Home >> Projects >> Face Detection & 3D Trajectory Generation Asim Shankar

[Some Faces]

Face Detection & 3D Trajectory Generation

As part of my final year undergraduate research project, we (Priyendra Singh Deshwal and I, with Dr. Amitabha Mukherjee) developed a system to generate 3D trajectories of actors moving in a video sequence and textual commentary on these trajectories.

As a first step, we began by implementing a face detection system in still images based on the system described by Henry Rowley (see CMU Face Group in references) in his doctoral thesis. We supplement the base detector described in that thesis with a "clustering" technique that works for both still images and across frames of a video that helps improve accuraccy and put multiple detections of the same face together. We later came across a much faster detector (a Haar detector proposed by Viola and Jones in 2001). For tracking we use the Continuously-Adaptive Mean-Shift algorithm. For more details, read on!

System Description

Face Detection

We use two basic face detectors in our system, one that uses a neural network as a classifier and is based on Henry Rowley's thesis (see CMU Face Group in references). The other is a face detector based on Haar-like features and is implemented in Intel's open source computer vision library (OpenCV). We observed that our implementations had a high degree of false positives, and detected the same face multiple times. In order to improve accuracy and to distinguish between different faces we use our own clustering technique.

Clustering Faces

A face detected by the network is in a rectangular region. We describe a face individually by a 4-tuple - (top-left x-coord of rectangle, top-left y-coord, size, frame) where size is the size of the rectangular region in pixels and frame is the frame number of the image in the video sequence. In this 4-dimensional space we then cluster faces together according to the euclidean distance between the 4-tuples. The strength of a cluster is the sum of the strengths of the individual faces (i.e., the value of the network output of these faces) in the cluster. We consider only clusters with a strength above a certain threshold (the value of the threshold is dependent on the number of frames in the sequence) as a face, thus each accepted cluster corresponds to a distinct face in the video. The clustering works as follows:
  1. F = set of all the faces (4-tuples) detected by the base detector (say N faces were detected).
  2. We construct "edges" between pairs of these N faces if the euclidean distance between the corresponding 4-tuples is less than a certain threshold.
  3. Each connected component of the graph formed by the set of vertices F and edges as described above form a single cluster corresponding to a single face

Tracking across frames

Applying the detection system on each and every frame of the sequence as a means of tracking is entincing. However, the detection systems can detect only frontal, upright faces (i.e., it cannot detect people not looking straight into the camera) and hence doing this wouldn't really help. Instead, we use a tracking algorithm that once initialized with the detection region keeps track of it (even as the face rotates or turns) through the video.

The algorithm we use is the Continuously-Adaptive Mean-Shift or CamShift algorithm. An implementation of this can also be found in Intel's OpenCV library.

3D Trajectory Generation

Detection gives us the initial face, tracking gives us the (x,y,scale) coordinates of the detected face as it moves along in image, where (x,y) are the coordinates of the center of the rectangular region being tracked and scale is the size of the rectangular box. We then convert these coordinates into real-world (x,y,z) coordinates using simple transformations and some calibration information from the camera. Calibration information from the camera includes factors like the true distance of a face from the camera and the scale of the detection, the aspect ratio the camera image etc.

References & Links

More information on face detection: Ofcourse, Google and other search engines will always be a good source of information!


Some sample results:
[Face Detection]
Four of four faces detected by our neural network detector.
[Frame A] [Frame B] [Frame C]
[Trajectory] The images above show three of the many frames in the video. The person detected in the first frame was tracked through the others and the generated trajectory is shown in the figure on the left. The pyramid in the figure shows the field of view of the camera. Commentary generated was "Actor 0 moves from left to right. Actor 0 moves from right to near the camera".


Some of the tools we used:

More Details

This single web-page doesn't do justice to the amount of work that went into creating this sytem and how things were achieved. I will be putting up our final report on this work soon, so you can read that for the finer details.

Last modified: Mon Apr 21 01:48:12 India Standard Time 2003