Post Job Free
Sign in

It Data

Los Angeles, CA
December 10, 2012

Contact this candidate



Yu Huang*, Jilin Tu**, Thomas S Huang**

*Thomson Corporate Research at Princeton, **University of Illinois at Urbana-Champaign

E-mail: **.******@*******.***, *******@***.****.***, *****@***.****.***

based [4, 6], while others are called the direct method using


the spatial and temporal image gradient information [10]. The

notable problem is how to fully utilize the redundant information

In this paper we propose a framework of factorization-based

non-rigid shape modeling and tracking in stereo-motion. We in the stereo-motion analysis, but practically the more important

construct a measurement matrix with the stereo-motion data issue would be how to make the two basic cues complement with

captured from a stereo-rig. Organized in a particular way this each other. Recently there are some stereo-motion papers taking

matrix could be decomposed by Singular Value Decomposition into account non-rigid motion [2, 9, 3]. A basic primitive called

(SVD) into the 3D basis shapes, their configuration weights,

dynamic-surfel which encodes the instantaneous local shape,

rigid motion and camera geometry. Accordingly, the stereo

reflectance and motion of a small region in the scene, is proposed

correspondences can be inferred from motion correspondences

in [2] to build the scene s structure in space-time from multiple

only requiring that a minimum of 3K point stereo

correspondences (where K is the dimension of shape basis space) views. Likewise, the object is modeled by a time-varying multi-

are created in advance. Basically this framework still keeps the resolution subdivision surface in [9], which is fitted to the image

property of rank constraints, meanwhile it owns other advantages data from multiple views. It can be figured both methods above

such as simpler correspondence and accurate reconstruction even

have to solve really complicated optimization problems. Only

with short image sequences. Results with real data are given to

Del Bue et. al addressed non-rigid stereo motion by a

demonstrate its performance.

factorization method [3], nevertheless stereo correspondence is

assumed to be created and its focus was on shape recovery only.

Index Terms Visual Tracking, Stereo Vision

In this paper, we will discuss 3D non-rigid shape recovery and

tracking based on factorization. Our motivations come from the


work in [5, 13]. Performing singular value decomposition (SVD)

on the well-organized stereo-motion measurement matrix, we

Tracking the object and recovering its 3D shape from sequences

could factorize it into 3D basis shapes, their configuration

of images are fundamental problems in computer vision

weights, stereo geometry and rigid motion parameters.

community. They have various applications such as scene

Moreover, we infer stereo correspondences from motion

modeling, robot navigation, object recognition and virtualized

correspondences only requiring that at least 3K point stereo

reality [1, 2, 6, 9]. Traditionally there exist two vision-based

correspondences (where K is the dimension of shape basis space)

methods for 3-D reconstruction: visual motion and stereo vision.

are created initially. Basically this framework still owns the

Both methods depend on how to solve the notorious

property of rank constraints [13]. It is an extension of [5] s work

correspondence problem. Basically this problem is relatively

to non-rigid objects, so such advantages as simpler

easy to handle in visual motion [12] because the extracted

correspondence and accurate reconstruction even with short

features have strong temporal association even without any prior

sequences are preserved.

knowledge of the dynamic model. Comparatively, stereo vision

Sect. 2.1 reviews the factorization work for the non-rigid

undergoes a much easier reconstruction task by triangulation, but

motion model in [13]. Our work as an extension to stereo-motion

the stereo correspondence task is severely ill-posed though we

is described in Sect. 2.2. In Sect. 2.3 we discuss how to infer

have the epipolar constraints.

stereo correspondences. Sect. 3 provides our experiment results

In visual motion, Tomasi and Kanade [12] proposed one of the

of real sequences.

most influential approaches as the factorization method for rigid

objects and orthographic projection. The key idea is


decomposition of a measurement matrix into its shape and

motion components. Various extensions have been put forward

[7-8, 14]. Stemming from the rigid factorization method, a non- 2.1 Non-rigid Motion Model

rigid factorization method was first proposed by Bregler et. al The shape of the non-rigid object is described [13] as a key-

[13]. In the case of non-rigid factorization, the 3D shape is frame basis set S1, S 2,, S K . Each key-frame S i is a 3xP

represented by a linear combination of basic modes of

matrix describing P points. The shape of a specific configuration

deformation. Brand proposed a flexible factorization approach

S t at the time frame t is a linear combination of the basis set:

which minimizes the deformations relative to the mean shape by

introducing an optimal correction matrix [1]. Recently Xiao K

S t = lt,i S i, S, S i 3 P, li .

proposed a new set of constraints on the shape basis in [15] and (1)

i =1

gave a close-form solution to non-rigid structure from motion.

Researchers have tackled this topic of augmenting structure Assume a weak-perspective model (scaled orthographic model)

from motion with stereo information. Some works are feature- for the camera projection process. The 2D image points

(u t,i, vt,i ) are related to 3D points of a configuration S t at a l '1

= 2 [r1 r6 ] .


specific time frame t by

r2 r3 r4 r5 (6)

u t,1 ... u t, P K ...

= R ' t l t,i S i + T ' t,


... v t, P

v t, P i =1 l ' K

r r2 r3

which shows that Q't is rank of 1 and also can be factored by

R' t = 1 . (3)

r4 r5 r6 SVD. Because this factorization is not unique, there exists one

invertible matrix G that ortho-normalizes all of the sub-blocks

where R't (2x3) contains the first two rows of the full 3-D rigid

Q't . Thus it leads to an alternative factorization:

rotation matrix Rt, and T ' t is the 2-D rigid translational vector

B = G 1 B .

Q' = Q' G,

(it consists of the first two components of the 3-D translation (7)

vector Tt ). The weak perspective scaling has been implicitly Irani exploited rank constraints in [7] for optic flow estimation

in the case of rigid motion. Building on this technique, a

coded in l ' t,1 l ' t, K . Actually we can eliminate T ' t by

framework of robust tracking could be set up (details are in [13]).

subtracting the mean of all 2D image points, and then can

assume that S t is centered at the origin. We can rewrite the 2.2 Stereo-Motion Model

Below we also utilize the rank constraints to help stereo

linear combination in (2) as a matrix multiplication:

S1 correspondence. Let ( R, T ) be the rotational and translational


u t,1 ... u t, P

[ ]

relationships between the stereo cameras. Under a scaled

= l ' t,1 R' t ... l ' t, K R' t, (4)


v orthographic camera model we can also assume the shape has


t, P ... v t, P

been centered at the origin. Therefore the translation T could be

S K subtracted from the shape relationship, since a translation part in

Stacking all point tracks over the whole sequence into a large depth has only effect on the scale factor and a translation part in

measurement matrix W, we can write the image plane is eliminated. So the 3D coordinates of any point

l '1,1 R'1 ... l '1, K R '1 S1 with respect to the two camera coordinate frames, S l and S r,

l ' R' ... l ' 2, K R' 2 S 2 and the corresponding shape basis, S l,i and S r,i (i = 1, 2,, K),

W =,

2,1 2


... ... are related by

... ...

S r = R S l, S r,i = R S l,i, i = 1, 2,, K. (8)

l ' 4444... 4'4,4R' N S K

N,1 R ' N lNK 2

43 1 3

1 2 Now we rewrite (4) as



l1,1 R1 ... l1, K R1 S1

Here the 2Nx3K matrix Q contains for each time frame t the


l 2,1 R2 ... l 2,K R2 S 2, (9)

pose R 't and configuration weights l ' t,1 l ' t, K, and the 3KxP

W = Ft ... ...

... ...

matrix B codes the K key-frame basis shapes S i . In the noise free ...

4243 l R

1 4 4 N,1 N ... l N,K R N S K

444 24444 1 3

case, rank of W is r 3K. This factorization can be realized


1 4


using SVD, i. e. W = U V T = Q ' B, only considering the



where Ft is the 2x3 scaled orthographic projection matrix given

first r singular values and singular vectors.

The next step is to extract the pose R't and shape basis weights by

1 0 0

Ft = s t

l ' t,1 l ' t, K from the matrix Q' . For each Q't in Q', it can be, (10)

0 1 0

written as (for convenience, the time index is dropped) [13]

[ ] with s t as the scale factor at time frame t.

Q' t = l ' t,1 R 't

... l 't, K R't

Applying the non-rigid motion model to the two cameras

l' r' l ' K r '3

l '1 r ' 2 l '1 r ' 3 ... l ' K r '1 l'K r'2 separately, one obtains two image measurement matrices

= 1 1 .

l ' K r '6

l '1 r ' 4 l '1 r ' 5 l '1 r ' 6 ... l ' K r ' 4 l ' K r '5 respectively as

Wl = Fl Ql Bl, Wr = Fr Qr Br . (11)

The elements of Q't can be reordered into a new matrix:

Because the shape is centered at the origin, we can omit the

l '1 r '1 l '1 r ' 2 l '1 r ' 6

l '1 r '3 l '1 r ' 4 l '1 r ' 5 translation component in the relationship between two rigid

motion representations for two camera coordinate frames and

l ' 2 r '1 l ' 2 r ' 2 l ' 2 r '3 l'2 r '4 l ' 2 r '5 l ' 2 r '6

Q 't = only consider the relationship of rotation components as

... ...

Rr,t R = R Rl,t (Some derivations are given in our technical

l ' K r '1 l ' K r ' 2 l ' K r '3 l' K r '4 l ' K r '5 l ' K r '6

report*) . Consequently, we write


where A l+ is the pseudo-inverse of A l and is given by

Wl Fl Ql Bl Fl I 3N

= F ' EQ B = ~ Ql B l (12)


A l+ = ( A T A l ) 1 A T .

W3 r 4 F '3 1E3

1 24 2


r l l (15)

12 l l

But the predicted result may not be exact due to noise. A



measure for feature matching could be count on: normally we

where F ' r actually has coded the scaling change of Fr due to calculate the least-mean-squares-error (LMSE) in all the

~ positions over the entire image sequence with reference to the

translation T, and the 3Nx3N matrix E is given as

prediction results; However, even this measure is small enough,


we can not guarantee it is a correct pair of stereo matching; An

E= .


R (13) additional measure related to windowed template matching is

... probably taken into account, i.e. the average normalized

correlation must be high enough [5]. If not, the image feature is

Equation (12) represents the matrix decomposition of the ignored. Finally all the inferred stereo correspondences are

stereo-motion correspondences into 3D structure Bl, the rigid grouped together to re-estimate the basis A, which is supposed to

be more accurate. This process could be iterated till convergence.

motion and shape basis weights Ql, the stereo geometry E and

However, we still reconstruct 3D deformable shape via

the camera parameters H . It is obvious, like Wl and Wr, is triangulation from views of the calibrated stereo cameras once all

of rank at most 3K: rank 3K. Below based on this rank the stereo correspondences are obtained [6]. Consequently we

can calculate by factorization the 3D shape basis from the

property, we can infer stereo matching from motion

measurement matrix of 3D point positions, similar to (5) and (9),


then extract the pose parameters and shape basis configuration

weights by rank-1constraints. Different from (5), this time we

2.3 Stereo Matching Inference

can extract all nine components of the rotation matrix rather than

Assume distinct feature points are extracted from the stereo

only the top two rows. Recovering the pose Rt and original

image sequences, and in each sequence they are tracked

separately using the motion correspondence method. Now the configuration weights l t,1 l t, K actually has realized 3-D non-

stereo correspondences are not established yet while the

rigid tracking.

estimated dense motion correspondences are assumed to be

mostly correct. With such motion correspondences, the


measurement matrixes Wl and Wr can be constructed, here

different from Wl and Wr, their columns have not been Because of limitation in space, only results with real data are

properly ordered. As is of rank at most 3K, a basis of the 3K- given here. In the experimental setup the two digital video

cameras are mounted vertically and connected to a PC through

dimensional subspace could be set up as long as a minimum of

3K linearly independent columns of are available. Then all the 1394 links. The human face recordings in the collected videos

other columns of are inferred from the set of basis. are captured with resolution 320x240 at 30 frames per second.

They contain rigid head motions, and non-rigid

Suppose k matches are obtained by some stereo correspondence

eye/eyebrow/mouth facial motions.

technique with epipolar constraints (To simplify 1D searching on

It is difficult to estimate optical flow from facial motions using

the epipolar line, the technique of image rectification could be

done prior to stereo matching), where k 3K. The traditional gradient-based or template matching methods because

the facial surface is smooth and its motion is non-rigid. We

corresponding columns of Wl and Wr can be stacked into a choose to use a Bazier Volume model-based face tracker to

4Nxk sub-matrix k . SVD of k is k = U k k VkT . obtain the optical flow around the face area [11]. For each

camera, we track the facial motion using independent face

Actually the first 3K columns of U k construct the optimal basis trackers with a dense 3D geometrical mesh model. The first

experiment we did is to reconstruct the facial structure from rigid

of 3K -dimensional vector subspace (Note K is the estimated

facial motions. In the videos, the human head moves up and

number of shape basis, which maybe is not equal to the true

backward within 30 frames. A pair of stereo images with

number K.). Let a 1, a 2 a 3 K ' be the extracted basis vectors of

depicted tracking points is shown in Fig. 1.


the column space of let a 4Nx3K

As the face trackers are applied independently to the video

matrix A = [a 1, a 2 a 3 K ' ], so a column v of is only a sequences of the two cameras. We don t know whether there is

linear combination of the columns of A. Let two 2Nx3K correspondence between the mesh points of the face models used

by the two face trackers, except those points at the eye corners

matrixes A l and A r be the top-half and the bottom-half sub-

and mouth corners. We identify these points as distinct feature

matrixes of A respectively such that the columns of A l belong

points (shown in red) and the correspondences of the rest points

to Wl and the columns of A r to Wr . For a column v l of are inferred using the bases factorized from the optical flow

vectors of these distinct feature points. In the rigid motion case,

Wl its stereo correspondence v r in Wr can be predicted from we take the number of bases K=3. Fig. 2 shows the found

correspondences of optical flows estimated from the two face

A l and A r as:

trackers. The red trajectories are the mapping of the optical flow

v r = ( A r A l+ ) v l (14) of the mesh points from upper camera view to lower camera

view using equation (14). The green trajectories show the found

correspondent trajectory of mesh points from video of lower

camera. After the correspondence is established, the 3D face 4. CONCLUSIONS AND FUTURE WORK

geometrical structure in each time instant can be reconstructed.

Fig. 3 shows the reconstructed mesh points in the 3D space. We have presented a framework for recovering 3D non-rigid

shape and motion viewed from calibrated stereo cameras. This

approach is a factorization-based method, so it naturally has the

property of rank constraints. Meanwhile it gives a mechanism of

inferring stereo correspondences from motion correspondences

only requiring that a minimum of 3K point stereo

correspondences are created initially. The combination of motion

and stereo cues offers such advantages as simpler stereo

correspondence and accurate reconstruction even with short

sequences. Experimental results from real stereo sequences are

(a) Upper camera (b) Lower camera

also given to demonstrate the performance of the proposed

Fig. 1. Tracking result for rigid motion

method. Future work will address how to detect not a few

outliers for robust factorization and how to realize 3D model-

based tracking along with model refinement.


Thank Dr. Zhengyou Zhang at Microsoft Research for allowing

us to use the test stereo sequences.


Fig. 2. Optical flow trajectories Fig. 3. Reconstructed points

[1] M. E. Brand, Morphable 3D models from video . IEEE

In order to verify our theories with non-rigid motion, we further CVPR 01, December 2001.

identified a stereo video sequences in which the subject opens [2] R. L. Carceroni, K. N. Kutulakos, Multi-View Scene

mouth within 8 frames. As shown in Fig. 4, the distinct facial Capture by Surfel Sampling: From Video Streams to Non-Rigid

features (depicted in red) are the eye corners, mouth corners, 3D Motion, Shape Reflectance, ICCV 01, June 2001.

nostrils, and the center of the upper and lower lip. As the non- [3] A. Del Bue, L. Agapito, Non-rigid stereo factorization,

rigid motion only contains the opening mouth, we take K=6 in IJCV, 66(2), 193-207, 2006.

this case. The found correspondences of optical flow trajectories [4] F. Dornaika and R. Chung, Stereo Correspondence from

are shown in Fig. 5. It is shown that most of the found Motion Correspondence, IEEE CVPR 99, pp 70-75, 1999.

correspondences of the optical flow trajectories are caused by the [5] P K Ho and R Chung, Stereo-Motion that Complements

opening mouth. The reconstructed 3D face geometric structure is Stereo and Motion Analysis, IEEE CVPR97, pp213-218, 1997.

shown in Fig. 6, where the purple dots are the reconstructed 3D [6] Y. Huang, T. S. Huang, Facial Tracking with Head Pose

points. Estimation in Stereo Vision, IEEE ICIP 02, Sept., 2002.

[7] M. Irani, Multi-Frame Optical Flow Estimation Using

Subspace Constraints . IEEE ICCV 99, September 1999.

[8] M. Irani and P. Anandan, Factorization with Uncertainty .

ECCV 00, June 2000.

[9] J. Neumann and Y. Aloimonos, Spatio-temporal stereo

using multi-resolution subdivision surfaces . IJCV, 47(1): 181-

193, 2002.

[10] G. Stein and A. Shashua, "Direct estimation of motion and

(a) Upper camera (b) Lower camera extended scene structure for a moving stereo rig", IEEE

Fig. 4. Tracking results for non-rigid motion. CVPR 98, 1998.

[11] H. Tao and T. S. Huang, "Explanation-based facial motion

tracking using a piecewise Bezier volume deformation model,"

IEEE CVPR'99, 1999.

[12] C. Tomasi and T. Kanade, ``Shape and Motion from Image

Streams under Orthography: a Factorization Method'', IJCV,

vol.9, no.2, pp.137-154, 1992.

[13] L. Torresani, D. Yang, G. Alexander, C. Bregler, Tracking

and Modelling Non-Rigid Objects with Rank Constraints, IEEE

CVPR 01, 2001.

[14] B. Triggs, ``Factorization Methods for Projective Structure

and Motion'', Proc. IEEE CVPR 96, pp.845--851, 1996.

[15] J. Xiao, J. Chai, T. Kanade, A closed-form solution to non-

Fig. 5. Optical flow trajectories Fig. 6. Reconstructed face. rigid shape and motion recovery, ECCV 04.

Contact this candidate