A FACTORIZATION METHOD IN STEREO MOTION FOR NON-RIGID OBJECTS
Yu Huang*, Jilin Tu**, Thomas S Huang**
*Thomson Corporate Research at Princeton, **University of Illinois at Urbana-Champaign
E-mail: **.******@*******.***, *******@***.****.***, *****@***.****.***
based [4, 6], while others are called the direct method using
ABSTRACT
the spatial and temporal image gradient information [10]. The
notable problem is how to fully utilize the redundant information
In this paper we propose a framework of factorization-based
non-rigid shape modeling and tracking in stereo-motion. We in the stereo-motion analysis, but practically the more important
construct a measurement matrix with the stereo-motion data issue would be how to make the two basic cues complement with
captured from a stereo-rig. Organized in a particular way this each other. Recently there are some stereo-motion papers taking
matrix could be decomposed by Singular Value Decomposition into account non-rigid motion [2, 9, 3]. A basic primitive called
(SVD) into the 3D basis shapes, their configuration weights,
dynamic-surfel which encodes the instantaneous local shape,
rigid motion and camera geometry. Accordingly, the stereo
reflectance and motion of a small region in the scene, is proposed
correspondences can be inferred from motion correspondences
in [2] to build the scene s structure in space-time from multiple
only requiring that a minimum of 3K point stereo
correspondences (where K is the dimension of shape basis space) views. Likewise, the object is modeled by a time-varying multi-
are created in advance. Basically this framework still keeps the resolution subdivision surface in [9], which is fitted to the image
property of rank constraints, meanwhile it owns other advantages data from multiple views. It can be figured both methods above
such as simpler correspondence and accurate reconstruction even
have to solve really complicated optimization problems. Only
with short image sequences. Results with real data are given to
Del Bue et. al addressed non-rigid stereo motion by a
demonstrate its performance.
factorization method [3], nevertheless stereo correspondence is
assumed to be created and its focus was on shape recovery only.
Index Terms Visual Tracking, Stereo Vision
In this paper, we will discuss 3D non-rigid shape recovery and
tracking based on factorization. Our motivations come from the
1. INTRODUCTION
work in [5, 13]. Performing singular value decomposition (SVD)
on the well-organized stereo-motion measurement matrix, we
Tracking the object and recovering its 3D shape from sequences
could factorize it into 3D basis shapes, their configuration
of images are fundamental problems in computer vision
weights, stereo geometry and rigid motion parameters.
community. They have various applications such as scene
Moreover, we infer stereo correspondences from motion
modeling, robot navigation, object recognition and virtualized
correspondences only requiring that at least 3K point stereo
reality [1, 2, 6, 9]. Traditionally there exist two vision-based
correspondences (where K is the dimension of shape basis space)
methods for 3-D reconstruction: visual motion and stereo vision.
are created initially. Basically this framework still owns the
Both methods depend on how to solve the notorious
property of rank constraints [13]. It is an extension of [5] s work
correspondence problem. Basically this problem is relatively
to non-rigid objects, so such advantages as simpler
easy to handle in visual motion [12] because the extracted
correspondence and accurate reconstruction even with short
features have strong temporal association even without any prior
sequences are preserved.
knowledge of the dynamic model. Comparatively, stereo vision
Sect. 2.1 reviews the factorization work for the non-rigid
undergoes a much easier reconstruction task by triangulation, but
motion model in [13]. Our work as an extension to stereo-motion
the stereo correspondence task is severely ill-posed though we
is described in Sect. 2.2. In Sect. 2.3 we discuss how to infer
have the epipolar constraints.
stereo correspondences. Sect. 3 provides our experiment results
In visual motion, Tomasi and Kanade [12] proposed one of the
of real sequences.
most influential approaches as the factorization method for rigid
objects and orthographic projection. The key idea is
2. STEREO-MOTION FACTORIZATION
decomposition of a measurement matrix into its shape and
motion components. Various extensions have been put forward
[7-8, 14]. Stemming from the rigid factorization method, a non- 2.1 Non-rigid Motion Model
rigid factorization method was first proposed by Bregler et. al The shape of the non-rigid object is described [13] as a key-
[13]. In the case of non-rigid factorization, the 3D shape is frame basis set S1, S 2,, S K . Each key-frame S i is a 3xP
represented by a linear combination of basic modes of
matrix describing P points. The shape of a specific configuration
deformation. Brand proposed a flexible factorization approach
S t at the time frame t is a linear combination of the basis set:
which minimizes the deformations relative to the mean shape by
introducing an optimal correction matrix [1]. Recently Xiao et.al K
S t = lt,i S i, S, S i 3 P, li .
proposed a new set of constraints on the shape basis in [15] and (1)
i =1
gave a close-form solution to non-rigid structure from motion.
Researchers have tackled this topic of augmenting structure Assume a weak-perspective model (scaled orthographic model)
from motion with stereo information. Some works are feature- for the camera projection process. The 2D image points
(u t,i, vt,i ) are related to 3D points of a configuration S t at a l '1
= 2 [r1 r6 ] .
l'
specific time frame t by
r2 r3 r4 r5 (6)
u t,1 ... u t, P K ...
= R ' t l t,i S i + T ' t,
(2)
... v t, P
v t, P i =1 l ' K
r r2 r3
which shows that Q't is rank of 1 and also can be factored by
R' t = 1 . (3)
r4 r5 r6 SVD. Because this factorization is not unique, there exists one
invertible matrix G that ortho-normalizes all of the sub-blocks
where R't (2x3) contains the first two rows of the full 3-D rigid
Q't . Thus it leads to an alternative factorization:
rotation matrix Rt, and T ' t is the 2-D rigid translational vector
B = G 1 B .
Q' = Q' G,
(it consists of the first two components of the 3-D translation (7)
vector Tt ). The weak perspective scaling has been implicitly Irani exploited rank constraints in [7] for optic flow estimation
in the case of rigid motion. Building on this technique, a
coded in l ' t,1 l ' t, K . Actually we can eliminate T ' t by
framework of robust tracking could be set up (details are in [13]).
subtracting the mean of all 2D image points, and then can
assume that S t is centered at the origin. We can rewrite the 2.2 Stereo-Motion Model
Below we also utilize the rank constraints to help stereo
linear combination in (2) as a matrix multiplication:
S1 correspondence. Let ( R, T ) be the rotational and translational
S
u t,1 ... u t, P
[ ]
relationships between the stereo cameras. Under a scaled
= l ' t,1 R' t ... l ' t, K R' t, (4)
2
v orthographic camera model we can also assume the shape has
...
t, P ... v t, P
been centered at the origin. Therefore the translation T could be
S K subtracted from the shape relationship, since a translation part in
Stacking all point tracks over the whole sequence into a large depth has only effect on the scale factor and a translation part in
measurement matrix W, we can write the image plane is eliminated. So the 3D coordinates of any point
l '1,1 R'1 ... l '1, K R '1 S1 with respect to the two camera coordinate frames, S l and S r,
l ' R' ... l ' 2, K R' 2 S 2 and the corresponding shape basis, S l,i and S r,i (i = 1, 2,, K),
W =,
2,1 2
(5)
... ... are related by
... ...
S r = R S l, S r,i = R S l,i, i = 1, 2,, K. (8)
l ' 4444... 4'4,4R' N S K
N,1 R ' N lNK 2
43 1 3
1 2 Now we rewrite (4) as
B
Q'
l1,1 R1 ... l1, K R1 S1
Here the 2Nx3K matrix Q contains for each time frame t the
...
l 2,1 R2 ... l 2,K R2 S 2, (9)
pose R 't and configuration weights l ' t,1 l ' t, K, and the 3KxP
W = Ft ... ...
... ...
matrix B codes the K key-frame basis shapes S i . In the noise free ...
4243 l R
1 4 4 N,1 N ... l N,K R N S K
444 24444 1 3
case, rank of W is r 3K. This factorization can be realized
32
1 4
F
using SVD, i. e. W = U V T = Q ' B, only considering the
B
Q
where Ft is the 2x3 scaled orthographic projection matrix given
first r singular values and singular vectors.
The next step is to extract the pose R't and shape basis weights by
1 0 0
Ft = s t
l ' t,1 l ' t, K from the matrix Q' . For each Q't in Q', it can be, (10)
0 1 0
written as (for convenience, the time index is dropped) [13]
[ ] with s t as the scale factor at time frame t.
Q' t = l ' t,1 R 't
... l 't, K R't
Applying the non-rigid motion model to the two cameras
l' r' l ' K r '3
l '1 r ' 2 l '1 r ' 3 ... l ' K r '1 l'K r'2 separately, one obtains two image measurement matrices
= 1 1 .
l ' K r '6
l '1 r ' 4 l '1 r ' 5 l '1 r ' 6 ... l ' K r ' 4 l ' K r '5 respectively as
Wl = Fl Ql Bl, Wr = Fr Qr Br . (11)
The elements of Q't can be reordered into a new matrix:
Because the shape is centered at the origin, we can omit the
l '1 r '1 l '1 r ' 2 l '1 r ' 6
l '1 r '3 l '1 r ' 4 l '1 r ' 5 translation component in the relationship between two rigid
motion representations for two camera coordinate frames and
l ' 2 r '1 l ' 2 r ' 2 l ' 2 r '3 l'2 r '4 l ' 2 r '5 l ' 2 r '6
Q 't = only consider the relationship of rotation components as
... ...
Rr,t R = R Rl,t (Some derivations are given in our technical
l ' K r '1 l ' K r ' 2 l ' K r '3 l' K r '4 l ' K r '5 l ' K r '6
report*) . Consequently, we write
*
http://www.ifp.uiuc.edu/~yuhuang/Factorization03.pdf
where A l+ is the pseudo-inverse of A l and is given by
Wl Fl Ql Bl Fl I 3N
= F ' EQ B = ~ Ql B l (12)
~
A l+ = ( A T A l ) 1 A T .
W3 r 4 F '3 1E3
1 24 2
r
r l l (15)
12 l l
But the predicted result may not be exact due to noise. A
E
H,
measure for feature matching could be count on: normally we
where F ' r actually has coded the scaling change of Fr due to calculate the least-mean-squares-error (LMSE) in all the
~ positions over the entire image sequence with reference to the
translation T, and the 3Nx3N matrix E is given as
prediction results; However, even this measure is small enough,
...
we can not guarantee it is a correct pair of stereo matching; An
E= .
~
R (13) additional measure related to windowed template matching is
... probably taken into account, i.e. the average normalized
correlation must be high enough [5]. If not, the image feature is
Equation (12) represents the matrix decomposition of the ignored. Finally all the inferred stereo correspondences are
stereo-motion correspondences into 3D structure Bl, the rigid grouped together to re-estimate the basis A, which is supposed to
be more accurate. This process could be iterated till convergence.
motion and shape basis weights Ql, the stereo geometry E and
However, we still reconstruct 3D deformable shape via
the camera parameters H . It is obvious, like Wl and Wr, is triangulation from views of the calibrated stereo cameras once all
of rank at most 3K: rank 3K. Below based on this rank the stereo correspondences are obtained [6]. Consequently we
can calculate by factorization the 3D shape basis from the
property, we can infer stereo matching from motion
measurement matrix of 3D point positions, similar to (5) and (9),
correspondences.
then extract the pose parameters and shape basis configuration
weights by rank-1constraints. Different from (5), this time we
2.3 Stereo Matching Inference
can extract all nine components of the rotation matrix rather than
Assume distinct feature points are extracted from the stereo
only the top two rows. Recovering the pose Rt and original
image sequences, and in each sequence they are tracked
separately using the motion correspondence method. Now the configuration weights l t,1 l t, K actually has realized 3-D non-
stereo correspondences are not established yet while the
rigid tracking.
estimated dense motion correspondences are assumed to be
mostly correct. With such motion correspondences, the
3. EXPERIMENTAL RESULTS
measurement matrixes Wl and Wr can be constructed, here
different from Wl and Wr, their columns have not been Because of limitation in space, only results with real data are
properly ordered. As is of rank at most 3K, a basis of the 3K- given here. In the experimental setup the two digital video
cameras are mounted vertically and connected to a PC through
dimensional subspace could be set up as long as a minimum of
3K linearly independent columns of are available. Then all the 1394 links. The human face recordings in the collected videos
other columns of are inferred from the set of basis. are captured with resolution 320x240 at 30 frames per second.
They contain rigid head motions, and non-rigid
Suppose k matches are obtained by some stereo correspondence
eye/eyebrow/mouth facial motions.
technique with epipolar constraints (To simplify 1D searching on
It is difficult to estimate optical flow from facial motions using
the epipolar line, the technique of image rectification could be
done prior to stereo matching), where k 3K. The traditional gradient-based or template matching methods because
the facial surface is smooth and its motion is non-rigid. We
corresponding columns of Wl and Wr can be stacked into a choose to use a Bazier Volume model-based face tracker to
4Nxk sub-matrix k . SVD of k is k = U k k VkT . obtain the optical flow around the face area [11]. For each
camera, we track the facial motion using independent face
Actually the first 3K columns of U k construct the optimal basis trackers with a dense 3D geometrical mesh model. The first
experiment we did is to reconstruct the facial structure from rigid
of 3K -dimensional vector subspace (Note K is the estimated
facial motions. In the videos, the human head moves up and
number of shape basis, which maybe is not equal to the true
backward within 30 frames. A pair of stereo images with
number K.). Let a 1, a 2 a 3 K ' be the extracted basis vectors of
depicted tracking points is shown in Fig. 1.
and
the column space of let a 4Nx3K
As the face trackers are applied independently to the video
matrix A = [a 1, a 2 a 3 K ' ], so a column v of is only a sequences of the two cameras. We don t know whether there is
linear combination of the columns of A. Let two 2Nx3K correspondence between the mesh points of the face models used
by the two face trackers, except those points at the eye corners
matrixes A l and A r be the top-half and the bottom-half sub-
and mouth corners. We identify these points as distinct feature
matrixes of A respectively such that the columns of A l belong
points (shown in red) and the correspondences of the rest points
to Wl and the columns of A r to Wr . For a column v l of are inferred using the bases factorized from the optical flow
vectors of these distinct feature points. In the rigid motion case,
Wl its stereo correspondence v r in Wr can be predicted from we take the number of bases K=3. Fig. 2 shows the found
correspondences of optical flows estimated from the two face
A l and A r as:
trackers. The red trajectories are the mapping of the optical flow
v r = ( A r A l+ ) v l (14) of the mesh points from upper camera view to lower camera
view using equation (14). The green trajectories show the found
correspondent trajectory of mesh points from video of lower
camera. After the correspondence is established, the 3D face 4. CONCLUSIONS AND FUTURE WORK
geometrical structure in each time instant can be reconstructed.
Fig. 3 shows the reconstructed mesh points in the 3D space. We have presented a framework for recovering 3D non-rigid
shape and motion viewed from calibrated stereo cameras. This
approach is a factorization-based method, so it naturally has the
property of rank constraints. Meanwhile it gives a mechanism of
inferring stereo correspondences from motion correspondences
only requiring that a minimum of 3K point stereo
correspondences are created initially. The combination of motion
and stereo cues offers such advantages as simpler stereo
correspondence and accurate reconstruction even with short
sequences. Experimental results from real stereo sequences are
(a) Upper camera (b) Lower camera
also given to demonstrate the performance of the proposed
Fig. 1. Tracking result for rigid motion
method. Future work will address how to detect not a few
outliers for robust factorization and how to realize 3D model-
based tracking along with model refinement.
5. ACKNOWLEDGEMENT
Thank Dr. Zhengyou Zhang at Microsoft Research for allowing
us to use the test stereo sequences.
6. REFERENCES
Fig. 2. Optical flow trajectories Fig. 3. Reconstructed points
[1] M. E. Brand, Morphable 3D models from video . IEEE
In order to verify our theories with non-rigid motion, we further CVPR 01, December 2001.
identified a stereo video sequences in which the subject opens [2] R. L. Carceroni, K. N. Kutulakos, Multi-View Scene
mouth within 8 frames. As shown in Fig. 4, the distinct facial Capture by Surfel Sampling: From Video Streams to Non-Rigid
features (depicted in red) are the eye corners, mouth corners, 3D Motion, Shape Reflectance, ICCV 01, June 2001.
nostrils, and the center of the upper and lower lip. As the non- [3] A. Del Bue, L. Agapito, Non-rigid stereo factorization,
rigid motion only contains the opening mouth, we take K=6 in IJCV, 66(2), 193-207, 2006.
this case. The found correspondences of optical flow trajectories [4] F. Dornaika and R. Chung, Stereo Correspondence from
are shown in Fig. 5. It is shown that most of the found Motion Correspondence, IEEE CVPR 99, pp 70-75, 1999.
correspondences of the optical flow trajectories are caused by the [5] P K Ho and R Chung, Stereo-Motion that Complements
opening mouth. The reconstructed 3D face geometric structure is Stereo and Motion Analysis, IEEE CVPR97, pp213-218, 1997.
shown in Fig. 6, where the purple dots are the reconstructed 3D [6] Y. Huang, T. S. Huang, Facial Tracking with Head Pose
points. Estimation in Stereo Vision, IEEE ICIP 02, Sept., 2002.
[7] M. Irani, Multi-Frame Optical Flow Estimation Using
Subspace Constraints . IEEE ICCV 99, September 1999.
[8] M. Irani and P. Anandan, Factorization with Uncertainty .
ECCV 00, June 2000.
[9] J. Neumann and Y. Aloimonos, Spatio-temporal stereo
using multi-resolution subdivision surfaces . IJCV, 47(1): 181-
193, 2002.
[10] G. Stein and A. Shashua, "Direct estimation of motion and
(a) Upper camera (b) Lower camera extended scene structure for a moving stereo rig", IEEE
Fig. 4. Tracking results for non-rigid motion. CVPR 98, 1998.
[11] H. Tao and T. S. Huang, "Explanation-based facial motion
tracking using a piecewise Bezier volume deformation model,"
IEEE CVPR'99, 1999.
[12] C. Tomasi and T. Kanade, ``Shape and Motion from Image
Streams under Orthography: a Factorization Method'', IJCV,
vol.9, no.2, pp.137-154, 1992.
[13] L. Torresani, D. Yang, G. Alexander, C. Bregler, Tracking
and Modelling Non-Rigid Objects with Rank Constraints, IEEE
CVPR 01, 2001.
[14] B. Triggs, ``Factorization Methods for Projective Structure
and Motion'', Proc. IEEE CVPR 96, pp.845--851, 1996.
[15] J. Xiao, J. Chai, T. Kanade, A closed-form solution to non-
Fig. 5. Optical flow trajectories Fig. 6. Reconstructed face. rigid shape and motion recovery, ECCV 04.