International Journal of Computer Vision **(*), **5 150, 2007
c 2007 Springer Science + Business Media, LLC. Manufactured in the United States.
DOI: 10.1007/s11263-007-0044-1
Image-Based Modeling by Joint Segmentation
LONG QUAN, JINGDONG WANG, PING TAN AND LU YUAN
The Department of Computer Science and Engineering, The Hong Kong University of Science and Technology
****@**.***.**
********@**.***.**
****@**.***.**
******@**.***.**
Received May 10, 2006; Accepted February 14, 2007
First online version published in March, 2007
Abstract. The paper rst traces the image-based modeling back to feature tracking and factorization that have been
developed in the group led by Kanade since the eighties. Both feature tracking and factorization have inspired and
motivated many important algorithms in structure from motion, 3D reconstruction and modeling. We then revisit the
recent quasi-dense approach to structure from motion. The key advantage of the quasi-dense approach is that it not only
delivers the structure from motion in a robust manner for practical modeling purposes, but also it provides a cloud of
suf ciently dense 3D points that allows the objects to be explicitly modeled. To structure the available 3D points and
registered 2D image information, we argue that a joint segmentation of both 3D and 2D is the fundamental stage for the
subsequent modeling. We nally propose a probabilistic framework for the joint segmentation. The optimal solution
to such a joint segmentation is still generally intractable, but approximate solutions are developed in this paper. These
methods are implemented and validated on real data set.
Keywords: structure from motion, image-based modeling, reconstruction, segmentation
The integral is over a window W around the point x. The
1. Introduction
transformation T may be taken to be a displacement in
its simplest form, x T (x) = x + d. Taylor expanding
Modeling by video taping has been advocated at CMU
the image term I (x + d), and differentiating E (d) with
by Kanade since the early eighties. The development has
respect to d for minima, we obtain Hd = e, where e is
been focused on two fundamental achievements by his
the residual error vector x W ( I I )g(weight)d x and g is
group, Moravec-Lucas-Tomasi-Kanade feature detection
the gradient of the image I . The H is the Hessian matrix,
and tracking, and Tomasi-Kanade factorization method
for reconstruction.
H= gT g(weight)d x.
x W
Lucas-Kanade tracking As done in Tomasi and Kanade (1991) and Shi and
Tomasi (1994), the feature point is de ned to be the pixel
at which we could have a reliable solution, i.e. H is not
The original idea of Lucas and Kanade (1981) is to track
or match an image patch I (x ) against a reference image singular at it.
I (x ) in two successive frames with unknown transforma- The Lucas-Kanade tracking equation has several ram-
tion T such that I (T (x)) = I (x). Due to the inherent i cations. First, it leads to one de nition of good feature
points (Tomasi and Kanade, 1991; Shi and Tomasi, 1994),
aperture problem for each individual pixel, it chooses a
of which H is well conditioned. This is per se the same de-
(weighted) integral error to minimize for the unknown
transformation T, tector of point of interest or corner in Harris and Stephens
(1988) that is motivated by improving Moravec s points
of interest (Moravec, 1981), which in turn is from the
E (T ) = I (T (x)) I (x)2 (weight) d x.
solution of discretization of the equation E (T ) with
x W
Quan et al.
136
d = {( 1 0, 1 0)} for self-matching I = I . It dated, but was related to the concept of af ne shape in-
troduced by Koenderink and Van Doorn in 1988. Koen-
is similar to that developed by F rstner (1994) as well.
o
derink s paper (Koenderink and van Doorn, 1989) has
All detectors of point of interest (F rstner, 1994; Harris
o
already been in circulation in 1988, and was published
and Stephens, 1988) only differ at computational imple-
in 1991 (Koenderink and van Doorn, 1989). The techni-
mentation of considering the eigensystem of H . Harris
and Stephen used det kTrace2 = 1 2 k ( 1 + 2 )2 . cal report of Tomas-Kanade factorization (Tomasi, 1991)
Tomasi and Kanade suggested min( 1, 2 ) >, and was available in 1991 and its journal version (Tomasi and
F rstner took aslo the scale into account and proposed Kanade, 1992) in 1992. This af ne shape was innovative,
o
2 = 2 . but limited to only af ne cameras. It is Faugeras paper
2
1
(Faugeras, 1992) that gave a de nite answer to what is
Second, it suggests that a dense correspondence
the parallel to an af ne camera for a projective camera.
between frames is ill-posed in nature as H is not
Hartley (1992) independently reached the same re-
well conditioned everywhere. I.e. there are pixels for
sults. Mohr et al. (1992, 1995) followed up by propos-
which H are close to singularity. The impossibility of
ing a numerical scheme, a kind of bundle adjustment
dense correspondence justi es the necessity of the in-
for projective reconstruction for multiple views, us-
troduction of the quasi-dense correspondence (Lhuil-
ing an explicit projective basis de ned by 5 space
lier and Quan, 2005) as a more achievable goal.
points. Interestingly and importantly, it was shown in
Thirdly, it implies that feature point based approach,
Sturm and Triggs (1996) and Triggs (1996) that the orig-
now the mainstream, is by its de nition the in nites-
inal Tomasi-Kanade af ne factorization for af ne cam-
imal expansion d for matching and self-matching only
eras could be extended to projective reconstruction for
good for close frames. This indeed re ects the actual
projective cameras, this is the projective factorization,
practice in the representative systems implemented in
that is the only alternative to a bundle-adjustment type
Pollefeys et al. (1998) and Nister (2001). Though a more
projective reconstruction. It has even an advantage of
general transformation than the translational displace-
straightforward initialization when an iterative scheme is
ment could be introduced into the development of match-
necessary. This projective tour for structure from mo-
ing and self-matching equations without theoretical dif -
tion is important as it tells us an important fact that
culties (Triggs, 2004), it leads often to inconclusive trade-
the matching problem for a rigid scene or object is
off between invariance and rareness of the descriptors
encoded by this uncalibrated projective geometry. And
(Triggs, 2004).
this projective structure could be ef ciently computed
from the feature points detected and matched in multiple
views.
Tomasi-Kanade factorization algorithm Second, it has been known that the second step of
the enforcement of metric constraints is capital, but
hard to succeed in some cases. This has been primarily
Related to the tracking, given the tracked points over the
viewed as a successive step of the entire reconstruction
sequence, but one step further to reconstruct these fea-
for af ne cameras. Now from an uncalibrated projective
ture points in 3D space. By rst considering a simpli ed
viewpoint, the enforcement of the metric constraints is
camera projection model that is approximated by an or-
thogonal type projection model (x, y, z ) (u, v ) in- equivalent to the autocalibration of the uncalibrated ap-
proach. Indeed, using the orthogonality constraints orig-
stead of a more general central projection model. Tomasi
inally proposed by Tomasi and Kanade gives an alterna-
(1991) proposed the factorization algorithm for recon-
struction. Given n points tracked over m views, and stack tive way of formulating the autocalibration both for af ne
all measurements (u, v ) over all views to form a big and projective cameras, which is usually formulated in
measurement matrix W, which has a rank constraint by terms of the absolute conic borrowed from the projective
its construction W = MS. Using SVD on M results in geometry.
W = U V = (U 1/2 )( 1/2 V ) = MS. The motion M Combining Lucas-Kanade-Tomasi feature detection
and tracking with the projective factorization is an al-
and shape S are de ned up to an arbitrary af ne transfor-
mation A as W = MS = (MA 1 )(AS), so the M and S ternative of the standard uncalibrated approach which
are de facto the af ne motion and af ne shape. Next, the is often implemented (Pollefeys et al., 1998; Hartley and
Zisserman, 2000; Faugeras et al., 2001) by rst extracting
metric constraints to guarantee that M is a valid rotation
the points of interest, Harris corners, then combining with
matrix are used to nalize the Euclidean motion M and
robust statistics such as RANSAC or LMS with a projec-
Euclidean shape S. Each of these two steps contains fun-
tive structure (Zhang et al., 1995), and nally optimizing
damental concepts that are related to the development of
the structure by a bundle-adjustment-like optimization
the uncalibrated approach.
(Mohr et al., 1995, 1992; Triggs et al., 2000) in projective
First, the importance of the pre-metric structure of the
shape, the S resulted from the SVD, though not eluci- space.
Image-Based Modeling by Joint Segmentation 137
Figure 1. One overview on the right of the reconstructed quasi-dense points for the entire scene from 25 images shown on the left.
Quasi-dense approach required. The main contribution of this paper is to intro-
duce a joint segmentation framework, for both 3D points
This sparse structure from motion approach usually re- and 2D pixels, and look for robust and ef cient solutions
quires a dense frame rate and leads to a too sparse set of to it in Section 3. Figure 2 shows one of the original 25
points to be suf cient for object modeling and depiction. images captured by a handheld camera for the example
This insuf ciency motivated the development the quasi- scenario, a rendered image of the nal modeling result
dense approach started since 1998 in Lhuillier (1998) and based on the desired 3D segmentation and 2D segmenta-
matured into the work in Lhuillier and Quan (2005). It tion results by a semi-automatic approach developed in
is robust and handles more distant views. More impor- Quan et al. (2006) using the joint segmentation method
tantly, it produces a set of semi-dense 3D points that was presented in this paper. The segmentation and modeling
impossible with the previous methods. The quasi-dense of such complex objects are almost impossible without
approach will be brie y revisited in Section 2. One exam- the joint segmentation.
ple is shown in Fig. 1. The quasi-dense approach can be
viewed as an extension to the discussed two key subjects,
2. Quasi-Dense Approach Revisited
feature point selection, re-sampled quasi-dense point vs.
interest point, and reconstruction, hierarchical division
The quasi-dense approach developed in Lhuillier (1998);
combined with a bundle-like method vs. batch solution
Lhuillier and Quan (2002, 2005) overcomes the sparse-
combined with a factorization.
ness of feature points and results in a more ef cient
and robust algorithm when combined with a bundle
Joint segmentation approach adjustment like algorithms both at projective and Eu-
clidean stages. The purpose is to view the quasi-dense
The increased density of the reconstructed 3D points from approach as an extension of the discussed two key sub-
multiple views, paves the way for the three-dimensional jects: feature point, quasi-dense point vs. interest point;
modeling of the objects in space, in addition to the re- and reconstruction, hierarchical division combined with a
covery of the camera positions. But the 3D points, even bundle-like method vs. batch solution with a factorization
semi-dense, are unstructured in space, therefore are not scheme.
yet suf cient for creating a geometric model of the un-
derlying objects. It is necessary to group the points and
pixels that belong to the same object into the same cluster 2.1. Quasi-Dense Point
of points and pixels. Obviously, the concept of object is
Initialization by sparse points of interest. We start by
subjective: for the example scenario shown in Fig. 2, the
whole plant might be considered as an object for some ap- detecting the points of interest in each image (Zhang et al.,
plications, whereas each individual leaf should be consid- 1995), then compute the Zero-Mean Normalized Cross-
ered as an independent object for some others, depending Correlation (ZNCC) to match points of interest in two
on the application and the realism details of the modeling images. First, we do the correlation from the rst image
Quan et al.
138
Figure 2. A proper segmentation is fundamental to 3D modeling. (a) One of the 25 original images captured by a handheld camera. (b) A rendered
image at a similar viewpoint of the reconstructed 3D model based on the segmentation results in (c) and (d). (c) A desired segmentation of 3D data
points from the reconstructed quasi-dense points, and (d) a desired segmentation of an input image.
to the second, and then do the same from the second to the First, the key is the de nition of the potential match
candidates N (x, x ) satisfying the disparity gradient
rst to retain a one-to-one consistent matching of points
of interest between two images. Most of the standard (Lhuillier and Quan, 2002; Pollard et al., 1985) as
approaches will continue with this set of feature points by
introducing global geometry constraint. The quasi-dense
N (a, a ) = {(b, b ), b N (a), b N (a ),
approach will not, it is merely initialized at this stage.
(b b) (a a)