Appeared in the Proceedings of the International Workshop on Synthetic-Natural Hybrid Coding and Three
Dimensional Imaging (IWSNHC3DI'97), pp. 192-194, September 5-9, 1997, Rodos-Palace, Rhodes, Greece.
Coding of Facial Image Sequences by Model-Based Optical Flow
Malcolm Davis Mihran Tuceryan
Texas Instruments Indiana Univ. Purdue Univ. Indianapolis
8330 LBJ Freeway, MS 8374 Dept. of Computer and Information Science
Dallas, Texas 75243 Indianapolis, Indiana 46202-5132
*******.*****@**.*** ********@**.*****.***
ABSTRACT are most relevant for teleconferencing and visual
communication. Inspired by the work of Waters
A model-based method for estimating the shape 2 and Tang 3 a model of the human head and
and motion of 3D objects appearing in a video face has been developed which uses an approx-
is described. This technique is used for model- imate model of facial musculature to animate
based video coding video compression . The facial expressions. A model-based formulation
method is based on a new variant of optical ow of optical ow, derived from the work of De-
and uses 3D computer graphics to represent and Carlo and Metaxas 4, 5, provides estimates of
display an object. Though the algorithm is gen- the motion of the head model.
eral, this work concentrates on videos depicting
the human head and face because of its relevance
1.1. Model-Based Motion Estimation
to videotelephony and teleconferencing. Rigid
body motion of the head and facial expressions In 3D model-based video coding, the 3D mo-
opening the mouth are accommodated. Re- tion of the object depicted in the video must be
sults obtained from videos of a moving person determined. In a few methods for recognition,
are described. tracking, or coding of facial image sequences, op-
tical ow is mapped directly onto the parame-
ters of a 3D model in order to determine the mo-
1. INTRODUCTION
tion of the modeled object. Most methods are
speci c to a particular motion model e.g., rigid
The concept behind model-based video coding
motion and an assumed object model e.g., a
video compression is that models of 3D ob-
triangular mesh . It is possible, however, to
jects and their motion require less information
modify the optical ow implementation which
to transmit than videos of those objects. As il-
usually assumes that the motion is planar so
lustrated in Figure 1, this type of coder analyzes
that virtually any motion and object model can
the video to obtain values for the parameters of
be accommodated in a regular and automatic
these models and estimates of the 3D motion of
manner 4, 5 .
modeled objects. These parameter values and
motion estimates are transmitted and a video Parametric Representations:
display of the modeled objects and their motion First, consider a 3D object, such as a hu-
is synthesized using 3D computer graphics. man face that appears in a video. This ob-
This work concentrates on model-based cod- ject can be represented by a 3D vector function,
ing of heads and faces because such techniques
= 1 2 3 , T
s ~ ; ~s
uq s ~ ; ~s
uq s ~ ; ~s
uq
~ ~ ; ~s
suq
which associates each value of the vector with ~
u
a point on the surface of the object. The column
ENCODER DECODER
vector,, contains values that control the shape
q
~s
Input Image Output Image
of the object. The function is called a
Analysis Synthesis
Analysis Data
~ ~ ; ~s
suq
parametric representation of the object and the
elements of are the domain of this representa-
~
u
Image Source model Image Source model
tion.
As example, suppose that is a generic
~ ~ ; ~s
suq
Figure 1: A block diagram of a model-based model of an average human face instead of an
video coding system from 1 . ellipsoid . In this case, might contain param-
q
~s
eters which would 1 adapt the shape of this
generic model to the shape of a individual's face
and 2 deform the face appropriately for facial
expressions such as smiles, frowns, and raising
the eyebrows.
Coordinate Transformations: The motion of a
3D object can be represented by a coordinate
transformation which changes over time. A co-
a b c
ordinate transformation maps or transforms
each 3D coordinate location to another coordi-
Figure 2: A computer graphics representation
nate location. Examples of coordinate transfor-
of a face as a 3D triangular mesh drawn: a as
mations include rotation, translation, perspec-
a wireframe each line is an edge of a triangle ;
tive projection , and deformations such as scal-
b as solid shapes with shading added; c with
ing, bending, and twisting. A coordinate trans-
texture mapping overlaid.
formation is represented as a vector function,
= , which transforms maps the coor-
~
y ~ ~; qr
rs~
dinate location,, into a new location, . The
~
s y
~
transformation has parameters values e.g., ro-
tation angles , which comprise the elements of
27
the vector, . q
~r 1 5
2 6
3 14 7 4
0 9
Model-Based Optical Flow: The well-known
12 17 13
8 10 15 32
28
11 16
planar formulation of optical ow is based on 18 19
20
the gradient constraint equation:
22
23
29 21 25 26 31
24
r _ + =0
30
a b c
T
I ~; t
x ~
x It ~ ; t
x
where r is the gradient of with re- Figure 3: Customization of a generic face model
I ~; t
x I ~; t
x
spect to the image coordinates and = to conform to a particular individual: a a set of
~
x It ~ ; t
x
. This equation can be extended to en- facial features are detected; b the correspond-
@ I ~ ;t
x
compass arbitrary motions, with the result 4 :
@t
ing location of these features on the generic face
model; c the generic face model is warped de-
r L _ + = 0 1
T
formed to bring the two sets of features into
I ~; t
x ~; ~ ~
uqq It ~ ; t
x :
approximate alignment.
where L is the Jacobian matrix of the
~; ~
uq
coordinate transformation from object coordi-
nates to camera coordinates, including anima-
occurring in the video at time, expressed as the
tion or deformation of the object, with respect t
rate of change in the object position parameters,
to the parameters of the transformation and the
. The object e.g., a face position parameters,
. This matrix, L ,
model, = q
~
T
T T, at time can be determined by numerically
q
~r q
~s
q
~ ~; ~
uq
is used to transform partial derivatives of into q
~ t
integrating _ . Presently, this algorithm makes
q
~
partial derivatives of : _ = L _ . ~
q
use of the Euler method.
~
x ~
x ~; ~ ~
uqq
Equation 1 represents the fundamental
principle for estimating 3D motion from opti- The Face Model: The 3D computer graphics
cal ow. As in the typical application of op- representation of the face is a 3D triangular
tical ow, values for the spatial and temporal mesh with texture mapping. The model is de-
derivatives of the image, r and , picted in Figure 2. This 3D model of a typical
I ~; t
x It ~ ; t
x
are obtained using derivative lters with Gaus- human head and face is customized to t the
sian kernels . It is assumed that the Jacobian shape of the individual depicted in the video
matrix and the transformation are known in ad- as illustrated in Figure 3. The jaw can ro-
vance, e.g., the object is a face and the transfor- tate, i.e., the mouth can open and close. By a
mation is a combination of rotation, translation, method analogous to that of Waters 2, major
facial expressions, and perspective projection . muscles of the face are approximated by a set of
Only _ remains undetermined. actuators" that can contract and relax. These
q
~
Least squares is used to solve for _ from the actuators are anchored to xed locations bone
q
~
spatial and temporal derivatives of the image at at one end and, at the other end, are attached to
a set of points,, = 1 2 3 . The result- vertices of the triangular mesh skin through a
~i
x i ; ; N
ing value for _ is the 3D estimate of the motion simulated exible medium. By appropriate ac-
q
~
tivation of groups of muscles" the face can be
made to smile, frown, raise an eyebrow, and so
on.
The Initial Pose: The position pose of the face
in the rst frame of the video is needed to initial-
ize the algorithm. This inital pose is determined a
by using an optimization algorithm gradient de-
scent to minimize the mean squared error be-
tween feature locations on the actual face and
the modeled face.
Model-Based Coding: In model-based video cod-
ing, the values of the parameters de ning the b
current face shape and orientation, _ or ,
q
~ q
~
are encoded and transmitted for each frame in
the video. Other information, such as the cus-
tomization of the shape of the head model for
the individual depicted in the video, is trans-
mitted only at the beginning of communication.
c
2. RESULTS
The motion estimation algorithm described in
this paper has been applied to video sequences
depicting the head and shoulders. The motion d
of the head appearing in one video is tracked
and used to create a second video depicting the Figure 4: Model-based motion estimation of
computer-generated face model as it follows the M.T.: a frames extracted from a video of M.T.;
motions in the rst video. An example of rigid b the same ve frames from a video where the
motion estimation is illustrated in Figure 4. In computer-generated head image follows the mo-
Figure 4 a , 5 frames extracted from a 100 frame tion of M.T.'s head; c the video of b with the
video of M.T. are shown. The same frames from texture map removed; d a computer-generated
the computer-generated video are displayed in video where S.K's head follows the motion of
Figure 4 b . The fact that the video is computer M.T.'s head.
generated is more apparent in Figure 4 c , where
texture mapping has been disabled. A unique
feature of this type of video coding is the abil-
ity for the person to appear di erently at the
receiver decoder than he does at the transmit-
ter encoder . This feature is illustrated in Fig-
ure 4 d where a graphics model of S.K.'s head
moves in synchronization with the video of M.T.
a
The tracking of more complex motion is illus-
trated in Figure 5 which depicts several frames
extracted from a video of M.D. as he turns his
head and simultaneously opens his mouth.
It has been indicated that about 68 parame-
ters are needed to encode facial expressions and
b
head motion. Using this value, the estimated
transmision baud rate of the video sequences
Figure 5: Model-based motion estimation of
in Figure 4 and Figure 5 is about 6,800 bits sec
M.D.: a frames extracted from a video of
10 bits parameter 68 parameters frame 10
M.D.; b the same ve frames from a video
frames sec without any encoding of the parame-
where the computer-generated head image fol-
ter values. Applying a coding scheme, like arith-
lows the motion of M.D.'s head.
metic coding, to the parameter values would sig-
ni cantly reduce even this low rate.
Acknowledgement
The authors are grateful to Scott King for his
contributions to the development of the face
model and several handy software tools. Doug
DeCarlo's frank discussions of his research are
appreciated. Bruce Flinchbaugh proofread the
manuscript.
3. REFERENCES
1 K. Aizawa and T. S. Huang, Model-based
image coding: Advanced video coding tech-
niques for very low bit-rate applications,"
Proceedings of the IEEE, vol. 83, pp. 259
271, Feb. 1995.
2 K. Waters, A muscle model for animating
three-dimensional facial expression," Com-
puter Graphics, vol. 21, pp. 17 24, July 1987.
3 L.-A. Tang, Human Face Modeling, Analy-
sis, and Synthesis. PhD thesis, Electrical
Engineering Department, University of Illi-
nois at Urbana-Champaign, Urbana, Illinois,
1996.
4 D. DeCarlo and D. Metaxas, The integra-
tion of optical ow and deformable mod-
els with appliations to human face shape
and motion estimation," in Proceedings of
the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition,
San Francisco, CA , pp. 231 237, IEEE
Computer Society Press, June 18 20, 1996.
5 D. Metaxas and D. DeCarlo, Deformable
model-based face shape and motion esti-
mation," in Proceedings of the Second In-
ternational Conference on Automatic Face
and Gesture Recognition, Killington, VT ,
pp. 146 150, IEEE Computer Society Press,
Oct. 14 16, 1996.
hc97.dvi