**** **** ************* ********* ** Multimedia
On the Applicability of Speaker Diarization to Audio Concept Detection for
Multimedia Retrieval
Robert Mertens Po-Sen Huang
International Computer Science Institute Beckman Institute, ECE Department
1947 Center Street, Suite 600 University of Illinois at Urbana-Champaign
Berkeley, CA 94704, USA Urbana, IL 61801, USA
********@****.********.*** ********@********.***
Luke Gottlieb Gerald Friedland Ajay Divakaran
International Computer Science Institute International Computer Science Institute SRI International Sarnoff
1947 Center Street, Suite 600 1947 Center Street, Suite 600 201 Washington Road
Berkeley, CA 94704, USA Berkeley, CA 94704, USA Princeton, NJ 08540, USA
****@****.********.*** *******@****.********.*** ****.*********@***.***
and retrieve them. At this point structuring and indexing of
Abstract Recently, audio concepts emerged as a useful
building block in multimodal video retrieval systems. Informa- this vast amount of multimedia data makes the difference. In
tion like this le contains laughter, this le contains engine order to tackle the challenge of indexing multimedia data,
sounds or this le contains slow music can signi cantly
a multitude of approaches have been devised in the past
improve purely visual based retrieval. The weak point of
(see [1] or [2] for an overview). A video usually contains
current approaches to audio concept detection is that they
and audio and a visual stream, however many approaches
heavily rely on human annotators. In most approaches, audio
material is manually inspected to identify relevant concepts. for video analysis focus only on the visual part of a video.
Then instances that contain examples of relevant concepts Audio has recently begun to play a role in multimodal media
are selected again manually and used to train concept
analysis and can be leveraged to complement results from
detectors. This approach comes with two major disadvantages:
visual analysis to increase the effectiveness of multimedia
(1) it leads to rather abstract audio concepts that hardly cover
retrieval or detection approaches. Audio information can be
the audio domain at hand and (2) the way human annotators
identify audio concepts likely differs from the way a computer used to these ends in two fundamentally different ways.
algorithm clusters audio data introducing additional noise Speech recognition has been employed for video analysis
in training data. This paper explores whether unsupervized
since the late 1990s [3]. The second method for using
audio segementation systems can be used to identify useful
audio analysis for video analysis is the detection of sound
audio concepts by analyzing training data automatically and
concepts that describe a video s content. The presence of
whether these audio concepts can be used for multimedia
document classi cation and retrieval. A modi ed version of human de ned lower level acoustic concepts such as indoor
the ICSI (International Computer Science Institute) speaker sound or people laughing conveys valuable information
diarization system nds segments in an audio track that have
as to a video s content and such sound concepts can be
similar perceptual properties and groups these segments. This
automatically detected once a system is trained to recognize
article provides an in-depth analysis on the statistic properties
them [4][5]. The use of low level acoustic concepts does,
of similar acoustic segments identi ed by the diarization system
in a prede ned document set and the theoretical tness of this however usually involve manual concept de nition. The two
approach to discern one document class from another. downsides of manual concept de nition are that it usually
leads to rather abstract concepts and that it introduces a
Keywords-Audio Clustering, Audio Indexing, Speaker Di-
arization, Video Indexing human bias as human annotators are likely to identify these
concepts based on different properties of sound than a
I. I NTRODUCTION computer algorithm would. This paper explores the appli-
Multimedia retrieval becomes more and more important cability of a speaker diarization engine to the de nition
due to a number of reasons. The amount of multimedia data and extraction of low level acoustic concepts from domain
posted by end users on the web is increasing on a daily basis. speci c training data. The speaker diarization engine clusters
Surveillance data is gathered with unprecedented coverage segments of an audio stream that exhibit similar properties.
and archives of professionally created entertainment media It thus extracts acoustic concepts as they are de ned by
or documentaries are growing steadily. All of these media a computer algorithm - namely the speaker diarization
objects are, however, of little value if users can not nd them engine. To explore whether this assumption holds, we have
978-0-7695-4589-9/11 $26.00 2011 IEEE 446
DOI 10.1109/ISM.2011.79
also employs speech/non-speech segmentation to exclude
Initialization
non-speech in later processing steps. In order to use all
audio information, we have omitted the exclusion of non-
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
speech segments. In the segmentation and clustering stage
Yes
(Re-)Training No
of speaker diarization, an initial segmentation is generated
Merge two
Clusters?
by uniformly partitioning the audio track into K segments
End
of the same length. K is chosen to be much larger than the
assumed number of speakers in the audio track. For meeting
Cluster1 Cluster2 Cluster1 Cluster2
(Re-)Alignment
recordings of about 30 minute length, previous work [6]
experimentally determined K = 16 a good value. For the
Cluster2 Cluster1 Cluster2 Cluster3
audio tracks used in the TRECVID MED 2011 data set, we
have determined K=64 a suitable value. The main reason for
Figure 1. ICSI Speaker Diarization System
the higher K value is that the number of signi cant audio
concepts in a video is much higher than the average number
of speakers in a meeting video. The procedure for diarization
generated and examined diarization data from the TRECVID
is shown in Figure 1 and takes the following steps, a more
MED 2011 data set. This data set has been released by
detailed description can be found in [7]:
NIST as training data for the TRECVID MED 2011 concept
1) Initialization: Train a set of Gaussian Mixture Models
detection challenge. It contains randomly selected videos
(GMMs), one for each initial cluster.
that are examples for fteen different categories of high level
2) Re-segmentation: Re-segment the audio track using the
concepts such as wedding ceremony and woodworking
current GMMs using majority vote on the likelihoods of a
project and thus represents a useful data set for the analysis
speci ed minimum duration [8]. For audio concept detec-
presented in this paper. The data set does not only deliver
tion, we have set this minimum duration to 200 milliseconds
a wealth of low level features that can be detected by the
in order to capture sounds of smaller duration. For speaker
diarization approach, it also separated data into higher level
segmentation higher values are used. A typical mimimum
classes so that we can explore whether higher level classes
for speaker segmentation would be 2500 milliseconds
can be predicted by the absence or presence of certain low
3) Re-training: Retrain the GMMs on current segmenta-
level features. The analysis of the distribution of speaker
tion using the expectation-maximization (EM) algorithm [8].
segments found in the videos from different categories
4) Agglomeration: Select the closest pair of clusters and
shows that speaker segments are not randomly distributed
merge them. At each iteration, the algorithm checks all
but can be used to predict whether a video belongs to a
possible pairs of clusters to see if there is an improvement
certain event class or not. It thus indicates that speaker
in BIC scores by merging each pair and re-training it on the
segmentation generates low level audio concepts that can be
combined audio segments. The clusters from the pair with
used for higher level machine learning based classi cation.
the largest improvement in Bayesian Information Criterion
The remainder of this paper is organized as follows: Section
(BIC) scores are merged and the new GMM is used. The
2 brie y explains the ICSI speaker diarization system. The
algorithm then repeats from the re-segmentation step until
contents of the NIST TRECVID 2011 MED data set are
there are no remaining pairs that will lead to an improved
discussed in section 3. Section 4 presents the methodology of
BIC score.
the experiment. An overview of signi cant low level features
The result of the algorithm consists of a segmentation of
found in the experimental results as well as a discussion of
the audio track with n clusters and with one GMM for each
the distribution of these results is given in Section 5. Section
cluster, where n is assumed to be the number of speakers,
6 discusses the relevance of the ndings of this paper and
which in our case are audio concepts.
touches upon the perspectives opened for future research.
III. TRECVID MED 2011 DEV-T DATA S ET
II. ICSI S PEAKER D IARIZATION S YSTEM
The TRECVid 2011 MED dataset is different from the
For detecting sound concepts in each individual video
original TRECVid dataset. The MED dataset is comprised of
we used a system based on the ICSI speaker diarization
found videos, i.e. consumer-produced videos downloaded
system [7] in a faster-than-realtime version [9]. The actual
from various social networking sites. Most videos are very
diarization process consists of a pre-processing phase and
short (a couple of minutes) and not produced professionally.
a segmentation and clustering phase, as shown in gure 1.
The query sets (so-called event kits) are comprised of fteen
In the preprocessing phase audio features, in this case Mel-
categories with only ve of those categories available in the
Frequency Cepstral Coef cients (MFCCs) are extracted from
testing set. The event kits consist of a total of 2040 videos
the video soundtrack. We use a frame period of 10 ms with
and the test set of a total of 4251 videos. The ve event
an analysis window of 30 ms in the feature extraction. In its
categories which are available in the test set are attempting
original application context the speaker diarization system
447
(GMM) is a number of Gaussian distributions that describe
Table I
N UMBER OF V IDEOS FOR T RAIN AND T EST
each feature in the speaker model, as shown in equation (1).
M
Category Description Train Data Test Data
p(x ) = wi N (x i, i )
E001 Board Tricks 160 111 (1)
E002 Feeding Animal 160 111 i=1
E003 Landing Fish 122 86
where x is a D-dimension random vector, N (x i, i ),
E004 Wedding 128 88
i = 1, . . ., M, are the component densities and wi, i =
E005 Woodworking 142 100
E006 Birthday Party 173 0 1, . . ., M, are the mixture weights. Each component density
E007 Changing Tire 110 0
is a D-variate Gaussian function of the form with mean
E008 Flash Mob 173 0
vector i and covariance matrix i (we use diagonal co-
E009 Vehicle Unstuck 131 0
variance matrix here). The mixture weights are constrained
E010 Grooming animal 136 0
M
by i=1 wi = 1.
E011 Make a Sandwich 111 0
E012 Parade 134 0 A single feature is represented by a number of Gaussians
E013 Parkour 108 0 that are weighted according as to how they in uence the
E014 Repair Appliance 123 0
overall model. The Gaussian distributions themselves are
E015 Sewing 116 0
represented by their mean value and their variance.
Rest Random Other N/A 3755
In order to match low level audio concepts across training
videos and to also be able to classify low level feature
models found in testing videos, we have simpli ed the
a board trick, feeding an animal, landing a sh, wed-
Gaussian mixture model per speaker to a single vector that
ding ceremony, and working on a woodworking project ;
consists of the sums of the weighted means and the sums of
the remainder of the videos in the test set are random videos
the weighted variances of each Gaussian. In the remainder of
not belonging to any of the event categories. The number
this paper, we will call this vector a simpli ed supervector,
of videos in each category for train and test is available in
as shown in equation (2)
Table 1. The contents of these videos are highly variant, for
example, the concept attempting a board trick includes
M M
people skateboarding, snowboarding and sur ng, while the
(x) = [ wi i ; wi i ] (2)
wedding ceremony varies from a traditional catholic mass,
i=1 i=1
to a Hindi ceremony, to home-made music videos. The
We then clustered the simpli ed supervectors from all low
analysis presented in this paper is limited to the event kits
level acoustic concepts from all video les with a Kmeans
which are in the training set of the TRECVID MED 2011
approach. The resulting clusters represent abstractions of the
data set. Annotators analyses of the testing set have revealed
simpli ed supervector for all acoustic low level concepts
a huge number of different sound categories some of which
and can be mapped back to the acoustic low level concepts
a event speci c like different tool sounds in woodworking or
(speaker models) in each video le by calculating the
engine sounds as well a music and different kinds of speech
distance between the abstract simpli ed supervectors and
in the videos.
the individual video s speaker models. By re-mapping the
IV. METHODOLOGY abstract simpli ed supervectors to the individual speaker
models, one can count the overall occurrences of an abstract
To explore the applicability of speaker diarization to audio
acoustic low level concept in all videos. We have also
concept detection, we applied the ICSI speaker diarization
counted the number of occurrences of each abstract acoustic
system to the TRECVID MED 2011 data set and analyzed
low level feature in all videos belonging to each event
the results produced by speaker segmentation. The basic idea
set. These numbers allow us to compute the normalized
behind this approach is that speaker diarization clusters those
frequency of the occurrence of a speci c acoustic low level
segments of an audio stream that exhibit similar acoustic
concept sound per event as shown in equation (3).
properties into a speaker model. When preprocessing lters
such as speech-nonspeech detection are removed from the
system, a speaker model does not necessarily represent a nj P (ci = cj cj Dk Dk E )
k j
EEH(ci, E ) =
speaker, but a low level audio concept. Our motivation was
nj P (ci = cj cj Dk )
k j
that event speci c sounds like a power drill in a video about
(3)
woodworking, an engine sound in a tire change scenario
where EEH represents expected event histogram and nj is
or clapping sound in a wedding video could be found.
the occurrence number of cj in audio clip Dk . P (ci =
Speaker models (which in our case are used as low level
cj cj Dk, Dk E ) is the probability of audio term ci
audio concepts) are represented by the diarization system
equal cj given cj is in the audio clip Dk in the event E,
as Gaussian Mixture models. A Gaussian Mixture Model
448
V. R ESULTS FROM DATA ANALYSIS
1
Audio Clips 2 The analysis of the distribution of the acoustic low
level concepts has revealed a number of acoustic low level
3
concepts that have high predictive power for the abstract
concepts (wedding, woodworking, etc.) given in the training
A B C A
1
set. For ve abstract concepts, we found acoustic concepts
with a normalized frequency of 50% or greater compared to
ICSI speaker diarization E E
D
2
a chance rate of 1/15 which is 6.7%: 100% for E001, 71.1
3 F G F
for E005, 71.4 for E007, 64.28 for E011, 50 % for E004.
I.e. one of these events can be predicted correctly with a
A B C A probability of 80 % or greater just by the presence of an
1
instance of the respective acoustic low level concept. The
B B
A
Kmeans Clustering 2
event that performed worse in this comparison is Changing
F C F
3 a tire, which can only be predicted with a 34 % accuracy
based on its dominant low level sound concept. These
numbers show that the low level sound concepts generated
Figure 2. work ow for speaker diarization and clustering by speaker diarization are a good discriminator for the
higher level concepts of the 15 events given in the NIST
TRECVID 2011 data set. Table II shows the normalized
frequencies of the top ve most predictive acoustic low
and P (ci = cj cj Dk ) is the probability of audio term ci
level concepts for all events in the event training kit. The
equal cj given cj is from audio clip Dk .
normalized frequencies are based on kMeans clustering with
The higher the normalized frequency of a sound belonging k=200. The average normalized frequency for the top sound
to an abstract acoustic low level concept a in an event b concept for all events was 46.6%. Clustering with k=100
is, the higher is the predictive power of these acoustic low produced an average normalized frequencies of 39.8% for
level concepts. To give an extreme example: if a power drill the top sound concept. Clustering with k=1000 produced
sound could be correctly identi ed by the system, it would an average normalized frequency in the 1 % range, likely
be likely that all occurrences of that sound appear in videos due to over tting to speci c low level sound concepts from
belonging to the category woodworking, resulting in a individual video clips.
normalized frequency of 100%. In other words, whenever a
While some sounds are a very good indicator of a speci c
power drill sound occurs, we have a woodworking video. In
event in the training set, their overall frequency of occur-
order to analyze the applicability of speaker diarization, we
rence also has to be considered when using a speci c sound
have extracted those low level acoustic concepts, that have
concept for event detection. Sound 153 for instance occurs
the most predictive per event and examined these closer as
only once in the whole training set. It is a fast gurgling water
discussed in the results section.
sound from a video about sur ng. Sound 159, the top sound
The whole work ow is illustrated in gure 2 and con- concept in E005 occurs in 21 videos, 12 of which are in
sists of two parts: the ICSI speaker diarization system and E005. Also, multiple speaker models from these videos are
Kmeans clustering. As shown in the top region of the gure, matched to that sound concept, which is why the predictive
suppose we have three audio clips: 1, 2, and 3. By applying value is 71.4%. We have manually inspected some of the
the ICSI speaker diarization system, each audio clip is instances of this sound in E005 and all videos where this
segmented into separate chunks. If the chunks are assumed sound was found in videos from the other event categories.
by the system to exhibit similar accoustic properties, then Most instances of this sound concept in E005 are engine
the chunks will be considered to belong to the same speaker. sounds of a moderate volume. The sound is also found in
For example, in audio clip 1, there are speakers A, B, and E001, E009, E001, E014, and E015. In one instance in
C. Note that each speaker in each audio clip is described by E001 it is a harp sound in spherical music, in the other
a Gaussian Mixture Model. one a compressor sound which is acoustically similar to a
Finally, given different speakers from the ICSI speaker moderate engine sound. In E009 it occurs in two videos,
diarization system, we use the simpli ed supervector ob- in both as an engine sound of racing cars in a distance.
tained from weighted mean and variance to represent a In E014 it occurs in only one video, where it is similar
speaker. Then, we apply Kmeans clustering to cluster similar to a moderate volume sound of a vacuum cleaner in the
speakers among all audio clips. For example, speaker A in background. In E015 it occurs in 4 videos. In 3 of these,
clip 1 and speaker D in clip 2 (see gure 2) are clustered it is the sound of a sewing machine and it is acoustically
together as cluster A. similar to some of the sounds of E005. In the other video
449
Table II
T OP FIVE SOUND CONCEPTS ACCORDING TO NORMALIZED FREQUENCY
E001 100% (ID 153) 25.8% (ID 117) 25% (ID 130) 19% (ID 189) 18.9% (ID 10)
E002 16.9% (ID 85) 14.8% (ID 108) 14.7% (ID 129) 14.6% (ID 90) 14.2% (ID 184)
E003 40.00% (ID 188) 31.25% (ID 13) 25.58% (ID 18) 22.32% (ID 58) 16.26% (ID 133)
E004 50.00% (ID 47) 41.66% (ID 48) 35.86% (ID 175) 33.33% (ID 94) 30.95% (ID 149)
E005 71.7% (ID 161) 58.3% (ID 88) 25% (ID 130) 24.7% (ID 66) 19.9% (ID 54)
E006 40.0% (ID 188) 14.2% (ID 52) 13.6% (ID 103) 13.3% (ID 4) 13% (ID 71)
E007 71.4% (ID 62) 43.5% (ID 51) 42.8% (ID 199) 41.9% (ID 139) 41.6% (ID 186)
E008 37.4% (ID 178) 29.8% (ID 110) 28.1% (ID 66) 25.0% (ID 130) 21.9% (ID 79)
E009 37.41% (ID 178) 29.81% (ID 110) 28.08% (ID 66) 25.00% (ID 130) 21.91% (ID 79)
E010 25.00% (ID 13) 21.42% (ID 169) 15.00% (ID 157) 14.91% (ID 65) 14.77% (ID 82)
E011 64.28% (ID 169) 36.36% (ID 103) 33.33% (ID 94) 32.98% (ID 70) 23.94% (ID 40)
E012 40.00% (ID 22) 40.00% (ID 156) 34.33% (ID 186) 33.33% (ID 135) 30.00% (ID 176)
E013 23.68% (ID 76) 15.11% (ID 43) 14.24% (ID 2) 13.07% (ID 20) 12.67% (ID 17)
E014 31.42% (ID 177) 30.43% (ID 84) 27.84% (ID 106) 27.08% (ID 98) 24.61% (ID 152)
E015 41.66% (ID 48) 33.33% (ID 94) 19.71% (ID 40) 14.77% (ID 37) 14.28% (ID 62)
categories, with 41% of it s occurrences being in E004, E005
and E008. Across the various event classes, it was observed
to contain fairly similar sounds: that of guitar music and
singing being most common, and where it is most highly
discriminating. The cluster also contains machine noises,
(a) K=100 (b) K=200
speech, and the sound of bacon cooking. Sound 8 on the
other hand only has one event with a greater than 10%
chance of occurrence, E014 at 16% and seems to be a cluster
of sounds that include music with percussive elements. In
E014 this translated to music with tools being used, but
in other places it was bass heavy techno, or cars moving
(c) K=300 (d) K=1000
at high speed. With sound 76 we had a sound which to
Figure 3. Distribution Frequency for Kmeans K=100, 200, 300, 1000.
a human annotator seems relatively unremarkable, being
mostly instrumental music; however this sound was only
detected in 25 of the videos, and not detected at all in six
from E015 where it occurs, it is a high frequency sound of the event classes.
of rotating equipment. Another observation we made is that
VI. CONCLUSION
there are a number of sound concepts for each speci c event
class that are not present in any video from that event class. This more in depth analysis of how our system has created
These numbers are E001: 35, E002: 29, E003: 27, E004: 7, the sound clusters is illuminating: the machine learning
E005:20, E006: 11, E007: 32, E008: 11, E009: 28, E010: system is creating categories that while clearly sensical for
14, E011: 15, E012: 27, E013: 23, E014: 66 and E015: 22. it, and useful in classifying the events, are clearly not what a
Both observations indicate that the presence or absence as human annotator would create. This forces us to reevaluate
well as the number of occurrences of certain low level sound the utility of our standard annotation approaches while
concepts in a video is a predictor for the concept class preparing the data for this sort of system, since what the
of that video. Figure 3 show the distribution of low level system is discovering, and nding useful is quite different
sound concepts in the test set according to their frequency from what a human annotator might create; it seems unlikely
of occurrence for clustering with k=100, k=200, k=300 and that a human annotator would put guitar music in the same
k=1000. The gures suggest that with higher values for k category as bacon cooking. This dichotomy will hopefully
in the clustering, the distribution comes closer to a Zip an prove to be advantageous in the future, if we can develop a
distribution. Zip an distributions are sometimes connected system that can combine the human understanding of sound
to the applicability of TF-IDF measures, even though this is meanings with the automatic segmentations which we have
controversial from a theoretical perspective [10]. been using.
In addition to the analysis of the top performing sound The distribution analysis of the clusters generated by the
concepts, we annotated and studied the distributions of two processing steps speaker diarization and clustering has
seven sound sets, without prejudice towards how well they shown that the distribution of low level sound concepts
performed. In sound 5 we have a fairly poor discriminator, found by speaker diarization differs between videos belong-
although by no means useless: it occurs in all of the event ing to different classes. It is hence safe to assume that low
450
level sound concepts can be used for video classi cation. [3] HD Wactlar, T. Kanade, MA Smith, and SM Stevens, Intel-
ligent access to digital video: Informedia project, Computer,
One advantage of using a representation of videos at the
vol. 29, no. 5, pp. 46 52, 1996.
level of the distribution of low level sound concepts is that
low level sound concepts deliver an abstract representation. [4] Huan Li, Lei Bao, Zan Gao, Arnold Overwijk, Wei Liu,
Classi cation at this level does hence require only a small Long fei Zhang, Shoou-I Yu, Ming yu Chen, Florian Metze,
amount of data. One video can be described by a vector with and Alexander Hauptmann, Informedia@trecvid 2010, in
Notebook for NIST s TREC Video Retrieval Evaluation 2010,
200 or 300 positions instead of a framewise representation of
2010.
raw features. Preliminary machine learning experiments with
this high level representation have con rmed our estimates. [5] Y.-G. Jiang, X. Zeng, G. Ye, S. Bhattacharya, D. Ellis,
In the future we will design a video concept classi cation M. Shah, and S.-F. Chang, Columbia-ucf trecvid2010 mul-
system based on the low level sound concepts found by timedia event detection: Combining multiple modalities, con-
textual concepts, and temporal matching, in NIST TRECVID
speaker diarization.
Workshop, 2010.
VII. ACKNOWLEDGMENTS
[6] David Imseng and Gerald Friedland, Robust speaker di-
Supported by the Intelligence Advanced Research Projects arization for short speech recordings, in Proceedings of
Activity (IARPA) via Department of Interior National Busi- the IEEE workshop on Automatic Speech Recognition and
ness Center contract number D11PC20066. The U.S. Gov- Understanding, 12 2009, pp. 432 437.
ernment is authorized to reproduce and distribute reprints for
[7] Chuck Wooters and Marijn Huijbregts, Multimodal tech-
Governmenta l purposes notwithstanding any copyright an-
nologies for perception of humans, chapter The ICSI RT07s
notation thereon. The views and conclusion contained herein Speaker Diarization System, pp. 509 519. Springer-Verlag,
are those ot the authors and should not be interpresented as Berlin, Heidelberg, 2008.
necessarily representing the of cial policies or endorsement,
[8] G. Friedland and O. Vinyals, Live speaker identi cation
either expressed or implied, of IARPA, DOI/NBC, or the
in conversation, in Proceedings of the ACM International
U.S. Governement.
Conference on Multimedia, October 2008, pp. 1017 1018.
R EFERENCES
[9] Yan Huang, Oriol Vinyals, Gerald Friedl, Christian Mller,
[1] Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Nikki Mirghafori, and Chuck Wooters, A fast-match ap-
Jain, Content-based multimedia information retrieval: State proach for robust, faster than real-time speaker diarization,
of the art and challenges, ACM Trans. Multimedia Comput. in ASRU, 2007.
Commun. Appl., vol. 2, no. 1, pp. 1 19, 2006.
[10] Stephen Robertson, Understanding inverse document fre-
[2] Cees G. M. Snoek and Marcel Worring, Concept-based video quency: On theoretical arguments for idf, Journal of Docu-
retrieval, Fundamental Trends in Information Retrieval, vol. mentation, vol. 60, pp. 2004, 2004.
2, no. 4, pp. 215 322, 2009.
451
Multimodal tech-
Governmenta l purposes notwithstanding any copyright an-