Data Training

Location:

Berkeley, CA

Posted:

November 08, 2012

Contact this candidate

Resume:

**** **** ************* ********* ** Multimedia

On the Applicability of Speaker Diarization to Audio Concept Detection for

Multimedia Retrieval

Robert Mertens Po-Sen Huang

International Computer Science Institute Beckman Institute, ECE Department

1947 Center Street, Suite 600 University of Illinois at Urbana-Champaign

Berkeley, CA 94704, USA Urbana, IL 61801, USA

********@****.********.*** ********@********.***

Luke Gottlieb Gerald Friedland Ajay Divakaran

International Computer Science Institute International Computer Science Institute SRI International Sarnoff

1947 Center Street, Suite 600 1947 Center Street, Suite 600 201 Washington Road

Berkeley, CA 94704, USA Berkeley, CA 94704, USA Princeton, NJ 08540, USA

****@****.********.*** *******@****.********.*** ****.*********@***.***

and retrieve them. At this point structuring and indexing of

Abstract Recently, audio concepts emerged as a useful

building block in multimodal video retrieval systems. Informa- this vast amount of multimedia data makes the difference. In

tion like this le contains laughter, this le contains engine order to tackle the challenge of indexing multimedia data,

sounds or this le contains slow music can signi cantly

a multitude of approaches have been devised in the past

improve purely visual based retrieval. The weak point of

(see [1] or [2] for an overview). A video usually contains

current approaches to audio concept detection is that they

and audio and a visual stream, however many approaches

heavily rely on human annotators. In most approaches, audio

material is manually inspected to identify relevant concepts. for video analysis focus only on the visual part of a video.

Then instances that contain examples of relevant concepts Audio has recently begun to play a role in multimodal media

are selected again manually and used to train concept

analysis and can be leveraged to complement results from

detectors. This approach comes with two major disadvantages:

visual analysis to increase the effectiveness of multimedia

(1) it leads to rather abstract audio concepts that hardly cover

retrieval or detection approaches. Audio information can be

the audio domain at hand and (2) the way human annotators

identify audio concepts likely differs from the way a computer used to these ends in two fundamentally different ways.

algorithm clusters audio data introducing additional noise Speech recognition has been employed for video analysis

in training data. This paper explores whether unsupervized

since the late 1990s [3]. The second method for using

audio segementation systems can be used to identify useful

audio analysis for video analysis is the detection of sound

audio concepts by analyzing training data automatically and

concepts that describe a video s content. The presence of

whether these audio concepts can be used for multimedia

document classi cation and retrieval. A modi ed version of human de ned lower level acoustic concepts such as indoor

the ICSI (International Computer Science Institute) speaker sound or people laughing conveys valuable information

diarization system nds segments in an audio track that have

as to a video s content and such sound concepts can be

similar perceptual properties and groups these segments. This

automatically detected once a system is trained to recognize

article provides an in-depth analysis on the statistic properties

them [4][5]. The use of low level acoustic concepts does,

of similar acoustic segments identi ed by the diarization system

in a prede ned document set and the theoretical tness of this however usually involve manual concept de nition. The two

approach to discern one document class from another. downsides of manual concept de nition are that it usually

leads to rather abstract concepts and that it introduces a

Keywords-Audio Clustering, Audio Indexing, Speaker Di-

arization, Video Indexing human bias as human annotators are likely to identify these

concepts based on different properties of sound than a

I. I NTRODUCTION computer algorithm would. This paper explores the appli-

Multimedia retrieval becomes more and more important cability of a speaker diarization engine to the de nition

due to a number of reasons. The amount of multimedia data and extraction of low level acoustic concepts from domain

posted by end users on the web is increasing on a daily basis. speci c training data. The speaker diarization engine clusters

Surveillance data is gathered with unprecedented coverage segments of an audio stream that exhibit similar properties.

and archives of professionally created entertainment media It thus extracts acoustic concepts as they are de ned by

or documentaries are growing steadily. All of these media a computer algorithm - namely the speaker diarization

objects are, however, of little value if users can not nd them engine. To explore whether this assumption holds, we have

978-0-7695-4589-9/11 $26.00 2011 IEEE 446

DOI 10.1109/ISM.2011.79

also employs speech/non-speech segmentation to exclude

Initialization

non-speech in later processing steps. In order to use all

audio information, we have omitted the exclusion of non-

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

speech segments. In the segmentation and clustering stage

Yes

(Re-)Training No

of speaker diarization, an initial segmentation is generated

Merge two

Clusters?

by uniformly partitioning the audio track into K segments

End

of the same length. K is chosen to be much larger than the

assumed number of speakers in the audio track. For meeting

Cluster1 Cluster2 Cluster1 Cluster2

(Re-)Alignment

recordings of about 30 minute length, previous work [6]

experimentally determined K = 16 a good value. For the

Cluster2 Cluster1 Cluster2 Cluster3

audio tracks used in the TRECVID MED 2011 data set, we

have determined K=64 a suitable value. The main reason for

Figure 1. ICSI Speaker Diarization System

the higher K value is that the number of signi cant audio

concepts in a video is much higher than the average number

of speakers in a meeting video. The procedure for diarization

generated and examined diarization data from the TRECVID

is shown in Figure 1 and takes the following steps, a more

MED 2011 data set. This data set has been released by

detailed description can be found in [7]:

NIST as training data for the TRECVID MED 2011 concept

1) Initialization: Train a set of Gaussian Mixture Models

detection challenge. It contains randomly selected videos

(GMMs), one for each initial cluster.

that are examples for fteen different categories of high level

2) Re-segmentation: Re-segment the audio track using the

concepts such as wedding ceremony and woodworking

current GMMs using majority vote on the likelihoods of a

project and thus represents a useful data set for the analysis

speci ed minimum duration [8]. For audio concept detec-

presented in this paper. The data set does not only deliver

tion, we have set this minimum duration to 200 milliseconds

a wealth of low level features that can be detected by the

in order to capture sounds of smaller duration. For speaker

diarization approach, it also separated data into higher level

segmentation higher values are used. A typical mimimum

classes so that we can explore whether higher level classes

for speaker segmentation would be 2500 milliseconds

can be predicted by the absence or presence of certain low

3) Re-training: Retrain the GMMs on current segmenta-

level features. The analysis of the distribution of speaker

tion using the expectation-maximization (EM) algorithm [8].

segments found in the videos from different categories

4) Agglomeration: Select the closest pair of clusters and

shows that speaker segments are not randomly distributed

merge them. At each iteration, the algorithm checks all

but can be used to predict whether a video belongs to a

possible pairs of clusters to see if there is an improvement

certain event class or not. It thus indicates that speaker

in BIC scores by merging each pair and re-training it on the

segmentation generates low level audio concepts that can be

combined audio segments. The clusters from the pair with

used for higher level machine learning based classi cation.

the largest improvement in Bayesian Information Criterion

The remainder of this paper is organized as follows: Section

(BIC) scores are merged and the new GMM is used. The

2 brie y explains the ICSI speaker diarization system. The

algorithm then repeats from the re-segmentation step until

contents of the NIST TRECVID 2011 MED data set are

there are no remaining pairs that will lead to an improved

discussed in section 3. Section 4 presents the methodology of

BIC score.

the experiment. An overview of signi cant low level features

The result of the algorithm consists of a segmentation of

found in the experimental results as well as a discussion of

the audio track with n clusters and with one GMM for each

the distribution of these results is given in Section 5. Section

cluster, where n is assumed to be the number of speakers,

6 discusses the relevance of the ndings of this paper and

which in our case are audio concepts.

touches upon the perspectives opened for future research.

III. TRECVID MED 2011 DEV-T DATA S ET

II. ICSI S PEAKER D IARIZATION S YSTEM

The TRECVid 2011 MED dataset is different from the

For detecting sound concepts in each individual video

original TRECVid dataset. The MED dataset is comprised of

we used a system based on the ICSI speaker diarization

found videos, i.e. consumer-produced videos downloaded

system [7] in a faster-than-realtime version [9]. The actual

from various social networking sites. Most videos are very

diarization process consists of a pre-processing phase and

short (a couple of minutes) and not produced professionally.

a segmentation and clustering phase, as shown in gure 1.

The query sets (so-called event kits) are comprised of fteen

In the preprocessing phase audio features, in this case Mel-

categories with only ve of those categories available in the

Frequency Cepstral Coef cients (MFCCs) are extracted from

testing set. The event kits consist of a total of 2040 videos

the video soundtrack. We use a frame period of 10 ms with

and the test set of a total of 4251 videos. The ve event

an analysis window of 30 ms in the feature extraction. In its

categories which are available in the test set are attempting

original application context the speaker diarization system

447

(GMM) is a number of Gaussian distributions that describe

Table I

N UMBER OF V IDEOS FOR T RAIN AND T EST

each feature in the speaker model, as shown in equation (1).

Category Description Train Data Test Data

p(x ) = wi N (x i, i )

E001 Board Tricks 160 111 (1)

E002 Feeding Animal 160 111 i=1

E003 Landing Fish 122 86

where x is a D-dimension random vector, N (x i, i ),

E004 Wedding 128 88

i = 1, . . ., M, are the component densities and wi, i =

E005 Woodworking 142 100

E006 Birthday Party 173 0 1, . . ., M, are the mixture weights. Each component density

E007 Changing Tire 110 0

is a D-variate Gaussian function of the form with mean

E008 Flash Mob 173 0

vector i and covariance matrix i (we use diagonal co-

E009 Vehicle Unstuck 131 0

variance matrix here). The mixture weights are constrained

E010 Grooming animal 136 0

by i=1 wi = 1.

E011 Make a Sandwich 111 0

E012 Parade 134 0 A single feature is represented by a number of Gaussians

E013 Parkour 108 0 that are weighted according as to how they in uence the

E014 Repair Appliance 123 0

overall model. The Gaussian distributions themselves are

E015 Sewing 116 0

represented by their mean value and their variance.

Rest Random Other N/A 3755

In order to match low level audio concepts across training

videos and to also be able to classify low level feature

models found in testing videos, we have simpli ed the

a board trick, feeding an animal, landing a sh, wed-

Gaussian mixture model per speaker to a single vector that

ding ceremony, and working on a woodworking project ;

consists of the sums of the weighted means and the sums of

the remainder of the videos in the test set are random videos

the weighted variances of each Gaussian. In the remainder of

not belonging to any of the event categories. The number

this paper, we will call this vector a simpli ed supervector,

of videos in each category for train and test is available in

as shown in equation (2)

Table 1. The contents of these videos are highly variant, for

example, the concept attempting a board trick includes

M M

people skateboarding, snowboarding and sur ng, while the

(x) = [ wi i ; wi i ] (2)

wedding ceremony varies from a traditional catholic mass,

i=1 i=1

to a Hindi ceremony, to home-made music videos. The

We then clustered the simpli ed supervectors from all low

analysis presented in this paper is limited to the event kits

level acoustic concepts from all video les with a Kmeans

which are in the training set of the TRECVID MED 2011

approach. The resulting clusters represent abstractions of the

data set. Annotators analyses of the testing set have revealed

simpli ed supervector for all acoustic low level concepts

a huge number of different sound categories some of which

and can be mapped back to the acoustic low level concepts

a event speci c like different tool sounds in woodworking or

(speaker models) in each video le by calculating the

engine sounds as well a music and different kinds of speech

distance between the abstract simpli ed supervectors and

in the videos.

the individual video s speaker models. By re-mapping the

IV. METHODOLOGY abstract simpli ed supervectors to the individual speaker

models, one can count the overall occurrences of an abstract

To explore the applicability of speaker diarization to audio

acoustic low level concept in all videos. We have also

concept detection, we applied the ICSI speaker diarization

counted the number of occurrences of each abstract acoustic

system to the TRECVID MED 2011 data set and analyzed

low level feature in all videos belonging to each event

the results produced by speaker segmentation. The basic idea

set. These numbers allow us to compute the normalized

behind this approach is that speaker diarization clusters those

frequency of the occurrence of a speci c acoustic low level

segments of an audio stream that exhibit similar acoustic

concept sound per event as shown in equation (3).

properties into a speaker model. When preprocessing lters

such as speech-nonspeech detection are removed from the

system, a speaker model does not necessarily represent a nj P (ci = cj cj Dk Dk E )

k j

EEH(ci, E ) =

speaker, but a low level audio concept. Our motivation was

nj P (ci = cj cj Dk )

k j

that event speci c sounds like a power drill in a video about

(3)

woodworking, an engine sound in a tire change scenario

where EEH represents expected event histogram and nj is

or clapping sound in a wedding video could be found.

the occurrence number of cj in audio clip Dk . P (ci =

Speaker models (which in our case are used as low level

cj cj Dk, Dk E ) is the probability of audio term ci

audio concepts) are represented by the diarization system

equal cj given cj is in the audio clip Dk in the event E,

as Gaussian Mixture models. A Gaussian Mixture Model

448

V. R ESULTS FROM DATA ANALYSIS

Audio Clips 2 The analysis of the distribution of the acoustic low

level concepts has revealed a number of acoustic low level

concepts that have high predictive power for the abstract

concepts (wedding, woodworking, etc.) given in the training

A B C A

set. For ve abstract concepts, we found acoustic concepts

with a normalized frequency of 50% or greater compared to

ICSI speaker diarization E E

a chance rate of 1/15 which is 6.7%: 100% for E001, 71.1

3 F G F

for E005, 71.4 for E007, 64.28 for E011, 50 % for E004.

I.e. one of these events can be predicted correctly with a

A B C A probability of 80 % or greater just by the presence of an

instance of the respective acoustic low level concept. The

B B

Kmeans Clustering 2

event that performed worse in this comparison is Changing

F C F

3 a tire, which can only be predicted with a 34 % accuracy

based on its dominant low level sound concept. These

numbers show that the low level sound concepts generated

Figure 2. work ow for speaker diarization and clustering by speaker diarization are a good discriminator for the

higher level concepts of the 15 events given in the NIST

TRECVID 2011 data set. Table II shows the normalized

frequencies of the top ve most predictive acoustic low

and P (ci = cj cj Dk ) is the probability of audio term ci

level concepts for all events in the event training kit. The

equal cj given cj is from audio clip Dk .

normalized frequencies are based on kMeans clustering with

The higher the normalized frequency of a sound belonging k=200. The average normalized frequency for the top sound

to an abstract acoustic low level concept a in an event b concept for all events was 46.6%. Clustering with k=100

is, the higher is the predictive power of these acoustic low produced an average normalized frequencies of 39.8% for

level concepts. To give an extreme example: if a power drill the top sound concept. Clustering with k=1000 produced

sound could be correctly identi ed by the system, it would an average normalized frequency in the 1 % range, likely

be likely that all occurrences of that sound appear in videos due to over tting to speci c low level sound concepts from

belonging to the category woodworking, resulting in a individual video clips.

normalized frequency of 100%. In other words, whenever a

While some sounds are a very good indicator of a speci c

power drill sound occurs, we have a woodworking video. In

event in the training set, their overall frequency of occur-

order to analyze the applicability of speaker diarization, we

rence also has to be considered when using a speci c sound

have extracted those low level acoustic concepts, that have

concept for event detection. Sound 153 for instance occurs

the most predictive per event and examined these closer as

only once in the whole training set. It is a fast gurgling water

discussed in the results section.

sound from a video about sur ng. Sound 159, the top sound

The whole work ow is illustrated in gure 2 and con- concept in E005 occurs in 21 videos, 12 of which are in

sists of two parts: the ICSI speaker diarization system and E005. Also, multiple speaker models from these videos are

Kmeans clustering. As shown in the top region of the gure, matched to that sound concept, which is why the predictive

suppose we have three audio clips: 1, 2, and 3. By applying value is 71.4%. We have manually inspected some of the

the ICSI speaker diarization system, each audio clip is instances of this sound in E005 and all videos where this

segmented into separate chunks. If the chunks are assumed sound was found in videos from the other event categories.

by the system to exhibit similar accoustic properties, then Most instances of this sound concept in E005 are engine

the chunks will be considered to belong to the same speaker. sounds of a moderate volume. The sound is also found in

For example, in audio clip 1, there are speakers A, B, and E001, E009, E001, E014, and E015. In one instance in

C. Note that each speaker in each audio clip is described by E001 it is a harp sound in spherical music, in the other

a Gaussian Mixture Model. one a compressor sound which is acoustically similar to a

Finally, given different speakers from the ICSI speaker moderate engine sound. In E009 it occurs in two videos,

diarization system, we use the simpli ed supervector ob- in both as an engine sound of racing cars in a distance.

tained from weighted mean and variance to represent a In E014 it occurs in only one video, where it is similar

speaker. Then, we apply Kmeans clustering to cluster similar to a moderate volume sound of a vacuum cleaner in the

speakers among all audio clips. For example, speaker A in background. In E015 it occurs in 4 videos. In 3 of these,

clip 1 and speaker D in clip 2 (see gure 2) are clustered it is the sound of a sewing machine and it is acoustically

together as cluster A. similar to some of the sounds of E005. In the other video

449

Table II

T OP FIVE SOUND CONCEPTS ACCORDING TO NORMALIZED FREQUENCY

E001 100% (ID 153) 25.8% (ID 117) 25% (ID 130) 19% (ID 189) 18.9% (ID 10)

E002 16.9% (ID 85) 14.8% (ID 108) 14.7% (ID 129) 14.6% (ID 90) 14.2% (ID 184)

E003 40.00% (ID 188) 31.25% (ID 13) 25.58% (ID 18) 22.32% (ID 58) 16.26% (ID 133)

E004 50.00% (ID 47) 41.66% (ID 48) 35.86% (ID 175) 33.33% (ID 94) 30.95% (ID 149)

E005 71.7% (ID 161) 58.3% (ID 88) 25% (ID 130) 24.7% (ID 66) 19.9% (ID 54)

E006 40.0% (ID 188) 14.2% (ID 52) 13.6% (ID 103) 13.3% (ID 4) 13% (ID 71)

E007 71.4% (ID 62) 43.5% (ID 51) 42.8% (ID 199) 41.9% (ID 139) 41.6% (ID 186)

E008 37.4% (ID 178) 29.8% (ID 110) 28.1% (ID 66) 25.0% (ID 130) 21.9% (ID 79)

E009 37.41% (ID 178) 29.81% (ID 110) 28.08% (ID 66) 25.00% (ID 130) 21.91% (ID 79)

E010 25.00% (ID 13) 21.42% (ID 169) 15.00% (ID 157) 14.91% (ID 65) 14.77% (ID 82)

E011 64.28% (ID 169) 36.36% (ID 103) 33.33% (ID 94) 32.98% (ID 70) 23.94% (ID 40)

E012 40.00% (ID 22) 40.00% (ID 156) 34.33% (ID 186) 33.33% (ID 135) 30.00% (ID 176)

E013 23.68% (ID 76) 15.11% (ID 43) 14.24% (ID 2) 13.07% (ID 20) 12.67% (ID 17)

E014 31.42% (ID 177) 30.43% (ID 84) 27.84% (ID 106) 27.08% (ID 98) 24.61% (ID 152)

E015 41.66% (ID 48) 33.33% (ID 94) 19.71% (ID 40) 14.77% (ID 37) 14.28% (ID 62)

categories, with 41% of it s occurrences being in E004, E005

and E008. Across the various event classes, it was observed

to contain fairly similar sounds: that of guitar music and

singing being most common, and where it is most highly

discriminating. The cluster also contains machine noises,

(a) K=100 (b) K=200

speech, and the sound of bacon cooking. Sound 8 on the

other hand only has one event with a greater than 10%

chance of occurrence, E014 at 16% and seems to be a cluster

of sounds that include music with percussive elements. In

E014 this translated to music with tools being used, but

in other places it was bass heavy techno, or cars moving

at high speed. With sound 76 we had a sound which to

Figure 3. Distribution Frequency for Kmeans K=100, 200, 300, 1000.

a human annotator seems relatively unremarkable, being

mostly instrumental music; however this sound was only

detected in 25 of the videos, and not detected at all in six

from E015 where it occurs, it is a high frequency sound of the event classes.

of rotating equipment. Another observation we made is that

VI. CONCLUSION

there are a number of sound concepts for each speci c event

class that are not present in any video from that event class. This more in depth analysis of how our system has created

These numbers are E001: 35, E002: 29, E003: 27, E004: 7, the sound clusters is illuminating: the machine learning

E005:20, E006: 11, E007: 32, E008: 11, E009: 28, E010: system is creating categories that while clearly sensical for

14, E011: 15, E012: 27, E013: 23, E014: 66 and E015: 22. it, and useful in classifying the events, are clearly not what a

Both observations indicate that the presence or absence as human annotator would create. This forces us to reevaluate

well as the number of occurrences of certain low level sound the utility of our standard annotation approaches while

concepts in a video is a predictor for the concept class preparing the data for this sort of system, since what the

of that video. Figure 3 show the distribution of low level system is discovering, and nding useful is quite different

sound concepts in the test set according to their frequency from what a human annotator might create; it seems unlikely

of occurrence for clustering with k=100, k=200, k=300 and that a human annotator would put guitar music in the same

k=1000. The gures suggest that with higher values for k category as bacon cooking. This dichotomy will hopefully

in the clustering, the distribution comes closer to a Zip an prove to be advantageous in the future, if we can develop a

distribution. Zip an distributions are sometimes connected system that can combine the human understanding of sound

to the applicability of TF-IDF measures, even though this is meanings with the automatic segmentations which we have

controversial from a theoretical perspective [10]. been using.

In addition to the analysis of the top performing sound The distribution analysis of the clusters generated by the

concepts, we annotated and studied the distributions of two processing steps speaker diarization and clustering has

seven sound sets, without prejudice towards how well they shown that the distribution of low level sound concepts

performed. In sound 5 we have a fairly poor discriminator, found by speaker diarization differs between videos belong-

although by no means useless: it occurs in all of the event ing to different classes. It is hence safe to assume that low

450

level sound concepts can be used for video classi cation. [3] HD Wactlar, T. Kanade, MA Smith, and SM Stevens, Intel-

ligent access to digital video: Informedia project, Computer,

One advantage of using a representation of videos at the

vol. 29, no. 5, pp. 46 52, 1996.

level of the distribution of low level sound concepts is that

low level sound concepts deliver an abstract representation. [4] Huan Li, Lei Bao, Zan Gao, Arnold Overwijk, Wei Liu,

Classi cation at this level does hence require only a small Long fei Zhang, Shoou-I Yu, Ming yu Chen, Florian Metze,

amount of data. One video can be described by a vector with and Alexander Hauptmann, Informedia@trecvid 2010, in

Notebook for NIST s TREC Video Retrieval Evaluation 2010,

200 or 300 positions instead of a framewise representation of

2010.

raw features. Preliminary machine learning experiments with

this high level representation have con rmed our estimates. [5] Y.-G. Jiang, X. Zeng, G. Ye, S. Bhattacharya, D. Ellis,

In the future we will design a video concept classi cation M. Shah, and S.-F. Chang, Columbia-ucf trecvid2010 mul-

system based on the low level sound concepts found by timedia event detection: Combining multiple modalities, con-

textual concepts, and temporal matching, in NIST TRECVID

speaker diarization.

Workshop, 2010.

VII. ACKNOWLEDGMENTS

[6] David Imseng and Gerald Friedland, Robust speaker di-

Supported by the Intelligence Advanced Research Projects arization for short speech recordings, in Proceedings of

Activity (IARPA) via Department of Interior National Busi- the IEEE workshop on Automatic Speech Recognition and

ness Center contract number D11PC20066. The U.S. Gov- Understanding, 12 2009, pp. 432 437.

ernment is authorized to reproduce and distribute reprints for

[7] Chuck Wooters and Marijn Huijbregts, Multimodal tech-

Governmenta l purposes notwithstanding any copyright an-

nologies for perception of humans, chapter The ICSI RT07s

notation thereon. The views and conclusion contained herein Speaker Diarization System, pp. 509 519. Springer-Verlag,

are those ot the authors and should not be interpresented as Berlin, Heidelberg, 2008.

necessarily representing the of cial policies or endorsement,

[8] G. Friedland and O. Vinyals, Live speaker identi cation

either expressed or implied, of IARPA, DOI/NBC, or the

in conversation, in Proceedings of the ACM International

U.S. Governement.

Conference on Multimedia, October 2008, pp. 1017 1018.

R EFERENCES

[9] Yan Huang, Oriol Vinyals, Gerald Friedl, Christian Mller,

[1] Michael S. Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Nikki Mirghafori, and Chuck Wooters, A fast-match ap-

Jain, Content-based multimedia information retrieval: State proach for robust, faster than real-time speaker diarization,

of the art and challenges, ACM Trans. Multimedia Comput. in ASRU, 2007.

Commun. Appl., vol. 2, no. 1, pp. 1 19, 2006.

[10] Stephen Robertson, Understanding inverse document fre-

[2] Cees G. M. Snoek and Marcel Worring, Concept-based video quency: On theoretical arguments for idf, Journal of Docu-

retrieval, Fundamental Trends in Information Retrieval, vol. mentation, vol. 60, pp. 2004, 2004.

2, no. 4, pp. 215 322, 2009.

451

Multimodal tech-

Governmenta l purposes notwithstanding any copyright an-

Contact this candidate