Exemplar Selection Methods to
Distinguish Human from Animal Footsteps
Po-Sen Huang, Mark Hasegawa-Johnson, Thyagaraju Damarla
Beckman Institute, ECE Department, University of Illinois at Urbana-Champaign, U.S.A.
US Army Research Laboratory, 2800 Powder Mill Road, Adelphi, MD 20783, U.S.A.
********@********.***, ********@********.***, ********@***.****.***
Abstract
The class discovery problem is the problem of learning a classi er from a mixture of unlabeled
and labeled training data, under the constraint that labeled training data exist for only N-1 of the
N target classes. The task of distinguishing human from animal footsteps can be framed as a class
discovery problem. When humans travel alone, every footstep sound is caused by a human foot, therefore
labeled training examples for the human class are abundant. When humans travel with animals, their
footsteps are interspersed and/or overlapped in time; without a tedious labeling e ort, there are no gold-
standard labels specifying which species created each of the footstep events. This paper will describe
three di erent types of class discovery algorithm: the mixed-vs-unmixed classi er, the generative class
discovery algorithm, and the class of algorithms sometimes called self training. Experiments using the
ARL/Mississippi multisensory personnel tracking database will be reported. Experimental results suggest
that the mixed-vs-unmixed classi er gives the best performance in distinguishing mixed vs. unmixed test
tokens (recordings containing humans alone vs. humans with animals), and that the self-training method
shows promise for the task of learning to distinguish between the discrete footfall sounds of humans and
animals.
Index Terms: acoustic event detection, active sensing, Gaussian mixture models, support vector machines
1 Introduction
Personnel detection is an important task for Intelligence, Surveillance, and Reconnaissance (ISR) [1, 2]. One
might like to detect intruders in a certain area during the day and night so that the proper authorities can be
alerted. For example, border crimes including human tra cking would be reduced by automatic detection
of illegal aliens crossing the border. There are numerous other applications where personnel detection is
important.
However, personnel detection is a challenging problem. Video sensors consume high amounts of power
and require a large volume for storage. Hence, it is preferable to use non-imaging sensors, since they tend
to use low amounts of power and are long-lasting.
Traditionally, personnel detection research concentrated on using seismic sensors. When a person walks,
his/her impact on the ground causes seismic vibrations, which are captured by the seismic sensors [3, 4].
Hyung et al. proposed the method of extracting temporal gait patterns to provide information on temporal
distribution of gait beats [5].
At border crossings, animals such as mules, horses, or donkeys are often known to carry loads. Animal
hoof sounds make them distinct from human footstep sounds. Automatic algorithms that imitate human
capabilities in other acoustic event detection tasks have been constructed [6, 7], e.g., using perceptual linear
predictions (PLP) features coupled to tandem neural net - HMM recognizers.
Existing research considers only the case when there is a single object (a person or a four-legged animal)
walking using a single sensor in clean environments. However, when multiple objects such as humans with
1
Figure 1: Sensor layout, where a multi-sensor multi-modal system has acoustic, seismic, passive infra-red
(PIR), radar, magnetic, and electric eld sensors.
four-legged animals travel together in noisy environments, the task becomes more challenging. Moreover,
without a tedious labeling e ort, there are no gold-standard labels specifying which species created each
of the footstep events. The task of distinguishing human from animal footsteps can be framed as a class
discovery problem, which is the problem of learning a classi er from a mixture of unlabeled and labeled
training data, under the constraint that labeled training data exist for only N-1 of the N target classes.
In this paper, we aim to identify the footstep sounds of humans only and humans with (four-legged)
animals. Especially, in the humans with animals class, there is ambiguity among the footsteps of animals
alone, of humans alone, and of animals traveling together with humans. This paper will describe three
di erent types of class discovery algorithm: the mixed-vs-unmixed classi er using Support Vector Machines
(SVM), the generative class discovery algorithm using Gaussian Mixture Models (GMM), and the exemplar
selection algorithms using SVM and GMM.
The organization of this paper is as follows: Section 2 introduces the multi-sensor multi-modality data
and events. Section 3 describes the acoustic feature extraction. Section 4 discusses Gaussian mixture model
classi ers, Support Vector Machines, and exemplar selection methods. Section 5 describes the experiments
and discussions on the multi-sensor multi-modal dataset. We conclude this paper in Section 6.
2 Data
In this paper, we use a multi-sensor multi-modal realistic dataset collected in Arizona by the U.S. Army
Research Lab and the University of Mississippi. The data are collected in a realistic environment in an open
eld. There are three selected vantage points in the area. These three points are known to be used by the
illegal aliens crossing the border. These places where the data are collected include: (a) wash (a ash ood
river bed with ne-grain sand), (b) trail (a path through the shrubs and bushes, and (c) choke point (a valley
between two hills.) The data are recorded using several sensor modalities, namely, acoustic, seismic, passive
infrared (PIR), magnetic, E- eld, passive ultrasonic, sonar, and both infrared and visible video sensors. Each
sensor suite is placed along the path with a spacing of 40 to 60 meters apart. The detailed layout of the
sensors is shown in Figure 1. Test subjects walked or ran along the path and returned back along the same
path.
A total of 26 scenarios with various combinations of people, animals and payload are enacted. We can
categorize them as: single person (11.6%), two people (13%), three people (21.7%), one person with one
animal (14.5%), two people with two animals (15.9%), three people with three animals (17.4%), and seven
people with a dog (5.9%), where the animals can be a mule, a donkey, a horse, or a dog, and the number
in the parentheses represents the percentage of the data. The data are collected over a period of four days;
2
Figure 2: Actively select (turn on) acoustic signals by seismic signals.
each day at a di erent site and di erent environment. There is variable wind in the recording environment.
2.1 Active Sensing
The time duration for subjects passing by is short (about ten to twenty seconds at a time) compared to the
whole recording time ( ve to six minutes recording). Without any ground truth segmentation, we would like
to extract the time duration when test subjects are passing through. This problem can be formulated as an
example of active sensing and learning [8, 9], which refers to sequential data selection/collection and inference
procedures that actively seek out highly informative data, rather than relying solely on non-adaptive data
acquisition.
For acoustic sensors, in an outdoor scene, the signals are contaminated by wind sounds, human voices,
or unexpected airplane engine sounds. Seismic sensors, on the other hand, are relatively clean. Since seismic
and acoustic signals are pre-synchronized, we could select the time duration when test subjects pass through
using seismic sensors by an energy detection. If the energy in any ten-second interval exceeds a threshold,
the interval is marked active. ; therefore the acoustic active integral can be marked on the basis of seismic
energy. For each recording, there are two active segments (walked or ran along the path and returned back
along the same path). In this paper, we emphasize the classi cation of segmented acoustic recordings into
two classes: humans only, and humans with animals.
3 Features Extraction
In acoustic signals, for footsteps, the hoof sounds of animals such as horses, donkeys, or mules are perceptually
distinct from human footstep sounds. In order to imitate the perceptual discrimination abilities of human
listeners, we begin by using Perceptual Linear Predictive (PLP) features [10], which are common features in
speech recognition. As mentioned in Section 2, the data are recorded in an open eld. There are noisy wind
sounds blowing in the recordings. We use spectral subtraction to reduce the e ect of noise [11, 12].
From the active segments we extracted in Section 2.1, we further extract acoustic features from short-time
footstep sounds by incorporating seismic signals. Based on the idea of active sensing, in order to extract the
exact footstep sounds for classi cation, we can formulate the problem as turning acoustic sensors on/o by
seismic information, as shown in Figure 2.
To be more speci c, since there are no labels for the exact time of footstep sounds, we have to use the
seismic sensor information, assuming that the peaks in the seismic signals correspond to footsteps. Suppose
there are n groups of peaks (if some peaks are close to each other, we count them as one group) in the
seismic signal, whose times are ti, for i = 1, . . ., n. We choose a small time around the peaks and extract
PLP features within the time duration (ti, ti + ), for i = 1, . . ., n, as shown in Figure 3. In each
time period, we extract 13 PLP features using 186ms Hamming windows with 75% overlap, where 186ms
3
Figure 3: Using peaks of seismic signals for matching acoustic footstep sounds
is approximately equal to the time duration of a single footstep (from heel strike to toe slap). Delta and
delta-delta coe cients are appended to create a 39-dimensional feature vector.
4 Methods
4.1 Gaussian Mixture Model Classi ers
The motivation for using Gaussian mixture densities is that a su ciently large linear combination of Gaussian
basis functions is capable of representing any di erentiable sample distribution [13, 14]. A Gaussian mixture
density is a weighted sum of M component densities, as shown in the following equation,
M
p(x ) = pi bi (x) (1)
i=1
where x is a D-dimension random vector, bi (x), i = 1, . . ., M, are the component densities and pi, i =
1, . . ., M, are the mixture weights. Each component density is a D-variate Gaussian function of the form
1 1
exp{ (x i ) 1 (x i )}
bi (x) = (2)
i
(2 )D/2 1/2 2
i
M
with mean vector i and covariance matrix i . The mixture weights are constrained by i=1 pi = 1.
The complete Gaussian mixture density is parameterized by the mean vectors, covariance matrices (we use
diagonal covariance matrices here) and mixture weights from all component densities. These parameters
4
are collectively represented by the notation = {pi, i, i }, i = 1, . . ., M . For classi cation, each class is
represented by a GMM parameterized by .
Given training data from each class, the goal of model training is to estimate the parameters of the GMM.
Maximum likelihood model parameters are estimated using the Expectation-Maximization (EM) algorithm.
Generally, ten iterations are su cient for parameter convergence.
The objective is to nd the class model that has the maximum a posteriori probability for a given
observation sequence X . Assuming equal likelihood for all classes (i.e., p( k ) = 1/N ), the classi cation rule
simpli es to
T
N = argmax p(X k ) = argmax log p(xt k ) (3)
1 k N 1 k N t=1
where the second equation uses logarithms and the independence between observations. T is the number of
observations.
4.2 Support Vector Machines
A Support Vector Machine (SVM) estimates decision surfaces directly [15], rather than modeling a probability
distribution from the training data. Given training feature vectors xi Rn, i = 1, . . ., k in two classes with
label y Rk, where yi {1, 1}, a SVM solves the following optimization problem:
k
1T
min 2w w +C i=1 i
w,b,
subject to yi (wT (xi ) + b) 1 i
i 0, i = 1, . . ., k
where (xi ) maps xi onto a higher dimensional space, C 0 is the regularization parameter, and i is a
slack variable, which measures the degree of misclassi cation of the datum xi .
k
The solution can be written as w satis es w = i=1 yi i (xi ), and the decision function is
k
h(x) = sgn yi i K (xi, x) + b (4)
i=1
where K (xi, x) = (xi )T (x) is the kernel function. In this paper, we use LIBSVM with Radial Basis
Function (RBF) kernels, that is, K (xi, xj ) = exp( xi xj 2 ) [16].
4.3 Exemplar Selection Methods
Our goal is to classify humans only vs. humans with animals. In the humans with animals class, there are
instances of human footstep sounds. Therefore, there are some overlap between the two classes in the feature
space, as shown on the left hand side of Figure 4. Regularized discriminative methods such as support vector
machines (SVM) explicitly trade o the degree of class overlap vs. the complexity of the decision boundary
in order to minimize an estimate of expected risk. Generative models, on the other hand, model overlap
only to the extent permitted by the speci ed generative model.
In order to improve the classi ers ability to compensate for class overlap, therefore, we propose a multi-
stage algorithm for exemplar selection, as shown in Figure 5; this framework is similar to the self-training
methods used in semi-supervised learning.
The idea of the framework is to select the exemplar frames in the humans with animals class which are
dissimilar to the features in the humans only class. With the exemplar selection method, classi ers are easier
to learn the distinctive features between classes as shown on the right hand side of Figure 4.
5
Figure 4: Left: an example of feature space of humans only and humans with animals class. Right: an
example of feature space of humans only and estimated animals only class, after exemplar selection.
Figure 5: Multi-stage framework for acoustic exemplar selection
The algorithm is as follows:
1. Train an exemplar selection classi er (SVM or GMM) for humans only and humans with animals using
training data as shown in the left block of Figure 5.
2. Label the training data of the humans with animals class using the trained models as shown in the
middle block of Figure 5. Each frame in the training data is labeled as either the humans only class
or the humans with animals class.
3. Keep the frames which were labeled as humans with animals ; in other words, discard the frames which
were labeled as humans only.
4. Train a new classi er (SVM or GMM) between the estimated animals only class and the humans only
class as shown in the right block of Figure 5.
5 Experiments
In this section, we describe the experiments in classifying humans only vs. humans with animals. There
are 69 recordings in the dataset. We divide the recordings into four groups and choose two for training and
6
Accuracy Feature
GMM SVM
73.768 2.230 65.337 1.896
PLP features without (1)(2)(3)(4)
76.105 4.098 71.698 4.572
PLP features with (1)
74,975 5.079 78.093 1.699
PLP features with (1)(2), =0.1s
75.737 2.936 76.604 2.179
PLP features with (1)(2)(3), =0.1s
72.735 4.585 75.090 2.577
PLP features with (1)(2)(4), =0.1s
77.555 4.268 80.578 3.113
PLP features with (1)(2), =0.3s
79.015 3.799 72.638 2.727
PLP features with (1)(2)(3), =0.3s
75.325 3.739 77.196 1.706
PLP features with (1)(2)(4), =0.3s
75.392 3.376 76.214 4.396
PLP features with (1)(2), =0.5s
77.688 3.149 74.507 3.634
PLP features with (1)(2)(3), =0.5s
74.800 4.523 71.313 3.456
PLP features with (1)(2)(4), =0.5s
Table 1: Classi cation accuracy using Acoustic features, where (1) represents spectral subtraction, (2) rep-
resents the use of seismic peaks with di erent second (s), and (3) represents the use of our proposed
multi-stage exemplar selection framework using GMM classi er as the rst step of the algorithm. (4) rep-
resents the use of our proposed multi-stage exemplar selection framework using SVM classi er as the rst
step of the algorithm.
two for testing at a time, resulting in a six-fold cross-validation. In each fold, we randomly select a part of
recordings from training and testing sets as a validation set. We choose the best mixture count for the GMM
classi er and parameters and C for the SVM, according to the validation set. The experimental results
are represented by mean standard error.
As described in Section 3 and Section 4, we want to examine the e ect of using spectral subtraction,
seismic peaks with di erent s, and our proposed multi-stage exemplar selection framework using GMM
and SVM classi ers as the rst step of the algorithm. The experimental results are shown in Table 1.
The rst row PLP features without (1)(2)(3)(4) in Table 1 demonstrates the result of using the active
audio segments, without using the duration estimated by the peaks of seismic signals, and without using
spectral subtraction. Spectral subtraction (row 2) improves the performance for both classi ers.
It is helpful to further extract audio features from the time durations marked by peaks of seismic signals.
This method utilizes both the characteristics of acoustic and seismic sensor in the sensor suites. Without
using this method, there are many silence or noise segments in the audio signals, and the silence or noise
signals make both classi ers ill-trained.
Moreover, di erent values of capture di erent amounts of acoustic information. The results show that
=0.3s has the best performance compared with =0.1s and =0.5s. The seismic sensor and acoustic sensor
are not at exactly the same place and the rates of propagation are di erent. Therefore, there are asynchronies
between acoustic and seismic signals. Speci cally, with =0.1s, the acoustic segment does not contain the
entire footstep sound. On the other hand, with =0.5s, the acoustic signals include too much unrelated
noise. These reasons may explain the performance variation of both classi ers.
For our proposed multi-stage exemplar selection framework, using GMM for exemplar selection improves
the accuracy around 1 2% for GMM classi ers; on the contrary, using GMM for exemplar selection degrades
the accuracy for SVM classi ers. A possible reason is that SVM implicitly chooses support vectors for the
hyperplane in the feature space. Using GMM selected features, the SVM has less information, and hence has
worse performance. On the other hand, using SVM for exemplar selection degrades performance in all cases.
A possible explanation is that the SVM cannot select proper exemplar in the case of overlapping feature
space in the rst stage.
7
6 Conclusion
In this paper, we use a challenging realistic multi-sensor multi-modal dataset for personnel detection focusing
on the classi cation between humans only and humans with animals. Based on the idea of active sensing,
we use seismic signals to actively select acoustic signals which correspond to footstep sounds. To reduce
the ambiguity between the two classes, this paper explores the multi-stage exemplar selection methods.
Experimental results suggest that the SVM classi er gives the best performance in distinguishing mixed vs.
unmixed test tokens, and that the exemplar selection method using GMM classi er shows promise for the
task of learning to distinguish between the discrete footfall sounds of humans and animals.
7 Acknowledgments
This research is supported by ARO MURI 2009-31.
References
[1] T. Damarla, Sensor fusion for ISR assets, M. A. Kolodny, Ed., vol. 7694. SPIE, 2010.
[2] T. Damarla, L. Kaplan, and A. Chan, Human infrastructure & human activity detection, in Information
Fusion, 2007 10th International Conference on, 9-12 2007, pp. 1 8.
[3] J. M. Sabatier and A. E. Ekimov, Range limitation for seismic footstep detection, E. M. Carapezza, Ed., vol.
6963. SPIE, 2008.
[4] K. M. Houston and D. P. McGa gan, Spectrum analysis techniques for personnel detection using seismic
sensors, E. M. Carapezza, Ed., vol. 5090. SPIE, 2003, pp. 162 173.
[5] H. O. Park, A. A. Dibazar, and T. W. Berger, Cadence analysis of temporal gait patterns for seismic discrimi-
nation between human and quadruped footsteps, Acoustics, Speech, and Signal Processing, IEEE International
Conference on, pp. 1749 1752, 2009.
[6] X. Zhuang, X. Zhou, M. A. Hasegawa-Johnson, and T. S. Huang, Real-world acoustic event detection, Pattern
Recognition Letters, vol. 31, no. 12, pp. 1543 1551, 2010.
[7] P.-S. Huang, X. Zhuang, and M. A. Hasegawa-Johnson, Improving acoustic event detection using generalizable
visual features and multi-modality modeling, in Acoustics, Speech and Signal Processing. ICASSP 2011. IEEE
International Conference on, 2011.
[8] D. J. MacKay, Information-based objective functions for active data selection, Neural Computation, vol. 4,
pp. 590 604.
[9] R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, and X. Zhu, Human active learning, NIPS, 2008.
[10] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of
America, vol. 87, no. 4, pp. 1738 1752, 1990.
[11] M. Berouti, R. Schwartz, and J. Makhoul, Enhancement of speech corrupted by acoustic noise, in Acoustics,
Speech, and Signal Processing, IEEE International Conference on ICASSP 79., vol. 4, Apr. 1979, pp. 208 211.
[12] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics,
Speech and Audio Processing, IEEE Transactions on, vol. 9, no. 5, pp. 504 512, Jul. 2001.
[13] L. R. Rabiner, B.-H. Juang, S. E. Levinson, and M. M. Sondhi, Recognition of isolated digits using hidden
markov models with continuous mixture densities. AT Technical Journal, vol. 64, no. 6 pt 1, pp. 1211 1234,
1985.
[14] L. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings
of the IEEE, vol. 77, no. 2, pp. 257 286, Feb. 1989.
[15] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classi ers, in
Proceedings of the fth annual workshop on Computational learning theory, ser. COLT 92. New York, NY,
USA: ACM, 1992, pp. 144 152. [Online]. Available: http://doi.acm.org/10.1145/130385.130401
[16] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
8