**** ***** *****. ** *********** SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
Case-Based Free Text Evaluation for Customer Satisfaction
Monitoring
NICHOLAS EVANGELOPOULOS
Department of Information Technology and Decision Sciences
University of North Texas
P.O. Box 305249 Denton TX 76203-5249
UNITED STATES
********@***.*** http://www.coba.unt.edu/itds/faculty/evangelopoulos/evangelopoulos.htm
Abstract: - In this paper we show how free text can be rated against some desirable criteria in order to generate
numerical input useful for the generation of control charts that monitor customer satisfaction. Following the
general principles of Case-Based Reasoning, a Case-Based Free Text Evaluation (CBFTE) algorithm is
developed and various parameters of interest are investigated. These include the document size of the
customer comments, whether binary or IDF weights are used in the calculations of document similarities, the
number of text clusters that are used to produce the initial training case base, and the number of loops used
while searching for the optimal clustering solution. CBFTE uses human experts to obtain customer
satisfaction ratings far the case base and then proceeds to automatically estimate the ratings of the remaining
documents. As an illustration of this framework we are using viewers' comments submitted to the Internet
Movie Database (IMDb) Web site after they watched a recent popular film.
Key-Words: - Text Mining, Customer Relationship Management, Case Based Reasoning, Text Clustering
similarity between documents, and a k-means text
1 Introduction
clustering to determine the initial case base.
Modern Customer Relationship Management
(CRM) is often faced with the time-consuming task
2.1 Vector Space Model (VSM)
of sifting through large volumes of customer
The Vector Space Model (VSM), a ranking model
feedback comments which represent an
that ranks individual documents against a query, was
underexploited source of business intelligence. The
originally introduced by Gerald Salton and his
main difficulty here is that text data need to be
associates [10], [11], and has been involved in a
converted to some numerical format, so that well-
large part of the Information Retrieval research.
established statistical quality control tools can take
VSM calculates similarity between two documents
over.
represented as vectors of unique terms, by
The purpose of this paper is to discuss the
considering the cosine of the angle that is formed
feasibility of Customer Satisfaction Monitoring
between the two vectors,
using established Quality Control tools, when the
only data available are in free text format. The
Sim(d,q) =(dq dq ), (1)
proposed solution is to use humans to evaluate a
relatively small set of text documents against some
Where d and q are a document and a query vector
desirable rating standard and then produce rating
respectively, on a multi-dimensional space that uses
estimates for the rest of the documents
as many dimensions as the non-trivial terms in a
automatically. The proposed methodology will be
dictionary. Various improvements to the basic VSM
applied to rated movie viewers comments after they
have been introduced by researchers, such as term
watched a recent popular film. We will also
stemming [7], [9], or Latent Semantic Indexing
perform some basic sensitivity analysis by
(LSI) [4]. Based on these methods of quantifying
investigating some related parameters.
document similarity, Text Classifiers have been
widely used in document categorization (indexing).
Methods for construction of document classifiers
2 Free Text Evaluation
include inductive learning techniques [3],
Our approach to free text evaluation will follow a
probabilistic and decision tree classifiers [5], genetic
methodology related to Case-Based Reasoning, will
algorithms, neural networks [13], taxonomies [8],
use Vector Space Model arguments to calculate
1791
WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
statistical clustering [2], and K-Nearest Neighbor 1. Select a randomly chosen set of k seed
[12, pp. 102-103], [6]. documents that define k initial clusters and
calculate their term concatenations qi, i =
1,, k.
2.2 Case-Based Reasoning (CBR)
2. Perform k queries using qi as the query
Case-based reasoning (CBR) is an Artificial
documents. For each document dj in the
Intelligence framework that functions through the
collection, j=1,,n, consider the similarities
usage of previous cases. New cases are compared to
Sim(dj, qi) and assign dj to the cluster whose
a case-base and checked for similarity. As the CBR
term concatenation qi is nearest. Use
system is utilized, new cases will be added to the
heuristics to deal with orphaned documents
case-base, building upon the general knowledge
and empty clusters. Recalculate the term
component found in the center of the CBR cycle,
concatenations for the cluster receiving the
which follows conceptual steps referred to as
new document and for the cluster losing the
retrieve, reuse, revise, and retain [1].
document.
3. Repeat Step 2 until no more assignments
2.3 K-Means Text Clustering (KMTC)
take place.
Since our CBFTE approach will compare each not-
4. Calculate the Overall Similarity of the final
yet-rated customer comment to its closest match in
clustering arrangement and save the
the Case Base, it is important to select the initial
clustering arrangement having the
members of the Case Base not randomly, but in an
maximum Overall Similarity.
optimized way that maximizes the average similarity
Next m.
of all cases to their closest match. The reason why
we need our cases to be very close to their best
The heuristics mentioned in step 2 are explained
matches is because, when two documents are very
below.
similar, their ratings are expected to be similar also,
however if the documents are dissimilar they could
Heuristic KMTC.1:
have ratings that are very different, but they could
If a document is orphaned (= without a cluster to
also have ratings that are very similar, since there
belong to) it will join the least populated cluster.
are more than one possible ways for customers to
Ties will be resolved by random assignment.
express a certain level of satisfaction. It is,
therefore, desirable to identify an initial set of
Heuristic KMTC.2:
documents (training set) by forming document
If a cluster is left empty, it will be joined by a
clusters and selecting an equal number of documents
document that deserves to be in its current cluster
from each cluster to participate in the training set.
the least. Ties will be resolved by random
assignment.
Figure 1 illustrates two such clusters populated
Cluster m
with documents with a high degree of within-cluster
similarity. Because our Vector Space representation
Term
j
of documents uses the cosine of the angle between
Document
Projection Horizon
vectors as a measure of similarity, the usual n-
dimensional Cartesian space with the rectangular
Cluster n
coordinates gives way to a system of normalized
polar coordinates where radius lengths do not
matter, only angles do. Essentially all documents
get projected on an n-dimensional spherical surface
Origin of
Term i
Vector Space
(a hyper-sphere) marked on Figure 1 as the
document projection horizon. VSM s view of the
Fig. 1. Clusters of documents in the Vector Space.
document clusters then becomes similar to an
observer s view of the starry night sky. Some stars
(or planets, or even galaxies for that matter, since
Algorithm 1:
they all look so similar) appear to form clusters, but
For m = 1 to M, where M is a moderately large
are they really close to each other? Well, stars that
number;
appear to participate in star clusters, for one, usually
1792 WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
are! In the results section we will investigate what During that same period, about 2,100 written
happens with document clusters. comments were submitted by registered users,
together with their Likert-scale star-ratings. Those
comments included a short comment (typically one
2.4 CBFTE Algorithm
line) summarizing their impression, and a longer
In view of the preceding discussion, we summarize
comment (typically 1 paragraph to 1 page)
the procedures followed by a Case-Based Free Text
elaborating on their feelings and opinions regarding
Evaluation engine in the following algorithm:
the movie. About 900 of those registered users'
postings originated in the United States and the
Algorithm 2:
remaining 1,200 in a large number of other
1. Removal of Unique Documents: Separate all
countries. The fact that the comments were posted
unique documents from the collection.
in English served, of course, as a source of bias: For
2. KMTC: Perform M runs of K-Means Text
instance, 250 comments came from the United
Clustering (Algorithm 1) and choose the
Kingdom, 97 from Canada, 60 from Australia,
clustering solution with the highest overall
whereas Germany contributed 21 postings, France
similarity.
10, Greece 6, and Austria 3. Nevertheless, countries
3. Construction of the Case Base: Have a human
such as Argentina, Mexico, Turkey, Egypt, or
expert manually rate l representatives of each
Thailand, were still represented in the sample. The
cluster. Store the kl rated cases in the case base.
most serious imbalance in the sample was the
4. Case-Based Free Text Evaluation: For each of
gender bias: The 2,100 users who posted comments
the remaining N kl documents, find the best
included only 200 women. This was probably due
match in the case base, and produce a
to the nature of the particular movie which, being an
corresponding rating estimate.
action-adventure, apparently had many more male
5. Acceptance Sampling: Sample s of the
than female viewers who felt strongly about the
estimated ratings and ask the human expert to
movie and decided to submit comments about it. To
rate them independently. Learning: Expand the
some extent this is not only a bias of the sample, but
case base by storing the verified ratings.
also a reflection of the gender distribution in the
6. Repeat step 5 until all verified ratings are equal
population of interest (i.e., the viewers of the
to the corresponding estimated ratings. Then
movie).
accept the remaining N kl s unverified
estimates and stop.
Rating Distribution, N=50,000
3 A Customer Satisfaction Example
50
We will now discuss an example with free text data
coming in the form of customer feedback. The data 40
set under consideration will include short comments 30
expressing how satisfied customers were with a %
20
certain product or service. Corresponding ratings in
numerical form will also be used in order to evaluate 10
the performance of our automatic rating algorithm.
0
1 2 3 4 5 6 7 8 9 10
3.1 Methods
Rating
We surveyed movie viewers' comments on a major
box office movie that was released in 2005. These Fig. 2. Distribution of ratings submitted by 50,000
viewers were users of the Internet Movie Database users. Information courtesy of The Internet Movie
(IMDb), a popular Web Forum related to movies. Database (http://www.imdb.com). Used with
Their movie feedback comments were posted during permission.
a certain 30-day period of 2005. Users posted their
star-rating (on a 1-10 Likert scale). The distribution
of the submitted ratings, based on a sample of about For the purposes of this study, 1,059 comments
50,000 users who volunteered to vote electronically were sampled, using a sampling scheme that
during the aforementioned 30-day period, is shown employed approximate separate sampling, i.e.,
in Fig. 2. picked an about equal number of comments from
each rating level. The number of comments
1793
WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
participating in the sample that came from each good, then added 7 non-evaluative terms that were
rating level are shown in Fig. 3. specific to this set of documents.
For the purposes of calculating similarities
between documents, terms are usually weighed so
Rating Distribution, n=1,059 that terms that appear very frequently in the
document collection are discounted and terms that
50
appear in only a few documents are promoted. This
40
approach results in the introduction of inter-
30 document frequencies (IDFs) as weights on the term
%
dimension coordinates. Alternatively, binary
20
weights can be used, simply indicating whether each
10
term appears in the document or not, without paying
any attention to how common or how unusual that
0
terms is. In our analysis, IDF as well as binary
1 2 3 4 5 6 7 8 9 10
weights were used.
Rating
Utilization of a Porter Stemmer yielded only
Fig. 3. Distribution of 1,059 sampled ratings. marginal difference in the results, so no stemming
was applied to the document terms.
3.2 Lexical Analysis 3.3 Evaluation based on Bipolar Queries
As part of our free text evaluation approach, a Creation of potential predictor variables based on
dictionary of all non-trivial terms included in the the results of bipolar queries, i.e., good/bad,
document collection was compiled. The size of the agree/disagree, etc., is becoming a quite common
dictionary varied with the number of comments approach in text mining. In our analysis, we first
under consideration. For example, a small subset of composed two queries representing satisfaction and
260 short comments had 418 different terms after dissatisfaction with the film, respectively:
being filtered with 530 stopwords. Another sample
of 520 short comments had 732 different terms, Query 1: Great movie, very good, better, best of all
whereas the full sample of 1059 short comments had times, excellent .
1228 different terms. Although the size of the Query 2: Bad movie, the worst of all times, very
dictionary grows bigger as the number of documents disappointing, I was disappointed .
in the collection goes up, it's interesting to notice
that the top 20 terms, after we sort the three Based on the similarity results from these queries,
dictionaries by descending term frequency, are the Bipolar Query Rating (BQR) of each document
almost identical: 18 out of 20 terms were present in was estimated using the following empirical
all three dictionaries as part of the top 20 list, and formula:
the remaining 2 turned up in spots 21-25. Another
interesting observation is the number of unique BQR = [1+9*Sim1/(Sim1+Sim2)], (2)
terms (i.e., terms that appear in only one document).
The first dictionary had 317 unique terms out of where Sim1 is the similarity metric from Query 1,
418, the second had 541 out of 732, and the third Sim2 the corresponding result from Query 2, and the
833 out of 1228. For very large document constants 1 and 9 aim to adjust the result to the 1-to-
collections, the number of terms in the dictionary is 10 rating scale. This kind of rating estimation will
expected to approach or even exceed the 10,000 to serve as a comparison basis for our CBFTE results.
20,000 words that are commonly used in the English
language. The 20 terms with the highest frequencies
3.4 CBFTE Algorithm Implementation
included judgment expressions such as good, best,
After the dictionary customization, k-means text
better, great, bad, and disappointing. They also
clustering was applied to the set of documents, using
included the word movie, words from the film's title
a variety of number of clusters k values, as well as a
and the director's name. This second category is
number of clustering runs M, so that an optimal
unrelated to the evaluative nature of these short
clustering solution could be picked. One
comments. Subsequently, a new stopword list with
representative from each cluster was then considered
537 terms was applied. We started with a standard
to have a known rating value (i.e., as if it was now
stoplist, removed evaluative terms such as bad and
rated by a human expert) and the remaining
1794 WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
documents had their ratings estimated using a Best Binary weights have been known in the literature to
Match approach, i.e., without any adaptation produce slightly better results when they operate on
(modification) of the estimated ratings. short documents, so this result is not surprising.
4 Results Long comments, IDF weights
This section presents the results of the methods 30
described previously. In order to evaluate these
25
results, accuracy rates, defined as the percentage of
20
documents rated with zero error, as well as Mean
Percent
Absolute Errors (MAEs) were employed. While it is 15
important to achieve high accuracy rate, it is also
10
interesting to see the entire distribution of the rating
error, defined here as the discrepancy between the 5
estimated and the observed (=reported) rating. In 0
the subsections that follow we will examine the -9 -6 -3 0 3 6 9
effects of various parameters of interest on such text Error
rating error distributions. Fig. 5. Rating error distribution for 254 long
comments, using the bipolar query rating method.
4.1 Size of the text documents
The first comparison we performed involved the size
of the customer comments. A sample of 254
Short comments, binary weights
feedback entries was used, with short and long
30
comments for each entry. Figures 4 and 5 show the
rating error distributions using short and long 25
comments, respectively, when bipolar queries were
20
used to produce the rating estimates. Short
Percent
15
comments yielded a slightly better performance,
with higher accuracy rate and less pronounced error 10
distribution tails.
5
0
-9 -6 -3 0 3 6 9
Short comments, IDF weights Error
30
Fig. 6. Rating error distribution for 254 short
25 comments, using the bipolar query rating method
and binary weights.
20
Percent
15
10 4.3 Text Clustering and the effect of k
We now proceed with the implementation of the
5
CBFTE algorithm as described in section 3. The
0
documents were assumed to have unknown ratings,
-9 -6 -3 0 3 6 9
and k-means text clustering produced k
Error
representatives that formed the initial case base.
Fig. 4. Rating error distribution for 254 short Those documents were subsequently assumed to
comments, using the bipolar query rating method. have known ratings, provided by human experts.
After estimating the remaining unknown ratings in
the way that was described in section 3, the
4.2 Term weights percentage of documents having an estimated rating
Figure 6 shows the rating error distribution for 254 that was exactly equal to the true rating was
short comments, using binary weights instead of obtained. Fig. 7 shows this rating accuracy for a
IDFs. The accuracy rate (the percentage of collection of N=260 user comments and for a
documents that were rated with zero error) is varying number of clusters that ranged from k=5 to
marginally higher when binary weights are used.
1795
WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
k=120. The rating accuracy starts around 10% for Figure 8 shows the rating error distribution using
k=5 and reaches approx. 55% for k=120. For each the CBFTE method, when k=30 clusters were used
value of k, M=40 iterations were applied so that the to produce 30 documents with known ratings. Only
best clustering solution could be determined. M=1 iteration was applied in the search for the best
clustering solution. The results show an
improvement over simple bipolar query rating.
Figure 9 shows a similar rating error distribution
Rating Accuracy for N=260, M=40
when k=60 documents were considered known.
60
These settings show again an improvement in
performance, illustrating the merit of using a higher
50
value of k. We observe that the errors really come a
40
little closer to zero, although not by a lot.
% 30
20
4.4 The effect of M iterations on the
10
discovery of a good clustering solution
Figure 9 in the previous subsection showed the
0
rating error distribution for the selected case of
k
N=260, k=60, applying only M=1 iteration during
Fig. 7. Rating accuracy for N=260, M=40. the search for the optimal clustering solution.
Increasing the number of iterations can have a
positive impact on the accuracy rate, as shown on
Figures 10 and 11, where M is equal to 10 and 40,
N=260, k=30, M=1
respectively. This result is quite interesting, since
30
increasing the iterations during the txt clustering
25 stage only requires computing power, not human
expert time.
20
Percent
15
N=260, k=60, M =10
10
30
5
25
0
20
-9 -6 -3 0 3 6 9
Percent
Error
15
Fig. 8. Rating error distribution for N=260, k=30, 10
M=1.
5
0
-9 -6 -3 0 3 6 9
N=260, k=60, M=1 Error
30
Fig. 10. Rating error distribution for N=260, k=60,
25 M=10.
20
Percent
15
Table 1 shows the Mean Absolute Errors (MAEs)
10 next to the corresponding accuracy rates for three
selected values of M. As the repetitions M increase
5
from 1, to 10, to 40, the Mean Absolute Error
0
(MAE) decreases from 2.08, to 2.056, to 1.905,
-9 -6 -3 0 3 6 9
respectively.
Error
Fig. 9. Rating error distribution for N=260, k=60,
M=1.
1796 WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
expressed satisfaction or dissatisfaction over a
N=260, k=60, M=40
recently released film, where the users also provided
30
quantitative information expressing the same
25 satisfaction in the form of a rating score. Our
example illustrated that it is actually possible for a
20
Percent
human rater to pay the cost of manually rating a
15
subset of customer feedback comments and then
10 enjoy the benefit of having the rest of them
automatically rated without a great loss of rating
5
accuracy. This way, a basis for further quantitative
0
analysis and, therefore, more efficient customer
-9 -6 -3 0 3 6 9
relationship management can be established.
Error
Fig. 11. Rating error distribution for N=260, k=60,
M=40. References:
[1] Aamodt, Agnar; Enric Plaza (1994), Case-based
reasoning; Foundational issues, methodological
M MAE Accuracy variations, and system approaches, AI
1 2.08 18.80% Communications, Vol. 7, No.1, 1994, pp. 39-59.
10 2.056 22.43% [2] Anick, P and Vaithyanathan, S, Exploiting
40 1.905 25.87% Clustering And Phrases For Context-Based
Information Retrieval, SIGIR, 1997.
Table 1. MAE and accuracy rates for 3 selected
[3] Cohen, W. W. and Singer, Y, Context-Sensitive
values of M.
Learning Methods for Text Categorization,
SIGIR 96: Proceedings of the 19th Annual
International ACM SIGIR Conference on
5 Discussion
Research and Development in Information
After an overall evaluation of the rating error
Retrieval, 1996, pp. 307-315.
distributions presented in the results section, we
[4] Deerwester, S., Dumais, T., Furnas, W.,
might claim that free text evaluation following the
Landauer, K., and Harshman, R., Indexing by
Case-Based approach that was presented in this
Latent Semantic Analysis, Journal of the
paper is not without merit, but we would have to
American Society for Information Science, Vol.
admit that it s not without problems, either.
41, 1990, pp.391-407.
Customers can be very creative in the ways they
[5] Lewis, D.D. and Ringuette, M, Comparison of
express their feelings and their satisfaction. They
Two Learning Algorithms for Text
can use rare words, slang, make obscure points, be
Categorization, Proceedings of the Third Annual
sarcastic, or play word games, just to name a few.
Symposium on Document Analysis and
Still, with moderately large numbers of documents k
Information Retrieval (SDAIR 94), 1994.
that are processed by humans and fairly large
[6] Li, Fan, and Yang, Yiming, A Loss Function
numbers of clustering solution repetitions M that are
Analysis for Classification Methods in Text
processed by computers, the rating estimation errors
Categorization, The Twentieth International
can be kept within tight ranges. As a future
Conference on Machine Learning (ICML'03),
direction, it is interesting to examine the possible
2003, pp. 472-479.
improvement in rating accuracy or the possible
[7] Porter, M. F, An Algorithm for Suffix Stripping,
reduction in rating estimation error when longer,
Program, Vol. 14, No. 3, 1980, pp. 130-37.
more descriptive comments are used, as opposed to
[8] Raghavan, P, Information Retrieval Algorithms:
the short, summarizing tag lines that were employed
A Survey, Symposium on Discrete Algorithms,
here.
ACM-SIAM, 1997.
[9] Riloff, E, Little words can make a big difference
for text classification, In Proc. 18th Int. ACM
6 Conclusion SIGIR Conf. on Research and Development in
In this paper we brought together a number of text
Information Retrieval, 1995, pp. 130-136.
mining tools in order to create a framework for free [10] Salton, G., and M. E. Lesk, Computer
text evaluation. Our methodology was applied to a
Evaluation of Indexing and Text Processing, J.
collection of short customer feedback comments that
1797
WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832
Association for Computing Machinery, Vol. 15,
No. 1, 1968, pp.8-36.
[11] Salton, G., and C. Buckley, Term-Weighting
Approaches in Automatic Text Retrieval,
Information Processing and Management, Vol.
24, No. 5, 1988, pp.513-23.
[12] Weiss, S. M, and N. Indurkhya, Predictive
Data Mining: A Practical Guide, Morgan
Kaufmann, San Francisco, 1998.
[13] Wiener, E., Pedersen, J.O. and Weigend, A.S,
A Neural Network Approach to Topic Spotting,
Proceedings of the Fourth Annual Symposium on
Document Analysis and Information Retrieval
(SDAIR 95), 1995.