Customer Quality Control

Location:

Denton, TX

Posted:

December 10, 2012

Contact this candidate

Resume:

**** ***** *****. ** *********** SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

Case-Based Free Text Evaluation for Customer Satisfaction

Monitoring

NICHOLAS EVANGELOPOULOS

Department of Information Technology and Decision Sciences

University of North Texas

P.O. Box 305249 Denton TX 76203-5249

UNITED STATES

********@***.*** http://www.coba.unt.edu/itds/faculty/evangelopoulos/evangelopoulos.htm

Abstract: - In this paper we show how free text can be rated against some desirable criteria in order to generate

numerical input useful for the generation of control charts that monitor customer satisfaction. Following the

general principles of Case-Based Reasoning, a Case-Based Free Text Evaluation (CBFTE) algorithm is

developed and various parameters of interest are investigated. These include the document size of the

customer comments, whether binary or IDF weights are used in the calculations of document similarities, the

number of text clusters that are used to produce the initial training case base, and the number of loops used

while searching for the optimal clustering solution. CBFTE uses human experts to obtain customer

satisfaction ratings far the case base and then proceeds to automatically estimate the ratings of the remaining

documents. As an illustration of this framework we are using viewers' comments submitted to the Internet

Movie Database (IMDb) Web site after they watched a recent popular film.

Key-Words: - Text Mining, Customer Relationship Management, Case Based Reasoning, Text Clustering

similarity between documents, and a k-means text

1 Introduction

clustering to determine the initial case base.

Modern Customer Relationship Management

(CRM) is often faced with the time-consuming task

2.1 Vector Space Model (VSM)

of sifting through large volumes of customer

The Vector Space Model (VSM), a ranking model

feedback comments which represent an

that ranks individual documents against a query, was

underexploited source of business intelligence. The

originally introduced by Gerald Salton and his

main difficulty here is that text data need to be

associates [10], [11], and has been involved in a

converted to some numerical format, so that well-

large part of the Information Retrieval research.

established statistical quality control tools can take

VSM calculates similarity between two documents

over.

represented as vectors of unique terms, by

The purpose of this paper is to discuss the

considering the cosine of the angle that is formed

feasibility of Customer Satisfaction Monitoring

between the two vectors,

using established Quality Control tools, when the

only data available are in free text format. The

Sim(d,q) =(dq dq ), (1)

proposed solution is to use humans to evaluate a

relatively small set of text documents against some

Where d and q are a document and a query vector

desirable rating standard and then produce rating

respectively, on a multi-dimensional space that uses

estimates for the rest of the documents

as many dimensions as the non-trivial terms in a

automatically. The proposed methodology will be

dictionary. Various improvements to the basic VSM

applied to rated movie viewers comments after they

have been introduced by researchers, such as term

watched a recent popular film. We will also

stemming [7], [9], or Latent Semantic Indexing

perform some basic sensitivity analysis by

(LSI) [4]. Based on these methods of quantifying

investigating some related parameters.

document similarity, Text Classifiers have been

widely used in document categorization (indexing).

Methods for construction of document classifiers

2 Free Text Evaluation

include inductive learning techniques [3],

Our approach to free text evaluation will follow a

probabilistic and decision tree classifiers [5], genetic

methodology related to Case-Based Reasoning, will

algorithms, neural networks [13], taxonomies [8],

use Vector Space Model arguments to calculate

1791

WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

statistical clustering [2], and K-Nearest Neighbor 1. Select a randomly chosen set of k seed

[12, pp. 102-103], [6]. documents that define k initial clusters and

calculate their term concatenations qi, i =

1,, k.

2.2 Case-Based Reasoning (CBR)

2. Perform k queries using qi as the query

Case-based reasoning (CBR) is an Artificial

documents. For each document dj in the

Intelligence framework that functions through the

collection, j=1,,n, consider the similarities

usage of previous cases. New cases are compared to

Sim(dj, qi) and assign dj to the cluster whose

a case-base and checked for similarity. As the CBR

term concatenation qi is nearest. Use

system is utilized, new cases will be added to the

heuristics to deal with orphaned documents

case-base, building upon the general knowledge

and empty clusters. Recalculate the term

component found in the center of the CBR cycle,

concatenations for the cluster receiving the

which follows conceptual steps referred to as

new document and for the cluster losing the

retrieve, reuse, revise, and retain [1].

document.

3. Repeat Step 2 until no more assignments

2.3 K-Means Text Clustering (KMTC)

take place.

Since our CBFTE approach will compare each not-

4. Calculate the Overall Similarity of the final

yet-rated customer comment to its closest match in

clustering arrangement and save the

the Case Base, it is important to select the initial

clustering arrangement having the

members of the Case Base not randomly, but in an

maximum Overall Similarity.

optimized way that maximizes the average similarity

Next m.

of all cases to their closest match. The reason why

we need our cases to be very close to their best

The heuristics mentioned in step 2 are explained

matches is because, when two documents are very

below.

similar, their ratings are expected to be similar also,

however if the documents are dissimilar they could

Heuristic KMTC.1:

have ratings that are very different, but they could

If a document is orphaned (= without a cluster to

also have ratings that are very similar, since there

belong to) it will join the least populated cluster.

are more than one possible ways for customers to

Ties will be resolved by random assignment.

express a certain level of satisfaction. It is,

therefore, desirable to identify an initial set of

Heuristic KMTC.2:

documents (training set) by forming document

If a cluster is left empty, it will be joined by a

clusters and selecting an equal number of documents

document that deserves to be in its current cluster

from each cluster to participate in the training set.

the least. Ties will be resolved by random

assignment.

Figure 1 illustrates two such clusters populated

Cluster m

with documents with a high degree of within-cluster

similarity. Because our Vector Space representation

Term

of documents uses the cosine of the angle between

Document

Projection Horizon

vectors as a measure of similarity, the usual n-

dimensional Cartesian space with the rectangular

Cluster n

coordinates gives way to a system of normalized

polar coordinates where radius lengths do not

matter, only angles do. Essentially all documents

get projected on an n-dimensional spherical surface

Origin of

Term i

Vector Space

(a hyper-sphere) marked on Figure 1 as the

document projection horizon. VSM s view of the

Fig. 1. Clusters of documents in the Vector Space.

document clusters then becomes similar to an

observer s view of the starry night sky. Some stars

(or planets, or even galaxies for that matter, since

Algorithm 1:

they all look so similar) appear to form clusters, but

For m = 1 to M, where M is a moderately large

are they really close to each other? Well, stars that

number;

appear to participate in star clusters, for one, usually

1792 WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

are! In the results section we will investigate what During that same period, about 2,100 written

happens with document clusters. comments were submitted by registered users,

together with their Likert-scale star-ratings. Those

comments included a short comment (typically one

2.4 CBFTE Algorithm

line) summarizing their impression, and a longer

In view of the preceding discussion, we summarize

comment (typically 1 paragraph to 1 page)

the procedures followed by a Case-Based Free Text

elaborating on their feelings and opinions regarding

Evaluation engine in the following algorithm:

the movie. About 900 of those registered users'

postings originated in the United States and the

Algorithm 2:

remaining 1,200 in a large number of other

1. Removal of Unique Documents: Separate all

countries. The fact that the comments were posted

unique documents from the collection.

in English served, of course, as a source of bias: For

2. KMTC: Perform M runs of K-Means Text

instance, 250 comments came from the United

Clustering (Algorithm 1) and choose the

Kingdom, 97 from Canada, 60 from Australia,

clustering solution with the highest overall

whereas Germany contributed 21 postings, France

similarity.

10, Greece 6, and Austria 3. Nevertheless, countries

3. Construction of the Case Base: Have a human

such as Argentina, Mexico, Turkey, Egypt, or

expert manually rate l representatives of each

Thailand, were still represented in the sample. The

cluster. Store the kl rated cases in the case base.

most serious imbalance in the sample was the

4. Case-Based Free Text Evaluation: For each of

gender bias: The 2,100 users who posted comments

the remaining N kl documents, find the best

included only 200 women. This was probably due

match in the case base, and produce a

to the nature of the particular movie which, being an

corresponding rating estimate.

action-adventure, apparently had many more male

5. Acceptance Sampling: Sample s of the

than female viewers who felt strongly about the

estimated ratings and ask the human expert to

movie and decided to submit comments about it. To

rate them independently. Learning: Expand the

some extent this is not only a bias of the sample, but

case base by storing the verified ratings.

also a reflection of the gender distribution in the

6. Repeat step 5 until all verified ratings are equal

population of interest (i.e., the viewers of the

to the corresponding estimated ratings. Then

movie).

accept the remaining N kl s unverified

estimates and stop.

Rating Distribution, N=50,000

3 A Customer Satisfaction Example

We will now discuss an example with free text data

coming in the form of customer feedback. The data 40

set under consideration will include short comments 30

expressing how satisfied customers were with a %

certain product or service. Corresponding ratings in

numerical form will also be used in order to evaluate 10

the performance of our automatic rating algorithm.

1 2 3 4 5 6 7 8 9 10

3.1 Methods

Rating

We surveyed movie viewers' comments on a major

box office movie that was released in 2005. These Fig. 2. Distribution of ratings submitted by 50,000

viewers were users of the Internet Movie Database users. Information courtesy of The Internet Movie

(IMDb), a popular Web Forum related to movies. Database (http://www.imdb.com). Used with

Their movie feedback comments were posted during permission.

a certain 30-day period of 2005. Users posted their

star-rating (on a 1-10 Likert scale). The distribution

of the submitted ratings, based on a sample of about For the purposes of this study, 1,059 comments

50,000 users who volunteered to vote electronically were sampled, using a sampling scheme that

during the aforementioned 30-day period, is shown employed approximate separate sampling, i.e.,

in Fig. 2. picked an about equal number of comments from

each rating level. The number of comments

1793

WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

participating in the sample that came from each good, then added 7 non-evaluative terms that were

rating level are shown in Fig. 3. specific to this set of documents.

For the purposes of calculating similarities

between documents, terms are usually weighed so

Rating Distribution, n=1,059 that terms that appear very frequently in the

document collection are discounted and terms that

appear in only a few documents are promoted. This

approach results in the introduction of inter-

30 document frequencies (IDFs) as weights on the term

dimension coordinates. Alternatively, binary

weights can be used, simply indicating whether each

term appears in the document or not, without paying

any attention to how common or how unusual that

terms is. In our analysis, IDF as well as binary

1 2 3 4 5 6 7 8 9 10

weights were used.

Rating

Utilization of a Porter Stemmer yielded only

Fig. 3. Distribution of 1,059 sampled ratings. marginal difference in the results, so no stemming

was applied to the document terms.

3.2 Lexical Analysis 3.3 Evaluation based on Bipolar Queries

As part of our free text evaluation approach, a Creation of potential predictor variables based on

dictionary of all non-trivial terms included in the the results of bipolar queries, i.e., good/bad,

document collection was compiled. The size of the agree/disagree, etc., is becoming a quite common

dictionary varied with the number of comments approach in text mining. In our analysis, we first

under consideration. For example, a small subset of composed two queries representing satisfaction and

260 short comments had 418 different terms after dissatisfaction with the film, respectively:

being filtered with 530 stopwords. Another sample

of 520 short comments had 732 different terms, Query 1: Great movie, very good, better, best of all

whereas the full sample of 1059 short comments had times, excellent .

1228 different terms. Although the size of the Query 2: Bad movie, the worst of all times, very

dictionary grows bigger as the number of documents disappointing, I was disappointed .

in the collection goes up, it's interesting to notice

that the top 20 terms, after we sort the three Based on the similarity results from these queries,

dictionaries by descending term frequency, are the Bipolar Query Rating (BQR) of each document

almost identical: 18 out of 20 terms were present in was estimated using the following empirical

all three dictionaries as part of the top 20 list, and formula:

the remaining 2 turned up in spots 21-25. Another

interesting observation is the number of unique BQR = [1+9*Sim1/(Sim1+Sim2)], (2)

terms (i.e., terms that appear in only one document).

The first dictionary had 317 unique terms out of where Sim1 is the similarity metric from Query 1,

418, the second had 541 out of 732, and the third Sim2 the corresponding result from Query 2, and the

833 out of 1228. For very large document constants 1 and 9 aim to adjust the result to the 1-to-

collections, the number of terms in the dictionary is 10 rating scale. This kind of rating estimation will

expected to approach or even exceed the 10,000 to serve as a comparison basis for our CBFTE results.

20,000 words that are commonly used in the English

language. The 20 terms with the highest frequencies

3.4 CBFTE Algorithm Implementation

included judgment expressions such as good, best,

After the dictionary customization, k-means text

better, great, bad, and disappointing. They also

clustering was applied to the set of documents, using

included the word movie, words from the film's title

a variety of number of clusters k values, as well as a

and the director's name. This second category is

number of clustering runs M, so that an optimal

unrelated to the evaluative nature of these short

clustering solution could be picked. One

comments. Subsequently, a new stopword list with

representative from each cluster was then considered

537 terms was applied. We started with a standard

to have a known rating value (i.e., as if it was now

stoplist, removed evaluative terms such as bad and

rated by a human expert) and the remaining

1794 WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

documents had their ratings estimated using a Best Binary weights have been known in the literature to

Match approach, i.e., without any adaptation produce slightly better results when they operate on

(modification) of the estimated ratings. short documents, so this result is not surprising.

4 Results Long comments, IDF weights

This section presents the results of the methods 30

described previously. In order to evaluate these

results, accuracy rates, defined as the percentage of

documents rated with zero error, as well as Mean

Percent

Absolute Errors (MAEs) were employed. While it is 15

important to achieve high accuracy rate, it is also

interesting to see the entire distribution of the rating

error, defined here as the discrepancy between the 5

estimated and the observed (=reported) rating. In 0

the subsections that follow we will examine the -9 -6 -3 0 3 6 9

effects of various parameters of interest on such text Error

rating error distributions. Fig. 5. Rating error distribution for 254 long

comments, using the bipolar query rating method.

4.1 Size of the text documents

The first comparison we performed involved the size

of the customer comments. A sample of 254

Short comments, binary weights

feedback entries was used, with short and long

comments for each entry. Figures 4 and 5 show the

rating error distributions using short and long 25

comments, respectively, when bipolar queries were

used to produce the rating estimates. Short

Percent

comments yielded a slightly better performance,

with higher accuracy rate and less pronounced error 10

distribution tails.

-9 -6 -3 0 3 6 9

Short comments, IDF weights Error

Fig. 6. Rating error distribution for 254 short

25 comments, using the bipolar query rating method

and binary weights.

Percent

10 4.3 Text Clustering and the effect of k

We now proceed with the implementation of the

CBFTE algorithm as described in section 3. The

documents were assumed to have unknown ratings,

-9 -6 -3 0 3 6 9

and k-means text clustering produced k

Error

representatives that formed the initial case base.

Fig. 4. Rating error distribution for 254 short Those documents were subsequently assumed to

comments, using the bipolar query rating method. have known ratings, provided by human experts.

After estimating the remaining unknown ratings in

the way that was described in section 3, the

4.2 Term weights percentage of documents having an estimated rating

Figure 6 shows the rating error distribution for 254 that was exactly equal to the true rating was

short comments, using binary weights instead of obtained. Fig. 7 shows this rating accuracy for a

IDFs. The accuracy rate (the percentage of collection of N=260 user comments and for a

documents that were rated with zero error) is varying number of clusters that ranged from k=5 to

marginally higher when binary weights are used.

1795

WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

k=120. The rating accuracy starts around 10% for Figure 8 shows the rating error distribution using

k=5 and reaches approx. 55% for k=120. For each the CBFTE method, when k=30 clusters were used

value of k, M=40 iterations were applied so that the to produce 30 documents with known ratings. Only

best clustering solution could be determined. M=1 iteration was applied in the search for the best

clustering solution. The results show an

improvement over simple bipolar query rating.

Figure 9 shows a similar rating error distribution

Rating Accuracy for N=260, M=40

when k=60 documents were considered known.

These settings show again an improvement in

performance, illustrating the merit of using a higher

value of k. We observe that the errors really come a

little closer to zero, although not by a lot.

% 30

4.4 The effect of M iterations on the

discovery of a good clustering solution

Figure 9 in the previous subsection showed the

0-20-40-60-80-100 120

rating error distribution for the selected case of

N=260, k=60, applying only M=1 iteration during

Fig. 7. Rating accuracy for N=260, M=40. the search for the optimal clustering solution.

Increasing the number of iterations can have a

positive impact on the accuracy rate, as shown on

Figures 10 and 11, where M is equal to 10 and 40,

N=260, k=30, M=1

respectively. This result is quite interesting, since

increasing the iterations during the txt clustering

25 stage only requires computing power, not human

expert time.

Percent

N=260, k=60, M =10

-9 -6 -3 0 3 6 9

Percent

Error

Fig. 8. Rating error distribution for N=260, k=30, 10

M=1.

-9 -6 -3 0 3 6 9

N=260, k=60, M=1 Error

Fig. 10. Rating error distribution for N=260, k=60,

25 M=10.

Percent

Table 1 shows the Mean Absolute Errors (MAEs)

10 next to the corresponding accuracy rates for three

selected values of M. As the repetitions M increase

from 1, to 10, to 40, the Mean Absolute Error

(MAE) decreases from 2.08, to 2.056, to 1.905,

-9 -6 -3 0 3 6 9

respectively.

Error

Fig. 9. Rating error distribution for N=260, k=60,

M=1.

1796 WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

expressed satisfaction or dissatisfaction over a

N=260, k=60, M=40

recently released film, where the users also provided

quantitative information expressing the same

25 satisfaction in the form of a rating score. Our

example illustrated that it is actually possible for a

Percent

human rater to pay the cost of manually rating a

subset of customer feedback comments and then

10 enjoy the benefit of having the rest of them

automatically rated without a great loss of rating

accuracy. This way, a basis for further quantitative

analysis and, therefore, more efficient customer

-9 -6 -3 0 3 6 9

relationship management can be established.

Error

Fig. 11. Rating error distribution for N=260, k=60,

M=40. References:

[1] Aamodt, Agnar; Enric Plaza (1994), Case-based

reasoning; Foundational issues, methodological

M MAE Accuracy variations, and system approaches, AI

1 2.08 18.80% Communications, Vol. 7, No.1, 1994, pp. 39-59.

10 2.056 22.43% [2] Anick, P and Vaithyanathan, S, Exploiting

40 1.905 25.87% Clustering And Phrases For Context-Based

Information Retrieval, SIGIR, 1997.

Table 1. MAE and accuracy rates for 3 selected

[3] Cohen, W. W. and Singer, Y, Context-Sensitive

values of M.

Learning Methods for Text Categorization,

SIGIR 96: Proceedings of the 19th Annual

International ACM SIGIR Conference on

5 Discussion

Research and Development in Information

After an overall evaluation of the rating error

Retrieval, 1996, pp. 307-315.

distributions presented in the results section, we

[4] Deerwester, S., Dumais, T., Furnas, W.,

might claim that free text evaluation following the

Landauer, K., and Harshman, R., Indexing by

Case-Based approach that was presented in this

Latent Semantic Analysis, Journal of the

paper is not without merit, but we would have to

American Society for Information Science, Vol.

admit that it s not without problems, either.

41, 1990, pp.391-407.

Customers can be very creative in the ways they

[5] Lewis, D.D. and Ringuette, M, Comparison of

express their feelings and their satisfaction. They

Two Learning Algorithms for Text

can use rare words, slang, make obscure points, be

Categorization, Proceedings of the Third Annual

sarcastic, or play word games, just to name a few.

Symposium on Document Analysis and

Still, with moderately large numbers of documents k

Information Retrieval (SDAIR 94), 1994.

that are processed by humans and fairly large

[6] Li, Fan, and Yang, Yiming, A Loss Function

numbers of clustering solution repetitions M that are

Analysis for Classification Methods in Text

processed by computers, the rating estimation errors

Categorization, The Twentieth International

can be kept within tight ranges. As a future

Conference on Machine Learning (ICML'03),

direction, it is interesting to examine the possible

2003, pp. 472-479.

improvement in rating accuracy or the possible

[7] Porter, M. F, An Algorithm for Suffix Stripping,

reduction in rating estimation error when longer,

Program, Vol. 14, No. 3, 1980, pp. 130-37.

more descriptive comments are used, as opposed to

[8] Raghavan, P, Information Retrieval Algorithms:

the short, summarizing tag lines that were employed

A Survey, Symposium on Discrete Algorithms,

here.

ACM-SIAM, 1997.

[9] Riloff, E, Little words can make a big difference

for text classification, In Proc. 18th Int. ACM

6 Conclusion SIGIR Conf. on Research and Development in

In this paper we brought together a number of text

Information Retrieval, 1995, pp. 130-136.

mining tools in order to create a framework for free [10] Salton, G., and M. E. Lesk, Computer

text evaluation. Our methodology was applied to a

Evaluation of Indexing and Text Processing, J.

collection of short customer feedback comments that

1797

WSEAS TRANS. ON INFORMATION SCIENCE & APPLICATIONS Issue 11, Volume 2, November 2005 ISSN: 1790-0832

Association for Computing Machinery, Vol. 15,

No. 1, 1968, pp.8-36.

[11] Salton, G., and C. Buckley, Term-Weighting

Approaches in Automatic Text Retrieval,

Information Processing and Management, Vol.

24, No. 5, 1988, pp.513-23.

[12] Weiss, S. M, and N. Indurkhya, Predictive

Data Mining: A Practical Guide, Morgan

Kaufmann, San Francisco, 1998.

[13] Wiener, E., Pedersen, J.O. and Weigend, A.S,

A Neural Network Approach to Topic Spotting,

Proceedings of the Fourth Annual Symposium on

Document Analysis and Information Retrieval

(SDAIR 95), 1995.

Contact this candidate