Text Categorization Methods Application for Natural Language Call

Routing

Roman Sergienko

, Tatiana Gasanova

, Eugene Semenkin

and Wolfgang Minker

Institute of Communications Engineering, Ulm University, Albert Einstein-Allee 43, 89081 Ulm, Germany

Department of System Analysis and Operation Research, Siberian State Aerospace University,

Krasnoyarskiy Rabochiy Avenue 31, 660014 Krasnoyarsk, Russian Federation

Keywords:

Text Classiﬁcation, Call Routing, Text Preprocessing, Natural Language Processing, Features Selection.

Abstract:

Natural language call routing can be treated as an instance of topic categorization of documents after speech

recognition of calls. This categorization consists of two important parts. The ﬁrst one is text preprocessing for

numerical data extraction and the second one is classiﬁcation with machine learning methods. This paper fo-

cuses on different text preprocessing methods applied for call routing. Different machine learning algorithms

with several text representations have been applied for this problem. A novel text preprocessing technique has

been applied and investigated. Numerical experiments have shown computational and classiﬁcation effective-

ness of the proposed method in comparison with standard techniques. Also a novel features selection method

was proposed. The novel features selection method has demonstrated some advantages in comparison with

standard techniques.

1 INTRODUCTION

Natural language call routing remains a complex and

challenging research area in machine intelligence and

language understanding. This problem is important

and topical for modern automatic call service design.

A number of works have recently been published on

natural language call routing: (Chu-Carroll and Car-

penter, 1999), (Kuo and Lee, 2003), (Witt, 2011), (Jan

and Kingsbury, 2010), (Sarikaya et al., 2011).

Generally natural language call routing can be

considered as two different problems. The ﬁrst one

is speech recognition of calls (transforming speech to

the text) and the second one is call categorization for

further routing. This paper focuses on text categoriza-

tion methods applied for call routing.

Text classiﬁcation can be considered to be a part of

natural language understanding, where there is a set of

predeﬁned categories and the task is to automatically

assign new documents to one of these categories. The

method of text preprocessing and text representation

inﬂuences the results that are obtained even with the

same classiﬁcation algorithms.

The most popular model for text classiﬁcation is

vector space model. In this case text categorization

may be considered as a machine learning problem.

Complexity of text categorization with vector space

model is compounded by the need to extract the nu-

merical data from text information before applying

machine learning methods. Therefore text categoriza-

tion consists of two parts: text preprocessing and clas-

siﬁcation using obtained numerical data.

There exist more advanced approaches for text

preprocessing such as TF-IDF (Salton and Buck-

ley, 1988) and Conﬁdent Weights(Soucy and Mineau,

2005). A novel text preprocessing method (Gasanova

et al., 2013) is also considered, which has some sim-

ilarities with ConfWeight method, but has improved

computational effectiveness. It is important to no-

tice that we use no morphological or stop-word ﬁl-

tering before text preprocessing. It means that text

preprocessing can be performed without expert or lin-

guistic knowledge and text preprocessing is language-

independent. Term weighting can be also applied

as features selection method because we can ignore

terms with low weight values. It provides dimension-

ality reduction.

In this paper we have used k-nearest neighbours,

Bayes classiﬁer, fast large margin based on support

vector machine (SVM) (Fan et al., 2008), and neu-

ral network as classiﬁcation methods. RapidMiner

(Shafait et al., 2010) has been used as implementation

software.

As a call routing problem we use a database from

827

Sergienko R., Gasanova T., Semenkin E. and Minker W..

Text Categorization Methods Application for Natural Language Call Routing.

DOI: 10.5220/0005139708270831

In Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics (ASAAHMI-2014), pages 827-831

ISBN: 978-989-758-040-6

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

the company Speech Cycle. Calls are represented as

a text after speech recognition. The utterances were

manually transcribed and classiﬁed into 20 classes

(call reasons), such as appointments, operator, bill,

internet, phone or video. Calls that cannot be routed

certainly to one reason of the list are classiﬁed to class

TE-NOMATCH.

We have investigated text categorization for call

routing with different text preprocessing and machine

learning methods. The main aim of investigation is to

evaluate the competitiveness of the novel text prepro-

cessing method (Gasanova et al., 2013) and the novel

features selection method in comparison with state-

of-the-art techniques. The criteria are classiﬁcation

efﬁcacy (macro F-measure) and computational time.

This paper is organized as follows: In Section 2,

we describe the problem and the database. Section

3 describes text preprocessing methods. The classiﬁ-

cation algorithms and features selection methods are

presented in Section 4. Section 5 reports on the ex-

perimental results. Finally, we provide concluding re-

marks in Section 6.

2 CORPORA DESCRIPTION

The data for testing and evaluation consists of about

300.000 user utterances recorded in English language

from caller interactions with commercial automated

agents. Utterances from this database are manually

labelled by experts and divided into 20 classes (TE-

NOMATCH, appointments, operator, bill, internet,

phone etc.) Class TE-NOMATCH includes utterances

that cannot be put into another class or can be put

into more than one class. The database is also un-

balanced, some classes include much more utterances

than others (the largest class TE-NOMATCH includes

27.85% utterances and the smallest one consists of

only 0.16% utterances). A lot of calls contain only

one or two words.

Utterance duplicates were removed. The prepro-

cessed database consisting of 24458 utterances was

divided into a training (22020 utterances, 90.03%)

and test set (2438 utterances, 9.97%) such that the

percentage of classes remained the same in both sets.

The size of the dictionary of the whole database is

3464 words, 3294 words appear in the training set,

1124 words appear in the test set, 170 words which

appear only in the test set and do not appear in the

training set (unknown words), 33 utterances consisted

of only unknown words, and 160 utterances included

at least one unknown word.

3 TEXT PREPROCESSING

METHODS

3.1 Binary Preprocessing

The simplest approach is to take each word of the doc-

ument as a binary coordinate and the size of the fea-

ture space will be the size of our vocabulary.

3.2 TF-IDF

TF-IDF (term frequency - inverse document fre-

quency) is a well-known approach for text preprocess-

ing based on multiplication of term frequency t f

i j

(ra-

tio between the number of times i

word occurs in j

document and the document size) and inverse docu-

ment frequency id f

t f

i j

, (1)

where t

i j

is the number of times the i

word occurs in

the j

document. T

is the document size (number of

the words in the j

document).

There are different ways to calculate the weight of

each word. In this paper we run classiﬁcation algo-

rithms with the following variant.

id f

= log

, (2)

where D is the number of documents in the training

set and n

is the number of documents that have the

word.

3.3 ConfWeight

Maximum Strength (Maxstr) is an alternative method

to ﬁnd the word weights. This approach has been pro-

posed in (Soucy and Mineau, 2005). It implicitly does

feature selection since all frequent words have zero

weights. The main idea of the method is that the fea-

ture f has a non-zero weight in the class c only if the f

frequency in documents of the c class is greater than

the f frequency in all other classes. The ConfWeight

method uses Maxstr as an analogy of IDF:

i j

= log(t f

i j

+ 1) · Maxstr(i). (3)

Numerical experiments (Soucy and Mineau,

2005) have shown that the ConfWeight method is

more effective than TF-IDF with SVM and k-NN

as classiﬁcation methods. The main drawback of

the ConfWeight method is computational complexity.

This method is more computationally demanding than

ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics

828

TF-IDF method because the ConfWeight method re-

quires time-consuming statistical calculations such as

Student distribution calculation and conﬁdence inter-

val deﬁnition for each word.

3.4 Novel Term Relevance Estimation

(TRE)

The main idea of the method (Gasanova et al., 2013)

is similar to ConfWeight but it is not so time-

consuming. The idea is that every word that appears

in the article has to contribute some value to the cer-

tain class and the class with the biggest value we de-

ﬁne as a winner for this article.

For each term we assign a real number term rel-

evance that depends on the frequency in utterances.

Term weight is calculated using a modiﬁed formula

of fuzzy rules relevance estimation for fuzzy classi-

ﬁers (Ishibuchi et al., 1999). Membership function

has been replaced by word frequency in the current

class. The details of the procedure are the following:

Let L be the number of classes; n

is the number of

articles which belong to the i

class; N

i j

is the number

of there j

word occurrence in all articles from the i

class; T

= N

is the relative frequency of the j

word occurrence in the i

class.

= max

), S

= arg(max

))is the number

of class which we assign to the j

word;

The term relevance, C

, is given by

∑

i=1

−

L − 1

∑

i=1,i6=S

. (4)

is higher if the word occurs more often in one

class than if it appears in many classes. We use novel

TW as an analogy of IDF for text preprocessing.

The learning phase consists of counting the C val-

ues for each term; it means that this algorithm uses

the statistical information obtained from the training

set.

4 CLASSIFICATION

ALGORITHMS AND FEATURES

SELECTION

We have considered 4 different text preprocessing

methods (binary representation, TF-IDF, ConfWeight

and novel TRE method) and compared them using

different classiﬁcation algorithms. The methods have

been implemented using RapidMiner (Shafait et al.,

2010). The classiﬁcation methods are:

-k-nearest neighbours algorithm with weighted

vote (we have varied k from 1 to 15);

-kernel Bayes classiﬁer with Laplace correction;

-neural network with error back propagation (stan-

dard setting in RapidMiner);

-fast large margin based on support vector ma-

chine (FLM) (standard setting in RapidMiner).

We use macro F-score as a criterion of classiﬁca-

tion effectiveness. Precision for each class i is cal-

culated as the number of correctly classiﬁed articles

for class i divided by the number of all articles which

algorithm assigned for this class. Recall is the num-

ber of correctly classiﬁed articles for class i divided

by the number of articles that should have been in this

class. Overall precision and recall are calculated as

the arithmetic mean of the precisions and recalls for

all classes (macro-average). F-score is calculated as

the harmonic mean of precision and recall.

Term weighting provides also a simple features se-

lection. We can ignore terms with low weight values

(idf, Maxstr, or novel TW). In our paper we propose a

novel feature selection method using term weighting.

This method can be applied only for text classiﬁcation

problems. At ﬁrst we calculate relative frequency of

each word in each class with the training sample. Af-

ter that we choose the class with the maximum value

of the relative frequency for each word. Therefore,

each word in the vocabulary has the corresponding

class. When we get a new text for classiﬁcation we

calculate for each class the sum of the weights of the

words which belong to the this class. After this proce-

dure we have number of attributes equals to number

of classes. Therefore, the method provides very small

number of attributes. Also the method can be applied

for binary preprocessing (standard features selection

is impossible for binary preprocessing).

5 RESULTS OF NUMERICAL

EXPERIMENTS

We have implemented 4 different text preprocessing

methods (binary method, TF-IDF, ConfWeight and

the novel TRE method). At ﬁrst we have measured

computational effectiveness of each text preprocess-

ing technique. We have tested each method 20 times

with the same computer (Intel Core i7 2.90 GHz, 8

GB RAM). Figure 1 compares computational times

for different preprocessing methods.

We can see in Figure 1 that binary preprocessing

is the fastest one. TF-IDF and the novel TRE are ap-

proximately one and a half times slower than binary

preprocessing and they have almost the same compu-

tational efﬁciency. The most time-consuming method

TextCategorizationMethodsApplicationforNaturalLanguageCallRouting

829

Figure 1: Computational effectiveness of the text prepro-

cessing methods.

is ConfWeight (CW). It requires approximately eight

times more time than TF-IDF and the novel TRE.

After that we implemented different classiﬁcation

algorithms for all text preprocessing techniques us-

ing RapidMiner. Bayes classiﬁer and k-nearest neigh-

bours algorithm can be applied without features se-

lection. In this situation number of classes equals to

size of the vocabulary. We applied these algorithms

also with ignoring 50% of the words with the low-

est values of the weights (for novel TW method also

ignoring 90% of the words). Also we have applied

the novel features selection method that allows to use

number of attributes equals to number of classes.

For artiﬁcial neural networks and support vector

machine the dimensionality is more critical than for

Bayes classiﬁer and k-nearest neighbours algorithm.

Therefore, these classiﬁcation algorithms were ap-

plied only with the novel feature selection method.

Table 1 present the F-scores obtained on the test

corpora for different text preprocessing methods and

different classiﬁcation algorithms with different fea-

tures selection techniques. The best value is shown in

bold. Results of k-nearest neighbours algorithm are

presented with the best value of k. The novel features

selection method is identiﬁed as ”sums”.

We can see in Table 1 that the best classiﬁcation

accuracy is provided with novel TRE approach as text

preprocessing and k-NN as classiﬁcation algorithm

(k=4). It is close to the results with ConfWeight and

k-NN but the novel TRE approach is more efﬁcient

for computation. We can also conclude that the k-

NN algorithm is the best one for all text preprocessing

methods. Binary preprocessing is more effective than

TF-IDF with all classiﬁcation methods. This can be

explained by the fact that the database contains very

short calls (often only one word) and repeatability of

words in one call is close to zero. TF-IDF is more ap-

propriate for large documents with a large number of

repetitive words.

Table 1: The results of the numerical experiments

Method k-NN Bayes

Neural

FLM

network

Binary

0.663 0.667 - -

(all words)

Binary

0.534 0.522 0.524 0.478

(sums)

TF-IDF

0.625 0.573 - -

(all words)

TF-IDF

0.212 0.229 - -

(50% words)

TF-IDF

0.537 0.537 0.488 0.440

(sums)

0.720 0.404 - -

(all words)

0.392 0.193 - -

(50% words)

0.716 0.691 0.680 0.594

(sums)

Novel TRE

0.721 0.662 - -

(all words)

Novel TRE

0.716 0.667 - -

(50% words)

Novel TRE

0.623 0.555 - -

(10% words)

Novel TRE

0.670 0.627 0.593 0.561

(sums)

The numerical results have shown advantages of

the novel features selection methods. The standard

features selection method with term weighting does

not allow to get appropriate results with the Con-

fWeight and the TF-IDF preprocessing methods, it is

possible only with the novel TRE (it is an additional

advantage of the novel TRE). The novel features se-

lection method works appropriately with all text pre-

processing techniques and allows to use very small

number of attributes.

6 CONCLUSIONS

This paper reported on call classiﬁcation experiments

on large corpora using different text preprocessing

methods and classiﬁcation methods. We have tested

binary representation, ConfWeight, TF-IDF and the

novel term relevance estimation approach as prepro-

cessing techniques. We have used k-NN, Bayes ap-

proach, neural network, and fast large margin based

on support vector machine as classiﬁcation algo-

rithms.

Numerical experiments have shown that Con-

fWeight method is more effective for classiﬁcation as

TF-IDF and binary preprocessing but this method is

ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics

830

more time-consuming. The novel term relevance es-

timation method allows reaching better classiﬁcation

effectiveness than ConfWeight. Computational effec-

tiveness of the novel TRE is the same as computa-

tional efﬁciency of TF-IDF.

Also the novel features selection method was pro-

posed. The method allow to use very small number

of attributes (equals to number of classes) and allow

to get appropriate classiﬁcation results for all text pre-

processing methods.

REFERENCES

Chu-Carroll, J. and Carpenter, B. (1999). Vector-based nat-

ural language call routing. Computational linguistics,

25(3):361–388.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and

Lin, C.-J. (2008). Liblinear: A library for large lin-

ear classiﬁcation. The Journal of Machine Learning

Research, 9:1871–1874.

Gasanova, T., Sergienko, R., Semenkin, E., Minker, W., and

Zhukov, E. (2013). A semi-supervised approach for

natural language call routing. Proceedings of the SIG-

DIAL 2013 Conference, pages 344–348.

Ishibuchi, H., Nakashima, T., and Murata, T. (1999). Perfor-

mance evaluation of fuzzy classiﬁer systems for multi-

dimensional pattern classiﬁcation problems. Systems,

Man, and Cybernetics, Part B: Cybernetics, IEEE

Transactions on, 29(5):601–618.

Jan, E.-E. and Kingsbury, B. (2010). Rapid and inexpen-

sive development of speech action classiﬁers for nat-

ural language call routing systems. In Spoken Lan-

guage Technology Workshop (SLT), 2010 IEEE, pages

348–353. IEEE.

Kuo, H.-K. J. and Lee, C.-H. (2003). Discriminative train-

ing of natural language call routers. Speech and Audio

Processing, IEEE Transactions on, 11(1):24–35.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information pro-

cessing & management, 24(5):513–523.

Sarikaya, R., Hinton, G. E., and Ramabhadran, B. (2011).

Deep belief nets for natural language call-routing. In

Acoustics, Speech and Signal Processing (ICASSP),

2011 IEEE International Conference on, pages 5680–

5683. IEEE.

Shafait, F., Reif, M., Koﬂer, C., and Breuel, T. M. (2010).

Pattern recognition engineering. In RapidMiner Com-

munity Meeting and Conference, volume 9.

Soucy, P. and Mineau, G. W. (2005). Beyond tﬁdf weighting

for text categorization in the vector space model. In

IJCAI, volume 5, pages 1130–1135.

Witt, S. M. (2011). Semi-automated classiﬁer adaptation

for natural language call routing. In INTERSPEECH,

pages 1341–1344.

TextCategorizationMethodsApplicationforNaturalLanguageCallRouting

831