USING CO-OCCURRENCE TO CLASSIFY UNSTRUCTURED DATA

IN TELECOMMUNICATION SERVICES

Motoi Iwashita, Ken Nishimatsu and Shinsuke Shimogawa

NTT Service Integration Laboratories, Midori-cho 3-9-11 Musashino, Tokyo, Japan

Keywords:

Co-occurrence, Correspondence analysis, Telecom operation, Text mining.

Abstract:

A variety of services have recently been provided according to the highly-developed networks and personal

equipment. Connecting this equipment becomes more complicated with advancement of these day by day.

Because software is often updated to keep up with advancements in services or security, problems such as

no-connection increase and determining the cause become difﬁcult in some cases. Telecom operators must

understand the situation and act as quickly as possible when they receive customer enquiries. In this paper,

we propose one method for analyzing and classifying customer enquiries that enables quick and efﬁcient

responses. Because customer enquiries are generally stored as unstructured textual data, this method is based

upon a co-occurrence technique to enable classiﬁcation of a large amount of unstructured data into patterns.

1 INTRODUCTION

A conventional ﬁxed telephone service is simply pro-

vided by a telephone network. Because the net-

work structure is simple, it is easy to determine the

cause of service problems. Furthermore, telecom

operators with long-accumulated know-how can act

quickly. Recently, broadband infrastructures with

Asymmetric Digital Subscriber Line (ADSL) and op-

tical ﬁbers have penetrated the telecommunications

industry. This trend has induced the expansion of

a variety of services, such as the exponential use of

the Internet, the provision of Voice over Internet Pro-

tocol (VoIP) and video distribution services, and se-

curity software countermeasures against virus attacks

on PCs. Therefore, the end-to-end network structure

has become complicated, considering the connection

of home equipment, such as a modem, and the setup

of its software. As a result, discovering the cause of

problems is difﬁcult. Connecting service equipment

will continue to become more complicated in the near

future, if we consider the drive forwards ubiquitous

services.

Customer satisfaction decreases when a long time

is spent on restoration because discovering the cause

is difﬁcult. Therefore, it is necessary to understand

the features of the problem ﬁrst to resolve the situa-

tion by data classiﬁcation.

The saved information is not classiﬁed as struc-

tured data, i.e., it is an unstructured data. Let us take

a customer phone enquiry about no-connection as an

example. There are several causes, such as failure of

the optical ﬁber, modem, or application. Both nor-

mal and strange situations are included in text such as

gthe connection to the Internet is OK, but e-mail can-

not be senth. Therefore, it is important to classify the

unstructured textual data with accuracy.

A text mining technique, such as morphological

analysis, syntax analysis, co-occurrence relation, etc.,

is effective (Ohsumi, 2006),(Sato et al., 2007),(Sulli-

van, 2001),(Toda et al., 2005). This technique is ap-

plicable to customer questionnaire analyses in prod-

uct development, word searches in portal sites such

as Google and Yahoo, term frequency analyses in

web logs (blogs) or customer generated media, article

classiﬁcation by keyword in news articles, and eval-

uation indexes of a company’s image. Mainly mor-

phological analysis is applied in these areas to sur-

vey trends by analyzing the frequency of terms in se-

lected text. A keyword is extracted as a topic of a

sentence in terms of the features of the network struc-

ture (Masuo et al., 2001),(Ohsawa et al., 1997),(Cut-

ting et al., 1992),(Ho et al., 2001),(Leuski, 2001).

Clustering and co-occurrence related methods have

been proposed classify keywords and relate them to

synonymous terms, different words having the same

meaning, and synonyms, which have similar mean-

ings (Uejima et al., 2004), (Rodoriguezd et al., 1998).

Iwashita M., Nishimatsu K. and Shimogawa S. (2008).

USING CO-OCCURRENCE TO CLASSIFY UNSTRUCTURED DATA IN TELECOMMUNICATION SERVICES.

In Proceedings of the International Conference on e-Business, pages 12-17

DOI: 10.5220/0001905500120017

 SciTePress

An improved method was proposed for synonymous

term classiﬁcation in fuzzy searches for the aim of

failure analysis (Naganuma et al., 2005). Simply

understanding failure trends and noting customer re-

quirements when analyzing an enquiry then analyzing

the word trends is not always effective. Understand-

ing the meaning of sentences is essential. There are

no effective methods to semantically analyze text that

are applicable to telecom management.

A text classiﬁcation method from a semantic point

of view that considers the features of telecom services

and the co-occurrence of terms for classifying and an-

alyzing a large amount of unstructured data consisting

of customer enquiries is proposed. The difﬁculties in

analyzing textual data in telecom services when con-

ventional techniques are used is explained in Sec. 2.

Section 3 describes the features of telecom services.

Our classiﬁcation method is proposed in Sec. 4, and

the results are discussed in Sec. 5.

2 DATA ANALYSIS IN TELECOM

OPERATIONS

2.1 Necessities of Data Classiﬁcation

and Text Mining

In general, the telecom operator saves the customer

enquiry and the coping process as information. The

aim of this is to enable the ﬁnding of similar prob-

lems by searching with related keywords and to en-

able quick action when such problems occur. These

coping processes are effective for sharing the knowl-

edge among assigned telecom operators and for im-

proving their skills. Therefore, this method is useful

when problems happen. However, the drawback of

this method is that it is impossible to get an overview

of all the possible patterns of a problem and to es-

tablish coping processes for more complex problems

in advance. Therefore, it is necessary to establish an

effective coping process for customer enquiries, as-

sign optimal operators due to advance classiﬁcation

of customer enquiries and survey the failure trend.

Information generally consists of text. It takes a

long time to analyze text word by word and to clas-

sify large amounts of textual data. Therefore, an ef-

fective method based on a text mining technique, such

as term frequency analysis, number of synonymous

words determination and related terms extraction is

needed.

2.2 Limitations of Morphological

Analysis

Figure 1 shows the relationship between term fre-

quency and its ranking for 10,000 customer enquiries

about telecom services. The terms were classiﬁed and

counted by morphological analysis, and those that ap-

pear more than 50 times are shown in Fig. 1.

Figure 1: Relationship between term frequency and its rank-

ing.

”Category 1” is text relating to equipment, such as

a modem, PC, telephone, etc., whereas ”Category 2”

is text relating to a failure phenomenon, such as setup

trouble, no-connection, cable breakdown, etc. ”Cat-

egory 0” means terms analyzed without considering

categories.

All the patterns show a n/1 feature, so they fol-

low a power law (Newman, 2005), (Zipf, 1949), as

is normal in general sentences. A power law implies

that there are many kinds of terms in the textual data

and means that there are many types of customer en-

quiries. Since the slope of Category 1 is steeper than

the other two, the terms used in that textual data are

limited compared with the terms in the other cate-

gories. On the other hand, the slope of Category 2

is gradual, which shows the variety of failure phe-

nomenon terms. These results show that it is possible

to only survey the terms with high frequency, such as

PC, telephone, internet, etc. However, it is necessary

to understand the relationships among the terms.

2.3 Limitations of Correspondence

Analysis

To classify into groups by term features, correspon-

dence analysis (Benzecri, 1992), (Hayashi, 1993),

(Takahashi, 1996) is applied. The terms with fre-

quency rankings higher than 400 were selected from

USING CO-OCCURRENCE TO CLASSIFY UNSTRUCTURED DATA IN TELECOMMUNICATION SERVICES

10,000 customer enquiries. The analysis of up to the

10th factor is shown in Table 1, and a graph of the 1st

and 3rd factors is shown in Fig. 2.

i-th contribution rate, r

, is calculated as follows.

= (Eigenvalue(i))/Σ

(Eigenvalue( j)) (1)

These results show the difﬁculty of describing the

features. This is because the accumulated contribu-

tion rate is under 20%, even considering the 10th fac-

tor, and most of the information is gathered in a small

area. Therefore, it is clear that we need some prepa-

ration before classiﬁcation.

Table 1: Correspondence analysis.

Figure 2: Relationship between 1st and 3rd factors.

3 FEATURES OF TEXTUAL DATA

IN TELECOM SERVICES

The features of textual data in telecom services are

explained in this section. An operator writes down in-

formation about a customer enquiry. Therefore, the

style of the description deeply depends on the op-

erator. That is, the description style is not ordered.

There may be abbreviations of terms, wording that

only the speciﬁed operator can understand, and so

on. Moreover, there are many synonymous terms.

Taking service speciﬁcations of optical ﬁbers as an

example, there are descriptions using general names

(e.g. optical service, Fiber-To-The-Home (FTTH)),

and special/abbreviated names for customer enquiries

(e.g. description of product name, abbreviation). Let

us consider a situation where optical and telephone

services are provided to a customer. It is difﬁcult

to determine whether the cause of a problem is an

optical cable, modem failure, or application prob-

lem in the case of a phone call from a customer

about no-connection. Moreover, associated factors

that complicate the situation include partial trouble,

e.g. the Internet works whereas e-mail does not, or

no-connection because of software compatibility.

To summarize, it is difﬁcult to apply a text mining

technique directly to raw textual data in telecom man-

agement for semantic/structural classiﬁcation. There-

fore, we need a modiﬁcation to distinguish the fea-

tures of telecommunication services. Telecommuni-

cation service such as internet connection, VoIP, is

generally provided by an end-to-end network con-

sisting of a telephone, PC, the carrier’s network, the

provider’s server, etc. Fig. 3. There is clearly an

event feature for each component of the network. We

can predict that the component, such as the service,

telephone, PC, and network, is strongly related to the

problem, such as failure, misconﬁgurated of set-up,

and cable breakdown, respectively. Therefore, by des-

ignating the network factor as one event and the prob-

lem as the other event, we can construct a semantic

representation. Moreover, if a problem occurs in one

piece of equipment in a network, it is expected to lead

to problems in the other component or to other events.

Figure 3: Telecom features categorization.

ICE-B 2008 - International Conference on e-Business

4 CLASSIFICATION METHOD

BASED ON CO-OCCURRENCE

4.1 Framework of Classiﬁcation

The requirement for textual data classiﬁcation is

based on the ability to cover all textual data and to

ﬁt the operator’s thinking. It is desirable that this

classiﬁcation be determined from the viewpoints of

term frequency, co-occurrence, and cause-effect rela-

tionship, as shown in Table 2. Term frequency can

tell us what kind of customer enquiries often appear,

while co-occurrence tells us which terms are strongly

related. Cause-effect relationship tells us the relation-

ship among multiple terms such as network compo-

nent and porblem event.

Table 2: Criteria for classiﬁcation.

Figures 4 and 5 show the frameworks in which

textual data is classiﬁed in terms of the previous three

criteria.

Classiﬁcation framework:

• Procedure 1: Classiﬁcation by type of access (Fig.

4 (a))

- Textual data is classiﬁed in terms of access, i.e.,

type of service, such as dial-up, ADSL, FTTH,

and so on.

• Procedure 2: Classiﬁcation by category (Fig. 4

(b))

- Textual data is classiﬁed in terms of categories

based on morphological analysis. For example,

Category A means component, such as a service,

telephone, PC, modem, etc., while Category B

means the problem, such as no-connection, mis-

conﬁguration, etc.

• Procedure 3: Calculation of term frequency and

co-occurrence (Fig. 5 (a))

- The frequency of both x and y appearing in tex-

tual data is represented by f(x,y). Let us se-

lect a pair in terms of a frequency greater than

β, where β is a given threshold. Then, calculate

co-occurrence as follows:

C(x,y)= f(x,y)/(f(x)+ f(y)− f(x,y)) (2)

for any x and y such that x,y ∈ A or B.

• Procedure 4: Transition among multiple terms

(Fig. 5 (b))

- Let us choose a pair of terms such that

C(x, y) ≥ α, where α is a given threshold.

Step 1: Select a pair with Categories A and B

satisfying α ≤ C(i, j), i ∈ A, and j ∈ B.

Step 2: Select a pair with Category A satisfying

α ≤ C(i, j), i, j ∈ A, and α ≤ C(i,k), i ∈ A, and

k ∈ B.

Step 3: Select a pair with Category B satisfying

α ≤ C(i, j), i, j ∈ B, and α ≤ C(i,k), i ∈ B, and

k ∈ A.

Figure 4: Term classiﬁcation based on morphological anal-

ysis.

Figure 5: Calculation based on co-occurrence.

USING CO-OCCURRENCE TO CLASSIFY UNSTRUCTURED DATA IN TELECOMMUNICATION SERVICES

4.2 Transition of Relativity Among

Multiple Terms

The effectiveness of procedure 4 in the previous sec-

tion is explained in this section. For example, a cus-

tomer makes a general complaint that he/she is not

able to send e-mail from his/her PC as there is no-

connection with the Internet. The cause might not be

in the PC but in the modem setup in that case. There-

fore as a transitional way of thinking, we need rela-

tive keywords to suggest other causes. The relation-

ship among multiple terms should be clear. If we cal-

culate all co-occurrence values between the pairs of

terms in Categories A and B, we need a long calcula-

tion time n

× n

= o(n

). If we use procedure 4, on

the other hand, we can reduce the calculation time to

n× n = o(n

). This is because co-occurrence is cal-

culated for each category as a unit.

5 DISCUSSION OF RESULTS

5.1 Co-occurrence in a Category

The order of choice strongly depends on the thresh-

old in classiﬁcation procedures. The relationship be-

tween term frequency and co-occurrence in textual

data from 1000 customer enquiries is shown in Table

3. 1000 customer enquiries are the ﬁeld data saved

by telecom management for IP-based services during

the daytime at one day in 2007. Pairs of terms are

ordered by term frequency. There are three cases for

co-occurrence with the threshold as a parameter. The

selected pairs become similar in term frequency when

the threshold decreases. Because we screened and

chose pairs with high frequency by procedure 2, using

the choice decided by co-occurrence was effective for

representing the features of the text. The threshold is

given high value at ﬁrst step. If the number of pairs is

small, α decreases as a second step. In this way, itera-

tion step of decreasing α can lead relationship among

terms.

Table 3: Choice of pair of terms.

5.2 Co-occurrence Among Categories

Co-occurrence among categories was calculated for

the textual data from 1000 customer enquiries, as

shown in Fig. 6. We chose the pairs marked that have

a value more than the given threshold. Then, seven

pairs of terms were selected by co-occurrence. More-

over, there were pairs that had a strong transition rate

in the same category. We merged those pairs and se-

lected them as the 8th and 9th choices Fig. 6.

Let us compare the proposed method and the

method using only term frequency for selecting pair

of terms. The frequency of 8th choice is 70 by

pair of terms-frequency, while that of the 8th choice

by the proposed method is 54. Because the differ-

ence between two choices is small in that amount of

data and the 8th choice by pair of terms-frequency

is smaller than the given threshold, it only represents

weak relationships in the features of the text. The pro-

posed method is possible to classify and understand

complicated structure through selecting and relating

strong co-occurrence. Therefore, the choice by co-

occurrence represents the features of the text.

Figure 7 shows the choice of pairs with the thresh-

Figure 6: Relationship between categories.

ICE-B 2008 - International Conference on e-Business

old as a parameter. The number of candidates in-

creases when the threshold decreases. This is because

of the weakness of co-occurrence. The number of

pairs with a transition rate grows when the threshold

decreases. The 6th choice is (1A and 2A) & (1B and

2B) when α = 0.25. 1A and 2A correspond to ”Inter-

net” and ”VoIP” respectively, while 1B and 2B cor-

respond to ”connection is OK” and ”no connection”

respectively. We can classify the text in a semantic

sense, e.g., ”Internet is OK, but VoIP has no connec-

tion”.

This type of proposed data classiﬁcation can get

an overview of all the possible patterns of a problem

and establish coping processes in advance. Further-

more, it has possibilities to mine the potential cus-

tomer requirement that leads to new business.

Figure 7: Threshold and pairs.

6 CONCLUSIONS

A classiﬁcation technique for customer enquiries is

needed due to the increasing complexity of the con-

nections in end-to-end networks in the telecom oper-

ating ﬁeld. In this paper, we proposed one method

for analyzing and classifying customer enquiries that

enables quick and efﬁcient responses. Because cus-

tomer enquiries are generally stored as unstructured

textual data, this method is based upon morphologi-

cal analysis and co-occurrence techniques to enable

classiﬁcation of a large amount of unstructured data

into patterns. We applied the proposed method to

1000 customer enquiries and evaluated its effective-

ness. The method can apply not only to establish cop-

ing processes in advance but also to mine potential

requirement for new business.

We are currently conducting further study on ap-

plying this method to large amounts of data and on

determining a threshold for telecom operation.

REFERENCES

Benzecri, J.-P. (1992). Correspondence Analysis Hand-

book. Marcel Dekker.

Cutting, D., Kager, D., and Tukey, J. (1992). Scatter/gather:

A cluster-based approach to browsing large document

colelctions. In Proc. 15th Annual International ACM

SIGIR Conference on Research and Development in

Information Retrieval.

Hayashi, C. (1993). Quantiﬁcation -Theory and Method.

Asakura-shoten.

Ho, X., Ding, C., Zha, H., and Simon, H. (2001). Automatic

topic identiﬁcation using webpage clustering. In Proc.

2001 IEEE International Conference on Data Mining.

Leuski, A. (2001). Evaluating document clustering for in-

teractive information retrieval. In Proc. 2001 ACM

International Conference on Information and Knowl-

edge Management.

Masuo, Y., Ohsawa, Y., and Ishizuka, M. (2001). Document

as a small word. In Proc. JSAI 2001, International

workshop (LNAI2253), pages 444–448.

Naganuma, K., Isonishi, T., and Aikawa, T. (2005). Diamin-

ing: Text mining solution for customer relationship

management. Mitsubishi Technical Report, 79-4:259–

262.

Newman, M. (2005). Power laws, pareto distributions and

zipf’s law. Contemporary Physics, 46:323–351.

Ohsawa, Y., Benson, N., and H.Yachida (1997). Keygraph:

Automatic indexing by co-occurrence graph based on

building construction metaphor. In Proc. IEEE Forum

on Research and Technology Advances in Digital Li-

braries.

Ohsumi, N. (2006). Mining of textual data. recent trend

and its direction. http://wordminer.comquest.co.jp

/wmtips/pdf/20060910

1.pdf.

Rodoriguezd, M., Gomez-Ilidalgo, J., and Diaz-Agudo, B.

(1998). Using wordnet to complement training in-

formation in text categorization. In Proc. Recent Ad-

vances in Natural Language Processing.

Sato, S., Fukuda, K., Sugawara, S., and Kurihara, S. (2007).

On the relationship between word bursts in docu-

ment streams and clusters in lexical co-occurrence

networks. IPSJ, 48-SIG14:69–81.

Sullivan, D. (2001). Document Warehousing and Text Min-

ing. John Wiley.

Takahashi, S. (1996). Correspondence Analysis by Excel.

Ohm-sya.

Toda, H., Kataoka, R., and Kitagawa, H. (2005). Clustering

news articles using named entities. IPSJ SIG Techni-

cal Report, 2005-DBS-137:175–181.

Uejima, H., Miura, T., and Shioya, I. (2004). Improving

text categorization by synonym and polysemy. Trans.

on IECIE, J87-D-I, No. 2:137–144.

Zipf, G. (1949). Human Behavior and the Principle of Least

Effort. Addison-Wesley.

USING CO-OCCURRENCE TO CLASSIFY UNSTRUCTURED DATA IN TELECOMMUNICATION SERVICES