A Classiﬁcation Method of Open-ended Questionnaires using

Category-based Dictionary from Sampled Documents

Keiichi Hamada

, Masanori Akiyoshi

, Masaki Samejima

and Hiroaki Oiso

Department of Information Science and Technology, Osaka University, 2-1 Yamadaoka, Suita, Osaka, Japan

Faculty of Applied Information Science, Hiroshima Institute of Technology, 2-1-1 Miyake saeki-ku, Hiroshima, Japan

Codetoys K. K, 8F Dojima building, 2-6-8 Nishitenma, Kita-ku, Osaka, Japan

Keywords:

Open-ended Questionnaires, Typical Words Involvement Degrees, Co-occurrence Pattern.

Abstract:

This paper addresses a classiﬁcation method of open-ended questionnaires using a category-based dictio-

nary. Different from other classiﬁcation methods, our proposed method introduces a category-based dictio-

nary which is generated from a small set of categorized samples. This category-based dictionary is used to

judge questionnaire categories with t f-id f(term frequency inverted document frequency) and co

t f-id f(co-

occurrence t f-id f). Experimental questionnaires about a university lecture show that 71% of these question-

naires are classiﬁed accurately.

1 INTRODUCTION

Recently, various types of questionnaires (e.g. close-

ended, open-ended) are collected to improve contents

or services. In open-ended questionnaires, people

write opinions in their own words that are expected

to involve signiﬁcant information. Analysts classify

the questionnaires into categories for grasping which

types of opinions are useful. However, analysts spend

a lot of time for reading the questionnaires in order to

classify them. In order to classify them into categories

efﬁciently, this paper addresses an efﬁcient classiﬁca-

tion of open-ended questionnaires.

One of the document classiﬁcation methods is

text mining(Berry, 2003) which analyzes and clas-

siﬁes large amounts of text data (e.g. news arti-

cles(Atkinson and Van der Goot, 2009), patent doc-

uments(Tseng et al., 2007)). SVM(Support Vector

Machine) and clustering are also the popular machine

learning techniques that are useful for questionnaire

classiﬁcation by a number of questionnaires(Zhang

and Lee, 2003)(Chim and Deng, 2008). However, the

questionnaires include grammatical errors and typos,

and are not accumulated for the classiﬁcation, while

make it difﬁcult to apply the text mining and the ma-

chine learning classiﬁer.

t f -id f(term frequency inverted document fre-

quency) (Salton and Buckley, 1988) that indicates

characteristics of words is a successful approach to

document classiﬁcation(Ramos, 2002). It is pos-

sible to ﬁnd characteristic words in each category

by t f-id f. Documents can be classiﬁed by empha-

sizing the characteristic words in comparing doc-

uments. But some classiﬁcation methods using

t f -id f(Trieschnigg et al., 2009) need tuning param-

eters to be determined manually for every targets,

which are difﬁcult to decide based on question-

naires that are not accumulated. In addition, text

mining techniques using co-occurrence patterns have

been proposed for supporting document classiﬁca-

tion (e.g. a keyword extraction algorithm using a

set of co-occurrence between each term and frequent

terms(Matsuo and Ishizuka, 2004)).

Based on our investigation on contents of ques-

tionnaires in each category, we found that there are

characteristic words and co-occurrence patterns. So,

we propose the classiﬁcation method using t f-id f

which considers words and co-occurrence patterns. In

order to reﬂect the characteristics of the categories to

t f -id f, the proposed method uses samples of ques-

tionnaires categorized by analysts in advance and cal-

culates “typical words involvement degree” between

an inputted questionnaire and each category based on

the samples. The questionnaire is classiﬁed into some

categories that have high typical words involvement

degrees with the questionnaire.

193

Hamada K., Akiyoshi M., Samejima M. and Oiso H..

A Classiﬁcation Method of Open-ended Questionnaires using Category-based Dictionary from Sampled Documents.

DOI: 10.5220/0003971601930198

In Proceedings of the 14th International Conference on Enterprise Information Systems (ICEIS-2012), pages 193-198

ISBN: 978-989-8565-12-9

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

Category based

dictionary

Category classification

sample

Unclassified

questionnaire

・・・・・・・・

Unknown

questionnaire

Category

classification

・・・・・・・・

Questionnaire

data

Manual

registration

Cat Cat

Manual

registration

Automatic

registration

Important words

refer

Judgement of typical

words involvement

refer

Sampling

Word tf-idf

〇〇

1.0

△△

0.7

…

Co-occurrence

pattern

co_tf-idf

〇〇, △△

1.0

〇〇, ××

0.6

…

Cat Cat

Cat

Figure 1: Outline of analysis support system.

2 CLASSIFICATION SUPPORT

SYSTEM

2.1 System Overview

Figure 1 shows the outline of the classiﬁcation sup-

port system. In advance, analysts pick up parts of

questionnaire data, and classify the sentences of the

questionnaires into categories Cat

. These sentences

are deﬁned as “category classiﬁcation samples”. The

goal of this research is to classify questionnairesusing

as few “category classiﬁcation samples” as possible,

because there are not many questionnaires and a few

“category classiﬁcation samples” make analysts easy

to classify to categories they expect.

For the classiﬁcation, this system uses only noun,

verb, and adjective that indicate the contents of the

questionnaires. The questionnaires are classiﬁed by

using similarity between “category classiﬁcation sam-

ples” and the questionnaires.

Because each sentence may have different con-

tents in the questionnaire, this system does not clas-

sify the questionnaire, but a sentence in the question-

naire. In case that sentences in a questionnaire have

contents of different categories(e.g. Cat

and Cat

the questionnaire is classiﬁed into the categories(e.g.

Cat

and Cat

Being inputted to the system, a questionnaire is

separated to sentences for the above reason. And the

system removes words in “general DB” which is a

database of stop words(e.g. do, be).

In order to classify questionnaires with high accu-

racy, we need to deﬁne how to decide the similarity.

2.2 Approach

The typical method to decide similarities is to ﬁnd

common words in “category classiﬁcation samples”.

But, the common words in these samples do not indi-

cate characteristics of words. So, questionnaires are

often classiﬁed wrongly.

Because analysts classify questionnaires based on

the meaning of the categories, the questionnaires are

similar to each other in the same category. Accord-

ing to investigate questionnaires, it is considered that

there are some features of questionnaires as below.

• A questionnaire includes typical words which are

words or synonyms included in other question-

naires in the same category. And, co-occurrence

patterns consisting of the typical words appear in

a questionnaire.

• Some categories have important words that char-

acterize the categories.

We deﬁne “typical words involvement degree” as

the similarity to classify questionnaires based on the

above features. This degree is an index based on how

many typical words and co-occurrence patterns ap-

pears in a questionnaire. Also, this degree is calcu-

lated by “category-based dictionary” that consists of

typical degrees of words and co-occurrence patterns.

And “important words DB” which is a database of

words that analysts consider to characterize a cate-

gory. “Category-based dictionary” is constructed by

category classiﬁcation samples, and “important words

DB” is manually constructed by analysts.

3 JUDGMENT OF TYPICAL

WORDS INVOLVEMENT

DEGREES

3.1 Construction of “Category-based

Dictionary”

As we mentioned before, a sentence in a question-

naire often includes typical words and co-occurrence

patterns. It is necessary to calculate typical degrees

of words and co-occurrence patterns. As for the con-

struction of “category-based dictionary”, t f −id f and

co t f-id f(co-occurrencet f-id f) are used as these de-

grees as shown in (1), (2), and Table 1.

t f -id f(w

) = t f(w

) × log

d f(w

)

(1)

co t f-id f(w

, w

) =

t f

, w

) × log

d f

, w

)

(2)

ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems

194

Table 1: Deﬁnition and condition of symbols.

Symbol Deﬁnition/Condition

j, k, l 1 ≤ j, k, l ≤ Const (Const is the number of

words in “category classiﬁcation samples”)

a word

t f(w

) the number of occurrences for w

in a category

N the number of all categories

d f (w

) the number of categories w

appears

t f

, w

) the number of occurrences which w

and w

appear together in a category

d f

, w

) the number of categories w

and w

appears

“Important words DB” is used for reﬂecting im-

portant words in typical words to typical degrees.

When a word is in a set “X” of words in “im-

portant words DB”, importance degrees of typical

words(C

≥ 1) are weighted to t f − id f and

co t f-id f as shown in (3), (4).

t f -id f(w

) = C

× t f(w

) × log

d f(w

)

(3)

co t f-id f(w

, w

) =

× t f

, w

) × log

d f

, w

)

(4)

s.t. w

∈ X, w

or w

∈ X

Finally, because t f − id f and co t f-id f in

“category-based dictionary” may not be normalized,

we make them normalize into a range of 0 and 1 for

each category.

3.2 Calculation of Typical Words

Involvement Degrees

“Typical words involvement degree” should be based

on both typical words and co-occurrence patterns. So,

typical words involvement degree R for a category is

decided by an average of R

and R

cot

as shown in (5)

- (9), and Table 2.

R =

+ R

cot

(5)

′

(6)

cot

′

cot

(7)

′

∑

1≤m≤n

t f -id f(w

) (8)

′

cot

∑

1≤p<q≤n

co t f-id f(w

, w

) (9)

Table 2: Deﬁnition of symbols.

Symbol Deﬁnition

and n

cot

the number of common words and com-

mon co-occurrence patterns between

“category-based dictionary” and the

sentence

3.3 Judgment by Typical Words

Involvement Degree and It’s

Example

A questionnaire is classiﬁed into the target category

by judging whether typical words involvement degree

R is higher than the threshold value S

as shown in

(10), (11), and Table 3.

sum

num

(10)

sum

∑

∈Cat

(11)

Table 3: Deﬁnition of symbols.

Symbol Deﬁnition

num

the number of sentences in Cat

the sentence r

typical words involvement degree R for the

sentence r

Figure 2 shows an example of typical words in-

volvement judgment. This target sentence includes

word “History” and co-occurence pattern “Calculator,

History” and “Calculator, Knowledge”. The typical

words involvement degree R for Cat

is 0.6965 which

is higher than S

= 0.5. Thus, the questionnaire in-

cluding this target sentence is classiﬁed to Cat

Category based

dictionary

Judgement of typical

words involvement

refer

Word tf-idf

Once 1.0

History 0.636

Difficult 0.144

Co-occurrence pattern co_tf-idf

Calculator, History 1.0

Calculator, Difficult 0.836

Calculator, Knowledge 0.514

Cat

A target sentence

In this lecture, I got many

knowledge about the

history of calculator.

Typical words involvement degree R

= {0.636 / 1 + (1 + 0.514) / 2} / 2 ≓ 0.6965 > 0.5

Rcot

(S = 0.5)

Figure 2: An example of typical words involvement judg-

ment.

3.4 Automatic Acquisition Method of

Importance Degrees C

The system needs to determine C

and C

automat-

ically based on relations between category classiﬁ-

AClassificationMethodofOpen-endedQuestionnairesusingCategory-basedDictionaryfromSampledDocuments

195

cation samples and “category-based dictionary”, be-

cause it is difﬁcult for analysts to determine C

and

in advance.

A correct sentence which is classiﬁed into a tar-

get category in category classiﬁcation samples should

havea high typical words involvementdegree because

the sentence has common words with the target cate-

gory. However, an incorrect sentence which is not

classiﬁed into the target category should have a low

typical words involvement degree because the sen-

tence lacks many related words to the topic. In ad-

dition, it should have low standard deviation of typ-

ical words involvement degrees for correct sentences

because typical words involvement degrees of the cor-

rect sentences are similar each other.

This system determines suitable C

under the

conditions that (1) a correct sentence has higher typ-

ical words involvement degree than that every incor-

rect sentence has, and (2) the standard deviation of

typical words involvement degrees for correct sen-

tences is as low as possible as shown in Figure 3.

Inappropriate

(C , C )

The feature of

category classification

sample Cat

Typical words

involvement degrees

･Mixed situation as to

correct and incorrect sentences

･

Large standard deviation as to

correct sentences

Category classification

sample

××

Appropriate

(C , C )

Typical words

involvement degrees

Category classification

sample

××

(C1, C2)：A pair of importance

degree of typical words

: Correct sentence

: Incorrect sentence

: Standard deviation for

correct sentences

××

･Separated situation as to

correct and incorrect sentences

･

Small standard deviation as to

correct sentences

Category classification

sample

Cat CatCat

Figure 3: Desirable situation as to correct and incorrect

opinions in category classiﬁcation sample.

Based on this, the automatic acquisition algorithm

is as follows.

1. For each combination C

and C

, calculate the

number of incorrect sentences as number, that

have higher typical words involvement degree

than the minimum of correct sentences in the con-

dition of 1.0 ≤ C

≤ C

max

C1.0 ≤ C

≤ C

max

2. Create class P including the combination C

and

which has lowest number.

3. For each combination of C

and C

in P, calcu-

late the standard deviation as sd, of typical words

involvement for correct sentences.

4. Determine the combination of C

and C

that lead

the lowest sd.

4 EVALUATION

4.1 Results of Experiment

We executed an experiment to evaluate effectiveness

of our proposed method. The target questionnaires

data are questionnaires on “Give what you learned or

what you feel about calculators’ history” in a univer-

sity lecture. The number of questionnaires is 165 with

an average of 3.0 sentences per a questionnaire and

with an average of 8.3 words per a sentence. We pro-

vided 10 categories in advance, and each extraction

has 20 questionnaires as “category classiﬁcation sam-

ples” that each category has at least two sentences. An

average of sentences including category classiﬁcation

samples per an extraction is 67.0. In this experiment,

we calculated the average of results for 5 times of ex-

tractions. Evaluation criteria are the recall rate, preci-

sion rate, and F measure as shown in Table4.

For separating a sentence to words, we used mor-

phological analysis “Japanese morphological analy-

sis” in Yahoo!Japan Developer Network. In addition,

we deﬁned C

max

= 10.0 in automatic acquisition of

importance degrees C

and C

We compared the effectiveness by the proposed

method to ones by other methods: the method of clus-

tering and the method of SVM. In the SVM, the num-

bers of occurrences for the top ﬁve frequently-used

words are used as vector elements.

Table 4: Deﬁnition of evaluation criteria.

Criteria Deﬁnition

recall rate recall = [

∑

the number of classiﬁed question-

naires correctly using a method for Cat

] / [

∑

the number of classiﬁed questionnaires manu-

ally for Cat

]

precision rate precision = [

∑

the number of classiﬁed ques-

tionnaires correctly using a method for Cat

]

/ [

∑

the number of classiﬁed questionnaires

manually for Cat

]

F measure [2× recall × precision] / [recall + precision]

ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems

196

Figure 4 shows that our proposed method is the

best classiﬁcation accuracy of the three methods. The

accuracy allows analysts to understand the number of

questionnaires in each category and the contents of

categories reading “category classiﬁcation samples”.

Clustering SVM Proposed method

Recall rate, Precision rate,

F measure(%)

Recall rate Precision rate F measure

Figure 4: Result of classiﬁcation experiment.

4.2 Discussion

Figure 5 shows F measures in increasing the number

of category classiﬁcation when the number of sam-

ples is changed from 20 to 100.

Even if category classiﬁcation samples are in-

creased, the accuracy of the proposed method is better

than the method of SVM. So, it is conﬁrmed that the

proposed method does not depend on the number of

category classiﬁcation samples. Thus, our proposed

method can classify questionnaires at a reduced cost.

20 samples 50 samples 100 samples

F measure(%)

SVM Proposed method

Figure 5: Result of classiﬁcation experiment by using in-

creased category classiﬁcation sample.

In order to verify the adequacy of important

words, we compared our proposed method to the

method without important words as C

= C

= 1.0.

Figure 6 shows the result for Data Set “A”(F measure

= 72.4%) which has the best F measure in 5 data sets

using our proposed method. Figure 6 shows that re-

call rate is increased by 4%, precision rate is increased

by 11%, F measure is increased by 8%.

Table 5 showsC

and C

for each category includ-

ing more than 20 questionnaires and F measure by

our proposed method. C

and C

are decided to 1.0

Without

important words

With

important words

Recall rate, Precision rate,

F measure(%)

Recall rate Precision rate F measure

Figure 6: Result of comparison with/without important

words.

in Category 4 because it does not have any important

words. Table 6 shows the result of sensitivity analysis

of changes in C

and C

for Category 6.

Table 6 shows that the F measure by our proposed

method is 5% lower than that one by using the best

and C

in Category 6. Thus, we can prove impor-

tant words are effective, and it is possible to improve

classiﬁcation accuracy by the improvement of the au-

tomatic acquisition method of C

and C

Table 5: C

and C

for each category and F measure.

Category Number 1 2 3 4 5 6

6.8 2.0 1.0 1.0 10.0 3.0

1.0 3.4 1.2 1.0 10.0 1.0

F measure(%) 76.6 67.9 82.5 55.4 85.1 75.0

Table 6: F measure(%) in category 6.

1 3 5 7 9 10

1 68.1 76.0 76.0 76.0 76.0 76.0

3 75.0 80.0 78.6 77.2 77.2 77.2

5 75.0 78.6 77.2 77.2 77.2 77.2

7 75.0 78.6 77.2 77.2 77.2 77.2

9 75.0 77.2 77.2 77.2 76.7 76.7

10 75.0 77.2 77.2 77.2 76.7 76.7

Figure 7 shows the results for each category in

Data Set “A” and Data Set “B”(F measure = 68.6%).

Table 7 shows the number of manually classiﬁed

questionnaires for each category that includes more

than 20 questionnaires. Both of these results show

that the classiﬁcation accuracy differs in each cate-

gory. The F measure for Category 5 in both data sets

is about 85%, but Category 4 is about 50%. This dif-

ference depends on whether the category’s content is

clear or not. Table 8 and 9 show the sentences in-

cluded in Category 4 and 5, respectively. The sen-

tences in categories which have clear contents e.g.

Category 5 includes clear words that characterize the

category, and these clear words have a high value of

AClassificationMethodofOpen-endedQuestionnairesusingCategory-basedDictionaryfromSampledDocuments

197

1 2 3 4 5 6

F measure(%)

Data set ``A'' Data set ``B''

Figure 7: Result of classiﬁcation experiment for 2 category

classiﬁcation sample patterns.

t f -id f and co t f-id f. So, the sentences with clear

contents can be classiﬁed accurately. On the other

hand, in categories which is confused contents such

as Category 4, the system can not identify words that

characterize the category, and the words appear in

other categories. Thus it is difﬁcult to classify in these

categories.

Table 7: The number of correctly categorized opinions for

each category.

Category Number 1 2 3 4 5 6

Data set “A” 89 53 46 39 25 22

Data set “B” 89 60 36 43 21 30

Table 8: Examples of category 4 “Dangerousness and how

to deal with information society”.

I would like to learn how to deal with overﬂooding

information.

But such information is not always right.

But I do not know whether it is good to depend on

information in the web.

Table 9: Example of category 5 “Electronic tag technol-

ogy”.

And it is nice to know electronic tag is used in book

stores’ security system.

I understood that electronic tags are used every-

where.

I think there will be no more cash registers in the

future because electronic tags are used for in all

goods.

5 CONCLUSIONS

This paper addressed the classiﬁcation method of

open-ended questionnaire using category-based dic-

tionary from category classiﬁcation samples. Our

proposed method uses typical words involvement de-

gree which is an index that measures the number of

typical words and co-occurrence patterns that charac-

terize a category. By applying our proposed method

to questionnaires about a university lecture, 71% of

these questionnaires are classiﬁed accurately. As a

result of experiments, the clearer the contents are, the

more accurate the proposed method can classify the

questionnaires.

REFERENCES

Atkinson, M. and Van der Goot, E. (2009). Near real time

information mining in multilingual news. In Proceed-

ings of the 18th international conference on World

wide web, WWW ’09, pages 1153–1154, New York,

NY, USA. ACM.

Berry, M. (2003). Survey of Text Mining : Clustering, Clas-

siﬁcation, and Retrieval. Springer.

Chim, H. and Deng, X. (2008). Efﬁcient phrase-based doc-

ument similarity for clustering. IEEE Transactions on

Knowledge and Data Engineering, 20:1217–1229.

Matsuo, Y. and Ishizuka, M. (2004). Keyword extraction

from a single document using word co-occurrence sta-

tistical information. International Journal on Artiﬁ-

cial Intelligence Tools, 13(1):157–169.

Ramos, J. (2002). Using TF-IDF to Determine Word Rele-

vance in Document Queries. Technical report, Depart-

ment of Computer Science, Rutgers University, 23515

BPO Way, Piscataway, NJ, 08855e.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information pro-

cessing and management, 24(5):513–523.

Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij,

W., and Rebholz-Schuhmann, D. (2009). MeSH Up:

effective MeSH text classiﬁcation for improved doc-

ument retrieval. Bioinformatics (Oxford, England),

25(11):1412–1418.

Tseng, Y.-H., Lin, C.-J., and Lin, Y.-I. (2007). Text mining

techniques for patent analysis. Inf. Process. Manage.,

43:1216–1247.

Zhang, D. and Lee, W. S. (2003). Question classiﬁcation

using support vector machines. In Proceedings of the

26th annual international ACM SIGIR conference on

Research and development in informaion retrieval, SI-

GIR ’03, pages 26–32, New York, NY, USA. ACM.

ICEIS2012-14thInternationalConferenceonEnterpriseInformationSystems

198