DETECTION OF SUBMITTERS SUSPECTED OF

PRETENDING TO BE SOMEONE ELSE TO

MANIPULATE COMMUNICATIONS IN A COMMUNITY SITE

Naoki Ishikawa, Ryo Nishimura, Yasuhiko Watanabe, Yoshihiro Okada

Ryukoku University, Dep. of Media Informatics, Seta, Otsu, Shiga, Japan

Masaki Murata

NICT, Seika-cho, Soraku-gun, Kyoto, Japan

Keywords:

Spooﬁng, Manipulation of communication, Credibility, Community site.

Abstract:

Community sites offer greater learning opportunities to users than search engines. One of the essential fac-

tors provides learning opportunities to users in community sites is anonymous submission. This is because

anonymity gives users chances to submit messages (questions, problems, answers, opinions, etc.) without

regard to shame and reputation. However, some users abuse the anonymity and disrupt communications in

a community site. For example, some users pretend to be other users by using multiple user accounts and

attempt to manipulate communications in the community site. Manipulated communications discourage mes-

sage submitters, keep users from retrieving good communication records, and decrease the credibility of the

communication site. To solve this problem, we conducted an experimental study to detect submitters sus-

pected of pretending to be someone else to manipulate communications in a community site by using machine

learning techniques. In this study, we used messages in the data of Yahoo! chiebukuro for data training and

examination.

1 INTRODUCTION

In these days, many people use community sites, such

as Q&A sites and social network services, where

users share their information and knowledge. Com-

munity sites offer greater learning opportunities to

users than search engines in the following points:

1. Users can submit ambiguous questions because

other users give some supports to them. Further-

more, users can submit questions in natural and

expressive sentences, not keywords.

Figure 1 is a question submitted to a widely-used

Japanese Q&A site, oshiete! goo, by a student in

the author’s class of media processing. He aimed

to obtain a sample program and do his assignment

of the class. On the other hand, the author ex-

pected students in the class to give the following

keywords to search engines and ﬁnd informative

web pages and sample programs:

• keywords used in the problem statement of his

assignment. (e.g. ppm, ﬂip, horizontally, 90

degree clockwise)

• keywords which were not used in it. (e.g.

fopen, sscanf)

fopen and sscanf were commands of C program-

ming language. The students could not write C

programs for the assignment without using these

commands. However, it is difﬁcult to think of

these keywords, especially, for students who did

not have detailed knowledge of media processing

and C programming language. By the way, the

student received a sample program with detailed

explanation three hours later after submitting the

question of Figure 1.

2. Communications in community sites are interac-

tive. Users have chances to not only submit ques-

tions but give answers and, especially, join discus-

sions.

As a result, community sites are promising media for

education.

166

Ishikawa N., Nishimura R., Watanabe Y., Okada Y. and Murata M. (2010).

DETECTION OF SUBMITTERS SUSPECTED OF PRETENDING TO BE SOMEONE ELSE TO MANIPULATE COMMUNICATIONS IN A COMMUNITY

SITE.

In Proceedings of the 2nd International Conference on Computer Supported Education, pages 166-171

DOI: 10.5220/0002772301660171

 SciTePress

Figure 1: A question submitted to a Japanese Q&A site, oshiete! goo, by a student in the author’s class.

One of the essential factors which provides learn-

ing opportunities to users in community sites is

anonymous submission. In most community sites,

user registration is required for those who want to join

the community sites. However, registered users gen-

erally need not reveal their real names to submit mes-

sages (questions, problems, answers, opinions, etc.).

It is important to submit messages anonymously to

a community site. This is because anonymity gives

users chances to submit messages without regard to

shame and reputation. However, some users abuse the

anonymity and disrupt communications in a commu-

nity site. For example, some users pretend to be other

users by using multiple user accounts and attempt to

manipulate communications in the community site.

Manipulated communications discourage other sub-

mitters, keep users from retrieving good communica-

tion records, and decrease the credibility of the com-

munity site. As a result, it is important to detect sub-

mitters suspected of pretending to be other users to

manipulate communications in a community site. In

this case, identity tracing based on user accounts is

not effective because these suspicious submitters of-

ten attempt to hide their true identity to avoid de-

tection. A possible solution is authorship identiﬁ-

cation based on analyzing stylistic features of mes-

sages. In recent years, a large number of studies

have been made on authorship identiﬁcation (Craig

99) (de Vel 01) (Koppel 02) (Corney 02) (Argamon

03) (Zheng 06), however, few researchers addressed

the identiﬁcation issues of authors who submit mes-

sages in a community site. To solve this problem,

in this study, we propose a method of detecting sub-

mitters suspected of pretending to be someone else to

manipulate communications in a community site. In

this method, in order to detect submitters suspected of

pretending to be someone else, we used an submitter

identiﬁer which was developed by learning stylistic

features of user’s messages and determine by whom a

series of input messages are submitted. We used mes-

sages in the data of Yahoo! chiebukuro

, a widely-

used Japanese Q&A site, for observation, data train-

ing and examination.

2 COMMUNICATIONS

MANIPULATED BY MULTIPLE

ACCOUNT USERS

In this study, we used messages in the data of Yahoo!

chiebukuro for observation, data training, and exam-

ination. The data of Yahoo! chiebukuro was pub-

lished by Yahoo! JAPAN via National Institute of In-

formatics in 2007

. This data consists of about 3.11

million questions and 13.47 million answers which

were posted on Yahoo! chiebukuro from April/2004

to October/2005. In this section, we observed mes-

sages submitted to the following categories in Yahoo!

chiebukuro.

• PC,

• healthcare, and

• social issues.

http://chiebukuro.yahoo.co.jp

http://research.nii.ac.jp/tdc/chiebukuro.html

DETECTION OF SUBMITTERS SUSPECTED OF PRETENDING TO BE SOMEONE ELSE TO MANIPULATE

COMMUNICATIONS IN A COMMUNITY SITE

167

Table 1: The numbers of submitters and their submitted messages to PC, healthcare, and social issues category in Yahoo!

chiebukuro (from April/2004 to October/2005)

PC healthcare social issues

submitters messages submitters messages submitters messages

question 43493 171848 29954 84364 13259 78777

answer 27420 474687 38223 289578 25766 403306

Table 2: The numbers of submitters and their messages, who submitted more than 200 messages to PC, healthcare, or social

issues category in Yahoo! chiebukuro (from April/2004 to October/2005)

PC healthcare social issues

submitters messages submitters messages submitters messages

question 17 5970 5 1581 39 14088

answer 395 260183 134 57406 312 180503

Table 1 shows the numbers of submitters and

their submitted messages to these categories from

April/2004 to October/2005. Also, Table 2 shows the

numbers of submitters who submitted more than 200

messages. We think that multiple account users who

intend to manipulate communications in community

sites are frequent message submitters.

In Yahoo! chiebukuro, users need not reveal their

real names to submit their messages. However, their

messages are traceable because their user accounts

are attached to them. Because of this traceability, we

can collect any users messages and some of them in-

clude clues of identifying individuals. As a result,

to avoid identifying individuals, it is reasonable and

proper that users change their user accounts or use

multiple user accounts. However, the following types

of message submissions using multiple user accounts

are neither reasonable nor proper.

TYPE I a question and its answer are submitted by

one and the same user.

We think that the user intended to manipulate

the message evaluation. For example, in Yahoo!

chiebukuro, each questioner is requested to deter-

mine which answer is best and give a best answer

label to it. These message evaluations encourage

message submitters to submit new messages and

increase the credibility of the community site. We

think, in order to get best answer labels and seem

a good answerer, the user has repeated this type of

submissions.

TYPE II two or more answers are submitted to the

same question by one and the same user.

We think that the user intended to dominate or dis-

rupt communicationsin the community site. To be

more precise, the user intended to

• control communications by advocating or justi-

fying his/her opinions, or

• disrupt communications by submitting two or

more inappropriate messages.

These kinds of submissions discourage other submit-

ters, keep users from retrieving good communication

records, and decrease the credibility of the commu-

nity site. As a result, it is important to detect users

suspected of pretending to be someone else to manip-

ulate communications in a community site.

TYPE I submissions are sometimes obscurer than

TYPE II submissions because the standards of best

answer selection differ with each questioner. In other

words, it is more possible to disrupt communications

by TYPE II submissions than TYPE I. As a result, in

this study, we intend to investigatea method of detect-

ing users who have repeated TYPE II submissions.

3 DETECTION OF SUBMITTERS

SUSPECTED OF PRETENDING

TO BE SOMEONE ELSE

In order to detect users who repeated TYPE II sub-

missions, we intend to detect users who

• have similar styles of writing, and

• submitted answers to the same questions.

It is easy to detect users who submitted answers to the

same questions by using their submission records. As

a result, in this section, we explain a method of detect-

ing users who have similar styles of writing. Figure

2 shows the outline of our method of detecting users

who have similar styles of writing.

In our method, we used a submitter identiﬁer

which is based on analyzing stylistic features and de-

termines by whom a series of input messages are sub-

mitted. As shown in Figure 2, the submitter identi-

ﬁer consists of N user classiﬁers developed by learn-

CSEDU 2010 - 2nd International Conference on Computer Supported Education

168

Figure 2: The outline of our method of detecting users who

have similar styles of writing

s1 the results of morphological analysis on sentences

in the target message

s2 the results of morphological analysis on the sen-

tence and sentence No.

s3 character 3-gram extracted from sentences in the

target message

s4 character 3-gram extracted from the sentence and

its sentence No.

s5 1 ∼ 10 characters at the head of each sentence

s6 1 ∼ 10 characters at the end of each sentence

s7 sequential patterns extracted by PreﬁxSpan (fre-

quency is 5+, item number is 3+, maximum gap

number is 1, and maximum gap length is 1)

Figure 3: Features used in maximum entropy (ME) method

for learning stylistic features of submitters. PreﬁxSpan

(http://preﬁxspan-rel.sourceforge.jp/) is a method of mining

sequential patterns efﬁciently.

ing users’ stylistic features. Each classiﬁer has a tar-

get user and calculates the probability that a series

of input messages were submitted by the target user.

Then, the identiﬁer determines that a series of input

messages were submitted by the user with the highest

probability. When the user with the highest proba-

bility differs from the user submitted a series of in-

put messages, our method determines that these users

have similar styles of writing. For example, in Fig-

ure 2, a series of input messages submitted by user j

are given to the submitter identiﬁer. Then, the iden-

tiﬁer determines that the series of of input messages

were submitted by user i. In this case, our method de-

termines that user i and j have similar styles of writ-

ing. In this way, the key to detecting users of similar

writing styles is the user classiﬁers. As a result, we

explain below how to develop these user classiﬁers.

Suppose that user (rank i) submitted l answers to

a communication site, ranked i-th place in the ranking

of frequent answer submitters, and is the target user

of classiﬁer (rank i). When a series of m answers of

Table 3: The number of target users in PC, healthcare, and

social issues category.

PC healthcare social issues

submitters 395 134 312

messages 260183 57406 180503

user (rank j) are given to classiﬁer (rank i), probabil-

ity score score(i, j) that user i and j were one and the

same user and user i submitted the series of m answers

is calculated as follows:

score(i, j) =











∏

k=1

ijk

(in case of

∏

k=1

ijk

∏

k=1

(1− P

ijk

))

0 (in case of

∏

k=1

ijk

≤

∏

k=1

(1− P

ijk

))

where P

ijk

is the probability that user i submitted mes-

sage k (1 ≤ k ≤ m) in the series of m messages of user

(rank j). P

ijk

is calculated by classiﬁer (rank i), which

was developed by learning stylistic features of user

(rank i). Training data for learning stylistic features

of user (rank i) consists of

• n messages which were selected randomly from l

messages submitted by user (rank i), and

• n messages which are extracted randomly from

messages submitted by other users.

In this study, we used the maximum entropy (ME)

method for data training. Figure 3 shows feature

s1 ∼ s7 used in machine learning on experimental

data. s1 and s2 were obtained by using the results of

the morphological analysis on experimental data. s3

and s4 were obtained by extracting character 3-gram

from experimental data. This is because Odaka et al.

reported that character 3-gram is good for Japanese

processing (Odaka 03). s5 and s6 were introduced

because, we thought, clue expressions to the author

identiﬁcation are often found at the head and end of

sentences. s7 was obtained by using PreﬁxSpan

PreﬁxSpan is a method of mining sequential patterns

efﬁciently and often used in document classiﬁcation.

By using PreﬁxSpan, Tsuboi et al. identiﬁed mail

senders (Tsuboi 02) and Matsumoto et al. classiﬁed

reviews into positive and negative ones (Matsumoto

04).

4 EXPERIMENTAL RESULTS

To evaluate our method, we conducted the following

experiments:

http://preﬁxspan-rel.sourceforge.jp/

DETECTION OF SUBMITTERS SUSPECTED OF PRETENDING TO BE SOMEONE ELSE TO MANIPULATE

COMMUNICATIONS IN A COMMUNITY SITE

169

experiment 1 The accuracy measurement of the user

classiﬁers.

experiment 2 The accuracy measurement of the sub-

mitter identiﬁer.

experiment 3 The detection of users who have simi-

lar styles of writing and submitted answers to the

same questions.

In this experiment, the target users were all sub-

mitters who submitted over 200 answer messages to

PC, healthcare, or social issues category in Yahoo!

chiebukuro. Table 3 shows the numbers of target sub-

mitters and their messages in each category.

100

1 2 3 4 5

percent accuracy

number of input messages (answers)

training data (100 ans.)

healthcare

social issues

(a) training data (100 answers)

100

1 2 3 4 5

percent accuracy

number of input messages (answers)

training data (200 ans.)

healthcare

social issues

(b) training data (200 answers)

100

1 2 3 4 5

percent accuracy

number of input messages (answers)

training data (300 ans.)

healthcare

social issues

Figure 4: The accuracy of the classiﬁers which determine

whether a series of messages were submitted by their target

users, under the various number (1 ∼ 5) of input messages

and the various size (100, 200, and 300 messages) of train-

ing data. The target users were all submitters who submit-

ted over 200 answer messages to PC, healthcare, and social

issues category in Yahoo! chiebukuro.

100

5 10 15 20 25

percent accuracy

number of input messages (answers)

training data (100 ans.)

healthcare

social issues

(a) training data (100 answers)

100

5 10 15 20 25

percent accuracy

number of input messages (answers)

training data (200 ans.)

healthcare

social issues

(b) training data (200 answers)

100

5 10 15 20 25

percent accuracy

number of input messages (answers)

training data (300 ans.)

healthcare

social issues

Figure 5: The accuracy of the identiﬁer which determines

by whom a series of messages were submitted, under the

various number (1 ∼ 25) of input messages and the vari-

ous size (100, 200, and 300 messages) of training data. The

target users were all submitters who submitted over 200 an-

swer messages to PC, healthcare, and social issues category

in Yahoo! chiebukuro.

We developed experimental data for data training

and examination in the next way. First, in order to de-

velop experimental data of examination, we extracted

50 messages from each user’s messages. Then, from

the other messages of each user, we extracted 50, 100,

and 150 messages and, as mentioned in section 3, de-

veloped three different sizes (100, 200, and 300 mes-

sages) of experimental data for data training. In the

experiments, we used a package for maximum en-

tropy method, maxent

, for data training. We also

http://www2.nict.go.jp/x/x161/members/mutiyama/

software.html#maxent

CSEDU 2010 - 2nd International Conference on Computer Supported Education

170

Table 4: The numbers of user pairs who have similar styles

of writing and submitted answers to the same questions.

frequency of submissions

to the same questions

category one or more ten or more

PC 87 12

healthcare 17 0

social issues 109 22

used a Japanese morphological analyzer, Mecab

, for

word segmentation of messages.

In experiment 1, we ﬁrst developed user classiﬁers

by applying maximum entropy (ME) method to the

training data. Then, we varied the numbers of input

messages to the classiﬁers and measured the accuracy

of them. Input messages were extracted from the ex-

perimental data for examination. Figure 4 shows the

accuracy of the classiﬁers under the various numbers

(1 ∼ 5) of input messages and the various sizes (100,

200, and 300 messages) of training data. As shown in

Figure 4, we obtained more than 95% accuracy when

we set the size of training data and the number of in-

put messages to be 300 (including 150 target user’s

messages) and 4, respectively. Furthermore, we found

character 3-gram (s3) and 1 ∼ 10 characters at the

head and end of sentences (s5 and s6) are effective to

this experiment.

In experiment 2, we measured the accuracy of the

identiﬁer. It consisted of N classiﬁers, the accuracy

of which are shown in Figure 4. Figure 5 shows the

accuracy of the identiﬁer under the various numbers

(1 ∼ 25) of input messages and the various sizes (100,

200, and 300 messages) of training data. As shown in

Figure 5, we obtained more than 80% accuracy when

we set the size of training data and the number of in-

put messages to be 300 (including 150 target user’s

messages) and 15, respectively.

In experiment 3, because we wanted to use the

identiﬁer with more than 85 % accuracy, we gave

training data consisting of 300 messages (including

150 target user’s messages) and set the number of in-

put messages to be 16. Table 4 shows the numbers

of user pairs who have similar styles of writing and

submitted answers to the same questions. In this ex-

periment, we found two user pairs suspected of pre-

tending to be someone else to manipulate communi-

cations. Those user pairs submitted answers to the

same questions in social issues category 43 and 17

times, respectively. We intend to examine whether

these user pairs are multiple account users, from var-

ious perspectives.

http://mecab.sourceforge.net/

5 CONCLUSIONS

In this paper, we proposed a method of detecting users

who have similar styles of writing and submitted an-

swers to the same questions in a community site fre-

quently. Our method detected some user pairs sus-

pected of pretending to be someone else and manip-

ulating communications in a community site. We in-

tend to examine this experimental results and reﬁne

our method. Then, we wish to contribute to learners

in community sites.

ACKNOWLEDGEMENTS

This research has been supported partly by the

Grant-in-Aid for Scientiﬁc Research (C) under Grant

No.20500106.

REFERENCES

Craig: Authorial attribution and computational stylistics:

if you can tell authors apart, have you learned any-

thing about them?, Literary and Linguistic Comput-

ing, 14(1), (1999).

de Vel, Anderson, Corney, and Mohay: Mining e-mail con-

tent for author identiﬁcation forensics, ACM SIG-

MOD Record, 30(4), (2001).

Koppel, Argamon, and Shimoni: Automatically Categoriz-

ing Written Text by Gender, Literary Linguistic and

Computing, 17(4), (2002).

Corney, de Vel, Anderson, and Mohay: Gender-Preferential

Text Mining of E-mail Discourse, ACSAC 2002,

(2002).

Argamon, Saric, and Stein: Style mining of electronic mes-

sages for multiple authorship discrimination: ﬁrst re-

sults, 9th ACM SIGKDD, (2003).

Zheng, Li, Chen, and Huang: A Framework of Author-

ship Identiﬁcation for Online Messages: Writing Style

Features and Classiﬁcation Techniques, Journal of the

American Society for Information Science and Tech-

nology, 57(3), (2006).

Odaka, Murata, Gao, Suwa, Shirai, Takahashi, Kuroiwa,

and Ogura: A Proposal on Student Report Scoring

System Using N-gram Text Analysis Method, IEICE

trans., J86-D-I(9), (2003).

Tsuboi and Matsumoto: Authorship Identiﬁcation for Het-

erogeneous Documents, ISPJ-NL-148, (2002).

Matsumoto, Takamura, and Okumura: Sentiment Classiﬁ-

cation using Word Sequences and Dependency Trees,

FIT2004, (2004).

DETECTION OF SUBMITTERS SUSPECTED OF PRETENDING TO BE SOMEONE ELSE TO MANIPULATE

COMMUNICATIONS IN A COMMUNITY SITE

171