IDIOLECT-BASED IDENTITY DISCLOSURE AND AUTHORSHIP

ATTRIBUTION IN WEB-BASED SOCIAL SPACES

Natalie Ardet

Freie Universit

at Berlin

Takustr.9, 14195 Berlin, Germany

Keywords:

identity, social networks, privacy, authorship attribution, hypermedia, text mining.

Abstract:

In this paper, we inspect new possible methods of Web surveillance combining web mining with sociolin-

guistic and semiotic related knowledge of human discourse. We ﬁrst give an overview of telecommunication

surveillance methods and systems, with focus on the Internet, and we describe the legal issues involved in

Web or Internet communications investigations. We put the emphasis on identity disclosure and anonymity or

pseudonymity undermining in open web spaces. Further, we give an overview of new trends in Internet medi-

ated communication, and examine the virtual social networks they create. Finally, we present the results of a

new method using the semiotic features of web documents for authorship attribution and identity disclosure.

1 INTRODUCTION

The Web is a growing social space allowing human

interaction and communication. Thus legal and ille-

gal telecommunication surveillance and user proﬁling

methods have been developed since the emergence of

the Internet and ﬂourish on the Web. This raised the

concern about privacy and anonymity in the ﬁeld of

computer mediated communication. We will describe

some current aspects of Internet surveillance and de-

pict the trends of communication in the Web, espe-

cially those communications taking place in Web so-

cial spaces. Finally, we will examine the problem of

identity disclosure and anonymity (or pseudonymity)

undermining in open web spaces. We illustrate the

problem by presenting the results of a new approach

using the properties of web documents to perform au-

thorship attribution and thus leading to partial or com-

plete identity disclosure.

2 TRADITIONS IN

TELECOMMUNICATIONS

SURVEILLANCE

In 1998, a study for the European Parliament (STOA,

1998) uncovered ECHELON, a multinational surveil-

lance network, centered at Sugar Grove, WV, U.S.A,

which intercepts all forms of electronic communica-

tions. It is the current electronic surveillance system

used by the National Security Agency of the United

States (NSA). This global system of electronic eaves-

dropping can monitor any electronic communication

in the world. It was thought to be used for com-

mercial purposes as well as military spying. ECHE-

LON monitors millions of communications, uses com-

puterized systems to target those of interest, by ori-

gin, destination, language or keywords. This elec-

tronic spy system was developed during the cold war.

Targets are tagged and forwarded to Fort Meade for

analysis and action. With the success of Internet

connected personal computers, surveillance software

targeting Internet users started to spread. Spyware

consists of ”computer software that gathers and re-

ports information about a computer user without the

user’s knowledge or consent”(Wikipedia, 2005). In

2004, according to a study by the National Cyber-

Security Alliance, 80% of home PCs are infested with

spyware (AOL/NCSA, 2004). As such, spyware is

cause for public concern about privacy on the In-

ternet. The concern about privacy lead to a con-

cern about the related concepts of identity, anonymity,

pseudonymity, unlinkability and unobservability in

computer mediated communication. Pﬁtzmann de-

ﬁnes identity as ”any subset of attributes of an in-

dividual which uniquely characterizes this individ-

ual within any set of individuals” (Pﬁtzmann, 2004).

305

Ardet N. (2005).

IDIOLECT-BASED IDENTITY DISCLOSURE AND AUTHORSHIP ATTRIBUTION IN WEB-BASED SOCIAL SPACES.

In Proceedings of the First International Conference on Web Information Systems and Technologies, pages 305-310

DOI: 10.5220/0001234803050310

 SciTePress

According to Jacobson, identity can be concealed

by the mean of anonymity. While ”anonymity con-

ceals an individuals real identity in online commu-

nication”, pseudonymity ”disguises” the real identity

(Jacobson, 1999). An important factor which dis-

tinguishes pseudonymity from anonymity is the con-

cept of accountability. Accountability is ”the prop-

erty that ensures that the actions of an individual

or an institution may be traced uniquely to that in-

dividual or institution” (ATIS, 2001). The major

dilemma is that anonymity protects privacy but under-

mines accountability whereas identiﬁcation assures

accountability but threatens privacy (Clarke, 1999).

Therefore Clarke stresses the ”central importance of

pseudonymity as a primary means of achieving the

necessary balance between the needs for privacy and

for accountability” (Clarke, 1999). The unlinkability

of a system is a property which ensures that a user

of the system may make multiple uses of resources

or services of the system without others being able

to link these uses together (Pﬁtzmann and K

ohntopp,

2001). In the following, the terminology in use is that

of a setting in which ”senders send messages to re-

cipients using a communication network”. We deﬁne

the unobservability property as P

and the anonymity

property as P

, s

is a sender, r

is a recipient, and R

is the communication relationship between s

and r

According to Pﬁtzmann, we have following implica-

tions:

⇒ P

(1)

) ⇒ P

) (2)

) ⇒ P

) (3)

) ⇒ P

) (4)

) ⇒ P

) (5)

) ⇒ P

) (6)

) ⇒ P

) (7)

) ⇒ P

) (8)

The concern of privacy led to new privacy-enhancing

rules and technologies in order to refrain the exploita-

tion and investigation of personal information in the

Internet. An emerging example is the Platform for

Privacy Preferences Project (P3P), a framework that

allows users to control the amount of personal infor-

mation they share with websites, developed by the

World Wide Web Consortium (Cranor, 2002).

3 COMPUTER SUPPORTED

SOCIAL NETWORKS

3.1 Social Networks and their

analysis

What is a social network? According to Garton et al.

(Garton, 1997), ”a social network is a set of people

(or organizations or other social entities) connected

by a set of social relationships, such as friendship, co-

working or information exchange”. Still according to

Garton et al. (Garton, 1997), a social network analy-

sis is a structural analysis, which units of analysis is

the interpersonal relation. A relation R (or

strand

) is

characterized by its content c, direction d and strength

s. A relation R denotes a relationship between two

actors a and b. R is directed or undirected. An undi-

rected relation can be unbalanced, which means that

its expression is asymmetrical. A set of relations con-

necting a pair of actors is called a tie. The more rela-

tions (or strands) in a tie, the more multiplex (or mul-

tistranded) is the tie. The composition of a relation

or a tie is derived from the social attributes of both

actors. Wasserman and Faust (Wasserman and Faust,

1994) introduced a different deﬁnition of a tie, from

their point of view a tie denotes a single aspect of a re-

lationship. In the example in ﬁgure 1, the entity ”Web

users” represents the actors. Each actor has a set of

attributes, e.g. ethnicity, social class and gender. A

possible relation between two actors is deﬁned as ”in-

teracts with”. This relation may have different aspects

such as ”topic” or ”purpose of conversation”.

Figure 1: User interaction

3.2 Computer mediated

communication

In the following part, we describe typical Internet

computer mediated communication (CMC) and pin-

point those which create observable social networks.

Email and Instant messaging systems (AIM, MSN,...)

are basically point-to-point two-way communication,

unless used for spamming purpose, then they are nei-

ther point-to-point nor two-way or communication.

These means of communication create only transient

WEBIST 2005 - WEB INTERFACES AND APPLICATIONS

306

social networks. We don’t consider ﬁle-sharing sys-

tems because they don’t support human to human

communication based on naturally occurring text or

speech. There are different types of one-to-many

means of communication. Usenet is a decentralized

system of discussion groups originally implemented

in 1979-1980 by Steve Bellovin, Jim Ellis, Tom Tr-

uscott, and Steve Daniel at Duke University (Hauben

and Hauben, 1997). Usenet predates the Internet, al-

though today most Usenet material is distributed over

the Internet using the NNTP protocol (Kantor and

Lapsley, 1986). Estimated half of the existing Usenet

newsgroups can be found on the Internet. Forums

(or discussion boards) are centralized, usually web-

based, discussion groups, allowing web users to have

a ”conversation” about a given topic. A new kind of

web communication support is the blog. Typically, a

blog contains a chronologically ordered set of mes-

sages posted by a single author or a group of au-

thors. Each message may be commented by its read-

ers. The readers can link the blog to their own blogroll

and thus creating a blogspace. The blogs are part

of a web space called blogosphere (or blogsphere).

A special feature of blogs and blogrolls is to export

their content by the means of feeds. This technique

is called content syndication. The most widely used

syndication formats are currently Really Simple Syn-

dication (RSS) and Atom (Nottingham, 2003). There

are currently seven different RSS formats (Pilgrim,

2002). Blog aggregators like Blogsnow

syndicate

the content of blogs. The aggregators are updated

by a ping mechanism informing them about that an

observed blog has been updated. The expansion of

the weblogs has been documented by the blog survey

company Technorati (see ﬁgure 2). Preece deﬁnes an

online community as people, who interact socially, of

a shared purpose, policies and computer systems to

support and mediate social interaction (Preece, 2000).

Internet researcher have created different taxonomies

to describe those interactions. The terms synchronous

is used for communication taking place in ”real time”

and asynchronous for timely non continuous commu-

nication. Regarding online investigation, the empha-

sis is put on the transience or persistence of commu-

nication. Communication is persistent if it is archived

in some way and can be recalled later. As an example,

an email message represents a persistent communica-

tion unless you delete it. Communication need to be

persistent and observable, to allow investigation. This

is typical for Blogs and Usenet. Social spaces like

discussion boards often reduce their observability by

restricting their access and typically reduce the per-

sistence of communication by the means of pruning

techniques as showed in the virtual ethnography done

by Ardet (Ardet, 2004).

http://www.blogsnow.com

Figure 2: Weblogs usage

4 IDIOLECT AND SEMIOTIC

FINGERPRINT

4.1 Properties of idiolects

Sociolinguistics is the study of language in human so-

ciety. An aspect of sociolinguistic research is an area

generally referred to as language variation. Language

variation focuses on how language varies in different

contexts, where context refers to factors like ethnic-

ity, social class, sex, geography and age. The choice

of the utterance used to convey a message is a also so-

cial decision tied to the social factors which deﬁne the

relationship between the transmitter and the receiver.

Language, dialect and idiolect are additional

factors which shape the utterances of the transmitter.

A dialect is a collection of attributes that make

one group of speakers noticeably different from

another group of speakers of the same language.

The term accent refers to a phonological variation

which may be important in speech analysis. Accent

is about pronunciation, while dialect is a broader

term encompassing syntactic, morphological, and

semantic properties as well. An idiolect is the variety

of language spoken by each individual speaker of the

language. While language and dialects are spoken

by a group of person, an idiolect is only spoken by a

single person. Chandler (Chandler, 2002) describes

an idiolect as a term from sociolinguistics referring

to the distinctive ways in which language is used

by individuals. In semiotic terms it can refer more

broadly to the stylistic and personal subcodes of

individuals. As a result of these deﬁnitions, we can

infer the following pyramid: a language consists of

one or more dialects, and each dialect is a collection

of idiolects as illustrated in ﬁgure 3. Each idiolect

maps to only one person, which implies that there

is an injective mapping function from an idiolect i

to a person p. The mapping function cannot be a

bijection because one person may speak different

dialects from one or more language, depending if the

person is monolingual or multilingual. We deﬁne X

IDIOLECT-BASED IDENTITY DISCLOSURE AND AUTHORSHIP ATTRIBUTION IN WEB-BASED SOCIAL

SPACES

307

as a set of all idiolects and Y the set of all speakers.

As f is injective, for every x and x’ in X, whenever

f(x) = f(x’), we must have x = x’. Which means it is

possible to identify a person by the mean of his or her

idiolect.

Figure 3: Language variety

4.2 Semiotic ﬁngerprint

We now introduce the notion of a semiotic ﬁngerprint,

a content dependent property of a document. Consid-

ering α

∈ A (1≤i≤n) as a semiotic sign, we de-

ﬁne the semiotic ﬁngerprint function f

as a mapping

between a document d = (α

, α

, . . . , α

) and its

semiotic ﬁngerprint σ

= (σ

d,1

, σ

d,2

, . . . , σ

d,m

) and

d,i

∈ R, σ

∈ F = R

: D

→ F

(d) = σ

is a feature set of d, i.e. a vector representation of

weighted and normalized attributes of d. Concerning

the input of the semiotic ﬁngerprint function f

, i.e.

the lexical unit of analysis of the forthcoming algo-

rithm, it may be a single document or a set of docu-

ments. If the semiotic ﬁngerprint function f

is injec-

tive, i.e. f (v

) = f(v

) implies v

= v

, this

would mean that each document has its own semi-

otic ﬁngerprint, which means a semiotic ﬁngerprint

uniquely identiﬁes a text. However, our goal is to de-

ﬁne a semiotic ﬁngerprint function which co-domain

can be split in semiotic ﬁngerprint classes C

∈ F.

Each class mapping eventually to an author:

= (σ

∈ F|∃d, f

(d) = σ

∧ a = author(d))

5 BLOGMINER WORKBENCH

5.1 Design and Conception

The BlogMiner workbench was designed to explore

weblogs, a category of open social spaces in the

Web. The discourse occurring in a weblog is encap-

sulated in messages posted on the blog. The pub-

lishing of messages usually occurs through the web-

blog’s content management system (CMS). BLOG-

MINER fetches the discourse occurring in a selected

weblog via the weblogs’own syndication mechanism.

An algorithm to calculate the semiotic ﬁngerprint of

a message as deﬁned in 4.2 has been developed and

integrated in BLOGMINER. The semiotic ﬁngerprint

computing will be presented here. At ﬁrst we have

to deﬁne the set of features, which will be used in

our analysis. The semiotic ﬁngerprint data structure

(SFDS) consists of two parts. The metrical part con-

tains the vector containing the values for the semi-

otic ﬁngerprints metrics (SFM). The lexical part con-

tains the occurrence set of semiotic units, i.e. the

distributional pattern of the words occurring in the

unit of analysis. The metrical part may consists of

occurrences and frequency of lexical, grammatical,

or paralinguistic patterns (Meyer, 2001)(Ha, 2003).

From a linguistic point of view, we have chosen an

empiric approach. This is due to the ”fuzziness” of

language occurring in many context of CMC as de-

scribed in (Ardet and Thome, 2004). Therefore, our

semiotic ﬁngerprint doesn’t include grammatical pat-

terns. The inclusion of grammatical pattern in order

to distinguish writing styles has been shown previ-

ously (Baayen et al., 1996) (Stamatatos and Kokki-

nakis, 2001) where the encoding of syntactic informa-

tion has been done with part-of-speech n-grams. The

lexical patterns we use include the frequency of punc-

tuation, numbers, capital letters. The paralinguistic

patterns we use are part of following categories:

Environmental contrast P

refers to a contrast be-

tween an object and its surroundings, caused by a

difference in shape, color, or illumination (PLAI,

2005)

Affect indicator P

refers to pattern which aim to

compensate the lack of facial expression or gesture

in CMC (Ardet and Thome, 2004)

Hypermedia usage P

refers to the usage of hyper-

media elements such as embedded images or hy-

perlinks.

Statistical values about the corpora like word count,

i.e. the size of the corpus, word types, i.e. the vo-

cabulary of the corpus and word tokens, i.e. the

frequency of each word type, are stored in the sec-

ond part of the semiotic ﬁngerprint, the lexical part.

These metrics have been discussed previously and

thus have not been integrated in our approach yet

(Oakes, 1998)(Smith, 1983)(Stamatatos and Kokki-

nakis, 2001)(Holmes, 1994) .

5.2 Results

The results achieved with our method for idiolect

recognition based on semiotic ﬁngerprint matching

will be presented now. The semiotic ﬁngerprint

has been computed for messages of three distincts

WEBIST 2005 - WEB INTERFACES AND APPLICATIONS

308

Table 1: Fingerprint values

3.2 4.6 1.8 0 9.2 2.7 3.7 0 0.9 0.9 7.4 10.1 1.8 0.9 0 0

8.6 4.5 2.2 0 4.5 2.2 2.2 0 1.1 0 0 1.1 2.2 0 0 0

1.8 4.6 3.1 0 12.5 3.1 4.6 0 0 1.5 0 15.6 7.8 1.5 0 0

8.3 7.6 0.1 0.8 2.6 0.5 0.4 0.8 0.1 0.2 0.7 0.8 0.2 0.1 3.8 0.7

3.8 4.4 0.3 1.2 5.0 1.2 1.2 2.2 0.9 0.3 0 0.9 1.2 0 1.9 0.6

6.1 6.0 2.3 1.1 6.3 2.0 2.0 1.4 1.4 0.2 3.4 0.8 0 0 2.6 1.4

weblogs

. In ﬁgure 4, each row shows semiotic ﬁn-

gerprints of a distinctive weblog. The weblogs mes-

sages are quite small (less than 200 words). Previous

research result, couldn’t achieve satisfactory results

with small documents (De Vel et al., 2001a)(De Vel

et al., 2001b). In ﬁgure 4, we can easily visualize

the difference between the ﬁngerprints in row 2, cor-

responding to weblog σ

and the other ﬁngerprints

and σ

. The semiotic ﬁngerprints of σ

and

are visually quite similar, thus if we inspect the

semiotic ﬁngerprint values in table 1, we observe that

in σ

, the attributes σ

, σ

are null whereas

these values are not null in σ

. The preliminary re-

sults on very small documents are quite promising and

need some further investigation.

Figure 4: Example of Semiotic Fingerprints

6 RELATED WORK

Ongoing research in the ﬁeld of authorship has led to

some interesting results. Koppel et.al. showed that

partial identity disclosure can be achieved using gen-

der inference techniques (Koppel et al., 2003). Inves-

tigation of large corporas (with an average of 42000

words) have shown that several classes of simple lex-

blabbermouth.net, executivewoman.blogspirit.com and

motorcitybadkitty.com

ical and syntactic features differ substantially accord-

ing to author gender, especially regarding the use of

pronouns and certain types of noun modiﬁers. Beside

research concerning partial identity disclosure, the in-

vestigation of groups of interest, i.e. groups of indi-

viduals who share social spaces may also reveal infor-

mation about an individual. In the context of social

network investigation, the concept of group is useful

for examining the relationships between sets of ac-

tors instead of single actors. Wellmann (Wellman,

1997) deﬁnes a group as ”a social network whose

ties are tightly-bounded within a delimited set and

are densely-knit so that almost all network members

are directly linked with each other”. A particular

group-ﬁnding algorithm is known as Friend-of-Friend

(FoF) . This technique was ﬁrst used in astrophysics

by Huchra and Geller (Huchra and Geller, 1982) to

identify group of galaxies. An entity belongs to a FoF

group if it lies within some linking length ǫ of any

other entity in the group. If no group is found, the en-

tity is entered in a list of isolated elements. All entities

found are added to the list of group members. The sur-

roundings of each group member are then searched.

This process is repeated until no further members can

be found. Another approach to group-ﬁnding entitled

conversation maps has been developed by Sacks. This

approach aims to visualize the interaction between

users of the Usenet in order to investigate very large

conversations (Sack, 2000).

7 CONCLUSION

In this paper, we ﬁrst explored the technical aspects

of the Internet, considered as a social model, as sug-

gested by the psychological view of Mantovani (Man-

tovani, 2001). Then, we gave a deﬁnition of computer

supported social networks, described their topology

(centered or distributed), their usages (personal and

business oriented) and explored the topological prop-

erties of two emerging types of computer-supported

social networks, aka Web social spaces. Finally, we

illustrated the issues of privacy and identity disclosure

in open Web social spaces by presenting the prelim-

inary results of an algorithm for idiolect recognition

IDIOLECT-BASED IDENTITY DISCLOSURE AND AUTHORSHIP ATTRIBUTION IN WEB-BASED SOCIAL

SPACES

309

based on semiotic ﬁngerprint matching. Our approach

uses the rich media content of web documents to build

a set of features, thus allowing stylometric analysis on

small documents.

REFERENCES

AOL/NCSA (2004). Aol/ncsa online safety study. Tech-

nical report, America Online and the National Cyber

Security Alliance.

Ardet, N. (2004). Teenagers, Internet and Black Metal mu-

sic. Conference Proceedings CIM 2004.

Ardet, N. and Thome, M. (2004). Virtual Ethnography: a

Computer-Based Approach. Computer and their Ap-

plications, Conference Proceedings (2004), 79–82.

ATIS (2001). Alliance for telecommunications industry so-

lutions telecom glossary.

Baayen, H., Halteren, H., and Tweedie, F. (1996). Outside

the cave of shadows: Using syntactic annotation to en-

hance authorship attribution. Literary and Linguistic

Computing, 11.

Chandler, D. (2002). Semiotics: The Basics. London,

Routeledge.

Clarke, R. (1999). Identiﬁed, anonymous and pseudony-

mous transactions: The spectrum of choice. User

Identiﬁcation & Privacy Protection Conference,

Stockholm.

Cranor, L. F. (2002). Web Privacy with P3P. O’Reilly &

Associates.

De Vel, O., Anderson, A., and Corney, M. (2001a). Min-

ing e-mail content for author identiﬁcation forensics.

ACM Sigmod, Volume 30 , Issue 4 (December 2001).

De Vel, O., Andersond, A., and Corney, M. (2001b). Multi-

topic-e-mail authorship attribution forensics. In ACM

Conference on Computer Security - Workshop on Data

Mining for Security Applications, November 8, 2001,

Philadelphia, PA, USA.

Garton, L. (1997). Studying on-line social networks.

JCMC (Journal of Computer Mediated Communica-

tion) Vol.3, Issue 1, 1997.

Ha, L. A. (2003). Extracting important domain-speciﬁc

concepts and relations from a glossary. In Proceed-

ings of the 6th CLUK Colloquium, pages 49–56, Ed-

inburgh, UK.

Hauben, M. and Hauben, R. (1997). Netizens: On the His-

tory and Impact of Usenet and the Internet. Wiley-

IEEE Computer Society Press.

Holmes, D. I. (1994). Authorship attribution. Computers

and the Humanities, Nr. 28:87–106.

Huchra, J. and Geller, M. (1982). Groups of Galaxies I.

Nearby Groups. ApJ 257 423.

Jacobson, D. (1999). Doing research in cyberspace. Fields

Methods, Vol. 11, No. 2, November 1999:pp. 127–

145.

Kantor, B. and Lapsley, P. (1986). Network news transfer

protocol. Technical report, U.C. San Diego.

Koppel, M., Argamon, S., and Shimoni, A. (2003). Auto-

matically categorizing written texts by author gender.

Literary and Linguistic Computing 17(4), November

2002, pp. 401–412.

Mantovani, G. (2001). The psychological construction of

the internet. from information foraging to social gath-

ering to cultural mediation. Cyberpsychology And Be-

havior. Vol. 4 (1), Pp. 47-56.

Meyer (2001). Extracting knowledge-rich contexts for ter-

minography. In D. Bourigault, C. J. and LHomme,

M. C., editors, Recent Advances in Computational

Terminology. Amsterdam, John Benjamins.

Nottingham, M. (2003). The atom syndication format 0.3

(pre-draft). Technical report, Atom Working Group.

Oakes, M. P. (1998). Statistics for Corpus Linguistics. Ed-

inburgh.

Pﬁtzmann, A. (2004). Anonymity, unobservability,

pseudonymity, and identity management a proposal

for terminology (draft v0.21 sep. 03, 2004). Technical

report, TU Dresden.

Pﬁtzmann, A. and K

ohntopp, M. (2001). Anonymity, un-

observability, and pseudonymity a proposal for ter-

minology. Technical report, proposal.

Pilgrim, M. (2002). What is rss? www.xml.com.

PLAI (2005). The plain language association international

glossary. http://www.plainlanguagenetwork.org/.

Preece, J. (2000). Online communities. Wiley.

Sack, W. (2000). Conversation map: A content-based

usenet newsgroup browser. In in the Proceedings of

the International Conference on Intelligent User In-

terfaces (New Orleans, LA: Association for Comput-

ing Machinery, January 2000).

Smith, M. (1983). Recent Experience and New Develop-

ments of Methods for the Determination of Author-

ship. Association for Literary and Linguistic Com-

puting Bulletin, 11, 1983, S. 73-82.

Stamatatos, E., N. F. and Kokkinakis, G. (2001). Computer-

based authorship attribution without lexical measures.

Computers and the Humanities 35, pages 193–214.

STOA (1998). An appraisal of technologies of politi-

cal control, interim study. Technical report, STOA

Programme, Directorate-General for Research Direc-

torate B, Eastman 112, rue Belliard 97-113, B-1047

Bruxelles., http://cryptome.org/stoa-atpc.htm.

Wasserman, S. and Faust, K. (1994). Social Network Analy-

sis: Methods and Applications. Cambridge University

Press.

Wellman, B. (1997). Cultures of the Internet, chapter An

electronic group is virtually a social network, page

pages 179205. Lawrence Erlbaum Publications, Mah-

wah, New Jersey.

Wikipedia (2005). Wikipedia Encyclopedia.

www.wikipedia.com.

WEBIST 2005 - WEB INTERFACES AND APPLICATIONS

310