EMPIRICAL TEXT MINING FOR GENRE DETECTION

Vasiliki Simaki

, Sofia Stamou

1,2

and Nikos Kirtsis

Computer Engineering and Informatics Department, Patras University, Patras, Greece

Department of Archives and Library Science, Ionian University, Corfu, Greece

Keywords: Genre Detection, Annotation, Human Study.

Abstract: In this paper, we report on a preliminary study we carried out for identifying patterns that characterize the

genre type of Greek texts. In the course of our study, we address four distinct genre types, we record their

observable stylistic elements and we indicate their exploitation for automatic genre-based document classi-

fication. The findings of our study demonstrate that texts contain lexical features with discriminative power

as far as genre is concerned, however modeling those features so that they can be explored by computer-

based applications is still in early stages.

1 INTRODUCTION

The genre is the totality of characteristics we ob-

serve in a text that gives a unique print. It is an het-

erogeneous categorical principle in that it provides

clues for classifying texts into specific styles. Genre

clues are closer to the field of semantics and unlike

text type that conveys information about the texts'

structure, they give out information about the style

of a text. Genre clues are employed for characteriz-

ing a text as subjective/objective, positive/negative

about the subject it elaborates, opinionated, factual,

etc. They may also be employed for unravelling sty-

listic features reflected within a text's content, with

these ranging from literary content to procedural,

descriptive and so forth.

In this paper, we report on an observatory study

we carried out in which we try to identify structural

and contextual clues within text in order to be able

to characterize its underlying genre accurately. Our

observations verify that there are discrete informa-

tional points within texts that constitute their genre

clues. Based on this finding, we take a step further

and try to mine lexical and syntactic patterns from

text, which may be treated as genre indicators for

automatically classifying texts according to genre

classes.

2 RELATED WORK

Automatic genre and style detection of texts is an

active field of research over the past years essen-

tially because there are many challenges and thus

yet-unresolved issues associated with it. In this re-

spect, several researchers have proposed various

solutions to the problem of identifying the character-

istics that communicate the genre and the stylistic

print of texts. In particular, Kessler et al. (1997) pro-

posed a theory of genres that compares various sur-

face textual features and they attempted the auto-

matic genre detection via the exploitation of genre

clues in a text's body. The clues they determined are:

structural, lexical, character-level and derivative;

and, despite their surface nature, they are as success-

ful as deep structural cues. Finn et al. (2002) pro-

posed the application of machine learning tech-

niques for the automatic classification of text genres.

They investigated three approaches to automatically

classify text documents by genre, i.e. the bag-of-

words method, part-of-speech statistics and hand-

crafted shallow linguistic features. The application

of their approach to a collection of news articles

revealed that part-of-speech tagging and statistical

analysis of a text's content may effectively contrib-

ute towards genre-based document classification.

Later, Finn and Kushmerick (2003) elaborated on

the details of textual genres and showed how genre

can be distinguished across different texts. Lee and

Myaeng (2004) implemented the so-called ‘devia-

tion-based statistical feature selection’ method and

proved its efficacy in automatically extracting genre-

related features from texts. Santini et al. (2006) pro-

posed a new definition for genre based on the traits

733

Simaki V., Stamou S. and Kirtsis N..

EMPIRICAL TEXT MINING FOR GENRE DETECTION.

DOI: 10.5220/0003956207330737

In Proceedings of the 8th International Conference on Web Information Systems and Technologies (WEBIST-2012), pages 733-737

ISBN: 978-989-8565-08-2

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

of hybridism, individualization and evolution. This

theoretical characterization constitutes the base for

an inferential model that automatically identifies

genres within web pages. Later, Santini (2007) pre-

sented a model based on a scheme performing zero-

to-multi genre assignments. Sharoff (2007) per-

formed experiments to find the best set of features

that delimit a typology useful for classifying web

pages according to their domain and genre. With

respect to Greek, Stamatatos et al. (2000) described

an approach that relies on classification schemes and

natural language processing modules for categoriz-

ing Greek texts according to their genre and author.

The application of their technique to various Greek

corpora proved the feasibility of their proposed

method.

3 METHODOLOGY

Our text genre detection approach operates upon

pre-determined classes of text types in which it

looks for features that signify the genre of the re-

spective texts. Before delving into the details of our

method, we should point out that we distinguish text

genre from text type in that the former represents the

way in which information is communicated via

texts, whereas the latter represents the type of infor-

mation texts convey. As an example consider the

current paper, whose style might be characterized as

technical or formal and whose type might be charac-

terized as scientific publication or research paper.

Based on the discrimination between style and

type, we propose a method that tries to identify the

genre (i.e. style) of texts that belong to particular

types. To obviate the need for pre-classifying texts

according to their type, we rely on an existing tex-

tual type classification scheme

that defines the fol-

lowing categories of text types: (i) narrative: an

account of events, (ii) expository: text that explains

something, (iii) procedural: text that gives instruc-

tions on how to do something and (iv) descriptive:

text that lists the characteristics of something. Each

of the above category types contains sub-types,

which share common elements in structure and con-

tent with their parental type and with each other, and

at the same time they deliver clues that are unique

and representative of their respective sub-type cate-

gory. As sub-type example categories consider the

following: (i) narrative texts may be sub-categorized

into novels, poetry or short stories, (ii) expository

texts may be sub-categorized into news articles,

travel books and periodicals, (iii) procedural texts

may be sub-categorized into various kinds of guide-

books, e.g. cooking books, manuals, installation

guides and (iv) descriptive texts may be sub-

categorized into encyclopedias, grammars and dic-

tionaries. Given the type and/or sub-type of a text,

our method tries to identify structural and contextual

elements that signify the text's style or else genre. To

ensure the accurate identification of genre, we relied

on a dataset of texts already classified into one of the

above types and after manually inspecting them we

extracted their genre-indicative elements. Manual

inspection of texts was performed by human expert

annotators, i.e. linguistics, to whom we distributed a

number of texts along with their respective type

categories and we asked them to indicate for every

text type the features (both contextual and structural

if feasible) that yield its style. Note that we did not

ask our study participants to directly indicate the

genre of every examined text type, but rather to in-

dicate the elements that shape the stylistic identity of

every examined text. This is because the quest in our

study is to determine the characteristics that distin-

guish the genre of text types from one another rather

than annotate texts with stylistic information. Based

on the above determination, one can employ super-

vised clustering methods for automatically grouping

texts in terms of their common underlying stylistic

elements. Turning back to the description of our

method, we should point out that due to space and

time constraints, we did not examine all the sub-

categories of every text type but rather we relied on

texts categorized under a specific (randomly se-

lected) type sub-category. Thus, for the narrative

category, we relied on texts belonging to the short

stories sub-category, for the expository category we

relied on texts belonging to the news articles sub-

category, for the procedural category we relied on

texts belonging to the cookbooks sub-category and

for the descriptive category we relied on texts be-

longing to the dictionaries sub-category.

The dataset used for each of the above catego-

ries, concerned online texts that we harvested from

suitable websites, i.e. sites containing (i) short sto-

ries, (ii) online newspapers, (iii) sites about recipes

and cooking and (iv) lexicographic sites. In the next

section, we report on the statistics of our collected

dataset. Now, we present the way in which human

annotators assessed our collected data sources in

order to identify their structural and contextual ele-

ments that signify their genre. To carry out our

study, we recruited 5 experienced linguists (3 fe-

male, 2 male), who volunteered to read the texts

collected for every type category and indicate the

http://www.sil.org/linguistics/glossaryoflinguisticterms/WhatIsATex

t.htm

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

734

elements that they judged as indicative of the texts'

style. The only instruction given to our study par-

ticipants was that their role would not be to indicate

the texts' genre but rather to identify elements that

demonstrate the style. To acquaint our users with

their task, we run a supervised laboratory experi-

ment in which we asked them to experiment with a

number of texts and via the think-aloud protocol to

indicate which elements they deemed as style-

descriptive for each of the test texts. Note also that

our volunteers were aware of the type sub-category

of every text they examined. Finally, we advised our

participants to indicate only elements for which they

felt absolutely confident that they are indicative of

the texts' genre and we did not allow them to com-

municate with each other during their participation

in the survey. The time duration of the study was

five consecutive days, with two 3-hour sessions per

day. At the end of the study, we collected the notes

(in the form of free-style comments) our participants

delivered, we grouped them by text type and we

examined for every text type the genre-indicative

elements that were selected by the majority of our

study participants. Obtained results for each of the

text types considered are presented and discussed in

the following section.

4 IDENTIFYING GENRE CLUES

Table 1 summarizes our experimental dataset.

Table 1: Statistical distribution of the examined dataset

across text types.

Documents 853

Short stories (narrative) 157

News articles (expository) 359

Recipes (procedural) 334

Dictionaries (descriptive) 3

Having collected the experimental texts for our

human survey, we grouped them by type and we

asked our study participants to read the texts in

every type and note down the clues they considered

indicative of the respective texts' genre. To facilitate

the work of our volunteers, we asked them to write

down in the form of (self-selected) keywords and/or

short phrases brief descriptions of the clues they

identified. All texts across all four types were exam-

ined by all the participants, thus we received feed-

back for the entire dataset. Based on the user-

selected genre clues for each of the text types con-

sidered, we built an indexing module where we

stored the user-selected genre clues for further proc-

essing. Genre clues processing concerned manual

assessment of the collected feedback to identify re-

current genre identifiers. In our study, we deem a

genre clue as recurrent in the user feedback if at

least three of the participants indicated the same clue

as genre-indicative even if the terms used to verbal-

ize the clue were not identical among their com-

ments. Table 2 gives the statistics of the user-defined

genre clues for each of the types examined.

Table 2: User-selected genre clues for text types.

Text type: short stories

Genre clues selected by more that 3 users 5

Genre clues selected by 3 or less users 9

Text type: news articles

Genre clues selected by more that 3 users 4

Genre clues selected by 3 or less users 9

Text type: recipes

Genre clues selected by more that 3 users 4

Genre clues selected by 3 or less users 6

Text type: dictionaries

Genre clues selected by more that 3 users 3

Genre clues selected by 3 or less users 5

According to the reported data, there are high

levels of user agreement with respect to the elements

that signify the genre of different text types. This

might be due to the fact that all our study partici-

pants were experienced linguistics with a solid

knowledge of what text genre is and how it can be

encapsulated in the structural and lexical elements of

texts.

4.1 Analysis of Genre Clue Indicators

Having collected and processed user feedback with

respect to what constitute the indicative characteris-

tics of genre within texts, we proceed with the pres-

entation and analysis of those indicators in an at-

tempt to shed light on the text properties that signify

style. Table 3 reports the stylistic characteristics of

texts belonging to the each of the categories exam-

ined as given by at least 3 of our participants.

As the table shows, there are relatively high lev-

els of repetition among the genre clues that charac-

terize text types. To identify how genre clues are

pronounced into the structural and contextual prop-

erties of their corresponding texts, we relied on the

combined analysis of the texts employed in our

study and the comments our participants delivered

for each of the examined texts and we interpret our

findings as follows.

For the short stories category, the limited text

size and the quick narration are the most frequently

indicated clues of genre. The text size feature ac-

counts to the number of words narrative texts con-

tain, which according to our data ranges between

EMPIRICALTEXTMININGFORGENREDETECTION

735

Table 3: Genre clues for the short stories, new articles, recipes and dictionary categories respectively.

Text type: short stories Text type: news articles

Genre clue Fraction Genre clue Fraction

Limited text size 61.14% Headings 100.00%

Quick narration 57.89% Images, tables 100.00%

Presence of dialogue 47.77% Syntactic coherence 22.31%

Syntactic coherence 44.45% Named entities 13.74%

Recursion 40.25%

Text type: recipes Text type: dictionaries

Genre clue Fraction Genre clue Fraction

Structure replication 100.00% Structure replication 100.00%

Sequential short NPs 25.77% Short, elliptical phrases 98.00%

Precise verbal structures 11.88% Use of abbreviations/symbols 23.00%

Use of command words 10.51%

1,000 and 3,500 terms. The quick narration feature

accounts to the use of descriptive language, the suc-

cession of events (usually in a chronological order),

the use of pronouns and the presence of factual

statements within text. Modeling the characteristic

of quick narration is a quite complex task that re-

quires large volumes of human-annotated data (i.e.

narrative corpora) as well as the availability of so-

phisticated language processing modules. The syn-

tactic coherence feature accounts to the presence of

full syntactic phrases within texts and the recursion

feature accounts to embedding phrases of the same

type within texts. Both of them together with the

dialogue feature are genre descriptive elements for

narrative texts and can be captured after applying

syntactic parsing to the contents of the respective

texts.

The genre clues identified for the news articles

category have strong discrimination power as these

are present in the striking majority of journalistic

texts. In particular, we observe that what character-

izes the genre of a text as expository and specifically

as journalistic is the presence of headings, images

and the utilization of named entities in the text con-

tents. Modeling the above features in order to come

up with automated genre detection methods is rela-

tively easy. This is because most of the features that

characterize expository texts and news articles can

be identified within the texts’ structural properties

where the application of shallow parsing would be

sufficient for their detection. For the remaining fea-

tures that serve as indicators of expository texts, we

would need to apply deep syntactic analysis to the

texts’ contents to be able to automatically extract

them.

The genre clues identified for procedural texts

are pretty different from both narrative and exposi-

tory texts with their main differences accounting for

the texts' structural properties. In particular, texts

belonging to the recipes type replicate the same

structural and syntactic patterns, i.e. they begin with

a title/heading, which is followed by a list of sequen-

tial short noun phrases (NPs), which are topically

related to each other. Noun phrases are followed by

short text nuggets, which contain simple and repeti-

tive terminology, command words as well as precise

verbal structures, i.e. verbs in indicative plural. Our

data indicates that we can capture and therefore

model the stylistic features of procedural texts by

relying mainly on the texts' structure and less on

their content. This facilitates the genre characteris-

tics modeling task as it does not require the applica-

tion of deep lexical analysis to the examined texts.

Of course we are aware of the fact that different

types of procedural texts may vary in their structure,

but still one could apply our user-selected features as

a starting base for detecting texts that describe a

process of doing something.

Finally, dictionaries are specialized instances of

text with unique stylistic features. As such, we can

safely determine their genre by relying on the analy-

sis of their structural elements, i.e. by looking for the

pattern: lemma entry + sequences of short elliptical

phrases that may contain abbreviations and/or sym-

bols. Although instances of the above patterns may

vary across different categories of descriptive texts,

still the structure of their contextual elements, i.e.,

lemma entries followed by short descriptions, is

what constitutes their genre print.

5 CONCLUDING REMARKS

We have presented an exploratory study for identify-

ing the features that characterize the genre of differ-

ent text types. Our findings show that there exist

specific structural and contextual elements within

texts that can be modeled as genre clues in order to

be explored by automatic genre classification mod-

ules. In the future, we plan to investigate ways of

semi-automatically assigning texts of particular

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

736

types to pre-specified genres.

REFERENCES

Finn, A. and Kushmerick, N. 2003. Learning to classify

documents according to genre. In Proceedings of the

Computational Approaches to Style Analysis and Syn-

thesis Workshop.

Finn, A., Kushmerick, N. and Smyth, B. 2002. Genre clas-

sification and domain transfer for information filter-

ing. In Proceedings of the European Colloquium on

Information Retrieval Research, pp. 353-362, Glas-

gow.

Karlgren, J. 1999. Stylistic experiments in information

retrieval. Natural Language Information Retrieval,

Kluwer.

Lee, Y. B. and Myaeng, S. H. 2004. Automatic identifica-

tion of text genres and their roles in subject-based

categorization. In the 37

Hawaiian Conference on

System Sciences.

Santini, M., Power, R. and Evans, R. 2006. Implementing

a characterization of genre for automatic genre identi-

fication of web pages. ACL Computational Linguistics

Conference.

Santini, M. 2007. Automatic genre identification: towards

a flexible classification scheme. In the BCS IRSG

Symposium: Future Directions in Information Access,

Glasgow, Scotland.

Sharoff, S. 2007. Classifying web corpora into domain and

genre using automatic feature identification. In the

Web as Corpus Workshop, Louvain-la-Neuve.

Stamatatos E., Fakotakis N. and Kokkinakis G. 2000.

Automatic text categorization in terms of genre and

author. Computational Linguistics, vol.26, no.4, pp.

461-485, MIT Press

EMPIRICALTEXTMININGFORGENREDETECTION

737