TEXT SIMPLIFICATION USING DEPENDENCY PARSING FOR

SPANISH

Miguel Ballesteros

Departamento de Ingenier

ıa del Software e Inteligencia Artiﬁcial, Universidad Complutense de Madrid

C/ Profesor Jos

e Garc

ıa Santesmases, s/n, E-28040 Madrid, Spain

Susana Bautista, Pablo Gerv

Instituto de Ingenier

ıa del Conocimiento, Universidad Complutense de Madrid

C/ Profesor Jos

e Garc

ıa Santesmases, s/n, E-28040 Madrid, Spain

Keywords:

Text simpliﬁcation, Dependency parsing, Spanish.

Abstract:

In this paper we investigate the task of text simpliﬁcation for Spanish. Our purpose is a system to simpliﬁed

text based on rules using dependency parsing. Our main motivation is the need for text simpliﬁcation to

facilitate accessibility to information by poor readers and by people with cognitive disabilities. This study

consists of the ﬁrst step towards building Spanish text simpliﬁcation systems helping to create easy-to-read

texts.

1 INTRODUCTION

Text simpliﬁcation aims at providing human readers

with a better understanding of a written text through

its simpliﬁcation. Our goal is to build a system to

promote access to Spanish texts for people at the rudi-

mentary and basic literacy levels, as well as for those

with cognitive disabilities.

In Spain a vast number of people belong to the so

called rudimentary and basic literacy levels. These

people are only able to ﬁnd explicit infomation in

short texts or process slightly longer texts and make

simple inferences. According to some studies

measure the literacy level of the population, 30% of

the population have difﬁculty understanding texts be-

yond a certain complexity.

Reading comprehension entails three elements:

the reader who is meant to comprehend; the text that

is to be comprehended and the activity in which com-

prehension is a part of (Snow et al., 2002). In addition

to the content presented in the text, the vocabulary

load of the text and its linguistic structure, discourse

style, and genre interact with the reader’s knowledge.

When these factors do not match the reader’s knowl-

edge and experience, the text becomes too complex

http://www.facillectura.es

for comprehension to occur. In this paper we will fo-

cus on the syntactic structure of a text to maximize the

comprehension of written texts through the simpliﬁ-

cation of their linguistic structure. This may involve

simplifying lexical and syntactic phenomena, by sub-

stituting words that are more usual, and by breaking

down and changing the syntactic structure of the sen-

tence. As a result, it is expected that the text can

be more easily understood (Siddharthan, 2003)(Max,

2006). Text simpliﬁcation may also involve dropping

parts or full sentences and adding some extra material

to explain a difﬁcult point (Petersen and Ostendorf,

2007).

It has already been shown that long sentences,

conjoined sentences, embedded clauses, passives,

non-canonical word order, and use of low-frecuency

words, among other things, increase text com-

plexity for language-impaired readers (Siddharthan,

2002),(Klebanov et al., 2004),(Devlin and Unthank,

2006),(Bautista et al., 2009),(Caseli et al., 2009).

There are different initiatives that make available

guidelines to make text easier to comprehend: the

Plain Language

or “European Guidelines for the

Production of Easy-to-Read Information”

or “Web

http://www.plainlanguage.gov

http://www.osmhi.org/contentpics/139/European

330

Ballesteros M., Bautista S. and Gervás P..

TEXT SIMPLIFICATION USING DEPENDENCY PARSING FOR SPANISH.

DOI: 10.5220/0003115803300335

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 330-335

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

Content Accessibility Guidelines”

. In principle,

these recommendations can be applied to any lan-

guage.

In this paper we present the results of the early

steps in the study of syntactic simpliﬁcation for Span-

ish and a rule-based syntactic simpliﬁcation system

for this language. We follow a subset of the whole set

of guidelines to deﬁne our rules: use short sentences

mostly, include one main idea per sentence and do not

try to express more of an idea or theme in each sen-

tence.

This paper is organized as follows. In Section 2

we describe related approaches for text simpliﬁcation.

Section 3 presents our proposal and the evaluation

measures. In Section 4 we show our results. Section

5 presents the conclusions and some future work.

2 PREVIOUS WORK

In this section we present related approaches for text

simpliﬁcation, state of the art about multilingual de-

pendency parsing, and the corpus that we used for our

experiment.

2.1 Text Simpliﬁcation

Existing text simpliﬁcation systems can be compared

along three axes: the type of system- rule-based or

corpus-based-, the type of knowledge used to identify

the need for simpliﬁcation, and the goals of the sys-

tem.

A few rule-based systems have been developed for

text simpliﬁcation (Chandrasekar et al., 1996),(Sid-

dharthan, 2003),(Bautista et al., 2009), focusing on

different readers (poor literate, aphasic, etc). These

systems contain a set of manually created simpliﬁca-

tion rules that are applied to each sentence. These are

usually based on parser structures and limited to cer-

tain simpliﬁcation operations. Siddharthan proposes

a syntactic simpliﬁcation architecture that relies on

shallow text analysis and favors time performance.

The general goal of the architecture is to make texts

more accessible to a broader audience. Max (Max,

2006) applies text simpliﬁcation in the writing pro-

cess by embedding an interactive text simpliﬁcation

system into a word processor. At the user’s request, an

automatic parser analyzes an individual sentence and

the system applies handcrafted rewriting rules. This

system requires human intervention at every step.

Guidelines for ETR publications (2).pdf

http://www.w3.org/TR/WCAG20/

Corpus-based systems, on the other hand, can

learn from corpus the relevant simpliﬁcation opera-

tions and also the necessary degree of the simpliﬁca-

tion for a given task (Petersen and Ostendorf, 2007).

Petersen addresses the task of text simpliﬁcation in

the context of second-language learning. A data-

driven approach to simpliﬁcation is proposed using a

corpus of paired articles in which each original sen-

tence does not necessarily have a corresponding sim-

pliﬁed sentence, making it possible to learn where

writers have dropped or simpliﬁed sentences. A clas-

siﬁer is used to select the sentences to simplify, and

Siddharthan’s syntactic simpliﬁcation system is used

to split the selected sentences. Inui et al. (Inui et al.,

2003) proposes a rule-based system for text simpliﬁ-

cation aimed at deaf people.

Some language technology systems attempt to

simplify documents for various purposes. A variety of

simpliﬁcation techniques have been used, for exam-

ple substituting common words for uncommon words

(Devlin and Tait, 1998), activising passive sentences

and resolving references (Canning, 2000), reducing

multiple-clause sentences to single-clause sentences

(Chandrasekar and Srinivas, 1997; Canning, 2000;

Siddharthan, 2002) and making appropriate choices

at the discourse level (Williams et al., 2003).

There also commercial systems like Simplus

and

StyleWriter

, which aim to support Plain English

writing.

2.2 Dependency Parsing

A dependency is a binary syntactic asymmetrical re-

lation between the words of a sentence that is relevant

to the structure of the sentence (K

ubler et al., 2009).

Based on this main idea, we can deﬁne what would be

the dependency parsing. The words in a sentence de-

pend on each other, so that the direct object of a verb

depends directly on the verb and an adjective depends

on a name. Finally, the purpose of dependency analy-

sis is to build a tree where leaves represent each of the

words comprising the phrase and the edges represent

the dependencies between them, this tree is called the

dependency tree.

There is a lot of work done in dependency parsers,

and some shared tasks had as main theme Multilin-

gual dependency parsing like the CoNLL-X Shared

Task(Buchholz and Marsi, 2006). Each year the Con-

ference of Computational Natural Language Learn-

ing(CoNLL) features a shared task, the 10th CoNLL

Shared task was Multilingual dependency parsing.

http://www.linguatechnologies.com/english/home.html

http://www.editorsoftware.com/writing-software

TEXT SIMPLIFICATION USING DEPENDENCY PARSING FOR SPANISH

331

Figure 1: A tagged sentence from AnCora.

There were a lot of research groups, each group im-

plemented a parser and there were a lot of languages

to parse. The Corpus that they used for Spanish pars-

ing is AnCora (Palomar et al., 2004), (Taul

e et al.,

2008) and we used it too for our experiment. The aim

of this task was to extend the state of the art avail-

able at that time in dependency parsing. In 2007,

another shared task about multilingual dependency

parsing was accomplished: The CoNLL-XI Shared

Task(Mcdonald et al., 2007), but in this case Spanish

was not present as a language for parsing.

2.3 AnCora Corpus

We used the AnCora (Palomar et al., 2004), (Taul

et al., 2008) treebank, a corpus of 95,028 word-

forms and 3,512 sentences that contains open-domain

texts annotated with their dependency analyses. The

ConLL X Shared Task (Buchholz and Marsi, 2006)

used AnCora as treebank for the Spanish parsing and

better scores were around 80% LAS(Labelled Attach-

ment Score). AnCora was tagged automatically with

morphosyntactic information (PoS tags) and manu-

ally checked. It has been used as a training corpus

for learning based systems.

AnCora is in CoNLL Data Format

, as shown in

the Figure 1. A sentence given in that format has all

the information about the dependency tree and some

other lexical information. AnCora has a dependency

tag set of 20 different tags, but the frequency of most

of the labels is very low. We saw that the ’CD’(Direct

Object) tag, the ’CI’(Indirect Object) tag, and the

’CC’(Adjunct) tag appear in all the sentences with

more than 15 wordforms.

http://nextens.uvt.nl/ conll/

3 DEPENDENCY BASED TEXT

SIMPLIFICATION

We propose a rule-based syntactic simpliﬁcation sys-

tem. It uses a dependency parsed tree and it is lim-

ited to a simpliﬁcation operation applied to the de-

pendency trees, prunning the tree focusing on the de-

pendency labels. The operation is applied sentence by

sentence of the corpus, producing simpliﬁed versions

of the sentences.

3.1 Dependency Tree Pruning

We were wondering which tag is the most appropriate

to be removed, and we focused on the small subset of

3 tags (“CC”, “CD”, and “CI”) that appears in most

of the sentences. The only tag that could be deleted

without losing the main information of the sentence

is the “CC” tag. It expresses complementary infor-

mation about an action, like when, where, how, and

why. But “CC” tag never reports about who or what.

Removing “CC” tag, we are not always loosing the in-

formation about i.e. when or where because this kind

of information not always depends on a verb.

In the following section we present our algorithm

that removes the “CC” tag from sentences tagged as a

dependency tree and produces a simpliﬁed version of

the sentence. The simpliﬁed version would be gram-

matically correct and easier to read and understand.

3.2 Pruning Algorithm

We implemented an algorithm that takes the depen-

dency tree in the CONLL Data Format and returns a

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

332

plain text with the simpliﬁed sentence. If the depen-

dency tree is well-formed, with 100.0% label and de-

pendency tag accuracy, or at least it is correctly tagged

for the tags that our algorithm takes into account, the

resulting sentence will be grammatically correct.

The algorithm runs through the dependency tree

and it makes the following steps.

1. The algorithm removes all the nodes that have as

dependency tag the “CC” tag.

2. The algorithm removes all the nodes that have as

parent the node removed in 1. The algorithm iter-

ates in 2 while there are more nodes that have had

a parent removed.

3. It generates a plain text sentence by removing all

the semantic and syntactic information of the de-

pendency tree.

Figure 2: A tagged sentence from AnCora.

Figure 2 shows a very easy example of a sen-

tence: Toc

o el familiar bulto con cuidado, (in English,

He/She touched the familiar shape carefully). The re-

sulting sentence must be: Toc

o el familiar bulto, (in

English, He/She touched the familiar shape). Our al-

gorithm removes the information about how he/she

touched it.

In section 4 we present the evaluation design of

our system with two measures of evaluation.

3.3 Evaluation Design

In this section we present, two measures of evalua-

tion, the ﬁrst one is not a group objective evaluation,

it consists in a group of people, they all have univer-

sity studies. The second evaluation measure consists

in a group of children between ten to eleven years old.

3.3.1 Questionnaire for Adults

As a evaluation measure, we surveyed a group of

people (20 persons), about how good is the text

simpliﬁcation made by our algorithm. They all have

university studies and they all have as their mother

tongue Spanish. None know how the simpliﬁcation

algorithm works. We selected 20 sentences from the

Ancora corpus. We showed them the whole sentence

and the simpliﬁed sentence, then we asked them the

following questions, they had to answer “yes” or

“no”:

• Q1: Is the main idea of the sentence retained?

• Q2: Was all the removed information unneces-

sary?

• Q3: Have only details without importance been

deleted?

• Q4: Do you understand better the simpliﬁed sen-

tence than the normal sentence?

The results of the survey, are presented in Section 4.1.

3.3.2 Questionnaire for Children

As a second evaluation measure, we decided to do a

group objective evaluation, carried out with a group of

children between ten to eleven years old. There were

24 children, and they had a questionnaire for each pair

of children. We selected 20 sentences from the An-

Cora corpus, we showed them the simpliﬁed version

and the original version, they had to answer ’yes’ or

’no’ to the following question for each sentence: Do

you understand better the simpliﬁed sentence than the

normal sentence?

The results of the survey, are presented in Section

4.2.

4 RESULTS

In this section we present the global results of our ex-

periment, Showing the results of the two measures of

evaluation, and some global statistical results on cor-

pus.

4.1 Results Adult Questionnaire

Table 1 shows the results of the evaluation made by

the group of people. In the table we show the answers

’yes’ or ’no’ for each question.

The ﬁrst question, Q1: Is the main idea of the sen-

tence kept?, is the most important one. The survey

give us 67.58% of people that say “yes” for any sen-

tence in question Q1. But it is important to notice that

in 50% of the sentences people answer “yes” in 86%

or more. We can conclude that most of the people

thought that in most of the sentences the main idea,

and the meaning, of the sentence is preserved.

TEXT SIMPLIFICATION USING DEPENDENCY PARSING FOR SPANISH

333

Table 1: Results obtained by the survey.

Question YES NO

Q1 67.58% 32.42%

Q2 27.66% 72.34%

Q3 46.72% 53.28%

Q4 60.76% 39.24%

If we focus on question Q2: Was all the removed

information unnecessary?, people thought that not all

information was dispensable. It is probably because

our algorithm made very aggressive simpliﬁcations in

many cases. Looking at questions Q1 and Q2, we can

see that most people feel that we are loosing some

information but they think that the overall meaning is

preserved.

Seeing the third question Q3: Have been only

deleted details without importance?, it is important to

notice the differences between Q2 and Q3. The neg-

ative answer to this question indicates lower quality

of the compressed sentence, but Q2 is more general

about the idea that we loose some data but maybe it is

not highly important. If we look at the results we can

conclude that in some of the phrases where we loose

some data, we are not loosing the most important in-

formation.

If we focus on the last question Q4: Do you un-

derstand better the simpliﬁed sentence than the nor-

mal sentence?. This question asks about how well the

people understand the simpliﬁed version compared to

the normal sentence. Most of the people think that the

simpliﬁed sentences are easier to read. It is important

to notice that some of the sentences are not really dif-

ﬁcult to read in the original version and because of

that, some people answer “no” to this question.

Finally, as a conclusion of the experiment, we see

that most of the people think that the main idea of the

sentences is preserved, which is one of our goals, and

they also think that the simpliﬁed version is easier to

read and understand than the original version which

is our second goal.

4.2 Results Children Questionnaire

The results of the survey on children are presented on

Table 2. We had 240 answers, 20 answers for each

sentence by each pair of children. The children an-

swered “yes” in 125 of the 240 cases. Therefore, we

have 52.08% of children who believed that the sim-

pliﬁed sentence was easier to read than the original

version.

We can see the differences between the 4th ques-

tion Q4 in the ﬁrst evaluation measure, and the results

Table 2: Results obtained by the survey.

Children YES NO

24 52.08% 47.92%

given by the survey in this evaluation measure. In

question Q4 people are not in the group objective, so

they can not say that they understand better the sen-

tences because they understand at the same level. In

this second evaluation measure the children may have

some problems to understand the sentences ﬂuently,

so our system can help them to understand the infor-

mation better. In fact children may have difﬁculty in

understanding even the simpliﬁed sentence because

they are not able to read some difﬁcult concepts that

are presented in the original version and the simpliﬁed

version.

We can conclude that our system helps the group

of children to understand the sentences better, which

is our main goal.

4.3 Overall Statistics in Corpus

In this subsection we show the results obtained af-

ter simplifying the whole corpus, using our algorithm

sentence by sentence. The AnCora corpus has 3,512

sentences, and the algorithm makes simpliﬁcation in

2,737 sentences, that is 77.93% of the total. The al-

gorithm did not simplify the whole corpus, because

sentences that do not have a “CC” tag are not simpli-

ﬁed. The results of the experiment are given in Table

3 which shows the number of wordforms, the average

sentence length and the longest sentence length of the

original corpus and the simpliﬁed corpus.

Table 3: Results on Sentence Length (SL).

Original Simpliﬁed

Total Wordforms 95,028 58,415

Average SL 27.06 wf 16.63 wf

Longest SL 143 wf 94 wf

5 CONCLUSIONS AND FUTURE

WORK

The potentialities of text simpliﬁcation systems for

education, for example, are evident. For students, it

is a ﬁrst step for more effective learning. For people

with poor literacy, we see text simpliﬁcation as a ﬁrst

step towards social inclusion, facilitating and devel-

oping reading and writing skills to interact in society.

The social impact of text simpliﬁcation is undeniable.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

334

Our system is a ﬁrst approximation to an auto-

matic system that runs through dependency trees and

returns a simpliﬁed version of the sentence parsed.

We can conclude that it is possible to simplify cor-

rectly texts using dependency parsing, in the particu-

lar case of Spanish. The simpliﬁed sentence is gram-

matically correct. But on the other hand, choosing any

label, using dependency parsing the algorithms make

aggressive simpliﬁcations.

We made a simple version of the algorithm and we

only focused on the dependency tree. We are work-

ing to increase the number of simpliﬁcation opera-

tions and some future work might be oriented towards

deﬁning lexical simpliﬁcation operations like to swap

the “difﬁcult” words with a simple synonym using a

version of WordNet (Fellbaum, 1998) for Spanish like

EuroWordNet (Vossen, 1998).

REFERENCES

Bautista, S., Gerv

as, P., and Madrid, R. (2009). Feasibil-

ity Analysis for SemiAutomatic Conversion of Text

to Improve Readability. In The Second International

Conference on Information and Communication Tech-

nologies and Accessibility.

Buchholz, S. and Marsi, E. (2006). Conll-x shared task on

multilingual dependency parsing. In Proceedings of

the 10th Conference on Computational Natural Lan-

guage Learning (CoNLL–X), pages 149–164.

Canning, Y. (2000). Cohesive simpliﬁcation of newspaper

text for aphasic readers. In 3rd annual CLUK Doc-

toral Research Colloquium.

Caseli, H. M., Pereira, T. F., Specia, L., Pardo, T. A. S.,

Gasperin, C., and M.Aluisio, S. (2009). Building a

Brazilian Portuguese Parallel Corpus of Original and

Simpliﬁed Texts. In In Proceedings of CICLing.

Chandrasekar, R., Doran, C., and Srinivas, B. (1996). Mo-

tivations and methods for text simpliﬁcation. In In

Proceedings of the Sixteenth International Conference

on Computational Linguistics (COLING ’96, pages

1041–1044.

Chandrasekar, R. and Srinivas, B. (1997). Automatic induc-

tion of rules for text simpliﬁcation. Knowledge-Based

Systems, 10.

Devlin, S. and Tait, J. (1998). Linguist Databases, chapter

The use of a Psycholinguistic database in the Simpli-

ﬁcation of Text for Aphasic Readers, pages 161–173.

CSLI.

Devlin, S. and Unthank, G. (2006). Helping aphasic people

process online information. In Assets ’06: Proceed-

ings of the 8th international ACM SIGACCESS con-

ference on Computers and accessibility, pages 225–

226, New York, NY, USA. ACM.

Fellbaum, C., editor (1998). WordNet: an electronic lexical

database. MIT Press.

Inui, K., Fujita, A., Takahashi, T., Iida, R., and Iwakura,

T. (2003). Text simpliﬁcation for reading assistance:

a project note. In Proceedings of the second in-

ternational workshop on Paraphrasing, pages 9–16,

Morristown, NJ, USA. Association for Computational

Linguistics.

Klebanov, B. B., Knight, K., and Marcu, D. (2004).

Text simpliﬁcation for information-seeking applica-

tions. In On the Move to Meaningful Internet Systems,

Lecture Notes in Computer Science, pages 735–747.

Springer Verlag.

ubler, S., McDonald, R. T., and Nivre, J. (2009). Depen-

dency Parsing. Synthesis Lectures on Human Lan-

guage Technologies. Morgan & Claypool Publishers.

Max, A. (2006). Writing for language-impaired readers. In

CICLing, pages 567–570.

Mcdonald, R. K. R., Nilsson, J., Riedel, S., and Yuret, D.

(2007). The conll 2007 shared task on dependency

parsing.

Palomar, M., Civit, M., D

ıaz, A., Moreno, L., Bisbal, E.,

Aranzabe, M., Ageno, A., Mart

ı, M., and Navarro, B.

(2004). 3lb: Construcci

on de una base de datos de

arboles sint

actico–sem

anticos para el catal

an, euskera

y espa

nol. In Proceedings of the XX Conference of

the Spanish Society for Natural Language Process-

ing (SEPLN), pages 81–88. Sociedad Espa

nola para

el Procesamiento del Lenguaje Natural.

Petersen, S. E. and Ostendorf, M. (2007). Text simpliﬁca-

tion for language learners: a corpus analysis. In In

Proc. of Workshop on Speech and Language Technol-

ogy for Education.

Siddharthan, A. (2002). Resolving attachment and clause

boundary amgiguities for simplifying relative clause

constructs. In Proceedings of the Student Research

Workshop, 40th Meeting of the Association for Com-

putacional Linguistics.

Siddharthan, A. (2003). Syntactic Simpliﬁcation and Text

Cohesion. PhD thesis, Research on Language and

Computation.

Snow, C. E., States., U., Science, and Corporation), T. P.

I. R. (2002). Reading for understanding : toward an

R&D program in reading comprehension / Catherine

Snow. Rand, Santa Monica, CA :.

Taul

e, M., Mart

ı, M., and Recasens, M. (2008). AnCora:

Multilevel Annotated Corpora for Catalan and Span-

ish. In Proceedings of 6th International Conference

on Language Resources and Evaluation.

Vossen, P., editor (1998). EuroWordNet: a multilingual

database with lexical semantic networks. Kluwer

Academic Publishers, Norwell, MA, USA.

Williams, S., Reiter, E., and Osman, L. M. (2003). Exper-

iments with discourse-level choices and readability.

In In Proceedings of the European Natural Language

Generation Workshop (ENLG) and 11th Conference

of the European Chapter of the Association for Com-

putational Linguistics (EACL03), pages 127–134.

TEXT SIMPLIFICATION USING DEPENDENCY PARSING FOR SPANISH

335