6. Use the generative grammar for sentence genera-
tion from the resultant parse trees. We have used
the MontyLingua based language generator, mod-
ified with additional rules for better performance
and output quality.
7. Rank the sentences using the top-level informa-
tion from Topic Models for highly probable and
salient handles for extraction. These handles point
to the markers in the given sentence from the in-
put documents for focusing attention. Select the
10% top scored sentences.
8. Reorder the sentences to reduce the dependency
list complexity, utilizing information from the
corresponding parse-tree.
We used a 6 GB dump of English Wikipedia
2
ar-
ticles and processed it to build a context information
database in the following way.
Algorithm 2: Wikipedia Context preprocessing
1. for each word w (excluding stop-words) in every
sentence of the articles, create a context set S
w
containing the words surrounding w.
(a) add S
w
to the database with w as the keyword.
(b) if w already exists, update the database with the
context set S
w
.
3.4 Example
We tested our algorithm on a sample of 30 documents
from the DUC 2001 Corpus of about 800 words each.
An extract from the document d04a/FT923-5089 is
reproduced below.
There are growing signs that Hurricane An-
drew, unwelcome as it was for the devastated
inhabitants of Florida and Louisiana, may in
the end do no harm to the re-election cam-
paign of President George Bush.
The parse for this sentence is given in Figure 1, and
the dependency is given in Figure 2. The original
sentence has now been split into three sentences by
identifying the cohesive subtrees and reordering them
in accordance with our algorithm. Some candidate
words for lexical simplification are devastated, inhab-
itants. The present output does not reflect the possi-
bilities for rephrasing these words although lexically
simpler words for replacement exist in our databases.
Certain technical deficiencies in our software in this
connection are presently being rectified.
The system output is given below:
2
http:/ dumps.wikipedia.org/enwiki
Hurricane Andrew was unwelcome. Unwel-
come for the devastated inhabitants of Florida
and Louisiana. In the end did no harm to
the re-election campaign of President George
Bush.
4 EXPERIMENTS AND RESULTS
In the phase involving identification of difficult
words, we used the Support Vector Machine (SVM)
and conducted the experiment using trained data set
of sizes 1100, 5500, 11000 words where 10:1 ratio of
difficult labeled and simple labeled words were used
for training. The SVM reported an accuracy of more
than 86% (Table 1). The overloading of the training
set with difficult word samples was consciously made
in view of the fact that by Zipf’s law a large num-
ber of words are rarely used and hence corpus data do
not provide enough samples, but these words get used
sporadically. Many experiments for training with
varying numbers of pre-identified set of commonly
used words by rank can be conducted as follows. For
the baseline we have fixed the first n = 20, 000 ranked
words as simple. Without loss of generality, n can be
increased and the training runs repeated. The intuition
behind fixing the ratio to 10:1 for our experiments
is that the first 20,000 ranked words (which we con-
sidered “simple”) account for more than 90% usage
among the over 200,000 words in our set of words.
In the simplification phase, we computed the eu-
clidean length of the vectors corresponding to both
difficult and simple words, considering them as points
in five dimensional space. We further hypothesized
that increase in euclidean length of vectors indicated
increase in complexity of words. Hence we filtered
out those rules in which the euclidean length of a dif-
ficult word was less than euclidean length of simple.
For the verification of our hypothesis we con-
ducted the experiment on different subsets of Brown
corpora involving 100000, 400000, 700000 and
1100000 words and found it to be consistent. This
was followed by the task involving two possibilities,
first where the difficult word will have only one sense.
As there exists only one sense for difficult word, irre-
spective of the context the word is replaced by simple
equivalent word. In the second possibility we found
the sentence context of the difficult word and the sen-
tence context of all possible simple words from pro-
cessed Wikipedia. We consider the context of simple
word which was found to be the best match for the
context in terms of intersection of words.
We evaluated the mean readability measures of
the original document and the system generated out-
TextSimplificationforEnhancedReadability
205