these elements of the text quality and develop a tool
for testing it.
At the same time, in many cases, text sources,
which are used by enterprise information systems,
address a global audience. That is, they are written in
English and then receive worldwide distribution. For
these Internet resources, in order to estimate its
readability, we offer to add such linguistic features to
traditional readability level indexes as the use of one-
word verbs instead of a verb phrase or the use of only
international writing of terms.
As an example of the texts intended for the
worldwide audience, we employ our corpus of
Wikipedia articles. Currently, there are sufficient
approaches to quality assessment of Wikipedia
articles (Lewoniewski, 2017). The issue of Wikipedia
texts quality assessment has become the subject of
studies in various fields of science. In 2006 one of the
co-founders of the online non-profit encyclopedia
Wikipedia suggested concentrating on the quality of
the articles instead of their number (Giles, 2005).
The best articles of Wikipedia must follow the
specific style guidelines, the rating system of which
depends on a specific language. For example, in
English Wikipedia articles, which we examine in the
study, the system of Wikipedia articles quality has 9
grades: FA (Featured Article), A, GA (Good Article),
B, C, Start, Stub, FL (Featured List), List. Each of
these grades has special criteria. For instance, to those
criteria, we can include the relevance,
informativeness and encyclopedicness (Khairova,
2018) of the information, the correctness of texts
spelling and grammar and some others. However, to
date, all of these criteria are assessed manually by the
Wikipedia communities.
In our study, we will consider the influence of
readability and some particular features of the texts
written for a global audience on the texts quality
assessment.
In order to estimate the influence of different
linguistic and statistical features on the text
readability, we decided to use five different text
corpora.
2 RELATED WORK
The readability concept was introduced in the 1920s
and it means the ability to read a text. Until the late
1980s, the readability concept was used by educators
in order to identify the complexity of tutorials and
textbooks. The educators discovered a way to use
vocabulary difficulty and sentence length to predict
the difficulty level of a text (DuBay, 2004).
At the present time, readability is one of the
dimensions of the text information quality and it
matters in every profession where people need
qualitative information and knowledge. Now, the
most known ways of representation of readability
level are 5 indexes, such as Flesch Reading Ease
(Cotugna, 2005), Flesch-Kincaid Grade Level, ARI
(Oosten, 2010), SMOG (Hedman, 2008) and FOG
(Walsh, 2008).
Generally, more modern methods are based on the
data of well-known indexes and do not give a reliable
advantage to any of them.
For instance, Pitler and Nenkova (Pitler, 2008)
ranged the influence of various readability factors on
predicting readability of a text and the text quality
Schwarm and Ostendorf (Schwarm, 2005)
proposed to develop new method appropriate for
finding English texts of a certain readability level on
the basis of the widely known readability indexes to
combine them with statistical language models,
support vector machines and other language
processing tools. Their research showed that
сombining information from statistical LMs with
other features using support vector machines
provided the best results.
Authors of the next study (Oosten, 2010) used 4
corpora in two languages, Dutch and English to find
the correspondences between the readability formulas
and variables that are used in them.They made a
conclusion that it was not reasonable to expect that
formulas based on language-independent features can
precisely predict the readability level.
It is interesting, that many studies dedicated to
readability analyze the text readability on the basis of
the texts devoted to health care. In our opinion, that's
because such texts must be understandable to as many
readers as possible. In medicine, it is extremely
important that texts with such information correspond
to the average level of the reader. The United States
Department of Health and Human Services identified
that the reader of this level is in the 7th-grade
(D'alessandro, 2001).
According to the article (D'alessandro, 2001), the
average reading level is eighth-ninth grade, in the
USA. But all medical education materials are too
complex for average adults. It means that such
materials should have a lower grade to be
understandable. Their conclusion was based on the
result of 2 most widely used indexes: The Flesch
Reading Ease score and Flesch-Kincaid that
evaluated one hundred documents from 100 different
Web sites. The result was that pediatric patient
education materials on the Internet were not written
at an appropriate reading level for the average adult.