The pages parameter from the formula above
represents the number of indexed pages written in
the used language. For English, this number could
be easily found by sending the word “the” to the
search engine and noticing the number of hits. At the
moment of writing this paper, more than 11 billion
pages written in English were indexed by Google.
This value is automatically detected every time the
application is launched.
The second filter is applied when a low number
of co-occurrences is obtained (less than 500). Here,
the value of the parameter beta is 1.05 in order to
provide a tougher filtering than normal, according to
the fact that fewer hits of the co-occurrences of the
two chunks imply greater probability of a
malapropism. Therefore, the filtering is not
dependent on the input text, but on the number of
hits of the co-occurrence of the two chunks.
The third filter applies to the co-occurrences that
have between 501 and 12,000 hits, being the filter
that is the most often used. For this filter, beta takes
the value 1 instead of 1.05 as in the previous filter,
because this is considered the regular filter from the
permission point of view. From now on, the
permissibility will constantly drop, since the number
of hits for the co-occurrence of the two chunks
becomes higher and higher, therefore the probability
of being a malapropism decreases.
The fourth filter is applied when no_combined is
between 12,001 and 14,000 and here, beta’s value is
0.95. The fifth filter lowers again the probability of
having a malapropism by considering beta to 0.9 and
is applied for chunks that have no_combined
between 14,001 and 15,000. Finally, the most
permissive filter is applied when no_combined
between 15,001 and 16,000, beta having the value of
0.8.
Above this final threshold (16,000), no possible
malapropisms are signalled, since having a very
large number of hits, one cannot precisely tell if a
malapropism occurred or there was just a less often
combination of two very popular chunks of text.
The presented thresholds and the coefficient for
the co-occurrence of the two chunks that these filters
depend on have been empirically determined and
they are language and time dependent, but they are
text independent. First of all, the values depend on
the language, because the number of pages written in
different languages is not the same. These values
have been detected for English, but if the language is
changed, the value of the “pages” parameter also
changes, and the same happens with the values of
no_pages1, no_pages2 and no_combined, so the
thresholds are not accurate any more. The values are
also time dependent, because the Internet is in a
continuous expansion and therefore, the number of
the written pages available to the search engines
continue to increase and in the same time, the
probability of finding incorrect text also increases,
affecting the thresholds of the presented filters.
Considering the large number of queries that are
sent to the search engine, we have also investigated
the possibility of using the Google 5-grams corpus
“Web 1T 5-gram Version 1 Corpus” (Brants and
Franz, 2006) instead of sending our queries to the
search engine. Besides his very large size (30 GB of
compressed text), which makes it difficult to
integrate in any application, we have observed
another drawback of this corpus: the document n-
grams were not completely covered by the corpus’
n-grams – the covering varied from 90% in the case
of bigrams to 15% in the case of 5-grams. More than
that, the nature of our application made us give up at
this corpus, because in the application we do not
know a-priori the degree of the n-grams that are
going to be used, since this is determined
dynamically by the pseudo-chuncker.
The purpose of this module is to limit as much as
possible the number of misses in the malapropisms
detection. The signalled malapropisms generated in
this module should cover all the real malapropisms
that exist in text. The module also signals a lot of
fake malapropisms, but they will be evaluated in the
next module and some of them will be ignored.
3.2 Malapropisms Correction
There are two main purposes of this module. The
first one is to determine which of the signalled
malapropisms from the previous step are false
alarms in order to eliminate them. The second
purpose is to detect the most probable candidates for
the remaining malapropisms in order to correct the
errors. This module uses all three technologies: the
paronyms dictionary in order to identify the
candidates for the possible malapropisms correction,
lexical chains in order to filter the list of candidates
for finding the ones that fit into the context and,
finally, the search engine in order to decide which is
the best candidate in the case that there are more that
fit into the lexical chains.
The module also works sequentially by analyzing
every pair of two chunks of words and deciding
whether a malapropism or a false alarm has been
found, and in the case of a malapropism, what
should be the replacement word. If the pair contains
no signalled malapropisms, than the process
continues with the next chunk, until a signaled
MALAPROPISMS DETECTION AND CORRECTION USING A PARONYMS DICTIONARY, A SEARCH ENGINE
AND WORDNET
367