
 
The  pages parameter from the formula above 
represents the number of indexed pages written in 
the used language. For English, this number could 
be easily found by sending the word “the” to the 
search engine and noticing the number of hits. At the 
moment of writing this paper, more than 11 billion 
pages written in English were indexed by Google. 
This value is automatically detected every time the 
application is launched. 
The second filter is applied when a low number 
of co-occurrences is obtained (less than 500). Here, 
the value of the parameter beta is 1.05 in order to 
provide a tougher filtering than normal, according to 
the fact that fewer hits of the co-occurrences of the 
two chunks imply greater probability of a 
malapropism. Therefore, the filtering is not 
dependent on the input text, but on the number of 
hits of the co-occurrence of the two chunks. 
The third filter applies to the co-occurrences that 
have between 501 and 12,000 hits, being the filter 
that is the most often used. For this filter, beta takes 
the value 1 instead of 1.05 as in the previous filter, 
because this is considered the regular filter from the 
permission point of view. From now on, the 
permissibility will constantly drop, since the number 
of hits for the co-occurrence of the two chunks 
becomes higher and higher, therefore the probability 
of being a malapropism decreases. 
The fourth filter is applied when no_combined is 
between 12,001 and 14,000 and here, beta’s value is 
0.95. The fifth filter lowers again the probability of 
having a malapropism by considering beta to 0.9 and 
is applied for chunks that have no_combined 
between 14,001 and 15,000. Finally, the most 
permissive filter is applied when no_combined 
between 15,001 and 16,000, beta having the value of 
0.8. 
Above this final threshold (16,000), no possible 
malapropisms are signalled, since having a very 
large number of hits, one cannot precisely tell if a 
malapropism occurred or there was just a less often 
combination of two very popular chunks of text. 
The presented thresholds and the coefficient for 
the co-occurrence of the two chunks that these filters 
depend on have been empirically determined and 
they are language and time dependent, but they are 
text independent. First of all, the values depend on 
the language, because the number of pages written in 
different languages is not the same. These values 
have been detected for English, but if the language is 
changed, the value of the “pages” parameter also 
changes, and the same happens with the values of 
no_pages1,  no_pages2 and no_combined, so the 
thresholds are not accurate any more. The values are 
also time dependent, because the Internet is in a 
continuous expansion and therefore, the number of 
the written pages available to the search engines 
continue to increase and in the same time, the 
probability of finding incorrect text also increases, 
affecting the thresholds of the presented filters. 
Considering the large number of queries that are 
sent to the search engine, we have also investigated 
the possibility of using the Google 5-grams corpus 
“Web 1T 5-gram Version 1 Corpus” (Brants and 
Franz, 2006) instead of sending our queries to the 
search engine. Besides his very large size (30 GB of 
compressed text), which makes it difficult to 
integrate in any application, we have observed 
another drawback of this corpus: the document n-
grams were not completely covered by the corpus’ 
n-grams – the covering varied from 90% in the case 
of bigrams to 15% in the case of 5-grams. More than 
that, the nature of our application made us give up at 
this corpus, because in the application we do not 
know a-priori the degree of the n-grams that are 
going to be used, since this is determined 
dynamically by the pseudo-chuncker. 
The purpose of this module is to limit as much as 
possible the number of misses in the malapropisms 
detection. The signalled malapropisms generated in 
this module should cover all the real malapropisms 
that exist in text. The module also signals a lot of 
fake malapropisms, but they will be evaluated in the 
next module and some of them will be ignored. 
3.2 Malapropisms Correction 
There are two main purposes of this module. The 
first one is to determine which of the signalled 
malapropisms from the previous step are false 
alarms in order to eliminate them. The second 
purpose is to detect the most probable candidates for 
the remaining malapropisms in order to correct the 
errors. This module uses all three technologies: the 
paronyms dictionary in order to identify the 
candidates for the possible malapropisms correction, 
lexical chains in order to filter the list of candidates 
for finding the ones that fit into the context and, 
finally, the search engine in order to decide which is 
the best candidate in the case that there are more that 
fit into the lexical chains. 
The module also works sequentially by analyzing 
every pair of two chunks of words and deciding 
whether a malapropism or a false alarm has been 
found, and in the case of a malapropism, what 
should be the replacement word. If the pair contains 
no signalled malapropisms, than the process 
continues with the next chunk, until a signaled 
MALAPROPISMS DETECTION AND CORRECTION USING A PARONYMS DICTIONARY, A SEARCH ENGINE
AND WORDNET
367