Towards Unsupervised Word Error Correction in Textual Big Data

Joao Paulo Carvalho and Sérgio Curto

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, Portugal

Keywords: Fuzzy Text Preprocessing, Medical Text Reports, Natural Language Processing, Word Similarity, MIMIC

II.

Abstract: Large unedited technical textual databases might contain information that cannot be properly extracted

using Natural Language Processing (NLP) tools due to the many existent word errors. A good example is

the MIMIC II database, where medical text reports are a direct representation of experts’ views on real time

observable data. Such reports contain valuable information that can improve predictive medic decision

making models based on physiological data, but have never been used with that goal so far. In this paper we

propose a fuzzy based semi-automatic method to specifically address the large number of word errors

contained in such databases that will allow the direct application of NLP techniques, such as Bag of Words,

to the textual data.

1 INTRODUCTION

Since the invention of written language, textual

information contained in documents has been the

most commonly used form of expressing human

knowledge. As such, textual information should be

an important source for automatic knowledge

representation. When the information contained in

the texts has been properly edited, Natural Language

Processing (NLP) tools can be used (with more or

less success) to process it. However, in the case of

unedited texts, such tools might not be reliable, since

one of the most common NLP approaches consists

in the use of the so-called “Bag-of-Words” model.

This model essentially relies in word counts to

extract information. Therefore, any word error,

whether resulting from a typo, wrong transcription,

or some cultural error, results on a model error (a

misclassification, a miscount, etc.), that can have

more or less serious consequences in what concerns

knowledge representation.

In the present age of Big Data, this problem

becomes very relevant, as can be exemplified by the

MIMIC II database, a very large database of ICU

patients admitted to the Beth Israel Deaconess

Medical Center, that will be used here as a case

study. The MIMIC II database includes both

physiological and text data; However, the

information contained in physicians’ and nurses’

text notes has never been used in any of the existing

several models using the database (Cismondi, et. al,

2012; Fialho, et al. 2012; 2013), despite the fact that

such notes contain rich information that is a direct

representation of experts’ views on the real time

observable data.

This textual information has not been used before

because of its size and the particularities of the

documents:

 The reports are not structured as a typical

written text – sentences are short, have many

abbreviations, a reduced number of function

words and most of the words are specific and

relevant within the context;

 The reports have a large number of medical

technical terms and specific technical

abbreviations;

 There are many numerical values associated

with physiological variables readings;

 Many different ways of expressing/representing

the same information. E.g., dates (23-06-2014;

6/23/14; June, 23 2014, etc.), time (10:14PM;

22:04; 2204, etc.), etc.;

 Text contains a huge number of typographical

and other word errors due to how the texts were

collected (real time transcriptions from

recordings; poor Optical Character Recognition

of manuscripts; etc.);

 Text contains many other artifacts, such as

misplaced control characters that break

sentences into paragraphs, escape sequences,

181

Carvalho J. and Curto S..

Towards Unsupervised Word Error Correction in Textual Big Data.

DOI: 10.5220/0005140401810186

In Proceedings of the International Conference on Fuzzy Computation Theory and Applications (FCTA-2014), pages 181-186

ISBN: 978-989-758-053-6

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

etc.

These particularities prevent the effective use of

common Natural Language Processing (NLP)

techniques and hinder their use as “automatic

information providers”, as shown in (Carvalho and

Curto, 2014). Here we will focus in improving the

process of automatic word error detection and

correction in such databases.

Error detection and correction is something that

everyone has become familiar with since the advent

of word processors. Nowadays, most of the

(computer / tablet / smartphone) written texts are

automatically corrected in “real-time” (i.e., as they

are being generated). However, everyone knows

that, despite the large recent advances, not all

corrections are proper, especially when less common

words (e.g. technical terms, named entities, etc.) are

being used. Also, most word correction tools are

limited to one or two errors per word. The capability

of humans to adapt very fast to new situations allows

them to detect most unwanted corrections as they are

proposed, and therefore react immediately. So, the

problem of word error detection and consequent

correction is basically non-existent when performed

in “real time” (and as long as the used vocabulary is

well known). However, if the texts have not been

properly corrected while they were created, then we

are facing a complex and expensive task that must

usually be done manually, or, even if done

automatically, demands a significant human

intervention. This is especially relevant in unedited

technical text. In the case of Big Data Text

databases, this task must be somehow automated,

since the size of the database would make the cost of

manual offline text editing unbearably expensive.

In this paper we propose a fuzzy based semi-

automatic method to specifically address the large

number of word errors contained in Big Data

unedited textual data, focusing specifically in the

MIMIC II database.

2 THE MIMIC II DATABASE

The developed work uses data from the Multi-

parameter Intelligent Monitoring for Intensive Care

(MIMIC II) database (Saeed, 2002). This is a large

database of ICU patients admitted to the Beth Israel

Deaconess Medical Center, collected from 2001 to

2006, and that has been de-identified by removal of

all Protected Health Information. The MIMIC II

database is currently formed by 26,655 patients, of

which 19,075 are adults (>15 years old at time

admission). It includes high frequency sampled data

of bedside monitors, clinical data (laboratory tests,

physicians’ and nurses’ notes, imaging reports,

medications and other input/output events related to

the patient) and demographic data. From the

available data, and for this particular problem, we

are mainly interested on the physicians’ and nurses’

notes.

The MIMIC II text database contains a total of

156 million words with 3 or more characters, and

260180 distinct words. Of these 260180 distinct

words, only 31527 (12%) appear in known word

lists: 30828 appear on the SIL list of English known

words (which contains 109582 distinct words) (SIL,

2014), and 429 appear on additional lists containing

medical terms not common in English. The

remaining 228923 words are simply unknown to

dictionaries, and most are the result of typing or

cultural errors!

As an example of the extent of such errors, here

is a non-extensive list of the different misspelled

variants of the word “abdomen” found in the

MIMIC II database: abadomen, abdaomen,

abndomen, badomen, abdaomen, abdeomen,

abdcomen, abdemon, abdeom, abdoem, abdmoen,

abdemon, abdiomen, abdman, abdmen, abdme,

abddmen, abbomen, abdmn, abdme, abdmonen,

abdonem, abdoben, abdodmen, abdoemen, abdoem,

abdoem, abdomin. It should be noted that these

errors are not isolated, e.g., the incorrect form

“abdomin” appears 1968 times in the database.

3 RELATED WORK

A typographical error, colloquially known as “typo”,

is a mistake made in the typing process. Most

typographical errors consist of substitution,

transposition, duplication or omission of a small

number of characters. Damerau (1964) considered

that a simple error consists in only one of these

operations. Nowadays, many other types of errors

can be found in text databases: Errors associated

with smaller keyboards became relevant due to their

effect in the increase of the number of word typos;

Errors due to the widespread use of blogs,

microblogs, instant messaging, etc.; Errors

associated with real time voice transcriptions; Errors

associated with poor Optical Character Recognition

when digitizing manuscripts; etc. One must also

mention linguistic errors, which are mostly due to

lack of culture and or/education, and are usually the

result of phonetic similarities.

As described previously, automatic word error

correction is an expensive task when performed off

FCTA2014-InternationalConferenceonFuzzyComputationTheoryandApplications

182

line for which there is no current reliable automatic

solution. The best performing methods are those that

aim to find the word that is most probable to be the

correct word in a given context. These methods are

based in probability theory and what is the most

likely word to follow a previous one (Jurafsky and

Martin, 2009). However such methods demand

many resources and are not the most adequate in

texts containing many technical terms, many errors,

and a very large vocabulary.

Independently from the degree of automatization,

any word correction tool depends on a good

similarity function to find the most likely correct

word. Current research on string similarity offers a

panoply of measures that can be used in this context,

such as the ones based on edit distances

(Levenshtein 1966) or on the length of the longest

common subsequence of the strings. However, most

of the existing measures have their own drawbacks.

For instance, some do not take into consideration

linguistically driven misspellings, others the

phonetics of the string or the mistakes resulting from

the input device. Moreover, the majority of the

existing measures do not have a strong

discriminative power, and, therefore, it is difficult to

evaluate if the proposed suggestion is reasonable or

not, which is a core issue in unsupervised spelling.

Here we will use the Fuzzy Uke Word Similarity

(FUWS) (Carvalho and Carola and Tomé 2006a;

2006b; Carvalho and Coheur 2013). This word

similarity function combines the most interesting

characteristics of the two main philosophies in word

and string matching, and by integrating specific

fuzzy based expert knowledge concerning

typographical errors, can achieve a good

discrimination.

4 TOWARDS AUTOMATIC

WORD CORRECTION IN

TEXTUAL BIG DATA

Given the above considerations concerning the

extent of the word errors present in the MIMIC II

text database, and the impact of such errors when

considering any kind of text analysis, we propose a

semi-automatic procedure to detect and correct

typographical and other word errors in the MIMIC II

text corpus, that improves and details the approach

presented in (Carvalho and Curto, 2014), and can be

extended to other Big Data Textual databases given

the appropriate resources.

4.1 Resources – Known Words List

(KWL)

The proposed procedure needs an extensive known-

words list, and one or more technical words lists

related with the subject of the textual database. We

assume that, despite our best efforts, technical word

lists might not be complete. The ordered set of all

words contained in these lists, forms what we refer

to as the “known-words list” (KWL).

It is also necessary to use a proper word

similarity function, preferably fast (due to the size of

the target databases) and that can achieve a good

discrimination, i.e. has both a good precision and

recall so that both false positives and false negatives

are minimized.

4.2 Automatic Word Correction Steps

4.2.1 Corpus Word List (CWL)

The first step consists of creating a list containing all

the different words present in the corpus, counting

the frequency of each occurrence, and ordering the

list. Words with less than 3 characters or more than

15 characters, and/or containing numerals are

removed.

The removal of short words is due to the

difficulty (or impossibility) of properly detecting

and correcting such words. This is not an important

issue since in NLP such words are often dismissed

as they usually contain more noise than useful

information.

Words with more than 15 characters are usually

codes, concatenated words, sometimes chemical

compounds, etc., and cannot or should not be

corrected. Words containing numerals are removed

since they usually consist of tokens that, as

previously, should not (and cannot) be corrected.

4.2.2 High Threshold Filtering

In the second step we look very close matches

between the CWL and the KWL in order to filter

minor typos and to aggregate occurrences of very

similar words. This is accomplished by selecting a

very high threshold when testing for word similarity.

When using the FUWS, the similarity threshold

should be above 0.9 – note that in (Carvalho and

Coheur, 2013), 0.67 is proposed as the standard

similarity threshold. Here we want to guarantee that

no false positives are generated, hence the much

higher threshold.

TowardsUnsupervisedWordErrorCorrectioninTextualBigData

183

It should be noted that this value will have more

impact in errors occurring in longer words (more

than 8 chars) than shorter ones.

After ordering the resulting list by word

frequency we obtain what we refer to as filtered

corpus word list (f-CWL). The f-CWL contains both

known and unknown words and will be used as a

corpus in the remaining procedure.

4.2.3 High Frequency (HFL) and Low

Frequency Words List (LFL)

The next step consists in generating two different

word lists based on the frequency of the words in the

f-CWL: the High Frequency Word List (HFL), and

the Low Frequency Word List (LFL).

The HFL will contain all the words in the f-CWL

whose frequency is higher than a given High

threshold ht. The HFL will be used as the known

word list in the final word correction step.

The reasoning behind the creation of the HFL is

based on the assumption that words that occur very

frequently in the f-CWL should no longer contain

errors (since common errors have been filtered in the

previous step). As such, very frequent words in the

f-CWL that are not present in the KWL should be

considered new words instead of word errors, and

should be used to correct errors that occur less

frequently (this is consistent with the previous

assumption that the available technical terms lists

are probably not complete).

The LFL will consist of the unknown words of

the f-CWL whose frequency is lower than a given

Low threshold lt. It is important to note that words

present in the KWL are removed from the LFL. The

LFL will contain the words that should eventually be

corrected in the final word correction step.

The LFL should consist mainly of: a) Very

specific technical words; b) Unknown abbreviations;

c) Unknown named entities (either individuals or

organizations) and special non numerical codes; d)

Words containing typing and/or other errors; e)

Tokens formed by an undue lack of spacing

separating proper words. Of these five different

cases, only the word errors should be corrected.

Note that errors occurring in some unknown named

entities or in some abbreviations might also be

corrected as long as the correct form is present in the

HFL.

The definition of ht and lt values is obviously an

important issue. Ideally they should be expressed as

a percentage of the number of distinct words in the

f-CWL, or of the joint word occurrence. Up to now

the thresholds have been found empirically, but

there are no indications if they can be generalized to

other datasets.

4.2.4 LFL Correction

The final word correction step consists in attempting

to correct the words in the LFL, while using the HFL

as the known words list. Common sense would

dictate using a normal word similarity threshold (in

the case of the FUWS, the value would be 0.67).

However, as it will be shown in the case study, best

results were obtained using a 10% lower value.

4.3 Correction Results

After the application of the previous steps we obtain

a word list that will be used to replace the

appropriate occurrences in the original textual

database. As discussed in 4.2.3, not all unknown

words are expected to (or should) be replaced, only

the ones resulting from typing or other word errors.

5 CASE STUDY AND RESULTS:

MIMIC II DATA BASE WORD

ERROR CORRECTION

In this section we apply the previously presented

procedure to the MIMIC II textual database.

To build the KWL we used the SIL English word

list (SIL, 2014) and three medical terms lists

publicly available online (mtherald, 2014)

(Heymans, 2014) (e-medtools, 2014).

As a word similarity function we used the above

mentioned FUWS, since it is fast, combines the most

interesting characteristics of the two main

philosophies in word and string matching, and by

integrating specific fuzzy based expert knowledge

concerning typographical errors, can achieve the

intended good discrimination.

The original database contains 1 095 127 distinct

words. After executing Step 1, the resulting CWL

contains 260180 distinct words (corresponding to a

joint occurrence of 177 446 957 words).

For the High threshold filtering operation a

FUWS empirical threshold of 0.935 was chosen

after several tests. This operation affected 15032

distinct words, which ended up being combined as

7805 distinct words. The number of errors is

estimated (by sampling the results) to be much lower

than 1%. Note that this reduction in the number of

distinct words is quite significant if one considers

that, in English, the necessary vocabulary to

FCTA2014-InternationalConferenceonFuzzyComputationTheoryandApplications

184

properly understand Academic textbooks ranges

from 5000 to 10000 words. So we managed to

reduce a similar number of words by using the high

threshold filtering operation. After this step, the f-

CWL size is 252 953.

In order to create the HFL, the LFL, and to apply

the LFL correction, it was necessary to define the

High and Low thresholds, and also choose the

FUWS threshold. Several empirical tests were

performed in order to find appropriate values. Even

if the numbers are not yet fully optimized, words

occurring more than 1400 times in the f-CWL

appear to be good correction candidates for words

occurring less than 800 times when using a FUWS

threshold=0.6.

Therefore we are currently using lt=800 and

ht=1400. The HFL consists of 6153 distinct words

(with a joint occurrence of 171642123 words), i.e.,

the HFL contains the only the top 2.43% most

frequent words, and yet contains 96.73% of the total

number of words. The LFL has 200137 distinct

words (79%) corresponding to a total of only

1585067 words of the database (0.8%).

The correction of the LFL using the HFL and a

FUWS threshold of 0.6, resulted in the correction of

88867 distinct words (out of the 200137), reducing

them to 5920 distinct words. I.e., the automatic

procedure found 88867 different words containing

typing errors that corresponded to only 5920 words.

Those 88867 words have a joint occurrence of 59%

of the LFL.

As expected, the uncorrected words fall mainly

into the categories indicated in section 4.2.3.

However not all the proposed corrections are

correct. An estimation based on sampling the 25%

more frequent corrections indicates the number of

false positives to be around 5%. Overall this results

in a very low number of errors when compared to

the number of different words and joint occurrences.

In the end, only 0.02% of the MIMIC II words are

estimated to be incorrectly replaced, and only 0.48%

are left uncorrected. Most of the unknown words

that were not proposed for correction correspond to

cases that should not be replaced or are indeed very

difficult to correct without additional lengthy

preprocessing. Following are some examples of the

observed errors using the format “Unknown Word

→ Proposed Correction (# occurrences) (FUWS sim

value); comments”:

 Abbreviations:

stg → gtts (209) (0,7500); stg may mean

“superior temporal gyrus”

ptsd → ptbd (317) (0,7500); ptsd may mean

“post-traumatic stress disorder”

 Prefix variation not present in the known words:

untolerated → tolerated (1) (0,7955); incorrect

use of the prefix un-, proper correction should be

“not tolerated”

noincreased → increased (2) (0,7500); correct

substitution should be “not increased”

 Terms aliasing due to the lack of spacing

between words:

remainslow → remains (1) (0,6750); should be

“remains low”)

withinthe → within (5) (0,6389); should be

“within the”

parentsup → parents (1) 0,75; should be “parent

support”

 Correct word is not the most similar:

Weerk → were (1) (0,8125); probably the correct

substitution should be “week” but it only has a

0,7 similarity

 Other special cases:

gmother → mother (14) (0,8214); should be

“grandmother”

anormal → normal (12) (0,8214); correct

substitution should be abnormal

6 CONCLUSIONS AND FUTURE

WORK

In this paper we propose and describe a novel semi-

automatic procedure to detect and correct errors in

unedited textual Big Data based on a fuzzy word

similarity function. The procedure is being applied

to the MIMIC II database with very encouraging

results.

The largest obstacle to the automatization,

generalization and application of the method to other

databases consists in the parameterization. Up to

now, the definition of the three thresholds used in

the proposed procedure, lt, ht and FUWS threshold,

has been made empirically. An automatic procedure

would be achieved if the values found up to now can

be directly applied to other databases. However, that

is not necessarily a likely situation, and we will not

have an answer until the procedure is tested in other

textual Big Data. Another option towards a more

automatized process consists in using the obtained

values as a starting point to an optimization

procedure. Since only three parameters (and most

likely only two) are involved, evolutionary

algorithms could certainly be used towards this goal.

TowardsUnsupervisedWordErrorCorrectioninTextualBigData

185

ACKNOWLEDGEMENTS

This work was supported by national funds through

FCT – Fundação para a Ciência e a Tecnologia,

under project PTDC/EMS-SIS/3220/2012 and

project PEst- OE/EEI/LA0021/2013.

REFERENCES

A. S. Fialho, F. Cismondi, S. M. Vieira, S. R. Reti, J. M.

C. Sousa, and S. N. Finkelstein. 2012. Data mining

using clinical physiology at discharge to predict icu

readmissions, Expert Systems with Applications, vol.

39, no. 18, pp. 13 158–13 165, December 2012.

A. S. Fialho, U. Kaymak, F. Cismondi, S. M. Vieira, S. R.

Reti, J. M. C. Sousa, and S. N. Finkelstein. 2013.

“Predicting intensive care unit readmissions using

probabilistic fuzzy systems,” Proc. of FUZZ-IEEE

2013, Hyderabad, India.

Carvalho, J. P., Coheur, L., 2013. Introducing UWS – A

Fuzzy Based Word Similarity Function with Good

Discrimination Capability: Preliminary results, Proc.

of the FUZZ-IEEE 2013, Hyderabad, India.

Carvalho, J. P., Curto, S., 2014. Fuzzy Preprocessing of

Medical Text Annotations of Intensive Care Units

Patients, Proc. of the IEEE 2014 Conference on

Norbert Wiener in the 21

Century, Boston, USA.

Carvalho, J. P., Carola, M., Tome,J.A., 2006. Using rule-

based fuzzy cognitive maps to model dynamic cell

behavior in Voronoi based cellular automata, Proc. of

the 2006 IEEE International Conference on Fuzzy

Systems, pp. 1687-1694, Vancouver, Canada.

Carvalho, J. P., Carola, M., Tome, J. A., 2006. Forest Fire

Modelling using Rule-Based Fuzzy Cognitive Maps

and Voronoi Based Cellular Automata, Proceedings of

the 25th International Conference of the North

American Fuzzy Information Processing Society,

NAFIPS 2006, Montreal, Canada.

Damerau, F. J. A Technique for Computer Detection and

Correction of Spelling Errors. Communications of the

ACM, Março 1964, pp. 171-176.

e-medtools 2014. http://e-medtools.com/

openmedspel.html, last accessed May 2014.

F. Cismondi, A. L. Horn, A. S. Fialho, S. M. Vieira, S. R.

Reti, J. M. Sousa, and S. Finkelstein. 2012. Fuzzy

multi-criteria decision making to improve survival

prediction of icu septic shock patients, Expert Systems

with Applications, vol. 39, no. 16, pp. 12 332–12 339.

Heymans 2014. http://users.ugent.be/~rvdstich/eugloss/

EN/lijst.html, last accessed May 2014.

Jurafsky, D., Martin, J., 2009, Speech and Language

Processing - An introduction to natural language

processing, computational linguistics, and speech

recognition, 2nd Edition, Prentice-Hall.

Levenshtein, V. 1966. Binary codes capable of correcting

deletions, insertions, and reversals. Soviet Physics

Doklady, 10:707–710.

M. Saeed, C. Lieu, and R. Mark, “Mimic ii: A massive

temporal icu database to support research in

intelligence patient monitoring,” Computers in

Cardiology, vol. 29, pp. 641–644, 2002.

mtherald 2014. http://mtherald.com/free-medical-spell-

checker-for-microsoft-word-custom-dictionary/, last

accessed May 2014.

SIL, 2014. http://www01.sil.org/linguistics/wordlists/

english/, last accessed on February 2014.

FCTA2014-InternationalConferenceonFuzzyComputationTheoryandApplications

186