etc.
These particularities prevent the effective use of
common Natural Language Processing (NLP)
techniques and hinder their use as “automatic
information providers”, as shown in (Carvalho and
Curto, 2014). Here we will focus in improving the
process of automatic word error detection and
correction in such databases.
Error detection and correction is something that
everyone has become familiar with since the advent
of word processors. Nowadays, most of the
(computer / tablet / smartphone) written texts are
automatically corrected in “real-time” (i.e., as they
are being generated). However, everyone knows
that, despite the large recent advances, not all
corrections are proper, especially when less common
words (e.g. technical terms, named entities, etc.) are
being used. Also, most word correction tools are
limited to one or two errors per word. The capability
of humans to adapt very fast to new situations allows
them to detect most unwanted corrections as they are
proposed, and therefore react immediately. So, the
problem of word error detection and consequent
correction is basically non-existent when performed
in “real time” (and as long as the used vocabulary is
well known). However, if the texts have not been
properly corrected while they were created, then we
are facing a complex and expensive task that must
usually be done manually, or, even if done
automatically, demands a significant human
intervention. This is especially relevant in unedited
technical text. In the case of Big Data Text
databases, this task must be somehow automated,
since the size of the database would make the cost of
manual offline text editing unbearably expensive.
In this paper we propose a fuzzy based semi-
automatic method to specifically address the large
number of word errors contained in Big Data
unedited textual data, focusing specifically in the
MIMIC II database.
2 THE MIMIC II DATABASE
The developed work uses data from the Multi-
parameter Intelligent Monitoring for Intensive Care
(MIMIC II) database (Saeed, 2002). This is a large
database of ICU patients admitted to the Beth Israel
Deaconess Medical Center, collected from 2001 to
2006, and that has been de-identified by removal of
all Protected Health Information. The MIMIC II
database is currently formed by 26,655 patients, of
which 19,075 are adults (>15 years old at time
admission). It includes high frequency sampled data
of bedside monitors, clinical data (laboratory tests,
physicians’ and nurses’ notes, imaging reports,
medications and other input/output events related to
the patient) and demographic data. From the
available data, and for this particular problem, we
are mainly interested on the physicians’ and nurses’
notes.
The MIMIC II text database contains a total of
156 million words with 3 or more characters, and
260180 distinct words. Of these 260180 distinct
words, only 31527 (12%) appear in known word
lists: 30828 appear on the SIL list of English known
words (which contains 109582 distinct words) (SIL,
2014), and 429 appear on additional lists containing
medical terms not common in English. The
remaining 228923 words are simply unknown to
dictionaries, and most are the result of typing or
cultural errors!
As an example of the extent of such errors, here
is a non-extensive list of the different misspelled
variants of the word “abdomen” found in the
MIMIC II database: abadomen, abdaomen,
abndomen, badomen, abdaomen, abdeomen,
abdcomen, abdemon, abdeom, abdoem, abdmoen,
abdemon, abdiomen, abdman, abdmen, abdme,
abddmen, abbomen, abdmn, abdme, abdmonen,
abdonem, abdoben, abdodmen, abdoemen, abdoem,
abdoem, abdomin. It should be noted that these
errors are not isolated, e.g., the incorrect form
“abdomin” appears 1968 times in the database.
3 RELATED WORK
A typographical error, colloquially known as “typo”,
is a mistake made in the typing process. Most
typographical errors consist of substitution,
transposition, duplication or omission of a small
number of characters. Damerau (1964) considered
that a simple error consists in only one of these
operations. Nowadays, many other types of errors
can be found in text databases: Errors associated
with smaller keyboards became relevant due to their
effect in the increase of the number of word typos;
Errors due to the widespread use of blogs,
microblogs, instant messaging, etc.; Errors
associated with real time voice transcriptions; Errors
associated with poor Optical Character Recognition
when digitizing manuscripts; etc. One must also
mention linguistic errors, which are mostly due to
lack of culture and or/education, and are usually the
result of phonetic similarities.
As described previously, automatic word error
correction is an expensive task when performed off
FCTA2014-InternationalConferenceonFuzzyComputationTheoryandApplications
182