using a spellchecker. However, it is shown not very helpful in [10] as it gave “little im-
provement in sensitivity, but a large increase in the number of false positives”. Instead,
they have been partially able to take care of misspellings by relying on contextual rules;
Such that, if an identifier is missed in dictionary lookup, it will be caught in contextual
template matching. An idea for dealing with misspelling is to make use of a list of com-
mon misspellings as a reference, along with the other dictionaries. Better yet, it might
be possible to extract common patterns from that list and feed only those patterns to
the system (e.g. receive & recieve, tomorrow and tommorrow, etc.). But if it should be
possible to extract heuristics from simple observations, as humans often do, using this
approach along with the methods that require an initial learning phase (such as in [14]),
this would not only bring great enhancements to the machine’s ability in performing de-
identification, unrestricted to the detection of mispellings, but also, would eventually,
serve as a big step toward the advancement of the Natural Language Understanding
(NLU) field.
Re-identification: After identifiers have been recognized, they should be substituted in a
generic way that does not violate privacy while maintaining as much of the information
as possible. Two best approaches in this regard are: (1) In [10] identifiers are replaced
by phrases such as [***first name***],or any other category/subcategoryof PHI, which
enhances the readability and is needless of further computation, since the category has
already been detected as part of the identifier’s recognition process. (2) A smoother
approach is to substitute them with surrogate information, such as John Doe for a first
name.
3 Main General Approaches to De-identification
Most approaches to recognition of PHI are either lexical or contextual:
Lexical. This approach mostly involves using dictionaries and gazeteers and doing
string matching type of search throughout the text. Obviously, the larger the dictio-
naries the better the result. Particularly, in recognition of first names and locations most
of the methods rely on dictionaries.
Contextual. This approach can take different forms. It can be as limited as associating a
number of common-sense templates with each of PHI categories, as explained in [12],
such as a [firstname lastname] template for a person’s name, or [lastname, firstname],
etc., or as sophistiated as in [11], which applies advanced natural language techniques
using a framework called MEDTAG, to categorize words and recognize parts of speech.
MEDTAG is a system of tags with an ontology of the medical domain which aims at
disambiguation in the “word-sense” level. One example from [11] is the word “miss”
which can be taken to mean an action (=fail) or a person (=a young lady). The tagging
system considers the context and distinguishes the semantics. Their system also does
a parts-of-speech tagging, which would again in the case of “miss” determine whether
it’s a name or a verb in the sentence to help eliminate potential ambiguities. However,
in the word-sense level their work is restricted by the 40 medical tags in the MEDTAG
framework and another custom-designed set of “anonymization-specific” tags. This set
disambiguates the words taken as possible PHI candidates and was extracted from in-
vestigation of particular cases.
79