anonymizing textual information. Sweeney’s pioneering work [16] is based on
removing the personal identifying information from the text so that the integrity of the
information remains intact, even though the identity remains confidential. It includes
developing an algorithm and software program called ‘Scrub Extractor’ that
automatically extracts names, addresses, and other identifying information from the
free text documents. Sweeney’s recognition methodology aims at detecting
information that can personally identify any person. One important issue that can be
observed here, is related to the definition of “personal information.” Is it the whole
text, the paragraph, the sentence, the phrase or only the word that denotes the
identity? We will answer this question in the next section. Sweeney also introduced
the DataFly system that provides an additional level of anonymity [17]. Ruch et al.
used syntactic and semantic knowledge to classify the tokens within a text [13]. N-
gram type rules, finite state automata and a recursive transition network were used to
encode the knowledge and extract patient identifiers. Taira et al. presented a
methodology that manually tags all references to patient identifiers and context
information [18]. The scheme searches for logical relations that are characterized by a
predicate and an ordered list of one or more arguments. In most cases, the logical
relation consists of three arguments; a head, a relation, and a value. In Johnny
underwent a pyeloplasty for uretropelvic junction stenosis…the token Johnny is the
logical relation head, underwent is the relation, and pyeloplasty is the value. In
Johnny is a 5 year old Caucasian male with Disease X, the token (5 year old and
Caucasian) modifies male, that syntactically modifies its head Johnny [18]. The
identification detection problem is concerned with certain types of logical relations.
All combinations of words in a sentence that can fill the roles (i.e., head, relation, and
value) of a given logical relation are considered. Other authors in the area of medical
textual information worked on morpho-syntactic aspects of the term formation in
medical language. For example, works in this area lead to the development of an
encoding system for diagnoses and interventions based on a semi-automatic encoder
with natural language entry and an interface [5].
Another important area of research in this direction is the notion of ‘k-anonymity’
[15]. The k-anonymization of a relational table, assumes that a table with a prime key
that refers to a person is the personal information. Its main concern is anonymizing
entries in the table in order to block any attempt to reach “identifiablity” that stems
from these entries. Systems that use such techniques aim at protecting individual
identifiable information and simultaneously maintaining the entity relationship in the
original data. Still, the definition in these works of “personal information” is not clear.
Implicitly, it is understood that the privacy aspect comes from associating the attribute
name with the identifying key of the relation.
In spite of impressive efforts and results in this area, we claim that the topic of
“private information anonymization” has not been systematized. Systematization here
means systematically concentrating on the ‘quality’ of privacy in the general scheme
of anonymization of information. It starts with the definition of ‘private information’.
Additionally, anonymization methods are usually focused on eliminating identities.
This brute mechanism hides fine points of anonymizing private information of a
person or private relations among persons. John and Mary are in love can be
anonymized with respect to John (Someone and Mary are in love), with respect to
Mary (John and Someone are in love) or with respect to the relationship between
them (John and Mary are in some type of relation). Our proposed systematic
162