2 RELATED WORK
Internet and local networks have greatly increased
the availability of medical information and reduced
the cost and the time needed to access it. Examples
of on-line information related to the health domain
include medical reports, bibliographies, conference
proceedings, clinic instructions, health organisation
information, discussion forums, etc.
This represents not only an opportunity but also
a problem for users with respect to the huge volume
of information that is generated daily. In order to
find relevant information in this context the adoption
of controlled vocabularies for information indexing
as well as of advanced categorization and filtering
techniques is more and more needed.
With respect to the first topic, several controlled
vocabularies for medicine exist. As an example, the
World Health Organization (WHO) defined the
International Classification of Diseases and Related
Health Problems (ICD) that provides codes to
classify diseases and a variety of signs, symptoms,
abnormal findings, complaints, social circumstances
and external causes of injury or disease. The current
version of ICD (WHO, 2004) is composed of more
than 155.000 codes grouped in 21 main chapters.
Moreover, the U.S. National Library of Medicine
developed the Medical Subject Headings (MEsH), a
controlled vocabulary of terms aimed at indexing
papers and books in life sciences and serving as a
thesaurus to facilitate searches in the medical field
(NLM, 2008).
MeSH includes more than 100.000 concepts and
a hierarchy with more than 11 deepening levels. It is
used to index MEDLINE, the biggest database of
medical information on-line. Several systems have
been developed to find in it relevant information like
the Grateful Med System, a windows-based querying
user interface and COACH, a system that converts
keywords and phrases in MeSH compatible queries.
Unfortunately these systems require a learning
phase that still discourages potential users. Moreover
they are not able to actively support users during the
search task by trying to determine user interests and
to refine searches by considering this information. In
other word they are not filtering systems.
An information filtering system is able to select
from an information source only those items that are
relevant for a given user basing on a (explicitly
defined or implicitly inferred) user profile. A user
profile describes the interests of a user with respect
to domain topics and may be generated by exploiting
machine learning techniques e.g. by extracting key-
words from a set of relevant documents and/or by
collecting user (positive or negative) feedback about
documents suggested by the system.
One of the few existing filtering systems in the
medical domain is Kavanah (Santos, E., Nguyen,
H., 2000) that is based on interface agents that learn
the interests and the preferences of the users while
they perform search tasks. Discovered interests and
preferences are then exploited to help users retrieve
further information and relevant knowledge.
The system we propose in this paper tries to
merge the advantages coming from the application
of a controlled vocabulary (ICD) together with those
coming from a filtering system. It is able, from one
side, to automatically categorize unknown medical
documents on the basis of ICD and, from the other
side, to filter relevant information on the basis of an
explicit ICD-based user profile.
For these reasons it is different from existing
search tools based on controlled vocabularies (like
Grateful Med System and COACH) because it also
offers profile-driven searches. It is also different
from traditional filtering systems (like Kanavah)
because available documents are pre-classified with
respect to a controlled vocabulary. This not only
empowers search facilities but also ensures a greater
interoperability with external systems based on such
vocabularies.
3 THE CATEGORIZATION
METHODOLOGY
The categorization process we apply is based on
machine learning techniques enabling the automatic
building of a classifier by letting it learn distinctive
features of the categories of a given domain starting
from a training set of pre-classified documents
(Sebastiani, F., 2002).
In other words, given a certain set of categories
(those provide by ICD in our case), a training set is
used to extract the distinctive features (or terms) of
each category. At this point, once a new document is
presented to the system, it will be able to assign it to
none, one or more categories on the basis of the
value of a given similarity function.
More formally, given a document domain D and
a category domain C, the purpose of the algorithm is
to build a function
φ
able to associate each pair (d, c)
with d ∈ D and c ∈ C a membership value of d in c
included in the range [0, 1].
As anticipated, our approach relies on an initial
corpus Ω ⊂ D of pre-classified documents. In such
sense every pair (d, c) with d ∈
Ω
so that
φ
(d, c) = 1
is a positive sample for the class c while every pair
(d, c) so that
φ
(d, c) = 0 is a negative sample for the
AN INFORMATION FILTERING SYSTEM FOR E-HEALTH - The Health-on-Net Experience
211