[MAJR] stands for “MeSH Major Topic”, asking the
system for retrieving only documents whose subject is
mainly a disease. The query provided us with 483,726
documents
4
that we downloaded by sending them to
a file in MEDLINE format. We automatically pro-
cessed that file to obtain the titles and abstracts with
their correspondingMeSH topics. This led us to 4,024
classes with at least one training sample each.
With respect to the test set, we have used 400 med-
ical histories from the School of Medicine of the Uni-
versity of Pittsburgh (Department of Pathology
5
). Al-
though, so far the web page contains 500 histories
4
,
not all of them are suitable for our purposes. There
are some which do not provide a concrete diagnosis
but only a discussion about the case, and some others
do not have a direct matching to the MeSH tree. We
downloaded the HTML cases and afterwards we con-
verted them to text format by using from each doc-
ument the title and all the clinical history, including
radiological findings, gross and microscopic descrip-
tions, etc. To get the expected output, we extracted
the top level MeSH diseases categories correspond-
ing to the diagnoses given on the titles of the “final
diagnosis” files (dx.html).
To select a proper ranking algorithm, we have
looked up the most suitable one through several
decades of literature about text classification and cat-
egory ranking. We have chosen the Sum of Weights
(SOW) approach (Ruiz-Rico et al., 2006), that is more
suitable than the rest for its simplicity, efficiency, ac-
curacy and incremental training capacity. Since med-
ical databases are frequently updated and they also
grow continuously, we have preferred using a fast and
unattended approach that lets us perform updates eas-
ily with no substantial performance degradation af-
ter incrementing the number of categories or training
samples. The restrictive complexity of other classi-
fiers such as SVM could derivate to an intractable
problem, as stated by (Ruch, 2005).
To evaluate how worth our suggestion is, we have
measured accuracy through three common ranking
performance measures (Ruiz-Rico et al., 2006): Pre-
cision at recall = 0 (P
r=0
), mean average precision
(AvgP) and Precision/Recall break even point (BEP).
Sometimes, only one diagnosis is valid for a partic-
ular patient. In these cases, P
r=0
let us quantify the
mistaken answers, since it indicates the proportion of
correct topics given at the top ranked position. To
know about the quality of the full ranking list, we use
the AvgP, since it goes down the arranged list averag-
ing precision until all possible answers are covered.
BEP is the value where precision equals recall, that
4
Data obtained on February 14th 2007
5
http://path.upmc.edu/cases
Figure 1: Example of the first level of a hierarchical diag-
nosis.
Figure 2: Output example after manual expansion of high
ranked topics (up) and by selecting the flat diagnosis mode
(down).
is, when we consider the maximum number of rele-
vant topics as a threshold. To follow the same proce-
dure as (Joachims, 1998), the performance evaluation
has been computed over the top diseases level.
2.1 Availability and Requirements
No special hardware nor software is necessary
to interact with the assistant. Just an Inter-
net connection and a standard browser are enough
to access on-line through the following site:
www.dlsi.ua.es/omda/index.php.
By using a web interface and by presenting re-
sults in text format, we allow users to access from
many types of portable devices (laptops, PDA’s, etc.).
Moreover, they will always have available the latest
version, with no need of installing specific applica-
tions nor software updates.
3 AN EXAMPLE
One of the 400 histories included in the test set looks
as follows:
Case 177 – Headaches, Lethargy and a Sel-
lar/Suprasellar Mass