practitioners will reduce their workload and will
allow them to focus on the content of the patient
record and evaluate the presence of NI.
Some of the 83 attributes of the prevalence
database are acquired for administrative purposes
only. The majority of these attributes are the
synthesis of information from the patient records
emanating from laboratory, radiology, nursing, and
clinical databases. A more realistic approach to
report potential cases is to find a subset of N most
relevant features (N<83) and query the hospital
databases on the basis of these N features. The
results obtained are then classified and ranked with a
classification algorithm, and only predicted positive
cases i.e. lists of infected patients are reviewed by
the infection control practitioners. The classification
is based on a model build with the N features.
We present in this paper a framework to build a
computer-aided detection of NI based on a set of N
pre-selected features using Fisher’s linear
discriminant algorithm. For this purpose, we analyze
a previous prevalence database to optimize the
classification process. The retrieval of the data from
the hospital data warehouse is not presented in this
paper. The Fisher’s linear discriminant (FLD) was
chosen for its “simplicity” as it only has one
parameter to optimize. One challenging
characteristic of NI prevalence data is the imbalance
between the positive and negative cases
(respectively 11% and 89%) (Cohen et al, 2006).
This important characteristic is taken into account in
the proposed methodology. An application is
developed to automate all the process.
1.2 NI Prevalence Data
The prevalence data we analyzed in this work is data
collected at the hospitals during the 2006 survey.
The dataset contains 5 data categories: 1)
demographic information, 2) admission diagnosis
(classified according to McCabe5 (McCabe and
Jackson, 1962) and the Charlson index (Charlson et
al, 1987) classifications); 3) patient information at
the study date (ward type and name, status of
Methicillin-Resistant Staphylococcus Aureus
portage, etc); 4) information at the study date and
the 6 days before (clinical data, central venous
catheter carriage, workload, infection status, etc) and
5) those related to the infections i.e. for infected
patients (infection type, clinical data, etc.).
In this study, we are interested in the 4 first
categories of data as they are related to patient
infection, which comprises 45 attributes. Most of
these data are categorical except for date information
(year of birth, admission date and study date) and
the workload value. The dataset contains 1573 cases.
The year of birth was converted into age and
discretized into 3 categories (0-60; 60-75; >75) as in
(Sax et al, 2002), and a new variable “hospitalization
duration” was created. A Mann-Whitney-Wilcoxon
statistical test on the workload value provides a
significant difference between infected and non-
infected patients. As it is the unique attribute having
missing values (91 cases including 2 positive cases),
all cases having no workload value were removed.
The latter and the hospitalization duration were
discretized afterwards using the minimum
description length principle (Kononenko, 1995).
Patients admitted for less than 48 hours at the time
of the study and not transferred from another
hospital were also removed. The final dataset
contains 1384 cases containing 166 positive cases
(11.99%). Let us denote this dataset S.
The ratio of positive cases in the dataset S is very
low compared to the negative ones. The class
imbalance is an important issue in machine learning
since the class of interest is represented with a small
number of examples (Japkowicz and Stephen, 2002).
In the presence of imbalanced datasets, classification
algorithms tend to classify the larger class accurately
while generating more errors in the minority class. If
a positive class has a ratio of 10%, a classification
accuracy of 90% may be meaningless if the
classification is not sensitive at all.
The class imbalance problem induces specific
approaches to train classifiers and evaluate their
performance. Two approaches were proposed to deal
with the class imbalance problem in (Cohen et al,
2006, Estabrooks, 2004). The first one is to modify
the classification algorithm or at least use an
algorithm able to deal with imbalanced data. The
second resamples the data to reduce the imbalance
effect. The latter has the advantage of being
independent of any classification algorithm.
1.3 Fisher’s Linear Discriminant
The basic idea behind linear discriminant algorithms
is to find a linear function providing the best
separation of instances from 2 classes. Fisher’s
linear discriminant is looking for a hyperplane
directed by w, which (i) maximizes the distance
between the mean of the classes when projected on
the line directed by w and (ii) minimizes the
variance around these means (Fisher, 1936). An
illustration of this algorithm is highlighted on the
figure below (Figure 1).
HEALTHINF 2009 - International Conference on Health Informatics
318