TOWARDS AN AUTOMATED NOSOCOMIAL INFECTION

CASE REPORTING

Framework to Build a Computer-aided Detection of Nosocomial Infection

Jimison Iavindrasana, Gilles Cohen, Adrien Depeursinge

Medical Informatics Department, University and Hospitals of Geneva, rue Micheli-du-Crest, 24, Geneva, Switzerland

Henning Müeller

Medical Informatics Department, University and Hospitals of Geneva, rue Micheli-du-Crest, 24, Geneva, Switzerland

University of Applied Sciences Western Switzerland, Sierre, Switzerland

Rodolphe Meyer, Antoine Geissbuhler

Medical Informatics Department, University and Hospitals of Geneva, rue Micheli-du-Crest, 24, Geneva, Switzerland

Keywords: Nosocomial infection, Machine learning, Feature selection, Fisher's linear discriminant.

Abstract: The prevalence survey is a valid and realistic surveillance strategy for nosocomial infection surveillance but

it is resource and labor-consuming. Querying the hospital data warehouse with a set of relevant features and

applying a classification algorithm on the results can reduce the amount of cases to be evaluated by the

infection control practitioners. The objective of this work is to provide a framework to build a nosocomial

infection model with a set of pre-selected features with Fisher’s linear discriminant algorithm. Application

of the methodology to two datasets provides promising results. It permits to predict respectively an average

of 41.5% and 43.54% positive cases including respectively 65.37% and 82.56% true positive cases. The

proposed framework can be applied to other classification algorithms, which are planned as future work.

1 INTRODUCTION

1.1 Context

Nosocomial infections (NI) are infections acquired

in a hospital. In Switzerland, 70000 hospitalized

patients per year are infected and 2000 deaths per

year are caused by NI. A hospital aware of the

quality of the patient care should have an infection

prevention, control and surveillance program. The

surveillance is the process of detecting these

infections. Prevalence surveys are recognized as

valid and realistic approaches of nosocomial

infection (NI) surveillance strategies (French et al,

1983). Prevalence of NI is presented as prevalence

of infected patients, defined as the number of

infected patients divided by the total number of

patients hospitalized at the time of study, and

prevalence of infections, defined as the number of

NIs divided by the total number of patients

hospitalized at the time of study (Sax et al, 2002).

However, a prevalence survey is resource and labor-

consuming, as it requires assembling a wide range of

data gathered from multiple sources. The medical

record of all patients hospitalized for more than 48

hours at the time of the survey are reviewed by

infection control practitioners. During this first

process they extract information related to each

patient and store them in a database specially

developed for this purpose. The prevalence database

contains 83 attributes ranging from administrative

information, demographic characteristics, admission

diagnoses, comorbidities and severity of illness

scores, type of admission, and exposure to various

risks of infection, clinical and paraclinical

information, and data related to infection when

present. This database is analyzed to sort out the

prevalence official report.

The hospital data warehouse contains all the data

in the operational system except the data of the day.

Querying the hospital data warehouse and apply data

mining techniques in order to report “potential

cases” to be reviewed by the infection control

317

Iavindrasana J., Cohen G., Depeursinge A., Müeller H., Meyer R. and Geissbuhler A. (2009).

TOWARDS AN AUTOMATED NOSOCOMIAL INFECTION CASE REPORTING - Framework to Build a Computer-aided Detection of Nosocomial

Infection .

In Proceedings of the International Conference on Health Informatics, pages 317-322

DOI: 10.5220/0001553103170322

 SciTePress

practitioners will reduce their workload and will

allow them to focus on the content of the patient

record and evaluate the presence of NI.

Some of the 83 attributes of the prevalence

database are acquired for administrative purposes

only. The majority of these attributes are the

synthesis of information from the patient records

emanating from laboratory, radiology, nursing, and

clinical databases. A more realistic approach to

report potential cases is to find a subset of N most

relevant features (N<83) and query the hospital

databases on the basis of these N features. The

results obtained are then classified and ranked with a

classification algorithm, and only predicted positive

cases i.e. lists of infected patients are reviewed by

the infection control practitioners. The classification

is based on a model build with the N features.

We present in this paper a framework to build a

computer-aided detection of NI based on a set of N

pre-selected features using Fisher’s linear

discriminant algorithm. For this purpose, we analyze

a previous prevalence database to optimize the

classification process. The retrieval of the data from

the hospital data warehouse is not presented in this

paper. The Fisher’s linear discriminant (FLD) was

chosen for its “simplicity” as it only has one

parameter to optimize. One challenging

characteristic of NI prevalence data is the imbalance

between the positive and negative cases

(respectively 11% and 89%) (Cohen et al, 2006).

This important characteristic is taken into account in

the proposed methodology. An application is

developed to automate all the process.

1.2 NI Prevalence Data

The prevalence data we analyzed in this work is data

collected at the hospitals during the 2006 survey.

The dataset contains 5 data categories: 1)

demographic information, 2) admission diagnosis

(classified according to McCabe5 (McCabe and

Jackson, 1962) and the Charlson index (Charlson et

al, 1987) classifications); 3) patient information at

the study date (ward type and name, status of

Methicillin-Resistant Staphylococcus Aureus

portage, etc); 4) information at the study date and

the 6 days before (clinical data, central venous

catheter carriage, workload, infection status, etc) and

5) those related to the infections i.e. for infected

patients (infection type, clinical data, etc.).

In this study, we are interested in the 4 first

categories of data as they are related to patient

infection, which comprises 45 attributes. Most of

these data are categorical except for date information

(year of birth, admission date and study date) and

the workload value. The dataset contains 1573 cases.

The year of birth was converted into age and

discretized into 3 categories (0-60; 60-75; >75) as in

(Sax et al, 2002), and a new variable “hospitalization

duration” was created. A Mann-Whitney-Wilcoxon

statistical test on the workload value provides a

significant difference between infected and non-

infected patients. As it is the unique attribute having

missing values (91 cases including 2 positive cases),

all cases having no workload value were removed.

The latter and the hospitalization duration were

discretized afterwards using the minimum

description length principle (Kononenko, 1995).

Patients admitted for less than 48 hours at the time

of the study and not transferred from another

hospital were also removed. The final dataset

contains 1384 cases containing 166 positive cases

(11.99%). Let us denote this dataset S.

The ratio of positive cases in the dataset S is very

low compared to the negative ones. The class

imbalance is an important issue in machine learning

since the class of interest is represented with a small

number of examples (Japkowicz and Stephen, 2002).

In the presence of imbalanced datasets, classification

algorithms tend to classify the larger class accurately

while generating more errors in the minority class. If

a positive class has a ratio of 10%, a classification

accuracy of 90% may be meaningless if the

classification is not sensitive at all.

The class imbalance problem induces specific

approaches to train classifiers and evaluate their

performance. Two approaches were proposed to deal

with the class imbalance problem in (Cohen et al,

2006, Estabrooks, 2004). The first one is to modify

the classification algorithm or at least use an

algorithm able to deal with imbalanced data. The

second resamples the data to reduce the imbalance

effect. The latter has the advantage of being

independent of any classification algorithm.

1.3 Fisher’s Linear Discriminant

The basic idea behind linear discriminant algorithms

is to find a linear function providing the best

separation of instances from 2 classes. Fisher’s

linear discriminant is looking for a hyperplane

directed by w, which (i) maximizes the distance

between the mean of the classes when projected on

the line directed by w and (ii) minimizes the

variance around these means (Fisher, 1936). An

illustration of this algorithm is highlighted on the

figure below (Figure 1).

HEALTHINF 2009 - International Conference on Health Informatics

318

Formally, Fisher’s linear discriminant aims at

maximizing the function:

wSw

J(w)

wS w

(1)

where S

is the scatter matrix between classes and

the scatter matrix within classes. This equation

permits to formulate Fisher’s linear discriminant as

an algorithm aiming at minimizing the variance

within the classes and maximizing the variance

between classes. An unknown case will be classified

into the nearest class centroid when projected onto a

hyperplane directed by w.

Figure 1: Illustration of Fisher’s linear discriminant. The

algorithm is looking for the direction providing the best

separation of the classes when projected upon. In this

figure, the third image (bottom left) provides the best

separation of the datasets.

In a classification task, an object is member of

exactly one class and an error occurs if the object is

classified into the wrong one. The objective is then

to minimize the misclassification rate. With Fisher’s

linear discriminant algorithm, the scatter matrix

within classes SW is evaluated on the training

datasets. To minimize the misclassification rate on

unseen test sets (generalization error), a

regularization factor r (0 ≤ r ≤ 1) is introduced into

the computation of S

(Hastie et al, 2001). The

regularization factor r has to be optimized to

minimize the misclassification error.

2 MATERIAL AND METHODS

The attributes from the dataset S were ranked

according to the information gain. Afterwards, a

Chi-square statistic test was applied to filter the

discriminative features to be retained for an

evaluation with the classification algorithm. Let us

denote S1 the new dataset created with this feature

selection. The dataset S1 may contain attributes

which are not always documented or at least not

documented in a machine readable format in the

clinical database. We will remove them from the

dataset S1 to obtain a second dataset S2.

We have taken the two datasets S1 and S2

described above to evaluate the discriminative

power of the selected features. For classification

purposes, we use the open-source toolbox

MATLABArsenal. This MATLAB package contains

many classification algorithms and in particular the

regularized Fisher linear discriminant algorithm as

described above. The MATLAB software is invoked

from the java application developed for the process

automation. This application also uses the WEKA

api for other routine tasks such as the training/testing

set splitting.

The evaluation of the predictive power of the

selected features is inspired by the experimental

setup described in (Rätsch et al, 2001). One hundred

(100) partitions of training and testing sets were

generated with the data source S1 and S2 having

respectively a ratio of 60% and 40%. The original

data distribution is kept in both partitions. A grid

search algorithm is then applied to the first five

under-sampled training sets using a 5-folds cross-

validation to find the best parameters of the

classification algorithm. In the five under-sampled

training sets, the classes are equally distributed (50%

positive cases and 50% negative cases).

The regularization factor r takes 41 values from 2-20

to 220 during this process. The best parameter of

each training set was the one providing the highest

recall (i.e. the parameter permitting to predict

highest rate of true positive cases). The best value

selected for the classification algorithm is the

median of the 5 best parameters. The 100 training

sets (having the original class distribution) are then

used to train Fisher’s linear discriminant models

with this best parameter. This process allows us to

build 100 models and to validate each of them on the

corresponding testing set. The general performance

of the classifier is computed as the mean of the 100

classification performances on the test sets. The

performance of the classification algorithm with the

2 datasets (S1 and S2) is also compared with respect

to the Mann-Whitney-Wilcoxon statistical test.

TOWARDS AN AUTOMATED NOSOCOMIAL INFECTION CASE REPORTING - Framework to Build a

Computer-aided Detection of Nosocomial Infection

319

Table 1: List and rank of features obtained with information gain followed by a Chi-square filtering. The first column

provides the rank of each attributes.

Rank Selected attributes

Antibiotic therapy

Fever

Mechanical ventilation

Urinary tract

Workload value > 91.5

Workload value <=45.5

Stay at the intensive care unit during hospitalization

Central vein catheter

Hospitalization duration up to 7.5 days

Intensive care unit ward

Obstetrical ward

Surgery

McCabe score fatal < 6 months

No MRSA colonization

Actual MRSA colonization

McCabe score non fatal

Workload value between 45.5 and 91.5

Diabetes with organ affected

Transfer from another hospital as admission

Congestive cardiomiopathy

Figure 2: The mean of each performance measure on the datasets S1 and S2.

HEALTHINF 2009 - International Conference on Health Informatics

320

3 RESULTS

3.1 Feature Selection

Twenty (20) attributes were retained from the

feature selection process as summarized in the table

1. Let S1 denote the name of this subset. Two (2)

clinical attributes are not always documented or at

least not documented in a machine readable format

in the clinical database: fever and workload value.

These attributes were removed to create a second

data source denoted S2 even if the information gain

ranked these attributes at the fourth and sixth

position.

3.2 Model Selection

The grid search algorithm applied on the two

datasets S1 and S2 returned respectively r =0.5 and 1

as the best parameter. The figure 2 summarizes the

performance metrics (accuracy, precision, recall, f-

measure and the ratio of positive predictions)

obtained with the 2 datasets in terms of their mean.

3.3 Classification Performances

Dataset S1 and S2 permit to obtain respectively a

mean recall (±standard deviation) of 65.37% (±6.76)

and 82.56 (±4.22), a precision (±SD) of 41.50%

(±3.9) and 43.54% (±4.59), a f-measure (±SD) of

50.58(±3.83) and 56.87(±4.29) over the 100

training/testing split realizations. The mean accuracy

(±SD) for S1 and S2 are 84.83%(±1.04) and

85.04%(±1.65) and the positive prediction ratios are

respectively 18.82% (±1.72) and 22.73% (±1.55).

According to the results above, querying the

hospital data warehouse with the features present in

the dataset S1 and S2 and classify the results with

the FLD algorithm, we can expect retrieving an

average of (±SD) 65.37% (±6.76) and 82.56 (±4.22)

of the infected patients. The mean numbers of

potential cases (±SD) to be submitted to the ICP are

respectively 18.82% (±1.72) and 22.73% (±1.55) of

the hospitalized patients.

A Mann-Whitney-Wilcoxon statistic test

provided a p value < 0.001 for accuracy, precision,

f-measures and the positive prediction ratio.

According to this test, there is a statistically

significant difference between the accuracy,

precision, f-measure and the ratio of positive

prediction. The removal of the temperature and the

workload features improved significantly the

performance of the FLD.

4 DISCUSSION AND

CONCLUSIONS

In this paper we present a framework to build a NI

model based on a small number of clinical features

permitting to report NI cases to be reviewed by

infection control practitioners. Fisher’s linear

discriminant was chosen as detection algorithm. The

removal of the attributes characterizing fever and

workload value is not affecting the sensitivity of the

classifier. This may be explained by the strong

correlation between these attributes with some

important features such as the antibiotherapy for the

fever and surgery, stay at the intensive care unit

during the hospitalization, a presence of artificial

ventilation, urinary tract, and central venous catheter

for the workvalue. The automation of the process

needs integration of data from laboratory, radiology,

nursing, and clinical databases.

Limits of this Work. The evaluation of the

discriminative power of the selected features was

carried out using Fisher’s linear discriminant

algorithm. A comparison with other classification

algorithms such as Support Vector Machines (SVM)

and the Kernel Fisher’s linear discriminant could be

a good option to improve the classification

performance. The framework could also be extended

with an evaluation of the best classifiers. The grid

search algorithm for optimal parameters has high

computational cost especially for classification

algorithms with more than one parameter to

optimize such as SVMs of the Kernel Fisher

discriminant. A gradient descent method can be used

to find the best parameter and can improve the

generalization performance as described in

(Chapelle et al, 2002).

Future Work. The framework introduced in this

paper permits to evaluate the discriminative power

of a subset of important features from the NI

database. The feature selection method we have

chosen in this work is based on the information gain

combined with a Chi-square statistic test. More

experiments with other feature selection techniques

are required (Guyon, 2003). The discriminative

power of the selected features will be evaluated with

more than one classification algorithm. The result of

these evaluations i.e. the minimal attributes required

to predict most of the positive NI cases will be

retained to build queries for the hospital databases in

order to automatically report potential cases for the

prevalence surveys. This automated nosocomial

TOWARDS AN AUTOMATED NOSOCOMIAL INFECTION CASE REPORTING - Framework to Build a

Computer-aided Detection of Nosocomial Infection

321

infection reporting will permit to conduct more

prevalence surveys with less cost.

ACKNOWLEDGEMENTS

The authors are grateful for the dataset provided by

the infection control team at the Geneva University

Hospital.

REFERENCES

Chapelle, O, Vapnik, V, Bousquet, O, Mukherjee, S, 2002.

Choosing multiple parameters for support vector

machines. Mach. Learning.

Charlson, ME, Pompei, P, Ales, KL, MacKenzie, CR,

1987. A new method of classifying prognostic

comorbidity in longitudinal studies: development and

validation. J Chronic Dis.

Cohen, G, Hilario, M, Sax, H, Hugonnet, S, Geissbuhler,

A, 2006. Learning from imbalanced data in

surveillance of nosocomial infection. Artificial

Intelligence in Medicine.

Estabrooks, A, 2004. A multiple resampling method for

learning from imbalanced datasets. Comput Intell.

Fisher, RA, 1936. The use of multiple measurements in

taxonomic problems. Annals of Eugenics.

French, GG, Cheng, AF, Wong, SL, Donnan, S, 1983.

Repeated prevalence surveys for monitoring

effectiveness of hospital infection control. Lancet.

Guyon, I, Elisseeff, A, 2003. An Introduction to variable

and feature selection. Mach. Learning Res J.

Hastie, T, Tibshirani, R, Friedman, J, 2001. The elements

of statistical learning: data mining, inference, and

prediction. Springer.

Japkowicz, N, Stephen, S, 2002. The class imbalance

problem: a systematic study. Intell Data Anal J.

Kononenko, I, 1995. On biases in estimating multi-valued

attributes. Eds.: Morgan Kaufmann. In Proceedings of

the 14th International Joint Conference on Artificial

Intelligence.

McCabe, WR, Jackson, GG, 1962. Gram-negative

bacteremia, I: etiology and ecology. Arch Intern Med.

Rätsch, G, Onoda, T, Müller, KR, 2001. Soft margin for

AdaBoost. Mach.Learning.

Sax, H, Pittet, D, 2002. Swiss-NOSO Network.

Interhospital Differences in nosocomial infection

rates: importance of case-mix adjustment. Arch Intern

Med.

HEALTHINF 2009 - International Conference on Health Informatics

322