MULTIVARIATE TECHNIQUE FOR CLASSIFICATION RULE
SEARCHING
Exemplieied by CT Data of Patient
Jyhjeng Deng
Industrial Engineering & Technology Management Department, DaYeh University
112 Shan-Jeau Rd., Da-Tsuen, Chang-Hua, Taiwan
Keywords: Correspondence Analysis, Fisher’s linear Discriminant function, multivariate categorical data, parallel
coordinate plots.
Abstract: In the process of searching classification rules for multivariate categorical data, it is crucial to find a quick
start to locate the combination of levels of input and response variables which can contribute to the most
correct classification rate for the response variable. Fisher’s linear discriminant function is proposed to
select some important input-variable candidates; then, correspondence analysis is used to ascertain that the
level of candidates is closely related to the appropriate level of response variable. The closest linkage
between input variable and response variables is chosen as the rule for each input-variable candidate. The
algorithm is applied to the hospital data of patients whose CT scan diagnosis awaits a decision. The result
shows that my algorithm is not only quicker than an exhaustive search but the result is also identical to the
optimum solution by exhaustive search in terms of the correct classification rate. The correct classification
rate is about 80%. Finally, two parallel coordinate plots of the 20% mistakenly classified data and the
corresponding correctly classified data are compared, showing their mutual confounding and explaining
why the correct classification rate cannot be further improved.
1 INTRODUCTION
Patients sent to the emergency unit of a hospital
need immediate care to save their life. To attend to
patients in an appropriate manner, correct treatment
is crucial, leading to the issue of making a correct
diagnosis. In this investigation a data set on 959
patients sent to a local hospital within a certain
period of time for emergency care is collected. The
on duty physicians face the decision whether the
patients need head-computed tomography (HCT),
more commonly known as computed axial
tomography (CAT or CT scan). Since each patient
differs, a rule based on the physical characteristics
(such as blood pressure, breath, mental status, and
triage level, etc.) of the patient should be formulated
to help the physician make an appropriate decision
on the need for HCT. The first 80% (767) of the
original data is used as a training set to establish the
rule; whereas, the second 20% (192) is used as test
data to demonstrate the effectiveness of the rule.
The data set contains seven independent
variables, A1-A7 (such as sex, age, triage level,
mental status, breathing rate, diastolic blood
pressure and pulse rate) and one response variable,
HCT (D). The medical data are further classified by
the physician as in the contingency table shown in
Table 1 in Appendix A. Note that D has two
meanings: (1) the response variable of the HCT,
being the highest standard determined by the on-
duty physician, and (2) the result determined by the
classification rule. At first the double meaning might
be somewhat confusing, but it eliminates the need to
define another variable, as should be clear from the
context as this report proceeds.
A simple and direct way to solve the problem is
enumeration, an exhaustive method. In the single-
variable rule search, there will be
2
p
ways to
classifiy a categorical input variable with p levels
into a response variable of 2 levels . For example, to
find the best rule for variable A3 (with three levels)
which will provide the most correct classification via
HCT (with two levels), the correct rate for the
following six rules must be computed: (A3=1, D=1),
(A3=1, D=2), (A3=2, D=1), (A3=2, D=2), (A3=3,
D=1), (A3=3, D=2). Since there are seven
278
Deng J. (2008).
MULTIVARIATE TECHNIQUE FOR CLASSIFICATION RULE SEARCHING - Exemplieied by CT Data of Patient.
In Proceedings of the Tenth International Conference on Enterprise Information Systems, pages 278-286
DOI: 10.5220/0001723702780286
Copyright
c
SciTePress
independent variables and each is classified as 2, 5,
3, 4, 3, 3 and 4 levels, 2*(2+5+3+4+3+3+4)=48
rules need to be evaulated. The classification rule
(A3=1, D=1) means that if the patient’s triage level
is 1, he/she needs to have the HCT. This decision is
applied to the training data; however, it may be
incorrect according to the criterion of response
variable D. Thus, a correct rate can be calculated,
and the highest correct rate chosen. In this case,
when dealing with only one independent variable,
the obtained rule is applied to the test data set to
determine whether this rule can render a similar
correct classification rate. This type of selection
procedure can be applied to the two independent
variables. When variable
1
x has
1
l levels and
variable
2
x has
2
l levels, then there are 2
12
ll
rules to be evaluated. In this case study, there will be
2*(2*5+2*3+2*4+2*3+2*3+2*4+5*3+5*4
+5*3+5*3+5*4+4*3+4*3+4*4+3*3+3*4+3*4)= 404
rules altogether. Clearly, evaluating each in
succession is very time consuming; therefore, a
faster, more reliable method should be sought. For
this purpose, two multivariate techniques are used to
solve the problem in sequence. First, Fisher’s linear
discriminant function is built and important input
variables selected. By following the correspondence
analysis, the level from the input variable closest to
the level of the response variable can be selected as
the classification rule. For example, variable triage
(A3) is considered to be one of the most important
of the seven independent variables. Then, a
Euclidean distance can be derived between the level
of independent variable A3 and the level of decision
variable D. The shortest distance between them is
chosen as the classification rule for input variable
A3. By extending the aforementioned procedure for
a single variable to two variables, a combination of
composite rules to determine the optimum correct
classification rate can be established, in this case
about 80%. Finally, two parallel coordinate plots of
the 20% mistakenly classified data and the
corresponding correctly classified data are compared,
showing their mutual confounding and explaining
why the correct classification rate cannot be further
improved. Although Fisher’s linear discriminant
function and correspondence analysis are two well
known techniques in multivariate analysis, using
them together to find the classification rule for
multivariate categorical data is unusual. Fisher’s
function is mostly used to classify multivariate
continuous data into various categories. In the
literature, one can find its application to face
detection (
Yang, Kriegman and Ahuja, 2001) and its
combination with linear programming (
Lam and Moy,
2003). Used to detect the root of two categorical
variables. Correspondence analysis can be used in
the ecological study of animal populations (
Allombert,
Gaston and Martin, 2005
).
2 FISHER’S LINEAR
DISCRIMINAT FUNCTION
Suppose that there is a sample data set X of
multivariate variable
x composed of samples
j
X
with sample size of
j
n , 1, 2, ,j J= L , from
J
populations. To obtain the optimum classification
rule for multivariate sample
X , Fisher suggests
finding the linear combination of
T
ax which
maximizes the ratio of between-group-sum of
squares to the within-group-sum of squares (Hardle
and Simar, 2003),
T
T
aBa
aWa
, (1)
where
B is the between-group-sum of squares,
defined as (Johnson and Wichern, 2003)
()()
1
J
T
ii i
i
B nx x x x
=
=−
; (2)
whereas,
W is the within-group-sum of squares,
defined as
()()
11
i
n
J
T
ijiiji
ij
Wxxxx
==
=−
∑∑
. (3)
Note that
ij
x represents the
th
j sample from
population
i ;
i
x , the sample mean of population i ;
x
, the grand average of the total samples. The
solution of vector
a is found in Theorem 1 (Hardle
and Simar, 2003).
Theorem 1. The vector
a that maximizes (1) is the
eigenvector of
1
WB
corresponding to the largest
eigenvalue.
Now, the pertinent discrimination rule is as follows:
Classify
x into group j where
T
j
ax is closest to
T
ax. When J =2 is grouped, the discriminant rule
is computed as follows: The corresponding
eigenvector in Theorem 1 is
1
12
()aW x x
=−.
The corresponding discriminant rule is
M
279
MULTIVARIATE TECHNIQUE FOR CLASSIFICATION RULE SEARCHING - Exemplieied by CT Data of Patient
()
()
1
2
if 0
if 0
T
T
xaxx
xaxx
→Π >
→Π
, (4)
where
i
Π represents population i .
Note that all the data are not assumed to be normal;
the only assumption is that they are real numbers.
After this short introduction to Fisher’s linear
discriminant function, the focus is now directed
toward its application to the hospital data. To sketch
the eight variables of 767 data set, parallel
coordinate plots are used. The result is shown in
Appendix B. Since variable A1 is nominal and the
analysis in Appendix B indicates that variable A1
has no strong correlation with variable D. Fisher’s
linear discriminant function with input variables
(A2-A7) is found. Vector
a is [0.00023445, -
0.00079367, 0.00062467, 0.0020182, -2.9766e-06, -
0.00035162]
T
; the grand average
x
is [3.4433,
1.7901, 1.1917, 1.9974, 2.9126, 2.3051], the values
in
a and
x
corresponding to variables A2-A7.
Then, the number of individuals coming from
j
Π
,
which have been classified into
i
Π by
ij
n , are
denoted. By applying the discriminant rule in Eq. (1)
to the test data set, one has
12
n =0 and
21
n =48, the
correct classification rate being 0.75. By examining
the magnitude of the coefficients in
a , it is clear
that the three most important variables are A5
(breathing), A3 (triage) and A4 (mental state);
whereas, the least important is A6 (blood pressure).
Thus, the correlation between the level of these
variables and the CT level is investigated and the
closest relationship between them in terms of the
Euclidean distance searched. The closest is chosen
as the rule to classify the patients who need HCT. A
detailed explanation follows.
3 CORRESPONDENCE
ANALYSIS
The aim of correspondence analysis is to develop
simple indices showing relationships between the
rows and columns of a contingency table, wherein
row and column represents one category of the
corresponding variables. The entry
ij
x
in table
Χ
(with dimension (nxp)) represents the number of
observations in a sample which simultaneously fall
in the i
th
row category and the j
th
column category,
for
1, 2, ,in= L and 1, 2, ,
j
p
=
L . Then the
association between the row and column categories
can be measured by an
2
-test statistic defined as
()
2
2
11
/ ,
p
n
ij ij ij
ij
x
EE
χ
==
=−
∑∑
(5)
where
..
..
ij
ij
x
x
E
x
= with
.i
x
represents the sum in
the i
th
row;
.
j
x
, the sum in the j
th
column; and
.. .
1
n
i
i
x
x
=
=
, the grand total. Under the hypothesis of
independence,
2
has an
2
(1)( 1)np
χ
−−
distribution. If
the test statistic
2
is significant at the 5% level,
investigating the special reasons for the departure
from independence is worthwhile. To extract the
elements of dependence, the principle of
correspondence analysis (CorrAna) is brought into
play. The CorrAna procedure first determines the
SVD (singular value decomposition) of matrix
C
(nxp) with elements (Hardle and Simar, 2003)
(
)
1/2
/
ij ij ij ij
cxEE=− . (6)
When assuming that the rank of
C is
R
, the SVD
of
C yields
T
C
=
ΓΛΔ , (7)
where
Γ
contains the eigenvectors
k
γ
(nx1) of
T
CC ,
Δ
the eigenvectors
k
δ
(px1) of
T
CC
where
k
=
1,2,…,R and
()
1/2 1/ 2
1
diag , ,
R
λλ
Λ= L
(where diag represents the diagonal matrix) with
12 R
λ
λλ
≥≥L (the eigenvalues of
T
CC ). By
defining the matrices
A (nxn) and
B
(pxp) as
(
)
()
..
diag and diag
ij
A
xB x==, (8)
one can calculate
k
r (nx1) and
k
s (px1),
1, 2, ,kR
=
L as
1/ 2
1/ 2
,
,
kk k
kk k
rA
sB
λ
γ
λ
δ
=
=
(9)
where point vectors [
1
r ,
2
r ] and [
1
s ,
2
s ] are
plotted onto a two-dimensional graph, called biplot,
with n points in point vector [
1
r ,
2
r ] representing
the n rows and p points in [
1
s ,
2
s ] representing the
p columns. Thus, the entire contingency table can be
simplified as n+p points on the 2D graph. The
ICEIS 2008 - International Conference on Enterprise Information Systems
280
relationship between n points [
1
r ,
2
r ] and p points
[
1
s ,
2
s ] explains why rows and columns are not
independent.
Now, the correspondence analysis (CorrAna) is
applied to the contingency table of variables A5 and
D as an illustration, shown in Table 1.
Table 1: Contingency table of variables A5 and D for 767
Data.
HCT (D)
1 2
Breath
(A5)
<10/min 1 4
10~24/min 249 510
>24/min 3 0
The corresponding
R of
C
for Table 1 is 1;
moreover, vector
1
r =[-0.2762, -0.0038143, 1.4253]
and vector
1
s =[0.13109, -0.064523]. Since R =1,
there are no
2
r and
2
s . A zero vector is substituted
for
2
r and
2
s to make a 2D biplot, shown in Fig. 1.
Figure 1: Biplot of variables A5 and D.
Note that the two levels of response variable D
(CTYes corresponding to HCT [D=1] and CTNo,
HCT [D=2]), are close to the second level of
independent variable A5. Examining the value of
1
r
(representing A5) and
1
s (representing D), one may
clearly observe that HCT (D=2) is closer to (A5=2)
than HCT (D=1). The value of HCT (D=2) in Figure
1 is -0.064523, being the second value in
1
s ; the
value of (A5=2), -0.0038143, being the second value
in
1
r ; the value of HCT (D=1), 0.13109, being the
first value in
1
s . The rule is then derived as if
(A5=2); therefore, HCT should not be administered.
Thus when the breathing level is normal, the patient
does not need HCT. By applying this rule to the
remaining test data, one has
12
n =0 and
21
n =47, the
correct rate being 0.75521. Table 1 clearly indicates
that the rule is optimum; hence, any other rule will
yield a worse correct rate. The effectiveness of
applying this rule to the test data can be clearly
observed by scrutinizing the contingency table of
variables A5 and D for the test data set, shown in
Table 2.
Table 2: Contingency table of variables A5 and D for 192
Data.
HCT (D)
1 2
Breath
(A5)
<10/min 1 0
10~24/min 47 144
>24/min 0 0
The rule can be applied to Table 2 to obtain Figure 2,
where the correct result from the main rule if
[A5=2] then do not administer HCT is marked in
red; whereas, the correct result from the congruent
rule if [A5 2] then administer HCT is marked
in blue. The correct rate is (1+0+144)/192=0.75521.
No other classification rule will result in a better
correct rate.
Figure 2: Classification rule with contingency table.
By applying the same procedure to the contingency
tables of variables A3 and D, and A4 and D, the
correct rates of 0.70312 and 0.72396, respectively,
are obtained. The rules derived are if (A3=1), then
HCT (D=1); and if (A4=2), then HCT (D=1),
respectively. These values indicate that when the
triage level is 1, the patient needs HCT; and when
the mental status is ‘to call’, HCT, respectively. The
corresponding biplots are shown in Figures 3 and 4.
For the sake of analytical completeness, the
contingency tables of variables A1-A7 vs D for both
the training and the test data sets are listed in
Appendix C.
M
281
MULTIVARIATE TECHNIQUE FOR CLASSIFICATION RULE SEARCHING - Exemplieied by CT Data of Patient
Figure 3: Biplot of variables A3 and D.
Figure 4: Biplot of variables A4 and D.
From the data reported in Appendix C and previous
results, one may assert that analysis by Fisher’s
linear discriminant function and CorrAna provides a
quick start for identifying the important input
variables for searching the classification rule. In this
case the important input variables are A3, A4 and
A5, being in agreement with the Chi-squares test on
the contingency table shown in Appendix C.
Furthermore, the rule provided by CorrAna is in
very strong agreement with the exhaustive search,
with only a small difference in the rule for A4. The
rules by CorrAna for each important variable are: if
A3=1 then D=1; if A4=2, then D=1; if A5=2 then
D=2. These rules differ from the ones obtained by
exhaustive search only in variable A4, where if
A4=1 then D=2; however, there is no difference in
the correct classification rate.
The optimum correct classification rate by the
single-variable classification rule is 0.75521,
provided by the rule of input variable A5, where if
(A5=2) then D=2. Further research is needed to find
composite classification rules for two variables
which, perhaps, would render a higher correct rate.
The result of two-variable classification is explained
below and listed in the combination of composite
rules.
4 OPTIMUM RULE OF
COMBINATION FOR TWO
VARIABLES
Following the previous arguments, the combination
rule for two variables is straightforward. For the
sake of simplicity, only the procedure for finding the
optimum correct rate is illustrated. Variables A3 and
A4 are used to form a new variable, A9, where
A9=(A3-1)*4+A4. (10)
Here A9, A3 and A4, in addition to designating
the composite variable name, triage and mental
status, also represent the level of the corresponding
variables. Since the level of variable A3 is three and
that of A4 is 4, the level of A9 indicates the
combination level of A3 and A4, as shown in
equation (10). For example, when A9=9 the equation
denotes that A3=3 and A4=1. Theoretically, the total
level of A9 is 12; however, in this case study it is
only 10 because levels 11 and 12 of A9 are missing.
Therefore, the frequencies of (A3=3 and A4=3) and
(A3=3 and A4=4) are zero with regard to both levels
of HCT response variable (D). Applying CorrAna to
A9 with regard to response D, the biplot shown in
Figure 5 is obtained.
For a clearer display, the labels of the levels of
response D are changed from ‘CTYes’ and ‘CTNo’
to ‘Y’ and ‘N’. It is clear from Figure 5 that levels 8,
3, 4 and 2 are located to the left of ‘Y’ and levels 10
and 9 to the right of ‘N’; whereas, the other levels
remain between ‘Y’ and ‘N.’ Thus, it can be said
that when (A3=2 and A4=4) or (A3=1 and A4=3 or
4 or 2), then one should judge D=1; whereas, when
(A3=3 and A4=2) or (A3=3 and A4=1), then D=2.
Therefore, when the triage level is 2 and the mental
status level is a coma, or when the triage level is 1
and the mental status is unclear, one should judge
HCT to be necessary. In such cases, the patients are
in serious conditions; thus, administering HCT is
appropriate and useful in diagnosing the root
problem. However, when patient is in a level 3 triage
and the mental status is clear or one capable of
responding to a verbal stimulus, HCT is unnecessary.
Note that in these two cases, the patients are in better
condition, but that many patients do not fall into
either of these stated categories. To overcome this
ICEIS 2008 - International Conference on Enterprise Information Systems
282
problem, the single rule of if A5=2, D=2 is
applied to the remainder.
It is worth noting that the single variable rule, the
shortest Euclidean distance between the levels of the
independent and the response variables is chosen as
the classification rule; whereas, in the two-variable
rule, the levels of the composite formed by the two
variables on each side of the response variable is
chosen as the composite rule. These principles are
true because in the single-variable case, once the
main classification rule has been set, its complement
(congruent) is automatically determined. For
example, in the case of A5 and D, as shown in
Figure 2, once the main rule, if (A5=2) then D=2, is
set, its complement, if (A5 2) then D=1, is
automatically determined. Note that (A5 2) means
that (A5=1 or 3). Therefore, once the destiny of
(A5=2) is determined as D=2, the other choices of
A5 have a pre-determined result. Thus, a reasonable
choice for the classification rule can be based on the
shortest linkage between the levels from the
independent and the response variables.
If the same argument is followed in the two-
variable case, then there is only one level in the
composite variable A9 which can be associated with
one of the two levels of D. The other values of A9
will be assigned with the alternative value of D, a
procedure which does not make good sense since the
other values are not necessarily exclusive with the
chosen value in the main rule. To clarify this point,
an illustrative example is given as follows. If the
shortest distance between level 5 of A9 and level 2
of D is chosen as the classification rule, then by the
same argument in single-variable, level 5 of A9
should be associated with level 2 of D and other
values (these include level 9, of course) of A9
should be with level 1 of D. However, Figure 5
clearly shows that level 9 of A9 should be associated
with level 2 of D since it is closely associated with
level 2 of D by the interpretation of correspondence
analysis (Hardle and Simar, 2003). Thus, the only
reasonable classification rule is to divide the levels
of the composite variable into three regions with the
levels of D as the demarcation points. With the
levels in the middle region undecided, the levels in
the left region are associated with the left
demarcation point; whereas, the levels in the right
region are assigned to the right extremity. Note also
that the levels in the middle region can be classified
later by the rule derived from the single variable.
The optimum correct classification rate by these
two-variable classification rules in addition to the
single rule is 0.76562, with
12
n =2 and
21
n =43. A
slightly better result is achieved than from the single
variable rule where the correct classification rate is
0.75521, with
12
n =0 and
21
n =47.
Figure 5: Biplot of variables A9 and D.
By examining the two misclassifications of
12
n ,
one finds an additional rule to eliminate
12
n : When
(A3=1, A4>=3, A6=3) then D=2. This means that
when the triage level is 1, the mental status is ‘to
pain’ or ‘coma’, and the diastolic blood pressure is
above 110 mm Hg (very serious high blood
pressure), the patient should not be administered
HCT because the situation is probably too dangerous.
This is a special provision under the rule of stating
that when (A3=1, A4>=3) then D=1, thereby
indicating the importance of abnormally high
diastolic blood pressure, a strong indicator to
overrule the HCT decision under serious health
conditions.
At this point, the correct classification rate is
0.77604, with
12
n =0 and
21
n =43. Note that
21
n
means the number of misclassified members,
thereby these members are treated as not
administering HCT (D=2) when in fact they need for
administering HCT (D=1). Misclassifying D=1 as
D=2 is more serious than that of D=2 as D=1 since
the penalty for the former error is life or death;
whereas, the consequence of the latter is merely a
waste of CT resource utilization. Note that of 192
patients only 48 patients were classified as D=1;
moreover, of these 48, the classification was correct
only five times. Since correct classification rate for
the 48 patients was very low, it is worthwhile to
investigate why
21
n cannot be reduced. By
examining the sorted data of
21
n =43, one notices
M
283
MULTIVARIATE TECHNIQUE FOR CLASSIFICATION RULE SEARCHING - Exemplieied by CT Data of Patient
that the age factor has been overlooked. When
considering the age factor (A2), a new rule is
formulated to reduce
21
n : when (A2=5, A3=1,
A6=3) then D=1. This means that when the patient’s
age is above 65 years, the triage level is 1, and the
diastolic blood pressure is above 110 mm Hg, the
patient should be administered HCT. When this rule
is applied in addition to the previous composite rules,
the correct classification rate is increased to 0.79167,
with
12
n =1 and
21
n =39. The result of
12
n =1 is an
exception to the previous rule. Apparently, nothing
can be done to further reduce the
21
n . The parallel
coordinate plot (PCP) of seven variables (A1-A7)
for the data of
21
n =39 is shown in Figure 6.
Figure 6: Parallel coordinate plots of 39 hospital data.
Several points are noteworthy. First, in comparison
with Figure B1, Figure 6 has only one colour (black)
since the data of
21
n =39 are members of D=1 only.
Second, there is no connection between variables A3
through A6, thereby, indicating that the levels of A4
and A5 are of single value. Indeed, A4=1 and A5=2,
thus indicating that patients having a clear mental
status and a normal breathing rate are easily
misclassified as not needing HCT, an understandable
error. Third, there are four age levels (2-5) instead of
the five in the original setting. Each age level is
connected with two triage levels except for age level
2 of which is connected to only triage level 2. In
comparison with the PCP in Figure B1, the pattern
in Figure 6 is quite different, wherein each age level
is connected to almost every triage level. Fourth, the
diastolic blood pressure is shown only for levels 2
and 3, thereby indicating that none of the patients
has normal blood pressure. Moreover, the pulse
levels are at 1-3, thus indicating that none of the
patients has an unusually high pulse rate (greater
than 120/min). Furthermore, there is no connection
between A6=2 and A7=1, thereby indicating that no
patient has blood pressure in the range of 80 to 110
mm Hg and a pulse rate lower than 60/min.
After a close examination of the sorted 39 data
sets, another rule is discovered: if (A2=5, A3=1,
A4=1) then (D=1). This rule indicates that when the
patient is very old (more than 65 years) and has a
triage level of 1 and a clear mental status, HCT
should be administered. This rule will reduce one
mistake in
21
n , thereby rendering the correct
classification rate of 0.79688 with
12
n =1 and
21
n =38, the optimum discoverable solution. The
PCP of the
21
n =38 data set is shown in Figure 7.
When comparing Figures 6 and 7, one notices that
the line connecting the normalized value of A2=1 to
A3=0 in Figure 6 has been deleted from Figure 7.
There is no observable distinction between D=1
from 38 patients and D=2 from 122 patients,
extracted from 144 data sets, wherein D=2 in the test
data has the response variable D=2 with A2>1 and
A4=1 and A5=2 and A6>1 and A7<4. The
aforementioned conditions set for D=2 are exactly
the same as for the 38 sets except for D=1. The PCP
of the 122 sets is shown in Figure 8. When
comparing Figures 7 and 8, it is clear that if the line
segments in each Figure are treated as elements in a
set, then Figure 7 can be regarded as contained in
Figure 8 in terms of the set concept, thereby
demonstrating that since the 38 sets are prominently
involved with the corresponding 122 sets, the two
cannot be separated by any rule. For the sake of
completeness, part of the XploRe (Hardle, Klinke
and Muller, 2000) code is listed to illustrate the
formulation of the composite rule in Appendix D.
The self-explanatory code is similar to the c code.
Figure 7: Parallel coordinate plots of 38 hospital data.
ICEIS 2008 - International Conference on Enterprise Information Systems
284
Figure 8: Parallel coordinate plots of 122 hospital data.
5 CONCLUSIONS
Two multivariate techniques have been proposed to
clarify patients sent to an emergency room to wait
for a decision on the administration of HCT. The
959 data set were segmented into two portions, a
767 training data set and a 192 test data set, after
which Fisher’s linear discriminant function was used
to find linear rule vector
a . Since classification
using rule vector
a in equation (4) is not practical
for on-duty physicians, three important variables,
such as triage (A3), mental status (A4) and breathing
rate (A5), were chosen on the basis of the magnitude
of the coefficients of
a . Next, correspondence
analysis was used to determine the simple
classification rule most suitable for each variable to
classify the need for administering HCT. The simple
classification rule has the format of if A5=x then
D=ywhere x and y are the levels of the input (A5)
and output (D) variables, respectively. The selection
of the rule is based on the shortest Euclidean
distance between the levels of the input variable
(e.g., A5) and response variable D located on the x-
axis of a biplot. The case study demonstrated that
output from the joint effort of the multivariate
technique coincided with the exhaustive search, a
promising result. The optimum correct rate is only
0.75521 with
12
n =0 and
21
n =47 for the rule of (if
A5=2 the D=2), meaning that if the patient’s
breathing rate lies within the normal range of
10~24/min, HCT is not needed.
The extension of a single-variable classification
rule to a two-variable one is straightforward, yet
requiring a small modification for choosing the rule.
First, a composite variable (e.g., A9) is formed by a
linear combination of the two variables based on
equation (10) so that each combination of the levels
from the two maps into an integer level of the
composite variable. Then typical CorrAna is applied
to the contingency table formed by variables A9 and
D, wherein a biplot is produced with points
representing both the levels of the composite
variable and response variable D. By taking the two
points of the levels of D as the demarcation points,
the x-axis can be cut into three regions: one to the
left of the left extremity, the second between the
demarcations, and the third to the right of the right
extremity. Moreover, the levels in the left regions
are assigned to the level of D at the left demarcation
point, the levels in the right regions to the level of D
at the right demarcation point, and the levels of the
composite variable between to the level of D on the
basis of the optimum classification rule from the
single variable. The two variables selected are triage
(A3) and mental status (A4), which render the
highest correct classification rate among all
combinations of two variables. The rules state that
when (A3=2 and A4=4) or (A3=1 and A4=3 or 4 or
2) D=1; whereas, when (A3=3 and A4=2) or (A3=3
and A4=1), D=2. Thus, the correct classification rate
is 0.76562 with
12
n =2 and
21
n =43.
The correct classification rate can be further
increased by examining the structure of the sorted
but misclassified items in the test data set. The
formulation of the composite rule for the case study
is listed in Appendix D, with the correct
classification rate of 0.79688 with
12
n =1 and
21
n =38. The composite rules may be generally
summarized as (1) when the triage level is 2 and the
mental status is a coma, or when the triage level is 1
and the mental status is unclear, HCT should be
administered, and (2) when the patient is in level 3
of triage and the mental status is one capable of
responding to a verbal stimulus, HCT is unnecessary.
Exceptional rules should be applied to patients older
than 65 years (A2) and those with high diastolic
blood pressure (A6). For example, (1) when the
triage level is 1, the mental status is ‘to pain’ or
‘coma’, and the diastolic blood pressure is above
110 mm Hg (seriously high), the patient should not
be administered HCT; (2) when the patient’s age is
older than 65 years, the triage level is 1, and the
diastolic blood pressure is above 110 mm Hg, the
patient must be administered HCT. It is noteworthy
that the variables of sex (A1) and pulse rate (A7) are
not considered in the composite rules.
M
285
MULTIVARIATE TECHNIQUE FOR CLASSIFICATION RULE SEARCHING - Exemplieied by CT Data of Patient
To show why the correct classification rate cannot
be increased, two parallel coordinate plots of
the
21
n =38 data set being in D=1 and the
corresponding 122 data set being in D=2 were
compared. The two data sets had the same domains
for variables A1-A7. The comparison indicated that
since both are prominently involved (highly similar),
they cannot be separated by any rule. Thus, no
improvement can be made in the correct
classification rate.
ACKNOWLEDGEMENTS
I grateful thank my colleague Prof. Paul Chen for
both encouraging me to pursue this research and
generally sharing his hospital data. I also express
appreciation to Dr. Cheryl Rutlede, Department of
English, DaYeh University, for her editorial
assistance.
REFERENCES
Yang, M., Kriegman, D and Ahuja, N., 2001, Face
Detection Using Multimodal Density Models,
Computer Vision and Image Understanding 84, 264–
284.
Lam, K. and Moy, J., 2003, A piecewise linear
programming approach to the two-group discriminant
problem
an adaptation to Fisher’s linear
discriminant function model, European Journal of
Operational Research 145, 471 481.
Allombert, S., Gaston, A. and Martin, J., 2005, A natural
experiment on the impact of overabundant deer on
songbird populations, Biological Conservation 126,
1 13.
Hardle, W. and Simar L., 2003. Applied Multivariate
Statistical Analysis, Springer. Berlin.
Johnson R. and Wichern D., 2002, Applied Multivariate
Statistical Analysis, Prentice Hall. 5
th
, NJ, USA.
Hardle, W., Klinke, S. and Muller M., 2000. XploRe
Learning Guide, Springer. Berlin.
ICEIS 2008 - International Conference on Enterprise Information Systems
286