CART method which is specially chosen among
other methods due to its reliability, speed and
accuracy.
2 RELATED WORK
Durand and Soulet (2005) worked on the
characterization of liver fibrosis using clustering on
the same dataset. They proposed a soft clustering
method to build a global model from emerging
pattern which describe local contrasts between two
or more classes. Focusing on the in-hospital
examination, the authors came up with some
examinations which are more associated with the
severe stages of liver fibrosis. They also noticed that
it is more difficult to characterize the initial stages as
compared to the severe stages.
Yaseen el al. (2011) proposed a model using
Principle Component Analysis and Regression
model to predict the probability of life and death of
hepatitis C patients on the dataset of machine
learning warehouse of University of California.
Ho et al. (2007) worked with the same dataset,
which is used in this paper, to solve the first and
second challenge given by the data provider. They
tried to find the change patterns of the test results
provided in the dataset. The authors then tried to
find the temporal relations between these temporal
patterns.
Different techniques have been used to address
the above mentioned challenge given by Chiba
University Hospital in (Aubrecht and Kejkula,
2005). The authors searched the temporal patterns
between Hepatitis B and C by using trend
characterization technique.
Vatham and Osmani (2005) made an effort to
classify the patients according to their types, i.e., B
and C. After the classification, the authors used the
processed data to find the temporal patterns between
Hepatitis B and C. They have used 3 fold cross
validation to measure the accuracy of their
methodology. The system they developed classified
samples as Class B and C correctly around 57% and
61% respectively.
Multi-relational association rules were used by
Pizzi et al. (2005). An algorithm named Connection
was used to infer the degree of liver fibrosis. The
authors examine the blood and urine tests along with
the biopsy results to find out the pattern which may
set up a correlation between the exam results and the
degree of fibrosis. They used the support and
confidence value to rate the rules and divide the
selected tests into three groups.
Karthikeyan and Thangaraju (2013) analyzed the
hepatitis patients from the dataset provided in UC
Irvine machine learning repository. They made use
of an open source tool named WEKA and performed
different algorithms and data processing techniques.
They used naive bayes, j48, trees, random forest and
multilayer perceptron to the dataset and found that
the performance of naive bayes both in terms of time
and accuracy is better than other classifiers. They
achieved the accuracy of around 84% for naive
bayes classifier.
Same data set as used in this paper was analyzed
by Geamsakul et. al. (2007). They had used a graph
based induction method for the classification of
hepatitis type. The algorithm constructed a decision
tree for graph structured data while simultaneously
constructing attributes of classification. They also
performed the classification of hepatitis type and
stage of its fibrosis for which they have constructed
a total of 262 graphs for both. The authors achieved
an average accuracy of 79.6% for the classification
of hepatitis type.
3 DATA PREPROCESSING
Data pre-processing is usually the first step in any
work involving data mining. The dataset, as
mentioned before, contains data with different
patterns, which needs pre-processing before
applying data mining techniques to it. As mentioned
before the data consists of 7 tables, out of which 5
tables have been used in this paper. The tables
consist of patients’ data including their id for
reference, gender and date of birth. Most of the
patients in that table performed liver biopsy which is
maintained in another table. It is worth mentioning
here that not all the patients have gone through with
liver biopsy and the date of liver biopsy is different
for different patients. Liver biopsy test also results in
the fibroses and activity of the virus inside the body.
The in-house examinations of patients contain the
results of different medical examinations taken on
different period of times spanning 20 years. The set
of examinations taken are not the same for all the
patients, all the time. So there are missing values in
this table.
Data mining provides different techniques to
handle the missing values like filling up the data
using the global constant mean or may be
interpolation but since the nature of this dataset is
sensitive so did not fill the missing values by using
off-the-shelf techniques. That is why, the missing
values in data set is simply ignored in this work.
ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems
240