COMPARISON OF DIFFERENT CLASSIFICATION

TECHNIQUES ON PIMA INDIAN DIABETES DATA

Farhana Afroz and Rashedur M. Rahman

Department of Electrical Engineering and Computer Science, North South University, Bashundhara, Dhaka, Bangladesh

Keywords: Classification, Neural network, Decision tree, Rule based classifier, Fuzzy lattice, Fuzzy inference system,

ANFIS.

Abstract: The development of data-mining applications such as classification and clustering has been applied to large

scale data. In this research, we present comparative study of different classification techniques using three

data mining tools named WEKA, TANAGRA and MATLAB. The aim of this paper is to analyze the

performance of different classification techniques for a set of large data. The algorithm or classifiers tested

are Multilayer Perceptron, Bayes Network, J48graft (c4.5), Fuzzy Lattice Reasoning (FLR), NaiveBayes,

JRip (RIPPER), Fuzzy Inference System (FIS), Adaptive Neuro-Fuzzy Inference Systems(ANFIS). A

fundamental review on the selected technique is presented for introduction purposes. The diabetes data with

a total instance of 768 and 9 attributes (8 for input and 1 for output) will be used to test and justify the

differences between the classification methods or algorithms. Subsequently, the classification technique that

has the potential to significantly improve the common or conventional methods will be suggested for use in

large scale data, bioinformatics or other general applications.

1 INTRODUCTION

The aim of this study is to investigate the

performance of different classification methods

using WEKA, TANAGRA and MATLAB for PIMA

Indian Diabetes Dataset (PIDD). A major problem in

bioinformatics analysis or medical science is in

attaining the correct diagnosis for certain important

information. For the ultimate diagnosis, a large

number of tests generally involve the clustering or

classification of large scale data. All of these test

procedures are said to be necessary in order to reach

the final diagnosis. On the other hand, huge amount

of tests could complicate the main diagnosis process

and lead to the difficulty in obtaining the end results,

particularly in the case where many tests are

performed. This kind of difficulty could be resolved

with the aid of machine learning. It could be used to

obtain the end result with the aid of several artificial

intelligent algorithms which perform the role as

classifiers. Machine learning covers such a broad

range of processes that it is difficult to define

precisely. A dictionary definition includes phrases

such as to gain knowledge or understanding of or

skill by studying the instruction or experience and

modification of a behavioural tendency by

experienced zoologists and psychologists study

learning in animals and humans (Nilson, 2011). The

extraction of important information from a large pile

of data and its correlations is often the advantage of

using machine learning. New knowledge about tasks

is constantly being discovered by humans and

vocabulary changes. There is a constant stream of

new events in the world and continuing redesign of

Artificial Intelligent systems to conform to new

knowledge is impractical but machine learning

methods might be able to track much of it (Han and

Kamber, 2000).

There is a substantial amount of research with

machine learning algorithms such as Bayes network,

Multilayer Perceptron, Decision tree and pruning

like J48graft, C4.5, Single Conjunctive Rule Learner

like FLR, JRip and Fuzzy Inference System and

Adaptive Neuro-Fuzzy Inference System.

2 DATA SET DESCRIPTION

The characteristics of the data set used in this

research are summarized in Table 1. The detailed

descriptions of the data set are available at UCI

repository (UCI, 2011).

365

Afroz F. and M. Rahman R..

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES ON PIMA INDIAN DIABETES DATA.

DOI: 10.5220/0003496803650368

In Proceedings of the 13th International Conference on Enterprise Information Systems (ICEIS-2011), pages 365-368

ISBN: 978-989-8425-53-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Table 1: Characteristics of PIMA Indian Dataset.

Data Set

No. of

Example

Input

Attributes

Output

Classes

Number of

Attributes

Pima

Indian

Diabetes

768 8 2 9

The objective of this data set was diagnosis of

diabetes of Pima Indians. Based on personal data,

such as age, number of times pregnant, and the

results of medical examinations e.g., blood pressure,

body mass index, result of glucose tolerance test,

etc., try to decide whether a Pima Indian individual

was diabetes positive or not. The attributes are given

below:

1. Number of times pregnant

2. Plasma glucose concentration a 2 hours in an

oral glucose tolerance test

3. Diastolic blood pressure (mm Hg)

4. Triceps skin fold thickness (mm)

5. 2-Hour serum insulin (mu U/ml)

6. Body mass index (weight in kg/ (height in

m)^2)

7. Diabetes pedigree function

8. Age (years)

9. Class variable (0 or 1)

3 METHODOLOGY

In this research we deploy various classification

techniques. Those techniques are described briefly

below:

3.1 Multilayer Perceptron (MLP)

The architecture used for the MLP (Werbos, 1974)

during simulations with PIDD dataset consisted of a

three layer feed-forward neural network: one input,

one hidden, and one output layer. Selected

parameters for the model are: learningRate =

0.3/0.15; momentum = 0.2; randomSeed = 0;

validationThreshold = 20, number. of epochs = 500.

3.2 BayesNet

BayesNet (John and Langley, 1995) (learns

Bayesian networks under the presumptions: nominal

attributes (numeric one are pre descretized) and no

missing values (any such values are replaced

globally). There are two different parts for

estimating the conditional probability tables of the

network. In this study run BayesNet with the

SimpleEstimator and K2 search algorithm without

using ADTree.

3.3 NaiveBayes

The NaiveBayes (John and Langley, 1995) classifier

provides a simple approach, with clear semantics, to

representing and learning probabilistic knowledge. It

is termed naïve because is relies on two important

simplifying assumes that the predictive attributes are

conditionally independent given the class, and it

posits that no hidden or latent attributes influence

the prediction process.

3.4 J48graft (C4.5 Decision Tree

Revision 8)

Perhaps C4.5 algorithm which was developed by

Quinlan (Quinlan, 1993) is the most popular tree

classifier. Weka classifier package has its own

version of C4.5 known as J48 or J48graf. For this

study, C4.5 classifier used in TANAGRA platform

and J48graft classifier used in WEKA platform.

J48graft is an optimized implementation of C4.5 rev.

8. J48graft is experimented is this study with the

parameters: confidenceFactor = 0.25; minNumObj =

2; subtreeRaising = True; unpruned = False. C4.5 is

experimented in this study with the parameters: Min

size of leaves = 5; Confidence-level for pessimistic

= 0.25.

3.5 Fuzzy Lattice Reasoning (FLR)

Classifier

The Fuzzy Lattice Reasoning (FLR) classifier is

presented for inducing descriptive, decision-making

knowledge (rules) in a mathematical lattice data

domain including space R

. Tunable generalization

is possible based on non-linear (sigmoid) positive

valuation functions; moreover, the FLR classifier

can deal with missing data. Learning is carried out

both incrementally and fast by computing

disjunctions of join-lattice interval conjunctions,

where a join-lattice interval conjunction corresponds

to a hyperbox in R

. In this study evaluated FLR

classifier in WEKA with the parameters: Rhoa = 0.5;

Number of Rules = 2.

3.6 JRip (RIPPER)

Repeated Incremental Pruning to Produce Error

Reduction (RIPPER) (Witten and Frank, 2005) is

one of the basic and most popular algorithms.

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

366

Classes are examined in increasing size and an

initial set of rules for the class is generated using

incremental reduced-error pruning. In this study

evaluated RIPPER through JRip, an implementation

of RIPPER in WEKA with the parameters: folds =

10; minNo = 2; optimizations = 2; seed = 1;

usePruning = true.

3.7 Fuzzy Inference System (FIS)

Fuzzy Inference Systems (FISs) is a technology

developed for granular rule induction and

generalization based on fuzzy logic. Note that since

a data cluster can be interpreted as a (fuzzy) granule,

data clustering may be closely related to fuzzy rule

induction. Neural implementations have provided

conventional FISs a capacity for parallel

implementation.

3.8 Adaptive Neuro-Fuzzy Inference

Systems (ANFIS)

In this work uses ANFIS (Adaptive Neuro-Fuzzy

Inference Systems), a fuzzy classifier that is part of

the MATLAB Fuzzy Logic Toolbox (FLT, 2011).

ANFIS is a fuzzy inference system implemented

under the framework of adaptive networks (Jyh and

Roger, 1993).

4 RESULT ANALYSIS

In this study, we examine the performance of

different classification methods. We use accuracy

estimate and error estimates of those classifiers. We

get highest accuracy is 81.33% belongs to J48graft

and lowest accuracy is 51.43% that belongs to FLR.

Based on Figure 3 and Table 3, we could compare

various error metrics among different classifiers in

WEKA. We find out that J48graft is best, second

best is Bayes Net and MLP & JRip is moderate but

FLR is arguable.

Figure 1: Error comparing for WEKA.

An algorithm which has a lower error rate will be

preferred as it has a more powerful classification

capability. The total time required to build the model

is also a crucial parameter in comparing the

classification algorithm. In this experiment, FLR

classifier requires the shortest time which is around

0.025 seconds compared to the others. MLP

algorithm requires the longest model building time

which is around 63.13 seconds. The second on the

list is Bayes network with 0.04 seconds. And

J48graft takes 0.135 seconds.

Kappa statistic is used to assess the accuracy of

any particular measuring cases, it is usual to

distinguish between the reliability of the data

collected and their validity (Kappa, 2011). The

average Kappa score from the selected algorithm is

around 0.01-0.59. Based on the Kappa Statistic

criteria, the accuracy of this classification purposes

is substantial. So according to best average kappa

statistic the J48graft classifier is best among others.

Rule accuracy is 71.51% and 78.79% for FIS and

ANFIS respectively for different network and

architectures. This is shown in Table 2. IF – THEN

rules are used for adaptive classifiers. We use 7 IF –

THEN fuzzy rules and mamdani operator for FIS

and sugeno operators for ANFIS membership

function. The rules are presented in Table 4.

We also measure our performance with True

Positive Rate (TPR), False Positive Rate (FPR),

Precision, Recall, F-measure and area under ROC

curve. Those results are shown in Table 3.

Table 2: Performance measuring in rule based fuzzy

approach using MATLAB.

Learning

systems

Training/test

epochs

Avg. Error

after

training/test

No. of

Extracted

Rules

Accuracy

(%)

FIS

500 7.6358 7 71.51

ANFIS

500 7.6358 7 78.79

5 CONCLUSIONS

We use WEKA, Tanagra and MATLAB to bring out

an extensive performance comparison among the

most popular classifier algorithms. In the absence of

medical diagnosis evidences, it is difficult for the

experts to opine about the grade of disease with

affirmation. There is a need to undertake diagnostic

studies medically to construct more realistic fuzzy

numbers for characterizing the imprecision and

thereby fuzzily describing the patient’s disease

nature. First, the misclassification cost is not

considered explicitly here. In future, cost-sensitive

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES ON PIMA INDIAN DIABETES DATA

367

Table 3: Different Performance Matrix in the Training and Test Data Set using WEKA.

Classifie

Phase TP Rate FP Rate Precision Recall F-measure ROC Area

MLP

Training 0.806 0.191 0.819 0.806 0.809 0.872

Testing 0.778 0.306 0.774 0.778 0.776 0.813

Bayes

Net

Training 0.783 0.26 0.783 0.783 0.783 0.851

Testing 0.797 0.253 0.799 0.797 0.798 0.848

J48graft

Training 0.841 0.241 0.842 0.841 0.836 0.888

Testing 0.785 0.189 0.816 0.785 0.792 0.803

JRip

Training 0.794 0.257 0.792 0.794 0.793 0.785

Testing 0.824 0.294 0.821 0.824 0.816 0.766

FLR

Training 0.358 0.344 0.774 0.358 0.2 0.507

Testing 0.67 0.662 0.582 0.67 0.572 0.504

Table 4: Sample rules framed for the proposed FIS and ANFIS.

IF THEN

Rule

No.

preg. plas bp skin insl bmi dpf age

Class 0

(Weight)

Class1

(Weight)

1 0 <=103 >40 <=26 <=156 <=35.3 <=0.179 <=34 0.955 0.5

2 <=3 NDF NDF <=35 >156 <=35.3 <=0.787 NDF 0.5 0.928

3 NDF NDF NDF NDF NDF NDF <=0.179 <=34 0.955 0.5

NDF

<=103

NDF NDF NDF NDF

<=0.787

NDF

0.944 0.5

5 NDF NDF NDF NDF <=156 <=35.3 NDF

>34 or

<=37

0.912 0.5

NDF

>135

NDF NDF

<=185 >33.7 <=1.096 >37 0.5 0.928

7 6 >103

NDF NDF NDF

>35.3 <=1.096 >34 0.5 0.909

learning might make the study more practical and

valuable. Second, in this survey used only 7 rules for

FIS and ANFIS but if increase the rules then might

be got more accurate diagnosis result.

REFERENCES

FLT, 2011. The mathworks - fuzzy logic toolbox, from

http://www.mathworks.ch/access/helpdesk r13/help/

toolbox/fuzzy/fuzzy.html

Han J., Kamber, M., 2000. Data Mining Concept and

Techniques, Morgan Kaufmann Publishers

John, G. H., Langley, P., 1995. Estimating Continuous

Distributions in Bayesian Classifiers. In: Proc. of the

11th Conf. on Uncertainty in Artificial Intelligence.

Jyh, S., Roger, J., 1993.Anfis: Adaptive-network-based

fuzzy inference system, IEEE Transactions on Systems,

Man, and Cybernetics, vol. 23, pp. 665–685.

Kappa Statistic, 2011. Link http://www.dmi.columbia.

edu/homepages/chuangj/kappa.

Nilsson, N. J., 2011. Introduction to Machine Learning,

http://ai.stanford.edu/~nilsson/mlbook.html.

Quinlan, J., 1993. C4.5: Programs for Machine Learning.

Morgan Kaufmann, San Mateo.

UCI machine learning repository, 2011. Link: http://www.

ics.uci.edu/mlearn/MLRepository.html

Werbos, P., 1974. Beyond Regression: New Tools for

Prediction and Analysis in the Behavioural Sciences,

PhD Thesis, Harvard University, 1974.

Witten, I. H., Frank, E., 2005. Data Mining: Practical

Machine Learning Tools and Techniques, 2nd edn.

Morgan Kaufmann, San Francisco, USA.

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

368