tmp: This matches the temperature. The
temperature is important for the development
of this disease when it has values between
20ºC and 25ºC.
hmdt: This matches the humidity. The humidity
is important for the development of this
disease when it has values higher than 92%.
rn: This matches the precipitation. The
precipitation is responsible for the spread of
the disease.
RSF: This matches rounded spots on fruits.
This is one of the main symptoms of this
disease.
WF: This matches wrinkled fruits symptom.
diss: This matches possibility of the disease
occurring based on the previous attributes.
The 10-fold cross validation test mode was used,
which means that 90% of the data is used for
training and 10% for testing in each fold test.
3.1 Evaluation of Classification
Algorithm using Weka
In this paper, we choose Weka because is very
sophisticated tool and used in many different
applications including visualization and algorithms
for data analysis and predictive modelling.
We have conducted a comparison study between
algorithms provided by Weka, corresponding to
different classification categories: Decision trees,
was chosen the Random Forest, for the lazy
classifiers, the K – Nearest Neighbors was chosen,
whose implementation in Weka is named IBk, for
the bayes classifiers, the Naïve Bayes was chosen
and, for function classifiers, Sequential Minimal
Optimization (SMO) was chosen.
We evaluate the performance of the classification
algorithm using Confusion Matrix. Confusion
Matrix can be represented by a table, that
summarizes the classification performance of a
classifier with respect to some test data (Shultz and
Fahlman, 2017). The confusion matrix is:
True positives (TP): In this case we predicted
“disease” and do have the disease.
True negatives (TN): In this case we no
predicted the disease and not have the disease.
False positives (FP): In this case we predicted
disease but don’t actually have the disease.
False negatives (FN): In this case we predicted
no disease but actually do have the disease.
We calculate value of precision and recall.
Precision is the number of True Positives divided by
the number of True Positives and False Positives.
Basically, it is the number of positive predictions
divided by the total number of positive class values
predicted. Recall is the number of True Positives
divided by the number of True Positives and the
number of False Negatives. Basically, it is the
number of positive predictions divided by the
number of positive class values in the test data.
The computation of precision and recall values is
as follows:
precision = TP / (TP + FP) (1)
recall = TP / (TP + FN) (2)
There are two possible predicted classes:
“disease” and “no disease”. In first dataset the
classifier made a total of 4200 predictions. In second
dataset the classifier has a total of 2800 predictions.
3.1.1 Random Forest
When applying the Random Forest algorithm to
DataSet 1, in these 4200 cases, the classifier
predicted “disease” 1900 times, and “no disease”
2300 times. In reality, 1900 instances in the sample
have the disease and 2300 do not. So, precision=1
and recall=1 for “disease”. Which means that for
precision, when “disease” was predicted, 100% of
the time the system was in fact correct. For recall it
means that when “disease” have been predicted,
100% of cases were correctly predicted.
For “no disease”, precision=1 and recall=1 which
means that for precision, out of the times “no
disease” was predicted, 100% of the time the system
was in fact correct. For recall it means that out of all
times “no disease” should have been predicted,
100% of cases were correctly predicted.
The results of application Random Forest algorithm
to DataSet 1 are shown in Table 1.
Table 1: Confusion Matrix of application Random Forest
algorithm to Plasmopara viticola.
True 1
(disease)
True 2
(no disease)
Class
Precision
Pred. 1
(disease)
1900 0 100%
Pred. 2
(no disease)
0 2300 100%
Recall 100% 100%
When applying the Random Forest algorithm to
DataSet 2, in these 2800 cases, the classifier
predicted “disease” 330 times, and “no disease”
2470 times. In reality, 329 instances in the sample
have the disease and 2471 do not. So,
precision=0.991 and recall=0.994 for “disease”.
Which means that for precision, out of the times
“disease” was predicted, 99.1% of the time the