False Positive 6
False Negative 9
True Negative 14
From the results of the calculation above, thus, the
values of Precision=0.65; Recall= 0.55; and
Accuracy=63%.
4.2 Discussion
Based on the results of the testing scenario, a limit of
0.0014 reached highest value for the Precision and
Accuracy which is equal to 70%. Although the result
is sufficient, we think that it still needs much
improvement.
There are certain conditions that we think greatly
influence the result previously discussed. The number
of datasets, which were 100 samples of fake news, is
one of the factors that might contribute to the result.
We believed that the result will be much better if the
sample data is greater than 100, at least 500 data
samples. This is also become our notable challenge if
we want to make a better system.
Another factor that might significantly contribute
to the result is the topics of the news in the datasets.
Since there are no limitations in the topics, the words
stored in the dictionary are dispersed. It is possible
because the hoax news library stores the news with
different topics, but it does not store news that is
continuously distributed, so that when the news is
compared with other news that also has different
topics, the effect will not be as great as that of the
news continuously used as a hot topic to spread lies.
This is also very influential in calculating the weight
of words.
The weight for each word then is less significant
and becomes useless for the next process. In the Tf-
Idf computation, the number of occurrences of words
in the stored documents is extremely influential. The
large number of words, their occurrence, and the
number of documents used greatly affect the weight
of calculation of a word compared. Hence, limiting
the topics may significantly help the computation of
Tf-Idf for each word in the system because the words
become sufficiently homogeneous. For example,
political news should be separated from
entertainment or musical news, and different
computation should be performed for each topic in
order to obtain the best result.
In hoax news storage documents, there are several
topics that are often used as targets for hoax news
dissemination. However, the news is very specific to
politics, so it greatly influences the existence of
documents with hoax news that rarely becomes the
reporting target and its number does not dominate.
For the test data, not all news in the test data discusses
political topics, so there are some false negative and
false positive values that affect the final results of
Precision, Recall and Accuracy.
For future works, we hope that we will be able to
collect more fake news with certain topics separated
from the others to assure the homogeneous words in
the datasets. In addition, we will try different methods
to obtain better result.
5. CONCLUSION
Based on the research findings on Hoax Detection on
Social Media in Indonesia using the Levenshtein
Distance Method, it can be concluded that:
1. There are some steps to apply the Levenshtein
Distance Method in a Hoax Detection System:
a. The creation of Target Data Documents in
which there is a simplified collection of hoax
words in pre-processing words and selection
of pre-processing words by giving weight to
each word using Tf-Idf.
b. The creation of a Hoax Detection System in
which there are several processes to produce
classification values – pre-processing the
source word, comparing the source word with
the target word, calculating the distance
(Levenshtein Distance), giving weight (Tf-
idf), and calculating the final result with its
classification.
2. The application of the Levenshtein Distance
method combined with Tf-Idf is proved to be able
to distinguish hoax and non-hoax news with a
fairly good level of accuracy.
3. The testing scenario with 0.0014 limit, which has
training data as many as 100 news indicated as
hoaxes and 40 news as test data, was divided into
two, that are 20 non-hoax news and 20 hoax news,
and had consistent values of Precision 0.7, Recall
0.7, and Accuracy 70%. This means that the more
hoax words are used as training data, the more
accurate the system performs detection.