2.1.3 Word Skip-Gram
In analyzing word structures, word skip-gram and
word n-gram exhibit similarities, yet differ in their
respective approaches. For example, when
considering the sentence "I bought a watch yesterday
afternoon" a 1-skip gram analysis would yield the
following features: "I a," "bought watch," "a
yesterday," and "watch afternoon" (Sandaruwan et al.,
2019). This technique is adept at extracting essential
words while disregarding structural words. However,
a notable limitation of 1-skip gram is its tendency to
overlook non-structural components of the sentence.
2.2 Machine Learning-Based Detection
Machine learning is an algorithm that learns from
data to perform tasks without explicit instructions and
can generalize to unseen data. It enables algorithms to
understand natural language and determine the
potential harm of statements.
2.2.1 Traditional Machine Learning
In researching hate speech detection, researchers
commonly turn to supervised learning algorithms in
traditional machine learning. These algorithms
function by analyzing datasets that have been
annotated by humans using different models,
allowing machines to develop effective hate speech
detection models. The literature strongly establishes
the credibility of using machine learning for hate
speech detection (Khan & Qureshi, 2022; Ates et al.,
2021; Khan et al., 2021). In their research, Emre
Cihan et al. utilized models such as Naive Bayes,
Gaussian Naïve Bayes, LDA, QDA, and LGBM to
develop a detection model. However, a notable
drawback of this approach is the requirement for a
sufficiently large corpus to achieve optimal results,
necessitating a considerable investment of human
resources.
2.2.2 Large Language Model
The mention of large language models often brings
ChatGPT to mind. Large language model is a kind of
machine learning model. It is aiming for the
generation and comprehension of nature language.
These models have larger corpora and more
parameters, allowing them to grasp natural language
intricacies from higher dimensions and closely mirror
human cognition. Compared to traditional machine
learning approaches, large language models do not
necessitate labeled data; they simply require an ample
amount of data and parameters, thereby diminishing
the reliance on human intervention. Recent research
by Garani et al. demonstrates that integrating large
language models into systems for harmful speech
detection can greatly enhance accuracy (Yogita
Garani et al., 2023). Their study leveraged models
such as ChatGPT and XLM-Roberta to bolster the
precision of harmful speech detection mechanisms.
3 RESULTS AND DISCUSSIONS
Traditional methods tend to be less complex but also
less accurate than newer approaches, with a focus on
word-level and character-level features that may
hinder their performance in understanding natural
language. The classic Dictionary method involves
simply checking whether users have used prohibited
words for detection. This method is the simplest and
fastest, although it often suffers from issues such as
the emergence of new prohibited words and
misspelled prohibited words, leading to lower
accuracy. Furthermore, this method solely relies on
the prohibited word list for detection when
encountering neutral terms, making it prone to false
positives for some neutral words.
N-grams and word skip-grams are effective
techniques that partially tackle the challenge of
identifying misspelled prohibited words. These
approaches demonstrate a degree of contextual
understanding in sentences and vocabulary, which
aids in the detection of misspelled prohibited words.
Nevertheless, the limitation lies in their inability to
completely grasp the intricacies of natural language.
Machine learning can better understand natural
language, fundamentally distinguishing between
harmful and normal discourse by comprehending
natural language at the sentence level. Users have the
ability to enhance the accuracy of detection models
by selecting different models. However, a significant
amount of human effort is required for the manual
evaluation of all training data in machine learning.
Moreover, models generated by traditional machine
learning methods can only be used for the
corresponding language, necessitating retraining for
other languages. Researchers achieved promising
results, with the accuracy of the LGBM model
reaching over 90% (Ates et al., 2021).
Utilizing large language models enables
individuals to construct more advanced hate speech
detection systems based on existing large-scale
models, for example, ChatGPT. LLM is able to have
a deeper understanding of natural language and a
higher-dimensional comprehension of its meaning
allows for a fundamental distinction of whether the
language is harmful. This distinction is key in
developing effective hate speech detection systems