a few SB data, we first employ and validate a data
clustering technique to produce high-quality clusters
of bidders. Second, we introduce an approach to de-
tect anomalies, i.e., fraudulent bidders in each cluster.
Since they are a few training data, we can, therefore,
check their ground truth. Nevertheless, the produced
labeled subset is imbalanced, and we tackle this prob-
lem with a hybrid method of data sampling. Next,
we develop two SB detection models based on two
SSC algorithms of different categories and then com-
pare their predictive performance using several qual-
ity metrics. Our objective is to determine the ideal
fraud classifier, which will be instrumental in distin-
guishing between genuine and fraudulent bidders on
auction sites. Lastly, we analyze the influence of la-
beled data amount on the SB model accuracy. More
precisely, we determine how much-labeled data is re-
quired to build the optimal fraud classification model.
We note that all comparisons between SSC models
are carried out using the statistical testing.
We structure our paper as follows. Section 2 re-
views notable studies of SSC in the fraud detection
domain. Section 3 describes the characteristics of the
SB training dataset. Section 4 exposes the process re-
quired to label a few SB data. Section 5 optimizes two
SSC algorithms with the few labeled data and assess
their performance with several quality metrics. Sec-
tion 6 examines the impact of labeled data amount on
the SSC accuracy. Finally, Section 7 concludes our
work and provides the future research direction.
2 RELATED WORK
In this section, we examine recent research work,
published in 2018, about the capability of SSC specif-
ically in the field of fraud detection. For instance,
to detect spams in tweets, the authors in (Sedhai
and Sun, 2018) proposed an adaptive SSC framework
consisting of two parts: real-time mode and batch
mode. The former mode detects and labels tweets
using four labels: blacklisted URLs, near-duplicated,
trusted (has no spam words and posted by trusted
users), and others. Then, the batch module is up-
dated accordingly. For the experiments, the authors
employed an old dataset containing a large number of
tweets of two months in 2013. In the original dataset,
data came with labels obtained manually or automat-
ically. The authors randomly selected some of the au-
tomatically labeled data to manually relabel them in
order to increase the ground truth. For training, they
used only 6.6% of tweets while the rest was used for
testing. They compared the proposed system called
S3D (which updates after each time window) to four
other classifiers, Naive Bayes, Logistic Regression,
Random Forest, and S3D-Update (without batch up-
date). Experimentally, S3D is superior to the other
classifiers and showed good ability in learning new
patterns and vocabulary. However, this study focused
on detecting spam tweets, not suspicious users. In-
deed, the discovery of fraudsters is significant because
they can still conduct fraud as long as they have not
been suspended.
The Irish Commission for the energy regulation
released a dataset collected in 2009 and 2010 of
around 5000 Irish households. Very few data have
been manually labeled, but almost 90% of data were
unlabeled because of the difficulty of the inspections.
In (Viegas et al., 2018), the authors took advantage of
the few labeled data and use them for SSC in order to
detect electricity fraud carried out by consumers. The
labeled data were imbalanced, so they added simu-
lated data to overcome this problem. Random For-
est Co-Training was employed to develop the clas-
sification models by varying the percentages of la-
beled data: 10%, 20% and 30%. More precisely,
the authors trained the Random Forest classifier on
10% of labeled data. Then, they gradually added data
that the model can predict with the most confidence.
The experiments showed that few (10%) labeled data
yield into the best accuracy. The authors also demon-
strated that SSC outperform supervised classification
with Random Forest, Naıve Bayes and Logistic Re-
gression.
Social Networking Services (SNSs) are increas-
ingly threatened by fake or compromised social bots.
These bots can mimic the behavior of legitimate users
to evade detection. In (Dorri et al., 2018), the au-
thors developed ”SocialBotHunter”, a collective SSC
approach that combines the structural information of
the social graph with the information on users’ social
behavior in order to detect social botnets in a Twitter-
like SNS. They used a popular tweet dataset consist-
ing of 10,000 legitimate users and 1,000 spammers.
Since this dataset lacks information on social inter-
actions among users, they used two random graph
generators to simulate social interactions in terms of
a social graph containing both legitimate and social
bot regions. To estimate the initial anomaly scores
of unlabeled users, first, a 1-class SVM classifier was
trained with a social graph of users and a small set of
labeled legitimate users. Next, to detect social bots,
the anomaly scores were revised by modeling the so-
cial interactions among all users as a pairwise Markov
Random Field (MRF) and applying the belief prop-
agation to the MRF. Furthermore, the authors used
a testing dataset of 9,000 legitimate unlabeled users
and 500 unlabeled social bots to evaluate the accu-
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
18