Pump and Dump Cryptocurrency Detection Using Social Media

Domenico Alfano

1, 2

, Roberto Abbruzzese

1, 2

and Domenico Parente

Department of Management & Innovation Systems, University of Salerno, Fisciano, Italy

Eustema S.p.A., Research and Development Centre, Napoli, Italy

Keywords:

Cryptocurrency, Pump and Dump, Anomaly Detection, Natural Language Processing, Explainable AI.

Abstract:

The economic implications behind the ﬂuctuation of cryptocurrencies prices, and, more importantly, the com-

plexity of the variables involved in the process, have made price forecasting a very popular topic among

researchers. Especially around detecting Pump & Dump events, where investors try to manipulate cryptocur-

rency owners to either buy or sell making a proﬁt from them. Over the last decade, research has progressed by

proposing new metrics (ﬁnancial and non ﬁnancial) capable of inﬂuencing and tracking the reasons for price

ﬂuctuations. Thanks to the advent of social media, major investment communities can be analysed through

social channels to create new metrics. With developments in the ﬁeld of Natural Language Processing, these

social channels are used to extract opinions and mood of expert investors and cryptocurrencies owners. We

propose to apply those innovative ways of creating metrics and to demonstrate that, taking these generated

metrics into account, can signiﬁcantly outperform other existing Pump & Dump detection methods. More-

over, to measure how each created metric contributes to the detection, a game theory approach called SHapley

Additive exPlanations and a method that explains each prediction using a local, interpretable model to ap-

proach any black box machine learning model called Lime will be used.

1 INTRODUCTION

Cryptocurrencies are steadily gaining popularity, and

more people are using them as platforms for invest-

ing. Cryptocurrencies are untested and generally un-

controlled, despite the signiﬁcant sums of money in-

vested in and traded in them. Its technical complexity

and lack of regulation make them a desirable target

for scammers aiming to prey on the uninformed (Kyle

and Viswanathan, 2008). Pump-and-Dump (P&D) is

a type of scam when speculators try to increase the

value of a particular cryptocurrency by disseminating

false information about it.

A P&D scheme is a sort of fraud in which perpe-

trators accumulate tokens over time, artiﬁcially raise

the market by disseminating false information (pump-

ing), and then sell what they have acquired to sub-

sequent customers at the artiﬁcially inﬂated price

(dumping). When the price has been boosted artiﬁ-

cially, it typically drops, leaving the purchasers who

made their purchase based on the misinformation at a

loss (Kamps and Kleinberg, 2018).

Making speciﬁc public P&D groups is the method

used to spread misinformation in the context of cryp-

tocurrencies in order to drive up the price.

These organizations have developed as online

chat rooms on social media platforms like Telegram

with the speciﬁc aim of organizing pump-and-dump

schemes on particular cryptocurrencies.

Studies suggest that these P&D groups almost

mainly target less well-known currencies, especially

those with low market capitalization and limited cir-

culation since they are thought to be simpler to manip-

ulate, in order to get the greatest outcomes (La Mor-

gia, 2020; Mac and J., 2018). In a typical pump-and-

dump scenario, group leaders advertise that the pump

will occur at a speciﬁc time on a speciﬁc exchange,

and that the currency will only be announced after that

time. The group chat participants attempt to be among

the ﬁrst to purchase the currency after it is introduced

in order to maximize their gains. They might even end

up buying at the peak and not be able to sell for a proﬁt

if they move too slowly. Users are frequently urged to

broadcast false information about the coin during the

pump phase in an effort to get others to buy it so they

may sell it more readily. Although there are many dif-

ferent types of misinformation, some frequent strate-

gies include fake news, nonexistent ventures, phony

alliances, or false celebrity endorsements.

P&D manipulation is currently not always un-

Alfano, D., Abbruzzese, R. and Parente, D.

Pump and Dump Cryptocurrency Detection Using Social Media.

DOI: 10.5220/0012059300003541

In Proceedings of the 12th International Conference on Data Science, Technology and Applications (DATA 2023), pages 235-240

ISBN: 978-989-758-664-4; ISSN: 2184-285X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

235

lawful due to the fact that the technology underpin-

ning cryptocurrencies is still relatively new (Kramer,

2005).

This paper presents a novel application to detect

P&D schemes in crypto. As most previous research in

this area have only used ﬁnancial data to address the

issue, our work focuses on utilizing the vast major-

ity of freely available data by using Telegram APIs to

generate performance advantages (La Morgia, 2020;

Kamps and Kleinberg, 2018).

2 RELATED WORK

The literature review led to the identiﬁcation and

study of scientiﬁc publications for anomaly detec-

tion in social media. The researcher Neda Soltani

(Neda Soltani, 2018) identiﬁed anomalies and sus-

picious users through network analysis techniques,

and through the study of the structure of nodes

within communities. Another interesting work (Ryan

G. Chacon, 2022) proposed the Fama French model

to measure and demonstrate the inﬂuence of social

media as a source of inspiration for new investments.

In this case, an attempt was made to copy the invest-

ment strategy suggested by the Wallstreetbets group

to generate proﬁts. Then, the analysis from the social

channel suggested the appropriate market to invest in.

Other researcher (Firat Akba and Askerzade, 2022)

attempted to identify Bitcoin price manipulation ac-

tivities with the use of Machine Learning techniques.

Speciﬁcally, the aim of the study was to investigate

periods of manipulation studying emotions and user

sentiments in social media.

The analysis was carried out with the use of algo-

rithms such as SVM and SARIMAX. Studies similar

to the ones mentioned above are abundant, making it

clear that social media are often used as a medium

for fraudulent activities. In recent years, many of

these have focused on the cryptocurrency market and

the Pump & Dump phenomenon. However, this phe-

nomenon was already used in the ﬁnancial ﬁeld be-

fore the creation of cryptocurrencies. Reviewing sim-

ilar works in the ﬁnancial ﬁeld can help us under-

stand more on the techniques used by malicious in-

vestors, making it easier to apply same or new ap-

proaches in the cryptocurrency market. One of the

main means used by fraudsters to pump a currency

is spreading misinformation, often with the help of

bots to automate the process (Mehrnoosh Mirtaheri

et al., 2021). On the other hand, from a ﬁnancial per-

spective, market manipulation schemes are classiﬁed

into three main categories (Allen and Gale, 1992):

information-based, action-based and trade-based.

Analysing this classiﬁcation, we can deﬁne pump

and dump schemes as a combination of information-

based and trade-based manipulation. The work of

Kamps (Kamps and Kleinberg, 2018) shows a ﬁrst

attempt to detect pump and dump using an adaptive

threshold. They emphasise the fact that there is no

reliable dataset of the conﬁrmed pump and dump pat-

tern, so they cannot fully validate their results.

3 PUMP AND DUMP DETECTION

3.1 Data Set

The ﬁnancial data collection utilized in this study is

composed of manually labelled unprocessed transac-

tion data from the cryptocurrency exchange Binance,

which was ﬁrst made public by (La Morgia, 2020).

Known instances of P&D were found in transactions

involving a variety of cryptocurrencies.

The authors initially joined various cryptocur-

rency P&D Telegram channels known for develop-

ing and carrying out P&D schemes in order to gen-

erate the data set. Following that, over a two-year

period, the researchers gathered timestamps of the of-

ﬁcial pump signals that were announced in each of

these groups by the group administrators. Depending

on what was accessible for access, the authors were

able to gather all bitcoin transactions pumped up to

a week before and after pumping using these times-

tamps and the Binance API. This method led to the

collection of 104 P&D data occurrences. After ob-

taining this raw data from the Binance API, the au-

thors pre-processed it further by grouping the transac-

tions into chunks of ﬁve seconds, ﬁfteen seconds, and

twenty-ﬁve seconds, resulting in three distinct aggre-

gated data sets. The following features are present in

each of these aggregated data sets:

• PumpIndex, Symbol: The pump’s base 0 index,

identifying it as one of the 104 pumps that were

available, and the coin’s symbol that the pump

took place on;

• StdRushOrder, AvgRushOrder: The average

percentage change in the number of rush orders

and the moving standard deviation;

• StdTrades: The number of buy- and sell-side

trades’ moving standard deviation;

• StdVolume, AvgVolume: The average percent-

age change in order volume and the moving stan-

dard deviation;

• StdPrice, AvgPrice, AvgPriceMax: The asset

price’s average percentage change, standard devi-

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

236

ation, and maximum percentage change, respec-

tively.

On the social media side, it has been proven that ac-

tual, organized price rigging occurs there (Kamps and

Kleinberg, 2018). The least active groups only carry

out one P&D operation per week, whilst the most ac-

tive groups carry out roughly one operation every day.

Generally speaking, the following steps are taken dur-

ing the procedure (Martineau, 2018):

• A few days or hours prior to the operation, the ad-

ministrators make public the fact that P&D will

occur, the exchange to be used, the precise time

the operation will begin, and whether the opera-

tion will be Free for All, in which case everyone

receives the message simultaneously, or Ranked,

in which case VIPs and members at higher levels

in the hierarchy receive the initial message before

other members;

• The notice is made more frequently as the opera-

tion’s execution time draws near;

• The organizers give these straightforward advice

just before the event begins: as you wait for an

outside investor, check your Internet connection,

buy low and sell high, and hold currency as much

as you can;

• The free chat rooms are currently closed to pre-

vent ”Fear, Uncertainty, and Doubt” (FUD) as a

result of deception efforts put out by those look-

ing to disrupt the operation and cause panic within

the group;

• Depending on where you are in the group hierar-

chy, you will know exactly when the designated

time comes for the targeted cryptocurrency to be

exposed. The name of the cryptocurrency is typ-

ically written in a fuzzy image that can only be

read accurately by humans. The purpose of the

obfuscation is to hinder bots’ ability to analyze

the message using OCR methods and begin the

process more quickly than humans;

• Admins release a news item shortly after the op-

eration begins and ask everyone in the group to

spread the message that the price of the cryptocur-

rency is increasing. This is done on Twitter, in

forums, and in special chats. With the use of a

special investment opportunity, this activity hopes

to draw in outside investors;

• Ultimately, the admins reopen the free chat rooms

after the operation and give the users some P&D

statistics.

Studying this process and starting from the times-

tamps of the manually labelled pumps and the tele-

gram group link from which the information was re-

trieved by the authors, we joined into these groups

and through the Telegram APIs we downloaded all

the messages exchanged in the chats from two days

before and two days after the pump.

However, only some of these telegram groups al-

lowed access to the message history, so we went from

a data set of 104 pumps to one of 89 pumps.

Using Natural Language Processing techniques, a

pre-processing of these messages was carried out,

consisting of the following steps:

• Removal of Emoji, Images, Stop-Words and

External Links: process of reducing non-

informative text;

• Lemmatization: process of grouping together the

inﬂected forms of a word so they can be analysed

as a single item;

• Pos Tagging: process of marking up a word in a

text as corresponding to a particular part of speech

based on both its deﬁnition and its context by ex-

tracting only words referring to nouns (NOUN)

and proper nouns (PROPN).

After gathering the messages from the relevant chat

for each pump index, we proceeded to the stage of

extracting the features.

In particular, for each of the three data sets the

authors (La Morgia, 2020) created, for each of the ﬁ-

nancial transactions and consequently for each pump

index deﬁned within them, the new feature takes on a

value of 1 if the currency symbol occurs in the mes-

sages exchanged 5 minutes before the transaction was

made, otherwise 0. We thus produce the ﬁnal data set,

which consists of 89 pumps, by only adding a categor-

ical feature to the ﬁnancial ones.

3.2 Model

The authors’ adopted classiﬁer, (La Morgia, 2020),

was employed in order to assess the scores and sig-

niﬁcance of utilizing a new Telegram feature in the

most effective way. We are discussing the Random

Forest classiﬁer, which is a collection of decision tree

classiﬁers that depend on the values of a random vec-

tor sampled independently, each of which casts a vote,

with the prediction being the class with the most votes

overall (Breiman, 2001). Since our data set consisted

of 89 pump and dumps, we did not divide it into nor-

mal train and test sets. Instead, we used a 5 fold cross-

validation to provide a more accurate assessment of

the performance. For the Random Forest classiﬁer we

use a forest of 200 trees, each leaf node must have at

least 6 samples, and a maximum depth of 4 for each

tree.

Pump and Dump Cryptocurrency Detection Using Social Media

237

4 RESULTS

In all the data sets, with the integration of the social

feature, the classiﬁer was able to beat the previous

approach by a statistically signiﬁcant margin. This

demonstrates the effectiveness of using features from

social data in this previously unexplored area.

State-of-the-art results across all data sets are

showed in our table.

Table 1: 5-Folds Random Forest Performance on Financial

Features.

Chunk-Size Precision Recall F1

5 Sec 89.1% 83.1% 86.04%

15 Sec 97.5% 89.8% 93.5%

25 Sec 97.5% 91.1% 94.1%

Table 2: 5-Folds Random Forest Performance on Financial

and Social Features.

Chunk-Size Precision Recall F1

5 Sec 94.2% 91% 92.5%

15 Sec 98.8% 94.3% 96.5%

25 Sec 97.6% 94.3% 95.9%

We found that predictions using the 5-second

chunked data set are much less accurate than those on

the 15-second and 25-second chunked data set, which

suggests that predicting anomalies using smaller

chunk sizes corresponds to a harder problem in gen-

eral. This conﬁrms ﬁndings from previous works

(La Morgia, 2020).

5 EXPLAINABILITY

5.1 Features Contribution Analysis

SHAP (SHapely Additive exPlanations) is a game

theoretic approach to explain the output of any ma-

chine learning model (Strumbelj and Kononenko,

2014). SHAP is used to solve an attribution prob-

lem, distributing the prediction score of a model for a

speciﬁc input to its base set of features, showing how

inﬂuential a feature is to make a decision. One of the

leading approaches for contribution analysis is based

on the Shapley value (Lundberg and Lee, 2017), a

construct from cooperative game theory. In cooper-

ative game theory, a group of players come together

to consume a service, and this incurs some cost. The

Shapley value distributes this cost among the players.

The algorithm computes the prediction assigning

different values to the feature in order to calculate

its contribution to the predicted outcome. Multiple

Shap methods are available to explain different ma-

chine learning models. In the Random Forest case

the TreeExplainer should be used. The Python SHAP

library has multiple methods to estimate the feature

contribution, based on the model used. In this case the

model selected is the Random Forest and the method

of the library that can be used with random forest is

TreeExplainer, which explains the output of ensemble

tree models.

These are the results of the SHAP analysis:

Figure 1: Impact of Financial and Social features on RF (5

Folds) for Chunk-Size 5 Sec.

Figure 2: Impact of Financial and Social features on RF (5

Folds) for Chunk-Size 15 Sec.

As shown in these ﬁgures, the social feature has a

considerable impact on all data sets. Its importance,

however, wanes as the chunk-size window increases.

This is unsurprising and the motivation lies in the fact

that for 15-second and 25-second chunk sizes, fewer

data are required to capture the same amount of in-

formation on average when compared to the more

data contained in 5-second chunk. Consequently, this

implies a drop in the amount of social information

present in the input for larger chunk sizes.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

238

Figure 3: Impact of Financial and Social features on RF (5

Folds) for Chunk-Size 25 Sec.

5.2 Single Instance Explanation

LIME (Local Interpretable Model-agnostic Explana-

tions) is an explanation technique that learns an in-

terpretable model locally around the prediction to ex-

plain any classiﬁer’s prediction in a way that is under-

standable and accurate (Marco Tulio Ribeiro, 2016).

Each component of the name reﬂects an explanation

we seek. Local ﬁdelity means that we want the ex-

planation to accurately capture how the classiﬁer be-

haved ”around” the instance that is being predicted.

This explanation is pointless unless it can be under-

stood by a person, or else it cannot be interpretable.

LIME is model-agnostic because it can describe any

model without having to ”peak” into it and concen-

trates on training local models to explain speciﬁc pre-

dictions rather than developing a global model. The

objective is to comprehend why a particular predic-

tion was made by the machine learning model. LIME

investigates what happens to predictions when differ-

ent sets of data are fed into a machine learning model

and creates a brand-new data set made up of altered

samples and the related black box model predictions.

These are the results coming from Lime analysis:

In particular, for each data set, instances that the

(La Morgia, 2020) model predict as class 0 (Noth-

ing) and the proposed model incorporating social fea-

tures predict as class 1 (Pump&Dump) are taken into

account. The analysis demonstrates the signiﬁcant

weight of social characteristics for all types of data

sets. In many instances, especially when the other

features’ values are absent, it is able to fully overturn

the local forecast (all equal to 0).

6 CONCLUSIONS

This paper studies the application of social features in

addition to the cryptocurrencies fraud detection prob-

(a) Without Social feature.

(b) With Social feature.

Figure 4: Single Instance Explanation Chunk-Size 5 Sec.

(a) Without Social feature.

(b) With Social feature.

Figure 5: Single Instance Explanation Chunk-Size 15 Sec.

(a) Without Social feature.

(b) With Social feature.

Figure 6: Single Instance Explanation Chunk-Size 25 Sec.

lem space. We propose a novel method that can reach

state-of-the-art performance on the data available.

Future work includes ﬁne-tuning this model with

new features to better account for the volatility gener-

ally found in cryptocurrencies, and exploring the po-

tential for deep learning techniques (V. Chadalapaka

and Vasil, 2022).

Pump and Dump Cryptocurrency Detection Using Social Media

239

Moreover, having proved the impact of the social

feature within the proposed models, future research

will focus on improving the construction of new and

more complex social features capable of capturing

more information contained in the Telegram text mes-

sages.

REFERENCES

Allen, F. and Gale, D. (1992). Stock-price manipulation.

The Review of Financial Studies.

Breiman, L. (2001). Random forests. Machine learning,

vol. 45, no. 1, pp. 5–32.

Firat Akba, Ihsan Tolga Medeni, M. S. G. and Askerzade,

I. (2022). Manipulator Detection in Cryptocurrency

Markets Based on Forecasting Anomalies. IEEE Inter-

national Conference on Artiﬁcial Intelligence in Engi-

neering and Technology.

Kamps, J. and Kleinberg, B. (2018). To the moon: deﬁn-

ing and detecting cryptocurrency pump-and-dumps.

Crime Science.

Kramer, D. B. (2005). The way it is and the way it should

be: Liability under § 10 (b) of the exchange act and

rule 10b-5 thereunder for making false and mislead-

ing statements as part of a scheme to pump and dump

a stock. University of Miami Business Law Review.

Kyle, A. S. and Viswanathan, S. (2008). How to deﬁne ille-

gal price manipulation. American Economic Review.

La Morgia, A. Mei, F. S. (2020). Pump and Dumps in the

Bitcoin Era: Real Time Detection of Cryptocurrency

Market Manipulations. IEEE.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. Advances in neural

information processing systems.

Mac, R. and J., L. (2018). Here’s how scammers are using

fake news to screw with Bitcoin Investors. BuzzFeed-

News.

Marco Tulio Ribeiro, Sameer Singh, C. G. (2016). “Why

Should I Trust You?” Explaining the Predictions

of Any Classiﬁer. Proceedings of the 22nd ACM

SIGKDD International Conference on Knowledge

Discovery and Data Mining, San Francisco, CA, USA,

August 13-17, 2016.

Martineau, P. (2018). Inside the group chats where people

pump and dump cryptocurrency. The Outline.

Mehrnoosh Mirtaheri, Sami Abu-El-Haija, F. M., Steeg,

G. V., and Galstyan, A. (2021). Identifying and An-

alyzing Cryptocurrency Manipulations in Social Me-

dia. IEEE Transactions on Computational Social Sys-

tems.

Neda Soltani, Elham Hormizi, S. A. H. G. (2018). Anomaly

Detection in Q&A Based Social Networks. Advances

in Intelligent Systems and Computing.

Ryan G. Chacon, Thibaut G. Morillon, R. W. (2022). Will

the reddit rebellion take you to the moon? Evidence

from WallStreetBets. Financial Markets and Portfolio

Management.

Strumbelj, E. and Kononenko, I. (2014). Explaining pre-

diction models and individual predictions with feature

contributions. Knowledge and information systems.

V. Chadalapaka, Kyle Chang, G. M. and Vasil, A. (2022).

“Crypto Pump and Dump Detection via Deep Learn-

ing Techniques. arXiv.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

240