CrudeBERT: Applying Economic Theory Towards Fine-Tuning
Transformer-Based Sentiment Analysis Models to the Crude Oil Market
Himmet Kaplan
1 a
, Ralf-Peter Mundani
2 b
, Heiko R
¨
olke
2 c
and Albert Weichselbraun
2 d
1
Zurich University of Applied Sciences, Winterthur, Switzerland
2
University of Applied Sciences of the Grisons, Chur, Switzerland
Keywords:
Natural Language Processing, Sentiment Analysis, Transformers, FinBERT, Crude Oil Market, Fine-Tuning.
Abstract:
Predicting market movements based on the sentiment of news media has a long tradition in data analysis.
With advances in natural language processing, transformer architectures have emerged that enable contextu-
ally aware sentiment classification. Nevertheless, current methods built for the general financial market such
as FinBERT cannot distinguish asset-specific value-driving factors. This paper addresses this shortcoming
by presenting a method that identifies and classifies events that impact supply and demand in the crude oil
markets within a large corpus of relevant news headlines. We then introduce CrudeBERT, a new sentiment
analysis model that draws upon these events to contextualize and fine-tune FinBERT, thereby yielding im-
proved sentiment classifications for headlines related to the crude oil futures market. An extensive evaluation
demonstrates that CrudeBERT outperforms proprietary and open-source solutions in the domain of crude oil.
1 INTRODUCTION
Crude oil is one of our primary energy sources and
also one of the most influential raw materials. Thus,
it is of utmost importance for the global economy
and even serves as an indicator of economic boom
or recession. Since crude oil is a limited natural re-
source, its price is expected to be determined by sup-
ply and demand. Yet, according to literature, crude oil
is one of the most volatile markets in the world since
its demand is primarily affected by economic activity
(business cycle) and exogenous events such as armed
conflicts and natural disasters (Buyuksahin and Har-
ris, 2011). Traditionally, analysts draw upon techni-
cal analysis which utilizes historical data for predic-
tion. However, historical data rarely provides high-
confidence insights (McCarthy et al., 2019). Com-
plementing technical analysis with additional contem-
porary and relevant information such as news articles
could be a promising strategy for achieving more re-
liable results. Several empirical studies demonstrated
that considering news media significantly improved
forecasts of large market movements (i.e., higher than
50%) of publicly listed assets (Qian and Rasheed,
2007). Therefore, many researchers studied the ben-
a
https://orcid.org/0000-0002-1115-8669
b
https://orcid.org/0000-0001-6248-714X
c
https://orcid.org/0000-0002-9141-0886
d
https://orcid.org/0000-0001-6399-045X
efits of incorporating news data into their predic-
tion models (Baboshkin and Uandykova, 2021). One
option for applying news to prediction tasks comes
with sentiment analysis which quantifies the impact
of news on a certain asset as positive, neutral, or neg-
ative. With the latest advancement in computer hard-
and software, particularly the development of trans-
former architectures (Devlin et al., 2018), modern nat-
ural language processing (NLP) algorithms emerged
that are capable of evaluating text in a contextually
aware manner for strategic forecasting. The obser-
vations of Jiang et al. (2020) indicate that current
transformer-based sentiment classifiers can achieve
remarkable accuracies of up to 97.5 %. While these
sentiment analysis methods gained great traction in
prediction tasks for the stock market and cryptocur-
rencies, they still play only a minor role in forecasting
crude oil prices. Thus, the benefits of considering the
sentiment of news headlines for crude oil price pre-
dictions seem to be evident.
The presented research draws upon FinBERT, a
state-of-the-art transformer-based sentiment analysis
model that has been pre-trained for the general finan-
cial market. However, analyzing over a decade of
news headlines relevant to the crude oil market re-
vealed that FinBERT’s sentiment classification does
not deliver any apparent insights into the contempo-
rary development of oil prices. Therefore, we de-
veloped the publicly available CrudeBERT sentiment
analysis model that has been optimized for the crude
324
Kaplan, H., Mundani, R., Rölke, H. and Weichselbraun, A.
CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market.
DOI: 10.5220/0011749600003467
In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 324-334
ISBN: 978-989-758-648-4; ISSN: 2184-4992
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
oil domain. CrudeBERT extends FinBERT by consid-
ering the economic theory of supply and demand. In
our experiments, CrudeBERT outperforms FinBERT
and provides a promising tool for improving crude oil
price predictions by incorporating information on the
sentiment conveyed in news headlines.
The main contributions of this paper can be sum-
marized as (i) developing a method that provides
transformer models with means for identifying the
major supply and demand factors that drive crude oil
futures markets, (ii) fine-tuning general transformer-
based sentiment analysis methods by incorporating
the economic model of supply and demand into these
models, and (iii) conducting extensive experiments
that draw upon multiple prediction settings to bench-
mark the developed method against a baseline (ran-
dom binary classification) and two state-of-the-art
(lexicon- and transformer-based) sentiment analysis
frameworks.
The remainder of this paper is organized as fol-
lows: Chapter 2 discusses related literature that led
to the modern transformer-era sentiment analysis ap-
plications. Afterward, chapter 3 introduces the Fin-
BERT domain-specific affective model for the do-
main of crude oil markets and its use in predicting
market movements. Chapter 4 describes the evalu-
ation setup, performs a comprehensive evaluation of
the CrudeBERT model, and discusses the obtained re-
sults. Chapter 5 concludes the paper with a summary
and an outlook on future improvements.
2 RELATED WORK
From an industrial standpoint, crude oil is critical to
the world’s economy. Consequently, many research
articles focus on predicting its price with studies vary-
ing from technical to fundamental analysis. This liter-
ature review focuses on articles aimed at forecasting
crude oil prices by including sentiment features.
2.1 Efficient Market Hypothesis
The efficient market hypothesis (EMH) questions
whether information retrieved from news articles does
contain any predictive value at all since it claims that
the price of an asset already considers all publicly
available information. Eugene Fama distinguishes
between the weak, semi-strong, and strong forms of
EMH (Fama, 1970). The weak form claims that the
price results only from its historical price history, thus
making all available information outside the historical
price relevant for forecasting the future price of an as-
set. The EMH’s semi-strong variant on the other hand
considers that the current pricing reflects the histori-
cal prices and publicly available information. There-
fore, confidential information such as insider knowl-
edge can add value to a prediction, given it hasn’t
yet altered the current price (Malkiel, 1989). The
strong form of EMH assumes that prices reflect his-
torical prices, and publicly available, and confidential
information (Fama, 1970). Hence, the strong form
assumes that applying fundamental analysis based on
any available information cannot lead to abnormal
economic returns. This form of the EMH is further
supported by various studies that emphasize the noto-
rious difficulty of forecasting crude oil prices, such as
the works of Hamilton, which concludes that the oil
price appears to be influenced by a random walk with
drift (Hamilton, 2008).
Yet, numerous experts have questioned the hy-
pothesis of the EMH’s strong and semi-strong forms,
claiming that once a news message is published, the
available information changes, and, therefore, the
price is expected to adapt. In the experiments of
Qian and Rasheed, they concluded that the predic-
tions based on news can correctly forecast price fluc-
tuations with greater than 50 % accuracy (Qian and
Rasheed, 2007). Furthermore, according to Buyuk-
sahin and Harris, crude oil demand is primarily driven
by exogenous events, such as armed conflicts and
natural disasters as well as the presence of specula-
tors such as noise traders. They assume that these
events considerably contribute towards making crude
oil one of the most volatile markets in the world.
They also observed a substantial relationship between
crude oil price changes and the behavior of politically
and economically unstable nations, which often trig-
ger such exogenous events (Buyuksahin and Harris,
2011). This observation is confirmed by Brandt and
Gao’s more recent study, which shows that macro-
economic and geopolitical news has a strong influ-
ence on crude oil, with varying impacts. For exam-
ple, macroeconomic news influences short-term price
movements and also helps to forecast long-term oil
prices. On the other hand, the influence of geopolit-
ical news yields typically a robust and instantaneous
impact that results in increased volume in trade. How-
ever, geopolitical news delivers no conclusive insights
in terms of forecasting (Brandt and Gao, 2019). Wex
et al. state that forecasts based on the sentiment scores
of news articles that cover exogenous events are sta-
tistically significant (Wex et al., 2013).
2.2 Sentiment Analysis
Sentiment analysis is considered a prevalent classifi-
cation task in NLP, which categorizes affective and
CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market
325
subjective information within entire documents, para-
graphs, and sentences. It has gained increasing popu-
larity due to its vast potential for a variety of applica-
tions such as economics, finance, marketing, political
science, psychology, and human-computer interaction
(Mohammad, 2021). Sentiment in the context of sen-
timent analysis, which is also known as opinion min-
ing, refers to the quantification of natural language
within pre-defined affective dimensions (Weichsel-
braun et al., 2022) such as sentiment polarity which
distinguishes between positive, neutral, and negative
media coverage. Still, there is little discussion about
what sentiment in the context of NLP truly represents
(Hovy, 2015). Generally, researchers assume that au-
thors always express some sentiment while producing
natural language, since emotions, opinions, and ex-
pressions in language are fundamental human traits
(Taboada, 2016). Therefore, sentiment analysis can
also cover complex emotions such as the ones in-
troduced in Plutchik’s Wheel of Emotions (Plutchik,
1982) and the Hourglass of Emotions (Susanto et al.,
2020). However, most literature in finance tends to
break sentiment down into attitudes using binary po-
larities such as positive and negative, also known as
financial sentiment analysis (FSA). Hence a binary
classification is more suitable for directly assessing
the up or down movements of publicly traded assets
(Li et al., 2014).
2.3 Early NLP Methods for FSA
Sentiment analysis in the financial domain has been
introduced in the 1980s. One of the first approaches
to classifying sentiment in text documents was the
Bag-of-Words (BOW) methodology, often referred to
as the lexicon-based technique (Liew, 2016). Since
a text consists of several words (tokens), BOW sim-
ply accumulates the sentiment scores of positive and
negative words to compute the overall sentiment clas-
sification. As the name suggests, these BOW meth-
ods utilize a lexicon consisting of words and their
sentimental value, predetermined by, ideally multi-
ple, human annotators. One of the most well-known
lexicons for FSA was developed by Loughran and
McDonald and aimed at interpreting liabilities con-
cerning 10-K filing returns (Loughran and Mcdon-
ald, 2011). In their later works (Loughran and Mc-
Donald, 2016), they published a survey on the use of
text analysis with a focus on accounting and finance.
However, creating lexicons that include all possible
keywords including negates, or word combinations
is very challenging, since a term’s sentiment often
also depends on the context expressed by surround-
ing terms or paragraphs.
2.4 Machine Learning-Based Sentiment
Analysis
With advancements in computer hard- and soft-
ware, modern FSA approaches started to heavily rely
on machine learning-based approaches. These ap-
proaches mostly focused on supervised learning, in
which learning is accomplished by training on anno-
tated datasets containing pairs of inputs and match-
ing solutions. As a result, by correcting and op-
timizing themselves based on the, mostly human-
curated, training dataset the machine learning-based
approaches identify rules and patterns and attempt to
derive a potentially meaningful generalization. These
approaches, which require large amounts of anno-
tated training data, are known as supervised learn-
ing and are usually used for classification and regres-
sion (Chollet, 2018) tasks. For instance, Recurrent
Neural Networks (RNN) and Long Short-Term Mem-
ory (LSTM) are particularly well suited for sequential
data, such as text (Tang et al., 2016). However, the
required training dataset is one disadvantage of super-
vised learning, particularly for classification problems
such as sentiment analysis. Since a bigger training
dataset usually yields better results using large train-
ing datasets is typically expensive (both computation-
ally and financially). Furthermore, RNNs suffer from
vanishing and exploding gradients, making them un-
suitable for lengthy texts, and are slower to train since
their sequential flow is incompatible with parallel pro-
cessing (Chollet, 2018). One approach to address-
ing this problem is combining supervised approaches
with unsupervised machine learning models. For in-
stance, word embeddings (also known as word vector
models), map words into a vector space that aligns
semantically related words close to each other in an
unsupervised manner. Among the most popular word
vector models are word2vec (Mikolov et al., 2013)
and Glove (Pennington et al., 2014) which can be
trained on large text corpora, therefore, capturing
the word semantics within the corpus. Nevertheless,
word embeddings still lack the power to fully con-
sider a term’s context i.e., once a model has been
trained, words with the same spelling always receive
the same word vector, independent of their context –
i.e., orange either as a fruit or as a color, or a mixture
of both concepts.
2.5 Transformer-Based Sentiment
Analysis
More recent language models draw upon the attention
mechanism (Bahdanau et al., 2016) which led to sig-
nificant gains in a range of NLP tasks including senti-
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
326
ment analysis. The attention mechanism allows neu-
ral networks to resemble the human cognitive func-
tion by selectively focusing on particularly relevant
information while dismissing other less relevant in-
formation. This approach encourages the neural net-
work to spend more computational resources on small
but relevant elements of the data (Bahdanau et al.,
2016) yielding improvements in terms of speed and
accuracy. Further enhancements from Vaswani et al.
made use of the attention mechanism to develop the
transformer architecture, which allows parallel train-
ing and makes it more efficient than RNNs. Initially,
the transformer architecture was proposed for neu-
ral machine translation, thus, it contains an encoder
and decoder. The encoder is a fully connected feed-
forward network made out of multiple identical multi-
headed attention layers which allows the sequence to
be evaluated from contextually varying perspectives
(Figure 1). The capability to consider a term’s context
together with the option to draw upon and customize
large pre-trained models has been key to the success
of transformer-based language models for sentiment
analysis (Vaswani et al., 2017).
Figure 1: Components of the Multi-Head Attention Design.
(Vaswani et al., 2017).
2.6 FinBERT for Financial Sentiment
Analysis
Shortly after the release of the transformer architec-
ture Develin et al. observed that the encoder, when
layered, can also serve as a strong representation
learning model and for this matter, they developed the
Bidirectional Encoder Representations from Trans-
formers (BERT) (Devlin et al., 2018). One notewor-
thy feature of BERT was its simple customization
for a wide range of NLP tasks with the important
added capability of contextual perception of words
(Yenicelik, 2020). Initially, it was pre-trained on Eng-
lish Wikipedia and the BookCorpus to give the model
a general comprehension of natural language (Zhu
et al., 2015). This model served as a foundation for
further adaptions to specific NLP applications and its
domain, such as FinBERT (Araci, 2019) which fo-
cuses on sentiment analysis of financial news. To
achieve this, Araci et al. employed a subset of the
Thomson Reuters Text Research Collection (TRC2)
to adapt the model to the domain of financial news,
where occurrences of slang and spelling errors are
minimal. For the task-specific fine-tuning process,
the training dataset Financial Phrase Bank from Malo
et al. (2014) was utilized (Figure 2).
English books and
Wikipedia for general
understanding of
language
General domain
pre-training
Thomas Reuters
Corpora (TRC2) for
further training to the
domain of news
Domain adaptation
to financial news
Financial Phrase
Bank as training
dataset for sentiment
classification
Task-specific
fine-tuning
Figure 2: Process of generating FinBERT.
Compared to the number of papers that used Fin-
BERT as a backend for classification, the proportion
of papers that use it for classifying sentiments towards
crude oil is relatively small.
2.7 RavenPack Event Sentiment Score
To capture the overall sentiment of the market, Raven-
Pack developed a lexicon-based news sentiment in-
dex namely the Event Sentiment Score (ESS), which
is a granular score between -1 (negative sentiment)
and 1 (positive sentiment). This score is determined
by systematically comparing stories that are often
classified as having a good or negative financial or
economic impact through manual assessments by ex-
perts. Based on this human-curated lexicon the ESS
algorithm can examine a wide range of sentiment
proxies that are frequently mentioned in financial
news allowing it to classify the sentiment from earn-
ings reports to natural disasters. (Hafez et al., 2020)
3 METHOD
This section introduces the CrudeBERT model, which
extends FinBERT by incorporating knowledge of an
event’s expected impact on crude oil supply and de-
mand. Section 3.1 presents an overview of the used
news headlines and crude oil price data sets which is
followed by a discussion of the data pre-processing
steps. Afterward, we analyse the shortcomings and
flaws of FinBERT and addressed them by developing
the CrudeBERT model.
CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market
327
3.1 Datasets
3.1.1 News Data
The dataset containing news information consists of
around 46,000 headlines published between 1 January
2000, and 1 April 2021, with high relevance to the
topic of crude oil and obtained through the Raven-
Pack Realtime news Discovery platform. Similar to Li
et al., we limit our analysis to news headlines, since
they are more easily accessible, and have lower re-
quirements in terms of pre-processing, storage, and
computational power (Li et al., 2019). The headlines
used in this paper originate from 1034 unique news
sources, of which the majority has been published
on the Dow Jones newswires (approx. 21,200), fol-
lowed by Reuters (approx. 3,000), Bloomberg (ap-
prox. 1,100), and Platts (approx. 870). There are also
around 400 news sources present that only delivered
a single headline. It should be noted that RavenPack
has added new publishers over the years, which led to
a steady increase in the number of available sources
over time, especially till 2012. To ensure rich news
coverage with diverse sources, we limit our evalua-
tions to the period after 2012.
3.1.2 Price Data
The oil market is dominated by the two most preva-
lent grades Brent Crude and Western Texas Intermedi-
ate (WTI), which dictate the price of crude oil (EIA,
2021). Brent crude is the benchmark for crude oil
in Africa, Europe, and the Middle East, accounting
for almost two-thirds of the global supply. WTI, on
the other hand, is the favored benchmark used by the
United States of America. Since all of the headlines
in our dataset are in English, the WTI futures prices
were regarded as potentially more relevant for our
research. The historical values of WTI have been
acquired from the financial market platform invest-
ing.com for the same period as the headlines.
3.2 Data Pre-Processing
3.2.1 Sentiment Data Normalization
We normalized the sentiment values of headlines,
computed by the sentiment classifiers, by using z-
statistics with the aim to integrate the market’s rela-
tive mood into the classification model. By normaliz-
ing sentiment data over a sliding window we account
for the perfect market theory (i.e., the market price re-
flects all publicly available information) by assuming
that only new information that either disappoints or
excels stakeholder expectations results in significant
price changes.
Equation 1 outlines the normalization of the sen-
timent value at time point t based on a weekly sliding
window of size w = 5 with
sent
norm,t
=
sent
t
sent
t,w
σ
t,w
(1)
where sent
t,w
indicates the average sentiment at time
points t, t 1, ...t w within the sliding window, and
σ
t,w
the corresponding standard deviation.
3.2.2 Price Data Normalization
Due to market volatility, commodity and stock prices
show random fluctuations that overlap short-term and
long-term trends within the market. Therefore, we
also normalize price data using z-statistics to bet-
ter distinguish between significant market movements
and random fluctuations. As with the sent
norm,t
we
normalize price data for a weekly sliding window of
w = 5 (due to the market being closed over the week-
ends) as outlined in the equation below:
price
norm,t
=
price
t
price
t,w
σ
t,w
(2)
with price
t,w
and σ
t,w
indicating the average price and
standard deviations within the chosen sliding window.
3.2.3 Handling Multiple Daily Sentiment Scores
Days covered by multiple news headlines yield mul-
tiple sentiment scores, which need to be merged for
that given day. Prior work by Hafez et al. concluded
that using the sum rather than the mean can exam-
ine both the sentiment score and the sentiment vol-
ume at the same time, even though, the normalization
of scores would severely shrink the relative impact of
days with a lower news volume. According to their
research, this strategy resulted in superior outcomes
in their experiments (Hafez et al., 2018).
3.2.4 Handling Gaps in the Dataset
Rows containing gaps caused either by missing head-
lines or missing prices (due to closings of the market)
were dropped entirely as a row. Furthermore, all the
values have been scaled between 1 and 1. The fi-
nal dataset covers the period from 1 January 2012 to
1 April 2021 and yields 3376 rows of data.
The dataset aligns the summarized daily sentiment
scores with the price change of the following day
(Return
t+1
), i.e., assumes that markets will adapt to
new information by the next day at the latest. Sen-
timent scores vary between positive (1) and negative
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
328
(1) values. The price, in contrast, always remains
positive with the notable exception of 20 April 2020
when prices became negative for a short period. Price
changes (i.e., Returns) are, therefore, better suited for
indicating the market’s reaction to news coverage. We
compute the daily Returns of WTI crude oil futures as
outlined in Equation 3 and compare them to the sen-
timent scores.
Return =
Price
t
Price
t1
Price
t1
(3)
Interpreting the oil price as the result of cumu-
lative returns allows a comparison to the cumulative
sentiment scores, as illustrated in Figure 7.
3.3 Shortcomings of the FinBERT
Figure 7 performs a visual comparison of FinBERT’s
cumulative sentiment scores (red) and the price (blue)
history of WTI crude oil futures to assess FinBERT’s
forecasting potential. The plot does not show any ap-
parent relationship or trends and outlines the need for
an additional inquiry into the underlying causes of
this poor relationship and potential ways for correct-
ing it.
Adam Smith’s (1776) price theory advocates that
the price of limited resources such as crude oil is de-
termined by supply and demand. In this context, sup-
ply refers to the amount of a product or service that
a provider will sell at a given price during a specific
period. Demand denotes the amount of a product or
service that a buyer is willing to acquire during the
same period for a given price. The interaction be-
tween suppliers and customers yields a competitive
market in which the price of products and services
is determined by the equilibrium between supply and
demand (Smith, 1776). For example, if demand re-
mains constant but supply falls, the resulting shortage
will cause prices to rise. A shortage can also occur if
the supply remains constant but the demand rises.
In contrast, increased supply with constant de-
mand will result in a surplus and consequently a de-
crease in prices. A surplus can also emerge if supply
remains constant but demand falls. This logic behind
supply and demand can be summed up as follows:
Less supply shortage higher price
More supply surplus lower price
Less demand surplus lower price
More demand shortage higher price
A drill-down analysis that compared news head-
lines to FinBERT sentiment scores revealed that Fin-
BERT tended to produce dubious outcomes. Given
that crude oil is a publicly-traded asset and FinBERT
has been trained on general financial market news this
result seems arguably surprising. Having said that,
according to Xing et al. such behavior is expected
when utilizing general sentiment analysis methods for
a specific domain and is known as the domain adap-
tation problem (Xing et al., 2020). Weichselbraun
et al. (2022) also emphasize the need for domain-
specific affective models and present methods for cre-
ating such models.
Interpreting the FinBERT scores of news head-
lines listed in Table 1 based on the impact of sup-
ply and demand on prices reveals some serious issues
with FinBERT’s assessment of strongly positive (+1),
highly negative (-1), and neutral (0) events. Head-
lines suggesting a drop in supply (e.g., due to acci-
dents at oil refineries and oil platforms), tend to ensue
negative FinBERT scores although the corresponding
events likely lead to higher crude oil prices. The Fin-
BERT model probably returns these negative scores
since accidents are rarely good news in finance and
due to moral assessments derived from the human-
made annotations within the Financial Phrase Bank.
Headlines implying a rise in supply (e.g., due to
oil discoveries and increasing exports), in contrast,
frequently yield neutral FinBERT scores.
When it comes to a decline in demand (e.g., in-
duced by decreased imports), the resulting surplus
should lead to a price decrease. This assessment is
also confirmed by two of the three FinBERT scores
for the analyzed headlines (row supply surplus due
to decreasing demand) in Table 1. The first head-
line, in contrast, yields a positive FinBERT score,
since FinBERT is not able to correctly interpret the
fall in imports indicated by negative numbers such as
-16.0 %. This limitation should be taken under con-
sideration when utilizing FinBERT since a substantial
number of headlines do contain such values. Lastly,
headlines that indicate increasing demand should re-
sult in higher oil prices. The experiments with Fin-
BERT confirm that it considers headlines conveying
increased demand mostly as positive.
3.4 Extending FinBERT to CrudeBERT
In the next step, we extended FinBERT to Crude-
BERT to consider the economic law of supply and
demand in the model’s assessment.
3.4.1 Training Dataset Generation
To generate a domain-specific labeled silver standard
for CrudeBERT, we analyzed several hundred head-
lines to determine frequently recurring topics, key-
words indicating these topics, and their likely impact
on the supply and demand of crude oil. This process
identified the following major topics: accidents, oil
CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market
329
Table 1: Sample of headlines and output of FinBERT.
Headlines
Sentiment
Score
Expected
Sentiment
Score
FinBERT
Shortage
Supply
Decrease
Major Explosion, Fire at Oil
Refinery in Southeast Philadelphia
Positive -0.886292
PETROLEOS confirms Gulf of
Mexico oil platform accident
Positive -0.507213
CASUALTIES FEARED AT OIL
ACCIDENT NEAR IRANS BORDER
Positive -0.901763
Demand
Increase
EIA Chief expects Global Oil
Demand Growth 1 M B/D to 2011
Positive 0.930822
Turkey Jan-Oct Crude
Imports +98.5% To 57.9M MT
Positive 0.866315
China’s crude oil imports
up 78.30% in February 2019
Positive 0.922963
Surplus
Demand
Decrease
China February Crude
Imports -16.0% On Year
Negative 0.540711
Turkey May Crude Imports
down 11.0% On Year
Negative -0.965965
Japan June Crude Oil Imports
decrease 10.9% On Yr
Negative -0.955271
Supply
Increase
Iran’s’ Feb Oil Exports +20.9%
On Mo at 1.56M B/D - Official
Negative 0.139093
Apache announces large petroleum
discovery in Philadelphia
Negative 0.089624
Turkey finds oil near
Syria, Iraq border
Negative 0.076210
discoveries, changes in exports, changes in imports,
changes in demand, pricing, supply, pipeline limita-
tions, drilling, and spillage.
We then queried the RavenPack repository for
headlines containing the identified keywords and as-
signed them to the corresponding topic. The head-
line in Figure 3, for instance, was assigned the topic
changes in imports due to the occurrence of the word
import in the headline. Afterward, we determined the
direction of the change by classifying the headline’s
polarity based on the presence of terms that indicate
an increase, a decrease, or constant levels.
Turkey’s crude oil imports up 78.3% since 2021
Matched Topic = IMPORT
Matched Polarity = INCREASE
Figure 3: Example of Detected Topic and Polarity.
The described approach enabled us to provide
topic and direction labels for around 30,000 head-
lines. Table 2 summarizes the ten detected frequently
reoccurring topics and their corresponding frequen-
cies (we do not report the number of headlines with
overlapping topics since it has been negligibly small):
Table 2: Frequency of Reoccurring Topics in the Domain of
Crude Oil.
Supply change Demand change
Increase No change Decrease Increase No change Decrease
ca. 5900 ca. 350 ca. 5700 ca. 1300 ca. 50 ca. 800
Export change Import change
Increase No change Decrease Increase No change Decrease
ca. 2000 ca. 150 ca. 1500 ca. 2800 ca. 50 ca. 2300
Price change Spill Discovery Drilling
Increase Decrease ca. 2300 ca. 1600 ca. 100
Accident Pipeline issue
ca. 1600 ca. 1300
ca. 400 ca. 100
Assessing these labels based on the price theory of
supply and demand, allowed the creation of a domain-
specific silver standard that classifies the headlines
into positive (i.e., indicating increasing crude oil
prices), negative (i.e., likely to cause decreasing crude
oil prices), and neutral (i.e., should not affect the
crude oil price), as outlined below (Figure 4):
Lower Prices (score: 1): Headlines covering
events such as drilling, discovery, increased ex-
ports, or simply a rise in oil production are likely
to cause an increase in supply. Similarly, head-
lines stating that oil imports or consumptions are
decreasing should, in principle, result in a surplus
of oil and, therefore, lower prices.
Higher Prices (score: +1): Headlines announc-
ing accidents, pipeline constraints, oil spills, or
a direct decline in oil supply, in turn, indicate a
possible oil shortage due to the negative impact
of these events on supply. Likely shortages can
also be inferred from news indicating a rise in de-
mand, an increase in imports, or a drop in exports.
Generally, news that signals a scarcity of oil or
a price increase should have a positive impact on
the price.
No Price Changes (score: 0): A neutral score has
been assigned to the relatively small number of
headlines that report no signs of supply, demand,
imports, or exports.
3.4.2 Model Fine-Tuning
The labeled headlines with the corresponding
domain-specific sentiment scores yielded the S&D-
Dataset which contains approximately 14,000 nega-
tive, 500 neutral, and 15,000 positive headlines.
«Price increase»
«Supply decrease»
«Demand increase»
«Exports decrease»
«Pipeline constraint»
«Spills»
«Imports increase»
«Accident»
«Price decrease»
«Supply increase»
«Demand decrease»
«Exports increase»
«Drilling»
«Oil discovery»
«Imports decrease»
«Supply steady»
«Demand steady»
«Exports steady»
«Imports steady»
Number of headlines: ca. 15’000
OIL PRICE
DECREASE
OIL PRICE
INCREASE
OIL PRICE
SAME
Number of headlines: ca. 14’000 Number of headlines: ca. 500
Figure 4: Assignment of the Labelled Topics.
We split the dataset into training (60%), test
(20%), and validation (20%) partitions (keeping the
distribution across classes), and used the test dataset
for fine-tuning FinBERT resulting in the CrudeBERT
classifier (Figure 5):
Despite the relatively low number of neutral head-
lines, we included them in training to provide the neu-
ral network with examples of lower domain-specific
sentiment scores that have not been assigned to one
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
330
English books and
Wikipedia for general
understanding of
language
General domain
pre-training
Thomas Reuters
Corpora (TRC2) for
further training to the
domain of news
Domain adaptation
to financial news
Labeled topics of the
query search based on
Smith’s price theory
Task-specific
fine-tuning
Figure 5: Process of Fine-tuning FinBERT to CrudeBERT.
of the two extremes (i.e., +1 for positive and 1 for
negative news).
A preliminary evaluation of the CrudeBERT clas-
sifier on the silver standard dataset yielded, despite
the class imbalance, a macro F1 score of 0.97, a macro
accuracy of 0.98, and a macro recall of 0.97 (Fig-
ure 6). On the other hand, the same evaluation with
the FinBERT classifier yielded a macro F1 score of
0.29, a macro accuracy of 0.59, and a macro recall
of 0.32 on the silver test dataset (Figure 6). Given
the substantial amount of headlines used for fine-
tuning and their relatively short length (on average
10.4 words per headline), these improvements are not
surprising.
(a) FinBERT
(b) CrudeBERT
Figure 6: Confusion Matrices of the Two Transformer-
based Financial Sentiment Classifiers on the Silver Dataset.
The qualitative comparison in Figure 7 further
supports our initial intuition that FinBERT’s lack of
knowledge of an event’s impact on supply and de-
mand seriously limits its suitability for prediction
tasks. Consequently, it fails to track historical price
movements compared to the fine-tuned CrudeBERT
model and the commercial classifier of RavenPack.
2012 2014 2016 2018 2020
0
0.2
0.4
0.6
0.8
1
Futures RavenPack FinBERT CrudeBERT (ours)
Comparison of WTI Crude Oil Price and different Cumulative Sentiment Scores (scaled)
Date
Scaled value
new text
Figure 7: Comparison of WTI Crude Oil Futures Prices and
Different Cumulative Sentiment Scores (Scaled).
4 EVALUATION
The following experiments leverage three differ-
ent sentiment classifiers (FinBERT, CrudeBERT, and
RavenPack ESS) to assess the potential of analyzing
headlines for predicting the direction of the next day’s
(Return
t+1
) change in crude oil futures prices, using
a two-class higher/lower price classification schema.
The evaluation considers the period between 1
January 2012 and 1 April 2021 consisting of 3376
days’ worth of data. We use precision, recall, and
the F1 metric to assess the predictive potential of the
evaluated classifiers.
Table 3 summarizes the evaluation results. On
average CrudeBERT outperforms FinBERT, Raven-
Pack, and a random baseline for binary classification.
Applying FinBERT without any customizations to the
prediction task seems to be contra-productive since it
yields worse results than the random baseline. Fine-
tuning FinBERT with the presented domain adapta-
tion method considerably improves the method’s per-
formance. CrudeBERT’s overall predictions also sur-
pass the results from RavenPack’s proprietary sen-
timent classifier, although these differences are less
pronounced. CrudeBERT performs slightly worse for
price-up predictions but considerably better at pre-
dicting pre-down movements.
Figure 8 presents a confusion matrix that com-
pares the predicted label for each classifier with the
following day’s price changes of WTI crude oil fu-
tures (Return
t+1
).
We, therefore, drew upon the SciPy
1
stats pack-
age to perform Pearson’s chi-square test to determine
whether the improvements provided by CrudeBERT
are statistically significant. When compared to ei-
1
https://scipy.org
CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market
331
(a) FinBERT (b) Random
(c) RavenPack (d) CrudeBERT
Figure 8: Confusion Matrices Comparing the Price Changes Predicted by RavenPack, FinBERT, and CrudeBERT with the
Recorded Price Changes of the Following Day WTI Crude Oil Futures.
Table 3: Classification Report of Different Sentiment Clas-
sifiers for Predicting Following Day WTI Oil Futures.
Metric Category FinBERT Random RavenPack CrudeBERT
Precision Price down 0.49 0.51 0.51 0.53
Price up 0.44 0.50 0.51 0.53
Macro 0.46 0.51 0.51 0.53
Recall Price down 0.85 0.51 0.47 0.53
Price up 0.11 0.50 0.55 0.52
Macro 0.48 0.51 0.51 0.53
F1-Score Price down 0.62 0.51 0.49 0.53
Price up 0.18 0.50 0.53 0.52
Macro 0.40 0.51 0.51 0.53
ther FinBERT or the random baseline, both Crude-
BERT and RavenPack yield significantly better re-
sults at the 0.05 significance level. The improvements
from RavenPack to CrudeBERT (1721 versus 1773
correct predictions) are less substantial and have only
been judged significant at the 0.10 significance level.
5 OUTLOOK AND CONCLUSION
Predicting market movements based on news head-
lines is still a very challenging task, as outlined in
the experiments conducted in Section 4. Even Fin-
BERT, a state-of-the-art sentiment classifier that con-
tains knowledge about the general financial domain,
is unable to offer helpful insights into the future price
fluctuations of commodities like crude oil when used
without any domain adaptations.
The presented paper, therefore, introduces a
method for fine-tuning FinBERT based on news head-
lines. Our approach selects frequently reoccurring
topics that cover events illustrating fundamental mar-
ket dynamics such as the interplay between supply
and demand. A frequency analysis identifies these
topics which are then used as keywords in search
queries for collecting additional suitable headlines.
Classifying the retrieved headlines based on Adam
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
332
Smith’s price theory allows the creation of a sil-
ver standard dataset, which serves as a practical and
cost-effective alternative to human-curated training
datasets. Applying this method to the domain of
crude oil led to the creation of a silver standard that
has then been used for fine-tuning FinBERT to create
CrudeBERT, a domain-specific affective model that
provides significantly better results than the original
transformer model. In our experiments, which cover
crude oil futures price movements over a nine-year
period, CrudeBERT outperforms FinBERT and a ran-
dom baseline on a significance level of 0.05. Crude-
BERT even yields better results than RavenPack’s
proprietary sentiment analysis model which has been
optimized in years of development, although the ob-
served improvements are only significant on the 0.10
significance level.
Future research on evaluation methods and met-
rics will help to better understand the relationship
between the model’s predictions and future crude
oil prices. The presented experiments only shed
light upon its short-term prediction performance (i.e.,
Return
t+1
which covers the next business day). Thus,
further research is required to investigate Crude-
BERT’s suitability for long-term strategies and in dif-
ferent economic environments (e.g., during times of
economic boom or recession).
It is also noteworthy that news headlines alone
rather than the whole article seem to be sufficient
for providing insights into the likely direction of
price changes. Despite the presented improvements,
CrudeBERT still has limitations and will be subject
to further developments. We also intend to provide
CrudeBERT with the ability to distinguish named en-
tities (e.g. countries and oil companies) and numer-
ical clues (e.g. increased by 10 % and increased by
1 %) to obtain a more fine-grained sentiment score.
This improved indicator should no longer be limited
to providing information on the direction of price
movements but also express their valence. Consid-
ering news volume seems to be another strategy for
assessing an event’s impact on the market.
Future research will also address the silver stan-
dard generation process. The current process, for in-
stance, does not contain any additional logic for hand-
ling headlines with contradictory information on fu-
ture supply and demand (e.g., “Ivory Coast Jan Crude
Oil Exports -1 % On Yr, Imports -3 %”). We, there-
fore, plan to develop strategies for identifying and
processing such mixed-signal news headlines.
Furthermore, we aim to assess the feasibility of
extending the presented method to other commodities
such as perishable (e.g., coffee beans), non-perishable
(e.g., natural gas), precious (e.g., gold), and non-
precious (e.g., iron ore) commodities, where pricing
may be influenced by similar factors.
ACKNOWLEDGMENT
We would like to extend our gratitude to Prof Dr Hans
Wernher van de Venn and the Institute of Mechatronic
Systems at Zurich University of Applied Sciences for
their generous support of this research. In addition,
we would like to thank Dr Adrian M.P. Bras¸oveanu
for his valuable inputs on suitable evaluations for the
CrudeBERT model. We would also like to thank Dr
Sahand Haji Ali Ahmad for his assessment of the rel-
evance of the news categories used in developing the
silver standard dataset.
REFERENCES
Araci, D. (2019). FinBERT: Financial Sentiment
Analysis with Pre-trained Language Models.
arXiv:1908.10063 [cs]. arXiv: 1908.10063.
Baboshkin, P. and Uandykova, M. (2021). Multi-source
Model of Heterogeneous Data Analysis for Oil Price
Forecasting. International Journal of Energy Eco-
nomics and Policy, 11(2):384–391.
Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neu-
ral Machine Translation by Jointly Learning to Align
and Translate. arXiv:1409.0473 [cs, stat]. arXiv:
1409.0473.
Brandt, M. W. and Gao, L. (2019). Macro fundamentals or
geopolitical events? A textual analysis of news events
for crude oil. Journal of Empirical Finance, 51:64–94.
Buyuksahin, B. and Harris, J. (2011). Do Speculators Drive
Crude Oil Futures Prices? The Energy Journal, Vol-
ume 32(Number 2):167–202.
Chollet, F. (2018). Deep learning with Python. Manning
Publications Co, Shelter Island, New York. OCLC:
ocn982650571.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova,
K. (2018). BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding.
arXiv:1810.04805 [cs]. arXiv: 1810.04805.
EIA, U. E. I. A. (2021). Table Definitions, Sources, and
Explanatory Notes.
Fama, E. F. (1970). Efficient Capital Markets: A Review of
Theory and Empirical Work. The Journal of Finance,
25(2):383.
Hafez, P., Matas, R., Grinis, I., Gomez, F., Kangrga, M.,
and Liu, A. (2020). Factor Investing With Sentiment:
A Look at Asia-Pacific Markets. White Paper.
Hafez, P., Matas, R., Lautizi, F., A. Guerrero-Col
´
on, J.,
G
´
omez, M., and G
´
omez, F. (2018). Effects of Event
Sentiment Aggregation: Sum vs. Mean. White Paper,
RavenPack.
CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market
333
Hamilton, J. (2008). Understanding Crude Oil Prices. Tech-
nical Report w14492, National Bureau of Economic
Research, Cambridge, MA.
Hovy, E. H. (2015). What are Sentiment, Affect, and Emo-
tion? Applying the Methodology of Michael Zock to
Sentiment Analysis. In Gala, N., Rapp, R., and Bel-
Enguix, G., editors, Language Production, Cognition,
and the Lexicon, pages 13–24. Springer International
Publishing, Cham.
Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Zhao, T.
(2020). SMART: Robust and Efficient Fine-Tuning
for Pre-trained Natural Language Models through
Principled Regularized Optimization. Proceedings
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 2177–2190. arXiv:
1911.03437.
Li, X., Shang, W., and Wang, S. (2019). Text-based crude
oil price forecasting: A deep learning approach. In-
ternational Journal of Forecasting, 35(4):1548–1560.
Li, X., Xie, H., Chen, L., Wang, J., and Deng, X. (2014).
News impact on stock price return via sentiment anal-
ysis. Knowledge-Based Systems, 69:14–23.
Liew, J. S. Y. (2016). Fine-grained Emotion Detection in
Microblog Text. PhD thesis.
Loughran, T. and Mcdonald, B. (2011). When Is a Liability
Not a Liability? Textual Analysis, Dictionaries, and
10-Ks. The Journal of Finance, 66(1):35–65.
Loughran, T. and McDonald, B. (2016). Textual analysis
in accounting and finance: A survey. Journal of Ac-
counting Research, 54(4):1187–1230.
Malkiel, B. G. (1989). Efficient market hypothesis. In Fi-
nance, pages 127–134. Springer.
Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala,
P. (2014). Good debt or bad debt: Detecting seman-
tic orientations in economic texts: Good Debt or Bad
Debt. Journal of the Association for Information Sci-
ence and Technology, 65(4):782–796.
McCarthy, R. V., McCarthy, M. M., Ceccucci, W., Ha-
lawi, L., and SpringerLink (Online service) (2019).
Applying Predictive Analytics Finding Value in Data.
OCLC: 1204071994.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space.
Mohammad, S. M. (2021). Sentiment Analysis: Automat-
ically Detecting Valence, Emotions, and Other Affec-
tual States from Text. Number: arXiv:2005.11882
arXiv:2005.11882 [cs].
Pennington, J., Socher, R., and Manning, C. (2014). Glove:
Global Vectors for Word Representation. In Proceed-
ings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 1532–
1543, Doha, Qatar. Association for Computational
Linguistics.
Plutchik, R. (1982). A psychoevolutionary theory of emo-
tions:. Social Science Information.
Qian, B. and Rasheed, K. (2007). Stock market predic-
tion with multiple classifiers. Applied Intelligence,
26(1):25–33. Publisher: Springer.
Smith, A. (1776). An Inquiry into the Nature and Causes of
the Wealth of Nations. McMaster University Archive
for the History of Economic Thought.
Susanto, Y., Livingstone, A., Ng, B. C., and Cambria, E.
(2020). The Hourglass model revisited. IEEE Intelli-
gent Systems, 35(5).
Taboada, M. (2016). Sentiment analysis: An overview from
linguistics. Annual Review of Linguistics, 2(1):325–
347.
Tang, D., Qin, B., Feng, X., and Liu, T. (2016). Effec-
tive LSTMs for target-dependent sentiment classifica-
tion. In Proceedings of COLING 2016, the 26th Inter-
national Conference on Computational Linguistics:
Technical Papers, pages 3298–3307, Osaka, Japan.
The COLING 2016 Organizing Committee.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention Is All You Need. arXiv:1706.03762
[cs]. arXiv: 1706.03762.
Weichselbraun, A., Steixner, J., Brasoveanu, A. M. P.,
Scharl, A., G
¨
obel, M., and Nixon, L. J. B. (2022).
Automatic Expansion of Domain-Specific Affective
Models for Web Intelligence Applications. Cognitive
Computation, 14(1):228–245.
Wex, F., Widder, N., Liebmann, M., and Neumann, D.
(2013). Early Warning of Impending Oil Crises Using
the Predictive Power of Online News Stories. In 2013
46th Hawaii International Conference on System Sci-
ences, pages 1512–1521, Wailea, HI, USA. IEEE.
Xing, F., Malandri, L., Zhang, Y., and Cambria, E. (2020).
Financial Sentiment Analysis: An Investigation into
Common Mistakes and Silver Bullets. In Proceed-
ings of the 28th International Conference on Compu-
tational Linguistics, pages 978–987, Barcelona, Spain
(Online). International Committee on Computational
Linguistics.
Yenicelik, K. D. (2020). Understanding and Exploit-
ing Subspace Organization in Contextual Word Em-
beddings. Masterthese, Eidgen
¨
ossische Technische
Hochschule Z
¨
urich, Z
¨
urich 8006, Schweiz.
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun,
R., Torralba, A., and Fidler, S. (2015). Aligning Books
and Movies: Towards Story-Like Visual Explanations
by Watching Movies and Reading Books. In Proceed-
ings of the IEEE International Conference on Com-
puter Vision (ICCV).
ICEIS 2023 - 25th International Conference on Enterprise Information Systems
334