CrudeBERT: Applying Economic Theory Towards Fine-Tuning

Transformer-Based Sentiment Analysis Models to the Crude Oil Market

Himmet Kaplan

1 a

, Ralf-Peter Mundani

2 b

, Heiko R

olke

2 c

and Albert Weichselbraun

2 d

Zurich University of Applied Sciences, Winterthur, Switzerland

University of Applied Sciences of the Grisons, Chur, Switzerland

Keywords:

Natural Language Processing, Sentiment Analysis, Transformers, FinBERT, Crude Oil Market, Fine-Tuning.

Abstract:

Predicting market movements based on the sentiment of news media has a long tradition in data analysis.

With advances in natural language processing, transformer architectures have emerged that enable contextu-

ally aware sentiment classiﬁcation. Nevertheless, current methods built for the general ﬁnancial market such

as FinBERT cannot distinguish asset-speciﬁc value-driving factors. This paper addresses this shortcoming

by presenting a method that identiﬁes and classiﬁes events that impact supply and demand in the crude oil

markets within a large corpus of relevant news headlines. We then introduce CrudeBERT, a new sentiment

analysis model that draws upon these events to contextualize and ﬁne-tune FinBERT, thereby yielding im-

proved sentiment classiﬁcations for headlines related to the crude oil futures market. An extensive evaluation

demonstrates that CrudeBERT outperforms proprietary and open-source solutions in the domain of crude oil.

1 INTRODUCTION

Crude oil is one of our primary energy sources and

also one of the most inﬂuential raw materials. Thus,

it is of utmost importance for the global economy

and even serves as an indicator of economic boom

or recession. Since crude oil is a limited natural re-

source, its price is expected to be determined by sup-

ply and demand. Yet, according to literature, crude oil

is one of the most volatile markets in the world since

its demand is primarily affected by economic activity

(business cycle) and exogenous events such as armed

conﬂicts and natural disasters (Buyuksahin and Har-

ris, 2011). Traditionally, analysts draw upon techni-

cal analysis which utilizes historical data for predic-

tion. However, historical data rarely provides high-

conﬁdence insights (McCarthy et al., 2019). Com-

plementing technical analysis with additional contem-

porary and relevant information such as news articles

could be a promising strategy for achieving more re-

liable results. Several empirical studies demonstrated

that considering news media signiﬁcantly improved

forecasts of large market movements (i.e., higher than

50%) of publicly listed assets (Qian and Rasheed,

2007). Therefore, many researchers studied the ben-

https://orcid.org/0000-0002-1115-8669

https://orcid.org/0000-0001-6248-714X

https://orcid.org/0000-0002-9141-0886

https://orcid.org/0000-0001-6399-045X

eﬁts of incorporating news data into their predic-

tion models (Baboshkin and Uandykova, 2021). One

option for applying news to prediction tasks comes

with sentiment analysis which quantiﬁes the impact

of news on a certain asset as positive, neutral, or neg-

ative. With the latest advancement in computer hard-

and software, particularly the development of trans-

former architectures (Devlin et al., 2018), modern nat-

ural language processing (NLP) algorithms emerged

that are capable of evaluating text in a contextually

aware manner for strategic forecasting. The obser-

vations of Jiang et al. (2020) indicate that current

transformer-based sentiment classiﬁers can achieve

remarkable accuracies of up to 97.5 %. While these

sentiment analysis methods gained great traction in

prediction tasks for the stock market and cryptocur-

rencies, they still play only a minor role in forecasting

crude oil prices. Thus, the beneﬁts of considering the

sentiment of news headlines for crude oil price pre-

dictions seem to be evident.

The presented research draws upon FinBERT, a

state-of-the-art transformer-based sentiment analysis

model that has been pre-trained for the general ﬁnan-

cial market. However, analyzing over a decade of

news headlines relevant to the crude oil market re-

vealed that FinBERT’s sentiment classiﬁcation does

not deliver any apparent insights into the contempo-

rary development of oil prices. Therefore, we de-

veloped the publicly available CrudeBERT sentiment

analysis model that has been optimized for the crude

324

Kaplan, H., Mundani, R., Rölke, H. and Weichselbraun, A.

CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market.

DOI: 10.5220/0011749600003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 324-334

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

oil domain. CrudeBERT extends FinBERT by consid-

ering the economic theory of supply and demand. In

our experiments, CrudeBERT outperforms FinBERT

and provides a promising tool for improving crude oil

price predictions by incorporating information on the

sentiment conveyed in news headlines.

The main contributions of this paper can be sum-

marized as (i) developing a method that provides

transformer models with means for identifying the

major supply and demand factors that drive crude oil

futures markets, (ii) ﬁne-tuning general transformer-

based sentiment analysis methods by incorporating

the economic model of supply and demand into these

models, and (iii) conducting extensive experiments

that draw upon multiple prediction settings to bench-

mark the developed method against a baseline (ran-

dom binary classiﬁcation) and two state-of-the-art

(lexicon- and transformer-based) sentiment analysis

frameworks.

The remainder of this paper is organized as fol-

lows: Chapter 2 discusses related literature that led

to the modern transformer-era sentiment analysis ap-

plications. Afterward, chapter 3 introduces the Fin-

BERT domain-speciﬁc affective model for the do-

main of crude oil markets and its use in predicting

market movements. Chapter 4 describes the evalu-

ation setup, performs a comprehensive evaluation of

the CrudeBERT model, and discusses the obtained re-

sults. Chapter 5 concludes the paper with a summary

and an outlook on future improvements.

2 RELATED WORK

From an industrial standpoint, crude oil is critical to

the world’s economy. Consequently, many research

articles focus on predicting its price with studies vary-

ing from technical to fundamental analysis. This liter-

ature review focuses on articles aimed at forecasting

crude oil prices by including sentiment features.

2.1 Efﬁcient Market Hypothesis

The efﬁcient market hypothesis (EMH) questions

whether information retrieved from news articles does

contain any predictive value at all since it claims that

the price of an asset already considers all publicly

available information. Eugene Fama distinguishes

between the weak, semi-strong, and strong forms of

EMH (Fama, 1970). The weak form claims that the

price results only from its historical price history, thus

making all available information outside the historical

price relevant for forecasting the future price of an as-

set. The EMH’s semi-strong variant on the other hand

considers that the current pricing reﬂects the histori-

cal prices and publicly available information. There-

fore, conﬁdential information such as insider knowl-

edge can add value to a prediction, given it hasn’t

yet altered the current price (Malkiel, 1989). The

strong form of EMH assumes that prices reﬂect his-

torical prices, and publicly available, and conﬁdential

information (Fama, 1970). Hence, the strong form

assumes that applying fundamental analysis based on

any available information cannot lead to abnormal

economic returns. This form of the EMH is further

supported by various studies that emphasize the noto-

rious difﬁculty of forecasting crude oil prices, such as

the works of Hamilton, which concludes that the oil

price appears to be inﬂuenced by a random walk with

drift (Hamilton, 2008).

Yet, numerous experts have questioned the hy-

pothesis of the EMH’s strong and semi-strong forms,

claiming that once a news message is published, the

available information changes, and, therefore, the

price is expected to adapt. In the experiments of

Qian and Rasheed, they concluded that the predic-

tions based on news can correctly forecast price ﬂuc-

tuations with greater than 50 % accuracy (Qian and

Rasheed, 2007). Furthermore, according to Buyuk-

sahin and Harris, crude oil demand is primarily driven

by exogenous events, such as armed conﬂicts and

natural disasters as well as the presence of specula-

tors such as noise traders. They assume that these

events considerably contribute towards making crude

oil one of the most volatile markets in the world.

They also observed a substantial relationship between

crude oil price changes and the behavior of politically

and economically unstable nations, which often trig-

ger such exogenous events (Buyuksahin and Harris,

2011). This observation is conﬁrmed by Brandt and

Gao’s more recent study, which shows that macro-

economic and geopolitical news has a strong inﬂu-

ence on crude oil, with varying impacts. For exam-

ple, macroeconomic news inﬂuences short-term price

movements and also helps to forecast long-term oil

prices. On the other hand, the inﬂuence of geopolit-

ical news yields typically a robust and instantaneous

impact that results in increased volume in trade. How-

ever, geopolitical news delivers no conclusive insights

in terms of forecasting (Brandt and Gao, 2019). Wex

et al. state that forecasts based on the sentiment scores

of news articles that cover exogenous events are sta-

tistically signiﬁcant (Wex et al., 2013).

2.2 Sentiment Analysis

Sentiment analysis is considered a prevalent classiﬁ-

cation task in NLP, which categorizes affective and

CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

325

subjective information within entire documents, para-

graphs, and sentences. It has gained increasing popu-

larity due to its vast potential for a variety of applica-

tions such as economics, ﬁnance, marketing, political

science, psychology, and human-computer interaction

(Mohammad, 2021). Sentiment in the context of sen-

timent analysis, which is also known as opinion min-

ing, refers to the quantiﬁcation of natural language

within pre-deﬁned affective dimensions (Weichsel-

braun et al., 2022) such as sentiment polarity which

distinguishes between positive, neutral, and negative

media coverage. Still, there is little discussion about

what sentiment in the context of NLP truly represents

(Hovy, 2015). Generally, researchers assume that au-

thors always express some sentiment while producing

natural language, since emotions, opinions, and ex-

pressions in language are fundamental human traits

(Taboada, 2016). Therefore, sentiment analysis can

also cover complex emotions such as the ones in-

troduced in Plutchik’s Wheel of Emotions (Plutchik,

1982) and the Hourglass of Emotions (Susanto et al.,

2020). However, most literature in ﬁnance tends to

break sentiment down into attitudes using binary po-

larities such as positive and negative, also known as

ﬁnancial sentiment analysis (FSA). Hence a binary

classiﬁcation is more suitable for directly assessing

the up or down movements of publicly traded assets

(Li et al., 2014).

2.3 Early NLP Methods for FSA

Sentiment analysis in the ﬁnancial domain has been

introduced in the 1980s. One of the ﬁrst approaches

to classifying sentiment in text documents was the

Bag-of-Words (BOW) methodology, often referred to

as the lexicon-based technique (Liew, 2016). Since

a text consists of several words (tokens), BOW sim-

ply accumulates the sentiment scores of positive and

negative words to compute the overall sentiment clas-

siﬁcation. As the name suggests, these BOW meth-

ods utilize a lexicon consisting of words and their

sentimental value, predetermined by, ideally multi-

ple, human annotators. One of the most well-known

lexicons for FSA was developed by Loughran and

McDonald and aimed at interpreting liabilities con-

cerning 10-K ﬁling returns (Loughran and Mcdon-

ald, 2011). In their later works (Loughran and Mc-

Donald, 2016), they published a survey on the use of

text analysis with a focus on accounting and ﬁnance.

However, creating lexicons that include all possible

keywords including negates, or word combinations

is very challenging, since a term’s sentiment often

also depends on the context expressed by surround-

ing terms or paragraphs.

2.4 Machine Learning-Based Sentiment

Analysis

With advancements in computer hard- and soft-

ware, modern FSA approaches started to heavily rely

on machine learning-based approaches. These ap-

proaches mostly focused on supervised learning, in

which learning is accomplished by training on anno-

tated datasets containing pairs of inputs and match-

ing solutions. As a result, by correcting and op-

timizing themselves based on the, mostly human-

curated, training dataset the machine learning-based

approaches identify rules and patterns and attempt to

derive a potentially meaningful generalization. These

approaches, which require large amounts of anno-

tated training data, are known as supervised learn-

ing and are usually used for classiﬁcation and regres-

sion (Chollet, 2018) tasks. For instance, Recurrent

Neural Networks (RNN) and Long Short-Term Mem-

ory (LSTM) are particularly well suited for sequential

data, such as text (Tang et al., 2016). However, the

required training dataset is one disadvantage of super-

vised learning, particularly for classiﬁcation problems

such as sentiment analysis. Since a bigger training

dataset usually yields better results using large train-

ing datasets is typically expensive (both computation-

ally and ﬁnancially). Furthermore, RNNs suffer from

vanishing and exploding gradients, making them un-

suitable for lengthy texts, and are slower to train since

their sequential ﬂow is incompatible with parallel pro-

cessing (Chollet, 2018). One approach to address-

ing this problem is combining supervised approaches

with unsupervised machine learning models. For in-

stance, word embeddings (also known as word vector

models), map words into a vector space that aligns

semantically related words close to each other in an

unsupervised manner. Among the most popular word

vector models are word2vec (Mikolov et al., 2013)

and Glove (Pennington et al., 2014) which can be

trained on large text corpora, therefore, capturing

the word semantics within the corpus. Nevertheless,

word embeddings still lack the power to fully con-

sider a term’s context – i.e., once a model has been

trained, words with the same spelling always receive

the same word vector, independent of their context –

i.e., orange either as a fruit or as a color, or a mixture

of both concepts.

2.5 Transformer-Based Sentiment

Analysis

More recent language models draw upon the attention

mechanism (Bahdanau et al., 2016) which led to sig-

niﬁcant gains in a range of NLP tasks including senti-

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

326

ment analysis. The attention mechanism allows neu-

ral networks to resemble the human cognitive func-

tion by selectively focusing on particularly relevant

information while dismissing other less relevant in-

formation. This approach encourages the neural net-

work to spend more computational resources on small

but relevant elements of the data (Bahdanau et al.,

2016) yielding improvements in terms of speed and

accuracy. Further enhancements from Vaswani et al.

made use of the attention mechanism to develop the

transformer architecture, which allows parallel train-

ing and makes it more efﬁcient than RNNs. Initially,

the transformer architecture was proposed for neu-

ral machine translation, thus, it contains an encoder

and decoder. The encoder is a fully connected feed-

forward network made out of multiple identical multi-

headed attention layers which allows the sequence to

be evaluated from contextually varying perspectives

(Figure 1). The capability to consider a term’s context

together with the option to draw upon and customize

large pre-trained models has been key to the success

of transformer-based language models for sentiment

analysis (Vaswani et al., 2017).

Figure 1: Components of the Multi-Head Attention Design.

(Vaswani et al., 2017).

2.6 FinBERT for Financial Sentiment

Analysis

Shortly after the release of the transformer architec-

ture Develin et al. observed that the encoder, when

layered, can also serve as a strong representation

learning model and for this matter, they developed the

Bidirectional Encoder Representations from Trans-

formers (BERT) (Devlin et al., 2018). One notewor-

thy feature of BERT was its simple customization

for a wide range of NLP tasks with the important

added capability of contextual perception of words

(Yenicelik, 2020). Initially, it was pre-trained on Eng-

lish Wikipedia and the BookCorpus to give the model

a general comprehension of natural language (Zhu

et al., 2015). This model served as a foundation for

further adaptions to speciﬁc NLP applications and its

domain, such as FinBERT (Araci, 2019) which fo-

cuses on sentiment analysis of ﬁnancial news. To

achieve this, Araci et al. employed a subset of the

Thomson Reuters Text Research Collection (TRC2)

to adapt the model to the domain of ﬁnancial news,

where occurrences of slang and spelling errors are

minimal. For the task-speciﬁc ﬁne-tuning process,

the training dataset Financial Phrase Bank from Malo

et al. (2014) was utilized (Figure 2).

English books and

Wikipedia for general

understanding of

language

General domain

pre-training

Thomas Reuters

Corpora (TRC2) for

further training to the

domain of news

Domain adaptation

to financial news

Financial Phrase

Bank as training

dataset for sentiment

classification

Task-specific

fine-tuning

Figure 2: Process of generating FinBERT.

Compared to the number of papers that used Fin-

BERT as a backend for classiﬁcation, the proportion

of papers that use it for classifying sentiments towards

crude oil is relatively small.

2.7 RavenPack Event Sentiment Score

To capture the overall sentiment of the market, Raven-

Pack developed a lexicon-based news sentiment in-

dex namely the Event Sentiment Score (ESS), which

is a granular score between -1 (negative sentiment)

and 1 (positive sentiment). This score is determined

by systematically comparing stories that are often

classiﬁed as having a good or negative ﬁnancial or

economic impact through manual assessments by ex-

perts. Based on this human-curated lexicon the ESS

algorithm can examine a wide range of sentiment

proxies that are frequently mentioned in ﬁnancial

news allowing it to classify the sentiment from earn-

ings reports to natural disasters. (Hafez et al., 2020)

3 METHOD

This section introduces the CrudeBERT model, which

extends FinBERT by incorporating knowledge of an

event’s expected impact on crude oil supply and de-

mand. Section 3.1 presents an overview of the used

news headlines and crude oil price data sets which is

followed by a discussion of the data pre-processing

steps. Afterward, we analyse the shortcomings and

ﬂaws of FinBERT and addressed them by developing

the CrudeBERT model.

CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

327

3.1 Datasets

3.1.1 News Data

The dataset containing news information consists of

around 46,000 headlines published between 1 January

2000, and 1 April 2021, with high relevance to the

topic of crude oil and obtained through the Raven-

Pack Realtime news Discovery platform. Similar to Li

et al., we limit our analysis to news headlines, since

they are more easily accessible, and have lower re-

quirements in terms of pre-processing, storage, and

computational power (Li et al., 2019). The headlines

used in this paper originate from 1034 unique news

sources, of which the majority has been published

on the Dow Jones newswires (approx. 21,200), fol-

lowed by Reuters (approx. 3,000), Bloomberg (ap-

prox. 1,100), and Platts (approx. 870). There are also

around 400 news sources present that only delivered

a single headline. It should be noted that RavenPack

has added new publishers over the years, which led to

a steady increase in the number of available sources

over time, especially till 2012. To ensure rich news

coverage with diverse sources, we limit our evalua-

tions to the period after 2012.

3.1.2 Price Data

The oil market is dominated by the two most preva-

lent grades Brent Crude and Western Texas Intermedi-

ate (WTI), which dictate the price of crude oil (EIA,

2021). Brent crude is the benchmark for crude oil

in Africa, Europe, and the Middle East, accounting

for almost two-thirds of the global supply. WTI, on

the other hand, is the favored benchmark used by the

United States of America. Since all of the headlines

in our dataset are in English, the WTI futures prices

were regarded as potentially more relevant for our

research. The historical values of WTI have been

acquired from the ﬁnancial market platform invest-

ing.com for the same period as the headlines.

3.2 Data Pre-Processing

3.2.1 Sentiment Data Normalization

We normalized the sentiment values of headlines,

computed by the sentiment classiﬁers, by using z-

statistics with the aim to integrate the market’s rela-

tive mood into the classiﬁcation model. By normaliz-

ing sentiment data over a sliding window we account

for the perfect market theory (i.e., the market price re-

ﬂects all publicly available information) by assuming

that only new information that either disappoints or

excels stakeholder expectations results in signiﬁcant

price changes.

Equation 1 outlines the normalization of the sen-

timent value at time point t based on a weekly sliding

window of size w = 5 with

sent

norm,t

sent

− sent

t,w

(1)

where sent

t,w

indicates the average sentiment at time

points t, t − 1, ...t − w within the sliding window, and

t,w

the corresponding standard deviation.

3.2.2 Price Data Normalization

Due to market volatility, commodity and stock prices

show random ﬂuctuations that overlap short-term and

long-term trends within the market. Therefore, we

also normalize price data using z-statistics to bet-

ter distinguish between signiﬁcant market movements

and random ﬂuctuations. As with the sent

norm,t

normalize price data for a weekly sliding window of

w = 5 (due to the market being closed over the week-

ends) as outlined in the equation below:

price

norm,t

price

− price

t,w

(2)

with price

t,w

and σ

t,w

indicating the average price and

standard deviations within the chosen sliding window.

3.2.3 Handling Multiple Daily Sentiment Scores

Days covered by multiple news headlines yield mul-

tiple sentiment scores, which need to be merged for

that given day. Prior work by Hafez et al. concluded

that using the sum rather than the mean can exam-

ine both the sentiment score and the sentiment vol-

ume at the same time, even though, the normalization

of scores would severely shrink the relative impact of

days with a lower news volume. According to their

research, this strategy resulted in superior outcomes

in their experiments (Hafez et al., 2018).

3.2.4 Handling Gaps in the Dataset

Rows containing gaps caused either by missing head-

lines or missing prices (due to closings of the market)

were dropped entirely as a row. Furthermore, all the

values have been scaled between −1 and 1. The ﬁ-

nal dataset covers the period from 1 January 2012 to

1 April 2021 and yields 3376 rows of data.

The dataset aligns the summarized daily sentiment

scores with the price change of the following day

(Return

t+1

), i.e., assumes that markets will adapt to

new information by the next day at the latest. Sen-

timent scores vary between positive (1) and negative

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

328

(−1) values. The price, in contrast, always remains

positive with the notable exception of 20 April 2020

when prices became negative for a short period. Price

changes (i.e., Returns) are, therefore, better suited for

indicating the market’s reaction to news coverage. We

compute the daily Returns of WTI crude oil futures as

outlined in Equation 3 and compare them to the sen-

timent scores.

Return =

Price

− Price

t−1

Price

t−1

(3)

Interpreting the oil price as the result of cumu-

lative returns allows a comparison to the cumulative

sentiment scores, as illustrated in Figure 7.

3.3 Shortcomings of the FinBERT

Figure 7 performs a visual comparison of FinBERT’s

cumulative sentiment scores (red) and the price (blue)

history of WTI crude oil futures to assess FinBERT’s

forecasting potential. The plot does not show any ap-

parent relationship or trends and outlines the need for

an additional inquiry into the underlying causes of

this poor relationship and potential ways for correct-

ing it.

Adam Smith’s (1776) price theory advocates that

the price of limited resources such as crude oil is de-

termined by supply and demand. In this context, sup-

ply refers to the amount of a product or service that

a provider will sell at a given price during a speciﬁc

period. Demand denotes the amount of a product or

service that a buyer is willing to acquire during the

same period for a given price. The interaction be-

tween suppliers and customers yields a competitive

market in which the price of products and services

is determined by the equilibrium between supply and

demand (Smith, 1776). For example, if demand re-

mains constant but supply falls, the resulting shortage

will cause prices to rise. A shortage can also occur if

the supply remains constant but the demand rises.

In contrast, increased supply with constant de-

mand will result in a surplus and consequently a de-

crease in prices. A surplus can also emerge if supply

remains constant but demand falls. This logic behind

supply and demand can be summed up as follows:

Less supply → shortage → higher price

More supply → surplus → lower price

Less demand → surplus → lower price

More demand → shortage → higher price

A drill-down analysis that compared news head-

lines to FinBERT sentiment scores revealed that Fin-

BERT tended to produce dubious outcomes. Given

that crude oil is a publicly-traded asset and FinBERT

has been trained on general ﬁnancial market news this

result seems arguably surprising. Having said that,

according to Xing et al. such behavior is expected

when utilizing general sentiment analysis methods for

a speciﬁc domain and is known as the domain adap-

tation problem (Xing et al., 2020). Weichselbraun

et al. (2022) also emphasize the need for domain-

speciﬁc affective models and present methods for cre-

ating such models.

Interpreting the FinBERT scores of news head-

lines listed in Table 1 based on the impact of sup-

ply and demand on prices reveals some serious issues

with FinBERT’s assessment of strongly positive (+1),

highly negative (-1), and neutral (0) events. Head-

lines suggesting a drop in supply (e.g., due to acci-

dents at oil reﬁneries and oil platforms), tend to ensue

negative FinBERT scores although the corresponding

events likely lead to higher crude oil prices. The Fin-

BERT model probably returns these negative scores

since accidents are rarely good news in ﬁnance and

due to moral assessments derived from the human-

made annotations within the Financial Phrase Bank.

Headlines implying a rise in supply (e.g., due to

oil discoveries and increasing exports), in contrast,

frequently yield neutral FinBERT scores.

When it comes to a decline in demand (e.g., in-

duced by decreased imports), the resulting surplus

should lead to a price decrease. This assessment is

also conﬁrmed by two of the three FinBERT scores

for the analyzed headlines (row supply surplus due

to decreasing demand) in Table 1. The ﬁrst head-

line, in contrast, yields a positive FinBERT score,

since FinBERT is not able to correctly interpret the

fall in imports indicated by negative numbers such as

-16.0 %. This limitation should be taken under con-

sideration when utilizing FinBERT since a substantial

number of headlines do contain such values. Lastly,

headlines that indicate increasing demand should re-

sult in higher oil prices. The experiments with Fin-

BERT conﬁrm that it considers headlines conveying

increased demand mostly as positive.

3.4 Extending FinBERT to CrudeBERT

In the next step, we extended FinBERT to Crude-

BERT to consider the economic law of supply and

demand in the model’s assessment.

3.4.1 Training Dataset Generation

To generate a domain-speciﬁc labeled silver standard

for CrudeBERT, we analyzed several hundred head-

lines to determine frequently recurring topics, key-

words indicating these topics, and their likely impact

on the supply and demand of crude oil. This process

identiﬁed the following major topics: accidents, oil

CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

329

Table 1: Sample of headlines and output of FinBERT.

Headlines

Sentiment

Score

Expected

Sentiment

Score

FinBERT

Shortage

Supply

Decrease

Major Explosion, Fire at Oil

Reﬁnery in Southeast Philadelphia

Positive -0.886292

PETROLEOS conﬁrms Gulf of

Mexico oil platform accident

Positive -0.507213

CASUALTIES FEARED AT OIL

ACCIDENT NEAR IRANS BORDER

Positive -0.901763

Demand

Increase

EIA Chief expects Global Oil

Demand Growth 1 M B/D to 2011

Positive 0.930822

Turkey Jan-Oct Crude

Imports +98.5% To 57.9M MT

Positive 0.866315

China’s crude oil imports

up 78.30% in February 2019

Positive 0.922963

Surplus

Demand

Decrease

China February Crude

Imports -16.0% On Year

Negative 0.540711

Turkey May Crude Imports

down 11.0% On Year

Negative -0.965965

Japan June Crude Oil Imports

decrease 10.9% On Yr

Negative -0.955271

Supply

Increase

Iran’s’ Feb Oil Exports +20.9%

On Mo at 1.56M B/D - Ofﬁcial

Negative 0.139093

Apache announces large petroleum

discovery in Philadelphia

Negative 0.089624

Turkey ﬁnds oil near

Syria, Iraq border

Negative 0.076210

discoveries, changes in exports, changes in imports,

changes in demand, pricing, supply, pipeline limita-

tions, drilling, and spillage.

We then queried the RavenPack repository for

headlines containing the identiﬁed keywords and as-

signed them to the corresponding topic. The head-

line in Figure 3, for instance, was assigned the topic

changes in imports due to the occurrence of the word

import in the headline. Afterward, we determined the

direction of the change by classifying the headline’s

polarity based on the presence of terms that indicate

an increase, a decrease, or constant levels.

Turkey’s crude oil imports up 78.3% since 2021

Matched Topic = IMPORT

Matched Polarity = INCREASE

Figure 3: Example of Detected Topic and Polarity.

The described approach enabled us to provide

topic and direction labels for around 30,000 head-

lines. Table 2 summarizes the ten detected frequently

reoccurring topics and their corresponding frequen-

cies (we do not report the number of headlines with

overlapping topics since it has been negligibly small):

Table 2: Frequency of Reoccurring Topics in the Domain of

Crude Oil.

Supply change Demand change

Increase No change Decrease Increase No change Decrease

ca. 5900 ca. 350 ca. 5700 ca. 1300 ca. 50 ca. 800

Export change Import change

Increase No change Decrease Increase No change Decrease

ca. 2000 ca. 150 ca. 1500 ca. 2800 ca. 50 ca. 2300

Price change Spill Discovery Drilling

Increase Decrease ca. 2300 ca. 1600 ca. 100

Accident Pipeline issue

ca. 1600 ca. 1300

ca. 400 ca. 100

Assessing these labels based on the price theory of

supply and demand, allowed the creation of a domain-

speciﬁc silver standard that classiﬁes the headlines

into positive (i.e., indicating increasing crude oil

prices), negative (i.e., likely to cause decreasing crude

oil prices), and neutral (i.e., should not affect the

crude oil price), as outlined below (Figure 4):

• Lower Prices (score: −1): Headlines covering

events such as drilling, discovery, increased ex-

ports, or simply a rise in oil production are likely

to cause an increase in supply. Similarly, head-

lines stating that oil imports or consumptions are

decreasing should, in principle, result in a surplus

of oil and, therefore, lower prices.

• Higher Prices (score: +1): Headlines announc-

ing accidents, pipeline constraints, oil spills, or

a direct decline in oil supply, in turn, indicate a

possible oil shortage due to the negative impact

of these events on supply. Likely shortages can

also be inferred from news indicating a rise in de-

mand, an increase in imports, or a drop in exports.

Generally, news that signals a scarcity of oil or

a price increase should have a positive impact on

the price.

• No Price Changes (score: 0): A neutral score has

been assigned to the relatively small number of

headlines that report no signs of supply, demand,

imports, or exports.

3.4.2 Model Fine-Tuning

The labeled headlines with the corresponding

domain-speciﬁc sentiment scores yielded the S&D-

Dataset which contains approximately 14,000 nega-

tive, 500 neutral, and 15,000 positive headlines.

«Price increase»

«Supply decrease»

«Demand increase»

«Exports decrease»

«Pipeline constraint»

«Spills»

«Imports increase»

«Accident»

«Price decrease»

«Supply increase»

«Demand decrease»

«Exports increase»

«Drilling»

«Oil discovery»

«Imports decrease»

«Supply steady»

«Demand steady»

«Exports steady»

«Imports steady»

Number of headlines: ca. 15’000

OIL PRICE

DECREASE

OIL PRICE

INCREASE

OIL PRICE

SAME

Number of headlines: ca. 14’000 Number of headlines: ca. 500

Figure 4: Assignment of the Labelled Topics.

We split the dataset into training (60%), test

(20%), and validation (20%) partitions (keeping the

distribution across classes), and used the test dataset

for ﬁne-tuning FinBERT resulting in the CrudeBERT

classiﬁer (Figure 5):

Despite the relatively low number of neutral head-

lines, we included them in training to provide the neu-

ral network with examples of lower domain-speciﬁc

sentiment scores that have not been assigned to one

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

330

English books and

Wikipedia for general

understanding of

language

General domain

pre-training

Thomas Reuters

Corpora (TRC2) for

further training to the

domain of news

Domain adaptation

to financial news

Labeled topics of the

query search based on

Smith’s price theory

Task-specific

fine-tuning

Figure 5: Process of Fine-tuning FinBERT to CrudeBERT.

of the two extremes (i.e., +1 for positive and −1 for

negative news).

A preliminary evaluation of the CrudeBERT clas-

siﬁer on the silver standard dataset yielded, despite

the class imbalance, a macro F1 score of 0.97, a macro

accuracy of 0.98, and a macro recall of 0.97 (Fig-

ure 6). On the other hand, the same evaluation with

the FinBERT classiﬁer yielded a macro F1 score of

0.29, a macro accuracy of 0.59, and a macro recall

of 0.32 on the silver test dataset (Figure 6). Given

the substantial amount of headlines used for ﬁne-

tuning and their relatively short length (on average

10.4 words per headline), these improvements are not

surprising.

(a) FinBERT

(b) CrudeBERT

Figure 6: Confusion Matrices of the Two Transformer-

based Financial Sentiment Classiﬁers on the Silver Dataset.

The qualitative comparison in Figure 7 further

supports our initial intuition that FinBERT’s lack of

knowledge of an event’s impact on supply and de-

mand seriously limits its suitability for prediction

tasks. Consequently, it fails to track historical price

movements compared to the ﬁne-tuned CrudeBERT

model and the commercial classiﬁer of RavenPack.

2012 2014 2016 2018 2020

0.2

0.4

0.6

0.8

Futures RavenPack FinBERT CrudeBERT (ours)

Comparison of WTI Crude Oil Price and different Cumulative Sentiment Scores (scaled)

Date

Scaled value

new text

Figure 7: Comparison of WTI Crude Oil Futures Prices and

Different Cumulative Sentiment Scores (Scaled).

4 EVALUATION

The following experiments leverage three differ-

ent sentiment classiﬁers (FinBERT, CrudeBERT, and

RavenPack ESS) to assess the potential of analyzing

headlines for predicting the direction of the next day’s

(Return

t+1

) change in crude oil futures prices, using

a two-class higher/lower price classiﬁcation schema.

The evaluation considers the period between 1

January 2012 and 1 April 2021 consisting of 3376

days’ worth of data. We use precision, recall, and

the F1 metric to assess the predictive potential of the

evaluated classiﬁers.

Table 3 summarizes the evaluation results. On

average CrudeBERT outperforms FinBERT, Raven-

Pack, and a random baseline for binary classiﬁcation.

Applying FinBERT without any customizations to the

prediction task seems to be contra-productive since it

yields worse results than the random baseline. Fine-

tuning FinBERT with the presented domain adapta-

tion method considerably improves the method’s per-

formance. CrudeBERT’s overall predictions also sur-

pass the results from RavenPack’s proprietary sen-

timent classiﬁer, although these differences are less

pronounced. CrudeBERT performs slightly worse for

price-up predictions but considerably better at pre-

dicting pre-down movements.

Figure 8 presents a confusion matrix that com-

pares the predicted label for each classiﬁer with the

following day’s price changes of WTI crude oil fu-

tures (Return

t+1

We, therefore, drew upon the SciPy

stats pack-

age to perform Pearson’s chi-square test to determine

whether the improvements provided by CrudeBERT

are statistically signiﬁcant. When compared to ei-

https://scipy.org

CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

331

(a) FinBERT (b) Random

Figure 8: Confusion Matrices Comparing the Price Changes Predicted by RavenPack, FinBERT, and CrudeBERT with the

Recorded Price Changes of the Following Day WTI Crude Oil Futures.

Table 3: Classiﬁcation Report of Different Sentiment Clas-

siﬁers for Predicting Following Day WTI Oil Futures.

Metric Category FinBERT Random RavenPack CrudeBERT

Precision Price down 0.49 0.51 0.51 0.53

Price up 0.44 0.50 0.51 0.53

Macro 0.46 0.51 0.51 0.53

Recall Price down 0.85 0.51 0.47 0.53

Price up 0.11 0.50 0.55 0.52

Macro 0.48 0.51 0.51 0.53

F1-Score Price down 0.62 0.51 0.49 0.53

Price up 0.18 0.50 0.53 0.52

Macro 0.40 0.51 0.51 0.53

ther FinBERT or the random baseline, both Crude-

BERT and RavenPack yield signiﬁcantly better re-

sults at the 0.05 signiﬁcance level. The improvements

from RavenPack to CrudeBERT (1721 versus 1773

correct predictions) are less substantial and have only

been judged signiﬁcant at the 0.10 signiﬁcance level.

5 OUTLOOK AND CONCLUSION

Predicting market movements based on news head-

lines is still a very challenging task, as outlined in

the experiments conducted in Section 4. Even Fin-

BERT, a state-of-the-art sentiment classiﬁer that con-

tains knowledge about the general ﬁnancial domain,

is unable to offer helpful insights into the future price

ﬂuctuations of commodities like crude oil when used

without any domain adaptations.

The presented paper, therefore, introduces a

method for ﬁne-tuning FinBERT based on news head-

lines. Our approach selects frequently reoccurring

topics that cover events illustrating fundamental mar-

ket dynamics such as the interplay between supply

and demand. A frequency analysis identiﬁes these

topics which are then used as keywords in search

queries for collecting additional suitable headlines.

Classifying the retrieved headlines based on Adam

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

332

Smith’s price theory allows the creation of a sil-

ver standard dataset, which serves as a practical and

cost-effective alternative to human-curated training

datasets. Applying this method to the domain of

crude oil led to the creation of a silver standard that

has then been used for ﬁne-tuning FinBERT to create

CrudeBERT, a domain-speciﬁc affective model that

provides signiﬁcantly better results than the original

transformer model. In our experiments, which cover

crude oil futures price movements over a nine-year

period, CrudeBERT outperforms FinBERT and a ran-

dom baseline on a signiﬁcance level of 0.05. Crude-

BERT even yields better results than RavenPack’s

proprietary sentiment analysis model which has been

optimized in years of development, although the ob-

served improvements are only signiﬁcant on the 0.10

signiﬁcance level.

Future research on evaluation methods and met-

rics will help to better understand the relationship

between the model’s predictions and future crude

oil prices. The presented experiments only shed

light upon its short-term prediction performance (i.e.,

Return

t+1

which covers the next business day). Thus,

further research is required to investigate Crude-

BERT’s suitability for long-term strategies and in dif-

ferent economic environments (e.g., during times of

economic boom or recession).

It is also noteworthy that news headlines alone

rather than the whole article seem to be sufﬁcient

for providing insights into the likely direction of

price changes. Despite the presented improvements,

CrudeBERT still has limitations and will be subject

to further developments. We also intend to provide

CrudeBERT with the ability to distinguish named en-

tities (e.g. countries and oil companies) and numer-

ical clues (e.g. increased by 10 % and increased by

1 %) to obtain a more ﬁne-grained sentiment score.

This improved indicator should no longer be limited

to providing information on the direction of price

movements but also express their valence. Consid-

ering news volume seems to be another strategy for

assessing an event’s impact on the market.

Future research will also address the silver stan-

dard generation process. The current process, for in-

stance, does not contain any additional logic for hand-

ling headlines with contradictory information on fu-

ture supply and demand (e.g., “Ivory Coast Jan Crude

Oil Exports -1 % On Yr, Imports -3 %”). We, there-

fore, plan to develop strategies for identifying and

processing such mixed-signal news headlines.

Furthermore, we aim to assess the feasibility of

extending the presented method to other commodities

such as perishable (e.g., coffee beans), non-perishable

(e.g., natural gas), precious (e.g., gold), and non-

precious (e.g., iron ore) commodities, where pricing

may be inﬂuenced by similar factors.

ACKNOWLEDGMENT

We would like to extend our gratitude to Prof Dr Hans

Wernher van de Venn and the Institute of Mechatronic

Systems at Zurich University of Applied Sciences for

their generous support of this research. In addition,

we would like to thank Dr Adrian M.P. Bras¸oveanu

for his valuable inputs on suitable evaluations for the

CrudeBERT model. We would also like to thank Dr

Sahand Haji Ali Ahmad for his assessment of the rel-

evance of the news categories used in developing the

silver standard dataset.

REFERENCES

Araci, D. (2019). FinBERT: Financial Sentiment

Analysis with Pre-trained Language Models.

arXiv:1908.10063 [cs]. arXiv: 1908.10063.

Baboshkin, P. and Uandykova, M. (2021). Multi-source

Model of Heterogeneous Data Analysis for Oil Price

Forecasting. International Journal of Energy Eco-

nomics and Policy, 11(2):384–391.

Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neu-

ral Machine Translation by Jointly Learning to Align

and Translate. arXiv:1409.0473 [cs, stat]. arXiv:

1409.0473.

Brandt, M. W. and Gao, L. (2019). Macro fundamentals or

geopolitical events? A textual analysis of news events

for crude oil. Journal of Empirical Finance, 51:64–94.

Buyuksahin, B. and Harris, J. (2011). Do Speculators Drive

Crude Oil Futures Prices? The Energy Journal, Vol-

ume 32(Number 2):167–202.

Chollet, F. (2018). Deep learning with Python. Manning

Publications Co, Shelter Island, New York. OCLC:

ocn982650571.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova,

K. (2018). BERT: Pre-training of Deep Bidirec-

tional Transformers for Language Understanding.

arXiv:1810.04805 [cs]. arXiv: 1810.04805.

EIA, U. E. I. A. (2021). Table Deﬁnitions, Sources, and

Explanatory Notes.

Fama, E. F. (1970). Efﬁcient Capital Markets: A Review of

Theory and Empirical Work. The Journal of Finance,

25(2):383.

Hafez, P., Matas, R., Grinis, I., Gomez, F., Kangrga, M.,

and Liu, A. (2020). Factor Investing With Sentiment:

A Look at Asia-Paciﬁc Markets. White Paper.

Hafez, P., Matas, R., Lautizi, F., A. Guerrero-Col

on, J.,

omez, M., and G

omez, F. (2018). Effects of Event

Sentiment Aggregation: Sum vs. Mean. White Paper,

RavenPack.

CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

333

Hamilton, J. (2008). Understanding Crude Oil Prices. Tech-

nical Report w14492, National Bureau of Economic

Research, Cambridge, MA.

Hovy, E. H. (2015). What are Sentiment, Affect, and Emo-

tion? Applying the Methodology of Michael Zock to

Sentiment Analysis. In Gala, N., Rapp, R., and Bel-

Enguix, G., editors, Language Production, Cognition,

and the Lexicon, pages 13–24. Springer International

Publishing, Cham.

Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Zhao, T.

(2020). SMART: Robust and Efﬁcient Fine-Tuning

for Pre-trained Natural Language Models through

Principled Regularized Optimization. Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 2177–2190. arXiv:

1911.03437.

Li, X., Shang, W., and Wang, S. (2019). Text-based crude

oil price forecasting: A deep learning approach. In-

ternational Journal of Forecasting, 35(4):1548–1560.

Li, X., Xie, H., Chen, L., Wang, J., and Deng, X. (2014).

News impact on stock price return via sentiment anal-

ysis. Knowledge-Based Systems, 69:14–23.

Liew, J. S. Y. (2016). Fine-grained Emotion Detection in

Microblog Text. PhD thesis.

Loughran, T. and Mcdonald, B. (2011). When Is a Liability

Not a Liability? Textual Analysis, Dictionaries, and

10-Ks. The Journal of Finance, 66(1):35–65.

Loughran, T. and McDonald, B. (2016). Textual analysis

in accounting and ﬁnance: A survey. Journal of Ac-

counting Research, 54(4):1187–1230.

Malkiel, B. G. (1989). Efﬁcient market hypothesis. In Fi-

nance, pages 127–134. Springer.

Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala,

P. (2014). Good debt or bad debt: Detecting seman-

tic orientations in economic texts: Good Debt or Bad

Debt. Journal of the Association for Information Sci-

ence and Technology, 65(4):782–796.

McCarthy, R. V., McCarthy, M. M., Ceccucci, W., Ha-

lawi, L., and SpringerLink (Online service) (2019).

Applying Predictive Analytics Finding Value in Data.

OCLC: 1204071994.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Mohammad, S. M. (2021). Sentiment Analysis: Automat-

ically Detecting Valence, Emotions, and Other Affec-

tual States from Text. Number: arXiv:2005.11882

arXiv:2005.11882 [cs].

Pennington, J., Socher, R., and Manning, C. (2014). Glove:

Global Vectors for Word Representation. In Proceed-

ings of the 2014 Conference on Empirical Methods in

Natural Language Processing (EMNLP), pages 1532–

1543, Doha, Qatar. Association for Computational

Linguistics.

Plutchik, R. (1982). A psychoevolutionary theory of emo-

tions:. Social Science Information.

Qian, B. and Rasheed, K. (2007). Stock market predic-

tion with multiple classiﬁers. Applied Intelligence,

26(1):25–33. Publisher: Springer.

Smith, A. (1776). An Inquiry into the Nature and Causes of

the Wealth of Nations. McMaster University Archive

for the History of Economic Thought.

Susanto, Y., Livingstone, A., Ng, B. C., and Cambria, E.

(2020). The Hourglass model revisited. IEEE Intelli-

gent Systems, 35(5).

Taboada, M. (2016). Sentiment analysis: An overview from

linguistics. Annual Review of Linguistics, 2(1):325–

347.

Tang, D., Qin, B., Feng, X., and Liu, T. (2016). Effec-

tive LSTMs for target-dependent sentiment classiﬁca-

tion. In Proceedings of COLING 2016, the 26th Inter-

national Conference on Computational Linguistics:

Technical Papers, pages 3298–3307, Osaka, Japan.

The COLING 2016 Organizing Committee.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention Is All You Need. arXiv:1706.03762

[cs]. arXiv: 1706.03762.

Weichselbraun, A., Steixner, J., Brasoveanu, A. M. P.,

Scharl, A., G

obel, M., and Nixon, L. J. B. (2022).

Automatic Expansion of Domain-Speciﬁc Affective

Models for Web Intelligence Applications. Cognitive

Computation, 14(1):228–245.

Wex, F., Widder, N., Liebmann, M., and Neumann, D.

(2013). Early Warning of Impending Oil Crises Using

the Predictive Power of Online News Stories. In 2013

46th Hawaii International Conference on System Sci-

ences, pages 1512–1521, Wailea, HI, USA. IEEE.

Xing, F., Malandri, L., Zhang, Y., and Cambria, E. (2020).

Financial Sentiment Analysis: An Investigation into

Common Mistakes and Silver Bullets. In Proceed-

ings of the 28th International Conference on Compu-

tational Linguistics, pages 978–987, Barcelona, Spain

(Online). International Committee on Computational

Linguistics.

Yenicelik, K. D. (2020). Understanding and Exploit-

ing Subspace Organization in Contextual Word Em-

beddings. Masterthese, Eidgen

ossische Technische

Hochschule Z

urich, Z

urich 8006, Schweiz.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun,

R., Torralba, A., and Fidler, S. (2015). Aligning Books

and Movies: Towards Story-Like Visual Explanations

by Watching Movies and Reading Books. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision (ICCV).

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

334