Learning to Predict Email Open Rates Using Subject and Sender

Daniel Vitor de Oliveira Santos

and Wladmir Cardoso Brand

Institute of Exact Sciences and Informatics, Pontiﬁcal Catholic University of Minas Gerais,

Dom Jos

e Gaspar Street, 500, Belo Horizonte, Brazil

Keywords:

Predictive Analytics, Email Marketing, Campaign Optimization, Text Mining, Machine Learning.

Abstract:

The burgeoning daily volume of emails has metamorphosed user inboxes into a battleground where marketers

vie for attention. This paper investigates the pivotal role of email subject lines in inﬂuencing open rates,

a critical metric in email marketing effectiveness. We employ text mining and advanced machine learning

methodologies to predict email open rates, utilizing subject lines and sender information. Our comparative

analysis spans eight regression models, leveraging diverse strategies such as morphological text attributes, op-

erational business factors, and semantic embeddings derived from TF-IDF, Word2Vec, and OpenAI’s language

models. The dataset comprises historical email campaign data, enabling the development and validation of

our predictive models. Notably, the CatBoost model, augmented with operational features and dimensionally

reduced embeddings, demonstrates superior performance, achieving a Root Mean Squared Error (RMSE) of

5.16, Mean Absolute Error (MAE) of 3.60, a Coefﬁcient of Determination (R

) of 77.53%, and Mean Absolute

Percentage Error (MAPE) of 14.73%. These results provide actionable insights for improving subject lines

and email marketing strategies, offering practical tools for practitioners and researchers.

1 INTRODUCTION

The digital revolution has signiﬁcantly transformed

personal and professional communication. Among

the various digital communication tools, email stands

out as one of the most powerful and effective, par-

ticularly in marketing. The number of email users

worldwide is approximately 4.3 billion in 2023 and

is expected to exceed 4.8 billion by 2027 (The Radi-

cati Group, 2023). Additionally, the total number of

business and consumer emails sent and received daily

surpassed 347 billion in 2023 and is projected to grow

to over 408 billion by the end of 2027.

Email marketing plays a crucial role in acquiring

new customers and retaining existing ones, standing

out in a digital environment saturated with informa-

tion. The effectiveness of email campaigns largely

depends on the subject lines, which are among the

ﬁrst elements seen by recipients. Well-crafted subject

lines, including personalization, emotional triggers

and conciseness, can signiﬁcantly increase open rates

compared to generic subject lines (Stupar-Rutenfrans

et al., 2019). However, many companies still rely

on intuition rather than a systematic, data-driven ap-

https://orcid.org/0009-0008-4116-6070

https://orcid.org/0000-0002-1523-1616

proach to crafting these crucial elements.

The high return on investment (ROI) of $40 for

every dollar spent on email marketing reported by

sources, such as Omnisend’s 2022 data report, under-

scores the importance of optimizing every aspect of

email campaigns, particularly the subject lines. This

high ROI is attributable to the direct and personal-

ized nature of email communication, which strength-

ens customer engagement and loyalty. Nevertheless,

achieving such outcomes is heavily dependent on the

effectiveness of the subject lines, as they signiﬁcantly

impact open rates and engagement (Advisor, 2024).

Despite its effectiveness, creating effective subject

lines is fraught with challenges, primarily due to the

heavy reliance on A/B testing. This method involves

sending different versions of an email to a sample

audience to determine which one yields higher open

rates and engagement metrics. This approach is time-

consuming, costly and fails to keep pace with the dy-

namic nature of consumer behavior, often requiring

constant iteration and testing. Such methods can de-

lay campaign launches and escalate operational costs,

highlighting the need for more efﬁcient strategies.

This paper addresses these challenges by propos-

ing an analytical approach to deepen understanding

the relationship between email subject lines and open

rates. By leveraging comprehensive datasets from

Santos, D. V. O. and Brandão, W. C.

Learning to Predict Email Open Rates Using Subject and Sender.

DOI: 10.5220/0012945700003825

In Proceedings of the 20th International Conference on Web Information Systems and Technologies (WEBIST 2024), pages 59-70

ISBN: 978-989-758-718-4; ISSN: 2184-3252

previous email campaigns and employing advanced

text mining and machine learning techniques, this re-

search aims to develop predictive models to evalu-

ate and maximize subject lines’ effectiveness. These

models are expected to provide actionable insights

that could signiﬁcantly reduce the reliance on inef-

ﬁcient and expensive A/B testing methods.

Eight different regression models including tree-

based methods, ensemble techniques, kernel-based

methods, and artiﬁcial neural networks are explored

to identify the most effective approach for analyz-

ing email marketing effectiveness. Each model type

provides unique advantages: tree-based methods of-

fer interpretability, ensemble techniques provide ro-

bustness, kernel-based methods capture complex re-

lationships, and neural networks can learn non-linear

patterns and handle large datasets. The research be-

gins with extracting textual features from email sub-

ject lines and sender ﬁelds, then integrates opera-

tional features, including contextual and temporal

data, along with the morphological attributes of the

subject lines. In the ﬁnal analysis phase, semantic fea-

tures represented by embeddings are combined with

operational features to assess their cumulative impact

on model performance, supporting the central hypoth-

esis that the semantic content of the subject line crit-

ically inﬂuences open rates, as evidenced by the de-

velopment and testing of these models.

Historical email campaign data, integrating open

rate metrics, were used to develop and validate the

models and approaches. Among these, the CatBoost

model was identiﬁed as particularly effective, incor-

porating operational features with advanced data pro-

cessing techniques such as PCA and OpenAI embed-

dings. This model demonstrated substantial effective-

ness in improving our understanding and optimization

of email subject lines, which are critical for enhancing

the performance of email marketing campaigns.

2 RELATED WORK

Research on email subject lines emphasizes their crit-

ical role in marketing efﬁcacy, particularly how well-

crafted lines can signiﬁcantly boost user engagement.

(Wainer et al., 2011) highlighted that subject lines

that incite curiosity or promise utility tend to achieve

higher open rates, demonstrating that recipients are

more likely to engage with content that piques their

interest or offers immediate value. Further studies like

(Sahni et al., 2016) have shown that personalization,

such as including the recipient’s name, can increase

open rates from 9.05% to 10.80%, leading to a 31%

increase in sales leads and a 17% decrease in unsub-

scribes, illustrating a strong consumer preference for

tailored email content.

Expanding on the importance of aesthetics, (Feld

et al., 2013) noted that visual design elements are

crucial, drawing a parallel between the impactful de-

Meanwhile, advancements in machine learning have

provided methods to enhance this effect; for example,

(Balakrishnan and Parekh, 2014) employed a Random

Forest model to predict open rates, achieving signif-

icant accuracy improvements with a RMSE of 1.60

x 10

−2

and a correlation of 0.352, underscoring the

potential to reﬁne email marketing strategies through

data-driven insights.

In the ﬁnancial sector, (Conceic¸

ao, 2019) applied

a Random Forest algorithm to predict campaign suc-

cess based on open rates, achieving an accuracy of

about 82% by identifying effective subject line key-

words. This complements psychological studies like

those by (Miller and Charles, 2016), which explored

how emotional triggers in subject lines, such as ex-

citement or urgency, signiﬁcantly inﬂuence user in-

teractions by enhancing open rates.

Recent contributions by (Joshi and Banerjee,

2023) and (Paulo et al., 2022) further demonstrate

the utility of integrating deep learning with traditional

machine learning approaches. Joshi’s Ngram-LSTM

model excelled in sparse data conditions, achieving

an accuracy of 84% within a 10% error tolerance. At

the same time, Paulo’s analysis across 140,000 email

subject lines found that Random Forest models were

most effective, achieving an accuracy of 62.4% and an

F1-score of 62.2%. These studies highlight the impor-

tance of combining structural and semantic features

of subject lines to reﬁne the prediction of open rates,

which this research builds upon by employing a multi-

faceted approach that blends semantic and morpho-

logical features with advanced machine learning tech-

niques for enhanced predictive accuracy and deeper

insights.

3 BACKGROUND

3.1 Email Marketing Concepts

Optimizing email marketing campaigns necessitates a

deep understanding of key performance metrics. Met-

rics such as Delivered reﬂect the number of success-

fully received emails, highlighting the efﬁciency of

email delivery systems. The Open metric measures

recipient engagement by recording opened emails.

Due to their reputation, the sender’s identity (Sender)

signiﬁcantly inﬂuences open rates. The Subject line,

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

which captures immediate attention, along with the

Open Rate—calculated as follows:

Open Rate =



Number of Opens

Number of Delivered



× 100 (1)

—both dictate the campaign’s success by reﬂect-

ing the effectiveness of subject line strategies.

3.2 Dimensionality Reduction with PCA

Principal Component Analysis (PCA) is a statistical

technique that reduces the dimensionality of datasets,

aiming to preserve as much variability as possible.

This reduction is accomplished by transforming the

original variables into a new set of variables, linear

combinations of the original variables and orthogonal

to each other. These new variables, known as princi-

pal components, are ordered so that the ﬁrst few re-

tain most of the variation in all of the original vari-

ables, setting a foundational stage for handling high-

dimensional data effectively (Hastie et al., 2009).

3.3 Curse of Dimensionality

The concept of the ”curse of dimensionality,” ﬁrst in-

troduced by (Bellman, 1961), refers to various phe-

nomena that arise when analyzing and organizing data

in high-dimensional spaces. As the dimensionality

increases, the volume of the space increases so ex-

ponentially that the available data becomes sparse.

This sparsity is problematic for any method that re-

quires statistical signiﬁcance, as the data points are

too dispersed to be representative. PCA effectively

addresses this issue by reducing the dimensionality,

concentrating the data into its most informative as-

pects and signiﬁcantly diminishing the space in which

the data points are spread. This reduction not only

simpliﬁes the computational demands but also en-

hances the effectiveness of subsequent analyses by

mitigating the impact of the curse of dimensionality.

3.4 Text Embeddings

Following the theme of dimensionality reduction, text

embeddings represent another crucial technique for

managing complex data structures in natural language

processing. Just as PCA helps to reduce the complex-

ity of numerical data, text embeddings simplify the

vast and intricate relationships within text data into

manageable vectors. These vectors allow computa-

tional models to process and interpret large volumes

of text efﬁciently, paving the way for deeper insights

into language patterns and behaviors. Text embed-

dings are advanced techniques used to convert text

into a numerical format, enabling machines to under-

stand and process language. These embeddings repre-

sent words or phrases as vectors in a high-dimensional

space. Each dimension captures some aspect of the

word’s meaning, and similar words are positioned

closer together in this space, reﬂecting their seman-

tic similarities. The process of creating embeddings

involves analyzing a large corpus of text and learning

to map words that have similar meanings close to each

other. This mapping preserves the contextual nuances

of words, allowing algorithms to recognize that cer-

tain words share similar contexts and meanings even

if they never appear together in the text.

4 METHODOLOGY

The methodology employed in this paper is system-

atically illustrated in Figure 1 and comprises several

critical stages: (1) Key Concepts and Data Overview,

which provides a foundational understanding of the

metrics and the dataset used; (2) Dataset Prepara-

tion and Feature Engineering, which involves clean-

ing, preprocessing, and engineering features from the

data; (3) Embedding Generation and Text Preprocess-

ing, which captures semantic information from the

text data and prepares it for analysis; (4) Approaches,

which detail the different strategies used to combine

features and embeddings for model input; (5) Models,

which describe the various machine learning mod-

els employed; and (6) Evaluation Metrics and Trans-

formed Target Regressor, which presents all metrics

employed and techniques to improve model perfor-

mance.

4.1 Data Overview

The dataset, which includes over 4 billion emails

delivered from 4,285 campaigns, was provided by

Etus Media Holding (Etus Media Holding, 2024),

a company specializing in email marketing manage-

ment. The campaigns are between May 2022 and

May 2024. This dataset, in Portuguese, pertains to

Brazilian email campaigns and includes key attributes

such as sender details, subject lines, delivery statis-

tics, and user engagement metrics. Due to privacy and

data security concerns for the clients involved, spe-

ciﬁc information about the data collection methods

and the exact nature of the data cannot be fully dis-

closed. However, the general methodology involves

extracting data through secure queries from a com-

prehensive email marketing database. This dataset

is critical for subsequent preprocessing and analyti-

cal steps. The data is segregated into training (72%),

Learning to Predict Email Open Rates Using Subject and Sender

Figure 1: Flowchart of the methodology.

validation (8%), and test (20%) sets, ensuring com-

prehensive model training and validation. Data nor-

malization and dimensionality reduction via PCA are

critical steps in preparing this dataset for further anal-

ysis and model building.

4.2 Dataset Preparation and Feature

Engineering

Following a thorough data overview, the primary goal

in this phase is to prepare the dataset for robust anal-

ysis by ensuring data quality and extracting relevant

features. Initially, entries with missing critical data,

such as Subject, Sender, and Delivered, are removed.

To ensure the reliability of our analysis, the focus is

on campaigns that sent at least 100,000 emails. This

threshold is selected based on the observation that

campaigns with fewer emails often include pilot tests

and may contain inconsistencies or experimental er-

rors. In contrast, campaigns exceeding this email vol-

ume are generally considered validated and provide a

more stable basis for statistical analysis, reducing the

likelihood of anomalies that could skew the results.

Feature engineering enhances the model’s predic-

tive capabilities. Created features include sender open

rate (the average rate at which emails from the sender

are opened), sender count (total campaigns from the

sender), and sender sum delivered (total emails deliv-

ered by the sender). Temporal aspects such as hour

of the day and day of the week are used alongside

email-speciﬁc metrics like email open rate and deliv-

ered scaled (normalized delivery volume), enriching

the dataset with contextual and behavioral insights.

Feature selection, leveraging domain expertise

and statistical methods, narrow down to the most

impactful features, including detailed email metrics

and temporal data points that signiﬁcantly affect open

rates. This careful preparation sets the stage for the

next crucial step of embedding generation and text

preprocessing, further reﬁning the data for predictive

modeling.

4.3 Embedding Generation

After preparing the dataset, this phase generates text

embeddings to analyze the semantics of email sub-

ject lines, which are crucial for predicting open rates.

We use three types of embeddings: TF-IDF, which

highlights keyword importance; Word2Vec, captur-

ing contextual word relationships; and OpenAI’s pre-

trained models, providing advanced semantic under-

standing.

The Term Frequency-Inverse Document Fre-

quency (TF-IDF) method transforms text data into

numerical features by evaluating the importance of

words within the subject lines relative to their fre-

quency across all subject lines. This technique is ben-

eﬁcial for identifying keywords expected to inﬂuence

email open rates, highlighting signiﬁcant words while

downplaying common terms that may not carry spe-

ciﬁc meaning in the context of email subjects (Salton

and Buckley, 1988).

The Word2Vec model, introduced by Mikolov et

al. (Mikolov et al., 2013), generates vector represen-

tations of words by considering the context in which

they appear. This method captures semantic similari-

ties between words, producing meaningful numerical

embeddings. The continuous bag-of-words (CBOW)

and skip-gram models are particularly effective at

capturing nuanced relationships between words in a

large corpus.

Employing a pre-trained model from OpenAI

(OpenAI, 2024), embeddings are generated to en-

capsulate deeper semantic relationships within the

text. These embeddings leverage large-scale lan-

guage models trained on vast amounts of diverse text

data, providing rich, contextualized representations of

email subjects and senders. This approach is advan-

tageous for capturing complex linguistic patterns and

contextual nuances that simpler models might miss.

The text from the subject lines and the sender’s

information are combined into a single textual input

to generate these embeddings. This combined feature

(Subject + Sender) ensures that the embeddings cap-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

ture both the message content and the sender’s inﬂu-

ence, critical factors determining email open rates. By

employing these three embedding techniques, the re-

search aims to cover a broad spectrum of textual rep-

resentations, from keyword importance and contex-

tual semantics to deep contextual embeddings, lever-

aging the strengths of each method to enhance the pre-

dictive power of our models.

4.4 Approaches

Based on the extensive embeddings and preprocess-

ing processes, this article employs several approaches

to enhance the predictive power of the models by in-

tegrating various types of features and embeddings.

Each approach is designed to optimize different as-

pects of the predictive modeling process, enabling a

thorough examination of how various feature combi-

nations can impact the accuracy of open rate predic-

tions.

4.4.1 Morphologic

The ﬁrst approach, morphologic, focuses on basic text

features from email subjects, such as text length, word

count, presence of special characters, numbers, and

emojis, as well as the uppercase ratio, named entity

count, letter ratio, vowel count, consonant count, and

punctuation count. These features provide a founda-

tional understanding of the structural elements of sub-

ject lines, which can be critical for assessing their ef-

fectiveness.

4.4.2 Operational

Building upon the Morphologic approach, the sec-

ond approach, operational, extends the feature set by

adding contextual and temporal information. This in-

cludes the number of emails delivered, scaled number

of emails delivered, hour of the day, day of the week,

time of day, hourly open rate, weekday open rate, and

time of day open rate. By incorporating these fea-

tures, the model gains insights into the broader op-

erational context within which emails are sent and

received, enhancing its ability to predict open rates

based on real-world email campaign dynamics.

4.4.3 Text Preprocessing and Embedding

Before generating embeddings, the textual data from

email subjects and sender information undergoes sev-

eral preprocessing steps: normalization, cleaning, and

lemmatization. Normalization involves converting all

characters to lowercase and removing special char-

acters and punctuation marks. Cleaning entails re-

moving unnecessary characters such as numbers and

currency symbols (e.g., ’R$’). Lemmatization, per-

formed using the SpaCy library, reduces words to

their base or root form, treating different word forms

as a single entity. These preprocessing steps create a

uniﬁed and cleaned textual input, vital for the genera-

tion of embeddings.

Building upon this foundation, the next set of ap-

proaches focuses exclusively on employing different

types of embeddings: TF-IDF, Word2Vec, and Ope-

nAI embeddings. These embeddings capture deeper

textual semantics that may not be evident through

structural analysis alone. TF-IDF and Word2Vec

embeddings were trained on the dataset to high-

light key terms and contextual meanings. Addition-

ally, OpenAI embeddings, generated using the text-

embedding-3-small model from OpenAI’s API, pro-

vide a more sophisticated analysis of the text, lever-

aging large-scale language models.

This combined approach to text preprocessing and

embedding generation sets the stage for developing

advanced predictive models, moving forward to the

strategies that will employ these embeddings to effec-

tively forecast email open rates.

4.4.4 Combined

Finally, the most innovative approaches combine op-

erational features with each type of embedding. For

these strategies, PCA is utilized to reduce the dimen-

sionality of the embeddings to six dimensions, effec-

tively addressing the curse of dimensionality and en-

hancing our models’ performance. By applying di-

mensionality reduction, we focus on the most infor-

mative features, thereby improving the efﬁciency and

accuracy of our predictive algorithms.

By integrating these diverse strategies, the re-

search aims to test various combinations of features

and modeling techniques to identify the most effec-

tive approach for predicting email open rates. This

comprehensive approach allows for evaluating each

strategy’s contribution to overall performance, paving

the way for selecting the best-performing model in the

subsequent phase of model development and assess-

ment.

After the strategic development of predictive ca-

pabilities, the next crucial step involves selecting the

most effective model, which will be covered in the

subsequent subsection on models and evaluation.

4.5 Data Preprocessing

The dataset includes several categorical features, such

as the sender and email address, which must be con-

verted into a numerical format for the machine learn-

ing models to process. This conversion is achieved

Learning to Predict Email Open Rates Using Subject and Sender

using the One-Hot Encoding technique, transforming

each categorical feature into a series of binary features

representing unique categories. This step is crucial for

preserving categorical information in a format that the

models can effectively utilize.

Normalization is another essential step in data pre-

processing, ensuring that numerical features have a

consistent scale. Min-Max Scaling was applied to

rescale features to a range between 0 and 1. This

normalization improves the convergence of gradient-

based algorithms and ensures that all features con-

tribute equally during model training.

As previously discussed in the background sec-

tion, PCA is applied to reduce the dimensionality of

the embedding vectors and capture the most important

features. This step simpliﬁes the model and improves

computational efﬁciency without signiﬁcant loss of

information. The number of dimensions was deter-

mined through an iterative method, testing ranges

from 1 to 100 dimensions, with six being identiﬁed as

the optimal choice. The embeddings are thus trans-

formed into a smaller set of six principal components

and then used as input features for the predictive mod-

els.

To improve the model performance and stabilize

the variance of the target variable, each model was

trained using the TransformedTargetRegressor tech-

nique. This technique applies a transformation to

the target variable during training and inversely trans-

forms the predictions back to the original scale. In

this article, a logarithmic transformation was used due

to its effectiveness in stabilizing variance and manag-

ing skewed distributions, as shown in Equation 2 and

Equation 3.

The logarithmic transformation is applied as fol-

lows:

transformed

= log(y + 1) (2)

The inverse transformation to retrieve the original

target values is:

original

= exp(y

transformed

) − 1 (3)

The advantage of using TransformedTargetRe-

gressor is that it allows models to better handle non-

linear relationships and heteroscedasticity in the data.

By stabilizing the variance, the models can achieve

more accurate and reliable predictions, particularly

when dealing with target variables that have a wide

range of values.

4.6 Models

Advancing from the strategic integration of features

and embeddings, this phase concentrates on the rig-

orous comparison and identiﬁcation of the best-

performing predictive model. Given the complexity

of email behavior and the nuances captured by var-

ious feature sets, selecting the most effective model

is crucial for achieving high predictive accuracy in

open-rate forecasting.

Various regression techniques have been selected

based on their popularity in the literature and effec-

tiveness in handling different types of data and com-

plexities. These include tree-based methods, ensem-

ble techniques, kernel-based methods, and neural net-

works, each offering unique advantages and suited for

speciﬁc aspects of the analyzed data.

Tree-based methods, such as Decision Trees and

Random Forest models, are appreciated for their in-

terpretability and ability to handle numerical and cat-

egorical data. Decision Tree Regressor predicts the

value of a target variable by learning simple decision

rules inferred from the data features. At the same

time, Random Forest improves predictive accuracy

and controls over-ﬁtting by constructing multiple de-

cision trees and aggregating their predictions.

Ensemble techniques, including Gradient Boost-

ing, LightGBM, XGBoost, and CatBoost, are em-

ployed for their robustness and high accuracy. These

methods build models sequentially or in parallel, cor-

recting errors made by previous models or combin-

ing multiple models to enhance overall performance.

For instance, LightGBM and XGBoost are particu-

larly noted for their efﬁciency with large datasets and

their capability to optimize a differentiable loss func-

tion using gradient descent.

Kernel-based methods, represented by Support

Vector Regressor (SVR), offer versatility in both lin-

ear and non-linear regression tasks through kernel

tricks, making them effective in high-dimensional

spaces.

Neural networks, such as the Multilayer Percep-

tron Regressor (MLP), are capable of modeling com-

plex non-linear relationships due to their deep learn-

ing capabilities and multiple layers of neurons. The

MLP model used in this article is conﬁgured with a

speciﬁc architecture and learning parameters to opti-

mize performance. Speciﬁcally, the MLP employs a

single hidden layer consisting of 1024 units, utilizing

the rectiﬁed linear unit (ReLU) activation function.

Training is conducted using the Adam optimizer with

a constant learning rate of 0.001. These settings are

selected for their effectiveness in handling complex

patterns and interactions within the data, enhancing

the MLP’s ability to learn and make accurate predic-

tions.

Each model undergoes a thorough training and

evaluation process using the prepared datasets, split

into training, validation, and test sets as previously

outlined. This rigorous evaluation includes cross-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

validation and performance assessment using multi-

ple metrics to ensure that the models perform well

on training data and generalize effectively to unseen

data.

4.7 Evaluation Metrics

The evaluation of predictive models employed in this

article utilizes a comprehensive set of metrics, each

providing unique insights into the model’s perfor-

mance and the accuracy of the predictions. These

metrics are essential for a thorough understanding of

how well the models predict email open rates, and for

selecting the best model based on empirical evidence.

Mean Absolute Error (MAE), as shown in Equa-

tion 4, is used to measure the average magnitude of

the errors in a set of predictions, providing a straight-

forward indicator of prediction error magnitude with-

out considering the direction of the errors.

Root Mean Squared Error (RMSE), given by

Equation 5, takes the square root of the average

squared differences between predicted and actual val-

ues. It is favored in scenarios where larger errors are

more detrimental, as it gives greater weight to them,

thus providing a more sensitive measure of prediction

accuracy.

The Coefﬁcient of Determination (R²), deﬁned in

Equation 6, is crucial for assessing the proportion of

variance in the dependent variable that can be pre-

dicted from the independent variables. It helps de-

termine the models’ overall ﬁt to the data, with values

closer to 1 indicating a more accurate model.

Mean Absolute Percentage Error (MAPE), ex-

pressed in Equation 7, presents prediction accuracy as

a percentage, offering a clear perspective on the rela-

tive size of the prediction errors concerning the actual

values.

To further enhance model performance and

address issues of non-linear relationships and

heteroscedasticity, the TransformedTargetRegressor

technique with a logarithmic transformation is ap-

plied. This transformation stabilizes the variance of

the target variable, facilitating more reliable predic-

tions. The logarithmic transformation is particularly

effective in managing the effects of skewness in the

target data, enhancing the model’s ability to make ac-

curate predictions across a range of values.

The equations for the evaluation metrics and the

transformation process are as follows:

MAE =

∑

i=1

− ˆy

(4)

RMSE =

∑

i=1

− ˆy

)

(5)

= 1 −

∑

i=1

− ˆy

)

∑

i=1

− ¯y)

(6)

MAPE =

100

∑

i=1



− ˆy



(7)

This rigorous approach to model evaluation en-

sures that the predictive models developed in this arti-

cle are effective in theoretical settings and robust and

applicable in practical scenarios, providing reliable

forecasts that can signiﬁcantly enhance the efﬁcacy

of email marketing campaigns.

5 RESULTS

This section presents a comprehensive analysis of

the performance and optimization of various machine

learning models for predicting email open rates. The

evaluation includes a range of performance metrics,

model selection processes, optimization techniques,

feature importance assessments, diagnostic evalua-

tions, and performance comparisons across different

data segments.

5.1 Model Performance Overview

This section provides an in-depth analysis of the per-

formance of various machine learning models, em-

ploying different strategies with the primary objective

of identifying the most effective approach for predict-

ing email open rates. The evaluation focuses on com-

paring a diverse range of machine learning models

and feature engineering strategies to determine which

combination maximizes predictive performance. The

models were evaluated based on four key metrics:

RMSE, MAE, R

, and MAPE as illustrated in Figure

Among all evaluated models, CatBoost and Light-

GBM emerged as the most effective, particularly

when integrating operational features, OpenAI em-

beddings, and PCA for dimensionality reduction.

This combination not only mitigates the curse of di-

mensionality but also enhances the models’ ability to

discern nuanced patterns within the data, signiﬁcantly

improving predictive accuracy.

Detailed performance metrics analysis revealed

that tree-based models such as Random Forest, Gra-

dient Boosting, LightGBM, and CatBoost are partic-

ularly well-suited for this task, due to their superior

predictive capabilities. These models beneﬁt from en-

semble learning techniques that signiﬁcantly enhance

accuracy and robustness. In contrast, strategies rely-

ing solely on basic text features, such as the morpho-

logic strategy, proved less effective. This underscores

Learning to Predict Email Open Rates Using Subject and Sender

Figure 2: Comparison of RMSE, MAE, R

, and MAPE for Different Models and Strategies.

the complexity of the data and the limitations of sim-

ple text attributes in fully capturing its nuances.

Models incorporating embeddings consistently

outperformed those relying solely on basic text fea-

tures, highlighting the critical role of semantic in-

formation in the prediction process. The integration

of operational features with embeddings, especially

when paired with PCA for dimensionality reduction,

achieved the best results, providing a comprehensive

solution that leverages both textual and operational

data insights.

5.2 Best Model Selection

Following the detailed evaluation of various mod-

els, the CatBoost and LightGBM models consis-

tently demonstrated the highest performance, partic-

ularly when incorporating operational features along-

side OpenAI embeddings with PCA for dimension-

ality reduction. This section details the methodol-

ogy used to identify and optimize the best-performing

model, which holds substantial practical implications

for enhancing email marketing strategies.

A comprehensive ranking system based on a

heuristic aggregation of key performance met-

rics—RMSE, MAE, MAPE, and R

—was imple-

mented to determine the most effective model. For

RMSE, MAE, and MAPE, where lower values signify

better performance, models were ranked from best

to worst. For R

, higher values are better, so rank-

ings were inverted to maintain consistency across all

metrics, with lower ranks indicating superior perfor-

mance.

The ranks for each metric were summed to derive

an overall performance score for each model, facili-

tating a direct comparison of their effectiveness across

diverse conditions. This ranking method, detailed in

Table 1, is beneﬁcial as it accounts for the error mag-

nitude and the proportion of variance the model ex-

plains, providing a comprehensive view of model ro-

bustness and accuracy.

Table 1: Top 5 Best Models Based on Overall Performance.

Type RMSE MAE R

MAPE

CB+E1 5.268 3.705 76.6% 15.0%

LGBM+E1 5.323 3.745 76.1% 15.3%

CB+E2 5.398 3.732 75.4% 15.1%

LGBM+E2 5.392 3.749 75.4% 15.2%

CB+E3 5.422 3.818 75.2% 15.6%

Legend: CB = CatBoost; LGBM = LightGBM; E1 =

Operational + Embedding OpenAI + PCA; E2 = Op-

erational + Embedding TF-IDF + PCA; E3 = Opera-

tional + Embedding Word2Vec + PCA.

The CatBoost model with the Operational + Ope-

nAI + PCA strategy emerged as the best performer,

achieving the lowest RMSE and one of the highest R

values. This model’s ability to effectively integrate

advanced feature engineering with powerful text em-

beddings and dimensionality reduction highlights its

suitability for the task.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

5.3 Model Optimization

Hyperparameter optimization for the CatBoost model

was conducted using a Grid Search approach, which

explored a range of hyperparameter combinations to

maximize predictive performance and ensure robust-

ness against overﬁtting. This process focused on key

parameters that signiﬁcantly inﬂuence the model’s

performance, such as iterations, maximum depth, and

the evaluation metric, as shown in Table 2.

The speciﬁc hyperparameters tuned during this

process included:

Table 2: Optimized Hyperparameters for CatBoost.

Hyperparameter Value

Random Seed 42

Use Best Model True

Evaluation Metric RMSE

Iterations 6000

Max Depth 6

The optimization process prioritized the mini-

mization of the root mean square error (RMSE) as the

primary evaluation metric. This careful tuning aimed

at balancing the model’s bias and variance, crucial for

achieving reliable predictions.

The outcomes of the optimization are highlighted

in the Table 3, demonstrating the improved perfor-

mance of the CatBoost model with these reﬁned set-

tings:

Table 3: Performance of Optimized CatBoost Model.

Metric Value

RMSE 5.164

MAE 3.6097

77.53%

MAPE 14.73%

These results conﬁrm the effectiveness of the Grid

Search method in reﬁning the model’s hyperparam-

eters, signiﬁcantly enhancing its ability to generate

highly accurate predictions. The optimized settings

not only improved key performance metrics but also

conﬁrmed the model’s enhanced capacity to capture

complex patterns in the data, making it highly suitable

for practical applications in predicting email open

rates.

5.4 Feature Importance Analysis

The optimized CatBoost model provided crucial in-

sights into the features that signiﬁcantly inﬂuence

email open rates. This analysis is instrumental in un-

derstanding the key drivers behind recipient engage-

ment and optimizing email marketing strategies ac-

cordingly.

Feature importance analysis identiﬁed critical pre-

dictors for email open rates. The most inﬂuential

feature, Sender Open Rate, underscores the substan-

tial impact of the sender’s reputation on recipient be-

havior, emphasizing the importance of maintaining a

high sender reputation to enhance open rates. Other

notable features include the number of emails deliv-

ered (Delivered and Delivered Scaled), various open

rate metrics (Hour Open Rate and Email Open Rate),

and the principal components of the OpenAI embed-

dings (OpenAI PCA 0, OpenAI PCA 1, etc.). Table

4 lists these top features, offering actionable insights

for marketers.

Table 4: Top 10 Most Important Features in the Optimized

CatBoost Model.

Feature Importance

Sender Open Rate 20.54%

Delivered 9.13%

Delivered Scaled 7.79%

Hour Open Rate 5.94%

Email Open Rate 5.42%

OpenAI PCA 0 4.49%

OpenAI PCA 1 3.63%

OpenAI PCA 2 3.49%

OpenAI PCA 5 3.45%

OpenAI PCA 4 3.29%

The inclusion of principal components from the

OpenAI embeddings as important features demon-

strates their effectiveness in capturing relevant se-

mantic information from the text. These components

help represent textual nuances that contribute to accu-

rate open rate predictions. Additionally, the analysis

reveals that contextual and temporal characteristics,

such as hourly and email open rates, are crucial for

model performance. These features provide valuable

insights into when and how recipients interact with

emails, enabling more precise predictions.

This comprehensive analysis underscores the need

to combine advanced feature engineering with power-

ful textual representations to capture both contextual

and semantic information, signiﬁcantly enhancing the

model’s predictive power.

5.5 Model Diagnostics

To evaluate the performance of our optimized Cat-

Boost model further, we conducted a detailed diag-

nostic analysis using two primary visualizations: the

histogram of residual errors and the actual vs. pre-

dicted values plot. These visualizations provide in-

Learning to Predict Email Open Rates Using Subject and Sender

sights into the model’s error distribution and predic-

tive accuracy.

5.5.1 Histogram of Residual Errors

The histogram of residual errors (Figure 3) displays

the distribution of prediction errors, calculated as the

difference between the predicted and actual values.

Ideally, for a well-performing model, the residuals

should be symmetrically distributed around zero, in-

dicating no signiﬁcant bias in the predictions.

In our analysis, the residual errors form a roughly

normal distribution centered around zero, suggesting

that the CatBoost model’s errors are unbiased. The

presence of a few outliers, indicated by residuals at

the far ends of the distribution, suggests that while

the model performs well on average, it may struggle

with certain extreme cases. Nonetheless, the majority

of errors are close to zero, which is indicative of high

model accuracy.

Figure 3: Histogram of Residual Errors.

5.5.2 Actual vs. Predicted Values

The actual vs. predicted values plot (Figure 4)

provides a scatter plot of actual values against the

model’s predictions. The ideal outcome is for all

points to lie on the diagonal line where the predicted

values equal the actual values. Deviations from this

line represent prediction errors.

In our results, most points are clustered around

the diagonal line, indicating that the model’s predic-

tions closely match the actual values. This alignment

demonstrates the model’s strong predictive capabil-

ities. The spread of points along the diagonal also

highlights that the model maintains consistent perfor-

mance across the range of open rates. However, a

few deviations from the line conﬁrm the presence of

some prediction errors, consistent with the residual

error histogram analysis.

Figure 4: Actual vs. Predicted Values.

5.5.3 Discussion

The combination of these diagnostic plots supports

the robustness and effectiveness of the CatBoost

model. The symmetric distribution of residuals

around zero and the close alignment of actual and pre-

dicted values demonstrate that the model has a high

degree of accuracy and generalizes well to new data.

The analysis highlights that while the model per-

forms exceptionally well overall, attention to outliers

and extreme cases may provide opportunities for fur-

ther reﬁnement. These diagnostics conﬁrm that the

model’s strong performance metrics (RMSE, MAE,

, and MAPE) are indicative of its reliability and

predictive power in real-world applications.

5.6 Comparison Across Quartiles

To gain deeper insights into the predictive model’s

performance across different data segments, we an-

alyzed the model by dividing the data into quartiles

based on open rates. We used the KBinsDiscretizer

to discretize the open rates into ﬁve quantiles labeled

from A to E, with quartile A having the lowest and

quartile E having the highest open rates. This method

helps identify speciﬁc areas where the model per-

forms better or worse, as detailed in Table 5.

Table 5: Performance Metrics Across Quartiles.

Cut Min Max RMSE MAE MAPE

A 6.50 18.07 4.94 3.75 27%

B 18.17 23.73 4.21 2.90 14%

C 23.76 28.84 4.03 3.01 12%

D 28.85 35.84 4.31 3.10 10%

E 35.87 74.85 7.50 5.28 12%

The analysis presented in Table 5 demonstrates

that the model achieves optimal performance within

mid-range quartiles (B, C, and D), characterized

by open rates typical of standard email campaigns.

These quartiles exhibit signiﬁcantly lower RMSE,

MAE, and MAPE values, conﬁrming the model’s ro-

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies

Figure 5: Distribution of Open Rate by Quartile with Metrics.

bust predictive accuracy for common campaign out-

comes. In contrast, the highest quartile (E), which

includes campaigns with exceptionally high open

rates, presents substantial challenges, showing ele-

vated RMSE and MAE values. This pattern suggests

that extreme open rates introduce complexities that

can compromise model accuracy, likely due to out-

lier effects or unique characteristics that are under-

represented in the training data. Additionally, while

the performance in quartile A, representing the low-

est open rates, is relatively better than in quartile E,

it is not as good as in the mid-range quartiles. This

indicates that while the model handles low open rates

reasonably well, there is still room for improvement

in these segments.

Further illustrating the variability in model perfor-

mance, Figure 5 depicts the distribution of open rates

by quartile. This visual representation reinforces the

ﬁndings from Table 5 and highlights the areas where

the model demonstrates strong performance, also un-

derscoring the necessity for further improvements,

particularly in handling extreme values and outliers.

Additionally, the distribution of campaigns is skewed

towards the mid-range quartiles, B, C, and D. This

is where the majority of the campaigns fall, as indi-

cated by the higher frequency in the histogram. The

model’s better performance in these quartiles suggests

that it is well-tuned for the most common scenarios,

making it robust for practical applications.

Overall, this quartile-based analysis highlights the

strengths and limitations of the model. It performs

well for the majority of campaigns but faces chal-

lenges with extreme open rates. These insights are

crucial for understanding the model’s applicability

and for guiding future improvements, particularly in

enhancing its capability to handle outliers and ex-

treme cases more effectively.

6 CONCLUSION AND FUTURE

WORKS

This paper applied text mining and machine learn-

ing techniques to predict email open rates from sub-

ject lines and sender, demonstrating notable predic-

tive accuracy with the optimized CatBoost model.

Enhanced by operational features and advanced text

embeddings, particularly through PCA with OpenAI

embeddings, the model achieved an RMSE of 5.164,

MAE of 3.609, R

of 77.53%, and MAPE of 14.73%.

These results support the hypothesis that the semantic

content of subject lines signiﬁcantly impact open rates

and highlight the importance of the sender’s reputa-

tion, quantiﬁed by the Sender Open Rate. Although

the model performs well on average open rates, it ex-

hibits limitations with extreme values, suggesting the

need of reﬁnements to capture diverse behaviors.

Previous research emphasized the roles of curios-

ity and utility in subject lines (Wainer et al., 2011).

This article builds on these insights by demonstrat-

ing that incorporating advanced text embeddings can

signiﬁcantly enhance predictive capabilities beyond

Learning to Predict Email Open Rates Using Subject and Sender

simple text features. Additionally, personalization

improves open rates (Sahni et al., 2016), a ﬁnding

corroborated by this research, which identiﬁes the

Sender Open Rate as a pivotal predictor. Moreover,

this work extends prevous work (Balakrishnan and

Parekh, 2014) by integrating sophisticated machine

learning techniques like CatBoost with operational

features and PCA, thereby achieving improved pre-

dictive accuracy. Also, this paper conﬁrms the utility

of deep learning and embeddings (Joshi and Banerjee,

2023), and expands upon by illustrating that combin-

ing these embeddings with operational features effec-

tively boosts model performance.

Future research should focus on enhancing the

model’s capability to handle extreme open rates, pos-

sibly through the use of more detailed datasets and

advanced nonlinear modeling techniques. Exploring

Large Language Models (LLMs) to reﬁne text em-

beddings and their direct integration into regression

tasks could markedly improve prediction accuracy

and adaptability.

Optimizing this model could facilitate real-time

adjustments based on campaign feedback, reducing

reliance on traditional A/B testing, which is often

time-intensive and costly. Such an approach enables

highly personalized strategies that respond dynam-

ically to individual recipient behaviors, optimizing

ROI by enhancing the precision and effectiveness of

email marketing campaigns. By merging cutting-edge

machine learning technologies with email analytics,

this methodology adapts to the evolving dynamics of

consumer interactions in the digital era, offering sig-

niﬁcant enhancements over traditional methods.

Ultimately, this article bridges the gap between

theoretical machine learning techniques and their

practical application in marketing, providing a com-

prehensive methodology for enhancing email cam-

paign strategies through data-driven insights, thus

contributing to a more strategic and economically ef-

ﬁcient approach to digital marketing.

ACKNOWLEDGEMENTS

The authors thank the Pontif

ıcia Universidade

Cat

olica de Minas Gerais – PUC-Minas and

Coordenac¸

ao de Aperfeic¸oamento de Pessoal de

ıvel Superior — CAPES (CAPES – Grant PROAP

88887.842889/2023-00 – PUC/MG, Grant PDPG

88887.708960/2022-00 – PUC/MG - Inform

atica,

and Finance Code 001).

REFERENCES

Advisor, F. (2024). Best email marketing software 2024.

Balakrishnan, R. and Parekh, R. (2014). Learning to pre-

dict subject-line opens for large-scale email market-

ing. pages 579–584.

Bellman, R. (1961). Adaptive Control Processes: A Guided

Tour. Princeton University Press, Princeton, NJ.

Conceic¸

ao, A. G. d. (2019). Main factors driving the open

rate of email marketing campaigns. Dissertation for

master in modelling, data analysis and decision sup-

port systems, University of Porto, Porto, Portugal.

Etus Media Holding (2024). About Etus Media Holding.

Feld, S., Frenzen, H., Krafft, M., Peters, K., and Verhoef,

P. C. (2013). The effects of mailing design character-

istics on direct mail campaign performance. Interna-

tional Journal of Research in Marketing, 30(2):143–

159.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Ele-

ments of Statistical Learning: Data Mining, Inference,

and Prediction. Springer, New York, NY, 2nd edition.

Joshi, S. and Banerjee, I. (2023). Ngram-lstm open rate pre-

diction model (nlorpm): Simple, effective, and easy to

implement approach to predict open rates for market-

ing email. International Journal of Digital Marketing.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. In Proceedings of the International Conference

on Learning Representations (ICLR).

Miller, R. and Charles, E. (2016). A psychological based

analysis of marketing email subject lines. In 2016

International Conference on Advances in ICT for

Emerging Regions (ICTer), pages 58–65. IEEE.

OpenAI (2024). Openai api.

Paulo, M., Migu

eis, V. L., and Pereira, I. (2022). Leverag-

ing email marketing: Using the subject line to antici-

pate the open rate. Expert Systems With Applications,

207:117974.

Sahni, N. S., Wheeler, S. C., and Chintagunta, P. K. (2016).

Personalization in email marketing: The role of non-

informative advertising content. Marketing Science,

35(2):216–233.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information Pro-

cessing & Management, 24(5):513–523.

Stupar-Rutenfrans, S., de Koff, D., and van den Elst, J.

(2019). The effect of subject lines on open rates of

email marketing messages. Advances in Social Sci-

ences Research Journal, 6(7):181–188.

The Radicati Group, I. (2023). Email statistics report, 2023-

2027.

Wainer, J., Dabbish, L., and Kraut, R. (2011). Should i open

this email? In Proceedings of the SIGCHI Conference

on Human Factors in Computing Systems, pages 343–

352. ACM.

WEBIST 2024 - 20th International Conference on Web Information Systems and Technologies