Product Embedding for Large-Scale Disaggregated Sales Data
Yinxing Li
a
and Nobuhiko Terui
b
Graduate School of Economics and Management, Tohoku University, Japan
Keywords: LDA2Vec, Item2Vec, Demographics, Hierarchical Model, Customer Heterogeneity, Topic Model.
Abstract: This paper recommends a system that incorporates the marketing environment and customer heterogeneity.
We employ and extend Item2Vec and Item2Vec approaches to high-dimensional store data. Our study not
only aims to propose a model with better forecasting precision but also to reveal how customer demographics
affect customer behaviour. Our empirical results show that marketing environment and customer
heterogeneity increase forecasting precision and those demographics have a significant influence on customer
behaviour through the hierarchical model.
1 INTRODUCTION
Marketing data are expanding in several modes
nowadays, as the number of variables explaining
customer behavior has greatly increased, and
automated data collection in the store has also led to
the recording of customer choice decisions from large
sample sizes. Thus, high-dimensional models have
recently gained considerable importance in several
areas, including marketing. Despite the rapid
expansion of available data, Naik et al. (2008)
mentioned that many algorithms do not scale linearly
but scale exponentially as the dimension of variable
expends. This highlights the urgent need for faster
numerical methods and efficient statistical estimators.
While some previous researches focused on the
dimension reduction approaches for the products (e.g.,
Salakhutdinov and Mnih, 2008, Koren et al., 2009,
Paquet and Koenigstein, 2013), learning the product
similarities is the final goal rather than the forecasting.
After Word2Vec was proposed (Mikolov et al.,
2013) regarding natural language processing, which
is designed to deal with high-dimensional sparse
vocabulary data, many studies applied and extended
the model to other fields, such as item
recommendation, including Prod2Vec (Grbovic et al.
2015), Item2Vec (Barkan and Koenigstein, 2016),
and Meta-Prod2Vec (Vasile et al., 2016). These
approaches indicate that the Word2Vec framework
outperforms existing econometric models in sales
a
https://orcid.org/0000-0001-9335-9802
b
https://orcid.org/0000-0003-4868-0140
prediction. Besides, Pennington et al. (2014)
proposed a model which factorize a large-scale word
matrix to improve the performance of paring the
similar words. This approach is further employed for
parsing tasks by Levy and Goldberg (2014).
However, the main limitation of the existing
approaches is the lack of interpretability of the model.
Similar to the most nonlinear machine learning
approach, the Word2Vec framework cannot evaluate
the effect of variables, which may limit its
implications in the marketing field, such as the
effective personalization and targeting (
Essex, 2009
).
Although extension models, such as Prod2Vec,
involve various marketing variables such as price and
customer demographic data, the role of the variables
in forecasting is still not discussed.
In light of the limitations mentioned above, we
propose a Word2Vec based framework that
incorporates marketing variables. The main research
purposes are to (i) improve the precision of
forecasting by involving the hierarchical structure of
the Word2Vec framework with marketing mix
variables, and (ii) investigate and interpret the role of
the marketing mix variables.
In order to fulfil these aims, we analyze the large-
scale sales data of a retail store for our empirical
application. In addition to daily sales data for each
unique customer, our data also include daily price
information, several promotional information, and
demographic data for each customer. Our approach is