Addressing the visualization and analysis of these
new large data sets challenges traditional statisti-
cal tools. Over the recent years, several research
groups have developed new supervised and unsuper-
vised machine-learning techniques to analyze them
(Di Clemente et al., 2018; Pentland, 2013). In this pa-
per we use an unsupervised method known as t-SNE
(Maaten and Hinton, 2008). t-SNE is an non-linear
embedding technique based on manifold learning able
to represent all of our high dimensional consumer
profiles in a two dimensional space, while aiming
to keep the original relative distance between users.
This embedded representation can be used as a basis
for more relevant clustering than demographic seg-
ments. In addition, unsupervised methods like t-SNE
can produce more meaningful and sometimes counter
intuitive insights from the data set that could be other-
wise missed. As an example, in this paper we used the
t-SNE embedding to identify two clusters of individ-
uals: one with above average housing consumption,
and other with above average recreation consumption.
We found middle-age individuals over-indexing in the
former, and younger, poorer males over-indexing in
the latter.
The remainder of this paper is structured as fol-
lows: section 2 describes the methods, including a
data overview and description of consumption cat-
egories, section 3 describes the results of the pa-
per, section 3.1 outlines our pipeline for construct-
ing consumption profiles from micro-transaction data
and explore the mean consumer profile in our sample,
section 3.2 explores the differences in consumption
across demographics, section 3.3 shows the problems
of this traditional approach, section 3.4 shows how
manifold learning can serve as a basis for more rel-
evant clusters, allowing us to analyse independently
the consumers of specific categories, and finally we
discuss our results in section 4.
2 METHODS
2.1 Data Overview
The data set used is comprised of almost 24 million
banking transactions from 49965 users of a Spain-
based Personal Finance Management service. The
data set covers all transactions for those users in 2017
including inbound/outbound money transfers, card
payments and cash withdrawals. In this work we only
analyzed transactions that can be connected to a con-
sumption category (section 2.2 and Appendix). Each
user is described by its Region, Age range, Income
range, and Gender. Demographic slices are struc-
tured as follows:
• Region contains the Spanish region—Comunidad
Autonoma—where the user lives.
• Age range is divided in the following ranges: 18
to 25, 26 to 35, 36 to 45, 46 to 55 and Over 55.
• Income range encloses ranges divided by the an-
nualized average monthly income. The ranges
are the following: Under e584, e584 to e1,083,
e1,084 to e1,583, e1,584 to e2,416, e2,417 to
e3,333, e3,334 to e4,166, e4,167 to e5,000,
e5,001 to e5,834 and Over e5,834
• Gender: Male or female.
Since the main objective of this paper is not to an-
alyze the Spanish population, but to introduce novel
methods to analyze big data sources, data is not re-
sampled to mimic the Spanish population—as ex-
plained in the next sections this reduces the complex-
ity of the approach. The full description of the sample
distribution across regions, age ranges, income ranges
and gender can be found at Appendix. Significance
is measured based on data sample. Users have given
express consent to research and commercial exploita-
tion of the data, in accordance with GDPR—General
Data Protection Regulation—(Council of European
Union, 2016) regulations. Data has been irreversibly
anonymized using differential privacy techniques.
2.2 Consumption Categories
We used our proprietary classification on consump-
tion categories based on the COICOP standard (Clas-
sification of Individual Consumption according to
Purpose) for households developed by the United
Nations Statistics Division (United Nations, DESA,
2018). This standard classifies consumption based on
the purpose of goods or services acquired. Our orig-
inal proprietary classification includes 42 categories,
but the granularity of this data set allows us to map
our transactions to 27 of them (Appendix). Although
all calculations in this work include all 27 categories,
in the interest of clarity, most of the graphics on this
paper include only the 12 most relevant categories.
Transactions in the categories of loans and trans-
fers are reassigned to category Housing based on
amount, recurrence, and whether the corresponding
user has been identified as a loan or mortgage owner.
Transactions in low-frequency categories at the com-
merce level —those covering less than 4% of the
total—, were reassigned at random, keeping the rela-
tive proportions of the rest of categories for that com-
merce.
DATA 2019 - 8th International Conference on Data Science, Technology and Applications
24