the leader social network in Spain, with a percentage
of 34% of users that connect in a social network. The
dataset contains the entire amount of forum discus-
sions in Spanish available at MySpace at May 16th,
2008. The total amount of retrieved data was 1.7GB.
The oldest comments in the dataset where published
on December 13, 2007 and the most newest on the
very same day the data was collected.
2.1 Dataset Basic Statistics
The dataset contains about 300000 comments pro-
duced by about 23340 distinct users. The comments
are divided into approximately 25,000 threads. The
longest forum thread has more than 15,000 com-
ments, which represents more than 5% of all com-
ments observed in the study and the most prolific user
has more than 8500 comments.
The number of posts per thread and the number of
posts per user follow a log-normal distribution with
a heavy tail that follows a power-low with cut-off
(Kaltenbrunner et al., 2009)
Males and Females follow a similar population
pyramid with slight differences (the most common
age for males is 18 and 17 for females, the median
is 22 years old for males, and 20 for females)
There are 18 countries with more than 50 users,
Starting with Mexico (30% of all users), Spain (4000
users ), the third are the 3500 users from the US while
a 10% of the users does not specify the country they
belong to.
3 PREPROCESSING AND
SELECTION OF THE DATA
Given the original Spanish data, we found out that
there were many words affected by encoding incon-
sistencies. This problem is specially problematic in
the Spanish language given the many accentuated
words and special characters as
˜
n. Therefore, we had
to perform a manual preprocessing to normalize all
characters that were affected.
The data forums contained many users from over
20 different countries. Giventhat there were not many
users for many countries, which generated much
sparseness in the data, we decided to make a selec-
tion of countries. In order to identify users by origin,
we have discarded those countries which Spanish is
not the official language, as the Spanish speakers in
this area may be not native or immigrants from any
country. Additionally, we selected countries with at
least over 30 users who had written more than 100
words in the forums. The last requirement was arbi-
trarily chosen to ensure enough training data to train
the classifier. This filter kept the following countries:
Mexico, Spain, Argentina, Chile, Colombia, Peru and
Venezuela.
Analyzing the age distribution, it shows that there
is an abnormal number of users older than 80 years.
This can be because either they did not fill in the birth
date correctly or they just invented. We discard all
users older than 80 years.
4 EXPERIMENTS WITH
STANDARD CLASSIFICATION
TECHNIQUES
Recent studies using the English language show that
a number of stylistic and content-based indicators are
significantly affected by both age and gender (Arga-
mon et al., 2007).
In this study, we propose to analyse all the contri-
butions to the different forums of one user. We study
if the available information (bag of words, number
of comments, discussions, starting threads and typed
characters) can be useful to determine the user’s coun-
try, gender and age.
Given the seven countries mentioned in the previ-
ous section, we made a wider classification into 3 cat-
egories: Central America, South America and Spain.
Given the variety of ages (from 14 to 80), we also
made a wider classification into 3 categories: 14 to 18
years, 19 to 24 years and more than 25 years.
Finally, the gender has 2 categories: male or fe-
male.
In order to classify, we used the open source soft-
ware WEKA
1
. Our attributes to classify were words
and by default we selected the most relevant ones by
means of the standard TF-IDF weighting (Salton and
McGill, 1983). Evaluation was done by an standard
10-fold cross-validation.
Figure 4 shows the user distribution in each cate-
gory. This distribution allows us to set a baseline re-
sult in classification. Regarding the gender, the base-
line reference is around 51%. Regarding the origin,
the baseline reference is around 45%. Finally, regard-
ing the age, the baseline reference is around 35%.
Given the proposed classifications and their cate-
gories, we want to analyse the following:
1. Which are the features that allow to improve the
classification?
1
http://www.cs.waikato.ac.nz/ml/weka/
WHERE ARE YOU FROM? - Tell me HOW you Write and I Will Tell you WHO you Are
407