based Classifiers, Probabilistic and Naïve Bayes
Classifiers, Proximity-based Classifiers and Linear
Classifiers which include SVM Classifiers,
Regression - based Classifiers, and Neural Network
Classifiers.
For the purposes of the study, attention was
centered on Linear Classifiers since the experiments
were made with SVM, even first tests proved other
classifiers of this group. Linear Classifiers have been
developed independently, however, they are similar
at a basic conceptual level and the main differences
are in terms of the details of the objective function
which is optimized, and the iterative approach used
in order to determine the optimum direction of
separation (Aggarwal y Zhai, 2012).
Support Vector Machines are based on the
principle of determining separators in the search
space which can provide the best separation of the
different classes. According to Aggarwal y Zhai
(2012), this is essentially a quite scalable semi-
supervised approach because of its use of unlabeled
data in the classification process and its use of a
number of modified quasi-newton techniques, which
tend to be efficient in practice.
On the other hand, age classification problem has
registered in the literature studies specially dedicated
to images analysis ( Ramesha et al., 2010; Gao & Ai,
2009; Ylioinas et al.,2012; Levi & Hassner 2015,
Rybintsev et al., 2015). Although some of them use
the algorithms mentioned before, literature about
text-based age classification is not broad.
3 METHODOLOGY
This section presents the steps we follow to test the
performance of linear classification algorithms in the
identification of users age in twitter. In the first
section, we present the training database
construction process in two parts (a) Identification
and extraction of people accounts in twitter and (b)
Identification of age expressions in accounts
descriptions based on a Spanish Lexicon created
around the concept of “cumpleaños”. The second
section shows the process of analysis for the derived
variable “age range”.
3.1 Training Database Construction
Before creating the classifier, it was necessary to
have a database that allows not only to train a text
classification model but also to extract age-related
information from the account profile description.
The information was collected from Twitter’s rest
API during august, 2017, only in Spanish twitters.
All the Datasets obtained were manually validated
by the CAOBA experts group.
3.1.1 Algorithm for Identifying People
Accounts from Twitter Rest API
The process carried out to develop the experiment
consisted initially in the search of Twitter user
accounts of people who were in Colombia, using as
input the Twitter accounts of the 113 universities
registered in the country. They were found 114.953
users linked, from which only 82.147 were active.
On the other hand, 19,378 Colombian celebrity
accounts were taken. In these accounts, 1,241,248
linked users were identified and it was possible to
obtain data from 967,660 of them.
A validation process was carried out to identify
that it was a profile related to a person, for which the
construction of a manual list of 8701 people names
was required. This analysis was performed on the
name and screen_name fields of the account
description through a process of BoW (Bag of
Words) (Moreno et al., 2017). It was possible to
determine if it was a person's account, this means it
had the name of a person associated or it was a
corporate account. In total, 50,819 accounts of
people linked to universities and 734,037 accounts
of people linked to celebrities were obtained.
3.1.2 Algorithm for Identifying Ages
Expressions in Accounts Descriptions
from Spanish Lexicon.
Once the accounts of people to analyze were
identified, a lexicon of Spanish expressions related
to the theme "cumpleaños" in age category was
created obtaining 50 words. Having this lexicon
constructed, we proceeded to analyze the
descriptions of each user accounts using BoW
techniques. 1159 accounts linked to the universities
that had information associated with age texts and
41183 accounts linked to celebrities related to the
same lexicon were identified.
Age Classification from Spanish Tweets
277