
based  Classifiers,  Probabilistic  and  Naïve  Bayes 
Classifiers,  Proximity-based  Classifiers  and  Linear 
Classifiers  which  include  SVM  Classifiers, 
Regression -  based Classifiers, and Neural Network 
Classifiers.  
For  the  purposes  of  the  study,  attention  was 
centered on Linear Classifiers since the experiments 
were made with SVM, even first tests proved other 
classifiers of this group. Linear Classifiers have been 
developed independently, however, they are similar 
at a basic conceptual level and the main differences 
are in terms of the details of the objective function 
which is optimized, and the iterative approach used 
in  order  to  determine  the  optimum  direction  of 
separation (Aggarwal y Zhai, 2012). 
Support  Vector  Machines  are  based  on  the 
principle  of  determining  separators  in  the  search 
space which can provide the best separation of the 
different  classes.  According  to  Aggarwal  y  Zhai 
(2012),  this  is  essentially  a  quite  scalable  semi-
supervised approach because of its use of unlabeled 
data  in  the  classification  process  and  its  use  of  a 
number of modified quasi-newton techniques, which 
tend to be efficient in practice. 
On the other hand, age classification problem has 
registered in the literature studies specially dedicated 
to images analysis ( Ramesha et al., 2010; Gao & Ai, 
2009;  Ylioinas  et  al.,2012;  Levi  &  Hassner  2015, 
Rybintsev et al., 2015). Although some of them use 
the  algorithms  mentioned  before,  literature  about 
text-based age classification is not broad. 
3  METHODOLOGY 
This section presents the steps we follow to test the 
performance of linear classification algorithms in the 
identification  of  users  age  in  twitter.  In  the  first 
section,  we  present  the  training  database 
construction  process  in  two  parts  (a)  Identification 
and extraction of people accounts in twitter and (b) 
Identification  of  age  expressions  in  accounts 
descriptions  based  on  a  Spanish  Lexicon  created 
around  the  concept  of  “cumpleaños”.  The  second 
section shows the process of analysis for the derived 
variable “age range”. 
3.1  Training Database Construction 
Before  creating  the  classifier,  it  was  necessary  to 
have a database that allows not only to train a text 
classification  model  but  also  to  extract  age-related 
information  from  the  account  profile  description. 
The  information  was  collected  from  Twitter’s  rest 
API  during august,  2017,  only in  Spanish  twitters. 
All  the  Datasets  obtained  were  manually  validated 
by the CAOBA experts group.  
3.1.1  Algorithm for Identifying People 
Accounts from Twitter Rest API 
The  process  carried out  to  develop  the  experiment 
consisted  initially  in  the  search  of  Twitter  user 
accounts of people who were in Colombia, using as 
input  the  Twitter  accounts  of  the  113  universities 
registered in the country. They were found 114.953 
users  linked,  from which  only  82.147  were active. 
On  the  other  hand,  19,378  Colombian  celebrity 
accounts  were  taken.  In  these  accounts,  1,241,248 
linked  users  were identified and  it  was possible to 
obtain data from 967,660 of them.  
A validation process was carried out to identify 
that it was a profile related to a person, for which the 
construction of a manual list of 8701 people names 
was  required.  This  analysis  was  performed  on  the 
name  and  screen_name  fields  of  the  account 
description  through  a  process  of  BoW  (Bag  of 
Words)  (Moreno  et  al.,  2017).  It  was  possible  to 
determine if it was a person's account, this means it 
had  the  name  of  a  person  associated  or  it  was  a 
corporate  account.  In  total,  50,819  accounts  of 
people  linked to  universities  and  734,037  accounts 
of people linked to celebrities were obtained. 
3.1.2  Algorithm for Identifying Ages 
Expressions in Accounts Descriptions 
from Spanish Lexicon. 
Once  the  accounts  of  people  to  analyze  were 
identified, a lexicon of Spanish expressions  related 
to  the  theme  "cumpleaños"  in  age  category  was 
created  obtaining  50  words.  Having  this  lexicon 
constructed,  we  proceeded  to  analyze  the 
descriptions  of  each  user  accounts  using  BoW 
techniques.  1159 accounts linked to the universities 
that  had  information  associated  with  age  texts  and 
41183  accounts  linked  to  celebrities  related  to  the 
same lexicon were identified. 
Age Classification from Spanish Tweets
277