Authors:
Yaguang Liu
1
;
Lisa Singh
1
and
Zeina Mneimneh
2
Affiliations:
1
Department of Computer Science, Georgetown University, 3700 O St., NW, Washington, DC, U.S.A.
;
2
Survey Research Center, University of Michigan, 426 Thompson Street, Ann Arbor, Michigan, U.S.A.
Keyword(s):
Demographic Inference, Siamese Network, BERT, Deep Learning.
Abstract:
In order for social scientists to use social media as a source for understanding human behavior and public opinion, they need to understand the demographic characteristics of the population participating in the conversation. What proportion are female? What proportion are young? While previous literature has investigated this problem, this work presents a larger scale study that investigates inference techniques for predicting age and gender using Twitter data. We consider classic text features used in previous work and introduce new ones. Then we use a range of learning approaches from classic machine learning models to deep learning ones to understand the role of different language representations for demographic inference. On a data set created from Wikidata, we compare the value of different feature sets with different algorithms. In general, we find that classic models using statistical features and unigrams perform well. Neural networks also perform well, particularly models us
ing sentence embeddings, e.g. a Siamese network configuration with attention to tweets and user biographies. The differences are marginal for age, but more significant for gender. In other words, it is reasonable to use simpler, interpretable models for some demographic inference tasks (like age). However, using richer language model is important for gender, highlighting the varying role language plays for demographic inference on social media.
(More)