User Profiling: On the Road from URLs to Semantic Features
Claudio Barros and Perrine Moreau
Data Science Direction, M
´
ediam
´
etrie, 70 rue Rivay, Levallois-Perret, France
Keywords:
Text Mining, URLs, User Profiling, Feature Engineering, Topic Extraction, Semantics.
Abstract:
Text data is undoubtedly one of the most rich and peculiar source of information there is. It can come in many
forms and require specific treatment based on their nature in order to create meaningful features that can be
subsequently used in predictive modelling. URLs in particular are quite specific and require adaptations in
terms of processing compared to usual corpora of texts. In this paper, we review different ways we have used
URLs to create meaningful features, both by exploiting the URL itself and by scrapping its page content. We
additionally attempt to measure the impact of the addition of different groups of features created in a predictive
modelling use case.
1 INTRODUCTION
M
´
ediam
´
etrie is the entity in charge of audience
measurement in France. For this purpose, we possess
multiple panels of individuals, including one dedi-
cated to measuring the Internet audience, which is
representative of the French internet user population.
Thanks to this panel, we have surf data, as well as the
characteristics of the individuals who originated it.
The surf data consists of logs containing a timestamp,
a user ID and a visited URL.
Table 1: Example of surf data.
ID Panelist Timestamp
133121 2021-05-06 12:03:42
133121 2021-05-06 12:37:01
509666 2021-05-06 22:16:18
URL
https://www.doctolib.fr/vaccination-covid-19/paris
https://www.lemonde.fr/actualite-en-continu/
https://www.750g.com/macarons-chocolat-r79291.htm
On the other hand, we receive data from clients
who own websites or groups of websites. This data
also contains user IDs and the associated surf on the
websites, but no information on the characteristics of
people surfing. In order to have a better understand-
ing of their audiences, our clients are interested in
the socio-demographic profile of their websites’ vis-
itors, their home composition, their purchase intents
or their behaviours and interests. To predict these, we
have proposed a machine learning model based on our
panel. The inputs of the model include features cre-
ated from the visited URLs.
In this paper we review different ways we have
used URLs in order to create features that can be in-
terpreted by algorithms. Throughout the paper, the
data we used as illustration comes from our panel’s
PC surf data from May 2021. This corresponds to
more than 8 million logs and over 1.8 million distinct
URLs. In section 2, we focus on feature creation by
exploiting the raw URLs. In section 3, we scrap the
URLs with the intent of adding content and context
into the equation. Section 4 consists of an evaluation
of the impact of each group of features created in a
predictive modelling use case. Finally, we draw some
conclusions and provide some critical analysis of our
work in section 5.
2 URL-BASED FEATURES
Here we focus on exploiting the raw URLs in or-
der to create features. Throughout our researches,
the perimeters of domains we studied were usually
made up of news, cooking, cinema, videogames and
forum French-speaking websites. The correspond-
ing URLs contained the associated page titles in most
cases which made it possible for us to use them as is.
The features created based on the raw URLs can be
split into 3 groups:
• keyword presence dummies
Barros, C. and Moreau, P.
User Profiling: On the Road from URLs to Semantic Features.
DOI: 10.5220/0011139900003269
In Proceedings of the 11th International Conference on Data Science, Technology and Applications (DATA 2022), pages 227-235
ISBN: 978-989-758-583-8; ISSN: 2184-285X
Copyright
c
2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
227