In order to build user profiles, one has to acquire
data about user preferences. Common approaches in-
clude to ask users about their preferences, or to collect
click data. The first approach is more direct, but also
more tedious and obtrusive for users. The second ap-
proach usually requires a log-in system to link user
profiles and demographic information to user click
choices, as in (Liu et al., 2010). We explore a third
approach which does not directly interfere with the
users, and it is based on simply monitoring what they
click. We had indirect access to this information for
some outlets which advertise in their websites their
most popular stories, i.e. the most clicked stories.
The drawback is that this information is not available
for all outlets and there is not a fine-grained user seg-
mentation. In our previous work we explored such
datasets with different techniques to model user pref-
erences in terms of prediction performance and ap-
plications (Hensinger et al., 2011), (Hensinger et al.,
2010).
Our models are built based on pairwise data from
user clicks: one more appealing news article versus
a less appealing one, both collected on same day and
from same outlet. This approach uses a linear util-
ity function to connect pairwise preferences to utility
values of items, in our case to article scores, with the
more appealing item having a higher score than its
counterpart. A preference model w contains weights
for individual article features and the “appeal” score
s(x) of an article x is computed by the linear func-
tion s(x) = hw, xi. We represent articles as bags of
words with TF-IDF weights as features – a standard
representation in information retrieval and text cate-
gorisation (Salton et al., 1975) – which is found be-
hind search engines, topic classifiers and spam filters
(Sculley and Wachman, 2007). Models are computed
via the Ranking Support Vector Machines (SVM)
method, introduced by (Joachims, 2002).
In Section 2, we focus on all tasks involved in
building models: We describe the theoretical frame-
work to learn pairwise preference relations, and the
selection and preparation of the data. We report on
the performance of the resulting models and also ex-
plore their similarities to each other. The models are
stable over time: we tested on weekly datasets up to
six months older than the data used to build the mod-
els. They can also make better than random predic-
tions of the choice of a typical reader, if he or she has
to choose between two articles.
Two factors which restrict the efficiency of our
models lie in the nature of the data we use. First,
we apply a very coarse-grained user segmentation:
all users of one outlet are seen as one homogeneous
group, since more detailed information is not avail-
able to us. Second, we use textual content only, while
online articles are often presented with supplementary
material, for instance images or videos. Such addi-
tional data can influence users’ choices, but it is not
provided by our data gathering system. Additionally,
we use only a subset of the full article text, mimicking
the real-life situation of news web pages, where the
user typically sees only the titles and short descrip-
tions of a collection of articles and has to make the
choice of what story she or he wants to read. Regard-
ing these restrictions and characteristics of our data,
it is remarkable that it is still possible to produce user
interest models that are reliable in their performance.
Having created the models, the key question be-
comes how to exploit them. Our goal is to gain an
understanding about the landscape of outlets, their
editors’ choices, and how those relate to their read-
ers’ interests. In this direction we performed a se-
ries of experiments. In Section 3 we compare the ap-
peal of different news topics. We found that topics
such as “Entertainment” and “Health” are perceived
as more appealing compared to topics such as “Busi-
ness”, “Environment” and “Politics”.
In Section 4 we compare outlets based on the ap-
peal of articles that appear in their main web pages.
For each article, we compute an appeal score with
each of the built models. We average the appeal
scores over all articles and models — for data from 33
different outlets. This allows us to rank those outlets
by their overall appeal score. It turns out - perhaps
not surprisingly - that articles from the online pres-
ence of the “People” magazine and from UK tabloids
are more appealing than from broadsheet papers and
newswires.
Finally, in Section 5, we attempt to explain the
behaviour of audiences and their click choices. We
measured readability and linguistic subjectivity of ar-
ticles and compared those quantities with the articles’
average appeal. Our finding is that outlets with simi-
lar appeal of their articles have also similar linguistic
subjectivity.
2 MODELLING NEWS APPEAL
This section describes the theoretical framework of
learning pairwise preference relations; the selection
and preparation of the data we used in our experi-
ments; and the resulting models, their prediction per-
formance, and their distances to each other.
The key task is to score news articles by means of
a linear function s(x) = hw,xi where x is the vector
space representation of the article and w is a parame-
ters vector.
ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods
42