AN EMPIRICAL COMPARISON OF LABEL PREDICTION

ALGORITHMS ON AUTOMATICALLY INFERRED NETWORKS

Omar Ali

1

, Giovanni Zappella

2

, Tijl De Bie

1

and Nello Cristianini

1

1

Intelligent Systems Laboratory, Bristol University, Bristol, United Kingdom

2

Dipartimento di Matematica ‘F. Enriques’, Universit`a Degli Studi di Milano, Milan, Italy

Keywords:

Nodes classiﬁcation, Social networks, Data mining, Machine learning.

Abstract:

The task of predicting the label of a network node, based on the labels of the remaining nodes, is an area of

growing interest in machine learning, as various types of data are naturally represented as nodes in a graph. As

an increasing number of methods and approaches are proposed to solve this task, the problem of comparing

their performance becomes of key importance. In this paper we present an extensive experimental comparison

of 15 different methods, on 15 different labelled-networks, as well as releasing all datasets and source code. In

addition, we release a further set of networks that were not used in this study (as not all benchmarked methods

could manage very large datasets). Besides the release of data, protocols and algorithms, the key contribution

of this study is that in each of the 225 combinations we tested, the best performance—both in accuracy and

running time—was achieved by the same algorithm: Online Majority Vote. This is also one of the simplest

methods to implement.

1 INTRODUCTION

One of the important trends in data mining and pattern

analysis is the growing importance of datasets where

data items are regarded as nodes in a network. This

is due to the increasing focus on domains where this

data arises naturally, such as social networking sites

or gene regulation analysis, and to the growing size

of other datasets where vector-based representations

can be very expensive to use.

As several approaches have been proposed to

solve this task, it is important to develop a rigorous

benchmarking protocol, that includes standardised

datasets and open-sourcecode, so that researchers can

pursue the most promising avenues and practitioners

can have objective information about the merits of

each approach.

In this paper we present an empirical study of sev-

eral state-of-the-art methods for label prediction, ap-

plied to our own, automatically inferred and labelled

social networks. Our experiments test 15 variants

of four algorithms, on three different networks, each

having ﬁve labellings, for a total of 225 different com-

binations.

The algorithm-classes we tested are the Weighted

Tree Algorithm (Cesa-Bianchi et al., 2010), Graph

Perceptron Algorithm (Herbster and Pontil, 2007),

Ofﬂine Label Propagation (Zhu et al., 2003) and On-

line Majority Vote. Multiple versions of some were

compared.

The data we used was created by us, using named

entities from news stories as nodes, co-occurrence in-

formation as links, and various entity-propertiesas la-

bels. The networks we used in this experiment were

limited to a size of about 3000 nodes (in order to en-

sure all algorithms we tried could converge), but we

are releasing networks as large as 21K nodes, on the

website associated to this article, as well as the source

code of all methods.

The networks were generated by analysing over 5

million news articles, extracting 21K people named

in the news, with 325K co-occurrence links between

them. The extraction process involves co-reference

resolution, statistical co-occurrence analysis, and sta-

tistical analysis for label assignment. In addition,

these networks are produced by an autonomous sys-

tem and as such are constantly growing in size.

It is remarkable that in all our studies, the same

algorithm—Online Majority Vote—was the best per-

former both in terms of accuracy and in terms of

computational complexity. Other algorithms demon-

strated other advantages however.

The paper is organised as follows. First, we de-

ﬁne the problem and some background on label pre-

259

Ali O., Zappella G., De Bie T. and Cristianini N. (2012).

AN EMPIRICAL COMPARISON OF LABEL PREDICTION ALGORITHMS ON AUTOMATICALLY INFERRED NETWORKS.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 259-268

DOI: 10.5220/0003695702590268

Copyright

c

SciTePress

diction algorithms. Second, we explain the data that

we have used for this experiment and how it was in-

ferred from online news. Third, we release large la-

belled networks and a library containing implementa-

tions of our algorithms. Then, we present and discuss

our ﬁndings prior to concluding.

2 PROBLEM DEFINITION

Label prediction operates on an undirected network,

or graph, G(V,E). There are n vertices in this graph,

V = {1,...n}, and edges between vertices have pos-

itive weights, w

i, j

> 0 for (i, j) ∈ E. Large edge

weights suggest that the vertices they connect are

close to one another.

Associated with the vertices of this graph are a set

of binary labels, y = (y

1

,...y

n

) ∈ {−1,+1}. Every

vertex is therefore labelled positively or negatively.

One factor that affects the performance of label

prediction algorithms is the regularity of the labels on

the input graph. If positive labels tend to be next to

one another then the graph is very regular, whereas if

they are randomly assigned it is very irregular. Reg-

ularity is to be expected in our networks, given than

people are connected based on the news in which they

appear. Consider for example, Lewis Hamilton and

Felipe Massa, who are both labelled as sportsmen

and linked together; this is a result of them appear-

ing togther in Formula One news articles.

We measure the irregularity of a graph using the

cutsize. Intuitively, this is proportional to the number

of edges with endpoints that have different labels. We

consider the set E

φ

⊆ E, where a φ-edge is any (i, j)

such that y

i

6= y

j

. We deﬁne Φ =

∑

(i, j)∈E

φ

w

i, j

.

Normalising the cutsize by the total weight of the

graph gives us a measure of the irregularity of any

graph:

irregularity(G(V,E)) =

∑

(i, j)∈E

φ

w

i, j

∑

(i, j)∈E

w

i, j

(1)

3 METHODS

We run these experiments as a black-box test, us-

ing implementations of four label prediction algo-

rithms. Some methods tested are online predictors:

Online Majority Vote (OMV), Perceptron with Graph

Laplacian Kernel (GPA), Weighted Tree Algorithm

(WTA), whilst another, Ofﬂine Label Propagation

(LABPROP) is used for comparison. The following

paragraphs give brief outlines of each method.

We should also point out that WTA and GPA have

strong theoretical guarantees and mistake bound anal-

yses. In particular, the number of mistakes made by

WTA is directly proportional to the expected value of

Φ on a random spanning tree. Similarly, those made

by GPA are proportional to the value of Φ and the

diameter of the graph.

In our experimental section, all predictors are op-

erated with a train/test protocol instead of an online

one, however this does not make any difference for

our purposes.

3.1 Online Majority Vote (OMV)

Online majority vote is the simplest and fastest of

methods that we test. It is very intuitive, predicting

the label of a vertex by looking at those of its immedi-

ate neighboursand taking a majority vote. This vote is

weighted according to edge weights and can be writ-

ten as ˆy

i

= sgn(

∑

j∼i

w

i, j

y

j

) (where j ∼ i means that

j is connected to i ). Time and space complexity are

both of Θ(|E|) (Cesa-Bianchi et al., 2010).

3.2 Ofﬂine Label Propagation

(LABPROP)

Ofﬂine label propagation (Zhu et al., 2003) operates

on an entire graph at once and attempts to minimise

the energy in a system of labelled vertices. Energy in

this case is very similar to the previously mentioned

cutsize. Therefore its task is to propagate known la-

bels in such a way as to minimise the cutsize of the

graph; that is to produce the most regular graph pos-

sible given the known labels.

Given the label values, y = (y

1

,...y

n

) ∈ {+1, −1},

the algorithm calculates a real-valued function for

each unlabelled vertex as follows:

f( j) =

∑

j∼i

w

ij

f(i)

∑

j∼i

w

ij

(2)

Where j ∼ i denotes that j is connected to i.

The function in 2 can be applied to each unla-

belled vertex, iteratively until convergence, in a sim-

ilar manner to the PageRank algorithm (Page et al.,

1998). Positive labels are then applied to vertices with

f( j) > 0. This decision threshold can be manipulated

to reﬂect the input class distribution, or to incorporate

prior knowledge (Zhu et al., 2003).

Our implementation uses a method described in

(Zhu and Ghahramani, 2002) and has a time complex-

ity of O(k|V|

2

), where k is the average degree of ver-

tices in the input graph.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

260

3.3 Perceptron with Graph Laplacian

Kernel (GPA)

A kernel perceptron is used to predict labels on the

graph (Herbster and Pontil, 2007). This uses a kernel

based on the Laplacian of the graph to predict the la-

bels. This involves inverting the Laplacian, which is

computationally expensive; therefore they compute a

spanning tree of the graph and predict on that, which

is much more efﬁcient. With the methods developed

in (Herbster et al., 2009) we operate on a |V|×|V| ma-

trix, giving a complexity of O(|V|

2

+ |V|S

T

), where

S

T

is the structural diameter or maximum depth of the

tree.

Intuitively, GPA embeds the graph in a vector

space using the Laplacian kernel, so well connected

nodes on the graph will be close in this vector space.

Given that we expect strongly connected nodes to

have the same label, so too do we expect that nearby

nodes in the vector space will have the same label;

therefore we can use a Perceptron to classify them.

3.4 Weighted Tree Algorithm (WTA)

The weighted tree algorithm (Cesa-Bianchi et al.,

2010) is a bit different from competitors since it op-

erates only on trees. This makes this algorithm ex-

tremely scalable, running in constant amortised time

per prediction, O(|V|) to predict all labels on a tree.

WTA takes a tree as input, then goes one step fur-

ther and converts the input tree into a line, which it

makes its predictions on using a nearest neighbour

rule.

A tree, T, is converted into a line by ﬁrst select-

ing a vertex of the tree at random. It then performs

a depth-ﬁrst traversal of the tree and stores this path

as a line, L

′

. Backtracking steps are included in this

process, meaning that the line can contain duplicate

vertices and edges.

Duplicate vertices in the line, L

′

are then removed.

Neighbouring vertices are connected, and the edge

takes on the weight of the weakest connection be-

tween the removed vertex and its neighbours. This

line, L, which has had duplicates removed is the input

to the weighted tree algorithm.

The algorithm predicts the label of an unlabelled

vertex by traversing the line, L, and returning the clos-

est label that is found. In (Cesa-Bianchi et al., 2010)

there is a clear explanation about how to implement

this algorithm efﬁciently.

3.5 Extracting a Tree from a Graph

For a variety of motivations, both WTA and GPA op-

erate on a spanning tree instead of the complete graph.

It is therefore important to understand how different

techniques to extract those spanning trees will inﬂu-

ence our results. There are many alternative ways

to transform a graph into a tree and we will evalu-

ate some of them by comparing their properties. The

methods evaluated are:

• Minimum Spanning Tree (MST): chosen because

it gave some of the best results in previous liter-

ature (Herbster et al., 2009)(Cesa-Bianchi et al.,

2010).

• Random Spanning Tree: we choose two different

types of random spanning tree because of their

connections with the resistance distance (as ex-

plained in (Cesa-Bianchi et al., 2010)). These

trees are extracted using algorithms with different

computational complexities: the ﬁrst is Broder’s

algorithm (Broder, 1989) (RWST) that runs in

O(|V|log|V|); the second is Wilson’s algorithm

(Wilson, 1996) (NWST) that runs in O(|V|) (both

are not in the worst case).

• Depth-ﬁrst Spanning Tree (DFST): chosen be-

cause it is a common technique for generating

spanning trees.

• Minimum Diameter Spanning Tree (SPST): cho-

sen because GPA’s mistake bound also depends on

the diameter of the tree.

3.6 Committees of Trees (CWTA)

We also tested a committee version of WTA (CWTA),

which uses 11 of the randomly generated trees (DFST,

NWST, RWST). The algorithm is the same as stan-

dard WTA except that the output label is the majority

vote between different instances of WTA running on

different trees.

4 NETWORK GENERATION

4.1 News Collection

Network generation begins with the collection of

news articles. These are collected by a spider, which

visits over 3,000 syndicated news feeds at regular in-

tervals throughout each day. News feeds are simple

XML ﬁles that can be easily monitored and parsed by

machines. The spider is able to handle a large volume

of news and does so using a multi-threaded architec-

ture that minimises the effect of network delays. It

AN EMPIRICAL COMPARISON OF LABEL PREDICTION ALGORITHMS ON AUTOMATICALLY INFERRED

NETWORKS

261

parses feeds in the RSS and Atom formats using an

open source Java library, Rome

1

.

Topic labels are automatically applied to articles

based on the feed(s) in which they appear. For ex-

ample, an article that appears in both the ‘business’

and ‘politics’ feeds of the BBC, will inherit both of

these topics as tags. Tags are simple labels that can

be attached to articles in our system.

The data acquisition system forms the basis of

many other works (Ali et al., 2010), (Ali and Cristian-

ini, 2010), (Flaounas et al., 2010a), (Flaounas et al.,

2010b), (Turchi et al., 2009) and also has its own suite

of monitoring tools (Flaounas et al., 2011).

4.2 Named Entity Extraction

References to people are extracted from articles of

news using a named entity recognition tool called

GATE (Cunningham et al., 2002). People can be re-

ferred to in many different ways, by omitting their

middle names, or misspelling them, for example. Re-

solving duplicate references to a speciﬁc named entity

is the problem of entity matching.

Entity matching is a well-studied problem, how-

ever it is one that is greatly affected by scale. Re-

solving duplicates among a large input of references

(we monitor around 40 million) is not computation-

ally feasible as all pairs must be compared, which is

of O(n

2

) complexity.

We resolve duplicate references using an approach

that focuses on high precision. It ﬁrst uses social

network information to limit the search space, then

applies string similarity measures and co-reference

information to determine which references to merge

into entities (Ali and Cristianini, 2010). This and fur-

ther ﬁltering steps (for example, ignoring those who

are mentioned only a few times) reduce our attention

to the 50,000 most-popular people in world media,

seen over a 2.5 year period.

4.3 Inference of Topic Labels

All articles in the system are associated with various

tags inherited from the feeds in which they are men-

tioned. We can therefore infer a lot about an entity

based on the articles that mention it. We would expect

to see a signiﬁcant relation between Lewis Hamilton

and sports articles, for example, however we might

also ﬁnd that he appears in many entertainment arti-

cles.

Prior to calculating the most signiﬁcant relations

between entities and tags we must ﬁrst collect basic

statistics about each.

1

Rome Library: https://rome.dev.java.net/.

Our input is a set of N articles, {a

1

,a

2

,...a

N

} ∈ A.

Every article can contain a subset of entities in our

system, {e

1

,e

2

,...e

M

} ∈ E, as well as a subset of prop-

erty tags, {p

1

, p

2

,...p

Q

} ∈ P. The following values

are then counted:

c(e) : number of times entity e has appeared

c(p) : number of times property p has appeared

c(e, p) : number of times e has appeared in articles

tagged with p

These values allow a contingency table to be pro-

duced for every observed pair of entity and tag. This

summarises their appearances, including the number

of times that they appear together, separately, or not

at all.

In some cases it is possible for cell counts to equal

zero. This typically occurs when an entity is always

seen in the context of a particular tag, or vice versa.

We may, for example, see Lewis Hamilton appear in

only sports news. This causes c(e) = c(e, p) and as a

consequence sets cell b = c(e) − c(e, p) = 0.

Undeﬁned odds ratios are avoided by applying a

Laplace correction to the contingency table. We pre-

tend to have seen four additional articles in which we

have seen: the entity alone; the tag alone; the entity

and tag together; neither the entity nor the tag. Table

1 shows how values are calculated for each cell of the

contingency table.

4.3.1 The Odds Ratio

To measure the strength of association between an en-

tity and a tag we use the odds ratio (Boslaugh and

Watters, 2008). If we label the cells of the contin-

gency table as a,b,c,d, we can calculate the odds ratio

using (3). The odds ratio is exponentially distributed,

which makes thresholding and comparison of values

more difﬁcult later on. For this reason we use the nat-

ural logarithm of the odds ratio (3), which is approxi-

mately normally distributed.

OR(e, p) =

a.d

b.c

, LOR(e, p) = log

a.d

b.c

(3)

Log odds ratios (LOR) close to zero suggest that

the entity and tag appear independently, such that

their appearances together were just down to chance.

Values greater than zero suggest that there is some as-

sociation between the pair and values less than zero

suggest a negative association; that is that a particular

tag means that the entity is unlikely to also appear.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

262

Table 1: A contingency table for appearances of entities and tags, corrected so that cells are always greater than zero.

p ¬p Total

e c(e, p) + 1 c(e) − c(e, p) + 1 c(e) + 2

¬e c(p) − c(e, p) + 1 N − c(e) − c(p) − c(e, p) + 1 N − c(e) + 2

Total c(p) + 2 N − c(p) + 2 N + 4

4.3.2 Signiﬁcant Values of the Odds Ratio

Increasing values of odds ratio for any tag and entity

suggests progressively stronger association between

the two. Prior to applying tags to any entity it must be

decided exactly which values of LOR are signiﬁcant.

To decide this we calculate the standard error of

the LOR, and conﬁdence intervals as shown in equa-

tion (4).

SE

LOR

=

r

1

a

+

1

b

+

1

c

+

1

d

, CI

LOR

= LOR± z.SE

LOR

(4)

Where z represents the level of signiﬁcance that is

required, as given by the standard normal distribution

(µ = 1,σ = 1). At the 99.9% signiﬁcance level, for

example, we use z = 3.2905.

Recall that values of LOR are more likely to be

considered signiﬁcant if they are sufﬁciently far from

zero, either positive or negative. Whether or not a

value is sufﬁciently far from zero is determined by the

conﬁdence interval; if the conﬁdence interval includes

zero then we cannot say that the value is signiﬁcant,

but if it does not include zero then we can.

4.3.3 Comparing Odds Ratios

At this stage we have the tools to decide whether any

entity and tag are signiﬁcantly associated or not. To

do so we use the odds ratio and our conﬁdence in it;

if the odds ratio is large and our conﬁdence interval is

small we can say that the association is signiﬁcant.

We now consider how the strength of association

between pairs of tag and entity can be compared to

one another. For example, is Lewis Hamilton more

strongly associated to sports or entertainment news?

To be able to make such a comparison we scale the

odds ratio by the standard error, at our desired level

of signiﬁcance. This ensures that all associations can

later be compared directly in order to make fast deci-

sions as to which tag connects most strongly to which

entity.

This correction is shown in (5). Any associa-

tions with |a

odds

| < 1 cannot be said to be signiﬁcant,

whereas any with |a

odds

| ≥ 1 are considered signiﬁ-

cant.

a

odds

=

LOR(e, p)

z.SE

LOR

(5)

4.3.4 Example Topic Labels

Every entity now has a comparable association score,

a

odds

that describes its level of association to any par-

ticular topic of news. This information is continu-

ously updated daily such that topic associations can

be requested immediately. For the purposes of label

prediction we simply use the strongest topic associa-

tion for each entity in the network being studied.

Table 2 shows some randomly-sampled examples

of people who are signiﬁcantly associated to busi-

ness, entertainment, and politics articles. Note that

there are some examples where duplicates still exist

(‘Trichet’), or where an organisation has been classi-

ﬁed as a person (‘X Factor’). This is due to the auto-

mated nature of the system; references are extracted,

cleaned and resolved without any human input.

4.4 Creating Networks

Previous steps have collected news, extracted and re-

solved references to entities, then inferred topic la-

bels. We use the results of these steps to produce net-

works for use in label prediction.

4.4.1 Signiﬁcant Links

People are linked based on the number of articles

in which they co-occur. A network, G(V,E), has

weighted vertices, w(v), and weighted edges, w(e) =

w(v

1

,v

2

). The weight of a vertex, w(v) is equal to the

number of articles in which v is mentioned, and the

weight of an edge w(v

1

,v

2

) is equal to the number of

articles that v

1

and v

2

are both mentioned in. In ad-

dition each social network is generated from a set of

news articles, A, which amounts to N = |A| observa-

tions.

Raw co-occurrence counts do not represent the

true strength of connection between pairs of people.

Popular people will tend to be connected strongly

due to their high occurrence counts, therefore we use

the χ

2

(chi-squared) test statistic to ﬁnd signiﬁcantly

linked pairs of people.

AN EMPIRICAL COMPARISON OF LABEL PREDICTION ALGORITHMS ON AUTOMATICALLY INFERRED

NETWORKS

263

Table 2: Examples of people who are signiﬁcantly associated with a selection of topics. Some mistakes get through the

automated system and are italicised.

Business Entertainment Politics

Sir Richard Branson Justin Bieber President Barack Obama

Jean-Claude Trichet X Factor John McCain

President Jean-Claude Trichet Angelina Jolie Harry Reid

Trichet Natalie Portman Mitch McConnell

Ben Bernanke Britney Spears John Boehner

Chris Bowen Anne Hathaway Sarah Palin

Russell Simmons Harry Potter Rep. Charles Rangel

Ken Burns Kim Kardashian Speaker Nancy Pelosi

Madoff Johnny Depp David Cameron

Bernanke Justin Timberlake Ed Miliband

As an initial step, we discard any edges where

w(e) < 5 in order to remove some noise from the net-

work, and as a requirement for the χ

2

statistic.

The χ

2

test of independence compares observed

counts of occurrences and co-occurrences against

what would be expected if entities appeared indepen-

dently. For any pair of entities, (v

1

,v

2

), we observe:

• N, number of articles.

• O

v

1

= w(v

1

), occurrence count of entity a.

• O

v

2

= w(v

2

), occurrence count of entity b.

• O

v

1

,v

2

= w(v

1

,v

2

), co-occurrence count of

(v

1

,v

2

).

Therefore, under the null hypothesis, we expect v

1

and v

2

to co-occur (O

v

1

∗ O

v

2

)/N times.

We build a 2x2 contingency table for the

pair (v

1

,v

2

) with four possible outcomes:

(v

1

,v

2

),(¬v

1

,v

2

),(v

1

,¬v

2

),(¬v

1

,¬v

2

). This is

built once for our observed values, O and again for

the expected values, X. The chi-squared statistic is

then calculated using (6).

χ

2

(v

1

,v

2

) =

∑

v

1

∈{v

1

,¬v

1

}

∑

v

2

∈{v

2

,¬v

2

}

(O

v

1

,v

2

− X

v

1

,v

2

)

2

X

v

1

,v

2

(6)

Edges of the input network, G, are then re-

weighted to the χ

2

test statistic. Speciﬁcally,

w

′

(v

1

,v

2

) = χ

2

(v

1

,v

2

). Next we consider the values

of the χ

2

statistic and what makes them statistically

signiﬁcant.

4.4.2 Thresholding

A suitable threshold value of χ

2

is selected using a re-

quired level of signiﬁcance. As used previously, we

use a level of 99%, or a p-value of 0.01. This means

that one edge in every hundred that we keep is ex-

pected to be down to random chance.

Our test makes multiple comparisons between

pairs of people and the test statistic is calculated for

all pairs which co-occur. In reality we implicitly com-

pare g =

n(n+1)

2

− n pairs, for n total entities. There-

fore we apply the Bonferroni correction and adjust the

signiﬁcance level to 1/g. For example, with 2,000

people, we could make 1,999,000 comparisons, so we

correct p = 0.01 to p/g = 5.0e

−9

.

We get a threshold, at this level of signiﬁcance, by

looking up the inverse of the χ

2

cumulative distribu-

tion function. Continuing the previous example gives

us a threshold of 34.19.

4.4.3 Stricter Filtering

One problem with signiﬁcance testing is the choice

of null hypothesis, which in this case was that peo-

ple appear in news articles independently. This is not

very discriminative, however, as the majority of co-

occurrences that we observe are indeed signiﬁcant.

This is simply a result of the fact that news is not

reported randomly, but to report actual events about

people.

To further ﬁlter our networks we simply increase

the threshold value of the χ

2

statistic. Increasing it re-

sults in a sparser network, however we need a way to

determine when to stop increasing this value. For this

we measure the average log degree,

¯

d, of the network:

¯

d =

∑

v∈V

log(d(v) + 1)

|V|

(7)

Filtering proceeds by removing edges that are

smaller than the threshold, then measuring

¯

d. If

¯

d

is too large, the threshold is increased and the net-

work ﬁltered again. This continues until

¯

d falls below

a target value. Thus, we replace a χ

2

threshold with

a threshold value of

¯

d. In our experiments we use a

threshold value of

¯

d = 1.4.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

264

Table 3: A description of the networks used in these exper-

iments.

Name Start Date End Date Nodes Edges

Sparse 2009-10-01 2009-10-04 1,072 3,825

Medium-Sparse 2009-10-01 2009-10-04 3,627 11,875

Dense 2010-01-01 2010-01-04 1,919 53,077

Table 4: Irregularity of the networks for each tag, expressed

as a percentage of mismatched edges, taking into account

edge weight.

Network Sport Science Gossip Politics Business

Sparse 4.43 11.9 12.9 16.1 16.8

Medium 7.39 14.8 11.6 12.6 15.7

Dense 4.02 9.78 12.5 11.1 9.72

4.4.4 The Networks

Three social networks were generated from news sto-

ries seen within different periods of time. Next, dis-

connected components were discarded so that only

the largest connected component remained. Edges

were re-weighted so that their values are linearly dis-

tributed, using w

′

ij

= log(w

ij

). Table 3 summarises

the networks that were produced for this experiment.

Next, the vertices of the network were labelled

with one of ﬁve topics of interest: politics, business,

science, sport, and entertainment. Only the most

strongly associated label was chosen for any vertex,

as it is common for them to have more than one.

This labelled network was then used to generate

one new network for each of the ﬁve topics, where

binary labels denote the presence of absence of the

topic label in question. Speciﬁcally, we convert the

labelled network, G, with labels, y ∈ {1, 2,3, 4,5},

to multiple graphs, G

1

,...G

5

, each with binary labels,

y ∈ {−1, +1}. This ensures that the networks are suit-

ably presented to the algorithms under trial, and there-

fore each network and algorithm is tested separately

on each topic.

Finally, we measured the irregularity of each of

these networks to understand how potentially difﬁcult

each of the networks is to label. See Table 4 for a

summary of the irregularity of each network.

5 DATA AND CODE

As an addition to the networks investigated in this pa-

per we also release

2

a selection of larger networks

2

https://patterns.enm.bris.ac.uk/labelled-networks

ranging in size from 14,916 nodes and 54,533 edges,

through to 21,487 nodes and 325,329 edges. All are

labelled with a large number of statistically associ-

ated topic labels, as well as labels derived from the

location of the news outlets a person is mentioned in;

we can, for example, say that Pope Benedict XVI is

currently (as of April 10th 2011) most-strongly asso-

ciated to news from Italy, The Vatican and Iraq.

These larger networks were not used in our experi-

ments as we could not test all methods on them within

available time and memory constraints. However, we

release them to encourage further work towards the

scalability of methods in this ﬁeld of research.

We also release the code used for our experiments.

Our repository contains all the Java classes necessary

to reproduce our experiments, including the predic-

tors and utilities for the creation of spanning trees.

This allows users to modify our settings from a simple

conﬁguration ﬁle, as well as to write their own predic-

tors implementing our interfaces for comparison with

existing ones.

6 RESULTS

Each of the algorithms were evaluated on ﬁfteen in-

put networks (three source networks with ﬁve topics

in each). Ten iterations were run for each combina-

tion of algorithm and network. In the case of the

tree-based methods this involved the generation of ten

spanning trees per algorithm. Every iteration was run

using a 50% train, 50% test split; half of the labels are

obscured prior to each iteration. Tables 5–7 show re-

sults for each topic and algorithm on each of the input

networks.

The most striking result is that OMV makes the

fewest errors in all of the tested cases. LABPROP per-

formed similarly well, but always fell short of OMV.

Both algorithms are very similar, with the only major

difference being that LABPROP considers informa-

tion from more than just the immediate neighbours

of a vertex. This suggests that such information only

serves to mislead the decision when working on social

networks.

Why this is the case needs further investigation.

However, the most likely explanation is in the nature

of social networks. Consider, for example, the small

world hypothesis (Travers and Milgram, 1969) and

ﬁndings that people are on average connected by only

5.2 steps in a network. Any algorithm that takes into

account information found outside of the immediate

neighbours of a vertex will quickly consider informa-

tion from a large proportion of the network.

WTA and GPA both perform less well on the

AN EMPIRICAL COMPARISON OF LABEL PREDICTION ALGORITHMS ON AUTOMATICALLY INFERRED

NETWORKS

265

Table 5: Sparse Network - Percentage error of each algorithm (standard deviation in brackets), quoted for a number of topics,

along with the overall average.

Algorithm

Business Gossip Politics Sport Science Avg.

WTA MST 23.4 (3.1) 17.4 (3.3) 18.9 (1.6) 8.24 (2) 18.8 (1.5) 17.4 (2.3)

WTA NWST

23.9 (1.9) 18.5 (2.2) 20.7 (1.6) 8.32 (0.96) 18.7 (1.4) 18.0 (1.6)

WTA DFST

23.3 (1.6) 17.6 (1.3) 21.2 (1.7) 8.04 (1.1) 18.2 (1.3) 17.7 (1.4)

WTA RWST

23.5 (1.5) 20.4 (2.1) 20.7 (1.8) 9.41 (1.4) 18.7 (1.3) 18.5 (1.6)

WTA SPST

24.0 (2.1) 20.6 (1.8) 21.8 (1.5) 9.60 (1.2) 18.7 (1.5) 18.9 (1.6)

CWTA DFST 18.5 (1.2) 11.5 (1.1) 14.9 (1.2) 5.08 (0.61) 14.2 (1) 12.8 (1)

CWTA NWST

18.3 (1.2) 11.7 (1.3) 15.6 (1.1) 5.22 (0.75) 13.7 (1.1) 12.9 (1.1)

CWTA RWST

18.4 (1.4) 11.8 (1.4) 15.4 (1.3) 5.12 (0.64) 13.9 (0.92) 12.9 (1.1)

GPA MST 26.8 (5.9) 24.0 (8.9) 26.0 (9.3) 14.3 (9.2) 22.5 (11) 22.7 (8.8)

GPA NWST

27.7 (5) 28.4 (8) 27.0 (5.3) 14.3 (7.7) 22.4 (9.2) 24.0 (7)

GPA DFST

33.3 (7.7) 32.2 (6.4) 33.2 (8.4) 16.1 (9.7) 23.1 (10) 27.6 (8.4)

GPA RWST

29.6 (6.2) 28.7 (7.1) 27.2 (7) 14.4 (9.7) 19.5 (6.3) 23.9 (7.2)

GPA SPST

27.1 (6.8) 24.7 (7.8) 27.7 (9.5) 16.5 (14) 20.7 (11) 23.4 (9.9)

LABPROP 17.4 (1.1) 10.3 (1.1) 14.9 (1.4) 4.74 (0.8) 13.2 (1) 12.1 (1.1)

OMV

16.9 (0.91) 8.91 (0.86) 13.6 (1.2) 4.35 (0.62) 13.0 (1) 11.3 (0.92)

Table 6: Medium-Sparse Network - Percentage error of each algorithm (standard deviation in brackets), quoted for a number

of topics, along with the overall average.

Algorithm

Business Gossip Politics Sport Science Avg.

WTA MST 20.0 (0.75) 16.6 (0.78) 19.6 (0.59) 9.32 (0.6) 19.9 (0.65) 17.1 (0.67)

WTA NWST

20.6 (0.59) 17.5 (0.75) 20.2 (0.82) 9.73 (0.53) 19.7 (1) 17.6 (0.74)

WTA DFST

20.2 (0.82) 17.4 (0.84) 19.7 (0.6) 9.03 (0.59) 19.8 (0.66) 17.2 (0.7)

WTA RWST

20.7 (0.76) 17.3 (0.63) 19.8 (0.71) 10.0 (0.67) 19.2 (0.66) 17.4 (0.69)

WTA SPST

21.2 (1.1) 17.9 (0.57) 21.1 (0.6) 10.3 (0.64) 20.4 (0.57) 18.2 (0.7)

CWTA DFST 17.5 (0.66) 13.2 (0.86) 17.5 (0.68) 6.68 (0.48) 16.5 (0.77) 14.3 (0.69)

CWTA NWST

17.4 (0.75) 13.0 (0.61) 17.1 (0.75) 6.66 (0.58) 16.1 (0.68) 14.1 (0.67)

CWTA RWST

17.4 (0.54) 13.0 (0.68) 17.3 (0.66) 6.78 (0.54) 16.2 (0.69) 14.1 (0.62)

GPA MST 24.2 (5.9) 25.4 (6.7) 23.7 (5.8) 12.3 (2.6) 21.5 (7.1) 21.4 (5.6)

GPA NWST

29.0 (9.6) 28.0 (5.7) 28.3 (7.2) 15.8 (7.3) 23.1 (9.3) 24.9 (7.8)

GPA DFST

26.1 (4.1) 33.7 (7.8) 26.5 (6.7) 17.2 (7.7) 22.9 (8.1) 25.3 (6.9)

GPA RWST

26.1 (7.8) 25.5 (6.5) 25.3 (5) 14.4 (4.3) 24.4 (10) 23.2 (6.7)

GPA SPST

27.5 (13) 26.1 (12) 24.7 (6.6) 13.4 (4.9) 18.7 (6.4) 22.1 (8.7)

LABPROP 16.3 (0.93) 11.7 (0.82) 16.1 (0.73) 6.17 (0.57) 15.2 (0.4) 13.1 (0.69)

OMV

15.5 (0.84) 10.5 (0.56) 15.8 (0.56) 5.68 (0.39) 14.8 (0.62) 12.5 (0.59)

Table 7: Dense Network - Percentage error of each algorithm (standard deviation in brackets), quoted for a number of topics,

along with the overall average.

Algorithm

Business Gossip Politics Sport Science Avg.

WTA MST 22.2 (1) 14.9 (0.51) 17.7 (0.84) 6.99 (0.75) 19.7 (0.96) 16.3 (0.81)

WTA NWST

28.6 (1.1) 20.0 (0.92) 21.3 (1.1) 9.19 (0.89) 23.1 (1.2) 20.4 (1)

WTA DFST

23.6 (0.89) 17.8 (0.7) 18.0 (0.87) 7.01 (0.8) 20.0 (1) 17.3 (0.86)

WTA RWST

27.9 (0.97) 20.4 (1.3) 21.7 (0.96) 9.40 (0.81) 23.0 (0.85) 20.5 (0.98)

WTA SPST

28.7 (1.5) 29.5 (0.97) 22.1 (0.75) 11.7 (0.75) 23.3 (1.1) 23.1 (1)

CWTA DFST 17.8 (0.9) 9.83 (0.53) 13.2 (0.63) 3.57 (0.59) 14.8 (0.77) 11.8 (0.68)

CWTA NWST

18.6 (0.93) 9.97 (0.66) 14.3 (0.91) 4.56 (0.97) 15.0 (0.76) 12.5 (0.85)

CWTA RWST

18.1 (0.97) 9.58 (0.69) 13.8 (1) 4.09 (0.72) 14.6 (0.89) 12.1 (0.85)

GPA MST 38.8 (18) 26.2 (5.2) 22.6 (6.9) 8.55 (0.83) 24.6 (18) 24.2 (9.7)

GPA NWST

37.3 (8.7) 31.1 (5.7) 24.7 (6.4) 14.4 (7.9) 29.6 (16) 27.4 (9)

GPA DFST

36.0 (7) 34.9 (10) 27.7 (6.9) 13.7 (5.6) 24.8 (7.9) 27.4 (7.5)

GPA RWST

38.6 (11) 28.0 (3.9) 23.8 (6.4) 12.8 (7.6) 26.3 (13) 25.9 (8.3)

GPA SPST

35.9 (17) 32.7 (8.9) 22.1 (3.8) 10.6 (4.1) 30.1 (24) 26.3 (12)

LABPROP 17.8 (1.1) 7.84 (0.72) 11.1 (0.73) 3.91 (0.66) 15.0 (0.9) 11.1 (0.82)

OMV

16.2 (0.8) 7.50 (0.59) 10.2 (0.63) 3.21 (0.36) 13.8 (0.66) 10.2 (0.61)

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

266

tested networks

3

. We have very similar results to

the WTA authors’ study (Cesa-Bianchi et al., 2010),

which found that WTA was generally more accurate

than GPA. We also see very similar patterns in the

choice of spanning tree; both methods tend to perform

best when using the minimum spanning tree (MST),

although the differences are typically small. It is im-

portant to remember how much information is dis-

carded in converting a graph to a tree, however, a

point which will be discussed further in the conclud-

ing section.

Committees of trees (CWTA) were also examined

and found to perform at a similar level to OMV and

LABPROP. In this case the choice of spanning tree

appears to make very little difference to the overall

performance.

Another aspect to this experiment is the choice of

topic. It is interesting to see that all algorithms are

better able to propagate sport labels than business la-

bels. Our results show a strong correlation between

the irregularity of each network and the error rate; we

know that the sparse business network is about four

times more irregular than sports, for example. Such

aspects are very good candidates for further study into

the content of the media.

7 CONCLUSIONS

In this paper we have performed an empirical study of

recent algorithms in label prediction. The main aim of

this has been to determine which method is the most

suitable for predicting missing labels in our automat-

ically generated social networks.

We have found that Online Majority Vote (OMV)

was always the most accurate, and in addition is the

simplest to implement and is scalable. This might

reﬂect the nature of our social networks and not be

a general property, but we believe the network we

have generated presents a number of properties that

are common to real world social networks.

Both tree based methods(WTA, GPA) do not quite

match OMV on this task, however they are the only

methods to offer strong theoretical guarantees.

Ofﬂine label propagation (LABPROP) performed

nearly as well as OMV, which is most likely due to

considering information that is too far away from the

unlabelled node. It is an interesting algorithm and its

3

While this paper was under review, a new algorithm has

been presented (Cesa-Bianchi et al., 2011) that outperforms

WTA on a set of benchmarks. This new algorithm has not

yet been introduced into the current experiment, and this

will be part of future work.

similarity to PageRank lends it to be easily parallelis-

able, particularly given recent methods in cloud com-

puting (Malewicz et al., 2010). Note however, that

OMV and WTA are faster in practice.

One of the biggest problems with social networks

are the number of edges they can contain. Storage can

quickly become an issue and execution times become

dominated by the edge count. This works in favour

of methods that scale in the number of nodes in a net-

work, for example, WTA or GPA.

What is perhaps most interesting about the tree

based methods is the sheer amount of data that they

throw away prior to running. Converting a graph into

a tree involves the removal of many edges; in the net-

works that we tested this corresponds to a loss of 72%

of edges in the Sparse network, 69% in the Medium-

Sparse network and 96% in the Dense network.

To the best of our knowledge there is no very good

way to observe connections in a social network and

build a suitable tree instead of a network. If this were

the case the pre-processing step of converting a graph

to a tree would be unnecessary, which would yield

two advantages: a way to build a sparse representa-

tion of a network, and methods to predict labels on

them. Although the tree based methods did not per-

form quite as well on these networks, they have a lot

of potential should issues such as online compression

of social network data be satisfactorily solved.

In summary, we ﬁnd that OMV was the most suit-

able method for our networks at present. However, we

also ﬁnd many important trade-offs between methods.

WTA and GPA offer strong theoretical performance

guarantees and also scale in the number of nodes,

rather than edges. WTA and OMV are both very fast

methods, and WTA can improve its accuracy when

run on a committee of trees. LABPROP is accurate

at the expense of being slow, however it lends itself

to parallelisation and would prove a useful option in a

distributed setting, for example.

Finally, we have released the networks used in

this study, along with several new networks that are

larger in size. An updated version of the source code

for label prediction algorithms has also been released.

We hope that this data and source code will help re-

searchers in future studies, both on labelled social net-

works and label prediction algorithms.

ACKNOWLEDGEMENTS

We thank Nicol`o Cesa-Bianchi and Claudio Gentile

for their advice during these experiments. Nello Cris-

tianini was supported by the Royal Society via a

Wolfson Research Merit Award, and by the European

AN EMPIRICAL COMPARISON OF LABEL PREDICTION ALGORITHMS ON AUTOMATICALLY INFERRED

NETWORKS

267

Project ComPLACS and Omar Ali by an EPSRC DTA

Grant.

REFERENCES

Ali, O. and Cristianini, N. (2010). Information Fusion for

Entity Matching in Unstructured Data. Artiﬁcial Intel-

ligence Applications and Innovations.

Ali, O., Flaounas, I., Bie, T. D., Mosdell, N., Lewis, J.,

and Cristianini, N. (2010). Automating News Content

Analysis: An Application to Gender Bias and Read-

ability. JMLR: Workshop and Conference Proceed-

ings, pages 1–7.

Boslaugh, S. and Watters, P. A. (2008). Statistics in a

Nutshell: A Desktop Quick Reference (In a Nutshell

(O’Reilly)). O’Reilly Media.

Broder, A. (1989). Generating random spanning trees. 30th

Annual Symposium on Foundations of Computer Sci-

ence, pages 442–447.

Cesa-Bianchi, N., Gentile, C., Vitale, F., and Zappella, G.

(2010). Random spanning trees and the prediction of

weighted graphs. In Proceedings of the 27th Interna-

tional Conference on Machine Learning.

Cesa-Bianchi, N., Gentile, C., Vitale, F., and Zappella, G.

(2011). See the Tree Through the Lines: The Shazoo

Algorithm. In Proceedings of the 25th Conference on

Neural Information Processing Systems.

Cunningham, H., Maynard, D., Bontcheva, K., and Tablan,

V. (2002). GATE: A framework and graphical devel-

opment environment for robust NLP tools and appli-

cations. In Proc. of the 40th Anniversary Meeting of

the Association for Computational Linguistics, pages

168–175.

Flaounas, I., Ali, O., Turchi, M., Snowsill, T., Nicart, F.,

De Bie, T., and Cristianini, N. (2011). NOAM : News

Outlets Analysis and Monitoring System. SIGMOD,

pages 1–3.

Flaounas, I., Turchi, M., Ali, O., Fyson, N., De Bie, T.,

Mosdell, N., Lewis, J., and Cristianini, N. (2010a).

The structure of the EU mediasphere. PloS one,

5(12):e14243.

Flaounas, I. N., Fyson, N., and Cristianini, N. (2010b). Pre-

dicting Relations in News-Media Content among EU

Countries. Cognitive Information Processing (CIP),

2nd International Workshop on.

Herbster, M. and Pontil, M. (2007). Prediction on a graph

with a perceptron. Advances in neural information

processing systems, 19:577.

Herbster, M., Pontil, M., and Rojas-galeano, S. (2009). Fast

Prediction on a Tree. Neural Information Processing

Systems.

Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I.,

Leiser, N., and Czajkowski, G. (2010). Pregel: a sys-

tem for large-scale graph processing. In Proceedings

of the 2010 International Conference on Management

of Data, pages 135–146. ACM.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998).

The pagerank citation ranking: Bringing order to the

web. World Wide Web Internet And Web Information

Systems, pages 1–17.

Travers, J. and Milgram, S. (1969). An Experimental Study

of the Small World Problem. Sociometry, 32(4):425.

Turchi, M., Flaounas, I. N., Ali, O., De Bie, T., Snowsill,

T., and Cristianini, N. (2009). Found in Transla-

tion. In ECML/PKDD, pages 746–749, Bled, Slove-

nia. Springer.

Wilson, D. B. (1996). Generating random spanning trees

more quickly than the cover time. Proceedings of the

twenty-eighth annual ACM symposium on Theory of

computing - STOC ’96, pages 296–303.

Zhu, X. and Ghahramani, Z. (2002). Learning from labeled

and unlabeled data with label propagation. School

Comput. Sci., Carnegie Mellon Univ., Tech. Rep.

CMUCALD-02-107.

Zhu, X., Ghahramani, Z., and Lafferty, J. (2003). Semi-

supervised learning using gaussian ﬁelds and har-

monic functions. In International Conference on Ma-

chine Learning, volume 20, page 912.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

268