Explaining Unintelligible Words by Means of their Context

Bal

azs Pint

er, Gyula V

os, Zolt

an Szab

o and Andr

as L

orincz

Faculty of Informatics, E

otv

os Lor

and University, P

azm

any P. s

any 1/C, H-1117 Budapest, Hungary

Keywords:

Unintelligible Words, Wikiﬁcation, Link Disambiguation, Natural Language Processing, Structured Sparse

Coding.

Abstract:

Explaining unintelligible words is a practical problem for text obtained by optical character recognition, from

the Web (e.g., because of misspellings), etc. Approaches to wikiﬁcation, to enriching text by linking words to

Wikipedia articles, could help solve this problem. However, existing methods for wikiﬁcation assume that the

text is correct, so they are not capable of wikifying erroneous text. Because of errors, the problem of disam-

biguation (identifying the appropriate article to link to) becomes large-scale: as the word to be disambiguated

is unknown, the article to link to has to be selected from among hundreds, maybe thousands of candidate

articles. Existing approaches for the case where the word is known build upon the distributional hypothesis:

words that occur in the same contexts tend to have similar meanings. The increased number of candidate

articles makes the difﬁculty of spuriously similar contexts (when two contexts are similar but belong to dif-

ferent articles) more severe. We propose a method to overcome this difﬁculty by combining the distributional

hypothesis with structured sparsity, a rapidly expanding area of research. Empirically, our approach based on

structured sparsity compares favorably to various traditional classiﬁcation methods.

1 INTRODUCTION

Many common types of errors can occur in free text

that produce unintelligible words. A word may be

misspelled. Errors can be introduced by Optical Char-

acter Recognition (OCR) because of imperfect scans

or errors committed by the algorithm. Automatic

speech recognition can also introduce errors. Explain-

ing these unintelligible words with Wikipedia articles

could help users and computer algorithms alike to un-

derstand them.

Enriching text documents with links to Wikipedia

articles, wikiﬁcation, has been in the focus of much

attention recently. Starting from the work of (Mihal-

cea and Csomai, 2007) wikiﬁcation consists of two

phases: link detection and link disambiguation. The

detection phase identiﬁes the terms and phrases from

which links should be made. The disambiguation

phase identiﬁes the appropriate article for each de-

tected term to link to. For example, the term bank

could link to an article about ﬁnancial institutions, or

river banks.

We consider link disambiguation as our starting

point to approach explaining unintelligible words.

The words to be disambiguated are assumed given:

they are the erroneous words in the text. They can be

selected by the user, or detected automatically by

methods such as (Kukich, 1992; Leacock et al., 2010).

Current approaches to link disambiguation do not

handle text with errors because of a tacit assump-

tion: the disambiguation of different word types

are

treated as independent problems. This constraint is

essential to reduce the complexity of the disambigua-

tion problem: without it, the article the target word

will link to has to be selected from among hundreds,

maybe thousands of candidate articles. If an error

makes the surface form

of the target word unusable,

the problem cannot be decomposed to disambiguation

on different word types: the vast number of candidate

articles yields a large-scale problem. Due to the large

scale, an additional difﬁculty appears.

Typical methods to disambiguate known target

words apply the distributional hypothesis. Accord-

ing to the distributional hypothesis, words that occur

in the same contexts tend to have similar meanings

(Harris, 1954). Because the disambiguation problem

with unknown target words is intrinsically large-scale,

exceptions to the distributional hypothesis can occur

In “A rose is a rose is a rose”, there are three word types

(a, rose, is), but eight word tokens.

The word to be explained with a Wikipedia article.

The form of a word as it appears in the text.

382

Pintér B., Vörös G., Szabó Z. and Lõrincz A. (2013).

Explaining Unintelligible Words by Means of their Context.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 382-387

DOI: 10.5220/0004267003820387

 SciTePress

more frequently. Particularly, let us call two contexts

spuriously similar if they are similar but belong to dif-

ferent articles. The amount of spuriously similar con-

texts tends to increase inherently with the number of

candidate articles. This makes the learning problem

considerably hard.

In this paper, we propose a method to explain

unintelligible words with Wikipedia articles that ad-

dresses this problem by using the distributional hy-

pothesis in a novel way. Structured sparse coding

(Bach et al., 2012) is introduced to diminish the ef-

fect of spurious similarities of contexts by utilizing

the structure of semantic space (Section 3).

The contributions of the paper are summarized as

follows: (i) we propose a method to disambiguate un-

intelligible words to Wikipedia articles. (ii) We show

that structured sparsity reduces the effect of spurious

similarities of contexts. (iii) We perform large-scale

evaluations where we disambiguate from among 1000

Wikipedia articles at once.

In the next section we review related work. Our

method and results are described in Section 3 and 4.

We discuss our results in Section 5 and conclude in

Section 6.

2 RELATED WORK

The main difference between previous methods in the

literature and ours is that they consider the disam-

biguation problems of different word types indepen-

dently. In our case, as the word type is unintelligible,

this would be unfeasible.

(Mihalcea and Csomai, 2007) introduced the con-

cept of wikiﬁcation: they proposed a method to auto-

matically enrich text with links to Wikipedia articles.

They used keyword extraction to detect the most im-

portant terms in the text, and disambiguated them to

Wikipedia articles with supervised learning using the

contexts. The same task was solved in (Milne and

Witten, 2008) more efﬁciently. Here, contexts were

taken into account also for the detection phase. Dis-

ambiguation was done using sense commonness and

sense relatedness scores.

Unlike the previously mentioned works, which in-

troduce links to important terms in the text to achieve

better readability, the goal of (Kulkarni et al., 2009)

was to add as many links as possible to help index-

ing for information retrieval. The terms were disam-

biguated by assuming that coherent documents refer

to entities from one or a few related topics or domains.

(Ratinov et al., 2011) proposed a similar disambigua-

tion system called GLOW (global wikiﬁcation), which

used several local and global features to obtain a set

of disambiguations that are coherent in the whole text.

In information retrieval and speech recognition,

unintelligible words pose a practical problem. The

TREC-5 confusion track (Kantor and Voorhees, 2000)

studied the impact of data corruption introduced by

scanning or OCR errors on retrieval performance.

In the subsequent spoken document retrieval tracks

(Garofolo et al., 2000), the errors were introduced by

automatic speech recognition.

Structured sparsity has been successfully applied

to natural language processing problems different

from ours in works such as (Jenatton et al., 2011) and

(Martins et al., 2011). (Jenatton et al., 2011) apply

sparse hierarchical dictionary learning to learn hierar-

chies of topics from a corpora of NIPS proceedings

papers. In a more recent application (Martins et al.,

2011), structured sparsity was used to perform effec-

tive feature template selection on three natural lan-

guage processing tasks (chunking, entity recognition,

and dependency parsing).

3 THE METHOD

We start from a list of candidate articles the unintel-

ligible word could be linked to. For each candidate

article, we collect a number of contexts. A context

of a candidate article consists of the N non-stopword

words occurring before and after the anchor of the

link that points to the article. There can be at most

2N words in a context.

The presented method makes use of a collection

of such contexts arranged in a word-context matrix

D (Turney and Pantel, 2010) (Figure 1). In this ma-

trix, each context is a column represented as a bag-of-

words vector v of word frequencies, where v

is the

number of occurrences of the ith word in the context.

The article is determined in two steps. First, we

formulate an inverse problem, and compute a repre-

sentation vector α

α. In the second step, a single can-

didate article is selected based on the weights in this

vector.

To compute the representation vector α

α, the con-

text x ∈ R

of the target word is approximated lin-

early with the columns of the word-context matrix

D = [d

, d

, . . . , d

] ∈ R

m×n

, called the dictionary in

the terminology of sparse coding. The columns of

the dictionary contain contexts, each labeled with the

candidate article l

∈ L the context was collected for.

Please note that multiple contexts can be, and in many

cases are, tagged with the same candidate article:

= l

is possible. There are m words in the vocab-

ulary, and n contexts in the dictionary.

ExplainingUnintelligibleWordsbyMeansoftheirContext

383

computer

leg

shoe

modern

Boot Foot

Booting

...

Figure 1: The word-context matrix D. Each column is a context of a candidate article (e.g., Boot, Foot). Each element D

i j

the matrix holds the number of occurences of the ith word in the jth context. For example, the word leg occurs three times in

the 7th context, which is the 3rd context labeled with Foot.

The representation vector α

α consists of the coefﬁ-

cients of a linear combination

x = α

+ α

+ . . . + α

. (1)

For each target word, whose context is x ∈ R

, a rep-

resentation vector α

α = [α

;α

;. . . ;α

] ∈ R

is com-

puted.

To diminish the effect of spurious similarities, we

introduce a structured sparsity inducing regularization

by organizing the contexts in D into groups. Each

group contains the contexts annotated with a single

candidate article. As only a single candidate is se-

lected for each target word, ideally only a single

group should be active in each representation vector

α. Sparsity on the groups is realized by computing α

with a group Lasso regularization (Yuan et al., 2006)

determined by the labels.

The groups are introduced as a family of sets

G = {G

}

l∈L

⊆ 2

{1,...,n}

. There are as many sets in G

as there are distinct candidate articles in L. For each

article l ∈ L, there is exactly one set G

∈ G that con-

tains the indices of all the columns d

tagged with l.

G forms a partition.

The representation vector α

α of the target word

whose context is x is deﬁned as the minimum of the

loss function

min

α∈R

∥x − Dα

α∥

+ λ

∑

l∈L

||α

, (2)

where α

∈ R

denotes the vector where only the

coordinates present in the set G

⊆ {1, . . . , n} are re-

tained.

The ﬁrst term is the approximation error, the sec-

ond one realizes the structured sparsity inducing reg-

ularization. Parameter λ > 0 controls the tradeoff be-

tween the two terms. The parameters w

> 0 denote

the weights for each group G

If each group is a singleton (i.e., G =

{{1}, {2}, . . . , {n}}) the Lasso problem (Tibshi-

rani, 1994) is recovered:

min

α∈R

∥x − Dα

α∥

+ λ

∑

i=1

|α

|. (3)

Setting λ = 0 yields the least squares cost function.

For the sake of simplicity, we represent each can-

didate article with the same number of contexts: there

are an equal number of columns in D for each label

l ∈ L (|G

| = |G

| = ·· · = |G

|L|

|). The weights w

the groups are set to 1.

In the second step, the link is disambiguated to a

single article based on the weights in this vector. We

utilize the group structure to condense the vector α

to a single article. The weights are summed in each

group G

∈ G, and the article l

∗

∈ L whose group con-

tains the most weight is selected:

∗

= arg max

l∈L

∑

(α

)

In this group Lasso formulation, whole groups are

selected. Each group G

∈ G contains contexts tagged

with the same candidate article l ∈ L, and only a few

groups can be selected. A context similar to the con-

text of the target word, x, only by accident has a

smaller chance to be selected: as it is in a group la-

beled with a candidate article that is less related in

meaning to the target word than the correct candidate,

its group contains mainly contexts less similar to x.

Therefore, errors introduced by spurious similarities

of contexts can be effectively diminished (Section 4).

4 RESULTS

To evaluate the method, we solve the disambiguation

task of wikiﬁcation, with a signiﬁcant difference: we

assume that the surface form of the target word is un-

known.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

384

4.1 The Datasets

The datasets used in our experiments are obtained

by randomly sampling the links in Wikipedia. Each

dataset consists of contexts labeled with candidate ar-

ticles (c

, l

), (c

, l

), . . . . Each labeled context is ob-

tained by processing a link: the bag-of-words vector

generated from the context of the anchor text is anno-

tated with the target of the link.

We use the English Wikipedia database dump

from October 2010

. Disambiguation pages, and ar-

ticles that are too small to be relevant (i.e., have less

than 200 non-stopwords in their texts, or less than 20

incoming and 20 outgoing links) are discarded. In-

ﬂected words are reduced to root forms by the Porter

stemming algorithm (Porter, 1997).

To produce a dataset, a list of anchor texts are gen-

erated that match a number of criteria. These criteria

have been chosen to obtain (i) words that are frequent

enough to be suitable training examples and (ii) are

proper English words. The anchor text has to be a

single word between 3 and 20 characters long, must

consist of the letters of the English alphabet, must

be present in Wikipedia at least 100 times, and must

point to at least two different Wikipedia articles, but

not to more than 20. It has to occur at least once in

WordNet (Miller, 1995) and at least three times in the

British National Corpus (BNC Consortium, 2001).

A number of anchor texts are selected from this

list randomly, and their linked occurrences are col-

lected along with their N-wide contexts. Each link is

processed to obtain a labeled context (c

, l

To ensure that there are an equal number of con-

texts tagged with each article l ∈ L, d randomly se-

lected contexts are collected for each label. Labels

with less than d contexts are discarded. We do not

perform feature selection, but we remove the words

that appear less than ﬁve times across all contexts, in

order to discard very rare words.

4.2 Evaluations

The task we proposed is a disambiguation problem

where the algorithm has to decide between all candi-

date articles at once, because the surface form of the

target word is not available. Given a context x ∈ R

a word, the goal is to determine the appropriate candi-

date article l ∈ L. The performance of the algorithms

is measured as the accuracy of this classiﬁcation.

We compare our group Lasso based method to

three baselines: two different regularizations (least

squares and the Lasso) of the inverse problem de-

scribed in Section 3, and a Support Vector Machine

Downloaded from http://dumps.wikimedia.org/enwiki/.

(SVM). The SVM is a multiclass Support Vector Ma-

chine with a linear kernel, used successfully for wik-

iﬁcation in previous works (Milne and Witten, 2008;

Ratinov et al., 2011).

10 20 30 40 50

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

accuracy vs d, |L| = 500

accuracy

Group Lasso

Support Vector Machine

Lasso

Figure 2: Dependency of the accuracy on the number of

contexts per candidate article. There are d − 1 such con-

texts in each step of the cross-validation, as there is one test

example for each article. The data points are the mean of

values obtained on the ﬁve datasets. The error bars denote

the standard deviations. The results of least squares are not

illustrated as the standard deviations were very large. It per-

forms consistently below the Lasso.

For least squares and the Lasso, the link is disam-

biguated to the article that corresponds to the largest

coefﬁcient in α

α. For the SVM, a classiﬁcation prob-

lem is solved using the labeled contexts (c

, l

) as

training and test examples.

The minimization problems of both the Lasso and

the group Lasso (Eq. 2) are solved by the Sparse

Learning with Efﬁcient Projections (SLEP) package

(Liu et al., 2009). For the support vector machine, we

use the implementation of LIBSVM (Chang and Lin,

2001).

The algorithms are evaluated on ﬁve disjoint

datasets generated from Wikipedia (Section 4.1), each

with different candidate articles. The mean and

standard deviation of the accuracy across these ﬁve

datasets are reported.

There are |L| = 1000 different articles in each

dataset, and d = 50 contexts tagged with each article.

The algorithms are evaluated on datasets of different

sizes (i.e., d and |L| are different), generated from the

original ﬁve datasets by removing contexts and their

labels randomly .

In accord with (Lee and Ng, 2002; Sch

utze, 1998),

and others, we use a broad context, N = 20. We found

that a broad context improves the performance of all

four algorithms.

Before evaluating the algorithms, we examined

the effect of their parameters on the results. We

ExplainingUnintelligibleWordsbyMeansoftheirContext

385

100 200 300 400 500 600 700 800 900 1000

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

accuracy vs |L|, d = 10

number of candidate articles (|L|)

accuracy

Group Lasso

Support Vector Machine

Lasso

(a)

100 200 300 400 500 600 700 800 900 1000

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

accuracy vs |L|, d = 20

number of candidate articles (|L|)

accuracy

Group Lasso

Support Vector Machine

Lasso

(b)

Figure 3: Dependency of the accuracy on the number of candidate articles, |L|. The data points are the mean of values

obtained on the ﬁve datasets. The error bars denote the standard deviations. The results of least squares are not illustrated, as

the standard deviations were very large. It performs consistently below the Lasso.

found that the algorithms are robust: for the Lasso,

λ = 0.005, for the group Lasso, λ = 0.05, and for the

SVM, C = 1 was optimal in almost every validation

experiment.

In the ﬁrst evaluation, we examine the effect of

the number of training examples per candidate arti-

cle on the accuracy of the four algorithms. The start-

ing datasets consist of |L| = 500 articles with d = 10

contexts (or examples) each. Stratiﬁed 10-fold cross-

validation is used to determine the accuracy of the

classiﬁcation. The dataset is partitioned into 10 sub-

sets (the same as d), where each subset contains ex-

actly |L| examples – one annotated with each article.

In one iteration, one subset is used for testing, and the

other 9 subsets form the columns of D: there are |L |

test examples and n = (d −1)|L| columns in D in each

iteration. For the SVM, the columns of D are used as

training examples.

To examine the effect of additional contexts, we

add contexts to D for each candidate article, and ex-

amine the change in accuracy. In order to evaluate the

effect correctly (i.e., to not make the learning problem

harder), the test examples remain the same as with

d = 10. In other words, we perform the same cross-

validation as before, only we add additional columns

to D in each step. In Figure 2, we report the results

for d = 10, 20, 30, 40, 50.

In the second evaluation, the accuracy of the algo-

rithms is examined as the number of candidate articles

|L| increases. As in the ﬁrst evaluation, there are d =

10 examples per candidate article, and stratiﬁed 10-

fold cross-validation is performed. Then, the number

of examples is raised to d = 20 in the same way (i.e.,

the new examples are not added to the test examples).

We report the results for |L| = 100, 200, . . . , 1000 can-

didate articles in Figure 3.

5 DISCUSSION

The results are very consistent across the ﬁve disjoint

datasets, except in the case when the representation

vector was computed with least squares. The perfor-

mance of least squares was the worst of the four algo-

rithms, and it was so erratic that we did not plot it in

order to keep the ﬁgure uncluttered.

For group Lasso and the SVM, additional train-

ing examples help up to 20 examples per article (Fig-

ure 2), but only small gains can be achieved by adding

more than 20 examples.

In sharp contrast, the Lasso-based representation

does not beneﬁt from new training examples at all

when there are many candidate articles. This may be

the effect of spurious similarities. As more and more

candidate articles are added, the less chance Lasso has

to select the right article from among the candidates.

Representation vectors computed with structured

sparsity inducing regularization signiﬁcantly outper-

form the other methods, including SVM (Figure 3).

This illustrates the efﬁciency of our method: struc-

tured sparsity decreases the chance of selecting con-

texts spuriously similar to the context of the target

word.

6 CONCLUSIONS

We proposed a method to explain unintelligible words

with Wikipedia articles. In addition to explaining un-

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

386

intelligible words, the method can also be used to

help wikify possibly erroneous text from real-world

sources such as the Web, optical character recogni-

tion, or speech recognition. The numerical evalu-

ations demonstrated that the method works consis-

tently even in large-scale experiments, when disam-

biguating between up to 1000 Wikipedia articles.

A possible future application of the presented

method is the veriﬁcation of links to Wikipedia. The

method can assign a single weight to each candidate

article: the sum of the weights in its group in the rep-

resentation vector α

α. If the weight corresponding to

the target of the link is small in contrast to weights of

other articles, the link is probably incorrect.

The presented method can be generalized, as it

can work with arbitrarily labeled text fragments as

well as contexts of Wikipedia links. This more gen-

eral framework may have further applications, as the

idea of distributional similarity offers solutions to

many natural language processing problems. For ex-

ample, topics might be assigned to documents as

in centroid-based document classiﬁcation (Han and

Karypis, 2000).

ACKNOWLEDGEMENTS

The research has been supported by the ‘European

Robotic Surgery’ EC FP7 grant (no.: 288233). Any

opinions, ﬁndings and conclusions or recommenda-

tions expressed in this material are those of the au-

thors and do not necessarily reﬂect the views of other

members of the consortium or the European Commis-

sion.

REFERENCES

Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012).

Optimization with sparsity-inducing penalties. Foun-

dations and Trends in Machine Learning, 4(1):1–106.

BNC Consortium (2001). The British National Corpus, ver-

sion 2 (BNC World).

Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for

support vector machines.

Garofolo, J. S., Auzanne, C. G. P., and Voorhees, E. M.

(2000). The TREC Spoken Document Retrieval

Track: A Success Story. In RIAO, pages 1–20.

Han, E.-H. and Karypis, G. (2000). Centroid-based doc-

ument classiﬁcation: Analysis and experimental re-

sults. In PKDD, pages 116–123.

Harris, Z. (1954). Distributional structure. Word,

10(23):146–162.

Jenatton, R., Mairal, J., Obozinski, G., and Bach, F.

(2011). Proximal methods for hierarchical sparse cod-

ing. Journal of Machine Learning Research, 12:2297–

2334.

Kantor, P. B. and Voorhees, E. M. (2000). The TREC-5

Confusion Track: Comparing Retrieval Methods for

Scanned Text. Information Retrieval, 2:165–176.

Kukich, K. (1992). Techniques for automatically correcting

words in text. ACM Computing Surveys, 24(4):377–

439.

Kulkarni, S., Singh, A., Ramakrishnan, G., and

Chakrabarti, S. (2009). Collective annotation of

Wikipedia entities in web text. In KDD, pages 457–

466.

Leacock, C., Chodorow, M., Gamon, M., and Tetreault, J.

(2010). Automated Grammatical Error Detection for

Language Learners. Synthesis Lectures on Human

Language Technologies. Morgan & Claypool Publish-

ers.

Lee, Y. K. and Ng, H. T. (2002). An empirical evaluation of

knowledge sources and learning algorithms for word

sense disambiguation. In EMNLP, pages 41–48.

Liu, J., Ji, S., and Ye, J. (2009). SLEP: Sparse Learning

with Efﬁcient Projections. Arizona State University.

Martins, A. F. T., Smith, N. A., Aguiar, P. M. Q., and

Figueiredo, M. A. T. (2011). Structured Sparsity in

Structured Prediction. In EMNLP, pages 1500–1511.

Mihalcea, R. and Csomai, A. (2007). Wikify!: linking doc-

uments to encyclopedic knowledge. In CIKM, pages

233–242.

Miller, G. A. (1995). WordNet: A lexical database for En-

glish. Communications of the ACM, 38:39–41.

Milne, D. and Witten, I. H. (2008). Learning to link with

Wikipedia. In CIKM, pages 509–518.

Porter, M. F. (1997). An algorithm for sufﬁx stripping, pages

313–316. Morgan Kaufmann Publishers Inc.

Ratinov, L., Roth, D., Downey, D., and Anderson, M.

(2011). Local and global algorithms for disambigua-

tion to Wikipedia. In ACL-HLT, pages 1375–1384.

Sch

utze, H. (1998). Automatic word sense discrimination.

Computational Linguistics, 24(1):97–123.

Tibshirani, R. (1994). Regression shrinkage and selection

via the lasso. Journal of the Royal Statistical Society,

Series B, 58:267–288.

Turney, P. D. and Pantel, P. (2010). From frequency to

meaning: vector space models of semantics. Journal

of Artiﬁcial Intelligence Research, 37(1):141–188.

Yuan, M., Yuan, M., Lin, Y., and Lin, Y. (2006). Model

selection and estimation in regression with grouped

variables. Journal of the Royal Statistical Society, Se-

ries B, 68:49–67.

ExplainingUnintelligibleWordsbyMeansoftheirContext

387