CONTEXT VECTOR CLASSIFICATION

Term Classiﬁcation with Context Evaluation

Hendrik Sch¨oneberg

Institute of Computer Science, University of W¨urzburg, W¨urzburg, Germany

Keywords:

Text mining, Classiﬁcation, Deep tagging, Information retrieval.

Abstract:

Automated Deep Tagging heavily relies on a term’s proper recognition. If its syntax is obfuscated by spelling

mistakes, OCR errors or typing variants, regular string matching or pattern matching algorithms may not

be able to succeed with the classiﬁcation. Context Vector Tagging is an approach which analyzes term co-

occurrence data and represents it in a vector space model, paying speciﬁc respect to the source’s language.

Utilizing the cosine angle between two context vectors as similarity measure, we propose, that terms with

similar context vectors share a similar word class, thus allowing even unknown terms to be classiﬁed. This

approach is especially suitable to tackle the above mentioned syntactical problems and can support classic

string- or pattern-based classiﬁcator-algorithms in syntactically challenging environments.

1 INTRODUCTION

Motivation. Let us assume being researching an arbi-

trary topic via the internet. Unless we explicitly know

a source that provides the sought-after information,

at some point we’ll most likely ﬁnd ourselves having

to use a search engine. The search engine’s success

depends heavily on the query we submit. Unfortu-

nately, due to e.g. different educational backgrounds,

language habits or personal preference people can ex-

press their ideas very differently. According to (Fur-

nas et al., 1987) and (Deerwester et al., 1990) stud-

ies show, that only in less than 20% of the time two

people choose the same keyword to describe a single,

well-known object.

In an attempt to make an arbitrary source more

accessible to a broad variety of search queries, it is of

high interest to provide additional knowledge going

beyond the source’s intrinsic information. To name a

few examples, this ranges from keywords describing

the source’s category of content, editorial information

or cross-references to related articles, up to informa-

tion with pin-point granularity like synonyms for a

speciﬁc term. The process of annotating a source with

this additional information is called Deep Tagging.

Deep Tagging a source manually is a time-consuming

and error-prone process if performed by a human.

This leads to a high demand for computer-aided or

completely automated tagging approaches.

String and Pattern Matching Approach. You might

for example be interested in ﬁnding and tagging all

kinds of places in an unknown text ﬁle. Obviously,

before being able to annotate a term referring to a

place with additional information it is crucial to iden-

tify it correctly in the ﬁrst place. This is a task most

commonly performed by string matching or, more

general, pattern matching algorithms.

Unfortunately, generic matching algorithms can en-

counter a large variety of problems: Spelling mis-

takes, OCR errors, typing variants and polysemy can

inhibit the recognition process. To address these prob-

lems algorithms usually utilize external knowledge

provided in lists of synonyms, ﬂexion rules, gram-

mars, spelling variants or common spelling mistakes

for a given term. This knowledge helps to improve

the overall classiﬁcation performance.

The University of W¨urzburg hosts projects deal-

ing with the preparation and presentation of ancient

sources (W¨urzburg-University-Library, 2010). An-

cient sources only have light spelling conventions and

tend to follow a loose punction policy. For many

terms, especially places or people, a broad variety of

spelling variants exists. Furthermore, after digitaliza-

tion the sources can contain many OCR-errors due

to the sophisticated nature of the hand-writing at that

time.

Performing a Deep Tagging on an ancient source

is especially challenging due to its heterogeneous

387

Schöneberg H..

CONTEXT VECTOR CLASSIFICATION - Term Classiﬁcation with Context Evaluation.

DOI: 10.5220/0003067403870391

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 387-391

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

appearance. Classiﬁcation algorithms depending on

string matching or pattern matching will therefore see

their use limited in this scenario.

Contextual Approach. Not only the term itself,

but its context, too, has proven to be a highly valu-

able source of information. According to (Miller

and Charles, 1991), the exchangeability of two terms

within a given context correlates to their semantic

similarity. This means, the easier two terms are ex-

changeable within the contexts they occur, the more

likely they share a similar meaning. A statistical anal-

ysis of two term’s context composition can therefore

indicate their degree of semantic similarity.

Many approaches utilize the information con-

tained within a term’s context: (Gauch et al., 1999)

propose an automatic query expansion approach

based on information from term co-occurrence data.

(Billhardt et al., 2002) analyze term co-occurrence

data to estimate relationships and dependencies be-

tween terms. (Sch¨utze, 1992) uses this information

to create Context Vectors in a high-dimensional vec-

tor space to resolve polysemy. Apparently it is possi-

ble to gain information about a term by analyzing its

context. The following example illustrates the idea of

information extraction from a term’s context:

Example. Imagine yourself passing by a group of

people and overhearing a piece of conversation: ”To-

morrow I am going to ﬂy to ...”

Even though this sentence is not complete, it con-

tains enough information for us to expect the miss-

ing word to be a place. In a conversation we would

intuitively request the missing information by asking

”Sorry, where are you going to?” and thereby express

our expectation of a place. We classiﬁed the miss-

ing piece of information as place just by its context.

We expect the missing word to be a place, but our ex-

pectation ist not restricted to a speciﬁc place at all.

This sentence would make perfect sense with a lot of

terms, as long as they are instances of the class place:

Tomorrow I am going to ﬂy to Berlin. Tomorrow I am

going to ﬂy to London.

Conclusion. Consider two terms s and t as instances

of class x. If s and t are exchangeable within a con-

text c, then this context requires its related term to

be of class x, regardless of its particular instantiation.

(Miller and Charles, 1991) stated that semantic simi-

larity correlates to contextual similarity.

Using the information contained in a given term’s

context allows two actions:

Deduction of Knowledge. Given the above example

we expect the missing piece of information to be a

place. If the speaker now replies with a word we have

never heard so far, we would assume it to be a to us

unknown place. That means, we classiﬁed a so far

unknown term utilizing only the information within

its context and acquired new knowledge.

Veriﬁcation of Knowledge. If on the other hand the

speaker replies with a term which, as far as we know,

is not a place, we encounter a clash of knowledge:

Maybe our data is correct and the speaker provided

false information, maybe it’s just contrary. In either

case an erroneous piece of information would have

been detected just by its context.

Resolving Polysemy. This is a special case of the

before mentioned clash of knowledge. We might for

example know for a fact, that a crane is a bird, but we

could discover, that depending on its context this term

could refer to a type of construction equipment, too.

We can suspect a term to be an instance of a

certain class after evaluating its context, because as

speakers of that particular language we understand the

underlying rules of forming a sentence. With those

rules in mind we can conclude, that only a few classes

of terms would make actual sense in a given context.

Obviously, it is challenging to teach a computer to

perform the same conclusions. Even with a sophis-

ticated understanding of how to form a sentence in a

given language, terms still have to be recognized in

the ﬁrst place, which brings us back to the recogni-

tion problems string and pattern matching algorithms

can encounter (see page 1).

Classiﬁcation by Context. The contextual informa-

tion allows a transfer of knowledge to so far unknown

words: If you can identify a context c, which demands

its related term to be of class x, you could propose that

whenever you happen to ﬁnd another occurrence of c

within a source, its related term is an instance of class

x, too. This leads to the following working assump-

tion:

Working assumption. A classiﬁcation algorithm can

decide whether a given term is an instance of a class x

(e.g. x = place) by evaluating the context similarity.

Statistical Context Analysis. Given an arbitrary

source s, let n be the amount of terms within s.

(Sch¨utze, 1992) introduces a high-dimensional vec-

tor space with n dimensions, one for each term in

s. For any term t, its context can then be repre-

sented as a vector within this vector space, each di-

mension d (which is a term, too) displaying the num-

ber t and d co-occurred throughout the source. The

cosine angle (Baeza-Yates and Ribeiro-Neto, 1999)

between two Context Vectors within this vector space

measures the similarity of its terms co-occurrence-

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

388

patterns. Sch¨utze suggests the usage of a ﬁxed win-

dow size or sentence boundaries for the deﬁnition of

co-occurrence.

Part of Speech Analysis. However, another source

of information has not been taken into account so far:

In most languages there are rules for forming a sen-

tence. Not only does a valid sentence have to contain

some integral parts (like subject, predicate, etc.), the

language’s grammar even implies a certain order of

a sentence’s components. By analyzing a sentence’s

sequence of terms we can gain additional informa-

tion: Consider for example the expression ’the car’

- the occurrence of ’the’ immediately before ’car’ im-

plies, that ’car’ is a noun. On the other hand this

implies, that -due to the language’s grammar- many

other parts of speech can not follow immediately af-

ter the occurrence of ’the’, which of course will afﬂict

term co-occurrence patterns. This information would

be lost, if we discarded the term’s position within a

sentence. (Gauch et al., 1999) for example take into

account a term’s position within its context during

co-occurrence-data analysis. Our approach utilizes

a scoring mechanism which applies weighting fac-

tors to a term’s co-occurrences based on their position

within the context.

1.1 Performance Evaluation

Independence. The context evaluation approach is

independent from the source’s particular language, as

it is an analysis of term co-occurrence patterns.

Stability. Imagine a source written in medieval Ger-

man. This language follows only light spelling con-

ventions, resulting in a large number of spelling vari-

ants for single terms. Regular string or pattern match-

ing approaches will therefore have to depend on ex-

ternal knowledge to perform. However, even though

a single term’s spelling could vary in medieval Ger-

man, the rules for forming a sentence were as strict as

in any Germanic language today. That means, a place

name had a speciﬁc context, regardless of its actual

spelling variant. Of course even the terms forming

the actual context surely had different spelling vari-

ants. Imagine a term referring to a place. Its context

could contain prepositions like ”to”, ”from”, ”in” etc.

As these words occur a lot more frequent than the

place they refer to, their spelling will be a lot more

consistent throughout the source. The context evalu-

ation approach is able to deal with weak orthography

and spelling variants, taking advantage of a statistical

evaluation of frequently used terms.

2 CONTEXT CLASSIFICATION

ALGORITHM

Let T

be the set of terms containing relevant infor-

mation.

1. We deﬁne a class as a set of terms T

with

= {t|t is an instance of class X, t ∈ T

}

2. Pick an arbitrary query element q ∈ T

3. Evaluate the context proﬁle P

, which is the set

of all context items c

q,i

for the n occurrences of

q, 1 ≤ i ≤ n, with

= {c

is context item for q}

and

q,i

= {t|t ∈ T

forming local context of hit i}

4. Each context item’s componentis assigned a score

by the scoring function with

score : Σ

∗

× Σ

∗

→ R

A term’s overallScore for a given context proﬁle

is the normalized sum of all scores:

overallScore(t, P

) =

∑

i=1

score(t, c

)





We assume a vector space in R

, with n being the

amount of terms in T

. Each term t ∈ T

forms

a dimension within the vector space. Given a

query’s q context proﬁle P

with x terms and their

respective overallScore, P

can be interpreted as

a vector in this vector space. We use the stan-

dard vector model as discussed in (Baeza-Yates

and Ribeiro-Neto, 1999). By interpreting a con-

text as a vector in vector space we are now able to

estimate two context proﬁle’s similarity by mea-

suring the cosine angle between them.

5. Given a similarity threshold ε with 0 ≤ ε ≤ 1. For

each context proﬁle P

with a similarity exceeding

the threshold ε we propose:

q ∈ T

∧ similarity(P

, P

) ≥ ε =⇒ r ∈ T

3 OPTIMIZATIONS

Obviously, the classiﬁcation quality is heavily im-

pacted by the proper choice of the query term q and its

resulting context P

. Consider the following example:

Poor Representative. Assume again, we are inter-

ested in ﬁnding all kinds of places throughout a given

CONTEXT VECTOR CLASSIFICATION - Term Classification with Context Evaluation

389

source. According to our algorithm (see page 3) we

choose a query term q from the class of terms refer-

ring to places T

place

. We choose a certain place b,

which happens to host a famous regular sport event,

but - aside from that - is fairly unknown otherwise.

The query’s context proﬁle will most likely contain

terms referring to the sport event. But obviously these

attributes are not commonly shared for instances of

the class T

place

. Attributes, which on the other hand

might be essential for identifying a place, could under

circumstances not even occur within the context pro-

ﬁle. The resulting context P

can therefore not reﬂect

a typical context composition for an arbitrary place,

even though it is a place. Each instance of the class

place

could appear in slightly different context, re-

sulting in a context with many terms relevant to only

single instances, but not to the class. Clearly we need

to ﬁnd a way to identify the set of signiﬁcant terms

for a given class.

In order to extract the set of signiﬁcant terms for a

given class we cannot simply conjunct or intersect

each instance’s context: A conjunction would result

in very large term sets, paying attributes, which are

relevant to only few instances, too much attention in

relation to the attributes relevant for the entire class.

Intersection could on the other hand result in an empty

term set, due to the overlapping nature of the context

proﬁles.

Majority Decision. A majority decision is able to de-

cide, which terms are relevant to a class rather than to

particular instances. After choosing several instances

of a class we calculate the most frequently used terms

within their context proﬁles. After sorting the terms

by frequency of occurrence we then deﬁne the top i

terms to be the signiﬁcant terms for their class.

Formal Description. Let T

be the set of all relevant

terms in our source.

1. Choose a class, e.g. T

place

2. Pick n terms from T

place

and calculate their con-

text proﬁles.

3. Calculate frequency(t) for each term t ∈ T

with

frequency(t) =

∑

i=1

occurs(t, P

)

and

occurs(t, c) =

(

1 if t occurs in context item c

0 otherwise

4. Sort terms by their frequency and extract the most

frequent i items. We deﬁne the set of i terms as

place

, the i-signiﬁcant set of terms for the class

place.

Result. We can use the majority decision deﬁned

above to determine the set F

, the set of i most sig-

niﬁcant terms for class X. Instead of comparing un-

known context vectors with a single instance of our

class X, we create a cluster of n instances and extract

the set of signiﬁcant terms. With this we avoid us-

ing terms for comparison which might be relevant to

only few instances of a given class. Each term’s score

is averaged from the overall scores. Each unknown

context vector will then be compared with this new

cluster context proﬁle.

4 FUTURE WORK

Deep Tagging. The Context Vector Classiﬁcation

approach is designated to act as a support module

for classic pattern matching algorithms for automated

tagging. The W¨urzburg University Library is in-

terested in processing (ancient) sources and anno-

tating them according to the TEI-P5 standard (TEI-

Consortium, 2007). Especially the detection of

events, composed of actors, places and dates, is as im-

portant as difﬁcult due to the syntactical challenges

mentioned above. A workbench, which combines

pattern matching algorithms with the Context Vector

Classiﬁcation approach, is under development with

the goal of providing the user suggestions for the clas-

siﬁcation of terms.

Reinforced Learning. A learning module for the

classiﬁcation framework is currently under develop-

ment. After a given term’s classiﬁcation has been pro-

posed the user can approve or decline the decision.

Based on the user’s input a weight factor will be ap-

plied to each context vector’s component. After sev-

eral iterations the framework gains a speciﬁc weight

matrix for a class of terms. This specialization allows

an adaption to different contextual environments and

improves the classiﬁcation quality.

5 RESULTS

Up to this point only small, yet very promising tests of

the classiﬁcation quality have been conducted. Large-

scale tests on corpora of different modern languages

are currently under development.

5.1 Ancient Source

The Context Vector Classiﬁcation approach was

tested with an ancient German source, Merian’s To-

pographia Germaniae (Merian, 1642) and (Merian,

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

390

Table 1: Most similar terms by cosine angle. Ancient German source (Merian, 1642).

CLUSTER HIGHEST SIMILARITY

places m¨ayntz sachsen mayntz oesterreich vianden bamberg marpurg m¨umpelgart angefangen

names friderich adolph johann georg otto albrecht wilhelm friederich ludwig heinrich

roles bischoff k¨onig abbt rath general hertzog thurn graff zeit k¨ayser

2010). Table 1 shows the most similar terms found

for a given reference cluster, each cluster composed

of ﬁve terms.

5.2 Modern Corpus

Setup. The following examples demonstrate the clas-

siﬁcation quality. The context vectors were created

from a 3 million sentence corpus in German language

(Leipzig-University, 1998).

Places. The test subject is a snippet of text containing

1658 terms, 26 of which relevant for classiﬁcation as

place. The contexter module examined a term’s left,

right and combined context with a window size of up

to 6 terms. See (Baeza-Yates and Ribeiro-Neto, 1999)

for the deﬁnitions of precision, recall and f-value. Ta-

ble 2 shows the results.

Table 2: Cluster: Places. Context window size 6.

CONTEXT COS PREC. RECALL F-VAL.

left 0.9 0.73 0.85 0.79

left 0.95 0.95 0.77 0.85

right 0.9 0.07 0.81 0.13

right 0.95 0.26 0.62 0.36

combined 0.9 0.4 0.88 0.55

combined 0.95 0.95 0.81 0.88

Names. In this case the test subject is a snippet of

text containing 1635 terms, 19 of which relevant for

classiﬁcation as name. The contexter module exam-

ined a term’s left, right and combined context with a

window size of up to 6 terms. See table 3 for results.

Table 3: Cluster: Names. Context window size 6.

CONTEXT COS PREC. RECALL F-VAL.

left 0.9 0.6 0.6 0.6

left 0.95 0.82 0.45 0.58

right 0.9 0.19 0.6 0.29

right 0.95 0.8 0.4 0.53

combined 0.9 0.5 0.7 0.58

combined 0.95 0.77 0.5 0.61

REFERENCES

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern In-

formation Retrieval. ACM Press, New York, 1st edi-

tion.

Billhardt, H., (corresponding), H. B., Borrajo, D., and

Maojo, V. (2002). A context vector model for infor-

mation retrieval. Journal of the American Society for

Information Science and Technology, 53:236–249.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,

T. K., and Harshman, R. (1990). Indexing by latent

semantic analysis. Journal of the American Society

for Information Science, 41:391–407.

Furnas, G. W., Landauer, T. K., Gomez, L. M., and Du-

mais, S. T. (1987). The vocabulary problem in human-

system communication. Commun. ACM, 30(11):964–

971.

Gauch, S., Wang, J., and Rachakonda, S. M. (1999). A cor-

pus analysis approach for automatic query expansion

and its extension to multiple databases. ACM Trans.

Inf. Syst., 17(3):250–269.

Leipzig-University (1998). German 3M corpus.

http://corpora.informatik.uni-leipzig.de/.

Merian, M. d. A. (1642). Topographia Germaniae.

B¨arenreiter.

Merian, M. d. A. (2010). Topographiae Germaniae.

http://de.wikisource.org/wiki/Topographia Germaniae/.

Miller, G. A. and Charles, W. G. (1991). Contextual corre-

lates of semantic similarity. Language and Cognitive

Processes, 6.

Sch¨utze, H. (1992). Dimensions of meaning. In Super-

computing ’92: Proceedings of the 1992 ACM/IEEE

conference on Supercomputing, pages 787–796, Los

Alamitos, CA, USA. IEEE Computer Society Press.

TEI-Consortium (2007). Guidelines for electronic text en-

coding and interchange. http://www.tei-c.org/release/

doc/tei-p5-doc/en/html/.

W¨urzburg-University-Library (2010). Franconica on-

line. http://franconica.uni-wuerzburg.de/Franconica/

index.html/.

CONTEXT VECTOR CLASSIFICATION - Term Classification with Context Evaluation

391