contained within damaged parts can only be recon-
structed heuristically. These factors lead to a high va-
riety of spelling variants (see Table 1).
Table 1: Spelling variants for term ’W¨urzburg’.
Spelling variants for ’W¨urzburg’ (excerpt)
Herbipolis, Wirceburgum, Wirciburc,
Wirtzburg, Wirzburg, Wirziaburg, W¨urtzburg,
W¨urtzb, wurtzb, Wurtzburg, wurtzburg,
w¨urtzburgk, w¨urtzburg, W, w¨urtzberg,
Wurtzb, W¨urzburg, wurzburg ...
Identifying all spelling variants for a given Named
Entity in an automated manner is a non-trivial task,
in which traditional dictionary-driven information re-
trieval and markup approaches can only succeed to a
certain degree, as it is very unlikely that all its spelling
variants have been discovered yet.
Goal. We need to find classifiers for Named Entities
with the ability to cope with syntactical challenging
environment and a broad variety of spelling variants.
2 CONTEXT CLASSIFICATION
APPROACHES
As the previous section pointed out, the domain of an-
cient documents poses a special challenge for Named
Entity Recognition due to the terms’ high variabil-
ity. But instead of focusing on compensating for the
terms’ high variability, couldn’t we instead look out
for more reliable sources of information?
Term Co-occurrence. Contrary to the loosely de-
fined orthography, the grammar and sentence struc-
ture in ancient documents are comparably restrictive
as nowadays. Thus the likelihood of two terms t
1
, t
2
co-occurring is not arbitrary but instead has a specific
probability.
Stop Words. A term’s context mostly comprises
stop words. Stop words are terms that mainly fulfill a
linguistic purpose and do not carry much information
themselves (as prepositions, conjunctions or articles).
Due to the stop words’ frequent occurrence their or-
thographic consistency is much higher than that of
Named Entities like places or people’s names.
’Event-driven’ Tagging. As already pointed out,
stop words carry little information apart from their
linguistic function. In our task to extract a source’s
semantics they can obviously be neglected. But even
within the group of non stop words only a small sub-
set is relevant to our interests: Thinking of Wiki-
Systems like Wikipedia
1
, usually only a small sub-
set of Named Entities is cross-referenced, for exam-
ple other people’s names, people’s function, role or
profession (like mayor), places or dates. In order to
precisely recreate past events it is usually necessary
to find out, who did something, when did he / she
do it and where. We therefore define an Event e as
e = (a, d, p) with a ∈ A and A being the set of all ac-
tors, d ∈ D with D being the set of all dates and p ∈ P
with P being the set of all places. We can now reduce
the complexity of Named Entity Recognition on an-
cient sources by limiting the relevant classes to events
or integral parts of events.
2.1 Related Work
According to (Miller and Charles, 1991), the ex-
changeability of two terms within a given context cor-
relates with their semantic similarity. This means, the
easier two terms are exchangeable within the con-
texts they occur, the more likely they share a sim-
ilar meaning. A statistical analysis of two terms’
context composition can therefore indicate their de-
gree of semantic similarity. Many approaches uti-
lize the information contained within a term’s con-
text: (Gauch et al., 1999) propose an automatic query
expansion approach based on information from term
co-occurrence data. (Billhardt et al., 2002) analyze
term co-occurrence data to estimate relationships and
dependencies between terms. (Sch¨utze, 1992) uses
contextual information to create Context Vectors in a
high-dimensional vector space to resolve polysemy.
Existing Knowledge. Of course, any a-priori knowl-
edge aggregated in databases for domains like to-
ponyms (names derived from a place or region), pro-
fessions or male names and their spelling variants is
not discarded, but used as the entry point for our con-
textual approach, as it reliably shows us instances
of our sought-after class. We can then analyze their
contextual properties to find additional, so far un-
known instances. Furthermore we can inspect the a-
priori data for useful patterns to create domain spe-
cific heuristics (e.g. typical n-gram distribution for a
given class, typical pre-/suffixes, capitalization).
As a term’s context can apparently provide informa-
tion about its semantics, the following sections there-
fore introduce approaches that attempt to classify a
term as an integral part of an event by evaluating its
contextual information.
1
http://en.wikipedia.org
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
164