a sentence might describe an artist’s life events (e.g. “during his childhood”, “while
on her trip to Italy”, “at the death of his father”), geographical reference in a work
(e.g. “Lake Cuomo”), or style (“impressionism”.) A set of categories has been ini-
tially proposed through textual analysis of art survey texts, and piloted by asking a
range of users to label sample text; using this labeling, initial experiments in machine
learning have been conducted to extract features which will permit categorization of
sentences. These features can then be used in disambiguation to select between dif-
ferent senses of a term according to its category.
The next phase, Linguistic Analysis, consists of several subprocesses. After
sentence segmentation, a part-of-speech (POS) tagger labels (i.e. tags) the function of
each word in a text, e.g., noun, verb, preposition, etc. Complete noun phrases can
then be identified by the NP chunker based on tag patterns. For example, a deter-
miner, followed by any number of adjectives, followed by any number of nouns, is
one such pattern that identifies a noun phrase, as in “the impressive still life draw-
ing”. The tagger used for CLiMB, the Stanford tagger
3
provides sentential analysis
of syntactic constructions, e.g., verb phrases, relative clauses. The output of Linguis-
tic Analysis consists of XML-tagged words which now contain substantial part of
speech tagged and syntactic parsed labels. Lucene is used to create an efficient index
for these tagged words.
4
At this point, the noun phrases stored in the index are input to the disambigua-
tion algorithm, which then enables sense mapping, so that the proper descriptor can
be selected from a controlled vocabulary. Words and phrases often have multiple
meanings which correspond to different descriptors in a controlled vocabulary but
only one may be relevant in context. The ability to select one sense from many is
referred to as lexical disambiguation. We map to the appropriate descriptor from the
Getty Art and Architecture Thesaurus (AAT), the Getty Union List of Artist Names
(ULAN), and the Getty Thesaurus of Geographic Names (TGN).
5
The AAT is a
well-established and widely-used multi-faceted thesaurus of terms for the cataloging
and indexing of art, architecture, artifactual, and archival materials. In the AAT, each
concept is described through a record which has a unique ID, preferred name, record
description, variant names, and other information that relate a record to other records.
In total, AAT has 31,000 such records. Within the AAT, there are 1,400 homonyms,
i.e., records with same preferred name. For example, the term “wings” has five senses
in the AAT (see Table 1 below).
Table 2 shows the breakdown of the AAT vocabulary by number of senses
with a sample lexical item for each frequency. As with most dictionaries and thesauri,
most items have two to three senses, and only a few have more.
3
Both the tagger and parser are available at: http://nlp.stanford.edu/software.
4
Lucene is a search engine library: http://lucene.apache.org.
5
Getty resources can be accessed at:
http://www.getty.edu/research/conducting_research/vocabularies/aat
77