Knowledge Resource Development for Identifying Matching Image
Descriptions
Alicia Sagae and Scott E. Fahlman
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, U.S.A.
Keywords:
Image Retrieval, Textual Similarity, Textual Inference.
Abstract:
Background knowledge resources contribute to the performance of many current systems for textual inference
tasks (QA, textual entailment, summarization, retrieval, and others). However, it can be difficult to assess how
additions to such a knowledge base will impact a system that relies on it. This paper describes the incremental,
task-driven development of an ontology that provides features to a system that retrieves images based on their
textual descriptions. We perform error analysis on a baseline system that uses lexical features only, then focus
ontology development on reducing these errors against a development set. The resulting ontology contributes
more to performance than domain-general resources like WordNet, even on a test set of previously unseen
examples.
1 INTRODUCTION
This paper describes experiments to retrieve images
based on matching their descriptive English labels.
As in ad-hoc document retrieval, a baseline sys-
tem using term vectors to represent these labels per-
forms reasonably well (>80% MRR, Mean Recip-
rocal Rank). However, error analysis reveals that
the most challenging examples for this task require
a richer feature space, allowing the system to cap-
ture more of the deep semantic similarities that hu-
mans seem to notice when they make comparisons
between images and their descriptions. As a result,
we present a solution that uses knowledge-based fea-
tures for identifying when two English descriptions
refer to the same image.
Object labels assigned by humans typically con-
sist of short multi-word phrases. These phrases
exhibit syntactic and semantic structure that is not
always modeled by information retrieval systems.
Nonetheless, humans rely on this structure, along
with background knowledge, when generating and in-
terpreting labels. These characteristics place our task
in the class of Applied Textual Inference (ATI) prob-
lems. ATI tasks depend on some level of text un-
derstanding and background knowledge, but they are
designed to abstract away from system-specific rep-
resentational choices. They include summarization,
question answering, and recognizing textual entail-
ment, among other problems. Image-identity is a rela-
tion that holds between two texts, A and B, when they
refer to the same image. In our current work, we fo-
cus on the ATI problem of recognizing image-identity
between a description that serves as a query to a re-
trieval system, and a description that labels a known
image in a collection.
2 RELATED WORK
This work draws on related research in image retrieval
and knowledge-based textual inference. Digital im-
ages are commonly associated with textual metadata,
including tags, titles, descriptions, or a textual context
such as a web page where an image has been embed-
ded. As a result, query topics and indexed images
can be represented by textual features (text-based im-
age retrieval) or by features derived from computer-
vision analysis of the image (content-based image re-
trieval). Other approaches explore a combination of
the two (multimedia or multimodal image retrieval).
Multimodal retrieval techniques have shown promise
in retrieving images in response to a textual query,
even when the images in the test set were not anno-
tated (Blei and Jordan, 2003). To achieve this, a joint
model of visual features and keywords was learned
from a training set before running the test queries.
Similar joint models have been used for generating
image descriptions (Farhadi et al., 2010) and for mea-
100
Sagae A. and Fahlman S..
Knowledge Resource Development for Identifying Matching Image Descriptions.
DOI: 10.5220/0004550601000108
In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2013), pages 100-108
ISBN: 978-989-8565-81-5
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
suring semantic relatedness between words and im-
ages (Leong and Mihalcea, 2011). However, these
models capture only the relationship between text in a
query and visual features in an indexed image. Anno-
tations on the indexed image are not leveraged, even
when available.
Other approaches to multimodal retrieval allow
the models to take advantage of text-only features in
addition to visual and joint textual-visual features. In
the Wikipedia Image Retrieval Task at ImageCLEF
2011 (Tsikrika et al., 2011), the best-performing sys-
tem applied Late Semantic Combination to leverage
features from text and visual modalities (Csurka et al.,
2011). Under this combination strategy, text-only fea-
tures and visual-only features can be developed and
improved independently, and still contribute to better
combined multimodal performance.
As a result, text-only image retrieval features like
the ones explored in this work can be applied on their
own, as we show here, and may also be combined in a
multimodal system. In addition, although multimodal
retrieval represents a growing research area of inter-
est, visual features of a query may not be available
in a typical real-world image retrieval scenario. The
organizers of the ImageCLEF 2011 Wikipedia Im-
age Search task, while encouraging participants to de-
velop multimodal systems, acknowledge that “a text-
only query. . . is likely to fit most users searching dig-
ital libraries or the Web. (Tsikrika et al., 2011). In
addition, text-only features have been shown to out-
perform multimodal features for some tasks like au-
tomatic image tagging (Leong et al., 2010).
Constraining ourselves to text-based representa-
tions of images, we find that the central problem of
image retrieval is to compare two texts and deter-
mine whether the image-identity relation holds be-
tween them. This problem is an instance of Applied
Textual Inference (ATI). Taking the PASCAL RTE
Challenge as an example, we can see that deep seman-
tic representations and knowledge-based techniques
play an important role in state-of-the-art ATI systems.
Of 16 research teams participating in the first chal-
lenge in 2005
1
, 7 used features from WordNet, 3 ap-
plied some kind of world knowledge, and 7 applied
logical inference engines (9 systems out of 16 used at
least one of the three). In 2007 the number of partici-
pants and the variety of techniques expanded, with the
vast majority of these systems relying on some com-
bination of WordNet, syntactic matching/alignment,
and machine learning algorithms; the most success-
ful system in that year applied all of these techniques
in addition to a logical inference engine (Hickl and
Bensley, 2007).
1
http://pascallin.ecs.soton.ac.uk/Challenges/RTE
Our approach applies knowledge from general-
purpose semantic resources like WordNet and Dolce,
in addition to developing new custom resources for
our task and for our training set. This approach to on-
tology development is consistent with (Montazeri and
Hobbs, 2011), however we perform an additional step
of error analysis before beginning ontology develop-
ment. One contribution of our work is the method-
ology for using this error analysis to drive ontology
development, described in Section 6.2. In addition,
many applications of formal ontology use manually-
constructed rules to operate over ontology concepts
and produce an analytical result. A second contribu-
tion of our work is the architecture for gathering rel-
evant ontological features and combining them with
syntactic features in a learned reranking function, de-
scribed in Section 7.
3 PROCEDURE
In our experiments, we developed an ontology to help
identify the description-identity relation among texts.
To evaluate, we performed retrieval on a data set
where images are associated with multiple descrip-
tions. For each image, one description is held out as a
development query, another is held out as a test query,
and the remainder make up the “document” that rep-
resents the image in a collection.
First, we constructed a baseline system that uses
only lexical features (Section 5.2) and performed er-
ror analysis on the baseline, in order to identify the
most promising conceptual space for ontology devel-
opment. Next, we used a set of training queries to
construct the ontology (Section 6.2) and to provide
ontological features to a perceptron-based classifier,
training it to decide image-identity. Finally, we ap-
plied the ontology and the classifier to a held-out set
of test queries in order to rerank the baseline retrieval
results (Section 7). Our results show an improve-
ment in performance as measured by Mean Recipro-
cal Rank (MRR).
4 DATA
We evaluate on the Phetch data set, collected by (von
Ahn and Dabbish, 2004). It was collected in the con-
text of an online game where multiple participants
compete in teams to identify an image based on one
teammate’s typed description. The exercise was re-
peated for each image in a large collection of JPEG
files harvested from the web. In this data set, a single
description is a short paragraph written by a single
KnowledgeResourceDevelopmentforIdentifyingMatchingImageDescriptions
101
Annotator 1 [training query]: “stadium; pageant girls in the foreground”
Annotator 2 [testing query]: “olympic stadium; women in foreground; with sashes”
Annotator 3 [indexed document]: “people at sporting event; pageant contestants”
Figure 1: Image from the Phetch data set with descriptions. One description is used for training, another for testing, and
remaining descriptions are concatenated to form the document representation in the retrieval index.
annotator/participant about a single image. Each im-
age in the collection has multiple descriptions. Each
description is composed of one or more phrases, short
segments that contribute to the overall description and
are usually connected rhetorically to each other. An
example is shown in Figure 1.
The total number of images represented in the cor-
pus is over 50,000. Of these, approximately 6,000
were labeled with 5 or more descriptions. Our experi-
ments were conducted on a subset of 700 images that
have 5 descriptions or more and that do not contain
text in the image itself. This subset is shown as parti-
tion 5A-notext subset in Table 1.
The availability of relevance judgments and the
level of detail in textual annotations set the Phetch
corpus apart from other data sets that are commonly
used to evaluate image retrieval, including the Corel
image collection and the ImageCLEF evaluation sets.
The Corel data set is most appropriate for evaluating
systems that use a sample image as a query, rather
than using text. Keywords, but not descriptions, are
available for the images in Corel. Some difficulties
with this data set for standardized evaluation are dis-
cussed by (M
¨
uller et al., 2002).
The ImageCLEF evaluation sets have evolved
over the years; one set was derived from the IAPR
TC-12 Benchmark of 20,000 images (Gr
¨
ubinger
et al., 2006). This set has been used in the Im-
ageCLEF image retrieval evaluations since 2006
2
.
This data includes image descriptions, but provides
fewer examples labeled with relevance judgments
than Phetch. In the Phetch data set, every image
is labeled with descriptions that can be used as a
query/document pair, where relevance judgments are
binary: if the document contains descriptions that
come from the same image as the description being
2
http://ir.shef.ac.uk/imageclef/2006/
used as a query, relevance is 1. Otherwise, relevance
is 0. In our experiments, we test on all 700 images
from the 5A-notext subset of Phetch. ImageCLEF, in
contrast, contains a subset of 60 images that have been
labeled with relevance judgments.
The MIRFLICKR-1M data set (Huiskes and Lew,
2008) contains 1 million images with their Flickr tags,
published under a Creative Commons license. These
images do not include full-phrase descriptions of the
type associated with each image in the Phetch collec-
tion. However they are annotated with content-based
visual descriptors. As a result, MIRFLICKR has been
used for image annotation and retrieval by visual ex-
ample, but is not sufficient for testing retrieval by
phrasal description.
A smaller data set with structure similar to Phetch
was collected in 2010, using Amazon’s Mechanical
Turk as a source for annotators (Rashtchian et al.,
2010). This data set includes 8000 images from
Flickr.com, annotated with multiple full-sentence de-
scriptions. Although it contains far fewer examples,
this set was developed with natural language process-
ing applications in mind. As a result more attention
was paid to annotation quality than in the Phetch data,
and this data is likely to be more free of noise due to
misspellings and other annotator errors.
5 BASELINE
5.1 Parameterization
In our document collection, each document is a set
of descriptions for one image. Since each of the 700
images in the 5A-notext subset have 5 descriptions
or more, we use one description from each image as
a development query. One additional description is
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
102
Table 1: Phetch corpus partitions. Partitions are created based on the number of descriptions associated with each image.
Section Descriptions per Image Total Images Total Words
1A 1 17,237 470,924
1B 1 17,237 470,924
2A 2 7,264 371,313
2B 2 7,265 371,879
3A 3 5,171 367,284
3B 3 5,171 367,999
4A 4 3,084 283,142
4B 4 3,084 282,486
5A 5 2,946 357,582
5A-notext subset 5 700 68,867
5B 5 2,946 355,853
held out as a test query. The remaining 3 descrip-
tions are taken together to be the document represent-
ing the image in our collection. To retrieve an image,
we compare the query to each document in the collec-
tion and calculate a relevance score. The image from
which the query description was taken is interpreted
as the only relevant document. The rest of the doc-
uments in the collection are taken to be non-relevant
for that query. This interpretation casts the retrieval
task as a way to identify the image-identity relation
among two texts (the query and an image document).
To establish a baseline for retrieval performance
on the Phetch data, we perform indexing and retrieval
with version 2.5 of the Indri search engine (Strohman
et al., 2005), a component of the Lemur Toolkit for
Language Modeling and Information Retrieval
3
. In-
dri implements a retrieval model that combines the
language modeling approach (Ponte and Croft, 1998),
which estimates word probabilities, with the infer-
ence network approach (Turtle and Croft, 1991) for
combining beliefs into a single document-level re-
trieval score. Indri supports a complex query syntax
that includes ordered and unordered windows, along
with field-specific language models. In our initial
experiments, the best-performing query formulation
was the #combine operator, which treats the terms
in the query and document as an unordered bag-of-
words, and calculates document relevance as a func-
tion only of the overlap in terms between the query
and a given image description document. This key-
word parameterization does not take word order or
other syntactic structure into account. Results for this
setting are given in Table 4 as Kwds+spell, or re-
trieval with keyword features that have undergone a
spell-checking pass. On the development queries, the
3
http://www.lemurproject.org/
baseline achieves MRR 0.8295; on the test queries, it
achieves 0.8216.
5.2 Error Analysis
In the retrieval setting described above, the main cri-
terion for success is returning the single image of in-
terest at the lowest rank possible. Since each query
describes precisely one image, we seek a metric that
measures where in the results list that image appeared.
This metric is Mean Reciprocal Rank (MRR). MRR
is defined as the average, over all queries, of 1 divided
by the rank where the correct document was found.
When the system assigns the relevant image a rank
of N > 1, at least two errors have occurred. First, the
system compared a non-relevant image description to
the query and determined that they describe the same
image, when in fact they do not. This error is a type
of false-positive judgement, which we refer to as a
precision error, because it dilutes the result list with a
non-relevant image at a high rank. Second, the sys-
tem compared the relevant image description to the
query and determined that they did not describe the
same image, at least not confidently enough to rank
the document first in the result list. This is a type of
false-negative judgement, which we refer to as a recall
error, since it implies that the system failed to recog-
nize the relevant image when it appeared. We have
developed a vocabulary of error classes that trigger
both precision and recall errors. The most frequent of
these classes are described in Table 2.
The vocabulary of error classes is motivated by the
hypothesis that the bag-of-words representation for
image descriptions leads to errors because it fails to
recognize certain types of textual similarity. Specifi-
cally, the bag-of-words model fails to capture seman-
tic similarities that are obscured by surface features
KnowledgeResourceDevelopmentforIdentifyingMatchingImageDescriptions
103
Table 2: Classes of error in the baseline retrieval system.
Error Class Context Description and Examples
Ontology Precision The wrong word meaning triggered a false match
“on the bank” 6= “at the bank”
Recall Failed to match words that would be similar Ontological classes
“shawl” “wrap”
Faulty Inference Precision Match in spite of conflicting relations among words
“green bandana” 6= “green shirt”, “man skating” 6= “girl skating”
Recall Failed to make a relevant inference
“lips are puckered” “getting ready to kiss”
Contradiction Precision Failed to recognize contradictions
“black background” 6=“blue background”
Recall Failed to match due to faulty contradiction
“black or blue background” “black background”
Missing Precision Failed to penalize for missing major elements
Elements “globe on a stand” 6= “globe”
Recall Over-penalized for minor missing elements
“guy smiling with glasses” “a guy smiling”
like word choice, and it fails to follow the inferen-
tial chains of reasoning that human annotators envi-
sion between their descriptions and the content of an
image. As a result, our error classes are composed
of specialized cases of semantic and inferential mis-
match that we expect to see in the errorful retrieval
runs.
We arrived at this vocabulary of error types in an
iterative fashion, based on observations in a sample
of 50 retrieval results from the Phetch 3A data set (a
sample that does not overlap with the 5A-notext sub-
set on which we test). In a first pass, we annotated
this development sample with free-text descriptions
of the evidence that a human might use to correct the
errors made by the baseline system. In a second pass,
these annotations were distilled by hand into a set of
14 phenomena that result in retrieval error. These
14 classes were used to re-annotate the sample. Af-
ter this pass, a final revision of the annotation classes
was made to focus on the most frequent and clearly-
defined classes. The resulting vocabulary contains 8
classes with precise definitions in the precision-error
and recall-error contexts. These classes are not mu-
tually exclusive; rather, a given retrieval error can be
annotated with all classifications that apply.
Table 3 shows the most frequently occurring error
types in the annotated sample from section 3A. For
each query we make two comparisons: we compare
the query description to the indexed description of the
same image in order to annotate the recall errors. We
also compare the query description to the indexed de-
scription that was retrieved at rank 1 for this query, in
order to annotate the precision errors. Although not
all of the errorful results returned by the baseline sys-
tem involve errors from these classes, most of them
do (90% of precision errors and 86% of recall errors).
Recall errors were annotated with 3.3 of these classes,
on average, and precision errors were annotated with
an average of 2.5 classes.
The most frequently-appearing class in the case of
recall errors were Ontology-related; that is, surface-
level mismatch between concepts that would be iden-
tical or closely linked in an ontological representa-
tion of background knowledge. The most frequently-
occurring classification of precision errors relates to
contradiction. In these cases, the baseline system
failed to recognize an explicit contradiction between
the query and the image description that was retrieved
at rank 1. A system with a model for recognizing con-
tradiction might be able to correct the baseline system
in nearly 60% of the cases where it currently makes
Precision errors.
Given this background knowledge about the na-
ture of retrieval errors in a baseline retrieval run, we
can identify some specific strategies for improving re-
trieval performance. We have established a set of er-
ror classes and textual features that contribute to re-
trieval error under the bag-of-words model. In the
next Section we will establish a more knowledge-rich
model for representing image descriptions and use
that model to implement handlers for the error classes
described here: Ontology-based matching and infer-
ence, handling of quantification over major content
elements, negation, analogy, and a model of media
types.
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
104
Table 3: Highest-frequency error classes.
Error Type Freq in Freq in
Recall Errors Precision Errors
Ontology 36 (72%) 7 (14%)
Quantification 34 (68%) 26 (52%)
Faulty Inference 29 (58%) 20(40%)
Missing Elements 23 (46%) 27(54%)
Any 43 (86%) 45 (90%)
Total 165 126
Average 3.3 2.52
6 ONTOLOGY
6.1 Framework and Development
To address the errors in Section 5.2, we developed
an ontology using the freely available Scone Knowl-
edge Base System
4
(Scone) as our ontology frame-
work. The Scone engine supports adding, searching,
and evaluating logical statements based on marker-
passing inference (Fahlman, 2006). To support the ex-
periments described in Section 7, we implemented ad-
ditional Common Lisp components that extend Scone
engine functions. These include new ontologies, in-
ference routines, and APIs that use Scone to anno-
tate text and measure the semantic distance between
concepts. In the remainder of this paper we use
“SconeImage” to refer to the extended software suite.
Knowledge bases in Scone use a frame-semantics
formalism to represent a network of concepts, or
Scone elements. Scone supports taxonomic relation-
ships, like “a flower is a plant as well as role-filling
relationships, like “a flower has scent”. New non-
taxonomic relations can be defined as well, with in-
stances of such relations being encoded as statements,
like “a bird flies”. Exceptions can be marked to han-
dle relations that apply to most, but not all, instances
of a class, as in “a penguin is a bird that does not fly”.
Scone also includes a lexical lookup function that al-
lows multiple strings to be attached to any element.
Given such a knowledge base as input, routines
defined in the Scone engine calculate the answer to
queries like “Is a rose a flower?”. Extensions in
SconeImage make higher-level calculations that de-
pend on these answers, like “what is the relationship
between bouquet and rose?”. In all cases, the answers
returned by these calculations depend on the knowl-
edge bases that are currently loaded.
To build these knowledge bases (KBs), we used
a combination of automatic and manual processes.
4
http://www.cs.cmu.edu/˜sef/scone/
We first attempted to leverage existing ontological
knowledge by importing concepts and relations from
WordNet (Fellbaum, 1998), to which domain-specific
knowledge can later be added. Using this knowledge
alone to annotate image descriptions with semantic
information (using the first nominal synset available
for each content word in the description), we achieved
an improvement over the baseline system, but with
small statistical significance (p >= .4 using single-
factor ANOVA). This result supports the hypothesis
that knowledge may help on this task; however, our
baseline error analysis indicated that some classes of
error can only be corrected by a system that applies
reasoning and inference over its knowledge base rep-
resentations. The ad-hoc network structure of Word-
Net was not intended to support this type of reasoning.
We hypothesized that a task-specific knowledge
base, structured with the challenges from Section 5.2
in mind, could improve performance even more. The
intuition that KB structure, particularly at the up-
per levels of the ontology, plays an important role
in its usability and effectiveness is supported by re-
lated work on textual inference and knowledge en-
gineering. (Fan et al., 2003), for example, describe
the effect of ablating layers of the KB in a system
for resolving noun-noun compounds. Their finding
was that concepts from the upper levels of the ontol-
ogy were critical to performance on that task, and that
they had a larger impact on performance than con-
cepts near the frontier. We leverage these findings in
our work by selecting an existing upper-level ontol-
ogy and re-connecting a subset of WordNet concepts
to that ontology, while also adding several concepts
specific to the task of image retrieval as well as con-
cepts specific to our training data. In the SconeIm-
age KB, we use the DOLCE upper ontology (Masolo
et al., 2003) for the top levels, and then apply a map-
ping from DOLCE to WordNet following (Gangemi
et al., 2003), with some task-specific modifications.
6.2 Acquiring Knowledge from
Training Data
After the new upper-level SconeImage KB has been
constructed, we perform a round of knowledge acqui-
sition based on the training queries from the Phetch
5A data set. The baseline retrieval run on this data set
resulted in 50 retrieval failures, cases where no rele-
vant image was found in the result list. These failures
were classified according to the procedure described
in Section 5.2. To expand the knowledge base, a de-
veloper examined each of the retrieval failures an-
notated as Ontology or Faulty Inference errors. The
training query and collection document were com-
KnowledgeResourceDevelopmentforIdentifyingMatchingImageDescriptions
105
pared, and terms were added to the ontology to com-
pensate for the error.
New elements are selected to correspond with an
appropriate term from WordNet, but their arrange-
ment in the ontology may differ significantly from
their placement in WordNet. This development strat-
egy is consistent with the analysis that WordNet
is most useful for associating strings with lexical-
semantic concepts, while the arrangement of concepts
into logical structures can be improved through con-
nection to an ontology like DOLCE and an inference
platform like Scone. For example, while the terms
man, boy, woman, and child all appear in the WordNet
hierarchy, the structure connecting “man” and “boy”
is different from the structure connecting “woman”
and “girl”. This type of inconsistency means that
an inference rule of the form if the query mentions
a subtype of person, expand with sister terms would
correct one of these lexical mismatches but not the
other. We prefer an ontology structure that uses par-
allel structures for conceptually parallel relationships
among concepts.
7 EXPERIMENTAL RESULTS
In our error analysis exercise, we identified retrieval
failures in the training set that could be attributed to
lack of ontological knowledge in the baseline sys-
tem. We hypothesized that a simple annotation ap-
proach can reduce these errors. To test this hypoth-
esis, we implemented a SconeImage module in Lisp
that performs a single-pass search over the words in
an image description for elements in the SconeImage
ontologies that are triggered by each word. All ele-
ments are added without performing word sense dis-
ambiguation. The resulting list of element identifiers
is concatenated with the original description to form
the annotated query or collection description. Index-
ing and retrieval are performed using Indri under the
same model described in Section 5.1.
In the 3A development subset, which was used to
estimate the frequency of error types in Section 5.2,
the knowledge-augmented system reduces the onto-
logical errors by 25%, a difference that is statistically
significant
5
with 95% confidence (p = 0.05). This
result supports our hypothesis that knowledge base
annotation could reduce ontology-related retrieval er-
rors. When we turn to the full development set 5A,
which has many more examples, we see a reduction
in error of 17%. This result is marginal statistically
but is still an encouraging finding in support of the hy-
5
single-factor ANOVA
pothesis that correcting ontology errors leads to better
overall performance. Improvement on the full test set
is similar, with an error reduction of 15% compared
with the baseline. Even with marginal statistical sig-
nificance of p = 0.1, this result shows some support
for the conclusion that these improvements will gen-
eralize well to unseen queries.
We further hypothesized that the class of errors re-
lated to inference can be reduced by using our knowl-
edge base in combination with syntactic information.
To test this hypothesis, we first annotated each query
and result description from the baseline run with a
dependency structure, storing the result in a seman-
tic graph. Such a graph links vertices in the syntac-
tic dependency tree (i.e. words from the description)
with concepts from the knowledge base. As a result,
we can calculate similarity features between the query
graph and the graph for any image in the candidate
list, taking both semantic and syntactic similarity into
account.
To combine these features into a relevance score
for a document, given a query, we define a combina-
tion function for the features and perform both man-
ual and learned weighting of features in this func-
tion. The manually-set weighting allows our intuition
about the relative importance of knowledge-base con-
cepts to play a role. However, the best performance is
achieved when we make this hand-tuned score (gph-
Sim score in Table 4) into another input feature for the
learned weighting function. We train an off-the-shelf
perceptron classifier based on (Collins, 2002) for this
purpose. The classifier distinguishes graph pairs that
describe the same image from graph pairs that do not.
To train the classifier, we generate similarity features
with SconeImage for every query-candidate pair in
the baseline retrieval output. When more than one
description is available for an indexed image in the
candidate list, we generate the features for every such
description independently.
For every query in the training set, at most one
candidate contains descriptions of the same image
as the query. These descriptions are positive train-
ing examples. The remaining descriptions are neg-
ative training examples. The classifier learns a set
of weights λ
1
···λ
N
for a linear combination over all
of the features f
1
··· f
N
that we calculate over graph
pairs:
learnedGraphSim(g, g
0
) =
|F|
n=1
λ
n
× f
n
(g, g
0
) (1)
where F is the set of all similarity features, and λ
n
is
the weight of feature f
n
.
At test time, we use a fresh set of queries that
were never seen by the classifier during training (the
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
106
Table 4: Results of retrieval using combinations of syntactic and semantic features for graph-based reranking. ANOVA
significance is shown for improvement over the keywords baseline.
Ranking Features MRR (Test) MRR (Train)
Kwds+
spell
Synt-
graph
KB
annot.
Sem-
graph
gphSim
score
0.8295 0.8216
0.8404 0.8383
0.8508 0.8494
0.8535 0.8559
0.8531 0.8584
0.8528 0.8543
0.8567 (p = 0.05) 0.8575 (p = 0.02)
Kwds=keywords, spell=spell-correction, Synt-graph=syntactic graph features for reranking, KB-
annot=query expansion with KB concepts (before reranking), Sem-graph=semantic graph features
for reranking, gphSim score=value of the graphSim hand-tuned distance function
test query descriptions from Phetch Section 5A). The
test queries are run through the baseline retrieval sys-
tem. The results from this run are sent to SconeIm-
age for graph-feature extraction and scoring using the
hand-tuned similarity function. The output of the test
run is one verdict and similarity measurement for ev-
ery description of every candidate in the result list.
Our aim was to address inference-related errors by
including dependency information in our parameteri-
zation of the retrieval problem. We ran the learned-
GraphSim reranker on the annotated 3A subset in or-
der to observe the effect on annotated inference er-
rors. In comparison with the baseline, inference er-
rors were reduced by 24%. This represents a large
absolute improvement on inference errors as a result
of adding dependency information. Because the num-
ber of such examples is small ( 50), this number is
only marginally significant (p = 0.07); however, re-
sults on the full test set (5A-notext-subset) confirm
that this reduction contributes to better performance
overall, underscoring the importance of these gains.
Table 4 gives results from runs of the retrieval sys-
tem using a variety of features, including the base-
line (Kwds + spell), concepts from the task-specific
knowledge base (KB annot), and graph structures de-
scribed above (Sem-graph, gphSim score).
8 CONCLUSIONS AND FUTURE
WORK
Background knowledge resources contribute to the
performance of many current systems for textual in-
ference tasks (QA, textual entailment, summarization,
retrieval, and others). However, it can be difficult to
assess how additions to such a knowledge base will
impact a system that relies on it. This paper describes
the incremental, task-driven development of an ontol-
ogy that provides features to a system that retrieves
images based on their textual descriptions. We per-
form error analysis on a baseline system that uses
lexical features only, then focus ontology develop-
ment on reducing these errors against a development
set. The resulting ontology contributes more to per-
formance than domain-general resources like Word-
Net, even on a test set of previously unseen examples.
This paper describes the development of a knowl-
edge base as a principled response to errors found in
a baseline image retrieval system. These errors were
found to be triggered by shortcomings in the bag-of-
words representation of image descriptions. We ap-
ply the knowledge base for simple term annotation
and for learning measures of distance between graphs
constructed in the semantic space of the KB. Our ex-
periments support the hypotheses that textual infer-
ence techniques can lead to improved retrieval perfor-
mance, in particular on the most interesting types of
images: images whose descriptions can only be in-
terpreted with the application of ontological knowl-
edge and inferential knowledge. In addition, these im-
provements can complement the strengths of a strong
bag-of-words baseline to achieve better overall per-
formance on all image types.
The current paper addresses two of the error types
identified in Section 5.2. It would be an interesting
extension of this work to test specific techniques that
could reduce the remaining error types. For example,
co-reference resolution might be beneficial in reduc-
ing errors associated with quantification mismatch, in
particular as it relates to the number of people in an
image description. Techniques for contradiction de-
tection have been developed and tested for other tex-
tual inference problems, including recognizing tex-
KnowledgeResourceDevelopmentforIdentifyingMatchingImageDescriptions
107
tual entailment and question answering. These tech-
niques could also apply to retrieval errors caused by
real or perceived contradictions between a query and
an index description.
REFERENCES
Blei, D. and Jordan, M. (2003). Modeling annotated data.
In the 26th annual International ACM SIGIR Con-
ference on Research and Development in Information
Retrieval, page 127134. ACM Press.
Collins, M. (2002). Discriminative training methods for
hidden markov models: theory and experiments with
perceptron algorithms. In EMNLP ’02: Proceedings
of the ACL-02 conference on Empirical methods in
natural language processing, pages 1–8, Morristown,
NJ, USA. Association for Computational Linguistics.
Csurka, G., Clinchant, S., and Popescu, A. (2011). XRCE
and CEA LIST’s Participation at Wikipedia Retrieval
of ImageCLEF 2011. In Petras, V., Forner, P., and
Clough, P., editors, Working Notes of CLEF 2011,
Amsterdam, The Netherlands.
Fahlman, S. E. (2006). Marker-passing inference in the
scone knowledge-base system. In the First Annual
International Conference on Knowledge Science, En-
gineering, and Management (KSEM 2006), Guilin,
China.
Fan, J., Barker, K., and Porter, B. (2003). The knowledge
required to interpret noun compounds. Technical Re-
port UT-AI-TR-03-301, University of Texas at Austin.
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P.,
Rashtchian, C., Hockenmaier, J., and Forsyth, D.
(2010). Every picture tells a story: generating sen-
tences from images. In Proceedings of ECCV 2010,
Greece.
Fellbaum, C. (1998). WordNet An Electronic Lexical
Database. Bradford Books.
Gangemi, A., Guarino, N., Masolo, C., and Oltramari, A.
(2003). Sweetening wordnet with dolce. AI Magazine,
24(3):13 – 24.
Gr
¨
ubinger, M., Clough, P., Mller, H., and Deselaers, T.
(2006). The IAPR TC-12 benchmark: A new eval-
uation resource for visual information systems. In In-
ternational Workshop OntoImage2006 Language Re-
sources for Content-Based Image Retrieval, held in
conjuction with LREC’06, pages 13–23, Genoa, Italy.
Hickl, A. and Bensley, J. (2007). A discourse commitment-
based framework for recognizing textual entailment.
In ACL 2007 Workshop on Textual Entailment and
Paraphrasing, Prague. ACL.
Huiskes, M. J. and Lew, M. S. (2008). The mir flickr re-
trieval evaluation. In MIR ’08: Proceedings of the
2008 ACM International Conference on Multimedia
Information Retrieval, New York, NY, USA. ACM.
Leong, C. W. and Mihalcea, R. (2011). Measuring the
semantic relatedness between words and images. In
IWCS ’11 Proceedings of the Ninth International Con-
ference on Computational Semantics, pages 185–194.
Association for Computational Linguistics.
Leong, C. W., Mihalcea, R., and Hassan, S. (2010). Text
mining for automatic image tagging. In Coling 2010:
Posters, pages 647–655, Beijing, China. Coling 2010
Organizing Committee.
Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltra-
mari, A., and Schneider, L. (2003). The wonder-
web library of foundational ontologies. Technical Re-
port WonderWeb Deliverable D17, National Research
Council, Institute of Cognitive Sciences and Technol-
ogy (ISTC-CNR).
Montazeri, N. and Hobbs, J. R. (2011). Elaborating a
knowledge base for deep lexical semantics. In Bos, J.
and Pulman, S., editors, Proceedings of the Ninth In-
ternational Conference on Computational Semantics
(IWCS 2011), pages 195–204.
M
¨
uller, H., Marchand-Maillet, S., and Pun, T. (2002). The
truth about corel - evaluation in image retrieval. In
Lew, M. S., Sebe, N., and Eakins, J. P., editors, Lec-
ture Notes In Computer Science, volume 2383, pages
38–49. Springer-Verlag, London.
Ponte, J. M. and Croft, W. B. (1998). A language model-
ing approach to information retrieval. In SIGIR 1998,
pages 275–281.
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J.
(2010). Collecting image annotations using amazons
mechanical turk. In NAACL HLT 2010 Workshop on
Creating Speech and Language Data with Amazons
Mechanical Turk. Association for Computational Lin-
guistics.
Strohman, T., Metzler, D., Turtle, H., and Croft, W. B.
(2005). Indri: A language model-based search engine
for complex queries. In International Conference on
Intelligence Analysis (ICIA) (poster), McLean, VA.
Tsikrika, T., Popescu, A., and Kludas, J. (2011). Overview
of the wikipedia image retrieval task at imageclef
2011. In Working Notes for the CLEF 2011 Labs and
Workshop, Amsterdam, The Netherlands.
Turtle, H. and Croft, W. B. (1991). Evaluation of an infer-
ence network based retrieval model. Trans. Inf. Syst.,
9(3):187–222.
von Ahn, L. and Dabbish, L. (2004). Labeling images with a
computer game. In Proceedings of the SIGCHI confer-
ence on Human factors in computing systems, pages
319–326. ACM Press, Vienna, Austria.
KEOD2013-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment
108