Combining Text Semantics and Image Geometry to Improve

Scene Interpretation

Dennis Medved

, Fangyuan Jiang

, Peter Exner

, Magnus Oskarsson

, Pierre Nugues

and Kalle

Astr

Department of Computer Science, Lund University, Lund, Sweden

Department of Mathematics, Lund University, Lund, Sweden

Keywords:

Semantic Parsing, Relation Extraction from Images, Machine Learning.

Abstract:

In this paper, we describe a novel system that identiﬁes relations between the objects extracted from an image.

We started from the idea that in addition to the geometric and visual properties of the image objects, we could

exploit lexical and semantic information from the text accompanying the image. As experimental set up, we

gathered a corpus of images from Wikipedia as well as their associated articles. We extracted two types of

objects: human beings and horses and we considered three relations that could hold between them: Ride,

Lead, or None. We used geometric features as a baseline to identify the relations between the entities and we

describe the improvements brought by the addition of bag-of-word features and predicate–argument structures

we derived from the text. The best semantic model resulted in a relative error reduction of more than 18%

over the baseline.

1 INTRODUCTION

A large percentage of queries to retrieve images re-

late to people and objects (Markkula and Sormunen,

2000; Westman and Oittinen, 2006) as well as re-

lations between them: the ‘story’ within the image

orgensen, 1998). Although the automatic recogni-

tion, detection and segmentation of objects in images

has reached remarkable levels of accuracy, reﬂected

by the Pascal VOC Challenge evaluation (Carreira

and Sminchisescu, 2010; Felzenszwalb et al., 2010;

Ladicky et al., 2010), the identiﬁcation of relations is

still a territory that is yet largely unexplored. Notable

exceptions include Chen et al. (2012) and Myeong

et al. (2012). The identiﬁcation of these relations,

though, would enable users to search images illustrat-

ing two or more objects more accurately.

Relations between objects within images are of-

ten ambiguous and captions are intended to help us in

their interpretation. As human beings, we often have

to read the caption or the surrounding text to under-

stand what happened and the nature of the relations

between the entities. This combined use of text and

images has been explored in automatic interpretation

mostly in the form of bag of words, see Sect. 2. This

approach might be inadequate however, as bags of

words do not take the word or sentence context into

account. This model inadequacy formed the starting

idea of this project: As we focused on relations in im-

ages, we tried to model their counterparts in the text

and reﬂect them not only with bags of words but also

in the form of predicate–argument structures.

2 RELATED WORK

To the best of our knowledge, no work has been done

to identify relations in images using a combined anal-

ysis of image and text data. There are related works

however:

Paek et al. (1999) combined image segmentation

with a text-based classiﬁer using image captions as in-

put. They used bags of words and applied a T F · IDF

weighting on the text. The goal was to label the im-

ages as either taken indoor or outdoor. They improved

the results by using both text and image information

together, compared to using only one of the classi-

ﬁers.

Deschacht and Moens (2007) used a set of 100

image-text pairs from Yahoo! News and automatically

annotated the images utilizing the associated text. The

goal was to detect the presence of speciﬁc humans,

but also more general objects. They analyzed the im-

479

Medved D., Jiang F., Exner P., Oskarsson M., Nugues P. and Aström K..

Combining Text Semantics and Image Geometry to Improve Scene Interpretation.

DOI: 10.5220/0004752004790486

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 479-486

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

age captions to ﬁnd named entities. They also derived

information from discourse segmentation, which was

used to determine the saliency of entities.

Moscato et al. (2009) used a large corpus of

French news articles, composed of a text, images, and

image captions. They combined an image detector to

recognize human faces and logos, with a named en-

tity detection in the text. The goal was to correctly

annotate the faces and logos found in the images. The

images were not annotated by humans, instead named

entities in the captions were used as the ground truth,

and the classiﬁcation was based on the articles.

Marszalek and Schmid (2007) used a semantic

network and image labels to integrate prior knowl-

edge of inter-class relationships in the learning step

of a classiﬁer to achieve better classiﬁcation results.

All of these works combined text and image analysis

for classiﬁcation purposes, but they did not identify

relations in the images. Another area of related work

is the generation of natural language descriptions of

an image scene, see Gupta et al. (2012) and Kulkarni

et al. (2011).

3 DATA SET AND

EXPERIMENTAL SETUP

The internet provides plenty of combined sources of

images and text including news articles, blogs, and

social media. Wikipedia is one of such sources that,

in addition to a large number of articles, is a substan-

tial repository of images illustrating the articles. As of

today, the English version has over 4 million articles

and about 2 million images (Wikipedia, 2012). It is

not unusual for editors to use an image for more than

one article, and an image can therefore have more

than one article or caption associated with it.

We gathered a subset of images and articles from

Wikipedia restricted to two object categories: Horse

and Human. We extracted the articles containing the

keywords Horse or Pony and we selected their asso-

ciated images. This resulted into 901 images, where

788 could be used. Some images were duplicates and

some did not have a valid article associated with them.

An image connected to the articles with the words

Horse or Pony does not need to contain a real horse.

It can depict something associated with the words for

example: a car, a statue, or a painting. Some of the

images also include humans, either interacting with

the horse or just being part of the background, see Fig-

ure 1 for examples. An image can therefore have none

or multiple horses, and none or multiple humans.

We manually annotated the horses and humans

in the images with a set of possible relations: Ride,

Lead, and None. Ride and Lead are when a human is

riding or leading a horse and None is an action that is

not Ride or Lead including no action at all. The anno-

tation gave us the number of respective humans and

horses, their sizes and their locations in the image.

We processed the articles with a semantic parser

(Exner and Nugues, 2012), where the output for

each word is its lemma and part of speech, and for

each sentence, the dependency graph and predicate-

argument structures it contains. We ﬁnally applied a

coreference solver to each article.

4 VISUAL PARSING

As our focus was to investigate to what extent the use

of combinations of text and visual cues could improve

the interpretation or categorization precision, we set

aside the automatic detection of objects in the im-

ages. We manually identiﬁed the objects within the

images by creating bounding boxes around horses and

humans. We then labeled the interaction between the

human-horse pair if the interaction corresponded to

Lead or Ride. The None relationships were left im-

plicit. It resulted in 2,235 possible human-horse pairs

in the images, but the distribution of relations was

quite heavily skewed towards the None relation. The

Lead relation had signiﬁcantly fewer examples; see

Table 1.

Table 1: The number of different objects in the source ma-

terial.

Item Count

Extracted images 901

Usable images 788

Human-horse pairs 2,235

Relation: None 1,935

Relation: Ride 233

Relation: Lead 67

The generation of the bounding boxes could be

produced automatically by an object detection algo-

rithm trained on the relevant categories (in our case

people and horses) such as e.g. the deformable part-

based model described in Felzenszwalb et al. (2010).

This would have enabled us to skip the manual detec-

tion step, but as our focus in this paper lies elsewhere

we opted not to do this.

5 SEMANTIC PARSING

We used the Athena parsing framework (Exner and

Nugues, 2012) in conjunction with a coreference

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

480

Figure 1: The upper row shows: A Ford Mustang, the 3rd Light Horse Regiment hat badge, and a snuff bottle. The lower

row shows: A human riding a horse, one human leading the horse and one bystander, and seven riders and two bystanders.

Bounding boxes are displayed.

solver (Stamborg et al., 2012) to parse the Wikipedia

articles. For each word, the parser outputs its lemma

and part of speech (POS). In addition, the parser pro-

duces a dependency graph with labeled edges for each

sentence as well as the predicates it contains and their

arguments. For each article, we also identify the

words or phrases that refer to a same entity i.e. words

or phrases that are coreferent.

Figure 2 shows the dependency graph and the

predicate–argument structure of the caption: Ponies

walking the streets in Burley

Figure 2: A representation of a parsed sentence: The upper

part shows the syntactic dependency graph and the lower

part shows the predicate, walk, and its two arguments the

parser has extracted: Ponies and the streets in Burley.

5.1 Predicates

The semantic parser uses the PropBank (Palmer et al.,

2005) nomenclature, where the predicate sense is ex-

plicitly shown as a number added after the word. The

sentence in Figure 2 contains one predicate: walk.01

http://en.wikipedia.org/wiki/New Forest, retrieved

November 9, 2012.

with its two arguments A0 and A1, where A0 corre-

sponds to the walker and A1, the path walked.

PropBank predicates can also have modifying ar-

guments denoted with the preﬁx “AM-”. There exist

14 different types of modiﬁers in PropBank such as:

AM-DIR: shows motion along some path,

AM-LOC: indicates where the action took place,

and

AM-TMP: shows when the action took place.

5.2 Coreferences

We applied a coreference resolution to create sets of

coreferring mentions as with the rider and the two he

in this caption:

If the rider has a refusal at the direct route he

may jump the other B element without addi-

tional penalty than what he incurred for the

refusal.

The phrase the rider is the ﬁrst mention of an entity

in the coreference chain. It usually contains most in-

formation in the chain. We use it together with POS

information and we substitute coreferent words with

this mention in a document, although this is mostly

useful with pronouns. The modiﬁed documents can

thereafter be used with different lexical features.

http://en.wikipedia.org/wiki/Eventing, retrieved

November 9, 2012.

CombiningTextSemanticsandImageGeometrytoImproveSceneInterpretation

481

6 FEATURE EXTRACTION

We used classiﬁers with visual and semantic features

to identify the relations. The visual features formed a

baseline system. We then added semantic features to

investigate this improvement over the baseline.

6.1 Visual Features

The visual parsing annotation provided us with a set

of objects within the images and their bounding boxes

deﬁned by the coordinates of the center of each box,

its width, and height.

To implement the baseline, we derived a larger set

of visual features from the bounding boxes, such as

the overlapping area, the relative positions, etc, and

combinations of them. We ran an automatic gener-

ation of feature combinations and we applied a fea-

ture selection process to derive our visual feature set.

We evaluated the results using cross-validation. How-

ever, as the possible number of combinations was very

large, we had to discard manually a large part of them.

Once stabilized, the baseline feature set remained un-

changed while developing and testing lexical features.

It contains the following features:

F Overlap Boolean feature describing whether the

two bounding boxes overlap or not.

F Distance numerical feature containing the normal-

ized length between the centers of the bounding

boxes.

F Direction(8) nominal feature containing the direc-

tion of the human relative the horse, discretized

into eight directions.

F Angle numerical feature containing the angle be-

tween the centers of the boxes.

F OverlapArea numerical feature containing the

size of the overlapping area of the boxes.

F MinDistanceSide numerical feature containing

the minimum distance between the sides of the

boxes.

F AreaDifference numerical feature containing the

quotient of the areas.

We used logistic regression and to cope with non-

linearities, we used pairs of features to emulate a

quadratic function. The three following features are

pairs involving a numerical and a Boolean features,

creating a numerical feature. The Boolean feature is

used as a step function: if it is false, the output is a

constant; if it is true, the output is the value of the

numeric feature.

F Distance+F LowAngle(7) numerical feature,

F LowAngle is true if the difference in angle is

less than 7

◦

F Angle+F LowAngle(7) numerical feature.

F Angle+F BelowDistance(100) numerical feature,

F BelowDistance(100) is true if the distance is

less than 100.

Without these feature pairs, the classiﬁer could not

correctly identify the Lead relation and the F

value

for it was 0. With these features, F

increased to 0.29.

Table 2 shows the recall, precision, and F

for the

three relations using visual features. Table 3, shows

the corresponding confusion matrix.

Table 2: Precision, recall and F

for visual features.

Precision Recall F

None 0.9472 0.9648 0.9559

Ride 0.7685 0.7553 0.7619

Lead 0.4285 0.2239 0.2941

Mean 0.6706

Table 3: The confusion matrix for visual features.

Predicted class

None Ride Lead

Actual class

None 1867 49 19

Ride 56 176 1

Lead 48 4 15

6.2 Semantic Features

We extracted the semantic features from the

Wikipedia articles. We implemented a selector to

choose the size of the input between: complete ar-

ticles, partial articles (the paragraph that is the closest

to an image), captions, and ﬁle names. The most spe-

ciﬁc information pertaining to an image is found in

the caption and the ﬁle name, followed by the partial

article, and ﬁnally, the whole article.

6.2.1 Bag-of-Words Features

A bag-of-word (BoW) feature was created for each

of the four different inputs. A BoW feature is repre-

sented by a vector of weighted word frequencies. The

different versions have separate settings and dictio-

naries. We also used a combined bag-of-word feature

vector consisting of the concatenation of the partial

article, caption, and ﬁlename feature vectors.

The features have a ﬁlter that can exclude words

that are either too common, or not common enough,

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

482

based on their frequency, controlled by a threshold.

We used a TF ·IDF weighting on the included words.

We used ﬁle names as one of the inputs, as it is

common to have a long descriptive names of the im-

ages in Wikipedia. However, they are not as standard-

ized as the captions. Some images have very long

descriptive titles; others were less informative, for ex-

ample: “DMZ1.jpg”. The ﬁle names were not seman-

tically parsed, but we deﬁned a heuristic algorithm,

which was used to break down the ﬁle name strings

into individual words.

6.2.2 Predicate Features

Instead of using all of the words in a document,

we used information derived from the predicate–

argument structure to ﬁlter out more relevant terms.

We created a feature that only used the predicate

names and their arguments as input. The words that

are not predicates, or arguments to the predicates, are

removed as input to the feature. The arguments can be

ﬁltered depending on their type, for example A0, A1,

or AM-TMP. We can either consider all of the words

of the arguments, or only the heads.

As for the BoW, we created predicate features

with articles, partial articles, and captions as input.

We never used the ﬁle names, because we could not

carry a semantic analysis on them. We also created a

version of the predicate-based features that we could

ﬁlter further on the basis of a list of predicate names,

including only predicates present in a predeﬁned list,

speciﬁed by regular expressions.

7 CLASSIFICATION

To classify the relations, we used the LIBLINEAR

(Fan et al., 2008) package and the output probabil-

ities over all the classes. The easiest way to clas-

sify a horse-human pair is to take the corresponding

probability vector and pick the class with the highest

probability. But sometimes the probabilities are quite

equal and there is no clear class to chose. We selected

a threshold using cross-validation. If the maximum

probability in the vector is not higher than the thresh-

old, the pair is classiﬁed as None. We observed that

because None represents a collection of actions and

nonaction, it is more likely to be the true class when

Ride and Lead have low probabilities.

Even with the threshold, this scheme can classify

two or more humans as riding or leading the same

horse. Although possible, it is more likely that only

one person is riding or leading the horse at a time.

Therefore we added constraints to the classiﬁcation: a

horse can only have zero or one rider, and zero or one

leader. For each class, only the most probable human

is chosen, and only if it is higher than the threshold.

For each human-horse pair, the predicted class is

compared to the actual class. The information derived

from this can be used to calculate the precision, recall,

and F

for each class. The arithmetic mean of the

three F

values is calculated, and can be used as a

comparison value. We also computed the number of

correct classiﬁcations and a confusion matrix.

8 SYSTEM ARCHITECTURE

Figure 3 summarizes the architecture of the whole

system:

1. Wikipedia is the source of the images and the ar-

ticles. The text annotation uses the Wiki markup

language.

2. Image analysis: placement of bounding boxes,

classiﬁcation of objects and actions. This was

done manually, but could be replaced by an au-

tomatic system.

3. Text selector between: the whole articles, para-

graphs that are the closest to the images, ﬁle-

names, or captions.

4. Semantic parsing of the text, see Section 5.

5. Extraction of feature vectors based on the bound-

ing boxes and the semantic information.

6. Model training using logistic regression from the

LIBLINEAR package. This enables us to predict

probabilities for the different relations.

7. Relation classiﬁcation using probabilities and

constraints.

9 RESULTS

We used the L2-regularized logistic regression (pri-

mal) solver from the LIBLINEAR package and we

evaluated the results of the classiﬁcation with the dif-

ferent feature sets starting from the baseline geomet-

ric features and adding lexical features of increasing

complexity. We carried out a 5-fold crossvalidation.

We evaluated permutations of features and set-

tings and we report the set of combined BoW features

that yielded the best result. Table 4 shows an overview

of the results:

• The baseline corresponds to the geometrical fea-

tures; we obtained a mean F

of 0.67 with them;

CombiningTextSemanticsandImageGeometrytoImproveSceneInterpretation

483

Table 4: An overview of the results, with their mean F

-value, difference and relative error reduction from the baseline mean

-value.

Mean of F

Difference (pp) Relative error reduction (%)

Baseline 0.6706 0.00 0.00

BoW Articles 0.6779 0.73 2.22

Partial articles 0.6818 1.12 3.40

Captions 0.6829 1.23 3.73

Filenames 0.6802 0.96 2.91

Combination 0.7132 4.26 12.9

Predicate Articles 0.7318 6.12 18.6

Partial articles 0.6933 2.27 6.89

Captions 0.6791 0.85 2.58

Articles + Words 0.6830 1.24 3.76

Articles + Coref 0.7280 5.74 17.4

Figure 3: An overview of the system design, see Section 8 for description.

• BoW corresponds to the baseline features and

the bag-of-word features described in Sect. 6.2.1;

whatever the type of text we used as input, we

observed an improvement. We obtained the best

results with a concatenation of the partial article,

caption, and ﬁlename (combination, F

= 0.71);

• predicate corresponds to the baseline features

and the predicate feature vector described in

Sect. 6.2.2. Predicate features using only one lex-

ical feature vector from the article text gave bet-

ter results than combining different portions of the

text (F

= 0.73).

Our best feature set is the predicate features utiliz-

ing whole articles as input. It achieves a relative error

reduction of 18.6 percent compared to baseline.

Tables 2 and 3 show the detailed results of the

baseline with the geometric features only. Tables 5

and 6 show the results of the best BoW feature com-

bination: a concatenation of the feature vectors from

the inputs: partial articles, captions, and ﬁlenames.

Tables 7 and 8 show the result of the best predicate

features.

10 DISCUSSION

Classifying the Lead relation with geometric features

with only bounding boxes as the input revealed quite

difﬁcult. There is indeed very little visual difference

between standing next to a horse and leading it. We

were not able to classify any Lead correctly until we

added the combination features.

For single BoW features, the captions gave the

best result, followed by partial articles, ﬁlenames, and

lastly articles. The order of the results was what we

expected, based on how speciﬁc information the fea-

tures had about the images. But for the predicate fea-

tures, the order was reversed: articles produced the

best result, followed by partial articles, and captions.

Using a speciﬁc list of predicates did not produce

good results although, depending on the list, results

vary greatly. Using a list with the words: ride, lead,

pull, and race, with articles as input, gave the best

result, but Table 4 shows a relative drop of 4.88 com-

pared to no ﬁltering. The negative results could pos-

sibly be explained by the fact that it is not common

to explicitly describe the relations in the images, and

only utilizing keywords such as ride is of little help.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

484

Applying coreference resolution on the docu-

ments lowered the results. Table 4 shows a relative

drop of 0.38 if applied on the predicate feature based

on articles. Despite these negative results, we still be-

lieve that solving coreferences could improve the re-

sults. The solver was designed to be used with another

set of semantic information. To be able to use the

solver, we altered its source code and possibly made

it less accurate. We checked manually coreference

chains and we could observe a signiﬁcant number of

faulty examples, leading us to believe that the output

quality of the solver left to be desired.

Table 5: Precision, recall, and F

for the concatenation of

BoW features with the inputs: partial articles, captions and

ﬁlenames.

Precision Recall F

None 0.9638 0.9638 0.9638

Ride 0.7642 0.8626 0.8104

Lead 0.5135 0.2835 0.3653

Mean 0.7132

Table 6: The confusion matrix for BoW for the concatena-

tion of BoW features with the inputs: partial articles, cap-

tions and ﬁlenames.

Predicted class

None Ride Lead

Actual class

None 1865 57 13

Ride 27 201 5

Lead 43 5 19

Table 7: Precision, recall and F

for predicate feature on

articles.

Precision Recall F

None 0.9745 0.9498 0.9620

Ride 0.7301 0.9055 0.8084

Lead 0.4500 0.4029 0.4251

Mean 0.7318

Table 8: The confusion matrix for predicate feature on arti-

cles.

Predicted class

None Ride Lead

Actual class

None 1838 70 27

Ride 16 211 6

Lead 32 8 27

11 CONCLUSIONS AND FUTURE

WORK

We designed a supervised classiﬁer to identify rela-

tions between pairs of objects in an image. As input to

the classiﬁer, we used geometric, bag-of-words, and

semantic features. The results we obtained show that

semantic information, in combination with geometric

features, proved useful to improve the classiﬁcation of

relations in the images. Table 4 shows that the relative

error reduction is 12.9 percent by utilizing a combi-

nation of bag-of-words features. An even greater im-

provement is made using predicate information with

an relative error reduction of 18.6 percent compared

to baseline.

Coreference resolution lowered the performance,

but the interface between the semantic parser and the

coreference solver was less than optimal. There is

room for improvement regarding this solver, either

with the interface to the semantic parser or with to

another solver. It could also be interesting to try other

types of classiﬁers, not just logistic regression, and

see how they perform.

Using automatically annotated images as input to

the program could be relatively easily implemented

and would automate all the steps in the system. A nat-

ural continuation of the work is to expand the number

of objects and relations. Felzenszwalb et al. (2010),

for example, use 20 different classiﬁers for common

objects: cars, bottles, birds, etc. All, or a subset of

it, could be chosen as the objects, together with some

common predicates between the objects as the rela-

tions.

It would also be interesting to try other sources of

images and text than Wikipedia: either using other re-

sources available online or creating a new database

with images captioned with text descriptions. An-

other interesting expansion of the work would be to

map entities found in the text with objects found in the

image. For example, if a caption includes the name of

a person, one could create a link between the image

and information about the entity.

ACKNOWLEDGEMENTS

This research was supported by Vetenskapsr

adet, the

Swedish research council, under grant 621-2010-

4800 and the Swedish e-science program: eSSENCE.

CombiningTextSemanticsandImageGeometrytoImproveSceneInterpretation

485

REFERENCES

Carreira, J. and Sminchisescu, C. (2010). Constrained Para-

metric Min-Cuts for Automatic Object Segmentation. In

IEEE International Conference on Computer Vision and

Pattern Recognition.

Chen, N., Zhou, Q.-Y., and Prasanna, V. (2012). Under-

standing web images by object relation network. In Pro-

ceedings of the 21st international conference on World

Wide Web, WWW ’12, pages 291–300, New York, NY,

USA. ACM.

Deschacht, K. and Moens, M.-F. (2007). Text analysis

for automatic image annotation. In Proceedings of the

45th Annual Meeting of the Association of Computa-

tional Linguistics, pages 1000–1007, Prague.

Exner, P. and Nugues, P. (2012). Constructing large propo-

sition databases. In Proceedings of the Eight Interna-

tional Conference on Language Resources and Evalua-

tion (LREC’12), Istanbul.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and

Lin, C.-J. (2008). LIBLINEAR: A library for large linear

classiﬁcation. Journal of Machine Learning Research,

9:1871–1874.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2010). Object detection with discrimina-

tively trained part based models. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 32(9):1627–

1645.

Gupta, A., Verma, Y., and Jawahar, C. (2012). Choosing

linguistics over vision to describe images. In Proc. of the

twenty-sixth AAAI conference on artiﬁcial intelligence.

orgensen, C. (1998). Attributes of images in describing

tasks. Information Processing and Management, 34(2-

3):161–174.

Kulkarni, G., Premraj, V., Dhar, S., Siming, L., Choi, Y.,

Berg, A., and Berg, T. (2011). Baby talk: Understand-

ing and generating image descriptions. In Proc. Conf.

Computer Vision and Pattern Recognition.

Ladicky, L., Russell, C., Kohli, P., and Torr, P. H. S. (2010).

Graph cut based inference with co-occurrence statistics.

In Proceedings of the 11th European conference on Com-

puter vision: Part V, ECCV’10, pages 239–253, Berlin,

Heidelberg. Springer-Verlag.

Markkula, M. and Sormunen, E. (2000). End-user search-

ing challenges indexing practices in the digital newspa-

per photo archive. Information retrieval, 1(4):259–285.

Marszalek, M. and Schmid, C. (2007). Semantic hierarchies

for visual object recognition. In Proc. Conf. Computer

Vision and Pattern Recognition.

Moscato, V., Picariello, A., Persia, F., and Penta, A. (2009).

A system for automatic image categorization. In Seman-

tic Computing, 2009. ICSC’09. IEEE International Con-

ference on, pages 624–629. IEEE.

Myeong, H., Chang, J. Y., and Lee, K. M. (2012). Learning

object relationships via graph-based context model. In

CVPR, pages 2727–2734.

Paek, S., Sable, C., Hatzivassiloglou, V., Jaimes, A., Schiff-

man, B., Chang, S., and Mckeown, K. (1999). Integra-

tion of visual and text-based approaches for the content

labeling and classiﬁcation of photographs. In ACM SI-

GIR, volume 99.

Palmer, M., Gildea, D., and Kingsbury, P. (2005). The

Proposition Bank: an annotated corpus of semantic roles.

Computational Linguistics, 31(1):71–105.

Stamborg, M., Medved, D., Exner, P., and Nugues, P.

(2012). Using syntactic dependencies to solve corefer-

ences. In Joint Conference on EMNLP and CoNLL -

Shared Task, pages 64–70, Jeju Island, Korea. Associ-

ation for Computational Linguistics.

Westman, S. and Oittinen, P. (2006). Image retrieval by end-

users and intermediaries in a journalistic work context. In

Proceedings of the 1st international conference on Infor-

mation interaction in context, pages 102–110. ACM.

Wikipedia (2012). Wikipedia statistics English.

http://stats.wikimedia.org/EN/TablesWikipediaEN.htm.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

486