TagNet: Using Soft Semantics to Search the Web
Geert Vanderhulst and Lieven Trappeniers
Alcatel-Lucent Bell Labs,
Copernicuslaan 50, 2018 Antwerpen, Belgium
Keywords:
Tagging, Ambiguity, Soft Semantics.
Abstract:
Semantic annotations are key to efficiently retrieve resources on the Web. On the one hand, ontologies under-
lying Web resources give rise to linked data. On the other hand, tagging has become increasingly popular to
bring order in data across Web applications and social networks. While taxonomies and folksonomies serve
the same purpose (i.e. classification), there is a large gap in semantics between uncontrolled keywords used for
tagging and hierarchical concepts found in a taxonomy. In this paper we introduce sematags as light-weight
‘soft semantics’ which aim to bridge the gap between ‘no semantics’ and ‘hard semantics’. Sematags define
aliases (synonyms) and isas (hypernyms) which overcome the typical issues conventional tags cope with such
as ambiguity. Furthermore, we present TagNet, a framework that extracts sematags from lexicons and existing
knowledge bases and exploits them to annotate, link and retrieve resources on the Web. We evaluate how soft
semantics can be used to semi-automatically map tagged photos in Flickr on DBpedia concepts and vice versa.
1 INTRODUCTION
The basic premise of the Semantic Web is to repre-
sent knowledge in a meaningful way so that comput-
ers can function more effectively by being able to dis-
tinguish different meanings of data. This is achieved
by describing data using languages with a logical en-
tailment such as OWL and RDF. We refer to this ap-
proach using the term hard semantics since linked
data is typically mapped on ontological resources by
domain experts or agents. Besides, we witness an
emerge of folksonomies to create order in a rapidly
expanding Web of data (Vander Wal, 2007; Specia
and Motta, 2007). The increasing popularity of tags
as a flat space of keywords is both visible on websites
such as Youtube, Flickr, Del.icio.us and across social
networks (e.g. hashtags on Twitter). However, there
are still a number of limitations of the current state
of technology as identified in (Garcia-Castro et al.,
2009): (i) tag ambiguity, (ii) missing links between
multiple synonyms, spelling variants, or morpholog-
ical variants, and (iii) variation in the level of granu-
larity and specificity of the tags used caused by differ-
ences in the domain expertise of agents. These issues
are due to the fact that tags typically have no seman-
tics associated.
In this paper we present TagNet, a framework
that eases the task of annotating and searching for
resources on the Web using soft semantics. Rather
than defining hard links to ontological concepts, we
add additional detail to a tag to remove ambiguity
and facilitate automatic derivation of links to exist-
ing knowledge bases. In TagNet, tags (i.e. plain text
keywords) are annotated in two dimensions: each tag
(i.e. sematag) defines aliases and isas as illustrated in
figure 1. Aliases are keywords that can be used as a
synonym for a given tag (e.g. synonyms, acronyms,
etc) and isas are keywords that generalize the mean-
ing of the tag (e.g. sport is an isa for tennis). The
combination of aliases and isas helps us to understand
and express the meaning of a tag using additional
tags, similar to the approach taken in (Garcia-Castro
et al., 2009). We distinguish between aliases and isas
because it matches well with the detail of informa-
tion contained in dictionaries (e.g. WordNet (Fell-
baum, 1998)) and ontologies (e.g. the DBpedia (Auer
et al., 2008) ontology) which are suitable sources to
extract tags from as outlined in section 2. For in-
stance, the WordNet lexicon expresses linguistic rela-
tions between words such as synonyms (aliases) and
hypernyms (isas). Also in ontologies, there is a no-
tion of similar concepts (aliases) – expressed in OWL
using constructs such as owl:sameAs (instance level)
and owl:equivalentClass (class level) and ‘isa’
relations contained in an ontology which directly map
on the isas of a tag in TagNet. TagNet advances the
state of the art by exploiting soft semantics both in
the annotation process of arbitrary resources on the
179
Vanderhulst G. and Trappeniers L..
TagNet: Using Soft Semantics to Search the Web.
DOI: 10.5220/0004325001790188
In Proceedings of the 9th International Conference on Web Information Systems and Technologies (WEBIST-2013), pages 179-188
ISBN: 978-989-8565-54-9
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
Figure 1: A sematag consists of a name, aliases and isas
which we label as soft semantics. Its purpose is to link
tag-based systems (no semantics) with semantic knowledge
bases (hard semantics).
Web and during search operations. To solve ambigu-
ity for end-users, we explain the different senses of
a keyword using the isas and aliases of tags match-
ing the keyword (section 3) which is also useful to
refine search queries (section 4). A unique advantage
of sematags over hard URIs is their scalability to tag-
based systems combined with the ability to map them
on linked data, as illustrated in section 5 by means of
a case study.
2 TAGGING VOCABULARIES
To stimulate reuse and sharing of existing sematags
and relieve users of specifying aliases and isas manu-
ally, TagNet relies on vocabularies from which tags
are extracted and suggested to users. Basically, a
tag (i.e. entered keyword) is only valid if it can be
found in a vocabulary compatible with TagNet. How-
ever, since this constraint would prohibit free tagging,
we relax it by requiring that the name of the tag can
be freely chosen, as long as it is annotated with at
least one isa that appears in a controlled vocabulary.
Hence, a user can annotate a picture of a pet using the
sematag mickey (the name of the pet), provided that
it is enriched with e.g. a known dog isa tag.
We consider WordNet and DBpedia as main, con-
trolled vocabularies for TagNet thanks to their wide
coverage of contemporary terms. A lexicon such as
WordNet contains a rich set of words that are part of
the English language, but it still lacks several con-
cepts such as names of places, people, companies,
television shows, etc. In addition to language-specific
words, we also want to include commonly accepted
terms that are introduced by a community of users.
The DBpedia vocabulary extracts tags out of con-
tent that was published on Wikipedia. Examples
of tag names that are supported by this vocabulary
are google, san francisco, madonna, etc. Whilst
Wikipedia covers a large set of generally accepted
terms, it still excludes concepts that only matter to
specific users such as names of family members or
pets and highly specialized terms related to a partic-
ular domain (e.g. medicines). To share such special-
ized tags, custom user- or domain-specific vocabular-
ies can be integrated in TagNet as discussed in sec-
tion 6. Figure 2 illustrates the main task of a vocabu-
Figure 2: A vocabulary takes a keyword as input and out-
puts one or more sematags.
lary: taking a keyword as input, a vocabulary outputs
one or more tags that attribute a meaning to the key-
word using isas and aliases. Note that sematags do
not contain direct references to concepts defined in
the vocabulary’s underlying knowledge base. For ex-
ample, a tag extracted from the WordNet lexicon does
not store a reference to a WordNet synset, nor does
a DBpedia tag contain a link to a Wikipedia page.
We opted for such a loose coupling because it allows
us to describe and interpret all tags equally and in-
dependent of the semantics underlying a vocabulary.
However, decoupling tags and vocabularies also intro-
duces a new level of ambiguity between similar tags
extracted from different vocabularies. For instance,
the term dog is found both in WordNet and the DBpe-
dia vocabulary with slightly different semantics. To
overcome this ambiguity, we add a label to each tag
that identifies the vocabulary from where the tag orig-
inates. In the next sections we outline how tags are
extracted from WordNet and DBpedia.
2.1 WordNet Vocabulary
In WordNet, words are organized in synsets: sets of
words with a similar meaning (i.e. synonyms). For
each synset, hypernyms can be looked up (e.g. animal
is a hypernym of dog). Hence, for each word in a
synset, a sematag can be composed as follows:
1. the name of the tag is the name of the word;
2. the aliases of the sematag correspond to the names
of all other words in the synset;
3. the isas of the sematag relate to the names of all
direct hypernyms of the synset including hyper-
nyms of hypernyms.
A keyword that is looked up in WordNet can appear
in multiple synsets if it has several meanings and thus
gives rise to multiple sematags, one for each sense.
To improve the results, we filter out generic hyper-
nyms such as living thing, object, entity and
whole as they do not contribute to the differentiation
of senses. The tag depicted in figure 1 is an example
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
180
of a sematag produced by the WordNet vocabulary.
The WordNet vocabulary is further subdivided in
the following subvocabularies: nouns, verbs, adjec-
tives and adverbs. This allows users and agents to
quickly distinguish between senses, knowing the lex-
ical role of a keyword.
2.2 DBpedia Vocabulary
The DBpedia ontology organizes Wikipedia
concepts in a structured hierarchy currently
covering about 320 classes and 1650 different
properties. There is an opportunity to gen-
erate sematags out of DBpedia classes (e.g.
http://dbpedia.org/ontology/Person), instances (e.g
http://dbpedia.org/resource/Semantic Web) and prop-
erties (e.g http://dbpedia.org/ontology/birthDate).
In this section, we will only elaborate on classes
and instances, yet properties are briefly discussed in
section 8. To understand how sematags are extracted
from DBpedia, we will first introduce the notion of
redirects and disambiguates in the DBpedia ontology:
Redirects. The wikiPageRedirects property maps
a resource on another resource. An example is
the resource Cow which does not have its own
page on Wikipedia, yet is redirected to the re-
source Cattle. Hence we can interpret cow as
an alias for cattle and vice versa. Several redi-
rects are also defined to support different descrip-
tions of the same resource. For instance, the re-
source Winston Churchill is also referred to as
Sir Winston and Prime Minister Churchill.
Disambiguates. The wikiPageDisambiguates
property maps a virtual resource on a col-
lection of relevant resources that could be
intended by a particular term. An example is
the Bird (disambiguation) resource which
links to o.a. Bird (animal), Birds, Illinois
(community in the USA) and The Birds (film)
(a Hitchcock movie). These resources can be
considered as distinct senses of the term bird.
For a given keyword, the DBpedia vocabulary will
lookup resources that match the keyword, also fol-
lowing redirects. If a match results in a disambiguat-
ing resource, each linked resource is also added to
the temporary result set. Next, for each resource
aliases are collected including the name of the re-
source if distinct from the keyword. This is achieved
by asking for all resources for which a redirect exists
to the current resource. The isas of a DBpedia se-
matag are populated by analyzing the classes to which
a resource belongs (indicated by the rdf:type rela-
tion). Whilst DBpedia concepts are also mapped on
other ontologies such as the YAGO knowledge base
(Suchanek et al., 2007), we currently require isas to
be part of the DBpedia ontology. The reason for this
is the level of detail provided by YAGO classes (e.g.
FilmsBasedOnShortFiction, 1960sHorrorFilms)
as compared to DBpedia classes (e.g. Film). Too
much detail compromises the general applicability of
a sematag. Some examples of tags extracted from
DBpedia resources are listed in figure 2. In addi-
tion, keywords are directly matched with classes in
the DBpedia ontology. If a matching class is found, a
sematag is created as follows:
1. the keyword that serves as name for the tag is sub-
stituted by an asterisk, indicating that the tag name
does not matter;
2. no aliases are added (no owl:equivalentClass
relations are defined in the DBpedia ontology);
3. the class name is included as isa, as well as any
parent classes.
The last sematag depicted in figure 2 is extracted from
the Bird class in the DBpedia ontology.
3 EXPLAINING TAGS USING
TAGS
When a resource is annotated using TagNet, a tag key-
word is looked up in available vocabularies. If the
keyword is found and multiple senses are detected,
the user is requested to select the proper meaning in
the context of the resource. However, disambiguating
between different meanings of a keyword is not al-
ways a trivial task. This is largely due to the fact that
several words have multiple senses, many of which
we do not use in daily life or even are aware of. For in-
stance, according to WordNet the noun dog has seven
senses of which most are less commonly used, e.g.
‘informal term for a man’ and ‘metal supports for logs
in a fireplace’. Similar, when looking up a keyword in
DBpedia, several concepts with the same name yet a
different meaning are typically returned. These often
include unexpected results because the search term
also corresponds to the name of an (infamous) music
album, place or alike. A straight-forward approach to
let users distinguish between multiple meanings is to
present them with a list of explanations as lined up
above, and let them pick the intended one. However,
this approach postulates some issues preventing quick
disambiguation. First, it clearly takes time for a user
to read all sense descriptions of a word as to iden-
tify the intended sense. Descriptions are often ver-
bose and/or expressed using scientific terms, making
TagNet:UsingSoftSemanticstoSearchtheWeb
181
Figure 3: Ambiguous senses of a keyword are explained to users by means of the isas of matching sematags. A dialog
visualizes the results of a tag lookup in available vocabularies and allows users to pick the intended sense.
it hard to grasp what is meant exactly (e.g. describ-
ing a dog as ‘a member of the genus Canis’ is still
confusing). Secondly, several senses only marginally
differ in semantics from each other. This level of de-
tail is redundant for most tagging purposes and causes
uncertainty when trying to select the proper sense. To
increase the efficiency of perceiving a tag’s senses, we
present a filtered set of the isas of a tag instead of
sense descriptions arranged by the (sub)vocabulary
they originate from as illustrated in figure 3. These
isas are fast to read and hence help to quickly differen-
tiate between senses. Sematags are further organized
by their aliases e.g. dog, utah prairie dog, etc
and can be picked on class level (only isas are in-
cluded) as well as instance level (aliases are included
that map on a unique resource, similar to a URI).
The reason we opted for isas as the primary means
to distinguish between tags with a similar name is be-
cause we learned that (i) tags extracted from WordNet
and DBpedia generally contain more indicative isas
than alias and (ii) broader terms seem more helpful
than similar terms to understand the semantics of a
tag. However, additional user experiments are needed
to validate this claim which is based on our own ex-
perience. Moreover, to improve the understandabil-
ity of tags explaining tags, it might be useful to in-
corporate statistics about the popularity of words to
decide which tags are best suited to explain the se-
mantics of tags. Another option is to give up some
semantic detail in favor of a simplified tagging expe-
rience; a proper balance is needed. A coarser filter
could for instance group the WordNet tags with isas
unpleasant woman, chap and villain into a sin-
gle tag with isas person and organism. Similar, we
could prune the DBpedia tags and only display key
terms such as animal, person, song, album, place
and band.
4 TagNet AS SEARCH TOOL
In this section we elaborate on the role of sematags
to facilitate search operations in a repository popu-
lated by resources. We represent a resource by a URI,
a name (label), a description and an optional image
(thumbnail). Resources are annotated with sematags
which are extracted from vocabularies as explained
in section 2. The extra information contained in se-
matags is exploited when retrieving resources. Search
terms are not only compared to a tag’s name, but
are also matched with its aliases and isas such that
searching for animal will also yield resources that
are tagged as bird or dog. TagNet implements a
meta-search algorithm that accepts a mix of sematags
and keywords – encoded as sematags with no isas and
aliases to find resources in connected repositories.
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
182
A sematag t
1
matches a sematag t
2
if and only if:
1. t
1
and t
2
originate from the same vocabulary, and;
2. t
1
has the same name as t
2
or an alias exists in t
2
with the same name as t
1
or an isa exists in t
1
with
the same name as t
2
1
, and;
3. all isas contained in t
1
also exist in t
2
.
Figure 4: A repository takes sematags as input and outputs
resources matching a search query composed from the in-
put.
A search query tunnelled through TagNet can be
refined dynamically. Initially, search keywords are
passed to TagNet which are matched with the name
and tags of resources in a target repository. If the re-
sults are considered too many and/or too diverse, the
search results are narrowed down by indicating the
actual meaning of one or more keywords. To this
end, sematags representing the various senses of a
keyword are looked up in available vocabularies and
presented to the user in a dialog as depicted in fig-
ure 3. Finally, selected sematags are sent along with
remaining keywords (that were not disambiguated) to
a target repository and resources are returned. Hence
sematags help users to resolve ambiguity at search
time and refine a search query.
In the next two sections we discuss how sematags
can be used to search for resources in WordNet and
DBpedia repositories. Note that the knowledge base
underlying WordNet and DBpedia is used both for
tag extraction (using vocabularies) and retrieval of
resources (using repositories). Sematags originating
from the WordNet vocabulary relate to synsets which
can be considered as annotated WordNet resources.
Similar, it makes sense to query a DBpedia repos-
itory using DBpedia sematags because resources in
this repository are already (virtually) annotated with
DBpedia sematags.
4.1 WordNet Repository
In the WordNet repository, each synset is identified by
a URI that is composed of an identifier such as dog-0
with dog being the name of the synset and 0 corre-
sponding to its sense number. Searching for synsets
using sematags is achieved by looking up all synsets
matching the keyword of the sematag and filtering
1
Note that wildcards are allowed in tag names. A *
matches any tag name.
out the results by comparing aliases and isas. Fig-
ure 5 shows that the ability to search through Word-
Net via sematags is useful for finding resources in
a knowledge base that is mapped on WordNet such
as SUMO (Niles and Pease, 2001), OpenCyc (Ma-
tuszek et al., 2006), DBpedia, etc. Sematags originat-
ing from WordNet can be translated into synset URIs
which can then be used to query for resources that are
linked to particular synsets.
Figure 5: Facts about resources can be inferred from Word-
Net sematags by first translating the tags into synset URIs
and then using these URIs to locate resources in a knowl-
edge base mapped on WordNet.
4.2 DBpedia Repository
In the DBpedia repository, Wikipedia content is seen
as a collection of resources that are virtually anno-
tated with sematags. Given a sematag, a search algo-
rithm can look for resources annotated with a match-
ing sematag. Although these sematags do not really
exist, we can assume they do according to the follow-
ing rules based on the tag extraction method outlined
in section 2.2:
each resource is annotated with a sematag having
the same name as the resource itself;
the aliases of the sematag are derived from ‘redi-
rects’ and ‘disambiguates’ pointing to the re-
source;
the isas of the sematag relate to the class hierarchy
of the resource in the DBpedia ontology.
The following SPARQL query collects resources that
match this scheme:
SELECT DISTINCT ? r ? l ?c ? t
WHERE {
?r rdfs : label ? l ; dbo: abstra ct ?c .
?r a <$tag . isa1>; a <$tag . isa2>.
FILTER ( bif : contains (? l , $tag .name” or
$tag . alia s1 ” ’)) .
FILTER (( langMatches( lang (? l ) , en )) &&
(langMatches( lang (?c ) , en ) ) ) .
OPTIONAL { ? r dbo : thumbnail ? t }
}
This query does not include resources con-
nected through dbo:wikiPageRedirects or
dbo:wikiPageDisambiguates properties. We
include these resources using separate queries to keep
the queries simple and performant.
TagNet:UsingSoftSemanticstoSearchtheWeb
183
5 EVALUATION
To evaluate the effectiveness of soft semantics to link
tagged resources with a semantic knowledge base,
we tested how tags used in Web services such as
Del.icio.us, Flickr and Youtube can be dynamically
mapped on DBpedia or WordNet resources and vice
versa. With such links in place, we can infer facts
about a photo or video using its tags and involve
tagged resources in semantic queries.In a first step, we
explored how additional tag detail can be introduced
in Flickr. Next, we investigated which steps should be
traversed to unambiguously link a collection of popu-
lar tags to related DBpedia or WordNet resources.
5.1 Introducing Sematags in Flickr
In Flickr, photos are classified by means of user-
generated (ambiguous) keywords. Rather than sub-
stituting these tags for URIs of semantic resources
wich are incompatible with a tag-based system like
Flickr we aim to upgrade these tags to sematags and
hence remove ambiguity and facilitate mappings to
resources through TagNet. However, this means that
sematags need to be stored in Flickr, when annotating
a photo. We thus need a way to seamlessly inject isas
and aliases in a legacy tagging system without break-
ing its core functionalities (e.g. search functionality).
To this end, we consider two approaches:
1. sematags are encoded in a string nota-
tion such as name|alias1|isa1;isa2 (e.g.
atlantis||db:space shuttle) and added as a
single tag to a link, or;
2. sematags are flattened into an array of tags com-
posed of the name of the tag and its isas (e.g.
name, isa1, isa2) and added as distinct tags to
a resource.
The former approach is compatible with Flickr (and
a.o. also Del.icio.us) and results in a number of ben-
efits: i) free text searches in Flickr now also range
over the synonyms and hypermyns of a tag, ii) search
queries can be passed through TagNet, semantically
refined and forwarded to Flickr using Flickr’s open
API, iii) unlike hard links, sematags can easily be un-
derstood by humans and machines and iv) sematags
can be mapped on linked data and vice versa. How-
ever, we acknowledge that a custom encoding of se-
matags is not recognized by existing systems, result-
ing in poor textual representations of sematags. Se-
matags could be rendered in a more visually appeal-
ing way by hiding aliases and isas by default and de-
picting those when hovering over a tag or clicking
on it. Furthermore, if sematags are not natively sup-
Table 1: Scores attributed to the search results of 142 ran-
dom tags in TagNet using its DBpedia vocabulary.
Sc Description of score Tags
A The first hit matches a corresponding DB-
pedia resource.
110
B The results contain a match, but not in the
first hit.
10
C None of the results correspond to a match. 14
D No results were found. 8
ported, internal free text searches will also match vo-
cabulary labels prepended to tags such as db: which
is not desirable.
The latter approach gives up on aliases and loses
information about relationships between tags. In a
flattened array of multiple sematags, it is unclear
which tags are actually isas and to which tag they be-
long. Hence it is impossible to unambiguously map
multiple flattened sematags on semantic resources.
Yet, matching a sematag with a set of flattened se-
matags (i.e. the other way round) will only yield false
positives in rare cases if tags are compared as follows.
A sematag t matches an array of tags T if and only if:
1. T contains t or T contains an alias that exists in t;
2. T contains every isa that exists in t.
A situation where false positives are possible occurs
if a sematag t
2
introduces keywords in T that com-
promise the semantics of a flattened sematag t
1
. For
instance, if a link is tagged using a keyword dog in the
sense of an animal and another tag introduces the key-
word person, then searching for a dog in the sense of
a person would incorrectly return the resource. Se-
matags with wildcards in their name (see figure 2)
should also be avoided here. However, the main draw-
back of this approach is the lack of aliases which are
needed by a machine to distinguish between resources
annotated with the same set of isas.
5.2 From Tagged Photo to Linked Photo
We used a public beta release of TagNet
2
and the all
time most popular tags on Flickr
3
as a starting point
for our study. These comprise 142 tags, related to the
broad domain of photography, and are listed in fig-
ure 6. We looked up each tag in TagNet using DBpe-
dia as primary repository and assigned a score based
on the relevance of the search results which were lim-
ited to 20 results per query for usability reasons.
The results of this step are summarized in table 1
2
http://sematags.belllabs.be/
3
http://www.flickr.com/photos/tags/
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
184
animals, architecture, art, asia, australia, autumn, baby,
band
C
, barcelona, beach, berlin, bike
C
, bird, birds,
birthday, black, blackandwhite
D
, blue, bw
C
, califor-
nia, canada, canon
B
, car, cat, chicago, china, christ-
mas, church
B
, city, clouds, color, concert, dance, day,
de
C
, dog, england, europe, fall, family, fashion, festi-
val, film, florida, flower, flowers, food, football, france,
friends
B
, fun, garden, geotagged, germany, girl, graf-
fiti, green, halloween, hawaii, holiday, house, india, in-
stagramapp
D
, iphone, iphoneography
D
, island, italia,
italy, japan, kids
C
, la
C
, lake, landscape, light, live
B
, lon-
don, love, macro
B
, me
D
, mexico, model
B
, museum, mu-
sic, nature, new
C
, newyork, newyorkcity
D
, night, nikon,
nyc, ocean, old
C
, paris, park, party, people, photo, pho-
tography, photos, portrait, raw
B
, red, river, rock
B
, san
C
,
sanfrancisco
D
, scotland, sea, seattle, show
C
, sky, snow,
spain, spring
C
, square
B
, squareformat
D
, street, sum-
mer, sun, sunset, taiwan, texas, thailand, tokyo, travel,
tree, trees, trip
C
, uk, unitedstates
D
, urban
B
, usa, va-
cation, vintage
C
, washington
C
, water, wedding, white,
winter, woman, yellow, zoo
Figure 6: By default, 110 out of 142 popular Flickr tags
(77.5%) are mapped correctly on a valid DBpedia resource
through TagNet (A score). Tags that need additional atten-
tion to resolve ambiguity are marked in bold and labeled
with a B, C or D score (see table 1).
and marked on figure 6. It shows that 110 tags re-
ceived an A score, meaning that they were correctly
mapped on a corresponding DBpedia resource in the
first hit. For example, the tag fall is resolved into
http://dbpedia.org/resource/Autumn and nyc maps on
http://dbpedia.org/resource/New York City. To de-
note the connection with DBpedia, we added the db
prefix to tags with an A score such that TagNet knows
which repository to use to map the tag on a URI.
For tags with a B score, additional detail should
be added to overcome ambiguity. For instance,
canon is in the first place known by DBpedia
as a city in Georgia, a priest, a list of topics re-
lated to Dutch history, etc, while in the context
of photography it refers to a company special-
ized in the manufacturing of imaging and optical
products. From the related DBpedia resource
(http://dbpedia.org/resource/Canon (company),
two isas can be extracted (company and
organisation) which give rise to a sematag
db:canon||company,organisation encoded
in a Flickr-compatible format as discussed in
section 5.1 that uniquely identifies the re-
source in TagNet. Similar, friends is recog-
nized by default as a sitcom while the resource
http://dbpedia.org/resource/Friendship is actu-
ally the best match for this tag’s meaning. The
Friendship resource has no specific isas, but by
including its name as an alias to a friend tag (i.e.
friend|db:friendship), TagNet can distinguish
between the different senses. As such, each DB-
pedia resource can be described unambiguously
by a human-understandable sematag that can be
dereferenced to a URI via TagNet and vice versa.
Note that we prefer to augment a tag with isas (if
available) over aliases derived from a resource’s label
since these specific aliases often tend to be spelling
variants of the tag name or informally refer to its isas.
For instance, an alias of canon in the sense of the
Japanese multinational would be canon (company).
Tags with a C or D score need extra attention. No
resources with a matching name exist in DBpedia or
they are not in line with the meaning of the tag. This
leaves us with two options: i) lookup the tag in a sec-
ondary repository or ii) replace the tag by a similar
tag or add A-rated aliases or isas to the tag. By relying
on WordNet as secondary repository, seven more tags
(band, bike, kids, new, old, show, washington)
were attributed an A or B score and thus could be up-
graded to sematags with a wn: prefix (e.g. wn:bike).
To clarify the semantics of the remaining 15
tags, we have to find at least one meaningful
isa or alias for each tag. For instance, the tag
me and iphoneography can be annotated with
db:person and db:blog isas respectively. Tags
like newyorkcity and sanfrancisco need an
alias that is spelled differently (e.g. db:nyc,
db:san francisco) or substituted by this alias,
while the tag instagramapp can be understood using
an instagram alias.
Tags like blackandwhite (and by extension also
typical Twitter hashtags such as #savetheplanet)
are more difficult to map on linked data as they de-
note a very specific property, a state of mind or ex-
pression which is hard to describe using formal se-
mantics. In summary, we showed that 127 out of 142
random tags (89.4%) could be mapped with minimal
effort on known concepts using DBpedia as primary
and WordNet as secondary vocabulary. By systemati-
cally enriching a tag with additional tags, a (sema)tag
becomes an alternate notation for a URI that scales
better to tag-based systems like Flickr, as it is hu-
man readable and supports free text queries (includ-
ing synonym and hypernym matching).
6 ARCHITECTURE AND
IMPLEMENTATION
TagNet is developed as a Java Web application, using
Servlet technology in the back-end and AJAX tech-
nology in the front-end. It offers a REST API and
TagNet:UsingSoftSemanticstoSearchtheWeb
185
has an open-ended design such that custom vocab-
ularies and repositories can easily be plugged in by
implementing a Vocabulary and Repository inter-
face respectively. An overview of the architecture
is presented in figure 7. The WordNet vocabulary
and repository make use of the WordNet 3.0 database
files
4
while the DBpedia vocabulary and repository
rely on the online DBpedia Virtuoso SPARQL end-
point
5
. The dependency on the DBpedia SPARQL
Figure 7: TagNet architecture overview.
query engine is a bottleneck in the current beta im-
plementation. On the one hand, it allows TagNet
to run on light-weight servers with limited memory
available, but on the other hand we rely on live data
which might not always be available in time. Each
request for sematags or resources is translated into
SPARQL queries which are directed to the online
DBpedia SPARQL endpoint. Although a lot of ef-
fort was spent in optimizing these queries, we ex-
perienced huge differences in their processing time
which is probably due to a variable load of the Virtu-
oso SPARQL engine over time. While the execution
of a query can be considered relatively fast at one mo-
ment, the same query might time out moments later.
To bypass these performance issues, we replicated the
DBpedia database on a local server such that network
latencies and processing delays caused by high loads
were avoided. Furthermore, we could cache sematags
that were looked up in the DBpedia vocabulary and
pre-generate a repository of annotated resource URIs.
This would dramatically speed up the matching of
tags since additional queries are only needed to col-
lect the data associated with matching resource URIs.
7 RELATED WORK
Previous work that has been done in the area of tag-
ging is quite diverse. For instance, models have been
proposed to represent relationships between agents,
4
http://wordnet.princeton.edu/wordnet/download/
5
http://dbpedia.org/sparql/
resources and tags and augment user-contributed data
(Newman, 2005; Gruber, 2007); frameworks were
proposed to add meaning to tags (Passant and Laublet,
2008; Garcia-Castro et al., 2009; Hepp, 2010); shar-
ing and reuse of social tagging data has been stud-
ied (Golder and Huberman, 2006; Kim et al., 2008)
as well as recommendation algorithms (Song et al.,
2008; Araujo et al., 2010; Sigurbj
¨
ornsson and van
Zwol, 2008).
In the remainder of this section, we elaborate on
the works with the closest match to TagNet and dis-
cuss how they differ or match. MOAT (Passant and
Laublet, 2008) extends an ontology designed for tag-
ging (Newman, 2005) and aims to enrich free tags (i.e
any user-defined keywords) with additional meaning.
Similar to TagNet, MOAT looks up the global mean-
ing of keywords in a controlled vocabulary and al-
lows users to select the appropriate meaning, or de-
fine a new meaning by referring to a Web resource
(e.g. a DBpedia resource). Unlike tags in MOAT
which are stored externally, sematags can be injected
in real-world tagging systems and mapped on knowl-
edge bases through TagNet.
Another approach to add meaning to tags is pre-
sented in Tags4Tags (Garcia-Castro et al., 2009)
where the underlying meaning of tag can be revealed
by means of another tag. In this work, the typical
meta-model in which a Web resource maintains one
or more hasTag relations with tag literals is expanded
with typed relationships between a pair of tags. The
ideas postulated in Tags4Tags were reused in Hyper-
Twitter (Hepp, 2010). Using so-called ‘tripletweets’,
tag equivalence (e.g. #webist13 = #webist2013,
tag specializations (e.g. #tennis subtag #sports)
and predefined relations between tags (e.g. #munich
>translation #muenchen) can be expressed. This
is completely in line with the vision of TagNet: a
Twitter vocabulary can process tripletweets and gen-
erate sematags out of them. Moreover, sematags
could be incorporated in HyperTwitter to express that
the hashtag #webist13 is a subtag of a sematag
webist.
To cope with large datasets and relieve users from
manual tagging steps, recommendation algorithms
were proposed that can (semi-automatically) gener-
ate annotations from Web pages (Song et al., 2008;
Araujo et al., 2010) and images (Sigurbj
¨
ornsson and
van Zwol, 2008). Additional research is needed to
investigate how sematags could be generated from ar-
bitrary Web resources, i.e. how the correct sense of a
keyword could be derived from the current context.
In (Weller, 2007), Weller compares ontologies and
folksonomies and suggests that they can be seen as
the two ends of a scale of documentation languages
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
186
ranging from unstructured to highly formalised sys-
tems. Rather than seeing them as rivals, they can
be considered as elements in a toolbox which can be
used together to support concrete applications. In this
work, we showed how soft semantics blur the distance
between folksonomies (no semantics) and ontologies
(hard semantics) and help to complete each other.
8 CONCLUSIONS
AND OUTLOOK
The key contributions of TagNet are twofold. First,
TagNet introduces sematags which annotate regular
keywords with isas and aliases, hence solving typi-
cal tag-related issues such as dealing with ambiguity,
spelling variants and variations in the specificity of
tags. Unlike other approaches, sematags do not in-
clude hard links to Web resources but rather contain
a minimal set of information extracted from plug-
gable vocabularies that is used to lookup related
resources in a repository. This loose coupling guar-
antees that folksonomies remain folksonomies (using
richer tags) yet unambiguous links to concepts in for-
mal knowledge bases can still be retrieved. By sup-
porting WordNet and DBpedia as default vocabular-
ies, we cover a wide range of contemporary meaning-
ful tags. Second, TagNet serves as an extensible meta-
search engine. We illustrated how TagNet is used
to search through DBpedia using sematags and ex-
plained how other repositories can be supported. We
also indicated how sematags can be scaled to support
legacy tagging systems and give rise to enriched folk-
sonomies. A beta version of TagNet is available on
http://sematags.belllabs.be/.
In future work, we want to further explore and val-
idate the effectiveness of using tags to explain the dif-
ferent senses of a keyword to users. Another inter-
esting path to explore is the use of extended ‘facets’
(categories and subcategories to which resources be-
long) to narrow down search results. Sematags sup-
port basic faceted search by default as isas classify
resources in categories. To better align with existing
faceted search engines, we could indicate how many
resources match a sematag while refining a search op-
eration using the dialog depicted in figure 3. In (Ben-
Yitzhak et al., ), Ben-Yitzhak et al. also explained the
importance of gaining insight in the data behind facets
which is far richer than just knowing the quantities of
resources that belong to each facet. We see an op-
portunity to include information about the properties
of a resource in (intermediate) search results and (re-
fined) search queries. For instance, we can also anno-
tate properties of resources using sematags. Search-
ing for e.g. ‘birthplace artist’ in DBpedia with ‘birth-
place’ and ‘artist’ both being resolved to sematags
the former matching a property, the latter matching
resources would result in a list of instances of the
DBpedia Artist class for which a birthPlace prop-
erty is defined (which is also included in the search
results).
REFERENCES
Araujo, S., Houben, G.-J., and Schwabe, D. (2010). Linka-
tor: Enriching Web Pages by Automatically Adding
Dereferenceable Semantic Annotations. In 10th Inter-
national Conference on Web Engineering (ICWE’10),
pages 355–369. Springer-Verlag.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,
R., and Ives, Z. (2008). DBpedia: A Nucleus for a
Web of Open Data. In 6th International Semantic Web
Conference (ISWC’07), pages 722–735.
Ben-Yitzhak, O., Golbandi, N., Har’El, N., Lempel, R.,
Neumann, A., Ofek-Koifman, S., Sheinwald, D.,
Shekita, E., Sznajder, B., and Yogev, S. Beyond Basic
Faceted Search. In International Conference on Web
Search and Web Data Mining (WSDM’08).
Fellbaum, C., editor (1998). WordNet: An Electronic Lexi-
cal Database. The MIT Press.
Garcia-Castro, L. J., Hepp, M., and Garcia, A. (2009).
Tags4Tags: Using Tagging to Consolidate Tags. In
20th International Conference on Database and Ex-
pert Systems Applications (DEXA’09), pages 619–
628. Springer-Verlag.
Golder, S. and Huberman, B. A. (2006). The Structure of
Collaborative Tagging Systems. Journal of Informa-
tion Science, 32(2):198–208.
Gruber, T. (2007). Collective Knowledge Systems: Where
the Social Web meets the Semantic Web. Web Seman-
tics Science Services and Agents on the World Wide
Web, 6(1):4–13.
Hepp, M. (2010). HyperTwitter: Collaborative Knowl-
edge Engineering via Twitter Messages. In 17th Inter-
national Conference on Knowledge Engineering and
Management by the Masses (EKAW’10), pages 451–
461. Springer-Verlag.
Kim, H.-L., Breslin, J., Yang, S.-K., and Kim, H.-G. (2008).
Social Semantic Cloud of Tag: Semantic Model for
Social Tagging. Agent and Multi-Agent Systems:
Technologies and Applications, pages 83–92.
Matuszek, C., Cabral, J., Witbrock, M., and DeOliveira, J.
(2006). An Introduction to the Syntax and Content
of Cyc. In AAAI Spring Symposium on Formalizing
and Compiling Background Knowledge and Its Ap-
plications to Knowledge Representation and Question
Answering, pages 44–49.
Newman, R. (2005). Tag Ontology Design. http://
www.holygoat.co.uk/projects/tags.
Niles, I. and Pease, A. (2001). Towards a Standard Upper
Ontology. In International Conference on Formal On-
tology in Information Systems (FOIS’01), pages 2–9.
ACM.
TagNet:UsingSoftSemanticstoSearchtheWeb
187
Passant, A. and Laublet, P. (2008). Meaning Of A Tag:
A Collaborative Approach to Bridge the Gap between
Tagging and Linked Data. In WWW’08 workshops:
Linked Data on the Web (LDOW’08).
Sigurbj
¨
ornsson, B. and van Zwol, R. (2008). Flickr Tag
Recommendation Based on Collective Knowledge. In
17th International Conference on World Wide Web,
pages 327–336. ACM.
Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W.-
C., and Giles, C. L. (2008). Real-time Automatic
Tag Recommendation. In 31st International Confer-
ence on Research and Development in Information
Retrieval (SIGIR’08), pages 515–522. ACM.
Specia, L. and Motta, E. (2007). Integrating Folk-
sonomies with the Semantic Web. In 4th European Se-
mantic Web Conference (ESWC’07), pages 624–639.
Springer.
Suchanek, F. M., Kasneci, G., and Weikum, G. (2007).
YAGO: A Core of Semantic Knowledge. In 16th In-
ternational World Wide Web Conference (WWW’07),
pages 697–706. ACM Press.
Vander Wal, T. (2007). Folksonomy Coinage and Defini-
tion. http://vanderwal.net/folksonomy.html.
Weller, K. (2007). Folksonomies and Ontologies: Two New
Players in Indexing and Knowledge Representation.
In Applying Web 2.0. Innovation, Impact and Imple-
mentation, pages 108–115.
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
188