An Approach for Semantic Search over Lithuanian News Website Corpus

Tomas Vileiniškis, Algirdas Šukys and Rita Butkienė

Department of Information Systems, Kaunas University of Technology, Studentu st. 50-309, Kaunas, Lithuania

Keywords: Semantic Search, SBVR, SPARQL, Information Retrieval, Ontology, Semantic Annotation, Lithuanian

Language.

Abstract: The continuous growth of unstructured textual information on the web implies the need for novel,

semantically aware content processing and information retrieval (IR) methods. Following the evolution and

wide adoption of Semantic Web technology, a number of approaches to overcome the limitations of

traditional keyword-based search techniques have been proposed. However, most of the research

concentrates on English and other well-known, linguistic resource-rich languages. Hence, this paper

presents an attempt to semantic search over domain-specific Lithuanian web documents. We introduce an

ontology-based semantic search framework capable of answering structured natural Lithuanian language

questions and discuss its language-dependent design decisions. The findings from a recent case study

showed that our proposed framework can be applied to approach meaning-based IR with significant results,

even when the underlying language is morphologically rich and has limited linguistic resources.

1 INTRODUCTION

In the context of traditional Web search, Information

Retrieval (IR) has been known as a task of retrieving

documents relevant to user information needs,

typically expressed by some form of a query. A

general IR model consists of three major building

blocks: representation of a user query, document

content description and a retrieval function. Early

work in IR field highly focused on keyword-based

models, such as exact-match Boolean and statistical

Vector Space Model (Salton et al., 1975). The

obvious shortcoming of these models is the lack of

conceptualization both at the query and document

representation level, which eventually results in poor

precision and recall rates. A number of approaches

such as query expansion (Carpineto and Romano,

2012) and word sense disambiguation (Stokoe et al.,

2003) have been proposed to manage synonymy and

polysemy in order to overcome the limitations of

prior models to some extent.

However, with the emerging growth of Semantic

Web technology, the way Web information retrieval

has been seen is changing. The introduction of

common standards for semantic data and domain

knowledge representation (Resource Description

Framework (RDF), RDF Schema (RDFS), Web

Ontology Language (OWL2)) followed by a

dedicated RDF query language (SPARQL)

influenced a wide body of research (Mangold, 2007)

towards meaning-based IR, which we will refer to as

semantic search throughout the paper. Standard

document text preprocessing steps used in classical

IR models (tokenization, stop word removal,

stemming etc.) are getting complemented by more

advanced Information Extraction (IE) methods such

as semantic annotation and ontology population. We

believe that application of IE methods for content

processing builds the foundation for efficient

semantic search.

In general, IE is known as an activity of

automatically extracting structured information from

unstructured information source. The main challenge

here is the complexity and ambiguity of natural

language, hence making IE hardly dependent on

advances in Natural Language Processing (NLP)

techniques. While state-of-the-art in IE related NLP

research for well-known languages (e.g. English)

has already reached levels of successful practical

application on a massive scale (e.g. IBM’s Watson

project) (Ferrucci, 2010), less popular and resource-

poor languages such as Lithuanian, remain an open

NLP research field.

The nature of Lithuanian language imposes many

NLP-related challenges. First of all, it is highly

inflected, which means that a single word root can

lead to hundreds of different word forms, each of

them expressing a distinct grammatical category. For

Vileiniškis, T., Šukys, A. and Butkien

e, R..

An Approach for Semantic Search over Lithuanian News Website Corpus.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 57-66

ISBN: 978-989-758-158-8

example, a nominative singular noun asmuo (person)

alone has multiple other grammatical cases reflected

by alternating suffixes: asmens (genitive), asmeniui

(dative), asmenį (accusative), asmeniu

(instrumental), asmenyje (locative) etc. Such

declension of nouns and adjectives plays a major

role when determining grammatical function of a

word in a sentence. Moreover, Lithuanian language

doesn’t have a strict word order. A single sentence

can be expressed in multiple ways by switching

word positions without losing the initial meaning.

Therefore, reusing standard syntactic parsing

approaches, applicable for strict word order

languages (e.g. English) becomes complicated

(Šveikauskienė and Telksnys, 2014). These and

many other language-specific features require

special attention when developing sophisticated IE

methods dedicated for Lithuanian or any other

morphologically rich languages (e.g., Slavic) in

general.

In this paper we present a combined attempt to

semantic content processing and search over

Lithuanian web texts. A semantic search framework

for the task is proposed. We introduce an ontology

population-driven IE approach tightly coupled with

a model-to-model (M2M) transformation-based IR

model. We show how such tight-coupling enables us

to serve natural structured language queries over

domain-specific data represented in the form of

ontology. We then evaluate applicability of our

framework by performing a case study over

Lithuanian news website corpus, focusing on

political and economic domains. To the best of our

knowledge, this is the first public attempt to

Lithuanian text processing at the level of ontological

semantics.

The rest of the paper is structured as follows.

Section 2 gives a brief overview of related work in

semantic search area and provides the state-of-the-

art of NLP research for Lithuanian language. Section

3 presents the architecture of our semantic search

framework with emphasis on capturing and

maintaining domain-specific semantics throughout

the search process. The experimental observations

and lessons learned from the case study are

presented in Section 4. Finally, we draw conclusions

and discuss our future research plans in Section 5.

2 RELATED WORK

The evolution of Semantic Web technology has

made a significant impact on meaning-based IR

methods over the last decade. In particular, the

introduction of W3C’s OWL2, RDFS, RDF and

SPARQL to conceptualize, represent and query

domain specific knowledge led to an upsurge of

research in the field.

(Kiryakov et al., 2004) proposed KIM - a

framework for semantic annotation and retrieval.

The main idea behind KIM is the semantic typing of

named entities (NE) by linking them to pre-

populated knowledge base entries and/or appropriate

domain-ontology classes. (Fernández et al., 2011)

introduced an approach for semantically enhanced

IR by adapting the classical vector space model

(Castells et al., 2007). The IE task used to

conceptualize document content is similar to the one

proposed by (Kiryakov et al., 2004). In addition,

(Fernández et al., 2011) use an ontology-based

Question Answering (QA) system to interpret the

intent behind user queries. This is achieved by

deriving linguistic triples from a natural language

question and then looking up for answer-bearing

ontology concepts by syntactic triple similarity

matches (Lopez et al., 2009). Our approach to

capturing user query intents differs substantially: we

aim at obtaining a formal SPARQL query model

from a structured natural language question (see

Section 3.2).

Knowledge bases like Freebase or DBpedia have

been recently used to tackle the problem of open-

domain QA (Yao and Van Durme, 2014; Shekarpour

et al., 2015). While their main goal is to retrieve

answers to factoid-like questions over structured

world’s knowledge, our framework is primarily

aimed at mining and searching domain-specific texts

in order to satisfy event-oriented information needs.

All of the above mentioned approaches target

semantic search only from an English language

perspective, thus they build upon sophisticated NLP

methods that are well known and properly

researched. However, Lithuanian NLP research

progresses in little steps. Perhaps one of the most

significant achievements is the early work by

(Zinkevičius, 2000) who created the first Lithuanian

lemmatizer and part-of-speech (POS) tagger called

Lemuoklis. The syntax of Lithuanian language has

been extensively analyzed by (Šveikauskienė, 2005)

(Šveikauskienė and Telksnys, 2014). A recent

approach to statistical dependency parsing

(Kapociute-Dzikiene et al., 2013) showed the

importance of morphological features (especially

grammatical case) for the accuracy of results.

However, the lack of syntactically annotated data

suggests that rule-based parsing is a better choice.

The only publically available case study of NLP-

based content processing is presented in (Krilavičius

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

et al., 2012), where authors apply named entity

recognition (NER) among other standard text pre-

processing steps to annotate and analyse Lithuanian

news media websites.

3 SEMANTIC SEARCH

FRAMEWORK

The architecture of our proposed semantic search

framework is depicted in Figure 1. As was noted in

Section 1 of the paper, the framework consists of

two major tightly coupled parts: the first one

performs information extraction (IE) related tasks

and the second one handles information retrieval

(IR) related activities. For a detailed explanation

please refer to Sections 3.1 and 3.2 respectively.

The IE module is dedicated for document text

preprocessing and annotation by linguistic

components in the NLP pipeline. IE module aims at

capturing and conceptualizing entities and the events

they participate in. We will emphasize the principles

behind the semantic annotation component in

subsequent sections. In order to avoid possible

confusion about terminology, a note on the use of

the terms “semantic annotation” and “ontology

population” should be given (Amardeilh, 2008). Our

text processing efforts concentrate on ontology

population, i.e., adding instance data (A-Box) to a

predefined ontology (T-Box). In addition, we

perform semantic annotation, i.e., we link slices of

text to their formal ontological representation bits

(A-box) created in the ontology population step. In

this aspect, our approach slightly differs from

previous works discussed in Section 2. The reason

for this is two-fold. First, there is no semantic

knowledge base that would have sufficient coverage

of domain specific entities and relations commonly

mentioned in Lithuanian media. The construction of

such resource would require a significant amount of

manual labor. Although, the existence of

multilingual lexical knowledge bases (e.g. BabelNet)

(Navigli et al., 2012) is well-known, the entries for

entities of a local importance (Lithuanian politics,

organizations etc.) are rare to be found. Secondly, IR

model behind our framework is based on formal

SPARQL query execution, thus we expect for all the

relevant domain knowledge acquired during text

processing to be present in RDF form at query time.

The IR module behind the framework is highly

based on SBVR (Semantics of Business Vocabulary

and Business Rules) standard. SBVR is the OMG

created metamodel and specification that defines

vocabulary and rules for describing business

semantics – business concepts, business facts, and

business rules using some kind of Controlled Natural

Language (OMG, 2008). SBVR enables to create

formal specifications understandable for business

people and also interpretable by software tools. This

is achieved by the usage of structured natural

language for representing meaning as formal logic

structures – semantic formulations. SBVR

metamodel is based on principle of separating

meaning of business concepts and business

restrictions from their representation. A number of

transformations of SBVR specifications to various

software models have been created: Web services

(Goedertier and Vanthienen, 2008), BPMN

(Bodenstaff et al., 2008), OWL2 (Karpovič et al.,

2014), etc. We employ specific SBVR metamodel

features to capture the meaning behind user’s

information needs and further to obtain a formal

SPARQL query representation by means of model-

to-model (M2M) transformation between the two.

3.1 Information Extraction

Information Extraction (IE) module aims to structure

natural language document text at the level of

ontological semantics, i.e. by analyzing entity

mentions and their domain-specific relations we

populate a predefined ontology schema with

instance data. Formal ontological representation of

document content allows taking advantage of

implicit knowledge that can be inferred by

employing OWL reasoning capabilities.

IE task behind our framework is powered by a

pipeline of NLP components for Lithuanian

language:

─ Lexical analyzer performs stop word removal and

standard text tokenization by breaking input text

into words, sentences and paragraphs.

─ Morphological analyzer assigns part-of-speech

(POS) tags to each of the word along with lemma,

grammatical number and most importantly

grammatical case.

─ Named Entity recognizer (NER) is based on

gazetteer lookups. It detects mentions of entities

that belong to three major type categories:

organizations, locations and persons.

─ Semantic annotator analyzes domain-specific

relations between entities and produces ontology

instance data in the form of RDF triples.

Each of the NLP components produces stand-off

annotations in a custom data format which gets

An Approach for Semantic Search over Lithuanian News Website Corpus

Figure 1: The architecture of the framework.

serialized using JSON. In such way, we keep the

documents and their annotations decoupled.

Since the principle behind the three first NLP

components in the pipeline is beyond the scope of

this paper we will focus on the fundamental features

of our semantic annotator.

Given an ontology schema, semantic annotator

attempts to populate it by instantiating classes and

their properties with entity and relation mentions

found in the analysed text. It follows a rule-based

approach that looks for specific lexico-semantic

patterns, combining information from prior lexical,

morphological and named entity annotations.

Capturing all the domain-specific relations one

could express in ontology is a non-trivial task,

especially for highly inflected languages like

Lithuanian. Moreover, the absence of production

ready syntactic parsers makes the task even more

challenging.

Our current ruleset targets extraction of political

and economic event mentions in their various forms.

We collected the most common reporting verbs

(sakyti (say), teigti (state), pranešti (announce) etc.)

from the news articles and derived multiple patterns

for utterance extraction. Example rules are given

below:

Rule I

Rule II

Rule I is based on direct quotation extraction,

while Rule II extracts indirect quotations by

matching common conjunction kad, jog (that)

patterns. In both cases we try to catch and instantiate

the full triple: the agent, the reporting verb and the

reported substance.

Some of the extraction rules are not as

straightforward and require more attention to

(c1) (c2) (c3)

<NE> <RVERB>{,} {kad|jog} {SUBSTANCE}

& type(NE) = Person =>

assert(c1:Person, c2:Saying,

c3:Substance, says<c1,c2>,

includes<c2,c3>)

(c1) (c2) (c3)

{„SUBSTANCE“,-} <RVERB> <NE>

& type(NE) = Person =>

assert(c1:Substance, c2:Saying,

c3:Person, says<c3,c2>,

includes<c2,c1>)

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

Figure 2: A fragment of domain-specific event ontology.

language specific features. An example of this is the

detection of positions held by persons within

organizations (here, PNOUN stands for the position

noun like prezidentas (president), ministras

(minister), teisėjas (judge) etc.):

Rule III

By relying solely on lexical term sequence, we

could easily end up with many incorrect extractions.

In a sample sentence Europos Parlamente

prezidentė Dalia Grybauskaitė skaitė pranešimą

(President Dalia Grybauskaitė gave a speech at the

European Parliament) the locative case of the word

Europos Parlamente determines its grammatical

function - an adverbial modifier of place. Ignoring

the case mark, Rule III would result in assertion

works_in<Dalia Grybauskaitė, European

Parliament>

which is not entirely true. Therefore,

an additional check for the genitive case is made to

avoid incorrect extractions caused by Lithuanian

declension.

Among the three rules presented above, our

ruleset includes over 20 patterns for detecting

changes of prices, taxes and other abstract objects of

interest. Also, we always instantiate named entity

mentions, whether they participate in some event or

not.

The final assertions are produced according to

the ontology schema that we created for capturing

the event-specific knowledge commonly found in

Lithuanian news articles. Currently, it consists of

over 100 classes and nearly 70 relations. A tiny

fragment of the ontology relative to the running

examples throughout the paper is presented in Figure

2. The link between the document and the

recognized objects within the content is established

by an object property

<:refers_to_object>. The

Object class is the top class of all domain entities

that we try to detect through the IE process. Thus,

the enrichment of ontology with new domain entities

is only a matter of sub-classing

Object.

As every mention of a named entity within the

text results in new instance (and URI) creation, we

face ambiguity issues. The same entities tend to be

referred to under different lexical aliases (Dalia

Grybauskaitė, D. Grybauskaitė, Grybauskaitė etc.)

throughout the news articles. Lithuanian declension

causes even more suffix alternations (Daliai

Grybauskaitei, Dalią Grybauskaitę, Dalios

Grybauskaitės etc.) in such way having a negative

impact on recall with queries including proper

names (see Section 3.2). This is approached by

employing several heuristics to disambiguate all the

different entity mentions to a single entity we call

trusted:

─ First, we find equal entities by a common lemma

and abbreviation matches.

─ Secondly, we determine the main alias behind the

trusted entity by inflecting its nominative case,

and then create a new instance T.

─ Lastly, we link all the corresponding entities to

the trusted instance T by an object property

<:recognized_as_trusted_object>.

We iterate the above process for each distinct

entity type (organizations, locations, persons)

recognizable by NER. In addition, the ontology is

pre-populated with a set of well-known trusted

entities along with their main aliases, which makes

(c1) (c2) (c3)

& type(NE1) = Organization &

type(NE2) = Person

& caseMark(NE1) = genitive =>

assert(c1:Organization, c3:Person,

works in<c3,c1>)

An Approach for Semantic Search over Lithuanian News Website Corpus

the disambiguation process more precise.

As mentioned in Section 3, in addition to A-Box

ontology assertions, semantic annotator produces

stand-off semantic annotations, i.e. extracted

document text fragments get linked to their

corresponding ontology entity URIs by token

indices. This information is later used in IR phase.

3.2 Information Retrieval

Our semantically enhanced Information Retrieval

(IR) model builds upon the framework for querying

OWL2 ontologies using structured natural language,

as presented in (Sukys et al., 2012). The operation of

this framework depends on SBVR, OWL2 and

SPARQL specifications, each of which comes with a

formal metamodel, thus making model-to-model

transformations possible. SBVR business vocabulary

is used to formulate, serialize and transform user’s

information needs to a SPARQL query which,

eventually, retrieves appropriate answers from the

ontology. An advantage of such model-driven IR

approach is the ability to capture and map domain-

specific business restrictions to formal query

conditions (triple patterns) in a straightforward way,

once the M2M transformation rules are present.

To ensure that the resulting triple patterns of

SPARQL query correspond to ontology classes and

properties, it is important to keep correspondence

between SBVR vocabulary entries (general

concepts, verb concepts) and OWL2 ontology

entities. The most reliable way to do this is to obtain

ontology schema automatically using model

transformations from specifications of SBVR

business vocabulary and business rules as described

in (Karpovič et al., 2014) or vice versa (Bernotaityte

et al., 2013).

Having business vocabulary and corresponding

ontology schema in place, questions can be written

using structured natural language, which helps to

express the intents more precisely and avoid

ambiguities that are common in natural language

interpretation. Question formulation using structured

natural language under strict grammar rules imposes

the need to guide the end user throughout the

process. Therefore, we employ EBNF (Extended

Backus–Naur Form) to constraint user input to a

somewhat relaxed form of possible SBVR

formulations present in the vocabulary. As the

question is written with the help of contextual

suggestions, it gets parsed using EBNF rules and a

syntax tree is created. The latter, which contains

recognized statements of the question, is further

used to generate SBVR XMI (XML Metadata

Interchange) model. This model holds the captured

meaning of a question that is constructed using a

closed projection with restricting logical

formulations and projection variables, expressing

general concepts that should appear in the answer.

SBVR XMI model is further transformed into

SPARQL XMI model using ATL model-to-model

transformation language. The principles behind

transformation process and specific transformation

rules are described in more details in (Sukys et al.,

2012). At the final step, SPARQL XMI model is

translated to textual representation using a model-to-

text generator.

An illustrative example of transforming question

“Kokie asmenys dirba organizacijose?” (What

persons work in organizations?) to a SPARQL query

is given below. For the sake of simplicity, we

provide only a small fragment of SBVR vocabulary

(Lithuanian and English equivalents), necessary for

such transformation:

The fragment consists of three vocabulary entries:

general concepts

asmuo (person), organizacija

(organization) and a verb concept asmuo dirba

organizacijoje (person works in

organization) denoting the domain specific

relation between the prior defined concepts. The

declension of Lithuanian nouns can be clearly seen

from the above example, i.e. the word representing

general concept

organizacija changes its suffix

in the verb concept asmuo dirba

organizacijoje since the verb dirba governs the

locative case. In the English example, the

grammatical form of a general concept

organization remains the same since its role is

determined by the use of the preposition in. We

manage such language inflection by referring to the

same concepts in different SBVR formulations by

their main grammatical form – lemma.

Several heuristics are employed to make the

structured question as natural sounding as possible.

For example, in certain cases we allow omitting the

subject part of the question, which is later derived by

performing grammatical case-based matching in the

vocabulary entries. As a result, the question in our

running example can be expressed in a more user-

person

organization

person works in organization

asmuo

organizacija

asmuo dirba organizacijoje

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

friendly form “Kas dirba organizacijose?” (Who

work in organizations?). Similarly, we manage

singular and plural noun forms as well.

The textual representation of a transformed

SPARQL query from the question in our running

example (English equivalent) is presented below:

SELECT ?person_i ?organization_i

WHERE {

?person_i ?person_in_organization

?organization_i .

?person_in_organization :label

"person works in organization".

?person_i rdf:type ?person_cl.

?person_cl :label "person".

?organization_i rdf:type

?organization_cl.

?organization_cl :label

"organization".

}

After the initial transformation, SELECT clause

projects a set of variables V that bind (given the

RDF graph data matches) to answer-bearing

ontology entity URIs. The basic graph pattern (BGP)

consists of multiple triple patterns that reflect the

identification of conforming vocabulary and

ontology concepts. In particular, we determine the

type of each of the projected variables v ∈ V in two

steps:

• A triple pattern T1 is created that binds a

representative literal value of SBVR concept to a

non-projected variable n (

<?person_cl :label

"person">).

• A subsequent triple pattern T2 is created with n in

an object position denoting the type of the

projected variable v (

<?person_i rdf:type

?person_cl>

In a similar way we identify necessary vocabulary-

conforming ontology properties.

Queries with proper names involved, e.g. Kas

dirba Europos Parlamente? (Who works in the

European Parliament?), are transformed by

additionally employing simple heuristics to retrieve

disambiguated instances (see Section 3.1). In

particular, we generate a set of triple patterns that

use the

<:recognized_as_trusted_object>

predicate to bind to all the non-trusted instances,

thus giving higher recall.

Finally, the query is augmented with triple

patterns that require for each of the projected

variables to be bound to a single document instance

(variable d), i.e. for each v ∈ V we create triple

patterns

<d :refers_to_object v>. At the last

step, we project an additional variable k in the

SELECT clause that denotes the internal document

identifier later on used for snippet generation. Note

that k ∉ V. An ORDER BY clause could be added to

sort the results according to document publication

date however, the ordering cost proved to be too

high on a larger dataset.

At this stage, we have fully-constructed a formal

SPARQL query that returns entity URI bindings,

essentially performing data retrieval. With the

original research aim in mind to attempt meaning-

based information retrieval, our proposed framework

includes a component for result snippet generation.

The logic behind it is based on the following

algorithm:

─ For each of the initial SPARQL projection

variables v ∈ V extract their URI bindings v → u;

─ For each u retrieve its beginning b and ending e

token indices from the semantic and lexical

annotations produced in IE phase;

─ Calculate min(b) and max(e) values to determine

the range of a text passage;

─ If min(b) and max(e) fit within boundaries of a

single sentence, extend the range of a text passage

to a full sentence;

─ Else If min(b) and max(e) overlap to the

neighboring sentences, extend the range of a text

passage to the boundaries of neighboring

sentences;

─ Extract the text passage as a final snippet.

─ Repeat for every tuple in the binding set.

Given that the lexical and semantic annotations

produced in IE phase are correct, the above

algorithm results in a snippet containing both the

answer-bearing entities, and the original context they

were extracted from.

4 EVALUATION

An early evaluation of our approach was performed

by conducting a case study over a crawled corpus of

Lithuanian news texts. We gathered over 90.000

domain specific documents from more than 30 news

portals. After initial pre-processing steps the

documents were annotated producing around 44

million explicit and 49 million implicit RDF triples

under OWL-Horst materialization settings in the

triple store. A prototype for the search interface was

deployed to ease the evaluation of the practical

applicability of our approach (see Figure 3).

In order to evaluate the search results in a

quantitative manner, we selected 4 different queries

for accuracy calculations: 2 abstract ones and 2 with

An Approach for Semantic Search over Lithuanian News Website Corpus

Figure 3: A prototype of semantic search interface for the case study. Sample results are shown for a question Kas dirba

Europos Parlamente? (Who works in the European Parliament?).

proper names involved (see Table 1). We then

judged the quality of the results on two main criteria:

whether the text snippet returned gives a correct

answer to the original question and if the answer-

related entities are correctly highlighted within the

text passage.

As a single article could possibly contain

multiple distinct answers to the same query, we

chose to calculate precision values snippet wise, so

the total number of analysed articles differs per

query and ranges from 12 to 65. Queries Q1, Q2 and

Q3, Q4 were assessed by manually evaluating 61

and 71 snippets respectively. These numbers proved

to be enough to observe a general trend in error

sources.

Getting correct recall values is not a

straightforward task in our current setting since a

full set of correct answers to each of the queries is

not known in advance. Therefore, we calculated

recall only on a working subset of articles, i.e. those

that had their snippets evaluated as mentioned

above. In particular, we analysed the content of

those articles to collect the number of missed

annotations and assertions required to stand as an

additional answer (snippet) with respect to the

original query.

Table 2 shows the primary results of text snippet

evaluation. Here, A

column stands for the amount

of snippets analysed, A

– snippets with correct

answer, A

– not found snippets. While most of the

queries achieve very high precision rates, Q2 stands

out with a bit lower results. We noticed that a

common pitfall here is the extraction of indirect

quotations, where pure lexico-semantic patterns

can’t differentiate between the reporting agent and

other agents contextually related to the reported

substance. Relatively low recall values indicate that

our current domain-specific event extraction ruleset

is capable of capturing only the most common event

expressions.

Table 1: Query set used for evaluation.

# Query

Q1 Ką kalbėjo agentai?

(What did the agents say?)

Q2 Ką kalbėjo Vladimiras Putinas?

(What did Vladimir Putin say?)

Q3 Kas dirba organizacijose?

(Who work in organizations?)

Q4 Kas dirba Europos Parlamente?

(Who works in the European Parliament?)

Table 2: Text snippet accuracy results.

# A

Recall Precision

Q1 61 57 64 0.456 0.934

Q2 61 54 50 0.486 0.885

Q3 71 69 59 0.493 0.971

Q4 71 71 16 0.816 1.000

In addition, we evaluated accuracy of entity

highlighting within the correctly returned snippets

(see Table 3). Each of the queries from our query set

is expected to return an entity tuple, either agent-

substance (Q1, Q2) or person-organization (Q3,

Q4), hence we split the results by agent/person and

substance/organization columns. A

column lists

the total number of snippets analysed, A

–

correctly highlighted agent/person entities, A

–

correctly highlighted substance/organization entities.

As the results in Table 3 show, Q2, Q3 and Q4 reach

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

near-perfect precision values in both A

and A

This is because these queries mostly return entities

that get instantiated by strictly following NE tags. In

contrast, Q1 fails significantly on A

. The Agent

entity behind Q1 is more general according to our

ontology schema (see Figure 2), therefore not only

NE instances are included within the results. In

particular, Agent subclasses like Group (and other

specific types of agents) get instantiated by domain

list-guided noun lookups during IE. Our efforts here

fail short when the looked-up noun only governs the

correct entity to be instantiated in a noun phrase and

does not stand as an instance by itself. Thus, noun

phrase mining should be improved by taking into

account more Lithuanian morphological features.

Table 3: Entity highlighting accuracy results.

# A

Q1 57 36 53 0.632 0.930

Q2 54 54 51 1.000 0.944

Q3 69 69 69 1.000 1.000

Q4 71 71 71 1.000 1.000

The primary experimental evaluation of our

approach led to certain observations:

─ The precision of search results is mainly affected

by the performance of NLP components behind

IE task, since the IR phase operates in a Boolean

manner, i.e. given a transformed formal SPARQL

query the returned bindings hold all the

conditions expressed by the set of triple patterns.

─ Syntactic parser for Lithuanian is a crucial

linguistic component currently missing from the

NLP pipeline. Event extraction from complex

sentence structures with a free word order is a

non-trivial task and can hardly be carried out by

solely relying on lexico-semantic patterns.

─ Even when NLP-related errors occur at IE stage,

our snippet generation approach enables to

deliver a correct text passage with a decent level

of accuracy.

The evaluation results can be summed up on a

qualitative note. As shown in Figure 3, our strategy

for snippet generation attempts to present the user

with an answer in a single step, eliminating the need

to perform an additional search for the answer-

related text passage by opening the whole article.

We see this as one of the main features and

advantages of semantic search paradigm when

compared to classical pure keyword-based

approaches.

5 CONCLUSIONS AND FUTURE

WORK

We presented, to the best of our knowledge, the first

public attempt to semantic search over Lithuanian

language texts. The aim of our research was to show

that meaning-based information retrieval methods

can be successfully applied even for resource-poor,

highly inflected languages like Lithuanian. While

state-of-the-art in Lithuanian natural language

processing is far away from well computerized

languages such as English, information extraction at

the level of ontological semantics can still be

approached with significant results. However,

language specific linguistic features should be

stressed.

The semantic search framework we proposed is a

continuation of our previous research on semantics

of SBVR, OWL2 and SPARQL. In particular, we

study the conceptual conformity and discrepancies

between their metamodels by means of model-to-

model transformations. Although the practical

application of our model-driven approach requires

some customizations by preparing business

vocabularies and ontologies, it can be ported to

different domains. SBVR question transformation to

SPARQL is not dependent from the structured

language used. However, certain adoptions for using

different languages, such as Structured English or

Structured Lithuanian are needed due to

grammatical peculiarities of languages.

The early case study demonstrated promising

results and thus we seek to improve our efforts,

especially towards content processing. By refining

event extraction rules and patterns we hope to

capture more domain-specific event mentions along

with their event-specific characteristics. Moreover,

we plan on evaluating the gathered RDF knowledge

base in order to draw some statistical conclusions

regarding the most common events and entities

found in the news articles. With such information

available, we will aim at performing fine-grained

ontological typing of the most common entities

against public knowledge bases, by employing

named entity linking techniques. Finally, our

prototype from the case study, as a part of a bigger

project, is being launched for public availability with

the hope to provide better news search capabilities to

Lithuanian language users. See:

http://www.semantika.lt/SemanticSearch/Search/Ind

An Approach for Semantic Search over Lithuanian News Website Corpus

REFERENCES

Salton, G., Wong, A., Yang, C. S. (1975). A vector space

model for automatic indexing. Communications of the

ACM, 18(11), 613-620.

Carpineto, C., Romano, G. (2012). A survey of automatic

query expansion in information retrieval. ACM

Computing Surveys (CSUR), 44(1), 1.

Stokoe, C., Oakes, M. P., Tait, J. (2003). Word sense

disambiguation in information retrieval revisited. In

Proceedings of the 26th annual international ACM

SIGIR conference on Research and development in

information retrieval (pp. 159-166). ACM.

Mangold, C. (2007). A survey and classification of

semantic search approaches. International Journal of

Metadata, Semantics and Ontologies, 2(1), 23-34.

Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek,

D., Kalyanpur, A., Welty, C. et al. (2010). Building

Watson: An overview of the DeepQA project. AI

magazine, 31(3), 59-79.

Šveikauskienė, D., Telksnys, L. (2014). Accuracy of the

Parsing of Lithuanian Simple Sentences. Information

Technology and Control, 43(4), 402-413.

Kiryakov, A., Popov, B., Terziev, I., Manov, D.,

Ognyanoff, D. (2004). Semantic annotation, indexing,

and retrieval. Web Semantics: Science, Services and

Agents on the World Wide Web, 2(1), 49-79.

Castells, P., Fernandez, M., and Vallet, D. (2007). An

adaptation of the vector-space model for ontology-

based information retrieval. Knowledge and Data

Engineering, IEEE Transactions on, 19(2), 261-272.

Fernández, M., Cantador, I., López, V., Vallet, D.,

Castells, P., Motta, E. (2011). Semantically enhanced

Information Retrieval: an ontology-based approach.

Web Semantics: Science, Services and Agents on the

World Wide Web, 9(4), 434-452.

Lopez, V., Uren, V., Sabou, M. R., Motta, E. (2009).

Cross ontology query answering on the semantic web:

an initial evaluation. In Proceedings of the fifth

international conference on Knowledge capture (pp.

17-24). ACM.

Zinkevičius, V. (2000). Lemuoklis–morfologinei analizei.

Darbai ir dienos, 24, 245-274.

Šveikauskienė, D. (2005). Formal description of the

syntax of the Lithuanian language. Information

Technologies and Control, 34(3).

Kapociute-Dzikiene, J., Nivre, J., Krupavicius, A. (2013).

Lithuanian Dependency Parsing with Rich

Morphological Features. In Fourth Workshop on

Statistical Parsing of Morphologically Rich

Languages (p. 12).

Krilavičius, T., Medelis, Ž., Kapočiūtė-Dzikienė, J.,

Žalandauskas, T. (2012). News Media Analysis Using

Focused Crawl and Natural Language Processing:

Case of Lithuanian News Websites. In Information

and Software Technologies (pp. 48-61). Springer

Berlin Heidelberg.

Amardeilh, F. (2008). Semantic annotation and ontology

population. Semantic Web Engineering in the

Knowledge Society, 424-p.

Navigli, R., Ponzetto, S. P. (2012). BabelNet: The

automatic construction, evaluation and application of a

wide-coverage multilingual semantic network.

Artificial Intelligence, 193, 217-250.

OMG, 2008. Semantics of Business Vocabulary and

Business Rules (SBVR). Version 1.0. December,

2008, OMG Document Number: formal/2008-01-02.

Goedertier, S., Vanthienen, J. (2008). A Vocabulary and

Execution Model for Declarative Service

Orchestration. Business Process Management

Workshops, LNCS, Vol. 4928, 496–501.

Bodenstaff, L., Ceravolo, P., Ernesto Damiani, R.,

Fugazza, C., Reed, K., Wombacher, A. (2008).

Representing and Validating Digital Business

Processes. Web Information Systems and

Technologies, LNBIP, Vol. 8(1), 19–32.

Karpovič, J., Kriščiūnienė, G., Ablonskis, L., Nemuraitė,

L. (2014). The Comprehensive Mapping of Semantics

of Business Vocabulary and Business Rules (SBVR)

to OWL 2 Ontologies. Information Technology and

Control, 43(3), 289-302.

Sukys, A., Nemuraite, L., Paradauskas, B., Sinkevicius, E.

(2012). Transformation framework for SBVR based

semantic queries in business information systems. In

BUSTECH 2012, The Second International

Conference on Business Intelligence and Technology

(pp. 19-24).

Sukys, A., Nemuraite, L., Paradauskas, B. (2012).

Representing and transforming SBVR question

patterns into SPARQL. In Information and Software

Technologies (pp. 436-451).

Bernotaityte, G., Nemuraite, L., Butkiene, R.,

Paradauskas, B. (2013). Developing SBVR

vocabularies and business rules from OWL2

ontologies. In Information and Software Technologies

(pp. 134-145).

Shekarpour, S., Marx, E., Ngomo, A. C. N., & Auer, S.

(2015). Sina: Semantic interpretation of user queries

for question answering on interlinked data. Web

Semantics: Science, Services and Agents on the World

Wide Web, 30, 39-51.

Yao, X., Van Durme, B. (2014). Information extraction

over structured data: Question answering with

freebase. In Proceedings of ACL.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval