STRATEGIES FOR OPTIMIZING QUERYING THIRD PARTY

RESOURCES IN SEMANTIC WEB APPLICATIONS

Albert Weichselbraun

Institute for Information Business, Vienna University of Economics and Business Administration, Vienna, Austria

Keywords:

Search-test-stop, query optimization, Web services.

Abstract:

One key property of the Semantic Web is its support for interoperability. Combining knowledge sources from

different authors and locations yields reﬁned and better results.

Current Semantic Web applications only use a limited amount of particularly useful and popular information

providers like Swoogle, geonames, etc. for their queries. As more and more applications facilitating Semantic

Web technologies emerge, the load caused by these applications is expected to grow, requiring more efﬁcient

ways for querying external resources.

This research suggests an approach for query optimization based on ideas originally proposed by McQueen

for optimal stopping in business economics. Applications querying external resources are modeled as deci-

sion makers looking for optimal action/answer sets, facing search costs for acquiring information, test costs

for checking the acquired information, and receiving a reward depending on the usefulness of the proposed

solution.

Applying these concepts to the information system domain yields strategies for optimizing queries to external

services. An extensive evaluation compares these strategies to a conventional coverage based approach, based

on real world response times taken from three different popular Web services.

1 INTRODUCTION

Semantic Web applications provide, integrate and

process data from multiple data sources including

third party providers. Combining information from

locations and services is one of the key beneﬁts of se-

mantic applications.

Current approaches usually limit their queries to

a number of particularly useful and popular services

like for instance Swoogle, geonames, or Dbpedia.

Research on automated web service discovery and

matching (Gupta et al., 2007) focuses on enhanced

applications, capable of identifying and interfacing

relevant resources in real time. Future implemen-

tations, therefore, could theoretically issue queries

spawning vast collections of different data sources,

providing even more enhanced and accurate informa-

tion. Obviously, such query strategies - if applied by

a large enough number of clients - impose a consider-

able load on the affected services, even if only small

pieces of information are requested. The World Wide

Web Consortium’s (W3C) struggle against excessive

document type deﬁnition (DTD) trafﬁc provides a re-

cent example of the impact a large number of clients

achieves. Ted Guild pointed out

that the W3C re-

ceives up to 130 million requests per day from broken

clients, fetching popular DTD’s over and over again,

leading to a sustained bandwidth consumption of ap-

proximately 350 Mbps.

Service provider like Google restrict the number

of queries processed on a per IP/user base to prevent

excessive use of their Web services. From a client’s

perspective overloaded Web services lead to higher

response times and therefore higher cost in terms of

processing times and service outages.

Grass and Zilberstein suggest applying value

driven information gathering (VDIG) for considering

the cost of information in query planning (Grass and

Zilberstein, 2000). VDIG focuses on the query se-

lection problem in terms of the trade off between re-

sponse time and the value of the retrieved informa-

tion. In contrast approaches addressing only the cov-

erage problem put their emphasis solely on maximiz-

ing precision and recall.

Optimizing value under scare resources is a clas-

sical problem from economics and highly related to

decision theory. In this research we apply the search-

p.semanticlab.net/w3dtd

111

Weichselbraun A. (2008).

STRATEGIES FOR OPTIMIZING QUERYING THIRD PARTY RESOURCES IN SEMANTIC WEB APPLICATIONS.

In Proceedings of the Third International Conference on Software and Data Technologies - PL/DPS/KE, pages 111-118

DOI: 10.5220/0001876901110118

 SciTePress

Table 1: Response times of some popular Web services.

Service Protocol t

min

max

Amazon REST 0.8 0.3 0.2 663.5 150.2

Dbpedia SPARQL 0.9 0.5 0.1 301.2 42.7

Del.icio.us REST 0.6 0.4 0.1 24.3 0.5

Geo REST 1.8 0.1 0.0 1160.4 771.4

Google Web 0.3 0.2 0.1 10.3 0.2

Swoogle Web 35.8 1.6 0.2 101022.2 1762682.4

Wikipedia Web 0.4 0.2 0.1 60.9 1.3

test-stop (STS) model to applications leveraging third

party resources. The STS model considers the user’s

preferences between accuracy and processing time,

maximizing the total utility in regard to these two

measures. In contrast to the approach described in

(Grass and Zilberstein, 2000) the STS model adds

support for a testing step, designed to obtain more in-

formation about the accuracy of the obtained results,

aiding the decision algorithm in its decision whether

to acquire additional information or act based on the

current answer set. Similar to (Ipeirotis et al., 2007)

the resulting query strategy might lead to less accu-

rate results than a “brute force” approach, but nev-

ertheless optimizes the balance between accuracy and

costs. This paper’s results are within the ﬁeld of AI re-

search facilitating techniques from decision theory to

address problems of agent decision making (Horvitz

et al., 1988).

The article is organized as follows. Section 2

presents known query limits and response times of

some popular Web services. Section 3 provides the

theoretical background for the search-test-stop model,

and presents its extension to discrete probability func-

tions. Afterwards the application of this method to

applications utilizing external resources is outlined in

Section 4 and an evaluation of this technique is pre-

sented in Section 5. This paper closes with an outlook

and conclusions drawn in Section 6.

2 PERFORMANCE AND

SCALABILITY

As more and more applications facilitating external

data repositories emerge, strategies for a responsible

use of these resources gain in importance.

Extensive queries to external resources increases

their share of the program’s execution time and may

lead to longer response times, requiring its operators

to impose limits on the service’s use.

Even commercial providers like Google or Ama-

zon restrict the number of accesses to their services.

For instance, Google’s Web API only allows 1000 re-

quests a day, with exceptions for research projects.

Workarounds like the use of Google’s public Web in-

terface may lead to blacklisting of the client’s IP ad-

dress

. Google’s geo coding service imposes a limit

of 15,000 queries per day and IP address. Amazon

limits clients to 20 queries per second, but restric-

tions vary between the offered services and might

change over time

. Other popular resources like geon-

ames.org and Swoogle to our knowledge currently do

not impose such limits.

A Web service timing application issuing ﬁve dif-

ferent queries to popular Web resources in 30 min

intervals over a time period of ﬁve weeks yielded

Table 1 listing the services’ average response time

), the response time’s median (

), its minimum

and maximum values (t

min

, t

max

), and variance (σ

These response times vary, depending on the client’s

Internet connectivity and location, but adequate val-

ues can be easily obtained by probing the service’s

response times from the client’s location.

Table 1 suggests that Google provides a fast and

quite reliable service (σ

= 0.2) with only small vari-

ations in the response times. This result is not very

surprising considering the global and highly reliable

infrastructure Google employs.

Smaller information providers which cannot af-

ford this kind of infrastructure in general provide

good response times (due to fewer requests), but they

are more sensitive to sudden peaks in the number of

clients accessing their services. Table 1 reﬂects these

spikes in terms of higher variances and t

max

values.

Our experiments suggest (see Section 5) that es-

pecially clients querying services with high vari-

ances beneﬁt from implementing the search-test-stop

model.

Another strategy from the client’s perspective is

avoiding external resources at all. Many commu-

nity projects like Wikipedia or geonames.org provide

database dumps which might be used to install a lo-

cal copy of the service. These dumps are usually

rather large (a current Wikipedia dump including all

p.semanticlab.net/gooso

developer.amazonwebservices.com

ICSOFT 2008 - International Conference on Software and Data Technologies

112

pages, discussions, and the edit history comprises ap-

proximately 6.4 GB

) and often outdated (Wikipedia

dumps are sometimes even more than one month old,

other services like geonames update their records very

frequently).

The import of this data requires customized tools

(like mwdumper

) or hacks and rarely processes with-

out major hassles. In some cases the provided ﬁles do

not contain all available data (geonames.org for in-

stance does not publish the relatedTo information)

so that querying the service cannot be avoided at all.

3 THE SEARCH-TEST-STOP

MODEL

This section outlines the basic principles of the STS

model as found in decision theory. For a detailed

description of the model please refer to MacQueen

(MacQueen, 1964) and Hartmann (Hartmann, 1985).

MacQueen (MacQueen, 1964) describes the idea

of the STS model as follows: A decision maker (a

person or an agent) searches through a population of

possible actions, sequentially discovering sets of ac-

tions (S

), paying a certain cost each time a new set

of actions is revealed (the search cost c

). On the ﬁrst

encounter with a set of possible actions, the person

obtains some preliminary information (x

) about its

utility (u), based on which he can

1. continue looking for another set of possible ac-

tions (paying search cost c

i+1

2. test the retrieved set of actions, to obtain (x

) - a

better estimation of the actions value - paying the

test cost (c

) and based on this extended informa-

tion continue with option 1 or ﬁnish the process

with option 3, or

3. accept the current set of answers (and gain the

utility u).

The challenge is combining these three options so that

the total outcome is optimized by keeping the search

) and test (c

) costs low (

∑

i=1

∑

i=1

) with-

out jeopardizing the obtained utility u.

Introducing the transformation r = E(u|x

) yields

the following description for a policy without testing:

v = vF(v)+

+∞

r f (r)dr − c

(1)

with the solution v = v

. F(r) represent the cumula-

tive distribution function of the expected utility and

f (r ) its probability mass function. The constant c

download.wikipedia.org; 2008-03-15

www.mediawiki.org/wiki/MWDumper

refers to search cost and v (better v

) to the utility ob-

tained by the solution of this equation.

Extending Equation 1 to testing yields Equation 2:

v = vF(r

) + (2)

T (v, r) f (r)dr +

+∞

r f (r)dr − c

and

T (v, r

) = v (3)

T (v, r

) = r

(4)

T (v, r) refers to the utility gained by testing, r

to the

value below which the discovered action set (S

) will

be dropped, and r

to the minimal utility required for

accepting S

. A rational decision maker will only

resort to testing, if the utility gained outweighs its

costs and therefore the condition T(v

) > v

holds

which is the case in the interval [r

In the next two sections we will (i) describe the

preconditions for applying this model to a real world

use case, and (ii) present a solution for discrete data.

3.1 Preconditions

MacQueen (MacQueen, 1964) deﬁnes a number of

preconditions required for the application of the STS

model. Hartmann (Hartmann, 1985) eases some of

these restrictions yielding the following set of require-

ments for the application of the model:

1. a common probability mass function h(x

,u)

exists.

2. The expected value of u given a known realization

(z = E(U|x

)) exists and is ﬁnite.

3. F(z|x

) is stochastically increasing in x

. For

the concept of stochastically increasing variables

please refer to (Lehmann and Romano, 2005,

p75).

3.2 The Discrete Search-Test-Stop

Model

This research deals with discrete service response

time distributions and therefore applies the dis-

crete STS methodology. Hartmann transferred Mac-

Queen’s approach to discrete models. The following

section summarizes the most important points of his

work (Hartmann, 1985).

Hartmann starts with a triple (x

, x

, u) of discrete

probability variables, described by a common proba-

bility function h(x

,u). From h Hartmann derives

1. the conditional probability function f (u|x

)

and the expected value Z = E(u|x

STRATEGIES FOR OPTIMIZING QUERYING THIRD PARTY RESOURCES IN SEMANTIC WEB APPLICATIONS

113

1. Austria/Carinthia/Spittal/Heiligenblut/Grossglockner (mountain)

2. Austria/Carinthia/Spittal/Heiligenblut (village)

3. Austria/Carinthia/Spittal (district)

4. Austria/National Park Hohe Tauern (national park)

5. Austria/Carinthia (state)

6. Austria/Salzburg (Neighbor) (state)

7. Austria/Tyrol (Neighbor) (state)

8. Austria (country)

Figure 1: Ranking of possible “correct” results for geo-tagging an article covering the “Grossglockner”.

3. the probability of x

, f (x

) and F(x

Provided that the conditions described in Sec-

tion 3.1 are fulﬁlled only ﬁve possible optimal poli-

cies are possible - (i) always test, (ii) never test, (iii)

test if u > u

, (iv) if u < u

, or (v) if u

< u < u

The expected utility equals to

1. E(u|x

) for accepting without testing,

2. T (r, v) with testing, and

3. v

if the action is dropped and a new set (S

) is

selected according to the optimal policy.

4 METHOD

This section focuses on the application of the STS

model to Web services. At ﬁrst we describe heuristics

for estimating cost functions (c

, c

), and the common

probability mass function h(x

,u) Afterwards the

process of applying search-test-stop to tagging appli-

cations is elaborated.

4.1 Cost functions

In the conventional STS model costs refer to the in-

vestment in terms of time and money for gathering

information. By applying this idea to software, costs

comprise all expenses in terms of CPU-time, band-

width and storage cost necessary to search for or test

certain answers.

Large scale Semantic Web projects, like the ID-

IOM media watch on climate change (Scharl et al.,

2007), process hundred thousands of pages a week.

Querying geonames for geo-tagging such numbers of

documents would add days of processing time to the

IDIOM architecture.

This research focuses solely on costs in terms of

response time, because they are the limiting factor

in our current research projects. Other applications

might require extending this approach to additional

factors like CPU-time, bandwidth, etc.

4.2 Utility Distributions

Applying the STS model to economic problems yields

cash deposits and payments. Transferring this idea

to information science is a little bit more subtle, be-

cause the utility is highly dependent on the applica-

tion and its user’s preferences. Even within one do-

main the notion of an answer set’s (S

) value might

not be clear. For instance in a geo context the “cor-

rect” answer for a certain problem may be a particu-

lar mountain in Austria, but the geo-tagger might not

identify the mountain but the surrounding region or

at least the state in which it is located (compare Fig-

ure 1). Assigning concrete utility values to these al-

ternatives is not possible without detailed information

regarding the application and user preferences. Ap-

proaches for evaluating the set’s value might there-

fore vary from binary methods (full score for correct

answers; no points for incomplete/incorrect answers)

to complex ontology based approaches, evaluating the

grade of correctness and severe of deviations.

4.3 Application

This work has been motivated by performance issues

in a geo-tagging application facilitating resources

from geonames.org and WordNet for improving tag-

ging accuracy. Based on the experience garnered dur-

ing the evaluation of STS models, this section will

present a heuristic for determining the cost functions

, c

) and the common probability mass function

h(x

,u).

4.3.1 Cost functions

Searching leads to external queries and therefore

costs. Measuring a service’s performance over a cer-

tain time period allows estimating the average re-

sponse time and variance.

STS ﬁts best for situations, where the query cost

is in the same order as the average utility retrieved

(O(c

) = O(u)). In settings with O(c

)  O(u) the

search costs have no signiﬁcant impact on the utility

Figure 1: Ranking of possible “correct” results for geo-tagging an article covering the “Grossglockner”.

2. the probability function of r, f (r|x

) and F(r|x

3. the probability of x

, f (x

) and F(x

Provided that the conditions described in Sec-

tion 3.1 are fulﬁlled only ﬁve possible optimal poli-

cies are possible - (i) always test, (ii) never test, (iii)

test if u > u

, (iv) if u < u

, or (v) if u

< u < u

The expected utility equals to

1. E(u|x

) for accepting without testing,

2. T (r,v) with testing, and

3. v

if the action is dropped and a new set (S

) is

selected according to the optimal policy.

4 METHOD

This section focuses on the application of the STS

model to Web services. At ﬁrst we describe heuristics

for estimating cost functions (c

, c

), and the common

probability mass function h(x

,u) Afterwards the

process of applying search-test-stop to tagging appli-

cations is elaborated.

4.1 Cost Functions

In the conventional STS model costs refer to the in-

vestment in terms of time and money for gathering

information. By applying this idea to software, costs

comprise all expenses in terms of CPU-time, band-

width and storage cost necessary to search for or test

certain answers.

Large scale Semantic Web projects, like the ID-

IOM media watch on climate change (Scharl et al.,

2007), process hundred thousands of pages a week.

Querying geonames for geo-tagging such numbers of

documents would add days of processing time to the

IDIOM architecture.

This research focuses solely on costs in terms of

response time, because they are the limiting factor

in our current research projects. Other applications

might require extending this approach to additional

factors like CPU-time, bandwidth, etc.

4.2 Utility Distributions

Applying the STS model to economic problems yields

cash deposits and payments. Transferring this idea

to information science is a little bit more subtle, be-

cause the utility is highly dependent on the applica-

tion and its user’s preferences. Even within one do-

main the notion of an answer set’s (S

) value might

not be clear. For instance in a geo context the “cor-

rect” answer for a certain problem may be a particu-

lar mountain in Austria, but the geo-tagger might not

identify the mountain but the surrounding region or

at least the state in which it is located (compare Fig-

ure 1). Assigning concrete utility values to these al-

ternatives is not possible without detailed information

regarding the application and user preferences. Ap-

proaches for evaluating the set’s value might there-

fore vary from binary methods (full score for correct

answers; no points for incomplete/incorrect answers)

to complex ontology based approaches, evaluating the

grade of correctness and severe of deviations.

4.3 Application

This work has been motivated by performance issues

in a geo-tagging application facilitating resources

from geonames.org and WordNet for improving tag-

ging accuracy. Based on the experience garnered dur-

ing the evaluation of STS models, this section will

present a heuristic for determining the cost functions

, c

) and the common probability mass function

h(x

,u).

4.3.1 Cost Functions

Searching leads to external queries and therefore

costs. Measuring a service’s performance over a cer-

tain time period allows estimating the average re-

sponse time and variance.

STS ﬁts best for situations, where the query cost

is in the same order as the average utility retrieved

(O(c

) = O(u)). In settings with O(c

)  O(u) the

search costs have no signiﬁcant impact on the utility

and if O(c

)  O(u) no searching will take place at

ICSOFT 2008 - International Conference on Software and Data Technologies

114

Figure 2: Database schema of a simple tagger.

all (because the involved costs are much higher than

the possible beneﬁt).

In real world situations the translation from search

times to costs is highly user dependent. To simplify

the comparison of the results, this research applies a

linear translation function c

= λ ·t

with λ = const =

yielding costs of O(c

) = 1. To reduce the in-

ﬂuence of service outages the median of the response

times

has been selected and a timeout of 60 seconds

for any search operation is implemented.

4.3.2 Utility Distribution

The discrete common probability mass function h is

composed of three components: The probability mass

function of (i) the utility u, (ii) the random variable

providing an estimate of the utility and, (iii) the

random variable x

containing a reﬁned estimate of

the answer’s utility.

In general a utility function assuming linearly in-

dependent utility values might look like Equation 5.

u =

∑

λ(i) f

eval

(i) (5)

The utility equals to the sum of the utility gained by

each answer set S

, which is evaluated using an eval-

uation function f

eval

, and weighted with a factor λ(i).

To simplify the computation of the utility we consider

only correct answers as useful (Equation 6) and apply

the same weight (λ(i) = const = 1) to all answers.

eval

(i) =

(

0 if a

incorrect;

1 if a

correct.

(6)

Geo-tagging identiﬁes geographic entities based on a

knowledge base as for instance a gazetteer or a trained

artiﬁcial intelligence algorithm.

After searching the number of identiﬁed entries

(|S

| = x

) provides a good estimation of the expected

value of the answers utility. Applying a focus algo-

rithm (e.g. (Amitay et al., 2004)) yields a reﬁned

evaluation of the entity set (|S

| = x

) resolving geo

ambiguities. S

might still contain incorrect answers

due to errors in the geo disambiguation or due to am-

biguous terms not resolved by the focus algorithm

(e.g. turkey/bird versus Turkey/country). Based on

the probabilities of a particular answer a

∈ S

of being incorrect P

incorr

)/P

incorr

) the expected

value u for a given combination of x

, x

is deter-

mined. Evaluating historical error rates yields esti-

mations for P

incorr

) and P

incorr

If no historical data is available heuristics based

on the number of ambiguous geo-entries are useful

for providing an educated guess of the probabilities.

A tagger recognizes patterns based on a pat-

tern database table. The relation hasPattern trans-

lates these patterns to TaggingEntities as for instance

spatial locations, persons, and organizations. Fig-

ure 2 visualizes a possible database layout for such

a tagger. Unfortunately, the hasPattern table of-

ten does not provide a unique mapping between pat-

terns and entities - names as for instance Vienna

may refer to multiple entities (Vienna/Austria ver-

sus Vienna/Virgina/US). On the other side many enti-

ties have multiple patterns associated with them (e.g.

Wien, Vienna, Vienne, Bech, etc.). Based on the

database schema above, P

incorr

) for such a tagger

is estimated using the following heuristic:

Entities

= |TaggingEntity| (7)

Mappings

= |hasPattern| (8)

ambiguous

= |σ

[isAmbiguous=

true

]

( (9)

TaggingEntry ∗ hasPattern)|

incorr

= 1 −

Entries

Mappings

+ n

ambiguous

(10)

Extending the database schema visualized in Fig-

ure 2 to non geo entries using WordNet and applying

Equations 7-10 yields P

incorr

5 EVALUATION

For evaluating the STS model’s efﬁciency in real

world applications a simulation framework, support-

ing (i) item a solely coverage based decision logic

and the search-test-stop model, (ii) artiﬁcial (normal

distribution) and measured (compare Section 2) dis-

tributions of network response times, and (iii) com-

mon probability mass functions h(x

,u) composed

STRATEGIES FOR OPTIMIZING QUERYING THIRD PARTY RESOURCES IN SEMANTIC WEB APPLICATIONS

115

Search Test

Stop

Folksonomies

Ontology Search

Engine

SPARQL-Endpoints

Business Logic

searching:

get answers

{a_1, ... a_n}

and probabilities

X_0; pay c_s

Ontologies

RDF Data

testing answers:

get refined probabilities X_1;

pay c_t

stop and get

the utility minus

the costs

accumulated.

Input Query

Response

Figure 3: The search-test-stop approach.

from user deﬁned P

incorr

) and P

incorr

) settings

has been programmed.

Integration of the python numarray library

en-

ables efﬁcient processing of matrix operations as re-

quired for computing decisions based on the search-

test-stop model.

To prevent the coverage based decision logic from

delivering large amounts of low quality answers,

the simulation controller only accepts answers with

an expected utility above a certain threshold (u

min

In contrast the search-test-stop algorithm computes

min

= r

on the ﬂy, based on the current responsive-

ness of the external service and the user’s preferences.

5.1 Performance

Comparing the two approaches at different mini-

mum quality levels (u

min

), and service response time

distributions approximated by a normal distribution

N(t, σ

) yields Table 2. The common probability

mass functions has been composed with P

incorr

) =

0.3, P

incorr

) = 0.1. The parameters for the nor-

mal distribution are c

= N(2,1.9) for high search

costs, c

= N(1,0.9) for medium search costs, and

= N(0.5,0.4) for low search costs.

Table 2 evaluates the search strategies according

to two criteria: (i) the quality u, the average utility of

an answer set (S

) retrieved by the strategy, and (ii)

the quantity

∆u

∆t

- the rate at which the number of cor-

rect answers (and therefore the total utility (u)) grows.

sourceforge.net/projects/numpy

High u values correspond to accepting only high

quality results, with a lot of correct answers, and drop-

ping low quality answer sets (at the cost of a lower

quantity).

The conventional coverage based approach

(Conv) delivers the highest quantity for small u

min

values because virtually all answers are accepted and

contribute to the total utility. This greedy approach

comes at the cost of a lower answer quality and

therefore low average utility u per answer. Increas-

ing u

min

yields a better answer quality, but lower

quantity values. At high search costs this strategy’s

performance is particularly unsatisfactory, because it

doesn’t consider the costs of the search operation.

In contrast to the conventional approach STS max-

imizes answer quality and quantity based on the cur-

rent search cost adjusting queries to the responsive-

ness of the service and the user’s preferences. These

preferences formalize the trade off between quality

and quantity by specifying a transformation function

between search cost and search times.

STS therefore optimizes the agent’s behavior in

terms of user utility. This does not mean that STS

minimizes resource usage. Instead STS dynamically

adjusts the resource utilization based on the cost of

searching (c

) and testing (c

), providing the user with

optimal results in terms of accuracy and response

times.

Enforcing a minimal utility u

min

boosts the av-

erage utility u of the non STS service, but at the

cost of a higher resource utilization, independent from

ICSOFT 2008 - International Conference on Software and Data Technologies

116

-4000

-2000

2000

4000

6000

8000

10000

12000

14000

0 20000 40000 60000 80000 100000 120000

swoogle.umbc.edu; u_min=4.00

Sts - time efficiency

Non_sts - time efficiency

(a) Swoogle;

t=1.6

2000

4000

6000

8000

10000

12000

14000

0 1000 2000 3000 4000 5000 6000 7000 8000

google.com; u_min=4.00

Sts - time efficiency

Non_sts - time efficiency

(b) Google;

t=0.2

-10000

-8000

-6000

-4000

-2000

2000

4000

6000

8000

10000

0 20000 40000 60000 80000 100000

geonames.org; u_min=4.00

Sts - time efficiency

Non_sts - time efficiency

t=0.1

200

400

600

800

1000

1200

0 1000 2000 3000 4000 5000

geonames.org

(d) Search times at geonames.org

Figure 4: Search-test-stop (STS) versus conventional (NON-STS) decision logic.

Table 2: Tagging performance.

Search Quality (u) Quantity (

∆u

∆t

)

Cost (c

) u

min

STS Conv STS Conv

low 2 6.62 5.58 3.47 7.79

low 4 6.64 6.13 3.56 6.93

low 6 6.69 6.55 3.57 5.95

low 8 6.66 6.39 3.55 2.75

medium 2 4.99 4.84 1.88 3.22

medium 4 5.02 5.15 1.92 2.76

medium 6 5.01 5.32 1.89 2.27

medium 8 5.00 3.86 1.87 0.79

high 2 2.81 3.20 0.78 1.05

high 4 2.75 3.25 0.76 0.88

high 6 2.84 2.81 0.80 0.59

high 8 2.81 -0.91 0.76 -0.09

the server’s load (leading to extremely high response

times during high load conditions). Static limits also

do not consider additional queries at idle servers,

leading to lower utilities under low load conditions.

In contrast to the conventional approach STS (i) uti-

lizes dormant resources of idle servers, and (ii) spares

resources of busy servers, maximizing utility accord-

ing to the user’s preferences.

5.2 Web Services

In this section we will simulate the effect of STS

on the performance of real world Web services, us-

ing search costs as measured during the Web service

timing (compare Section 2). Figure 4 visualizes the

application of the search-test-stop model to Web ser-

vices. The simulation facilitates the cost and common

probability mass functions from Section 5.

Figure 4 compares the tagger’s performance for

three different Web services (Swoogle, Google, geo-

names) with u

min

= 4. The fourth ﬁgure visualizes

geoname’s response times over the observation period

of ﬁve weeks. In all three use cases STS performs

well, because the search times are adjusted accord-

ing to the service’s responsiveness. Geonames and

STRATEGIES FOR OPTIMIZING QUERYING THIRD PARTY RESOURCES IN SEMANTIC WEB APPLICATIONS

117

Swoogle experience the highest performance boost,

due to high variances in the search cost, leading

to negative utility for the conventional query strat-

egy. Services with low variances (σ

) in their re-

sponse times as for instance Google, del.icio.us and

Wikipedia beneﬁt least from the application of the

STS model, because static strategies perform better

under these conditions.

6 OUTLOOK AND

CONCLUSIONS

This work presents an approach for optimizing ac-

cess to third party remote resources. Optimizing the

clients resource access strategy yields higher query

performance and spares remote resources by prevent-

ing unnecessary queries.

The main contributions of this paper are (i) apply-

ing the search-test-stop model to value driven infor-

mation gathering, extending its usefulness to domains

where one or more testings steps allow reﬁning the

estimated utility of the answer set; (ii) demonstrating

the use of this approach to semantic tagging, and (iii)

evaluating how the search-test-stop model performs

in comparison to a solely value based approach.

The experiments show that search-test-stop and

value driven information gathering perform especially

well in domains with highly variable search cost.

In this work we only use one level testing, never-

theless, as Hartmann has shown (Hartmann, 1985) ex-

tending STS to n-levels of testing is a straight forward

task. Future research will transfer these techniques

and results to more complex use cases integrating

multiple data sources as for instance semi automatic

ontology extension (Liu et al., 2005). The develop-

ment of utility functions considering partially correct

answers and user preferences will allow a more ﬁne

grained control over the process’s performance yield-

ing highly accurate querying strategies and therefore

better results.

ACKNOWLEDGEMENTS

The author wishes to thank Prof. Wolfgang Janko

for his valuable feedback and suggestions. The

project results have been developed in the IDIOM

(Information Diffusion across Interactive Online Me-

dia; www.idiom.at) project funded by the Aus-

trian Ministry of Transport, Innovation & Technol-

ogy (BMVIT) and the Austrian Research Promotion

Agency (FFG).

REFERENCES

Amitay, E., Har’El, N., Sivan, R., and Soffer, A. (2004).

Web-a-where: geotagging web content. In SIGIR ’04:

Proceedings of the 27th annual international ACM SI-

GIR conference on Research and development in in-

formation retrieval, pages 273–280, New York, NY,

USA. ACM.

Grass, J. and Zilberstein, S. (March 2000). A value-driven

system for autonomous information gathering. Jour-

nal of Intelligent Information Systems, 14:5–27(23).

Gupta, C., Bhowmik, R., Head, M. R., Govindaraju, M.,

and Meng, W. (2007). Improving performance of web

services query matchmaking with automated knowl-

edge acquisition. In Web Intelligence, pages 559–563.

IEEE Computer Society.

Hartmann, J. (1985). Wirtschaftliche Alternativensuche

mit Informationsbeschaffung unter Unsicherheit. PhD

thesis, Universit

ot Fridericiana Karlsruhe.

Horvitz, E. J., Breese, J. S., and Henrion, M. (1988). De-

cision theory in expert systems and artiﬁcial intelli-

gence. International Journal of Approximate Reason-

ing, 2:247–302.

Ipeirotis, P. G., Agichtein, E., Jain, P., and Gravano, L.

(2007). Towards a query optimizer for text-centric

tasks. ACM Trans. Database Syst., 32(4):21.

Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical

Hypotheses. Springer, New York, 3rd edition edition.

Liu, W., Weichselbraun, A., Scharl, A., and Chang, E.

(2005). Semi-automatic ontology extension using

spreading activation. Journal of Universal Knowledge

Management, 0(1):50–58.

MacQueen, J. (1964). Optimal policies for a class of

search and evaluation problems. Management Sci-

ence, 10(4):746–759.

Scharl, A., Weichselbraun, A., and Liu, W. (2007). Track-

ing and modelling information diffusion across inter-

active online media. International Journal of Meta-

data, Semantics and Ontologies, 2(2):136–145.

ICSOFT 2008 - International Conference on Software and Data Technologies

118