MULTI-CRITERIA EVALUATION OF INFORMATION RETRIEVAL

TOOLS

Nishant Kumar and Jan Vanthienen

Katholieke Universiteit Leuven, Leuven Institute for Research on Information Systems

Naamsestraat 69, 3000 Leuven, Belgium

Jan De Beer and Marie-Francine Moens

Katholieke Universiteit Leuven, Legal Informatics and Information Retrieval

ICRI, Tiensestraat 41, 3000 Leuven, Belgium

Keywords:

information retrieval, text mining, decision support systems, knowledge representation.

Abstract:

We propose a generic methodology for the evaluation of Text Mining/Search and Information Retrieval tools

based on their functional conformity to a predeﬁned set of functional requirements prioritized by distinguish-

able user proﬁles. The methodology is worked out and applied within the context of a research project con-

cerning the assessment of intelligent exploitation tools for unstructured information sources in the police

domain. We present the general setting of our work, give an overview of our evaluation approach, and discuss

our methodology for testing in greater detail. These kinds of evaluations are particularly useful for both (po-

tential)purchasers of exploitation tools, given the high investments in time and money required in becoming

proﬁcient in their use, and developers who aim at producing better quality software products.

1 INTRODUCTION

The invent of various text and data mining algo-

rithms and their continuous improvements in terms of

accuracy, performance, scalability,...paired with an

ever expanding market of software producers turning

the algorithms into general-purpose, versatile, fully

ﬂedged and easy-to-use software products, driven by

an ever growing interest and desire for such tools

in various domains, strengthens the need to develop

solid and sound evaluation procedures to test tools’

absolute competence and relative competitiveness for

their application and integration in live environments.

As software vendors tend to proclaim superiority

and supreme adequacy of their products, it is yet to

be studied and veriﬁed through objective and sound

means whether these claims hold true in practice. In

our project this is achieved through the deﬁnition of

various evaluation criteria, which will be used in a

subsequent benchmarking stage.

Deﬁning objective and adequate criteria is surely

not a trivial task. First, the concept of relevance as the

perceived quality or usability of any generated results,

is by itself very subjective in nature, depending partly

on the context of the task, the user, the anticipated

outcome, the objective, etc. As a consequence, evalu-

ation usually entails and is founded on human interac-

tion and judgment, severely constraining the amount

of testing and effort that can be spend. Lastly, a great

number of heterogeneous and seemingly incompara-

ble factors and criteria take part in a cognitive human

judgment process, which is hard to reveal and formal-

ize.

With their vast amounts of interconnected struc-

tured and unstructured data ﬁles, police forces

throughout the world are gaining interests in powerful

and reliable automated tools that turn data into useful,

concise, accurate, and timely information and knowl-

edge, to improve or assist in information sharing and

criminal intelligence analysis. For police forces, in-

formation and knowledge make vital elements for

the efﬁcient and effective practicing of their opera-

tions. This widely known and well understood fact is

translated into the concept of Intelligence Led Polic-

ing (ILP), as opposed to the more traditional, labor-

intensive and less efﬁcient strategy of crime ﬁghting.

Section 2 brieﬂy describes the project

INFO-NS as

the context of and as a case study for the development

and application of our generic evaluation methodol-

ogy. One facet of the evaluation spectrum, which

consists of assessing the functional support of tools,

termed conformity testing, will be covered in depth in

Sect. 3. After a discussion of the evaluation model,

we conclude in Sect. 4 with related work and refer-

ences for further reading.

150

Kumar N., Vanthienen J., De Beer J. and Moens M. (2006).

MULTI-CRITERIA EVALUATION OF INFORMATION RETRIEVAL TOOLS.

In Proceedings of the Eighth International Conference on Enterprise Information Systems - AIDSS, pages 150-155

DOI: 10.5220/0002463601500155

 SciTePress

2 PROJECT DESCRIPTION

2.1 Overview

The INFO-NS research project is an initiative of the

Belgian Science Policy Ofﬁce in collaboration with

the Belgian police, and is carried out by the research

groups pertaining to the authors of this paper. The

aim of the project is to provide an objective study

to the applicability of exploitation tools for unstruc-

tured information sources of the Belgian police. More

speciﬁcally, it is studied how information retrieval, in-

formation extraction and information processing tools

might leverage intelligence and decision support by

exploiting, linking, and contextualizing the unstruc-

tured information that is contained in vast amounts of

available free text material.

The project will achieve its objective through a

thorough evaluation of existing (off-the-shelf) re-

trieval and text mining products of some of the lead-

ing and most promising software producers in the

ﬁeld on a workbench of test cases and evaluation cri-

teria worked out in collaboration with police depart-

ments.

2.2 Evaluation Approach

We identiﬁed three major groups of evaluation crite-

ria, capturing the applicability, the competence, and

the practicality of the tools under evaluation.

Applicability The extent to which each of the pres-

elected tools (an initial market selection) answers

the identiﬁed functional needs of the various user

proﬁles.

Competence The extent to which each of the tools

performs at quality measures like capability, accu-

racy, ﬂexibility, scalability, etc. For this purpose,

task-speciﬁc evaluation procedures and criteria are

devised.

Practicality Includes performance, as the extent to

which system resources (memory, disk space, net-

work band width,...) are efﬁciently utilised,

considering extensive document collections and a

large potential number of concurrent users, next

to various, more subjective criteria, such as user-

friendliness, user-system interaction, the quality of

documentation, etc.

For the remaining of this paper, we restrict our-

selves to the application of the ﬁrst of these groups,

coined conformity evaluation.

3 CONFORMITY EVALUATION

In this section we present our methodology for the

evaluation of tools solely on the basis of their support

(provision) regarding any functional needs and asso-

ciated priorities for a number of distinct user proﬁles

that are identiﬁed in the early stages of the project –

the requirements analysis phase. The methodology is

sufﬁciently generic, so that it be readily adoptable in

other projects, and is quite broad in scope, so as to

be readily portable to other situations in which some

sort of multi-criteria evaluation or analysis is to be

performed.

After we present our methodology, we illustrate

how we used this in the context of our project to pur-

sue part of its objective. Given space and conﬁdential-

ity constraints, we will however not go into too much

detail. We end this section with a discussion of our

proposed evaluation model.

3.1 Methodology

Our evaluation model assumes the following informa-

tion is available.

• A set of tools to be evaluated T = {T

}

i=1

• A set of relevant functionalities F = {F

}

i=1

with

ﬁxed semantics and identifying labels.

• A hierarchy H deﬁned over the functionalities in

F according to the inclusion relation ⊃ (read: sub-

sumes); H = {(i, j) | F

⊃ F

∧¬∃k = i, j : F

⊃

⊃ F

Although not mandatory for our evaluation model,

H puts an order upon a potentially large set F

through the identiﬁcation of atomic (indivisible)

functionalities and their grouping to more general

functionalities. As will become clearer further in

this text, H allows us to proceed in a more method-

ical and systematic manner.

• For each tool T

a support tree ST

. This concept

is worked out in deﬁnition 1 (see below).

• A set of use cases U = {U

}

i=1

. Formally, every

use case represents a logical grouping of related

functionalities U

= {F

i,j

}

j=1

In practice, a use case represents some particu-

lar task which comprises several functional com-

ponents, in turn consisting out of logically related

functionalities. Common components pertain to

data preprocessing, the support for accomplish-

ing the task, visualisation and interaction, import-

export capabilities, etc.

We refer to the application of these and similar tech-

niques in police domain for e.g. the prioritization of crim-

inal investigations and the assessment of threats based on

offender (group) proﬁles or environmental conditions.

MULTI-CRITERIA EVALUATION OF INFORMATION RETRIEVAL TOOLS

151

The provision of multiple use cases allows the cov-

erage of as many of the functionalities in F with

a selection of any number of tools, given the faint

likelihood of having one supertool; a tool that sup-

ports most tasks for everyone the best.

• A set of user proﬁles P = {P

}

i=1

with identiﬁed

priorities regarding each functionality in F.

• For each use case U

and user proﬁle P

a require-

ments tree RT

i,j

. This concept is worked out in

deﬁnition 2.

Deﬁnition 1 (support tree) The support tree of tool

, noted ST

, is a tree structure corresponding H,

wherein the node representing F

carries as attributes

the label of F

for identiﬁcation, as well as an indica-

tion of the degree to which the tool supports F

Deﬁnition 2 (requirements tree) The requirements

tree of use case U

for user proﬁle P

, noted RT

i,j

is a tree structure corresponding H restrained to

i,k

}

k=1

. In this structure, the node representing

i,k

carries as attributes the label of F

i,k

for iden-

tiﬁcation, as well as an indication of the degree to

which F

i,k

is desired by users of proﬁle P

Given this information, we now aim to evaluate

how well each tool conforms to every use case in U,

and this for every user proﬁle in P individually. As

every combination of use case and user proﬁle is re-

ﬂected in a unique requirements tree, we thus want

to compute the conformity between every tool and re-

quirements tree. For this, we deﬁne an abstract oper-

ator τ that evaluates the “degree of support” of deﬁ-

nition 1 with respect to the “degree of desire” of de-

ﬁnition 2, given a particular support tree ST and a

requirements tree RT .

τ : ST × RT → R

The repeated operation of τ for each tool on all re-

quirement trees then produces an array of conformity

scores, which can optionally be combined (through

weighing e.g.) to global scores, or used to ﬁlter away

dominated tools. This latter option can be achieved

by retaining only those tools in T for which there is

at least one requirements tree for which they give the

best result (among the other tools in T). The selec-

tion of non-dominated tools is given by the following

formula.



j,k

| τ (ST

,RT

j,k

)=max

τ(ST

,RT

j,k

)}

(1)

In the formula, the union is taken of all best tools for

every requirements tree.

In the workout, we show how we deﬁned the operator

τ .

3.2 Workout

3.2.1 User Proﬁles

In association with the Belgian police, we ﬁrst identi-

ﬁed four user proﬁles for the tools being sought af-

ter. These proﬁles are quite general in nature and

are equally found in other police organisations, even

though they may go by different names.

Administrator Collects, manages, structures, and

sometimes already relates facts described in ofﬁcial

documents (case reports e.g.), dispatching the gath-

ered or derived information to other services upon

request or as part of the information ﬂow.

Investigator Conducts criminal investigations. Her

task is to compile a comprehensive report (a legal

case ﬁle) describing all acts and elements part of

the investigation, which will be the main source of

evidence used by judicial authorities for prosecu-

tion.

Operational analyst Examines, supports and assists

criminal investigations, especially more complex

ones. New hypotheses, alternatives, links, contex-

tualisations, schematisations, etc. can be suggested

or provided.

Strategic analyst Analyze safety problems; their

tendencies, trends, patterns, processes, novelties,

etc. Such analysis serve as the basis for strategic

(long-term) decision making, pinpointing the main

security problems and giving insights into their na-

ture and characteristics. This allows allocating lim-

ited police resources for top efﬁcacy.

3.2.2 Functionalities and Priorities

We compiled an extensive list of functional require-

ments, partly technical requirements of a more pre-

requisite nature that we as technical researchers were

able to identify ourselves, and partly functional needs

of the user group, which we gathered through ques-

tionnaires, meetings and work sessions held through-

out the different police departments. A topical, high-

level overview follows.

• Tool Conﬁguration

– Document content indexing process

– Security and access control

– Support for multiple languages and document

formats

A prerequisite is the support for the three ofﬁcial lan-

guages in Belgium, namely Dutch, French, and German,

along with English for open sources. Given the emerging

threat of terrorism and the organised crime wave coming

from the East, interest in Arabic and Asiatic languages is

growing.

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

152

– Inclusion of metadata

– Automatic clustering or classiﬁcation

• Search & Retrieve

– Metadata search: document id, url, title, type,

language, origin,...

– Free text search: crosslingual, fuzzy, conceptual

search,...

– Entity search: crosslingual, phonetic, morpho-

logical search,...

– Similarity search: crosslingual search-by-

example

– Taxonomy search: category or cluster selection

– Multi-modal search

– Monitoring: automated signaling of relevant,

new or updated information, e.g. through user

proﬁling and proactive search agents

• User-System interaction

– Assisted formulation of search queries

– Filtering of search results through successive

formulation of queries

– Relevance feedback and query reﬁnement

– Repeated search and search history

– Visualisation, exportation, manipulation, and

browsing of search results

– Automated clustering or classiﬁcation of the

search result

• Qualitative Analysis

– Discovery of relations between terms, concepts,

entities, or any combination

– Assisted annotation of documents, also known as

text coding

– Support for creation of graphical schemes

– Automated recognition and classiﬁcation of en-

tities

– Visualisation and exportation of analysis results

Functionalities were hierarchically ordered and

presented in clear language to police ofﬁcers of the

identiﬁed user proﬁles. By having them score the

functionalities to their active needs, we were able to

associate real-valued priority values to F for each pro-

ﬁle.

3.2.3 Use Cases and Requirement Trees

Out of F we were able to distinguish ten distinct use

cases. As an example, consider the use case “free text

search”. As all others, this use case is made up of

several functional components, including tool conﬁg-

uration, document indexing, text search, and various

interaction functions. Given the hierarchical ordering

of our functionalities we set up the corresponding re-

quirements tree.

3.2.4 Tools and Support Trees

For each of the tools considered for evaluation, we

will construct their corresponding support tree.

The

implementation is done through the speciﬁcation of

support values for each of the functionalities in F.

Concrete, the support value of tool T

for F

, noted

σ(T

), is a real number in unit interval giving ex-

pression to the “degree of support” of deﬁnition 1. A

value of 0 indicates no support, 1 indicates full sup-

port, and partial support might be mapped along the

continuum.

σ : T × F → [0, 1]

3.2.5 Conformity Matching

In order to match a requirements tree with the sup-

port tree of a tool, we implement the matching op-

erator τ through the speciﬁcation of objective func-

tions at every single node in the requirements tree.

These functions take as arguments the support values

of the tool, the requirement priorities of a user pro-

ﬁle, and some extra, proﬁle-independent parameters.

Each objective function produces as a result a real-

valued conformity score with respect to the function-

ality associated to the node having the function at-

tached. Through repeated and systematic evaluation

of these functions - starting at the leaf nodes, trac-

ing intermediate nodes, and ending at the root node -

one obtains a global conformity score for each tool on

every use case and for every user proﬁle.

In addition, next to the detailed intermediate re-

sults, which can give useful insight as to why and

at which points some tools fail, we build two clear

and concise contracted tables which we can easily de-

rive through priority composition. One table gives the

conformity of each tool for each use case (combined

over all proﬁles), whereas the other table gives the

conformity of each tool for every user proﬁle (com-

bined over all use cases).

3.3 Discussion

3.3.1 Considerations

To safeguard the proper application of the proposed

evaluation model with a sound interpretation and use

of the produced results, a few conditions and remarks

should be made.

To prevent any market inﬂuence and to safeguard the

conﬁdentiality of our research, we choose not to make the

tools publicly known, at least not at this stage.

Whenever no (reliable) information can be obtained

about the degree of support, we safely assume support is

missing; σ =0.

MULTI-CRITERIA EVALUATION OF INFORMATION RETRIEVAL TOOLS

153

First of all, support values should be obtained by

conﬁdent means so as to resemble the tools’ true sup-

port, otherwise results are deemed to be meaning-

less and therefore useless. Through own experience,

we found that software vendors have a tendency to

badge their products as being extremely versatile and

applicable to the speciﬁc task at hand.

As a re-

searcher, one should therefore strive to establish these

values through objective and motivated means, pos-

sibly skimming any documentation that describes the

tools’ capabilities and features, attending demo pre-

sentations, installing evaluation versions, looking for

related studies conducted by trustworthy third-parties,

through personal use or prior knowledge, etc.

Second, it is the nested objective functions which

serve to produce the absolute conformity scoring val-

ues. As these functions capture the very semantics

of the evaluation (matching) taking place, they should

be deviced with great care and precision. Judicious

use of mathematical operators (additive, multiplica-

tive, fuzzy logical,...) andoverall consistency in de-

sign are primal points of attention.

Third, interpretation of results should primarily be

based on a relative comparison of tools by identifying

any signiﬁcant differences in conformity scores, as

the absolute scores may depend heavily on the some-

what arbitrary structuring of the tree, composition of

objective functions, and parameter settings.

As a last remark, we observed a marked differ-

ence in prioritizing functionalities among different

user proﬁles. Whereas some proﬁles cautiously dis-

tributed priorities as if they were given some ﬁxed

amount of priority points, others rated the majority

of functionalities equally and sufﬁciently high. Judi-

cious use of normalizing operators in objective func-

tions at different levels in the requirements tree pre-

vents the model from being biased by these differ-

ent prioritizing behaviors. The successive application

of small-scale normalization will give the desired ef-

fect of conformity scores being somewhat more tai-

lored for proﬁles having deﬁned more balanced pri-

ority schemes, provided those scheme reﬂect actual

gradations in desirability of functional needs.

3.3.2 Possible Uses

Given accurate support values and priorities, one

could use this procedure to make a selection of tools

on the basis of functional conformity, as suggested

by (1). Such selection would allow to identify tools

that are promising and suitable candidates for further,

more thorough testing. Since the number of tools on

As an example, tools claiming certain functional ca-

pabilities merely by the provision of some general-purpose

macro language or Application Programming Interface

(API) cannot be considered meeting our interest in directly

applicable, off-the-shelf tools.

the market is usually quite large, and time is limited in

research projects, this early kind of preliminary eval-

uation may turn out to be an interesting, efﬁcient and

effective exercise.

As we had little prior knowledge about the tools

under study and too little time to perform a full-scale

support analysis of the tools, we decided to make a

preselection motivated through early conformity im-

pressions drawn from tool documentation, demo pre-

sentations and personal contacting, and retaining the

conformity evaluation procedure until a later stage of

our project.

4 RELATED WORK

In the past decade, many IT implementation projects

have been conducted in collaboration with police

forces throughout the world. Most of these projects

revolve around the centralisation and consolidation of

various digitized information sources, for the purpose

of information fusion, information sharing, improved

availability (ubiquitousness) of information, and ad-

vanced exploitation for criminal analysis. Among the

more renowned (pilot) projects we mention the trend-

setting and since 1997 vigorously growing

COPLINK

project of Chen et al. ((Hauck et al., 2001; Atabakhsh

et al., 2001; Chen et al., 2002; Chen et al., 2003; Chen

et al., 2004)) in the state of Arizona, US, the

CLEAR

(Citizen Law Enforcement Analysis and Reporting)

project in Chicaco, the

FLINTS (Forensic Led Intelli-

gence System) project developed since 1999 by West

Midlands police under the auspices of R. M. Leary

((Leary, )), the

OVER project of Oatley, Ewart en

Zeleznikow ((Oatley et al., 2004)), in association with

West Midlands police since 2000, and the expanding

KDD-PN (Knowledge Discovery from Databases – Po-

lice Netherlands) project including the DataDetective

tool since 2001.

We found that the majority of projects are quite

similar in scope and nature, involving the applica-

tion of data mining and subsequent visualisation tech-

niques on information that is implicitly assumed to

be electronically available in structured, clean, pre-

processed, and unprotected (readily accessible) form.

Among the more inspiring technologies are deci-

sion tree building, offender proﬁling, social network

analysis, spatio-temporal statistics and visualisation

techniques including hot spot analysis. Applications

and numerous case studies can be found in a re-

cent book of J. Mena on the subject matter ((Mena,

2003)). Kumar ((Kumar et al., 2006)) & De Beer

((De Beer et al., 2006)) has discussed in detail on the

quality of commercial information retrieval and text

mining tools. Rijsbergen ((Van Rijsbergen, 1979))

has discussed evaluation techniques for measuring

ICEIS 2006 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

154

the performance of information retrieval tools. Re-

lated studies can be found from Lancaster ((Lancaster,

1968)), Cooper ((Cooper, 1973)), and Ingwersen ((In-

gwersen, 1992)) on functional use assessment, rele-

vance assessment, and quality evaluation, while the

evaluation methodologies suggested by Elder and Ab-

bot ((Elder and Abbott, 1998)), Nakhaeizadeh, and

Schnabl ((Nakhaeizadeh and Schnabl, 1997)), Collier

et al. ((Collier et al., 1999)) are notable.

5 CONCLUSION

Through our research project with the Belgian police,

we encountered many interesting aspects that are not

readily found or touched upon in literature on the sub-

ject, most noticeably on the issues of privacy, security,

legal aspects such as the evidential value of generated

results, data preprocessing and cleaning, integration,

ﬂexibility, adaptability, and performance of exploita-

tion tools in practical settings. In this paper we pre-

sented our proposed evaluation methodology for con-

formity testing of software tools, which ﬁts in a larger

framework of tool evaluation. We hope our work may

prove useful, inspire or ponder other ﬁeld workers on

these topics, as we believe the success and promising

future of these tools heavily depends on their careful

consideration.

ACKNOWLEDGMENTS

The authors would like to thank the Belgian police

for their interest and active collaboration, in particu-

lar Kris D’Hoore, Martine Pattyn and Paul Wouters.

This work was supported by the Belgian Science Pol-

icy Ofﬁce through their

AGORA research programme.

AG /01/101

REFERENCES

Atabakhsh, H., Schroeder, J., Chen, H., Chau, M., Xu, J. J.,

Zhang, J., and Bi, H. (2001). Coplink knowledge man-

agement for law enforcement: Text analysis, visual-

ization and collaboration. In Proceedings of the Na-

tional Conference for Digital Government Research,

volume 1.

Chen, H., Chung, W., Xu, J. J., Wang, G., Qin, Y., and Chau,

M. (2004). Crime data mining: a general framework

and some examples. 37(4).

Chen, H., Schroeder, J., Hauck, R., Ridgeway, L.,

Atabakhsh, H., Gupta, H., Boarman, C., Rasmussen,

K., and Clements, A. (2002). Coplink connect: infor-

mation and knowledge management for law enforce-

ment. 34(3):271–285.

Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., and

Schroeder, J. (2003). Coplink managing law enforce-

ment data and knowledge. 46(1):28–34.

Collier, K., Carey, B., Sautter, D., and Marjaniemi, C.

(1999). A methodology for evaluating and selecting

data mining software. In Proceedings of the Interna-

tional Conference on System Sciences.

Cooper, W. S. (1973). On selecting a measure of retrieval

effectiveness. Journal of the American Society for In-

formation Science, 24(2):87–100.

De Beer, J., Kumar, N., Moens, M.-F., and Vanthienen, J.

(2006). Assessing the state of the art of commercial

tools for unstructured information exploitation.

Elder, J. F. and Abbott, D. W. (1998). A comparison of

leading data mining tools. Technical report.

Hauck, R. V., Schroeder, J., and Chen, H. (2001). Coplink:

Developing information sharing and criminal intelli-

gence analysis technologies for law enforcement. In

Proceedings of the National Conference for Digital

Government Research, volume 1, pages 134–140.

Ingwersen, P. (1992). Information Retrieval Interaction.

Taylor Graham, London.

Kumar, N., De Beer, J., Vanthienen, J., and Moens, M.-F.

(2006). A study on the quality of enterprise search

tools.

Lancaster, F. W. (1968). Information Retrieval Systems:

Characteristics, Testing and Evaluation. Wiley, New

York.

Leary, R. M. The role of the national intelligence model

and ﬂints in improving police performance. Online at

http://www.homeofﬁce.gov.uk/docs2/

resconf2002daytwo.html.

Mena, J. (2003). Investigative data mining for security and

criminal detection. First edition.

Nakhaeizadeh, G. and Schnabl, A. (1997). Development

of multi-criteria metrics for evaluation of data mining

algorithms. In Proceedings KDD-97. AAAI Press.

Oatley, G. C., Ewart, B. W., and Zeleznikow, J. (2004). De-

cision support systems for police: Lessons from the

application of data mining techniques to ‘soft’ foren-

sic evidence.

Van Rijsbergen, C. J. (1979). Information Retrieval. But-

terworths London, second edition.

MULTI-CRITERIA EVALUATION OF INFORMATION RETRIEVAL TOOLS

155