MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE

OF CLUSTERING AND LINEAR PROGRAMMING TECHNIQUES

∗

Andrea Tagarelli, Irina Trubitsyna, Sergio Greco

DEIS - University of Calabria

87030 Rende, Italy

Keywords:

Data Mining, Clustering, DEA, Efﬁciency Measures.

Abstract:

The paper proposes a technique based on a combined approach of data mining algorithms and linear program-

ming methods for classifying organizational units, such as research centers. We exploit clustering algorithms

for grouping information concerning the scientiﬁc activity of research centers. We also show that the replace-

ment of an expensive efﬁciency measurement, based on the solution of linear programs, with a simple formula

allows clusters of very good quality to be computed efﬁciently. Some initial experimental results, obtained

from an analysis of research centers in the agro-food sector, show the effectiveness of our approach, both from

an efﬁciency and a quality-of-results point of view.

1 INTRODUCTION

The high performance of organizational units, also

known as decision-making units, relies on good de-

cision support which can have a major impact on the

achievement of the goals of the unit. On the other

hand, the soundness of a decision usually reﬂects the

quality of the activities of the unit. For instance, a

decision made on a project in which a scientiﬁc re-

search center is involved could lead to an increasing

in the productivity of the research center itself, pro-

vided that such a project represents a relevant activity

from a scientiﬁc point of view.

The process of evaluating and comparing the per-

formances of organizational units is a challenging ap-

plication, in principle, for several research disciplines.

In particular, there is growing interest in measuring

the efﬁciency of organizational units involved in sim-

ilar activities, technologies and inputs. Moreover,

evaluating the productivity of research centers is use-

ful from the point of view of a careful deployment

of ﬁnancial resources to the centers themselves: in-

tuitively, a research center with a high performance

may gain more economic beneﬁts rather than other

research centers with lower quality scores.

∗

Work supported by a MURST grant under the project

“Sistemi informatici integrati a supporto del bench-marking

di progetti ed interventi ad innovazione tecnologica in

campo agro-alimentare”

Traditional efﬁciency measures are often inade-

quate due to the presence of multiple inputs and out-

puts related to different resources, activities and en-

vironmental factors. In many productive ﬁelds, the

methods of parametric and non-parametric evaluation

seem to be preferred with respect to the combined use

of traditional indicators. In fact, such methods pro-

vide a synthetic indicator of the productivity by si-

multaneously considering multiple inputs and outputs

of the productive process. As a consequence, they al-

low the comparison of the efﬁciency of a given orga-

nizational unit with respect to the frontier of the possi-

ble efﬁcient solutions for all the organizational units.

The parametric methods (DFA, SFA) require the pre-

sumptive deﬁnition of the productive function, while

the non-parametric ones (DEA, FDH) are able to de-

termine the relative efﬁciency of organization units by

means of linear programming techniques. This is an

advantage, since the non-parametric methods permit

us to evaluate the performance of organization units

without any knowledge of their productive process.

The contribution of this paper is the deﬁnition of a

methodology for the classiﬁcation of research centers

combining data mining techniques, such as clustering,

and linear programming techniques. The expected re-

sult is a system capable of organizing research centers

by considering information about the volume and the

quality of their scientiﬁc activity. We study how to ex-

tract and represent both scientiﬁc results and perfor-

Tagarelli A., Trubitsyna I. and Greco S. (2004).

MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE OF CLUSTERING AND LINEAR PROGRAMMING TECHNIQUES.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 84-91

DOI: 10.5220/0002624000840091

 SciTePress

mance information from research centers. Then, we

exploit clustering algorithms to accomplish the task

of organizing such information, and evaluate the cor-

responding accuracy of the proposed approach.

The remainder of this paper is organized as follows.

The next section is a short overview of the clustering

process in a suitable way to our purposes. Section 3

presents DEA, a linear programming based technique

for measuring the efﬁciency of organizational units.

Section 4 illustrates the overall architecture and the

features of a system for the classiﬁcation of research

centers. Section 5 describes a methodology for or-

ganizing research centers based on models comput-

ing their efﬁciency. In Section 6 proposes an alterna-

tive way to compute the efﬁciency of research centers;

this section ends reporting the experimental evalua-

tion stating the effectiveness of our approach. Finally,

Section 7 contains concluding remarks.

2 DATA CLUSTERING

Clustering is the task of organizing a collection of

objects (whose classiﬁcation is unknown) into mean-

ingful or useful groups, called clusters, based on the

interesting relationships discovered in the data. The

goal is that the objects within a cluster will be highly

similar to each other, but will be very dissimilar from

objects in other clusters. The greater the homogene-

ity/heterogeneity within/between groups, the better

the resulting partition of clusters.

A ﬁrst stage in a typical clustering task is the deﬁ-

nition of a model to represent the objects, drawn from

the same feature space. Typically, an object is repre-

sented as a multidimensional vector, where each di-

mension is a single feature. Formally, given an m-

dimensional space, an object x is a single data point

and consists of a vector of m measurements: x =

, . . . , x

). A set of n objects X = {x

, . . . , x

}

to be clustered is in the form of an object-by-attribute

structure, i.e. an n-by-m matrix. The scalar compo-

nents x

of x are called features or attributes.

Many different clustering algorithms can be ex-

ploited (Jain and Dubes, 1988). Partitional and hi-

erarchical clustering techniques are by far the most

popular and important ones. In this work, we exploit

the well-known k-Means partitional algorithm which

has the main advantage of requiring O(n) compar-

isons and guarantees a good quality of clusters. The

algorithm starts by randomly choosing k objects as

the initial cluster centers. Then it, iteratively, reas-

signs each object to the cluster to which it is the clos-

est, based on the proximity between the object and the

cluster center until a convergence criterion is met.

The deﬁnition of a proximity measure between ob-

jects is crucial in the clustering. Object proximity is

assessed on the basis of the attribute values describ-

ing the objects, and is usually measured by a distance

function or metric. The most commonly used met-

ric, at least for ratio scales and continuous features,

is the Minkowski metric, deﬁned as d

, x

) =

(

h=1

− x

)

1/p

= k(x

− x

, which is

a generalization of the popular Euclidean distance,

obtained when p = 2. Higher p values increase

the inﬂuence of large differences at the expense of

small differences and, from this point of view, the Eu-

clidean distance represents a good trade-off. It works

well when the objects within a collection are natu-

rally clustered in compact and convex-shaped groups,

and it is exploited to deﬁne the squared-error crite-

rion, which is the most intuitive and frequently used

criterion function in partitional clustering algorithms.

The squared-error criterion computes the sum of the

squared distance of each object from the center of

the cluster, and tries to make the resulting clusters as

compact and as separate as possible.

Quality in clustering deals with questions like how

well a clustering scheme ﬁts a given dataset, and

how many groups partition the analyzed data. Three

approaches are adopted to investigate cluster valid-

ity (Halkidi et al., 2002): external criteria, internal

criteria, and relative criteria. A pre-speciﬁed struc-

ture, which reﬂects our intuition about the clustering

structure of the dataset, is exploited by external cri-

teria to evaluate a clustering. Internal criteria are de-

ﬁned over quantities that involve the representations

of the data themselves (e.g. proximity matrix). The

basic idea of the latter approach is instead the com-

parison of different clustering schemes resulting from

the same algorithm but with different parameter val-

ues.

Our choice falls back on external criteria, since it

is particularly convenient, for our purposes, to mea-

sure the degree to which a dataset conﬁrms an a-priori

speciﬁed scheme.

3 DEA TECHNIQUE

Data Envelopment Analysis (DEA) is a linear pro-

gramming technique that has been frequently ap-

plied to assess the efﬁciency of decision-making units

(hereinafter called DMUs), where the presence of

multiple inputs, as well as outputs, makes compar-

isons difﬁcult.

The measurement of relative efﬁciency was ad-

dressed in (Farrell, 1957) and developed in (Farrell

and Fieldhouse, 1962), focusing on the creation of

a hypothetical efﬁcient unit, as a weighted average

of efﬁcient units, to act as a comparator for an in-

efﬁcient unit. The ﬁrst DEA model was introduced

in (Charnes et al., 1978) and its extents were used for

MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE OF CLUSTERING AND LINEAR

PROGRAMMING TECHNIQUES

measuring and comparing the efﬁciency of local au-

thority departments, schools, hospitals, shops, bank

branches and similar entities with homogeneous sets

of units (Chung et al., 2000; Zhu, 2002; Charnes et al.,

1994; Stern et al., 1994; Thanassoulis et al., 1987). In

the Data Mining context, (Sohn and Choi, 2001) pro-

poses using DEA in order to ﬁnd the weights involved

in multi-attribute performances of classiﬁers in a data

ensemble algorithm. A recent bibliography of DEA

including applications can be found in (Emrouznejad,

2001).

DEA is a non-parametric technique, in the sense

that it does not require any assumption about the func-

tional form relating the independent variables to the

dependent variables. By contrast, the efﬁciency of

each DMU is computed as the ratio of a weighted

sum of outputs and a weighted sum of inputs, where

the weight sets are different for distinct DMUs and

have to be selected to maximize the efﬁciency of each

DMU.

The selection of the attributes and their partition,

as input and output parameters, play a crucial role in

the deﬁnition of a DEA model. In other terms, a DEA

model involves not only the choice of individual at-

tributes, but also deciding whether an attribute will be

treated as an input or an output parameter.

A DEA model can hence be formally stated as fol-

lows. Given N DMUs with I inputs and O outputs,

let x

and y

be, respectively, the i-th input and

the o-th output of DMU j, and let v

and w

the corresponding weights, where j ∈ {1, . . . , N},

i ∈ {1, . . . , I}, o ∈ {1, . . . , O}. The efﬁciency E

of a given DMU j can be obtained by solving the fol-

lowing linear program:

max E

o=1

i=1

subject to

o=1

i=1

≤ 1

, v

≥ ε

where l ∈ {1 · · · N }, i ∈ {1 · · · I}, o ∈ {1 · · · O}.

The variables of the above problem are the weights

that have been chosen to maximize the efﬁciency of

a given DMU j. The ﬁrst constraint represents the

upper bound for the efﬁciency of all DMUs com-

puted with the current weights. The second con-

straint, where ε is a positive value close to 0, avoids

that an input or an output is totally ignored in deter-

mining the efﬁciency.

If E

= 1 then DMU j is efﬁcient with respect to

other DMUs, otherwise there is some other more ef-

ﬁcient DMU, even if the weights have been chosen in

favor of DMU j. In fact, the solution technique at-

tempts to make the efﬁciency E

as large as possible.

The search procedure stops when some DMU hits the

upper bound of 1. Thus, for an inefﬁcient DMU at

least another unit will be efﬁcient with the given set

of weights.

The ﬂexibility in the choice of weights is both a

weakness and a strength of this approach. It is a weak-

ness because in some cases the evaluation can be more

affected by the choice of the weights than by the at-

tribute values of DMUs; on the other hand, the in-

dependence of the weights is a strength because the

evaluation of DMUs’ inefﬁciency is deﬁnitive as the

most valuable weights have been chosen.

4 A SYSTEM FOR CLASSIFYING

DMUs

We present a system for the classiﬁcation of research

centers based on different parameters involving sci-

entiﬁc results and efﬁciency indicators. For this pur-

pose, the system combines clustering algorithms and

linear programming techniques. It takes in input ag-

gregate information, stored in the source database,

concerning the scientiﬁc activity of research centers

and, in particular, aggregate data involving any prod-

uct concerning scientiﬁc activities, such as publica-

tions, projects, citations, and patents. As the num-

ber of scientiﬁc publications and citations are abso-

lute values, not actually useful without a comprehen-

sive point of reference, some scientometric indicators

(see Section 4.1) need to be taken into account.

The global classiﬁcation process is reported in Fig-

ure 1 and consists of three main steps implemented by

the following modules:

1. Indicator computation – This module takes in input

the source aggregate information about research

centers and computes some scientometric indica-

tors on the volume and quality of the scientiﬁc ac-

tivity of research centers. The output of this mod-

ule is merged with the source database.

2. Efﬁciency evaluation – The efﬁciency evaluation

is based on a given model which exploits both

source aggregate information and scientometric in-

dicators. Such a model is usually deﬁned as a DEA

problem. In this case, the efﬁciency is computed as

the result of the objective function of a DEA linear

program. For each research center, the computed

efﬁciency value is merged with the scientometric

indicators and the source information.

3. Clustering – This module provides an organization

of DMUs into homogeneous groups according to

both source and derived information.

Note that in the computation of the efﬁciency of

DMUs we also used a model selecting from the set of

attributes the input parameters and the output parame-

ters. In Section 5, we will show how different models

(i.e. different selections of attributes) lead to different

behavior which could lead to different classiﬁcations

of research centers.

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

Figure 1: The research center analysis system.

4.1 Scientometric indicators

Scientometric indicators (Schubert, 1988; Galante et

al., 1998; Okubo, 1997) aim at measuring the output

of scientiﬁc and technological research through data

derived not only from scientiﬁc literature but from

patents as well. We used two scientometric indica-

tors concerning scientiﬁc publications and citations

and deﬁned as follows.

Deﬁnition 1 Let S be a set of scientiﬁc publications,

r be a research center, and c be a scientiﬁc discipline.

The Activity Index of r with respect to a category c is

deﬁned as AI

/ P

, where P

is the number

of publications of r belonging to category c, P

is the

total number of publications belonging to category c,

is the number of publications of r, and P is the

total number of publications in S. ¤

Deﬁnition 2 Let r be a research center, y be a ﬁxed

year, and S be a set of scientiﬁc publications related

to r in the year y. The Relative Citation Rate (RCR)

of r in the year y is deﬁned as RCR =

Q / J

F/ J

, where

J is is the number of publications contained in S, Q

is the number of citations received by publications in

S in the years y, y + 1, y + 2, and F is the sum of

Impact Factors of journals publishing each item in S

in y. ¤

The journal Impact Factor is a measure of the fre-

quency with which the “average article” in a journal

has been cited during a given year. As a consequence,

RCR provides a measure of the incoming citations for

all items in S with respect to the expected citations.

The above two indicators, together with information

contained in the source database, will be used to com-

pute the efﬁciency.

4.2 Efﬁciency Evaluation

As described in Section 3, a suitable way to com-

pute the efﬁciency of DMUs is to solve a system of

DEA linear programs (one for each DMU) according

to a given model stating the relevance of source ag-

gregate information and indicators. The results of the

DEA problems consist of the values assigned to the

weights which maximize the objective functions (i.e.

efﬁciency of DMUs). In the following, we will de-

ﬁne different DEA models each of which is based on

different selections of attributes that will be used, re-

spectively, as input and output parameters.

In order to apply linear programming methods, a

DEA problem needs to be converted into a linear

form. This can be obtained by setting the denominator

of the objective function equal to a constant (e.g. 1)

and maximizing its numerator. The resultant DEA

problem for a given DMU j is deﬁned as follows:

max E

o=1

subject to

i=1

= 1

o=1

−

i=1

≤ 0

, v

≥ ε

where l ∈ {1 · · · N }, i ∈ {1 · · · I}, o ∈ {1 · · · O}.

Note that the introduction of the ﬁrst constraint, that

normalizes the weighted sum of inputs, leads to the

transformation of the problem in linear form.

4.3 Clustering of DMUs

Clustering of DMUs aims at identifying homoge-

neous groups of DMUs similar from the scientiﬁc ac-

tivity point of view. Formally, the problem can be

stated as follows: given a set U = {u

, . . . , u

} of

DMUs, ﬁnd a suitable partition P = {C

, . . . , C

} of

U in k groups such that each group contains a homo-

geneous subset of DMUs.

In our context, the notion of homogeneity can be

measured by exploiting, as attributes of DMUs, the

information previously presented. Each DMU is re-

presented as a multidimensional vector (Baeza-Yates

and B. Ribeiro-Neto, 1999). Moreover, to our pur-

poses it is particularly convenient to adopt a Eu-

clidean metric, since all the attributes have numeric

values. However, if the Euclidean metric is used

directly, some attributes (such as the ones corre-

sponding to absolute indicators) can exhibit a domi-

nant effect over other ones that have a smaller scale

of measurement. In order to avoid this, for each

DMU j we normalize all the attribute values to fall

within the range [0,1]. For each attribute z

, the

corresponding attribute with normalized value is de-

ﬁned as a

−min(z

)

max(z

)−min(z

)

, where z

∈

, . . . , x

, y

, . . . , y

, E

} is the actual value of

the p-th attribute of DMU j, z

= {z

, . . . , z

} is

MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE OF CLUSTERING AND LINEAR

PROGRAMMING TECHNIQUES

the set of values assigned to the same attribute of dis-

tinct DMUs, and max(z

) and min(z

) compute, re-

spectively, the maximum and the minimum value over

all DMUs.

5 CLASSIFICATION OF

RESEARCH CENTERS

Data Description

Our source database is composed of data related to

research centers in the agro-food sector. In par-

ticular, we have collected more than 3600 projects

and 8800 scientiﬁc publications, covering the period

1983-2000. We have also collected 2000 European or

international patents, mostly those of 1999. Informa-

tion about patents come from the PATLIB Center, an

Italian information center for patents, whereas infor-

mation about projects and scientiﬁc publications has

been retrieved mostly through the CORDIS (Commu-

nity Research and Development Information Service)

site. In addition, we obtained information on about

15000 scientiﬁc publications with their bibliographic

references, related to the years 1998, 1999 and 2000.

For each research center r we extracted and stored

information which comprise the attributes described

in Table 5.

Table 1: Attributes of research centers.

attribute description

NPrj Nr. of projects in which r is involved

NPub Nr. of scientiﬁc publications ﬁnanced by r

NPat Nr. of patents ﬁnanced by r

NCit Nr. of incoming citations of publications

ﬁnanced by r

AI AI value for r

RCR RCR value for r

DEA models

In order to measure the efﬁciency of research cen-

ters we deﬁned different DEA models, by considering

different combinations of input and output attributes.

The models used in our experiments are reported in

Table 5, where we considered related attributes once

(e.g. we considered either NPub or AI and either

NCit or RCR). Observe that two models (M

and

) take in input the efﬁciency computed by other

models (M

and M

It is worth noticing that the models differently de-

ﬁne the input and the output parameters used in the

DEA linear programs. For instance, in the ﬁrst model

Table 2: Models for efﬁciency evaluation.

model input param. output param.

[ NPrj ] [ NPub, NPat ]

[ NPrj ] [ AI, NPat ]

[ NPrj ] [ NPat, NPub, NCit ]

[ NPrj ] [ NPat, AI, RCR ]

[ NPrj ] [ NCit, NPub ]

[ NPrj ] [ RCR, AI ]

[ E(M

), NPub ] [ NCit ]

[ E(M

), AI ] [ RCR ]

), we measured the efﬁciency of the research cen-

ters that have been involved in projects, evaluating

their productivity in terms of patents and scientiﬁc

publications. In the last two models we tried to as-

sess efﬁciency variations of organizations during the

time period by using a global efﬁciency measure (e.g.

E(M

) and E(M

)) and the parameters related to the

number of citations (e.g. N Cit and RCR).

Clustering results

DMUs could be clustered on the basis of their efﬁ-

ciency computed using the DEA technique. DEA usu-

ally provides good results because it assesses the rela-

tive efﬁciency values by choosing the favorite weight

sets for each DMU. However, in some cases, the

evaluation can be more affected by the choice of the

weights than by the attribute values of DMUs. Con-

sider, for instance, two clusters based on the efﬁciency

values calculated by model M

reported in Table 5.

Observe that the partition is quite good, but the ﬁrst

cluster, which is characterized by high efﬁciency val-

ues, contains an outlier, DMU 7, whose scientiﬁc fea-

tures are very close to the second cluster. In this case,

very low input values (for the attribute N P rj) mis-

leadingly result in a high efﬁciency value.

Table 3: Classiﬁcation of DMUs based on M

model.

DMU NPrj RCR AI E

... ... ... ... ...

70 0.001 0.501 0.410 0.722

96 0.001 0.257 0.562 0.828

9 0.001 0.600 0.501 0.877

39 0.001 1.000 0.480 1.000

7 0.001 0.098 0.740 1.000

... ... ... ... ...

42 0.023 0 0.794 0.267

65 0.015 0 0.659 0.296

... ... ... ... ...

The above observation suggest a different classiﬁ-

cation of DMUs where the clustering algorithm takes

into account, other than the efﬁciency computed by

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

means of DEA technique, also source aggregate data

and scientometric indicators. We performed the k-

Means algorithm on several experiments trying dif-

ferent k combinations for each model deﬁned previ-

ously. As an example, in the portion of data reported

in Table 5 we considered all the attributes together

with the efﬁciency value. The clustering of DMUs,

under the model M

, assigned DMU 7 to cluster 7 in-

stead of cluster 4. This solution is more appropriate as

DMU 7 is very close to the other DMUs in cluster 7,

whereas the degree of similarity between DMU 7 and

DMUs belonging to cluster 4 is very low.

Table 4: Clustering of DMUs based on M

model.

DMU NPrj RCR AI E cluster

... ... ... ... ... ...

70 0.001 0.501 0.410 0.722 4

96 0.001 0.257 0.562 0.828 4

9 0.001 0.600 0.501 0.877 4

39 0.001 1.000 0.480 1.000 4

... ... ... ... ... ...

42 0.023 0 0.794 0.267 7

65 0.015 0 0.659 0.296 7

7 0.001 0.098 0.740 1.000 7

... ... ... ... ... ...

To sum up, the proposed technique for classify-

ing research centers on the basis of their performance

consists in two main steps: i) efﬁciency evaluation,

which is performed using DEA based techniques, and

ii) clustering of DMUs, which considers the efﬁciency

values together with other model attributes. The ﬁrst

step provides a value that expresses the relative per-

formance for each research center, while the second

one acts as a further reﬁnement through the classiﬁ-

cation of research centers so that DMUs with similar

efﬁciency values can be assigned to different clusters.

In some sense, this process is similar to the identiﬁca-

tion of relevant web pages (corresponding to DMUs

with high efﬁciency values) and the identiﬁcation of

web communities (clusters of web pages with high

numbers of co-citations

). Obviously, if we derive

large clusters, the clustering process can be further

reﬁned by applying the algorithm to the distinct clus-

ters.

6 APPROXIMATE EFFICIENCY

MEASURE

The problem in measuring the efﬁciency with the

above approach is that the DEA technique can be

computationally expensive and cannot be applied to

Two web pages are “similar” if there is a signiﬁcant

number of pages containing links to both of them.

large datasets such as those currently used in Data

Mining. In fact, the computation of the efﬁciency

of DMUs consists in the resolution of N DEA linear

programs whose solutions give us a suitable combina-

tion of weights that maximizes the objective function.

DEA is good at estimating the “relative” efﬁciency

but not the “absolute” efﬁciency of DMUs; it can tell

you how well you are doing compared to your peers

but not compared to a “theoretical maximum”.

As said before, a crucial issue in DEA problems

is the computational complexity. To address such an

issue, we propose an alternative way to compute the

efﬁciency of DMUs. Our idea is to deﬁne an approx-

imation of the DEA-efﬁciency measure, by simply

considering the objective function of a DEA model

(provided that suitable weights are given), and then

normalizing all the attributes as explained in Sec-

tion 4.3. Formally, our approximate efﬁciency mea-

sure is deﬁned as:

o=1

i=1

In order to minimize |E

− η

|, where E

and η

de-

note the normalized values of E

and η

respectively,

suitable weight sets for the computation of η

have to

be found.

6.1 Weight assignments

For each model M , obtained by selecting a set of I

input attributes and a set of O output attributes, we

deﬁned the input assignment set, denoted by V , as

the list of values assigned to the weights of the input

attributes; in an analogous way, we deﬁned the output

assignment set, denoted by W .

Note that, since η has a fractional form and η

de-

notes the normalized value of η, some weight assign-

ments can provide the same values of η

. In such

a case we say that the two weight assignments are

equivalent.

Deﬁnition 3 Two weight assignments Φ

= [V

, W

]

and Φ

= [V

, W

], used to compute the approximate

efﬁciency measures η

and η

respectively, are equi-

valent if η

= η

. ¤

Moreover, a sufﬁcient condition to assess the

equivalence of two assignments is the proportionality

respectively between input and output weight values.

Formally, this can be stated by the following proposi-

tion:

Proposition 1. Two weight assignments Φ

and Φ

are equivalent if

= c

, ∀i ∈ {1, . . . , I} and

= c

, ∀o ∈ {1, . . . , O}, where c

e c

are con-

stants. ¤

MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE OF CLUSTERING AND LINEAR

PROGRAMMING TECHNIQUES

As a consequence, we have the subsequent corol-

lary:

Corollary 1 For each assignment Φ = [V, W ] there

exists a corresponding equivalent assignment

Φ =

[

V ,

W ] such as

V [1] = 1 and

W [1] = 1. ¤

From a practical point of view, the above corollary

means that we can perform a comparative analysis by

setting an element of

V and an element of

W to 1, and

then trying different combinations for the remaining

attribute weights. Thus, the number of parameters is

reduced to I + O − 2.

6.2 Experimental results

To evaluate the effectiveness of our approximate efﬁ-

ciency measure, we carried out a comparative anal-

ysis trying different combinations for the attribute

weights. We performed experiments on two differ-

ent datasets, containing respectively 540 and 134 re-

search centers. We have used the models M

and M

for the largest dataset and the other models for the

smallest dataset. Table 6.2 shows two different value

assignments for the attribute weights, for each model.

The vectorial notation matches the list of attributes

selected for each model (see Table 5).

Table 5: Best settings of attribute weights.

model Φ

[1], [1, 1] [1], [1, 0.01]

[1], [1, 1] [1], [1, 0.001]

[1], [1, 1, 1] [1], [1, 0.1, 0.1]

[1], [1, 1] [1], [1, 0.001]

[1], [1, 1] [1], [1, 0.1]

[1, 1], [1] [1, 0.001], [1]

Figure 2 shows a comparison of the η measure with

respect to DEA-efﬁciency measure (i.e. |E

− η

relative to the model M

. As we can see, high er-

ror peaks are very few, whereas most of the error val-

ues are below 0.2 and such a behavior is also con-

ﬁrmed for the remaining models. Thus, the η mea-

sure works as a good approximation of the DEA-

efﬁciency. Moreover, we can take advantage of the

fact that an approximate efﬁciency measure, such as

η, allows an optimal trade-off between accuracy and

efﬁciency, since its computation is not as expensive as

solving a DEA problem.

6.3 Clustering quality results

To evaluate the outcome of a clustering process, it is

important to check whether the computed clusters can

(a)

(b)

Figure 2: Error rates of η measure with respect to DEA-

efﬁciency measure (|E

− η

|), according to Φ

(a) and Φ

(b) weight combinations.

be considered as of good quality. This can be done by

comparing the clusters with an ideal categorization of

DMUs. In our context, an ideal partition is deﬁned

as the result of the clustering algorithm applied to a

given set of DMUs whose attributes include the DEA-

efﬁciency measure together with source aggregate in-

formation and scientometric indicators.

In the experiments, our aim was to compare the

ideal categorization Π = {γ

, . . . , γ

}, of a set U of

DMUs, to a clustering scheme P = {C

, . . . , C

} of

a set U

, where U

was derived from U by replacing all

DEA-efﬁciency values with the corresponding η efﬁ-

ciency values. The quality of P with respect to Π can

be evaluated by exploiting several quality measures.

In this work, we used the standard F -measure (Baeza-

Yates and B. Ribeiro-Neto, 1999): higher values of

the measure mean higher quality of clusters. Values

close to the range [0.7, 1] are typical of good clusters.

We performed several experiments for each model

with a different number of clusters. Figure 3 contains

the summarized information for the case of 20 clus-

ters. The high values of F-measure suggest that our η

efﬁciency measure is a good approximation of DEA-

efﬁciency for all the models. Moreover, there exists

a model, M

, such that the approximated technique

provides the same results and this behavior is valid

for any number of clusters. This means that the DEA

ICEIS 2004 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

Figure 3: Clustering quality results.

efﬁciency measure can be substituted with the approx-

imate measure, that improves the performance of our

technique. This is particularly important in the case

of large datasets.

It is important to note that while DEA techniques

are non-parametric (i.e. the weight of parameters is

computed by solving linear systems), in the computa-

tion of the approximate efﬁciency we have to assign

a weight to the parameters. Our experiments have

shown that the assignment of arbitrary weight values

(selected without knowing the productive function),

for some models, gives a good approximation of DEA

(e.g. model M

). In any case, in order to choose

a good set of values for the weights, we can com-

pare DEA and the approximate technique on small

datasets.

7 CONCLUSIONS

We have presented a technique for the classiﬁcation of

organizational units, such as research centers, accord-

ing to information on the volume and the quality of

their scientiﬁc activity. Such information involves ag-

gregate data and scientometric indicators and allows

the computation of efﬁciency values for the produc-

tivity of research centers. We also proposed an alter-

native efﬁciency measure which exhibits a good ap-

proximation of DEA, but with the advantage of not

requiring the resolution of N linear programs. The

classiﬁcation process, based on clustering algorithms,

was tested in several experiments, showing a high de-

gree of efﬁciency and effectiveness in the research

center context.

REFERENCES

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern

Information Retrieval. ACM Press Books, Addison

Wesley.

Charnes, A., Cooper, W. W., Lewin, A. Y., and Seiford,

L. M. (1994). Data Envelopment Analysis: Theory,

Methodology and Applications. Kluwer Academic

Publishers.

Charnes, A., Cooper, W. W., and Rhodes, E. (1978). Mea-

suring the efﬁciency of decision making units. Euro-

pean Journal of Operational Research 2, 429–444.

Chung, S. H., Yang, Y. S., and Wu, T. -H. (2000). Evaluat-

ing the Efﬁciency of University via DEA approach. In

Proc. 5th Annual Int. Conf. on Industrial Engineering

Theory, Applications and Practice.

Emrouznejad, A. (2001). An Extensive Bibliography of

Data Envelopment Analysis. Tech. Rep., Business

School, Univ. of Warwick,

Farrell, M. J. (1957). The measurements of productive efﬁ-

ciency. J.R. Statis Soc., Series A 120, 253–281.

Farrell, M. J., and Fieldhouse, M. (1962). Estimating efﬁ-

cient production functions under increasing returns to

scale. J.R. Statis Soc., Series A 125, 252–267.

Galante, E., Sala, C., and Lanini, I. (1998). Valutazione

della ricerca agricola, Franco Angeli (ed.), Milano.

Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2002).

Cluster Validity Methods: Sigmod Record 31(2), 40-

45.

Jain, A. K., and Dubes, R. C. (1988). Algorithms for Clus-

tering Data. Prentice-Hall advanced reference series.

Okubo, Y. (1997). Bibliometric indicators and analysis of

research systems: Methods and examples. OECD,

WP#1.

Schubert, A., Glaenzel, W., and Braun, T. (1988). Against

Absolute Methods: Relative Scientometrics Indicators

and Relational Charts as Evaluation Tools. Handbook

of Quantitative Studies of Science and Technology,

Van Ran A. F. J. (ed.), North-Holland, Amsterdam.

Sohn, S. Y., and Choi, H. (2001). Ensemble Based on Data

Envelopment Analysis. In Proc. Aspects of Data Min-

ing, Decision Support and Meta-Learning, 129–137.

Stern, Z. S., Mehrez, A., and Barboy, A. (1994). Academic

departments efﬁciency via DEA. Computers and Op-

erations Research 21(5), 543–556.

Thanassoulis, E., Dyson, R. G., and Foster, M. J. (1987).

Relative Efﬁciency Assessments using Data Envelop-

ment Analysis: an Application to Data on Rates De-

partments. J. Opl. Res. Soc. 38, 397–412.

Viveros, M. S., Nearhos, J. P., and Rothman, M. J. (1996).

Applying Data Mining Techniques to a Health Insur-

ance Information System. In 22th VLDB Conf., 286–

294.

Zhu, J. (2002). Quantitative Models for Performance Eval-

uation and Benchmarking: Data Envelopment Analy-

sis with Spreadsheets and DEA Excel Solver. Kluwer

Academic Publishers, Boston.

MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE OF CLUSTERING AND LINEAR

PROGRAMMING TECHNIQUES