
mance information from research centers. Then, we
exploit clustering algorithms to accomplish the task
of organizing such information, and evaluate the cor-
responding accuracy of the proposed approach.
The remainder of this paper is organized as follows.
The next section is a short overview of the clustering
process in a suitable way to our purposes. Section 3
presents DEA, a linear programming based technique
for measuring the efficiency of organizational units.
Section 4 illustrates the overall architecture and the
features of a system for the classification of research
centers. Section 5 describes a methodology for or-
ganizing research centers based on models comput-
ing their efficiency. In Section 6 proposes an alterna-
tive way to compute the efficiency of research centers;
this section ends reporting the experimental evalua-
tion stating the effectiveness of our approach. Finally,
Section 7 contains concluding remarks.
2 DATA CLUSTERING
Clustering is the task of organizing a collection of
objects (whose classification is unknown) into mean-
ingful or useful groups, called clusters, based on the
interesting relationships discovered in the data. The
goal is that the objects within a cluster will be highly
similar to each other, but will be very dissimilar from
objects in other clusters. The greater the homogene-
ity/heterogeneity within/between groups, the better
the resulting partition of clusters.
A first stage in a typical clustering task is the defi-
nition of a model to represent the objects, drawn from
the same feature space. Typically, an object is repre-
sented as a multidimensional vector, where each di-
mension is a single feature. Formally, given an m-
dimensional space, an object x is a single data point
and consists of a vector of m measurements: x =
(x
1
, . . . , x
m
). A set of n objects X = {x
1
, . . . , x
n
}
to be clustered is in the form of an object-by-attribute
structure, i.e. an n-by-m matrix. The scalar compo-
nents x
i
of x are called features or attributes.
Many different clustering algorithms can be ex-
ploited (Jain and Dubes, 1988). Partitional and hi-
erarchical clustering techniques are by far the most
popular and important ones. In this work, we exploit
the well-known k-Means partitional algorithm which
has the main advantage of requiring O(n) compar-
isons and guarantees a good quality of clusters. The
algorithm starts by randomly choosing k objects as
the initial cluster centers. Then it, iteratively, reas-
signs each object to the cluster to which it is the clos-
est, based on the proximity between the object and the
cluster center until a convergence criterion is met.
The definition of a proximity measure between ob-
jects is crucial in the clustering. Object proximity is
assessed on the basis of the attribute values describ-
ing the objects, and is usually measured by a distance
function or metric. The most commonly used met-
ric, at least for ratio scales and continuous features,
is the Minkowski metric, defined as d
M
(x
i
, x
j
) =
(
P
m
h=1
|x
ih
− x
jh
|
p
)
1/p
= k(x
i
− x
j
)k
p
, which is
a generalization of the popular Euclidean distance,
obtained when p = 2. Higher p values increase
the influence of large differences at the expense of
small differences and, from this point of view, the Eu-
clidean distance represents a good trade-off. It works
well when the objects within a collection are natu-
rally clustered in compact and convex-shaped groups,
and it is exploited to define the squared-error crite-
rion, which is the most intuitive and frequently used
criterion function in partitional clustering algorithms.
The squared-error criterion computes the sum of the
squared distance of each object from the center of
the cluster, and tries to make the resulting clusters as
compact and as separate as possible.
Quality in clustering deals with questions like how
well a clustering scheme fits a given dataset, and
how many groups partition the analyzed data. Three
approaches are adopted to investigate cluster valid-
ity (Halkidi et al., 2002): external criteria, internal
criteria, and relative criteria. A pre-specified struc-
ture, which reflects our intuition about the clustering
structure of the dataset, is exploited by external cri-
teria to evaluate a clustering. Internal criteria are de-
fined over quantities that involve the representations
of the data themselves (e.g. proximity matrix). The
basic idea of the latter approach is instead the com-
parison of different clustering schemes resulting from
the same algorithm but with different parameter val-
ues.
Our choice falls back on external criteria, since it
is particularly convenient, for our purposes, to mea-
sure the degree to which a dataset confirms an a-priori
specified scheme.
3 DEA TECHNIQUE
Data Envelopment Analysis (DEA) is a linear pro-
gramming technique that has been frequently ap-
plied to assess the efficiency of decision-making units
(hereinafter called DMUs), where the presence of
multiple inputs, as well as outputs, makes compar-
isons difficult.
The measurement of relative efficiency was ad-
dressed in (Farrell, 1957) and developed in (Farrell
and Fieldhouse, 1962), focusing on the creation of
a hypothetical efficient unit, as a weighted average
of efficient units, to act as a comparator for an in-
efficient unit. The first DEA model was introduced
in (Charnes et al., 1978) and its extents were used for
MINING SCIENTIFIC RESULTS THROUGH THE COMBINED USE OF CLUSTERING AND LINEAR
PROGRAMMING TECHNIQUES
85