CLASSICATION BY SUCCESSIVE NEIGHBORHOOD
David Grosser, Henri Ralambondrainy and Noel Conruyt
Laboratoire d’Informatique et Math
´
ematiques, Universit
´
e de la R
´
eunion, 97490 Sainte-Clotilde, France
Keywords:
Classification, Similarity, Nearest neighbors, Structured data, Systematics.
Abstract:
Formalization of scientific knowledge in life sciences by experts in biology or Systematics produces arbores-
cent representations whose values could be present, absent or unknown. To improve the robustness of the
classification process of those complex objects, often partially described, we propose a new classification
method which is iterative, interactive and semi-directed. It combines inductive techniques for the choice of
discriminating variables and search for nearest neighbors based on various similarity measures which take into
account structures and values of the objects for the neighborhood computation.
1 INTRODUCTION
Systematics is the scientific discipline that deals with
listing, describing, naming, classifying and identify-
ing living beings. In the frame of environmental sci-
ences, the acquisition and production of knowledge
on biological specimens and taxa is an essential part
of the work of systematicians (Winston, 1999). In-
deed, being able to describe, classify and identify a
specimen from morphological characters is a first step
for monitoring biodiversity because it gives access
to information relative to its species name (Biology,
Geography, Ecology, etc.). This process can be as-
sisted with computer science decision support tools.
In return, such complex domains deliver interesting
symbolical and numerical knowledge representation
and processing problems to the knowledge engineer-
ing and computer science community.
Indeed, classical discrimination methods devel-
oped in the frame of data analysis or machine learn-
ing, such as classification or decision trees (Breiman
et al., 1984), (Quinlan, 1986) or more recent methods
developed in the data mining field such as association
rules mining (Piatetsky-Shapiro, 1991) or Multifactor
dimensionality reduction (Zhu and Davidson, 2007)
are not sufficient, because they do not cope with re-
lations between attributes, missing data, and are not
very tolerant to errors in descriptions.
The considered problem that we are faced with is
to determine the class of a structured description that
is partially answered and, eventually contains errors,
from a referenced case base, this last one be a priori
classified by qualified experts in k-classes. The pro-
posed discrimination method proceeds by inference
of successive neighboring. It is inductive, interac-
tive, iterative and semi-directed. It combines induc-
tive techniques of discriminatory variables and neigh-
bors search, with the help of a similarity measure that
takes into account the structure (dependencies of vari-
ables) and the content (missing and unknown values).
2 DATA REPRESENTATION
Within a knowledge base, observations are described
with the help of descriptive models. A descriptive
model represent an ontological knowledge about the
considered domain and contains descriptors structure
and organization.
2.1 The Descriptive Model
The descriptive model (fig. 1), or schema, is a
rooted tree M = (A, U), where A is a set of
nodes(attributes), and U a set of edges. Leaves
are single classical attributes, as numerical or nomi-
nal ones, called ”basic attributes”. Nodes are ”struc-
tured attributes”, sub-trees made of several attributes.
For example, A
j
:< A
1
, . . . , A
p
> denotes a structured
attribute where A
j
is the root of the sub-tree and
A
1
, . . . , A
p
, are the sons, structured or basic attributes,
the components of A
j
.
288
Grosser D., Ralambondrainy H. and Conruyt N. (2009).
CLASSIFICATION BY SUCCESSIVE NEIGHBORHOOD.
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 288-291
DOI: 10.5220/0002299002880291
Copyright
c
SciTePress