the results obtained from applying the proposed ap-
proach on heterogeneous documents. Finally, we con-
clude by summarizing the contribution of this work
and giving some perspectives in section 5.
2 RELATED WORK
Because of their advantages, recently, bio-inspired al-
gorithms have been widely used as a tool for feature
selection in data mining. (Kabir et al., 2012) have
proposed a hybrid ant colony optimization (ACO) al-
gorithm for feature selection (FS), called ACOFS, us-
ing a neural network. A key aspect of this algorithm
is the selection of a subset of salient features of re-
duced size. ACOFS uses a hybrid search technique
that combines the advantages of wrapper and filter
approaches. In order to facilitate the hybrid search,
the authors designed new sets of rules for pheromone
update and a heuristic information measurement. On
the other hand, the ants are guided in correct direc-
tions, while constructing graph (subset) paths using a
bounded scheme in each and every step of the algo-
rithm. The above combinations ultimately not only
provide an effective balance between exploration and
exploitation of ants in the search, but also intensify
the global search capability of ACO for a high quality
solution in feature selection. There are other studies
that applied the ant colony algorithm to the problem
of feature selection such as (Al-Ani, 2005) and (Agh-
dam et al., 2009).
(Zahran and Kanaan, 2009) have introduced a fea-
ture selection algorithm based on Particle Swarm Op-
timization (PSO) to improve the performance of Ara-
bic text categorization. They used RBF networks (Ra-
dial Basis Function) as a text classifier. On the basis
of the same bio-inspired algorithm, (Xue et al., 2012)
have proposed two multi-objective algorithms for se-
lecting the Pareto front of non-dominated solutions
(feature subsets) for classification. The first algorithm
introduces the idea of non-dominated sorting based
multi-objective genetic algorithm into PSO for feature
selection. In the second algorithm, the multi-objective
PSO uses the ideas of crowding, mutation and domi-
nance to search for the Pareto front solutions.
(Siedlecki and Sklansky, 1989) introduced the use
of Genetic Algorithm (GA) for feature selection. In
a GA approach, a given feature subset is represented
as a binary string ‘Chromosome’ of length n, with a
zero or one in a position i denoting the absence or
presence of feature i in the set, respectively. Note that
n is the total number of available features. A popu-
lation of chromosomes is maintained. Each chromo-
some is evaluated to determine its ”fitness”, which de-
termines how likely the chromosome is to survive and
breed into the next generation. New chromosomes
are created from old chromosomes by the following
processes: (1) crossover, where parts of two differ-
ent parent chromosomes are mixed to create offspring
and (2) mutation, where the bits of a single parent are
randomly disturbed to create a child (Yusta, 2009).
(Jourdan et al., 2001) also presented a genetic algo-
rithm dedicated for a feature selection problem, but in
a particular case encountered in the genetic analysis
of different diseases. The specificities of this problem
is that the authors are not looking for a single feature,
but for several associations of features that may be in-
volved in the studied disease. There are other studies
applying the genetic algorithm on the problem of fea-
ture selection, we cite those of: (Yang and Honavar,
1998), (Oliveira et al., 2003) and (Babatunde et al.,
2014).
3 THE PROPOSED APPROACH
The proposed approach is composed of three mod-
ules: preprocessing, feature selection and matching.
Figure 1 shows the structure of the proposed ap-
proach. First, for the preprocessing of the input het-
erogeneous text documents, we implement the most
important prerocessing steps which include opera-
tions such as cleaning data and stemming. Second,
we apply a quantum-inspired genetic algorithm in or-
der to select the minimum set of features that are ide-
ally necessary and sufficient to describe the semantics
of a set of heterogeneous text documents; in order to
reduce the cost and increase the matching accuracy of
these documents. Finally, the cosine similarity is used
to measure the difference between the input text doc-
ument and its corresponding in the matching phase.
The following subsections describe the details of each
step of the proposed approach.
3.1 Preprocessing Phase
Text preprocessing is a task that plays a very impor-
tant role in text mining techniques and applications,
it becomes even more important when handling big
data generated from multiple sources. In this step,
we used R language which is widely used among data
miners, it offers multiple packages for performing text
mining that facilitate preprocessing tasks including:
the elimination of punctuation, digits and stopwords,
stemming, TF-IDF weighting, etc.
As our bio-inspired approach is intended to deal
with heterogeneous text documents, the collected
dataset used in our work consists of scientific articles,
Towards a Bio-inspired Approach to Match Heterogeneous Documents
277