Towards a Bio-inspired Approach to Match Heterogeneous Documents

Nourelhouda Yahi

, Hacene Belhadef

, Mathieu Roche

and Amer Draa

NTIC Faculty, MISC Laboratory, University of Constantine 2-Abdelhamid Mehri, Algeria

UMR TETIS (Cirad, Cnrs, Irstea, AgroParisTech), France

Keywords:

Text Mining, Feature Selection, Semantic Similarity, Quantum Inspired Genetic Algorithm.

Abstract:

Matching heterogeneous text documents coming from different sources means matching data extracted from

these documents, generally structured in the form of vectors. The accuracy of matching directly depends on

the right choice of the content of these vectors. That’s why we need to select the best features. In this paper,

we present a new approach to select the minimum set of features that represents the semantics of a set of text

documents, using a quantum inspired genetic algorithm. Among different Vs characterizing the big data we

focus on ‘Variety’ criterion, therefore, we used three sets of different sources that are semantically similar to

retrieve their best features which describe the semantics of the corpus. In the matching phase, our approach

shows signiﬁcant improvement compared with the classic ‘Bag-of-words’ approach.

1 INTRODUCTION

In the recent years, the amount of data has been ex-

ploding. Data are generated by different individuals

and originated from different sources. That is why

the format of the presented information is heteroge-

neous and unstructured. Most often, it is expressed

by means of natural language. In order to evaluate the

similarity between documents, usually, a very known

document representation is used which is the ‘Bag-

of-words’ model. This model creates a matrix (i.e.

corpus) with word counts for each data instance (i.e.

documents). The count can be either absolute (i.e.

number of occurrences), binary (contains or does not

contain), sublinear (logarithm of the term frequency),

and so forth. However, if we directly use the vocab-

ulary contained in the training texts, we end up with

a vector space with a very high dimension. Each text

will be represented by a vector with as many terms

as there are words in the vocabulary. The processing

of vectorial space would require a lot of memory and

computation time and could stop us from using more

complex processing algorithms (R

ehel, 2005). Fea-

ture selection is a preprocessing technique commonly

used for high-dimensional data. This involves the se-

lection of important feature subset and removing ir-

relevant, redundant and, noisy features, for an easier

and more accurate data presentation.

In the literature, we ﬁnd different feature selection

methods; but due to the huge increase in the amount

of data, the use of metaheuristics enables us to ob-

tain reasonably good solutions, without needing to

explore the entire solution space. In real-world appli-

cations, people are more interested in obtaining good

solutions in a reasonable amount of time rather than

being obsessed with optimal solutions. Therefore, we

favour metaheuristic methods that have been proven

to be efﬁcient for dealing with real-world applica-

tions (Yusta, 2009). In our experiments we used a

Quantum-Inspired Genetic Algorithm (QIGA) which

offers advantages of genetic algorithms, including the

processing of several data in parallel using the mini-

mum of information, and the providing of a good bal-

ance between exploration and exploitation of research

space, and takes the quantum-inspired algorithms ad-

vantages, including the quantum representation of in-

dividuals; which allows a further exploration; a good

diversity is offered, it can cover a large part of the

search space. Also, quantum algorithms are charac-

terized by a reduced complexity; very few individuals

are necessary for a good representation of the search

space; and a huge computing power thanks to the su-

perposition of states and the quantum operators, en-

abling the processing of a large amount of informa-

tion in parallel (Draa, 2011).

The rest of the paper is organized as follows. We

ﬁrst discuss previous works related to feature selec-

tion using bio-inspired algorithms in Section 2. We

then describe the proposed approach in Section 3.

Section 4 discusses the details of our experiments and

276

Yahi, N., Belhadef, H., Roche, M. and Draa, A.

Towards a Bio-inspired Approach to Match Heterogeneous Documents.

DOI: 10.5220/0006294002760283

In Proceedings of the 13th International Conference on Web Information Systems and Technologies (WEBIST 2017), pages 276-283

ISBN: 978-989-758-246-2

the results obtained from applying the proposed ap-

proach on heterogeneous documents. Finally, we con-

clude by summarizing the contribution of this work

and giving some perspectives in section 5.

2 RELATED WORK

Because of their advantages, recently, bio-inspired al-

gorithms have been widely used as a tool for feature

selection in data mining. (Kabir et al., 2012) have

proposed a hybrid ant colony optimization (ACO) al-

gorithm for feature selection (FS), called ACOFS, us-

ing a neural network. A key aspect of this algorithm

is the selection of a subset of salient features of re-

duced size. ACOFS uses a hybrid search technique

that combines the advantages of wrapper and ﬁlter

approaches. In order to facilitate the hybrid search,

the authors designed new sets of rules for pheromone

update and a heuristic information measurement. On

the other hand, the ants are guided in correct direc-

tions, while constructing graph (subset) paths using a

bounded scheme in each and every step of the algo-

rithm. The above combinations ultimately not only

provide an effective balance between exploration and

exploitation of ants in the search, but also intensify

the global search capability of ACO for a high quality

solution in feature selection. There are other studies

that applied the ant colony algorithm to the problem

of feature selection such as (Al-Ani, 2005) and (Agh-

dam et al., 2009).

(Zahran and Kanaan, 2009) have introduced a fea-

ture selection algorithm based on Particle Swarm Op-

timization (PSO) to improve the performance of Ara-

bic text categorization. They used RBF networks (Ra-

dial Basis Function) as a text classiﬁer. On the basis

of the same bio-inspired algorithm, (Xue et al., 2012)

have proposed two multi-objective algorithms for se-

lecting the Pareto front of non-dominated solutions

(feature subsets) for classiﬁcation. The ﬁrst algorithm

introduces the idea of non-dominated sorting based

multi-objective genetic algorithm into PSO for feature

selection. In the second algorithm, the multi-objective

PSO uses the ideas of crowding, mutation and domi-

nance to search for the Pareto front solutions.

(Siedlecki and Sklansky, 1989) introduced the use

of Genetic Algorithm (GA) for feature selection. In

a GA approach, a given feature subset is represented

as a binary string ‘Chromosome’ of length n, with a

zero or one in a position i denoting the absence or

presence of feature i in the set, respectively. Note that

n is the total number of available features. A popu-

lation of chromosomes is maintained. Each chromo-

some is evaluated to determine its ”ﬁtness”, which de-

termines how likely the chromosome is to survive and

breed into the next generation. New chromosomes

are created from old chromosomes by the following

processes: (1) crossover, where parts of two differ-

ent parent chromosomes are mixed to create offspring

and (2) mutation, where the bits of a single parent are

randomly disturbed to create a child (Yusta, 2009).

(Jourdan et al., 2001) also presented a genetic algo-

rithm dedicated for a feature selection problem, but in

a particular case encountered in the genetic analysis

of different diseases. The speciﬁcities of this problem

is that the authors are not looking for a single feature,

but for several associations of features that may be in-

volved in the studied disease. There are other studies

applying the genetic algorithm on the problem of fea-

ture selection, we cite those of: (Yang and Honavar,

1998), (Oliveira et al., 2003) and (Babatunde et al.,

2014).

3 THE PROPOSED APPROACH

The proposed approach is composed of three mod-

ules: preprocessing, feature selection and matching.

Figure 1 shows the structure of the proposed ap-

proach. First, for the preprocessing of the input het-

erogeneous text documents, we implement the most

important prerocessing steps which include opera-

tions such as cleaning data and stemming. Second,

we apply a quantum-inspired genetic algorithm in or-

der to select the minimum set of features that are ide-

ally necessary and sufﬁcient to describe the semantics

of a set of heterogeneous text documents; in order to

reduce the cost and increase the matching accuracy of

these documents. Finally, the cosine similarity is used

to measure the difference between the input text doc-

ument and its corresponding in the matching phase.

The following subsections describe the details of each

step of the proposed approach.

3.1 Preprocessing Phase

Text preprocessing is a task that plays a very impor-

tant role in text mining techniques and applications,

it becomes even more important when handling big

data generated from multiple sources. In this step,

we used R language which is widely used among data

miners, it offers multiple packages for performing text

mining that facilitate preprocessing tasks including:

the elimination of punctuation, digits and stopwords,

stemming, TF-IDF weighting, etc.

As our bio-inspired approach is intended to deal

with heterogeneous text documents, the collected

dataset used in our work consists of scientiﬁc articles,

Towards a Bio-inspired Approach to Match Heterogeneous Documents

277

Dataset

Heterogeneous

Documents

Preprocessing

Cleaning and stemming

Feature

vectors

Criterion

Generation

Evaluation

Features

subset

Subset

quality

Feature selection with QIGA

Yes Selected

features

Matching

Cosine similarity

Figure 1: The proposed bio-inspired approach to match het-

erogeneous documents.

blog posts, and tweets. In order to clean the corpus

of scientiﬁc articles and blog posts, the tasks carried

out are: removing links, punctuation and digits, re-

moving stopwords using a standard list that we have

enriched, eliminating unnecessary spaces, stemming,

and deemphasis. Concerning tweets, we have added

for them removing hashtags and citations.

Figure 2: Word cloud of a preprocessed corpus.

3.2 Feature Selection Phase

The excessive dimensionality of feature vectors in-

creases the training time and decreases matching ac-

curacy. In order to select the best features, we

chose to use the Quantum-Inspired Genetic Algo-

rithm (QIGA); to take advantage from the genetic

algorithm which is suitable for discrete optimisation

problems and from quantum computing, which is able

to minimize the complexity of the algorithm.

3.2.1 Quantum-inspired Genetic Algorithm

A quantum-inspired genetic algorithm is a genetic al-

gorithm enriched by the concepts and principles of

quantum computing, such as the qubit, the superposi-

tion of states and the quantum operators. In quantum

computing, the smallest unit of information storage is

the qubit. A qubit can be in the state 1, the state 0 or

in a superposition of both. The state of a qubit can be

represented as indicated by Formula (1).

|Ψi = α|0i + β|1i (1)

Where |0i and |1i represent the conventional values

of bits 0 and 1, respectively. α and β are complex

numbers satisfying:

|α|

+ |β|

= 1 (2)

|α|

represents the probability that the qubit is found

in the state 0, while |β|

represents the probability that

the qubit is found in the state 1. A quantum register of

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

278

m bits can represent 2

values simultaneously. How-

ever, in the act of observing a quantum state, there is

no more superposition and one of the values is then

available for use (Laboudi and Chikhi, 2009). As the

basic element here is the qubit, a chromosome is sim-

ply a string of m qubits forming a quantum register.

To extract a classical chromosome from a quan-

tum chromosome, we apply the quantum measure-

ment, measuring the state of a qubit forces it to point

to either ‘1’ or to ‘0’. The result depends on the am-

plitudes of the qubit, such as a qubit whose value

|α|

= 0.8 will have an 80% chance of being a ‘1’

and 20% to be in the state ‘0’.

Quantum crossover has the same principle as a

conventional crossover. But it operates on quan-

tum chromosomes. So, it is a probability matrice

crossover that generates as a result new probabil-

ity matrices. The quantum crossover between two

individuals (parents) at a given point can generate

two new individuals (offspring) whose genes become

from both parents. As shown in the Figure 3.

Figure 3: Quantum crossover.

Classical mutation operates as a small perturba-

tion that reverses the mutated bit. In a quantum mu-

tation, there is also a perturbation, but it operates on

the probabilities of a qubit of the concerned chromo-

some, as follows. Consider a qubit |Ai = α|0i + β|1i.

The qubit quantum mutation of A generates the qubit

|Bi = β|0i + α|1i.

Figure 4: Quantum mutation.

At each iteration, the current best solution serves

as a guide to ﬁnd new solutions that may be better.

This is implemented via a quantum gate, which ro-

tates the qubit in question in order to increase the

probability of having the binary value of the corre-

sponding bit in the current best solution (Draa, 2011).

The general principle of the quantum-inspired genetic

algorithm is illustrated in Figure 5.

Figure 5: Flowchart of the quantum inspired genetic algo-

rithm.

3.2.2 Representation of Solutions

In our bio-inspired approach a quantum representa-

tion of solutions is adopted. A given feature subset is

represented as a quantum register ‘Quantum chromo-

some’ of length n, where each qubit is composed of

two values, namely α and β in position i denoting the

absence or presence of feature i in the set. Note that

n is the total number of available features. A popula-

tion is simply a set of quantum chromosomes. Figure

6 shows the structure of a quantum chromosome.

Figure 6: The quantum chromosome structure.

3.2.3 Objective Function

In general, the objective function consists of two

terms which are in competition with each other: the

number of features (to be minimized) and the quality

Towards a Bio-inspired Approach to Match Heterogeneous Documents

279

(to be maximized). The decision is a compromise be-

tween these two objectives. In our approach the qual-

ity is evaluated with the cosine measure between the

vector of features to reduce and the learning vectors.

Given two n dimensional vectors ~v and ~w , the cosine

similarity between them is calculated as follows:

Cosine(~v,~w) =

~v ·~w

|~v||~w|

∑

i=1

× w

∑

i=1

∑

i=1

(3)

The resulting similarity ranges from 0 meaning totally

different, to 1 meaning exactly the same. In order to

calculate the cosine similarity we must pass by the

weighting step. In which we used two methods, the

ﬁrst method consists of concatenating the two feature

vectors (i.e. the vector to be optimized, and the vec-

tor used for learning). Then, two weight vectors are

constructed in this way: for each feature, if it does not

exist in the vector to optimize (for P1, but for P2 the

existence of the term is veriﬁed in the learning vector)

the given weight is ‘0’, otherwise, if it exists in the

vector to optimize only the given weight is ‘1’, oth-

erwise, if it exists in the two vectors the given weight

is ‘2’. The second weighting method is based on the

TF-IDF measure (Salton and McGill, 1986) which is

a weighting function that depends on the term fre-

quency (TF) in a given document calculated with its

relative collection frequency (IDF).

Since the calculation of TF-IDF for each term in

relation to each document takes a lot of time, in our

approach the TF-IDF value of each term is calculated

in relation to each corpus. Then, the ﬁtness is calcu-

lated using Formula (4).

Fitness = α ∗ Size + (1 − α) ∗ Quality (4)

Data: Quantum chromosomes of features to reduce

+ Chromosomes of learning features

Result: ﬁtness of each chromosome

for each quantum chromosome do

Apply quantum measurement

Calculate the number of non-selected features:

zeros

Concatenate the feature vectors (selected +

learning)

Generate two weight vectors: P1 and P2

Calculate the cosine similarity between P1 and

P2: cos

Calculate Fitness = α * zeros + (1-α) * cos

end

Algorithm 1: The objective function.

3.3 Matching Phase

In order to match heterogeneous text documents, we

calculate the cosine similarity using two methods, the

ﬁrst one consists in calculating the cosine between

vectors representing the terms occurrences, and the

second is a cosine calculation between TF-IDF vec-

tors (Term Frequency-Inverse Document Frequency)

of all test pairs at the basis of the features selected

from the previous phase.

4 EXPERIMENTAL RESULTS

In this section, the proposed approach is validated

through applying it on three types of text documents:

scientiﬁc articles, blog posts, and tweets. Then MRR

(Mean Reciprocal Rank) measurement is adopted in

order to compare the proposed approach to the classic

‘Bag-of-words’ approach.

4.1 Corpora

In this work, we focus on ‘Variety’ criterion of big

data. Consequently, we have collected heterogeneous

documents in French from various sources.

• Scientiﬁc articles dealing with climate change

subject from an abstract collection of arti-

cles, books, book chapters, thesis, etc., gotten

from: Agritrop open archive CIRAD

(Agri-

cultural research and international cooperation,

French organization working for the sustainable

development of tropical and Mediterranean re-

gions’publications). It offers free access to writ-

ten scientiﬁc production of CIRAD in compliance

with copyright. In 2016, Agritrop includes more

than 92,000 references and 25,000 full-text docu-

ments in agricultural research and rural develop-

ment in Southern countries.

• Non-climatic scientiﬁc articles from a collection

of abstracts of TETIS

Laboratory (for Territories,

Environment, Remote Sensing and Spatial Infor-

mation). It aims to develop methods of spatial in-

formation control to promote the knowledge and

management of environment and territories. The

research domains of these paper are agro-forestry,

urbanization, natural resource management, land-

use planning, etc.

• Blog posts addressing the climate issue that are

manually collected from two blogs

that address

the theme of climate change.

http://www.cirad.fr/

https://tetis.teledetection.fr/index.php/fr/tetis-summary

http://www.ecoco2.com/blog/rechauffement-

climatique and http://maplanete.blogs.sudouest.fr/tag/

rechauffement+climatique

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

280

• Non-climatic blog posts that are manually col-

lected from a blog

of useful information of

medicines called non-conventional or alternative.

• Tweets dealing with climate change topic from

DEFITweets

which is a corpus of climate change

(15,000 tweets).

• Non-climatic tweets from Politweets corpus

(Longhi et al., 2014). This corpus gathers the

tweets of 7 personalities from 6 different French

political groups. Extracted from Twitter accounts

of these persons by a method that selects mes-

sages sent in 2013 and 2014, making the total cor-

pus of 34273 messages (tweets).

In our experiments we have randomly selected 100

documents for each category. These data are named:

CA for scientiﬁc Articles in Climate change domain,

NCA represents Non-Climatic scientiﬁc Articles, CB

for Climatic Blog posts, NCB for Non-Climatic Blog

posts, CT represents Climate change Tweets and NCT

is for Non-Climatic Tweets, where relevant pairs are

pairs of heterogeneous documents covering the same

subject; e.g. a document from CA and a document

from CB.

4.2 Experimental Protocol

After data preprocessing our goal is to select the

minimum features that describe the semantics of the

corpus using the quantum-inspired genetic algorithm

with the following parameters. The number of it-

erations has been ﬁxed at 300, the population size

was chosen to be 5, the probability of mutation and

crossover were chosen to be 0.1 and 0.9, respectively.

The number of documents for each category is 66

and the α value in the objective function took values

from 0.1 to 0.9. The quantum-inspired genetic algo-

rithm proceeds as follows. To extract the best features

of climate Articles CA, the Algorithm inputs are CA

feature vector as the vector to be reduced and (CB

+ CT) feature vector as the vector for learning fea-

tures. First, we start by randomly creating the initial

population. Then, for each iteration we apply the ge-

netic operators, we evaluate the population using the

formula of the objective function, and we extract the

words chosen by the algorithm. After that, we ap-

ply the above steps to extract the best features of cli-

mate blogs (inputs are CB to reduce and (CA + CT)

as learning vector), and the best features of climate

tweets (inputs are CT to reduce and (CA + CB) as

a learning vector). Finally, the base of features that

http://blog-medecine-douce.com/

https://deft.limsi.fr/2015/index.php

describes the semantics of the corpus is the concate-

nation of the best features selected in CA, CB and CT.

4.3 Mean Reciprocal Rank

The Reciprocal Rank (RR) information retrieval mea-

sure calculates the reciprocal of the rank at which the

ﬁrst relevant document was retrieved. RR equals 1

if a relevant document was retrieved at rank 1, if not

it equals 0.5 if a relevant document was retrieved at

rank 2 and so on. When averaged across queries, the

measure is called the Mean Reciprocal Rank (MRR)

(Craswell, 2009). This statistical measure evaluates

any process that produces a list of possible answers

to a sample of queries, ordered by the probability of

correctness. In our experiments, MRR is calculated in

two ways, the ﬁrst way is based on the calculation of

the cosine between term-occurrence vectors of all test

pairs, and the second one is based on the calculation

of the cosine between TF-IDF vectors of all test pairs.

These vectors are calculated on the basis of the fea-

tures selected with the quantum-inspired genetic algo-

rithm. The cosine values are sorted in descending or-

der, then couples relevance is evaluated, knowing that

relevant pairs are pairs of heterogeneous documents

covering the same subject; e.g. a document from CA

and a document from CB, by assigning the value 1 to

the relevant couples and 0 for irrelevant ones.

Table 1: Reciprocal Rank calculation based on cosine simi-

larity.

ID Doc 1 ID Doc 2 Cosine Relevance P@i

166 11 1.000 1 1

58 64 0.944 1 1

16 65 0.914 0 2/3

120 2 0.810 1 3/4

... ... ... ... ...

After that, for each couple i, precision P@i is cal-

culated as follows; If r relevant documents have been

retrieved at rank i, then:

P@i =

(5)

Finally we calculate the mean reciprocal rank us-

ing Formula 6.

MRR =

∑

i=1

P@i (6)

4.4 Results

We present in this section the results obtained from

applying the proposed bio-inspired approach on three

categories of heterogeneous text documents collected

from different sources which are; scientiﬁc articles,

Towards a Bio-inspired Approach to Match Heterogeneous Documents

281

blog posts, and tweets. These results are compared to

those given by the classic ‘Bag-of-words’ approach.

Table 2 shows the results of three variations of the

proposed approach, which are different in two steps;

the weighting step which precedes the cosine calcu-

lation in the objective function, and the cosine calcu-

lation step that precedes the MRR calculation. The

ﬁrst method uses a proposed weighting method, and

the MRR is calculated using the occurrence vectors

of the terms of all test pairs. The second method uses

TF-IDF as a weighting measure, and the MRR is cal-

culated using the occurrence vectors of the terms of

all test pairs. The last method uses the TF-IDF as a

weighting measure, and the MRR is calculated using

the TF-IDF vectors of all the test pairs. MRR values

are calculated for:

• The feature bases constructed by different α val-

ues of the objective function: Fitness = α ∗Size +

(1 − α) ∗ Quality.

• The features of learning corpus, after the prepro-

cessing phase.

• The features of learning corpus without prepro-

cessing, only punctuation removing.

From the reported results, we can see that the

ﬁrst method did not give better results compared to

the preprocessed ‘Bag-of-words’ approach, but it out-

performed the not preprocessed ‘Bag-of-words’ ap-

proach for three alpha values (0.2, 0.4, and 0.6). The

second method outperformed the preprocessed ‘Bag-

of-words’ approach for two alpha values (0.1 and 0.6)

and the not preprocessed ‘Bag-of-words’ approach

for all alpha values. In the third method the results

of the three alpha values (0.1, 0.2, and 0.6) were bet-

ter compared to the preprocessed ‘Bag-of-words’ ap-

proach, and the not preprocessed ‘Bag-of-words’ was

worse compared to all other results.

5 CONCLUSION AND FUTURE

WORK

In this paper, we have proposed a bio-inspired ap-

proach for matching heterogeneous text documents.

In the ﬁrst stage, we preprocess data by cleaning and

stemming to get a feature vector. The second phase

highlights the selection of the minimum set of fea-

tures that represents the semantics of text documents

using the quantum-inspired genetic algorithm. On the

basis of the features selected in the previous phase,

we do the matching. In order to validate the proposed

approach, three document sets coming from differ-

ent sources, that are semantically similar, are used

to retrieve their optimal features. Then MRR mea-

surement is used to evaluate matching accuracy. The

proposed approach outperformed the classical ‘Bag-

of-words’ approach.

Our future work intends to apply other weighting

measurements that can bring further improvements as

Okapi BM25, we would also compare the proposed

approach with other works using other corpora.

ACKNOWLEDGEMENT

This work has been supported by SONGES project

(FEDER and Occitanie).

REFERENCES

Aghdam, M. H., Ghasem-Aghaee, N., and Basiri, M. E.

(2009). Text feature selection using ant colony

optimization. Expert systems with applications,

36(3):6843–6853.

Al-Ani, A. (2005). Ant colony optimization for feature sub-

set selection. In WEC (2), pages 35–38. Citeseer.

Table 2: Comparison of matching accuracy of the proposed approach with different objective functions.

Objective function

ID Method

Method 1 Method 2 Method 3

0.1 * Size + 0.9 * Quality 0.7953 0.9578 0.9914

0.2 * Size + 0.8 * Quality 0.9319 0.9497 0.9865

0.3 * Size + 0.7 * Quality 0.8640 0.9171 0.9247

0.4 * Size + 0.6 * Quality 0.9190 0.9439 0.9706

0.5 * Size + 0.5 * Quality 0.8751 0.9361 0.9461

0.6 * Size + 0.4 * Quality 0.8741 0.9534 0.9963

0.7 * Size + 0.3 * Quality 0.8682 0.9461 0.9759

0.8 * Size + 0.2 * Quality 0.9274 0.9171 0.9247

0.9 * Size + 0.1 * Quality 0.8643 0.9439 0.9706

Bag-of-words (preprocessed) 0.9512 0.9770

Bag-of-words (not preprocessed) 0.8898 0.7920

WEBIST 2017 - 13th International Conference on Web Information Systems and Technologies

282

Babatunde, O., Armstrong, L., Leng, J., and Diepeveen, D.

(2014). A genetic algorithm-based feature selection.

British Journal of Mathematics & Computer Science,

4(21):889–905.

Craswell, N. (2009). Mean reciprocal rank. In Encyclopedia

of Database Systems, pages 1703–1703. Springer.

Draa, A. (2011). Mod

eles pour les syst

eemes complexes

adaptatifs pour la r

esolution de probl

emes : Auto-

mates cellulaires apprenants quantiques et

evolution

diff

erentielle quantique. PhD thesis, Constantine 2

Abdelhamid Mehri University, Algeria.

Jourdan, L., Dhaenens, C., and Talbi, E.-G. (2001). A ge-

netic algorithm for feature selection in data-mining for

genetics. Proceedings of the 4th Metaheuristics Inter-

national ConferencePorto (MIC2001), pages 29–34.

Kabir, M. M., Shahjahan, M., and Murase, K. (2012).

A new hybrid ant colony optimization algorithm for

feature selection. Expert Systems with Applications,

39(3):3747–3763.

Laboudi, Z. and Chikhi, S. (2009). Evolution d’automate

cellulaire par algorithme genetique quantique. In

CIIA.

Longhi, J., Marinica, C., Borzic, B., and Alkhouli, A.

(2014). Polititweets, corpus de tweets provenant

de comptes politiques inﬂuents. Banque de cor-

pus CoMeRe. Ortolang. fr: Nancy. http://hdl. handle.

net/11403/comere/cmr-polititweets.

Oliveira, L. S., Sabourin, R., Bortolozzi, F., and Suen,

C. Y. (2003). A methodology for feature selec-

tion using multiobjective genetic algorithms for hand-

written digit string recognition. International Jour-

nal of Pattern Recognition and Artiﬁcial Intelligence,

17(06):903–929.

ehel, S. (2005). Cat

egorisation automatique de textes et

cooccurrence de mots provenant de documents non

etiquet

es. Faculty of Science and Engineering, Uni-

versity LAVAL, QUEBEC.

Salton, G. and McGill, M. J. (1986). Introduction to modern

information retrieval.

Siedlecki, W. and Sklansky, J. (1989). A note on genetic

algorithms for large-scale feature selection. Pattern

recognition letters, 10(5):335–347.

Xue, B., Zhang, M., and Browne, W. N. (2012). Multi-

objective particle swarm optimisation (pso) for feature

selection. In Proceedings of the 14th annual confer-

ence on Genetic and evolutionary computation, pages

81–88. ACM.

Yang, J. and Honavar, V. (1998). Feature subset selection

using a genetic algorithm. In Feature extraction, con-

struction and selection, pages 117–136. Springer.

Yusta, S. C. (2009). Different metaheuristic strategies to

solve the feature selection problem. Pattern Recogni-

tion Letters, 30(5):525–534.

Zahran, B. M. and Kanaan, G. (2009). Text feature se-

lection using particle swarm optimization algorithm.

World Applied Sciences Journal 7 (Special Issue of

Computer & IT), pages 69–74.

Towards a Bio-inspired Approach to Match Heterogeneous Documents

283