2.2 Similarity Function
Once the VSM for two vectors V
x
and V
y
is created,
the usual method is to compute their similarity with
the cosine (Turney and Pantel, 2010):
similarity (V
x
,V
y
) = cos (α) (2)
where α is the angle between V
x
and V
y
.
This similarity rate gives a good clue of how close
two vectors are within their vector space, therefore
how similar two profiles are.
But there is a human intuition in profiles
recognition that is missing with this computation.
Sometimes, two profiles do not contain enough
relevant information to evaluate their similarity. For
example, if two profiles like the same singer and these
profiles only contain this information, it is not enough
to determine they both concern the same person.
Through experimentation, we noticed that the norm
of a vector is a good metric to evaluate the relevance
of a profile. Therefore, the similarity rate is smoothed
with the average norm of the two vectors:
similarity(V
x
,V
y
) = cos(α) * (‖V
x
‖ + ‖V
y
‖) / 2
(3)
where ‖V‖ is the Euclidean norm for the vector V. The
factor (‖V
x
‖ + ‖V
y
‖) / 2 goes through a repair function
which assures it stays in the real interval [0,1].
This similarity rate is a real value in [0,1] and it
can be interpreted as a percentage. For example, a rate
of 0.27 corresponds to 27% of similarity.
To sum it up, making use a VSM is an effective
process to move from semantic information to a
mathematic model which will be used to compute
effectively the similarity between two profiles. The
next step is to teach the computer to find dynamically
the weighting for each label. For this purpose, a
genetic algorithm is applied in this study.
3 GENETIC ALGORITHM
Genetic algorithms (GA) are heuristics, based on
Charles Darwin’s theory of natural evolution, used to
solve optimization problems (Hüe, 1997). The
general process for a GA is described as follows
(Eberhart et al., 1996), (Kim and Cho, 2000):
Step 1: Initialize a population.
Step 2: Compute the fitness function for each
chromosome in the population.
Step 3: Reproduce chromosomes, based on their
fitness.
Step 4 : Perfom crossover and mutation.
Step 5 : Go back to step 2 or stop according to a
given stopping criteria.
GA can also be used as Machine Learning (ML)
algorithm and has been shown to be efficient in this
purpose (Goldberg, 1989). The idea behind is that
natural-like algorithms can demonstrate, in some
cases in the ML field, a higher efficiency compared
to human-designed algorithms (Kluwer Academic
Publishers, 2001). Indeed, actual evolutionary
processes have succeeded to solve highly complex
problems, as proved through probabilistic arguments
(Moorhead and Kaplan, 1967).
In our case, GA will be used to determine an
adequate set of weighting for each label present in a
training set. Our training set is composed with
similarities between profiles, two profiles are either
similar (output = 1) or not similar (output = 0).
3.1 Genetic Representation
The genotype for each chromosome of the population
will be the group of all labels in the training set. Each
label is defined as a gene and the weighting for a
specific label is the allele of the linked gene. The
weighting is a value in [0,1], it could be translated as
the relevance of a label and it reaches its best at value
1 and worst at 0.
3.2 Population Initialization
For population initialization, there are two questions
: “What is the initial population size ?” and “What is
the procedure to initialize the population ?”.
About the population size, the goal is to find a
compromise between the global complexity and the
performance of the solution (Holland, 1992). A small
population may fail to converge and a large one may
demand an excessive amount of memory and
computation time. It turns out that the size of the
initial population has to be carefully chosen, tailored
to address our specific problem.
As a reminder, in our problem, the GA have to
compute weighting for a number N of labels in the
training set. We evaluated different values for the
population size, compromised between efficienty and
complexity and chose to fix the population size to
2×N.
Secondly, to initialize a population, there are
usually two ways: heuristic initialization or random
initialization. The heuristic initialization, even if it
allows to converge faster to a solution, has the issue
to restrain the GA to search solutions in a specific area
and it may fail to converge to a global optimum (Hüe,
1997). A random initialization is a facilitating factor
for preventing GA to get stuck in a local optima.
In our case, the random initialization consists in