O
i
= {I
k
: k ∈ 1, . . . M}, (1)
where O
i
is the i-th object, I
k
is the k-th coordinate of
the input vector, and we will call it a ”feature”. M is
the number of features.
Obviously a single feature can appear in objects
belonging to different success categories. After col-
lecting data for a group of students, we get some dis-
tribution of the features in respect to the success cate-
gories into which they have been classified. In conse-
quence, every feature has some defined probability of
belonging to each of the categories:
P
I
k
= (P
k1
, P
k2
, . . . , P
kN
), (2)
where P
kn
is the probability, that a feature numbered
k (I
k
) was found in the success category numbered n.
We could write this as P(I
k
|n).
The probabilities vector (2) is the factor defining
the meaning of every feature in our approach. To be
more precise, important is not the vector itself, but
its direction. Thus the distance between two features
in semantic space is measured as the angle between
respective vectors, or more conveniently as a cosine
of the angle between the vectors. The cosine is the
semantic similarity measure, which is the basis for
further computations. In this way the features with
identical meaning have the maximal similarity equal
to 1, and the features with completely different mean-
ing have the similarity equal to 0. The similarity S
kl
between two features P
I
k
and P
I
l
we will calculate in
a standard way as:
S
kl
= cosα
kl
=
P
I
k
· P
I
l
P
I
k
P
I
l
, (3)
We have to explain the motivation for making such
an assumption. If some feature is expressed using nat-
ural language, then there is a chance, that there will
be another feature, which will expressed in a differ-
ent form, but will have the same meaning. This has
already been mentioned as synonymy. Usually, in a
carefully designed survey, we will not find the same
question twice. Our considerations, however, have a
more general nature. We treat the survey only as a par-
ticular example of a more general class of methods,
where collecting information is performed through a
set of unformalized natural language expressions. A
good example of such methods are medical tests, like
physical examination, where the symptoms are de-
scribed using sentences in natural language. In a set
of such descriptions, a large number of synonymic ex-
pressions can be found. Even in case of surveys, when
the survey is repeated, it can be modified many times.
The purpose is to optimize the survey to make it max-
imally friendly, and understandable for the individu-
als being examined, as well as to maximize the intake
of valuable information. Formulating a good survey
is not easy, because it requires experimental verifica-
tion, and examination of the people’s answers to the
questions. Sometimes it is necessary to make several
attempts, before the optimal choice of questions can
be found. But the questions still can be modified, es-
pecially on a larger time scale. After collecting dif-
ferent versions of the modified survey, there is a large
chance to find many questions with the same or close
meaning, but formulated differently.
In many cases we want to integrate the different
versions of the surveys in order to integrate the col-
lected data. This is especially useful, when we want
to use unique historical data, which cannot be recre-
ated. To make possible integration of different sur-
veys, one should create a mapping between the dif-
ferent surveys. For two surveys, this could be a di-
rect mapping, but for larger number of surveys, it is
much more convenient, to build a single ontology,
which will integrate all the surveys. Building such
an ontology in a standard approach would require a
lot of manual work. In our approach, such mapping
will be performed automatically, by creating the se-
mantic model of the questions/features, and mapping
between the features and the model, no matter what
version of the survey do we have. There is only one
requirement, to make this comparison possible. The
output data (in this case the success categories) have
to be the same for the different surveys. The output
creates some kind of reference to the input data. We
assume, that the input is something that can vary, so
the output needs to remain constant, to be able to cre-
ate the mapping between varying input.
Identifying two features as having the same mean-
ing, does not always imply, that the two features have
the same meaning according to our common sense
understanding. We should remember, that the mean-
ing is defined here in computational terms. There is
a chance, that two features with completely different
interpretations (according to our understanding), will
have the same meaning according to the presented ap-
proach. This results from the fact that the two features
are associated with the same educational success clas-
sification. In consequence, in terms of computational
meaning they should be treated as synonyms.
Actually the exact synonymy is rather theoretical
concept, because it is very unlikely, that we find two
features, with semantic similarity equal to 1. To be
more realistic we have to assume, that even if the sim-
ilarity between two features is not 1, they can still be
treated as synonyms if their similarity is close to 1.
Thus to find synonyms, we have to find the groups
of very similar features. This task can be achieved
by clustering in the space of meanings. In this way
DATA2014-3rdInternationalConferenceonDataManagementTechnologiesandApplications
374