Notes, each containing 4-10 statements (based on ran-
dom assignment), were created. Each statement was
randomly assigned to be relevant or not relevant. In
addition, each statement was assigned a depth, which
is recognized by the “relative” approach. The re-
striction was that a statement’s depth could not ex-
ceed the number of statements in the Semantic Note.
Note that assigning a depth to a statement has im-
pact on the depths of the rest of the statements in
the respective Semantic Note. The range of the first
statement’s d is 1 ≤ d
s
1
≤ (|S
n
| + 1), the next one’s
1 ≤ d
s
2
≤ (|S
n
| + 1 − d
s
1
), and so on. Furthermore,
each statement was randomly assigned as being de-
pendant on some other statement or not. This has im-
pact on the “dependant” approach.
When generating the relevance values at the Se-
mantic Note level, the basic (baseline) relevance was
first calculated by summing up the relevances of the
statements, and dividing them by the number of all
statements in the respective Note. The relative rel-
evance was calculated in the same manner, but in-
stead of summing up the plain statement relevances,
their relative values were instead used. In the case
of dependency relevance, it was first checked whether
the statement in question is relevant or not. If yes,
it was checked whether the statement it was labeled
dependant on was relevant or not. If that was true
as well, the statement was labeled as “dependency-
relevant”. Finally, the “OM relevance” was simulated
as follows: If the statement currently under inspection
was not relevant, it did not automatically receive a 0
relevance value, but some randomly assigned floating
point between 0 and 1. This represents the inclusion
of close matches to the calculation.
To complement the above-mentioned relevance
kinds, the system randomly—and regardless of
the above relevance values based on statements’
relevances—labeled each Semantic Note as “really”
relevant or not. This was representing the user’s ac-
tual consideration of the Semantic Note, whereas the
above-mentioned relevance values represent the deci-
sion support capabilities of our system. The differ-
ence between the “real” relevance and the statement-
based relevance kinds was tested as-is (with 0 corre-
spondence), with 0.5 correspondence, and with 0.9
correspondence. This consideration was justified
since it is envisaged that the rules stored in the user
profiles have some correlation with the actual rele-
vances. That is, if the user creates a rule stating that
she is interested in ice cream, it is indeed justified to
assume that (all other things equal) she will be more
interested in an ice cream parlor than a hot dog stand.
For generating the test set, we modified three
things: First, the likelihood of correspondence (Lhc)
was set to be 0, 0.5, or 0.9. Secondly, either half
or one quarter of the statements were labeled as “re-
ally” relevant (likelihood of relevance, Lhr). Third,
the “real” relevances were reassigned based on base-
line relevance values, based on combined relevance
values, or not at all reassigned. The same value which
was used as a threshold for retrieving content through-
out the tests, namely 0.5, was also used as a thresh-
old for reassignment. By combining these options,
we came up with 18 different test cases, with each of
them having 500 generated Semantic Notes.
4.2 Evaluation Results
We now present some results of the simulations. In
the following Tables, the approaches are referred to
as “Baseline”, “Relative”, “OM”, “Dependant”, and
“Combined”. The “Combined” approach is the av-
erage of “Relative”, “OM”, and “Dependant” ap-
proaches. Naturally, we could have considered other
combinations, too. However, contrasting the alter-
native approaches with “Baseline” separately and as
one combination is enough for giving us guidelines
on their performance.
Basic instruments of information retrieval, namely
precision, recall, and the F-measure, were used in the
evaluation. As for relevant documents, we used the
“real” relevance, that is, the relevance which was not
derived from the number of statements considered rel-
evant. This way we could compare the decision sup-
port of the system with the (simulated) true relevance
considered by the user. In doing this, precision came
to indicate the number of documents which are both
retrieved and (“really”) relevant divided by the num-
ber of retrieved documents. Recall indicates the num-
ber of documents which are both retrieved and (“re-
ally”) relevant divided by the number of (“really”)
relevant documents. Finally, the F-measure is the har-
monic mean of precision and recall, with the formula
of F = 2∗ precision∗ recall/(precision+ recall).
Table 1 depicts the precision values in the case
where none of the “real” relevance values are tam-
pered. There exists no significant variation among the
approaches; the average standard deviation (SD) be-
tween the approaches in different cases is 0.05. If Ta-
ble 1 is contrasted with Tables 2 and 3, it is visible that
more variation among the approaches emerges. Cor-
responding SD average for Table 2 is 0.12 and for Ta-
ble 3 it is 0.15. Naturally, for the first two rows, where
the likelihood of correspondence (Lhc) is 0, this does
not hold. But once the likelihood grows to 0.5 and
especially 0.9, differences start to show. This is espe-
cially true in the case where the rearrangement of the
“real” relevance values is done based on the combined
WEBIST 2007 - International Conference on Web Information Systems and Technologies
166