an evidence in the first branches cannot be used,
several extraction steps are necessary.
Comparing the tree search results to the previous
methods (see Table 4) reveals that tree search reaches
similar search performance as All between 6 and 7 ev-
idences, as Top7 between 3 and 4, and as Top7 2-level
between 2 and 3. We infer that the main advantage of
tree search is the optimization of search and extrac-
tion steps in comparison to simpler methods.
In conclusion, we found a structure that optimizes
search and extraction steps and delivers good search
results with relatively low effort in training time and
calibration of the method. We believe, the informa-
tion gain trees will enable our system to deal with
more complex setups.
7 EVALUATION OF FAILURE
DETECTION
In this section, we evaluate the different failure detec-
tion strategies regarding detection performance. We
pre-evaluate the degree of belief (DoB) in dependence
on the search set size and the evidence type to prepare
the second strategy (expected DoB).
7.1 Evaluation Setup
We evaluate on the same corpus in two steps:
1. Degree of Belief Values. We conduct Attentive
Task (AT) search on random search setups for
evaluating the dependency of the degree of be-
lief (DoB) value on search set size and evidence
type. Further, we repeat the experiment for under-
standing, how DoB develops in case of a search
failure, and for generating a threshold. We repeat
each search setup 20,000 times.
2. Failure Detection Strategies. We evaluate each
of the proposed strategies: 1) specific evidence
types, 2) expected DoB, 3) inclusion of AT tem-
plates, as well as hybrid strategies, 1) & 2) and 1)
& 3). We compare them with established classi-
fication measures: precision Pr = t p/(t p + f p),
recall Re = t p/(t p + f p), and accuracy Acc =
(t p + f p)/(t p + f p +tn + f n). True positives t p
are correctly detected failures, false positives f p
non-failures classified as failures, true negatives
tn correctly detected non-failures, and false nega-
tives f n not detected failures. We also conduct a
separate evaluation of the two failure cases. Ad-
ditionally, we aim at minimizing the costly search
steps. Experiments were repeated 80.000 times
for each setup including varying number of search
steps from 1 to 6, which are the number of evi-
dences used for search.
7.2 Degree of Belief Evaluation
We repeated DoB experiments for different ATs
search sets and varied the search set size, number of
evidences as input for search, and the type of evidence
group. For evidence groups, we differentiate between
the best performing evidence types from our previous
work (Top7), all evidence types (All), and all possi-
ble evidence types without the Top7 (All w/o Top7).
For each search experiment, we generate a random
AT set of random size, select one corresponding doc-
ument, extract evidences according to the evidence
type group, and execute search. The findings depicted
in Figures 4 (a) - (c) are summarized as follows:
Search Set Size. The larger the search set is, the
smaller the DoB value of the corresponding AT.
Figure 4 (a) displays the DoB development for all
evidence types when one evidence is used. This
effect is diminished with increasing number of ev-
idences (see Figure 4 (b)). To measure the devel-
opment over search set size we calculate the com-
pound growth rate (CGR)
3
between search set size
2 and 17 for different evidence numbers.
Evidence Type Performance. The well performing
evidence types (Top7) have less decreasing influ-
ence on the DoB value than the others. We derive
that DoB is more stable for calibrated searches.
Search Failures. Comparing the DoB for successful
searches and the top DoB for failures shows that
only a few selected evidence type combinations
result in relevant differences between the values
(see Figure 4 (c)). This is caused by many ev-
idences also matching to one or more incorrect
ATs. In such a case, the ATs get a higher match-
ing value and after normalization they get a value
similar to the correct AT. For the expected DoB
strategy, we use only the selected evidence types
and half of the average difference as threshold.
We conclude that in our setup the DoB value is highly
sensitive to search set size and evidence type. There
are only a few evidence type combinations that allow
to use the DoB distance to an expected DoB for iden-
tifying if the corresponding AT is not included in the
search set. We will further evaluate the related strat-
egy, but these results indicate that the expected DoB
strategy could become fragile in other domains.
3
CGR(s
0
,s
n
) = (
avgDoB(s
n
)
avgDoB(s
0
)
)
1
s
n
−s
0
− 1, s
i
is the search set
size and avgDoB(s
i
) the average DoB for search set size s
i
ICAART2013-InternationalConferenceonAgentsandArtificialIntelligence
88