2009). It is assessed by averaging the durations of all
false detections produced by the system at a single
operating point (with a chosen threshold). In a real
application, FD/h indicates the number of times a
clinician has to check the results of an automatic
detector in vain; however, not only the number of
times but also the total amount of time should be
reported. For instance, if both systems can give 90%
of good seizure detection rate, the first one with a
cost of 1 FD/h of 20m duration and the other with a
cost of 2 FD/h each of 1m duration, the second
system may be preferred as the results of the first
system imply that ~33% of time a clinician has to
check the EEG recording in vain, with only ~ 3% of
time in the second case.
The curve of performance is obtained for healthy
babies and is compared to the one obtained on sick
babies reported in (Temko et al., 2009) for FD/h
metric varying the threshold on a probability of a
seizure. The results are reported in Figure 4. To be
able to compare the results on sick and on healthy
babies the same N-fold cross validation is used here
(N=17). That is, each of 47 healthy babies is tested
N times using N models trained on N-1 sick patients
with a normalization coefficient for each channel in
the montage calculated on the remaining 46 healthy
patients. This way, the performances on healthy and
sick babies are completely comparable as the same
model is used to test the remaining sick baby and all
healthy babies (which are not used in training at all).
As can be seen from Figure 4, the performance
of the seizure detection system before normalization
is much worse than the performance of the system
on sick babies. However, after normalization, the
curve of the FD/h for healthy babies is consistently
better than that for sick babies. Additionally, the
duration of false detection on healthy babies is
significantly lower than that for sick babies. It is
worth noting that as the normalization coefficients
are calculated to normalize to the background energy
in the training database, the actual performance in
term of good seizure detection rate on sick babies is
not changed because the resulting coefficient is
equal to one (i.e. no change is applied to the sick
baby signals).
It is interesting that after a certain point (~0.65 in
our case), the FD/h for healthy babies becomes
larger than that for sick babies. It actually shows that
the performance on sick and healthy babies cannot
be compared on the full scale of FD/h. For instance,
statistics of the database of sick babies say that there
are in average ~2.6 seizures every hour. It naturally
restricts the maximum number of false detections
obtainable for this dataset by a given algorithm. On
the healthy babies however, there are no upper limit
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0
0.2
0.4
0.6
0.8
1
Threshold on Probability of Non-Seizure (1 - Probability of Seizure)
False Detections per Hour
2.1m
2.1m
1.5m
2.2m
1.6m
2.3m
1.8m
2.4m
2.1m
Sick patients
Healthy patients before normalization
Healthy patients after normalization
Figure 4: The curves of performance for sick and healthy
babies for FD/h metric.
on false detections, so after a certain threshold the
algorithm will stop producing the false detections on
sick patients due the presence of actual seizures
while still producing false detections on healthy
babies.
This interesting phenomenon firstly reveals the
range of thresholds which are practically useful for
the designed algorithm which could not have been
seen while testing on sick babies only. In our case,
the threshold on the probability of the seizure should
be set higher than 0.35 to guarantee the reported
performance for all possible testing patients.
Apart from the practically useful range of
thresholds, testing on healthy patients shows how
the statistics of the dataset can affect the metrics
which measure the performance of the system. In
other words, the same algorithm tested on different
datasets can obtain different metric values
depending on the density of seizures in the datasets.
For instance, in (Mitra et al., 2009), the average
number of seizures per hour was ~4.9, in
(Navakatikyan et al., 2006) there were ~4 seizures
per hour, and in (Deburchgraeve et al., 2008) ~3.3
seizures per hour. Comparing the statistics of the
datasets in the mentioned studies, the results
obtained on our dataset with ~2.6 seizures per hour
can be seen as an over-pessimistic performance
assessment.
In fact, the large difference between the FD/h
obtained on healthy babies and on sick babies
suggests that the results on sick and on healthy
babies should be reported separately as it has been
done in (Mitra et al., 2009). In a certain sense, these
values indicate the average upper and lower bounds
on FD/h achievable in practice. If reported together
the final FD/h score will be skewed by the amount
of healthy baby data which can differ from study to
study (Navakatikyan et al., 2006; Deburchgraeve et
al., 2008). For example, in our study, the developed
BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing
316