but they differ by less than 1 percent.
Next we perform a pool-based active learning ses-
sion using Algorithm 1 with our active learning GUI
to train these baseline detectors and improve their per-
formance. With our interface, we collect training im-
ages from Flickr, which was a source for VOC2007
images. We used the Flickr Image API to obtain only
images from after the competition date of VOC2007
to prevent the possibility of obtaining test data from
the competition. Our interface allows a human user to
search images based on a query word such as “cow,”
“motorbike,” etc. We collected the first 3000 images
from the Flickr database in each category. In order to
further speed up the training process, we also cache
visual features, using our baseline detectors to scan
images, and sort them based on the uncertainty score
in Eq. (6) for the highest detection scoring window.
It takes approximately 3-4 hours to search, download
and scan 3000 images (around 164 million bounding
windows) to generate the set of most-uncertain query
windows. However, this preprocessing can be done
with no human interaction other than specifying the
tag. The effectiveness of active learning can be seen
in Figure 4 which shows the best 5 and worst 5 ac-
tively selected query windows in each category. The
worst 5 query windows in the bicycle category con-
tain target objects that are easily classified, whereas
the worst 5 query windows in the chair category do
not contain any target objects. In the sofa category,
the worst 5 query windows include ones that contain
easy target objects as well as ones without.
Table 1 summarizes the results and training data
statistics. We downloaded in total 60,000 images
from Flickr and annotated bounding windows in 300
actively selected images for each category based on
uncertainty of the highest scoring detection window
of each image. We used the highest scoring detection
window for sorting images in order to get as many
positive labels as possible from a large unlabelled
data pool. We then conducted two simulations. One
is a common active learning approach where we an-
swer just 300 query windows based on our uncertainty
sampling criteria (ALORquery). The other is a case
of fully utilizing our interface where we are allowed
to not only answer these 300 query windows but also
add user selected query windows (ALORfull). The
time required for the human interactive labelling of
these 300 image queries was around 20 minutes and
less than 40 minutes was required for fully utilizing
our interface and giving more annotations for each
category.
Figure 5 shows the performance comparison of
both ALORquery and ALORfull in a different num-
ber of training images. It shows that ALORfull con-
Figure 5: Performance comparison of ALORfull and
ALORquery. Each point represents the average number of
additional positive labels and the average training time per
category. By allowing a user to add additional queries, the
performance is consistently better. Note that just 53 labels
from ALORfull outperform 140 labels from ALORquery.
sistently outperforms ALORquery, which indicates
that user selected queries improve the overall detec-
tion performance. More results in Table 2 demon-
strate an impact of user selected queries on the final
classification performance. In the table, we show the
result of ALORquery with 300 images and ALORfull
with 100 and 300 images. For the cases of ALOR-
full, around 40 % of queries are selected by a user.
Those queries are often quite challenging for a ma-
chine to choose, because they are often either erro-
neous or missed detections and distant from the de-
cision boundary of a latent SVM. A human oracle is
quite helpful in such cases.
Fig. 6 presents the gain of our best result for each
category in the average precision. From our exper-
iments, we can make several observations. First, in
most of the categories, our active learning interface
allows a user to quickly improve the performance of
the baseline detectors. Second, our user interface also
allows a user to achieve a better performance than
would be obtained by a simpler learning approach in
which a user answers Yes/No/Maybe queries for se-
lected windows. A lot of difficult machine-selected
queries that a user is not sure about (Fig. 3(c), for
example) can be easily corrected with our interface.
Third, with less than 40 minutes of user input, we can
achieve significant performance improvement even
over the best competition results. The PASCAL com-
petition has a section in which users can provide their
own data, but the difficulty of collecting such data
means there have seldom been entries in that section.
Our active learning approach and GUI would enable
users to efficiently collect useful data for improved
AN ADAPTIVE INTERFACE FOR ACTIVE LOCALIZATION
255