4 EXPERIMENTAL EVALUATION
In this section, we describe experiments designed to
answer the following questions: a) Can our algorithm
be used to estimate visualization diversity, a weaker
quality metric sufficient for many of our target appli-
cations? (Yes; Table 3) b) Can our algorithm effec-
tively extract correct sub-figures, a stronger quality
metric? (Yes; Table 3) c) Could a simpler method
work just as well as our algorithm? (No; Table 3) d)
Is step 3 of the algorithm (selection) necessary and ef-
fective? (Yes; Figure 11) e) Where does the algorithm
make mistakes? (Figure 14)
The corpus we used for our experiments was col-
lected from the PubMed database. We selected a ran-
dom subset of the PubMed database by collecting all
tar.gz files from 188 folders (from //pub/pmc/ee/00 to
//pub/pmc/ee/bb); these files contain the pdf files of
the papers as well as the source images of the figures,
so figure extraction was straightforward. In order to
filter non-figure images such as logos, banners, etc.,
we only used images of size greater than 8KB. We
manually identified the composite figures (we have a
classifier that can recognize multi-chart figures, but
for this experiment we wanted to avoid the additional
dependency) and divided them into a testing set and a
training set. We trained the classifier and performed
cross-evaluation with the training set, reserving the
test set for a final experimental evaluation. The test-
ing set S for the experiments contains 261 compos-
ite figures related to biology, biomedicine, or bio-
chemistry. Each figure contains at least two different
types of visualizations; e.g., a line plot and a scatter
plot, a photograph and a bar chart, etc. We ignored
multi-chart figures comprised of single-type figures
in this experiment for the convenience of evaluation,
described later in the first question. We evaluated per-
formance in two ways: (1) type-based evaluation, a
simpler metric in which we attempt to count the num-
ber of distinct types of visualizations within a single
figure, and (2) chart-based evaluation, a stronger met-
ric in which we attempt to perfectly recover all sub-
figures within a composite figure.
Can Our Algorithm Be Used to Estimate Visual-
ization Diversity? The motivation for type-based
evaluation is that some of our target applications in
bibliometrics and search services need only know the
presence or absence of particular types of visualiza-
tions in each figure to afford improved search or to
collect aggregate statistics — it is not always required
to precisely extract a perfect sub-figure, as long as we
can tell what type of figure it is. For example, the
presence or absence of an electrophoresis gel image
appears to be a strong predictor of whether the paper
is in the area of experiemental molecular biology; we
need not differentiate between a sub-figure with one
gel and a sub-figure with several gels. Moreover, it is
not always obvious what the correct answer should be
when decomposing collections of sub-figures of ho-
mogeneous type: Part of Figure 14(e) contains a num-
ber of repeated small multiples of the same type — it
is not clear that the correct answer is to subdivide all
of these individually. Intuitively, we are assessing the
algorithms’ ability to eliminate ambiguity about what
types of visualizations are being employed by a given
figure, since this task is a primitive in many of our
target applications.
To perform type-based evaluations we label a test
set by manually counting the number of distinct visu-
alization types in each composite figure. For exam-
ple, Figure 3 has two types of visualizations, a line
chart and a bar chart; Figure 6 also has two types of
visualizations, a line chart and an area chart; Figure
10(a) also has two types of visualizations, bar charts
and electrophoresis gels. We then run the decompo-
sition algorithm and manually distinguish correct ex-
tractions from incorrect extractions. Only homoge-
neous sub-images — those containing only one type
of visualization — are considered correct. For exam-
ple, the top block in Figure 10(a) is considered cor-
rect, because both sub-figures are the same type of vi-
sualization: an electrophoresis gel image. The bottom
two blocks of Figure 10(a) are considered incorrect,
since each contains both a bar chart and a gel.
Using only the homogeneous sub-images (the het-
erogeneous sub-images are considered incorrect), we
manually count the number of distinct visualization
types found for each figure. We compare this number
with the number of distinct visualization types found
by manual inspection of the original figure. For exam-
ple, in Figure 10(a), the algorithm produced one ho-
mogeneous sub-image (the top portion), so only one
visualization type was discovered. However, the orig-
inal image has two distinct visualization types. So our
result for this figure would be 50%.
To determine the overall accuracy we de-
fine a function diversity : Figure → Int as
diversity( f ) = |{type(s) | s ∈ decompose( f )}|,
where decompose returns the set of subfigures and
type classifies each subfigure as a scatterplot, line
plot, etc. The final return value is the number of
distinct types that appear in the figure. We then
sum the diversity scores for all figures in the cor-
pus. We compute this value twice: once using our
automatic version of the decompose function and
once using a manual process. Finally, we divide the
total diversity computed autoamtically by the total
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
86