next number in that row. However, there are several
difficulties to consider that had to be overcome. For
example, the words might be incorrectly transcribed
or omitted, the quantities may be incorrectly
transcribed, the alignment process may fail if the
chart is still slightly rotated, etc.
To solve these potential problems, the specific
information of our domain was again taken into
account. To mitigate the omission of words, the
structure of the nutritional information tables was
used, which is always the same. The attributes always
appear in the following order: energy value, fats,
saturated fats, carbohydrates, sugar, proteins and salt.
Therefore, if “fats” is found in a row and its quantity,
and the following row only has a number, it can be
assumed that that number corresponds to the amount
of saturated fats. This improves the accuracy although
it is not always effective, because there are some
charts that show more information. For example, if
“fiber” is present, it is between “proteins” and “salt”,
so the system may confuse “fiber” with “salt”.
The amount of energy is more complicated, as it’s
usually indicated in kilojoules and kilocalories.
Moreover, sometimes both appear in the same row
and sometimes in consecutive rows. But it also helps
us overcome omissions, as the kilojoules of energy
can be calculated from the kilocalories and vice versa.
In addition, these tables sometimes include
another column apart from the one that indicates
amounts per 100 g, which indicates the amount per
serving. This situation is identified by checking that
there are two quantities per row. Then, if it’s the case
the second column can be used to supplement
omissions or inaccuracies in the first row. This is
done by calculating the proportion between the
quantities in the first column and their analogous on
the second one. The correct proportion is assumed to
be the most common one and it is used to correct or
calculate inaccurate or omitted quantities.
With these and other similar algorithms, the data
is corrected, completed and extracted from these
transcriptions. This way, the quantity of each of the 7
considered attributes per 100 g is acquired. As energy
is recovered in kilojoules and kilocalories, 8 values
are finally obtained as the result for each image.
As for the evaluation of this whole transcription
system, a result is considered correct if it deviates
from the ground truth by less than 20%. For example,
if the ground truth were 0.4 g, a value between 0.32 g
and 0.48 g would be accepted. This percentage value
was chosen based on (Guidance document for
competent authorities for the control of compliance
with EU Legislation, 2012), that states in 20% the
maximum acceptable error in the labeling of
nutritional facts. That way, a reasonable accuracy
percentage can be computed for our system.
3 EXPERIMENTS
In our whole system, there were mainly 3 processes
that needed to be trained, optimized and evaluated:
• The convolutional neural network that
classifies the charts in light letters over dark
background or dark letters over light
background.
• The genetic algorithm used to optimize the
parameters of the filters applied to the images
as preprocessing.
• The transcription itself, especially the
effectiveness of the complementary contour-
based transcription and the post-processing
applied to the result had to be tested.
For the first task, a small dataset of 347 pictures
was collected with 3 smartphones in a realistic
environment. The photos were made of various food
products in a supermarket. From these pictures, the
nutritional information chart was cropped, so that
only the necessary information was present. This
dataset was divided into 315 samples as the training
set and 32 for testing. For this task, all these images
were manually labeled with a binary class indicating
whether they were light letters over dark background
or dark letters over light background.
The genetic algorithm is a more time-demanding
process, as it implies preprocessing all the pictures 30
times each generation. That’s why we decided to run
it with a smaller set of 40 images selected from the
dataset described above. These pictures had to be
manually labelled with the key values needed for
computing the fitness function. That is, the number of
decimal numbers in the chart and the amount of
quantities with an explicit unit after it. A vocabulary
set of 76 terms for this domain was also elaborated.
As the transcription is the most important part of
our system and we wanted more accurate and reliable
results, a bigger and more diverse dataset of 633
images was collected. These pictures contain the
nutritional facts from labels in Spanish and
correspond to different food and beverage products
(except alcoholic beverages). The pictures were taken
using 5 different smartphones with conventional
camara. During acquisition, glitters, reflections, blur
and other defects were avoided as most as possible.
This dataset was also manually labeled, extracting
for each image the amount of each of the nutritional
characteristics per 100g and per serving. For the