next number in that row. However, there are several 
difficulties to consider that had to be overcome. For 
example, the words might be incorrectly transcribed 
or  omitted,  the  quantities  may  be  incorrectly 
transcribed,  the  alignment  process  may  fail  if  the 
chart is still slightly rotated, etc.  
To  solve  these  potential  problems,  the  specific 
information  of  our  domain  was  again  taken  into 
account.  To  mitigate  the  omission  of  words,  the 
structure  of  the  nutritional  information  tables  was 
used, which is always the same. The attributes always 
appear  in  the  following  order:  energy  value,  fats, 
saturated fats, carbohydrates, sugar, proteins and salt. 
Therefore, if “fats” is found in a row and its quantity, 
and the following row only has a number, it can be 
assumed that that number corresponds to the amount 
of saturated fats. This improves the accuracy although 
it  is  not  always  effective,  because  there  are  some 
charts  that  show more  information.  For  example, if 
“fiber” is present, it is between “proteins” and “salt”, 
so the system may confuse “fiber” with “salt”.  
The amount of energy is more complicated, as it’s 
usually  indicated  in  kilojoules  and  kilocalories. 
Moreover,  sometimes  both  appear  in  the  same  row 
and sometimes in consecutive rows. But it also helps 
us  overcome  omissions,  as  the kilojoules  of energy 
can be calculated from the kilocalories and vice versa. 
In  addition,  these  tables  sometimes  include 
another  column  apart  from  the  one  that  indicates 
amounts per 100 g,  which indicates the amount per 
serving. This situation is identified by checking that 
there are two quantities per row. Then, if it’s the case 
the  second  column  can  be  used  to  supplement 
omissions  or  inaccuracies  in  the  first  row.  This  is 
done  by  calculating  the  proportion  between  the 
quantities in the first column and their analogous on 
the second one. The correct proportion is assumed to 
be the most common one and it is used to correct or 
calculate inaccurate or omitted quantities.  
With these and other similar algorithms, the data 
is  corrected,  completed  and  extracted  from  these 
transcriptions. This way, the quantity of each of the 7 
considered attributes per 100 g is acquired. As energy 
is recovered  in kilojoules and  kilocalories,  8 values 
are finally obtained as the result for each image. 
 
As for the evaluation of this whole transcription 
system,  a  result  is  considered  correct  if  it  deviates 
from the ground truth by less than 20%. For example, 
if the ground truth were 0.4 g, a value between 0.32 g 
and 0.48 g would be accepted. This percentage value 
was  chosen  based  on  (Guidance  document  for 
competent  authorities  for  the  control  of  compliance 
with  EU  Legislation,  2012),  that  states  in  20%  the 
maximum  acceptable  error  in  the  labeling  of 
nutritional  facts.  That  way,  a  reasonable  accuracy 
percentage can be computed for our system. 
3  EXPERIMENTS 
In our whole system, there were mainly 3 processes 
that needed to be trained, optimized and evaluated: 
•  The  convolutional  neural  network  that 
classifies the  charts in  light letters over  dark 
background  or  dark  letters  over  light 
background.  
•  The  genetic  algorithm  used  to  optimize  the 
parameters of the filters applied to the images 
as preprocessing. 
•  The  transcription  itself,  especially  the 
effectiveness  of  the  complementary  contour-
based  transcription  and  the  post-processing 
applied to the result had to be tested. 
For the first task, a small dataset of 347 pictures 
was  collected  with  3  smartphones  in  a  realistic 
environment. The photos were made of various food 
products in  a supermarket. From these  pictures,  the 
nutritional  information  chart  was  cropped,  so  that 
only  the  necessary  information  was  present.  This 
dataset was divided into 315 samples as the training 
set and 32 for testing. For this task, all these images 
were manually labeled with a binary class indicating 
whether they were light letters over dark background 
or dark letters over light background. 
The genetic algorithm is a more time-demanding 
process, as it implies preprocessing all the pictures 30 
times each generation. That’s why we decided to run 
it with a smaller set of  40 images selected from the 
dataset  described  above.  These  pictures  had  to  be 
manually  labelled  with  the  key  values  needed  for 
computing the fitness function. That is, the number of 
decimal  numbers  in  the  chart  and  the  amount  of 
quantities with an explicit unit after it. A vocabulary 
set of 76 terms for this domain was also elaborated. 
As the transcription is the most important part of 
our system and we wanted more accurate and reliable 
results,  a  bigger  and  more  diverse  dataset  of  633 
images  was  collected.  These  pictures  contain  the 
nutritional  facts  from  labels  in  Spanish  and 
correspond to  different  food and  beverage products 
(except alcoholic beverages). The pictures were taken 
using  5  different  smartphones  with  conventional 
camara. During acquisition, glitters, reflections, blur 
and other defects were avoided as most as possible.  
This dataset was also manually labeled, extracting 
for each image the amount of each of the nutritional 
characteristics  per  100g  and  per  serving.  For  the