4.1 Analysis of Comparison Results
The results in Table 1 show that the accuracy of Bag-
of-Words (BoW) was not competitive with that of
either CNNs or humans. Increasing the number of
codewords did improve the results slightly, but per-
formance of BoW was still much lower than CNNs
or humans. Table 1 also shows that Humans achieved
83% accuracy while, for 16 food types, the best CNN
achieved 89% to 93%, depending on the CNN archi-
tecture used. The conclusion is that, given a small
number of food categories to learn, CNNs are very
accurate and can surpass humans.
A more realistic application of CNNs for food
recognition requires learning thousands of food cat-
egories. Table 1 also shows that CNN accuracy with
256 categories of food (Food256) decreases signifi-
cantly, in spite of having the full dataset to train with.
The best performing CNN, which was R101, achieved
73% (down from 93% on 16 categories), Iv3 and G
achieved 67% and 56% respectively. As the num-
ber of food categories to learn increases, accuracy of
all tested CNN architectures decreases significantly.
Practical food recognition systems should be able to
learn thousands of food categories instead.
Table 2 shows that the most difficult categories for
CNN and for humans are different. The conclusion is
that humans and CNNs behave very differently clas-
sifying food types.
4.2 Conclusions from Experiments
Analysis of the results confirm that CNNs are very ac-
curate, and much more accurate than BoW. But the re-
sults also show that when CNNs had to learn 256 food
categories they were less accurate than humans. As a
consequence, in realistic contexts with many food cat-
egories and prior knowledge, humans are expected to
still be more accurate than CNNs. There is however a
need to investigate the comparison to humans in more
detail. Note also that other factors, such as illumi-
nation, perspective, occlusion and others, will further
influence automated recognition capacity negatively,
and other human capabilities and contextual informa-
tion may aid humans improving their guesses in real
environments.
5 CONCLUSIONS AND FUTURE
The promise of smartphone-based capturing of the
dish to be eaten for dietary assessment makes it im-
portant to evaluate feasibility. This work analysed
bag-of-words (BoW) and deep learning-based solu-
tions for food recognition (CNNs), comparing them
to humans as well. The approaches were compared
experimentally and we analyzed the results. This al-
lowed us to conclude that CNNs beat BoW signifi-
cantly. But we also concluded that CNNs accuracy
decreases when they have to learn more food cate-
gories. Our current and future work related to this
issue is focused on the need to analyse this issue in
more detail, evaluating how deep learning compares
with humans, the deficiencies, and how to improve
the approaches.
Future work on practical food recognition systems
can ask the user which of top-3 possibilities is the
right one, since the top-3 accuracy should be much
higher, based also on results by other authors. An-
other line of work concerns feeding different kinds
of contextual information to the CNN classifier stage,
both during training and use, to improve automated
classification accuracy.
Finally, it is very important to experiment with dif-
ficulty inducing-factors, such as bad illumination and
shadows, perspective, occlusion and others, which
will further influence recognition capacity negatively.
REFERENCES
Alom, Z. M. (2018). The history began from alexnet: A
comprehensive survey on deep learning approaches.
In in ArXiv Preprint ArXiv:1803.01164,.
Arora, S. and Bhatia, M. (2018). A robust approach for
gender recognition using deep learning. In 2018 9th
International Conference on Computing, Communica-
tion and Networking Technologies (ICCCNT), pages
1–6. IEEE.
Caldeira, M., Martins, P., and Cec
´
ılio, J. (2019). Compar-
ison study on convolution neural networks (cnns) vs
human visual system (hvs). In . BDAS 2019: 111-
125. Beyond Databases, Architectures and Structures.
Paving the Road to Smart Data Processing and Anal-
ysis - 15th International Conference, BDAS 2019,,
pages 978–3. Ustro, Poland,.
Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray,
C. (2004). Visual categorization with bags of key-
points. In Workshop on statistical learning in com-
puter vision, ECCV, volume 1, pages 1–2. Prague.
Foodcam, U. (2018). Uec food dataset. [URL accessed in
10/2018] [http://foodcam.mobi/dataset256.html].
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In In Proceedings
of the IEEE conference on computer vision and pat-
tern recognition, pages 770–778.
ILSVRC (2018). Large scale visual recogni-
tion challenge. [URL Accessed 10/2018]
http://www.imagenet.org/challenges/LSVRC/.
Food Recognition: Can Deep Learning or Bag-of-Words Match Humans?
107