diction rather than the probability scores. This makes
our method more tolerant to error reinforcement. Our
experiments on the Food-101 and the UPMC-101 da-
tasetts in Section 5 show that our method is able to
produce superior results compared to the recently pro-
posed approach based on tri-training.
2 FORMULATION
Assume the problem of training a neural network
for classifying some foods worldwide. To this end,
we need to collect a database of foods and manu-
ally label them. Denoting the database by X
s
=
{(x
0
s
, y
0
s
), . . . , (x
n
s
, y
n
s
)}, our aim is to train the classi-
fication model
g(x
s
) : R
H×W
→ R (1)
to predict the label of the input image. The database
could be created by only collecting images of food
from the Internet. Also, online users tend to decorate
a food and take the best shot. The database could be
collected considering different environmental condi-
tions, variations of the same food from one country to
another and the imaging devices.
Our goal is to train a food classification network
using these images. Then, we will deploy our mo-
del on a device which is designed to classify image
foods captured by the device. In other words, there is
another database called X
t
= {(x
0
t
, y
0
t
), . . . , (x
m
t
, y
m
t
)}
indicating the samples captured by the device. Here,
our dataset collected from the Internet is the source
domain and the dataset collected by the device is the
target domain where the actual test will take place.
Collecting X
s
using the first approach can be done
quickly and efficiently. In contrast, collecting X
s
using the second approach is hard and it is almost in
fact impractical. However, samples in the second ap-
proach will be diverse as opposed to the samples in
the first approach. Consequently, the model trained
on the second approach is likely to be more accurate
than the first approach. (Torralba and Efros, 2011)
showed in most cases there is a shift between X
s
and
X
t
even when X
s
is collected using the second appro-
ach. This shift between the databases negatively af-
fects the classification accuracy on X
t
.
More formally, the classification model g(x
s
) can
be formulated by the composite function g(x
s
) =
f (h(x
s
)) where h : R
H×W
→ X
D
is a function that
maps the input image into a D-dimensional space cal-
led feature space. The joint probability of vectors
in the feature space and their corresponding labels
are denoted by p
s
(h(x
s
), y
s
), and p
t
(h(x
t
), y
t
) for the
source domain and target domain, respectively.
Domain adaptation refers to the problem of trai-
ning the model f (h(x
s
)) when p
s
(h(X
s
)) 6= p
t
(h(X
t
))
but y
t
, y
s
∈ L where L is the label space. In other
words, domain adaptation assumes that the labels y
s
and y
t
are drawn from the common label space L and
h(x
s
) and h(x
t
) are also drawn from the common fea-
ture space X
D
. However, distribution of feature vec-
tors in the source domain is different from the distri-
bution of feature vectors in the target domain. This is
called covariate shift.
This is different from knowledge adaptation
where the basic assumption is that p
s
(h(X
s
)) '
p
t
(h(X
t
)) but y
s
∈ L and y
t
∈ L
0
are two different la-
bel spaces. Here, we have only focused on methods
for dealing with the covariate shift problem.
2.1 Domain Adaptation Types
Domain adaptation can be further divided into su-
pervised, unsupervised and semi-supervised domain
adaptation. In supervised domain adaptation, both X
s
and X
t
are labeled. In contrary, unsupervised dom-
ain adaptation deals with situations where the source
domain is labeled but the target domain is only com-
posed of h(x
t
) and y
t
is unknown for all target sam-
ples. Finally, semi-supervised domain adaptation re-
fers to problems where the target dataset is partially
labeled. However, the number of labeled target sam-
ples is very low. Figure 1 shows these three problems
schematically.
Unsupervised and semi-supervised domain adap-
tation has important practical applications. We ex-
plain one of these applications using an example. In
order to collect X
s
in our application, we can simply
rely on online images instead of collecting a diverse
range of food images from wall-mounted cameras in
real-world. This way, we can collect a considerable
amount of food images in a reasonable time.
Then, we can collect many images from a wall-
mounted camera in real-world without annotating
their labels and create the X
t
dataset. Finally, our mo-
del can be trained using both X
s
and X
t
. Using X
s
the model will learn essential visual clues that are re-
quired for classifying foods. Then, it will refine its
knowledge using the images in X
t
.
3 RELATED WORK
In this section, we will explain state-of-the-art domain
adaptation techniques that are applicable to the neu-
ral networks. Generally speaking, domain adaptation
techniques break down to feature space alignment, re-
construction based, generative adversarial networks
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
144