and then tested on a distribution of them, including
potentially unseen tasks. Many popular solutions em-
ploy the Metric-based Meta Learning method. Here,
the objective is to learn a mapping from images to
their embeddings such that images that are similar are
closer and images that are from different categories,
are far apart.
In the 1990s, Bromley and LeCun introduced a
signature verification algorithm that used a novel ar-
tificial neural network - Siamese Neural Network
(Bromley et al., 1994). Siamese neural networks
are a class of neural network architectures that con-
tain identical twin sub-networks i.e., they have the
same configuration with the same parameters and
weights. Parameter updating is mirrored across both
sub-networks. In situations where we have thousands
of classes but only a few image examples per class,
these networks are popular. In (Koch et al., 2015), a
one shot classification strategy is presented that in-
volves the learning of image representations using
Siamese neural networks. These features are then
used for the one-shot task without any retraining. Dis-
crimination between similar and different pairs of im-
ages was done by calculating the weighted L1 dis-
tance between the twin feature vectors. This was com-
bined with a sigmoid function. A cross-entropy ob-
jective was chosen to train the network. For each pair
of images, a similarity score was evaluated. It is as-
sumed that this trained network will then work well to
classify a new example from a novel class during the
one shot task. In (Garcia and Bruna, 2015), a single
layer message-passing iteration resembles a Siamese
Neural Network. The model learns image vector rep-
resentations whose euclidean metric is agreeing with
label similarities.
Matching Networks have also proven to be an ex-
cellent model for one-shot learning tasks. In (Vinyals
et al., 2016), the architecture possesses the best of
both worlds - positives of parametric and the posi-
tives of non-parametric models. They acquire infor-
mation from novel classes very quickly while being
able to satisfactorily generalize from common exam-
ples. Meta-learning with memory-augmented neural
networks (Santoro et al., 2016) greatly influences this
work. LSTMs (Hochreiter and Schmidhuber, 1997)
learn rapidly from sets of data fed in sequentially. In
addition, the authors of (Vinyals et al., 2016) employ
ideas of metric learning (Roweis et al., 2004) based
on features learnt. The set representation for images is
also prevalent in the graph neural model proposed in
(Garcia and Bruna, 2015). However the main differ-
ence between the two implementations is that match-
ing networks encode the support set independently
of the target image i.e., the learning mechanism em-
ployed by them attends to the same node embeddings
always, in contrast to the stacked adjacency learning
in (Garcia and Bruna, 2015).
Prototypical Networks (Snell et al., 2017) were
designed to provide a simpler yet effective approach
for few-shot learning. They build upon work done in
(Vinyals et al., 2016) and the meta-learning approach
to few-shot learning (Ravi and Larochelle, 2016),
showcasing a better performance than Matching Net-
works without the complication of Full Context Em-
bedding (FCE). These networks apply an inductive
bias in the form of class prototypes. There exists em-
beddings in which samples from each class cluster
around a single prototypical representation which is
simply the mean of the individual samples. The query
image is then classified by finding the nearest class
prototype.
In the graph neural model proposed in (Garcia and
Bruna, 2015), Prototypical Networks information is
combined within each cluster. Each cluster is defined
by nodes with similar labels.
3 PROPOSED METHOD
3.1 UNet for Image Embeddings
For the task of Image Segmentation in the field
of Biomedical Imaging, UNets (Ronneberger et al.,
2015) were proposed. The architecture consisted of
two sections - a contracting path and a symmetric
expansion path. The contracting section contained
multiple blocks, each applying two 3x3 convolution
layers and a 2x2 max pooling layer, on an input im-
age. The number of feature maps doubled after ev-
ery block enabling the architecture to capture context
effectively. The expansion section aims to preserve
the spatial properties of the image, key to generat-
ing the segmented image. The architecture does this
by concatenating feature maps from the correspond-
ing contraction section. Each block in this section
consisted of two 3x3 convolution layers and an up-
sampling layer. To maintain symmetry, the number of
feature maps halved after each block. For training, a
softmax function was applied on every pixel of the re-
sultant segmented image, followed by a cross entropy
loss function.
For our proposed model, we extracted embed-
dings of size 64 from the end of the contraction sec-
tion by using a fully connected layer. These were then
fed into the graph neural model. As stated in (Ron-
neberger et al., 2015), UNet’s speed is one of its ma-
jor advantages. In our experiment, the UNet model
converged the quickest.
Optimization of Image Embeddings for Few Shot Learning
237