the spatial meaning is lost. Many of these models mix
CNNs with LSTMs to process the series of words that
make up a caption, see (Wang et al., 2016) for exam-
ple. Hossain et al. (Hossain et al., 2019) provide a
useful review of image captioning techniques.
In 2014, in his talk on what is wrong with con-
volutional neural nets, Geoff Hinton talked about the
limitations of max pooling and how CNNs can recog-
nise the right elements, but in the wrong order. For
example, it might detect two eyes, a nose and a mouth
and classify a face even if those elements are not ar-
ranged as a face. We were interested in whether or not
a CNN could be made to learn such spatial relation-
ships if the target outputs made them explicit. This is
slightly different from the point that Hinton was mak-
ing, but sparked the question all the same. Can a CNN
learn image wide spatial relationships by simply en-
coding the name of such relationships at the output
layer?
There have been several attempts at explicitly ad-
dressing the challenge of learning spatial relation-
ships among objects in an image using an architec-
ture that adds spatial specific elements to the standard
CNN. Mao et al. (Mao et al., 2014) use a mixture of
recurrent network layers and convolutional layers in a
multimodal approach they call an m-CNN. The model
uses a statistical approach to produce words that form
sentences that describe images. Words are selected
from a probability model based on the image and the
previous words in the sentence.
More recently, Raposo et al. (Raposo et al., 2017)
proposed relation networks (RN) as a way of allow-
ing networks to learn about the relationships between
objects in a scene. The RN models the relationships
between pairs of objects and is used in conjunction
with CNNs and LSTMs for image and language pro-
cessing.
In this paper, we address the question of whether a
simple CNN architecture without recurrent or LSTM
components is able to learn relative spatial relation-
ships in images. The important question is whether
or not a CNN can learn to generalise concepts such
as above or below from example images without an
architecture that is specifically designed to capture
those relationships. This was done by generating im-
ages with a small number of object classes arranged
in a variety of spatial configurations while ensuring
that some combinations did not appear in the training
data. When tested, the model was able to correctly re-
port the relative locations of object combinations that
were absent from the training data.
The motivation for the work is to add specific out-
put nodes to a CNN, which refer to a defined concept.
In this case, the concepts describe relative locations,
but in future work they might describe an action or
even an intention. Rather than generating a sentence
(such as the girl is drinking the milk) that requires
further post-processing to extract meaning, we aim
to generate meaningful outputs directly from the im-
age (object=girl, subject=milk, verb=drink, for exam-
ple). We want to be able to use a single architecture (a
standard CNN) and change only the output targets to
be able to train on different meaningful relationships
among objects in an image.
The remainder of this paper is organised as fol-
lows. Section 2 describes the preparation of the train-
ing, validation and test data. Sections 3 and 4 de-
scribe two experiments with CNNs for spatial relation
recognition. Finally, section 5 provides an analysis of
the results and some ideas for further work.
2 DATA PREPARATION
The datasets for the experiments were constructed us-
ing the Fashion-MNIST dataset (Xiao et al., 2017).
Images in this collection are 28 by 28 pixels in size
and they were used to generate larger images of 56 by
56 pixels by pasting two or three of the original im-
ages onto a larger canvas. The original images were
selected at random without replacement and placed on
the larger canvas in randomly chosen non-overlapping
locations. In this way, any two objects in an image
could have a clearly defined spatial relationship from
the set {above, below, left, right, above left, below
left, above right, below right}. The data was auto-
matically labelled using the algorithm that generated
it, making the process of producing large quantities
of data very efficient. Figure 1 shows two example
images. Note that the Fashion-MNIST images are 28
by 28 pixels, so the figures in this paper represent the
quality of those images accurately.
The Fashion-MNIST object classes are: T-
shirt/top, Trouser, Pullover, Dress, Coat, Sandal,
Shirt, Sneaker, Bag, Ankle boot. There are 60,000
training images and 10,000 test images in the dataset.
All of the images are provided in grey scale so they
do not have a colour dimension.
The training/validation/test protocol was as fol-
lows. The original MNIST training data were split
into 20% test data and 80% training data before the
composite images were generated. The 80% used for
training were further split into 80% train and 20% val-
idation sets, all before the composite images were cre-
ated. Consequently, the train, validation and test data
share no original images in common. What is more,
on different training runs, certain combinations of ob-
ject class and spatial relationship (shirt above bag,
Learning Spatial Relations with a Standard Convolutional Neural Network
465