Table 1: The training time and test accuracy of a baseline RN and three HRN architectures at 30 and 200 epochs.
Architecture Training time Test accuracy Training time Test accuracy
(30 epochs) (30 epochs) (200 epochs) 200 (epochs)
RN (baseline) 2,080s 90.3% 13,349s 92.7%
SCHRN 81s 88.9% 549s 93.3%
UCHRN 67s 90.4% 449s 94.1%
FDHRN 90s 87.8% 545s 89.6%
The intuition behind this approach is that instead
of pre-assigning each object to a definite category, we
treat each object as belonging, say 95% to category-
one, 3% to category-five, and so on. This transi-
tion from using hard category assignments, similar to
Aristotelic logic, to using soft assignments, similar to
fuzzy logic, both allows for differentiability and in-
creases the fitting power of the network. For small m,
when many objects would fall ”on the fence” between
two classes, the increase in fitting power would result
in greater flexibility and improved performance at the
cost of training more parameters.
An important feature of the FDHRN architecture
is the use of trainable embeddings for the object cat-
egories. Imagine a zoo at which we only ask ques-
tions about features that have little to do with animal
taxonomy, such as color or size (”do brown animals
tend to be bigger than black animals?”). If we were
to groups objects by more domain-based categories,
such as ”amphibian”, ”mammal” etc., there would
be a mismatch between the relation questions we ask
and the given categories, which would decrease the
learning efficiency. Trainable embeddings allow the
FDHRN to derive categories that are more specific to
the reasoning task at hand.
Although advantageous, inferring categories from
data can lead to increased model complexity and
make the network more prone to overfitting.
4 RESULTS
4.1 The Zoo Animals Dataset
We used a dataset containing information about
101 animals with various attributes for each animal
(Learning, 2020).
We tested all three aforementioned HRN architec-
tures in tasks of relational queries on subsets of the
101 animals. An example query might be: ”Among
these animals, are those with lungs more likely to be
aquatic than those with fins?” The possible answers
(balanced to occur with about the same frequency) are
”yes,” ”no,” and ”about equally likely.”
Each data point consists of a single task: a subset
of animals and a query to which the network gener-
ates an answer. During training, the correct answer is
also provided as a label. The models are scored based
on the percentage of correctly performed tasks, with
a random guess baseline achieving around 33% accu-
racy.
4.2 Testing Specifications
We use a version of the Animals dataset that contains
about 19, 000 questions on different subsets of ani-
mals. Half the questions are used for training while
the other half is reserved for testing. Each subset of
animals consists of 25 animals. The choice of 25 is
arbitrary, and purely made to ensure that the baseline
RN completes training in reasonable time. Even so,
the RN model takes several hours to train while the
three HRN architectures take a few minutes each.
All models have the same fully-connected layer
depth and width, as well as the same batch size (64)
and learning rate (0.0005). The tests ran in Python 3
on an OSX operating system with GPU-support.
4.3 Comparison of Performance
We evaluated four models on the Animals dataset:
The original RN used as baseline, an SCHRN with
7 categories, an UCHRN with 8 categories, and the
FDHRN with 8 categories. We compared both train-
ing times and testing accuracies achieved. The results
are summarized in Table 1, with values averaged over
several runs and rounded. All models converge by
200 epochs.
Even for merely 25 objects, the training time of
the baseline RN is one or two orders of magnitude
higher than that of the HRN models. This is expected
as, in general, the time complexity of the HRNs scale
with m
2
instead of n
2
, where m is the number of cate-
gories and n the number of objects.
We see that, out of all the models tested, the
UCHRN achieves superior accuracy at the end of
training and also takes the shortest time to finish train-
ing. This is likely because its assumptions are satis-
fied best by the animals dataset (the questions were
posed on all animal attributes). The SCHRN performs
worse as the fixed categories given by the dataset
(mammal, bug, etc.) are not particularly suited for the
Hierarchical Relation Networks: Exploiting Categorical Structure in Neural Relational Reasoning
363