number of questions in the dataset, but also the di-
versity of the programs. Also, to the use of the unbal-
anced testdev-all for evaluation.
By comparing our best CL model
(CL+W.a+P+R) to the models trained without
CL (no-CL), we find very significant gains in terms
of computational cost, e.g. an 18-fold reduction
compared to the top contender, the model trained on
the Unbalanced dataset. The price to pay—a drop
of only 2% in accuracy—appears reasonable. The
Random model, trained on randomly sampled 12M
examples, performs almost as well as the Unbalanced
model, an expected result since both models use a
similar amount of distinct training examples (12M
vs 14M). The Unbalanced model requires an almost
9 times more expensive training than Random, but
the improvement in accuracy (70.2 % vs. 69.4%)
hardly justifies it. However, the proposed CL model
has an almost 2 times lower computational cost than
Random, confirming the superiority of curriculum
learning in this type of application.
Table 4: Comparaison of our CL model (CL+W.a+P+R)
with no-CL models (Unbalanced, Balanced, and Random)
on the testdev-all set.
Model Comp. cost # examples Accuracy
Unbalanced 9 × 14 M 14 M 0.702
Balanced 50 × 1.4 M 1.4 M 0.678
Random 12 × 1 M ≤ 12 M 0.694
CL+W.a+P+R 7 × 1 M < 7 M 0.681
6 CONCLUSION
In this work we present several Curriculum Learn-
ing (CL) strategies within a Neural Module Net-
work (NMN) framework for Visual Question An-
swering (VQA). Our visual reasoning approach lever-
ages a cross-modal Transformer encoder to extract
aligned question/image features along with question
programs to perform multi-step reasoning over the
image and predict an answer. Our model employs
an NMN architecture composed of multiple neural
modules, each capable of performing a reasoning sub-
task. We compare several CL strategies for VQA. Our
model is evaluated on the GQA dataset and shows
very interesting results in terms of computational cost
reduction. To drive the CL strategy, we introduce a
difficulty measure based on the number of objects in
the question and we achieve close accuracy results by
training on a judiciously sampled 50% of the training
data, compared to an NMN model trained without CL
on the entire training set.
ACKNOWLEDGEMENTS
We thank Souheil Hanoune for his insightful com-
ments. This work was partly supported by the French
Cifre fellowship 2018/1601 granted by ANRT, and by
XXII Group.
REFERENCES
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,
Gould, S., and Zhang, L. (2018). Bottom-up and top-
down attention for image captioning and visual ques-
tion answering. In CVPR.
Askarian, N., Abbasnejad, E., Zukerman, I., Buntine, W.,
and Haffari, G. (2021). Curriculum learning ef-
fectively improves low data VQA. In Rahimi, A.,
Lane, W., and Zuccon, G., editors, Australasian
Language Technology Association Workshop (ALTA)
2021, pages 22–33. ACL.
Chen, W., Gan, Z., Li, L., Cheng, Y., Wang, W. Y., and Liu,
J. (2021). Meta module network for compositional
visual reasoning. In WACV.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proc.
2019 Conf. ACL, pages 4171–4186.
Elman, J. L. (1993). Learning and development in neural
networks: the importance of starting small. Cognition,
48(1):71–99.
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and
Parikh, D. (2017). Making the V in VQA matter:
Elevating the role of image understanding in Visual
Question Answering. In CVPR.
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko,
K. (2017). Learning to reason: End-to-end module
networks for visual question answering. In ICCV.
Hudson, D. A. and Manning, C. D. (2019). GQA: A new
dataset for real-world visual reasoning and composi-
tional question answering. In CVPR.
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.,
Zitnick, C. L., and Girshick, R. B. (2017a). CLEVR:
A diagnostic dataset for compositional language and
elementary visual reasoning. In CVPR.
Johnson, J., Hariharan, B., Van Der Maaten, L., Hoff-
man, J., Fei-Fei, L., Zitnick, C. L., and Girshick, R.
(2017b). Inferring and executing programs for visual
reasoning. In ICCV.
Kervadec, C., Wolf, C., Antipov, G., Baccouche, M., and
Nadri, M. (2021). Supervising the transfer of reason-
ing patterns in VQA. In NeurIPS.
Li, G., Wang, X., and Zhu, W. (2019a). Perceptual visual
reasoning with knowledge propagation. In ACM MM,
MM ’19, page 530–538, New York, NY, USA. ACM.
Li, L. H., Yatskar, M., Yin, D., Hsieh, C., and Chang, K.
(2019b). VisualBERT: A simple and performant base-
line for vision and language. CoRR, abs/1908.03557.
Curriculum Learning for Compositional Visual Reasoning
895