The idea is to verify model generated solutions at
test time. Since the verifier outputs a probability that
the solution is correct, multiple trials of the same
problem can be carried out. Each solution can be
ranked with the verifier, and then the solution with the
highest verifier score can be returned.
Figure 5 shows the percentage of correct solution
against the number of trials, provided by the MC-
ZSL. It can be considered the performance with an
ideal verifier (i.e., providing 100% of probability that
the solution is correct).
For one trial, the percentage is 18.63%, but with
two and three trials is 27.56 and 33.33%, respectively.
To show the potential of this approach, it can be noted
that with ten trials the percentage becomes 54.20%.
Figure 5: Percentage of MC-ZSL solve rate against number
of trials.
4 CONCLUSIONS
This work explores and measures the effectiveness of
the most recent deep learning models for solving grade
school math tasks described in natural language. The
proposed approach shows that problem-solving based
on computer coding is more effective than problem-
solving based on natural language reasoning.
A pipelined solution is designed, based on
OpenAI Codex. Experimental results clearly show the
potential of the approach: the Codex achieves 18.63%
solve rate against the 6.82% of GPT-3.
Further improvements can be achieved by using
verifiers. The proposed approach has been
implemented, tested, and publicly released on the
GitHub platform, to foster its application on various
research environments. An excerpt of significant
cases has been included in appendix.
ACKNOWLEDGEMENTS
We thank OpenAI for giving us free and unlimited
access to Codex to run our experiments. Work
supported by the Italian Ministry of University and
Research (MUR) in the framework of the CrossLab
project (Departments of Excellence), and in the
framework of the FISR 2019 Programme, under
Grant No. 03602 of the project “SERICA”.
REFERENCES
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A. & Agarwal, S. (2020). Language models are
few-shot learners. arXiv preprint arXiv:2005.14165.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O.,
Kaplan, J., Edwards H., Burda Y., Joseph N., Brockman
G., Ray A. et al. (2021). Evaluating large language
models trained on code. arXiv preprint
arXiv:2107.03374.
Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano,
R., Hesse, C., & Schulman, J. (2021). Training verifiers
to solve math word problems. arXiv preprint
arXiv:2110.14168.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018).
Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint
arXiv:1810.04805.
Galatolo, F.A. (2021). Math-codex repository on GitHub,
https://github.com/galatolofededico/math-codex.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Tang, E., Song, D. & Steinhardt, J. (2021).
Measuring mathematical problem solving with the
MATH dataset. arXiv preprint arXiv:2103.03874.
Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J.,
Brundage, M., & Sutskever, I. (2019). Better language
models and their implications. OpenAI Blog
https://openai. com/blog/better-language-models, 1, 2.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention
is all you need. Advances in neural information
processing systems, 30.
Wang A., Pruksachatkun Y., Nangia N., Singh A., Michael
J., Hill F., Levy O., S. R. & Bowman, S. R. (2019).
Superglue: A stickier benchmark for general-purpose
language understanding systems. arXiv preprint
arXiv:1905.00537.
APPENDIX
It follows some selected sample problems (P) from
the GSM8K data set, the related solution (S), the
related solution provided by MC-ZSL (S
C
), and
finally the solution provided by GPT-3-ZSL (S
G
).