LLMs Take on the Bebras Challenge:
How Do Machines Compare to Students?
Germán Capdehourat, María Eugenia Curi and Víctor Koleszar
Ceibal, Uruguay
Keywords: Artificial Intelligence, Computational Thinking, Education, K-12, LLMs.
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. However,
their performance in tasks involving logical reasoning and computational thinking continues to be an active
area of research. This study analyzes the behaviour of state-of-the-art LLMs on tasks from Bebras Challenge,
a test designed to promote computational thinking skills. We compare the outcomes of LLMs and primary
and secondary school students from grades 3rd through 9th in Uruguay, who participated in the Bebras
Challenge as part of the country’s Computational Thinking and Artificial Intelligence program. The results
reveal that LLMs achieve an increasing performance as the model complexity increases, with the most
advanced ones outperforming the average younger students' results. Our findings highlight both the promise
and the current limitations of LLMs in tackling computational thinking challenges, providing valuable insights
for their integration into educational contexts. In particular, the results suggest that LLMs could be used as a
complementary tool to analyse the task's difficulty level, which could be very helpful to accelerate the time-
consuming exchange and discussion process actually required to categorize the tasks.
1 INTRODUCTION
The rapid development of large language models
(LLMs) has revolutionized various domains,
showcasing impressive potential capabilities in
natural language understanding, reasoning, and task
completion. LLMs such as GPT-4, Gemini and Llama
have demonstrated proficiency across a wide range of
applications, including content generation, coding,
problem-solving, and conversational agents (Fan,
2024). Their versatility has extended to educational
contexts, where they are increasingly used as tools to
support learning, tutoring, and the development of
critical skills such as reading comprehension and
problem-solving (Li, 2024).
To evaluate and benchmark their performance,
researchers commonly use standardized tests such as
MMLU, MATH and IFEVAL. These benchmarks
help identify strengths and weaknesses, with a
consistent finding being that LLMs often excel in
tasks requiring factual recall and pattern recognition
but face challenges with tasks that demand logical
reasoning and multi-step problem-solving. Studies in
this area (Wan, 2024) highlight limitations in their
ability to perform tasks grounded in logical
consistency or requiring abstract reasoning,
indicating a gap that merits further exploration.
Computational thinking (CT) is considered a
critical competency in today’s education landscape,
referred to problem-solving based on computer
science principles and concepts (Barr & Stephenson,
2011, Shute et al., 2017). CT emphasizes skills such
as algorithmic thinking, decomposition,
generalization, abstraction, and evaluation (Grover &
Pea, 2013). These skills mirror the logical and
structured reasoning often required in tasks where
LLMs struggle. Different tests used to evaluate CT,
such as the Bebras Challenge (Dagiene, 2016),
provide a unique lens to analyze the reasoning and
problem-solving capabilities of LLMs.
Exploring the performance of LLMs on Bebras
tasks is not only interesting from an academic
perspective but also practical. Beyond solving tasks
themselves, LLMs could contribute to educational
settings by assisting educators in identifying the
appropriate age range for specific challenges,
considering different steps in the resolution process,
or automating the thematic classification of tasks.
These applications could reduce the time and effort
required to design and evaluate activities, enriching
the learning process.
338
Capdehourat, G., Curi, M. E. and Koleszar, V.
LLMs Take on the Bebras Challenge: How Do Machines Compare to Students?.
DOI: 10.5220/0013364100003932
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th Inter national Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 338-346
ISBN: 978-989-758-746-7; ISSN: 2184-5026
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
In this study, we examine the performance of
state-of-the-art (SoTA) LLMs on Bebras tasks,
comparing their outcomes with those of students in
grades 3rd to 9th in Uruguay. This country has been
a pioneer in incorporating technology into the
educational system, starting in 2007 with a one-to-
one computer program called Ceibal (Ceibal, 2025).
In this context, the teaching of computational
thinking was incorporated a few years ago through a
specific educational program and a curriculum
designed for such purposes. The key questions
addressed in this study are: What is the accuracy rate
of SoTA LLMs when solving Bebras tasks? and How
does the performance of those LLMs compare to that
of Uruguayan students on the same tasks?
2 COMPUTATIONAL THINKING
EDUCATION IN URUGUAY
Ceibal is the innovative one-to-one educational
program initiative that positioned Uruguay as the first
country in the world to provide laptops and Internet
access to all students and teachers in public K-12
schools. In this context, Ceibal has incorporated
computer science education into classrooms through
the Computational Thinking and Artificial
Intelligence (PCIA, in Spanish) program, a joint
initiative with the National Administration of Public
Education (ANEP, in Spanish). This program
operates on an optional basis but it is integrated into
the regular school schedule, with a remote teacher
working collaboratively with the classroom teacher
(Koleszar et al., 2021a). By 2024, the program
coverage reaches approximately 75% of public
schools across the country. The main goal is to help
students develop foundational computer science
concepts starting in primary education, learn different
approaches to problem-solving, and express solutions
through programming.
Bebras Challenge is an international initiative to
promote participation in CT activities. It originated at
Vilnius University, with its first edition held in
Lithuania in 2004 (Dagienė, 2010). Since then,
participation has grown steadily, with nearly
4,000,000 participants from over 70 countries in
2023. Numerous studies highlight the challenging
work of producing high-quality tasks. Each year,
representatives from all participating countries
develop, approve, and validate a shared pool of tasks
from which each country selects their challenges.
This systematic process involves academics and
educators from all member countries and consists of
multiple stages. Finally, the revised tasks are then
presented to the Bebras community during the annual
workshop, where representatives from all countries
review, improve, and select a final pool of tasks that
can be used in the annual challenge.
Since 2020, Ceibal PCIA has been part of Bebras,
organizing the challenge for the schools in Uruguay.
After a one-month preparation phase, during which
students are provided with resources and taught
strategies for problem-solving, the annual challenge
is made available on a learning assessment online
platform, where students complete it individually.
These activities contribute to building and enriching
the teaching community while providing valuable
data for research and evaluation in computational
thinking skills (Stupurienė et al., 2016). For this
study, we examined a set of tasks from the 2023
Bebras edition implemented in Uruguay for students
in grades 3rd through 9th. This edition featured 26
tasks, distributed across three age categories,
designed to evaluate the following computational
thinking skills: algorithmic thinking, generalization
and evaluation.
3 RELATED WORK
Performance evaluation of LLMs is a rapidly
evolving field, with various benchmarks designed to
assess different aspects of their capabilities. Among
the most widely used are tasks like MMLU
(Hendrycks et al., 2021), which tests multi-task
language understanding, MATH (Hendrycks et al.,
2021), which focuses on mathematical reasoning, and
IFEVAL (Zhou et al., 2023), designed for instruction-
following abilities. These benchmarks provide a
foundation for understanding the strengths and
weaknesses of LLMs across different domains.
However, a growing interest lies in assessing models
using tests that are more specialized. In our case, we
focused on computational thinking challenges, which
usually include a greater component of logic and
reasoning ability, thus introducing greater complexity
to the models (Williams et al., 2024). This kind of
analysis allows to gauge the LLMs problem-solving
and reasoning abilities in scenarios aligned with
human cognitive processes.
As previously mentioned we are particularly
interested in the computational thinking problems
from the Bebras challenge. A previous work that
studies LLMs performance on solving these tasks,
considers a legacy GPT-3 model (Bellettini et al.,
2023). This work investigates the ability of OpenAI's
DaVinci model to solve tasks from the Bebras
LLMs Take on the Bebras Challenge: How Do Machines Compare to Students?
339
challenge, posing research questions such as How
often is the model able to answer correctly? or Does
the model perform better with some specific types of
tasks? Although the study provides valuable insights,
its conclusions are limited by the use of an earlier
LLM, with significantly lower performance
compared to current SoTA models. Moreover, it does
not compare the performance to that of real students,
leaving an essential gap in understanding how LLMs
fare against real-world benchmarks.
Another more recent study (Pădurean et al., 2024),
uses a similar approach, but in this case focused on
different visual programming and computational
thinking tests, such as HoC, ACE, and CT-test. These
benchmarks focus on typical block coding
programming and computational thinking problems,
sometimes involving multimodal inputs, such as
textual descriptions and accompanying visuals. The
authors examine the performance of advanced LLMs,
including multimodal variants, and find that SoTA
models like GPT-4o and Llama3 barely match the
performance of an average school student. Although
these tests differ from Bebras, they provide valuable
context for understanding the limitations of LLMs in
computational thinking evaluation.
An additional area of interest involves the
automated classification of tasks. For example, Lucy
et al. (2024) investigates the categorization of
mathematical problems. The study highlights the
potential utility of automating task tagging to
streamline the preparation of educational materials.
However, the findings suggest that LLMs often
struggle to accurately tag problems according to
predefined standards, typically predicting labels that
approximate but subtly differ from the ground truth.
For Bebras challenges, task categorization involves
several dimensions, including difficulty levels, age
ranges, and specific computational thinking skills.
The framework proposed by Dagienė et al. (2017)
introduces a dual-level categorization system for
Bebras tasks. The first level relates to computational
thinking skills (e.g. abstraction, decomposition,
generalization), while the second addresses
informatics concepts such as algorithms, data
structures and representations, computer processes,
and human-computer interaction. To investigate
whether LLMs can automate this classification
process effectively is a promising avenue for research
and could significantly influence how such
challenges are organized and utilized.
Our study contributes to this field in several ways.
To begin with, up to our knowledge this is the first
work to analyze the performance of modern LLMs,
including state-of-the-art models, on the Bebras
challenge. Secondly, unlike previous works, we
directly compare LLM performance with that of
students participating in the Bebras challenge in
Uruguay. This comparison provides a clearer
understanding of how LLMs align with human
performance on such tasks. Finally, we explore the
potential for LLMs to assist in automating
educational processes, such as task categorization and
age-range recommendations. The results show that
while LLMs demonstrate a strong ability to solve
many tasks, their performance varies by task type.
While it could be incorporated as an objective
measure of the level of difficulty of challenges, its
ability to classify challenges according to skills or
knowledge domains is still insufficient for practical
applications. The study provides valuable insights
into the integration of LLMs in educational contexts
and contributes to the broader understanding of their
strengths and limitations.
4 EXPERIMENTAL ANALYSIS
WITH STUDENTS AND LLMS
Our experimental section is divided into two main
parts. On the one hand, we present the results of the
Bebras challenge 2023 in Uruguay. A set of different
tasks were selected in order to analyze the students'
performance for different grade levels. On the other
hand, we analyze the capacity of various LLMs to
solve the Bebras tasks. To do so, we carried out a
preprocessing step to get a text-only version for the
tasks that include images. Finally, the section
concludes with a comparative discussion of the
results obtained for students and language models.
Within the Ceibal PCIA program, Bebras is an
optional initiative for schools every year since 2020
(Koleszar et al., 2021b; Porto et al., 2024). Different
tasks are selected to suit each educational level,
distributed in four categories: grades 1st-2nd, 3rd-
4th, and 5th-6th of primary education, and 7th-9th of
secondary education. For this analysis, the results for
students from 1st and 2nd grade were discarded,
because the original Bebras tasks were modified in
those cases to facilitate reading, making the text
simpler and more age appropriate. The final subset of
student results analyzed is detailed in Table 1, while
the corresponding subset of 22 Bebras tasks was
distributed as shown in Figure 1. It should be noted
that some of them are repeated between the different
categories.
CSEDU 2025 - 17th International Conference on Computer Supported Education
340
Table 1. Dataset considered from the Bebras 2023 edition
in Uruguay.
Grade Category name Number of students
3th
Benteveos 3,717
4th Benteveos 17,972
5th Cardenales 17,789
6th Cardenales 19,770
7th Horneros 1,488
8th Horneros 1,058
9th Horneros 741
Figure 1. Distribution of the Bebras tasks carried out in
Uruguay in 2023.
4.1 Students’ Results
Benteveos. The average percentage of correct
responses was 45.5% for 3rd-grade students and
53.4% for 4th-grade students. Figure 2 presents the
detailed performance results broken down by task. To
identify the tasks we simplified the international
Bebras code, removing the year and using only the
country and the assigned number. For example 2023-
LT-01 is represented as LT-01 (as all tasks are from
2023).
Figure 2. Average percentage of correct responses by task
for 3rd- and 4th-grade students.
The first thing that can be observed is the great
variability in performance per task, from cases just
above 25%, to others that almost reach 80% of correct
answers. Additionally, the results show that 4th-grade
students systematically perform better than 3rd-grade
students, with a significant difference ranging
between 4% and 13%, with an average of 8%. This
difference is probably explained by the fact that, in
addition to the age difference, the Ceibal's PCIA
educational program starts at schools during the 4th-
grade year. From a more detailed observation of the
tasks, LT-01 and CH-01 were particularly complex
for all participants, as the average of correct responses
was below 35%. On the other hand, SK-04 and UY-
02 were more simple as both were answered correctly
with an average correct response rate above 65%.
Cardenales. Figure 3 presents the results for the
students 5th-6th grade. In this case again the results
of the older students are higher, with an average
percentage of correct responses of 47.3% for 5th-
grade and 50.1% for 6th-grade. However, the
differences in this case are smaller than those
observed between the 3rd- and 4th-grade, ranging
from 0% to 5.6%, with an average of 2.7%. The
variability among tasks is similar to the previous case,
with a more noticeable break in this case between two
groups of tasks, those that are above and below 50%
of correct answers. If we consider the tasks analyzed
for the previous category, CH-01 and LT-01 achieved
a higher average of correct responses compared to the
3rd- and 4th-grade students. Additionally, the task
UY-02, which already had a good performance of the
students from the Benteveos category, also showed
better results for the students in this category.
Figure 3. Average percentage of correct responses by task
for 5th- and 6th-grade students.
Horneros. The number of tasks used in this category
is larger than in the previous ones, as it can be seen in
Figure 4. The average percentage of correct responses
in this case was 50.0% for the 7th-grade students,
50.6% for 8th-grade students, and 55.3% for 9th-
grade students. Thus, an improvement in performance
is again observed, as the age of the students increases.
LLMs Take on the Bebras Challenge: How Do Machines Compare to Students?
341
However, the gap between 7th and 8th grade is quite
short in this case, with even several tasks in which the
results are better for 7th grade students. The
differences between 8th and 9th grade are larger,
where the results for each task are again
systematically better for older students. The
performance variability among tasks falls within an
even broader range in this case, between 25% and
92%. Upon a deeper analysis of the tasks, CA-01,
UY-02, and SA-01, students achieve higher average
results compared to the previous category. Tasks BR-
04 and PH-03 have the lowest average response rates
compared to all other tasks, which could indicate that
these are more complex tasks.
Although the set of tasks used for each category
is different, it can be seen that the insights found are
quite consistent for the three analyzed categories. In
all cases, the students' performance increases on
average, as the age of the students increases.
Furthermore, a great variability in performance is
observed for each of the tasks, which shows that the
set selected for the challenge covers a fairly wide
range of difficulties.
Figure 4. Average percentage of correct responses by task
for 7th-, 8th- and 9th-grade students.
4.2 LLMs Performance on Bebras
As previously mentioned, in order to perform the tests
with several LLMs, a pre-processing step was applied
to prepare the tasks. To do this, a description of the
images or graphic elements that provide key
information for the formulation and resolution of the
problem was generated, with the aim of presenting the
task as faithfully as possible to how a student receives
it. To generate these text descriptions, the OpenAI
GPT-4o model was used and the generated outputs
were manually corrected. In this way, the pre-
processed tasks used as input for the language models
includes both the original task text as well as the
corresponding text description of the images.
Various LLMs were selected to run the
experiments, some proprietary models and others
with open weights. For the proprietary case, OpenAI
models were used. In this case, the selected models
were GPT-4 and GPT-4o, and the more recent
“reasoning” models o1-mini and o1-preview. For the
open weights model selection, we took into account
the most recent ones available in Ollama and the local
hardware infrastructure that we have for running the
tests. Thus, the open models selected were of lower
capacity than those of Open-AI, such as gemma2:2B
and gemma2:9B from Google, llama3.2:3B and
llama3.1:8B from Meta, phi3.5:3.8B from Microsoft
and qwen2.5:7B from Alibaba. The last number of
each open model corresponds to the number of
billions of parameters. Thus, with the selected LLMs
we manage to cover a wide range of models with
different capacity levels.
In addition to the different LLMs, also different
prompt variations were analyzed. In this case, two
slightly different prompts were considered. The
general structure in both cases was the same,
including the task description, the question to answer,
the available multiple-choice options and a final
instruction asking the model to solve the task,
indicating the correct answer. The analyzed variation
corresponds to a chain-of-thought (CoT) approach,
where the phrase “Let's think step by step the problem
to reach the correct answer is also included in the
prompt.
In order to analyze the general performance of all
the different LLMs and prompts considered, a first
experiment was conducted, using all the Bebras tasks
used in the 2023 edition in Uruguay. In Figure 5 the
results are presented, where the first thing to notice is
that the performance is directly related with the model
capabilities. The smaller models, which have a few
billion parameters, struggle to solve the tasks, with
only a few correct responses. Then we have the
middle range open models tested, which reach a
performance slightly above 30%. The better results in
this case correspond to the OpenAI models, with the
best results above 50% for the standard models, and
Figure 5. Average accuracy results on the selected Bebras
tasks for the different LLMs evaluated.
CSEDU 2025 - 17th International Conference on Computer Supported Education
342
much better results for the most advanced
“reasoning” models. Finally, it is worth highlighting
that almost all models present a better performance
with the CoT-based prompt.
From now on we concentrate on the OpenAI
models, as they were the ones with better results in
the previous experiments. The first thing we analyzed
is the consistency of the previous results. As we
know, the output of an LLM is not always the same,
so we repeated the previous experiment 10 times for
each task. This way, we computed the number of
correct answers for each task on each of the
experiment runs. The results are shown in Figure 6,
where the histograms indicate that GPT-4o presents
more consistent results than GPT-4o-mini. As we can
see, the histograms in this case are more concentrated
on 1 or 0 values, which indicate that responses for a
certain task are always correct or wrong. It is worth
noting that little impact of the prompt is observed in
this case.
Figure 6. Consistency analysis for the Open AI standard
models GPT-4o and GPT-4o-mini.
4.3 Results Comparison and Further
Discussion
After looking at the results separately for students and
LLMs, we analyze and compare the performance in
both cases. Since each test varies according to the
students' grade level, a comparison can be made
between the LLMs and the grades that share the same
tasks in the Bebras challenge. Figures 7, 8 and 9
presents the comparative results for the different
categories, analyzing the performance of the LLM
models on the tasks corresponding to the challenge
for these grades.
Looking at the different graphs, the first thing to
notice is that the LLMs results cover the whole range
of students' performance. That is to say, that the less
capable models have a similar behaviour to the worse
students, while the opposite happens with the most
advanced models. This result is more than relevant,
since it indicates that it would be possible to somehow
automate the calibration of the tasks, based on the
performance that the different LLMs have when
trying to solve it.
Figure 7. Comparative results between LLMs vs 3rd- and
4th-grade students.
Figure 8. Comparative results between LLMs vs 5th- and
6th-grade students.
Figure 9. Comparative results between LLMs vs 7th-, 8th-
and 9th-grade students.
Furthermore, if we compare the results among the
three different graphs, it can be seen that the LLMs
performances present a noticeable drift towards
worse results (i.e. vertical lines move to the left), as
the age level of the categories increases. This makes
a lot of sense, since the set of tasks selected for each
category is usually associated with the corresponding
ages, with the level of difficulty increasing as the ages
get older. Thus, the performance degradation of the
LLMs is probably explained by the increasing
difficulty of the set of tasks selected for each
category.
LLMs Take on the Bebras Challenge: How Do Machines Compare to Students?
343
The above results prove that it would be possible
to integrate LLMs into the review and categorization
processes of the Bebras challenge, as an objective tool
to measure the difficulty level of each task. Although
this incorporation requires further studies and work
on calibration and analysis of the models, the
automation of these tasks would be very beneficial
and could save significant efforts in the time-
consuming exchange and discussion process actually
required to categorize the Bebras tasks.
A different automation that could be helpful,
concerning LLMs, is the task classification according
to the different skills and knowledge required. The
aforementioned framework (Dagienė et al., 2017)
was used in this case, to test the capabilities of some
LLMs concerning the classification task. According
to the best results obtained in our preliminary tests,
the model struggles to classify correctly in both cases.
Concerning the skill associated with each task, the
best result for this case was 50% of the tasks
classified correctly. The result is not much better for
the case of the knowledge domain of each task, where
the best model reached 57.7% of correct
classifications.
5 CONCLUSIONS AND FUTURE
WORK
This study highlights the potential and limitations of
state-of-the-art LLMs in solving CT problems such as
those presented in the Bebras Challenge. While
LLMs have shown strong performance in solving
specific types of tasks, their variability across task
categories indicates opportunities for improvement.
These findings underscore the importance of
developing targeted methodologies for integrating
LLMs into educational processes.
The results demonstrate the significant impact of
prompt design on LLM performance. Incorporating
chain-of-thought reasoning into the prompts led to
noticeable improvements in accuracy, highlighting
the importance of carefully crafting input instructions
to align with the cognitive requirements of the tasks.
Furthermore, the performance of LLMs consistently
improved with increasing model size, a trend that
parallels the progression observed in students' results
as they advance in age and grade level. This
alignment between model size and student
performance provides a natural hierarchy of difficulty
levels, where tasks can be ranked according to their
complexity and solved progressively by students or
models with corresponding capabilities.
Advanced reasoning models achieved near-
perfect scores on the Bebras tasks, showcasing their
ability to handle complex computational thinking
challenges. However, even these models exhibited
limitations when tasked with categorizing exercises
based on the skills required or the knowledge
domains involved. This indicates that, while LLMs
are effective problem solvers, their meta-cognitive
abilities to analyze and classify tasks remain
underdeveloped, presenting an avenue for further
research and enhancement.
One promising area of future work involves
incorporating LLMs into the process of challenge
generation and evaluation. By leveraging their ability
to solve tasks and analyze patterns of performance,
LLMs could provide objective measures of task
difficulty. This capability could streamline the
current time-intensive process of categorizing
challenges by difficulty and assigning appropriate age
ranges. Tools based on LLM performance metrics
could serve as valuable resources for educators and
task designers, enabling a more efficient and data-
driven approach to preparing computational thinking
activities.
Another avenue for exploration is improving
LLMs' performance in automatic task classification.
While LLMs have demonstrated some capability in
identifying key skills and knowledge domains
associated with tasks, their accuracy remains
insufficient for practical applications. Enhancing
their ability to classify tasks based on computational
thinking skills, such as abstraction or algorithmic
reasoning, could significantly benefit the design of
targeted educational interventions and the
organization of challenge databases.
Finally, the potential for modifying the Bebras
Challenge format using LLMs represents an exciting
opportunity. For example, instead of relying solely on
multiple-choice questions, challenges could include
intermediate reasoning steps where LLMs assist
students in formulating their solutions. Such
modifications would still allow for automated grading
but would provide richer insights into students'
thought processes and problem-solving strategies.
LLMs could also play a role in generating adaptive
feedback, helping students improve their
computational thinking skills in real-time.
REFERENCES
Barr, V., & Stephenson, C. (2011). Bringing computational
thinking to K-12: What is involved and what is the role
CSEDU 2025 - 17th International Conference on Computer Supported Education
344
of the computer science education community?. ACM
inroads, 2(1), 48-54.
Bellettini, C., Lodi, M., Lonati, V., Monga, M., &
Morpurgo, A. (2023, April). DaVinci goes to Bebras: a
study on the problem solving ability of GPT-3. In
CSEDU 2023-15th International Conference on
Computer Supported Education (Vol. 2, pp. 59-69).
SCITEPRESS-Science and Technology Publications.
Casal-Otero, L., Catala, A., Fernandez-Morante, C.,
Taboada, M., Cebreiro, B. and Barro, S. (2023). AI
literacy in K-12: A systematic literature review, in
International Journal of STEM Education, 10(1), 29.
https://doi.org/10.1186/s40594-023-00418-7.
Ceibal (2025). What is Ceibal? https://ceibal.edu.uy/en/
what-is-ceibal/
Ceibal (2022). Pensamiento computacional. Propuestas
para el aula. https://bibliotecapais.ceibal.edu.uy/info/
pensamiento-computacional-propuesta-para-el-aula-00
018977
Dagienė, V. (2010). Sustaining informatics education by
contests. In Teaching Fundamentals Concepts of
Informatics: 4th International Conference on
Informatics in Secondary Schools-Evolution and
Perspectives, ISSEP 2010, Zurich, Switzerland,
January 13-15, 2010. Proceedings 4 (pp. 1-12).
Springer Berlin Heidelberg.
Dagiene, V., & Stupuriene, G. (2016). Bebras--A
Sustainable Community Building Model for the
Concept Based Learning of Informatics and
Computational Thinking. Informatics in education,
15(1), 25-44.
Dagienė, V., Sentance, S., & Stupurienė, G. (2017).
Developing a two-dimensional categorization system
for educational tasks in informatics. Informatica, 28(1),
23-44.
Grover, S., & Pea, R. (2013). Computational thinking in K–
12: A review of the state of the field. Educational
Researcher, 42(1), 38–43. https://doi.org/10.3102/0013
189X12463051
Dan Hendrycks and Collin Burns and Steven Basart and
Andy Zou and Mantas Mazeika and Dawn Song and
Jacob Steinhardt (2021). Measuring Massive Multitask
Language Understanding. Proceedings of the
International Conference on Learning Representations
(ICLR).
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Tang, E., Song, D.X., & Steinhardt, J. (2021).
Measuring Mathematical Problem Solving With the
MATH Dataset. ArXiv, abs/2103.03874.
Kim, S., Jang, Y., Kim, W., Choi, S., Jung, H., Kim, S. and
Kim, H. (2021). Why and What to Teach: AI
Curriculum for Elementary School, in Proceedings of
the AAAI Conference on Artificial Intelligence, 35(17),
15569-15576.
Koleszar, V., Pérez Spagnolo, A., & Pereiro, E. (2021a).
Pensamiento computacional en educación primaria: El
caso de Uruguay. Jornadas Argentinas de Didáctica de
las Ciencias de la Computación, Buenos Aires,
Argentina.
Koleszar, V., Clavijo, D., Pereiro, E., & Urruticoechea, A.
(2021b). Análisis preliminares de los resultados del
desafío BEBRAS 2020 en Uruguay. Revista INFAD de
Psicología. International Journal of Developmental and
Educational Psychology., 1(2), 17-24.
Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu,
and Libby Hemphill. (2024). A Bibliometric Review of
Large Language Models Research from 2017 to 2023.
ACM Trans. Intell. Syst. Technol. 15, 5, Article 91
(October 2024), 25 pages. https://doi.org/10.1145/
3664930
Long, D. & Magerko, B. (April 2020). What is AI literacy?
Competencies and design considerations, in
Proceedings of the 2020 CHI Conference on Human
Factors in Computing Systems, 1-16.
Lucy, L., August, T., Wang, R. E., Soldaini, L., Allison, C.,
& Lo, K. (2024). MathFish: Evaluating Language
Model Math Reasoning via Grounding in Educational
Curricula. arXiv preprint arXiv:2408.04226.
Natali, V., & Nugraheni, C. E. (2023). Indonesian Bebras
Challenge 2021 ExploratoryData Analysis. Olympiads
in Informatics, 17, 65-85.
Ng, D. T. K., LEUNG, J. K. L., Chu, S. K. W. and Qiao, M.
S. (2021). Conceptualizing AI literacy: An exploratory
review, in Computers and Education: Artificial
Intelligence, 2, 100041. 10.1016/j.caeai.2021.100041.
Olari, V. and Romeike, R. (October 2021). Addressing AI
and Data Literacy in Teacher Education: A Review of
Existing Educational Frameworks, in The 16th
Workshop in Primary and Secondary Computing
Education, pp. 1-2.
Pădurean, V. A., & Singla, A. (2024). Benchmarking
Generative Models on Computational Thinking Tests in
Elementary Visual Programming. arXiv preprint
arXiv:2406.09891.
Porto, C., Pereiro, E., Curi, M. E., Koleszar, V., &
Urruticoechea, A. (2024). Gender perspective in the
computational thinking program of Uruguay: teachers’
perceptions and results of the Bebras tasks. Journal of
Research on Technology in Education, 1-15.
Qingyao Li and Lingyue Fu and Weiming Zhang and
Xianyu Chen and Jingwei Yu and Wei Xia and Weinan
Zhang and Ruiming Tang and Yong Y. (2024).
Adapting Large Language Models for Education:
Foundational Capabilities, Potentials, and Challenges.
arXiv preprint arXiv:2401.08664.
Sentance, S. and Waite, J. (2022). Perspectives on AI and
data science education. Recovered from
https://www.raspberrypi.org/app/uploads/2022/12/Pers
pectives-on-AI-and-data-science-education-_Sentance-
Waite_2022.pdf
Shute, V. J., Sun, C., & Asbell-Clarke, J. (2017).
Demystifying computational thinking. Educational
Research Review, 22, 142–158. doi:10.1016/j.edurev.
2017.09.003
Tedre, M., Denning, P. and Toivonen, T. (November 2021).
CT 2.0, in Proceedings of the 21st Koli Calling
International Conference on Computing Education
Research, pp. 1-8.
LLMs Take on the Bebras Challenge: How Do Machines Compare to Students?
345
Touretzky, D., Gardner-McCune, C., Martin, F. and
Seehorn, D. (2019). AI for K-12 Envisioning: What
Should Every Child Know about AI?, in Proceedings of
the AAAI Conference on Artificial Intelligence, 33(01),
9795-9799.
UNESCO (2023). Currículos de IA para la enseñanza
preescolar, primaria y secundaria: un mapeo de los
currículos de IA aprobados por los gobiernos.
Recovered from https://unesdoc.unesco.org/ark:/482
23/pf0000380602_spa
Williams, S., & Huckle, J. (2024). Easy Problems That
LLMs Get Wrong. arXiv preprint arXiv:2405.19616.
Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang
Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, and
Michael Lyu. 2024. LogicAsker: Evaluating and
Improving the Logical Reasoning Ability of Large
Language Models. In Proceedings of the 2024
Conference on Empirical Methods in Natural Language
Processing, pages 2124–2155, Miami, Florida, USA.
Association for Computational Linguistics.
Jeffrey Zhou and Tianjian Lu and Swaroop Mishra and
Siddhartha Brahma and Sujoy Basu and Yi Luan and
Denny Zhou and Le Hou (2023). Instruction-Following
Evaluation for Large Language Models. ArXiv,
abs/2311.07911.
CSEDU 2025 - 17th International Conference on Computer Supported Education
346