Large Language Models in Civic Education on the Supervision and
Risk Assessment of Public Works
Joaquim J. C. M. Honório
1a
, Paulo C. O. Brito
1b
, J. Antão B. Moura
2c
and Nazareno F. Andrade
2d
1
Graduate Program in Computer Science, Federal University of Campina Grande (UFCG), Brazil
2
Systems and Computing Department, Federal University of Campina Grande (UFCG), Brazil
Keywords: Civic Education, Large Language Models, Machine Learning, Public Works.
Abstract: The Public Administration spends an estimated 13 trillion USD annually worldwide, of which
approximately 20% is allocated to public works. Despite strict rules, unfinished works for legal reasons,
including corruption, are not atypical, negatively impacting the region’s economy, culture, and society.
Civic awareness about this problem may help reduce such losses. This study investigates the use of Large
Language Models (LLM) and Retrieval-Augmented Generation (RAG) to support civic education on risks
in public works. While LLMs interpret and create human language, RAGs combine text production with
access to other external data, allowing contextualized responses. Here, we evaluate how these technologies
can facilitate the population’s understanding of technical information about public works. To this end, we
initially create and evaluate 4 Machine Learning models for risk prediction of public work failure, using
data from real public works. We provide a failure estimate for each contracted work based on the most
efficient model. These data and others related to government development and risk processes are accessed
and presented to the user through a web support system. Tests with 35 participants indicate a significant
improvement in citizens ability to understand complex aspects related to risks and contracts of public works.
1 INTRODUCTION
In the current situation of increasing complexity of
public funded, government managed projects, citizen
education becomes very relevant to monitoring and
assessing costs, quality, and effectiveness of said
projects. Of particular interest here are the so called
“public works”. The participation of the community
as an observer is important to properly monitor
project activities, and to ensure that the community’s
needs are met and that the public administration is
transparent and effective (Twizeyimana &
Andersson, 2019).
Public Administrations worldwide procure i.e.,
contract the acquisition, of – goods, services and
works from companies to the yearly tune of an
estimated 13 trillion US dollars
1
. Corruption and
a
https://orcid.org/0000-0001-5746-2108
b
https://orcid.org/0000-0002-0913-7586
c
https://orcid.org/0000-0002-6393-5722
d
https://orcid.org/0000-0001-5990-9495
mismanagement of procurement contracts, however,
may cause up to 30% of this estimate to be lost, with
procurement contracts be responsible for 57% of all
bribery cases in the Organization for Economic Co-
operation and Development (OECD) countries
2
.
These large numbers suggest an important problem
in terms of supporting the Public Administration to
manage procurement contracts and society at large
to monitor more effectively, by forecasting risk of
contract failure in special.
Information about public works projects is
frequently made available via government
transparency portals, public databases, and
independent organizations. Aside from particular
contract information, it is common to give statistical
estimations, frequently collected using Machine
Learning (ML) approaches. These estimates, aimed
1
The
Global
Value
of
Public
Procurement:
https://spendnetwork.com/13-trillion-the-global-value-
of-public-procurement/
2
OCDE
-
Highlights
Reforming
Public
Procurement:
https://www.oecd.org/gov/public-procurement/public-
procurement-progress-report-highlights.pdf.
Honório, J., Brito, P., Moura, J. and Andrade, N.
Large Language Models in Civic Education on the Supervision and Risk Assessment of Public Works.
DOI: 10.5220/0012589300003693
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024) - Volume 2, pages 27-38
ISBN: 978-989-758-697-2; ISSN: 2184-5026
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
27
at both the general public (e.g.: citizens) and
regulatory agencies, can include projections of costs
(Bayram & AlJibouri, 2016) (Barros, Marcy, &
Carvalho, 2018), contractual additions (Gallego,
Rivero, & Martínez, 2021), bid-rigging (Huber &
Imhof, 2019), and execution time (Titirla &
Aretoulis, 2019), among other things, as well as the
risk of project failure owing to legal rulings (Sun &
Sales, 2018) (Gallego, Rivero, & Martínez, 2021).
Several studies have investigated applying
machine learning to assess risks in public works.
However, non-specialists such as members of the
general population – may need help to use these
strategies in practice (Yang, Suh, Chen, & Ramos,
2018). While robust, machine learning frequently
allows for complicated outcomes their complexity
nature requires specialist expertise for interpretation,
making citizen participation and comprehension
difficult. Chat assistants can help overcome this
barrier, improving a lay person’s understanding of
technical issues (Pérez, Daradoumis, & Puig, 2020).
Large Language Models (LLM) are extensively
trained language models that can comprehend and
generate human language, trained with a large
volume of textual data (Kasneci, et al., 2023).
Retrieval-augmented generation (RAG) is an
extension of the models mentioned above that
enables the incorporation of the capability to obtain
and use information from external source (Lewis, et
al., 2020). By being combined with RAG, LLM’s
replies may include up-to-date external information
beyond the range of the data they were trained in.
In everyday life, LLMs offer support in various
applications, facilitating access to information in
different areas (Kasneci, et al., 2023). In education,
these tools pave the way for the creation of tutors
specialized in specific areas. They can be adapted to
offer personalized support in different disciplines,
for example, to answer multiple-choice questions
about code (Jaromir, Arav, Christopher, & Majd,
2023), for medical education (Kung, et al., 2023), to
give feedback to students (Wei, et al., 2023), among
others. Despite efforts, these models are yet to be
applied to the context of public works to support
civic education.
Despite advances in research developed around
ML in public works management, there are
limitations in results obtained so far. Firstly, there
are few efforts published in academic circles aimed
at predicting risks in public works, most of which
are directed at the entire area of government
procurement. Furthermore, there is a lack of
academic efforts specifically aimed at the practical
application of these technologies so that they are
accessible to the non-specialized population. This
scenario highlights the need for methods that enable
technical information to be made available in a more
accessible language, allowing greater understanding
and community participation in supervising, and
monitoring public works.
This work aims to support civic education by
making information about complex topics, such as
predictions generated by machine learning
algorithms more easily accessible to the non-
technical, general public. To achieve this objective,
a model for estimating risks in public works is
developed based on ML techniques to predict the
risk of project failure. Simultaneously, a chat
assistant model is developed that utilizes LLM and
RAG to transform and contextualize this technical
information into a language easily comprehensible
and accessible to non-technical citizens. Thus, this
article addresses the following research
questions that are derived from the above objectives:
Is it feasible to develop a specialized ML model
for assessing and predicting risks in public
works projects?
Are LLM models capable of assisting non-
experts in understanding information about
public works?
How does the integration of external
information (e.g., processes, machine learning
models, data on public works) through RAG
affect the accuracy of the information provided?
This article contributes to Civic Education (and
also, to applications of Artificial Intelligence AI),
by promoting transparency and citizen participation
through simplifying complex information about
public works. To date and to the best of our
knowledge, this is the first article to:
Evaluate the performance of LLM and RAG
models, including their limitations, in
simplifying complex information about public
works.
Provide an overview of the development flow to
describe complex information in an accessible
manner, contributing to transparency and civic
education.
Present a risk prediction model for estimating a
public work’s failure risk.
To facilitate replication of the contents of this
article, all used instruments and procedures are made
available according to practices of the Open Science
Framework (https://osf.io/7byd9/) in the external
repository.
CSEDU 2024 - 16th International Conference on Computer Supported Education
28
2 FUNDAMENTALS AND
RELATED WORK
2.1 Problem Definition
In the development of a ML model for risk
assessment in public works, a detailed set of project
characteristics is crucial. Let 𝑃=
𝑝
,𝑝
,…,𝑝
be
the complete set of available projects. A given
project, p
, where 𝑖 = 1,2,𝑛 , is described by a
feature vector
x
𝑝
𝑖
that belongs to the risk space R
,
where 𝑑 represents the variables such as cost,
duration, dimension, type of infrastructure, among
others.
The main function of the ML model, denoted by
M, is to map this feature vector to a risk estimate.
Mathematically, this is represented by the function
M:R
0,1
, which generates an estimated
probability of failure or a degree of risk,
𝑅
, for
each project p
. The aim of the model is to fine-tune
this mapping function to minimize the prediction
error, quantified by the sum of squared differences
between the observed real risks,
R
, and the model’s
estimates,
𝑅
. This goal is expressed by the
objective Equation 1.
min
𝑅
− 𝑅
(1)
Where 𝑛 is the total number of projects
evaluated. This minimization process is crucial to
ensure that the model makes accurate predictions
about the risks associated with different
infrastructure projects.
An assistant, 𝐴, is a chatbot that functions to
transform the quantitative outputs of the model 𝑀
into clear and understandable textual explanations.
The model 𝐴 can be described by the function
𝐴:
𝑅
𝑝
𝑖
×𝑅
×𝒟𝒯, where 𝒟 is a set of external
data and 𝒯 is the space of all possible explanatory
texts. For each project 𝑝
, 𝐴 generates an
explanatory text
𝑇
, which is a function of the risk
estimates
𝑅
, the project’s features 𝑥
, and the
contextual data from 𝒟.The efficacy of the chat
assistant is evaluated based on the accuracy, clarity,
and relevance of the textual explanations
𝑇
, by
presenting a random subset 𝑆 of selected projects for
assessment. Here, 𝑆 is mathematically defined as a
specific collection of elements, chosen from a larger
set 𝑃 of all possible projects, such that 𝑆⊆𝑃.
2.2 Machine Learning Model for Risk
Prediction
In recent years, the application of data mining and
machine learning (ML) techniques has gained
significant attention in improving the accuracy
around public procurement. Previous research in
public works has focused on estimating cost (Titirla
& Aretoulis, 2019), duration of projects (Titirla &
Aretoulis, 2019), preventing collusion in bidding
(Huber & Imhof, 2019) and predicting risk,
inefficiencies (Gallego, Rivero, & Martínez, 2021).
In parallel, ML studies have addressed risk
estimation and anomaly detection in all types of
government procurement, including works
(Domingos, Carvalho, Carvalho, & Ramos, 2016)
(Ivanov & Nesterov, 2019) (Sun & Sales, 2018).
One widely adopted framework in data mining
projects is the CRISP-DM (Cross-Industry Standard
Process for Data Mining) methodology, known for
its systematic and structured approach. Studies have
showcased its effectiveness in guiding the prediction
process and providing actionable insights for
decision-makers (Schröer, Kruse, & Gómez, 2021).
The methodology includes the steps illustrated in
Figure 1, starting with Business Understanding,
where relevant data for public works is collected and
prepared. The next step, Data Understanding
defines project objectives and aligns data analysis
with these goals. In Data Preparation, data is
cleaned, and key features are selected for prediction.
The Modelling step involves selecting machine
learning algorithms and splitting the data for training
and testing. Evaluation assesses model performance
using metrics like F1-score and ROC-AUC, with
adjustments made as needed. Finally, Deployment
implements resulting models in real-world scenarios,
integrates them into systems.
Figure 1: Overview of the CRISP-DM Methodology for
Risk Model Development.
Within the Modelling step of CRISP-DM,
various machine learning algorithms have been
utilized to predict public contracting outcomes.
Data
Understanding
Data
Preparation
Evaluation
Deployment
Modelling
Business
Understanding
Start
En
d
Large Language Models in Civic Education on the Supervision and Risk Assessment of Public Works
29
Notably, Random Forest (RF) (Auret & Aldrich,
2012) algorithms have proven to be robust and
accurate, capable of handling complex data
structures and capturing nonlinear relationships.
Artificial neural networks (ANN) (Hond, Asgari, &
Jeffery, 2020) with their ability to learn complex
patterns, have also shown promise in forecasting
contracting outcomes. Additionally, Stochastic
Gradient Boosting (SGB) (Bentéjac, Csörgő, &
Martínez-Muñoz, 2021), known for its ensemble
learning technique, has demonstrated its
effectiveness in improving prediction accuracy.
Logistic regression (LR), a classical statistical
method (Christodoulou, et al., 2019), has been
utilized as well, offering interpretability and model
transparency.
Evaluation metrics play a vital role in assessing
the performance of prediction models. Studies in this
field have employed various evaluation metrics to
gauge the accuracy and reliability of the models.
Here, adopted metrics are accuracy, precision, recall,
F1-Score, and area under the ROC curve.
Precision (Equation 2) is a metric that measures
the proportion of correctly predicted positive
instances out of all instances classified as positive.
In the context of public works contract failure
prediction, precision indicates the model’s ability to
accurately identify contracts that are likely to fail. A
higher precision suggests that the model has a lower
rate of false positive predictions, minimizing the
chances of wrongly flagging contracts as failures.
𝑃
=
𝑇𝑃
𝑇𝑃 +
𝐹𝑃
(2)
Where true positives (𝑇𝑃) represent the instances
correctly identified as contract failures. These are
cases where the prediction aligns with the actual
outcome, indicating that the model accurately
identified a failing contract. True negatives ( 𝑇𝑁)
denote instances correctly identified as successful
contracts, where the model correctly predicts the
absence of failure. On the other hand, false positives
(𝐹𝑃) occur when the model incorrectly predicts a
contract failure that succeeds. False negatives (𝐹𝑁)
represent cases where the model fails to identify an
actual contract failure, incorrectly classifying it as a
successful contract.
Recall (Equation 3), also known as sensitivity or
true positive rate, evaluates the proportion of
correctly predicted positive instances out of all
actual positive instances. In the context of contract
failure prediction, recall assesses the model’s ability
to capture and identify all contracts that are failing.
A higher recall indicates that the model can
effectively identify a larger portion of failing
contracts.
𝑅
=
𝑇𝑃
𝑇𝑃 +
𝐹𝑁
(3)
F1-Score (Equation 4) is a composite metric that
combines precision and recall into a single value. It
considers both false positives and false negatives
and provides a balanced measure of the model's
performance.
𝐹1 =
𝑃
𝑅
𝑃
+
𝑅
(4)
ROC-AUC is a metric commonly used in binary
classification tasks and evaluates the model's ability
to discriminate between positive and negative
instances at various probability thresholds. It plots
the true positive rate against the false positive rate
and calculates the area under the resulting curve. A
higher ROC-AUC score indicates better
discrimination ability and overall performance of the
contract failure prediction model.
By employing these metrics, we aim to
comprehensively evaluate the performance of the
public works contract failure prediction models
proposed in this article. The adopted metrics
provided a robust set of measurements, capturing
different aspects such as false positives, false
negatives, overall accuracy, and discriminative
ability of the models in predicting contract failures.
2.3 Large Language Models
AI is increasingly evident in various domains and
stands out in education. Among the most notable
technological advances today is LLM, which
emphasizes OpenAI's ChatGPT
3
. Generative AI
encompasses the creation of novel synthetic content
for diverse tasks (García-Peñalvo & Vázquez-
Ingelmo, 2023). These models, including Generative
Pre-trained Transformers (GPT), are based on deep
neural networks, which allows them to understand
and generate text effectively (Leippold, 2023). GPT-
3.5 Turbo
4
, with its 175 billion parameters, has
outstanding capacity compared to previous models.
Despite the gains brought through LLM, it
presents restrictions. One of the most significant is
“hallucination”
caused by returning generic or
3
ChatGPT: https://chat.openai.com/
4
GPT-3.5 Turbo: https://platform.openai.com/docs/model-
index-for-researchers
CSEDU 2024 - 16th International Conference on Computer Supported Education
30
Figure 2: Flowchart depicting the entire process from risk prediction for each project in the study set to the question-and-
answer mechanism of the chat assistant.
fictitious information, which can lead to a loss of
credibility and trust among users (Yuyan, et al.,
2023). Furthermore, although advanced, these
models still struggle with challenges related to
understanding contexts, which can result in
responses that do not capture what was asked by the
user.
The potential of the LLM in education has been
studied for different purposes. Researchers evaluated
the effectiveness of GPT-3 in programming tests
with code snippet questions (Jaromir, Arav,
Christopher, & Majd, 2023) (Savelka, Agarwal,
Bogart, Song, & Sakr, 2023). Another study
evaluated GPT-3 in medical exams, achieving or
approaching approval without specialized training,
suggesting its potential in medical education and
clinical decisions (Kung, et al., 2023). GPT has also
been evaluated in automated educational feedback
systems, showing that it generates detailed, coherent
responses aligned with instructor assessments,
potentially helping to develop students’ learning
skills (Wei, et al., 2023). Research has also explored
how English learners perceive and use ChatGPT
outside of the classroom, finding that LLM’s
perceived usefulness has had positive perceptions
for its actual use in informal English learning
(Guangxiang & Chaojun, 2023). Other researchers
have proposed an architectural model aimed at LLM
research, with practical use in education (Gonzalez,
2023).
Despite advances in integrating LLMs into
various educational areas, there is a significant gap
in the literature in terms of a lack of studies focused
on using LLMs to address complex topics in civic
education, such as processes and statistics. Thus, this
work adds to those in the literature by exploring the
effectiveness of these models to enrich teaching and
learning on socially relevant topics such as citizens’
awareness of risks of public works projects running
afoul of the law, intended quality and costs or of
social interests.
3 MATERIALS AND METHODS
3.1 Methodology
In this study, we evaluate the use of LLM and RAG
to teach complex concepts in civic education,
focusing on risk estimates of public works failure.
Initially, we develop and evaluate four different ML
models, selecting the most efficient one based on the
ROC-AUC metric. The selected ML model is then
integrated with the chat assistant to enhance system-
generated responses. Figure 2 presents an overview
of the structure used.
What does the risk of this work mean?
Question
Consider that the query contains the data of the work
(
𝑝
) selected in the panel and the predicted risks ( 𝑅
) for it.
Retrieval
Model
Processes of dev. of
the local works
How the risk model
was developed and
risk levels
Knowledge
Considering the estimated risk, it
suggests a moderate to high likelihood
of legal challenges affecting the
project’s completion.
Pre-trained
LLM
nswe
Search
External data (𝒟)
Query
Query +
𝒟
Response
Public Works
List Provider
(P) Public
Works Data
Risk
Predictor
Best
Classification
Model
Features
Training
Set
Test
Candidate
Classification
Models
Random Subset
of Public
Works
Evaluation with
the ROC-AUC
Including class balancing
techniques.
∀p
∈S, R
S
P
Large Language Models in Civic Education on the Supervision and Risk Assessment of Public Works
31
As part of the methodology, a questionnaire
5
was
applied to measure different aspects of the user's
experience with the chat assistant which makes use
of the risk estimate produced by the selected ML
method. The questionnaire was designed to assess
the users’ understanding of the concepts presented,
the accuracy of the delivered information, the ease
of use and the practicality of applying the
information in real-world scenarios. The
questionnaire also included questions that ranged
from demographic information to specific questions
about public works concepts - such as risk. Users
received the questionnaire digitally after
participating in an interactive session with the chat
assistant, during which they explored information
about public works and risk prediction. This process
ensured that users’ responses reflected their practical
experiences using the system. All interaction
between the user and the system was conducted
entirely in Portuguese.
We employed a combination of research
methods to analyze both quantitative and qualitative
descriptive data gathered from evaluation by users.
To assess participants perspectives quantitatively we
applied techniques to evaluate their responses on
five-point Likert scales. Additionally, we conducted
a qualitative analysis using content analysis
methodology, where writers assessed the feedback.
This approach aimed to explore factors that might
not be adequately captured by the data gathered
from the sample.
Combining quantitative and qualitative
approaches gave us a more comprehensive
understanding of user perceptions. Although Likert
scales helped us measure values, analyzing the final
answers provided qualitatively made it possible to
interpret the quantitative data more deeply. This
combination of quantitative methods improved our
research, contributing to understanding how chat
assistants can be applied effectively in civic
education.
3.2 Data Set
In the development of both the Risk Estimation
Model and the Chat Assistant, we utilized data from
5
Text Assistant Survey Questionnaire: https://osf.io/
7byd9/?view_only=7e2f643daa5b4d09a68b7284e54291
59
Brazilian public works, spanning the period from
2010 to 2019. The selection of this timeframe allows
for considerable completeness of the available data
during these years. By encompassing a decade of
public contracting records, this study ensures a
comprehensive and robust analysis of the dynamics
and patterns within the Brazilian public
administration regarding works procurement. In
total, four databases were used:
Companies’ Registers: this database, provided
by the Federal Revenue Service in Brazil (in
Portuguese, Receita Federal
6
), has data about
companies’ registries, partners, investors,
registration status, the national classification of
economic activities and other data about legal
entities in Brazil. This base is available in a
database format, making it difficult to use
directly. To solve this difficulty, the database
was configured in a local environment.
Status of Public Works Contracts: this
database provided by Tramita
7
consists of a
system for internal processing and managing
electronic documents/processes. It is possible to
access it from the start of procurement until it is
archived. For research on public works, the data
gathered in Tramita is essential, mainly for
collecting information regarding accurate
contract resolutions (whether a work contract
has been successfully completed or terminated
for administrative reasons). Although Tramita
is a system that does not require internal
credentials to access processes and documents,
there is no means to access and download the
complete dataset of contract terminations. To
circumvent such limitation, contractual data
through 2019 was made available by auditors
with special access to the tool.
Contract Attributes: through the SAGRES
8
application it is possible to access information
such as budget execution, bidding,
administrative procurements, registration
information, and personnel payrolls of related
jurisdictional units.
6
Companies in Brazil: https://dados.gov.br/dados/
conjuntos-dados/cadastro-nacional-da-pessoa-juridica--
-cnpj
7
System for electronic processes: https://tramita.tce.
pb.gov.br/tramita/pages/main.jsf.
8
Electronic reporting system SAGRES: https://tce.pb.
gov.br/sagres-online
CSEDU 2024 - 16th International Conference on Computer Supported Education
32
Public Works Attributes: Through the
databases from the Geo-PB
9
and Painel de
Obras
10
, it is possible to access information
regarding the public works carried out in the
state of Paraíba in Brazil. The data in this
system gathers information about geo-
references, contract number, financial amounts
which were paid, type, and built area, among
other details of a public work.
During the cleaning process, incorrect,
incorrectly formatted, duplicated, and incomplete
data were removed from the data set. After cleaning,
the data were integrated into one file, resulting in a
sample with 643 works.
3.3 Risk Prediction Model
Four supervised machine learning algorithms were
evaluated: Artificial Neural Networks (ANN),
Random Forest (RF), Logistic Regression, and
Stochastic Gradient Boosting (SGB). The training
for each was conducted using Scikit-learn
11
library
in Python, streamlining the creation and adjustment
of machine learning models by merely changing
specific parameters. The data sets were divided in
chronological order to prevent information leakage
between training and testing phases. For each model,
cross-validation was conducted with five subsets,
considering metrics such as Precision, Recall, F1-
Score, and ROC-AUC.
In building the ANN, the primary parameters set
were size and decay. The size refers to the number
of neurons in the hidden layer, with values of 1, 3,
and 5, while decay, a regularization parameter, was
set at 0, 0.1, and 0.0001. For the RF models the
number of predictors was randomly selected for each
branch of the trees at each node, with values of 2,
14, and 26. The development of SGB involved a
number of trees ranging between 50, 100, and 150,
and depth values of 1, 2, and 3.
In this study, 10 attributes from relevant
literature were selected, chosen for their importance
and applicability to the subject. The selection was
based on bibliographic reviews and consultations
with experts, aiming to ensure that each attribute
contributed significantly to the study. The attributes
used in each algorithm were:
Total Bids Won: The total number of bids won
by the company (contractor).
9
Information’s about public works: https://tce.pb.
gov.br/noticias/geo-pb-ferramenta-de-controle-de-
obras-e-servicos-de-engenharia
10
Information’s about public works: https://painel
deobras.tce.pb.gov.br/
11
Scikit-learn: https://scikit-learn.org/stable/
Total Bids Disputed: Total number of bids
(including those won) the company participated
in.
Number of Previous Failures: Total number of
failures in the conclusion of the work.
Bid Type: Modality of public procurement
process, classified into six types: competitive
bidding, invitation, price taking, contest, trading
floor, and auction.
Number of Activities: The number of company
activities according to the National
Classification of Economic Activities (CNAE),
indicating the activities performed by the
company.
Age of the Company (Years): The time
elapsed from the creation of the company until
the conclusion of the contract.
Pending Submission of Information to
Control Agencies: History of pending
registrations, such as lack of monitoring, lack of
georeferencing, outdated estimates, among
others.
Number of Districts Served: The number of
districts in which the contractor has worked.
Value of the Work: Total amount destined to
the execution of the work.
Duration: Time estimated for the development
of the work.
After defining and evaluating the algorithms'
hyperparameters, the performances of the four
classification models were assessed and compared
using the original data sample and two artificial
balancing methods (undersampling and
oversampling).
3.4 Chat Assistant
A user interface developed in React
12
(see Figure 3)
was implemented to conduct the empirical
evaluation of the use of LLM and RAG, which
connects to an API built in Flask Python
13
. This API
is responsible for the functions: returning responses
generated by the LLM/RAG model, carrying out the
calculation of risks associated with public works,
and providing information relevant to the work that
the user is viewing on the interface (see Figure 2,
which presents an overview of the architecture
used).
12
React: https://react.dev/learn
13
Flask Python: https://flask.palletsprojects.com/en/3.0.x/
Large Language Models in Civic Education on the Supervision and Risk Assessment of Public Works
33
Figure 3: Demonstration of the chat assistant interface
presented to the user. (1) data provided to the user
regarding the works, including risk estimates; (2) chat that
allows the user to ask questions, including about elements
presented in item 1.
The chat assistant was built using the LLM
algorithm, specifically GPT 3.5 Turbo. We chose
this algorithm because it offers a more affordable
solution compared to GPT-4 but is still effective in
generating coherent responses. To utilize this
algorithm, we integrated it with the OpenAI API
14
.
This integration was implemented to enhance the
experience of experiment participants by ensuring
faster response times through resources. We relied
on the Langchain
15
library to integrate the user
interface and specific system functionalities.
For the responses generated by the system to be
relevant to the context of public works education, it
was necessary to create a prompt. The prompt
created, as shown in Figure 4, was designed to direct
the way the system processes and answers the
questions. As it can be seen, it was developed to act
as a tutor on the topic of public works, having
detailed information about the work being presented
to the user.
A range of external data was made available for
the models. The first refers to information about the
risk prediction model, including data on the input
variables, the methodology used in its construction,
and the variables that have the most significant
impact on the calculated risk results obtained
through the measurement of the importance of each
of them through Random Forest. This integration
14
OpenAI API: https://platform.openai.com/
15
Langchain:
https://python.langchain.com/docs/get_
started/introduction
was added to ensure that when users inquire about
specific aspects of the risk model, the assistant,
through RAG, can provide accurate and detailed
answers. In addition, information was also made
available on local public works development
processes, specifically the processes used in Brazil.
Figure 4: Representation of the prompt used in the chat
assistant: (1) Describes the task that an AI should perform;
(2) Information about the work 𝑝
and 𝑅
displayed in
the interface; (3) User-entered question.
4 RESULTS AND DISCUSSION
4.1 Risk Prediction Model
In a first experiment, the data used for training the
classification models were not balanced, so the
contracts considered as high risk corresponded to
37% of the data.
The Random Forest, Stochastic Gradient
Boosting, and Neural Networks models were the
least effective in distinguishing between the classes,
having a ROC-AUC close to or equal to 0.5. The
small amount of available data is considered to have
compromised the learning process of the models due
to the attempt to minimize classification errors,
resulting in classifying all being low risk.
After class balancing through the undersampling
technique on the training set, the resulting sample
had 192 data items, 96 for each class. It can be seen
from Table 1 that, in this second experiment, the
most effective model, with a ROC-AUC of 0.64,
was Random Forest, correctly classifying 19
contracts as high risk and with 9 false positives.
Despite the Random Forest model having a higher
ROC-AUC, the Neural Networks model correctly
classified 24 contracts as high risk, with only 4 false
positives. Finally, Logistic Regression was the
model with the worst performance among the others,
where the classifier labelled all contracts as high
CSEDU 2024 - 16th International Conference on Computer Supported Education
34
Table 1: Comparison of machine learning model performances with different data balancing methods. The Random Forest
model under Oversampling, highlighted for its top ROC-AUC score, is the best performer.
Balancing method Model Precision Recall F1-Score ROC-AUC
Original Data Sample
RF - 1 0 0.5
ANN 0.32 0.24 0.27 0.54
LR 0.29 0.19 0.22 0.66
SGB - 1 0 0.5
Undersampling
RF 0.68 0.38 0.48 0.64
ANN 0.86 0.27 0.41 0.61
LR 0.26 1 0.41 0.5
SGB 0.33 0.65 0.44 0.62
Oversampling
RF
0.42 0.63 0.51 0.75
ANN
0.71 0.39 0.50 0.66
LR
0.64 0.39 0.48 0.65
SGB
0.47 0.39 0.43 0.67
risk. It is believed that this performance was due to
insufficient data from the balancing technique used.
In the third and last experiment, class balancing
was performed through the oversampling technique
applied to the training set, resulting in a sample with
1094 data items, 547 for each class. Finally, as with
balancing through oversampling, the Logistic
Regression was the model with the worst
performance when the ROC-AUC metric was
observed (see Table 1). Despite the low efficiency
when compared to the other models, the model
correctly classified 18 high-risk contracts, but just
like the Neural Networks model, it obtained a high
value of false negatives compared to Random
Forest.
Among the three experimental scenarios
observed for the risk modelling of public works
procurement, the classifiers that best differentiated
the high and low-risk classes were, based on the
ROC-AUC metric, Random Forest with
Oversampling technique, Random Forest with
undersampling technique, and Logistic Regression
through the original sample without any balancing
treatment.
The oversampling technique, aimed at increasing
the number of records in the class with lower
frequency until the base is balanced, together with
Random Forest, was the treatment with the highest
ROC-AUC among the others, having a confidence
interval ranging from 0.64 to 0.87 (95% CI). Next,
the Logistic Regression algorithm proved to have
better results when used with the original sample
(95% CI: 0.46 - 0.81). Finally, the undersampling
technique, also with the Random Forest algorithm,
was the approach with the lowest results with a
ROC-AUC of 0.64 (95% CI: 0.57 - 0.72). All the
results presented can be seen in Figure 5.
Figure 5: Comparison between the algorithms with the
best results, and their data balancing techniques.
It is possible to use machine learning-based
classifiers to reveal risks in public works, considering
that the ROC-AUC values are higher than the
threshold value of 0.5 (random classifier). Despite
such threshold, as there is an overlap between the
confidence intervals, there is no significant difference
between the approaches used to indicate the most
effective approach. In the best-case scenario, Random
Forest with Oversampling will provide better results
than the others. In contrast, in the worst-case scenario,
the Logistic Regression with the original balancing
may be less effective than the other approaches and,
in the best case, superior to Random Forest with
undersampling. By measuring the importance of each
of the Random Forest with oversampling input
variables, it was possible to identify that the 5 most
important variables for the result were: value of the
work, dimension, number of activities of the
contracted company, pending issues detected by
government agencies and estimated duration.
The above observations come with limitations.
Generalizing our findings faces the threat of data
poor representativeness, as public data we used may
not reflect the full range of projects or adequately
capture the diversity of contexts and conditions in
other countries.
0.45
0.55
0.65
0.75
0.85
LR
Original dataset
RF
Undersampling
RF
Oversampling
ROC-AUC
Large Language Models in Civic Education on the Supervision and Risk Assessment of Public Works
35
4.2 Chat Assistant
The research used questionnaires to evaluate LLM
within the scope of civic education and supervision
of public works. Quantitative and qualitative
analyses were based on responses to a questionnaire
administered to 35 individuals.
The average age of the individuals was 38 years
old with ages ranging from 19 to 54 years. In terms
of gender distribution there were 19 males, 14
females and 2 participants who identified as other.
When it comes to education most participants (16)
had completed high school while 14 had a university
undergraduate degree and 5 held a postgraduate
degree. This indicates that the sample used in this
study exhibits a range of educational levels. The
inclusion of ages and educational backgrounds
showcases the diversity, within this participant
group.
We also included some questions to gauge
participants understanding of AI. The results showed
that 20 participants had no knowledge about it; 10
had a basic understanding; 4 had moderate
knowledge; and only one demonstrated advanced
knowledge. Alongside AI knowledge we also
assessed their familiarity with public works. Most
participants categorized their knowledge of works as
basic (18) while 16 indicated they had no knowledge
and only one said it had “advanced knowledge”.
Figure 6 presents users’ experiences with the
chat assistant, evaluated using a 5-level Likert scale.
The graph highlights that 21 participants strongly
agree with the clarity and ease of understanding of
the chat assistant’s responses. Ease of use also
scored highly, with 34 participants agreeing or
strongly agreeing with its ease of use. For
understanding public works, 32 participants gave
ratings that ranged from neutral to strongly agree. As
can be seen, only a tiny portion of the responses
demonstrate disagreement, which suggests strong
user approval of the chat assistant’s features.
A better understanding of users’ responses was
possible through the joint analysis of quantitative
(Likert questions) and qualitative (open questions)
approaches. This analysis made it possible to
identify that participants with no or basic knowledge
of AI tended to focus on general questions about
public works or superficial questions about risk. In
contrast, participants with more advanced or
moderate knowledge of AI used external data
through RAG more effectively, especially to
understand complex processes such as risk
estimation in public works. This group demonstrated
to take full advantage of RAG’s advanced
capabilities, leveraging its potential for more
complex and detailed tasks (e.g., how the models
were built, algorithms used, and the importance of
each variable).
It was also found that some users rated the
clarity and precision of the system’s responses as
low (strongly disagree or neutral). This classification
occurred especially in situations where they were
looking for unavailable external information, such as
details about tenders that a specific company had
already participated in. The lack of availability of
this data in the databases consulted by RAG led to
the provision of more generic responses, affecting
users' perception of satisfaction with the system.
These same participants highlighted that it would be
indifferent for them to implement such functionality
in government systems.
Analysis of the responses demonstrates that the
RAG and the LLM have distinct but complementary
roles in disseminating information and civic
education. RAG, with its ability to access and
integrate information from external databases,
proved to be particularly efficient on two fronts: the
first was to provide details regarding simple
information presented to the user during the iteration
Figure 6: Graph depicting participant ratings of the chat assistant on five aspects: response accuracy, ease of use,
helpfulness, understanding of public works, and response clarity, using a 5-point Likert scale from 'Strongly Disagree' to
'Strongly Agree'.
Accuracy of Text Assistant Responses
Ease of Using Text Assistant
Helpfulness of Text Assistant Responses
Understanding Public Works via Assistant
Clarity in Text Assistant Responses
Strongly Disagree Disagree Neutral Agree Strong Agree
CSEDU 2024 - 16th International Conference on Computer Supported Education
36
with the public work, such as the value of work and
the type of bidding. The second regarded resolving
more detailed questions with complex subjects, such
as risks estimation. Despite these benefits, the
effectiveness of the RAG is limited by the
availability and scope of information in the
databases to which it has access. The lack of specific
information, such as details about companies
involved in public works projects, resulted in less
satisfactory responses for some users.
Finally, it was observed that, using LLM in
conjunction with the developed prompt, it was
possible to transform technical and complex
information into understandable explanations. By
presenting specialized concepts in a more accessible
way, it was possible to contribute to easing the
comprehension of public works. Such a contribution
can improve civic education and promote greater
understanding among the population regarding the
processes inherent to the management and
supervision of public works projects.
5 CONCLUSIONS
Considering the inherent complexities surrounding
purchases and contracts made by government
agencies, civic education becomes essential to
enable an understanding of how government
resources are being spent. Based on this, the
investigation reported in this article focused on
developing and evaluating the use of LLM and RAG
to simplify the presentation of concepts related to
public works and their contracting process and risk
estimation carried out through AI’s machine
learning (ML). The goal was to make these
advanced ideas easily understandable to citizens and
encourage their informed involvement.
This study initially focused on developing a
predictive model for estimating the risk of public
work failures, using ML methods. Based on ROC-
AUC values, the Random Forest algorithm coupled
with the Oversampling technique for data balancing,
was the most efficient approach among the 4
evaluated, evidenced by the model’s ability to
differentiate between high and low risk contracts.
In addition to the risk prediction model, this
study also evaluated the applicability of LLM and
RAG to civic education in the context of public
works projects. To carry out this assessment,
questionnaires were applied to people of different
ages and educational levels. Through quantitative
and qualitative analysis of the data, it was possible
to verify that: (i) there was a strong approval for the
use of chat assistant resources, where most
participants reported that this resource could be
available in government tools; (ii) participants with
less knowledge of AI tended to focus on general
questions about public works, while those with
moderate or advanced knowledge more effectively
used external data to understand risk estimation
processes; and (iii) in specific cases, the lack of
external data affects user satisfaction due to the
generation of generic responses.
Based on these findings, it is possible to
conclude that it is feasible to develop risk predictors
to predict whether public work will be halted due to
court decisions. To this end, the Random Forest
algorithm proved to be the most efficient among the
algorithms analyzed. Furthermore, LLM models
may also help understand complex information
related to the risk prediction model and public
works. Finally, integrating external information
through RAG enabled the user to have accurate
answers on different topics. However, it is necessary
to make it clear to target users to avoid searches
involving non-existent data, causing inaccurate
answers.
Thus, this study emphasizes the significance of
AI in educational methods and fostering greater
civic engagement and understanding of the
complexities associated with the supervision and
execution of public works.
ACKNOWLEDGEMENTS
This work was carried out with support from the
National Council for Scientific and Technological
Development (CNPq), Brazil. We extend our
gratitude to Jair Guedes for prior discussions and
support on LLMs, and to the anonymous reviewers
for their constructive criticism.
REFERENCES
Auret, L., & Aldrich, C. (2012). Interpretation of nonlinear
relationships between process variables by use of
random forests. Minerals Engineering, 35, (pp. 27-42).
doi:https://doi.org/10.1016/j.mineng.2012.05.008
Barros, L. B., Marcy, M., & Carvalho, M. T. (2018).
Construction cost estimation of Brazilian highways
using artificial neural networks. International Journal
of Structural and Civil Engineering Research, pp. 283-
289.
Bayram, S., & A. S. (2016). Efficacy of estimation
methods in forecasting building projects’ costs.
Large Language Models in Civic Education on the Supervision and Risk Assessment of Public Works
37
Journal of construction engineering and management
v. 142, p. 142.
Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2021).
A comparative analysis of gradient boosting
algorithms. Artificial Intelligence Review, 54, (pp.
1937-1967). doi:https://doi.org/10.1007/s10462-020-
09896-5
Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E.
W., Verbakel, J. Y., & Van Calster, B. (2019). A
systematic review shows no performance benefit of
machine learning over logistic regression for clinical
prediction models. Journal of clinical epidemiology,
110, pp. 12-22.
Domingos, S. L., Carvalho, R. N., Carvalho, R. S., &
Ramos, G. N. (2016). dentifying it purchases
anomalies in the Brazilian Government Procurement
System using deep learning. 15th IEEE International
Conference on Machine Learning and Applications
(ICMLA) doi: 10.1109/ICMLA.2016.0129, (pp. 722-
727).
Gallego, J., Rivero, G., & Martínez, J. (2021). Preventing
rather than punishing: An early warning model of
malfeasance in public procurement. International
Journal of Forecasting 37.1, pp. 360-377.
García-Peñalvo, F., & Vázquez-Ingelmo, A. (2023). What
do we mean by GenAI? A systematic mapping of the
evolution, trends, and techniques involved in
Generative AI.
Gonzalez, B. (2023). Smart Surveys: An Automatic
Survey Generation and Analysis Tool. In Proceedings
of the 15th International Conference on Computer
Supported Education - Volume 2: CSEDU, (pp. 113-
119). doi:10.5220/0011985400003470
Guangxiang, L., & Chaojun, M. (2023). Measuring EFL
learners’ use of ChatGPT in informal digital learning
of English based on the technology acceptance model.
Innovation in Language Learning and Teaching, (pp.
1-14). doi:10.1080/17501229.2023.2240316
Hond, D., Asgari, H., & Jeffery, D. (2020). Verifying
Artificial Neural Network Classifier Performance
Using Dataset Dissimilarity Measures. 19th IEEE
International Conference on Machine Learning and
Applications (ICMLA) (pp. 115-121). IEEE.
doi:10.1109/ICMLA51294.2020.00027
Huber, M., & Imhof, D. (2019). Machine learning with
screens for detecting bid-rigging cartels. International
Journal of Industrial Organization 65, pp. 277-301.
Ivanov, D., & Nesterov, A. (2019). Identifying bid leakage
in procurement auctions: Machine learning approach.
Proceedings of the 2019 ACM Conference on
Economics and Computation., (pp. 69-70).
doi:https://doi.org/10.1145/3328526.3329642
Jaromir, S., Arav, A., Christopher, B., & Majd, S. (2023).
Large Language Models (GPT) Struggle to Answer
Multiple-Choice Questions About Code. International
Conference on Computer Supported Education.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M.,
Dementieva, D., Fischer, F., & ... & Kasneci, G.
(2023). ChatGPT for good? On opportunities and
challenges of large language models for education.
Learning and individual differences, (p. 103).
Kung, T. H., Cheatham, M., A., M., Sillos, C., De Leon,
L., Elepaño, C., & Tseng, V. (2023). Performance of
ChatGPT on USMLE: Potential for AI-assisted
medical education using large language models. PLoS
digital health, 2(2).
Leippold, M. (2023). Thus spoke GPT-3: Interviewing a
large-language model on climate finance. Finance
Research Letters. doi:https://doi.org/10.1016/j.
frl.2022.103617
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,
Goyal, N., & ... & Kiela, D. (2020). Retrieval-
augmented generation for knowledge-intensive NLP
tasks. Advances in Neural Information Processing
Systems, 33, (pp. 9459-9474).
Pérez, J. Q., Daradoumis, T., & Puig, J. M. (2020).
Rediscovering the use of chatbots in education: A
systematic literature review. Computer Applications in
Engineering Education 28.6, (pp. 1549-1565).
Savelka, J., Agarwal, A., Bogart, C., Song, Y., & Sakr, M.
(2023). Can generative pre-trained transformers (gpt)
pass assessments in higher education programming
courses? In Proceedings of the 28th Annual ACM
Conference on Innovation and Technology in
Computer Science Education, (pp. 117–123).
doi:https://doi.org/10.1145/3587102.3588792
Schröer, C., Kruse, F., & Gómez, J. M. (2021). A
systematic literature review on applying CRISP-DM
process model. Procedia Computer Science, (pp. 526-
534). doi:https://doi.org/10.1016/j.procs.2021.01.199
Sun, T., & Sales, L. J. (2018). Predicting public
procurement irregularity: An application of neural
networks. Journal of Emerging Technologies in
Accounting 15.1, pp. 141-154.
Titirla, M., & Aretoulis, G. (2019). Neural network
models for actual duration of Greek highway projects.
Journal of Engineering, Design and Technology 17.6,
pp. 1323-1339.
Twizeyimana, J. D., & Andersson, A. (2019). The public
value of E-Government–A literature review.
Government information quarterly, pp. 167-178.
Wei, D., Jionghao, L., Hua, J., Tongguang, L., Yi-Shan,
T., Dragan, G., & Guanliang, C. (2023). Can Large
Language Models Provide Feedback to Students? A
Case Study on ChatGPT. IEEE International
Conference on Advanced Learning Technologies
(ICALT) doi: 10.1109/ICALT58122.2023.00100.
Yang, Q., Suh, J., Chen, N. C., & Ramos, G. (2018).
Grounding interactive machine learning tool design in
how non-experts actually build models., (pp. 573-
584).
Yuyan, C., Qiang, F., Yichen, Y., Zhihao, W., Ge, F.,
Dayiheng, L., . . . Yanghua, X. (2023). Hallucination
Detection: Robustly Discerning Reliable Answers in
Large Language Models. In Proceedings of the 32nd
ACM International Conference on Information and
Knowledge Management (CIKM '23), (pp. 245–255.).
doi:https://doi.org/10.1145/3583780.3614905.
CSEDU 2024 - 16th International Conference on Computer Supported Education
38