Handling Inconsistent Government Data: From Acquisition to Entity

Name Matching and Address Standardization

Davyson S. Ribeiro

1,5 a

, Paulo V. A. Fabr

ıcio

1,5 b

, Rafael R. Pereira

1,5 c

, Tales P. Nogueira

2,5 d

Pedro A. M. Oliveira

3,5 e

, Vict

oria T. Oliveira

1,5 f

, Ismayle S. Santos

4,5 g

and Rossana M. C. Andrade

1,5 h

Federal University of Cear

a, Fortaleza, Brazil

University of the International Integration of the Afro-Brazilian Lusophony, Redenc¸

ao, Brazil

Federal Institute of Education, Science, and Technology of Maranh

ao, Pedreiras, Brazil

State University of Cear

a, Fortaleza, Brazil

Group of Computer Networks, Software Engineering and Systems, Fortaleza, Brazil

Keywords:

Data Science, Big Data, Business Intelligence, Data Analysis, Decision Support Systems, Decision-Making.

Abstract:

The integration of Data Science and Big Data is essential for managing large-scale data, but challenges such

as heterogeneity, inconsistency, and data enrichment complicate this process. This paper presents a ﬂexible

architecture designed to support municipal decision-making by integrating data from multiple sources. To

address inconsistencies, an entity matching algorithm was implemented, along with an address standardization

library, optimizing data processing without compromising quality. The study also evaluates data acquisition

methods (APIs, Web Crawlers, HTTPS requests), highlighting their trade-offs. Finally, we demonstrate the

system’s practical impact through a case study on health data monitoring, showcasing its role in enhancing

data-driven governance.

1 INTRODUCTION

The ethical use of data, information, and knowledge

derived from digital traces left by computational de-

vices has been a growing topic in Information Science

(Rautenberg and do Carmo, 2019). The intelligent

use of data enables the development of innovative ap-

plications and services, such as data integration plat-

forms that provide visualization and predictive analy-

sis tools for decision-making and strategic planning,

contributing to urban development and public policy

improvements. However, handling large volumes of

data presents challenges that traditional database sys-

tems and batch processing struggle to address due to

https://orcid.org/0000-0002-7375-0684

https://orcid.org/0009-0000-8326-7197

https://orcid.org/0009-0000-4838-6617

https://orcid.org/0000-0002-3266-4632

https://orcid.org/0000-0002-3067-3076

https://orcid.org/0000-0002-1400-522X

https://orcid.org/0000-0001-5580-643X

https://orcid.org/0000-0002-0186-2994

performance and scalability limitations. In this con-

text, cloud computing offers elasticity and scalability,

making it a key solution for managing and analyzing

massive datasets (Sandhu, 2021).

Signiﬁcant investments from the private sector

have driven the development of technologies capa-

ble of handling large-scale data. Platforms like Ama-

zon Web Services (AWS) provide specialized tools

for Data Science, including Amazon S3 for storage,

AWS Glue for serverless data integration, and AWS

Athena for interactive analysis (Vines and Tanasescu,

2023).

This study reports the experiences gained dur-

ing the development of a data-driven decision-making

platform designed to support municipal government

managers in Brazil. The project extensively leveraged

cloud computing and Data Science techniques to ad-

dress key challenges, such as data heterogeneity and

inconsistency. The datasets were independently pro-

duced by different municipal departments, including

Education, Health, and Urban Planning, along with

federally managed centralized databases.

286

Ribeiro, D. S., Fabrício, P. V. A., Pereira, R. R., Nogueira, T. P., Oliveira, P. A. M., Oliveira, V. T., Santos, I. S. and Andrade, R. M. C.

Handling Inconsistent Government Data: From Acquisition to Entity Name Matching and Address Standardization.

DOI: 10.5220/0013294200003929

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 286-293

ISBN: 978-989-758-749-8; ISSN: 2184-4992

Despite the potential of Big Data in government

decision-making, several challenges hinder its practi-

cal implementation. The heterogeneity and inconsis-

tency of data from diverse sources, lack of standard-

ization in collection methods, and the need for scal-

able, automated data pipelines increase operational

costs and compromise analysis quality. This work

proposes a robust framework integrating automation,

preprocessing, validation techniques, and standard-

ization practices to improve the quality and usability

of both textual and geographic data.

The remainder of this paper is structured as fol-

lows: Section 2 reviews related works, discussing

methodologies relevant to data processing challenges.

Section 3 presents the foundational concepts of Data

Science and the key stages of the project’s imple-

mentation. Section 4 details the cloud computing ar-

chitecture, emphasizing scalability, security, and op-

erational advantages. Section 5 describes the func-

tional and non-functional requirements necessary for

efﬁcient Big Data management in government sys-

tems. Section 6 explores the main challenges encoun-

tered and the solutions implemented. Section 7 eval-

uates the system’s applicability, discusses technical

and operational constraints, and presents case stud-

ies demonstrating its impact. Finally, Section 8 con-

cludes with a summary of ﬁndings and future research

directions.

2 RELATED WORK

This section presents works that incorporate various

Data Science and Big Data methods, focusing on their

application to decision-making and the development

of new applications. Sarker (2021) discusses the rel-

evance of advanced data analysis methods across dif-

ferent sectors, emphasizing their impact on decision-

making, operational optimization, and trend forecast-

ing. While it highlights the importance of customized

applications in healthcare, smart cities, and cyber-

security, it does not speciﬁcally address government

data standardization and harmonization.

Freitas et al. (2023) propose a data warehousing

environment for crime data analysis, supporting pub-

lic security managers in strategic decision-making.

The study identiﬁes key challenges, such as data het-

erogeneity, lack of standardization, and the need for

advanced extraction, transformation, and visualiza-

tion techniques.

Fugini and Finocchi (2020) focus on documental

Big Data processing, introducing an Enterprise Con-

tent Management (ECM) system enhanced with ma-

chine learning for classiﬁcation and information ex-

traction. Their work deﬁnes quality metrics—Textual

Quality Conﬁdence, Classiﬁcation Conﬁdence, and

Extraction Conﬁdence—to assess system accuracy

and efﬁciency. These indicators contribute to data in-

tegrity and consistency but do not fully address het-

erogeneous governmental data integration.

Behringer et al. (2023) present SDRank, a deep

learning-based approach for ranking data sources by

similarity, optimizing semantic pattern recognition

and automated data selection. While this technique

improves efﬁciency and scalability in large-scale data

processing, it does not tackle structural inconsisten-

cies in government datasets.

Furtado et al. (2023) investigate digital transfor-

mation in smart governance, analyzing how Big Data

tools can support policies aimed at vulnerable pop-

ulations. However, their study does not explore the

technical challenges of integrating and standardizing

multiple governmental data sources.

These studies provide valuable insights into Big

Data applications, yet they lack a detailed ar-

chitectural perspective on handling heterogeneous

and inconsistent government data. This work ad-

dresses these gaps by proposing a scalable integration

pipeline, combining automation, entity name match-

ing, and address standardization to enhance data qual-

ity and usability in public sector applications.

3 DATA SCIENCE, BIG DATA AND

PROJECT STAGES

Big Data refers to large, heterogeneous datasets that

exceed the processing capabilities of conventional

methods due to their dynamic and complex nature.

These datasets exhibit characteristics such as vol-

ume, velocity, variety, veracity, variability, and value,

requiring specialized techniques for their manage-

ment. In this study, Big Data encompasses diverse

governmental datasets, used to build analytical tools

that support municipal decision-making. Given these

characteristics, a key challenge is ensuring data stor-

age, cataloging, and availability for decision-making

processes.

Data Science, an interdisciplinary ﬁeld integrat-

ing statistics, mathematics, and computer science, en-

ables the extraction of valuable insights to support

data-driven decisions (Wu et al., 2021). Identify-

ing patterns, trends, and hidden relationships within

data is complex but essential for predictions, process

optimization, and strategic decision-making (Sarker,

2021). Transforming raw data into actionable knowl-

edge involves several critical stages, as illustrated in

Figure 1. The data acquisition stage involves col-

Handling Inconsistent Government Data: From Acquisition to Entity Name Matching and Address Standardization

287

Add/delete/ request

modiﬁcation

Request Document

Raw Data

Collection

Creation of

Tasks Related to

the Request

Task

Implementation

Task Review

Presentation

for PO’S

Semi Structuring

of Data

Data Structuring

Registration of

Data Products in

the System

Data Product

Validation

Raw Data

Processed Data

Finalized Data

Bronze Silver Gold

not

yes

Was the Data Product

Accepted?

yes

Final

Product

There is data that meets

the request?

Figure 1: Steps used to build the product.

lecting raw data and metadata from multiple sources

based on municipal department requirements. The

data ingestion phase transforms and loads this raw

data into a centralized repository, structuring it for

easy access and analysis. Next, the data exploration

stage enables the preliminary study of datasets, ren-

dering them semi-structured to facilitate the deﬁnition

of workﬂows and hypotheses. During information ex-

traction, advanced analytical techniques identify rel-

evant patterns and insights, reﬁning the data into ac-

tionable knowledge. The information display stage

ensures clear communication of insights through de-

tailed reports, interactive dashboards, and visualiza-

tions. Finally, in the decision-making phase, analyti-

cal models combine with domain expertise to gener-

ate strategic recommendations, supported by data vi-

sualization and reporting.

4 CLOUD COMPUTING AND

PROJECT ARCHITECTURE

Cloud computing has transformed data storage and

processing, allowing organizations to reduce costs

and enhance scalability by using third-party infras-

tructure (Silva et al., 2023). The cloud services mar-

ket is led by AWS, Microsoft Azure, Google Cloud

Platform (GCP), and IBM Cloud, each offering solu-

tions across Infrastructure as a Service (IaaS), Plat-

form as a Service (PaaS), and Software as a Service

(SaaS) (Gupta et al., 2021). AWS stands out for its

wide service range and global presence, while Azure

integrates seamlessly with Microsoft tools, GCP fo-

cuses on AI and machine learning, and IBM Cloud

specializes in hybrid cloud solutions.

This competition fosters constant innovation and

price reductions, beneﬁting users by enhancing ser-

vice quality, security, and compliance.

4.1 Using the AWS Platform

AWS was chosen due to its scalability, ﬂexibility, se-

curity, and cost-effectiveness, providing an intuitive

environment for activity monitoring, data structur-

ing, and automated rule enforcement. Key AWS ser-

vices used include Amazon S3 (storage), AWS Glue

(serverless ETL), AWS Athena (querying), and AWS

Lambda (automation).

The architecture was designed to ensure compli-

ance with GDPR and LGPD, adhering to strict data

protection regulations. To achieve this, principles

such as data minimization and storage limitation were

implemented, processing only essential information

for the shortest necessary period. Anonymization and

pseudonymization techniques were applied to safe-

guard personal data while maintaining analytical ac-

curacy.

Access control is managed through AWS Identity

and Access Management (IAM), enforcing granular

permissions. Password rotation, two-factor authenti-

cation, and continuous monitoring via AWS Cloud-

Trail and Amazon CloudWatch ensure system secu-

rity. AWS follows a shared responsibility model,

managing physical infrastructure security while users

conﬁgure data protection policies. AWS compliance

with ISO/IEC 27001, SOC 1, 2, 3, and PCI DSS fa-

cilitates adherence to international regulations.

These security measures ensure the integrity, con-

ﬁdentiality, and reliability of the system, providing a

secure environment for data analysis while maintain-

ing ethical and transparent data handling.

4.2 Data Acquisition and Storage

Data acquisition follows three primary methods:

APIs, HTTPS requests, and manual ingestion. AWS

Lambda functions retrieve data based on provider ac-

cess methods, storing raw data in AWS S3 buckets.

The system follows the Medallion Architecture,

a structured data lake model that organizes data into

Bronze, Silver, and Gold layers (Kumar et al., 2023).

The Bronze layer stores raw, unprocessed data, main-

taining full historical records for auditing. The Silver

layer applies data cleaning, ﬁltering, and enrichment,

creating a structured dataset. Finally, the Gold layer

contains highly reﬁned and aggregated data, ready for

business intelligence, machine learning, and analyt-

ics.

AWS Glue facilitates data transformation us-

ing ProcessingJob (to convert unstructured data into

semi-structured data) and BusinessJob (to create

structured datasets for BI applications).

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

288

4.3 Workﬂow Automation

Project workﬂows are automated using AWS Glue

components, including Crawlers and Triggers.

Crawlers scan and classify data, updating the AWS

Glue Data Catalog, which maintains metadata for

querying and ETL jobs. Triggers initiate ETL

processes based on time-based, event-based, or

dependency-based conditions, automating the entire

pipeline workﬂow.

A well-deﬁned automation strategy ensures efﬁ-

ciency in managing heterogeneous data formats and

resolving inconsistencies, eliminating the need for

manual intervention. Figure 2 illustrates the workﬂow

automation process.

Ini

Pipeline

Initialization

to to

Raw data

catalog

Raw data

Processing

Catalog of

processed data

Adequacy of

processed data

for a business

application

Business-level

data catalog

Activating a flow step

activates the other

steps

Figure 2: Process of Automating a Workﬂow.

The automation process consists of ﬁve stages: I.

Raw Data Veriﬁcation: A trigger checks for new data

in the Bronze bucket and catalogs it using a Bronze

Crawler. II. Processing Job Initiation: Upon detecting

new Bronze data, a ProcessingJob transforms it into a

semi-structured format. III.Silver Data Cataloging: A

trigger detects new processed data in the Silver bucket

and updates the Silver Crawler. IV. Business Rule

Execution: A trigger initiates a BusinessJob that ap-

plies domain-speciﬁc transformations to prepare data

for BI use. V. Gold Data Finalization: A ﬁnal trigger

detects new structured data in the Gold bucket, cata-

loging it for analysis.

By structuring the workﬂow this way, the sys-

tem ensures consistent, reliable, and automated data

processing, supporting evidence-based public policies

with high-quality, standardized data.

5 REQUIREMENTS FOR

EFFICIENT BIG DATA

MANAGEMENT IN

GOVERNMENT SYSTEMS

The use of Big Data in public administration is funda-

mental for developing evidence-based policies. This

study proposes an integrated and standardized ap-

proach to handle large volumes of heterogeneous

data, ensuring its interoperability and usability for

advanced analysis and decision-making. The sys-

tem architecture follows a structured workﬂow cov-

ering data acquisition, standardization, entity match-

ing, processing, and visualization. To support its im-

plementation, a set of functional and non-functional

requirements was identiﬁed, ensuring efﬁciency, se-

curity, and scalability in public sector applications.

5.1 Functional Requirements and

Non-Functional Requirements

The system is structured into three key layers: data ac-

quisition and integration, processing and analysis, and

access and visualization. The data acquisition and in-

tegration layer must collect information from govern-

ment databases, public APIs, structured/unstructured

ﬁles, and web scraping. To ensure interoperability,

the system must support automatic data conversion

and standardization, enabling consistent storage. Ad-

ditionally, entity matching techniques must be imple-

mented to unify records referring to the same entity,

such as institutions, locations, or individuals, reduc-

ing duplication and improving data quality.

The processing and analysis layer must handle

large-scale data through scalable pipelines, following

the Medallion Architecture, which structures data into

three layers: Bronze (raw data), Silver (reﬁned data),

and Gold (ready-for-analysis data). Machine learning

and artiﬁcial intelligence models must optimize ad-

dress standardization, error correction, and predictive

analytics, while an automated data quality assessment

mechanism should identify and correct inconsisten-

cies before analysis.

The access and visualization layer must present

data in an accessible and interactive manner, provid-

ing dynamic dashboards and geospatial analysis tools

to identify patterns in areas such as urban mobility,

public safety, and healthcare. The system interface

must be intuitive, responsive, and compliant with dig-

ital accessibility guidelines, ensuring broad usabil-

ity. Access control mechanisms should be enforced

to regulate data access, adhering to regulations such

as LGPD and GDPR.

For Non-Functional Requirements, the system

must guarantee performance, scalability, and secu-

rity, meeting robust technical requirements. To ensure

scalability, the architecture must be distributed and

modular, processing large data volumes with low la-

tency and high efﬁciency, while adapting to increased

demand. Optimized processing pipelines should sup-

port high-concurrency workloads, ensuring system

responsiveness.

Regarding usability and user experience, the inter-

face must facilitate intuitive navigation and provide

continuous training and technical support, enabling

Handling Inconsistent Government Data: From Acquisition to Entity Name Matching and Address Standardization

289

users to fully utilize analytical tools.

Security and data governance are critical aspects.

The system must implement robust encryption to pro-

tect sensitive information and multifactor authentica-

tion to prevent unauthorized access. Additionally, au-

tomated backups must ensure data recovery in case of

failures.

For system management and maintenance, insti-

tutional responsibilities must be well-deﬁned, ensur-

ing systematic data updates and continuous improve-

ments. Automated monitoring and auditing tools

should be implemented to detect anomalies, facilitat-

ing proactive adjustments. Comprehensive documen-

tation must be maintained to support future expan-

sions and integrations.

5.2 Strategic Impact of the Architecture

A Big Data architecture for public administration sig-

niﬁcantly enhances efﬁciency, transparency, and pol-

icy formulation. By optimizing resource allocation,

reducing waste, and improving decision-making pro-

cesses, the system strengthens the government’s abil-

ity to address complex challenges.

Furthermore, the platform fosters greater trans-

parency, allowing citizens to access and monitor pub-

lic policies. Aligning the architecture with national

and international digital governance frameworks en-

sures compliance with best practices, driving efﬁ-

ciency and innovation.

The integration of artiﬁcial intelligence and open

data initiatives positions this model as a cornerstone

for modernizing public administration. The clear def-

inition of functional and non-functional requirements,

combined with a scalable and robust architecture, en-

sures that the system can support advanced analytics,

efﬁciently manage data, and drive sustainable and im-

pactful public policies.

6 CHALLENGES AND

SOLUTIONS

This section presents the main challenges encoun-

tered related to data acquisition methods, as well as

the solutions employed to address issues of data het-

erogeneity and inconsistency.

6.1 Heterogeneity in Formats and

Access Methods

Data ingestion is a critical step in data-driven projects,

requiring the collection and integration of heteroge-

neous data sources into a centralized storage sys-

tem for analysis. In this project, governmental data

fragmentation posed challenges due to the lack of

standardized acquisition methods across departments,

each using distinct operational protocols and manage-

ment strategies. To address this, three primary data

collection methods were employed: APIs, HTTPS re-

quests, and Web Crawlers.

Web Crawlers were used to extract data from

public health systems such as SIM (Mortality Infor-

mation System) and SINASC (Live Birth Informa-

tion System), both accessible via TABNET

. This

method ensured real-time data access, independent

maintenance, and fewer availability issues compared

to APIs. However, it had limitations, including inabil-

ity to access sensitive microdata, high maintenance

costs due to frequent web structure changes, AWS

Lambda storage constraints, and potential legal im-

plications related to terms-of-service compliance.

APIs provided structured and reliable data in

JSON format, with advantages such as well-

documented services, access control via tokens, and

secure integration. However, they also presented

schema standardization issues, reliance on external

providers without guaranteed SLAs, and challenges

with large datasets requiring pagination. In cases of

extensive data, AWS Glue Jobs were required, in-

creasing operational costs.

HTTPS requests were used for direct data down-

loads (e.g., Education data in CSV format). This ap-

proach was simple and ﬂexible, allowing customized

extractions without third-party dependence. How-

ever, it required manual schema adjustments, addi-

tional processing for unstructured data, and lacked

optimization for large datasets, impacting storage

costs.

Static datasets, such as neighborhood HDI data,

were updated infrequently and manually uploaded to

the Bronze storage layer, initiating the data pipeline.

The data acquisition process using these three meth-

ods is illustrated in Figure 3.

6.2 Inconsistency Between Data from

Entities in Different Databases

Data inconsistency is a frequent challenge in multi-

source data integration, often arising from spelling

variations, inconsistent date formats, and missing val-

ues, complicating direct comparisons. To ensure data

accuracy and usability, effective comparison and con-

solidation methods are required.

https://datasus.saude.gov.br/

informacoes-de-saude-tabnet/

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

290

Data acquisition flowchart

Data Sorce

Return all Data

Returns Endpoint-

specific data

Return data

collected

Lambda Function

Start

Function

Make an HTTP

Request

Makes request to

API endpoint

Where

does the

data

originate?

API with

generic

data

Filters the

necessary data

Requests old data

Endpoint API

with specific

data

Data source

without API

access

Accessing the

source via

Selenium

Performs

manipulation for

data collection

Update data

schema

Compares old and

new data

Is there new

data to store?

Yes

Successful

Data Collection

S3 Storage

Returns old data Stores new data

Figure 3: Data acquisition ﬂowchart.

To address spelling variations, an algorithm was

developed using RapidFuzz

and InDel distance cal-

culations, optimizing the comparison of key ﬁelds

such as names and birth dates. The QRatio method

was selected due to its balance between computa-

tional efﬁciency and accuracy, with a 90

To handle large-scale data, a sparse matrix ap-

proach was used, storing only relevant similarity

scores to optimize memory and computational re-

sources. Additionally, an internal comparison process

was implemented to detect duplicate or near-duplicate

records, ensuring that the most complete and recent

version of a record was preserved. The similarity

function implementation is illustrated in Figure 4,

demonstrating key attributes considered during record

comparison.

Old Data

Individual attributes

Name

Mother's Name

Father's Name

Date of Birth

Acceptance

Comparison (90%

Threshold)

New

Data

Sparse Matrix

Generation

Removing

Duplicates

Bronze

Individual attributes

Name

Mother's Name

Father's Name

Date of Birth

Figure 4: Similarity Flowchart.

This approach provided an efﬁcient and scalable

solution for integrating multiple data sources, sig-

niﬁcantly improving data consistency and reliabil-

ity. By combining similarity calculation techniques

with optimized data structures, this method enhanced

data consolidation and quality, ensuring a structured

dataset for subsequent analyses.

https://github.com/rapidfuzz/RapidFuzz

6.3 Address Data

Geographic data in municipal systems often includes

address and neighborhood ﬁelds, but manually en-

tered information is prone to inaccuracies such as ty-

pographical errors and outdated names due to urban

changes like street renaming, neighborhood mergers,

and administrative boundary shifts. These inconsis-

tencies challenge data consistency.

To ensure accurate data representation, the plat-

form developed choropleth maps, requiring neighbor-

hood names to match exactly with those in a Geo-

JSON ﬁle provided by the City’s Planning Institute.

This ﬁle adhered to the latest municipal decree, which

restructured several neighborhoods. Consequently,

non-standard names had to be corrected for proper

data aggregation and visualization.

To address this issue, a dedicated Python package

was created, using the GeoJSON ﬁle as a reference to

standardize geographic data throughout the pipeline.

AWS Glue dynamically installed this library, enabling

automated neighborhood name corrections and en-

suring consistent data in the Silver and Gold layers.

Among the main functions provided by the library,

the following stand out:

• get

nome canonico (get canonical name). Re-

turns the ofﬁcial neighborhood name according to

the latest municipal decree. The algorithm ﬁrst

”simpliﬁes” the neighborhood name (removes ac-

cent marks and converts it to uppercase) and

searches for a match with neighborhoods in the

reference GeoJSON ﬁle. If an exact match is

not found, it performs a fuzzy search and returns

the ofﬁcial neighborhood name that is most simi-

lar. For example, a neighborhood named ”Centro”

could appear as ”Centr,” ”Cntro,” or ”C

etro”. Tak-

ing the last example, the function would initially

convert ”C

etro” to ”CETRO” and subsequently

Handling Inconsistent Government Data: From Acquisition to Entity Name Matching and Address Standardization

291

employ a fuzzy search, which would probably

identify ”Centro” as the canonical name.

• get bairro (get neighborhood). Returns the

neighborhood name given a latitude and longitude

pair. This function uses the Shapely library

create a *Point* geometry and checks, for each

neighborhood, if that point is contained within its

polygon.

• get localizacao (get location). Returns the lo-

cation (latitude and longitude pair) given a free-

text address. This method uses the GeoPy li-

brary

, speciﬁcally with the Nominatim

(Open-

StreetMap data) and GoogleV3 (Google Maps

data) geocoders, to perform geocoding. The user

can choose the desired geocoder. During the

project, Google Maps data proved to be more ac-

curate for the region of interest. However, this

API requires authorization via a token and may in-

cur costs if numerous queries are performed. This

technique proved to be more cost-effective than

AWS alternatives, which rely on HERE Technolo-

gies and Esri data

• get endereco (get address). Returns address

data given a latitude and longitude pair. This

method is useful in cases where the exact location

is available, but the textual address data is miss-

ing.

Isolating utility functions like those described

above into a dedicated library proved advantageous

for the project, as it centralized neighborhood name

handling and other geolocation functions within a

single repository, thereby avoiding code duplication

and potential maintenance costs associated with up-

dating code across multiple ETL jobs. These func-

tions retained their conceptual integrity throughout

the project, with no changes to their method signa-

tures, demonstrating good cohesion and serving as an

acceptable point of coupling between the library and

the ETL jobs that utilized it.

7 APPLICABILITY, TECHNICAL

AND OPERATIONAL

LIMITATIONS

This section presents one of the signiﬁcant results that

were obtained from the implementation of the sys-

tem, despite the contributions presented, it is essential

https://github.com/shapely/shapely

https://github.com/geopy/geopy

https://nominatim.openstreetmap.org/

https://aws.amazon.com/location/

to recognize the technical and operational limitations

encountered during its development and implementa-

tion.

7.1 Vaccine Monitoring and Updating

TThe system’s applicability was demonstrated

through a case study on childhood vaccination moni-

toring in public daycare centers. By integrating health

and education data, the platform provided real-time

insights into vaccination gaps among children aged 0

to 3 years, enabling targeted interventions.

Cross-referencing data identiﬁed 11,854 overdue

vaccinations among 4,657 children, highlighting the

need for urgent action. Targeted campaigns based on

this analysis resulted in over 2,000 children updating

their vaccination records within a month, demonstrat-

ing the platform’s efﬁcacy in guiding public health

strategies.

Continuous monitoring enabled public managers

to track progress and measure intervention effective-

ness. Subsequent data collection showed a signiﬁcant

reduction in overdue vaccinations, with 637 children

achieving full immunization compliance. However,

new enrollments revealed recurring vaccination gaps,

reinforcing the need for ongoing and adaptive public

health initiatives.

Beyond improving immunization coverage, the

platform optimized resource allocation, ensuring that

funding and efforts were focused on high-risk ar-

eas. Additionally, it enhanced collaboration between

health and education departments, fostering a holistic

approach to policy planning.

The case study highlights the transformative

power of data-driven decision-making in public ad-

ministration. By integrating and analyzing multi-

sectoral data, municipal governments can improve

service quality, optimize resources, and implement

evidence-based policies that effectively address crit-

ical public health challenges.

7.2 Operational Challenges

The system was designed to process large-scale het-

erogeneous data, but scalability constraints may arise

as data sources and records grow. Expanding cloud

resources can increase execution time and operational

costs, while AWS Lambda execution limits restrict

real-time processing. Despite AWS’s advanced tools

and scalability, reliance on proprietary cloud services

introduces challenges. Costs for S3, Glue, and Athena

may escalate in data-intensive scenarios, and pipeline

interoperability is limited in organizations using alter-

native cloud or on-premises solutions. Cost manage-

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

292

ment is crucial, particularly for government institu-

tions with budget constraints, as on-demand services

may lead to unpredictable expenses without proper

strategies. Additionally, maintaining the pipeline in

a rapidly evolving technology landscape requires fre-

quent updates and specialized expertise, making long-

term scalability and sustainability a challenge.

8 FINAL REMARKS

This work presented an architecture to address het-

erogeneity and inconsistency in large datasets us-

ing cloud computing. The proposed framework im-

proves data analysis quality for decision-making and

is adaptable to various applications. Future research

may explore Infrastructure as Code (IaC) to enhance

automation, scalability, and resource management,

optimizing data processing efﬁciency.

ACKNOWLEDGMENTS

Finally, we would like to thank FUNCAP for the

project ﬁnancial support and CNPq for the pro-

ductivity grant awarded to Rossana M. C. Andrade

(306362/2021-0).

REFERENCES

Behringer, M., Treder-Tschechlov, D., Voggesberger, J.,

Hirmer, P., and Mitschang, B. (2023). Sdrank:

A deep learning approach for similarity ranking of

data sources to support user-centric data analysis.

In Proceedings of the 25th International Conference

on Enterprise Information Systems, page 419–428.

SCITEPRESS - Science and Technology Publications.

Freitas, J. B., Clarindo, J., and Aguiar, C. (2023). Am-

biente de data warehousing espacial para tomada de

decis

ao sobre dados de crimes. In Anais Estendidos

do XXXVIII Simp

osio Brasileiro de Bancos de Dados,

pages 36–42, Porto Alegre, RS, Brasil. SBC.

Fugini, M. and Finocchi, J. (2020). Quality evaluation

for documental big data. In Proceedings of the 22nd

International Conference on Enterprise Information

Systems, page 132–139. SCITEPRESS - Science and

Technology Publications.

Furtado, L. S., da Silva, T. L. C., Ferreira, M. G. F., de

Macedo, J. A. F., and Moreira, J. K. M. L. C. (2023).

A framework for digital transformation towards smart

governance: using big data tools to target sdgs in

cear

a, brazil. In TEMPLATE’06, 1st International

Conference on Template Production. Journal of Urban

Management.

Gupta, B., Mittal, P., and Mufti, T. (2021). A review on

amazon web services (aws), microsoft azure & google

cloud platform (gcp) services. In Proc. of the 2nd In-

ternational Conference on ICT for Digital, Smart, and

Sustainable Development.

Kumar, A., Mishra, A., and Kumar, S. (2023). Data

lake, lake house, and delta lake. In Architecting

a Modern Data Warehouse for Large Enterprises:

Build Multi-cloud Modern Distributed Data Ware-

houses with Azure and AWS, pages 95–160. Springer.

Rautenberg, S. and do Carmo, P. R. V. (2019). Big data e

encia de dados: complementariedade conceitual no

processo de tomada de decis

ao. Brazilian Journal of

Information Science, 13(1):56–67.

Sandhu, A. K. (2021). Big Data with Cloud Computing:

Discussions and Challenges. Big Data Mining and

Analytics, 5(1):32–40.

Sarker, I. H. (2021). Data science and analytics:

An overview from data-driven smart computing,

decision-making and applications perspective. In

TEMPLATE’06, 1st International Conference on Tem-

plate Production. Springer Science and Business Me-

dia LLC.

Silva, M. E., Pinheiro, F., Bezerra, C., and Coutinho, E.

(2023). Modelagem de Ecossistemas de Software das

Plataformas de Computac¸

ao em Nuvem AWS e GCP.

In Anais Estendidos do XIX Simp

osio Brasileiro de

Sistemas de Informac¸

ao, pages 172–177, Porto Ale-

gre, RS, Brasil. SBC.

Vines, A. and Tanasescu, L. (2023). An Overview of ETL

Cloud Services: An Empirical Study Based on User’s

Experience. In Proceedings of the International Con-

ference on Business Excellence, volume 17, pages

2085–2098.

Wu, Y., Zhang, Z., Kou, G., Zhang, H., Chao, X., Li, C.-

C., Dong, Y., and Herrera, F. (2021). Distributed

linguistic representations in decision making: Taxon-

omy, key elements and applications, and challenges

in data science and explainable artiﬁcial intelligence.

Journal of Urban Management.

Handling Inconsistent Government Data: From Acquisition to Entity Name Matching and Address Standardization

293