The Impact of Data Science on Geography: A Review with
Optimization Algorithms
Roberto de Oliveira Machado
a
Universidade Nova de Lisboa, Avenida de Berna, 26-C, 1069-061, Lisboa, Portugal
Keywords: Data Science, Geography, Optimization Algorithms, Supervised Learning, Systematic Review.
Abstract: We conducted a systematic review using the PRISMA methodology, analyzing 2,996 studies and synthesizing
41 to explore the evolution of data science and its integration into geography. Optimization algorithms were
employed to enhance the efficiency and precision of literature selection. Our findings reveal that data science
has evolved over five decades, facing challenges such as the integration of diverse spatial data and the
increasing demand for advanced computational skills. In the field of geography, data science emphasizes
interdisciplinary collaboration and methodological innovation. Techniques like large-scale spatial data
analysis and predictive algorithms hold promise for applications in natural disaster management and
transportation optimization, enabling faster and more effective responses. These advancements highlight data
science’s pivotal role in solving complex spatial problems. This study contributes to the application of
optimization algorithms in systematic reviews and underscores the necessity for deeper integration of data
science into geography. Key contributions include identifying challenges in managing heterogeneous spatial
data and promoting advanced analytical capabilities. The intersection of data science and geography leads to
significant improvements in disaster management and transportation efficiency, fostering more sustainable
and impactful environmental solutions.
1 INTRODUCTION
Data science is an interdisciplinary field that
integrates statistical analysis, machine learning, and
computational techniques to extract significant
insights from large and complex datasets. With the
advent of big data, characterized by high volumes,
velocity, and variety, data science has become
essential in analyzing and interpreting vast amounts
of information. Technologies such as cloud
computing, artificial intelligence, and advanced
algorithms facilitate data processing and analysis,
enabling real-time insights and decision-making
across various domains.
The application of data science in geography has
immense potential to revolutionize the analysis of
spatial data and address complex issues. Predictive
algorithms and spatial analysis techniques can
enhance natural disaster prediction, optimize
transportation networks, and manage resources
sustainably. These innovations provide precise and
actionable information, improving the effectiveness
a
https://orcid.org/0000-0002-8346-4155
of spatial interventions and fostering community
resilience (Yue, 2016; Singleton, 2019). However,
the integration of data science into geography faces
challenges, such as the need to harmonize
heterogeneous spatial data, the demand for
interdisciplinary expertise, and the lack of specialized
tools for advanced spatial analysis, which hinder its
broader adoption in the field of geography.
This study examines the evolution of data science
and its integration into geography through a
systematic review. It employs the PRISMA
(Preferred Reporting Items for Systematic Reviews
and Meta-Analyses) methodology to ensure a
structured and unbiased synthesis of relevant studies.
Additionally, supervised learning techniques, such as
logistic regression and Naïve Bayes algorithms, are
employed to optimize the literature selection process,
reducing manual workload and improving precision.
A bibliometric analysis is conducted using
Bibliometrix to explore trends, knowledge gaps, and
contributions across disciplines.
Machado, R. O.
The Impact of Data Science on Geography: A Review with Optimization Algorithms.
DOI: 10.5220/0013464200003935
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 11th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM 2025), pages 219-230
ISBN: 978-989-758-741-2; ISSN: 2184-500X
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
219
The primary objectives of this work are to:
(i) conduct a systematic review following the
PRISMA protocol, enhanced by optimization
algorithms, to ensure a comprehensive analysis of
existing knowledge; (ii) examine the evolution of data
science and its applications in geography; and
(iii) provide actionable insights into how data science
can be leveraged to address complex spatial
challenges.
By achieving these objectives, the research
highlights the transformative potential of data science
in addressing spatial challenges and advancing
interdisciplinary collaboration.
The article is structured into five sections:
(i) systematic literature review; (ii) bibliometric
analysis; (iii) results; (iv) discussion; and (v)
conclusion.
2 METHODS
To ensure a structured and unbiased analysis of the
literature, the study adhered to the PRISMA
methodology (Pickering et al., 2014; Teh et al., 2020),
which comprises four key stages: identification,
screening, eligibility, and inclusion. Relevant studies
were retrieved through comprehensive database
searches in Scopus and Web of Science (WOS),
employing carefully selected search terms to capture
the intersection of data science and geography. Titles
and abstracts were screened to exclude irrelevant
articles based on predefined criteria, ensuring both
relevance and methodological rigor. Full-text articles
were subsequently assessed for eligibility, and only
those meeting all criteria were included in the final
analysis.
The screening process was enhanced through the
implementation of supervised learning algorithms,
such as Logistic Regression and Naïve Bayes. These
algorithms automated key stages of the process,
significantly reducing manual effort. Initially,
additional classifiers, including Support Vector
Machines, K-Nearest Neighbors, and Random Forest,
were also evaluated. However, they were excluded
due to their lower performance in terms of accuracy
and false-negative rates. Logistic Regression
exhibited the lowest false-negative rate, ensuring the
inclusion of high-quality and relevant studies, while
Naïve Bayes excelled in computational efficiency and
generalization capability, making them suitable for
handling the available dataset.
A bibliometric analysis conducted using
Bibliometrix provided deeper insights into trends,
knowledge gaps, and interdisciplinary contributions.
Graphs and tables were generated in Excel, while
spatial data visualizations were created in Jupyter
Notebook, ensuring a robust and versatile approach to
data analysis.
To enhance transparency and reproducibility,
rigorous inclusion and exclusion criteria were applied
throughout the process. Potential biases were
mitigated through manual verification (Ólafsdóttir &
Tverijonaite, 2018; Saltz & Dewar, 2019; Saltz &
Krasteva, 2022) of algorithmic outputs and sensitivity
analyses (Kantardzic, 2019) to assess the robustness
of the findings.
2.1 Systematic Literature Review
The review followed six steps: formulation,
identification, selection, eligibility, inclusion, and
categorization. Two research questions were
established: “What is the state of the art in data
science?” and “What is its applicability in
geography?” To ensure a structured selection process,
specific inclusion criteria were defined, considering
articles, conference papers, reviews, books, and book
chapters that contained the relevant search terms in
the title, abstract, and keywords, were aligned with
the research questions, and were written in English.
Conversely, exclusion criteria were applied to filter
out editorials, notes, short surveys, data papers, and
any studies misaligned with the research questions or
written in languages other than English.
The literature search was conducted in the Scopus
and Web of Science (WOS) databases, using the
keywords “Data Science” AND “Geography” OR
“Spatial” OR “Big Data” OR “Geographic” OR
“Geospatial”. The search encompassed all available
studies up to May 2023, focusing on abstracts to
ensure relevance. The results were exported in CSV
format for Scopus and Research Information Systems
(RIS) format for WOS. A normalization process was
then applied to merge and structure data from
different sources into a unified CSV format,
leveraging Python 3.8 and the Pandas library for
efficient data processing.
The next phase involved the manual classification
of a set of articles into relevant and irrelevant
categories. This step was followed by the
implementation of a supervised active learning cycle,
leveraging the following libraries: Pandas;
sklearn.feature_extraction.text.TfidfVectorizer;
sklearn.linear_model.LogisticRegression;
sklearn.naive_bayes.MultinomialNB;
sklearn.metrics.recall_score;
sklearn.metrics.confusion_matrix.
GISTAM 2025 - 11th International Conference on Geographical Information Systems Theory, Applications and Management
220
The Term Frequency - Inverse Document
Frequency (TF-IDF) technique was used to measure
the importance of a word in a document, considering
a collection of documents or corpus. This
measurement was performed by assigning a weight
during the text vectorization process (Bafna et al.,
2016; Chen et al., 2016; Sah et al., 2020). The TF-
IDF calculation for the term 't' in a document 'd' is
defined as shown in Equation (1):
𝑇𝐹 − 𝐼𝐷𝐹
(
𝑡,𝑑,𝐷
)
=𝑇𝐹
(
𝑡,𝑑
)
∗𝐼𝐷𝐹
(
𝑡,𝐷
)
(1)
where:
𝑇𝐹
(
𝑡,𝑑
)
corresponds to the number of times
the term t appears in the document 𝑑;
𝐼𝐷𝐹 (𝑡,𝐷) represents the inverse document
frequency of the term t in the set of
documents D, calculated as: 𝐼𝐷𝐹
(
𝑡,𝐷
)
=
𝑙𝑜𝑔 (𝑁/(1 + 𝑛
)
where:
𝑁 is the total number of documents in the
set;
𝑛
is the number of documents that contain
the term t.
Here is the total number of documents in the set
and is the number of documents that include the term
t. This method assigns higher weights to less frequent
terms and lower weights to more frequent terms
(O’Mara-Eves et al., 2015; Chen et al., 2016).
In the selection procedure, the Logistic
Regression classifier demonstrated superior
performance in classification, identifying higher
proportion of relevant studies compared to the total
number of relevant articles detected. This indicates
that Logistic Regression had a lower false-negative
rate, being more effective in identifying relevant
studies. The performance metrics of active learning
for the two algorithms used (Logistic Regression and
Naïve Bayes) in categorizing the relevance of studies
related to data science concepts. The recall value was
calculated using Equation (2):
𝑇𝑃
𝑇𝑃
+
𝐹𝑁
(2)
where:
𝑇𝑃 represents true positives (relevant
studies correctly identified) and 𝐹𝑁
represents false negatives (relevant studies
incorrectly marked as irrelevant) (O’Mara-
Eves et al., 2015).
Throughout the cycle, the binary Logistic
Regression classifier was used to predict the category
of a dependent variable based on the values of the
independent variables. The result of this process was
0 for "irrelevant" or 1 for "relevant", as shown in
Equation (3):
𝑃
(
𝑥
,𝜃
)
=
1
1
+
𝑒

(3)
where:
𝑃
(
𝑥
,𝜃
)
is the probability that observation
𝑥
belongs to class 𝑦
;
𝑒 is the constant Euler’s number,
approximately 2.71828;
𝑧 is the linear combination of features
weighted by parameters:
𝑧= 𝜃
+ 𝜃
𝑥

+ 𝜃
𝑥

+⋯+ 𝜃
𝑥

;


is the sigmoid function, which
transforms the linear combination 𝑧 into a
value between 0 and 1, interpreted as the
probability of belonging to the positive
class.
Implementing a systematic literature review using
supervised learning can reduce the time required to
identify relevant studies. However, this method is not
without biases, potentially omitting up to 5% of
relevant literature (Ros et al., 2017; Yu & Menzies,
2019; Ferdinands et al., 2020; Wang et al., 2020; van
De Schoot et al., 2021). To mitigate this, the process
followed guidelines proposed by Yang (2018), which
recommend finalizing the review process after
identifying 50 irrelevant records.
After completing the screening process with the
Logistic Regression classifier, an additional selection
was conducted using the Nve Bayes classifier. The
Naïve Bayes classifier employs a probabilistic
classification mechanism based on Bayes' Theorem
(Yang, 2018), expressed by the following Equation
(4) (Zhang & Gao, 2011; Jurafsky & Martin, 2019):
𝑃
(
𝑦
|
𝑥
,𝑥
,…,𝑥
)
∝𝑃
(
𝑦
)
𝑃(𝑥

|
𝑦
)
(4)
where:
𝑃
(
𝑦
|
𝑥
,𝑥
,…,𝑥
)
is the conditional
probability of class 𝑦 given the variables
𝑥
,𝑥
,…,𝑥
;
𝑃
(
𝑦
)
corresponds to the prior probability of
class 𝑦;
𝑃
(
𝑥
|
𝑦
)
represents the conditional
probability of observing the feature 𝑥
given
that the class is 𝑦;
𝑃

indicates the multiplication of all
conditional probabilities for each feature 𝑥
;
= 3.822 is a hyperparameter used to improve
classifier performance.
The results were compared to identify potential
The Impact of Data Science on Geography: A Review with Optimization Algorithms
221
omissions of relevant literature. Following this, a full
reading of the selected studies was conducted, with
particular attention given to frequently cited authors
not included in earlier phases. Relevant studies were
integrated to minimize potential biases. In the
eligibility phase, studies misaligned with the research
questions were excluded from the review process.
In the final categorization phase, information
from eligible studies was extracted. Bibliometric data
were analyzed using the open-source tool
Bibliometrix, which provided disaggregation by
space, time, thematic area, and journals. Graphs and
tables were generated in Excel, and spatial data
visualizations were developed using Jupyter
Notebook. It is important to acknowledge that the
review was conducted by a single reviewer, which
could introduce subjectivity in the inclusion and
exclusion criteria. Additionally, only textual studies
were considered due to limitations in applying text
classifier algorithms to digitized documents. The
Python code for the systematic review is well-
documented and ensured for reproducibility. It is
available for consultation on [GitHub] and [Google
Colab].
3 RESULTS
The results of this systematic review provide a
comprehensive overview of the main trends and
challenges in applying data science to geography. The
initial database search identified 2,996 studies. After
removing duplicates, 1,806 studies were selected for
screening. Title and abstract analysis led to the
selection of 126 documents for full reading. After
examination, 35 studies were included, while the rest
were excluded for not aligning with the research
question. Citation analysis contributed six additional
studies, including three books and three other
documents, two of which were in image format. The
total number of studies for categorization and
synthesis was 41.
3.1 Bibliometric Analysis
This systematic review includes studies from 10
countries. The United States leads with 24 studies,
followed by the Netherlands with four, the United
Kingdom and China with three each, and Australia
with two. The remaining countries contributed one
publication each. The reviewed studies span from
1966 to 2022, with the earliest published in 1966 and
the latest in December 2022. The most representative
years were 2017 (17%), followed by 2018 (15%), and
2016 and 2022 (10% each). Until 2012, studies
focused on the contribution of statistics and
computing to the foundations of data science.
Between 2013 and 2017, research addressed the
field's definition and applicability. From 2018 to
2022, the literature expanded to various topics, from
specific methodologies to guidelines for knowledge
extraction, with a link to geography.
Among the six most representative areas of
knowledge, computer science is the most prominent,
accounting for 35% of occurrences, followed by
social sciences and decision sciences, each at 16%.
Mathematics accounts for 14%, earth and planetary
sciences for 11%, and business, management, and
accounting for 8%. Regarding the distribution of
study types, 46% are articles, 25% are reviews, 17%
are conference papers, and 12% are books.
The most frequent periodicals include
Communications of the ACM, Geographical
Analysis, Geography Compass, and Proceedings of
the National Academy of Sciences, each with two
publications. Among the top authors, van der Aalst
W. leads with three studies, while Smyth P.,
Arribas-
Bel D., and Cao L. each have two. The six most cited
studies cover topics such as data mining (Fayyad et
al., 1996), process mining (van der Aalst, 2016), the
intersection of supply chain management and data
science (Waller & Fawcett, 2013),, the challenge of
defining data science and its link to big data (Provost
& Fawcett, 2013), the role of data scientists
(Davenport & Patil, 2012), and the distinction
between data science and statistics (Dhar, 2013).
3.2 Definition of the Concept of Data
Science
As a starting point, we provide an overview of the
concept of data science as defined in previous studies.
Data science is broadly conceived as a flexible and
practical approach to extracting knowledge or useful
patterns from data (Dhar, 2013; Provost & Fawcett,
2013; Kelleher & Tierney, 2018; Vicario, G., &
Coleman, 2019). It combines three fundamental
components: computing, statistics (Vicario, G., &
Coleman, 2019; Blei, 2017; Cao, 2018; Yan & Davis,
2019), and domain knowledge (Song & Zhu, 2015;
Muller et al., 2019). Computing facilitates large-scale
parallel computing, algorithms, and data mining to
uncover useful information (Provost & Fawcett,
2013; Kelleher, J., & Tierney, 2018; Donoho, 2017).
Statistics provide a quantitative framework, applying
concepts such as mean, standard deviation,
hypothesis testing, and statistical inference (Waller,
& Fawcett, 2013). Domain knowledge is crucial for
GISTAM 2025 - 11th International Conference on Geographical Information Systems Theory, Applications and Management
222
analyzing and extracting knowledge pertinent to
specific areas (Muller et al., 2019).
Vicario et al. (2019), Yan & Davis (2019), and
van der Aalst (2016) emphasize the interdisciplinary
nature of data science, where various disciplines
collaborate to extract knowledge from data. In
contrast, Song & Zhu (2019) describe data science as
multidisciplinary, with each field contributing
uniquely while operating within a common context.
Cao (2018) further classifies data science as
transdisciplinary, where different domains converge
to extract relevant information from data. Therefore,
data science can be inter-, multi-, or transdisciplinary
depending on the project's scope, the individuals
involved, and the evolving nature of the field.
Interdisciplinary and multidisciplinary approaches
keep data science closely linked to its foundational
areas, while achieving a transdisciplinary state
requires robust research that encompasses both
interdisciplinary and multidisciplinary elements.
The Venn diagram illustrates the convergence of
disciplines in data science, emphasizing its
interdisciplinary nature, as noted by Blei & Smyth
(2017). The overlap of computer science and domain
knowledge produces "advanced analysis," where
models are created, and analyses are conducted. The
intersection of mathematics, statistics, and computer
science enables "machine learning" through data and
algorithms. The overlap of domain knowledge with
mathematics and statistics leads to "traditional
analysis," using classical methods to identify patterns
(Provost & Fawcett, 2013; Kelleher & Tierney,
2018). At the core, where all three disciplines
intersect, lies the discovery of knowledge through
integrated data analysis, statistical methods, and
advanced computing.
Song & Zhu propose a distinct perspective,
suggesting that data science comprises three pillars:
data, technologies, and people. They argue that the
third pillar —skilled professionals is essential for
selecting appropriate data and applying advanced
technologies in analysis. Blei & Smyth (2017) also
emphasize the human element in data science,
highlighting the need for data scientists to understand
the problem domain, devise inferential methods, and
effectively use computational tools. Scheider et al.
(2020) further note that such skills are often acquired
through professional interactions and practical
implementation, rather than solely through academic
training (Hicks & Irizarry, 2018).
The proliferation of data, driven by numerous
human activities and technological advancements,
represents a significant revolution in knowledge
production (van der Aalst, 2016). Data science has
emerged as a flexible field capable of analyzing large
volumes of structured and unstructured data.
Structured data, typically stored in SQL, are
tabulated for easier storage and processing, while
unstructured data, in NoSQL systems, are more
irregular and complex to analyze (Kelleher &
Tierney, 2018). To manage this complexity, data
scientists often create an n*m analytical matrix where
'n' represents entities (rows) and 'm' represents
attributes (columns). This matrix, incorporating data
from multiple sources such as databases, data
warehouses (Kelleher & Tierney, 2018), big data
lakes (Carniel & Schneider, 2021), online content,
and social media streams, is fundamental to data
science projects. Significant time and resources are
spent organizing, cleaning, and updating this matrix.
3.3 Origin and Evolution of the
Concept
In the 1960s, statistician John Tukey and computer
scientist Peter Naur laid the foundations of data
science, leading many to view it as a descendant of
statistics and computer science (Blei & Smyth, 2017).
While incorporating methods and approaches from
these fields, data science has evolved to meet
contemporary analytical needs driven by the
omnipresence of data collected ubiquitously (van der
Aalst & Damiani, 2015). Over the past two decades,
this evolution has consolidated data science as a
distinct science.
The term “data science” first appeared in literature
in the preface of Peter Naur's book Concise Survey of
Computer Methods in the 1970s, where he defined it
as the "science of dealing with data" (Cao, 2018, p.
9). However, John Tukey's earlier work in 1962
proposed incorporating data analysis processes
within statistics, advocating for the inclusion of new
theories, computers, visualization tools, and larger
datasets (van der Aalst, 2016; Cao, 2018). This need
for change was echoed in the field of computing.
In 1966, Peter Naur introduced “Datalogy” (Naur,
1966), the systematic study of data, including
representation and automatic processing
mechanisms. Naur characterized this field as broad
and interdisciplinary, essential for tasks across
different domains and data types. Despite its
significance, the role of statistics in defining data
science foundations remains debated, as statisticians
traditionally focused more on formulating theories
than addressing practical challenges associated with
large datasets and computational complexities (van
der Aalst, 2016).
In the 1970s, John Tukey's Exploratory Data
The Impact of Data Science on Geography: A Review with Optimization Algorithms
223
Analysis (Tukey, 1997) introduced techniques aimed
at improving results through hypothesis testing,
unbiased data, and conventional statistics, which later
came to be known as confirmatory data analysis.
Concurrently, the relational data model SQL
revolutionized database information storage,
indexing, and retrieval (Kelleher & Tierney, 2018),
laying the groundwork for “Knowledge Discovery in
Databases” (KDD) in the late 1980s (Cao, 2018). The
seminal 1996 article "Data Mining and Knowledge
Discovery in Databases" by Fayyad et al. (1996)
outlined a five-stage process for knowledge
discovery: selection, pre-processing, transformation,
data mining, and pattern interpretation.
Building on the KDD model, other methodologies
emerged both in academia and industry due to the
growing demand for data analysis tools. Notable
among these was the Cross-Industry Standard Process
for Data Mining (CRISP-DM), organized around a
six-stage lifecycle, providing a flexible and iterative
framework for data mining projects (Martínez-
Plumed, 2021). As data science gained prominence in
the 1990s, it became increasingly associated with
both research and practical applications (Zhu &
Xiong, 2015).
The early 21
st
century witnessed further
advancements, including Cleveland's action plan for
teaching data science and the launch of NoSQL
databases, propelling the field forward. The
increasing of systems led to the rise of big data,
creating the need for new processing frameworks
such as MapReduce and Hadoop (van der Aalst,
2016; Cao, 2018). MapReduce, with its Map and
Reduce functions, and Hadoop, integrating Hadoop
Distributed File System (HDFS) with MapReduce,
enabled efficient storage and distribution of large data
volumes across clusters (Kumar & Mohbey, 2022).
In the second decade of the 21
st
century, data
science became more clearly defined, integrating
computer science, mathematics, statistics, and
domain-specific knowledge (Brady, 2019). The role
of the data scientist, as highlighted by Davenport &
Patil (2012), gained prominence as data science
became critical for problem-solving and deriving
insights. This period saw an increase in publications
on data science infrastructure and processes, such as
van der Aalst's Process Mining Data Science in
Action (2016) and Donoho's concept of “greater data
science” activities (2017). The profession of data
scientist emerged as both essential and desirable,
although discussions on the competencies required
for these professionals continue to evolve in literature
and academia (Vicario & Coleman, 2019).
3.4 Function and Applicability
Our analysis demonstrates that data science has
solidified its position as a key discipline for solving
complex problems through the integration of
statistical and computational methods (Cleveland,
2001; Waller & Fawcett, 2013; Martínez-Plumed et
al., 2021). In addition to addressing the challenges
posed by the big data phenomenon (Song, & Zhu,
2015), data science aims to generate both actional and
actionable knowledge (Donoho, 2017). It employs
automated analyses to understand various
phenomena, converting data into actionable insights,
diagnostics, and automated decisions (Provost &
Fawcett, 2013; van der Aalst, 2017). This
transformation follows a methodology that ranges
from data collection to knowledge acquisition (Cao,
2018). As Brady (2019) notes, data science employs
specific processes to explore and describe data,
identify meaningful patterns, and present them
effectively, adapting and redefining the traditional
objectives of statistics and computer science. Van der
Aalst & Damiani (2015) emphasize that data science
is capable of addressing four fundamental questions:
"What happened?" "Why did it happen?" "What will
happen?" and "What is the best decision to make?"
Notably, although data science is often associated
with large datasets, it is equally effective when
applied to smaller datasets, which can also yield
valuable information for solving specific problems.
Most studies suggest that the knowledge
discovery process in data science follows a lifecycle
encompassing systematic methodologies, open and
interdisciplinary approaches, and synergies from
various disciplines (Song, & Zhu, 2015; van der Aalst
& Damiani; 2015; Cao, 2017; Yu & Kumbier, 2020).
However, there is some discrepancy concerning the
number of stages in this cycle, ranging from four (van
der Aalst, 2017) to eight stages (Yu & Kumbier,
2020). The most referenced model is CRISP-DM
(Provost & Fawcett, 2013;
Song & Zhu, 2015;
Martínez-Plumed, 2021). Its main strength lies in its
independence from specific software or analysis
techniques (Kelleher, & Tierney, 2018). CRISP-DM,
an extension of the KDD model (Martínez-Plumed,
2021), is structured into six stages: (i) understanding
the project objectives and defining the problem; (ii)
collecting data for analysis; (iii) performing cleaning
actions to correct or remove inaccurate or irrelevant
information; (iv) applying algorithms to identify
useful patterns in the data and establish representative
models; (v) evaluating the model and the developed
stages; and (vi) implementing the model, assessing its
practical viability and ability to generate knowledge
GISTAM 2025 - 11th International Conference on Geographical Information Systems Theory, Applications and Management
224
or value from the data (Larose & Larose, 2014). It is
important to note that the CRISP-DM process is
iterative, not necessarily following a linear sequence
of stages.
The technologies used in data science vary across
organizations and projects (Kelleher & Tierney,
2018).
As the size of the organization or the volume
of data increases, so does the complexity of the
ecosystem to be implemented. A typical data science
ecosystem comprises tools and components from
multiple software vendors, capable of processing data
in diverse formats. Commonly used tools include
SQL and NoSQL databases (van der Aalst &
Damiani, 2015), MapReduce for distributed
computing (Blei & Smyth, 2017), and Hadoop for
storage, along with associated components such as
HBase, Hive, Pig, Mahout, Storm, and Spark (Song,
& Zhu, 2015; van der Aalst & Damiani, 2015; Kumar
& Mohbey, 2022; Abdalla, 2022). These components
are critical during the data representation and
computation phase (Donoho, 2017). The
programming languages SQL, Python, and R are
employed for querying, visualization, data cleaning,
and calculations (Brunner & Kim, 2016).
Among the various tools available, Jupyter
Notebook has gained significant prominence as an
open-source web-based interactive development
environment. Jupyter Notebook stands out for its
interface, which facilitates communication,
abstraction, and data manipulation (Carniel &
Schneider, 2021). It enables users to access databases,
configure, and organize workflows for scientific
computing and machine learning with ease.
Moreover, Jupyter Notebook is highly flexible,
allowing seamless integration of extensions, such as
Python and R libraries, for analysis, model
application, and result visualization. This flexibility
and ease of use contribute to a reduction of the time
required for analysis. One of the tool’s most valuable
features is its ability to integrate both text and code in
a single location, a concept known as “literate
programming”. This is particularly beneficial in data
science projects involving multidisciplinary teams,
where not all members may possess a deep
understanding of all tasks, thereby making integrated
documentation an essential feature.
3.5 Relationship with Geography
Data science emerges in geography as a
complementary, though somewhat limited, process
(Singleton & Arribas-Bel, 2019). A review of the
literature analysis reveals no consensus on the
definition of the term, with various designations,
including "earth data science," "geospatial data
science," "spatial data science," and "geographic data
science."
Yue
et al. (2016, p. 84) define "earth data science"
as "the ability and knowledge to apply the data
science paradigm to EO data”. Xie et al. (2017, p. 3)
describe "geospatial data science" as "the science of
data-driven approaches focusing on the geospatial
domain”. Carniel et al. (2021, p. 1) define "spatial
data science" as a "subclass of object-oriented data
science and spatial data". Finally, Andrienko et al.
(2017, p. 15) refer to "geographic data science" as a
field for working "with data that incorporates spatial
and often temporal elements".
Spatial data science is defined as the discipline
that explores data-driven approaches, with a
particular emphasis on the spatial domain. Therefore,
this review adopts the term "spatial data science",
which, according to Singleton Singleton & Arribas-
Bel (2019) and Bowlick & Wright (2018), has the
potential to revitalize quantitative geography and
geocomputation methodologies. This revitalization
occurs through the application of advanced
techniques for collecting, preparing, exploring, and
visualizing large datasets, alongside the use of
machine learning and spatial statistical methods to
detect patterns and model spatial dependencies (Xie
et al., 2017). Spatial data science seeks to extract
valuable information about space (Andrienko et al.,
2017), time, and distance (Carniel, & Schneider,
2021). High-performance computational procedures,
algorithms, and specific programming languages are
essential for extracting this information within an
interdisciplinary or transdisciplinary framework.
One of the most compelling applications of data
science in geography is the integration of EO data
with socio-economic datasets. Recent studies have
demonstrated how combining EO data with social
media and economic data enhances our understanding
of socio-economic-environmental systems, thus
facilitating improved urban planning and disaster
management (Yue et al., 2016).
Visualization tools are essential in geographic
data science. For instance, the Urban Space Explorer
tool uses a coordinated multiple-view interface to
analyze social media posts in relation to population
distribution and points of interest. This tool has
proven invaluable to urban planners, providing
insights into urban dynamics and supporting data-
driven decision-making (Xie et al., 2017).
Xie et al. (2017) view spatial data science as
transdisciplinary, addressing the limitations of
isolated data mining and machine learning
approaches by integrating four key elements:
The Impact of Data Science on Geography: A Review with Optimization Algorithms
225
mathematics, statistics, computer science, and
geospatial techniques. Palomino et al. (2017) argue
that spatial data science is inherently
interdisciplinary, emerging from the intersection of
three distinct fields: Geographic Information Science,
data science, and CyberGIS, which is defined as "GIS
based on advanced cyberinfrastructure and e-science"
(VoPham et al., 2018, p. 2). Carniel & Schneider
(2021) also regard spatial data science as
interdisciplinary, rooted in three core disciplines:
mathematics/statistics, computer science, and
business information/domain specialization.
However, the literature lacks clarity on the specific
techniques employed within the subclasses that
emerge from the intersection of these core disciplines
in spatial data science.
The integration of the three core disciplines of
data science computer science, mathematics and
statistics, and domain knowledge —, can be
effectively visualized using a Venn diagram. In the
context of mathematics and statistics, techniques such
as Kriging, bootstrap, and autocorrelation are
employed. In computer science, programming and
managing large spatial datasets are crucial, utilizing
tools such as Spatial Hadoop and specialized open-
source libraries like PySal, PyMove, Geopandas, and
Moving Pandas. In domain knowledge,
comprehending the spatial context including
location, distance, and spatial and temporal
interactions — is crucial.
The implementation of data science projects
within spatial contexts is often underemphasized in
the literature. Palomino et al. (2017) and VoPham et
al. (2018) advocate for workflows utilizing open-
source and cloud computing technologies. Palomino
et al. (2017) propose a comprehensive framework
consisting of two complementary models: a four-
stage model—setting up the work environment, data
management (collection, cleaning, and
transformation), analysis and visualization, and data
publication, with an emphasis on collaborative
work—and an iterative six-stage workflow, which
builds on the previous model by adding specific steps
for data import, processing, cleaning, transformation,
visualization, modeling, and communication of
results. Carniel & Schneider (2021) introduce a life
cycle based on conventional data science workflows,
comprising four components: problem definition and
understanding, spatial data modeling and
understanding, spatial reasoning, and spatial data
visualization.
In practice, spatial data science benefits from the
inherent flexibility in implementing data science
projects. While establishing models, such as CRISP-
DM, are widely used, scientists retain the flexibility
to adapt the lifecycle or workflow to suit the specific
project context (Martínez-Plumed et al., 2021). The
skills required for spatial data science continue to
evolve, with teamwork, machine learning techniques,
cloud computing, and proficiency in programming
languages such as Python, R, and SQL being crucial
(Jiang & Chen, 2021).
Successfully executing a spatial data science
project requires proficiency in three key areas:
domain knowledge, mathematical and statistical
skills, and computer science (Carniel & Schneider,
2021; Bowlick & Wright, 2018). However, the role of
quantification in geography remains a subject of
ongoing debate, in part due to the fact that many
practitioners lacking a strong background in statistics
and computing. Increasing emphasis on these areas is
crucial for strengthening the connection between
geography and data science, both academically and
professionally.
There is a longstanding tradition of using open-
source programming languages such as Python and R,
in spatial analyses (Anselin & Rey, 2022). The
CyberGIS domain advocates for the application of
cloud computing in geography (Palomino et al.,
2017). Specialized software, such as QGIS and
ARCGIS, has adapted to this trend by integrating
open-source extensions for querying, analyzing, and
visualizing data. In recent decades, geocomputation
and remote sensing have increasingly incorporated
large datasets, algorithms, and open-source code into
their analyses (Arribas-Bel & Reades, 2018).
Geographers are also actively involved in smart city
projects, which presents new opportunities for
analyzing large datasets (Scheider et al., 2020).
In these environments, structured and
unstructured data from diverse sources are frequently
employed for complex analyses (Brady, 2019). This
convergence of factors, coupled with ongoing
developments, can cultivate a stronger relationship
between data science and geography. To date, this
relationship has remained fragmented and indirect
(Singleton, & Arribas-Bel, 2019). Should this
relationship deepen, it could mark a new chapter for
the discipline, reinforcing its position in a data-driven
world, open science (Palomino et al., 2017), and
cloud computing. This scenario is consistent with
contemporary data science approaches to data
collection, transformation, processing, and analysis.
GISTAM 2025 - 11th International Conference on Geographical Information Systems Theory, Applications and Management
226
4 DISCUSSION
4.1 Systematic Review
This study employed an efficient, transparent, and
reproducible process for bibliographic selection,
providing a detailed and technical overview of the
methodologies and algorithms used. Mathematical
equations and detailed discussions on algorithm
performance were provided, along with well-
documented Python code that can be easily adapted
and implemented both locally and on web-based
platforms.
However, several limitations were encountered in
implementing Python pipelines for continuous
bibliography updates. The Scopus and Web of
Science (WOS) APIs imposed functional restrictions,
limiting the number of requests and records that could
be retrieved and increasing query complexity.
Consequently, article selection required direct
interaction with the platforms. Furthermore, WOS
does not allow direct CSV data export, requiring the
conversion of RIS files to CSV using Python. This
process introduces potential errors and
inconsistencies, requiring rigorous standardization
and cleaning procedures to ensure data quality.
4.2 Data Science
Data science plays a pivotal role in addressing
complex problems by utilizing statistical and
computational methodologies. In the era of big data,
characterized by increasing volume, velocity, and
variety, data science has evolved to automate and
enhance data analysis processes, thereby improving
diagnostics, predictive analytics, and decision-
making tools.
By redefining traditional statistical and
computational objectives, data science generates
actionable insights across various domains,
highlighting its transformative potential in addressing
contemporary challenges.
4.3 Application in Geography
In geography, data science emerges as a
complementary yet underutilized tool. The lack of
consensus surrounding terms such as "earth data
science," "geospatial data science," and "spatial data
science" highlights the need for standardized
terminology. Among these, "spatial data science" has
gained prominence as a promising framework to
rejuvenate quantitative geography and
geocomputation methodologies.
Advanced techniques, such as machine learning
and predictive analytics, offer significant potential
across spatial domains. For instance, predictive
algorithms enhance natural disaster management by
providing more accurate forecasts and facilitated
effective response strategies. The analysis of large
datasets optimizes transportation routes, reduces
congestion, and improves logistical efficiency.
Furthermore, data science plays a crucial role in
sustainable resource management by identifying
patterns and trends that foster the adoption of
environmentally friendly practices.
4.4 Challenges and Ethical
Considerations
The application of data science in geography presents
significant challenges, including the integration of
heterogeneous spatial data, the need for advanced
computational and statistical expertise, and ethical
concerns surrounding data privacy and usage.
However, spatial data science offers innovative
solutions by integrating mathematics, statistics,
computer science, and geospatial techniques. Tools
such as open-source software and cloud computing
provide promising pathways for addressing these
challenges.
4.5 Interdisciplinary Collaboration
Interdisciplinary collaboration among geographers,
data scientists, statisticians, and ethicists is crucial for
addressing complex challenges at the intersection of
geography and data science. Previous studies have
shown that multidisciplinary teams optimize the
analysis of large spatial datasets. Recommended
strategies include forming diverse teams, fostering
advanced data analysis skills, and adhering to
rigorous ethical standards to promote methodological
innovation and ensure sustainable solutions.
4.6 Limitations and Future Directions
Future research should prioritize the integration of
advanced data science techniques into geography to
tackle complex spatial challenges. Key areas include
leveraging machine learning algorithms and
incorporating novel data sources, such as Internet of
Things (IoT) sensors, for real-time dynamic analyses.
However, the adoption of these techniques is
hindered by the limited proficiency of many
geographers in programming and computational
tools, which restricts the effective application of these
methods. Bridging this gap will necessitate
The Impact of Data Science on Geography: A Review with Optimization Algorithms
227
interdisciplinary collaboration and targeted training
programs that align technological advancements with
practical needs.
Emerging technologies, including artificial
intelligence and cloud computing, hold
transformative potential for spatial research. These
technologies enhance data processing capabilities,
facilitate the integration of large datasets, and
improve analytical precision. To fully capitalize on
these benefits, it is essential to address skill gaps and
develop user-friendly tools tailored to non-specialists.
By overcoming these challenges, we can enable
innovative, sustainable solutions for critical
geographic issues, such as disaster management,
urban planning, and resource optimization.
5 CONCLUSIONS
This study highlights the application of supervised
learning algorithms, specifically Logistic Regression
and Naïve Bayes, to optimize the systematic review
process. These techniques notably increased
efficiency and precision, while reducing the time
required for literature selection, thereby
demonstrating the transformative potential of data
science methodologies in systematic reviews. By
showcasing these improvements, the study illustrates
how data science approaches can effectively enhance
various research processes.
The investigation reveals that the majority of data
science studies were conducted in the United States
of America. This is largely attributed to country's
advanced technological infrastructure, prestigious
academic institutions, substantial financial
investment in research, supportive policies for
technological development, and the presence of major
technology companies that drive innovation and the
application of data science. In the field of geography,
this infrastructure and support have facilitated the
precise and efficient analysis of large spatial datasets.
Advanced methodologies, including machine
learning and data mining, hold significant potential to
improve natural disaster management, optimize
transportation routes, and promote sustainable
resource management.
However, several significant challenges remain.
The need for advanced computational and statistical
skills, alongside ongoing ethical concerns regarding
privacy and data usage, continues to present
obstacles. Overcoming these challenges requires an
interdisciplinary approach, integrating expertise from
various domains to develop robust and ethical
solutions.
This study highlights several practical
implications for applied geography:
1. Natural Disaster Management: Predictive
algorithms can substantially improve the accuracy of
natural disaster forecasts, enabling more effective and
timely evacuation plans and resource allocation.
Enhanced disaster management can save lives, reduce
economic losses, and strengthen community
resilience.
2. Transportation Optimization: Data science
techniques can facilitate the design of more efficient
transportation networks, reducing congestion and
improving logistics. Optimized transportation can
lead to better urban mobility, reduced environmental
impact, and a higher quality of life for city residents.
3. Sustainable Resource Management: By
identifying patterns and trends in resource use, data
science can promote sustainable practices. Effective
resource management ensures the better conservation
of natural resources and supports environmental
sustainability.
In summary, data science offers new perspectives
and solutions to complex geographical challenges.
This study underscores the importance of
interdisciplinary collaboration and methodological
innovation in harnessing the transformative potential
of data science within geography, offering a solid
foundation for future research and practical
applications.
ACKNOWLEDGMENT
This work was supported by the Fundação para a
Ciência e Tecnologia [UI/BD/151395/2021].
REFERENCES
Abdalla, H. B. (2022). A brief survey on big data:
technologies, terminologies and data-intensive
applications. Journal of Big Data, 9(1).
https://doi.org/10.1186/s40537-022-00659-3
Andrienko, G., Andrienko, N., & Weibel, R. (2017).
Geographic Data Science. IEEE Computer Graphics
and Applications, 37(5), 15–17. https://doi.org/
10.1109/mcg.2017.3621219
Anselin, L., & Rey, S. J. (2022). Open-source software for
spatial data science. Geographical Analysis, 54(3),
429–438. https://doi.org/10.1111/gean.12339
Arribas-Bel, D., & Reades, J. (2018). Geography and
computers: Past, present, and future. Geography
Compass, 12(10), e12403. https://doi.org/10.1111/
gec3.12403
GISTAM 2025 - 11th International Conference on Geographical Information Systems Theory, Applications and Management
228
Bafna, P., Pramod, D., & Vaidya, A. (2016). Document
clustering: TF-IDF approach. In 2016 International
Conference on Electrical, Electronics, and
Optimization Techniques (ICEEOT) (pp. 61-66). IEEE.
http://dx.doi.org/10.1109/ICEEOT.2016.7754750
Blei, D. M., & Smyth, P. (2017). Science and data science.
Proceedings of the National Academy of Sciences of
the United States of America, 114(33), 8689–8692.
https://doi.org/10.1073/pnas.1702076114
Bowlick, F. J., & Wright, D. J. (2018). Digital Data-Centric
Geography: Implications for Geography’s Frontier. The
Professional Geographer, 70(4), 687–694.
https://doi.org/10.1080/00330124.2018.1443478
Brady, H. E. (2019). The challenge of big data and data
science. Annual Review of Political Science, 22(1),
297–323. https://doi.org/10.1146/annurev-polisci-
090216-023229
Brunner, R., & Kim, E. (2016). Teaching data science.
Procedia Computer Science, 80, 1947–1956.
https://doi.org/10.1016/j.procs.2016.05.513
Cao, L. (2017). Data science: A comprehensive overview.
ACM Computing Surveys, 50(3), 1–42.
https://doi.org/10.1145/3076253
Cao, L. (2018). Data Science Thinking: the next scientific,
technological and economic revolution.
https://www.amazon.com/Data-Science-Thinking-
Scientific-Technological/dp/3319950916
Carniel, A., & Schneider, M. (2021). A Survey of Fuzzy
Approaches in Spatial Data Science. In 2021 IEEE
International Conference on Fuzzy Systems (FUZZ-
IEEE), 1-6. IEEE. https://doi.org/10.1109/
FUZZ45933.2021.9494437
Chen, K., Zhang, Z., Long, J., & Zhang, H. (2016). Turning
from TF-IDF to TF-IGM for term weighting in text
classification. Expert Systems with Applications, 66,
245-260. https://doi.org/10.1016/j.eswa.2016.09.009
Cleveland, W. S. (2001). Data Science: an Action Plan for
Expanding the Technical Areas of the Field of
Statistics. International Statistical Review, 69(1), 21–
26. https://doi.org/10.1111/j.1751-
5823.2001.tb00477.x
Davenport, T. & Patil, T. (2012). Data scientist: The sexiest
job of the 21st century. Harvard business review,
90(10), 70-76. https://hbr.org/2012/10/data-scientist-
the-sexiest-job-of-the-21st-century
Dhar, V. (2013). Data science and prediction.
Communications of the ACM, 56(12), 64–73.
https://doi.org/10.1145/2500499
Donoho, D. L. (2017). 50 years of data science. Journal of
Computational and Graphical Statistics, 26(4), 745–
766. https://doi.org/10.1080/10618600.2017.1384734
Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P. (1996).
From data mining to knowledge discovery in databases.
Ai Magazine, 17(3), 37–54. https://doi.org/10.1609/
aimag.v17i3.1230
Ferdinands, G., Schram, R., de Bruin, J., Bagheri, A.,
Oberski, D. L., Tummers, L., & van de Schoot, R.
(2020). Active learning for screening prioritization in
systematic reviews - A simulation study.
https://doi.org/10.31219/osf.io/w6qbg
Hicks, S. C., & Irizarry, R. A. (2018). A guide to teaching
data science. The American Statistician, 72(4), 382–
391. https://doi.org/10.1080/00031305.2017.1356747
Jiang, H., & Chen, C. (2021). Data Science Skills and
Graduate Certificates: A Quantitative Text analysis.
Journal of Computer Information Systems, 62(3), 463–
479. https://doi.org/10.1080/08874417.2020.1852628
Jurafsky, D., & Martin, J. H. (2019). Speech and Language
Processing. Pearson.
Kantardzic, M. (2019). Data mining. https://doi.org/
10.1002/9781119516057
Kelleher, J., & Tierney, B. (2018). Data science. MIT Press.
Kumar, S., & Mohbey, K. K. (2022). A review on big data
based parallel and distributed approaches of pattern
mining. Journal of King Saud University - Computer
and Information Sciences, 34(5), 1639–1662.
https://doi.org/10.1016/j.jksuci.2019.09.006
Larose, D., & Larose, C. (2014). Discovering knowledge in
data: an introduction to data mining (Vol. 4). John
Wiley & Sons.
Martínez-Plumed, F., Contreras-Ochando, L., Ferri, C.,
Hernández-Orallo, J., Kull, M., Lachiche, N., Ramírez-
Quintana, M. J., & Flach, P. A. (2021). CRISP-DM
Twenty years Later: From data mining processes to data
science trajectories. IEEE Transactions on Knowledge
and Data Engineering, 33(8), 3048–3061.
https://doi.org/10.1109/tkde.2019.2962680
Muller, M., Lange, I., Wang, D., Piorkowski, D., Tsay, J.,
Liao, Q. V., ... & Erickson, T. (2019). How data science
workers work with data: Discovery, capture, curation,
design, creation. In Proceedings of the 2019 CHI
conference on human factors in computing systems, 1-
15. https://doi.org/10.1145/3290605.3300356
Naur, P. (1966). The science of datalogy. Communications
of the ACM, 9(7), 485. https://doi.org/10.1145/
365719.366510
Ólafsdóttir, R., & Tverijonaite, E. (2018). Geotourism: A
Systematic Literature Review. Geosciences, 8(7), 234.
https://doi.org/10.3390/geosciences8070234
O’Mara-Eves, A., Thomas, J., McNaught, J. et al. (2015).
Using text mining for study identification in systematic
reviews: a systematic review of current approaches.
Systematic Reviews, 4, 5. https://doi.org/10.1186/
2046-4053-4-5
Palomino, J., Muellerklein, O., & Kelly, M. (2017). A
review of the emergent ecosystem of collaborative
geospatial tools for addressing environmental
challenges. Computers, Environment and Urban
Systems, 65, 79–92. https://doi.org/10.1016/
j.compenvurbsys.2017.05.003
Pickering, C., Grignon, J., Steven, R., Guitart, D., & Byrne,
J. (2014). Publishing not perishing: how research
students transition from novice to knowledgeable using
systematic quantitative literature reviews. Studies in
Higher Education, 40(10), 1756–1769.
https://doi.org/10.1080/03075079.2014.914907
Provost, F., & Fawcett, T. (2013). Data Science and its
Relationship to Big Data and Data-Driven Decision
Making. Big Data, 1(1), 51–59. https://doi.org/
10.1089/big.2013.1508
The Impact of Data Science on Geography: A Review with Optimization Algorithms
229
Ros, R., Bjarnason, E., Runeson P. (2017). A Machine
Learning Approach for Semi-Automated Search and
Selection in Literature Studies. In Proceedings of the
21st International Conference on Evaluation and
Assessment in Software Engineering (EASE'17).
Association for Computing Machinery, New York, NY,
USA, 118–127. https://doi.org/10.1145/
3084226.3084243
Saltz, J. S., Dewar, N. (2019). Data science ethical
considerations: a systematic literature review and
proposed project framework. Ethics and Information
Technology, 21, 197–208. https://doi.org/10.1007/
s10676-019-09502-5
Saltz, J. S., & Krasteva, I. (2022). Current approaches for
executing big data science projects—a systematic
literature review. PeerJ. Computer Science, 8, e862.
https://doi.org/10.7717/peerj-cs.862
Scheider, S., Nyamsuren, E., Kruiger, H., & Xu, H. (2020).
Why geographic data science is not a science.
Geography Compass, 14(11). https://doi.org/10.1111/
gec3.12537
Shah, K., Patel, H., Sanghvi, D. J., & Shah, M. (2020). A
comparative analysis of logistic regression, random
forest and KNN models for the text classification.
Augmented Human Research, 5(1).
https://doi.org/10.1007/s41133-020-00032-0
Singleton, A., & Arribas-Bel, D. (2019). Geographic Data
Science. Geographical Analysis, 53(1), 61–75.
https://doi.org/10.1111/gean.12194
Song, I., & Zhu, Y. (2015). Big data and data science: what
should we teach? Expert Systems, 33(4), 364–373.
https://doi.org/10.1111/exsy.12130
Teh, H. Y., Kempa-Liehr, A. W., & Wang, K. I. (2020).
Sensor data quality: a systematic review. Journal of Big
Data, 7(1). https://doi.org/10.1186/s40537-020-0285-1
Tukey, J. W. (1977). Exploratory data analysis. Addison-
Wesley Publishing Company.
van de Schoot, R., De Bruin, J., Schram, R., Zahedi, P., De
Boer, J., Weijdema, F., Kramer, B., Huijts, M.,
Hoogerwerf, M., Ferdinands, G., Harkema, A.,
Willemsen, J., Ma, Y., Fang, Q., Hindriks, S.,
Tummers, L., & Oberski, D. L. (2021). An open source
machine learning framework for efficient and
transparent systematic reviews. Nature Machine
Intelligence, 3(2), 125–133. https://doi.org/10.1038/
s42256-020-00287-7
van der Aalst, W. (2016). Process Mining: Data Science in
action. Springer.
van der Aalst, W. M. P. (2017). Responsible Data Science:
Using event data in a “People friendly” manner. In
Lecture notes in business information processing (pp.
3–28). https://doi.org/10.1007/978-3-319-62386-3_1
van der Aalst, W. M. P., & Damiani, E. (2015). Processes
Meet Big Data: Connecting Data Science with Process
Science. IEEE Transactions on Services Computing,
8(6), 810–819. https://doi.org/10.1109/
tsc.2015.2493732
Vicario, G., & Coleman, S. (2019). A review of data science
in business and industry and a future view. Applied
Stochastic Models in Business and Industry, 36(1), 6–
18. https://doi.org/10.1002/asmb.2488
VoPham, T., Hart, J. E., Laden, F., & Chiang, Y. (2018).
Emerging trends in geospatial artificial intelligence
(geoAI): potential applications for environmental
epidemiology. Environmental Health, 17(1).
https://doi.org/10.1186/s12940-018-0386-x
Waller, M. A., & Fawcett, S. E. (2013). Data science,
predictive analytics, and big data: a revolution that will
transform supply chain design and management.
Journal of Business Logistics, 34(2), 77–84.
https://doi.org/10.1111/jbl.12010
Wang, Z., Nayfeh, T., Tetzlaff, J., O’Blenis, P., & Murad,
M. H. (2020). Error rates of human reviewers during
abstract screening in systematic reviews. PLoS ONE,
15(1), e0227742. https://doi.org/10.1371/
journal.pone.0227742
Xie, Y., Eftelioglu, E., Ali, R. Y., Tang, X., Yan, L., Doshi,
R., & Shekhar, S. (2017). Transdisciplinary
Foundations of Geospatial Data Science. ISPRS
International Journal of Geo-information, 6(12), 395.
https://doi.org/10.3390/ijgi6120395
Yan, D., & Davis, G. K. (2019). A first course in data
science. Journal of Statistics Education, 27(2), 99–109.
https://doi.org/10.1080/10691898.2019.1623136
Yang, F. J. (2018). An implementation of Naive Bayes
classifier. In International Conference on
Computational Science and Computational Intelligence
(pp. 301-306). https://doi.org/10.1109/
CSCI46756.2018.00065
Yu, B. & Kumbier, K. (2020). Veridical Data Science. In
Proceedings of the 13th International Conference on
Web Search and Data Mining (WSDM '20).
Association for Computing Machinery, New York, NY,
USA, 4–5. https://doi.org/10.1073/pnas.190132611
Yu, Z., & Menzies, T. (2019). FAST2: An intelligent
assistant for finding relevant papers. Expert Systems
With Applications, 120, 57–71. https://doi.org/
10.1016/j.eswa.2018.11.021
Yue, P., Ramachandran, R., Baumann, P., Khalsa, S. J. S.,
Deng, M., & Jiang, L. (2016). Recent activities in Earth
data science [technical committees]. IEEE Geoscience
and Remote Sensing Magazine, 4(4), 84-89.
https://doi.org/10.1109/MGRS.2016.2600528
Zhang, W., & Gao, F. (2011). An improvement to naive
bayes for text classification. Procedia Engineering, 15,
2160-2164. https://doi.org/10.1016/
j.proeng.2011.08.404
Zhu, Y., & Xiong, Y. (2015). Towards data science. Data
Science Journal, 14(0), 8. https://doi.org/10.5334/dsj-
2015-008
GISTAM 2025 - 11th International Conference on Geographical Information Systems Theory, Applications and Management
230