Engineers and data scientists are tempted to collect
“as much data as possible” which can be costly, or for
simplicity, they may sample from arbitrarily selected
vehicles. They usually do not have all the knowledge
about the general demographics, geographical simi-
larities, vehicle configurations, etc. as these change
over time. Therefore, it is important to have a cen-
tralized system which assists in collecting the proper
amount data from the required signals sampled cor-
rectly. Otherwise, studies or machine learning mod-
els can be biased and under-perform, as can be shown
in (Hasanin et al., 2019) and (Johnson and Khoshgof-
taar, 2020).
To solve this problem, we developed an intelli-
gent sampling system for connected vehicle feature
analytics which combines connected vehicles domain
knowledge and analytical results with data sampling
techniques, while balancing the budget with the de-
sired statistical significance whenever possible. It as-
sists the users in determining which signals to use,
sampling technique, and in choosing a sample suit-
able for their studies while meeting their budget con-
straints.
This paper is organized as follows. Section 2 de-
scribes common technologies used in vehicles, and
motivates the need for an intelligent sampling sys-
tem. Section 3 describes our system architecture and
components. Section 4 demonstrates using the sys-
tem for analyzing feature usage on different types of
roads. Section 5 describes a case which models fuel
consumption as a function of tire pressure. Section 6
concludes the paper.
2 BACKGROUND
Big data challenges related to our work have been
known for several years, even before cloud solutions
became powerful. As computational power improved,
data collection also increased, and therefore, these
challenges remain. An obvious approach to deal with
the computational burden created by big data is sam-
pling. What is not obvious is how to perform the
sampling. For example, in (Casamayor-Pujol et al.,
2023), the authors designed a scalable “Intelligent
Sampling” method to assist in scheduling workloads
in large scale heterogeneous computing continuum.
This, of course, is abstracted from the end users who
are interested in building models, which serves as a
suitable example of an intelligent sampling system.
A comprehensive list of sampling techniques is found
in (Djouzi et al., 2023). Some of these methods are
very well known. We review some of the fairly re-
cent methods in adaptive sampling. In (John and
Langley, 1996), the authors introduced a progressive
sampling method and the concept of “Probably Close
Enough” (PCE). The idea behind PCE is to obtain a
good enough sample such that it is very unlikely to
improve a mining algorithm any further by using the
entire dataset. The authors discussed static versus dy-
namic sampling and their work aims to deal with big
data efficiently. In (Satyanarayana, 2014), the authors
proposed Generalized Dynamic Adaptive Sampling
(GDAS), an adaptive sampling technique to tackle
the limitations in progressive sampling, listed in their
work. In (Djouzi et al., 2022), the authors proposed
a new adaptive sampling method, Subsampled Dou-
ble Bootstrap GDAS (SDBGDAS) method, which is
an improvement over GDAS (Satyanarayana, 2014)
method, which allows the scaling of adaptive methods
to big data. In (Loyola R et al., 2016), various sam-
pling methods are discussed and the authors propose
a Smart Sampling and Incremental Function Learning
Algorithm to find a Probably Approximately Correct
Computation (PACC) regression model.
Other work, such as in (Zhang and Wang, 2021),
(Ai et al., 2021), investigated methods to deal with
distributed and massive data. The idea is to opti-
mally select a distributed sub-data, for which sum-
mary statistics are calculated on the edge and sent to
a central server or to build generalized linear models
(GLM). Fuzzy methods are also proposed to reduce
sample size such as in (He et al., 2015).
Whether simple random sampling techniques are
used, or advanced methods, it is clear that challenges
arise when dealing with big data, and good sampling
techniques help address these challenges. As noted
earlier, the data size of connected vehicle data in
the cloud grows at least polynomially (ignoring any
changes in regulations, consent agreements, etc.). A
proof is offered here before proceeding to the next
section.
2.1 Polynomial Growth of Connected
Vehicle Data
To motivate the need for intelligent sampling systems,
we first show that the data will grow at polynomial
rate during the next few years. Let S
y
be the number
of connected vehicles sold in year y, and assume that
y
1
< y
2
=⇒ |S
y
1
| < |S
y
2
|. In other words, the sales of
connected vehicles each year are more than the pre-
vious year (unsaturated market). Note that we only
consider connected vehicles. Therefore, the assump-
tion |S
y
1
| <= |S
y
2
| holds until almost all vehicles on
the road are connected vehicles. Let d
i
be the amount
of data collected from model year y
i
. Assuming d
i
is proportional to S
y
i
, we have y
1
< y
2
=⇒ d
1
< d
2
,
Intelligent Sampling System for Connected Vehicle Big Data
151