Building a Dataset for Trip Style Assessment Based on Real Trip Data

ıs M. P. Loureiro

1,3

, Artur J. Ferreira

1,2 a

and Andr

e R. Lourenc¸o

1,3 b

ISEL, Instituto Superior de Engenharia de Lisboa, Instituto Polit

ecnico de Lisboa, Portugal

Instituto de Telecomunicac¸

oes, Lisboa, Portugal

CardioID Technologies, Portugal

Keywords:

Clustering, Driver Behavior, Feature Engineering, Pay-as-you-Drive, Trip Dataset, Trip Driver Style.

Abstract:

In most countries, to have permission to drive vehicles on public roads one must have insurance against civil

liability for vehicles. In many cases, the insurance fees depend on the age of the driver, the number of years

one holds a driving license, and the driving history. The usual assumption taken by insurance companies that

younger drivers are always more risky than others are not always correct, penalizing young good drivers. In

this paper, we follow a pay-as-you-drive approach based on trip behavior data of different drivers. First, we

build a dataset from real trip data. Then, we apply a two-stage clustering approach to the dataset to identify

trip proﬁles. The experimental results show that we can cluster and identify distinct trip proﬁles in which

many trips have a non-aggressive style, some have an aggressive style and only a few are risky style trips. Our

solution ﬁnds application in fair insurance fee calculation or ﬂeet management tasks, for instance.

1 INTRODUCTION

The European Road Safety Observatory (European

Comission, 2021) reports that from 2010 to 2020

there were approximately 1.000.000 crashes and

1.250.000 injuries, per year, in the European Union.

In most countries, car insurance is required for one to

have permission to drive on public roads (NationMas-

ter, 2014). The owner or driver of a vehicle is respon-

sible for damages it may cause, in case of accident.

The interests of injured parties must be protected, re-

gardless of whether or not the person responsible for

the accident is ﬁnancially able to do so. The manda-

tory insurance against civil liability for motor vehi-

cles assures this protection. The insurance companies

typically deﬁne vehicle insurance fees as functions of

static variables, such as the age of the driver, the num-

ber of years one has a driving license, and the driving

history (Hutson, 2021).

Nowadays, the volume of data which can be ac-

quired and processed has increased exponentially,

yielding new opportunities. The acquisition of driving

behavior data brings opportunities for insurance and

transportation companies to provide solutions which

are more fair than the existing ones.

https://orcid.org/0000-0002-6508-0932

https://orcid.org/0000-0001-8935-9578

1.1 Goals of This Work

The key idea of this work is that the more aggres-

sive/risky the driver actually behaves, the more one

should pay by the insurance. On the other hand, the

more conservative, well-behaved drivers should pay

less money, leading to an overall fair system. A com-

pany that needs to perform ﬂeet management can also

take better planning decisions and to identify the best

drivers, based on the assessment of their real driving

style. This work has the following speciﬁc goals:

(1) from anonymous data records with events gener-

ated on trips from different drivers, perform fea-

ture engineering actions to build a dataset suited

for driver style analysis;

(2) identify and perform the necessary data prepro-

cessing operations on the dataset;

(3) analyze the output of clustering techniques on the

dataset, to assign a style to each trip.

The remainder of this paper is organized as fol-

lows. In Section 2, we overview related work and the

driver data acquisition setup. The developed approach

and the dataset details are described in Section 3. The

experimental results are reported in Section 4. Fi-

nally, Section 5 ends the paper with some concluding

remarks and directions for future work.

Loureiro, L., Ferreira, A. and LourenÃ

go, A.

Building a Dataset for Trip Style Assessment Based on Real Trip Data.

DOI: 10.5220/0012079400003541

In Proceedings of the 12th International Conference on Data Science, Technology and Applications (DATA 2023), pages 287-294

ISBN: 978-989-758-664-4; ISSN: 2184-285X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

287

2 RELATED WORK

In this section, we address related work and the state-

of-the-art regarding driver style identiﬁcation (Sec-

tion 2.1), the key aspects and goals of the i-DREAMS

project (Section 2.2) and the trip data acquisition

setup, devices, and events (Section 2.3).

2.1 Driving Style Taxonomy

In the literature, driving styles are established at dif-

ferent levels of speciﬁcation, ranging from simple

and single indicators, such as speeding or hard ac-

celeration, to general concepts, such as aggressive

driving or risky driving. These general concepts are

based on a combination of speciﬁc behavioral indica-

tors. In Taubman-Ben-Ari et al. (2004), eight driving

styles regarding as how one usually drives are identi-

ﬁed. These styles are organized into two main cate-

gories named as safe and unsafe. Within the safe cat-

egory, there are three styles, as follows. The distress-

reduction driving style is related to muscle relaxation

techniques while driving. The patient style refers to

calm and safe behaviors based on the “better safe

than sorry” motto. The careful style addresses cau-

tious behaviors and a state of mind to always re-

act well and quickly to unexpected actions of other

drivers. The unsafe category has ﬁve styles. The dis-

sociative driving style is based on unaware and mis-

judging behaviors. The anxious style is related to ner-

vous and worrying behavior while driving. Risky ap-

plies to drivers that like to take risks. The angry style

means that the driver will respond with aggressive

reactions to other drivers actions. Finally, the high-

velocity style corresponds to an impulsive attitude to-

wards driving faster than allowed by the road condi-

tions or feeling impatient when trafﬁc slows down.

Table 1 presents each style and its category.

A framework in which driving styles are seen in

terms of driving habits established as a result of indi-

vidual dispositions, as well as social norms and cul-

tural values was proposed by Sagberg et al. (2015). In

this framework, a global driving style is composed by

a set of speciﬁc driving styles. A speciﬁc driving style

Table 1: Driving categories (safe/unsafe) and the corre-

sponding driving styles (Taubman-Ben-Ari et al., 2004).

Driving Categories

Safe Unsafe

Distress-reduction Dissociative

Patient Anxious

Careful Risky

- Angry

- High-velocity

is a commonly adopted behavior while driving.

Typically, the set of parameters and variables for

proﬁling drivers cover the most important aspects of

driving (Singh and Kathuria, 2021). These parameters

refer to the dynamic driving environment, based on

crash data gathered from self-reported surveys (Peck

and Kuan, 1983; Singh and Kathuria, 2021) as fol-

lows: (i) speed; (ii) acceleration, braking, and jerks;

(iii) annual mileage; (iv) lateral maneuver, such as

swerving, lane changing, and sharp turns; (v) time and

space factors, such as time of day, day of the week,

and month; (vi) distraction.

Recently, a method to detect city potential hazard

driving zones, using bus tracking data was proposed

by Almeida et al. (2022). The approach consists in

mapping geolocation coordinates into road segments,

using bus speed, bus maximum allowed speed, and

bus acceleration to classify the driving behavior into

one of six categories to evaluate the driving behavior.

They conclude that some roads exhibit problems that

occur in the same period of the day while other roads

present circulation issues regardless of the time period

or day.

2.2 The i-DREAMS Project

This work arises as part of the European Hori-

zon2020 i-DREAMS project (i-DREAMS Team,

2021). The project proposes a solution with a set of

devices installed in vehicles to acquire data about the

driver’s status and driving behavior. The i-DREAMS

project aims to determine the safety tolerance zone

(STZ) for driving and interventions for driver-vehicle-

environment interactions.

Figure 1 depicts a global overview of the i-

DREAMS project. The Gateway (GW, in Figure 1)

is the central element of the system, by collecting the

data originated from other components and handling

data connectivity and transmission. It is an edge com-

puting device that determines the STZ, allows the trig-

gering of alarms, and real-time communication with

an application programming interface (API) or stor-

age for post-trip synchronization. The access to the

data is carried out through this API, which deﬁnes

that each data collection system is assigned to a ve-

hicle and not with a speciﬁc driver. Data is acquired

anonymously during trips from the moment the vehi-

cle’s engine is activated until the moment it is deac-

tivated. By interacting with this API, we retrieve trip

event data to build a dataset.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

288

Figure 1: Global overview of the i-DREAMS on-vehicle

systems (i-DREAMS Team, 2021).

2.3 Trip Data Acquisition

The devices in the i-DREAMS on-vehicle systems

produce data events for the identiﬁcation of the driv-

ing style. After data collection, the gateway makes it

available through an API. The types of data provided

by the main elements of the i-DREAMS architecture,

are summarized in Table 2. The project also includes

a smartphone application to detect driver distraction.

The CardioWheel device is installed on the steer-

ing wheel and the Mobileye (Mobileye, 2021) device

is placed on the windshield. Figure 2 shows the sen-

sors installed on a vehicle. CardioWheel can pro-

duce hands-on wheel detection events. The Drowsi-

ness Detector device detects events of the driver’s

drowsiness level according to the Karolinska sleepi-

ness scale (KSS) (Shahid et al., 2011). KSS is a 10-

point Likert scale (Joshi et al., 2015), to classify the

states of human sleepiness in periods of ﬁve minutes,

as described in Table 3.

Table 2: Data acquisition devices and data identiﬁers of the

i-DREAMS project.

Device Data Identiﬁer

CardioWheel LOD Event

Driver App Distraction

Drowsiness Detector Drowsiness

Gateway Ignition

Gateway Motion Sensor DrivingEvents

Global Navigation

Satellite System (GNSS) Geolocation Coordinates

Mobileye ME AWS

ME TSR

ME CAR

Safety iDreams Fatigue

Tolerance iDreams Headway

Zone (STZ) iDreams Overtaking

iDreams Speeding

The Gateway generates data for ignition and driv-

ing events, corresponding to harsh driving detec-

tion events. The GNSS module provides satellite-

based geolocation data, at a rate of one sample per

second. The Mobileye device produces data about

the safety and warning states in a certain timestamp

(ME AWS). It also generates data about the trafﬁc

signs (ME TSR) and vehicle parameters (ME CAR).

Finally, the STZ device produces fatigue, headway,

overtaking, and speeding events.

3 THE DEVELOPED APPROACH

In this section, we describe the key aspects of the de-

veloped approach. We begin with the overview of

the solution architecture, functionalities, and gener-

ated events (Section 3.1 and Section 3.2). Then, we

describe the key steps and procedures to build the

dataset from the events generated by the devices (Sec-

tion 3.3). The dataset pre-processing steps are ad-

dressed in Section 3.4. Finally, Section 3.5 describes

the clustering techniques we apply on the dataset.

3.1 Block Diagram of Our Approach

Figure 3 depicts the block diagram of the devel-

oped trip proﬁling system. It shows an example of

some data types obtained by interacting with the i-

DREAMS API. From the exported driving data events

trough the API, we perform a feature engineering step

to build a dataset. From the set of events described in

Table 2, we establish a set of features and we build

an unlabeled dataset. Then, we apply unsupervised

learning to cluster trips. The clustering output is ana-

lyzed by human experts to assess the trip style.

3.2 The i-DREAMS Generated Events

The CardioWheel device produces Hands-On Detec-

tion (LOD Event) events that signal the presence of

Table 3: The 10-point Karolinska Sleepiness Scale

(KSS) (Shahid et al., 2011).

Level Description

1 Extremely alert

2 Very alert

3 Alert

4 Rather alert

5 Neither alert nor sleepy

6 Some signs of sleepiness

7 Sleepy, but no effort to keep awake

8 Sleepy, but some effort to keep awake

9 Very sleepy, great effort to keep awake, ﬁghting sleep

10 Extremely sleepy, can’t keep awake

Building a Dataset for Trip Style Assessment Based on Real Trip Data

289

Figure 2: CardioWheel on the steering wheel (left). Mobileye and dashcam seen from the inside (middle) and from the outside

of the vehicle (right).

Figure 3: Proposed solution architecture to build and use a dataset with trip data for generic purposes. In this work, we also

aim to identify the driving style for each trip, using the dataset.

the driver’s hands on the steering wheel. Each event

consists of a timestamp and the Hands-On state (with

four possible values). These events allow us to deter-

mine how much time the driver spent with no hands,

left hand, right hand and both hands on the steering

wheel. The Driver App provides real-time mobile

phone use events, which are an indicator of distrac-

tion. Each sample is mapped by the type of event, mo-

bile usage start, and mobile usage end. These events

signal any distractions while driving. The Drowsiness

Detector system is able to detect events of the driver

drowsiness level based on the KSS scale, described in

Table 3, at certain timestamps.

The Gateway Motion Sensor system generates

driving behavior (DrivingEvents) data, which indicate

the occurrence of harsh acceleration, harsh braking,

and harsh cornering behaviors. Each event stores the

timestamp and the type of event (harsh acceleration,

braking, and cornering). Simultaneously, the maxi-

mum acceleration (in g-force) during an event is reg-

istered with the total duration of the event in seconds,

and the event severity level is measured with one of

three values: low, medium, or high. This indicator is

useful to calculate the number of times each type of

event occurs on a given trip. These events are widely

used in the literature for driver proﬁle classiﬁcation

applications and are mostly associated with aggres-

sive driving. The Gateway device is able to provide

events regarding if the ignition is on or off.

The GNSS system offers satellite-based geolo-

cation data, with a rate of about one sample per

second. The Mobileye Advanced Warning System

(ME AWS) produces data about the safety and warn-

ing state of the Mobileye system in a given timestamp.

A signiﬁcant part of this data refers to the system us-

age; for this work, the relevant events are:

• fcw - if true, forward collision warning is active;

• hw level - headway monitoring level (no car de-

tected, green car, red car, warning);

• ldw - if true, lane departure warning is active (left

or right);

• ldw left - if true, left lane departure warning is

active;

• ldw right - if true, right lane departure warning is

active;

• pcw - if true, pedestrian collision warning is ac-

tive;

• pedestrian dz - if true, pedestrian detected in dan-

ger zone;

• time indicator - states the lighting conditions

(day, dusk, or night);

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

290

• tsr level - trafﬁc sign recognition level. Level of

warning according to units over the speed limit (in

km/h or mph);

• zero speed - if true, the vehicle is stopped.

The fcw parameter corresponds to the forward col-

lision warning (FCW), which is activated when the

system detects an imminent collision danger with cars

ahead and triggers visible and audible warnings. The

hw related parameters correspond to headway moni-

toring & warning (HMW), triggered when the vehicle

is too close to the vehicle ahead and will raise visible

and audible warnings. The ldw related parameters re-

fer to the lane departure warning (LDW) event, acti-

vated when the vehicle departs from the current lane

without turning signals and will also raise visible and

audible warnings.

The Mobileye system also provides information

about the vehicle parameters (ME CAR). Each data

sample holds the current speed of the vehicle (ex-

pressed in km/h) and the following parameters:

• brakes - if brakes are on or off;

• low beam - if the low beam is on or off;

• high beam - if the high beam is on or off;

• signal left - if the left turn signal is on or off;

• signal right - if the right turn signal is on or off;

• wipers - if wipers are on or off.

In this work, we are not using the Mobileye Trafﬁc

Sign Recognition (ME TSR) events.

The STZ device generates real-time fatigue inter-

vention (iDreams Fatigue) events with the following

levels:

• 0 - no warning (normal driving);

• 1 - visual warning (dangerous driving);

• 2 - visual and auditory warning (dangerous driv-

ing);

• 3 - frequent warnings (avoidable accident).

STZ also generates the following real-time head-

way intervention (iDreams Headway) events:

• -1 - no vehicle detected (normal driving);

• 0 - vehicle detected, but headway >= 2.5 (normal

driving);

• 1 - vehicle detected, headway < 2.5, but above

warning threshold (normal driving);

• 2 - ﬁrst warning stage (dangerous driving);

• 3 - second warning stage (avoidable accident).

STZ also provides real-time overtaking interven-

tion (iDreams Overtaking) events with these levels:

• 0 - no warning (normal driving);

• 1 - visual warning (normal driving);

• 2 - visual and auditory warning (dangerous driv-

ing);

• 3 - frequent warnings (avoidable accident).

Finally, we have the real-time speeding interven-

tion (iDreams Speeding) events:

• 0 - no warning (normal driving);

• 1 - visual indication (normal driving);

• 2 - visual speeding warning (dangerous driving);

• 3 - visual and auditory warning (avoidable acci-

dent).

3.3 Dataset Construction

After the analysis of the available events, the next step

is to build a dataset through these events. To do this,

for each event we devised several features, which we

consider to be the most appropriate to determine the

style of driving on a given trip. The events provided

vary from trip to trip, since we are dealing with a nat-

uralistic driving environment (et al, 2011). This in-

cludes aspects of vehicle movement, driver behavior,

and the direct environment. As a result, there may be

sensor failures on a given trip and as a consequence

some features may have missing values, handled as

described in Section 3.4. Although some events may

not be available, all trips are represented by the fol-

lowing mandatory set of features:

• trip start - date and time trip started; this feature

is also relevant to input missing values for other

features such as the trip lighting condition;

• trip end - date and time trip ended; this feature

is also relevant to input missing values for other

features such as the trip lighting condition;

• distance - trip distance in kilometers;

• duration - trip duration in seconds.

The available data was collected and the features

were designed. After a deeper analysis of the data,

only the most relevant features were selected accord-

ing to the literature. Through interaction with the

i-DREAMS API, we gather trip data from April 1,

2021, to July 20, 2022, yielding a total of 17138 trips.

The designed features are organized by the type of

system that originated them. According to the four

mandatory features, for each of the 17138 trips, 75

features were generated. However, for this collection

of trips, the Hands-On and Drowsiness systems were

not available, so only 67 features were kept. As a

consequence, at this stage the dataset was composed

of n = 17138 instances (trips) and d = 67 features.

Building a Dataset for Trip Style Assessment Based on Real Trip Data

291

3.4 Dataset Pre-Processing

After constructing the dataset, we performed data pre-

processing by summarizing data and imputing miss-

ing values. We removed trips with less than one

minute, trips with less than 1.5 km, and trips in which

all features had missing values. To handle missing

values, we locate features with such a problem. Sub-

sequently, these values were imputed as follows:

• if the percentage of missing values for feature X

is less than or equal to 50%, then these values are

ﬁlled with the mean, median, or mode of the re-

maining values on that column;

• if this percentage is greater than or equal to 50%,

the feature is removed.

We applied data normalization techniques because

many clustering algorithms are based on distances,

and the different orders of magnitude of the fea-

ture values inﬂuences the result (Witten et al., 2016).

Thus, we normalize the data by trip distance or by

trip duration, separately. Then, we evaluate both and

choose the one with the best results. We have also

removed the features unrelated with the objective of

this work (to identify a trip style), yielding a dataset

with n = 15002 instances (trips) and d = 53 features.

3.5 Clustering Algorithms

The clustering phase is organized into two cluster-

ing stages. In the ﬁrst, we applied different cluster-

ing techniques. Initially, we use the K-Means algo-

rithm (Hartigan and Wong, 1979). The strategy con-

sisted in initially trying to identify the best value of K

through the elbow and silhouette methods. Density-

based spatial clustering (DBScan) (Ester et al., 1996)

was also applied to check for outliers in the dataset.

The parameters were set as in Sander et al. (1998).

Another technique used was gaussian mixture models

(GMM) (Reynolds, 2009) that states the probability

that each instance of the dataset belongs to a Gaussian

distribution. The advantage of using this type of clus-

tering is that it can better deal with different shapes of

the data distribution. The K-Means algorithm with

Euclidean distance handles spherical shapes better.

The number of components/gaussians used was the

same as K in K-Means. Finally, evidence accumula-

tion clustering (EAC) by Fred and Jain (2002) with

K-Means (Okun and Valentini, 2008) was also evalu-

ated.

The objective of the second stage of clustering was

to use the best combination of the ﬁrst stage, to check

if the clusters found could be further decomposed into

sub-clusters, given more details on the drivers pro-

ﬁles. After running the clustering algorithm, we as-

sign a trip style to each cluster, by human analysis

and inspection.

4 EXPERIMENTAL EVALUATION

In this section, we address the experimental evalua-

tion of the developed solution, focusing on the goal to

devise trip styles with clustering techniques, from the

built dataset. The source code was written in Python.

Section 4.1 describes the standard clustering evalua-

tion metrics. In Section 4.2, we address the use of

dimensionality reduction techniques. Finally, Sec-

tion 4.3 reports the experimental results of clustering

and the identiﬁcation of trip styles.

4.1 Clustering Evaluation Metrics

The clustering algorithms results were evaluated with

the Calinski-Harabasz (CH) (Cali

nski and Harabasz,

1974), Silhouette (Sil) (Rousseeuw, 1987), and

Davies-Bouldin (DB) (Davies and Bouldin, 1979)

scores. The CH metric, also known as the variance

ratio criterion, is the ratio of the sum of between-

clusters dispersion and inter-cluster dispersion for all

clusters; the higher the score, the better the perfor-

mance. The Silhouette score ranges from -1 (worst)

to 1 (best); it is computed as functions of the average

intra-cluster distance and the average inter-cluster dis-

tance. The DB score is a function of the ratio of the

within cluster scatter to the between cluster separation

and a lower value means better clustering.

4.2 Dimensionality Reduction

We have considered the use of dimensionality re-

duction techniques to reduce the number of features,

and avoid the curse of dimensionality (Bishop and

Nasrabadi, 2006). Since the dataset does not have

pre-labeled data, we chose an unsupervised feature

reduction (FR) technique, namely principal compo-

nent analysis (PCA) (Jolliffe, 2002). The use of PCA

yielded the following results: (i) with data normaliza-

tion, the number of features was reduced from d = 53

to m = 17; (ii) without normalization, we got m = 1

which is an extreme dimensionality reduction causing

loss of information. Thus, we opted to consider the

data with normalization by distance or by duration.

4.3 Clustering the Trip Dataset

For the ﬁrst stage clustering, we use the normalization

and dimensionality reduction techniques. The experi-

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

292

mental results of the best combination, for each algo-

rithm, are reported in Table 4.

Table 4: First stage clustering results of the n = 15002 trips,

by K-Means, DBScan, GMM, and EAC algorithms evalu-

ated with the CH, Sil, and DB scores. We aim to maximize

the CH and Sil scores (denoted by the ↑ symbol) and to

minimize the DB score (the ↓ symbol).

Scores K-Means DBScan GMM EAC

CH ↑ 7258.919 2365.621 5602.887 1965.691

Sil ↑ 0.488 0.650 0.382 0.807

DB ↓ 1.164 1.372 1.377 0.482

# cluster 0 11916 14550 5287 54

# cluster 1 3086 452 9715 14948

Total 15002 15002 15002 15002

Although the results state that EAC has better

DB and Silhouette scores, further cluster analysis has

shown that its results yield large imbalance between

clusters on different runs of the algorithm over the

training data. We choose the K-Means algorithm be-

cause it was more consistent on successive runs and

the scores were better overall when comparing to DB-

Scan and GMM. In summary, normalization by dis-

tance, dimensionality reduction by PCA, and cluster-

ing with K-Means yield the best results. Moreover,

using PCA evaluation we found that the most im-

portant features (for the ﬁrst PCA component) were:

speed, number of speeding events, number of times

breaks are on, number of harsh cornering events, and

the number of harsh acceleration events.

The second stage of clustering consisted in apply-

ing K-Means individually to cluster 0 and to cluster 1,

in Table 4, to check if they can be further decomposed

into sub-clusters. The experimental results showed

that cluster 1 attained by K-Means was further decom-

posed in two clusters. Table 5 provides the scores for

the second stage of clustering.

Table 5: Second stage clustering results with CH, Sil, and

DB scores. The 3086 instances of # cluster 1 by K-Means,

in Table 4, are further split into two clusters with 3033 and

53 instances, respectively.

CH ↑ Sil↑ DB↓ # cluster 1.1 # cluster 1.2

1144.656 0.735 0.548 3033 53

Table 6 and Figure 4 report the ﬁnal results for

unsupervised learning. After human inspection of the

clusters and feature values, cluster 0 was assigned as

aggressive trips while cluster 1 was deﬁned as non-

aggressive trips, and cluster 2 holds the risky trips. In

detail, cluster 0 represents trips with low-speed val-

ues per km and low speeding events per km. Cluster

1 represents trips with low to medium speed per km

and low to medium speeding events per km. In terms

of the number of braking events per km, cluster 1 has

more events. For cluster 2, the number of events ex-

ceeding the speed limit was the most important fea-

ture to differentiate, aggressive trips from risky trips.

Risky trips involve more speeding events than aggres-

sive trips.

Table 6: Second stage ﬁnal clustering results - distribution

of the n = 15002 trips per K = 3 clusters, holding the trip

styles, after human expert analysis.

Cluster ID Description Number of instances

0 Aggressive trips 3033

1 Non-Aggressive trips 11916

2 Risky trips 53

Total 15002

Figure 4: A 3D visualization of the ﬁnal clustering results,

with the aggressive trips (0), non-aggressive trips (1), and

risky (2) trip styles.

5 CONCLUSIONS

The monitoring of driver behavior based on real trip

data acquired from a vehicle yields many useful ap-

plications. In this paper, our key goal was to show

that it is possible to devise a driver style identiﬁcation

solution, from a built dataset from the i-DREAMS

project devices. Using anonymous trip data, our ap-

proach was to acquire data from a set of trip events

and to perform feature extraction to build a dataset.

Then, we apply some pre-processing techniques on

the dataset. Afterwards, we perform clustering, eval-

uated with standard metrics. The dataset construction

and pre-processing phases were the ones that most in-

ﬂuenced the clustering results. By normalizing the

trips by distance and performing dimensionality re-

duction by principal component analysis we achieved

the best results. We applied a two-stage clustering

Building a Dataset for Trip Style Assessment Based on Real Trip Data

293

strategy, in which the K-Means algorithm has shown

to be the best. The most challenging part of the clus-

tering phase was to assign meaning (trip style) to the

clusters, which implied further human analysis and

inspection on the clustering results. Thus, the pro-

posed solution was found to be able to identify the

trip styles in a satisfactory way.

As future work, we may use different clustering

algorithms and perform a deeper analysis of the ev-

idence accumulation algorithm. We can use feature

selection techniques instead of feature reduction, to

identify the most relevant original features for the

trip style identiﬁcation task and the smallest subset

of these features that yield robust classiﬁcation.

ACKNOWLEDGEMENTS

This work was co-funded by the EU H2020

i-DREAMS project (Project Number: 814761)

funded by European Commission under the MG-

2-1-2018 Research and Innovation Action (RIA),

www.idreamsproject.eu, and by the European Re-

gional Development Fund (FEDER), through Por-

tugal 2020, under the Operational Competitiveness

and Internationalization (COMPETE 2020) and Lis-

boa 2020 programmes (grants no. 069918 “Cardi-

oLeather”).

REFERENCES

Almeida, A., Br

as, S., Sargento, S., and Oliveira, I. (2022).

Using bus tracking data to detect potential hazard driv-

ing zones. In Pattern Recognition and Image Analysis,

pages 667–679, Cham. Springer International Pub-

lishing.

Bishop, C. and Nasrabadi, N. (2006). Pattern recognition

and machine learning, volume 4. Springer.

Cali

nski, T. and Harabasz, J. (1974). A dendrite method

for cluster analysis. Communications in Statistics,

3(1):1–27.

Davies, D. and Bouldin, D. (1979). A cluster separation

measure. IEEE Transactions on Pattern Analysis and

Machine Intelligence, PAMI-1(2):224–227.

Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996). A

density-based algorithm for discovering clusters in

large spatial databases with noise. In Knowlede Dis-

covery in Databases (KDD), volume 96, pages 226–

231.

et al, I. S. (2011). Towards a large scale European naturalis-

tic driving study: ﬁnal report of PROLOGUE: deliver-

able D4.2. Technical report, SWOV Institute for Road

Safety Research.

European Comission (2021). Road safety,

https://ec.europa.eu/transport.

Fred, A. and Jain, A. (2002). Data clustering using evidence

accumulation. In International Conference on Pattern

Recognition, volume 4, pages 276–280 vol.4.

Hartigan, J. and Wong, M. (1979). Algorithm as 136:

A k-means clustering algorithm. Journal of the

Royal Statistical Society. series c (applied statistics),

28(1):100–108.

Hutson, D. (2021). How car insurance premiums are

calculated https://www.comparethemarket.com/car-

insurance/content/what-impacts-upon-your-car-

insurance/.

i-DREAMS Team (2021). A smart driver and road environ-

ment assessment and monitoring system.

Jolliffe, I. (2002). Principal component analysis. Springer

Series in Statistics.

Joshi, A., Kale, S., Chandel, S., and Pal, D. (2015). Lik-

ert scale: Explored and explained. British Journal of

Applied Science & Technology, 7(4):396–403.

Mobileye (2021). Autonomous driving & ADAS (Ad-

vanced Driver Assistance Systems).

NationMaster (2014). Motor vehicle ownership per 1000

inhabitants, https://ourworldindata.org/.

Okun, O. and Valentini, G. (2008). Supervised and Un-

supervised ensemble methods and their applications,

volume SCI, 126. Springer.

Peck, R. and Kuan, J. (1983). A statistical model of individ-

ual accident risk prediction using driver record, terri-

tory and other biographical factors. Accident Analysis

& Prevention, 15(5):371–393.

Reynolds, D. (2009). Gaussian mixture models. Encyclo-

pedia of biometrics, 741(659-663).

Rousseeuw, P. (1987). Silhouettes: A graphical aid to

the interpretation and validation of cluster analysis.

Journal of Computational and Applied Mathematics,

20(1):53–65.

Sagberg, F., Selpi, Piccinini, G., and Engstr

om, J. (2015).

A review of research on driving styles and road safety.

Human factors, 57(7):1248–1275.

Sander, J., Ester, M., Kriegel, H., and Xu, X. (1998).

Density-based clustering in spatial databases: The al-

gorithm GDBSCAN and its applications. Data Mining

and Knowledge Discovery, 2(2):169–194.

Shahid, A., Wilkinson, K., Marcu, S., and Shapiro, C.

(2011). Karolinska sleepiness scale (KSS) - stop, that

and one hundred other sleep scales. A. Shahid and K.

Wilkinson and S. Marcu and C. Shapiro (eds), pages

209–210.

Singh, H. and Kathuria, A. (2021). Proﬁling drivers to as-

sess safe and eco-driving behavior–a systematic re-

view of naturalistic driving studies. Accident Analysis

& Prevention, 161:106349.

Taubman-Ben-Ari, O., Mikulincer, M., and Gillath,

O. (2004). The multidimensional driving style

inventory–scale construct and validation. Accident

Analysis & Prevention, 36(3):323–332.

Witten, I., Frank, E., Hall, M., and Pal, C. (2016). Data min-

ing: practical machine learning tools and techniques.

Morgan Kauffmann, 4th edition.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

294