An Automated Clustering Process for Helping Practitioners to Identify

Similar EV Charging Patterns across Multiple Temporal Granularities

Ren

e Richard

1 a

, Hung Cao

1 b

and Monica Wachowicz

1,2 c

People in Motion Lab, University of New Brunswick, Canada

RMIT, Australia

Keywords:

Agglomerative Hierarchical Clustering, EV Adoption, Charging Infrastructure Usage Patterns, Clustering

Process, Cluster Validity Indices.

Abstract:

Electric vehicles (EVs) are part of the solution towards cleaner transport and cities. Clustering EV charging

events has been useful for ensuring service consistency and increasing EV adoption. However, clustering

presents challenges for practitioners when ﬁrst selecting the appropriate hyperparameter combination for an

algorithm and later when assessing the quality of clustering results. Ground truth information is usually not

available for practitioners to validate the discovered patterns. As a result, it is harder to judge the effectiveness

of different modelling decisions since there is no objective way to compare them. In this work, we propose

a clustering process that allows for the creation of relative rankings of similar clustering results. The overall

goal is to support practitioners by allowing them to compare a cluster of interest against other similar clusters

over multiple temporal granularities. The efﬁcacy of this analytical process is demonstrated with a case study

using real-world Electric Vehicle (EV) charging event data from charging station operators in Atlantic Canada.

1 INTRODUCTION

Globally, national and local government commit-

ments to electrify the transport sector will have a posi-

tive impact on smart cities. The vision for smart cities

fosters advanced and modern urbanization, which re-

sults in a core infrastructure that enables a good qual-

ity of life for citizens and the sustainable management

of natural resources. Supporting the usage of EVs

contributes to improved air quality, sustainable mo-

bility and therefore contributes to this vision.

The high capital costs of setting up public charg-

ing infrastructure and the usage of public funds to fos-

ter a shift to EVs necessitates informed decision mak-

ing at all stages of the adoption life-cycle. Given early

EV adoption challenges, some charging stations can

be under-utilized, others will serve a disproportionate

amount users. Clustering stations together based on

utilization patterns is a useful planning tool for opera-

tors. Additionally, as vehicle electriﬁcation grows, so

does the demand for electricity and the possible strain

on power grids. Utilities and other power generators

need to prepare for increased demand. Accurate load

https://orcid.org/0000-0002-1342-6225

https://orcid.org/0000-0002-0788-4377

https://orcid.org/0000-0002-4659-0101

forecasting is one tool which can help operators en-

sure service consistency.

Clustering is an unsupervised learning method

which assists practitioners in discovering hidden pat-

terns from a data set. It has been utilized by prac-

titioners in the energy domain to group similar con-

sumers, predict future demand, and increase EV

adoption. Statistical models, built with data from

charging stations having similar charging patterns

will reportedly have superior accuracy (Straka and

Buzna, 2019). Therefore, energy load forecasting

methods might preform better when applied to ho-

mogeneous clusters of stations as opposed to all sta-

tions. The patterns in energy usage behavior are core

to improving services provided by utility companies,

which are responsible for managing peaks and imbal-

ances in charging infrastructure usage patterns (Igle-

sias and Kastner, 2013).

Although clustering is widely used in many

knowledge domains, it remains arduous for prac-

titioners to select the proper clustering algorithm

with hyperparameter combination and later assess the

quality of clustering results. The subjectivity found in

the required expert knowledge that is needed for de-

termining the level of “success” achieved during clus-

tering, is likely to be one of the main reasons why

Richard, R., Cao, H. and Wachowicz, M.

An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities.

DOI: 10.5220/0010485000670077

In Proceedings of the 10th International Conference on Smart Cities and Green ICT Systems (SMARTGREENS 2021), pages 67-77

ISBN: 978-989-758-512-8

existing AutoML frameworks tend to focus on super-

vised learning tasks that require labeled data as in-

put (Oliveira, 2019). One of the challenges is that the

identiﬁcation of the most similar clusters can be sub-

jective and it usually requires multiple approaches to

automate this process (Poulakis, 2020). The difﬁculty

in clustering is ﬁnding the results that aligns with a

practitioner’s needs because in many complex data

sets, there are several plausible clusters, and practi-

tioners may have different priorities and preferences.

An unsupervised clustering algorithm has no way to

intrinsically infer which clusters embody desired pri-

orities and preferences (Bae et al., 2020).

Additionally, in data with a temporal component

such as EV charging events for example, assessing the

structure consistency of discovered clusters over dif-

ferent temporal granularities, is often a lengthy man-

ual undertaking. Metrics such as inter-cluster separa-

tion, inter-cluster homogeneity, density, and uniform

cluster sizes can be computed to determine structure

consistency. However, the question of how to select

a particular clustering result that is more meaningful

than another based on user priorities and preferences,

still depends on the practitioner’s capacity of distin-

guishing similar clusters. Towards this challenge, this

research work explores whether given the prospect of

a clustering result of interest, a process of objectively

highlighting and recommending similar clustering re-

sults can be automated in order to support practition-

ers in evaluating how clustering patterns persist over

multiple temporal granularities, allowing practition-

ers to ﬁnd meaningful clusters according to their pref-

erences and priorities. The overall motivation of this

work is to assist the practitioner in navigating multiple

clustering results for different temporal partitions of

the same data. Providing the practitioner with an ini-

tial ranked list of clustering results and a mechanism

to identify clustering similarities can assist practition-

ers in downstream analytical tasks such as improving

regression or classiﬁcation model performance.

Therefore, we propose a clustering process which

uses internal cluster validity indices to enable the

identiﬁcation of similar clustering results across var-

ious temporal slices of data. Of primary concern in

this work is the comparison of clustering results from

a-priori selected temporal granularity (e.g weekly,

monthly and seasonal) and how to support practition-

ers in identifying similar results using a reference re-

sult of interest. A case study using real-world charg-

ing event data from EV station operators in Atlantic

Canada is used to evaluate the proposed clustering

process in identifying similar clusters of charging sta-

tions according to their usage patterns (e.g high vs low

usage).

The scientiﬁc contributions of this paper are as fol-

lows.

• Our work is unique in proposing a combination of

eight internal cluster validity indices to character-

ize clusters at different granularities (e.g.weekly,

monthly or seasonally). Previous research work

has usually focused on using these indices apart

from each other.

• These internal validity indices are then used to

compute a proximity measure (i.e. Euclidean dis-

tance) for helping practitioners to identify similar

clusters. To the best of our knowledge, this clus-

tering procedure has never been used as an objec-

tive measure to reduce the cognitive load of prac-

titioners in understanding clustering results.

• The use of real-world data from EV charging sta-

tions advances the understanding of charging be-

havior. To the best of our knowledge, no previous

work has implemented an end-to-end automated

clustering process that facilitates the comparison

of clustering results by practitioners with differ-

ent priorities and preferences.

The rest of the paper is organized as follows. In

Section 2, previous research work is described. Sec-

tion 3 describes the proposed clustering process un-

derpinning our work. Section 4 provides a detailed

description of the real-world EV charging event data

and the end-to-end automated implementation of our

proposed clustering process. In Section 5, we discuss

the results. Finally, Section 6 concludes and indicates

future research work.

2 RELATED WORK

In clustering, various steps must be taken by a prac-

titioner such as the selection of an appropriate algo-

rithm and its hyperparameters, the choice of an ad-

equate proximity measure, and how to validate the

modeling results. Fig. 1 outlines a typical cluster anal-

ysis process.

Additionally, the temporal granularity of an algo-

rithm’s input data can generate different clusters over

time. A common problem in clustering is how to ob-

jectively and quantitatively evaluate the results. Clus-

ter validation is an important task in the clustering

process because it aims to compare clustering results

and solve the question of optimal cluster count. Many

internal validity indices have been proposed to as-

sess the level of “success” that a clustering algorithm

achieves in ﬁnding the natural clusters in data without

any class label information (Rend

on et al., 2011), (Liu

et al., 2010).

SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems

Figure 1: The Main Tasks of a Clustering Process as De-

scribed in (Messina, nd).

The preponderance of studies validating cluster

results have been focused on the computation of indi-

vidual cluster validity indices (CVI), which are usu-

ally selected to determine the relative performance of

clustering results. In (Arbelaitz et al., 2013), Arbe-

laitz et al. perform an extensive comparative study

of 30 CVI that are evaluated by using an experimen-

tal setup which recommends the “best” partitioning

in multiple data sets where ground truth information

exists. The optimal suggested number of partitions is

deﬁned as the one that is the most similar to the cor-

rect one measured by partition similarity measures.

The authors found that noise and cluster overlap had

the greatest impact on CVI performance. Some in-

dices performed well with high dimensionality data

sets and in cases where homogeneity of the cluster

densities disappeared. The conclusion in this work

suggests using several CVI to obtain robust results.

In the energy domain, clustering has played an im-

portant role in revealing new insights in energy us-

age behavior, in particular, the EV charging demand

(Al-Ogaili et al., 2019) . For example, in (Straka

and Buzna, 2019), the authors demonstrated the po-

tential of clustering to understand the usage patterns

related to segments of charging stations by compar-

ing k-means, hierarchical, and DBScan algorithms.

The clustering algorithms have successfully identiﬁed

four groups of EV charging stations characterized by

distinct usage patterns.

In contrast, very few attempts have been found in

exploring CVI for evaluating the clustering results.

In (Xydas et al., 2016), the Davies-Bouldin index

is used to determine the best value for the cluster

count parameter using the k-means algorithm. Sun

et al. (Sun et al., 2020) proposed a time series clus-

tering method using a modiﬁed Euclidean distance to

group the similar charging tails from ACN-Data col-

lected from smart EV charging stations. In this work,

they evaluated their clustering results with Dynamic

Time Warping distance (DTW) and Euclidean dis-

tance method using the silhouette coefﬁcient.

In summary, the traditional usage of CVI has been

for validation purposes. However, utilizing multiple

CVI together in combination with a proximity mea-

sure such as Euclidean distance has a strong potential

to offer a new pairwise similarity measure that can

enhance the comparison of clustering results by prac-

titioners. Certainly, this is not a common practice in

Data Science as well as in the energy domain.

3 THE PROPOSED CLUSTERING

PROCESS

Our proposed clustering process extends the well-

known process introduced in the previous section.

Fig. 2 provides a conceptual overview of the main

tasks of our proposed clustering process. The num-

bered items in the ﬁgure link back to individual

Python scripts described in detail in the implemen-

tation section. At the end of the process, a database

is used to persist all clustering results and a RESTful

Application Programming Interface (API) facilitates

querying these results by different practitioners.

Figure 2: Our Proposed Clustering Process.

3.1 Data Preprocessing and Fusion

The data preprocessing and fusion task uses raw data

from the public EV charging stations. Preprocess-

ing consists of data cleaning and consolidation steps.

Data cleaning, ensures good data quality and pro-

duces a set of cleaned ﬁles by eliminating errors, in-

consistencies, duplicated and redundant data rows,

and handling missing data. Data consolidation com-

An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities

bines data from various data ﬁles into a single data set.

A variety of ﬁles from the cleaned data set are used as

the input for this operation. The output of these steps

is a unique ﬁle that merges all attributes into one big

table.

Moreover, data fusion consists of combining mul-

tiple data sources followed by a reduction or replace-

ment for the purpose of better inference. In our

proposed clustering process, consolidated station lo-

cation information and charging event data ﬁles are

combined to produce more consistent, accurate, and

useful data ﬁles.

3.2 Feature Generation and Selection

The aim of the feature generation and selection task is

to enrich pre-processed and fused data ﬁles by adding

new attributes to each data row according to a spe-

ciﬁc context. This task is deﬁned by a contextual-

ization function that can produce a set of new data

rows using contextualization parameters to add new

attributes to the fused data rows. Transformed data is

then partitioned using multiple temporal granularities

(e.g. e.g.weekly, monthly or seasonally).

3.3 Clustering

The aim of the clustering task is to ﬁnd the pat-

terns from transformed input data using a hierarchical

agglomerative clustering algorithm. The algorithm

seeks to build a hierarchy of clusters by merging cur-

rent pairs of mutual closest input data points until all

the data points have been used in the computation.

The measure of inter-cluster similarity is updated af-

ter each step using complete Ward linkage. This a pri-

ori selected algorithm is utilized to ﬁt the various tem-

poral granularities of input data, producing multiple

clustering results. Internal clustering validity indices

are recorded during each application of the clustering

algorithm.

3.4 Harvesting and Processing Validity

Indices

Each application of the clustering algorithm gener-

ates a record consisting of the cluster count param-

eter value, the various cluster validity index values

and the input data used to generate the clusters. Pro-

cessing the validity indices involves selecting and nor-

malizing the index values in preparation for Euclidean

distance computations. This task utilizes the combi-

nation of eight cluster validity indices which are de-

scribed as follows:

3.4.1 Silhouette Index

The silhouette width of a data point measures how

similar the data point is to its own cluster compared

to other clusters. For clusters X

= ( j = 1,..c), the

silhouette width of the i

data point in cluster X

deﬁned as (Rend

on et al., 2011):

S(i) =

(b(i) − a(i))

max{a(i),b(i)}

(1)

Where a(i) is the average distance between the i

data point and all data points included in X

; b(i) is

the minimum average distance between the i

data

point and all of the data points clustered in X

= (k =

1,..c,k 6= j).

From individual silhouette width calculations, an

aggregated global silhouette index is obtained (Petro-

vic, 2006). The silhouette index values range from

-1 to 1 where a value closer to 1 indicates clusters

are well separated and clearly distinguished. A value

closer to -1 indicates data points are not properly clus-

tered.

3.4.2 Cali

nski-Harabasz Index

The Cali

nski-Harabasz (CH) index is expressed as

a ratio of between-cluster variance and the overall

within-cluster variance. A recent comparative study

of available clustering indices demonstrated this in-

dex as one of the best cluster validity indices (Arbe-

laitz et al., 2013). Well deﬁned clusters yield high

values of this index. Therefore, the maximum value

of the index is used to select the best partition. For n

data points, k clusters where B and W are the between

within cluster scatter matrices, the index is computed

as (Gurrutxaga et al., 2011):

CH =

traceB/(k − 1)

traceW/(n − k)

(2)

3.4.3 Davies-Bouldin Index

The Davies-Bouldin (DB) index is deﬁned as fol-

lows (Gurrutxaga et al., 2011):

DB =

∑

i=1

max

j = 1,..,k; j 6= i

i j

) (3)

where

i j

d(c

)

(4)

In this formula, k is the number of clusters, s

is the

average distance of all data points in cluster i to their

cluster centroid and d(c

) is the distance between

the centroids of clusters i and j. With this index, a

SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems

minimum value denotes the best partitioning of the

data.

3.4.4 Cohesion

Cohesion is measured by the sum of squared distances

from each data point to its respective centroid. Also

referred to as the within sum of squares (WSS), it

measures how closely related data points are in a clus-

ter. The WSS is deﬁned as (L

opez et al., 2017) :

W SS =

∑

i=1

∑

x∈C

d(X,

)

(5)

Where C

is the cluster N

is the number of clus-

ters

is the cluster centroid, and

X is he sample

mean. The goal in clustering is to minimize the value

of WSS.

3.4.5 Separation

Measure how distinct or well separated a cluster is

from other clusters. Calculated as the sum of the

squared deviations between the groups, it is deﬁned

as (L

opez et al., 2017):

BSS =

∑

i=1

| • d(

(6)

In this formula, |C

| is the size of the cluster N

the number of clusters

is the cluster centroid, and

X is he sample mean. An optimal clustering will have

a higher value of BSS.

3.4.6 Root Mean Square Standard Deviation

The Root Mean Square Standard Deviation (RMSSD)

measures homogeneity within clusters. A lower

RMSSTD value means a better separation of clusters.

Large values of RMSSTD indicates that clusters are

not homogeneous. The metric is computed as (Ru-

jasiri and Chomtee, 2009):

RMSST D =

∑

j=1..k

i=1..p

∑

i j

a=1

− ¯x

i j

)

∑

j=1..k

i=1..p

i j

− 1)

(7)

Where k is the number of clusters, p is the num-

ber of independent variables in the data set, ¯x

i j

is the

mean of values in variable j and cluster i, and n

i j

the number of data points which are in variable p and

cluster k.

3.4.7 R-squared

The R-square (RS) value captures whether there is a

signiﬁcant difference among data points in different

clusters and that data points in the same cluster have

high similarity. RS values range from 0 to 1 where a

value closer to 0 indicates there is no difference be-

tween clusters. If the R-squared value is zero, there is

no difference between clusters. On the other hand, if

the value is closer to 1, then the partitioning of clus-

ters is closer to an optimal allotment. The metric is

computed as (Rujasiri and Chomtee, 2009):

RS =

− SS

(8)

∑

j=1

∑

a=1

− ¯x

)

(9)

i=1..k

∑

j=1

i j

∑

a=1

− ¯x

i j

)

(10)

In these equations, SS

is the sum of squared dis-

tances among all variables, SS

is the sum of square

distances among all data points in the same cluster, k

is the number of clusters, p is the number of indepen-

dent variables in the data set, ¯x

is the mean of data

in variable j, ¯x

i j

is the mean of the data in variable j

and cluster i and n

i j

is the number of data which are

in variable p and cluster k.

3.4.8 Xie-Beni Index

The Xie-Beni (XB) index is applicable to fuzzy and

crisp clustering results. It is deﬁned as the quotient

between the mean quadratic error and the minimum

of the minimal squared distances between the points

in the clusters. The index is deﬁned as (Chakrabarty,

2010) :

XB(K) =

∑

k=1

∑

j=1

(µ

k j

)

k x

− z

n ×

min

≤i≤K,1≤ j≤K

k x

− z

(11)

Where the nominator measures cluster compact-

ness and the denominator measures the separation be-

tween different cluster centers. The value of the XB

index should be minimum for the optimum number

of clusters in the data. The parameter m is called the

fuzziﬁer and is usually set between 1 and 2.

3.5 Similarity Computations

Our work uses a proximity measure in the clustering

task and in the computation of the results similarity

matrix. Selecting a this measure to determine how

An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities

similar or dissimilar two data points is an important

step in any clustering process. Proximity measures af-

fect the shape of clusters as some data points may be

close to one another according to one measure and far

from each other according to another. Euclidean dis-

tance is a preferred distance measure by researchers in

the ﬁeld of clustering and is deﬁned as (Chakrabarty,

2010):

D(x,y) =

∑

i=1

− y

)

(12)

In addition to the clustering task, the similarity

computation task uses Euclidean distance as the prox-

imity measure between clustering results. All index

values (e.g. multidimensional points in Euclidean

space) of each clustering results are used in the dis-

tance computations. The pair-wise similarity compar-

isons (e.g. the similarity matrix) are then persisted in

a database for down-stream results exploration via a

RESTful API.

The similarity matrix is stored in the database us-

ing two tables. The ﬁrst table summarizes clustering

results with rows consisting of a unique clustering re-

sult ID (result id) and meta-data about running the

algorithm (e.g. input ﬁle name, clustering execution

time, all validity index values, etc.). The second ta-

ble, which is linked to the ﬁrst table, contains rows

consisting of a source result ID (from result id), a tar-

get result ID (to result id) and a Euclidean distance.

Links between result IDs are not duplicated as direc-

tionality is not considered.

4 IMPLEMENTATION

This work makes use of real operational data from

public EV charging stations provided by the New

Brunswick Power Corporation. 9,505 EV charging

events that occurred between the dates of April 2019

and April 2020 at Level-2 (L2) and Level-3 (L3) pub-

lic charging stations were included in the analysis.

Table 1 describes the raw EV charging data set fea-

tures. Our practitioners are managers and planners

of an utility company who are responsible for coordi-

nating various projects including EV charging station

condition assessments, operating and capital budget

forecasting, and maintenance and operation practices

development. Fig. 3 describes the overall end-to-end

implementation of our EV use case.

Custom-written Python code and a scientiﬁc

Python stack were leveraged to implement the pro-

posed clustering process. Task elements were ex-

ecuted in sequence from a centralized management

Table 1: Raw Data.

Column Name Description

Connection ID Unique identiﬁer

for a connection

Recharge start time (local) Timestamp denoting

start of charging event

Recharge end time (local) Timestamp denoting

end of charging event

Account name Unused (all null)

Card identiﬁer Unique identiﬁer for

a charging plan

member

Recharge duration

(hours:minutes)

Duration of

charge event

Connector used Connection used

during charge event

Start state of charge (%) State of charge %

at beginning of

charging event

End state of charge (%) State of charge %

after charging

event is complete

End reason Charge event end

reason

Total amount Unused (all null)

Currency Unused (all null)

Total kWh Energy transferred to

vehicle during

charging event

Station Unique identiﬁer

for charging station

script (Richard et al., 2020). The software pro-

grams used in this work were packaged using a

Docker (Boettiger, 2015) container in order to ensure

a reproducible and consistent computational environ-

ment.

Fig. 4 highlights noteworthy aspects of the imple-

mentation. The numbered boxes that represent indi-

vidual parameterized Python scripts. The data ﬂow is

such that the output of one script is the input for the

next script. Input and output ﬁle names contain pa-

rameter values that were used when calling the work-

ﬂow’s scripts. The grey elements represent a job’s in-

put ﬁle(s). The blue elements represent a job’s output

ﬁle(s). The detailed implementation of each script is

described as follows:

• Script (1): The one way hash.py script imports

raw event data and casts column elements to ap-

propriate types. Additionally, a one-way hash

function is applied to the Card identiﬁer column.

• Script (2): The locations to parquet.py script

imports raw station location data and integrates

multiple input ﬁles into one.

• Script (3): The fuse location w events.py script

SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems

Feature Generation/Selection

Data Pre-processing/Fusion

Cleaned Data

Contextual-

ized Data

Extracted

Data

Hashed

Data

Raw Data

Charging Level 3

Charging Level 2

Charging Level 1

Cable connect to charger

Control &

Communication

Control &

Communication

Control &

Communication

Build-in cable protection

Smart recharge

stations

Data

Cleaning

Data

Cleaning

HashingHashing

Data

Fusion

Data

Fusion

Fused

Data

Feature

Generation

Feature

Generation

Data

Partition

Data

Partition

Feature

Selection

Feature

Selection

Partioned

Data

ClusteringClustering

A-priori selected

Algorithm, Parameter,

Proximity Measure

Harvest &

Process

Valication

Indices

Harvest &

Process

Valication

Indices

Clustering

Similarities

Calculation

Clustering

Similarities

Calculation

Multiple Clustering

Results

Clustering

Result

Similarities

Database

Figure 3: Overview of Our Implemented EV Use Case.

fuses event data with charging station location in-

formation.

• Script (4): This work focuses on recharge

report event data in the downstream analysis.

The feat eng rech report.py script creates new

features (contextualized) based on calculations

involving existing data attributes and removes

events with a duration of 5 minutes or less (elimi-

nating 11% of the raw records).

• Script (5): The create batch ranges.py script

creates temporal partitions of the data. These

partitions facilitate the cluster analysis based on

charging events occurring during a particular

week, month or season of the year.

• Script (6): The generate ev station features.py

prepares the input data for clustering by cal-

culating, for each charging station, station type

and temporal granularity, the proportion of total

charging events and the proportion of total power

used to charge vehicles relative to all stations.

• Script (7): The cluster data.py script applies the

agglomerative clustering algorithm to all temporal

slices of the data produced in the previous task.

This is done for a cluster count hyperparameter

that varies from 2 to 7. Other hyperparameter set-

tings are kept constant to simplify the experimen-

tal setup. Internal clustering validity indices are

recorded during each application of the clustering

algorithm (See Table 2 for the list of indices).

• Script (8): The scale indices.py script normalizes

the internal clustering validity indices in prepara-

tion for the downstream Euclidean distance com-

putations.

• Script (9:) The similarity matrix.py script per-

forms pairwise Euclidean distance computations

for each clustering result. All index values (e.g.

multidimensional points in Euclidean space) of

each clustering results are used in the distance

computations.

• Script (10): The load data.py script persists the

similarity matrix data produced in the previous

task in a relational database to enable querying of

clustering results and corresponding similarities

across months, weeks and seasons. The database

query functionality is made available via a REST-

ful API.

After results are generated and persisted (e.g.

Script (10) in Fig. 4 is complete), the practitioner can

navigate these results via a RESTful interface. Fig. 5

illustrates how the practitioner interacts with the re-

sults system. First, the practitioner requests ranked

station clustering results for either L2 or L3 station

types (Step 1). The system then returns a sorted list

of clustering results ordered by silhouette score (Step

2). From this list, the practitioner selects one result

as the reference result for which comparable results

are desired and then request these comparable results

from the system (Step 3). Finally, the system returns

a sorted list of comparable clustering results that is

ordered by Euclidean distance (Step 4). This sorted

list contains result speciﬁc artefacts such as scatter

plots, mapped station cluster memberships and sil-

houette plots.

The clustering process implementation and RESTful

API facilitate the comparison of clustering result sim-

ilarities across various temporal granularities. This

process is useful in identifying avenues for further

analysis. One Level 3 station clustering result for the

An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities

Figure 4: Data Flow Between Python Scripts of the Clus-

tering Process Implementation.

Figure 5: Results Query Sequence.

week of May 27

, 2019 has been selected as a case

study to demonstrate our approach. The case study is

presented in the next section.

5 RESULTS AND DISCUSSION

This section highlights the results of our proposed ap-

proach in identifying similar station clusterings over

multiple weeks with a case study. Table 3 highlights

similar clustering results relative to station clusterings

for a target week starting on May 27

, 2019. In all re-

sults, the number of clusters is 2 and the station type

is L3. The table is sorted in ascending order by Eu-

clidean distance relative to the target week. Accord-

ing to the multi-dimensional pairwise distance calcu-

lations obtained using the features described in Ta-

ble 2, the most similar clustering result to the week

starting on May 27

, 2019 is the result for the week

starting on February the 17

2020. The least similar

clustering result is the result for the week starting on

December 2

, 2019.

Table 2: Clustering Validity Index Data.

Column Name Description

ﬁle name File name for clustering

results for station type

and time granularity

n cluster K parameter value used in

applying the clustering

algorithm

silhouette score Silhouette index value for

clustering result

calinski harabasz Cali

nski-Harabasz index for

clustering result

davies bouldin Davies-Bouldin index for

clustering result

cohesion Cohesion index for

clustering result

separation Separation index for

clustering result

RMSSTD Root mean square standard

deviation index for

clustering result

RS R-squared index for

clustering result

XB Xie-Beni index for

clustering results

A corresponding visual presentation of the cluster-

ing results found in Table 3 can be seen in Figures 6

through 10. Each ﬁgure contains a silhouette and scat-

ter plot describing the clustered data. In the silhouette

plots, an observation with a silhouette width near 1,

means that the data point is well placed in its cluster;

an observation with a silhouette width closer to neg-

ative 1 indicates the likelihood that this observation

might really belong in some other cluster.

Table 3: Clustering Similarities - L3 - May 27

, 2019.

WEEK Sil CH DB C S RMS RS XB Dist

27/05/19 0.60 51.37 0.51 1.12 2.40 0.15 0.68 0.09 N/A

17/02/20 0.60 49.35 0.57 0.19 2.44 0.16 0.67 0.10 0.081

02/03/20 0.65 55.51 0.52 1.14 2.63 0.15 0.70 0.07 0.101

29/07/19 0.60 55.82 0.53 0.99 2.30 0.14 0.70 0.11 0.105

02/12/19 0.63 56.55 0.58 1.26 2.97 0.16 0.70 0.09 0.177

SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems

Column Name Abbreviations for Table 3

Sil : Silhouette index

CH : Cali

nski-Harabasz index

DB : Davies-Bouldin index

C : Cohesion

S : Separation

RMS : Root mean square standard deviation

RS : R-squared

XB : Xie-Beni index

Dist : Euclidean distance between current

and previous row

(a) (b)

Figure 6: L3 Station Clusters - MAY-27-2019.

(a) (b)

Figure 7: L3 Station Clusters - FEB-17-2020.

We can see from Figures 6 and 7, reasonable

structures in the data have been found. Stations are

grouped in terms of relatively higher or lower uti-

lization rates. The average silhouette score is 0.60

in both clustering results. The number of observa-

tions in each cluster for both results are also the same.

Cluster 0 in both situations have more stations with

relatively lower utilization rates. Results for the week

of May 27

, 2019 are slightly better when consid-

ering all cluster validation indices. This can also be

observed visually. Data points seem to be closer to-

gether in the scatter plot of Fig. 6b than in Fig. 7b.

The in-between cluster separation in both results are

similar.

The silhouette plot in Fig. 8a suggests a less op-

timal clustering. This plot indicates that some obser-

vations would seemingly belong to clusters other than

the one they are in; these observations have a negative

silhouette width value.

The silhouette plot in Fig. 9a and the average sil-

houette score of 0.60 suggest a reasonable structure in

(a) (b)

Figure 8: L3 Station Clusters - MAR-02-2020.

(a) (b)

Figure 9: L3 Station Clusters - JUL-29-2019.

(a) (b)

Figure 10: L3 Station Clusters - DEC-02-2019.

the data has also been found in this week. The number

of observations in each cluster for both clustering re-

sults are different. Based on the various indices, clus-

tering results for July 29

, 2020 are better in some

aspects and inferior in others to results for the week

of May 27

, 2019. This result was identiﬁed as being

the 3

most similar result for our target week.

The decreasing relative similarity of results is es-

pecially visible when comparing the results for the

week of May 27

, 2019 with results having the least

similarity (i.e, results for the week of December 2

2019). In Fig. 10a we can see that all cluster 0’s

members have below average silhouette scores and

the clustering of stations is much less similar than the

other clusterings.

Individual index calculations embed implicit

trade-offs on what is prioritized when expressing

inter-cluster separation, inter-cluster homogeneity,

density, and compactness as one numeric value. One

can view the various indices as averages where a cer-

tain precision is lost in the summary. This can lead to

situations where one index will suggest a better clus-

tering relative to another grouping and another index

An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities

will inverse this assessment. This is illustrated in Ta-

ble 3 where for example, the silhouette and Cali

nski-

Harabasz index values for December 2

suggest a

better clustering than on the week starting on May

. However, the Davies-Bouldin and R-squared in-

dex values inverse this assessment.

Capital investments in public charging infrastruc-

ture involves the use of public funds and necessitates

robust informed decision making. Identifying similar

station utilization patterns over multiple weeks can be

useful planning information for station operators. The

cluster analysis presented in our case study provides

useful insights by identifying similar groupings of EV

charging stations according to their usage patterns in

time.

The results highlighted in the case study provided

in this section demonstrate that given a clustering re-

sult of interest, a process of objectively highlighting

and recommending similar clustering results can in-

deed be automated in order to support the practitioner

in evaluating how structure in data persists over mul-

tiple time slices in a data set with temporal proper-

ties. The relative ranking of similar clustering results

that our approach affords makes it easy to objectively

identify similar station groupings over multiple weeks

based on a reference week. Not highlighted in the

case study, are the clustering results for other a-priori

selected temporal partitions in the data, which are also

available as reference points for exploring monthly or

seasonal clustering similarities. For example silhou-

ette plots representing a reference month (where K=4)

and season (where K=3), see Fig. 11.

(a) (b)

Figure 11: L3 Station Clustering References - August and

Spring.

6 CONCLUSIONS

Although clustering has become a routine analytical

task in many research domains, it remains arduous for

practitioners to select a good algorithm with adequate

hyperparameters and to assess the quality of cluster-

ing and the consistency of identiﬁed structures over

various temporal slices of data. The process of clus-

tering data is often an iterative, lengthy, manual and

cognitively demanding task. The subjectivity in deter-

mining the level of “success” that unsupervised learn-

ing approaches are able to achieve and the required

expert knowledge during the modeling phase suggest

that a human-in-the-loop process of supporting the

practitioner during this activity would be beneﬁcial.

Ascertaining whether a particular clustering of data is

meaningful or not requires expertise and effort. Doing

this for multiple results on data that has been sliced by

weekly, monthly or seasonal partitions prior to apply-

ing the clustering algorithm would be very time con-

suming. Manually identifying one meaningful result

of interest and then having an automated mechanism

to select similar results is extremely useful in reduc-

ing the amount of effort required to identify avenues

that merit further analysis and assist in downstream

analytical tasks such as improving regression or clas-

siﬁcation model performance.

A case study using real-world charging event data

from EV station operators in Atlantic Canada was

used to validate the approach and identify similar

groupings of charging stations according to their us-

age patterns. Our work demonstrates that given a

clustering result of interest, the process of objectively

highlighting and recommending similar clustering re-

sults can be automated in order to support the prac-

titioner in evaluating how structure in data persists

over multiple time slices and reduce the cognitive load

of identifying multiple meaningful clustering results

from a large number of modeling artifacts.

Presenting the practitioner with an initial ranked

list of clustering results leveraging all index values

simultaneously instead of just using the silhouette

scores (as described in Step 1 of Fig. 5) may improve

the initial results exploration process. Framing the

creation of the initial ranked list of results as a Multi-

ple Criteria Decision Making (MCDM) problem will

be included in future work. Additionally, we will ex-

plore if an expert can label a portion of the model-

ing artifacts as meaningful or not and whether a semi-

supervised or other algorithm can automatically label

the rest of the unseen modeling results from the labels

provided by the practitioner. Finally, other avenues

will explore whether this work can be adapted to im-

plement a novel change point detection approach in

identifying signiﬁcant changes in station groupings in

temporal slices of the data.

ACKNOWLEDGEMENTS

The authors of this paper like to thank the New

Brunswick Power Corporation for providing access

to station operator users and the EV charging data

SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems

referenced in this research. This work was partially

supported by the NSERC/Cisco Industrial Research

Chair, Grant IRCPJ 488403-1.

REFERENCES

Al-Ogaili, A. S., Hashim, T. J. T., Rahmat, N. A., Ra-

masamy, A. K., Marsadek, M. B., Faisal, M., and Han-

nan, M. A. (2019). Review on scheduling, clustering,

and forecasting strategies for controlling electric vehi-

cle charging: challenges and recommendations. Ieee

Access, 7:128353–128371.

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P

eRez, J. M.,

and Perona, I. (2013). An extensive comparative

study of cluster validity indices. Pattern Recognition,

46(1):243–256.

Bae, J., Helldin, T., Riveiro, M., Nowaczyk, S., Bouguelia,

M.-R., and Falkman, G. (2020). Interactive Cluster-

ing: A Comprehensive Review. ACM Computing Sur-

veys, 53(1):1–39.

Boettiger, C. (2015). An introduction to docker for repro-

ducible research. ACM SIGOPS Operating Systems

Review, 49(1):71–79.

Chakrabarty, A. (2010). An investigation of clustering al-

gorithms and soft computing approaches for pattern

recognition. PhD thesis, Assam University.

Gurrutxaga, I., Muguerza, J., Arbelaitz, O., P

erez, J. M.,

and Mart

ın, J. I. (2011). Towards a standard method-

ology to evaluate internal cluster validity indices. Pat-

tern Recognition Letters, 32(3):505–515.

Iglesias, F. and Kastner, W. (2013). Analysis of similarity

measures in times series clustering for the discovery

of building energy patterns. Energies, 6(2):579–597.

Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J. (2010). Un-

derstanding of internal clustering validation measures.

In 2010 IEEE International Conference on Data Min-

ing, pages 911–916. IEEE.

opez, S. L. S., Redondo, R. P. D., and Vilas, A. F.

(2017). Discovering knowledge from student inter-

actions: clustering vs classiﬁcation. In Proceedings

of the 5th International Conference on Technological

Ecosystems for Enhancing Multiculturality, pages 1–

Messina, E. (n.d.). Cluster analysis, powerpoint slides. (Ac-

cessed 2020-11-07).

Oliveira, M. (2019). 3 Reasons Why AutoML Won’t Re-

place Data Scientists Yet. (March 2019).

Petrovic, S. (2006). A comparison between the silhouette

index and the davies-bouldin index in labelling ids

clusters. In Proceedings of the 11th Nordic Workshop

of Secure IT Systems, pages 53–64. Citeseer.

Poulakis, G. (2020). Unsupervised automl: a study on au-

tomated machine learning in the context of clustering.

Master’s thesis, Πανεπιστ

ηµιo Πειραι

ως.

Rend

on, E., Abundez, I., Arizmendi, A., and Quiroz, E. M.

(2011). Internal versus external cluster validation in-

dexes. International Journal of computers and com-

munications, 5(1):27–34.

Richard, R., Cao, H., and Wachowicz, M. (2020). Discov-

ering ev recharging patterns through an automated an-

alytical workﬂow. In 2020 IEEE International Smart

Cities Conference (ISC2), pages 1–8.

Rujasiri, P. and Chomtee, B. (2009). Comparison of cluster-

ing techniques for cluster analysis. Nat. Sci, 43:378–

388.

Straka, M. and Buzna, L. (2019). Clustering algorithms

applied to usage related segments of electric vehicle

charging stations. Transportation Research Procedia,

40:1576–1582.

Sun, C., Li, T., Low, S. H., and Li, V. O. (2020). Clas-

siﬁcation of electric vehicle charging time series with

selective clustering. Electric Power Systems Research,

189:106695.

Xydas, E., Marmaras, C., Cipcigan, L. M., Jenkins, N.,

Carroll, S., and Barker, M. (2016). A data-driven

approach for characterising the charging demand of

electric vehicles: A uk case study. Applied energy,

162:763–771.

An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities