Efﬁcient and Distributed DBScan Algorithm Using MapReduce

to Detect Density Areas on Trafﬁc Data

Ticiana L. Coelho da Silva, Ant

onio C. Ara

ujo Neto, Regis Pires Magalh

aes, Victor A. E. de Farias,

Jos

e A. F. de Mac

edo and Javam C. Machado

Federal University of Ceara, Computing Science Department, Fortaleza, Brazil

Keywords:

DBScan, MapReduce, Trafﬁc Data.

Abstract:

Mobility data has been fostered by the widespread diffusion of wireless technologies. This data opens new

opportunities for discovering the hidden patterns and models that characterise the human mobility behaviour.

However, due to the huge size of generated mobility data and the complexity of mobility analysis, new scalable

algorithms for efﬁciently processing such data are needed. In this paper we are particularly interested in using

trafﬁc data for ﬁnding congested areas within a city. To this end we developed a new distributed and efﬁcient

strategy of the DBScan algorithm that uses MapReduce to detect what are the density areas. We conducted

experiments using real trafﬁc data of a brazilian city (Fortaleza) and compare our approach with centralized

and map-reduce based DBSCAN approaches. Our preliminaries results conﬁrm that our approach is scalable

and more efﬁcient than others competitors.

1 INTRODUCTION

Mobility data has been fostered by the widespread

diffusion of wireless technologies, such as the call de-

tail records from mobile phones and the GPS tracks

from navigation devices, society-wide proxies of hu-

man mobile activities. These data opens new opportu-

nities for discovering the hidden patterns and models

that characterize the trajectories humans follow dur-

ing their daily activity. This research topic has re-

cently attracted scientists from diverse disciplines, be-

ing not only a major intellectual challenge, but also

given its importance in domains such as urban plan-

ning, sustainable mobility, transportation engineer-

ing, public health, and economic forecasting (Gian-

notti et al., 2011).

The increasing popularity of mobility data has be-

come the main source for evaluating the trafﬁc situ-

ation and to support drivers’ decisions related to dis-

placement through the city in real time. Trafﬁc infor-

mation in big cities can be collected by GPS in ve-

hicles or trafﬁc radar, or even gathered from tweets.

This information can be used to complement the data

generated by cameras and physical sensors in order to

guide municipality actions in ﬁnding solutions to the

mobility problem. Through these data it is possible

to analyze where the congested areas are within the

city in order to discover which portions of the city has

more probability to have trafﬁc jams. Such discovery

may help the search for effective reengineering trafﬁc

solutions in the context of smart cities.

One of the most important clustering algorithms

known in the literature is DBScan (Density-based

Spatial Clustering of Application with Noise) (Ester

et al., 1996). Its advantages over others clustering

techniques are: DBScan groups data into clusters of

arbitrary shape, it does not require a priori number

of clusters, and it deals with outliers in the dataset.

However, DBScan is more computing expensive than

others clustering algorithms, such as k-means. More-

over, DBScan does not scale when executed on large

datasets using only a single processor. Recently many

researchers have been using cloud computing in order

to solve scalability problems of traditional clustering

algorithms that run on a single machine (Dai and Lin,

2012). Thus, the strategy to parallelize the DBScan

in shared-nothing clusters is an adequate solution to

solve such problems(Pavlo et al., 2009).

Clearly, the provision of an infrastructure for large

scale processing requires software that can take ad-

vantage of the large amount of machines and mit-

igate the problem of communication between ma-

chines. With the interest in clusters, it has been in-

creasing the amount of tools to use them, among

that stands out the framework MapReduce (Dean and

Ghemawat, 2008) and its open source implementation

L. Coelho da Silva T., C. Araújo Neto A., Pires Magalhães R., A. E. de Farias V., A. F. de Macêdo J. and C. Machado J..

Efﬁcient and Distributed DBScan Algorithm Using MapReduce to Detect Density Areas on Trafﬁc Data.

DOI: 10.5220/0004891700520059

In Proceedings of the 16th International Conference on Enterprise Information Systems (ICEIS-2014), pages 52-59

ISBN: 978-989-758-027-7

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Hadoop, used to manage large amounts of data on

clusters of servers. This framework is attractive be-

cause it provides a simple model, becoming easier to

users express distributed programs relatively sophis-

ticated. The MapReduce programming model is rec-

ommended for parallel processing of large data vol-

umes in computational clusters (Lin and Dyer, 2010).

It is also scalable and fault tolerant. The MapReduce

platform divides a task into small activities and mate-

rializes its intermediate results locally. When a fault

occurs in this process, only failed activities are re-

executed.

This paper aims at identifying, in a large dataset,

trafﬁc jam areas in a city using mobility data. In this

sense, a parallel version of DBScan algorithm, based

on MapReduce platform, is proposed as a solution.

Related works, such as (He et al., 2011) and (Dai and

Lin, 2012) also use MapReduce to parallelize the DB-

Scan algorithm, but they have different strategies and

scenarios from the ones presented in this paper.

The main contributions of this paper are: (1) Our

partitioning strategy is less costly than the one pro-

posed on (Dai and Lin, 2012). The paper creates a

grid to partition the data. Our approach is trafﬁc data

aware and clusters data based on one attribute values

(in our experiments we used the streets’ name); (2) To

gather clusters of different partitions, our merge strat-

egy does not need data replication as opposed to (He

et al., 2011) and (Dai and Lin, 2012). Moreover, our

strategy ﬁnds the same result of clusters as the cen-

tralized DBScan; (3) We proved that our distributed

DBScan algorithm is correct on Section 4.

The structure of this paper is organized as follows.

Section 2 introduces basic concepts needed to under-

stand the solution of the problem. Section 3 presents

our related work. Section 4 addresses the method-

ology and implementation of this work, that involve

the solution of the problem. The experiments are de-

scribed in section 5.

2 PRELIMINARY

2.1 MapReduce

The need for managing, processing, and analyzing ef-

ﬁciently large amount of data is a key issue in Big

Data scenario. To address these problems, differ-

ent solutions have been proposed, including the mi-

gration/building applications for cloud computing en-

vironments, and systems based on Distributed Hash

Table (DHT) or structure of multidimensional arrays

(Sousa et al., 2010). Among these solutions, there

is the MapReduce paradigm (Dean and Ghemawat,

2008), designed to support the distributed processing

of large datasets on clusters of servers and its open

source implementation Hadoop (White, 2012).

The MapReduce programming model is based on

two primitives of functional programming: Map and

Reduce. The MapReduce execution is carried out as

follows: (i) The Map function takes a list of key-value

pairs (K

) as input and a list of intermediate key-

value pairs (K

) as output; (ii) the key-value pairs

) are deﬁned according to the implementation

of the Map function provided by the user and they are

collected by a master node at the end of each Map

task, then sorted by the key. The keys are divided

among all the Reduce tasks. The key-value pairs that

have the same key are assigned to the same Reduce

task; (iii) The Reduce function receives as input all

values V

from the same key K

and produces as out-

put key-value pairs (K

) that represent the outcome

of the MapReduce process. The reduce tasks run

on one key at a time. The way in which values are

combined is determined by the Reduce function code

given by the user.

Hadoop is an open-source framework developed

by the Apache Software Foundation that imple-

ments MapReduce, along with a distributed ﬁle sys-

tem called HDFS (Hadoop Distributed File System).

What makes MapReduce attractive is the opportunity

to manage large-scale computations with fault toler-

ance. To do it so, the developer only needs to imple-

ment two functions called Map and Reduce. There-

fore, the system manages the parallel execution and

coordinates the implementation of Map and Reduce

tasks, being able to handle failures during execution.

2.2 DBScan

The DBScan is a clustering method widely used in

the scientiﬁc community. Its main idea is to ﬁnd clus-

ters from each point that has at least a certain amount

of neighbors (minPoints) to a speciﬁed distance (eps),

where minPoints and eps are input parameters. Find-

ing values for both can be a problem, that depends on

the manipulated data and the knowledge to be discov-

ered. The following deﬁnitions are used in the DB-

Scan algorithm that will be used in the Section 4:

• Card(A): cardinality of the set A.

• N

eps

(o): p ∈ N

eps

(o), if and only if the distance

between p and o is less or equal than eps.

• Directly Density-Reachable (DDR): o is DDR

from p, if o ∈ N

eps

(p) and Card(N

eps

(p)) ≥ min-

Points.

• Density-Reachable (DR): o is DR from p, if there

is a chain of points {p

, ..., p

} where p

= p and

EfficientandDistributedDBScanAlgorithmUsingMapReducetoDetectDensityAreasonTrafficData

= o, such as p

i+1

is DDR from p

and ∀i ∈

{1, ..., n −1}.

• Core Point: o is a Core Point if Card(N

eps

(o)) ≥

minPoints.

• Border Point: p is a Border Point if

Card(N

eps

(p)) < minPoints and p is DDR

from a Core Point.

• Noise: q is Noise if Card(N

eps

(q)) < minPoints

and q is not DDR from any Core Point.

DBScan ﬁnds for each point o, N

eps

(o) in the

data. If CardN

eps

(o)) ≥ minPoints, a new cluster C

is created and it contains the points o and N

eps

(o).

Then each point q ∈ C that has not been visited is

also checked. If N

eps

(q) ≥ minPoints, each point

r ∈ N

eps

(q) that is not in C is added to C. These steps

are repeated until no new point is added to C. The

algorithm ends when all points from the dataset are

visited.

3 RELATED WORK

P-DBScan (Kisilevich et al., 2010) is a density-based

clustering algorithm based on DBScan for analysis

of places and events using a collection of geo-tagged

photos. P-DBScan introduces two new concepts: den-

sity threshold and adaptive density, that is used for

fast convergence towards high density regions. How-

ever P-DBScan does not have the advantages of the

MapReduce model to process large datasets.

Another related work is GRIDBScan (Uncu et al.,

2006), that proposes a three-level clustering method.

The ﬁrst level selects appropriate grids so that the den-

sity is homogeneous in each grid. The second stage

merges cells with similar densities and identiﬁes the

most suitable values of eps and minPoints in each

grid that remain after merging. The third step of the

proposed method executes the DBScan method with

these identiﬁed parameters in the dataset. However,

GRIDBScan is not suitable for large amounts of data.

Our proposed algorithm in this work is a distributed

and parallel version of DBScan that uses MapReduce

and is suitable for handling large amounts of data.

The paper (He et al., 2011) proposes an implemen-

tation of DBScan with a MapReduce of four stages

using grid based partitioning. The paper also presents

a strategy for joining clusters that are in different par-

titions and contain the same points in their bound-

aries. Such points are replicated in the partitions and

the discovery of clusters, that can be merged in a sin-

gle cluster, is analyzed from them. Note that the num-

ber of replicated boundary points can affect the clus-

tering efﬁciency, as such points not only increase the

load of each compute node, but also increase the time

to merge the results of different computational nodes.

Similar to the previous work, (Dai and Lin, 2012)

proposes DBScanMR that is a DBScan implementa-

tion using MapReduce and grid based partitioning. In

(Dai and Lin, 2012), the partition of points spends a

great deal of time and it is centralized. The dataset is

partitioned in order to maintain the uniform load bal-

ancing across the compute nodes and to minimize the

same points between partitions. Another disadvan-

tage is the strategy proposed is sensitive to two input

parameters. How to obtain the best values for those

parameters is not explained on the paper. The DB-

Scan algorithm runs on each compute node for each

partition using a KD-tree index. The merge of clusters

that are on distinct partitions is done when the same

point belongs to such partitions and it is also tagged

as a core point in any of the clusters. If it is detected

that two clusters should merge, they are renamed to

a single name. This process also occurs in (He et al.,

2011).

Our work is similar to the papers (He et al.,

2011) and (Dai and Lin, 2012) because they consider

the challenge of handling large amounts of data and

use MapReduce to parallelize the DBScan algorithm.

However, our paper presents a data partitioning tech-

nique that is data aware, which means partitioning

data according to their spatial extent. The partitioning

technique proposed by (Dai and Lin, 2012) is central-

ized and spends much time for large amount of data as

we could see on our experiments. Furthermore, unlike

our strategy to merge clusters, (He et al., 2011) and

(Dai and Lin, 2012) proposed approaches that require

replication, which can affect the clustering efﬁciency.

Our cluster merging phase checks if what was previ-

ously considered as a noise point becomes a border or

core point.

4 METHODOLOGY AND

IMPLEMENTATION

As we discussed before, normally the trafﬁc data in-

dicates the name of the street, the geographic position

(latitude and longitude), the average speed of vehicles

at the moment, among others. We address in this pa-

per the problem of discovering density areas, that may

be represent a trafﬁc jam.

In this section, we focus on the solution of ﬁnd-

ing density areas or clusters from raw trafﬁc data on

MapReduce. We formulate the problem as follows:

Problem Statement. Given a set of d-dimensional

points on the trafﬁc dataset DS = {p

, p

, ..., p

} such

that each point represents one vehicle with the av-

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

erage speed, the eps value, the minimum number of

points required to form a cluster minPoints and a set

of virtual machines VM = {vm

, vm

, ..., vm

} with a

MapReduce program; ﬁnd the density areas on the

trafﬁc data with respect to the given eps and min-

Points values. In this work, we only consider two di-

mensions (latitude and longitude) for points. Further-

more, each point has the information about the street

it belongs to.

4.1 MapReduce Phases and Detection of

Possible Merges

This section presents the implementation of the pro-

posed solution to the problem. Hereafter, we explain

the steps to parallelize DBScan using the MapReduce

programming model.

• Executing DBScan Distributedly. This phase is

a MapReduce process. Map and reduce functions

for this are explained below.

1. Map function. Each point from the dataset is

described as a pair hkey, valuei, such that the

key refers to the street and the value refers to

a geographic location (latitude and longitude)

where the data was collected. As we could see

on Algorithm 1.

2. Reduce funtion. It is presented on Algorithm

2. It receives a list of values that have the

same key, i.e. the points or geographical posi-

tions (latitude and longitude) that belongs to the

same street. The DBScan algorithm is applied

in this phase using the KD-tree index (Bentley,

1975).

Algorithm 1: First Map-Reduce - MAP.

Input: Set of points of the dataset T

1 begin

2 for p ∈ T do

3 createPairhp.street name, p.Lat,

p.Loni

The result is stored in a database. This means that

the identiﬁer of each cluster and the information

about their points (such as latitude, longitude, if it

is core point or noise) are saved.

• Computing Candidate Clusters. Since there are

many crossroads between the streets on the city

and the partitioning of data is based on the streets,

it is necessary to discover what are the clusters of

different streets that intersect each other or may

be merged into a single cluster. For example, it is

common in large cities have the same trafﬁc

Algorithm 2: First Map-Reduce - REDUCE.

Input: Set P of pairs hk, vi with same k,

minPoints, eps

1 begin

2 DBScan(P, eps, minPoints)

3 Store results on database;

jam happening on different streets that intersect to

each other. In other words, two clusters may have

points at a distance less than or equal to eps in

such a way that if the data were processed by the

same reduce or even if they were in the same par-

tition, they would be grouped into a single cluster.

Thus, the clusters are also stored as a geometric

object in the database and only the pairs of ob-

jects that have distance at most eps will be con-

sidered in the next phase that is the merge phase.

Tuples with pairs of candidate clusters for merge

are passed with the same key to the next MapRe-

duce. As we could see on Algorithm 3.

Algorithm 3: Find merge candidates clusters.

Input: Set C of Clusters

Output: V is a set of merge candidates clusters

1 begin

2 for C

∈ C do

3 Create its geometry G

;

4 for all geometries G

and G

and i <> j do

5 if Distance(G

, G

) ≤ eps then

6 V ← hC

7 return V ;

Algorithm 4: Second Map-Reduce - REDUCE.

Input: Set V of pairs hC

i of clusters

candidates to merge

1 begin

2 for hC

i ∈ V do

3 if CanMerge(C

) then

4 Rename C

to C

in V

• Merging Clusters. It is also described by a

MapReduce process. Map and reduce functions

for this phase are explained below.

1. Map function. It is the identity. It simply passes

each key-value pair to the Reduce function.

2. Reduce function. It receives as key the lowest

cluster identiﬁer from all the clusters that are

candidates to be merged into a single cluster.

The value of that key corresponds to the others

EfficientandDistributedDBScanAlgorithmUsingMapReducetoDetectDensityAreasonTrafficData

cluster identiﬁer merge candidates. Thereby, if

two clusters must be merged into a single clus-

ter, the information about points belonging to

them are updated. The Algorithms 4 and 5 were

implemented in this phase.

As we could see on the Algorithm 5 on lines 2

to 5, we check and update the neighbors of each

point p

∈ C

and p

∈ C

. This occurs because

if C

and C

are merge clusters candidates, there

are points in C

and C

that the distance between

them is less or equal than eps. On the lines 6 to

11, the algorithm veriﬁes if there is some point

∈ C

and p

∈ C

that may have become core

points. This phase is important, because C

and

can merge if there is a core point p

∈ C

and

another core point p

∈ C

, such that p

is DDR

from p

as we can see on lines 12 to 15 on the

Algorithm 5. In the next section, we present a

prove that validate our merge strategy.

This work considers the possibility that a noise

point in a cluster may become a border or core

point with the merge of clusters different of our

related work. We do that on the line 16 of Al-

gorithm 5 calling the procedure updateNoise-

Points(). Considering that p

∈ C

, p

∈ C

and

∈ N

eps

), if p

or p

were noise points be-

fore the merge phase and it occurred an update

on N

eps

) and N

eps

), p

or p

could not

be more a noise point, but border point or core

point. That is what we check on the procedure

updateNoisePoints().

Our merge phase is efﬁcient, because it does not

consider replicated data as our related work. Next,

we prove that our strategy merges clusters candi-

dates correctly.

4.2 Validation of Merge Candidates

Theorem 1. Let C

be a cluster of a partition S

, C

be a cluster of a partition S

, and S1 ∩ S2 =

0. Clus-

ters C

and C

should merge if there are two points

∈ C

and p

∈ C

that satisfy the following proper-

ties:

• Distance(p

, p

) ≤ eps;

• N

eps

(p1) ≥ minPoints in S1 ∪ S2;

• N

eps

(p2) ≥ minPoints in S1 ∪ S2;

Proof. First, we can conclude that there are at least

two points p

∈ C

and p

∈ C

such that the distance

between them is less than or equal to eps, otherwise it

would be impossible for any points from C

and C

be placed in the same cluster in the centralized execu-

tion of DBScan because they would never be Density-

Algorithm 5: CanMerge

Input: Clusters C

candidates to merge

1 begin

2 for p

∈ C

3 for p

∈ C

4 if ((p

∈ N

eps

)) then

5 setAd jacent(p

, p

)

6 for p

∈ C

7 if (Card(N

eps

)) ≥ minPoints) then

8 p

.isCore ← true

9 for p

∈ C

10 if (Card(N

eps

)) ≥ minPoints) then

11 p

.isCore ← true

12 for p

∈ C

13 for p

∈ C

14 if (Ad j(p

, p

) ∧ p

.isCore ∧

.isCore) then

15 merge ← true

16 updateNoisePoints();

17 if (merge) then

18 for p

∈ C

19 p

.cluster ← i

20 return merge

Reachable (DR). Moreover, such condition is neces-

sary to allow the merge between two clusters. Still

considering the points p

and p

, we have the follow-

ing possibilities:

1. p

and p

are core points in C

and C

respec-

tively;

2. p

is core point in C

and p

is border point in C

;

3. p

is border point in C

and p

is core point in C2;

4. p

and p

are border points in C1 and C2 respec-

tively;

Analyzing the ﬁrst case, where p

is a core point,

by deﬁnition Card(N

eps

)) ≥ minPoints. Consid-

ering the partitions S

∪ S

, we observe that p

∈

eps

). When being visited during the execution of

DBScan, the point p

would reach point p

directly

by density. As p

is also a core point, all others points

from C

could be density reachable from p

. Thus, the

points from clusters C

and C

would be in the same

cluster. Similarly, we can state the same for point p

as it could reach by density all points from C

through

point p

. In this case, C

and C

will be merged.

The analysis is analogous in the second case,

where only p

is a core point. Thus, when visiting p

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

100 150 200 250 300 350 400

eps

190000

200000

210000

220000

230000

processing time (ms)

Eps Variation for Our Approach

Figure 1: eps variation.

its neighbors, particularly p

, will be expanded. Con-

sidering the space S

∪ S

, suppose that N

eps

) <

minPoints. So, the point p

is not expanded when

visited and point p

will not reach the core point be-

longing to cluster C

, that is a p

neighbor. In this

case, C

and C

will not be merged.

Similarly one can analyze the third case and reach

the same conclusion of the second case.

For the fourth and last case, consider that none of

the two points p

and p

have more than or equal to

minPoints neighbors, i.e., N

eps

) < minPoints and

eps

) < minPoints in S

∪ S

. When visited, nei-

ther will be expanded, because they will be consid-

ered border points. In the case that only one of them

has more than minPoints neighbors in S

∪S

(the pre-

vious case), we could see that such condition is not

enough to merge the clusters. Therefore, the only case

in which such clusters will merge is when N

eps

) ≥

minPoints and N

eps

) ≥ minPoints in S

∪ S

or in

the ﬁrst case (p

and p

are core points).

5 EXPERIMENTAL EVALUATION

We performed our experiments using OpenNebula

platform from the private cloud environment of Fed-

eral University of Ceara. The number of virtual

machines used with Ubuntu operating system were

eleven. Each machine has 8GB of RAM and 4 CPU

Table 1: Hadoop conﬁguration variables used in the experi-

ments.

Variable name Value

hadoop.tmp.dir /tmp/hadoop

fs.default.name hdfs://master:54310

mapred.job.tracker master:54311

mapreduce.task.timeout 36000000

mapred.child.java.opts -Xmx8192m

mapred.reduce.tasks 11

dfs.replication 5

200000 300000 400000 500000 600000

number of points

50000

100000

150000

200000

250000

300000

350000

400000

450000

processing time (ms)

Data set variation for Our Approach

Figure 2: Data set size variation.

units. The Hadoop version used on each machine was

1.1.2 and the environment variables were set with the

values shown in Table 1. Each test was conducted ﬁve

times and reported the average, maximum and mini-

mum observed values.

The datasets used in the experiments were related

to avenues from Fortaleza city in Brazil and the col-

lected points were retrieved from a website that ob-

tains the data from trafﬁc radar. The dataset used

to run the experiments contains the avenue’s name,

the geographic position of vehicles (latitude and lon-

gitude), as well as its speed at the moment. In the

context of the problem, what our approach does is

identify groups with high density of points from the

dataset that have low speeds. The results can be used

to detect trafﬁc jam areas on Fortaleza city.

The ﬁrst test varied the eps value and kept the

same amount of points in each street, i.e., of the orig-

inal dataset that corresponds to 246142 points. The

eps values used were 100, 150, 200, 250, 300 and 350,

while keeping the value of 50 to minPoints. As it was

expected, the Figure 1 shows that with the increase of

eps, the processing time has also increased, because

more points in the neighborhood of a core point could

be expanded.

The Figure 2 illustrates the variation of the num-

ber of points from the input dataset related to the pro-

cessing time. The number of points used was 246142,

324778, 431698 and 510952 points. As expected,

when the number of points processed by DBScan is

increased, the processing time also increases, show-

ing that our solution is scalable.

The Table 2 presents a comparison of our ap-

proach execution time and DBScan centralized ex-

ecution time. We varied the number of points that

was 246142, 324778, 431698 and 510952 points for

eps=100 and minPoints=50. In all cases, our ap-

proach found the same clusters that DBScan central-

ized on the dataset, but spent less time to process as

we expected.

The Figures 3 and 4 show a comparison of our ap-

EfficientandDistributedDBScanAlgorithmUsingMapReducetoDetectDensityAreasonTrafficData

Table 2: Comparing the execution time of our approach and

DBScan centralized for eps=100 and minPoints=50.

Data set Our Approach[ms] Centralized [ms]

246142 197263 1150187.6

324778 254447 2002369

431698 330431 3530966.6

510952 409154 4965999.6

246142 324778 431698 510952

number of points

1000000

2000000

3000000

4000000

5000000

processing time (ms)

Data set Variation for DBSCanMR

Data set Variation for Our approach

Figure 3: Varing the dataset size and comparing our ap-

proach with DBSCanMR.

proach and DBScanMR (Dai and Lin, 2012) that has

the partitioning phase centralized, different of our ap-

proach. On the both experiments, our solution spent

less processing time than DBSCanMR, because DB-

SCanMR spends great cost to make the grid during

the centralized partitioning phase.

Moreover, DBSCanMR strategy is sensitive to

two input parameters that are the number of points

that could be processed on memory and a percentage

of points in each partition. For these two parameters,

the paper does not present how they could be calcu-

lated and what are the best values. We did the exper-

iments using the ﬁrst one equals to 200000 and the

second as 0,49.

We did not compare with the other related work

(He et al., 2011). However, we believe that our ap-

proach presents better results than (He et al., 2011),

because our merge strategy does not need data repli-

cation, that can affect the clustering efﬁciency for a

100 150 200 250 300 350

eps

200000

400000

600000

800000

1000000

1200000

1400000

1600000

processing time (ms)

Eps Variation for DBScanMR

Eps Variation for Our approach

Figure 4: Varing eps and comparing our approach with DB-

SCanMR.

Figure 5: The dataset with 246142 points plotted.

Figure 6: Clusters found after run our approach for 246142

points (eps=100 and minPoints=50).

large dataset.

The Figures 5 presents the points plotted for

246142 points of dataset. Note that each color repre-

sents an avenue of Fortaleza. On the Figure 6, we can

see the clusters found by our approach using eps=100

and minPoints=50. Each cluster found is represented

by a different color on the Figure 6. Note that the

merge occurred where there are more than one avenue

that crosses each other as we expected.

6 CONCLUSION AND FUTURE

WORK

In this paper we proposed a new distributed DBScan

algorithm using MapReduce to identify congested ar-

eas within a city using a large trafﬁc dataset. Our

approach is more efﬁcient than DBSCanMR as con-

ﬁrmed by our experiments, while varying the dataset

size and the eps value. We also compare our approach

to a centralized version of DBScan algorithm. Our

approach found the same clusters as the centralized

DBScan algorithm, moreover our approach spent less

time to process, as expected.

As we adopted a distributed processing, the total

time to ﬁnd the density areas is inﬂuenced by machine

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

with a worst performance or the one that has more

data to process. So that, our future work will focus on

dealing with data skew because it is fundamental for

achieving adequate data partition. Furthermore, we

will also focus on proposing a partitioning technique

more generic that leads with other kind of data.

REFERENCES

Bentley, J. L. (1975). Multidimensional binary search trees

used for associative searching. In Communications of

the ACM, volume 18, pages 509–517. ACM.

Dai, B.-R. and Lin, I.-C. (2012). Efﬁcient map/reduce-

based dbscan algorithm with optimized data partition.

In Cloud Computing (CLOUD), 2012 IEEE 5th Inter-

national Conference on, pages 59–66. IEEE.

Dean, J. and Ghemawat, S. (2008). Mapreduce: simpliﬁed

data processing on large clusters. Communications of

the ACM, 51(1):107–113.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).

A density-based algorithm for discovering clusters in

large spatial databases with noise. In KDD, vol-

ume 96, pages 226–231.

Giannotti, F., Nanni, M., Pedreschi, D., Pinelli, F., Renso,

C., Rinzivillo, S., and Trasarti, R. (2011). Un-

veiling the complexity of human mobility by query-

ing and mining massive trajectory data. The VLDB

JournalThe International Journal on Very Large Data

Bases, 20(5):695–719.

He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S.,

and Fan, J. (2011). Mr-dbscan: An efﬁcient parallel

density-based clustering algorithm using mapreduce.

In Parallel and Distributed Systems (ICPADS), 2011

IEEE 17th International Conference on, pages 473–

480. IEEE.

Kisilevich, S., Mansmann, F., and Keim, D. (2010). P-

dbscan: a density based clustering algorithm for ex-

ploration and analysis of attractive areas using collec-

tions of geo-tagged photos. In Proceedings of the 1st

International Conference and Exhibition on Comput-

ing for Geospatial Research & Application, page 38.

ACM.

Lin, J. and Dyer, C. (2010). Data-intensive text processing

with mapreduce. Synthesis Lectures on Human Lan-

guage Technologies, 3(1):1–177.

Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt,

D. J., Madden, S., and Stonebraker, M. (2009). A

comparison of approaches to large-scale data analysis.

In Proceedings of the 2009 ACM SIGMOD Interna-

tional Conference on Management of data, SIGMOD

’09, pages 165–178, New York, NY, USA. ACM.

Sousa, F. R. C., Moreira, L. O., Macłdo, J. A. F., and

Machado, J. C. (2010). Gerenciamento de dados em

nuvem: Conceitos, sistemas e desaﬁos. In SBBD,

pages 101–130.

Uncu, O., Gruver, W. A., Kotak, D. B., Sabaz, D., Alib-

hai, Z., and Ng, C. (2006). Gridbscan: Grid density-

based spatial clustering of applications with noise. In

Systems, Man and Cybernetics, 2006. SMC’06. IEEE

International Conference on, volume 4, pages 2976–

2981. IEEE.

White, T. (2012). Hadoop: the deﬁnitive guide. O’Reilly.

EfficientandDistributedDBScanAlgorithmUsingMapReducetoDetectDensityAreasonTrafficData