Hadoop, used to manage large amounts of data on
clusters of servers. This framework is attractive be-
cause it provides a simple model, becoming easier to
users express distributed programs relatively sophis-
ticated. The MapReduce programming model is rec-
ommended for parallel processing of large data vol-
umes in computational clusters (Lin and Dyer, 2010).
It is also scalable and fault tolerant. The MapReduce
platform divides a task into small activities and mate-
rializes its intermediate results locally. When a fault
occurs in this process, only failed activities are re-
executed.
This paper aims at identifying, in a large dataset,
traffic jam areas in a city using mobility data. In this
sense, a parallel version of DBScan algorithm, based
on MapReduce platform, is proposed as a solution.
Related works, such as (He et al., 2011) and (Dai and
Lin, 2012) also use MapReduce to parallelize the DB-
Scan algorithm, but they have different strategies and
scenarios from the ones presented in this paper.
The main contributions of this paper are: (1) Our
partitioning strategy is less costly than the one pro-
posed on (Dai and Lin, 2012). The paper creates a
grid to partition the data. Our approach is traffic data
aware and clusters data based on one attribute values
(in our experiments we used the streets’ name); (2) To
gather clusters of different partitions, our merge strat-
egy does not need data replication as opposed to (He
et al., 2011) and (Dai and Lin, 2012). Moreover, our
strategy finds the same result of clusters as the cen-
tralized DBScan; (3) We proved that our distributed
DBScan algorithm is correct on Section 4.
The structure of this paper is organized as follows.
Section 2 introduces basic concepts needed to under-
stand the solution of the problem. Section 3 presents
our related work. Section 4 addresses the method-
ology and implementation of this work, that involve
the solution of the problem. The experiments are de-
scribed in section 5.
2 PRELIMINARY
2.1 MapReduce
The need for managing, processing, and analyzing ef-
ficiently large amount of data is a key issue in Big
Data scenario. To address these problems, differ-
ent solutions have been proposed, including the mi-
gration/building applications for cloud computing en-
vironments, and systems based on Distributed Hash
Table (DHT) or structure of multidimensional arrays
(Sousa et al., 2010). Among these solutions, there
is the MapReduce paradigm (Dean and Ghemawat,
2008), designed to support the distributed processing
of large datasets on clusters of servers and its open
source implementation Hadoop (White, 2012).
The MapReduce programming model is based on
two primitives of functional programming: Map and
Reduce. The MapReduce execution is carried out as
follows: (i) The Map function takes a list of key-value
pairs (K
1
,V
1
) as input and a list of intermediate key-
value pairs (K
2
,V
2
) as output; (ii) the key-value pairs
(K
2
,V
2
) are defined according to the implementation
of the Map function provided by the user and they are
collected by a master node at the end of each Map
task, then sorted by the key. The keys are divided
among all the Reduce tasks. The key-value pairs that
have the same key are assigned to the same Reduce
task; (iii) The Reduce function receives as input all
values V
2
from the same key K
2
and produces as out-
put key-value pairs (K
3
,V
3
) that represent the outcome
of the MapReduce process. The reduce tasks run
on one key at a time. The way in which values are
combined is determined by the Reduce function code
given by the user.
Hadoop is an open-source framework developed
by the Apache Software Foundation that imple-
ments MapReduce, along with a distributed file sys-
tem called HDFS (Hadoop Distributed File System).
What makes MapReduce attractive is the opportunity
to manage large-scale computations with fault toler-
ance. To do it so, the developer only needs to imple-
ment two functions called Map and Reduce. There-
fore, the system manages the parallel execution and
coordinates the implementation of Map and Reduce
tasks, being able to handle failures during execution.
2.2 DBScan
The DBScan is a clustering method widely used in
the scientific community. Its main idea is to find clus-
ters from each point that has at least a certain amount
of neighbors (minPoints) to a specified distance (eps),
where minPoints and eps are input parameters. Find-
ing values for both can be a problem, that depends on
the manipulated data and the knowledge to be discov-
ered. The following definitions are used in the DB-
Scan algorithm that will be used in the Section 4:
• Card(A): cardinality of the set A.
• N
eps
(o): p ∈ N
eps
(o), if and only if the distance
between p and o is less or equal than eps.
• Directly Density-Reachable (DDR): o is DDR
from p, if o ∈ N
eps
(p) and Card(N
eps
(p)) ≥ min-
Points.
• Density-Reachable (DR): o is DR from p, if there
is a chain of points {p
1
, ..., p
n
} where p
1
= p and
EfficientandDistributedDBScanAlgorithmUsingMapReducetoDetectDensityAreasonTrafficData
53