Authors:
Sheng-Tzong Cheng
;
Jian-Ting Chen
and
Yin-Chun Chen
Affiliation:
Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
Keyword(s):
Big Data, Deduplication, In-Memory Computing, Spark.
Abstract:
In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is
inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory
Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning,
exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and
can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of
a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source
and dispersing information to each node. According to observation, some data from sensor devices might be
duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution.
The technique for eliminating duplicated data is capable of improving data utilization. This study presents a
distributed real-time IMC platform -- “Spa
rk Streaming” optimization. It uses deduplication technology to
eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and
improve the throughput of Spark Streaming.
(More)