The edge computing paradigm is analyzed in section
3, where we explain the use of reservoir sampling
(algorithm R) in the proposed scheme and we try to
quantify the era duration of an event. In section 4 we
describe a few implementation issues such as the use
of Varnish Cache. Section 5 describes the metrics and
results of our simulation. Finally, the paper is
concluded in section 6.
2 RELATED WORK
Related work primarily refers to the problem of
turning Big data into small data to meet scalability
requirements in different infrastructure types.
Di Martino, Aversa, Cretella, Esposito &
Kolodziej (2014) survey the developments on Cloud
Computing concerning the big data issue with a
critical analysis and show the further direction to the
new generation multi-datacenter cloud architectures
for storage and management. It presents several cloud
platforms offering big data-oriented services, like
PiCloud, Google BigQuery, Amazon Elastic
MapReduce, etc. Furthermore, it makes an attempt to
classify the services related to big data management,
like data collection, curation, integration and
aggregation, storage, and analysis and interpretation,
among the different cloud providers. It concludes that
distributed data applications across geographically
distributed data centers appear as a good solution for
the efficient management of big data in the clouds.
The researchers, Tao, Jin, Tang, Ji & Zhang
(2020), try to solve the problem of network resource
redundancy and overload in the IoT architecture.
They propose a model of cloud edge collaborative
architecture that combines cloud and edge computing,
centralized and decentralized, respectively, trying to
fulfill the requirements of computing power and real-
time analysis of big local data. Moreover, they
combine the complex network and data access with
the management requirements of the IoT. The Power
IoT architecture uses four layers: the perception layer,
the network layer, the platform layer, and the
application layer. Nevertheless, there is a
management collaboration and coordination of
computing tasks problem between the platform layer,
application layer and the edge computing network,
not to mention the increasing cost of construction,
operation, and maintenance of the system.
The authors Zhou, Liu & Li (2013) examine the
net effect of using deduplication for big data
workloads, considering the increasing complexity of
the data handling process, and elaborate on the
advantages and disadvantages of different
deduplication layers (local and global). The term
‘local deduplication layer’ refers to the fact that
deduplication is only used within a single VM, and
the relevant mechanism can detect replicas within a
single node. The term ‘global deduplication layer’
means that the deduplication technique is applied
across different VMs. In the first case, different VMs
are assigned to different ZFS (deduplication tool)
pools, and in the second case, all VMs are assigned to
the same ZFS pool. Local deduplication cannot fully
remove all the replicas. This fact leads to a negative
performance with the increase of active datasets. The
performance becomes slightly better when more
nodes are deployed because local deduplication can
leverage the parallelism for hash computation and
indexing. It also maintains data availability. On the
contrary, global deduplication has the opposite results
and presents a positive performance.
Xia et al. (2011) present a near-exact
deduplication system, named SiLo, which, under
various workload conditions, exploits similarity and
locality in order to achieve high throughput and
duplicate elimination and, at the same time, low RAM
usage. SiLo is trying to exploit similarity by grouping
correlated small files and segmenting large files. In
addition, it tries to exploit locality in the backup
stream by grouping contiguous segments into blocks
in order to capture duplicate or similar data that is
missing during similarity detection.
Hillman, Ahmad, Whitehorn, & Cobley (2014)
elaborate on a near real-time processing solution in
the sector of big data preprocessing with the use of
Hadoop and Map Reduce. The basic idea is to use
parallel compute clusters and programming methods
in order to deal with large data volumes and
complexity in a reasonable time frame. The paradigm
uses the vast volume of data that is generated in the
field of genes and their product proteins, which must
be preprocessed. Hadoop is used for handling the raw
data while Java code and MapReduce are used for
data preprocessing in order to identify 2D and 3D
peaks in Gaussian curves produced by the data of a
mass spectrometer. As a result, the datasets are
greatly reduced by a Map task and the completion
times are greatly reduced compared to a conventional
PC-based process.
Using preprocessing tools and a cloud
environment, the authors Sugumaran, Burnett &
Blinkmann (2012) were able to develop and
implement a web-based LiDAR (Light Detection and
Ranging) data processing system. The
implementation of this system, called CLiPS (Cloud
Computing-based LiDAR Processing System),
showed that the processing time for three types of