Authors:
Parijat Shukla
1
and
Arun K. Somani
2
Affiliations:
1
Xillinx, Inc., HITEC City, Hyderbad and India
;
2
Dept. of Electrical and Computer Engineering, Iowa State University, Ames, Iowa and U.S.A.
Keyword(s):
Deduplication, Semi-structured Data, NoSQL, Big Data, Parallel Processing, GPGPU, Data Shaping.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Business Analytics
;
Data Analytics
;
Data Engineering
;
Data Reduction and Quality Assessment
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Mining Text and Semi-Structured Data
;
Pre-Processing and Post-Processing for Data Mining
;
Symbolic Systems
Abstract:
Several Big Data problems involve computing similarities between entities, such as records, documents, etc., in timely manner. Recent studies point that similarity-based deduplication techniques are efficient for document databases. Delta encoding-like techniques are commonly leveraged to compute similarities between documents. Operational requirements dictate low latency constraints. The previous researches do not consider parallel computing to deliver low latency delta encoding solutions. This paper makes two-fold contribution in context of delta encoding problem occurring in document databases: (1) develop a parallel processing-based technique to compute similarities between documents, and (2) design a GPU-based document cache framework to accelerate the performance of delta encoding pipeline. We experiment with real datasets. We achieve throughput of more than 500 similarity computations per millisecond. And the similarity compuatation framework achieves a throughput in the range
of 237-312 MB per second which is up to 10X higher throughput when compared to the hashing-based approaches.
(More)