loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Sanzhar Aubakirov 1 ; Paulo Trigo 2 and Darhan Ahmed-Zaki 1

Affiliations: 1 al-Farabi Kazakh National University, Kazakhstan ; 2 Instituto Superior de Engenharia de Lisboa and Biosystems and Integrative Sciences Institute / Agent and Systems Modeling, Portugal

Keyword(s): distributed computing, text processing, n-gram extraction

Related Ontology Subjects/Areas/Topics: Business Analytics ; Data Engineering ; Data Management and Quality ; Statistics Exploratory Data Analysis ; Text Analytics

Abstract: In this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each k ind of data set regarding its overall size and granularity of the input data. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.141.41.187

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Aubakirov, S.; Trigo, P. and Ahmed-Zaki, D. (2016). Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction. In Proceedings of the 5th International Conference on Data Management Technologies and Applications - DATA; ISBN 978-989-758-193-9; ISSN 2184-285X, SciTePress, pages 25-30. DOI: 10.5220/0005943000250030

@conference{data16,
author={Sanzhar Aubakirov. and Paulo Trigo. and Darhan Ahmed{-}Zaki.},
title={Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction},
booktitle={Proceedings of the 5th International Conference on Data Management Technologies and Applications - DATA},
year={2016},
pages={25-30},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005943000250030},
isbn={978-989-758-193-9},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 5th International Conference on Data Management Technologies and Applications - DATA
TI - Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction
SN - 978-989-758-193-9
IS - 2184-285X
AU - Aubakirov, S.
AU - Trigo, P.
AU - Ahmed-Zaki, D.
PY - 2016
SP - 25
EP - 30
DO - 10.5220/0005943000250030
PB - SciTePress