words S
n
= (W
1
,W
2
,W
3
, . . . ,W
m
), where W
n
is a sin-
gle word.
Algorithm 1: Pseudo code for n-gram extraction.
1: function MAP(list sentence, int size) map
2: return rangeClosed(1, size)
3: .map(n → sliding(sentence, n, size))
4: end function
5: function REDUCE(stream ngrams) reduce
6: return ngrams.collect(groupByCount())
7: end function
8: function SLIDING(list sentence, int n, int size)
9: return rangeClosed(0, size - n)
10: .map(idx → join(” ”, list.subList(idx, idx +
n)))
11: end function
We implemented sliding(), map() and reduce()
functions, pseudo-code is shown in figure 1. Func-
tion map() takes list of sentences S and for each S
i
executes sliding() function with the parameter n =
(0, 1, 2, . . . , m), where n is size of slides (n-grams) that
function will produce and m is number of words W in
sentence S
i
. Function reduce() takes output of map()
function, which is the list of n-grams ( list of list of
words) and count similar ones. As a results it re-
turns list of objects (n-gram, v), that is usually called
Map, where v is the frequency of particular n-gram in
the text. This approach provide ability to execute in-
dependent map() and avoid communication between
nodes until reduce() stage.
2.1 MPJ Express Implementation
We are using transactional queue synchronization pat-
tern as a process task management for MPJ Express.
Transactions guarantees message delivery to a sin-
gle recipient. Thus in case of failure transaction can
be aborted and message will be delivered to another
worker. We fill queue with the file names, each node
takes file name from queue, process it and takes an-
other one. This solution provides reliability, fault-
tolerance and scalability. And one more benefit is that
workers can be implemented in any language or plat-
form, not only JVM based applications.
2.2 Apache Hadoop Implementation
Word count application is probably the most common
example of Apache Hadoop usage. We use algorithm
from the Apache Hadoop community official site as
basis for our task. The count part remaining almost
the same, with the adoptions to n-gram extraction
task. Pseudo-code of map() and reduce() functions
shown on Algorithm 2.
Algorithm 2: Pseudo-code of Apache Hadoop implementa-
tion.
1: function MAP(key, value, context) map
2: list sentences = getSentences(value)
3: for (sentence : sentences) do
4: for (n← 1, sentence.size()) do
5: ngrams = sliding(sentence, n)
6: ngrams
7: .forEach(ngram → write(ngram, 1))
8: end for
9: end for
10: end function
11: function REDUCE(key, values, context) reduce
12: sum = 0
13: for (value : values) do
14: sum += value.get()
15: end for
16: result.set(sum)
17: write(key, result)
18: end function
2.3 Apache Spark Implementation
Applications based on Apache Spark framework can
be implemented using one of three programming lan-
guages: Python, Scala and Java. Scala is a functional
programming language that fully executes in JVM.
It provide advantages such as immutable data struc-
tures, type safety and pure functions. This language
features simplifies code parallelization by reducing
the biggest headache in distributed programming such
as race conditions, deadlocks, and other well-known
problems. As a results code written in Scala become
cleaner, shorter and easy to read. Apache Spark ap-
plication was implemented in Scala, pseudo-code of
implementation is described in Algorithm 3.
Algorithm 3: Pseudo-code of Apache Spark map and reduce
functions.
1: function MAP(sentence, N) map
2: (1 to N).toStream.flatMap(n ⇒ sen-
tence.sliding(n))
3: end function
4: function REDUCE(text) reduce
5: text.flatMap(sentence ⇒ sliding(sentence))
6: .map(ngram ⇒ (ngram,1))
7: .reduceByKey((a,b) ⇒ a+b)
8: .saveAsTextFile()
9: end function
2.4 Assumptions
We implemented n-gram extraction library that is
used by Hadoop and MPJ, while Spark use built-
in functions. All implementations use HDFS for
Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction
27