In fact τ=(τ
1
=10,000)+(N-1)⋅(τ
j
=1)=11,999. The
estimation error τ is large in options 1 and 2.
Let us show that with the same initial data, the
LA-AQP method allows estimating τ with a much
smaller error. In this case, we obtain (see item 3 of the
algorithm 2 in Section 4):
π
1
=((τ
1
=10,000)⋅P
1
)/
((τ
1
=10,000)⋅P
1
+ (N-1)⋅(τ
j
=1)⋅P
j
)= 0.83;
π
j
=((τ
j
=1)⋅P
j
)/
((τ
1
=10,000)⋅P
1
+(N-1)⋅(τ
j
=1)⋅P
j
)=8.33E-5, j=2...N.
The probability that for n=100 trials the 1st
segment will not be selected is (1-π
1
)
n
=1.1E-77 (i.e.
it is practically 0). We derive an aggregate value
estimate from formula (5):
Option 1 - the 1st segment will be selected only
one time out of n = 100:
τ(n)=(1/n)⋅((τ
1
=10,000)/
π
1
+(n-1)⋅(τ
j
=1)/π
j
)=12,005;
Option 2 - the 1st segment will be selected 100
times out of n=100:
τ(n)=(1/n)⋅(n⋅(τ
1
=10,000)/π
1
)= 12,048.
Option 3 - the 1st segment is not selected (this is
an almost impossible event):
τ(n)=(1/n)⋅(n⋅(τ
j
=1)/π
j
)= 12,004.
Therefore the error in calculating the aggregate is
small (exact value τ=11,999) in options 1, 2 and 3 for
LA-AQP. The sample size 'n' is not important in this
example, and it can be equal to one. It also does not
matter which segment numbers are selected for
processing. The calculation error using the LA-AQP
method will be small in any case.
To achieve the same level of error in Sapprox,
it is necessary to significantly increase the sample
size n. It should be comparable to N.
The LA-AQP method allows specifying a more
complex search condition and a GROUP BY clause
in a query (see (7)). Sapprox allows only AND (see
(4)) connection of elementary conditions.
7 CONCLUSIONS
The existing systems with lambda-architecture
requires constantly repeat package updates for new
analytical queries execution acceleration. This
consumes large time since it searches in a large
database. The developed approach allows avoiding
creation of package representations due to the
introduction of metadata level. The queries are
executed promptly but with a certain error. The
developed LA-AQP method reduces this error.
Expression (5) gives an unbiased estimate of the
τ aggregate for any probability distribution function
{π
g
}, π
g
>0. The issue is in the sample size. The
developed method for calculating {π
g
} makes it
possible to obtain a good aggregate estimation
accuracy for small n values. This is achieved due to
the fact that when calculating {π
g
}, estimates of the
values of the aggregates in the segment are used. As
a result, the values τ
j
/π
j
in (5) become approximately
the same. This allows minimizing the sample
variance D(n) (see (6)).
The LA-AQP method allows executing queries
with a general search condition and with a grouping
(see (7)). For calculations, aggregate values are used
at the level of individual attributes and segments. The
overhead costs of obtaining such aggregates are low:
they accumulate as data arrives in the stream. The
accuracy of the general aggregates increases
approximately twofold as compared with the Sapprox
method.
The future work includes development a method
for processing queries at the Speed Layer in a Lambda
Architecture system.
REFERENCES
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden,
S., and Stoica, I. (2013). Blinkdb: Queries with
bounded errors and bounded response times on very
large data. In Proceedings of the 8th ACM European
Conference on Computer Systems, EuroSys ’13, pages
29–42, New York, NY, USA, 2013. ACM.
Cormode, G. et al. (2011). Synopses for massive data:
Samples, histograms, wavelets, sketches // Foundations
and Trends® in Databases. – 2011. – Vol. 4. – №. 1–3.
– P. 1-294.
Cox-Buday, K. (2017). Concurrency in Go: Tools and
Techniques for Developers. "O'Reilly Media, Inc.",
2017.
Donovan, Alan AA, and Kernighan B. W. (2015). The Go
programming language. Addison-Wesley Professional,
2015.
Gribaudo, M., Iacono, M., Kiran M. A. (2018).
Performance modeling framework for lambda
architecture based applications // Future Generation
Computer Systems. – 2018. – Vol. 86. – pp. 1032-1041.
Goiri, R., Bianchini, S., Nagarakatte and Nguyen, T. D.
(2015). Approxhadoop: Bringing approximations to
mapreduce frameworks. In Proceedings of the
Twentieth International Conference on Architectural
Support for Programming Languages and Operating