which has four arguments (Fig. 3):
DcColumntotal=createColumn(
"total",Categories,Double
);
total.setFormula("ACCUMULATE(
facts=[LineItems],
groups=[product].[category],
measure=[amount],
accumulator=SUM)");
The facts parameter specifies a table with all the
records to be processed. The
groups parameter is a
definition of a column of the facts table which
returns a group. Note that in this example, we used
an intermediate table to compute a group for each
line item, that is, a line item has a
product which
belongs to some
category. The measure parameter
is also a column definition of the facts table but its
purpose is to return some value to be accumulated.
And the fourth
accumulator parameter is essentially
a definition of the new
total column and its
purpose is to specify how the currently stored value
will be updated. In this case, we used a predefined
function name
SUM which means that the total
column will add a new measure value to the
currently stored value for each new group element.
In the general case, it can be an arbitrary expression
which updates the current value of the column.
This approach to aggregation has the following
distinguishing features:
Both the grouping criterion and the measure can
be COEL expressions as opposed to using only
primitive columns for groups and measures in
the conventional group-by operator. This feature
is especially important for complex ad-hoc
analytics.
Aggregation is a column definition rather than a
special query construct. Such columns update
their currently stored value for each new group
element rather than overwrite the previous value.
7 CONCLUSIONS
In this paper, we presented a conception and
described an implementation of a novel approach to
data integration, transformation and analysis, called
DataCommandr. It is aimed at ad-hoc, agile and
explorative data processing but as a general-purpose
technology, it can be applied to a wider range of
tasks. This approach is based on the concept-
oriented model of data and its main distinguishing
feature is that it relies on column transformations as
opposed to table or cell transformations.
There are two major benefits of using
DataCommandr:
Development Time. It decreases development
time, maintenance costs, semantic clarity and
quality of code. COEL is not only a concise
language but it also allows for better modularity
of code. COEL is simpler and more natural
language which is very close to how spreadsheet
application work but having the power of
relational query languages when working with
multiple tables and complex relationships.
Run Time. DataCommandr can increase
performance at run time because operations on
columns are known to be much faster for
analytical workloads in comparison to row-
oriented data organization. The new mechanisms
of links and aggregation can decrease data
processing time by avoiding unnecessary copy
operations.
In this paper, the focus was made on the conception
and logical organization which are important for
agility of ad-hoc analytics. In future, we plan to
focus on run time issues like performance of in-
memory operations, partitioning, job management,
fault tolerance and scalability.
REFERENCES
Atzeni, P., Jensen, C.S., Orsi, G., Ram, S., Tanca, L., &
Torlone, R., 2013. The relational model is dead, SQL
is dead, and I don’t feel so good myself. ACM
SIGMOD Record, 42(2), 64–68.
Abadi, D.J., 2007. Column stores for wide and sparse data.
In Proceedings of the Conference on Innovative Data
Systems Research (CIDR), 292–297.
Boncz, P. (Ed.), 2012. Column store systems [Special
issue]. IEEE Data Eng. Bull., 35(1).
Chaudhuri, S., Dayal, U. & Narasayya, V., 2011. An
overview of Business Intelligence technology.
Communications of the ACM, 54(8), 88–98.
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M.,
Welton, C., 2009. Mad skills: New analysis practices
for big data. In Proc. 35th International Conference
on Very Large Data Bases (VLDB 2009), 1481–1492.
Copeland, G.P., Khoshafian, S.N., 1985. A decomposition
storage model. In SIGMOD 1985, 268–279.
Dean, J, Ghemawat, S., 2004. MapReduce: Simplified
data processing on large clusters. In Sixth Symposium
on Operating System Design and Implementation
(OSDI'04), 137–150.
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J., 2011.
Wrangler: Interactive Visual Specification of Data
Transformation Scripts. In Proc. ACM Human Factors
in Computing Systems (CHI), 3363–3372.
Krawatzeck, R., Dinter, B., Thi D.A.P., 2015. How to