order consisting of a number of order items can be
computed using the following SQL query:
SELECTSUM(i.amount)AStotal
FROMOrderItemsASi
GROUPBYi.order//Groupdefinition
A vertical (aggregation) operation can also be
formally represented as a function. However, this
function takes set-valued arguments rather than
single values in the case of horizontal operations.
For example, in the above example this function is
written as the following expression:
total(group)=SUM(group) where group is a
subset of order items from the
OrderItems table
belonging to one order. Vertical operations are used
for complex data processing and analysis. They are
traditionally more difficult to understand and use
than horizontal operations, and most of the
difficulties are due to the grouping mechanism and
user-defined aggregate functions.
1.2 Related Work
Relational algebra (Codd, 1970) is intended for
manipulating relations, that is, it provides operations
which take relations as input and produce a new
relation as output. This formalism does not provide
dedicated means for data aggregation just because it
belongs to a set-oriented approach where the main
unit is that of a set rather than a value. In particular,
it is not obvious how to define and manipulate
dynamically defined groups of tuples, that is, subsets
which depend on values in other tuples.
Theoretically, it could be done by introducing nested
relations, complex objects and relation-valued
attributes (see e.g. Abiteboul et al., 1989) but any
such modification makes the model significantly
more complicated and actually quite different from
the original relational approach.
Since aggregation is obviously a highly
important operation, these functions were introduced
in early relational DBMSs and its support was added
to SQL (Database Languages|SQL, 2003) in the
form of a dedicated group-by operator. Importantly,
group-by is not a formal part of the relational model
but rather is a construct of a query language that
supports this model but can also support other data
models. In other words, group-by is not a specific
feature of the relational model and actually does not
rely on its main principles. In particular, it has been
successfully implemented in many other models,
query languages, database management systems and
data processing frameworks. Nowadays, in the
absence of other approaches, group-by is not merely
a formal operation but rather a dominant pattern of
thought for the concept of data aggregation.
An alternative approach to aggregation is based
on using correlated queries where the inner query is
parameterized by a value provided by the outer
query. This parameter is interpreted as a group
identifier so that the outer query iterates through all
the groups while the inner query iterates through the
group members by aggregating all of them into one
value. Yet, this approach still needs the group-by
operator but it is interesting from the conceptual
point of view because it better separates different
aspects used during data aggregation.
Aggregation is also a crucial part of the map-
reduce data processing paradigm (Dean and
Ghemawat, 2004) where map is a horizontal
operation and reduce is a vertical operation. Its main
advantage is that it allows for almost arbitrarily
complex data processing scenarios due to the
complete control over data aggregation and the
natively supported mechanism of user-defined
aggregations. Yet, map-reduce is much closer to
programming than to data processing because
manual loops are required with direct access to the
data being processed.
Aggregation is an integral part of many other
data processing frameworks like pandas (McKinney,
2010; McKinney, 2011), R or Spark SQL (
Armbrust
et al., 2015
) which rely on data frames as the primary
data structure. One of their specific features is that
they provide a separate operation for grouping
elements of a data frame so that different aggregate
functions can be then applied to these groups as a
next operation. This approach is also closer to
programming models rather than to data models
because user-defined aggregations still require direct
access to and explicit loops through the group
elements.
1.3 Goals and Contribution
This paper is devoted to the problem of data
aggregation. We discuss this mechanism at logical
level of data representation and processing, and not
physical level where numerous implementations and
optimization techniques exist taking into account
various hardware architectures and network
properties. The main mechanism for data
aggregation which has been dominating among other
approaches for dozens of years is the group-by
operation. Yet, despite its wide adoption, this
operation has some serious conceptual drawbacks. In
particular, group-by does not naturally fit into the
relational (set-oriented) setting and looks more like a