NRank: A Unified Platform Independent Approach
for Top-K Algorithms
Martin Čech
1
and Jaroslav Pokorný
2
1
Lekis s.r.o., Pražská 126, 256 01 Benešov, Czech Republic
2
Faculty of Mathematics and Physics, Charles University, Malostranské nám. 25, Prague, Czech Republic
Keywords: Top-K Query, Relational Database, Rank Join.
Abstract: Due to increasing capacity of storage devices and speed of computer networks during last years, it is still
more required to sort and search data effectively. A query result containing thousands of rows from a
relational database is usually useless and unreadable. In that situation, users may prefer to define constraints
and sorting priorities in the query, and see only several top rows from the result. This paper deals with top-k
queries problems, extension of relational algebra by new operators and their implementation in a database
system. It focuses on optimization of operations join and sort. The work also includes implementation and
comparison of some algorithms in standalone .NET library NRank.
1 INTRODUCTION
Results of relational queries containing more than
hundreds of tuples are often not clearly arranged and
hence not usable for a user. Therefore, it is suitable
to sort these results according to a criterion and to
offer to users only the best ones. This reminds the
behaviour of search engines, where a user usually
defines neither a way of sorting results (this is, after
all, fairly heavy task), nor restriction on the result
size. The paper is focused on such queries, whose
criteria of sorting as well as the required result size
are defined by a user. Such queries are overall
denoted by top-k queries (we define them formally
in Section 2).
Top-k queries need not be restricted only to
relational databases. They can concern also an
aggregation of several web services; each of them
provides some data, whereas central application
performs joins on the data, their filtering and sorting
(Horničák et al., 2011). Principles of optimization
are the same and, moreover, they can take into
account speeds of particular services or the fact that
data can be loaded in parallel (Abid and
Tagliasacchi, 2011).
A good introduction to top-k queries and basic
algorithms for their processing is in the works
(Bruno et al., 2002), (Fagin et al., 2003). Solutions
in context of relational databases can be found in
(Ilias et al., 2003) and later in the survey (Ilias et al.,
2008).
In principle, two approaches can occur in
practice:
plug-in in which a top-k functionality is
built on top of the database engine, and
native which directly manipulate the
database engine by injecting new preference
operators
We follow the former in this paper. We focus on
so called top-k join queries (or shortly rank joins) in
our work, i.e. queries doing join of two or more
(ranked) relations with output ordered by values of a
scoring function calculated from tuple scores of
input relations. We use some algorithms described in
(Ilias et al., 2004), (Ilias et al., 2008). There are
other attempts to top-k joins, based on modification
of so called “no random accesses” (NRA) algorithms
(Mamoulis et al., 2007).
We developed a .NET library NRank enabling a
simple formulation and effective evaluation of top-k
queries. The library is primarily intended for
querying over a relational database, but it can be
used also for querying over other data structures.
Section 2 concerns a definition and properties of
top-k queries. Section 3 is devoted to description of
four representative top-k algorithms used in the
NRank. NRank is a unified platform independent
system for top-k algorithms. Section 4 describes its
implementation. In Section 5 we compare these
algorithms and in Section 6 we summarize achieved
111
ˇ
Cech M. and Pokorný J..
NRank: A Unified Platform Independent Approach for Top-K Algorithms.
DOI: 10.5220/0003966901110116
In Proceedings of the International Conference on Data Technologies and Applications (DATA-2012), pages 111-116
ISBN: 978-989-8565-18-1
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
results.
2 TOP-K QUERIES
A top-k query is the 4-tuple (S, c, f, k), where S =
{S
i
} is a set data sources (a data source can be an
arbitrary set of objects), c = S
i
{0, 1} is a
scoring condition, f = S
i
R is a scoring function
(R is the set of real numbers), and k N is the
maximum result size (the required number of
returned objects).
This is the most general form of top-k query that
we can meet in most of top-k query models. It
includes both top-k join queries and top-k selection.
In top-k selection queries, the goal is to apply a
scoring function on multiple attributes of the same
relation to select tuples ranked on their combined
score (Ilyas et al., 2003).
The result of a top-k query over sources
instances S
*
is then an ordered set O S
i
*
such
that:
t O: c(t) =1,
t
i
, t
j
O: i < j f(t
i
) f(t
j
),
t O, s S
i
*
O: c(s) =1 f(t) f(s)
O = k .
The first condition only restricts the condition on
tuples in the result (that is equivalent to SQL clause
JOIN ON c and/or WHERE c’). The second
condition defines a set ordering according to the
scoring function f. The third condition says that the
result contains tuples with the highest possible score.
The last condition restricts the result size to k tuples.
Of course, the number of tuples in result can be less
than k. Then it is possible to modify the definition to
ensure the result set was as large as possible.
Top-k algorithms solve the top-k query without
reading all of the input. Their optimality is often
measured in the number of input objects from
particular sources, denoted as a depth. In fact, it is
the minimal depth into which the algorithm must
access its inputs to report the result. Other criterion
of optimality is a space complexity; the algorithms
distinguish significantly mutually in the amount of
data they need to be hold in memory.
Determining time complexity of these algorithms
is more complicated. Asymptotic time complexity of
these algorithms is usually the same. Because of the
sort-based approach these algorithms have no other
possibility than reading inputs in linear time.
Algorithms then distinguish only in processing data
in buffers, but from the view of asymptotic
complexity they are mostly „equally fast“.
Asymptotic complexity is therefore too weak tool
for comparing the effectiveness of top-k algorithms.
The recent approaches to rank join problem
focus mostly on instance optimal algorithms in terms
of I/O cost. Then this cost is a quantity proportional
to the overall number of fetched tuples. The notion
of an instance optimal algorithm (originally
introduced in (Fagin et al., 2003) expresses a relative
optimality of an algorithm with respect to a class of
algorithms A and a class of instances D. An
algorithm b is instance optimal over A and D if for
every a A and every d D we have t(b, d) =
O(t(a, d)) (where t(b, d) is the number of algorithm
steps), in other words, there are two constants c
1
and
c
2
such that t(b, d) c
1
*t(a)+c
2
for every aA and
every dD. c
1
is denoted as the optimality ratio.
When the algorithm complexity is determined,
usually the worst or average case is studied only.
However, while an algorithm is instance optimal, it
is guaranteed, that it will be „reasonably fast“ in all
defined cases. While the sets A and D are well
defined, the instance optimality is a very strong
criterion. For example, the Hash Rank Join
algorithm (see Section 3) is an instance optimal
algorithm.
An interesting finding is that most of instance
optimal algorithms are in practice slower than other
algorithms. Although they have a good asymptotic
complexity, their speed is really low, more precisely
lower, than the speed of some algorithms, which are
not instance optimal. During determination of the
algorithms complexity we often meet cases, where
the algorithm has a good asymptotic complexity, but
its real speed is low; top-k algorithms, which are
instance optimal, have often the same asymptotic
complexity, but their real speed is often higher.
Therefore we are often driven by a quest to find a
reasonable compromise between instance optimality,
which guarantees a robustness, and real algorithm
effectiveness. Some works, e.g. (Finger and
Polyzotis, 2009), endeavour to find compromises
between these extremes.
In the case of rank joins we use functions f
aggregating (or combining) ranked inputs. We will
consider only monotone aggregation functions to
ensure some properties of algorithms followed. The
main idea of the algorithms is sorting the inputs in
descending order according to ranking and quest to
provide the result as soon as possible.
3 TOP-K JOIN ALGORITHMS
We will consider four algorithms based on hashing
AA   nternational onference on ata echnoloies and Applications

and nested loops.
3.1 Hash Rank Join
The algorithm Hash Rank Join (Ilyas et al., 2003),
(Ilyas et al., 2004) requires ranked data (in a
descending order according to f) in both inputs L and
R and join condition only of form L.X = R.Y. The
scoring function for calculating the combined score
is L.score + R.score.
The algorithm alternately reads both inputs and
stores them into hash tables HL and HR, where the
hash value is calculated from the value of attribute,
which occurs in equality condition. After reading
new values from an input, in the hash table of the
second input there are looked up the tuples, which
are joinable. The join results are inserted into the
priority queue, where they are ordered by their
combined score. Furthermore, the algorithm
maintains the top and bottom scores of retrieved
inputs L
max
, L
min
, R
max
, R
min
, respectively, from which
it calculates so called score threshold
t = max(f(L
max
, R
min
), f(L
min
, R
max
))
This is an upper-bound on the score of all join
combinations not yet seen. The soundness of the
estimation is trivial and follows from the
monotonicity of f. Values L
max
and R
max
are
initialized during reading the first tuple of L and R,
respectively, values L
max
and R
max
change
continuously with new tuples from L and R being
fetched. All results, which have score greater than t,
can be immediately reported. Obviously, the value
of t is also computed from couples, which do not
fulfil the join condition.
Algorithm Hash Rank Join:
Input: data sources L and R,
aggregation function f:(L×R)->R,
join condition c:(L×R) -> {0, 1} of
form L.X = R.Y,
result size k
Output: the first k tuples from L*cR,
ordered in descending order
according to ranking f.
Variables: hash tables HL and HR, the
hash key (L.X or R.Y), L_min,
L_max, R_min, R_max calculated
from retrieved tuples, score
threshold t, priority queue T
1. HL = HR = Ø;
2. L_min = R_min = ∞;
3. L_max = R_max = -∞;
4. while (L≠Ø nebo R≠Ø)
5. {
I := one of inputs L, R
(inputs are selected alternately.
The second input is denoted J,
its hash table HJ);
6. read the next i I;
7. foreach j HJ[i]:
T.Insert(join(i, j));
8. I_min = i.Score;
9. I_max = max(I_max, i.Score);
10. update t value;
11. while(T≠Ø and T[0].Score>=t)
12. {report T[0] as output and
remove it from T;
13. if k tuples are reported
then stop;
14. }
15. }
// emptying out the queue
16. while(T≠Ø)
17. {
report T[0] as output and
remove it from T;
18. if k tuples are reported
then stop;
19. }
The algorithm must hold in memory all no
reported results (in priority queue) and all retrieved
data from both inputs (in the hash table). While the
number of resulted tuples k is not known, the queue
can grow without restrictions. Otherwise, while k is
known, the queue can hold only k of best tuples. A
compromise is possible the queue can be restricted
on a value greater than k (e.g., for k = 10 we restrict
the queue to 100 objects). The user then gains top k
tuples, but he/she can require additional tuples, up to
maximal queue size. This ensures that the queue
does not exceed the given size.
The basic variant of Hash Rank Join can be
optimized by the input choice in particular steps of
the algorithm. Since the basic variant alternates the
both inputs mechanically, it becomes easily, that
after reading the next object the score threshold t
does not decrease. Minimization of t essentially
influences a speed of reporting the results, i.e. the
overall algorithm speed. During calculating t the
maximum from two virtual values t
1
= f(L
max
, R
min
)
and t
2
= f(L
min
, R
max
) is chosen. The threshold
decreases only in case that in the next step the
greater from values t
1
and t
2
decreases (if t
1
>t
2
, then
the right input is chosen, otherwise the left one) in
other words, it is not necessary to decrease the lesser
from values t
1
, t
2
, the resulted t will be not changed.
The optimization lies in strategy, in which the input
is chosen from L and R in the way that the value t
always decreases. This strategy is further denoted as
adaptive.
However, this strategy of minimizing t is not
always better than classical alternating the inputs
(round robin strategy). There are inputs, where
NRank A nified Platform ndependent Approach for op Alorithms

alternating inputs is more suitable, as it is mentioned
in (Finger and Polyzotis, 2009).
3.2 General Rank Join
The algorithm Hash Rank Join can be generalized in
the way that the join condition need not be
necessarily only equality condition, but arbitrary. In
such case algorithm will not store output into hash
tables (moreover, there is no key to be hashed), but
simply into a buffer (each input has its buffer).
When a tuple is read from one input, all tuples in
second buffer must be iterated and join condition
must be checked at each step. This process is
obviously slower than the Hash Rank Join, which
does not check join condition at all (the hash table
provides just the right tuples to join). The General
Rank Join is more universal, but slower.
3.3 Nested Loop Rank Join
The algorithm (Ilyas et al., 2003), (Ilyas et al., 2004)
is a variant of the classical nested loop join. One
input (denote it L) is read consecutively. For each
tuple l L, the algorithm iterates through all tuples
from the second input (R), tries to join them with l
and the results stores again into the priority queue.
Similarly as in the general rank join, the threshold
value t is maintained and updated with each reading
a tuple from L. Resulted tuples in the priority queue,
which have the resulted score greater than t, are
immediately outputted.
In fact, the Nested Loop Rank Join is a variant of
General Rank Join, but firstly all tuples from R are
read and stored into a buffer and after that the tuples
from L are read.
Input R does not have to be ordered, since all
tuples are loaded and stored in a buffer, and thus
R
max
and R
min
are known in advance and the
threshold score can be easily estimated; in addition,
the values R
max
and R
min
do not change during
reading L. Tuples from L do not have to be
maintained in memory; on the other hand, as all data
from R are held in memory, this algorithm is
suitable rather in case, where data from R are rather
sparse, or when it is possible to access them directly
(random access), so that it is not necessary to hold
data from R in memory.
There is one more interesting optimization. The
value L
max
can be decreased in each step to the value
L
min
. After reading a tuple l L there are created all
result tuples, where l can occur (tuple l is even
forgotten in the next step, since it is not stored in the
buffer). Therefore the value L
max
can be decreased
on the score of tuple l. In each step of the algorithm,
L
max
= L
min
= f(l). Then the value of t decreases
faster, than using the Hash Rank Join. Consequently,
this algorithm can be faster in some situations.
3.4 Hash Nested Loop Rank Join
The algorithm Nested Loop Rank Join can be
optimized for such top-k queries, where the join
condition is an equality condition. Similarly to the
Hash Rank Join algorithm can store the retrieved
data in hash tables. In this case a tedious throughput
of the whole buffer drops out. The hash table of the
input occurring in memory is created during
initialization of the algorithm; the hash table of the
second input is filled by sequel during reading
particular tuples, in the same way as in Hash Rank
Join.
4 IMPLEMENTATION
The library NRank is implemented in .NET/C# 4.0.
Its main objective is to provide all the mentioned
algorithms to users, while it stays independent on the
data sources. Whereas most implementations focus
on a specific algorithm and are closely linked to
some DBMS, we have implemented generic versions
of the algorithms, which accept any data collection
as a data source.
This approach opens up many new possibilities,
i.e. performing the algorithms on two completely
independent web sources, combining data from
different databases, etc. Inspired by C# query
language LINQ, we also stressed simple usage.
When performing a join query in LINQ, a user
typically specifies the collections to be joined, the
join condition, and the result function. In NRank, the
calling convention is very similar and is extended
only by ranking functions, so that algorithms are
able to count tuple’s rank.
While we cannot achieve such fast algorithms as
if they were integrated into a database, we still
obtain very interesting results. A big difference in
using traditional LINQ queries and NRank queries is
that whereas LINQ queries are translated to SQL (if
called on database objects), NRank queries are
performed with no translation and optimization. That
means, when LINQ query is called, user typically
does not care about what algorithm is used. When
NRank query is called, user must specify the
algorithm; this step has crucial impact on evaluation
speed.
The library is accessible at address http://code.
google.com/p/nrank/.
AA   nternational onference on ata echnoloies and Applications

5 COMPARISON OF
ALGORITHMS
The project includes a benchmark database
containing tables Town (50.000 tuples), Hotel
(1.000.000 tuples), Club (1.000.000 tuples), and
Room (20.000.000 tuples). Experiments have been
done on Microsoft SQL Server 2008 Express
Edition, OS Windows 7 64bit, Intel ®Core™ 2 Duo,
CPU 8400 2.26GHz, 4GB RAM. First, the
algorithms evaluated a top-k binary join that can be
expressed by the SQL query
SELECT T.IdTown, C.IdClub,
T.Score + C.Score
FROM Town T JOIN Club C ON
T.IdTown = C.IdTown
ORDER BY DESC T.Score + C.Score
STOP AFTER k
Figure 1: Comparison of General Rank Join and Hash
Rank Join.
We varied the values of k in the sequence 1, 5, 10,
50, 100, 500, and 1000.
Experiments represented by Figures 1 and 2
show, that top-k algorithms exploiting hash tables
for cashing input data are much faster than top-k
algorithms exploiting only a simple buffer. The use
of buffer even for small values of k significantly
slows down the algorithm. The only difference
between general and hash algorithms is in the join
condition.
Our experiments also shown that concerning the
General Rank Join algorithm, the strategy of input
choice is crucial only for higher values of k. Time
saving is subtle for small values of k since both
algorithms read from their inputs only small amount
of data. The Hash Rank Join almost does not depend
on a strategy of input choice. The use of hashing
brings a significant saving while the strategy in itself
does. Due to the algorithms after reading an object
from input need not look through a huge amount of
data, both strategies of input choice are comparably
fast. Both algorithms read order of magnitude the
same amount of data independently on the strategy
used. Thus, in use of Hash Rank Join, the strategy of
input choice has a negligible influence.
Figure 2: Comparison of Nested Loop Rank Join and Hash
Nested Loop Rank Join.
We also tested the algorithms evaluating top-k
join of three tables (see Figure 3) with the query
SELECT T.IdTown, H.IdHotel, C.IdClub
FROM Town T JOIN Club C ON
T.IdTown = C.IdTown
JOIN Hotel H ON T.IdTown = H.IdTown
ORDER BY DESC
T.Score + C.Score + H.Score
STOP AFTER k
For small k values the algorithm Nested Loop
Rank Join is the fastest. Other algorithms are also
usable, nevertheless, e.g., Nested Loop Rank Join
has higher demands on the memory (keeps all data
from one input in memory). Algorithms General
Rank Join and Nested Loop Rank Join, i.e.
algorithms, which do not use hash tables, are for
larger values of k not usable.
In general, the compared top-k algorithms are
beneficial in situations with joins of more tables (or
other data sources) and for smaller values of k.
While the join condition is formulated as equality,
hash algorithms are fast also for large values of k.
NRank A nified Platform ndependent Approach for op Alorithms

Figure 3: Comparison of all algorithms for joining three
tables.
6 CONCLUSIONS
Including of top-k algorithms into a real relational
environment is not usual by that time. For exclusion
see, e.g., (Li, Chang, Ilyas, and Song, 2005)
extending relational algebra and query optimization.
Other examples are prototype implementations in
PostgreSQL (Khalefa et al., 2011), (Kini and
Naughton, 2007).
We have made comparison of several algorithms
implementing rank join operator. Our results
confirmed that huge processing time can be saved
using optimal algorithm and appropriate source-
choosing strategy. In simple join conditions it is
possible to boost the algorithms with use of a hash
table, which brings significant improvement,
especially for higher values of k.
We have developed the .NET library NRank,
which implements these algorithms and is ready to
be used in a real-world application. For a future
work, a rank-aware optimization framework would
be beneficial enabling to use data statistics stored in
a usual RDBMS. The library can be used practically
for processing arbitrary data.
ACKNOWLEDGEMENTS
This research has been partially supported by the
grant GACR No. P202/10/0761.
REFERENCES
Abid, A. and Tagliasacchi, M. (2011). Parallel Data
Access for Multiway Rank Joins. In ICWE 2011, S.
Auer, O. Diaz, and G.A. Papadopoulos (Eds.), LNCS
6757, 4458.
Bruno, N., Chaudhuri, S., and Gravano, L. (2002). Top-k
Scoring Queries over Databases: Mapping strategies
and Performance Evaluation. ACM Transactions on
Database Systems, Vol. 27, No. 2, 153-187.
Fagin, R., Lotem, A. and Naor, M. (2003). Optimal
aggregation algorithms for middleware. Journal of
Computer and System Sciences 66, 614656.
Finger, J. and Polyzotis, N. (2009). Robust and Efficient
Algorithms for Rank Join Evaluation. In ACM
SIGMOD International Conference on Management of
Data, Providence, Rhode Island, USA, June 29 - July
2, 415-428.
Horničák, E., Ondreička, M., Pokorný, J., Vojtáš, P.
(2011). Multi-user Searching of Top-k Objects with
Data on Remote Servers. In: ADBIS 2011 Research
Communications, J. Eder, M. Bielikova, and A Min
Thoa (Eds.), Oesterreichische Comp. Gesellschaft,
Vienna, Austria, Sep. 20-30, 200-211.
Ilyas, I. F., Aref, W. G. and Elmagarmid, A. K. (2004).
Supporting Top-k Join Queries in Databases. The
VLDB Journal (2004) 13: 207221.
Ilyas, I. F., Shah, R., Aref, W. G., Vitter, J. S. and
Elmagarmid, A. K. (2004). Rank-aware query
optimization. In ACM SIGMOD International
Conference on Management of Data, Paris, France,
June 13-18, 203-214.
Ilyas, I. F., Beskales, G. and Soliman, M. A. (2008). A
Survey of Top-k Query Processing Techniques in
Relational Database Systems. ACM Computing
Surveys, Vol. 40, No. 4, Article 11, 58 p.
Khalefa, M. E., Mokbel, M. F. and Levandoski, J. J.
(2011). PrefJoin: An Efficient Preference-aware Join
Operator. In 27th International Conference on Data
Engineering (ICDE), IEEE, 995-1006.
Kini, A. and Naughton, J. F. (2007). Database Support for
Weighted Match Joins. In: Proceedings of 19th
International Conference on Scientific and Statistical
Database Management (SSDBM 2007), IEEE, 10 p.
Li, Ch., Chang, K. Ch., Ilyas, I. F. and Song, S. (2005).
RankSQL: Query Algebra and Optimization for Top-k
Queries. In ACM SIGMOD International Conference
on Management of Data, Baltimore, Maryland, USA,
June 14-16, 131-142.
Mamoulis, N., Yiu, M. L., Cheng, K. H., Cheung, D. W.
(2007). Efficient Top-k Aggregation of Ranked Inputs.
ACM Transactions on Database Systems, Vol. 32,
August, Article 19.
AA   nternational onference on ata echnoloies and Applications
