• other data structures than the obvious or
traditional ones are advantageous;
• programming instead of using a query language
can improve performance.
The remainder of this paper is structured as
follows: In Section 2, we collect some related work
to underline the novelty of our investigation.
Afterwards, we elaborate in Section 3 upon
unfairness of existing comparisons and various
influencing factors before we conduct performance
measurements on PostgreSQL for common Neo4j
test scenarios in Section 4. In Section 5, we deduct
criteria for achieving a fair performance comparison.
Section 6 concludes the investigation and
presents some future work.
2 RELATED WORK
An early paper of (Hohenstein, et al., 1997)
criticizes standard benchmarks for object-oriented
database management systems (ODBMSs) like OO7
(Carey, et al., 1994) and statements made therein
like “ODBMSs are faster by factor 100”. In a case
study, a real application using Oracle was
transformed to several ODBMSs. The surprising
result of the performance measurements then
conducted that only a single ODBMS-based
implementation has the potential to be faster than the
original Oracle-based solution, while one ODBMS
was definitively much slower. Consequently, the
authors state that the best benchmark is the
application itself. The paper presents a methodology
for deriving application-specific benchmarks.
To our knowledge, no further work on fairness
of performance comparisons has been published so
far. Indeed, there are only some critical statements
from (Baach, 2015) “Comparing Neo4j to MySQL
without the use of Cypher is comparing apples and
oranges”. In the forum (Hacker, 2010) others also
argue that some comparisons are meaningless.
However, several performance investigations can
be found in the literature, which we tried to qualify.
(Khan, 2016) states that the technology of graph
databases is better than RDBMSs by explaining why
joins are bad for graph structures. He uses a simple
scenario that consists of Employees (E), Payments
(P) and Departments (D), related by one-to-many
relationships E-P and P-D. Then, qualifying two
departments by a query, the related payments are
retrieved via the employees. The complexity is
evaluated in Big-O notation. While RDBMS achieve
O(|E|*|P|) with nested loop joins and O(|E|+|P|) with
hash joins, Neo4j has an O(k) behavior. Neo4j’s
constant behaviour is explained as follows: “Using
hash indexing this gives O(1). Then the graph is
walked to find all the relevant payments, by first
visiting all employees in the departments, and
through them, all relevant payments. If we assume
that the number of payment results are k, then this
approach takes O(k).” However, it remains unclear
what “visiting all employees“ in Neo4j means and
how the internal data structures contribute to a better
performance compared to hash indexes in RDBMSs.
(Rodriguez, 2011) uses 1,000,000 nodes and
4,000,000 edges with a synthetic distribution:
Despite an average fan-out of 4, some nodes have a
higher number of edges. A test measures traversal
from a starting node to related nodes via 1 to 5 hops.
The result reveals that Neo4j is more than twice as
fast for 4 hops. For 5 hops, Neo4j required 14.37
minutes while MySQL was stopped after 2 hours.
The test of (Adell, 2013) detects if one person is
connected to another in 4 or fewer hops. The data set
contains 1,000,000 users with an average of 50
friends. Neo4j required 2ms for the check, while an
RDBMS was stopped after running several days.
Another comparison (Baach, 2015) uses 100,000
and 1,000,000 nodes with exactly 50 edges each. A
test counts the number of friends up to 5 hops. As a
surprising result, MySQL was about 6 times faster
than Neo4j. One potential reason for that might be
the use of the Cypher query language to perform
queries while (Rodriguez, 2011) sticks to the Pipes
framework, which seems to be very beneficial.
(Baach, 2015) considers a comparison SQL vs. the
Cypher language as fair, whereas SQL vs. Pipes
being unfair. Another reason for the result might be
some deeper thoughts about configuring MySQL.
(Vicknair et al, 2010) experiment with data sets
of size 1,000, 5,000, 10,000 and 100,000 nodes. In
contrast to others, they set up a direct acyclic graph.
Several tests traverse the graph, and count the nodes,
with 4 and 128 hops, count the number of nodes
with a certain payload, particularly with “<”
comparisons, and find all the orphan nodes. In
general, the execution times are less than 200 ms,
and do not show huge differences between Neo4j
and MySQL. In fact, the data sets are small and
enable in-memory processing.
(Khan, et al., 2017) compare Oracle 11g and
Neo4j using a Medical Diagnostic System. The data
set comprises about 28,000 patients, 625,721 patient
visits, 869,666 patient-IssueMed records, to mention
the main tables. Five count queries join two or three
tables. While Oracle performs queries in a few
seconds (depending on the query), Neo4j requires
about 0.3 sec.