actly two outgoing edges (t, a, t
0
) and (t, b, t
00
) that
are in the same set of C
or
while u has two outgo-
ing edges (u, a, u
0
) and (u, b, u
00
) such that t
0
(t
00
)
matches u
0
(resp., u
00
), then t cannot match u.
2. ISUSABLEONCE on line 9 checks if t can match u
without violating the constraints on the edge car-
dinality of q except C
int
. That is, if an edge inci-
dent to t does not appear in C
int
, then the corre-
sponding edge incident to u must be exist at most
once. The function checks if the condition holds
for u and t.
3. ISUSABLENTIMES on line 10 checks if t can
match u without violating C
int
. For example, sup-
pose that (t, a, t
0
) ∈ C
[0,k]
. Then the number of
edges in E
u
that match (t, a, t
0
) must be no more
than k. The function checks such a condition by
comparing E
u
and E
t
along with C
int
.
If all the above checks are passed, then pair (u, t) is
added to M by UPDATESTATE on line 11. On line 12,
call FINDMATCH recursively in order to find matches
for the rest of nodes of q and G
S
. If no answer is found
by the call of FINDMATCH, then RESTORESTATE on
line 13 restores M, i.e., (u, t) is deleted from M, and
back to line 7. Finally, if no answer is found until all
the nodes in q are examined, output “unsatisfiable” on
line 14.
Finally, consider briefly the computational com-
plexity of the problem. In theory, the problem cannot
be solved efficiently.
Theorem 1. Detecting unsatisfiable pattern queries
is NP-hard under both s-typing and m-type semantics.
The algorithm checks if each node u of q is
matched by a node t of S. Thus the time complex-
ity may become exponential in the worst case, but
which is unavoidable due to the above theorem. How-
ever, the size of schema is much smaller than that of
data graph, and the algorithm terminates as soon as
one satisfiable matching is found. Thus, although the
problem is NP-hard, the algorithm can be executed
highly efficiently as shown in the next section.
4 PRELIMINARY EXPERIMENTS
We conducted preliminary experiments to evaluate
our algorithm. To detect unsatisfiable queries, the al-
gorithm has to be executed before executing queries
over RDF data. Therefore, we need to verify that
the execution time of our algorithm is enough small
compared with query execution time over RDF data.
The algorithm was implemented in Ruby 2.5.1, and
all the experiments were executed on a machine with
Intel(R) Core(TM) m3-7Y30 CPU 1.60GHz, 4.00GB
RAM, Windows 10 Home OS.
We used two datasets. The first one was gener-
ated by SP
2
Bench (Schmidt et al., 2008) and the sec-
ond one was generated by BSBM (Bizer and Schultz,
2009). SP
2
Bench is a well-known SPARQL per-
formance benchmark tool based on DBLP. For the
SP
2
Bench dataset, we generated RDF data of size
1,087,517 byte (10,291 triples) and 5,400,376 byte
(50,168 triples). Since SP
2
Bench does not have any
ShEx schema, we manually created a ShEx schema
(type: 11, edge: 69) based on (Schmidt et al., 2008).
BSBM is also a well-known SPARQL performance
benchmark tool, which data is based on e-commerce
use case. For the BSBM dataset, we generated
RDF data of size 2,583,293 byte (10,250 triples) and
10,216,303 byte (40,377 triples). BSBM does not
have any ShEx schema either, therefore we created a
ShEx schema (node: 10, edge: 71) based on (Bizer
and Schultz, 2009). Since both of SP
2
Bench and
BSBM assume s-type semantics implicitly, the exper-
iments were conducted under s-typing semantics.
As for pattern queries, we made a Ruby program
for generating queries. In short, this program ran-
domly selects labels and types from a given ShEx
schema and generates nodes and edges, and then the
authors check unsatisfiability of the generated queries
manually. We generated 50 different unsatisfiable
queries (10 queries for each of 5 different query
sizes) for each dataset. We also made a Ruby pro-
gram to execute queries based on the Ullmann’s algo-
rithm (Ullmann, 1976). Note that, although a num-
ber of algorithms for pattern matching are proposed,
the data used in this experiments are very small and
thus which algorithm is used hardly affects execu-
tion time. Actually, in such a case “preprocessing”
such as reading data and registering nodes and edges
into lists/arrays accounts for most portion of execu-
tion time, which is common to any kind of pattern
matching algorithms.
Tables 1 and 2 show the results. All the execu-
tion times were measured in seconds. Each query ex-
ecution time in the tables is the average of those of
10 queries. As shown in the tables, compared with
the query execution time over RDF data, the execu-
tion time of our algorithm is much smaller and, e.g.,
we can save about 300 seconds for the larger data of
SP
2
Bench dataset. Also, the ratio values are almost
negligible. Note that, since the size of RDF data used
in the experiment is rather small, the ratio would be-
come much smaller if we use larger RDF data. There-
fore, if a user tries to execute a query and it is unsatis-
fiable, our algorithm can save a lot of time by detect-
ing the unsatisfiability of the query. And even if it is
Detecting Unsatisfiable Pattern Queries under Shape Expression Schema
289