Bottom-up Discovery of Context-aware Quality Constraints for

Heterogeneous Knowledge Graphs

Xander Wilcke

1 a

, Maurice de Kleijn

2 b

, Victor de Boer

1 c

, Henk Scholten

and Frank van Harmelen

1 d

Dept. of Computer Science, Vrije Universiteit Amsterdam, The Netherlands

Dept. of Spatial Economics, Vrije Universiteit Amsterdam, The Netherlands

Keywords:

Knowledge Graphs, Data Validation, Data Quality, Constraints, Pattern Mining.

Abstract:

As knowledge graphs are getting increasingly adopted, the question of how to maintain the validity and ac-

curacy of our knowledge becomes ever more relevant. We introduce context-aware constraints as a means to

help preserve knowledge integrity. Context-aware constraints offer a more ﬁne-grained control of the domain

onto which we impose restrictions. We also introduce a bottom-up anytime algorithm to discover context-

aware constraint directly from heterogeneous knowledge graphs—graphs made up from entities and literals of

various (data) types which are linked using various relations. Our method is embarrassingly parallel and can

exploit prior knowledge in the form of schemas to reduce computation time. We demonstrate our method on

three different datasets and evaluate its effectiveness by letting experts on knowledge validation and manage-

ment assess candidate constraints in a real-world knowledge validation use case. Our results show that overall,

context-aware constraints are to an extent useful for knowledge validation tasks, and that the majority of the

generated constraints are well balanced with respect to complexity.

1 INTRODUCTION

Knowledge graphs have ceased to be the academic

experiment that they once were. They are now

conﬁdently present in the working environment of

many different institutes, museums, and businesses

around the globe, such as the Smithsonian museum

of American art (Szekely et al., 2013), taxi service

Uber (Hamad et al., 2018), and even internet giants

such as Google (Singhal, 2012) and Facebook (Sun

and Iyer, 2013) have ﬁrmly embedded knowledge

graphs into their services. With this newly conquered

position it becomes ever more important to not only

look at how to engineer this knowledge, but also how

to maintain the quality of this knowledge across its

entire life cycle, every step of which is prone to suffer

from a loss in said quality by the introduction of var-

ious artefacts (F

urber, 2015). These artefacts come in

many forms, ranging from false, illegal, and missing

attribute values to incorrect, inconsistent, and contra-

https://orcid.org/0000-0003-2415-8438

https://orcid.org/0000-0003-2379-191X

https://orcid.org/0000-0001-9079-039X

https://orcid.org/0000-0002-7913-0048

dictory relationships. Failure to correct these artefacts

can have severe negative effects on the operations and

decision making processes, which is why quality con-

trol is a vital step in any knowledge management pro-

cess (Tayi and Ballou, 1998).

A key component of a modern quality control pro-

cess is the quality constraint: an externally given

rule which speciﬁes criteria that correspond to high-

quality knowledge, and which can be used to validate

a knowledge base in an automated fashion (F

urber

and Hepp, 2011). For knowledge graphs, simple qual-

ity constraints can be deﬁned using OWL

, or, if more

sophisticated constraints are needed, by using con-

straint languages such as ShEx

or the more recent

SHACL

. Constraint languages such as these apply re-

strictions on the schema level, for instance to all mem-

bers of a certain class or to every value of a certain at-

tribute. This is analogous to constraint languages for

relational databases, and works well if the members

of a class form a single homogeneous group. If this is

not the case however, and the members within a class

See https://www.w3.org/TR/owl-ref/

See https://shex.io/

See https://www.w3.org/TR/shacl/

Wilcke, X., de Kleijn, M., de Boer, V., Scholten, H. and van Harmelen, F.

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs.

DOI: 10.5220/0010113500810092

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 1: KDIR, pages 81-92

ISBN: 978-989-758-474-9

 2020 by SCITEPRESS – Science and Technology Publications, Lda. All r ights reserved

Bridge

material

crosses

salinity

type

River

"0.05"

type

Steel

WMA

max_load

material

function

Road

type

"21.5"

Highway

Figure 1: Two example subgraphs from the asset manage-

ment domain, with Left) a steel bridge crossing a salt-water

river, and Right) a section of road on the highway. Circles

represent entities, with open circles depicting focal entities.

Literals are shown as strings.

form two or more distinct clusters with their own pe-

culiarities, it may occur that constraints which apply

to one cluster do not necessarily apply to the other(s).

Consider for example the two subgraphs in ﬁg-

ure 1 from a knowledge graph about asset manage-

ment. On the left we can see a steel bridge which

crosses a salt-water river, whereas on the right we can

see a section of road on a highway which is made

from WMA (a type of asphalt) and which has a certain

load-bearing capacity (in metric tons). For the bridge

example, a schema level constraint might state that all

bridges must be constructed from a certain building

material. Similarly, a schema-level constraint for the

road example might tell us that the value of attribute

max load must lie between zero and one hundred,

and must be of the data type ﬂoat. Constraints such

as these work well for identifying illegal or missing

values and relationships, but at the same time over-

look the different characteristics that the members of

a class are likely to have: a bridge might have differ-

ent material demands depending on the salinity levels

of its environment, and the load-bearing capacity of

roads might vary depending on its material and usage.

To impose restrictions on this more ﬁne-grained

level it is necessary to condition constraints not on

the schema level, but rather on level of the clusters

whose members share similar characteristics (Bohan-

non et al., 2007). This can be achieved by gen-

eralizing constraints over nodes with similar con-

texts. We call such constraints context aware. In

this work, we introduce a new formalism to deﬁne

context-aware quality constraints on heterogeneous

knowledge graphs. Constraints of this kind offer a

ﬁne-grained control over the domain upon which to

impose restrictions. This domain is determined by a

so-called contextual pattern that describes a special

graph motif that the nodes need to match. Contex-

tual patterns can contain entities (by IRI) and/or lit-

erals (by value), and also offer means to generalize

to classes, data types, and value patterns (e.g. ranges

and regular expressions). These same options are also

available to restrictions.

Context-aware constraints can be deﬁned by hand,

or top down, but doing so quickly becomes infeasibly

as the dimensions and the diversity of the knowledge

grow. An alternative is to learn suitable quality con-

straints from the knowledge itself, or bottom up, by

mining frequent patterns in the graph and by encoding

these patterns as constraints (Tayi and Ballou, 1998).

This works on the supposition that the large major-

ity of the knowledge is valid and accurate, and that

these qualities can be captured in a set of patterns.

We apply this approach in this paper. For this pur-

pose, we introduce a bottom-up anytime algorithm to

discover context-aware constraints directly from het-

erogeneous knowledge graphs. Our algorithm is em-

barrassingly parallel and generates constraints by ex-

ploring and testing increasingly more complex con-

textual patterns in a breadth-ﬁrst fashion. Special

attention is given to the multimodal nature of many

knowledge graphs by enabling our algorithm to learn

patterns over various data types, such as dates, num-

bers, and texts.

We evaluate our method in two ways. Firstly, from

an algorithmic perspective for which we demonstrate

and test an implementation of our method on three

different datasets and evaluate the constraints it is able

to generate. Secondly, from a user perspective by let-

ting knowledge management experts asses the gener-

ated constraints in a real-world knowledge validation

task.

To summarize, our main contributions are 1) a

novel graph-based constraint formalism to deﬁne re-

strictions on the contextual level, 2) an anytime algo-

rithm for the bottom-up generation of context-aware

constraints from heterogeneous knowledge graphs,

and 3) a user-driven evaluation of the method and

constraints by experts in a real-world knowledge val-

idation use case.

2 RELATED WORK

Several mature standards exist with which constraints

for knowledge graph can be deﬁned. One of these

standards is the Web Ontology Language, better

known as OWL, which supports simple value and car-

dinality constraints. More expressive constraints can

be deﬁned using dedicated constraint languages such

as ShEx or SHACL, which offer capabilities similar

to their counterparts for relational databases. A sub-

set of these capabilities is also supported by our work,

such as placing restrictions on values and datatypes.

However, ShEx and SHACL are designed around a

different paradigm in which the focus lies on schema-

level constraints, whereas the constraints proposed in

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

Table 1: All ﬁve variants of assertion patterns with their corresponding domains in set-builder notation. In all cases, the

left-hand side object-type variable can be substituted for υ

∗

, which matches all types.

Assertion Pattern Domain

1 p

(υ

, e

) {e

∈ E | type(e

,t) ∧ (∃e

∈ E) [ p

, e

) ]}

2 p

(υ

, υ

) {e

∈ E | type(e

,t) ∧ (∃e

∈ E) [ p

, e

) ∧type(e

) ]}

3 p

(υ

, l

) {e

∈ E | type(e

,t) ∧ (∃l

∈ L) [ p

, l

) ]}

4 p

(υ

, υ

) {e

∈ E | type(e

,t) ∧ (∃l

∈ L) [ p

, l

) ∧ dtype(l

) ]}

5 p

(υ

, υ

) {e

∈ E | type(e

,t) ∧ (∃l

∈ L) [ p

, l

) ∧ match(l

, s) ]}

this work operate on the contextual level. A behaviour

similar to context-aware constraints can nevertheless

be achieved using SHACL by specifying the ﬁlter

shapes introduced by SHACL’s advanced features.

While ShEx and SHACL managed to grow into

mature standards, they are not the ﬁrst to introduce

more expressive constraints for knowledge graphs.

The work in (Cort

es-Calabuig and Paredaens, 2012)

already discusses the different types of constraints

that can be deﬁned on knowledge graphs from a the-

oretical perspective, together with their satisfaction

and entailment problems. In (Lausen et al., 2008),

the authors show how the popular query language

SPARQL can be used to retain knowledge integrity

when converting relational databases to knowledge

graphs. Both these studies consider only schema-level

constraints similar to those of ShEx and SHACL.

Association rules have been the interest of several

works trying to adapt them to knowledge graphs. As-

sociation rules are implications of the form X =⇒ y,

where the presence of a set of instances X implies

the presence of another instance y. Generalized as-

sociation rules works largely the same, except that X

holds the types associated with these instances. Both

variants can be expressed using context-aware con-

straints. A straightforward approach to bring associ-

ation rules to knowledge graphs is shown in (Anbu-

tamilazhagan and Selvaraj, 2014), which ﬂattens the

graph into transactions and feeds these to the Apriori

algorithm. This is different from the approach used in

our work, which is speciﬁcally tailored to graphs. A

more similar method is presented in (Ramezani et al.,

2014), which operates directly on graphs and which

allows for multi-relational patterns. Such patterns can

be seen as selective contexts, whereas context-aware

constraints consider the entire context. In (Barati

et al., 2016), the authors introduce a graph-based ap-

proach which can exploit common RDF and RDFS

semantics to infer type hierarchies. Exploiting com-

mon semantics is also part of our method, but is used

to infer direct types and datatypes rather than gener-

alizations thereof.

Quite some work is done in bringing functional

dependencies (FD) to knowledge graphs, e.g. (Akhtar

et al., 2010; Calvanese et al., 2014; Hellings et al.,

2016). A FD X → Y expresses that entities with the

same values for all attributes in X must also have

the same values for those in Y . This behaviour can

be approached by context-aware constraints, but only

for values which are already present in the graph.

In (He et al., 2014; Yu and Heﬂin, 2011), the au-

thors extend FDs with paths, which can be thought

of as selective or pruned contexts. The work in (Fan

and Lu, 2017; Fan et al., 2016) is closest to context-

aware constraints by letting FDs consist of graph mo-

tifs with support for entities, literals, and variables.

In (Yu and Heﬂin, 2011), the authors introduce FDs

with numeric patterns by clustering values using k-

means. We employ a similar strategy to learn patterns

for numbers, dates, and strings (see Sc. 4.1.1).

Some work has been done on automatic constraint

discovery from knowledge graphs. In (He et al.,

2014), the authors accomplish this by ﬁrst ﬂatten-

ing a graph into transactions, from which they mine

frequent patterns that are fed to an off-the-shelf al-

gorithm for discovering FDs. This differs from our

approach, which is speciﬁcally developed for graphs.

More similar methods are used by (Fan et al., 2018;

Yu and Heﬂin, 2011), which start out with minimal

constraints and extend these iteratively until all op-

tions are exhausted. However, these algorithms only

consider FDs.

General rule miners based on inductive logic pro-

gramming (Tresp et al., 2008) or frequent-pattern

mining (e.g. (Gal

arraga et al., 2013; Meilicke et al.,

2019)) can also be used to discover constraints. How-

ever, these methods generally focus on the relational

structure of a graph and its underlying schemas with-

out considering contextual dependencies and/or literal

values.

To the best of our knowledge, we are the ﬁrst to

evaluate work of this kind from a user perspective.

All other reviewed work employs a theoretical and/or

data-driven evaluation.

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs

Algorithm 1: Initialization of generation forest—simpliﬁed. Returns all constraints of size 1 with minimal support and

conﬁdence. Support for object/data type and value patterns are omitted here, but are similar to line 7–10 with an additional

few steps. In line 10, υ

is used as shorthand for type(·, t), and a dummy self relation is added which is needed in Alg. 2.

1: function INITGENERATIONFOREST(G, supp

min

, con f

min

)

2: types = {t ∈ E | (∃e ∈ E ) [type(e,t)]}

3: for type t in types do

4: Ω(t, 0) :=

5: if |{e ∈ E | type(e,t)}| ≥ supp

min

then

6: for p ∈ P do

7: S := {p(e, r) | (∃e ∈ E , ∃r ∈ R ) [p(e, r) ∈ A ∧type(e,t)]}

8: for p(·, r) ∈ S do

9: if |p(·, r) ∈ S| ≥ con f

min

then

10: φ := p(υ

, r) ← {sel f (υ

, υ

)}

11: Ω(t, 0) := Ω(t, 0) ∪ {φ}

12: return Ω

3 DEFINING CONSTRAINTS

In this section, we provide a deﬁnition of context-

aware constraints. Let G = (R , P , A) be a knowledge

graph with the set of all resources R = E ∪ L, the set

of all predicates P , and with A the set of all assertions

, r

) that make up G, with p

∈ P , e

∈ E , and

∈ R . Disjoint sets E and L consist of all entities

and literals in R , respectively.

A constraint φ = c ← A states that every entity

e ∈ E which satisﬁes antecedent A = a

∧a

∧···∧a

must also satisfy consequent c. We can more intu-

itively think of this as the restriction c we wish to

impose upon the domain E

⊆ E , with E

encom-

passing all entities that satisfy the condition(s) in A.

Restriction c and every condition a in A take the form

of assertion patterns p

(·, ·), which generalize the as-

sertions in A by substituting the left and/or right-hand

side resource with a pattern variable υ. Pattern vari-

ables match any resource which ﬁt their pattern and

come in three different ﬂavours: object-type patterns

which match all entities of type t (e.g. Bridge or

Road), data-type patterns υ

which match all literals

of data type t

(e.g. String or Geometry), and value

patterns υ

which match all literals with a value that

falls within regular expression s (e.g. “[:digit:]{2}” or

“ˆ[:alnum:]$”). We also introduce the syntactic short-

hand υ

∗

which matches entities of any type.

Together with the existing resources the three pat-

terns variable allow us to construct ﬁve different as-

sertion patterns (Table 1). In all cases, the left-hand

side is an object-type variable because placing literals

(or variables thereof) or entities in that spot results in

illegal or unnecessary assertion patterns. For literals

and data-type/value variables this is because these can

never be the subject of an assertion. For entities, the

resulting assertions would either apply to a single en-

tity if they are used as consequent c, or, if used in an-

tecedent A, they would not help us reduce the domain

any further than if we would just omit them (compare

(υ

, e

) ∧ p

, ·) to only p

(υ

, e

)).

Antecedent A can consist of one or more condi-

tions. These conditions can apply directly to arbi-

trary entities (i.e. p

(υ

∗

, ·)) in which case we call

them depth-1 conditions. If the right-hand side of a

depth-1 condition is an object-type variable we can

also chain two or more conditions to form depth-n

conditions: p

(υ

∗

, υ

) ∧ p

(υ

, ·) ∧ . . .. The longest

chain is called the depth of A, whereas its width equals

the maximum number of conditions per variable. The

size of A is the total number of conditions.

Each constraint φ is accompanied by two mea-

sures of relevance: its support and conﬁdence. The

support tells us the size of the domain, and equals the

number of entities which satisfy A. The conﬁdence

tells us for how many members of the domain the re-

striction holds as well, and equals the number of enti-

ties which satisfy both A and c.

We only consider constraints with a single restric-

tion c because this offers more ﬂexibility when choos-

ing which restrictions to apply and because it makes

the measures of relevance more easily interpretable.

If constraints with more than one restriction are de-

sired we can obtain this by grouping constraints that

have the same domain.

4 DISCOVERING CONSTRAINTS

Where in the previous section we provide a deﬁni-

tion of context-aware constraints, we here provide a

bottom-up anytime algorithm to efﬁciently discover

said constraints. To do so, our algorithm starts out

with all constraints that have a single condition (|A| =

1), which are then used as parents from which more

complex constraints (|A| > 1) are derived by adding

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

new conditions. This second step is the main loop

of our algorithm and operates by exploring, for ev-

ery parent constraints, all sensible diagonal combi-

nations of candidate endpoints and candidate exten-

sions. Candidate endpoints are assertion patterns with

an object-type variable on the right-hand side (Tab 1,

pattern 2) which represent the leaf nodes to which we

can connect another assertion pattern. This other as-

sertion pattern is the candidate extension and can take

the form of any of the assertion patterns listed in Ta-

ble 1.

Constraints are derived breadth ﬁrst, which en-

sures that we only derive new constraints from par-

ents that meet the minimal requirements, preventing

unnecessary work, and that the complexity of these

new constraints increases linearly. This latter charac-

teristic gives our algorithm an anytime property, al-

though rather than ﬁnding “better” answers when left

running, it ﬁnds ever smaller domains as more condi-

tions are added. Differently put: the longer we let the

algorithm run, the more ﬁne grained the constraints

become.

Our algorithm is embarrassingly parallel because

every constraint creates a new branch of which the

vertices can be computed independent of each other.

The only caveat is that we need the original graph

to calculate the measures of relevance for each con-

straint we mine. However, because the domain of

child constraints is always a subsets of their parents’

domain, we can largely avoid this problem by letting

parents keep a record of the entities in their domain

and calculate the measures using only these.

For the remainder of this work we let all con-

straints be speciﬁc to object-type variables. For this

reason, we will omit condition type(υ

∗

,t) from A

and change the left-hand side of restriction c from

∗

to υ

. This effectively ﬁxes the type to which

constraints can apply, irrespective of their conditions.

We limit ourselves to these cases because validation

workﬂows are typically designed around object types.

From here on, we consider A as a set of conditions

, a

, . . . , a

} that all need to be satisﬁed.

4.1 Components

We can identify three main components in our al-

gorithm

: the main loop (Sc. 4.1.2), the exploration

stage (Sc. 4.1.3), and the generation forest which

helps us keep track of the constraints we discover

(Sc. 4.1.1). We will discuss each of these next. We

also provide a simpliﬁed pseudocode which omits

pruning and most optimization steps, and which does

not show the generation of constraints with pattern

variables (cases 2, 3, and 5 in Table 1). However,

these parts are slight variations to those shown and

can easily be derived from them.

4.1.1 Generation Forest

The generation forest Ω is a data structure (e.g. a map

or dictionary) which holds all discovered constraints

divided over numerous generation trees. Each gener-

ation tree has a different constraint of size 1 as root,

with depth d + 1 of the tree containing the children

constraints that are obtained by adding new condi-

tions to their parent constraints of depth d. Root con-

straints are of the form p

(υ

, ·) ← {sel f (υ

, υ

)},

and are generated for each entity type t for which as-

sertion pattern p

(υ

, ·) meets the minimal support

and conﬁdence. An identity condition sel f (·, ·) is

added to serve as initial candidate endpoint for Al-

gorithm 2 (Alg 2, line 9).

The initialization of the generation forest is shown

in Algorithm 1. For each entity type in a graph of

which the number of members meets the minimal

support, we collect the assertions p

(·, r) which occur

for at least as many members as the minimal conﬁ-

dence requires. The assertion patterns corresponding

to these assertions are combined with the entity types

to form the root constraints. Algorithm 1 only shows

this for assertion patterns of the form p

(υ

, e

) and

(υ

, l

Type and value constraints (cases 2 and 3 in Ta-

ble 1) are generated similarly, but add an additional

step. For type constraints, this step involves infer-

ring the type of object r. For entities, this is achieved

by exploiting the rdf:type relations, whereas the

xsd:datatype declarations are used for literals. If

no (data) type is found we default to super type

rdfs:Class and datatype xsd:anyType for entities

and literals, respectively.

Value pattern constraints (case 5 in Table 1) are

generated by clustering all values r using k-means and

by translating these clusters into patterns. The opti-

mal number of clusters is automatically determined

using the elbow method. How the patterns are gener-

ated depends on the datatype. For numerical values,

these patterns take the form of a range between the

two outer values of a cluster. Ranges are also used

for dates and timecodes, which we convert to natural

numbers by encoding these as unix timestamps. For

strings, the patterns consist of regular expressions that

match all values in a certain cluster.

Available at https://gitlab.com/wxwilcke/cckg

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs

Algorithm 2: The main loop of the algorithm to discover

constraints—simpliﬁed. Returns all constraints up to depth

max

with minimal support and conﬁdence. Pruning and

optimization steps are omitted. The longest path in A to

variable u is given by ∆

(u).

1: function DISCOVER(G, d

max

, supp

min

, con f

min

)

2: Ω =InitGenerationForest(G, supp

min

, con f

min

)

3: d := 0

4: while d < d

max

5: for type t in Ω.types() do

6: E :=

7: for φ = c ← A in Ω(t, d) do

8: C :=

9: I := {a ∈ A | a = p

(·, υ

) ∧ ∆

(υ

) = d}

10: for a

= p

(·, υ

) ∈ I do

11: J := {a | ψ ∈ Ω(t

, 0) ∧ ψ = a ← A

}

12: for a

= p

(υ

, ·) ∈ J do

13: C := C ∪ {(a

, a

)}

14: E := E ∪ Explore(φ,C)

15: Ω(t, d + 1) := E

16: d := d + 1

17: return Ω

4.1.2 Main Loop

The algorithm begins by generating the root con-

straints, which are then extended by a single level

each iteration until the maximum depth is reached

(Alg. 2). To do so, we begin each iteration by retriev-

ing the previously-generated generation of constraints

of depth d, which form the parents from which we de-

rive new constraints of depth d +1. The result of each

iteration E is stored back in the generation forest to be

used by the next iteration.

To derive new constraints from parent constraints

we ﬁrst retrieve the set of candidate endpoints I of a

parent. The endpoints of a constraint φ = c ← A are

the assertion patterns in A that are leafs and have an

object-type variable p

(·, υ

) as object (and thus can

be extended). For each of the endpoints, the matching

candidate extensions J are the consequents p

(υ

, ·)

of the root constraints for type t

. These have been

generated during initialization and are therefore en-

sured to have the required support and conﬁdent. To-

gether with the endpoints, the candidate extentions

are passed as pairs C to Algorithm 3 where they are

used to extend the parent constraints.

4.1.3 Explore

The exploration step searches through all possible

diagonal combinations of parent constraint φ and

its candidate extensions in a breadth-ﬁrst fashion

(Alg. 3). Concretely, if A has size n, we ﬁrst explore

derivatives of size n + 1 by adding a single extension,

Algorithm 3: Explore and extend all candidate endpoints a

of parent constraint φ with candidate extensions a

to create

derived constraint χ—simpliﬁed.

1: function EXPLORE(φ,C)

2: E := empty set

3: Q := empty queue

4: Q.enqueue(φ)

5: while Q 6=

0 do

6: ψ := Q.dequeue()  ψ := c ← A

7: for a

, a

∈ C do

8: A

:= A ∪ {a

}  a

and a

are incident

9: χ := c ← A

10: if supp(χ) ≥ supp

min

∧ conf(χ) ≥ con f

min

11: then

12: E := E ∪ {χ}

13: Q.enqueue(χ)

14: return E

of which the resulting constraints form the parents

from which to explore size n + 2. This continuous

until all combinations are exhausted, after which the

results are returned.

A new constraint χ is generated by adding the

candidate extension a

= p

(·, υ

) to the parent con-

straint at the corresponding endpoint a

= p

(υ

, ·).

The derived constraint is only returned if it meets the

minimum support and conﬁdence, and if these val-

ues are not equal to that of the parent (not shown in

Alg. 3).

4.2 Optimization

Our algorithm includes several optimization steps to

reduce the search space. The most important steps are

listed below:

• Constraints which apply to the same entities as

their parent are pruned. This follows from the

intuition that if the less restricted constraint has

the same domain as the more restricted constraint,

then the latter does not add anything over the for-

mer.

• Constraints which have already been tried via an-

other route are excluded from creation. This can

occur when their parents differ on exactly the con-

ditions that these constraints now include.

• Sibling constraints that all have the same support

and conﬁdence values are pruned. This follows

from the intuition that if the same restriction ap-

plies to overlapping domains which differ only by

a single condition, then this separation between

domains does not add any new information.

• Conditions that equal the restriction exactly or

which are variations thereof (e.g. p

(υ

, e

) and

(υ

, υ

) where type(e

)) are never added.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

The same holds for conditions that are incident on

subject.

• Combinations of candidate endpoints and exten-

sions for which we know (from a previous itera-

tion) that they do not meet the minimal require-

ments are skipped. Assertion patterns for which

this is the case are already ﬁltered during the ini-

tialization of the generation tree.

Constraints considered for pruning are not removed

immediately. Instead, we still allow these constraints

to become parents for the next iteration before re-

moval because we would otherwise lose potentially

interesting (grand) children further down the branch.

We call this delayed pruning.

5 EXPERIMENTS

We evaluate our method in two ways: ﬁrstly, from an

algorithmic perspective during which we test an im-

plementation of our method on the constraints it is

able to generate, and secondly, from a user perspec-

tive by generating constraints from an in-use dataset

and by letting experts asses them in a real-world

knowledge validation use case.

5.1 Datasets

The constraints in our experiments are generated from

three different datasets. We here provide a concise de-

scription of each of them. Table 2 lists basic statistics

for each dataset.

AIFB. The AIFB dataset is a benchmark datasets

for machine learning on knowledge graphs (Ristoski

et al., 2016), and contains information about the staff

and publications of a research institute. This dataset

is the smallest of the three. Note that a modiﬁed ver-

sion

is used in this paper, which includes the datatype

declarations needed to accurately determine the liter-

als’ modalities. These declarations are missing in the

original version.

MUTAG. The MUTAG dataset is another bench-

mark dataset from (Ristoski et al., 2016), and de-

scribes complex molecules by their characteristics

and shape, with the focus on their carcinogenic prop-

erties. This is the largest of the three datasets used in

this paper.

Available at https://gitlab.com/wxwilcke/mmkg

Table 2: Datasets used in the experiments. AIFB is modi-

ﬁed to include data type declarations.

Dataset AIFB RWS MUTAG

Assertions 29,219 56,364 74,567

Relations 45 305 23

Entities 6,072 3,895 32,621

Literals 5,468 12,844 1,104

RWS. The RWS dataset contains detailed knowl-

edge about road and water constructions, including

interchanges, bridges, tunnels, and many more (See

e.g. Fig 1). This knowledge consists, among others, of

general characteristics (year of construction, dimen-

sions, location, etc.), maintenance reports, and ad-

ministrative information. The dataset contains legacy

data and has been, and still actively is, worked on

by many people from several departments Rijkswater-

staat

, the Dutch government agency responsible for

the construction and management of major infrastruc-

ture facilities in the Netherlands. Because of its long

and active use, it is prone to artefacts caused by in-

valid or inaccurate entries, by changes in procedures

over time, or by past integration or conversion issues.

These aspects make this dataset a suitable choice for

the task of knowledge validation.

Due to the sensitive nature of this information we

are unfortunately prohibited from sharing this dataset.

5.2 Constraint Discovery

With this experiment we demonstrate our algorithm’s

ability to discover context-aware constraints from het-

erogeneous knowledge graphs, with the intend to

show the trade off between the chosen support and

conﬁdence values, and the resulting number of con-

straints. It is also shown what the effect of pruning has

on this number. An analysis on the computation time

of our algorithm is omitted due to unreliable numbers

caused by running the experiments in a shared envi-

ronment outside our control.

Each of the datasets listed in Table 2 is run for

constraints up to depth 3 and with several different

support and conﬁdence requirements. In each case,

the support and conﬁdence values are varied between

300 and 500 with a 100-step increment, resulting in 6

combinations. These combinations are chosen based

on preliminary tests, which indicated that this range

was supported by all three datasets without resulting

in cases where no suitable constraints can be found or

where the number of constraints exceeded unmanage-

able amounts. No limits are placed on the width and

restrictions of constraints.

www.rijkswaterstaat.nl

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs

Table 3: Number of generated constraints for AIFB as func-

tion of chosen support and conﬁdence values. Number of

pruned constraints is listed between parenthesis.

Support

Conf.

500 400 300

500 79 (64) 112 (102) 193 (206)

400 234 (186) 315 (290)

300 498 (476)

Table 4: Number of generated constraints for MUTAG as

function of chosen support and conﬁdence values. Number

of pruned constraints is listed between parenthesis.

Support

Conf.

500 400 300

500 10 (0) 11 (0) 13 (0)

400 11 (0) 14 (0)

300 28 (22)

5.2.1 Results & Discussion

Tables 3, 4, and 5 list the number of generated con-

straints as function of chosen support and conﬁdence

values for AIFB, MUTAG, and RWS, respectively. A

stark difference is visible in the number of constraints

generated for each dataset. Where this number is

rather small for MUTAG and slightly larger for AIFB,

it far exceeds the amount deemed as manageable for

RWS at conﬁdence values lower than 500.

The results indicate that there is a strong posi-

tive relation between the number of generated con-

straints and the used support and conﬁdence values,

as expected. However, there seems to be no direct

relationship between these numbers and the size of

the datasets: MUTAG, the largest dataset, has very

few constraints whereas RWS, which is considerably

smaller, has the largest number of constraints. In-

stead, the statistics in Table 2 suggest that the number

of relations is more likely an indicator for the number

of generated constraints.

The number of pruned constraints grows as the

number of generated constraints rise, and with a sim-

ilar factor. This is an expected outcome of our prun-

ing strategy and suggests that this strategy is to an

extent effective. Noteworthy is again the difference

between datasets. With MUTAG and AIFB, the num-

ber of constraints generated exceeds those which are

pruned, whereas the reverse is true for RWS.

Table 6 shows ﬁve constraints that were sampled

from the AIFB and MUTAG output sets. The ﬁrst

example has a value pattern as consequent (shown

simpliﬁed as range) and tells us that

1403

1841

= 0.76%

of all entities of the type Carbon-22 have a charge

which lies between −0.158 and 0.063. The sec-

ond example shows that 79% of the publications

about ID70Instance (a certain individual) are also

about ID69Instance (another individual). Examples

3 and 4 tell us that a compound that is mutagenic

has a carbon-10 atom in 92% of the cases, while a

compound which is not mutagenic has a hydrogen-

3 atom in an equal number of cases. The ﬁnal con-

straint shows a value pattern with a regular expression

(matching e.g. “123-456”), which holds for 30% of all

manuscripts that are listed in titled proceedings. The

relatively low conﬁdence to support ratio of this last

example limits its usefulness and makes it a candidate

for removal.

5.3 User Study

The user study takes the form of a half-day work-

shop with a questionnaire at the end. Participants con-

sist of experts on knowledge management and valida-

tion who are employed at Rijkswaterstaat. During the

workshop, these participants are given a presentation

which explains the constraints generation process as

well as the constraints themselves. After the presen-

tation, participants are provided with a questionnaire

and asked to ﬁll it in individually.

The questionnaire is designed to investigate the

trade off between the context granularity of the gen-

erated constraints and their perceived effectiveness in

capturing relevant patterns in the knowledge. Con-

straints with increasingly ﬁner-grained contexts are

generated and presented to the participants, who are

asked to rate these constraints on how relevant they

are for the task of knowledge quality control in the

domain of asset management. Here, relevancy ques-

tioned whether the presented constraints were too ﬁne

grained, too coarse grained, or whether they were

somewhere in between.

Too ﬁne-grained constraints have a relatively large

number of conditions which translates to a relatively

small domain. Constraints such as these are unde-

sirable because their use is limited to only few data

points. These constraints are also more likely to

capture outliers, are difﬁcult to transfer to unseen

data, and can increase the total number of constraints

to unmanageable amounts. Too coarse-grained con-

straints have few conditions and apply to a relatively

large domain, and are undesirable because they limit

our ability to distinguish between subsets of similar

data points and can result in an increase in the num-

ber of false positives and/or negatives. Between too

ﬁne-grained and too coarse-grained lie the constraints

which our participants perceive as balanced and most

effective for knowledge maintenance.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

Table 5: Number of generated constraints for RWS as function of chosen support and conﬁdence values. Number of pruned

constraints is listed between parenthesis.

Support

Conf.

500 400 300

500 27 (56,769) 27 (56,769) 28 (58,725)

400 76,490 (446,109) 76,491 (448,065)

300 375,326 (732,497)

Table 6: A sample of ﬁve hand-picked context-aware constraints with their support and conﬁdence values. All examples are

simpliﬁed by omitting URIs and identity conditions, and are ordered by depth.

Supp. Conf. Constraint

1 1841 1403 charge(υ

, [−0.158 ≥ v ≥ 0.063])

← type(υ

, Carbon-22)

2 127 100 about(υ

, ID69Instance)

← type(υ

, Publication e) ∧ about(e, ID70Instance)

3 129 119 hasAtom(υ

, Hydrogen-3)

← type(υ

, Compound e) ∧ isMutagenic(e, False)

4 129 119 hasAtom(υ

, Carbon-10)

← type(υ

, Compound e) ∧ isMutagenic(e, True)

5 431 131 pages(υ

, “ˆ[:digit:]{3}[:punct:]{1}[:digit:]{3}$”)

← type(υ

, InProceedings e) ∧ bookTitle(e, dtype

) ∧ type(dtype

, [XSD:string])

In addition to the above, participants are also

asked about their familiarity with relevant topics, and

about their opinion on the usefulness of context-aware

constraints as a whole.

The questionnaire contains 3×4 constraints. Each

group of 4 represents a different level of granularity,

and is sampled by dividing the full set of constraints,

generated from the RWS dataset, in groups of low,

average, and high complexity. Low-complexity con-

straints are of depth 1, whereas average- and high-

complexity constraints are of depth 2 and 3, respec-

tively. In all cases, the context width varies between

1 and 4. No limit is placed on the type of restriction:

any of those listed in Table 1 is allowed to occur. A

5-point Likert scale is used for all questions, with an

additional sixth option unsure for the constraint gran-

ularity questions to prevent unreliable answers.

All constraint are presented as if-then business

rules in natural language to prevent unfamiliarity with

knowledge graph terminology and/or the constraint

syntax to confound the results.

5.3.1 Results & Discussion

A total of 21 experts on knowledge management and

validation participated in our user study. Table 7 lists

the median and mode familiarity of these participants

with relevant topics, and ranges from fully disagree

to fully agree. Krippendorff’s alpha is used to as-

sess inter-rater agreement. Overall, the participants

are moderately to very conﬁdent with their familiarity

with any of the topics, but, having only a fair agree-

ment (α = 0.26), it seems that this level of conﬁdence

is not uniformly distributed over all participants. Irre-

spective, the conﬁdence is especially strong for their

knowledge of database terminology. In contrast, par-

ticipants seem only moderately conﬁdent about their

familiarity with the domain, which can be explained

by the different departments the participants works at

and the different subsets of the data these departments

focus on. Nevertheless, the overall and individual

conﬁdence level(s) are strong enough to ensure that

we can trust the answers our participants provide.

Table 8 shows the perceived complexity as portion

of the scores for our 12 constraints combined, and

for each of the three complexity groups separately.

We left out the unsure answers to improve reliabil-

ity. Overall, slightly more than half of the partici-

pants thought the complexity was well balanced, with

the other four score levels dividing the remaining por-

tion roughly equally with values between 0.08 to 0.16

each. This suggests that the generated constraints are

to an extent suited for the task of knowledge valida-

tion.

A indifference between low-, average-, and high-

complexity constraints is visible for all ﬁve score lev-

els, with a minimal and maximal difference between

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs

Table 7: Familiarity of the participants (α = 0.26) with

the domain, with data validation and data quality rules (of

any form), with database terminology, and with knowledge

graphs. Last column shows correlation (Kendall’s tau) with

perceived usefulness (Tab. 10).

Familiarity Median Mode τ

Domain neutral agree 0.45

Data Val. agree agree 0.32

Data QR agree agree 0.41

DB Terms fully agree fully agree 0.07

Kn. Graphs agree agree 0.25

Table 8: Relevance shown as portions of scores given by

participants for constraints of low, average, and high com-

plexity, and for all forms combined. Scores range from far

too ﬁne grain (FG) to far too coarse grain (CG). Mode and

median are balanced for all cases. Unsure scores are omit-

ted.

Score Low Average High Comb.

far too FG 0.07 0.16 0.14 0.12

sl. too FG 0.17 0.14 0.14 0.16

balanced 0.51 0.53 0.57 0.53

sl. too CG 0.13 0.08 0.11 0.11

far too CG 0.11 0.08 0.04 0.08

groups of 0.03 for slightly too ﬁne grained and 0.09

for far too ﬁne grained, respectively. This minor dif-

ference implies that complexity does not affect the

relevance of the constraints, or that the different com-

plexity groups differ too slightly to have an impact

on said relevance. This indifference is supported by

signiﬁcance tests (Tab 9), indicating little to no differ-

ence in distributions.

There is an overall fair to moderate agreement

(α = 0.34) on relevancy between participants when

looking at the combined scores (Tab. 9). However,

this agreement varies signiﬁcantly when we take the

complexity group into account, with only a slight to

fair agreement for low-complexity constraints (α =

0.16) to a substantial agreement for high-complexity

constraints (α = 0.63). This stark difference seems

to contrast with the minor difference seen in Ta-

ble 8, which suggests that more participants answered

unsure (which were ﬁltered) as the complexity in-

creased.

Participants have a neutral to agreeable stance

with respect to the overall usefulness of our method

(Tab. 10). However, a considerable portion of the par-

ticipants seems unsure about this usefulness, which

supports our earlier assumption that participants be-

came less conﬁdent as the complexity increased. Cor-

relation analysis (Tab. 7) suggests that this effect may

be part caused by (the lack of) participants’ familiar-

Table 9: Inter-rater agreement (Krippendorff’s alpha) and

p-values (Kruskal-Wallis at signiﬁcance level 0.05) for con-

straints of low, average, and high complexity, and for all

forms combined. Unsure scores are omitted.

Complexity α p-value

Low 0.16 0.20

Average 0.42 0.43

High 0.63 0.58

Combined 0.34 -

Table 10: Usefulness of the method as perceived by partici-

pants in relative numbers.

Score Portion

fully disagree 0.00

disagree 0.10

neutral 0.24

agree 0.24

fully agree 0.05

unsure 0.38

ity with the domain and with quality rules, both of

which have a moderate positive relationship with the

perceived usefulness. Because unsure has the lowest

position on our Likert scale, this suggests that partic-

ipants that are unfamiliar with the domain and with

quality rules are also more likely to be unsure about

the usefulness of the method.

6 CONCLUSION

In this work, we introduced context-aware constraints

for knowledge quality control which offer a more

ﬁne-grained control over the domains on which we

want to impose restrictions. We also introduced a

bottom-up anytime and easily to parallelize algorithm

to discover context-aware constraint directly from

heterogeneous knowledge graphs.

We demonstrated our method on three different

datasets, which showed that there is no direct rela-

tionship between the size of a dataset and the number

of generated constraints, making it difﬁcult to apply

a rule of thumb to the chosen support and conﬁdence

values. However, our results do suggest a positive cor-

relation between the relation count and the number of

generated constraints.

Our evaluation consisted of a user study amongst

experts on knowledge engineering and maintenance,

which were invited to a workshop and asked to assess

various constraints on asset management. Their an-

swers indicate that, overall, context-aware constraints

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval

are to an extent useful for knowledge validation tasks,

and that the majority of the constraints were well bal-

anced with respect to complexity. However, a consid-

erable number of participants were nevertheless un-

sure about the usefulness of the method. Our analysis

suggests that the lack of familiarity with the domain

and quality rules might be the cause, although more

in-depth study is needed.

Our algorithm contains a few noteworthy limita-

tions. A practical limitation concern scalability as our

algorithm needs to evaluate a great deal of combina-

tions. This problem is slightly reduced by our prun-

ing and other optimization methods, and can also be

alleviated by parallelizing the task, but will neverthe-

less remain a challenge to deal with as the dataset in-

creases in size and, most particularly, the number of

relations. Another possible limitation lies with our as-

sumption that the majority of the knowledge is valid

and accurate. An insufﬁciently large enough ratio

between valid/accurate and invalid/inaccurate knowl-

edge can result in a relatively high number of false

positives and negatives, reducing the usefulness of our

method. A ﬁnal noteworthy limitation is the high sen-

sitivity of the provided support and conﬁdence values,

which, depending on the characteristics of the dataset,

can result in too few or in an unmanageable amount

of constraints. However, this is a common problem in

this ﬁeld of research.

We identiﬁed several potential extensions to our

method which we offer as suggestions for future

work. Firstly, our algorithm currently only generates

a proper subset of those expressible by constraint lan-

guages such as ShEx and SHACL, missing support for

e.g. cardinality restrictions. Adding support for these

constraints would make our method more useful for

real-world knowledge validation tasks. Another angle

worth pursuing but which fell out of our current scope

is the analysis of our algorithm’s time complexity, the

theoretical speed up which can be obtained through

parallelization, and how it deals with the satisfaction

and entailment problems.

ACKNOWLEDGEMENTS

We express our gratitude to Jaap Bakker, coordinating

specialist advisor on asset management and data inte-

gration at Rijkswaterstaat, for providing us access to

the data infrastructure, experts, and facilities needed

to complete our research. This research was made

possible with the help of Rijkswaterstaat, The Nether-

lands.

REFERENCES

Akhtar, W., Cort

es-Calabuig,

A., and Paredaens, J. (2010).

Constraints in rdf. In International Workshop on Se-

mantics in Data and Knowledge Bases, pages 23–39.

Springer.

Anbutamilazhagan, T. and Selvaraj, M. K. (2014). A novel

model for mining association rules from semantic web

data. Elysium Journal, 1(2).

Barati, M., Bai, Q., and Liu, Q. (2016). SWARM: An Ap-

proach for Mining Semantic Association Rules from

Semantic Web Data, pages 30–43. Springer Interna-

tional Publishing, Cham.

Bohannon, P., Fan, W., Geerts, F., Jia, X., and Kementsi-

etsidis, A. (2007). Conditional functional dependen-

cies for data cleaning. In 2007 IEEE 23rd interna-

tional conference on data engineering, pages 746–

755. IEEE.

Calvanese, D., Fischl, W., Pichler, R., Sallinger, E., and

Simkus, M. (2014). Capturing relational schemas and

functional dependencies in rdfs. In Twenty-Eighth

AAAI Conference on Artiﬁcial Intelligence.

Cort

es-Calabuig, A. and Paredaens, J. (2012). Semantics of

constraints in rdfs. In AMW, pages 75–90. Citeseer.

Fan, W., Hu, C., Liu, X., and Lu, P. (2018). Discover-

ing graph functional dependencies. In Proceedings of

the 2018 International Conference on Management of

Data, pages 427–439. ACM.

Fan, W. and Lu, P. (2017). Dependencies for graphs. In Pro-

ceedings of the 36th ACM SIGMOD-SIGACT-SIGAI

Symposium on Principles of Database Systems, pages

403–416. ACM.

Fan, W., Wu, Y., and Xu, J. (2016). Functional dependen-

cies for graphs. In Proceedings of the 2016 Inter-

national Conference on Management of Data, pages

1843–1857. ACM.

urber, C. (2015). Data quality management with semantic

technologies. Springer.

urber, C. and Hepp, M. (2011). Towards a vocabulary for

data quality management in semantic web architec-

tures. In Proceedings of the 1st International Work-

shop on Linked Web Data Management, pages 1–8.

Gal

arraga, L. A., Teﬂioudi, C., Hose, K., and Suchanek,

F. (2013). Amie: association rule mining under in-

complete evidence in ontological knowledge bases. In

Proceedings of the 22nd international conference on

World Wide Web, pages 413–422.

Hamad, F., Liu, I., and Zhang, X. X. (2018). Food

discovery with uber eats: Building a query

understanding engine. https://eng.uber.com/

uber-eats-query-understanding/. Accessed: 2020-05-

20.

He, B., Zou, L., and Zhao, D. (2014). Using conditional

functional dependency to discover abnormal data in

rdf graphs. In Proceedings of Semantic Web Informa-

tion Management on Semantic Web Information Man-

agement, pages 1–7. ACM.

Hellings, J., Gyssens, M., Paredaens, J., and Wu, Y. (2016).

Implication and axiomatization of functional and con-

Bottom-up Discovery of Context-aware Quality Constraints for Heterogeneous Knowledge Graphs

stant constraints. Annals of Mathematics and Artiﬁcial

Intelligence, 76(3-4):251–279.

Lausen, G., Meier, M., and Schmidt, M. (2008). Sparqling

constraints for rdf. In Proceedings of the 11th inter-

national conference on Extending database technol-

ogy: Advances in database technology, pages 499–

509. ACM.

Meilicke, C., Chekol, M. W., Rufﬁnelli, D., and Stuck-

enschmidt, H. (2019). An introduction to anyburl.

In Joint German/Austrian Conference on Artiﬁcial

Intelligence (K

unstliche Intelligenz), pages 244–248.

Springer.

Ramezani, R., Saraee, M., and Nematbakhsh, M. (2014).

Swapriori: a new approach to mining association rules

from semantic web data. Journal of Computing and

Security, 1:16.

Ristoski, P., De Vries, G. K. D., and Paulheim, H. (2016). A

collection of benchmark datasets for systematic eval-

uations of machine learning on the semantic web. In

International Semantic Web Conference, pages 186–

194. Springer.

Singhal, A. (2012). Introducing the knowledge graph:

Things, not strings. https://googleblog.blogspot.com/

2012/05/introducing-knowledge-graph-things-not.

html. Accessed: 2020-05-20.

Sun, E. and Iyer, V. (2013). Under the

hood: The entities graph. https://www.

facebook.com/notes/facebook-engineering/

under-the-hood-the-entities-graph/

10151490531588920. Accessed: 2020-05-20.

Szekely, P., Knoblock, C. A., Yang, F., Zhu, X., Fink, E. E.,

Allen, R., and Goodlander, G. (2013). Connecting the

smithsonian american art museum to the linked data

cloud. In Extended Semantic Web Conference, pages

593–607. Springer.

Tayi, G. K. and Ballou, D. P. (1998). Examining data qual-

ity. Communications of the ACM, 41(2):54–57.

Tresp, V., Bundschus, M., Rettinger, A., and Huang, Y.

(2008). Towards machine learning on the semantic

web. In Ursw (lncs vol.), pages 282–314. Springer.

Yu, Y. and Heﬂin, J. (2011). Extending functional depen-

dency to detect abnormal data in rdf graphs. In Inter-

national Semantic Web Conference, pages 794–809.

Springer.

KDIR 2020 - 12th International Conference on Knowledge Discovery and Information Retrieval