On the Automatic Generation of Knowledge Connections

Felipe Poggi A. Fraga

1 a

, Marcus Poggi

1 b

, Marco A. Casanova

1 c

and Luiz Andr

e P. Paes Leme

2 d

Pontiﬁcal Catholic University of Rio de Janeiro, Rio de Janeiro RJ, Brazil

Universidade Federal Fluminense, Niter

oi RJ, Brazil

Keywords:

Personal Knowledge Management, Natural Language Processing, Semantic Similarity, Concept Extraction,

Text Relatedness, Connections Generation, Note-Taking Apps, Bi-Directional Hyperlinks.

Abstract:

Recently, the topic of Personal Knowledge Management (PKM) has seen a surge in popularity. This is illustrated

by the accelerated growth of apps such as Notion, Obsidian, and Roam Research, as well as the appearance of

books like “How to Take Smart Notes” and “Building a Second Brain.” However, the area of PKM has not

seen much integration with Natural Language Processing (NLP). This opens up an interesting opportunity to

apply NLP techniques to operating with knowledge. This paper proposes a methodology that uses NLP and

networked note-taking apps to transform a siloed text collection into an interconnected and inter-navigable text

collection. The navigation mechanisms are based on shared concepts and semantic relatedness between texts.

The paper proposes a methodology, presents demonstrations using examples, and describes an evaluation to

determine if the system functions correctly and whether the proposed connections are coherent.

1 INTRODUCTION

The recent surge in popularity of the Personal Knowl-

edge Management (PKM) ﬁeld has led to a myriad

of new note-taking tools being released since 2016.

These innovative tools provide the general public with

brand-new functionalities to operate with knowledge

that were not previously available, an example being

bidirectional hyperlinks.

Even though advancements in note-taking tools are

happening simultaneously with the accelerated devel-

opment of Natural Language Processing (NLP), as of

today, these two ﬁelds still interact in a very superﬁ-

cial way. This presents an opportunity to combine the

ﬁelds of PKM and NLP, by enhancing the features of

note-taking tools using Artiﬁcial Intelligence.

In a structured way, this work is inspired by the

intersection of three elements: (1) Personal Knowl-

edge Management; (2) New generation of Note-Taking

Tools; (3) Natural Language Processing.

Personal Knowledge Management deals with creat-

ing an external and persistent collection of a person’s

https://orcid.org/0000-0002-1385-9098

https://orcid.org/0000-0003-0827-7111

https://orcid.org/0000-0003-0765-9636

https://orcid.org/0000-0001-6014-7256

knowledge. Recently, PKM has become increasingly

popular, as evidenced by the popularity of the books:

“How to Take Smart Notes”, (Ahrens, 2017), which

explains the note-taking process by the proliﬁc Soci-

ologist Niklas Luhmann, called the Zettelkasten; and

“Building a Second Brain”, (Forte, 2022), which ex-

plains how to build an external collection of knowledge

to work faster and with better quality.

Another piece of evidence for this increase in pop-

ularity is the appearance of a new category of note-

taking tools. Tools such as Roam Research, Obsidian,

and Tana present new features for knowledge organiza-

tion based on

networked note-taking

. Networked

note-taking contrasts with the hierarchical, folder-

document systems used by traditional note-taking. The

feature that stands out as essential for this work is

the use of

bidirectional hyperlinks

to connect notes.

Representing both directions of the link, front-links

(outgoing) and back-links (incoming), provides a fun-

damentally different type of knowledge to work with.

The main contributions and possible impacts on

the ﬁeld of Personal Knowledge Management are:

Contribution 1 – The creation of a methodology

that automatically generates connections between

texts:

Connections between texts are generated fol-

lowing two different options: Semantic Relatedness be-

tween texts and Shared Concept. The connections are

Fraga, F., Poggi, M., Casanova, M. and Leme, L.

On the Automatic Generation of Knowledge Connections.

DOI: 10.5220/0011781100003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 2, pages 43-54

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

considered to be coherent and reliable for the intended

functions of Recall, Elaboration, and New Insight.

Contribution 2 – A theoretical and practical

workﬂow for introducing NLP capabilities to mod-

ern note-taking tools:

To the best of our knowledge,

this paper is the ﬁrst work to explicitly unite the ﬁelds

of note-taking apps and Natural Language Processing.

The methodology presented in the paper contributes to

the design and implementation of note-taking applica-

tions that use NLP.

Contribution 3 – A technique for generating

connections between texts using Shared Concept:

The proposed navigation using concept nodes pro-

vides relevant information about concepts and creates

bridges between texts that mention the same concept.

With the added possibility of calculating the text relat-

edness using the concepts mentioned in the texts.

The possible implications are, therefore, to lower

the entrance barrier for applying Natural Language

Processing to Knowledge Management, which may

enhance collective intelligence and capabilities to over-

come the challenges that humanity soon must face.

The rest of this paper is organized as follows. Sec-

tion 2 deﬁnes the problem to be solved. Section 3

covers related work. Section 4 outlines the proposed

methodology. Section 5 explains the experiments that

evaluate the performance of the proposed methodol-

ogy. Section 6 discusses the implications of this work.

Finally, Section 7 contains the conclusions.

2 PROBLEM STATEMENT

The objective of this work is to apply NLP techniques

to leverage and enhance the current functionalities in

note-taking apps. To facilitate, empower, or replace

human effort in operating with knowledge, speciﬁcally

inside networked note-taking apps.

The approach is intended to help users explore a

text collection by automatically proposing connections

between texts. An essential element of the approach

is to use networked note-taking apps to visualize and

navigate through the connections.

Furthermore, the problem statement and the

methodology are designed to ensure that users have

an active role. Automatically generated connections

could nudge users into a passive role, with Artiﬁcial

Intelligence doing all the “thinking”.

Given an initial text collection, a user is responsible

for selecting the highlights s/he wants to insert into the

system for the automatic generation of connections.

Hence, the user has a direct inﬂuence on the result of

the methodology. A highlight is a selected passage

from a text in the initial text collection that is used to

create a set of highlights,

, where the methodology

will be applied.

Two types of connections are proposed to navigate

through the highlights:

Concepts Connections: Navigation between high-

lights through shared concepts, by ﬁrst navigating

from a highlight

to a concept

mentioned in

and then from

to other highlights that mention

Text Relatedness Connections: Navigation based

on a recommendation system, with texts suggested

according to Semantic Relatedness.

The problem addressed is informally deﬁned as:

What:

Automatically generate connections to create

a highlights collection that is interconnected and

inter-navigable, represented by a graph.

How:

Create connections between highlights using

shared concepts and semantic relatedness.

Where:

Use networked note-taking tools to navigate.

In what follows, recall that, in an undirected graph,

an edge is an unordered pair

{x, y}

of graph nodes

(contrasting with a directed graph, where a directed

arc is an ordered pair (x, y) of nodes).

Given two sets

and

of nodes, let

←→

denote

the set of all edges {x, y} such that x ∈ X and y ∈ Y .

The problem is then more precisely deﬁned as fol-

lows. Given a set of highlights

, create an undirected

graph

= (V, E)

, called an interconnection graph

for H, such that:

V ⊆ H ∪C ∪ A (1)

where

is a set of Highlight nodes,

is a set of

Concept nodes, and A is a set of Author nodes, and

E ⊆

←→

HH ∪

←→

HC ∪

←→

CC ∪

←→

AA ∪

←→

AC ∪

←→

AH (2)

where the edges represent connections between the

nodes and are always bidirectional.

We assume that all edges in

←→

are given, while

the others must be computed, as well as the Concept

nodes, C.

3 RELATED WORK

3.1 Frameworks and Systems

Becker et al. (2021b) presents CO-NNECT, a frame-

work that proposes connection paths between sen-

tences according to the concepts mentioned and their

relations. This work uses concepts in ConceptNet and

language models trained on knowledge relations from

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

ConceptNet. Maria Becker suggests this framework

could be used to enrich texts and knowledge bases.

Ilkou (2022) uses Entity Extraction and the DB-

pedia Knowledge graph to generate Personal Knowl-

edge Graphs with speciﬁc e-learning user’s personal

information regarding learning proﬁles and activities,

looking to enhance the learning experience.

Blanco-Fern

andez et al. (2020) presents a system

that, given a question and a right answer, automati-

cally generates wrong answers to distract the user in

multiple-choice questions. The system uses knowl-

edge bases and semantic relatedness between texts.

3.2 Concept Recognition

This section brieﬂy outlines works that focus on ex-

tracting concepts.

(Mendes et al., 2011) introduces DBpedia Spot-

light, which approaches concept extraction as a text

annotation task, and can annotate mentions with DB-

pedia resources (which include abstract concepts).

Becker et al. (2021a) extracts ConceptNet concepts

from natural text using a series of semantic manipu-

lations to form candidate phrases, which are matched

and mapped to the ConceptNet concepts.

Fang et al. (2021) proposes GACEN (Guided At-

tention Concept Extraction Network), a technique of

attention networks feeding a CRF to extract concepts

using the title, topic, and clue words.

Recent Surveys on Named Entity Recognition (Li

et al., 2020) and (Canales and Murillo, 2017) point

to a set of industry-based tools. One tool mentioned

in both surveys is Dandelion API, (SpazioDati, 2012),

which can identify conceptual entities. Dandelion API

performs a high-quality identiﬁcation of entities, as

well as entity linking to the DBpedia knowledge base.

3.3 Using Knowledge Graphs

This section brieﬂy outlines a few works that use

knowledge bases instead of building them.

Resnik (1995) presents a commonly used seman-

tic similarity measure using the is-A taxonomy from

WordNet to provide Information Content. The algo-

rithm measures the semantic similarity between con-

cepts by how much information they have in common.

Piao and Breslin (2015) presents Resim, a Re-

source Similarity metric between DBpedia Resources

based on Linked Data Semantic Distance (LSDS).

Leal et al. (2012) proposes a Semantic Related-

ness approach between concepts using the paths on an

ontological graph extracted from DBpedia.

Anand and Kotov (2015) utilizes DBpedia and Con-

ceptNet to perform query expansions and retrieve ad-

ditional results to a given query.

3.4 Text Semantic Relatedness

This section presents works that perform the tasks of

Text Semantic Relatedness and Text Semantic Simi-

larity. This task is usually divided into two main cate-

gories of algorithms: Knowledge-based methods and

Corpus-based methods (Gomaa and Fahmy, 2013).

3.4.1 Knowledge-Based Methods

Speer et al. (2017) describes the ConceptNet Number-

batch, a relatedness measure between concepts using

the ConceptNet knowledge base.

Yazdani and Popescu-Belis (2013) presents a relat-

edness metric based on Visiting Probability using Ran-

dom Walks between two texts. Visiting probability is

calculated using relations matrices between concepts.

The metric uses two different relation types, Wikipedia

links and a relatedness score between concepts.

Ni et al. (2016) builds a Concept2Vector represen-

tation of concepts and then computes a cosine similar-

ity to determine the similarity between concepts and

eventually between documents by combining distances

between concepts in each document.

3.4.2 Corpus-Based Methods

With the introduction of word embeddings, (Mikolov

et al., 2013), similarity metrics shifted to using the

semantic meaning of the words. Nonetheless, pre-

trained word embeddings still have limitations when

dealing with polysemic words, which have multiple

meanings, such as “bank” which can mean a river bank

or a ﬁnancial bank.

One solution to this problem is BERT, presented

in Devlin et al. (2019). BERT (Bidirectional Encoder

Representations from Transformers) uses a Transform-

ers architecture (encoder-decoder), (Vaswani et al.,

2017), to create word embeddings that capture the

meaning of surrounding words. BERT pre-trains lan-

guage models considering both directions of the text

so that words further along in the text still inﬂuence

the vectorial representation of earlier words.

A notable work that builds on the original BERT

model is presented in Reimers and Gurevych (2019).

Sentence BERT or SBERT generates sentence embed-

dings of a text without running all of the texts through

a BERT architecture, which signiﬁcantly reduces com-

putation time from 65 hours to 5 seconds.

On the Automatic Generation of Knowledge Connections

Figure 1: Overview of the proposed Methodology.

4 METHODOLOGY FOR

CONSTRUCTION OF

INTERCONNECTION GRAPHS

4.1 Overview of the Methodology

The proposed methodology is to create an intercon-

nected, human-readable, and human-navigable collec-

tion of highlights,

. Instead of being a one-time

execution, the methodology is recurrently applied to

continuously updated sets of Highlights

selected

from the initial text collection.

The methodology creates a ﬁrst interconnection

graph

and then updates the initial graph with

newly added highlights,

→ G

i+1

. This is done to

allow for the highlights set,

to be constantly chang-

ing as users add new texts and passages.

Each interconnection graph,

, can be navigated

by transforming the nodes in the graph into pages

inside a networked note-taking tool that enables nav-

igation and editing. The chosen tool for this task is

Obsidian, (Li and Xu, 2020).

Figure 1 details the methodology overview, includ-

ing this background check for new highlights. The

remainder of this section details the processes for a

single pass of the methodology for a ﬁxed set of high-

lights, H. This process occurs between stages 2-4.

To create an interconnection graph,

, the un-

derlying problem is simpliﬁed to solving for 9 graph

components (3 node types and 6 edge types). We

assume that

, and

←→

are given and known in

advance. The six remaining components (1 node type

and 5 edge types) are obtained from the two paths

described in section 2:

1. Concepts Connections

• Concept Recognition: C,

←→

HC, and

←→

• Concept Relationships :

←→

2. Text Relatedness Connections:

←→

HH, and

←→

Concepts Connections represent paths between any

given highlight and other highlights that mention the

same concept. Text Relatedness Connections create

a direct bridge between highlights, providing a clear

picture of other highlights and ideas that are related to

any given highlight.

4.2 Concept Connections

Concept Connections refer to all the connections that

involve concepts, the most important type being be-

tween Highlights and Concepts, which is identiﬁed

through the task of Concept Recognition. Concept Re-

lationship, subsubsection 4.2.2, plays an exploratory

role and collects important information for Section 4.3

on Text Relatedness.

4.2.1 Concept Recognition

This section details the tasks of entity recognition and

the ﬁltering of these entities to obtain concepts. The

pipeline starts with entities, and as the ﬁltering process

occurs, the term concepts will be employed. Concepts

are deﬁned as conceptual entities, such that every con-

cept is also an entity, but not all entities are concepts,

Equation 3. In special, for the scope of this paper,

Named Entities are not considered to be concepts.

C ⊆ E (3)

where C is the set of concepts and E is the set of

entities.

The ﬁrst task in the pipeline for this section is

identifying the entities mentioned in a highlight’s text.

Since all types of entities are initially extracted, the

task is described as Entity Recognition.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

The task is the joint task of Entity Recognition

together with Entity Linking, which is formalized as

follows:

1. Spot mentions in the text.

2. Collect candidate entities for each mention.

Select the most likely entity represented by each

mention.

Several tools were considered to perform the joint

task, and two different tools were chosen.

The Dandelion API (SpazioDati, 2012) was used

for initial Entity Recognition for a couple of reasons.

First, it has the functionality of identifying concepts.

Second, it provides an exact conﬁdence level for each

mention-entity occurrence, and third, it links each men-

tion to a DBpedia resource (entity).

The other tool is DBpedia Spotlight (Mendes et al.,

2011), which also links mentions to a DBpedia re-

source and is used to identify any mention-entity oc-

currence that may have been missed by Dandelion API.

DBpedia Spotlight is used for this task because it has

a broader reach and captures more mentions.

It is worth mentioning that both tools do not require

the traditional data cleaning of removing stop words

and performing lemmatization on the input text for the

task of Entity Recognition.

Upon extraction of the information regarding the

mentions and corresponding entities, a

mentions

database

is built. The database contains important

information that is used in the ﬁltering process, includ-

ing mention surface form, entity name, entity DBpedia

URL, location in the text, and conﬁdence score.

The following step is ﬁltering the mentions

database according to unwanted entities. This step

marks the transition from dealing with entities to deal-

ing with concepts. The term “concepts” refers to the

set of entities after the ﬁltering process, which priori-

tizes conceptual entities.

Let

be a collection of selected highlights and

be a set of mention-entity occurrences, denoted

(m, e)

identiﬁed in the text corpus of highlights, where

denotes the text fragment used to represent the entity

. Let also

α(e)

be a set of taxonomic categories that

classify

β(m, e)

be the conﬁdence of

being an

entity

, and

γ(e)

be the frequency of the entity

the texts, i.e.

γ(e) = 1

means that the entity occurs in

just one text t ∈ T .

The ﬁltering expression for the set

is as follows.

α(e) ∩ L =

0 ∧ β(m, e) > 0.6 ∧ γ(e) > 1 (4)

where

is a set of unwanted categories (such as the

DBpedia categories /Film, /Band, /Magazine, and

/TelevisionShow),

is a DBpedia-type ﬁlter,

is a

conﬁdence level ﬁlter, and

is a single occurrence

ﬁlter. There are user-deﬁned parameters for each of

these ﬁlters.

After the ﬁltering process, a ﬁnal concepts list,

is deﬁned with the remaining entities’ DBpedia URLs.

4.2.2 Concept Relationships

This section details the procedures to identify relation-

ships between the concepts in the graph. All entities

that passed the ﬁltering process are regarded as con-

cepts from here onward.

This task is composed of ﬁnding all relationships

between concepts contained in the graph

, given by

the concepts list,

. To ﬁnd all relationships between

each possible pair of concepts, information is retrieved

from two Knowledge Graphs, DBpedia (Lehmann

et al., 2015) and ConceptNet (Speer et al., 2017).

Queries are made to each knowledge graph to deter-

mine all relationships

r ∈ R

between each concept pair

in the concepts list,

. The data is gathered by posting

queries to both knowledge graphs and collecting the

results from these queries. Queries and their results

follow the triples format (subject, predicate, object).

(∀c

, c

∈ C)(ﬁnd all r ∈ R) such that: (5)

r IN (c

, p, c

) (6)

where

is the chosen set of relationships and

, p, c

)

is the result triple with predicate

. The set

of relationships considered,

is composed of relations

from DBpedia, R

, and ConceptNet, R

DBpedia was accessed to capture relations between

concepts based on Wikipedia page links. DBpedia

relations are Wikipedia links from one concept’s page

to another. These relations are the least speciﬁc and

provide the lowest value information.

DBpedia relations (R

)

• dbo:WikiPageWikiLink TO

• dbo:WikiPageWikiLink FROM

• dbo:WikiPageWikiLink BOTH

ConceptNet was used to obtain commonsense

knowledge between concepts. Commonsense knowl-

edge presents more granular relationship types be-

tween concepts, providing richer information on how

concepts connect with one another.

ConceptNet relations (R

)

• Causality

- /r/Causes

- /r/CapableOf

- /r/MotivatedByGoal

On the Automatic Generation of Knowledge Connections

• Equivalency

- /r/Synonym

- /r/SimilarTo

• Opposition

- /r/Antonym

- /r/DistinctFrom

• Dependency

- /r/HasPrerequisite

- /r/HasContext

- /r/HasProperty

- /r/PartOf

• General

- /r/IsA

- /r/RelatedTo

It is worth noting that all relationships are used as

bidirectional links between concepts. Technically, the

inverse of each relation is also captured.

4.3 Text Relatedness Connections

Text Relatedness Connections are direct connections

between highlights. They represent an important

bridge between ideas, creating direct pathways be-

tween highlights. This section details the creation of

connections between highlights using two different re-

latedness metrics: Knowledge-based Shared Concepts

Relatedness (kbr) using concepts; and Corpus-based

Semantic Relatedness (cbr). A ﬁnal Relatedness Score

is achieved by combining the two metrics:

S(t

) =

kbr(t

) + cbr(t

(7)

where

is the ﬁnal relatedness score,

kbr

is the

knowledge-based relatedness, and

cbr

is the corpus-

based relatedness.

It is worth noting that recommendations to other

highlight nodes are also divided between highlights

that are from the same author and belonging to differ-

ent authors. This is done to provide multiple options

when navigating between highlight nodes.

4.3.1 Knowledge-Based Relatedness

Knowledge-based shared concepts relatedness be-

tween highlights is calculated according to the con-

cepts that are mentioned in each highlight’s text to-

gether with the relatedness between each concept.

kbr(h

, h

) =

∑

∈h

rel(c

, c

)

(8)

rel(c

, c

) =

nr(c

, c

) + sr(c

, c

)

(9)

where

is the number of concepts in highlight

nr(c

, c

)

is the numerical relatedness and

sr(c

, c

)

the shared relationship between concepts.

Numerical Relatedness is a quantitative score that

represents the direct semantic relatedness between con-

cepts, and Shared Relationship is a binary value that

signals the presence of any descriptive relation be-

tween the concepts.

The numerical relatedness score,

nr(c

, c

)

, is a

value between 0 and 1 that indicates the relatedness

between two concepts as deﬁned by the ConceptNet

Numberbatch (Speer et al., 2017), which calculates

word embeddings based on shared neighbors in a Con-

ceptNet graph with additional retroﬁtting using GloVe

and word2vec embeddings. This score was further

normalized to represent relatedness scores between 0

and 1.

An example of the two concept-relatedness met-

rics is presented in Table 1, showing the relatedness

between the concept of “Knowledge” as compared to

selected concepts.

Table 1: Relatedness scores for the concept of “Knowledge”.

Relatedness to: “Knowledge”

Concept Numerical Shared

Relatedness Relationship

Information 0.460 1

Wisdom 0.442 1

Understanding 0.434 1

Intelligence 0.332 1

Learning 0.328 1

Memory 0.181 1

Thought 0.120 1

Biology 0.063 0

Innovation 0.017 1

Nutrient -0.007 0

Pricing -0.080 0

Shared Relationship is when two concepts share

one or more relationships with each other. A relation-

ship happens when two concepts are part of a relation

triple, (subject, predicate, object), extracted in subsub-

section 4.2.2. i.e. One concept appears in the subject

position, and another concept in the object position,

while sharing a predicate (relation) between them.

The presence of a relationship between concepts

is used to create a connection matrix, similar to what

is proposed in Yazdani and Popescu-Belis (2013). A

connection matrix is a binary matrix composed of only

0s and 1s; the number 1 is used to represent that a

relationship between two given concepts exists, and 0

is used when there is no relationship.

sr(c

, c

) = max

(r ∈ R) (10)

where R is the set of all relationships considered.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

The relationship types

considered were detailed

in Section 4.2.2 and were selected from DBpedia and

ConceptNet. These relationships were applied in both

directions, meaning that if concept

is related to con-

cept

, then concept

is automatically related to con-

cept

. This is done to satisfy the four properties of a

distance metric: non-negativity, symmetry, the identity

of indiscernibles, and triangle inequality.

A Knowledge-based Shared Concept Relatedness

Matrix is created by calculating the overall relatedness

between each pair of highlights in the collection. The

method to calculate the relatedness between two given

highlights is based on the average connection strength

between the concepts of each highlight’s text. This is

inspired by one of the features for document similarity

using concepts, presented in (Huang et al., 2012).

For each of the metrics, the following procedures

are performed. First, the concepts mentioned in each

of the highlights are identiﬁed. Next, an average relat-

edness is calculated considering all the relationships

between the concepts in highlight A with the concepts

in highlight B. When a concept appears in both high-

lights, the relatedness score is 1. The ﬁnal relatedness

between the two highlights is composed of the aver-

age of the two relatedness metrics, arriving at a uni-

ﬁed Average connection strength between the concepts

present in each of the two texts.

4.3.2 Corpus-Based Semantic Relatedness

Corpus-based Semantic Relatedness is deﬁned as the

similarity between the semantic meaning of two texts.

This relatedness metric is calculated using SBERT

Sentence Transformers (Reimers and Gurevych, 2019),

an adaptation that generates Sentence Embeddings

using the BERT architecture:

cbr(h

, h

) = cos(SB(h

), SB(h

)) (11)

where SB(h) is the Text Embedding for highlight h.

The process consists of two stages: (1) Encode the

highlights’ texts; (2) Create a Relatedness Matrix.

Encode the Text

To calculate the relatedness (or distance) between

two given highlights, it is necessary to encode each

highlight’s text into a vectorial representation. This

numerical format, a tensor, represents the semantic

meaning of text across multiple dimensions so that it

is possible to apply similarity metrics to compare two

highlights.

Sentence-BERT, SBERT, is a model trained only

to generate embeddings, which means it only contains

the encoding architecture and does not contain a de-

coder component. By focusing only on the encoding

of the information and being ﬁne-tuned for this spe-

ciﬁc task, the Sentence Transformers derived from

Sentence-BERT are computationally very efﬁcient and

run in a fraction of the time needed to run the entire

BERT architecture. SBERT also does not require the

removal of stop words and lemmatization.

Sentence-BERT works by automatically adding a

pooling operation to the output of BERT. It performs

ﬁne-tuning of a neural network architecture composed

of siamese and triplet networks to produce sentence

embeddings that are meaningful and can be compared

using similarity metrics.

Sentence-BERT outputs ﬁxed-sized Sentence Em-

beddings,

SB(h)

, that can be easily explored to calcu-

late the relatedness between them. All the embeddings

generated with the same model checkpoint will be

compatible with one another. This does not depend on

the texts being encoded together in the same batch.

Creating a Relatedness Matrix

Once a vectorial representation, or a Text Embed-

ding, is generated for all highlights in the text collec-

tion, the relatedness between these vectors is calcu-

lated to generate a relatedness matrix.

The relatedness between texts is a generalization

of similarity and is the inverse of the distance between

two texts. Computing the relatedness between two

texts is equivalent to ﬁnding the similarity between the

embeddings representing each text.

To build a relatedness matrix, the cosine similarity

was calculated between the vectors representing all

pairs of highlights in the collection.

4.4 Methodology end Result

The result of the methodology is an interconnection

graph,

, with the proposed connections. For users

to actively use this graph, the nodes and edges are

respectively “translated” to pages and hyperlinks in

the networked note-taking tool Obsidian, (Li and Xu,

2020).

Creating pages and hyperlinks turns the graph into

an accessible collection of highlights. Two ﬁgures

brieﬂy illustrate what this collection looks like. Fig-

ure 2 shows the Highlight Node’s page for an example

Figure 2: Example Highlight Node page and hyperlinks.

On the Automatic Generation of Knowledge Connections

highlight. The ﬁgure shows the different alternatives

for navigation, where red rectangles take users to Con-

cept Nodes, and blue ones take users to Highlight

Nodes. Figure 3 presents the Concept Node’s page for

an example concept, again demonstrating the options

for navigation following the same key.

Figure 3: Example Concept Node page and hyperlinks.

5 EVALUATION

The objective of the evaluation is to analyze whether

the methodology implementation successfully solves

the deﬁned problem. Two metrics were selected as

being necessary to consider the system a successful

implementation:

1. Accurate Graph Representation.

2. Coherence of Knowledge Connections.

To carry out the experiment, two different-sized

subsets of the highlight collection were used. The

original Text collection

was composed of several

different books. The selected highlights

were book

passages collected using the Kindle digital reader for

each book. The subsets of the text collection were two

highlights selections:

The small test-set had 52 highlights from two

books, with connections that can be easily interpreted.

This small test set makes it easier to analyze the fun-

damental functioning of the system in a limited scope,

where the connections can be manually analyzed and

interpreted.

The

medium test-set

had 182 highlights from

eight books, with connections that were less obvious

to interpret, but still under control for human interpre-

tation. This test set was selected to interpret how the

fundamental features of the system scale for a slightly

larger highlights collection.

5.1 Accurate Graph Representation

Question 1:

Are all of the node and edge types in

the mathematical problem statement represented in the

graph view of Obsidian?

This is a very simple question. It simply checks if

all the node and edge types proposed in the problem

statement are actually present in the ﬁnal structure of

the interconnection graph G

The note-taking tool Obsidian, which is used to

navigate the generated connections, has a functionality

called the graph view, where it is possible to visualize

the generated graph. This view mode is further en-

hanced by an extension called Juggl, which makes it

possible to include additional information to identify

the node and edge types in the graph.

The chosen approach to answering this question

is to look individually at the local graph views rep-

resenting the three node types. The Highlight nodes

are represented by blue circles, the Concept nodes are

shown as red pentagons, and the Author nodes as green

triangles.

The fact that it is possible to look at the three dif-

ferent node types individually is enough to understand

that the three node types, Highlight, Concept, and

Author, are present in the graph representation. The

question then translates into determining if the 6 edge

types deﬁned in Equation 2 are present within the

graph views for each node type.

Figure 4: Local graph view for an example Highlight Node.

When looking at the local graph view for the High-

light node of an example text, in Figure 4, it is possible

to identify all the three edge types involving Highlight

nodes.

−−→

, between blue circles,

−→

, between blue

circles and red pentagons and

−→

, between the green

triangle and blue circles.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

Figure 5: Local graph view for an example Author Node.

When looking at the local graph view for the Au-

thor node of Tiago Forte, in Figure 5, it is possible to

identify the three edge types that involve author nodes.

−→

, between the green triangles,

−→

, between the big

green triangle and red pentagons, and once more

−→

between the big green triangle and blue circles.

Figure 6: Local graph view for an example Concept Node.

Finally, when looking at the local graph view for

the Concept node of “Wisdom”, in Figure 6, it is pos-

sible to identify the ﬁnal edge type to conclude that

the implementation has all the proposed elements, the

edges

−→

, between the red pentagons. It is also possi-

ble to observe the edges

−→

, between blue circles and

red pentagons.

After analyzing the local graph views for each node

type, it is possible to assert that all elements proposed

in the problem statement are present in the implemen-

tation.

5.2 Coherence of Knowledge

Connections

Question 2:

What is the coherence between the two

connection types? i.e. The coherence between the

relatedness matrices generated for each method.

This evaluation determines if the generated knowl-

edge connections are appropriate and serve their in-

tended purpose.

The Coherence between Connections is deﬁned

as being the coherence, or similarity, between the

two types of proposed connections, Shared Concepts

(knowledge-based) and Semantic Relatedness (corpus-

based).

The metric for coherence between the two types of

connections is calculated by comparing their respec-

tive relatedness matrices using the Mantel Test (Man-

tel, 1967). The Mantel Test is a popular statistical test

that returns a measure of the correlation between two

matrices, ranging from -1 to 1.

The Mantel Test evaluates the similarity between

the matrices. If the connections proposed by the Se-

mantic Relatedness metric are similar to the ones pro-

posed by the Shared Concepts metric, then they are

coherent. A positive correlation between the two re-

latedness matrices implies that the two relatedness

metrics are similar and, thus, coherent with one an-

other.

A simple grid search is also performed when cal-

culating the relatedness matrices, to ﬁnd the best co-

herence while varying the parameter

β(m, e)

, the con-

ﬁdence threshold for concepts extraction.

Table 2: Grid Search of Coherence scores, while varying

Conﬁdence threshold for Entity Extraction.

Coherence Score

Conﬁdence Small Test Set Medium Test Set

0.55 0.679 0.519

0.60 0.665 0.499

0.65 0.642 0.508

0.70 0.681 0.481

Following the proposed metric, all results in Ta-

ble 2 present a positive correlation. According to

(Swinscow and Campbell, 2002), correlations from

0.40

0.59

are considered to be moderate, and from

0.6

0.79

, as strong. This means that the relatedness

metrics are considered to be strongly correlated for

the small test set, and moderately correlated for the

medium test set.

By applying the signiﬁcance test with

P < 0.001

the correlation coefﬁcients are considered to be highly

statistically signiﬁcant for all the obtained results. This

signals that the two different approaches for generating

connections are indeed coherent with one another.

On the Automatic Generation of Knowledge Connections

The positive correlation between the different con-

nection types suggests that the methodology presented

throughout this paper is valid. The generated connec-

tions are quantitatively coherent with one another.

With regard to the grid search for the optimum

conﬁdence, by considering both test sets, it is possible

to say that lower conﬁdence thresholds when extract-

ing concepts lead to a higher coherence between the

relatedness metrics.

This makes some intuitive sense because this pro-

vides more concepts to work with, which would in-

crease the performance of the concept-based related-

ness, as there is more information available.

Regardless of the reason for the coherence between

the metrics improving, the conﬁdence threshold of

0.55 presents the best overall coherence.

Even though the best coherence between the two

metrics was obtained at 0.55 conﬁdence, the chosen

default parameter for conﬁdence was 0.60, because of

the computational cost of running the system. More

concepts mean a longer runtime to run the methodol-

ogy. This implies that running the methodology with

a lower conﬁdence threshold could lead to very long

runtimes for larger highlights collection.

6 DISCUSSION

This section presents a discussion regarding the eval-

uation of the results and a reﬂection on the proposed

methodology, with special attention to the potential

use cases for this technology.

The proposed methodology was successfully rep-

resented in the functional version of the system, the

ﬁnal implementation successfully corresponds to the

proposed graph representation, and the Connections

can be considered to be coherent with one another.

6.1 Proposed Methodology Overview

Two questions are proposed to carry out additional

reﬂections on the implementation of the methodology.

Can the combination of NLP with Networked note-

taking tools improve the Knowledge Management

functions of Recall, Elaboration, and New Insight?

Are Concept Nodes a useful mechanism for navi-

gating a highlights collection?

The ﬁrst question is directed at the inspirations

and reasons for proposing the methodology and im-

plementing it. The second question is directed at the

utility of introducing the Concept Nodes as means of

connecting and enriching a highlights collection.

When analyzing the proposed methodology

through the lens of the three intended functions of

Knowledge Visualization – Recall, Elaboration, and

New Insight – it is interesting to mention some charac-

teristics of the envisioned system.

The three functions share a common duality, which

is the presence of divergent and convergent aspects.

Each function has an element of divergence, of spread-

ing out wide and exploring new connections, while

also presenting the convergence element, of collapsing

to one tangible connection or event.

The function of Recall has the divergence of search-

ing for a speciﬁc item across a wide range of options

and eventually converging to the retrieval of the de-

sired item(s). Elaboration presents different options

for elaboration and eventually converges to one of the

possibilities to elaborate on a given topic. New Insight

is the most divergent of all three features, seeking

mainly to be exposed to new ideas and possibilities

that aren’t previously known. The convergent aspect

of insights is when two or more ideas connect and

form a new piece of knowledge.

The proposed methodology was inspired by these

three functions, which means it was designed to seek

divergence and convergence in a balanced way.

The two Connection types used to navigate the

highlights collection have different levels of diver-

gence and convergence. The

Concepts Connections

are directed at divergent thinking by uniting related

ideas that use common concepts, with the important

feature of not discriminating between ideas according

to the general topic of the highlights. The convergent

aspect of the Shared Concepts Connections comes

from connecting different highlights through a com-

mon concept between them.

On the other hand, the

Text Relatedness Connec-

tions

seek to initially promote convergent thinking by

generating connections between highlights that are in-

deed related to each other. In turn, it also presents

elements of divergent thinking, with the possibility of

contrasting ideas from different authors.

The two modes of operation – divergence and con-

vergence – are essential to the proposed system, as they

play a part in both connection types and are designed

into the system to simultaneously represent the three

functions of Recall, Elaboration, and New Insight.

Whenever possible, the chosen priority for the sys-

tem is to provide solid options for divergence and

exploration instead of convergence and precision. The

navigation mechanism provided by Concept Nodes is

considered to be a key factor for this divergence and

exploration.

The focus of the proposed methodology is to pro-

vide the structure for semi-organized

divergent think-

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

ing

while also providing the tools and context for the

human user to bring forth convergence, according to

their personal terms. This means humans are still re-

sponsible for interpreting connections and selecting

the most relevant ones, something humans do inher-

ently well.

The methodology promotes an active role for hu-

mans, with users actively cocreating connections and

being an inherent part of the puzzle. This prevents

users from being too passive and is an important as-

pect of the development of AI technologies.

6.2 Use Cases for the Proposed

Technology

The methodology is supposed to be used by human

users and depends directly on user input. It is expected

that the user would deploy the methodology with a

speciﬁc outcome in mind, in the format of a tangible

project or even a speciﬁc reﬂection or contemplation

of ideas.

The ﬁrst category of use cases is applying the

methodology for Personal Knowledge Management

(PKM). The idea behind a PKM System is to store

a person’s past knowledge in a safe place to be later

recycled and reused.

A potential combination would be integrating the

interconnected highlights collection

into a given

user’s PKM System. This way, the user could access

their highlights through the generated connections and

use them as catalysts for retrieving and “applying” the

knowledge contained in the highlights for several more

granular use cases.

A clear use case for this combination would be

creation. This could be creating a speciﬁc deliverable

for work, a research project, and of course, the epitome

of creation, writing.

Another use case within the overarching use case

of Personal Knowledge Management is to help users

retrieve any piece of knowledge related to speciﬁc

texts, concepts, or topics in a fast and efﬁcient way.

Another potential use case within PKM is learn-

ing. Generating connections automatically can help

users compare a new piece of information with pre-

viously acquired knowledge in an easier, faster, and

more powerful way.

A very interesting use case would be

organizing

knowledge from different topics

. A speciﬁc example

of this would be students organizing their notes across

different disciplines using automatic connections be-

tween passages from digital Textbooks, online articles,

and personal notes.

A generalization applied to any non-student would

be to use the system to read and study multiple sources

at once, while easily navigating between the ideas

presented in them. By highlighting the passages that

present important ideas that the user wishes to ponder

about or study more deeply, the user would be able

to receive an interconnected version with connections

between the selected highlights.

The interconnected text collection may then be

navigated, and most importantly,

edited and incre-

mented

. Since the system is hosted inside a note-

taking app, the user may easily use the software to cre-

ate new notes and elaborate on the initially collected

ideas.

A ﬁnal, important use case for this technology is

connecting ideas from different people

. Using the

system to connect and compare texts from a group

of people is very aligned with what Doug Engelbart

deﬁnes as the two most important aspects of the Col-

lective IQ level, (Engelbart, 1992). First, the process

– how well a group develops, integrates, and applies

its knowledge. Second, the assets produced by that

process – how effective the group’s shared repository

of knowledge is and how easily information can be

synthesized, stored, retrieved and updated.

7 CONCLUSIONS

Recent advances in Natural Language Processing and

Personal Knowledge Management present an oppor-

tunity to combine these two ﬁelds. Speciﬁcally, by

providing Artiﬁcial-Intelligence-based features to note-

taking tools, using the recently available functionalities

as a starting point.

This paper proposes and successfully implements

a methodology to automatically generate connections

between highlights. The proposed methodology em-

ploys a combination of NLP tools to transform a given

highlights collection,

, into an

interconnected

and

navigable

graph representing the same highlights,

Interconnectedness is present in the graph where

highlights become Highlight nodes, and connections

are added with multiple edge types and Concept Nodes.

In turn, navigation is added to the text collection us-

ing Obsidian, a note-taking tool that combines the

hierarchical organization of ﬁles and folders with the

networked organization of bidirectional hyperlinks.

The results and evaluation suggest that the end

result of the proposed methodology is an adequate rep-

resentation of the proposed graph,

, in the Problem

Statement, section 2. Finally, the two different paths

for generating connections are coherent within them-

selves, which suggests that the connections generated

by the system are reliable.

On the Automatic Generation of Knowledge Connections

REFERENCES

Ahrens, S. (2017). How to take smart notes: One simple tech-

nique to boost writing, learning and thinking. S

onke

Ahrens.

Anand, R. and Kotov, A. (2015). An empirical comparison

of statistical term association graphs with dbpedia and

conceptnet for query expansion. In Proc. 7th forum for

information retrieval evaluation, pages 27–30.

Becker, M., Korfhage, K., and Frank, A. (2021a). COCO-

EX: A tool for linking concepts from texts to Concept-

Net. In Proc. 16th Conference of the European Chapter

of the Association for Computational Linguistics: Sys-

tem Demonstrations, pages 119–126.

Becker, M., Korfhage, K., Paul, D., and Frank, A. (2021b).

CO-NNECT: A Framework for Revealing Common-

sense Knowledge Paths as Explicitations of Implicit

Knowledge in Texts. In Proc. 14th International Con-

ference on Computational Semantics (IWCS), pages

21–32.

Blanco-Fern

andez, Y., Gil-Solla, A., Pazos-Arias, J. J.,

Ramos-Cabrer, M., Daif, A., and L

opez-Nores, M.

(2020). Distracting users as per their knowledge: Com-

bining linked open data and word embeddings to en-

hance history learning. Expert Systems with Applica-

tions, 143:113051.

Canales, R. F. and Murillo, E. C. (2017). Evaluation of

Entity Recognition Algorithms in Short Texts. CLEI

Electronic Journal, 20(1):13.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019).

BERT: pre-training of deep bidirectional transformers

for language understanding. In Proc. 2019 Conference

of the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, NAACL-HLT, pages 4171–4186.

Engelbart, D. C. (1992). Toward high-performance orga-

nizations: A strategic role for groupware. In Proc.

GroupWare, volume 92, pages 3–5. Citeseer.

Fang, S., Huang, Z., He, M., Tong, S., Huang, X., Liu,

Y., Huang, J., and Liu, Q. (2021). Guided Attention

Network for Concept Extraction. volume 2, pages

1449–1455. ISSN: 1045-0823.

Forte, T. (2022). Building a second brain: A proven method

to organize your digital life and unlock your creative

potential. Atria Books. tex.lccn: 2021057379.

Gomaa, W. H. and Fahmy, A. A. (2013). A Survey of

Text Similarity Approaches. International Journal of

Computer Applications, 68(13):13–18.

Huang, L., Milne, D., Frank, E., and Witten, I. H. (2012).

Learning a concept-based document similarity mea-

sure. Journal of the American Society for Information

Science and Technology, 63(8).

Ilkou, E. (2022). Personal knowledge graphs: Use cases in

e-learning platforms. In Proc. WWW ’22: Compan-

ion Proceedings of the Web Conference 2022, page

344–348.

Leal, J. P., Rodrigues, V., and Queir

os, R. (2012). Computing

semantic relatedness using dbpedia. In Proc. 1st Sym-

posium on Languages, Applications and Technologies.

Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kon-

tokostas, D., Mendes, P. N., Hellmann, S., Morsey, M.,

Van Kleef, P., and Auer, S. (2015). Dbpedia–a large-

scale, multilingual knowledge base extracted from

wikipedia. Semantic web, 6(2):167–195. Publisher:

IOS Press.

Li, J., Sun, A., Han, J., and Li, C. (2020). A Survey on

Deep Learning for Named Entity Recognition. IEEE

Transactions on Knowledge and Data Engineering,

34(1):50–70.

Li, S. and Xu, E. (2020). Obsidian [Computer software].

Mantel, N. (1967). The detection of disease clustering and

a generalized regression approach. Cancer research,

27(2 Part 1):209–220. Publisher: AACR.

Mendes, P. N., Jakob, M., Garc

ıa-Silva, A., and Bizer, C.

(2011). DBpedia spotlight: shedding light on the web

of documents. In Proc. 7th International Conference

on Semantic Systems, pages 1–8.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. Advances in

neural information processing systems, 26.

Ni, Y., Xu, Q. K., Cao, F., Mass, Y., Sheinwald, D., Zhu,

H. J., and Cao, S. S. (2016). Semantic Documents Re-

latedness using Concept Graph Representation. In Proc.

9th ACM International Conference on Web Search and

Data Mining, pages 635–644.

Piao, G. and Breslin, J. G. (2015). Computing the semantic

similarity of resources in dbpedia for recommendation

purposes. In Joint International Semantic Technology

Conference, pages 185–200. Springer.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sen-

tence Embeddings using Siamese BERT-Networks. In

Proc. 2019 Conference on Empirical Methods in Natu-

ral Language Processing and the 9th International

Joint Conference on Natural Language Processing,

pages 3982–3992.

Resnik, P. (1995). Using information content to evaluate

semantic similarity in a taxonomy. In Proc. 14th Inter-

national Joint Conference on Artiﬁcial Intelligence -

Volume 1 – IJCAI’95, page 448–453.

SpazioDati, . (2012). Dandelion API

Semantic Text Ana-

lytics as a service.

Speer, R., Chin, J., and Havasi, C. (2017). Conceptnet 5.5:

An open multilingual graph of general knowledge. In

Proc. 31st AAAI Conference on Artiﬁcial Intelligence –

AAAI’17, page 4444–4451.

Swinscow, T. D. V. and Campbell, M. J. (2002). Statistics at

square one. Bmj London.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).

Attention is all you need. In Proc. 31st International

Conference on Neural Information Processing Systems

– NIPS’17, page 6000–6010.

Yazdani, M. and Popescu-Belis, A. (2013). Computing text

semantic relatedness using the contents and links of a

hypertext encyclopedia. Artif. Intell., 194:176–202.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems