achieves good performance. According to (Kejri-
wal et al., 2021), the advantage of deep Learning
models for NER is they do not normally require
domain-specific resources like lexicons or ontolo-
gies and can scale more easily without significant
manual tuning.
For detecting the entity names, in addition to
supervised methods, there are semi-supervised
and unsupervised methods. Regarding semi-
supervised methods, in (Kejriwal et al., 2021) au-
thors stand out that recently weak supervision has
gained interest because it requires some amount
of human supervision, usually in the very begin-
ning when the system designer provides a starting
set of seeds, which are then used to bootstrap the
model.
The last group of ML-based methods are based
on unsupervised learning, therefore they do not
require labeled texts during training to recognize
entities. Thus, the goal of unsupervised learning
is to build representations from data. Typically,
clustering algorithms are used to find similarities
in data during training (Eltyeb and Salim, 2014).
Due to their simplicity, these algorithms are suit-
able for simple tasks (Bose et al., 2021) and are
not popular in the NER task.
Another ML-based approaches use language
models based on deep neural networks. Neu-
ral language models have made interesting ad-
vances in several NLP tasks by improving their
performance and scalability. Pre-trained language
models like BERT are created with large amounts
of training data and are effective in automati-
cally learning useful representations and underly-
ing factors from raw data. In (Li et al., 2022), the
authors present the most representative methods
of deep learning for NER.
By combining supervised and unsupervised mod-
els in NER we can leverage the advantages of each
approach (Uronen et al., 2022).
• Rule-based systems. We can use a formal rule
language to define the extraction rules of enti-
ties. The rules can be based on regular expres-
sions or references to a dictionary, or we can reuse
custom extractors. Mainly two types of rules
are used, Pattern-based rules, which depend upon
the morphological pattern of the words used, and
context-based rules, which depend upon the con-
text of the word used in the given text document
(Dathan, 2021). This approach may be appro-
priate when the entities’ names of a certain type
share a spelling pattern; for example, in general
any university has in its name the term university.
2.2.2 Relation Extraction (RE)
When constructing KGs, EE is used to get the nodes
of a knowledge graph, and relation extraction can be
used to get the edges or relationships, which connect
pairs of nodes or entities in the graph. Therefore, RE
is the problem of detecting and classifying relation-
ships between entities extracted from the text, being
a significantly more challenge than NER (Kejriwal
et al., 2021).
ML-based approaches include some supervised
and semi-supervised techniques:
1. Supervised RE methods require labeled data
where each pair of entity mentions is tagged with
one of the predefined relation types. According
to (Kejriwal et al., 2021), there are two kinds of
supervised methods: feature-based supervised RE
and kernel-based supervised RE. On one hand,
feature-based methods define the RE problem as
a classification problem. Namely, for each pair
of entity mentions, a set of features is generated,
and a classifier is used to predict the relation, of-
ten probabilistically. On the other hand kernel-
based supervised methods, generally are heavily
dependent on the features extracted from the men-
tioned pairs and the sentence. Word embeddings
could be used to add more global context. Ker-
nel methods are based on the idea of kernel func-
tions; some common functions include the se-
quence kernel, the syntactic kernel, the depen-
dency tree kernel, the dependency graph path ker-
nel, and composite kernels.
2. Semi-supervised Relation Extraction. There are
two motivations for using this type of methods:
(1) acquiring labeled data at scale is a challenge
task, and (2) leveraging the large amounts of un-
labeled data currently available on the web, with-
out necessarily requiring labeling effort. A semi-
supervised or weakly supervised method is boot-
strapping, which starts from a small set of seed
relation instances and it is able to learns more re-
lation instances and extraction patterns. Another
paradigm is distant supervision which uses a large
number of known relationships instances in exist-
ing large knowledge bases to create a proxy for ac-
tual training data. Besides bootstrapping and dis-
tant supervision, other LM methods include active
learning, label propagation and multitask transfer
learning (Kejriwal et al., 2021).
Besides supervised and semi-supervised methods,
there are other approaches to extract relationships be-
tween two named entities.
• Syntactic patterns or rule-based. It tries to dis-
cover a pattern for a new relation by collecting
Using NLP to Enrich Scientific Knowledge Graphs: A Case Study to Find Similar Papers
193