
signature. In the second step, we compute the µ
vector of the mean along each dimension. Thirdly,
we initialize PCA with the number of components
given by the hyperparameter nc, subtract µ from
each (normalized) vector signature X , and compute
the singular value decomposition for each principal
component with the same vector length as ¯v. Sub-
sequently, we can now project the input data set X
onto the lower-dimensional principal components,
resulting in the encoded |E| × nc matrix we define as
Z. Finally, we cast the decoder operation to generate
ˆ
X. This involves reversing the dot-product between
Z and principal components plus the entity-wise
addition of the mean µ. Lastly, we compute the
mean-squared error (MSE) between X and
ˆ
X and
utilize these as the entity scores (s
i
). The time com-
plexity for PCA is O(|E| · ¯v
2
+ ¯v
3
). We refer to the
tutorial paper (Shlens, 2014) for more details on PCA.
Autoencoders: They are a special type of neural
networks trained to encode data on a meaningful rep-
resentation and reversely decode it back to its original
state. These models are considered self-supervised
as the data serves both as training input and output.
Similar to PCA, we follow the assumption that a
trained autoencoder learns relevant patterns more
efficiently of normally distributed entity signatures
but not for anomalous and unlinkable ones. More-
over, autoencoders with one latent layer and linear
activation functions generalize PCA. In the following
section, we provide a summary of autoencoders in the
context of anomaly detection using the reconstruction
error for scoping. More details on autoencoders can
be found in the survey by (Bank et al., 2021).
We formally denote the encoder function as
A(V (E) = X ) ⇒ Z mapping the set of the normal-
ized entity signatures into a latent lower-dimensional
representation. The decoder function B(Z) ⇒
ˆ
X aims
to transform the latent representation into the original
input. Both the functions of A and B are trained over
a number of epochs ep in order to minimize the mean
reconstruction error converging to
arg min
A,B
[MSE(X,B(A(X )))]
ep
.
Contrary to PCA, normalization of the entity sig-
natures not only simplifies the computation but allows
the use of non-linear activation functions. Both the
encoder and decoder functions can, therefore, con-
struct more elegant and superior non-linear hyper-
planes. At the same time, non-linear hyperplanes
tend to overfit. For this reason, different types of
regularization beyond the lower-dimensional bottle-
neck must be considered depending on the number
of entities |E|, signature length ¯v, and degree of de-
viations. Various possible configurations of the au-
toencoder may consider the network’s depth or shal-
lowness, the number of epochs, layers, neurons, acti-
vation functions, optimization algorithms, loss, and
validation sampling configurations. The computa-
tional time complexity depends on those architectural
choices; therefore, providing a O() notation varies
and is dependent on each different configuration. Due
to the rising time complexity of backpropagating the
weights of each neuron in each hidden layer over
multiple epochs, we assume autoencoders to have a
higher time complexity compared to the previously
presented ranking methods. In the scoping context,
we generally recommend preventing overfitting with
regularization, as such a model would generate iden-
tical entity scores that are not useful for scoping.
4 EVALUATION
We evaluate the scoping approach on a real-world
multi-sourced entity linkage dataset. We first
present the performance metrics we use, then we
describe the dataset, elaborate on chosen signa-
ture strategies, and the configuration of the rank-
ing methods. All experiments were performed
in Python Jupyter hosted by Google Collabora-
tory
1
. The dataset and code are available at
https://github.com/leotraeg/scoping.
Performance Metrics: To measure the effectiveness
of the algorithms for generating scoped entity collec-
tions E
′
from the original ones E, we adopt typical
metrics used in ER.
• Reduction Ratio (RR(E
′
,E)) reflects the time effi-
ciency in scoping without relevance to the ground
truth of linkages. It expresses the reduction in the
number of entity comparisons between the scoped
entity collections and the original ones:
1 − ||B(E
′
1
,E
′
...
,E
′
n
)||/||B(E
1
,E
...
,E
n
)||.
• Pair Completeness (PC(E
′
,E)) estimates the
number of potentially true entity linkages within
the scoped entity collections with respect to the
number of the ground truth entity linkages within
the original entity collections:
||L(E
′
1
,E
′
...
,E
′
n
)||/||L(E
1
,E
...
,E
n
)||.
• Harmonic-Mean RR-PC (HM(E
′
,E))) represents
a combined metric between the two competing
objectives of reduction ratio and pair complete-
ness.
2·RR·PC
RR+PC
.
The threshold p affects the collections of scoped
entities E
′
in a major way. Knowing its value before-
1
https://colab.research.google.com
Scoping: Towards Streamlined Entity Collections for Multi-Sourced Entity Resolution with Self-Supervised Agents
111