Table 2: Measures on the sets of poor clones, then rich clones for the information retrieval task, and on rich clones for the
classification task. Models followed by ”*” are variants adapted to knowledge graphs.
Poor clones Rich clones Rich clones - classification
Model MRR Hit@1 Hit@100 MRR Hit@1 Hit@100 Acc. Precision Recall F1
GAT 0.257 0.206 0.534 0.532 0.472 0.764 0.320 0.258 0.957 0.406
GATV2 0.636 0.582 0.850 0.810 0.774 0.936 0.542 0.491 0.924 0.641
GATV2* 0.720 0.678 0.886 0.752 0.711 0.903 0.565 0.513 0.978 0.673
Graphsage 0.449 0.388 0.740 0.783 0.752 0.913 0.431 0.423 0.993 0.593
Graphsage* 0.484 0.432 0.764 0.820 0.793 0.928 0.431 0.423 0.993 0.593
HinSage 0.110 0.091 0.235 0.521 0.489 0.699 0.764 0.472 0.705 0.566
HGT 0.141 0.119 0.281 0.450 0.426 0.600 0.564 0.297 0.890 0.446
PIKA-simple 0.817 0.783 0.928 0.918 0.899 0.971 0.653 0.617 0.960 0.751
PIKA 0.856 0.828 0.941 0.959 0.949 0.988 0.786 0.669 0.952 0.786
relations. Since the model weights are not specific to
nodes but to attributes and relationships, it is applica-
ble to other graphs with the same types of attributes
and edges.
For a more specific application, our method could
easily be adapted to assign higher weights to attributes
that should be focused on. By submitting pairs of
clones with specific modifications during training, the
user would be able to detect similarities only on tar-
geted attributes.
We obtained a more robust model to ambiguous
entity pairs, by applying an iterative training. Never-
theless, embedding a complex entity prevents an ef-
ficient atomic comparison of its attributes (as is per-
formed in entity alignment methods) since the infor-
mation is smoothed into a single vector. Still, the
results showed that search entities, if not returned in
first position, are almost always in the top 100 results.
This considered, we can easily imagine the use of our
model as a pre-filtering step on a large database before
performing an entity-by-entity comparison or using a
more accurate but slower method on fewer candidates.
Finally, the numerical values are not processed in
an optimal way by the encoder since they appear in
the database as strings. Text encoders such as USE
or Mpnet do not specifically handle numerical val-
ues: we believe that a dedicated processing for such
attributes is required in order to reach an acceptable
quality for the clone classification task.
REFERENCES
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and
Yakhnenko, O. (2013). Translating embeddings for
modeling multi-relational data. Advances in neural
information processing systems, 26.
Brody, S., Alon, U., and Yahav, E. (2021). How atten-
tive are graph attention networks? arXiv preprint
arXiv:2105.14491.
Brunner, U. and Stockinger, K. (2020). Entity matching
with transformer architectures-a step forward in data
integration. In International Conference on Extending
Database Technology, Copenhagen, 30 March-2 April
2020. OpenProceedings.
Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John,
R. S., Constant, N., Guajardo-C
´
espedes, M., Yuan,
S., Tar, C., et al. (2018). Universal sentence encoder.
arXiv preprint arXiv:1803.11175.
Dai, Y., Wang, S., Xiong, N. N., and Guo, W. (2020). A
survey on knowledge graph embedding: Approaches,
applications and benchmarks. Electronics, 9(5).
Daza, D., Cochez, M., and Groth, P. (2020). Inductive entity
representations from text via link prediction. CoRR,
abs/2010.03496.
Dettmers, T., Minervini, P., Stenetorp, P., and Riedel, S.
(2018). Convolutional 2d knowledge graph embed-
dings.
Dong, Y., Chawla, N. V., and Swami, A. (2017). metap-
ath2vec: Scalable representation learning for hetero-
geneous networks. In Proceedings of the 23rd ACM
SIGKDD international conference on knowledge dis-
covery and data mining, pages 135–144.
Fortunato, S. (2010). Community detection in graphs.
Physics reports, 486(3-5):75–174.
Gadek, G. (2019). From community detection to topical,
interactive group detection in online social networks.
In IEEE/WIC/ACM International Conference on Web
Intelligence-Companion Volume, pages 176–183.
Hamilton, W. L., Ying, R., and Leskovec, J. (2017). In-
ductive representation learning on large graphs. In
Proceedings of the 31st International Conference on
Neural Information Processing Systems, pages 1025–
1035.
Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-
sification with graph convolutional networks. arXiv
preprint arXiv:1609.02907.
Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020).
Deep entity matching with pre-trained language mod-
els. arXiv preprint arXiv:2004.00584.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019). Roberta: A robustly optimized bert pre-
training approach. arXiv preprint arXiv:1907.11692.
Duplicate Detection in a Knowledge Base with PIKA
53