PyResolveMetrics: A Standards-Compliant and Efﬁcient Approach to

Entity Resolution Metrics

Andrei Olar

and Laura Dios¸an

Faculty of Mathematics and Computer Science, Babes¸-Bolyai University, Romania

Keywords:

Entity Resolution, Metrics, Library, Open Source, Standards-Compliant, Theoretical Model, Efﬁciency.

Abstract:

Entity resolution, the process of discerning whether multiple data refer to the same real-world entity, is crucial

across various domains, including education. Its quality assessment is vital due to the extensive practical

applications in ﬁelds such as analytics, personalized learning or academic integrity. With Python emerging as

the predominant programming language in these areas, this paper attempts to ﬁll in a gap when evaluating the

qualitative performance of entity resolution tasks by proposing a novel consistent library dedicated exclusively

for this purpose. This library not only facilitates precise evaluation but also aligns with contemporary research

and application trends, making it a signiﬁcant tool for practitioners and researchers in the ﬁeld.

1 INTRODUCTION

The ﬁeld of entity resolution (ER), integral to nat-

ural language processing and emerging technologies

in education, is pivotal in understanding and link-

ing data across multiple sources. Some deﬁnitions

view it as identifying and linking data from multiple

sources(Qian et al., 2017). However, it’s argued that

this identiﬁcation and linking is a more specialized

process(Talburt, 2011).

ER, also known as record linkage, data dedupli-

cation, merge-purge, named entity recognition, entity

alignment, and entity matching, plays a signiﬁcant

role in educational communication and collaboration

tools, facilitating the information exchange between

parents, teachers and students. It can also be useful

to track the progression of alumni, by linking var-

ious data sources to provide meaningful insights on

the career trajectories of graduates. It is a key com-

ponent in AI literacy, particularly in those machine

learning tasks that involve identifying how disparate

pieces of information correlate to the same real-world

entity. ER itself has numerous implementations that

rely on machine learning and is important for AI lit-

eracy for that reason, too(Li et al., 2020). We rec-

ognize the importance of accurate and efﬁcient ER in

both individual learning outcomes and broader soci-

etal impacts. Enhancing research integrity through

https://orcid.org/0009-0006-7913-9276

https://orcid.org/0000-0002-6339-1622

plagiarism detection or enabling personalized learn-

ing by tracking a student’s preferences across learning

domains might qualify as ﬁelds of study in their own

right. For example, the work of (Chen et al., 2021)

points out the importance of University and profes-

sional information for career exploration. Their sur-

vey highlights that matching the educational offering

to ﬁt student goals leads to a higher chance of students

developing successful careers. One can envision sys-

tems that automatically create educational offerings

based on the proﬁles of students. In this scenario, ER

has the role of automatically building the student pro-

ﬁle from heterogeneous information sources. Another

use case for automatically generated proﬁles might be

career recommender systems, for example. The ideal

outcome from using the information stored in these

proﬁles would be ﬁnding the best possible match be-

tween educational offering and student aspirations.

What if ER provides us with a misleading pro-

ﬁle? At best, we realize this is the case, stop trust-

ing the ER system and revert to a state where we

don’t beneﬁt from the information stored in the pro-

ﬁles generated through ER. At worst, we do not re-

alize the system error and proceed career exploration

based on misleading proﬁles. This leads to bad career

choice recommendation and more severe risks related

to wasted time, ﬁnancial misfortune, professional dis-

satisfaction, stagnation in personal development or

even health and relationship concerns. Adopting the

right ER system is desirable for obtaining the initial

beneﬁts of making more informed decisions faster.

Olar, A. and Dio¸san, L.

PyResolveMetrics: A Standards-Compliant and Efﬁcient Approach to Entity Resolution Metrics.

DOI: 10.5220/0012546300003693

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Computer Supported Education (CSEDU 2024) - Volume 1, pages 257-263

ISBN: 978-989-758-697-2; ISSN: 2184-5026

257

ER systems should not be adopted and cannot be

maintained without measuring the quality of their out-

comes.

In this context, the paper introduces a new Open

Source library that is hosted in a Git repository on

GitHub(PyResolveMetrics, 2023). The library offers

implementations of well known metrics for evaluating

ER, contributing signiﬁcantly to metrics and evalua-

tion in educational technologies. Thus, this library

could have a role in advancing tools and methodolo-

gies in the realm of education and technology by al-

lowing a more informed process for developing tools

that make use of ER. The book (Sch

utze et al., 2008)

studies various methods for evaluating the perfor-

mance of information retrieval systems that help in

assessing how effective these systems are in search-

ing, identifying, and retrieving relevant information

from large datasets. The metrics revolve around the

notion of relevant and irrelevant information that is

retrieved by the system. It is asserted that what is rel-

evant is stipulated in a ground truth which is depen-

dant upon an information need. ER systems partly

function as information retrieval systems, as they de-

termine whether multiple data points refer to the same

real-world entity. This capability to discern data iden-

tity within a context is the fundamental information

need of any ER system. Is it therefore ﬁtting to use in-

formation retrieval metrics for entity resolution? This

seems to be the consensus drawn in the scientiﬁc lit-

erature as we shall see in the next section. Our li-

brary implements various information retrieval met-

rics adapted for ER.

It’s also important to acknowledge the distinct en-

tity resolution models. The library sets itself apart by

organizing metrics based on their compatibility with

ER models, inﬂuenced by the underlying differences

in data structures that are characteristic to each model.

Special attention is given to interoperability and the

seamless integration of the library into the Python

programming language ecosystem. Its key features

are: embracing an OpenSource licensing model, ef-

ﬁcient implementation using state of the art libraries

under a very popular platform, and a design that is

agnostic to the ER implementation.

After this introduction, we overview two existing

mathematical models for ER which are widely used

and still represent the state of the art. Then we go

through other work that relates to this paper. Subse-

quently, we present the new library and pay special

attention to the reasons for implementing it. We go

over the metrics that are implemented, the technolog-

ical and design choices that were made, an example

of using the library and a performance evaluation of

the functions implemented by the library. In the end

we offer some conclusions and present aspects that

require more work.

2 ENTITY RESOLUTION

MODELS

Fellegi-Sunter Model. In the late 1960s Ivan Fel-

legi and Alan Sunter wrote the seminal paper(Fellegi

and Sunter, 1969) for record linkage — what would

later be known as ER. To this day, their mathematical

model based on probability theory is the most popu-

lar way of formalizing the ER problem. In this math-

ematical model, ER is a function that aids in prob-

abilistic decision making. In this model of ER, the

process primarily involves comparing data from two

sources. The essential step is matching two items —

one from each source, after which a decision is made

to categorize the match as a ‘link’, ‘non-link’, or ‘pos-

sible link’. Consequently, any matching algorithm un-

der this model typically returns pairs of items from

the original data sources, each tagged as one of these

categories. However, in practical applications today,

this process is often simpliﬁed to just returning a list

of pairs labeled as ‘links’. This intuitive explanation

gives us the structure of the input we can expect when

we use the Fellegi-Sunter ER model: an iterable se-

quence of pairs.

The metrics that are implemented with the Fellegi-

Sunter model of entity resolution in mind will accept

iterable sequences of pairs as input where the ground

truth and the result of the ER task are concerned.

Algebraic Model. The algebraic model for ER, ini-

tially conceived for assessing information quality in

large datasets(Talburt et al., 2007), was later reﬁned

to describe the ER process itself(Talburt, 2011). This

model treats ER as an algebraic equivalence relation

over a given input set, which can include data from as

many original sources as necessary. The unique as-

pect of this model lies in the characteristics of equiva-

lence relations(Halmos, 1960). These relations create

partitions over the input set, with each partition com-

ponent equivalent to an equivalence class of the rela-

tion(Talburt, 2011). Conversely, a partition over a set

can also induce an equivalence relation. With this in

mind, evaluating the outcome of an ER task becomes

as easy as comparing two partitions: the partition that

induces the ideal equivalence relation (the gold stan-

dard or ground truth) to the partition that is produced

by the ER task.

The library supports a few metrics for comparing

partitions, all of which expect that a partition is repre-

sented as a list of sets.

CSEDU 2024 - 16th International Conference on Computer Supported Education

258

3 RELATED WORK

Measuring ER quality was a subject of inter-

est ever since the ﬁrst paper on the subject sur-

faced(Newcombe et al., 1959). It speaks about ac-

curacy and contamination similarly to the current no-

tions of true and false positives. The fundamental the-

ory of record linkage(Fellegi and Sunter, 1969) of-

fers a probabilistic approach to evaluating the success

of an ER task. It suggests methods to affect the re-

sults through the selection of suitable thresholds for

deﬁning success and failure. It also provides mecha-

nisms for properly weighting for independent proba-

bility variables. The literature expands on these tech-

niques in subsequent papers(Winkler, 1990). Some

of the ER evaluation metrics that are a direct result

of this theoretical foundation include match accuracy,

match rate(Jaro, 1989), error rate estimation, rate

of clerical disambiguation(Winkler, 1990) or relative

distinguishing power of matching variables(Winkler,

2014). A lot of effort is spent on estimating and mea-

suring the effectiveness of blocking techniques to re-

duce the input size of the data set used for evaluation

purposes(Winkler, 1990; Jaro, 1989). Measuring ER

performance was and remains a computationally in-

tensive task.

Concerns about using accuracy and match rate are

also voiced(Goga et al., 2015). Thus we see a shift to-

wards metrics used in the related ﬁeld of information

retrieval. The probabilistic model for ER aligns well

with concepts such as true/false positives/negatives.

Given the extensive history of using ground truths to

assess entity resolution quality, there is a natural ﬁt

for using information retrieval quality metrics. Most

literature on this topic focuses on using information

retrieval metrics where the order in which results are

retrieved is not relevant(Sch

utze et al., 2008).

Besides the original statistical model for ER, other

models have evolved from it or alongside it. The work

of the InfoLab at Stanford on their Stanford Entity

Resolution Framework(Benjelloun et al., 2009) and

that of the Center for Entity Resolution and Informa-

tion Quality at the University of Arkansas in Little

Rock(Talburt et al., 2007) stand out. These models of

ER also propose new metrics for evaluating ER qual-

ity(Menestrina et al., 2010; Talburt, 2011).

There is ample coverage of the metrics used

for ER in syntheses on the subject(K

opcke et al.,

2010; Maidasani et al., 2012; Talburt, 2011). Clus-

tering metrics such as pairwise and cluster met-

rics(Menestrina et al., 2010; Huang et al., 2006) or

the Rand index(Talburt et al., 2007) seem to be used

more frequently to measure ER quality as time passes.

Numerous systems to perform ER are available.

Some of them include modules to evaluate the per-

formance of a particular ER solution(K

opcke et al.,

2009; Doan et al., 2020; University of Arkansas Lit-

tle Rock, 2012). There are also other Python packages

that implement some or all of the metrics provided by

our library(paulboosz, 2018; Virtanen et al., 2020).

4 PyResolveMetrics

In this context, the necessity for yet another spe-

cialized library dedicated to evaluating ER metrics

might seem redundant. This skepticism is rooted in

the expectation that Python, being a highly popu-

lar programming platform, should already offer high-

quality, reusable tools available for a wide range of

applications — including evaluating ER results.

Upon closer examination of the tools available for

evaluating entity resolution tasks, certain limitations

in the existing assumptions become apparent. There

are indeed numerous libraries offering packages for

computing entity resolution metrics. However, us-

ing a general-purpose library like SciPy raises con-

cerns about interoperability and efﬁciency. This is

particularly relevant when the sole requirement is to

compute entity resolution metrics, and the additional

features of a comprehensive library are unnecessary.

The challenge of seamlessly integrating ER evalu-

ation routines into a custom built project becomes

even more pronounced when attempting to use the

ones packaged with established ER systems (Univer-

sity of Arkansas Little Rock, 2012), (Papadakis et al.,

2017), (Li et al., 2020), (Doan et al., 2020).

Conversely, when speciﬁcally searching for li-

braries that only offer ER metrics, it becomes evident

that some of the essentials for effectively evaluating

ER tasks may be absent(paulboosz, 2018).

Approaching the issue from a different angle, us-

ing metrics from a general-purpose algorithmic li-

brary like Scipy (speciﬁcally scipy.metrics) for

ER evaluation requires strict adherence to certain de-

sign choices imposed by the library. For example,

to calculate the Rand index, data clusters must be

mapped with labels, and these labels must be provided

as input. While this might seem simple, the user-

friendliness of such an approach is debatable. The

complexity of adapting existing data and managing

the necessary labels for the package could potentially

rival the complexity of computing the Rand index it-

self, mooting the use of the package. Furthermore, ad-

ditional memory and compute time are also required

to perform the mapping between own data structures

and the ones required by the API contract.

In short, here are the reasons we chose to imple-

PyResolveMetrics: A Standards-Compliant and Efﬁcient Approach to Entity Resolution Metrics

259

ment such a library:

• Architecturally, adhering to the principle of ‘do

one thing and do it well’ is beneﬁcial. This

approach avoids the biases and dependencies of

general-purpose libraries like SciPy, which can

complicate integration into our custom-designed

software.

• Historically, ER has adapted evaluation tech-

niques from statistics, information retrieval, and

graph theory, tailoring these methods to suit its

speciﬁc needs. It seems desirable to standardize

these methods into forms speciﬁc to ER.

• Currently, there appears to be no implementa-

tion that consolidates all the metrics useful for

ER evaluation, as identiﬁed in scientiﬁc literature,

into a single, cohesive unit.

• Our work has a signiﬁcant component of evaluat-

ing ER outcomes.

Our opinion is that using mathematical models

speciﬁc to ER is the best approach for guiding the

library’s design. Since each model signiﬁcantly im-

pacts the data structures used in evaluation, the li-

brary’s functions are categorized based on the type

of input they support and, implicitly, by the mathe-

matical model they align with. There are a couple of

important assumptions that the library makes, regard-

less of the ER model. One such assumption is that the

quality of the ER output is always measured against a

ground truth(Sch

utze et al., 2008). The other assump-

tion is that the ground truth and the ER result are both

structured under the same mathematical model.

4.1 Supported Metrics

Statistical quality metrics, extensively detailed in

the literature(Sch

utze et al., 2008; Maidasani et al.,

2012), are the most common method for measuring

ER performance as evidentiated by their almost ubiq-

uitous usage(K

opcke et al., 2009; Goga et al., 2015;

Li et al., 2020; Obraczka et al., 2021). These met-

rics are linked to the Fellegi-Sunter model for ER

which provides clear deﬁnitions of Type I and Type

II errors(Winkler, 1990). Type I and Type II errors

clarify the concepts of true positives, true negatives,

false positives, and false negatives as they are used in

entity resolution. Understanding these concepts ne-

cessitates referencing the M (matches) and U (non-

matches) sets as deﬁned in the seminal paper on the

model.

Depending on the expected location of a pair pro-

duced by the entity resolution function, we deﬁne:

• true positives as pairs predicted to be in M that

should be in M,

• false positives, or type I errors, as pairs predicted

to be in M, but should be in U,

• true negatives as pairs predicted to be in U that

should be in U, and

• false negatives, or type II errors, as pairs pre-

dicted to be in U, but should be in M.

Several metrics based on these concepts exist,

though the effectiveness of some has been ques-

tioned(Goga et al., 2015). With this in mind we ﬁnally

deﬁne the three quality metrics that are supported by

our library:

Precision =

true positives

true positives + f alse positives

(1)

Recall =

true positives

true positives + f alsenegatives

(2)

Score = 2 ·

Precision · Recall

Precision + Recall

(3)

Precision (or the positive predictive value) is de-

ﬁned as the number of correct predictions that were

made in relation to the total number of predictions

that were made. Recall (or sensitivity) is deﬁned as

the number of correct predictions that were made in

relation to the total number of positive predictions

that could have been made (which corresponds to the

number of items in the ground truth). The F

score is

the harmonic mean of the precision and the recall and

it is used to capture the tradeoff between precision and

recall(Maidasani et al., 2012).

Algebraic metrics is the generic name we use for

‘cluster metrics’(Rand, 1971; Maidasani et al., 2012)

and ‘pairwise metrics’(Maidasani et al., 2012; Men-

estrina et al., 2010) because their foundation is al-

gebraic and because they are linked to the algebraic

model for ER. Most of the algebraic metrics imple-

mented by the library are an exercise in using opera-

tions on sets, while the rest focus on matrix operations

with a dash of combinatorics:

• Pairwise metrics (precision, recall and F-

measure)(Menestrina et al., 2010; Maidasani

et al., 2012)

• Cluster metrics (precision, recall and F-

measure)(Huang et al., 2006; Maidasani et al.,

2012)

• Talburt-Wang Index(Talburt et al., 2007)

• Rand(Rand, 1971) and Adjusted Rand In-

dex(Hubert and Arabie, 1985)

Their input arguments (the ground truth and the ER

result) are represented as partitions over the same set.

The Rand index is one of the ﬁrst metrics used

to compare the similarity between two different data

CSEDU 2024 - 16th International Conference on Computer Supported Education

260

clusterings. It quantiﬁes the agreement or disagree-

ment between these clusterings by considering pairs

of elements.

RandIndex =

(a + b)





(4)

The main components of the Rand index are as fol-

lows: a: Represents the number of times a pair of

elements belongs to the same cluster across both clus-

tering methods, b: Represents the number of times a

pair of elements belongs to different clusters across

both clustering methods,





: denotes the number of

unordered pairs in a set of n elements.

The Rand index always takes values in the [0, 1)

interval.

A variation on the Rand Index is the Adjusted

Rand Index for chance grouping of elements. It ac-

counts for agreements between data clusterings that

occur due to chance (Yeung and Ruzzo, 2001). The

Adjusted Rand Index is calculated by using the fol-

lowing formula:

ARI =

RandIndex − E

max(RandIndex) − E

, (5)

where E is the expected value of the RandIndex. The

Adjusted Rand index is valued in the interval [−1, 1].

For a comprehensive understanding of the Adjusted

Rand Index and its calculation, we recommend con-

sulting the detailed and informative work presented in

the study by (Warrens and van der Hoef, 2022) on the

subject. For both of these indexes, higher scores indi-

cate a closer alignment between the compared parti-

tions.

A metric that attempts to approximate the Rand

Index is the Talburt-Wang Index which counts the

number of overlapping subsets of two partitions over

the same input set. Assuming A and B are two parti-

tions over the same input set of elements, the Talburt-

Wang Index is given by the formula:

∆(A, B) =

|A| · |B|

Φ(A, B)

(6)

where Φ(A, B) =

∑

|A|

i=1

∈ B|B

∩ A

̸=

0}.

This metric approximates the Rand Index with-

out requiring the expensive counting of true positives,

false positives, true negatives or false negatives (Tal-

burt et al., 2007). It is valued within the same interval

as the Rand Index.

Our library implements other popular metrics that

can be used for comparing partitions: pairwise pre-

cision, pairwise recall and their harmonic mean (the

pairwise F

measure)(Maidasani et al., 2012).

If we have two sets X and Y , the pairwise precision

is given by the ratio of pairs that are in both sets to the

total amount of pairs of the reference set.

PP(X, Y ) =

|Pairs(X) ∩ Pairs(Y )|

|Pairs(X)|

(7)

The pairwise recall is given by the ratio of pairs that

are in both sets to the number of pairs in the compar-

ison set(Maidasani et al., 2012).

PR(X, Y ) =

|Pairs(X) ∩ Pairs(Y )|

|Pairs(Y )|

(8)

The pairwise F-measure is given by the harmonic

mean of the pairwise precision and pairwise recall.

PF =

2 · PP · PR

PP + PR

(9)

The library computes partition metrics by iteratively

analyzing equivalence classes within each partition

generated by the ER equivalence relation and extract-

ing element pairs from each subset.

Finally, our library supports ‘cluster mea-

sures’(Maidasani et al., 2012). Cluster precision is the

ratio of the number of completely correct clusters to

the total number of clusters resolved, whereas cluster

recall is the portion of true clusters resolved(Huang

et al., 2006). The harmonic mean of the cluster pre-

cision and cluster recall is typically called the cluster

F-measure. In this paragraph ‘clusters’ refer to the

equivalence classes of the entity resolution relation as

it is formalized in the algebraic model.

Given two partitions A and B, the cluster measures

are given by the following formulae:

CP(A, Y ) =

|A ∩ B|

|A|

(10)

CR(A, Y ) =

|A ∩ B|

|B|

(11)

CF =

2 ·CP ·CR

CP +CR

(12)

4.2 Technology

The technology used for implementing our library is

described in the Appendix (Olar, 2024) available on-

line.

4.3 Example Usage

To provide a visual outlook over the metrics pro-

vided by our library we are using a toy data set(Olar,

2023) containing near duplicates and the PPJoin(Xiao

et al., 2011) entity matching algorithm to perform ER.

The PPJoin algorithm matches items by using preﬁx

lengths determined using a Jaccard coefﬁcient t.

We split the data in the toy data set into two data

sets by column. The resulting data sets are:

PyResolveMetrics: A Standards-Compliant and Efﬁcient Approach to Entity Resolution Metrics

261

DG1: with ‘name’, ‘manufacturer’, ‘price’,

‘id’, and

DG2: with ‘description’, ‘name’, ‘id’

Because we have split the data column-wise, we

know exactly what the ground truth should be for each

of the metrics, assuming that each row in the original

toy data set refers to a distinct real-world entity. Be-

cause we are working with two data sets, the ground

truth for the statistical model is the same as the ground

truth for the algebraic model: a list of pairs of match-

ing items obtained by iterating over DG1 and DG2

using the same cursor.

All that’s left is to apply the PPJoin algorithm on

DG1 and DG2 and plot the values of the metrics pro-

vided by the library for values of t in the interval [0, 1)

at increments of 0.01. The plots that show the out-

come are available online in the accompanying Ap-

pendix (Olar, 2024).

4.4 Performance

CPU performance is usually evaluated by through-

put (e.g millions of operations per second or MIPS).

However it is meaningless to compare throughput on

different CPU architectures (Jain, 1991).

A similar concern can be raised about memory

proﬁling in relation to the underlying operating sys-

tem. Due to the variability of the outcomes during ex-

perimentation and the fact that all memory consump-

tion is very dependent on the operating system and

standard C library used for compiling and linking the

Python interpreter, we found memory proﬁling not to

provide great insights.

Under these circumstances we have chosen to

elaborate a method of proﬁling the CPU usage of the

library which is agnostic to the underlying hardware.

CPU proﬁling is useful in the context of judging the

metrics provided by the library relatively to one an-

other.

To prevent the ER task from interfering with pro-

ﬁling the metrics library, we run our experiment in

two stages. In the ﬁrst stage we run the ER task and

store its result in a ﬁle along with the ground truth.

The second stage loads the results and ground truth

from the ﬁle and runs the entity resolution metrics

while proﬁling the CPU usage.

The performance analysis is available online in the

Appendix (Olar, 2024) to this article. Perhaps the

most important lesson to learn from proﬁling our li-

brary is that algebraic metrics are an order of magni-

tude more expensive to compute than statistical met-

rics. Moreover, not all algebraic metrics were created

equal: the Rand indexes are an order of magnitude

more expensive to compute than the other algebraic

metrics. Therefore, because the Talburt-Wang index

approximate the Rand index well, it might be a prefer-

able choice to measure how well an ER algorithm per-

forms clustering.

5 CONCLUSIONS AND FUTURE

WORK

We have introduced a library for evaluating ER re-

sults that is based on standards and Python protocols,

making it highly interoperable. The API exposed by

this library is deeply rooted in the mathematical mod-

els fundamental to ER, making it more familiar to ER

users.

The performance of the library is sound because

it externalizes computationally expensive tasks to na-

tive code. The accuracy of the implemented metrics

is veriﬁed automatically through unit tests.

These attributes render the library not only highly

beneﬁcial but also low maintenance, making it an in-

valuable asset ER. This does not preclude additional

work.

Missing Metrics. The ER models we have

touched upon support many more metrics that the

library does not currently implement. The work

by (Maidasani et al., 2012) provides an insight-

ful overview. For well-rounded support of the ER

models mentioned so far, the library should imple-

ment at least additional cluster comparisons, such

as the Closest Cluster F

, the MUC F

, B

and

the CEAFF

(Maidasani et al., 2012), and additional

Rand-like indexes (Warrens and van der Hoef, 2022).

Missing Models. Besides the models we have

covered herein, ER has been theorized to be a graph

problem(Obraczka et al., 2021) or an exercise in lat-

tice theory with an ordering relation based on “merge

dominance”(Benjelloun et al., 2009). More work is

required to distil the metrics that become available for

evaluating ER under those models and the data struc-

tures that are used in the evaluation process.

REFERENCES

Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q.,

Whang, S. E., and Widom, J. (2009). Swoosh: a

generic approach to entity resolution. The VLDB Jour-

nal, 18:255–276.

Chen, H., Liu, F., Wen, Y., Ling, L., Chen, S., Ling, H.,

and Gu, X. (2021). Career exploration of high school

students: Status quo, challenges, and coping model.

Frontiers in Psychology, 12.

Doan, A., Konda, P., Suganthan GC, P., Govind, Y.,

Paulsen, D., Chandrasekhar, K., Martinkus, P., and

CSEDU 2024 - 16th International Conference on Computer Supported Education

262

Christie, M. (2020). Magellan: toward building

ecosystems of entity matching solutions. Communi-

cations of the ACM, 63(8):83–91.

Fellegi, I. P. and Sunter, A. B. (1969). A theory for record

linkage. Journal of the American Statistical Associa-

tion, 64:1183–1210.

Goga, O., Loiseau, P., Sommer, R., Teixeira, R., and Gum-

mandi, K. P. (2015). On the reliability of proﬁle

matching across large online social networks. In On

the Reliability of Proﬁle Matching Across Large On-

line Social Networks, Sydney.

Halmos, P. R. (1960). Naive set theory. van Nostrand.

Huang, J., Ertekin, S., and Giles, C. L. (2006). Efﬁcient

name disambiguation for large-scale databases. In Eu-

ropean conference on principles of data mining and

knowledge discovery, pages 536–544. Springer.

Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of classiﬁcation, 2:193–218.

Jain, R. K. (1991). The Art of Computer Systems Perfor-

mance Analysis: Techniques for Experimental Design,

Measurement, Simulation, and Modeling, volume 1.

Wiley New York, 1 edition.

Jaro, M. A. (1989). Advances in record-linkage methodol-

ogy as applied to matching the 1985 census of tampa,

ﬂorida. Journal of the American Statistical Associa-

tion, 84(406):414–420.

opcke, H., Thor, A., and Rahm, E. (2009). Comparative

evaluation of entity resolution approaches with fever.

Proceedings of the VLDB Endowment, 2(2):1574–

1577.

opcke, H., Thor, A., and Rahm, E. (2010). Evaluation

of entity resolution approaches on real-world match

problems. Proceedings of the VLDB Endowment, 3(1-

2):484–493.

Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020).

Deep entity matching with pre-trained language mod-

els. arXiv preprint arXiv:2004.00584.

Maidasani, H., Namata, G., Huang, B., and Getoor, L.

(2012). Entity resolution evaluation measures. Uni-

versity of Maryland, Tech. Rep.

Menestrina, D., Whang, S. E., and Garcia-Molina, H.

(2010). Evaluating entity resolution results. In Evalu-

ating entity resolution results.

Newcombe, H. B., Kennedy, J. M., Axford, S. J., and James,

A. P. (1959). Automatic linkage of vital records. Sci-

ence, 130(3381):954–959.

Obraczka, D., Schuchart, J., and Rahm, E. (2021). Ea-

ger: embedding-assisted entity resolution for knowl-

edge graphs. arXiv preprint arXiv:2101.06126.

Olar, A. (2023). Experiment data. https://github.com/match

escu/experiment-data. Online; accessed 25.10.2023.

Olar, A. (2024). Pyresolvemetrics appendix. https://matche

scu.github.io/py-resolve-metrics/article/02 appendix.

pdf. Online; accessed 07.03.2024.

Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos,

G., Palpanas, T., and Koubarakis, M. (2017). Jedai:

The force behind entity resolution. In The Semantic

Web: ESWC 2017 Satellite Events: ESWC 2017 Satel-

lite Events, Portoro

z, Slovenia, May 28–June 1, 2017,

Revised Selected Papers 14, pages 161–166. Springer.

paulboosz (2018). entity-resolution-evaluation. https://gith

ub.com/entrepreneur-interet-general/entity-resolutio

n-evaluation/README.md. accessed 2023-09-22.

PyResolveMetrics (2023). Pyresolvemetrics. https://gith

ub.com/matchescu/er-metrics. Online; Accessed:

26.11.2023.

Qian, K., Popa, L., and Prithviraj, S. (2017). Active learning

for large-scale entity resolution. In Active learning for

large-scale entity resolution, pages 1379–1388.

Rand, W. M. (1971). Objective criteria for the evaluation of

clustering methods. Journal of the American Statisti-

cal association, 66(336):846–850.

Sch

utze, H., Manning, C. D., and Raghavan, P. (2008). In-

troduction to information retrieval, volume 39. Cam-

bridge University Press Cambridge.

Talburt, J., Wang, R., Hess, K., and Kuo, E. (2007). An

algebraic approach to data quality metrics for entity

resolution over large datasets. In Information quality

management: Theory and applications, pages 1–22.

IGI Global.

Talburt, J. R. (2011). Entity resolution and information

quality. Elsevier.

University of Arkansas Little Rock, E. (2012). Oyster. http

s://bitbucket.org/oysterer/oyster/src/master/READM

E.md. accessed 2023-09-22.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,

Reddy, T., Cournapeau, D., Burovski, E., Peterson,

P., Weckesser, W., Bright, J., et al. (2020). Scipy

1.0: fundamental algorithms for scientiﬁc computing

in python. Nature methods, 17(3):261–272.

Warrens, M. J. and van der Hoef, H. (2022). Understanding

the adjusted rand index and other partition compari-

son indices based on counting object pairs. Journal of

Classiﬁcation, 39(3):487–509.

Winkler, W. E. (1990). String comparator metrics and en-

hanced decision rules in the fellegi-sunter model of

record linkage. Non-Journal.

Winkler, W. E. (2014). Matching and record linkage.

WIREs Computational Statistics, 6(5):313–325.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G.

(2011). Efﬁcient similarity joins for near-duplicate

detection. ACM Transactions on Database Systems

(TODS), 36(3):1–41.

Yeung, K. Y. and Ruzzo, W. L. (2001). Details of the ad-

justed rand index and clustering algorithms, supple-

ment to the paper an empirical study on principal com-

ponent analysis for clustering gene expression data.

Bioinformatics, 17(9):763–774.

PyResolveMetrics: A Standards-Compliant and Efﬁcient Approach to Entity Resolution Metrics

263