way that similar patterns are clustered together. The
patterns are thereby managed into a well-formed
evaluation that designates the population being
sampled (Winkler, W. E., 1990). Binary entity
resolution (ER) and cluster-based ER represent
different techniques to resolving entities in datasets.
In binary ER, the focus is on comparing individual
pairs of references from separate files to determine
equivalence. Unlike cluster-based ER, binary ER
does not use transitive closure, in that it doesn't
automatically link references that are indirectly
related. The output of binary ER consists of linked
pairs, with each pair representing a match between a
reference from one file to another. Whereas cluster
ER operates by initially identifying pairs of
references that match.
The methodology for computing metrics in binary
Entity Resolution (ER) shares similarities with
cluster- based ER, yet there are notable distinctions.
The first key difference lies in the data sources:
cluster ER operates on a unified dataset or single file,
whereas binary ER is executed across two distinct
datasets. The second distinction pertains to the
application of the transitive closure principle; cluster
ER integrates this principle to identify related entities
across multiple records, while binary ER does not
incorporate transitive closure, focusing instead on
direct comparisons between the two datasets to help
identify pairs that represent the same entity. These
differences highlight the unique challenges and
considerations inherent to each ER approach.
Initial objective of our approach is to identify the
cascading methods that can help uniquely identify
links between people, places and things; i.e, Social
Security Number for a person versus a Part Number
for equipment, in order to establish potential matches
(Fellegi, I. P., & Sunter, 1969), The groundwork
around finding the most apparent connections is
fundamental to any ER process. Following this, the
process employs a series of less stringent filters, such
as comparisons on name, address, date-of-birth, and
other demographics of a person. After direct pair
linking, indirect methods such as household
connections may be employed to further additional
links (Mohammed, O.K. et al, 2024). By
methodically narrowing down from the most accurate
identifiers to broader characteristics, the cascade
approach enhances the integrity and utility of data
linkage, making it an essential tool in efficient data
management and integration tasks. Employing a
tiered approach to implement cascading enables
precise linking and helps to ensure that only
equivalent pairs of references with a high confidence
are brought together.
Binary ER can support pairwise linking These
different types of pairwise linking are crucial for
accurately capturing the complexity of real-world
relationships between data records (Mohammed,
O.K. et al, 2024). By supporting these varying levels
of linkage, our Binary ER approach ensures
flexibility and precision in matching records, which
is essential for improving the overall accuracy and
effectiveness of the entity resolution process in this
work.
One to one: One reference in file A matches to one
reference in file B.
One to many: One reference in file A could match
to more than one reference in file B, but each
reference in file B has at most one matching reference
in file A. These different types of pairwise linking are
crucial for accurately capturing the complexity of
real-world relationships between data records. This is
particularly important in scenarios where duplicate
records exist in the database, necessitating the use of
the one-to-many scenario to ensure all possible
matches are identified. Additionally, in certain cases,
such as when generating credit files at a credit score
company, there is a requirement that only one
matching entity is sent to the user, meaning that the
system must handle one-to-one matching with
precision. By supporting these varying levels of
linkage, our Binary ER approach ensures flexibility
and precision in matching records, which is essential
for improving the overall accuracy and effectiveness
of the entity resolution process in this work.
2 LITERATURE REVIEW
Entity Resolution (ER), also known as record linkage
or deduplication, is a critical process in data
management, aimed at identifying and linking records
that refer to the same real-world entity across multiple
datasets. Over the years, various approaches have
been developed to address the challenges of ER,
ranging from traditional rule-based systems to more
advanced machine learning methods. One of the
foundational methods in ER is the Fellegi-Sunter
model, which introduced probabilistic techniques to
resolve records based on a set of matching criteria and
decision rules (Fellegi, I. P., & Sunter, A. B., 1969).
This model laid the groundwork for many subsequent
ER systems by formalizing the process of comparing
and linking records. However, traditional
probabilistic models often struggle with complex and
large-scale datasets, where the sheer volume of
records and variations in data can lead to inaccurate
matches and inefficiencies.