the encryption code is resizable, and all white spaces
are removed during encoding. The previous
comparison showed that the modified version of the
Spanish Phonetic Algorithm had a better
performance in terms of precision. However, during
the present research we have implemented two
moreSpanish encoding functions: the Spanish
Metaphone algorithm (Philips, 2000), (Mosquera,
2012), and a second version of such algorithm,
which applies same code to similar sounds derived
from very common misspellings.
The present paper is organized as follows: The
next section briefly explains the data matching
process. Section 3 explains the phonetic encoding
functions proposed from previous research, the
enhancements we have implemented on some of
them, along with their role within de process of data
matching. Section 4 presents the experiments carried
out, and analyses the results. Finally, the last section
concludes the main topics achieved regarding the
performance of the encoding functions and the
future work to be done.
2 RELATED WORK
The data matching process is mainly concerned to
the record comparison among databases in order to
determine if a pair of records corresponds to the
same entity or not (Christen, 2012). It is also called
record linkage o de-duplication. This process in
general terms consists on the following tasks:
A standardization process (Christen, 2012),
which refers to the conversion of input data from
multiple databases into a format that allows correct
and efficient record correspondence between two
data sources.
Phonetic encoding is a type of algorithm that
converts a string into a code that represents the
pronunciation of that string. Encoding the phonetic
sound of names avoids most problems of
misspellings or alternate spellings, a very common
problem on low quality of data sources.
The indexing process aims to reduce those pairs
of records that are unlikely to correspond to the
same real world entity and retaining those records
that probably would correspond in the same block
for comparison; consequently, reducing the number
of record comparisons. The record similarity
depends on their data types because they can be
phonetically, numerically or textually similar. Some
of the methods implemented within our prototype
SEUCAD are for instance, Soundex (Odell, 1918),
Phonex, Phonix (Christen, 2012), NYSIIS
(Borgman, 1992), Double Metaphone (Philips,
2000).
Field and record comparison methods provide
degrees of similarity and define thresholds
depending on their semantics or data types. In the
prototype, the algorithms Qgram, Jaro - Winkler
Distance (Jaro, 1989), (Winkler, 1990), longest
common substring comparison are already
implemented.
The classification of pairs of records grouped
and compared during previous steps is mainly based
on the similarity values that were already obtained,
since it is assumed that the more similar two records
are, there is more probability that these records
belong to the same entity of the real world. The
records are classified into matches, not matches or
possible matches.
The aim of the following section is to briefly
explain the phonetic encoding functions that we
have implemented and enhance in order to quantify
and compare their performance during the record
linkage process.
3 PHONETIC ENCODING
PROPOSALS TO COMPARE
3.1 Phonetic Coding Functions
Phonetic encoding is a type of algorithm that
converts a string (generally assumed to correspond
to a name) into a code that represents the
pronunciation of that string. Encoding the phonetic
sound of names avoids most problems of
misspellings or alternate spellings, a very common
problem on low quality of data sources.
3.2 Spanish Phonetic
The Spanish phonetic coding function compared in
the present document is a variation of the Soundex
algorithm. Soundex is a phonetic encoding algorithm
developed by Robert Russell and Margaret Odell in
(Odell, 1918), and patented in 1918 and 1922. It
converts a word in a code (Willis, 2002). The
Soundex code is to replace the consonants of a word
by a number; if necessary zeros are added to the end
of the code to form a 4-digit code. Soundex choose
the classification of characters based on the place of
articulation of the English language.
The limitations of the Soundex algorithm have
been extensively documented and have resulted in
several improvements, but none oriented to the
Fifth International Conference on Telecommunications and Remote Sensing