Approximate String Matching Techniques

Taoxin Peng

and Calum Mackay

School of Computing, Edinburgh Napier University, 10 Colinton Road, Edinburgh, U.K.

KANA, Greenock Road, Renfrewshire, U.K.

Keywords: String Matching, Data Quality, Record Matching, Record Linkage, Data Warehousing.

Abstract: Data quality is a key to success for all kinds of businesses that have information applications involved, such

as data integration for data warehouses, text and web mining, information retrieval, search engine for web

applications, etc. In such applications, matching strings is one of the popular tasks. There are a number of

approximate string matching techniques available. However, there is still a problem that remains

unanswered: for a given dataset, how to select an appropriate technique and a threshold value required by

this technique for the purpose of string matching. To challenge this problem, this paper analyses and

evaluates a set of popular token-based string matching techniques on several carefully designed different

datasets. A thorough experimental comparison confirms the statement that there is no clear overall best

technique. However, some techniques do perform significantly better in some cases. Some suggestions have

been presented, which can be used as guidance for researchers and practitioners to select an appropriate

string matching technique and a corresponding threshold value for a given dataset.

1 INTRODUCTION

Data quality is a key to success for all kinds of

businesses that have information applications

involved, such as data integration for data

warehouse applications; text and web mining,

information retrieval and search engines for web

applications. When data need to be integrated from

multiple sources, it always has a problem: how to

identify data records that refer to equivalent entities,

which is called a duplicate detection problem. The

“approximate string matching problem” is generally

used as a black-box function with some threshold

parameter in the context of duplicate detection

problem. In general, string matching deals with the

problem of whether two strings refer to the same

entity.

String matching, also referred as record matching

or record linkage in statistics community has been a

persistent and well-known problem for decades

(Elmagarmid et al., 2007). For a brief review of the

topic of record linkage, see the work by Herzog e

tal., (2010).

To deal with string matching, there are

mainly two types of matching techniques: character-

based and token-based techniques. Character-based

similarity techniques are designed to handle well

typographical errors. However, it is often the case

that typographical conventions lead to

rearrangement of words e.g., “John Smith” vs.

“Smith, John”. In such cases, character-based

techniques fail to capture the similarity of the

entities. Token-based techniques are designed to

compensate for this problem (Elmagarmid et al.,

2007).

Therefore, generally speaking, character-

based similarity techniques are good for the single

word problem, such as name matching while token-

based for the matching with more than one word.

For instance, for e-commerce datasets used in

(Köpcke et al., 2010), token-based techniques could

be more appropriate, given the length of the fields.

This paper will focus on popular token-based

techniques. Although these techniques are designed

for dealing with this kind of problems, since there is

no clear overall best string matching technique for

all kinds of datasets (Christen, 2006), a question still

remains unanswered for researchers and

practitioners: for a given dataset, how to select an

appropriate technique and a threshold value required

by this technique for the purpose of string matching.

In past decade, several researchers have challenged

this problem (Bilenko et al, 2003; Christen, 2006;

Cohen, Ravikumar and Fienberg, 2003;

Hassanzadeh et al., 2007; Peng et al., 2012).

However, none of them have done such a

217

Peng T. and Mackay C..

Approximate String Matching Techniques.

DOI: 10.5220/0004892802170224

In Proceedings of the 16th International Conference on Enterprise Information Systems (ICEIS-2014), pages 217-224

ISBN: 978-989-758-027-7

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

comprehensive analysis and comparison work that is

done for token-based techniques in this paper. The

contributions of this paper are to overview 12

popular token-based string matching techniques,

evaluate whether the following factors will have

effect on the performance: the error rate in a dataset,

the threshold value, the selected type of strings in a

dataset, and the size of a dataset, by using sixty three

carefully designed datasets.

The rest of this paper is structured as follows.

Related works are described in next section. Section

3 introduces techniques that will be examined. The

main contribution of this paper is presented in

section 4 that describes the preparation of the

datasets, the experiments, the analysis and

comparisons. Finally, this paper is concluded and

future work pointed out in section 5.

2 RELATED WORK

Bilenko et al (2003) and Cohen et al., (2003)

evaluated and compared a set of existing string

matching techniques, which include popular

character-based techniques, token-based techniques

and hybrid techniques. They claimed that the

Monge-Elkan performed best on average and

SoftTFIDF performed best overall. However, their

works did not consider the effect of the error rate,

the type of errors in a string, the token length in a

string and the size of a dataset on the performance.

Besides, regarding the threshold value used for

matching, their work only mentioned that a suitable

threshold value was chosen, but not mentioned how

and whether or not this value was universal for all

considered techniques.

Peter Christen (2006) thoroughly discussed the

characteristics of personal names and the potential

sources of variations and errors in them, and also

evaluated a number of commonly used name

matching techniques, considering given names,

surnames and full names separately, and proposed

nine useful recommendations for technique selection

when dealing with name matching problems.

Particularly, the author pointed out the importance

of choosing a suitable threshold value. It was argued

that it was a difficult task to select a proper threshold

value and even small changes of the threshold could

result in dramatic drops in matching quality.

However, his focus was on personal name matching

techniques. Also, similar to Cohen et al’s work

(2003), the author did not consider any effect of the

error rate, the token length in a string and the size of

a dataset on the performance.

Hassanzadeh et al., (2007) presented an overview of

several string matching techniques and thoroughly

evaluated their accuracy on several datasets with

different characteristics and common quality

problems. Similar to this paper, the work was

focused on token-based string matching techniques.

The effect of types of errors and the amount of

errors were both considered. Types of errors

considered include edit errors, token swap and

abbreviation replacement. Their experiment results

showed that types of errors and the amount of errors

both had significant effect on the performance. It

was claimed that the threshold value used for the

matching task would influence the individual

performance of matching techniques. However the

token length in a string and the size of datasets were

not considered.

Recently, Peng et al., (2012) presented an

evaluation work on techniques for name matching.

The work considered a variety of factors, such as the

error rate, the size of a dataset, which might have

effect on the performance of such techniques. Their

preliminary experimental results confirmed that

there is no overall clear best technique. The work

claimed that the error rate in the dataset has effect on

threshold values. However, they only considered

character-based matching techniques.

3 STRING MATCHING

TECHNIQUES

String matching that allows errors is also called

approximate string matching. The problem, in

general, is to “find a text where a text given pattern

occurs, allowing a limited number of “errors” in the

matches.” (Navarro, 2001) Each technique uses a

different error model, which defines how different

two strings are.

In token-based similarities, two strings, s and t

can be converted into two token sets, where each

token is a word. A similarity function, Sim() is used

to define whether strings s and t are similar or not,

based on a given threshold value. In this paper, 12

token-based similarity techniques are considered. In

the rest of this section, if there is no reference given,

the description of algorithm’s formula follows the

work in (Hassanzadeh et al., 2007).

3.1 Notations

Given two relations: R = {r

: 1 ≤ i ≤ N

} and S = {s

: 1 ≤ j ≤ N

}, where |R| = N

, |S| = N

, for a similarity

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

218

function sim( ), a pair of records, (r

, s

) ∈ R×S is

considered to be similar if sim((r

, s

) ≥ θ , where θ is

a given threshold value.

3.2 Edit Similarity or Levenshtein

The Levenshtein distance (Navarro, 2001) is defined

to be the minimum number of edit operations

required to transform string s

into s

. Edit

operations are delete, insert, substitute and copy. It

can be calculated by:











,





1.0





,





max











where dist(s

, s

) refers to the actual Levenshtein

distance function which returns a value of 0 if the

strings are the same or a positive number of edits if

they are different. The value of such a measure is

between 0.0 and 1.0 where the bigger the value, the

more similar between the two strings.

3.3 GES

The Gap Edit Similarity (GES) measure uses gap

penalties. (Chaudhuri et al., 2006). GES defines the

similarity between two strings as a minimum cost

required to convert s

to s

and is given by:











,





1min









,















,1.0

where wt(s

) is the sum of weights of all tokens in s

and tc(s

, s

) is a sequence of the following

transformation operations:

 Token insertion: inserting a token t in s

with

cost w(t). c

ins

where c

ins

is the insertion factor

constant and is in the range between 0 and 1. In

our experiments, c

ins

=1;

 Token deletion: deleting a token t from s

with

cost w(t);

 Token replacement: replacing a token t

by t

with cost (1- sim

Leven

, t

).w(t).

3.4 Jaccard/Weighted Jaccard

Jaccard similarity is the fraction of tokens in s

and

that are present in both of the strings. Weighted

Jaccard similariy is the weighted version of Jaccard

similarity. The similarity measure for two strings, s

can be calculated by:



.







,







∑







∈



∩



∑







∈



∪



where w

(t) is a function of weight that reflects the

frequency of token t in relation R:











log





0.5





0.5



where n is the number of tuples in the relation R and

is the number of tuples in R containing the token t.

3.5 TF-IDF

TF-IDF, term frequency–inverse document

frequency, is a similarity measure that is widely used

in information retrieval community (Cohen,

Ravikumar and Fienberg, 2003). The similarity

measure for two strings, s

, s

can be calculated by:











,





,



∙,





∈



∩



together with:







,



log

,

1∙log





and





,







,/







,







where TF

w, s

is the frequency of token w in s, IDF

the inverse of the fraction of names in the corpus

that contains w.

3.6 SoftTFIDF

SoftTFIDF is a hybrid similarity measure by Cohen

et al., (2003), in which similar tokens are considered

as well as tokens in s

s

. The similarity measure

for two strings, s

, s

can be calculated by:











,







,



∙,



∙,





∈,



,





where CLOSE(θ, s

, s

) is the set of tokens t

∈s

such that for some t

sim(s

, s

)>θ, where sim() is

a secondary similarity function, and D(w, s

) =

max

∈

sim(w, s

), for w∈CLOSE(θ, s

, s

).In the

experiments, Jaro-Winkler similarity function is

used as the secondary similarity function and θ =

0.9.

3.7 Cosine TF-IDF

Cosine TF-IDF is a well-established technique used

in Information Retrieval and uses vectors to analyse

the strings and to calculate the similarity between

the pattern and the string. The similarity is

determined by the cosine of the angle between these

vectors. The formula is shown below for the

ApproximateStringMatchingTechniques

219

similarity, as well as the formula for how the vectors

are assigned:











,











∙







∈



∩



where 





(t) and 











are normalized tf-idf

weights for each common token in s

and s

respectively. The normalized tf-idf weight of token

t in a given string s is defined as follows:



























∑





















∈

,















∙

where 









is the term frequency of token t within

string s and idf(t) is the inverse document frequency

with respect to the entire relation R.

3.8 BM25

BM25 is an efficient string searching technique that

is used as the foundation for many others. The

algorithm looks for instances of P (pattern) in S

(text) by performing comparisons of characters at a

range of different alignments. The similarity

measure for two strings, s

, s

can be calculated by:











,













∙







∈



∩



where





















1∙















































1∙



















and













log





0.5





0.5

















1













where tf

(t) is the frequency of the token t in string s,

|s| is the number of tokens in s, avg

is the average

number of tokens per record in relation R, n

is the

number of records containing the token t and k

, k

and b are a set of indpendent parameters, where k

∈

[1,2], k

= 8 and b∈ [0.6, 0.75].

3.9 Hidden Markov

Hidden Markov Model (HMM) has a formula

wherein a language model is given to each

individual document. HMM is a model that is

assigned to actual strings whereby the probability of

returning a particular string is measured against

another string. The following formula shows the

score for HMM used in this experiment:











,



































∈



where a

and a

= 1 - a

are the transition states

probabilities of the Markov model and P(t|GE) and

P(t|s

) are determined by:













∑



∈

∑

||

∈



















|



3.10 WHIRL

The similarity of two document vectors



and



usually interpreted as the cosine of the angle

between v and w. The value of Sim

WHIRL

(



) is

always between one and zero and is given by the

following formula (Cohen, 2000):







,























∙









|∙|







∈

where



is related to the “importance” of the term t

in the document represented by

and is related

to the relevance of term t in the document

represented by

is the length of vector

and is the length of the vector Two

documents are similar when they share many

“important” terms (Cohen, 2000).

3.11 Affine Gap

Some sequences are much more likely to have a big

gap, rather than many small gaps. For example, a

biological sequence is much more likely to have one

big gap of length 10, due to a single insertion or

deletion event, than it is to have 10 small gaps of

length 1. Affine gap penalties use a gap opening

penalty, o, and a gap extension penalty, e. A gap of

length l is then given a penalty o + (l-1)e. So that

gaps are discouraged, o is almost always negative.

Because a few large gaps are better than many small

gaps, e, though negative, is almost always less

negative than o, so as to encourage gap extension,

rather than gap introduction.

In the Affine Gap Penalty model, a gap is given a

weight Wg to “open gap” and another weight Ws to

“extend the gap”. The formula for the model is:















|||| v



|||| w



ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

220

where q is the length of the gap, W

is the weight to

“open the gap” and W

is the weight to “extend the

gap” with one more space (Vingron and Waterman,

1994).

3.12 Fellegi-Sunter

Fellegi-Sunter makes comparison between recorded

characteristics and values in two records and makes

a decision based on whether or not the comparison

pair make up the same object or event (Fellegi and

Sunter, 1969).

Decisions are made according to three types

which will be referred to as A

(a link)

(a possible

link) and A3(a non-link). When members of the

comparison pair are unmatched then Fellegi-sunter

equation calculates the score by determining the

probability of the error occurring between the

matched pair defined as:













|

∈













|

∈

where u(γ) and m(γ) are probabilities of realizing

and discovering a comparison vector. The

summation is the total space P of all possible

realizations.

4 EXPERIMENTS AND

EVALUATION

In our experiments, we focus on the performance of

the above 12 popular token-based string matching

techniques. Datasets designed cover a wide range of

characteristics, such as the level of “dirtiness”, the

token length in a string, the size of datasets and the

type of errors. It is expected that the experiment

results should show that all these characteristics

have significant effect on the performance and

optimal threshold values.

4.1 Datasets Preparation

In the absence of common datasets for data cleaning,

we prepare our data for experiments as follows.

The datasets that are used are based on real

Electoral Roll data. First, a one million record

dataset was extracted, from which an address list

was created, which includes House number, Street,

City and County, Postcode. This list contains 10000

clean, non-duplicate addresses, with an ID

associated to each of the records.

Erroneous records were introduced by doing the

following five operations manually to the address

fields of records: inserting, deleting, substituting,

replacing characters and swapping tokens. The

replacing type of errors includes abbreviation errors,

such as replacing St. with Street or vice versa. The

level of the dirtiness of a dataset is divided into three

levels: Low, Medium and High. Static percentages

are used as a guideline for such a classification: for

low error datasets, the percentage was set to 20%;

for medium, 50% and for high, 80%.

The dataset sizes that were considered consisted

of six sizes with 500, 1000, 2000, 4000, 8000 and

10000 records respectively. The average record

length was approximately 8.2 words per record. The

errors in the datasets were distributed in a uniform

way. Experiments have also been run on datasets

generated using different parameters. For each size,

there were three datasets generated with mixed types

of errors, having a different error rate associated.

This contributes eighteen datasets with mixed type

of errors introduced.

In addition, datasets with single type of errors

were also considered in our experiment. The

insertion and deletion errors were incorporated

solely into each dataset and the algorithms were

measured against each other to see how they

performed against each technique. With regard to

how these error types were distributed amongst the

data, the distribution went as follows: Insertion

errors were performed on a single character per

record that would correspond to a given percentage

distribution rate dependent on the error rate being

analyzed. For example, for the low error dataset for

1000 records, there were 20% of those records that

had a single “insertion” error implemented. So 200

of those 1000 records had a character “added” that

would simulate an error. There were 18 datasets

incorporated with only insertion errors.

Deletion followed a similar approach for each

error rate and took a single character away from the

number of records corresponding to the respective

percentage error rate. There were 18 such datasets as

well.

Token lengths were measured for the 8000 size

dataset with the “low”, “medium” and “high” error

rates taken into consideration. Three different token

lengths were used, “Short”, “Medium” and “Long”.

For the short tokens, the only fields considered for

each record were for Street and City. Medium length

considered County and the Long length size added

PostCode which was the original size of the record.

ApproximateStringMatchingTechniques

221

The average length used for low was 2.1 tokens per

record, medium was 5.6 and long was 8.2. Therefore

in total there are sixty three datasets for the

experiment.

4.2 Measures

A target string is a positive if it is returned by a

technique; otherwise it is a negative. A positive is a

true positive if the match does in fact denote the

same entity; otherwise it is a false positive. A

negative is a false negative if the un-match does in

fact denote the same entity; otherwise it is a true

negative.

We evaluate the matching quality using the F-

measure (F) that is based on precision and recall:

1

2



with P (precision) and R (recall) defined as:



||













||











Clearly, a trade-off between recall and precision

exists, if most targets are matched, recall will be

high but precision will be low. Conversely if

precision is high, recall will be low. F1-measure is a

way of combining the recall and precision into a

single measure of overall performance (Rijsbergen,

1979). In our experiments, precision, recall and F1-

measure are measured against different value of

similarity thresholds, θ. For the comparison of

different techniques, the maximum F1-measure

score across different thresholds is used.

Figure 1: Optimum threshold value for different

techniques on datasets with three different error rates.

Figure 2: Maximum F1- score for different techniques on

datasets of 8000 records with three different error rates.

Figure 3: Maximum F1- score for different techniques on

datasets of 8000 records with three different token length

associated with medium error rates.

Figure 4: Maximum F1- score for different techniques on

datasets of 10000 records only having insertion errors with

three different error rates.

4.3 Results

In this section, testing results on the sixty three

carefully designed datasets are analysed and

evaluated. Results show that in general, the size of a

dataset is not significantly sensitive to the accuracy

(the best F1-score) relative to the threshold values.

4.3.1 Effect of Error Rate on Threshold

Values

Figure 1 shows the optimum threshold values for all

12 techniques on datasets with three different error

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

222

rates. The results confirm that the level of “dirtiness”

in a dataset has significant effect on threshold

selection. From the figure, it says the higher the

error rate in the dataset, the lower the threshold

value is required in order to achieve the maximum

F1-score, except the case for BM25. For example

W/Jaccard achieves the maximum F1-score on

datasets with low error rate at the threshold value of

0.85, with medium and high error rate at threshold

values of 0.57 and 0.47 respectively. However,

BM25 achieves the maximum F1-score on datasets

with high error rate at the threshold value of 0.60,

but with medium error rate at the threshold value of

0.59.

4.3.2 Effect of Error Rate on Performance

Experiment results show that the “dirtiness” of a

dataset has great effect on the overall performance.

As an example, Figure 2 shows the maximum F1-

scores for the 12 different techniques on datasets of

size 8000 with three different error rate associated. It

shows that for datasets with low error rate

associated, HMM, BM25 and Cosine TF-IDF are the

best three performers among the 12 algorithms,

while Affine Gap, WHIRL and SoftTFIDF are the

worst performers. For datasets with medium and

high error rate associated, Affine Gap joined HMM

and BM25 as the top three performers, while Cosine

TF-IDF drops out of the best 5. This shows that

Affine Gap is suitable for datasets with higher level

of “dirtiness”, and Cosine TF-IDF is suitable for

datasets with lower level of “dirtiness”. In general,

the performance decreases along with the increase of

the “dirtiness” in a dataset, except technique, Affine

Gap, which has higher maximum F1-score on

datasets with medium error rate associated than on

datasets with low error rate associated.

4.3.3 Effect of Size of Datasets on

Performance

Our experiment results show that the size of datasets

does not have significant effect on maximum F1-

score values, given optimum threshold values.

4.3.4 Effect of Length of Strings on

Threshold Values

Figure 3 shows the maximum F1- score for different

techniques on datasets of 8000 records with three

different token lengths associated with medium error

rates. It says in general, performance decreases

along with the increase of lengths, except Cosine

TF-IDF, where the performance on long length

datasets is slightly better than that on medium

length. From results for different techniques on

datasets with medium length and long length tokens

respectively, associated with three different error

rate, it can be seen that BM25, HMM, cosine TF-

IDF and W/Jaccard performed better than others,

and Affine Gap, WHIRL and SoftTFIDF performed

badly when the token length is long, However, the

performance of Affine Gap, WHIRL and SoftTFIDF

increased significantly when the token length is

medium.

4.3.5 Effect of Type of Errors on

Performance

Figure 4 shows the maximum F- score for different

techniques on datasets of 10000 records only having

insertion errors with three different error rates. For

datasets with low error rate associated, W/Jaccard,

GES and Edit Similarity are the best three

performers, while Fellegi-Sunter, Cosine TF-IDF

and BM25 perform worse than the rest. The

performance of BM25 increases along with the

increase of the level of “dirtiness” in datasets, while

the performance of GES and Edit Similarity

decreases along with the increase of the level of

“dirtiness” in datasets. It is noted that BM25 is

among the best three performers in most of cases

when the error types in datasets are mixed. See

section 4.3.2.

4.3.6 Effect of Timing

Results also show, when the error rate is low,

Affine, GES, Felligi-Sunter and Edit Similarity cost

less time among the twelve algorithms whereas

SoftTFIDF costs the most time. The pattern is

similar when the error rate is medium. However,

SoftTFIDF costs the least time when the error rate is

high. HMM, TF-IDF and WHIRL also cost less time

than the rest when the error rate is high. Our

experiment results agree that smaller datasets cost

less time, and the time used increases when the error

rate increases. In particular, time used on datasets

with high error rate associated is significantly more

than that on datasets with low or medium error rate

associated.

4 CONCLUSIONS AND FUTURE

WORK

This paper has analysed and evaluated twelve

popular token-based name string matching

ApproximateStringMatchingTechniques

223

techniques. A comprehensive comparison of the

twelve techniques has been done based on a series of

experiments on 63 carefully designed datasets with

different characteristics, such as the rate of errors, the

type of error, the number, the length of tokens in a

string, and the size of a dataset. The comparison

results confirmed the statement that there is no clear

best technique. The characteristics considered all

have significant effect on performance of these

techniques, except the size of a dataset. In general,

HMM and BM25 perform better than others,

especially on smaller sized datasets, but consume

much more time. Cosine TF-IDF and TF-IDF are

better on larger datasets with a higher error rate

associated. Results also show that techniques that

perform well on datasets incorporated with mixed

type of errors do not secure a similar performance on

datasets incorporated with a single type of errors. For

example, BM25 didn’t perform well on datasets with

low error rate, incorporated only with insertion

errors. Similarly, HMM didn’t perform well on

datasets with low error rate, incorporated with only

deletion errors. The token length also has an effect on

the performance. For example, some techniques,

such as Affine Gap, WHIRL and SoftTFIDF

performed much better when the token length is

medium than that of the token length when it is long.

Regarding the threshold value, the results show

that the level of “dirtiness” in a dataset has

significant effect on threshold selection. In general,

the higher the error rate in the dataset, the lower the

threshold value is required in order to achieve the

maximum F1-score.

The work introduces a number of further

investigations, including: 1) to do more experiments

on datasets with more characteristics, such as the

number of tokens in strings etc.; 2) to do further

analysis in order to evaluate whether there is a

method to select a threshold value for any of the

matching techniques on a given dataset.

REFERENCES

Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P. and

Fienberg, S., 2003. Adaptive Name Matching in

Information Integration, IEEE Intelligent Systems, vol.

18, no. 5, pp. 16-23.

Chaudhuri, S., Ganti, V. and Kaushik, R., 2006. A

primitive operator for similarity joins in data cleaning.

In Proceedings of International Conference on Data

Engineering.

Christen, P., 2006. A Comparison of Personal Name

Matching: Techniques and Practical Issues. In

Proceedings of the Sixth IEEE International

Conference on Data Mining - Workshops (ICDMW

'06). IEEE Computer Society, Washington, DC, USA,

pp.290-294.

Cohen, W., 2000. WHIRL: A word-based information

representation language. Artificial Intelligence,

Volume 118, Issues 1-2, pp. 163-196.

Cohen, W., Ravikumar, P. and S. Fienberg., 2003. A

comparison of string distance metrics for name-

matching tasks. In Proceedings of the IJCAI-2003

Workshop on Information Integration on the Web,

pp.73-78.

Elmagarmid, A., Ipeirotis, P. and Verykios, V., 2007.

Duplicate Record Detection: A Survey. IEEE Trans.

Knowl.Data Eng., Vol.19, No.1, pp. 1-16.

Fellegi, P. and Sunter, B., 1969. A Theory for Record

Linkage. Journal of the American Statistical

Association, 64(328), pp. 1183-1210.

Hassanzadeh, O., Sadoghi, M. and Miller, R., 2007.

Accuracy of Approximate String Joins Using Grams.

In Proceedings of QDB'2007, pp. 11-18.

Herzog, T., Scheuren, F. and Winkler, W., 2010, “Record

Linkage,” in (D. W. Scott, Y. Said, and E.Wegman,

eds.)Wiley Interdisciplinary Reviews: Computational

Statistics, New York, N. Y.: Wiley, 2 (5),

September/October, 535-543.

Köpcke, H., Thor, A., and Rahm, E., 2010. Evaluation of

Entity Resolution Approahces on Real-world Match

Problems, In Proceedings of the VLDB Endowment,

Vol. 3, No. 1.

Navarro, G., 2001. A Guide Tour to Approximate String

Matching. ACM Computing Surveys, Vol. 33, No. 1,

pp. 31–88.

Peng,T., Li, L. and Kennedy, J., 2012. A Comparison of

Techniques for Name Matching. GSTF International

Journal on Computing , Vol.2 No. 1, pp. 55 - 61.

Rijsbergen, C., 1979. Information Retrieval. 2

ed.,

London: Butterworths.

Vingron, M. and Waterman, S., 1994. Sequence alignment

and penalty choice. Review of concepts, case studies

and implications. Journal of molecular biology 235

(1), pp. 1–12.

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

224