value must be constructed to be acceptable. For
instance, an English phone number has eleven digits,
the first one is usually a zero and the second one
should be one, two, or seven. This is a precise rule,
based on a defined pattern. But some rules can be
more general: a first name is composed of letters and
may have some special characters (e.g. “-”).
Obviously, the number of characters in a name is not
fixed as in a phone number. It is important to notice
that the rules are country based. For example the
number of digits in phone numbers is different in
France (10) and in Britain (11). Therefore, the
different rules should be sorted by country. The
general algorithm is based on the characters’
analysis. Each character is assessed with the
different rules. Because of slight differences, there
are two possible algorithms:
Defined pattern
A defined pattern has a precise size and the exact
location of all the characters is known. A rule is for
instance:
The French postcode has five and only five
digits (numbers from 0 to 9)
Therefore, the corresponding pattern is:
NNNNN (with N a digit from 0 to 9)
And the algorithm compares the postcode to the
pattern, character by character. If an error occurs
then the postcode is not valid. In some cases there
are more than one pattern. For instance
The British customer phone number has eleven
digits, the first one is zero and the second one can be
one, two or seven
The corresponding patterns are then:
01NNNNNNNNN
02NNNNNNNNN
07NNNNNNNNN
And the algorithm compares the phone number
to the first pattern. If this pattern does not match, the
algorithm uses the second pattern and then the third
one. If none of them match then the phone number is
not valid.
General pattern
A general pattern has no particular size, and only the
type of the allowed characters is known. A rule is for
instance:
A last name has only letters and the special
characters “-“ and “.”
The corresponding general pattern is:
Letters – .
The algorithm checks each character of the last
name to find if it is a letter, “.” or “-“. If one
character does not match then the last name is not
valid.
5.3 Implementation of the Defined
Rules
To measure the usefulness of the defined metrics, a
Java application was design and implemented. All
the rules are stored in XML files as patterns and
sorted by country. To check a field, the algorithm
retrieves the rule corresponding to the country in the
XML file, using the SAX (Simple API for XML)
parser. The field value is then compared to the
pattern (or patterns), i.e. each character is checked
with the rule. If there is an error (i.e. the field breaks
the rule), the algorithm returns false else true. The
Accuracy attribute directly depends on this result,
and is equal to 0 if the algorithm returns false, else it
is equal to 1.
5.3.1 Meaning Algorithms
The main concern regarding the meaning attribute is
to assess the value of a field to decide if this value
has a meaning. Therefore, a lot of different strategies
are needed for different fields. For instance, a
strategy for checking the meaning of a phone
number may be different from a strategy of
assessing the meaning of a first name. They can be
considered as indicators that indirectly assess the
meaning attribute, therefore the interpretation of the
results is very important. The criteria used to check
names (first names, last names and cities) are
explained in the following sections. The main idea is
that the normal names (first names, last names and
cities for instance) have particular values.
Vowel ratio
This algorithm compares the number of vowels to
the total number of letters. The result is the number
of vowels divided by the number of letters,
displayed as a percentage (e.g. 50% means that half
the letters are vowels). A high value means that the
name (or word) has more vowels than consonants.
Pattern redundancy
The pattern frequency algorithm will calculate the
frequency of groups of letters, which occur more
than once. These groups are called pattern and can
have any size. The algorithm returns the size of the
most frequent patterns multiplied by its frequency
divided by the number of letters in the name. This
number may be considered as the “surface” of the
pattern. A high value means that there is a recurrent
pattern, which is unusual in a real world name.
Keyboard algorithm
The keyboard algorithm is based on the location of
the keys on a keyboard. In fact few real world names
depend only on the second line of the keyboard
(a,s,d,f…), but fake names (e.g. “dklsajl”) are very
ICEIS 2007 - International Conference on Enterprise Information Systems
98