“DOE” is not in the table. The name “JOHN DOE,
JUNIOR” would generate the mask “GW,J” which
would be mapped by the domain expert as “G” to the
given name attribute, “DOE” to the surname attribute,
and “J’ to the generational suffix attribute. On the
other hand, the name “JUNIOR DOE” while
generating the mask “JW” would be mapped by the
domain expert as “JUNIOR” to the given name
attribute and “DOE” to the surname attribute despite
“J” indicating a generational suffix.
The third sub-component processes the address
tokens following a similar scheme. This component
has a separate address-specific clue word table. In
addition, there are currently 6 address token types
used to identify 15 address word categories as shown
in Table 1. So, for example the “D” token type
identifies a directional token such as “N” for north
and “SE” for southeast.
However, the token identified as type “D” may be
used as a predicational address attribute, e.g., “N
OAK ST” or a post directional address attribute, e.g.,
“E ST NORTH”. These examples again illustrate the
need for pattern interpretation of the mask by a
domain expert.
In the street address “N OAK ST”, the “N” token
is identified as a “D” type token and would be mapped
to the predicational attribute. However, in “E ST
NORTH”, both the “E” token and the “NORTH”
token would be identified as “D” type tokens, the “E”
token would be mapped to the street name attribute,
and the token “NORTH” would be mapped to the post
directional attribute.
For both the name parsing and the address parsing
to successfully complete, the token mask generated
by the lookup table process must be found in a
knowledge base of previously created mask
mappings. If either the name mask or address mask is
not found, then the parsing operation fails for that
input and the input and mask are both written to an
exceptions file that is the input to the Exception
Processing System, This mask mapping entry is then
inserted into the mask-mapping knowledge base so
that thereafter, any input generating the same mask
will be automatically processed by the automated
parsing system when a match for a token is not found
in the pre-defined dictionary of masks. This step is
important because not all addresses will conform to
the standard format used in the dictionary, and some
addresses may contain non-standard or ambiguous
components that cannot be matched to a specific
address field.
(i) Searching for a Match:
When a token is generated and the corresponding
mask does not match any of the pre-defined masks in
the dictionary, the program will search for a match by
comparing the token to a list of common address
components. This list may include common street
names, city names, and state abbreviations it also uses
the Levenshtein similarity scores to to find the closest
match for example Junior and ‘Junir’ wherein we
have a missing ‘o’ can help us achieve a robust
mechanism to assign token which is the epitome of
correctness.
(ii) Assigning to an Exception:
If a match is still not found after searching the list of
common components, the program will assign the
token to an exception. This is a catch-all category that
represents any component of the address that cannot
be matched to a specific field. Examples of
exceptions may include apartment or suite numbers,
building names, or unusual address formats.
(iii) Adding the Exception to the US Address:
Once the token has been assigned to an exception, the
program will add it to the US address components as
a separate field. This allows the exception to be
included in the final output, even if it cannot be
matched to a specific address field.
The fifth step of the program involves adding the
address tokens that were generated in the second step
to the US address components. This step builds on the
previous steps by combining the cleaned and
tokenized address components with the assigned
address fields to create a complete US address.
(i) Assigning Tokens to Address Fields:
Based on the comparison of masks in the third step,
the program assigns each token to a specific address
field, such as the street name, city, state, and zip code.
This creates a set of address fields, where each field
corresponds to a specific component of the address.
(ii) Adding Address Tokens to the Address Fields:
Once the tokens have been assigned to the address
fields, the program then adds these tokens to the
appropriate address field in the US address
components. For example, if the token "Main" was
assigned to the street name field, the program would
add "Main" to the street name component in the US
address.
(iii) Building the Complete US Address:
Once all tokens have been added to the appropriate
address fields, the program can then combine these
components to create a complete US address. This
involves concatenating the address components in the
correct order, and separating them with the
appropriate punctuation, such as commas and spaces.