syllable to be compared and boundary syllables (i.e.,
FS
i,j
and LS
p,q
, which we call LB and UB
respectively in the rest of this paper) of the Korean
search pattern: one is comparing them in Unicode
code points and the other is comparing them in the
UTF-8 encoding scheme. Korean characters are
assigned into the Basic Multilingual Plane (BMP)
region of Unicode. Therefore, even though the first
scheme has the burden of transforming byte
sequences into Unicode code points, since the code
points can be stored in variables of unsigned short
data type, the comparison itself can be done
promptly. The second scheme keeps LB and UB in
byte sequences according to the UTF-8 encoding
scheme. Therefore, it does not have the burden of
transforming Korean syllables into Unicode code
points. However, for the comparison of byte
sequences within each code unit, (1) in the case of
encoding schemes of UTF-8, UTF-16BE, and UTF-
32BE, those bytes have to be compared in the
forward direction and (2) in the case of encoding
schemes of UTF-16LE and UTF-32LE, those bytes
have to be compared in the reverse direction.
We take the second scheme for the comparison of
Korean syllables and it works as follows. After
transforming the searcher of the Korean search
pattern of the UTF-8 encoding scheme into a
Unicode code point, from that code point, the type of
the pattern and column_index (if available) are
identified, LB and UB of the searcher are decided in
Unicode code points, and then these two values are
transformed into byte sequences of the UTF-8
encoding scheme. For the matching of Korean
syllables in the string to be compared, those
syllables are compared directly with LB and UB. For
the generation of column_index (if necessary), the
syllables are transformed into Unicode code points.
In the rest of this paper, we assume that all data
structures and algorithms take this policy, and all
characters in the string pattern and the stored data
are either ASCII characters or modern Korean
characters.
The string pattern of the operator LIKE should be
normalized before performing any matching
operation. For that purpose, the string pattern is
stored in an array, say StringPattern, and the
normalized string pattern is kept in an array, say
zPattern. In this paper, we consider normal
characters, reserved characters (‘%’ and ‘_’), and
escape characters for the string patterns. We do not
consider string patterns like ‘[ ]’ and ‘[^]’, which are
supported by some commercial DBMSs such as MS
SQL Server. Upon including the Korean search
pattern, we put an array zPatternFlag to keep types
of Korean search patterns, put arrays LBS and UBS
to keep LB and UB of each Korean search pattern
respectively, and have the following two additional
rules for the normalization. Note that each Korean
character takes three bytes in the UTF-8 encoding
scheme.
(1) Let zPattern
k
represent the k
th
character in the
array zPattern. zPatternFlag
k
takes the same
number of bytes as zPattern
k
holds. If zPattern
k
is not a Korean search pattern, zPatternFlag
k
takes 0. However, if zPattern
k
is a Korean search
pattern, zPatternFlag
k
holds the information of
<type, range_index, column_index>, where type
means the type (1 for Type_ROW, 2 for
Type_CELL, and 3 for Type_COLUMN) of the
Korean search pattern, range_index identifies the
index of arrays LBS and UBS that hold LB and
UB of the pattern zPattern
k
, and column_index
holds the column_index of the search pattern
only when the search pattern is one of type
Type_COLUMN.
(2) For each Korean search pattern in the string
pattern, the following steps have to be done. The
predecessor of the pattern is not stored in
zPattern and only the searcher of the pattern is
stored in zPattern. Let the searcher to be stored
in zPattern be the k
th
character in zPattern. First
of all, LB and UB of the searcher are found
according to the schemes shown in Section 3 and
they are appended into arrays LBS and UBS. Let
the index of the values appended in the arrays be
range_index. If the type of the search pattern is
Type_ROW or Type_CELL, <1, range_index,
0> or <2, range_index, 0> is assigned to
zPatternFlag
k
, respectively. However, if the type
of the search pattern is Type_COLUMN,
column_index of the vowel is calculated and then
<3, range_index, column_index> is assigned to
zPatternFlag
k
.
We have assigned arrays LBS and UBS, and have
stored column_index in the array zPatternFlag. The
reason is simply because a lot of database records
should be compared with the given string pattern. If
we do not store them, whenever a new database
record is met for the string match, the values should
be re-calculated. This is not a good idea.
Because of the reserved character ‘%’, the
algorithm that executes the matching operation
between a string pattern and a string to be compared
could be a recursive one. Since discussing the
algorithm itself is beyond the scope of this paper, the
string match algorithm is simply summarized within
the scope of the Korean search pattern. Let the start
index of the current pattern in zPattern be k. If
zPatternFlag[k] is not 0, the pattern is a Korean
search pattern. If zPatternFlag[k] is either 1 (i.e.,
A KOREAN SEARCH PATTERN IN THE LIKE OPERATION
463