
 
syllable to be compared and boundary syllables (i.e., 
FS
i,j
 and LS
p,q
, which we call LB and UB 
respectively in the rest of this paper) of the Korean 
search pattern: one is comparing them in Unicode 
code points and the other is comparing them in the 
UTF-8 encoding scheme. Korean characters are 
assigned into the Basic Multilingual Plane (BMP) 
region of Unicode. Therefore, even though the first 
scheme has the burden of transforming byte 
sequences into Unicode code points, since the code 
points can be stored in variables of unsigned short 
data type, the comparison itself can be done 
promptly. The second scheme keeps LB and UB in 
byte sequences according to the UTF-8 encoding 
scheme. Therefore, it does not have the burden of 
transforming Korean syllables into Unicode code 
points. However, for the comparison of byte 
sequences within each code unit, (1) in the case of 
encoding schemes of UTF-8, UTF-16BE, and UTF-
32BE, those bytes have to be compared in the 
forward direction and (2) in the case of encoding 
schemes of UTF-16LE and UTF-32LE, those bytes 
have to be compared in the reverse direction.  
We take the second scheme for the comparison of 
Korean syllables and it works as follows. After 
transforming the searcher of the Korean search 
pattern of the UTF-8 encoding scheme into a 
Unicode code point, from that code point, the type of 
the pattern and column_index (if available) are 
identified, LB and UB of the searcher are decided in 
Unicode code points, and then these two values are 
transformed into byte sequences of the UTF-8 
encoding scheme. For the matching of Korean 
syllables in the string to be compared, those 
syllables are compared directly with LB and UB. For 
the generation of column_index (if necessary), the 
syllables are transformed into Unicode code points. 
In the rest of this paper, we assume that all data 
structures and algorithms take this policy, and all 
characters in the string pattern and the stored data 
are either ASCII characters or modern Korean 
characters. 
The string pattern of the operator LIKE should be 
normalized before performing any matching 
operation. For that purpose, the string pattern is 
stored in an array, say StringPattern, and the 
normalized string pattern is kept in an array, say 
zPattern. In this paper, we consider normal 
characters, reserved characters (‘%’ and ‘_’), and 
escape characters for the string patterns. We do not 
consider string patterns like ‘[ ]’ and ‘[^]’, which are 
supported by some commercial DBMSs such as MS 
SQL Server. Upon including the Korean search 
pattern, we put an array zPatternFlag to keep types 
of Korean search patterns, put arrays LBS and UBS 
to keep LB and UB of each Korean search pattern 
respectively, and have the following two additional 
rules for the normalization. Note that each Korean 
character takes three bytes in the UTF-8 encoding 
scheme. 
(1) Let zPattern
k
 represent the k
th
 character in the 
array zPattern. zPatternFlag
k
 takes the same 
number of bytes as zPattern
k
 holds. If zPattern
k
 
is not a Korean search pattern, zPatternFlag
k
 
takes 0. However, if zPattern
k
 is a Korean search 
pattern, zPatternFlag
k
 holds the information of 
<type, range_index, column_index>, where type 
means the type (1 for Type_ROW, 2 for 
Type_CELL, and 3 for Type_COLUMN) of the 
Korean search pattern, range_index identifies the 
index of arrays LBS and UBS that hold LB and 
UB of the pattern zPattern
k
, and column_index 
holds the column_index of the search pattern 
only when the search pattern is one of type 
Type_COLUMN. 
(2) For each Korean search pattern in the string 
pattern, the following steps have to be done. The 
predecessor of the pattern is not stored in 
zPattern and only the searcher of the pattern is 
stored in zPattern. Let the searcher to be stored 
in zPattern be the k
th 
character in zPattern. First 
of all, LB and UB of the searcher are found 
according to the schemes shown in Section 3 and 
they are appended into arrays LBS and UBS. Let 
the index of the values appended in the arrays be 
range_index. If the type of the search pattern is 
Type_ROW or Type_CELL, <1, range_index, 
0> or <2, range_index, 0> is assigned to 
zPatternFlag
k
, respectively. However, if the type 
of the search pattern is Type_COLUMN, 
column_index of the vowel is calculated and then 
<3,  range_index,  column_index> is assigned to 
zPatternFlag
k
. 
We have assigned arrays LBS and UBS, and have 
stored  column_index in the array zPatternFlag. The 
reason is simply because a lot of database records 
should be compared with the given string pattern. If 
we do not store them, whenever a new database 
record is met for the string match, the values should 
be re-calculated. This is not a good idea.  
Because of the reserved character ‘%’, the 
algorithm that executes the matching operation 
between a string pattern and a string to be compared 
could be a recursive one. Since discussing the 
algorithm itself is beyond the scope of this paper, the 
string match algorithm is simply summarized within 
the scope of the Korean search pattern. Let the start 
index of the current pattern in zPattern be k. If 
zPatternFlag[k] is not 0, the pattern is a Korean 
search pattern. If zPatternFlag[k] is either 1 (i.e., 
A KOREAN SEARCH PATTERN IN THE LIKE OPERATION
463