S is the set of possible start consonants, E the set of
possible end consonants and ot an optional tone mark
symbol. Hn shows all possible combinations of Hoh-
nahm with other consonants. Kc is a shorthand for all
possible K-clusters and Pc for the possible P-clusters.
A possible pattern action pair is:
{S}M{ot}{E} printf("<S01V03E02>");
If the scanner detects a single start consonant, repre-
sented by the macro {S}, followed by a vowel sym-
bol, represented here by M, with one or zero tone
marks, given by the macro {ot} and an end conso-
nant that is in the set of possible end consonants {E}
it will output <S01V03E02 >.
Another possible pattern action pair is:
e{Kc}I{ot}Y printf("<S04V27!");
The pattern matches a K-cluster Kc in combina-
tion with a compound vowel symbol, represented
here by e-IY (where the dash is the place of the
start consonant or cluster), with one or zero tone
marks, the action will be the generation of the out-
put <S04V27E02!. This syllable could be complete,
but there is also a possibility of an end consonant. In
flexpass this situation will be solved. This is also a
flex based scanner, but uses less patterns than dinvow.
The actions of flexpass will be discussed in the exam-
ple of the next section.
7 EXAMPLES
To explain in more detail the working of the scan-
ner, in this section two examples will be given. Let
us start with the famous ”Hello world” example. In
written Thai Hello world looks like and is
pronounced sawatdee lohk, phonetic [sawatdi: lo:k].
These are the steps from the scanner to come to a
blueprint where this phonetic result can be derived
from. First the Thai string is duplicated with dupthai
and then prelex makes a translation of Thai characters
to standard ASCII characters. When the duplication
of the original Thai text is ignored, this results in:
[SWMSmIOLG]
The [ and ]-signs are used as delimiters. Every char-
acter in this string represents a Thai symbol, a Thai
vowel or a set (mostly with one member) of Thai con-
sonants. Now we are ready to use dinvow to detect
the vowels and associated consonants. The result is:
[S<S01V03E02><S01V10!<S01V18!G]
In this phrase, three vowel-symbols in written Thai
are involved. The first one always needs an ending
consonant, so this syllable is finished. For the sec-
ond one there is the possibility of an end consonant
and the same is true for the third one. This is the rea-
son why these syllables are not closed by a > but by
a ! symbol. At the beginning as well as at the end
there is still a dangling consonant (symbol S and G).
In the next step (flexpass) the dangling consonant at
the beginning is treated as a single syllable with in-
herent a-sound and no end consonant. The dangling
consonant at the end is combined with the possibly
not closed syllable before it. The syllable denoted by
S01V10 appears to have no possible end consonant
so the combination !<will be translated to ><. The
result is:
[<S01V02E00><S01V03E02><S01V10E00>
<S01V18E02>]
Now we have a complete blueprint of the phrase. In
the section ’sound generation’ a tool will be intro-
duced to make a phonetic representation. The difficult
part however is now done by the scanner.
As a second example we consider the word
meaning sour and pronounced priaw. This word con-
sists of a consonant cluster pr in combination
with a compound vowel consisting of four symbols
and a tone symbol. To make things a bit more
complicated, this compound vowel also uses conso-
nant symbols. Using prelex the Thai characters are
translated to:
[exxI2YW]
Using dinvow gives the result:
<S05V33E00>
The meaning of this blueprint is: this is a word with
a p-consonant cluster (being either pl or pr accord-
ing to table 1) and a compound vowel V33 (having
the sound iaw) and no end consonant. In this case
no operation is done by the next pass of the scanner,
because the blueprint is already finished.
8 ERROR RECOVERY
There are situations where the scanner fails to produce
the correct result. In this case a hint can be given to
tell the scanner where the end of a syllable actually is.
The scanner uses these hints as extra information and
this could result in a correct transcription. Hinting is
used in words where the syllables are not easy to de-
termine. An example is the word: birdcage that is
written in Thai using only five consonants and
is pronounced [krong nok]. In this situation the scan-
ner has a choice of using an inherent ’o’ between two
consonants or an inherent ’a’ after every single con-
sonant or perhaps a combination of inherent ’o’ and
’a’ sounds. K-R-NG-N-K should however result in
GENERATING PHONEMES FROM WRITTEN THAI USING LEXICAL ANALYSIS BASED ON REGULAR
EXPRESSIONS
309