Another disadvantage of such approaches is that
they are language dependent, since the rules for syl-
labification differ for each language.
The data-driven methods are a relatively newer
approach to syllabification, with both supervised and
un-supervised techniques being investigated.
Syllabification by Analogy (SbA) (Marchand and
Damper, 2000) performs a full matching between a
new word and the dictionary entries, by means of a
directed graph, and identifies the best candidate syl-
labification by finding the shortest path between the
starting and the end node; ties are solved via sev-
eral scoring strategies. In (Daelemans et al., 1997)
the authors apply instance based learning to identify
the closest N-gram for each juncture. Different fea-
ture weighting functions are investigated, as well as
different N-gram sizes.
More recently, the task of syllabification has
been formulated as a structured classification problem
(Bartlett et al., 2008), which is solved via structured
Support Vector Machines (SVM-HMM). This prob-
lem formulation requires a tagging scheme, for the
relevant features to be marked. Positional tags (Not
Boundary, Boundary: NB) and structural tags (On-
set, Nucleus, and Coda: ONC) are investigated. Each
syllable is composed of a sequence of phones: a nu-
cleus (vowel), preceded by an onset (consonant) and
followed by a coda (consonant). From the phonetic
point of view, the nucleus and coda give the rhyme.
Probabilistic methods based on Conditional Random
Fields have also been investigated, modelling again
the problem as a sequence learning problem (Rogova
et al., 2013).
For the Romanian language, the authors of (Dinu,
2004) present the process of building a database of
the syllables in the language, while in (Barbu, 2008)
the authors introduce a dictionary for syllabification,
with morphological aspects. More recently, Condi-
tional Random Fields have been applied to Romanian
syllabification, in (Dinu et al., 2013), reaching an ac-
curacy of above 95% at word level.
3 SYLLABLE-BASED SEQUENCE
PATTERN MINING
This section presents the basic notions on which the
proposed syllabification solution is built, through an-
swering the following questions: What is a syllable?
Is it possible to syllabify words based on a strict set of
rules? What are frequent sequence patterns and how
could they be applied to words?
3.1 Syllable. Syllabification
Syllabification can be viewed as a procedure which
receives as input a word and outputs a sequence of
parts of that word, called syllables.
A syllable is a unit of organization for a sequence
of speech sounds. For example, the word water is
composed of two syllables: wa and ter. A syllable is
typically made up of a syllable nucleus (most often
a vowel) with optional initial and final margins (typi-
cally, consonants).
1
Note that the concept of syllable is not defined by
a rigid set of rules and there are even multiple types
of syllabifications possible. The root causes of this
ambiguity are the following:
1. languages are not rigid artificial constructs and
ambiguity is one attribute of natural constructs.
2. the use-cases for syllabification vary: phonetic
syllabification may differ from the orthographic
syllabification (also known as hyphenation).
The definition from above is rather a phonetic def-
inition of the syllable, for the orthographic rules might
have even aesthetic roots. To illustrate the difference,
let us consider the following example:
From a phonetic perspective, the Romanian word
inegal is split like i-ne-gal, but its hyphenated form is
in-e-gal.
3.2 Mining Frequent Sequence Patterns
Frequent sequence patterns are patterns that occur at
least a minimum number of times (minimum support)
within a collection of sequences.
We introduce the formal definition of such pat-
terns and then we show how syllabified words could
be mined for frequent sequence patterns.
Let I = {i
1
, i
2
, ..., i
n
} be an alphabet (a set of ele-
ments) for a collection of sequences. A sequence s is
defined as an ordered collection of elements from I:
< i
k
1
, i
k
2
, ..., i
k
m
>, where ∀k
j
, 0 ≤ j ≤ m, k
j
∈ I. (1)
For example: let I = {a, b, c, d}, a possible se-
quence is < b, b, a >.
A sequence dataset S
db
is a set of tuples of the
form (id, s), where id is a unique identifier and s - a
sequence.
For example, such a dataset is ilustrated in Table
1.
A sequence s
1
=< t
1
,t
2
, ...,t
n
> is contained by
s
2
=< l
1
, l
2
, ..., l
m
> if n ≤ m and ∃i, j so that t
1
=
l
i
,t
n
= l
j
and ∀k, i ≤ k ≤ j =⇒ t
k−i
= l
k
. We will use
s
1
⊆ s
2
to denote that s
2
contains s
1
.
1
https://en.wikipedia.org/wiki/Syllable
Syllabification with Frequent Sequence Patterns - A Language Independent Approach
353