the state merging phase. Condition 2(b) is not guaran-
teed if the transduction scheme is not a total function
(Oncina et al., 1993), i.e., it is not sufficient to make
the merge decision if the transduction scheme is a par-
tial function. In order to overcome this issue, in our
proposed algorithm we have used the frequencies of
the given data. Condition 2(a) ensures that the rela-
tive frequencies of the observed data is sufficient to
distinguish non-mergable states by means of Hoeffd-
ing bound. The inequality shown in condition 2(a) is
from inequality 2.
Notice that condition 2(a) depends on the value of
δ. Carrasco and Oncina have discussed how a large
or a small value of δ affects the merge decision in
(Carrasco and Oncina, 1994). Basically, if the size
of the learning sample S
n
is significantly large, one
could keep δ negligibly small. On the contrary, for a
relatively small size of learning sample S
n
, δ requires
to be sufficiently high.
The runtime complexity of algorithm APTI2
is given by O ((kS
n
k)
3
(m + n) + kS
n
kmn), where:
kS
n
k =
∑
(u,v)∈S
n
|u| and m = max{|u| : (u,v) ∈ S
n
}.
The FPTST can be built in linear time w.r.t. kS
n
k.
We will now analyze the outermost while loop in the
APTI2 algorithm. Being pessimistic, there will be at
most kS
n
k numberof states in the FPTST. In the worse
case, if no merges are accepted, there will be O (kS
n
k)
executions of the outermost while loop and O (kS
n
k
2
)
executions of the inner for loop, resulting O (kS
n
k
3
)
executions of the core algorithm. In each of these ex-
ecutions, lcp operation can be implemented in O (m)
times and the pushback operation in O (n) times. As-
suming that all arithmetic operations are computed
in unit time, the total core operation of APTI2 can
be bounded by O ((kS
n
k)
3
(m + n) + kS
n
kmn). This
runtime complexity is pessimistic and the runtime of
APTI2 is much lower in practice. Experimental ev-
idence of runtime of APTI2 is presented in the next
section.
7 EXPERIMENTAL RESULTS
We conduct our experiments with two types of data
sets: 1) artificial data sets generated from random
transducers 2) data generated from the Miniature Lan-
guage Acquisition (MLA) task (Feldman et al., 1990)
adapted to English-French translations.
For the artificial data sets, we first generate a ran-
dom PST with m states. The states are numbered
from q
0
to q
m−1
where state q
0
is the initial state.
The states are connected randomly; labels on transi-
tions preserve the deterministic property. Then the
unreachable states are removed. The outputs are as-
signed as random strings drawn from a uniform dis-
tribution over Ω
≤k
, for an arbitrary value of k. The
probabilities of the edges are randomly assigned mak-
ing sure the following condition holds:
∀q
i
∈ Q,
∑
e∈E[q
i
]
prob[e] = 1 (3)
Using the target PST, the training sample is gen-
erated following the paths of the PST. The test data is
also generated in the similar manner. In order to test
the algorithm with unseen examples, we make sure
that the test set and the training set are disjoint.
As a measure of correctness we compute two met-
rics: word error rate (WER) and sentence error rate
(SER). Intuitively, WER is the percentage of sym-
bol errors in the hypothesis translation w.r.t. the ref-
erence translation. For each test pair, the Levenshtein
distance (Levenshtein, 1966) between the reference
translation and hypothesis translation is computed
and divided by the length of the reference string. The
mean of the scores computed for each test pair is re-
ported as the WER. On the other hand the SER is
more strict; it is the percentage of wrong hypothesis
translations w.r.t. the reference translations.
The objectives of the first experiment (see Fig-
ure 2(a) and Figure 2(b)) are to demonstrate the cor-
rectness of our algorithm and to study the practical
runtime. Figure 2(a) shows experiments conducted
on randomly generated PSTs with 5 states and |Σ| =
|Ω| = 2. We start with a training sample size 200,
we keep incrementing the training size by 200 up to a
size of 20000. For every training size the experiment
is repeated 10 times by generating new datasets. The
mean of these 10 experimental results is reported. We
have conducted the experiment for 10 random PSTs.
Thus, in total we have conducted 10000 trials. Fig-
ure 2(a) shows the mean of the results obtained from
10 random PSTs. As the Figure 2(a) shows, the error
rate is approaching zero. As expected, the WER in
most cases remains below SER. The execution time
for this experiment is reported in Figure 2(b). The
results of this experiment tell us that APTI2 shows ac-
ceptable error rate with 5000 training examples, and
from a training size of 10000 the error rate is close to
zero. The runtime in practice is much lower than the
theoratical bound and almost linear.
In our second experiment (Figures 2(c) and 2(d)),
we aim to learn slightly larger PSTs. In this second
experiment we have taken a randomly generated PSTs
with 10 states and |Σ| = |Ω| = 2. We start with a train-
ing sample size 1000, we keep incrementing the train-
ing size by 1000 up to a size of 40000. Similar to the
previous experiment, experiments with each training
sample size are repeated 10 times. The experiment is
conducted for 10 random PSTs. The results of this ex-
LearningProbabilisticSubsequentialTransducersfromPositiveData
483