Classifying Words with 3-sort Automata

Tomasz Jastrz ˛ab

1 a

, Frédéric Lardeux

2 b

and Éric Monfroy

2 c

Silesian University of Technology, Gliwice, Poland

LERIA, University of Angers, Angers, France

Keywords:

Grammatical Inference, Nondeterministic Automata, SAT Models.

Abstract:

Grammatical inference consists in learning a language or a grammar from data. In this paper, we consider a

number of models for inferring a non-deterministic ﬁnite automaton (NFA) with 3 sorts of states, that must

accept some words, and reject some other words from a given sample. We then propose a transformation from

this 3-sort NFA into weighted-frequency and probabilistic NFA, and we apply the latter to a classiﬁcation task.

The experimental evaluation of our approach shows that the probabilistic NFAs can be successfully applied

for classiﬁcation tasks on both real-life and superﬁcial benchmark data sets.

1 INTRODUCTION

Many real-world phenomena may be represented as

syntactically structured sequences, e.g., DNA, natu-

ral language sentences, electrocardiograms, and chain

codes. Grammatical Inference refers to learning

grammars and languages from data, i.e., from such

syntactically structured sequences. Machine learning

of grammars has various applications in syntactic pat-

tern recognition, adaptive intelligent agents, compu-

tational biology, and prediction.We are interested in

learning grammars as ﬁnite state automata, with re-

spect to a sample of the language composed of posi-

tive sequences that must be elements of the language,

and negative ones that the automaton must reject.

The problem of learning ﬁnite automata has been

studied from various angles: ad-hoc methods such

as DeLeTe2 (Denis et al., 2004) which merges states

from the preﬁx tree acceptor (PTA), a family of al-

gorithms for regular language inference presented

in (Vázquez de Parga et al., 2006), metaheuristics

such as hill-climbing in (Tomita, 1982), or model-

ing the problem as a Constraint Satisfaction Prob-

lem (CSP) and solving it with generic tools (such

as non-linear programming (Wieczorek, 2017), or

Boolean formulas (Jastrz ˛ab, 2017; Jastrz ˛ab et al.,

2023; Lardeux and Monfroy, 2021; Jastrz ˛ab et al.,

2022)).

https://orcid.org/0000-0002-7854-9058

https://orcid.org/0000-0001-8636-3870

https://orcid.org/0000-0001-7970-1368

However, all these works consider Deterministic

Finite Automata (DFA), or Non-deterministic Finite

Automata (NFA). In both cases, this means that when

using the automata on a word, the answer is “Yes,

this word is part of the language”, or “No, this word

is not part of the language”. Since samples are ﬁ-

nite and usually limited in size (hundreds of words at

most), and regular languages are inﬁnite, this classiﬁ-

cation may be too restrictive. One could be interested

in probabilistic answers such as “this word is part of

the language with a probability of x%”. The question

is thus “How can we learn a probabilistic automaton

from a sample of positive and negative words?”.

In this paper, we propose a technique to derive

a probabilistic automaton from a sample of positive

and negative words. We ﬁrst learn Non-deterministic

Finite Automata with 3 sorts of states: accepting ﬁ-

nal states which validate positive words, rejecting ﬁ-

nal states which reject negative words, and whatever

states that are not conclusive. We use these 3-sort

NFA which seems of reasonable use for our goal (the

usefulness of this kind of NFA is presented in (de la

Higuera, 2010)). To improve the efﬁciency for gen-

erating such automata, we use a similar property to

the one that was used for 2-sort NFA in (Jastrz ˛ab

et al., 2023): here, we build a 3-sort automaton with

only one accepting ﬁnal state and one rejecting ﬁnal

state, and some extra constraints to reduce this size

k + 2 automaton into a size k automaton. Then, we

want to reﬂect frequencies based on the sample, such

as: how many positive words of the sample termi-

nate in this accepting ﬁnal state? How many times

Jastrz ˛ab, T., Lardeux, F. and Monfroy, É.

Classifying Words with 3-sort Automata.

DOI: 10.5220/0012454100003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 1179-1188

ISBN: 978-989-758-680-4; ISSN: 2184-433X

1179

has a negative word of the sample passed by this tran-

sition? But we also want to be a bit more speciﬁc.

For example, how many negative words of the sam-

ple passed by this transition and terminated in a re-

jecting ﬁnal state (vs. a whatever state). To this end,

we need to weigh differently some cases and patterns.

We thus need to deﬁne what we call 3-sort Weighted-

Frequency NFA, and we present the transformation of

a 3-sort NFA into a 3-sort Weighted-Frequency NFA.

This last can then be converted into a probabilistic

NFA once weights have been instantiated.

The probabilistic NFA can then be used to deter-

mine the probability for a word to be a part of the lan-

guage, or the probability of it not being a part of the

language. Note that by modifying the weights, we can

obtain more accepting or more rejecting automata.

We conduct a number of experiments on the Waltz

DB database to classify peptides into amyloid ones

that are dangerous, or non-amyloid ones that are

harmless. We perform similar studies with languages

generated by a regular expression. Our results look

promising, leaving also some space for weights tun-

ing depending on the aim of the classiﬁcation, e.g., we

can be safer (rejecting dangerous peptides and some

harmless ones) or more risky (trying not to reject non-

amyloid peptides).

The paper is organized as follows. In Sect. 2

we revise the already developed inference models

and propose modiﬁcations required to construct 3-sort

NFAs. In Sect. 3 we show how the 3-sort NFA can

be transformed into weighted-frequency NFA and ﬁ-

nally into probabilistic NFA. In Sect. 4 we describe

conducted experiments and discuss obtained results.

Finally, we conclude in Sect. 5.

2 THE NFA INFERENCE

PROBLEM: FIRST MODELS

In this section, we formally present the NFA inference

problem based on the propositional logic paradigm

and we provide several models. These new models

are similar to the ones of (Jastrz ˛ab et al., 2023) but

using 3-sort non-deterministic automata.

Without the loss of generality

, we consider in the

following, that λ, the empty word, is not part of the

sample. We also consider a unique initial state, q

If λ ∈ S, then it can be recognized or rejected directly,

without the need of an automaton.

2.1 Notations

Let Σ = {s

,...,s

} be an alphabet of n symbols, and

let λ denote the empty word, K be the set of integers

{1,...,k}, Pre f (w) (resp. Su f (w)) be the set of pre-

ﬁxes (resp. sufﬁxes) of the word w, that we extend to

Pre f (W ) (resp. Su f (W)) for a set of words W .

Deﬁnition 1. A 3-sort non-deterministic ﬁnite au-

tomaton (3NFA) is a 6-tuple A = (Q,Σ,I,F

−

,δ)

with: Q = {q

,...,q

} – a ﬁnite set of states, Σ – a

ﬁnite alphabet, I – the set of initial states, F

– the set

of accepting ﬁnal states, F

−

– the set of rejecting ﬁnal

states, and δ : Q × Σ → 2

– the transition function.

Note that in what follows, we will consider only one

initial state, i.e., I = {q

A learning sample S = S

∪S

−

is given by a set S

of “positive” words from Σ

∗

that the inferred 3-sort

NFA must accept, and a set S

−

of “negative” words

that it must reject.

The language recognized by A, L(A)

, is the set

of words for which there exists a sequence of transi-

tions from q

to a state of F

. The language rejected

by A , L(A )

, is the set of words for which there exists

a sequence of transitions from q

to a state of F

−

An automaton is non-ambiguous if L(A)

∩

L(A)

0, i.e., no positive word terminates in a re-

jecting ﬁnal state, and no negative word terminates in

an accepting ﬁnal state.

We discard models with 0/1 variables, either

from INLP (Wieczorek, 2017) or CSP (Rossi et al.,

2006): we made some tests with various models with

PyCSP (Lecoutre and Szczepanski, 2020) and ob-

tained some very poor results: the NFA inference

problem is intrinsically a Boolean problem, and thus,

well suited for SAT solvers. Hence, we consider the

following variables:

• k, an integer, the size of the 3NFA to be generated,

• a set of k Boolean variables F

= {a

,...,a

} de-

termining whether state i is a ﬁnal accepting state

or not,

• a set of k Boolean variables F

−

= {r

,...,r

} de-

termining if state i is rejecting,

• ∆ = {δ

# »

| s ∈ Σ and (i, j) ∈ K

} , a set of nk

Boolean variables representing the existence of

transitions from state q

to state q

with the sym-

bol s ∈ Σ, for each i, j, and s.

• we deﬁne ρ

w,q

m+1

as the path q

,...,q

m+1

for

a word w = s

...s

: ρ

w,q

m+1

= δ

# »

∧ . . . ∧

# »

m+1

Although the path is directed from q

to q

m+1

(it is a

sequence of derivations), we will build it either start-

ing from q

, starting from q

m+1

, or starting from both

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1180

sides (Jastrz ˛ab et al., 2022). Thus, to avoid confusion,

we prefer keeping q

m+1

without any direction.

Paths will be built recursively, and we need at most

O(σk

) Boolean variables ρ

s,i, j

, with σ =

∑

w∈S

|w|.

2.2 Core of the Models

The core of the models is independent of the way the

paths are built. It will thus be common to each model.

The core to deﬁne a 3NFA of size k (noted k_3NFA)

can be deﬁned as follows:

• a ﬁnal state must be either accepting or rejecting:

i∈K

¬(a

∧ r

) (1)

• a positive word must terminate in an accepting ﬁ-

nal state of the 3NFA, i.e., there must be a path

from state q

to a ﬁnal state i, such that i ∈ F

i∈K

w,q

∧ a

(2)

• to build a non-ambiguous NFA, a positive word

cannot terminate in a rejecting ﬁnal state:

i∈K

(¬ρ

w,q

∨ ¬r

) (3)

• similarly, negative words need the same con-

straints swapping rejecting and accepting:

i∈K

w,q

∧ r

(4)

i∈K

(¬ρ

w,q

∨ ¬a

) (5)

Of course, the notion of a path can be deﬁned and

built in many ways, see (Jastrz ˛ab et al., 2022) for pre-

ﬁx, sufﬁx, and hybrid approaches.

2.3 Building Paths

We consider here that a path for a word w = uv is built

as the concatenation of a path for the preﬁx u and one

for the sufﬁx v. Thus, we can have several joining

states ( j below) and ending states (k below) for the

various paths of a word w = uv:

( j,k)∈K

u,q

∧ ρ

v,q

(6)

Note that for the empty word λ we impose that

λ,q

= True ∀(i, j) ∈ K

, (7)

As said before, we consider that λ ̸∈ S, but when split-

ting a word, its preﬁx or sufﬁx may be λ.

to ensure that splitting a word in such a way that the

preﬁx or sufﬁx is the empty word is valid. Note, how-

ever, that we do not allow λ-transitions in the NFA.

Preﬁxes and sufﬁxes are then built recursively,

starting from the beginning of words for preﬁxes:

• For each preﬁx u = s, s ∈ Σ:

i∈K

# »

↔ ρ

s,q

(8)

• for each preﬁx u = xs, s ∈ Σ of each word of S:

i∈K

u,q

↔

j∈K

x,q

∧ δ

# »

(9)

and from the ending of words for sufﬁxes:

• for v = s, and s ∈ Σ

(i, j)∈K

# »

↔ ρ

s,q

(10)

• for each sufﬁx v = sx, s ∈ Σ:

(i, j)∈K

v,q

↔

l∈K

# »

∧ ρ

x,q

(11)

2.4 The Models

We build the models similarly to standard NFA (i.e.,

without the notion of rejecting states). This means

that we have to determine where to split each word

w ∈ S into a preﬁx u and a sufﬁx v. We then consider

the set of preﬁxes S

= {u|w ∈ S and w = uv} and

= {v|w ∈ S and w = uv} the set of sufﬁxes. Then,

the model is the conjunction of Constraints (1)–(11).

Based on Constraint (7), if we split each word w

as w = wλ, we obtain the preﬁx model P whose spa-

tial complexity is in O(σk

) clauses, and variables,

with σ =

∑

w∈S

|w|. Similarly, by splitting words as

w = λw, we obtain the sufﬁx model S whose spatial

complexity is in O(σk

) clauses, and variables (the

difference is because we do not know where a word

terminates).

We then have hybrid models with non-empty suf-

ﬁxes and preﬁxes. Their complexity is in O(σk

• the best sufﬁx model S

⋆

which consists in

determining a minimal set of sufﬁxes cover-

ing each word of S and maximizing a cost

based on an order over sufﬁxes (Ω(v) = {w ∈

S | v is a sufﬁx of w} and considering two suf-

ﬁxes v

and v

, v

≽ v

⇔ |v

| · |Ω(v

)| ≥

|·|Ω(v

)|. Then, preﬁxes are computed to com-

plete words.

• similarly, the best preﬁx model P

⋆

is built opti-

mizing preﬁxes.

Classifying Words with 3-sort Automata

1181

• we can also try to optimize each word splitting us-

ing some metaheuristics, for example, iterated lo-

cal search (ILS). The model ILS(Init), based on a

local search optimization (Stützle and Ruiz, 2018)

of word splittings (starting with an initial conﬁg-

uration Init, being either a random splitting of

words, the splitting found by the P

⋆

model, or by

the S

⋆

model), tries to minimize the ﬁtness func-

tion f (S

) = |Pre f (S

)| + k ·|Su f (S

)|.

2.5 From O(k

) to O((k + 2)

)

Consider a sample S. If there is a k_3NFA for S, i.e.,

a 3NFA of size k, to recognize words of S

and reject

words of S

−

, there is also a (k + 2)_3NFA for S. We

can reﬁne this property by adding some constraints to

build what we call (k + 2)_NFA

⋆

extensions.

Let A = (Q,Σ,{q

},F

−

,δ) be a k_3NFA.

Then, there always exists a (k + 2)_3NFA, A

′

= (Q ∪

k+1

k+2

},Σ,{q

},{q

k+1

},{q

k+2

}, δ

′

), such that:

• there is only one ﬁnal accepting state q

k+1

and one

rejecting state q

k+2

, thus we do not need anymore

the a

and r

variables,

• each transition is copied:

∀

i, j∈K

,s∈Σ

# »

↔ δ

′

# »

• incoming transitions to accepting ﬁnal state are

duplicated to the new accepting ﬁnal state q

k+1

∀

i∈K,q

∈F

# »

↔ δ

′

# »

k+1

• the same transition duplication is made for reject-

ing ﬁnal states to the new rejecting ﬁnal state q

k+2

• there is no outgoing transition from states q

k+1

and q

k+2

• no negative (resp. positive) words terminate in the

states from F

(resp. F

−

). This is obvious in A,

we have to make it effective in A

′

The interest of this (k + 2)_3NFA for S is that the

complexity for building sufﬁxes is now in O(k

) since

both positive and negative words must terminate in a

given state (resp. q

k+1

and q

k+2

We now give the constraints of the (k + 2)_3NFA.

Let K

= {1,2,...,k + 2}:

• Constraint (1) disappears,

• Constraints (2) and (3) become, for each w ∈ S

w,q

k+1

(12)

¬ρ

w,q

k+2

(13)

• the same happens for Constraints (4)–(5), re-

placed by, for w ∈ S

−

w,q

k+2

(14)

¬ρ

w,q

k+1

(15)

• Constraints (6) must be split into two, for positive

(16) and negative (17) words:

j∈K

u,q

∧ ρ

v,q

k+1

(16)

j∈K

u,q

∧ ρ

v,q

k+2

(17)

• Constraints (8)–(9) are not modiﬁed

• Constraints (10)–(11) are respectively modiﬁed

for positive words into:

i∈K

# »

k+1

↔ ρ

s,q

k+1

(18)

i∈K

v,q

k+1

↔

j∈K

# »

∧ ρ

x,q

k+1

(19)

and for negative words into:

i∈K

# »

k+2

↔ ρ

s,q

k+2

(20)

i∈K

k+2

↔

j∈K

# »

∧ ρ

k+2

(21)

• There is no outgoing transition from the ﬁnal

states:

s∈Σ

i∈K

¬δ

# »

k+1

∧ ¬δ

# »

k+2

(22)

• Each incoming transition of the accepting (resp.

rejecting) ﬁnal state q

k+1

(resp. q

k+2

) must also

terminate in another state (duplication):

s∈Σ

i∈K



# »

k+1

→

j∈K

# »



(23)

s∈Σ

i∈K



# »

k+2

→

j∈K

# »



(24)

In order to be able to reduce the (k + 2)_3NFA

⋆

into a k_3NFA, we must take care about the possi-

bly rejecting and accepting ﬁnal states of the k_3NFA.

To this end, we need a new set of Boolean variables

representing possibly accepting (resp. rejecting) ﬁ-

nal states for the corresponding k_NFA: {a

⋆

,··· ,a

⋆

}

(resp. {r

⋆

,··· ,r

⋆

}). The (k + 2)_3NFA

⋆

may be

reduced to a k_3NFA by just removing states q

k+1

k+2

, and their incoming transitions, and by ﬁxing the

ﬁnal states among the possible ﬁnal states, i.e., deter-

mining the a

⋆

and the r

⋆

which are ﬁnal states of the

k_3NFA, either accepting or rejecting. To determine

these possible ﬁnal states, we have to ensure:

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1182

• A negative (resp. positive) word cannot termi-

nate in an accepting (resp. rejecting) possible ﬁnal

state:

i∈K



⋆

→

w∈S

−

¬ρ

w,q



(25)

i∈K



⋆

→

w∈S

¬ρ

w,q



(26)

Note that with Constraints (25)–(26), Constraints

(3)–(5) are no longer needed.

• Each accepting (resp. rejecting) possible ﬁnal

state validates at least one positive (resp. nega-

tive) word of S:

i∈K

⋆

→

vs∈S

j∈K



v,q

∧ δ

# »

∧ δ

# »

k+1



(27)

i∈K

⋆

→

vs∈S

−

j∈K



v,q

∧ δ

# »

∧ δ

# »

k+2



(28)

• Each positive (resp. negative) word terminates in

at least one accepting (resp. rejecting) possible

ﬁnal state:

w∈S

i∈K

(ρ

w,q

∧ a

⋆

) (29)

w∈S

−

i∈K

(ρ

w,q

∧ r

⋆

) (30)

• a state cannot be both accepting and rejecting pos-

sible ﬁnal state:

i∈K

¬(a

⋆

∧ r

⋆

) (31)

3 FROM 3NFA TO

WEIGHTED-FREQUENCY NFA

AND PROBABILISTIC NFA

Using a sample of positive and negative words, we

are able to generate a k_3NFA. However, we can-

not directly obtain probabilistic automata to decide

the probability for a word to be part or not of the

language represented by the sample. However, we

can use the sample and the generated k_3NFA to

build a weighted-frequency automaton: weighted-

frequencies will be determined with respect to the

sample words. We can then create, from this last au-

tomaton, a probabilistic automaton to classify words.

3.1 Weighted-Frequency Automata

We now deﬁne what we call a weighted-frequency au-

tomaton. In a frequency automata, the integer n at-

tached to a transition δ(q, a, q

′

) means that this tran-

sition was used n times (see (de la Higuera, 2010)).

Here, we want to count differently positive (resp. neg-

ative) words terminating in an accepting (resp. reject-

ing) ﬁnal state from positive (resp. negative) words

terminating in whatever state. We thus need to weigh

these cases with different real numbers. Thus, we ob-

tain automata that reﬂect weighted frequencies and

are still based on 3-sort automata.

Deﬁnition 2 (3_NWFFA). A 3-sort non-

deterministic weighted-frequency ﬁnite automa-

ton (3_NWFFA) is a 10-tuple A = (Q,Σ,I,F

−

,δ,Ω

( f ,+)

,Ω

( f ,−)

,Ω

(δ,+)

,Ω

(δ,−)

) with:

• Q = {q

,...,q

} – a ﬁnite set of states,

• Σ – a ﬁnite alphabet,

• I = {q

} – the set of initial states,

• F

– the set of accepting ﬁnal states,

• F

−

– the set of rejecting ﬁnal states,

• δ : Q × Σ → 2

– the transition function,

• Ω

( f ,+)

– a function Q → N, i.e., Ω

( f ,+)

(q) =







( f ,+,+)

· φ

( f ,+,+)

(q) if q ∈ F

( f ,+,?)

· φ

( f ,+,?)

(q) if q ∈ Q \ (F

∪ F

−

)

0 otherwise

where:

– ω

( f ,+,+)

and ω

( f ,+,?)

two weights associated re-

spectively to positive words terminating in ac-

cepting states and to positive words terminating

in whatever states,

– two counting functions φ

( f ,+,+)

(q) and

( f ,+,?)

(q) for counting the number of times

(i.e., the numbers of physical paths for all

positive words) a positive word terminates in

accepting state q and respectively in whatever

state q. These two functions are detailed later.

• Ω

( f ,−)

– a function Q → N, Ω

( f ,−)

(q) is deﬁned as

above but for negative words and rejecting states

(i.e., “+” is replaced by “−”, and F

by F

−

• Ω

(δ,+)

– a function Q × Σ × Q → N, i.e.,

Ω

(δ,+)

(q,s,q

′

) =

(δ,+,+)

·φ

(δ,+,+)

(q,s,q

′

)+ω

(δ,+,?)

·φ

(δ,+,?)

(q,s,q

′

)

where:

– ω

(δ,+,+)

and ω

(δ,+,?)

are two weights associ-

ated respectively with positive words terminat-

ing in accepting states and positive words ter-

minating in whatever states,

– two counting functions φ

(δ,+,+)

(q,s,q

′

) and

(δ,+,?)

(q,s,q

′

) for counting the number of

times a positive word uses given transition

within a physical path that terminates in an ac-

cepting state and respectively in whatever state.

These two functions are detailed later.

Classifying Words with 3-sort Automata

1183

• Ω

(δ,−)

– a function Q × Σ × Q → N, i.e.,

Ω

(δ,−)

(q,s,q

′

) = ω

(δ,−,−)

· φ

(δ,−,−)

(q,s,q

′

) +

(δ,−,?)

· φ

(δ,−,?)

(q,s,q

′

) deﬁned as above but for

negative words.

Remember that these automata are non-

deterministic, and thus, there can be several paths for

a word, terminating in different states. Thus, for a

given word, we are interested in all terminating paths

independently from the sort of the terminating state.

3.2 From 3-sort NFA to

Weighted-Frequency Automata

We need a “physical” view of transitions and paths

of the k_3NFA we have built. Consider the transition

function δ : Q × Σ → 2

from a k_3NFA A. We re-

name by ∆

#»

i, j

the value δ(q

,s,q

). Note that if δ

# »

is true, ∆

#»

i, j

exists, otherwise ∆

#»

i, j

does not exist.

We also deﬁne π

...s

,...,i

n+1

as the sequence of

physical transitions ∆

# »

,...,∆

# »

n+1

. For a given

word w = s

...s

w,i

n+1

= {π

...s

,...,i

n+1

| i

,...,i

n+1

∈ Q

n+1

}

is the set of all sequences for w in A.

Consider a sequence π

...s

,...,i

n+1

= ∆

# »

,...,

∆

# »

n+1

. Then, occ(π

...s

,...,i

n+1

)(∆

#»

i, j

) is the

number of occurrences of ∆

#»

i, j

in the sequence

.....s

,...,i

n+1

deﬁned recursively as follows:

• occ(Λ)(∆

#»

i, j

) = 0

• occ(∆

# »

,∆

# »

,...,∆

# »

)(∆

#»

i, j

) =











1 + occ(∆

# »

,...,∆

# »

)(∆

#»

i, j

)

if (s,i, j) = (s

occ(∆

# »

,...,∆

# »

)(∆

#»

i, j

)

otherwise.

We now propose an implementation of the count-

ing functions for a weighted-frequency automaton:

• φ

( f ,+,+)

: if q ∈ F

, φ

( f ,+,+)

(q) = |

w∈S

w,1,q

0 otherwise

• φ

( f ,+,?)

: if q ∈ Q \ (F

∪ F

−

), φ

( f ,+,?)

(q) =

w∈S

w,1,q

|, 0 otherwise

• φ

( f ,−,−)

and φ

( f ,−,?)

can be deﬁned similarly

• φ

(δ,+,+)

: φ

(δ,+,+)

(q,s,q

′

) =

∑

w∈S

∑

∈F

∑

p∈Π

w,q

occ(p)(∆

# »

q,q

′

)

• φ

(δ,+,?)

, φ

(δ,−,−)

, and φ

(δ,−,?)

are deﬁned similarly.

We can imagine other counting functions, for example

not considering all possible paths for a word, but only

one path, or only a given number of paths.

The different weights enable us to consider only

positive words for example (ω

(F,−,∗)

= 0 for F ∈

{ f ,δ} and ∗ = {−,?}), or considering for example

only positive words terminating in a positive state

(ω

( f ,+,?)

= 0, and ω

(δ,+,?)

= 0).

3.3 Probabilistic Automata

We can now deﬁne the probabilistic automata we are

interested in: 3-sort automata with probabilities for

transitions and probabilities for states to be ﬁnal ac-

cepting and rejecting.

Deﬁnition 3 (3_NPFA). A 3-sort non-deterministic

probabilistic ﬁnite automaton is an 8-tuple A =

(Q,Σ,I, δ, Γ

( f ,+)

,Γ

( f ,−)

, Γ

(δ,+)

,Γ

(δ,−)

) with:

• Q = {q

,...,q

} – a ﬁnite set of states,

• Σ – a ﬁnite alphabet,

• I = {q

} – the set of initial states,

• δ : Q × Σ → 2

– the transition function,

• Γ

( f ,+)

– a function Q → [0,1], i.e., Γ

( f ,+)

(q) is the

probability of state q to be accepting ﬁnal,

• Γ

( f ,−)

– a function Q → [0,1], i.e., Γ

( f ,−)

(q) is the

probability of state q to be rejecting ﬁnal,

• Γ

(δ,+)

– a function Q × Σ × Q → [0,1], i.e.,

(δ,+)

(q,s,q

′

) is the probability for a positive

word to pass by the transition δ(q,s,q

′

• Γ

(δ,−)

– similar to Γ

(δ,+)

for negative words.

A 3-sort non-deterministic probabilistic automaton

A = (Q,Σ,I,δ, Γ

( f ,+)

,Γ

( f ,−)

,Γ

(δ,+)

,Γ

(δ,−)

) must re-

spect the following constraint:

∀q ∈ Q,



∑

′

∈Q,s∈Σ



(δ,+)

(q,s,q

′

) + Γ

( f ,+)

(q)



= 1

∑

′

∈Q,s∈Σ



(δ,−)

(q,s,q

′

) + Γ

( f ,−)

(q)



= 1

Remember that we consider only one initial state

which is q

. In case one wants to consider several ini-

tial states, some probabilities of being initial positive

and initial negative can be added.

3.4 From Weighted-Frequency to

Probabilistic Automata

We now present the transformation of a 3_NWFFA

into a 3_NPFA: weighted-frequencies are converted

into probabilities.

Consider a 3-sort non-deterministic weighted-

frequency ﬁnite automaton A = (Q, Σ, I,F

, F

−

δ, Ω

( f ,+)

, Ω

( f ,−)

,Ω

(δ,+)

,Ω

(δ,−)

). Then, from A ,

we can derive a 3-sort non-deterministic prob-

abilistic ﬁnite automaton A

′

= (Q

′

, Σ

′

, I

′

, δ

′

( f ,+)

,Γ

( f ,−)

,Γ

(δ,+)

,Γ

(δ,−)

) such that:

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1184

• the states, alphabet, transitions, and initial state

remain unchanged: Q = Q

′

, Σ = Σ

′

, I = I

′

, and

δ = δ

′

• the probability for q to be an accepting ﬁnal state

is the weighted frequency of words of S

termi-

nating in q, divided by the sum of the weighted

frequencies of the positive words from the sample

outgoing from q plus the weighted-frequency of

positive words ending in q:

∀q ∈ Q, Γ

( f ,+)

(q) =

Ω

( f ,+)

(q)/



Ω

( f ,+)

(q) +

∑

s∈Σ,q

′

∈Q

Ω

(δ,+)

(q,s,q

′

)



• the probabilities Γ

( f ,−)

are computed similarly for

negative words replacing “+” by “−”.

• the probability for a positive word to follow tran-

sition δ(q,s,q

′

) is computed similarly as the prob-

ability of ending in q:

∀q,q

′

∈ Q,∀s ∈ Σ,Γ

(δ,+)

(q,s,q

′

) =

Ω

(δ,+)

(q,s,q

′



Ω

( f ,+)

(q)

∑

′

∈Σ,q

′′

∈Q

Ω

(δ,+)

(q,s

′

′′

)



• the computation is similar for negative words re-

placing “+” by “−”.

These probabilities respect the properties of 3-sort

non-deterministic frequency ﬁnite automata.

3.5 Classifying Words

Given the two sets of independent weights (for states

and transitions) and the non-deterministic nature of

the NFA, implying possibly multiple paths for a word,

we consider four classiﬁers:

• C

– computes the positive and negative scores

for a word by multiplying the probabilities of the

transitions and the probability of the last state on

each path, selecting as the ﬁnal score the maxi-

mum of all paths.

• C

– computes the positive and negative scores

for a word by multiplying the probabilities of the

transitions and the probability of the last state on

each path, selecting as the ﬁnal score the average

of all paths.

• C

– computes the positive and negative scores

for a word by summing up the probabilities of the

transitions and the probability of the last state on

each path, selecting as the ﬁnal score the maxi-

mum of all paths. Summed up probabilities for

each path for word w are divided by |w| + 1, to

scale them to the range [0,1].

• C

– computes the positive and negative scores

for a word by summing up the probabilities of the

transitions and the probability of the last state on

each path, selecting as the ﬁnal score the average

of all paths. Summed up probabilities for each

path for word w are divided by |w| + 1, to scale

them to the range [0,1].

The ﬁnal classiﬁer decision, i.e., acceptance or rejec-

tion of a word, is based on the comparison of positive

and negative scores—the greater score wins.

To illustrate the operation of the classiﬁers let

us consider an example. Assume that we have a

word w = abb for which there are two paths in

some NFA. For simplicity, assume that all weights

(F,•,∗)

= 1, with F = { f ,δ}, • = {+,−}, and ∗ =

{+,−,?}. Let us also assume that the transition

and last state probabilities for the ﬁrst path are:

(0.2,0.5,0.35,0.6) for the acceptance scenario, and

(0.6,0.5,0.65,0.4) for the rejection scenario. For the

second path assume probabilities (0.2, 0.15, 0.55, 0.9)

and (0.6, 0.5, 0.5, 0.75). Then the scores for the re-

spective classiﬁers are as follows:

• for C

the positive score is 0.02, and the nega-

tive is 0.11,

• for C

the positive score is 0.02, and the negative

is 0.10,

• for C

the positive score is 0.45, and the negative

is 0.59,

• for C

the positive score is 0.43, and the negative

is 0.56.

It is clear that each classiﬁer indicates that the word

should be rejected as the negative scores are always

greater than the positive ones. Note also that the C

and C

produce larger margin between the scores

than the C

and C

(0.13–0.14 vs. 0.08–0.09).

4 EXPERIMENTATION

4.1 Experiment I

To evaluate the proposed probabilistic automata and

classiﬁers, we have created a benchmark set based

on the peptides stored in WaltzDB database (Beerten

et al., 2015; Louros et al., 2019). The bench-

mark set was composed of several samples con-

taining amyloid (positive) and non-amyloid (nega-

tive) peptides, each having a length of 6 charac-

ters. The samples were created based on pep-

Classifying Words with 3-sort Automata

1185

tide subsets available on the WaltzDB website

(http://waltzdb.switchlab.org/sequences).

Based on each sample, the training and test sam-

ples were created, with the training sample consisting

of 10%, 30%, and 50% of the ﬁrst peptide sequences

in the given subset. The training sample was used

to infer the probabilistic NFA, which acted then as a

classiﬁer for the test sample, comprising the whole

subset without the elements included in the training

sample. Since some of the subsets contained very few

positive/negative sequences, for the ﬁnal evaluation

we selected only ﬁve of them, i.e., Amylhex (AH),

Apoai mutant set (AMS), Literature (Lit), Newscores

(NS), and Tau mutant set (TMS). Table 1 summarizes

the characteristics of the data set. Note that all sam-

ples are quite imbalanced and they differ both in the

total number of words and the size of the alphabet.

Table 1: Characteristics of the benchmark set.

Train 10% Train 30% Train 50% Whole subset

Subset |Σ| |S

| |S

−

| |S

−

| |S

−

| |S

−

AH 19 7 12 23 36 39 60 79 121

AMS 20 7 3 23 10 39 18 79 36

Lit 20 20 6 61 19 102 33 204 66

NS 18 3 1 9 4 16 7 32 15

TMS 19 9 2 27 6 46 11 92 22

The inference models were implemented in

Python using PySAT library and the Glucose SAT

solver with default options. The experiments were

carried out on a computing cluster with Intel-E5-

2695 CPUs, and a ﬁxed limit of 10 GB of memory.

Running times were limited to 15 minutes, including

model generation and solving time. The classiﬁcation

was conducted using a Java application running on

a single machine with Intel Core i7-7560U 2.40GHz

processor and 8 GB of RAM.

Since weights tuning lies outside of the scope of

the current paper, we decided to conduct the experi-

ments by setting respective weights to 0s or 1s only,

analyzing all possible combinations of 0s and 1s for

the eight weights deﬁned before. Thus, for each train-

ing sample we analyzed 256 different weight assign-

ments, which along with 115 inferred NFAs

and 4

classiﬁers gave us a total of 117 760 classiﬁcations.

The whole process took around 186 minutes, with

the (k + 2)_3NFA models taking on average 2.9 times

longer to perform the classiﬁcation than the k_3NFA

ones. This difference may be attributed to the larger

size of the former NFAs, which makes the path build-

ing process more time-consuming.

The classiﬁcation results were evaluated based on

In ﬁve cases for the NS subset, we failed to infer an

NFA for the Train 50% training set. The models that failed

were the P

⋆

k+2

and all sufﬁx-based models.

accuracy and F1-score given by Eqs. (32) and (33):

Acc =

T P + T N

T P + T N + FP + FN

(32)

F1 =

2 · T P

2 · T P + FP + FN

(33)

where T P denotes true positives (amyloid peptides

classiﬁed as amyloid), T N denotes true negatives

(non-amyloid peptides classiﬁed as such), FP de-

notes false positives (non-amyloid peptides classiﬁed

as amyloid ones), and FN denotes false negatives

(amyloid peptides classiﬁed as non-amyloid).

Table 2 shows the best accuracy values and their

corresponding F1-scores obtained for the test sets

over all analyzed weight combinations and all clas-

siﬁers. The metrics were obtained by NFAs inferred

using 8 different models. Boldfaced values denote the

best column-wise values. The entries with an asterisk

denote the cases in which the best F1-score did not

correspond to the best accuracy.

Table 2: Best accuracy and corresponding F1-score metrics

obtained by the NFAs for the analyzed benchmark sets.

Accuracy F1-score

Model AH AMS Lit NS TMS AH AMS Lit NS TMS

0.63 0.59 0.76 0.63 0.72 0.63* 0.66* 0.86 0.73 0.82

(k+2)

0.69 0.68 0.76 0.67 0.73 0.64* 0.79 0.86 0.76 0.82

⋆

0.68 0.62 0.76 0.71 0.72 0.66 0.72 0.86 0.77 0.82

⋆

(k+2)

0.64 0.67 0.73 0.50 0.73 0.62* 0.77 0.82 0.51 0.82

0.61 0.55 0.76 0.62 0.72 0.48* 0.56* 0.86 0.70 0.82

(k+2)

0.65 0.64 0.77 0.50 0.73 0.60* 0.77 0.86 0.48 0.82

⋆

0.61 0.62 0.66 0.50 0.72 0.58* 0.76 0.79 0.51 0.82

⋆

(k+2)

0.63 0.66 0.74 0.47 0.73 0.42* 0.77 0.85 0.44 0.82

Based on the accuracy values we can state that

the preﬁx-based models perform best among all eight

models, regardless of the benchmark set. It is also

clear that NS data set turned out to be the hardest one,

since some of the models achieved accuracy smaller

than 0.5, which is the probability of success with a

random decision in binary classiﬁcation. The anal-

ysis of F1-score, being the harmonic mean between

precision and recall, conﬁrms the strong position of

preﬁx-based models, with a small advantage given to

the P

⋆

model. Overall, the achieved metrics are not

very satisfactory, which may be attributed to, e.g.:

• the way the training samples were constructed –

it is not infrequent that the training sample does

not cover the whole alphabet, which results in the

rejection of words from the test set using the sym-

bols outside of the training sample’s alphabet,

• the lack of some language behind the subsets of

peptides – there is no guarantee that the peptides

from a certain subset share some common features

reﬂected in the sequences,

• limited parameter tuning – so far we analyzed

only the extreme values for the weights, more

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1186

advanced parameter tuning may be required to

achieve better results.

Interestingly, the best results were typically

achieved with 30% training sets. The NS data set,

whenever the NFA for it could be inferred, required

the 50% training set to achieve its peak performance.

There were also rare cases in which the smallest train-

ing data set was sufﬁcient (e.g., for the P

model with

AMS data set).

Table 3 shows the best accuracy values and cor-

responding F1-scores for different classiﬁers over all

analyzed weight combinations and all subsets. The

metrics pertain to NFAs inferred using eight different

models. The meaning of boldfaced entries and entries

with an asterisk is the same as for Table 2.

Table 3: Best accuracy and corresponding F1-score metrics

obtained by the NFAs for the analyzed classiﬁers.

Accuracy F1-score

Model C

0.70 0.70 0.76 0.76 0.80 0.81 0.86 0.86

(k+2)

0.70 0.69 0.76 0.76 0.80 0.79 0.86 0.86

⋆

0.56 0.56 0.76 0.76 0.45* 0.45* 0.86 0.86

⋆

(k+2)

0.59 0.59 0.73 0.73 0.70 0.70 0.82* 0.82*

0.53 0.53 0.76 0.76 0.61* 0.69 0.86 0.86

(k+2)

0.64 0.64 0.77 0.77 0.73 0.73 0.86 0.86

⋆

0.53 0.53 0.72 0.72 0.69 0.69 0.82 0.82

⋆

(k+2)

0.68 0.68 0.74 0.74 0.79 0.79 0.85 0.85

The analysis shows that this time P

is clearly the

best model. In terms of classiﬁers, we do not ob-

serve many differences between the pairs of classi-

ﬁers based on multiplication (C

, C

) and sum-

mation (C

, C

). However, the differences between

multiplication- and summation-based classiﬁers for

the given model are statistically signiﬁcant (based on

ANOVA test with α = 0.05 and post hoc Tukey HSD)

in terms of both accuracy and F1-score.

Figure 1 shows the best accuracy values and cor-

responding F1-scores obtained by all classiﬁers over

all analyzed weight combinations and NFAs inferred

by all models. We can conﬁrm that the differences

between the classiﬁers are statistically signiﬁcant at

α = 0.05, with the C

and C

classiﬁers consis-

tently achieving better results than the other two. We

can also note that Lit data set was the most favorable

in terms of satisfactory metric values.

4.2 Experiment II

To evaluate the proposed solutions even further, we

created the second benchmark composed of two data

sets. The data sets were built based on regular expres-

sions (regexp)

—we deﬁned two languages described

The regular expressions were: (0|11)(001|000|10)*0

and [0-9][0-4][5-9](024|135|(98|87))*(0|6).

0.2

0.4

0.6

0.8

1.0

AH AMS Lit NS TMS

Accuracy

0.2

0.4

0.6

0.8

1.0

AH AMS Lit NS TMS

F1-score

Figure 1: Best accuracy and F1-score metrics obtained by

all NFAs for the analyzed benchmark sets and classiﬁers

(black), C

(gray), C

(light gray), and C

by different regular expressions from which we sam-

pled the words of 1 to 15 characters. These words

represented the sets S

. The sets S

−

contained words

constructed by randomly shufﬂing the positive exam-

ples and ensuring they do not match the regexp. Sim-

ilarly to the ﬁrst experiment, we created the training

data sets used for NFA inference and the test sets for

evaluation. The sizes of complete samples were equal

to 200 words split equally between S

and S

−

. The

experimental setup, i.e., computing machines, met-

rics, and weight settings were kept as before.

In Table 4 we show the results obtained for the var-

ious models across all classiﬁers for the two regexp-

based data sets. Comparing the results to the ones

presented in Tab. 2, we note a signiﬁcant improve-

ment in the achieved metrics. We also observe that for

RegExp1 data set, except for the S

model, all mod-

els achieve perfect scores. Finally, we note that for

RegExp2 all (k +2)-based models, except PM(k + 2),

improve over their k-based counterparts, while for

RegExp1 it only applies to S

⋆

model, since the oth-

ers achieved perfect scores. Detailed analysis have

shown that in most cases, the best accuracy and F1-

score were obtained with the 50% training set, but for

the P

(k+2)

, S

, and S

⋆

(k+2)

models, 10% was enough.

Table 4: Best accuracy and corresponding F1-score metrics

obtained by the NFAs for the analyzed benchmark sets.

Data set P

(k+2)

⋆

(k+2)

⋆

(k+2)

Acc

RegExp1 1.00 1.00 1.00 1.00 0.91 1.00 1.00 1.00

RegExp2 0.93 0.88 0.88 0.93 0.77 0.92 0.85 0.87

F1-score

RegExp1 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00

RegExp2 0.93 0.86* 0.87 0.95 0.71* 0.92 0.81 0.87

In Table 5, we show the analysis of best accu-

racy and corresponding F1-score for the models vs.

classiﬁers comparison. It can be observed that the re-

sults improved signiﬁcantly as compared to the ones

presented in Tab. 3. We can also note that with this

Classifying Words with 3-sort Automata

1187

benchmark, only the S

model failed to achieve per-

fect scores. Clearly, there are no signiﬁcant differ-

ences between classiﬁers as they performed equally

well regardless of the model.

Table 5: Best accuracy and corresponding F1-score metrics

obtained by the NFAs for the analyzed classiﬁers.

Accuracy F1-score

Model C

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

(k+2)

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

⋆

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

⋆

(k+2)

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.91 0.91 0.91 0.91 0.90 0.90 0.90 0.90

(k+2)

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

⋆

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

⋆

(k+2)

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

5 CONCLUSIONS

In this paper, we have proposed a method to transform

an NFA with three types of states (accepting, rejecting

and non-conclusive) to a weighted frequency automa-

ton, which could be further transformed into a prob-

abilistic NFA. The developed transformation process

is generic since it allows to control the relative impor-

tance of the different types of states and/or transitions

by customizable weights.

We have evaluated the proposed probabilistic au-

tomata on the classiﬁcation task performed over two

distinct benchmarks. The ﬁrst one, based on real-life

samples of peptide sequences proved to be quite chal-

lenging, yielding relatively low quality metrics. The

second benchmark, based on a random sampling of

a language described by a regular expression enabled

us to show the power of probabilistic NFA, producing

accuracy scores of 0.81–1.00 with F1-score ranging

between 0.69 up to 1.00. The second benchmark al-

lowed us to prove that given a representative sample

of an underlying language, the probabilistic NFA can

achieve very good classiﬁcation quality, even without

sophisticated parameter tuning.

In the future, we plan to apply some heuristics to

tune the weights so that the classiﬁers perform even

better, especially for real-life benchmarks. Given the

generic nature of the proposed weighted-frequency

automata we also plan to consider using a parallel en-

semble of classiﬁers, differing not only in terms of

weights, but also in how probabilities are combined.

REFERENCES

Beerten, J., van Durme, J. J. J., Gallardo, R., Capriotti,

E., Serpell, L. C., Rousseau, F., and Schymkowitz, J.

(2015). WALTZ-DB: a benchmark database of amy-

loidogenic hexapeptides. Bioinform., 31(10):1698–

1700.

de la Higuera, C. (2010). Grammatical Inference: Learn-

ing Automata and Grammars. Cambridge University

Press.

Denis, F., Lemay, A., and Terlutte, A. (2004). Learning

regular languages using RFSAs. Theor. Comput. Sci.,

313(2):267–294.

Jastrz ˛ab, T. (2017). Two parallelization schemes for the in-

duction of nondeterministic ﬁnite automata on PCs. In

Proc. of PPAM 2017, volume 10777 of LNCS, pages

279–289. Springer.

Jastrz ˛ab, T., Lardeux, F., and Monfroy, É. (2022). Taking

advantage of a very simple property to efﬁciently in-

fer NFAs. In 34th IEEE International Conference on

Tools with Artiﬁcial Intelligence, ICTAI 2022, pages

1355–1361. IEEE.

Jastrz ˛ab, T., Lardeux, F., and Monfroy, É. (2023). Inference

of over-constrained NFA of size k + 1 to efﬁciently

and systematically derive NFA of size k for grammar

learning. In Proceedings of the International Confer-

ence on Computational Science – ICCS 2023, Part I,

volume 14073 of LNCS, pages 134–147. Springer.

Lardeux, F. and Monfroy, É. (2021). Optimized models

and symmetry breaking for the NFA inference prob-

lem. In 33rd IEEE International Conference on Tools

with Artiﬁcial Intelligence, ICTAI 2021, pages 396–

403. IEEE.

Lecoutre, C. and Szczepanski, N. (2020). PYCSP3: mod-

eling combinatorial constrained problems in python.

CoRR, abs/2009.00326.

Louros, N., Konstantoulea, K., De Vleeschouwer, M.,

Ramakers, M., Schymkowitz, J., and Rousseau, F.

(2019). WALTZ-DB 2.0: an updated database con-

taining structural information of experimentally deter-

mined amyloid-forming peptides. Nucleic Acids Re-

search, 48(D1):D389–D393.

Rossi, F., van Beek, P., and Walsh, T., editors (2006).

Handbook of Constraint Programming, volume 2 of

Foundations of Artiﬁcial Intelligence. Elsevier.

Stützle, T. and Ruiz, R. (2018). Iterated Local Search,

pages 579–605. Springer International Publishing,

Cham.

Tomita, M. (1982). Dynamic construction of ﬁnite-state au-

tomata from examples using hill-climbing. Proc. of

the Fourth Annual Conference of the Cognitive Sci-

ence Society, pages 105–108.

Vázquez de Parga, M., García, P., and Ruiz, J. (2006). A

family of algorithms for non deterministic regular lan-

guages inference. In Proc. of CIAA 2006, volume

4094 of LNCS, pages 265–274. Springer.

Wieczorek, W. (2017). Grammatical Inference – Algo-

rithms, Routines and Applications, volume 673 of

Studies in Computational Intelligence. Springer.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1188