WANN-TAGGER

A Weightless Artificial Neural Network Tagger for the Portuguese Language

Hugo C. C. Carneiro, Felipe M. G. França

Systems Engineering and Computer Science Program/COPPE, Universidade Federal do Rio de Janeiro

Caixa Postal 68511, 21941-972, Rio de Janeiro, RJ, Brazil

Priscila M. V. Lima

Department of Mathematics, Universidade Federal Rural do Rio de Janeiro, BR-465, Km 47

Campus Universitário, 23890-000, Seropédica, RJ, Brazil

Keywords: WiSARD, DRASiW, POS-Tagger, Weightless artificial neural networks.

Abstract: Weightless Artificial Neural Networks have proved to be a promising paradigm for classification tasks. This

work introduces the WANN-Tagger, which makes use of weightless artificial neural networks for labelling

Portuguese sentences, tagging each of its terms with its respective part-of-speech. A first experimental

evaluation using the CETENFolha corpus indicates the usefulness of this paradigm and shows that it

outperforms traditional feedforward neural networks in both accuracy and training time, and also that it is

competitive in accuracy with the Hidden Markov Model in some cases. Additionally, WANN-Tagger shows

itself capable of incrementally learning new tagged sentences during runtime.

1 INTRODUCTION

In many fields in which there are text mining and

information extraction tasks, many steps must be

taken in order to acquire the main information from

some text. The first of these steps is part-of-speech

tagging, which consists of obtaining the grammatical

tags of words in a sentence.

This paper presents the WANN-Tagger, a

Weightless Artificial Neural Network (WANN)

model that makes use of the WiSARD (Wilkie,

Stonham and Aleksander’s Recognition Device)

architecture and its advantages (Alexander et al.,

1984) in order to create a tagger that is both faster

and more accurate than traditional connectionist

models and is also capable of learning almost

instantly new tagged sentences during runtime.

In the next section some basic concepts are

introduced and background knowledge for the

subsequent sections provided. So, in Section 2, the

concepts of weightless artificial neural networks, the

WiSARD model and the DRASiW (an important

WiSARD generalization) are discussed. Some

related works are also described in this section. In

Section 3 the specification of the WANN-Tagger is

presented and its application to POS tagging

explained together with the discussion of the main

changes to the original WiSARD model for it to

perform POS tagging. An experimental framework

is presented in Section 4, which explains how to use

CETENFolha corpus (Linguateca, 2009) to train the

WANN-Tagger, compares it with a feedforward

neural network and with a Hidden Markov Model,

and discusses the experimental results. Section 5

offers some conclusions on the effectiveness of

WANN-Tagger.

2 BASIC CONCEPTS

2.1 Weightless Artificial Neural

Networks and the WiSARD Model

Weightless Artificial Neural Networks (WANNs)

are a set of artificial neural network models in which

there is no synaptic weight balancing during its

training phase. The main idea behind a WANN is

that the knowledge is not strictly in the synapses.

There is a large variety of WANNs, for instance:

WiSARD (Alexander et al., 1984); Sparse

330

Carneiro H., França F. and Lima P..

WANN-TAGGER - A Weightless Artiﬁcial Neural Network Tagger for the Portuguese Language.

DOI: 10.5220/0003082303300335

In Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation (ICNC-2010), pages

330-335

ISBN: 978-989-8425-32-4

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

Distributed Memory (Kanerva, 1988); Probabilistic

Logic Nodes (Alexander and Kan, 1987); Goal

Seeking Neuron (Carvalho Filho et al., 1991);

Generalizing Random Access Memory (Alexander,

1990); General Neural Unit (Alexander and Morton,

1991).

In this paper, the pioneering WiSARD (Wilkie,

Stonham and Aleksander’s Recognition Device)

model (Alexander et al., 1984) is used. It employs a

set of N RAM-discriminators (see Figure 1), where

N is the number of existent classes, to determine

which class a pattern belongs to. Each RAM-

discriminator consists of a set of X Boolean RAMs,

or Boolean neurons, with n input bits each, and a

summation device

(see Figure 2). Each RAM-

discriminator (or simply discriminator) receives nX

inputs that are pseudo-randomly obtained from the

input retina, the binary vector (see Figure 2) used in

both training and recognition phases of the WiSARD

model. The summation device integrates all X

Boolean RAM outputs and is activated during the

recognition phase.

2.1.1 Training Phase

In order to train a target discriminator, one must set

all of its memory locations to “0” and then, for each

dataset entry, a “1” should be written in the memory

locations addressed by the given input pattern.

Unlike other artificial neural network models based

on synaptic weights, the training phase of a given

discriminator ends after all the input entries

belonging to the corresponding training set are

presented just once, not requiring recursive iterations

to achieve convergence.

2.1.2 Recognition Phase

Once all discriminators are trained, one can present

an unseen input pattern to all discriminators. Each

discriminator will produce a response r to that given

input, so that the discriminator presenting the

highest response indicates that such target input

pattern belongs to the corresponding class with

confidence d (see Figure 1), which is usually defined

as the difference between the two best discriminator

responses produced. The performance of WiSARD

strongly depends on n. WiSARD generalisation

capability grows inversely with n.

2.2 DRASiW

When one trains a large dataset, usually there are

ties, i.e., different discriminators producing the same

(highest) responses during the recognition phase. In

order to avoid this saturation effect (also named

overfitting), DRASiW (Soares et al., 1998), an

extension to the WiSARD model, is provided with

the ability of producing pattern examples, or

prototypes, derived from learned categories, what

was proved helpful in such disambiguation process

(Grieco et al., 2010).

Figure 1: A 10 RAM-discriminator architecture (from

(Grieco et al., 2010)).

Figure 2: The retina and a RAM-discriminator (from

(Grieco et al., 2010)).

Figure 3: Training samples and the respective mental

image.

A very simple modification to the WiSARD

training process is needed in order to have DRASiW

disambiguating discriminator responses. When an

input pattern is presented, RAM locations addressed

by the input pattern are incremented by 1, instead of

simply storing a “1” at those memory locations. This

makes RAM locations that are more accessed during

the training step have higher values than the ones

that are less accessed.

WANN-TAGGER - A Weightless Artificial Neural Network Tagger for the Portuguese Language

331

The change above allows WiSARD model to

build “mental images” that are no longer black (“1”)

and white (“0”), but “greyscale” (see Figure 3). To

retrieve a “black and white mental image” one must,

at the end of the training step, assume value “1” in

the memory locations that have values higher or

equal to a luminance threshold l and assume value

“0” in the other memory locations. Having l = 0 as

an initial condition, one can iteratively increase l

upon detecting ties between discriminator responses,

until l reaches a value meaning that just a single

winner discriminator exists (Grieco et al., 2010).

2.3 Related Works

Most part-of-speech (POS) taggers are implemented

using probabilistic and statistical frameworks, e.g.

Hidden Markov Models (Rabiner, 1989)

(Villavicencio et al., 1995), Maximum Entropy

Markov Models (McCallum et al., 2000) and

Conditional Random Fields (Lafferty et al., 2001).

These approaches, although discovering parts-of-

speech with high accuracy, lack cognitive

background and do not represent how human beings

deduce the parts-of-speech of a sentence.

Other approaches include rule based taggers,

such as the Constraint Grammar approach (Karlsson

et al., 1995), in which the tags are obtained based on

which word is being tagged and other constraints.

And there are yet other ones, like transformation

based learning (TBL) taggers (Brill, 1995) and

neural taggers (Schmid, 1994) (Marques and Lopes,

1996). The former works as the rule based taggers,

but the rules are automatically induced from the

data, like some HMM taggers. The latter makes use

of artificial neural networks, mainly feedforward

ones.

3 WANN-TAGGER

3.1 General Architecture

The WANN-Tagger is a part-of-speech tagger that

makes use of the WiSARD model in order to tag

sentences in Portuguese. Words are represented as a

probability vector with values p(tag

|word) when the

word is relevant to the corpus, i.e., appears in at least

0.5% of the corpus, and are represented as p(tag

|word

suffix) otherwise.

Besides the probability of a word having a

particular tag, the WANN-Tagger also takes into

account the context in which the word is presented. In

order to consider it, the WANN-Tagger employs a

context window possessing b words before and a

words after the current one plus the current word

itself. Thus, the WANN-Tagger input is composed of

a+b+1 words, i.e. a+b+1 vectors with N probabilities

each. When there are less than b words before the

current word or less than a words after it, then the

probability vectors corresponding to the spaces of the

context window lacking words have ‘0’ in such

positions.

The WANN-Tagger, as a Boolean neural

network, does not support the use of floating points as

inputs. In order to use the probability vectors one

must discretize them. In this discretization process,

probabilities are encoded as Boolean vectors

according to the following condition: given two

distinct probabilities, the one with higher value will

be encoded as a Boolean vector with more 1's than the

other.

WANN-Tagger has a large variety of classes,

which are listed in its tagset. As mentioned in the

previous section, ties can often happen during the

classification step when a WiSARD model that has a

large training set is used. In order to avoid ties, the

WANN-Tagger extensively uses the DRASiW

mechanism to find the best value of the luminance

threshold l that prevents such ties.

In (Grieco et al., 2010) it is stated that these ties

can be avoided by gradually increasing the luminance

threshold. However, when one uses a large variety of

classes, this can take too much time until no tie can be

found. To avoid this time spending, a “binary search”

luminance threshold adjustment is used, which

consists of an iterative process containing the

following statements:

 if it is the first iteration of the “binary search”

luminance threshold adjustment one must set a

minimum threshold to the number of times the

most accessed RAM-discriminator was

accessed; test the WANN-Tagger using l = 1,

i.e., if an entry in the memory location is

greater or equal to “1”, then assume value “1”

in this memory location, otherwise assume

value “0”; check which discriminator produced

the highest response and call this value V;

 if there were ties in the previous iteration and

the greatest value was V, then set the maximum

threshold to the arithmetic mean of the

minimum threshold and the former maximum

threshold and test the WANN-Tagger with the

recently adjusted maximum threshold as l;

 if the greatest result in the previous iteration was

less than V, then set the minimum threshold to

the arithmetic mean of the former minimum

ICFC 2010 - International Conference on Fuzzy Computation

332

threshold and the maximum threshold and test

the WANN-Tagger with the recently adjusted

minimum threshold as l.

Repeat this process until having a result in which

the greatest value is V and there is no tie.

3.2 Architecture for Portuguese

WANN-Tagger has been implemented to be tested

in Portuguese, so that its capability of tagging

accurately a highly inflected and somewhat free

word order language would be guaranteed.

Several experiments have been done to discover

which WANN-Tagger parameter values that

maximize its accuracy. They employ a dataset

containing 2000 sentences randomly extracted from

CETENFolha corpus (Linguateca, 2009) and several

distinct context windows, probability discretization

degrees and number of inputs in each of the RAMs

of the RAM-discriminators.

The results show that the context window

W = (1,1) is the one which the WANN-Tagger

obtains the best accuracy with, and that the best

probability discretization degree is d = 8 and the best

number of inputs in each of the RAMs of the RAM-

discriminators is n = 16. Any context window larger

than W = (1,1) or values for d and n greater than the

ones obtained in the experiments imply in the

WANN-Tagger becoming overfitted.

4 METHODOLOGY AND

EXPERIMENTS

4.1 Dataset

In order to test the WANN-Tagger, different

samples from the CETENFolha corpus (Linguateca,

2009) have been used. These samples were created

by extracting random sentences from the corpus.

This extraction has consisted in dividing the corpus

in N parts with the same amount of sentences, and

extracting one random sentence from each part. The

amounts of sentences in these samples are

logarithmically gradual.

The tagset used in this paper is a simplified

version of CETENFolha corpus one, as shown in

Table 1. This simplified version was created because

CETENFolha corpus tagset, PALAVRAS (Bick,

2000), contains very meticulous tags, in which even

gender and number (or finiteness, person, tense and

mood in the case of a verb) are shown.

Table 1: The tagset used in the experiments.

Tag Description Tag Description

PROP Proper Noun ADV Adverb

N Noun PRP Preposition

PERS

Personal

Pronoun KC

Coordinate

Conjunction

POSS

Possessive

Pronoun KS

Subordinate

Conjunction

DET Determiner NUM Numeral

ADJ Adjective IN Interjection

V Verb PT Punctuation

SPEC Specifier

4.2 Experiments

To prove the effectiveness of the WANN-Tagger, it

has been compared with (i) a feedforward neural

network trained via Backpropagation, and (ii) with a

Hidden Markov Model (HMM).

The feedforward neural network uses the

architecture presented in (Schmid, 1994). Like in

WANN-Tagger the probabilities of a word (or a

suffix) possess a particular tag are obtained by using

their relative frequencies in the corpus. The epoch

size is equal to the amount of sentences of each

sample used in the experiments. Its learning rate is

constant and equal to 0.001. The halting criteria

consist in either the convergence of the training, or

reaching 1000 epochs.

Several experiments have been done and their

results showed that the context window that best fits

with the feedforward neural network model is W =

(1, 1), i.e., one word before the current one and one

after it and that the amount of nodes in the hidden

layer that produces the best accuracy is n = 14. The

two experiments mentioned above imply that the

feedforward model which presented the best

accuracy rates was 45-14-15.

The Hidden Markov Model used is the one

presented in LingPipe (Alias-i, 2009), a Java library

suite for natural language processing. It was set to

use Laplace smoothing and thus causing it to

estimate the part-of-speech of words that it does not

possess in its vocabulary, based only on their

contexts. To test the accuracy of all the models 10-

fold cross-validation was used.

4.3 Results and Discussion

By testing these three models it is notable that the

WANN-Tagger is considerably better than

feedforward neural network in both accuracy and

training time and that it is not as good as the Hidden

Markov Model in accuracy rate but, unlike it, its

accuracy rate standard deviation is very small and

WANN-TAGGER - A Weightless Artificial Neural Network Tagger for the Portuguese Language

333

thus more reliable. HMM can achieve accuracy rates

of 94%, but varying from 65% to 100%, showing

that it can be outperformed by the WANN-Tagger in

some cases, as can be seen in Figure 4. The training

time of the HMM, however, always outperforms the

one of the WANN-Tagger as it is shown in Figure 5.

Figure 4: Accuracy rate. The horizontal axis represents the

number of sentences in the sample, the vertical represents

the accuracy achieved by the model. Each curve indicates

the mean accuracy and its 95% confidence interval.

One can note, however, that HMM performed

better than the other two models because it employs

heuristics that are used by neither the WANN-

Tagger nor the feedforward neural network. HMM,

through the forward-backward algorithm (also

known as Baum-Welch algorithm) (Baum, 1972), a

special case of the Expectation-Maximization or EM

algorithm (Dempster et al., 1977), is able to obtain

the best Markov model that maximizes the

probability of a POS sequence given a sequence of

words. This algorithm allows HMM to know the

whole clause, whereas both WANN-Tagger and the

feedforward neural network model only know the

part of the clause within the context window.

One may also note that for each word in a

sentence the HMM must read two entries only, the

word and the tag (already stored in memory), which

was the tag of the previous word. On the other hand,

the WANN-Tagger has a context window which

contains three words. It is clear that for each word in

a sentence the tagger must discretize a number of

elements equal to three times the number of classes,

in this case 15. The feedforward neural network,

however, does not have a discretization process. The

main reason why it is outperformed in training time

by both HMM and WANN-Tagger is due to the

training convergence nature.

It is important to perceive that as WANN-Tagger

makes use of WiSARD it does not require recursive

iterations to achieve convergence and therefore,

unlike feedforward neural networks, when it

receives a new tagged sentence to be trained it does

not need to retrain itself, it only needs to increment

the values in the memory locations addressed by this

input pattern. As it is only training one sentence its

training time is almost zero, as can be seen in Figure

5, making it capable of learning new tagged

sentences during runtime.

Figure 5: Time spent during experiments. The horizontal

axis represents the number of sentences in the sample, the

vertical represents the time spent in seconds in each fold.

Each curve indicates the mean time spent and its 95%

confidence interval.

In addition, it is worth noticing that Portuguese,

as a highly inflected language, has a flexible word

order in its sentences. This word order flexibility is

responsible for the accuracy rates presented in

Figure 4, being a little lower than the ones obtained

in papers that use less inflected languages, such as

English (Schmid 1994).

Tests with a corpus written in English, the

Brown Corpus (Francis and Kučera, 1982), have

also been done and the results show that WANN-

Tagger performs almost as well as when it is trained

with CETENFolha. WANN-Tagger performance

achieves 79.56% when using the same configuration

used for Portuguese. It is worth noting that Brown

Corpus uses far more tags than it was used for

Portuguese. If WANN-Tagger had the same amount

of tags its accuracy would possibly be greater.

5 CONCLUSIONS

A weightless artificial neural network model for

ICFC 2010 - International Conference on Fuzzy Computation

334

tagging Portuguese language sentences was

introduced in this paper. The model showed itself as

an alternative to feedforward neural network part-of-

speech taggers as the former is faster and had better

accuracy rate than the latter. WANN-Tagger also

showed that its accuracy rate is as high as of Hidden

Markov Models in some cases. However, the

WANN-Tagger does not show itself as being

competitive in training time with the Hidden Markov

Model, as it needs a context window and a large

number of inputs for each of its RAMs.

In order to make this model more competitive

with the Hidden Markov Model, the possibility of

adding Markov information to the WANN-Tagger

inputs, for instance the tag obtained during the

tagging phase of the previous word, will be left for

future work. This way, the model would be able to

know more information about the whole sentence and

not only about a small part of it that is within the

context window.

In addition, it is worth noting that WANN-Tagger

performed well with both a free and a fixed word

order language and showed itself as a good model to

be used when training is required during runtime.

ACKNOWLEDGEMENTS

This work was partially supported by CNPq

(306070/2007-3) and FAPERJ (E-26/102.957/2008)

Brazilian research agencies.

REFERENCES

Alexander, I., Thomas, W. V., Bowden, P. A., 1984.

WiSARD: a radical step forward in image recognition.

In Sensor Review, pp. 120-124.

Alexander, I., Kan, W. W., 1987. A probabilistic logic

neuron network for associative learning. In IEEE

Proceedings of the First International Conference on

Neural Networks, pp. 541-548.

Alexander, I., 1990. Ideal neurons for neural computers. In

Parallel Processing in Neural Systems and

Computers, North-Holland, Amsterdam, pp. 225-228.

Alexander, I., Morton, H., 1991. General neural unit:

retrieval performance. In IEE Electronics Letters, 27,

pp. 1176-1178.

Alias-i, 2009. LingPipe 3.9.2 http://alias-i.com/lingpipe

(accessed October 1, 2009).

Baum, L. E., 1972. An inequality and associated

maximization technique in statistical estimation for

probabilistic functions of Markov processes. In

Shinsha, O. (Ed.), Inequalities III: Proceedings of the

Symposium on Inequalities, University of

California, Los Angeles, pp. 1-8. Academic Press.

Bick, E., 2000. The Parsing System Palavras – Automatic

Grammatical Analysis of Portuguese in a Constraint

Grammar Framework. Århus University Press.

Brill, E., 1995. Transformation-based error-driven

learning and natural language processing: a case study

in part-of-speech tagging. In Computational

Linguistics, 21(4), pp. 543-566.

Carvalho Filho, E. C. B., Fairhurst, M. C., Bisset, D. L.,

1991. Adaptive pattern recognition using goal-seeking

neurons. In Pattern Recognition Letters, 12, pp. 131-

138.

Dempster, A. P., Laird, N. M., Rubin, D. B., 1977.

Maximum likelihood from incomplete data via the EM

algorithm. In Journal of the Royal Statistical Society,

39(1), pp 1-21.

Francis, W. N., Kučera, H. (1982). Frequency Analysis of

English Usage. Houghton Mifflin, Boston.

Grieco, B. P. A., Lima, P. M. V., De Gregorio, M., França,

F. M. G., 2010. Producing pattern examples from

“mental” images. In Neurocomputing, 73, pp. 1057-

1064.

Kanerva, P., 1988. Sparse Distributed Memory. MIT

Press.

Karlsson, F., Vuotilainen, A., Heikkilä, J., Anttila, A.,

1995. Constraint Grammar: A Language-Independent

System for Parsing Unrestricted Text. Mouton de

Gruyter.

Lafferty, J., McCallum, A., Pereira, F., 2001. Conditional

random fields: probabilistic models for segmenting

and labeling sequence data. In Proceedings of the 18

International Conference on Machine Learning.

Linguateca, 2009. CETENFolha http://www.linguateca.pt/

CETENFolha/index_info.html (accessed October 1,

2009).

Marques, N. C., Lopes, G. P., 1996. Using neural nets for

portuguese part-of-speech tagging. In Proceedings of

the 5

International Conference on the Cognitive

Science of Natural Language Processing.

McCallum, A., Freitag, D., Pereira, F., 2000. Maximum

entropy Markov models for information extraction and

segmentation. In Proceedings of the 17

International

Conference on Machine Learning.

Rabiner, L. R., 1989. A tutorial on hidden Markov models

an selected applications in speech recognition. In

Proceedings of the IEEE, 77(2), pp. 257-286.

Schmid, H., 1994. Part-of-speech tagging with neural

networks. In Proceedings of the 15

Conference on

Computational Linguistics, 1, pp. 172-176.

Soares, C. M., da Silva, C. L. F., De Gregorio, M., França,

F. M. G., 1998. Uma implementação em software do

classificador WiSARD. In Proceedings of the V

Simpósio Brasileiro de Redes Neurais (SBRN 1998),

2, pp. 225-229. (In Portuguese).

Villavicencio, A., Marques, N. M., Lopes, J. G. P.,

Villavicencio, F., 1995. Part-of-speech tagging for

portuguese texts. In Proceedings of the 12

Symposium on Artificial Intelligence, 991, pp. 323-

332.

WANN-TAGGER - A Weightless Artificial Neural Network Tagger for the Portuguese Language

335