TOWARD A SILENT SPEECH INTERFACE BASED ON UNSPOKEN

SPEECH

Alejandro Antonio Torres García, Carlos Alberto Reyes García and Luis Villaseñor Pineda

Computer Sciences Department, National Institute for Astrophysics Optics and Electronics

Luis Enrique Erro # 1, Tonantzintla, México

Keywords:

Silent Speech Interfaces (SSI), Electroencephalograms (EEG), Unspoken Speech, Discrete Wavelet Transform

(DWT), Classiﬁcation.

Abstract:

This work aims to interpret the EEG signals associated with actions to imagine the pronunciation of words

that belong to a reduced vocabulary without moving the articulatory muscles and without uttering any audi-

ble sound (unspoken speech). Speciﬁcally, the vocabulary reﬂects movements to control the cursor on the

computer. We have recorded EEG signals from 21 subjects using a markers based basic protocol. The dis-

crete wavelet transform (DWT) is used to extract features from the delimited windows, and a subset of them

with frequency ranges below 32 Hz is further selected. These subsets are used to train four classiﬁers: Naive

Bayes (NB), Random Forests (RF), support vector machine (SVM), and Bagging-RF. The results are still pre-

liminary but encouraging because the accuracy rates are above 20%, i.e. up to chance for ﬁve classes. The

implementation process as well as some experiments with their corresponding results are shown.

1 INTRODUCTION

Oral communication is the natural way in which hu-

mans interact. However, in some circumstances, it is

not possible to emit an intelligible acoustic signal, or

it is desired to communicate without making sounds.

In these conditions systems that enable spoken com-

munication in the absence of an acoustic signal are

desirable. This kind of systems are part of a recent

research area called silent speech interfaces (SSI).

Among the SSI technologies described by Denby

(Denby et al., 2010), of particular interest are those

that use EEG because they are non invasive, rela-

tively simple, economical, and insensitive to environ-

ments with large amounts of audible noise. Particu-

larly those associated with unspoken speech, also re-

ferred as internal or imagined speech.

Works using unspoken speech can be divided into

two approaches: by words and by syllables. The ﬁrst

approach is followed in (Porbadnigk, 2008; Suppes

et al., 1997; Wester, 2006). While in (Brigham and

Kumar, 2010; DaSalla et al., 2009; D’Zmura et al.,

2009) only syllables are treated. In the speciﬁc case of

works that explore words, where this study falls, the

following problems had been identiﬁed. In (Suppes

et al., 1997) was presented a prototypes based method

that is not unsuitable for real-time processing. While

in (Porbadnigk, 2008; Wester, 2006) it is assumed that

the extracted features can be recognized with existing

models for common speech recognition, nonetheless

speech acoustic signal and EEG signals have very dif-

ferent characteristics. In the words approach, the most

recent work described in (Porbadnigk, 2008) uses a 5

words vocabulary, EEG signals from seven subjects

were recorded, and each word of vocabulary was re-

peated 20 times.

This research aims to interpret the EEG signals as-

sociated with unspoken speech. Speciﬁcally, it aims

to interpret the signals to recognize ﬁve unspoken

words of the Spanish language: “arriba”, “abajo”,

“izquierda”, “derecha”, and “seleccionar”, which are

repeated 33 times by each subject. They were chosen

because with them it could be possible to control a

computer screen cursor.

2 METHODOLOGY

The stages of proposed methodology are the follow-

ing: EEG signal adquisition, EEG signal enhance-

ment (Pre-processing), feature extraction, feature se-

lection, and classiﬁcation.

For EEG signal acquisition an EMOTIV kit is

370

Antonio Torres García A., Alberto Reyes García C. and Villaseñor Pineda L..

TOWARD A SILENT SPEECH INTERFACE BASED ON UNSPOKEN SPEECH.

DOI: 10.5220/0003769603700373

In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS-2012), pages 370-373

ISBN: 978-989-8425-89-8

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

used. This kit is wireless and has fourteen high resolu-

tion electrodes (channels) whose sampling frequency

is 128 Hz.

2.1 Brain Signals Acquisition

In this stage the EEG kit is used to acquire the

brain signals. According to the Geschwind-Wernicke

model, EEG signals related to speech production

come from speciﬁc areas in the left side of the brain

(Geschwind, 1972). Particularly, channels F7, FC5,

T7 and P7 are of interest because they are the nearest

to Geschwind-Wernicke’s model areas.

Moreover, a basic protocol is used to acquire EEG

signals from each subject, a mouse is used to send

markers to the EEG signals acquisition software, to

delimit the start and end of imagining the pronunci-

ation of the words (see Figure 1). A set of samples

between both markers is called window or instance.

Considering that, it is known in what part of the

EEG signal the patterns associated with the imagina-

tion of the word pronunciation should be searched.

Figure 1: EEG signal from F7 channel that it belong subject

S1 while he imagines word pronunciation“abajo” following

the data acquisition protocol.

2.2 Pre-processing

In this stage, the EEG signals obtained from the chan-

nels of interest (F7, FC5, T7 y P7) are ﬁltered using

a ﬁnite impulse response (FIR) band-pass ﬁlter at the

range 4 to 25 Hz.

It is noteworthy that, similarly to conventional

speech, the duration of the unspoken speech’s win-

dows for each word is variable, for one subject as well

as for different subjects. Thus, it is necessary to es-

tablish an equal size for all windows.

At the end of this stage windows with 256 sam-

ples and a frequency range between 4 and 25 Hz. are

kept and used for the creation of the experimental data

base. Windows lower than 256 samples are completed

with zeroes, and those with more than 256 samples

are discarded.

2.3 Feature Extraction

In (Lotte et al., 2007) it is mentioned that the features

to be used in BCI are not stationary and con-

tain time information, which makes necessary an ad-

equate representation. The discrete wavelet transform

(DWT) provides a highly efﬁcient wavelet represen-

tation by restricting the variation in translation and

scale, usually to powers of two.

In consequence, in this work the discrete wavelet

transform (DWT) with six decomposition levels is

applied, using a second order Daubechies (db2) as

mother wavelet function. With this, a vector with 269

wavelet coefﬁcients is obtained for each window in

each of the interest channels. Subsequently, the co-

efﬁcients in the same time interval that belong to the

four interest channels are concatenated following this

order F7-FC5-T7-P7. At the end of this stage is ob-

tained a vector with 1076 features, and its correspond-

ing class label is obtained.

2.4 Feature Selection

The feature selection problem implies to select a min-

imum subset, with M features, S = (S

, ..., S

) from

original feature set F = (F

, · · · , F

), where M ≤ N

and S ⊆ F, so that the feature space is optimally

reduced and the classiﬁcation performance is main-

tained, improved or not signiﬁcantly degraded.

At this stage, the subset of features greater than

25 Hz. is discarded. Therefore, the feature subset se-

lected consists of the detail coefﬁcients D2 to D6 and

the approximation A6 which reduces the dimension of

the feature vectors, and at the same time reduces the

impact of the curse of dimensionality in the classiﬁ-

cation stage. With this, each window of each channel

is represented with 140 wavelet coefﬁcients. Then,

the DWT coefﬁcients of windows in the same time

interval were concatenated as in the feature extraction

stage.

2.5 Classiﬁcation

In this work the following three classiﬁers are trained

and tested: Support Vector Machines (SVM), Ran-

dom Forests (RF) and Naive Bayes (NB). After eval-

uating the individual classiﬁers, the classiﬁer with the

higher accuracy percentage is selected to use it as the

base classiﬁer in Bagging ensamble.

TOWARD A SILENT SPEECH INTERFACE BASED ON UNSPOKEN SPEECH

371

3 EXPERIMENTATION AND

RESULTS

3.1 Preliminary Experiments

Preliminary experiments consisted in training and

testing the three described classiﬁers (NB, RF, and

SVM) with the EEG signals recorded from three sub-

jects (S1, S2 and S3). Previously, complete and re-

duced feature vectors are obtained. These experi-

ments aim to evaluate the convenience of using the

complete or reduced vectors, and select the classiﬁer

to be used as a basis for the Bagging classiﬁcation.

For this purpose, the measure to evaluate the classi-

ﬁers is accuracy. The classiﬁcation accuracy is ob-

tained through 10-fold cross validation.

The accuracy percentage for each of the three clas-

siﬁers using the complete feature vectors are in table

Table 1: Accuracy percentages obtained for the classiﬁers

using the complete feature vectors (1076 features).

Accuracy

Subject NB RF SVM

S1 23.35 24.08 23.35

S2 17.09 31.63 24.78

S3 35.75 41.21 18.18

Table 2 presents the accuracy percentage for each

of the three classiﬁers using the reduced feature vec-

tors.

Table 2: Accuracy percentages obtained for the classiﬁers

using the reduced features vectors (540 features).

Accuracy

Subject NB RF SVM

S1 24.08 43.78 21.9

S2 18.8 38.46 21.37

S3 33.94 43.64 19.39

The results described in tables 1 and 2 show that,

generally, better results are obtained when using the

reduced feature vectors than when using complete

vectors. Nonetheless, in occasions when better results

are obtained using the whole vectors, the improve-

ment is not signiﬁcant.

In addition, the tables 1 and 2 show that the clas-

siﬁer which obtains the best accuracy percentages is

RF, so it was selected to be used for the following

experiments as the base classiﬁer for Bagging. From

here, this ensemble will be denoted as Bagging-RF.

3.2 Experiment with the Whole Corpus

In these experiments participated 21 right-handed

subjects (S1-S21) to collect a corpus of data. From

each of them 33 instances of each of the ﬁve imag-

ined words were recorded. However, those instances

with more than 256 samples (two seconds long), are

discarded in the experimental phase. After that se-

lection process, the remaining instances pass by all

methodology stages.

In the ﬁrst experiment the instances from each of

the twenty-one subjects (S1-S21) are separately uti-

lized for training and testing the four evaluated classi-

ﬁers (RF, SVM, NB, Bagging-RF). The accuracy per-

centages are obtained using 10-folds cross validation.

In ﬁgure 2 these percentages are shown.

Figure 2: Accuracy percentages for each classiﬁer obtained

after 10-folds cross validation on the data of each subject.

Figure 2 shows that, generally, the accuracy per-

centages obtained by the four classiﬁers are above

chance for ﬁve classes, which is 20%. This accuracy

rate is taken as a lower bound because, according to

(Dietterich, 2000), “an accurate classiﬁer is one that

has an error rate better than chance at the stage of

generalization (testing)”. Furthermore, ﬁgure 2 shows

that, according to accuracy percentages, the best clas-

siﬁer is Bagging-RF and the worst is SVM. Also, it

is important to note that, for all subjects both RF

and Bagging-RF are kept above chance for the ﬁve

classes.

Further on, results at word level obtained by RF

and Bagging-RF, utilizing the F-measure, are next

presented. For each subject’s dataset, the classiﬁers

calculate a F-measure value for each of the words.

Figure 3 shows the average F-measure obtained by

RF and Bagging-RF for each of the words. In the case

of RF, the words order according to the f-measure

from high to low, is: “arriba”, “izquierda”, “selec-

cionar”, “abajo”, and “derecha”. While, in the case

of Bagging-RF according to the same F-measure the

order is: “seleccionar”, “arriba” , “derecha”, “abajo”,

and “izquierda”.

BIOSIGNALS 2012 - International Conference on Bio-inspired Systems and Signal Processing

372

Figure 3: Graph of average f-measure for each word ob-

tained for RF and Bagging RF using 10-folds cross valida-

tion on the subject’s data.

It is important to note that, the words “selec-

cionar” and “arriba” classiﬁed by Bagging-RF have a

F-measure above 0.4, which is twice bigger than ran-

dom.

Last, it is worth to mention that in the presented

work the results obtained are relatively comparable to

state of the art similar works, like those reported in

(Porbadnigk, 2008) where the classiﬁcation was eval-

uated only based in accuracy, reporting 45.95% for

ﬁve words. This comparison is mentioned consider-

ing the differences described in section 1.

4 CONCLUSIONS AND FUTURE

WORK

The acoustic speech signal and the EEG signals have

different features, which makes them naturally dis-

similar. In consequence, we explored an alternative

processing and classiﬁcation approach to treat the

EEG signals, in particular those related to unspoken

speech. Indeed, the problem of interpreting unspo-

ken speech is still far to be solved. However, from

our experiments we obtained evidence to afﬁrm that

the EEG signals actually carry useful information to

allow the classiﬁcation of unspoken words. We con-

clude this based on the percentages of accuracy in

the classiﬁcation for the four classiﬁers, which, are

above chance for ﬁve classes (see ﬁgure 2). Our re-

sults and experimental procedures are consistent with

those reported in the state of the art, because: we per-

formed experiments with more than one classiﬁer, we

explored a language different to English, we used a

reduced vocabulary with more semantic meaning, and

we worked with features obtained by a feature selec-

tion approach instead a dimensionality reduction ap-

proach. The average f-measure was below the per-

centages due to chance for ﬁve classes.

To improve the reported results we propose to ex-

plore how to utilize and compare all windows regard-

less their size. We propose to apply independent com-

ponent analysis (ICA) and assess each independent

component using the Hurst’s coefﬁcient to eliminate

some artifacts as blinks and heartbeats. To select an-

other wavelet family also could help. We also plan to

test another EEG signal representation and combine

them with the DWT coefﬁcients. Finally, it is still

possible to use hybrid intelligent systems, and others

ensemble schemes to improve classiﬁcation results.

ACKNOWLEDGEMENTS

This work was done under partial support of CONA-

CyT (scholarship # 234705), and INAOE.

REFERENCES

Brigham, K. and Kumar, B. (2010). Imagined Speech

Classiﬁcation with EEG Signals for Silent Commu-

nication: A Preliminary Investigation into Synthetic

Telepathy. In Bioinformatics and Biomedical Engi-

neering (iCBBE), 2010 4th International Conference

on, pages 1–4. IEEE.

DaSalla, C. S., Kambara, H., Koike, Y., and Sato, M.

(2009). Spatial ﬁltering and single-trial classiﬁcation

of EEG during vowel speech imagery. In i-CREATe

’09: Proceedings of the 3rd International Convention

on Rehabilitation Engineering & Assistive Technol-

ogy, pages 1–4, New York, NY, USA. ACM.

Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.,

and Brumberg, J. (2010). Silent speech interfaces.

Speech Communication, 52(4):270–287.

Dietterich, T. (2000). Ensemble methods in machine learn-

ing. Multiple classiﬁer systems, pages 1–15.

D’Zmura, M., Deng, S., Lappas, T., Thorpe, S., and Srini-

vasan, R. (2009). Toward EEG sensing of imagined

speech. Human-Computer Interaction. New Trends,

pages 40–48.

Geschwind, N. (1972). Language and the brain. Scientiﬁc

American.

Lotte, F., Congedo, M., Lécuyer, A., Lamarche, F., and Ar-

nald, B. (2007). A review of classication algorithms

for EEG-based brain-computer interfaces. Journal of

Neural Engineering, 4:r1–r13.

Porbadnigk, A. (2008). EEG-based Speech Recognition:

Impact of Experimental Design on Performance. Mas-

ter’s thesis, Institut für Theoretische Informatik Uni-

versität Karlsruhe (TH), Karlsruhe, Germany.

Suppes, P., Lu, Z., and Han, B. (1997). Brain wave

recognition of words. Proceedings of the National

Academy of Sciences of the United States of America,

94(26):14965.

Wester, M. (2006). Unspoken Speech - Speech Recognition

Based On Electroencephalography. Master’s thesis,

Institut für Theoretische Informatik Universität Karl-

sruhe (TH), Karlsruhe, Germany.

TOWARD A SILENT SPEECH INTERFACE BASED ON UNSPOKEN SPEECH

373