On the Automatic Classiﬁcation of Reading Disorders

Andreas Maier

, Caroline Parchmann

, Tobias Bocklet

, Florian H¨onig

Oliver Kratz

, Stefanie Horndasch

, Elmar N¨oth

and Gunther Moll

Lehrstuhl f¨ur Mustererkennung (Informatik 5)

Friedrich-Alexander Universit¨at Erlangen N¨urnberg

Martensstr.3, 91058 Erlangen, Germany

Kinder- und Jugendabteilung f¨ur Psychische Gesundheit, Universit¨atsklinikum Erlangen

Schwabachanlage 6 und 10, 91054 Erlangen, Germany

Abstract. In this paper, we present an automatic classiﬁcation approach to iden-

tify reading disorders in children. This identiﬁcation is based on a standardized

test. In the original setup the test is performed by a human supervisor who mea-

sures the reading duration and notes down all reading errors of the child at the

same time. In this manner we recorded tests of 38 children who were suspected

to have reading disorders. The data was confronted to an automatic system which

employs speech recognition to identify the reading errors. In a subsequent classi-

ﬁcation experiment — based on the speech recognizer’s output and the duration

of the test — 94.7% of the children could be classiﬁed correctly.

1 Introduction

The state-of-the-art approach to examine children for reading disorders is a perceptual

evaluation of the children’s reading abilities. In all of these reading tests, a list of words

or sentences is presented to the child. The child has to read all of the material as fast and

as accurate as possible. In order to determine whether the child has a reading disorder

two variables are investigated by a human supervisor during the test procedure:

– The duration of the test, i.e. the ﬂuency, and

– The number of reading errors during the reading of the test material, i.e., the accu-

racy.

Both variables, however, are dependent on the age of the child and related to each other.

If a child tries to read very fast, the number of reading errors will increase and vice

versa [1]. Furthermore, with increasing age the reading ability of children increases.

Hence, appropriate test material has to be chosen according to the age and reading

ability of the child. Therefore, reading tests often consist of different sub-tests. While

younger children are tested with really existing words and only short sentences, the

older children have to be tested with more difﬁcult tasks, such as long complex sen-

tences and pseudo words which may or may not resemble real words. Appropriate sub-

tests are then selected for each tested child. Often this is linked to the child’s progress

in school.

One major drawback of the testing procedure is the intra-rater variablity in the per-

ceptual evaluation procedure. Although the test manual often deﬁnes how to differen-

tiate reading errors from normal disﬂuencies and “allowed” pronunciation alternatives,

Maier A., Parchmann C., Bocklet T., Hönig F., Kratz O., Horndasch S., Nöth E. and Moll G. (2009).

On the Automatic Classiﬁcation of Reading Disorders.

In Proceedings of the 9th International Workshop on Pattern Recognition in Information Systems, pages 18-27

DOI: 10.5220/0002174700180027

 SciTePress

there is no exact deﬁnition of a reading error in terms of its acoustical representation.

In order to solve this problem, we propose the use of a speech recognition system to

detect the reading errors. This procedure has two major advantages:

– The intra-rater variability of the speech recognizer is zero because it will always

produce the same result given the same input.

– The deﬁnition of reading errors is standardized by the parameters of the speech

recognition system, i.e., the reading ability test can also be performed by lay per-

sons with only little experience in the judgment of readings disorders.

In the literature, different automatic approaches to determine the “reading level” of a

child exist. Often the reading level is linked to the perceptual evaluation of expert listen-

ers using ﬁve to seven classes. In [2] Black et al. estimate a reading level between 1 and

7 using pronunciation veriﬁcation methods based on Bayesian Networks. Compared to

the human evaluation they achieve correlations between their automatic predictions and

the human experts of up to 0.91 on 13 speakers. In [3] the use of ﬁnite-state-transducers

is proposed to obtain a “reading level” between “A” (best) and “E” (worst). For this

ﬁve-class problem absolute recognition rates of up to 73.4% for real words and 62.8%

for pseudo words are reported. In order to remove age-dependent effects from the data,

80 children in the 2nd grade were investigated. Both papers focus on the creation of a

“reading tutor” in order to improve children’s reading abilities.

In contrast to these studies, we are interested in the diagnosis of reading disorders

as they are relevant in a clinical point of view. Currently, we are developing PEAKS

(Program for the Evaluation of All Kinds of Speech Disorders [4]) — a client-server-

based speech evaluation framework — which was already used to evaluate speech in-

telligibility in children with cleft lip and palate [5], patients after removal of laryngeal

cancer [6], and patients after the removal of oral cancer [7]. PEAKS features interfaces

and tools to integrate standardized speech tests easily. After integration of a new test,

PEAKS can be used for recording from any PC which is connected to the Internet if

Java Runtime Environment version 1.6 or higher is installed. All analyses performed by

PEAKS are fully automatic and independent of the supervising person. Hence, it is an

ideal framework to integrate an automatic reading disorder classiﬁcation system.

The paper is organized as follows. First the test material, the recorded speech data

and its annotation is described and discussed. Next, the automatic evaluation methods,

i.e., the speech recognizer and the classiﬁers, are reported. In the results section the clas-

siﬁcation accuracy is presented in detail. The subsequent section discusses the outcome

of the experiments. The paper is concluded by a summary.

2 Speech Data

In order to be able to interpret the results and to compare them to other studies’ test

material, speech data, and its annotation is described in detail here. Special attention is

given to the annotation procedure since the automatic evaluation algorithm aims to be

used for clinical diagnosis. Therefore, the annotation should meet clinical standards.

Table 1. Structure of the SLRT test: Thetable reports all sub-tests of the SLRT with their contents,

their number of words, and the school grades in which the respective sub-test is suitable.

sub-test content # of words grade

SLRT1 A short list of bisyllabic, single, real words to introduce the test.

This part is not analyzed according to the protocol of the test.

8 1–4

SLRT2 A list of mono- and bisyllabic real words 30 1–4

SLRT3 A list of compound words with two to three compounds each 11 3–4

SLRT4 A short story with only mono- and bisyllabic words 30 1–2

SLRT5 A longer story with mainly mono- and bisyllabic words but also

a few compound words

57 3–4

SLRT6 A short list of pseudo words with two to three syllables to intro-

duce the pseudo words. This part is not analyzed according to

the protocol of the test.

6 3–4

SLRT7 A list of pseudo words with two to three syllables 24 1–4

SLRT8 A list of mono- and bisyllabic pseudo words which resemble

real words

30 2–4

2.1 Test Material

The recorded test data is based on a German standardized reading disorder test — the

“Salzburger Lese-Rechtschreib-Test” (SLRT, [8]). In total the SLRT consists of eight

sub-tests (cf. Table 1). All sub-tests contain 196 words of which 170 are disjoint.

The test is standardized according to the instructions and the evaluation. The test is

presented in form of a small book, which is handed to the children to read in. They get

the instruction to read the text as fast as possible while doing as little reading mistakes

as possible.

In the original setup the supervisor of the test has to measure the time for all sub-

tests separately while noting down the reading errors of the child.

We will only report the results obtained for the SLRT7 and SLRT8 sub-tests in the

following.

On the one hand, the setup of the perceptual evaluation for all sub-tests is very

similar. Therefore, it is not necessary to report the results of all sub-tests. On the other

hand, the investigation of pseudo words using automatic systems is described as the

most challenging task in the literature [2]. The sub-test SLRT6 also contains pseudo

words, but no perceptual evaluations are conducted according to the test manual.

2.2 Recording Setup

In order to be able to collect the data directly at the PC, the test had to be modiﬁed. In-

stead of a book, the text was presented as a slide on the screen of a PC. The instructions

to the child were the same as in the original setup.

All children were recorded with a head-mounted microphone (Plantronics USB

510) at the University Clinic Erlangen. The recordings took place in a separate quiet

room without background noises. Hence, appropriate audio quality was achieved in all

recordings.

Table 2. 38 Children were recorded with the SLRT: The table shows mean value, standard de-

viation, minimum, and maximum of the age of the children and the count (#) in the respective

group.

group # mean std. dev. min max

all 38 9.7 0.9 7.8 11.3

girls 12 10.2 0.7 9.0 11.3

boys 26 9.5 0.9 7.8 11.3

Table 3. Overview on the limits of pathology for the SLRT7 and SLRT8 sub-tests

SLRT 7 SLRT 8

grade # of errors duration [s] # of errors duration [s]

1st 8 144 - -

2nd 7 113 6 100

3rd 6 78 5 70

4th 5 62 4 55

In total 38 children (26 boys and 12 girls) were recorded. The average age of the

children was 10.2 ± 0.9 years. A detailed overview regarding the statistics of the chil-

dren’s ages is given in Table 2. All of the children were speculated to have a reading

disorder.

2.3 Perceptual Evaluation

For each child the decision whether its reading ability was pathologic or not was de-

termined according to the manual of the SLRT [8]. A child’s reading ability is deemed

pathologic

– if the duration of the test is longer than an age-dependent standard value or

– if the number of reading errors exceeds an age-dependent standard value.

These limits differ for each sub-test according to the SLRT. Table 3 reports these limits

for the sub-tests SLRT7 and SLRT8. In the SLRT7 and the SLRT8 sub-test 30 children

were above the time limit.

We assigned each child two different labels: “reading error/normal”and “pathologic/non-

pathologic”. If only the number of misread words is exceeded, the child is assigned the

label “reading error”, otherwise “normal”. Reading errors are regarded as soon as a sin-

gle phonemic deviation is found. Errors of the accentuation of the word are also counted

as reading errors as described in the manual of the test [8]. For the case of the SLRT7

data, 12 children exceeded the limit of reading errors while 14 children were above the

error limit in the SLRT8 data.

If either of these two boundaries is exceeded by the child, the child is assigned

the label “pathologic”. In both sub-tests 32 of the 38 children were diagnosed to have

pathologic reading.

3 Automatic Evaluation System

The automatic evaluation is based on three information sources:

– The total duration of the test

– The reading error and duration limits (cf. Table 3)

– The word accuracy computed by a speech recognition system

The test duration can be easily accessed as PEAKS tracks this information automati-

cally during the recording. Prior information about the child — namely the child’s age

and the respective duration and error limits — can also easily be obtained (cf. Table 3).

3.1 Speech Recognition Engine

For the objective measurement of the reading accuracy, we use an automatic speech

recognition system based on Hidden Markov Models (HMM). It is a word recognition

system developed at the Chair of Pattern Recognition (Lehrstuhl f¨ur Mustererkennung)

of the University of Erlangen-Nuremberg. In this study, the latest version as described

in detail in [9] and [10] was used.

As features we use 11 Mel-Frequency Cepstrum Coefﬁcients (MFCCs) and the en-

ergy of the signal plus their ﬁrst-order derivatives. The short-time analysis applies a

Hamming window with a length of 16ms, the frame rate is 10ms. The ﬁlter bank for

the Mel-spectrum consists of 25 triangular ﬁlters. The 12 delta coefﬁcients are com-

puted over a context of 2 time frames to the left and the right side (56 ms in total).

The recognition is performed with semi-continuous HMMs. The codebook con-

tains 500 full covariance Gaussian densities which are shared by all HMM states. The

elementary recognition units are polyphones [11], a generalization of triphones. Poly-

phones use phones in a context as large as possible which can still statistically be mod-

eled well, i.e., the context appears more often than 50 times in the training data. The

HMMs for the polyphones have three to four states.

We used a unigram language model to weigh the outcome of each word model. It

was trained with the reference of the tests. For our purpose it was necessary to em-

phasize the acoustic features in the decoding process. In [12] a comparison between

unigram and zerogram language models was conducted. It was shown that intelligibil-

ity can be predicted using word recognition accuracies computed using either zero- or

unigram language models. The unigram, however,is computationally more efﬁcient be-

cause it can be used to reduce the search space. The use of higher n-gram models was

not beneﬁcial.

The result of the recognition is a word lattice. In order to get an estimate of the

quality of the recognition, the word accuracy (WA) is computed. Based on the number

of correctly recognized words C and the number of words R in the reference, the WA

is further dependent on the number or wrongly inserted words I:

WA =

C − I

· 100 %

Hence, the WA can take values between minus inﬁnity and 100%.

The speech recognition system had been trained with acoustic information from 23

male and 30 female children from a local school who were between 10 and 14 years

old (6.9 hours of speech). To make the recognizer more robust, we added data from 85

male and 47 female adult speakers from all over Germany (2.3 hours of spontaneous

speech from the VERBMOBIL project, [13]). The data were recorded with a close-talk

microphone with 16 kHz sampling frequency and 16 bit resolution. The adult speakers

were from all over Germany and thus covered most dialect regions. However, they were

asked to speak standard German. The adults’ data were adapted by vocal tract length

normalization as proposed in [14]. During training an evaluation set was used that only

contained children’s speech. MLLR adaptation (cf. [15,16]) with the patients’ test data

led to further improvement of the speech recognition system.

3.2 Classiﬁcation System

Classiﬁcation was performed in a leave-one-speaker-out (LOO) manner since there was

only little training and test data available. We chose three popular measures in order to

report the classiﬁcation accuracy.

– CL: The class-wise-averaged recognition rate, or so-called average recall. It is de-

termined as

CL =

recall(k)

· 100 % (1)

where K is the number of classes. The CL is useful if the distribution of the classes

is biased.

– RR: The total recognition rate determined as the fraction of correctly identiﬁed

samples c divided by the number of samples n:

RR =

· 100 % (2)

The RR reports the overall performance of the classiﬁer including the class distri-

bution of the data.

– ROC denotes the area under the Receiver-Operating-Characteristic (ROC) curve

[17]. A random classiﬁer yields an area of 0.5 while the perfect classiﬁer would

yield an area of 1.0.

As classiﬁcation system we decided for Ada-Boost [18] in combination with an LDA-

Classiﬁer as simple classiﬁer as it was already successfully applied in [19].

4 Results

Table 4 shows the results of the classiﬁcation experiment “reading error”. The task

was to determine automatically whether the age-dependent limit of reading errors was

exceeded or not. For both sub-tests, the classiﬁcation performance using duration infor-

mation and WA only is moderate. If the age-dependent limit which is dependent on the

school grade of the child is also provided to the classiﬁer, the performance increases for

both sub-tests (90.1 % CL for SLRT7 and 68.2% CL for SLRT8). The actual age —

deﬁned by the date of birth of the child and the date and time of recording — did not

yield any improvement for this classiﬁcation task.

As a second experiment, the use of the classiﬁcation system to automatically de-

termine reading disorders was investigated. Now, the task was to classify whether the

Table 4. Overview on the classiﬁcation results for the task “reading error”. CL is the average

recall, RR the absolute recognition rate and ROC the area under the ROC curve.

SLRT 7 SLRT 8

feature set CL [%] RR [%] ROC CL [%] RR [%] ROC

duration and accuracy 59.5 65.8 0.74 61.6 68.4 0.74

+ age-dependent limits 90.1 89.5 0.97 68.2 71.1 0.62

+ actual age 77.6 81.6 0.81 50.3 57.9 0.56

Table 5. Overview on the classiﬁcation results for the task “pathologic”. CL is the average recall,

RR the absolute recognition rate and ROC the area under the ROC curve.

SLRT 7 SLRT 8

feature set CL [%] RR [%] ROC CL [%] RR [%] ROC

duration and accuracy 90.1 94.7 0.99 68.7 81.6 0.66

+ age-dependent limits 83.3 84.7 0.99 70.3 84.2 0.76

+ actual age 81.8 91.1 0.99 88.5 92.1 0.83

child has a reading disorder or not. Table 5 reports the results. Using only duration and

WA, high recognition rates can already be obtained for the SLRT7 sub-test. For the

case of the SLRT8 sub-test, more information is required to obtain such high classiﬁca-

tion rates. If the age-dependent limits of the SLRT8 and the actual age of the child are

supplied to the classiﬁer, a CL of 88.5% is achieved.

5 Discussion

The scope of this paper was the automatic detection and classiﬁcation of reading disor-

ders in children. Therefore, we chose a clinical standard test and recorded 38 children

who were speculated to have reading disorders.

In order to diagnose a reading disorder,the time of the test has to be investigated and

the number of reading errors has to be determined because both variables are related.

This was performed according to the manual of the SLRT test.

Next, these data were confronted to an automatic evaluation routine based on an

automatic speech recognition system. For the SLRT7 test, using the test duration, the

WA, and the age-dependent limits of the test, the automatic system could already deter-

mine whether the child exceeded the number of reading errors or not at a CL of 90.1%

(89.5% RR). For the SLRT8, however, only 68.2% accuracy were achieved. Figure 1

shows the relation between WA and the duration for the SLRT8 sub-test. Although both

are correlated, duration and WA can be used to distinguish children with many reading

errors from children with only few reading errors. However, both clusters are scattered

into each other. If the reading error limit is also supplied the performance of the clas-

siﬁcation system increases. The actual age of the children did not contribute. This may

be related to the boosting algorithm. It emphazises the wrong features and classiﬁes

according to the age instead of the grade.

In a second experiment we investigated whether these classiﬁcation rates were al-

ready enough to determine a reading pathology automatically. In the SLRT7 sub-test,

-350

-300

-250

-200

-150

-100

-50

100

0 50 100 150 200 250 300 350

word accuracy [%]

duration [s]

X (reading error)

O (normal)

regression line X

regression line O

Fig.1. The plot shows the children of the SLRT8 sub-test: The regression lines show that the

additional use of speech recognition helps to differentiate between the children with many reading

errors and the ones with few reading errors.

90.1% CL (94.7% RR) were achieved.Investigationof the SLRT8 sub-test showed that

a high classiﬁcation rate of 88.5 % CL (92.1% RR) could also be achieved. Hence, the

proposed system is suitable for the automatic classiﬁcation of reading disorders, even

though the classiﬁcation of the reading errors is not perfect. Figure 2 shows this pro-

cess: If duration and reading errors are taken into account the group of non-pathologic

children can be found on the top left of the plot. However,both groups still overlap. Fur-

ther prior information helps to distinguish them from each other. In this experiment the

limits and the true age of the child improve the classiﬁcation further. In further experi-

ments using the other sub-tests and a control group of non-pathologic school children,

we will investigate this effect in more detail.

In the future this procedure will help in the diagnosis of reading disorders in chil-

dren. The system can also be used by lay persons with only little understanding of read-

ing disorders. Screening of reading disorders is also within the reach of the proposed

system.

6 Summary

In this paper we presented an automatic approach for the classiﬁcation of reading disor-

ders based on automatic speech recognition. The evaluation is performed on a standard-

ized German reading capability test that contains pseudo words. To our knowledge such

-350

-300

-250

-200

-150

-100

-50

100

0 50 100 150 200 250 300 350

word accuracy [%]

duration [s]

pathologic

non-pathologic

Fig.2. The plot shows the children of the SLRT8 sub-test: The non-pathologic children can be

found in the upper left corner of the diagram. The data points of the pathologic children are scat-

tered to the bottom right. Duration and word accuracy (WA) alone are not sufﬁcient to separate

both clusters.

a system has not been published before. The system is web-based and can be accessed

from any PC which is connected to the Internet.

Using a database with 38 children classiﬁcation rates of up to 94.7% (RR) could be

achieved. The system is suitable for the automatic classiﬁcation of reading disorders.

References

1. I. Dennis and J. St. B. T. Evans, “The speed-error trade-off problem in psychometric testing,”

British Journal of Psychology, vol. 87, pp. 105–129, 1996.

2. M. Black, J. Tepperman, S. Lee, and S. Narayanan, “Estimation of children’s reading ability

by fusion of automatic pronunciation veriﬁcation and ﬂuency detection,” in Interspeech 2008

– Proc. Int. Conf. on Spoken Language Processing, 11th International Conference on Spoken

Language Processing, September 25-28, 2008, Brisbane, Australia, Proceedings, 2008, pp.

2779–2782.

3. J. Duchateau, L. Cleuren, H. Van Hamme, and P. Ghesquiere, “Automatic assessment of

children’s reading level,” in Interspeech 2007 – Proc. Int. Conf. on Spoken Language Pro-

cessing, 10th European Conference on Spoken Language Processing, August 27-31, 2007,

Antwerp, Belgium, Proceedings, 2007, pp. 1210–1213.

4. A. Maier, T. Haderlein, U. Eysholdt, F. Rosanowski, A. Batliner, M. Schuster, and E. N¨oth,

“PEAKS – A System for the Automatic Evaluation of Voice and Speech Disorders,” Speech

Communication, vol. 51, no. 5, pp. 425–437, 2009.

5. A. Maier, E. N¨oth, A. Batliner, E. Nkenke, and M. Schuster, “Fully Automatic Assessment

of Speech of Children with Cleft Lip and Palate,” Informatica, vol. 30, no. 4, pp. 477–482,

2006.

6. M. Schuster, T. Haderlein, E. N¨oth, J. Lohscheller, U. Eysholdt, and F. Rosanowski, “Intel-

ligibility of laryngectomees’ substitute speech: automatic speech recognition and subjective

rating,” Eur Arch Otorhinolaryngol, vol. 263, no. 2, pp. 188–193, 2006.

7. M. Windrich, A. Maier, R. Kohler, E.N¨oth, E. Nkenke, U. Eysholdt, and M. Schuster, “Au-

tomatic Quantiﬁcation of Speech Intelligibility of Adults with Oral Squamous Cell Carci-

noma,” Folia Phoniatr Logop, vol. 60, pp. 151–156, 2008.

8. K. Landerl, H. Wimmer, and E. Moser, Salzburger Lese- und Rechtschreibtest. Verfahren zur

Differentialdiagnose von St¨orungen des Lesens und des Schreibens f¨ur die 1. bis 4. Schul-

stufe, Huber, Bern, 1997.

9. F. Gallwitz, Integrated Stochastic Models for Spontaneous Speech Recognition, vol. 6 of

Studien zur Mustererkennung, Logos Verlag, Berlin (Germany), 2002.

10. G. Stemmer, Modeling Variability in Speech Recognition, vol. 19 of Studien zur Muster-

erkennung, Logos Verlag, Berlin (Germany), 2005.

11. E.G. Schukat-Talamazzini, H. Niemann, W. Eckert, T. Kuhn, and S. Rieck, “Automatic

Speech Recognition without Phonemes,” in Proc. European Conf. on Speech Communication

and Technology (Eurospeech), Berlin (Germany), 1993, vol. 1, pp. 129–132.

12. K. Riedhammer, G. Stemmer, T. Haderlein, M. Schuster, F. Rosanowski, E. N¨oth, and

A. Maier, “Towards Robust Automatic Evaluation of Pathologic Telephone Speech,” in

Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU),

Kyoto, Japan, 2007, pp. 717–722, IEEE Computer Society Press.

13. W. Wahlster, Ed., Verbmobil: Foundations of Speech-to-Speech Translation, Springer, Berlin

(Germany), 2000.

14. G. Stemmer, C. Hacker, S. Steidl, and E. N¨oth, “Acoustic Normalization of Children’s

Speech,” in Proc. European Conf. on Speech Communication and Technology, Geneva,

Switzerland, 2003, vol. 2, pp. 1313–1316.

15. M. Gales, D. Pye, and P. Woodland, “Variance compensation within the MLLR framework

for robust speech recognition and speaker adaptation,” in Proceedings of the International

Conference on Speech Communication and Technology (Interspeech), Philadelphia, USA,

1996, vol. 3, pp. 1832–1835, ISCA.

16. A. Maier, T. Haderlein, and E. N¨oth, “Environmental Adaptation with a Small Data Set of the

Target Domain,” in 9th International Conf. on Text, Speech and Dialogue (TSD), P. Sojka,

I. Kopeˇcek, and K. Pala, Eds., Berlin, Heidelberg, New York, 2006, vol. 4188 of Lecture

Notes in Artiﬁcial Intelligence, pp. 431–437, Springer.

17. A. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, pp.

861–874, 2006.

18. Yoav Freund and Robert E. Schapire, “Experiments with a new boosting algorithm,” in

Thirteenth International Conference on Machine Learning, San Francisco, 1996, pp. 148–

156, Morgan Kaufmann.

19. C. Hacker, T. Cincarek, A. Maier, A. Heßler, and E. N¨oth, “Boosting of Prosodic and Pronun-

ciation Features to Detect Mispronunciations of Non-Native Children,” in Proceedings of the

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hawaii,

USA, 2007, vol. 4, pp. 197–200, IEEE Computer Society Press.