VOICE BIOMETRICS WITHIN THE FAMILY: TRUST, PRIVACY

AND PERSONALISATION

Delphine Charlet

France Telecom

2 avenue Pierre Marzin, 22307 Lannion Cedex, France

Victor Peral Lecha

France Telecom

Building 10, Chiswick Business Park, 566 Chiswick High Road, London W4 5XS, United-Kingdom

Keywords:

Voice biometrics, speaker recognition, privacy, trust, personalisation, family voices evaluation.

Abstract:

Driven by an increasing need for personalising and protecting access to voice services, the France Telecom

R&D speaker recognition system has been used as a framework for experimenting with voices in a family

context. With the aim of evaluating this task, 33 families were recruited. Particular attention was given to 2

main scenarios: a name-based and a common sentence-based scenario. In each case, members of the family

pronounce their complete name or a common sentence respectively. Moreover, this paper presents a database

collection and ﬁrst experiments for family voices.

The results of this preliminary study show the particular difﬁculties of speaker recognition within the family,

depending on the scenario, the genre and age of the speaker, and the physiological nature of the impersonation.

1 INTRODUCTION

Privacy within a family is a sensible subject: each

member of a family may require some privacy,

but achieving privacy protection through traditional

means (such as pin code) may be resented by the

other members of the family. Indeed, traditional pri-

vacy protection is based on “what you know” infor-

mation, thus, to be effective, it needs to be hidden

from the others. So the use of traditional authenti-

cation password or PIN is inadequate with the global

feeling of mutual trust that often exists within the

family. Hence, speaker recognition appears to be an

interesting way for privacy protection since it is based

on “what-you-are” and not on “what-you-hide”. On

the other hand, the genetic links that exist between

parents and children as well as the same cultural and

geographical contexts they share may hamper speaker

recognition, as it is pointed by (van Leeuwen, 2003).

Moreover, children constitute a difﬁcult target for

speech recognition in general (Wilpon and Jacobsen,

1996).

As far as we know, no published study has been

done on this subject. Thus, for the purpose of this

preliminary study we collected a database of family

voices in order to perform technological evaluation of

the performances of speaker recognition in the con-

text of family voices.

2 APPLICATION DESIGN

We have deﬁned a standard application to evalu-

ate the feasability to personalise home telecoms ser-

vices for each member of a family, based on voice

recognition. It has been decided that personalisation

should be done through an explicit step of identiﬁca-

tion/authentication. Thus, if this step is explicit, it has

to be as short as possible. These ergonomic consider-

ations lead us to two technological choices:

• Identiﬁcation and authentication performed on the

same acoustical utterance.

• Text-dependent speaker recognition: as it achieves

acceptable performances for short utterances (less

than 2 seconds).

Two scenarios were deﬁned, depending on the con-

tent of the utterance pronounced by the speaker.

2.1 Name-based scenario

In this scenario, the user pronounces his/her ﬁrst and

last name to be identiﬁed and authenticated. The main

characteristics of such an application are:

• It is easy to recall.

• It is usually rather short.

• It enables deliberate impostor attempts: e.g. a

mother can claim to be her daughter.

330

Charlet D. and Peral Lecha V. (2005).

VOICE BIOMETRICS WITHIN THE FAMILY: TRUST, PRIVACY AND PERSONALISATION.

In Proceedings of the Second International Conference on e-Business and Telecommunication Networks, pages 330-334

DOI: 10.5220/0001407003300334

 SciTePress

As for each member of the family, the phonetic

content of the voiceprint is different, the identiﬁca-

tion process is performed on both the voice and the

phonetic context, although there is no explicit name

recognition process. Thus it can be expected that the

identiﬁcation error rate will be negligeable. On the

other hand, as the authentication is performed on a

short utterance (e.g. 4 syllables), the authentication

performances are expected to be rather poor.

2.2 Common sentence-based

scenario

In this scenario, the user pronounces a sentence,

which is common to all the members of the family.

The main characteristics of such an application, com-

pared to the name-based scenario are:

• It is less easy to recall: the sentence should be

prompted to the user, if he does not remember.

• The length of the sentence may be longer than a

simple name.

• It prevents deliberate impostor attempts: e.g. a

mother can not claim to be her daughter.

As for each member of the family the phonetic con-

tent of the voiceprint is the same, the identiﬁcation

process is performed only on the voice. Thus it can

be expected that the identiﬁcation error rate will be

higher than in the context of name-based recognition.

On the other hand, as the authentication is performed

on a longer sentence (10 syllables), the authentication

performances are expected to be better than those of

the name-based scenario.

3 DATABASE DESIGN

3.1 Family proﬁle

The families were required to be composed of 2 par-

ents and 2 children. The children are older than 10.

They all live in the area of Lannion, where this work

was conducted. 33 families were recruited: 19 with

one son and one daughter, 10 with 2 sons and 4 with 2

daughters. They were asked to perform 10 calls from

home, with their landline phone, during a period of

one month. Hence, factors such as voice evolution

over time or sensitivity to call conditions are not stud-

ied in this work.

3.2 Name-based scenario

In this scenario, the key point is the fact that there

might be deliberate impostor attempts on a target

speaker, as the user claims an identity to be recog-

nised.

3.2.1 Training

For each member of a family, training is performed

with 3 repetitions of his/her complete name (ﬁrst +

last name). This number of repetitions represents a

good trade-off between performances and tediousness

of the task.

3.2.2 Within family attempts

Each member of a family is asked to perform attempts

on his own name, and also attempts on the name of

each of the other members of his family.

3.2.3 External impostor attempts

A set of impostor attempts from people who do not

belong to the family is collected. The external im-

postors are composed of members of other families

who pronounced the name of the member of the tar-

get family.

3.2.4 Collected Database

As we focus on speaker recognition, we only retained

the utterances where the complete name was correctly

pronounced.

• 16 families completed their training phase.

• 13 families completed the testing phase also.

• 672 true speaker attempts collected for the 13 fam-

ilies, that makes an average of 52 true speaker

attempts per family, thus an average of 13 true

speaker attempts per user.

• 582 within family impostor attempts collected for

the 13 families, that makes an average of 45 within

family impostor attempts per family, thus an aver-

age of 11 within family impostor attempts per user.

• 2173 external impostor attempts (impostor at-

tempts are performed on all the families who have

completed the training phase)

3.3 Common sentence-based

scenario

In this scenario, the key point is the fact that there

can not be deliberate impostor attempts on a target

speaker, as the user does not claim identity to be

recognised.

3.3.1 Training

For each member of a family, training is performed

with 3 repetitions of the common sentence.

VOICE BIOMETRICS WITHIN THE FAMILY: TRUST, PRIVACY AND PERSONALISATION

331

3.3.2 Within family attempts

Each member performs attempts by pronouncing the

common sentence.

3.3.3 External impostor attempts

The attempts of other families are used to perform ex-

ternal impostor attempts.

3.3.4 Collected Database

As we focus on speaker recognition, we only retain

the utterances where the sentence was correctly pro-

nounced.

• 25 families completed their training phase.

• 17 families completed the testing phase also.

• 614 within family attempts collected for the 17

families, that makes an average of 36 attempts per

family, thus an average of 9 attempts per user.

• 13549 external impostor attempts: all within family

attempts are used to perform impostor attempts on

the other families (impostor attempts are performed

on all the families who have completed the training

phase).

4 EXPERIMENTS

4.1 Speaker recognition system

For these experiments, the text-dependent speaker

recognition system developed in France Telecom

R&D is used. It basically relies on HMM modelling

on cepstral features, with a special care on variances,

with a speaker-independant contextual phone loop as

reject model (Charlet et al., 1997). The task is open-

set speaker identiﬁcation. There was no particular

tuning of the system to this new task. The rejection

thresholds are set a posteriori, common to all fami-

lies.

4.2 Typology of errors

4.2.1 Name-based scenario

In this scenario, the identity claim is taken into ac-

count in the typology of errors. Within the family,

when the speaker A claims to be speaker A (pro-

nouncing the name of speaker A), the system can:

• accept the speaker as being speaker A: correct ac-

ceptance (CA)

• accept the speaker as being speaker B: false identi-

ﬁcation error (FI)

• reject the speaker: false rejection error (FR)

Moreover, in this scenario, within the family, a

speaker can make deliberate impostor attempts on a

target speaker. When the speaker A claims to be

speaker B, the system can:

• reject the speaker: correct rejection on internal im-

personation (CRII)

• accept the speaker as being speaker B: wanted false

acceptance on claimed speaker (WFA)

• accept the speaker as being speaker A: unwanted

correct acceptance (UCA)

• accept the speaker as being speaker C: unwanted

false acceptance (UFA)

• Internal False Acceptance is deﬁned as: IFA =

WFA + UFA

Outside the family, when an impostor claims to be

speaker A, the system can:

• reject the speaker: correct external rejection (CER)

• accept the speaker as being speaker A: external

false acceptance on claimed speaker (EWFA)

• accept the speaker as being speaker B: external un-

wanted false acceptance (EUFA)

• External False Acceptance is deﬁned as: EFA =

EWFA + UFA

4.2.2 Common sentence-based scenario

This is the classical typology of open-set speaker

identiﬁcation. Within the family, when the speaker

A pronounces the common sentence, the system can:

• accept the speaker as being speaker A: correct ac-

ceptance (CA)

• accept the speaker as being speaker B: false identi-

ﬁcation error (FI)

• reject the speaker : false rejection error (FR)

Outside the family, when an impostor pronounces

the common sentence, the system can:

• accept the speaker as being speaker A: external

false acceptance error (EFA)

• reject the speaker: correct external rejection (CER)

5 RESULTS

5.1 name-based scenario

Figure 1 plots EFA, IFA and FI as a function of FR

for the name-based system.

For the particular operating point where {

FR=7.9%; FI=0.4%; IFA=21.1%; EFA=8.1%}, let us

see the details of errors:

ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS

332

7 8 9 10 20 30

false rejection rate

0.4

0.7

1.0

2.0

4.0

7.0

10.0

20.0

error rate

IFA

EFA

Figure 1: Name-based speaker identiﬁcation within the

family

• FI=0.4% is low and due to 2 types of surprising

errors: a man identiﬁed as his wife and a 10-year-

old girl identiﬁed as his father.

• UCA=20.6% : when a member within the family

claims to be another member, he is “unmasked”

(i.e. truly identiﬁed) in 20% of the cases. This is

mainly due to the fact that, as the voiceprint of the

members of the family shares important part of the

phonetic content (same last name), an attempt with

another ﬁrst name may match reasonably well the

voiceprint of the speaker if the ﬁrst names are not

too different.

• IFA=21.1% - EFA=8.1% : the rate of IFA (false ac-

ceptance from deliberate impostor attempts within

the family) is more than 2.5 times as much as EFA

(from deliberate impostor outside the family, with

the same age/gender distribution as the impostors

within the family). Although we cannot quantify

what is due to line effects from what is due to phys-

iological resemblance, we can note that the system

is much more fragile to impostor within the same

family than from outside the family.

IFA is decomposed into WFA=18.0% and

UFA=3.1%. Table 1 presents the false acceptance

rate for deliberate impersonation (WFA : when

speaker A claims the name of speaker B and is

accepted as being speaker B) within the family, for

the different types of speaker. “Son” can do impostor

attempts on “son” when they are brothers. In the

same way, “daughter” can do impostor attempts on

“daughter” when they are sisters.

We analyze each type of speaker (father, mother,

daughter, son) with respect to their ability to defeat

the system and to their “fragility” to impostor at-

tempts (Doddington et al., 1998). In the table, be-

tween brackets are given the numbers of tests of each

type of imposture, in order to give an idea of the reli-

ability of the results.

Table 1: false alarm rate according to the type of impostor

and target speaker

Impostor target speaker

speaker father mother son daughter

father – 11.8 12.3 2.2

[51] [57] [45]

mother 0 – 14.8 18.6

[52] [54] [43]

son 7.4 29.8 29.4 53.3

[54] [47] [17] [30]

daughter 0 32.6 50 50

[45] [43] [36] [8]

From the table, the following remarks can be done:

• the father, as an impostor, gets a moderate and

equal success rate on his wife and his son, and a low

success on his daughter. As a target, he is “fragile”

against his son at a moderate level.

• the mother, as an impostor, gets a better success

rate than her husband on her children, and a com-

plete failure on her husband. The difference of suc-

cess rate between the son target and the daughter

target is not high. As a target, she is fragile against

her husband at a moderate level, and she is equally

fragile at a high level against her children.

• the son, as an impostor, has a low success rate on

his father, a high level on his mother and brother

and a very high level on his sister (the difference

between brother and sister success rate might not

be signiﬁcant because of the small number of at-

tempts in the case of 2 brothers attempts). As a

target, he is moderately fragile against his parents

and highly against his sister.

• the daughter, as an impostor, has a high success rate

on her mother and a very high success on her sister

or brother, and a complete failure on her father. As

a target, she is not fragile against her father, mod-

erately against her mother and very highly against

her sister or brother.

5.2 common sentence-based scenario

Figure 2 shows the global performances of the com-

mon sentence-based system, for different operating

points.

For a particular operating point where globally

{FR=6.1%; FI=1.2%; EFA=5.8%;}, analyzing the re-

VOICE BIOMETRICS WITHIN THE FAMILY: TRUST, PRIVACY AND PERSONALISATION

333

2 4 6 8 10 20

false rejection rate

0.4

0.7

1.0

2.0

4.0

7.0

10.0

20.0

error rate

EFA

Figure 2: Common sentence-based speaker identiﬁcation

within the family

sults family per family, we observe that over the 17

families who made identiﬁcation attempts:

• 13 families have no false identiﬁcation : FI=0%

• 3 families get a false identiﬁcation rate : FI=2-2.5%

(actually 1 observed error)

– family #13: father identiﬁed as his 17-year-old

son

– family #16: 14-year-old girl identiﬁed as her

mother

– family #20: mother identiﬁed as her 20-year-old

girl

• 1 family get a false identiﬁcation rate of FI=7% due

to one unique cause of error: a 12-year-old girl was

identiﬁed as her mother

Speaker identiﬁcation within a family appears to be

easily achievable for a majority of the families tested

(13/17). For the remaining families, where errors

were observed, the errors are not frequent and always

concern a speciﬁc pair parent/same-sex teenager.

6 DISCUSSION

In the name-based scenario, there is deliberate imper-

sonation within the family. As the speech utterance

to perform speaker authentication on is very short

(on average 4 syllables) performances are very poor.

However, the analysis of the differences of perfor-

mances according to the type of impostor and target

speaker is interesting. It shows that the most “fragile”

to impersonation are the children, between them, and

that they are also the most effective impostors.

In the common sentence-based scenario, there can-

not be deliberate impersonation, as there is no identiy

claim. Except for some rare cases where there is an

identiﬁcation error, it seems to be the most effective

and ergonomic way to perform speaker recognition

within the family. Considering the case of identiﬁca-

tion errors, as they always occured on the same pair,

it should be possible to develop special training pro-

cedure to prevent them.

7 CONCLUSION

In this paper, we have presented a database of fam-

ily voices collected for voice recognition within the

family. Two scenarios are studied. The number of

recorded attempts is enough to perform experiments

and draw ﬁrst conclusions about the particular difﬁ-

culties of speaker recognition within the family.

Despite the novel nature of the application, voice

biometrics seems the most natural way to manage the

privacy and trust issues when accessing to voice ser-

vices in a family context. Further research will at-

tempt to improve the reliability of the system when

dealing with the voices of children.

REFERENCES

Charlet, D., Jouvet, D., and Collin, O. (1997). Speaker veri-

ﬁcation with user-selectable password. In Proc. of the

COST 250 Workshop, Rhodos, Greece.

Doddington, G., Liggett, W., Martin, A., Przybocki, M.,

and Reynolds, D. (1998). Sheeps, goats, lambs and

wolves, a statistical analysis of speaker performance

in the nist 1998 speaker recognition evaluation. In

Proc. of the ICSLP 1998, Sydney, Australia.

van Leeuwen, D. (2003). Speaker veriﬁcation systems and

security considerations. In Proc. of Eurospeech 2003,

pages 1661–1664, Geneva, Switzerland.

Wilpon, J. and Jacobsen, C. (1996). A study of speech

recognition for children and the elderly. In Proc. of

the ICASSP 1996, pages 349–352, Atlanta, USA.

ICETE 2005 - SECURITY AND RELIABILITY IN INFORMATION SYSTEMS AND NETWORKS

334