Data-Centric Optimization of Enrollment Selection in Speaker

Identiﬁcation

Long-Quoc Le

1,2 a

and Minh-Nhut Ngo

1,2∗ b

Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam

Vietnam National University, Ho Chi Minh City, Vietnam

Keywords:

Data-Centric, Speaker Identiﬁcation, Enrollment Selection.

Abstract:

In this paper, we introduce a novel method for optimizing enrollment selection in speaker identiﬁcation sys-

tems, with a particular focus on low-resource languages. Unlike traditional approaches that rely on random

enrollment samples, our method systematically analyzes pair-wise similarities between enrollment utterances

to eliminate poor-quality samples often impacted by noise or adverse environments. By retaining only high-

quality and representative utterances, we ensure a more robust speaker proﬁle. This innovative approach,

applied to the Vietnam-Celeb dataset using the state-of-the-art ECAPA-TDNN model, delivers substantial

performance improvements. Our method boosts accuracy from 73.38% in bad scenarios to 93.62% and in-

creases the F1-score from 72.91% to 95.48%, demonstrating the effectiveness of focusing on quality-driven

enrollment selection even in low-resource contexts.

1 INTRODUCTION

Speech communication has become an increasingly

popular interface in virtual assistants, especially with

advancements in large language models that enable

understanding of high-level knowledge. Speaker

recognition has garnered signiﬁcant attention as it en-

hances speech communication by providing added

functionality and security. By enabling speaker

recognition, virtual assistants and smart interaction

systems can respond more naturally and customize

interactions for speciﬁc users, improving the overall

user experience (Mohd Hanifa et al., 2021). Addi-

tionally, speaker recognition strengthens security by

preventing unauthorized users from executing critical

commands.

Prior to the deep learning era, most speaker recog-

nition systems relied on i-vector based models (De-

hak et al., 2010), which utilized Mel-Frequency Cep-

stral Coefﬁcients (MFCC) and universal background

models (UBM) built with Gaussian Mixture Models

(GMM). These i-vector approaches projected speaker

information from a high-dimensional UBM space

into a lower-dimensional speaker space. However, i-

https://orcid.org/0009-0007-9838-2260

https://orcid.org/0009-0001-1185-2394

∗

Corresponding author.

vector models suffered from limited performance due

to their reliance on handcrafted features, which strug-

gled to capture the complex variations in human voice

characteristics, especially under challenging condi-

tions.

However, the advent of deep learning has driven

remarkable advancements in speaker recognition per-

formance, with early deep neural embedding-based

models such as x-vectors (Snyder et al., 2018)

marking a notable leap from traditional i-vector ap-

proaches. The x-vector model pioneered the use

of deep neural networks for generating speaker em-

beddings, laying the groundwork for later innova-

tions. Building on the Time Delay Neural Network

(TDNN) (Peddinti et al., 2015) which is a frame-

level feature extractor, the ECAPA-TDNN architec-

ture (Dawalatabad et al., 2021) introduced reﬁned fea-

ture extraction layers and when combined with the

ArcFace loss function (Deng et al., 2019), achieved an

Equal Error Rate (EER) of 0.87% on the VoxCeleb1

test set (Zeinali et al., 2019), representing a consid-

erable enhancement in accuracy. In addition, ResNet

architectures (He et al., 2016), adapted speciﬁcally for

speaker recognition, have demonstrated remarkable

performance by utilizing their powerful feature ex-

traction capabilities. Various ResNet conﬁgurations

have yielded impressive outcomes. For instance, the

Thin ResNet-34 model (Chung et al., 2019), paired

344

Le, L.-Q. and Ngo, M.-N.

Data-Centric Optimization of Enrollment Selection in Speaker Identiﬁcation.

DOI: 10.5220/0013256800003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 344-351

ISBN: 978-989-758-730-6; ISSN: 2184-4313

with the Angular Prototypical loss function (Chung

et al., 2020), achieved an EER of 2.21% on the

VoxCeleb1 test set. Further pushing the boundaries,

RawNet3 (Jung et al., 2022) utilized raw audio sig-

nals directly through 1D convolutional layers, attain-

ing an EER of 0.89% on VoxCeleb1. These mod-

els underscore the remarkable evolution and effec-

tiveness of end-to-end deep learning architectures in

speaker recognition.

Although deep learning models have advanced

speaker recognition, signiﬁcant challenges persist.

While speaker recognition systems face the common

challenge of capturing the natural diversity in human

vocal characteristics—including variations in tone,

pitch, and speaking style within the same individ-

ual—these difﬁculties. This challenge becomes espe-

cially pronounced in low-resource languages, where

data scarcity limits the ability to create comprehen-

sive speaker representations.

To address this issue, we propose a data-centric

approach that emphasizes selecting high-quality and

representative enrollment samples, especially tailored

for low-resource languages. Our method prioritizes

samples that accurately capture the speaker’s dom-

inant vocal characteristics and eliminates those af-

fected by noise or distortions. This strategy is partic-

ularly crucial in low-resource settings, where limited

data prevents comprehensive coverage of all vocal

traits. Instead, our approach ensures a well-deﬁned

speaker proﬁle by selecting only the most consistent

and reliable vocal traits, which in turn, enhances the

system’s robustness and reliability for speaker identi-

ﬁcation tasks.

In this work, we introduce an empirical method

to optimize enrollment sample selection for speaker

identiﬁcation, aimed at maximizing the effectiveness

of limited data in low-resource settings. Our experi-

ments demonstrate that selecting high-quality enroll-

ment samples leads to signiﬁcant performance im-

provements in speaker identiﬁcation. The proposed it-

erative selection method identiﬁes and removes poor-

quality samples, resulting in a high-conﬁdence en-

rollment set. By conﬁguring the total number of en-

rollment samples, our approach also allows for cus-

tomization based on application needs.

The rest of this paper is organized as follows.

Section 2 discusses the existing literature and related

work in models of speaker recognition. Section 3 de-

scribes the proposed method of optimizing selection

of enrolling utterances in speaker identiﬁcation, illus-

trating our idea, explaining the underlying principles,

and detailing the key steps. Section 4 presents the

experimental setup and conﬁgurations, the results of

evaluations, and analyzes the results. Section 5 gives

some concluding remarks and suggests some direc-

tions for future work.

2 RELATED WORK

Before the era of deep learning, speaker recogni-

tion models predominantly used the i-vector method

(Dehak et al., 2010), which extracted speaker fea-

tures based on Mel-Frequency Cepstral Coefﬁcients

(MFCC) and universal background models (UBM).

In this approach, a Gaussian Mixture Model (GMM)

was employed to map the high-dimensional UBM

space to a lower-dimensional i-vector space, provid-

ing a compact representation of each speaker. Despite

its utility, the i-vector approach was limited by its vul-

nerability to variations in speaking style, background

noise, and other environmental conditions. These

limitations reduced its robustness in real-world ap-

plications, as it struggled to consistently differenti-

ate speakers across diverse and unpredictable acoustic

environments.

The shift to deep learning introduced a new era

of speaker recognition architectures, beginning with

x-vectors (Snyder et al., 2018), a deep neural net-

work (DNN) embedding model designed for text-

independent speaker recognition. The x-vector model

consists of three main components: a Time Delay

Neural Network (TDNN) (Peddinti et al., 2015) for

frame-level feature extraction from MFCC inputs, a

statistics pooling layer that aggregates segment-level

statistics, and a soft-max output layer trained with

cross-entropy loss to classify speakers. This archi-

tecture laid the foundation for further advancements

in speaker embedding. Building on this structure,

the ECAPA-TDNN architecture (Dawalatabad et al.,

2021) introduced reﬁned feature extraction layers,

further enhancing accuracy, particularly when paired

with the ArcFace loss function (Deng et al., 2019),

achieving an EER of 0.87% on the VoxCeleb1 test set

(Zeinali et al., 2019).

ResNet (He et al., 2016), initially popularized in

computer vision, has also become a prominent archi-

tecture in speaker recognition. Unlike its use in im-

age tasks, ResNet in audio processing is customized

to work with speech spectrograms, capturing speaker-

speciﬁc patterns effectively. Several ResNet conﬁgu-

rations, such as Thin ResNet-34 (Chung et al., 2019)

combined with the Angular Prototypical loss function

(Chung et al., 2020), have demonstrated impressive

performance, with Thin ResNet-34 achieving an EER

of 2.21% on the VoxCeleb1 test set.

Further advancements in deep learning for speaker

recognition include the use of raw audio data, exem-

Data-Centric Optimization of Enrollment Selection in Speaker Identiﬁcation

345

pliﬁed by RawNet3 (Jung et al., 2022). This model

employs 1D convolutional layers to directly process

raw audio signals, eliminating the need for spectro-

gram conversion and enabling the capture of more

granular acoustic features. Combined with the Arc-

Face loss, RawNet3 achieved an EER of 0.89% on

the VoxCeleb1 test set, underscoring the potential of

end-to-end deep learning models in speaker recogni-

tion.

Several techniques of adaptation and normaliza-

tion were proposed to deal with limited enrollment

data (Kimball et al., 1997) and training, enrollment

and test mismatching (Mak et al., 2006), (Glembek

et al., 2014), (Wang et al., 2018), (Aronowitz, 2014),

(Li et al., 2022). Some approaches have been pro-

posed to deal with enrollment of utterances for later

uses in veriﬁcation or identiﬁcation processes. (Li

et al., 2024) proposed an augmentation technique that

applies to enrolling utterances which results in con-

sistent performance improvement. (Mingote et al.,

2020) directly trained speaker enrollment models for

each speaker by leveraging an embedding dictionary

stored during the training phase in the last layer of

a deep neural network. The veriﬁcation scores are

obtained directly from the speaker enrollment models

without using another comparison metric.

Our method prioritizes the selection of high-

quality samples that accurately reﬂect the speaker’s

dominant vocal traits, while ﬁltering out those af-

fected by noise or inconsistencies. By capturing the

most consistent and reliable vocal characteristics, we

create an effective and representative speaker proﬁle

for recognition tasks.

3 PROPOSED METHOD

Figure 1 illustrates the 3D representation of embed-

dings for seven randomly selected speakers from the

Vietnam-Celeb (Thanh et al., 2023) dataset, after di-

mensionality reduction through the Principal compo-

nent analysis (PCA) (Kurita, 2019) algorithm. These

embeddings were generated using the ECAPA-TDNN

model (Dawalatabad et al., 2021), ﬁne-tuned on the

Vietnam-Celeb (Thanh et al., 2023) data. While there

is clear separation among most speakers, some points

remain indistinct, reﬂecting cases where embeddings

overlap. Further auditory analysis reveals that these

ambiguous samples are often of lower quality, likely

due to noise or variations in voice tone and pronun-

ciation that deviate from the speaker’s usual patterns.

These represent outlier embeddings that could beneﬁt

from ﬁltering in the data preprocessing phase.

Our proposed approach, which emphasizes the

Figure 1: Representation of Speaker Utterances.

careful selection of high-quality samples, proves par-

ticularly advantageous in low-resource languages,

where it enables performance levels comparable to

systems using extensive datasets. This efﬁciency

makes the method highly suited to low-resource con-

texts, where obtaining large datasets can be chal-

lenging. While this method requires tonal consis-

tency during enrollment, potentially leading to au-

thentication rejections when a user’s voice tone varies

signiﬁcantly, this trade-off enhances system robust-

ness—a compromise acceptable in low-resource lan-

guage contexts, where consistent enrollment data sig-

niﬁcantly improves performance without requiring

large datasets.

Based on these observations, our method relies on

two key assumptions:

• Higher similarity among samples within the en-

rollment set likely indicates convergence toward

the speaker’s standard voice under optimal condi-

tions, minimizing variability in vocal traits.

• Increasing high-quality samples in the enroll-

ment set enhances system robustness, allowing

a smaller set of samples in low-resource lan-

guages to achieve comparable robustness to larger

datasets in high-resource languages.

Initially, we hypothesized that high-quality sam-

ples alone would be sufﬁcient. However, our ﬁnd-

ings showed that a more nuanced approach is neces-

sary. Thus, we propose a data-centric solution that

involves optimizing sample selection based on a ﬁne-

tuned positive threshold.

The positive threshold optimizes Equal Error Rate

(EER) during training, ensuring that selected enroll-

ment samples have a pairwise similarity meeting or

exceeding this threshold. This approach creates a

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

346

consistent and representative enrollment set, reducing

noise and environmental variability.

In our experiments, we also observed diminish-

ing returns with increasing sample quantity, reveal-

ing that more data does not necessarily improve per-

formance. We therefore propose selecting an opti-

mal sample quantity at which performance stabilizes,

achieving a balance between complexity and accu-

racy. In summary, our solution comprises three key

steps:

• Deﬁne a positive threshold during training to en-

sure high-quality, consistent samples.

• Ensure each sample pair in the enrollment set

meets or exceeds the positive threshold, forming

a homogeneous and representative set.

• Determine the optimal sample quantity to prevent

redundancy or noise, balancing performance and

complexity.

Figure 2 presents an example enrollment set con-

taining four utterances. When adding a new sample

(5), we calculate its similarity to each existing sam-

ple. Sample (5) is closely aligned with samples (1)

and (2), indicated by solid connections, while sample

(2) is also close to (1). These samples form a high-

similarity cluster, from which we can select (1), (2),

and (5) as candidates. However, sample (5) lacks sim-

ilarity with samples (3) and (4), represented by dashed

lines, and thus does not meet the positive threshold for

a cohesive enrollment set.

Figure 2: Example of Enrollment Selection Based on Pair-

wise Similarity.

These steps aim to create a robust enrollment pro-

cess that represents dominant vocal traits, enhanc-

ing system performance while resisting variations in

speech style and environmental factors.

4 EXPERIMENTS

4.1 Dataset and Experimental Setup

In this study, we used the Vietnam-Celeb dataset

(Thanh et al., 2023), which consists of voice samples

collected from 1,000 distinct speakers. This dataset

was split into three subsets: 900 speakers for train-

ing, 50 speakers for validation, and 50 speakers for

testing. The primary goal of our experiments was to

evaluate the effectiveness of our proposed method of

enrollment selection with speaker recognition model

ECAPA-TDNN (Dawalatabad et al., 2021) which

was found efﬁcient for Vietnamese (Ngo and Le,

2024).

We chose this dataset because Vietnamese is a

low-resource language, making it an ideal choice to

rigorously evaluate our algorithm’s effectiveness un-

der challenging conditions. By using Vietnamese, we

can accurately assess the algorithm’s performance in

optimizing enrollment selection and speaker recogni-

tion accuracy when data is inherently limited.

4.1.1 Data Quality Control and Mislabeled

Samples

During the initial data inspection, we discovered that

a signiﬁcant number of samples in the test set were

mislabeled. This issue posed a potential threat to

the reliability of the experimental results, as inaccu-

rate labeling could lead to biased evaluation metrics.

Upon further investigation, we found that out of 7,351

utterances in the test set, 182 were mislabeled, which

accounts for 2.48% of the total test data.

To ensure the validity of the evaluation, we man-

ually re-labeled the entire test set, correcting the mis-

labeling errors. The re-labeling process was critical

for the integrity of the results, as it directly impacted

the accuracy of performance metrics like Equal Er-

ror Rate (EER). However, due to resource constraints,

we did not re-label the training and validation sets.

We believe that the impact of mislabeled data in the

training and validation sets is minimal, given that only

2.48% of the test set samples were mislabeled. This

relatively small percentage of mislabels is unlikely

to signiﬁcantly inﬂuence the model’s ability to learn.

The ECAPA-TDNN model (Dawalatabad et al., 2021)

is robust and can effectively generalize, allowing it to

focus on key speaker characteristics and ignore occa-

sional mislabeled samples. Furthermore, the model’s

resilience to mislabeled data is enhanced when work-

ing with larger datasets, as the model can still capture

meaningful patterns from the majority of correctly la-

beled data.

4.1.2 Embedding Model and Similarity

Measurement

For embedding extraction, we used the ECAPA-

TDNN model (Dawalatabad et al., 2021), a state-of-

the-art architecture widely recognized for its supe-

Data-Centric Optimization of Enrollment Selection in Speaker Identiﬁcation

347

rior performance in speaker recognition tasks. This

model was speciﬁcally designed to effectively cap-

ture speaker-speciﬁc features and handle variations

in speech signals, making it ideal for this task. The

ECAPA-TDNN model (Dawalatabad et al., 2021) was

chosen not only for its strong performance but also

because it was used in the original Vietnam-Celeb pa-

per (Thanh et al., 2023), ensuring consistency in com-

parison and enabling a fair evaluation of results across

different studies.

The ECAPA-TDNN model (Dawalatabad et al.,

2021) was trained using the 79,789 training samples,

which consist of voice data from 900 distinct speak-

ers. The model’s architecture was ﬁne-tuned during

the training process to optimize its ability to differ-

entiate between speakers based on their unique vocal

characteristics.

To measure similarity between speaker embed-

dings, we employed cosine similarity. Cosine simi-

larity calculates the cosine of the angle between two

embedding vectors, producing a score in the range

[−1, 1], where values closer to one indicate higher

similarity. This metric allows for a precise compari-

son of embeddings, as it quantiﬁes the alignment of

speaker-speciﬁc features in the enrollment and test

samples, contributing to accurate speaker identiﬁca-

tion.

4.1.3 Threshold Optimization and Evaluation

After calculating the similarity between two embed-

dings, a positive threshold is applied to determine

whether they belong to the same person. If the sim-

ilarity score between two embeddings meets or ex-

ceeds this threshold, they are classiﬁed as belonging

to the same individual; otherwise, they are considered

as coming from different individuals. Setting an ap-

propriate positive threshold is essential for balancing

the False Acceptance Rate (FAR), which measures the

rate of mistakenly accepting embeddings from differ-

ent individuals and the False Rejection Rate (FRR),

which represents the rate of incorrectly rejecting em-

beddings from the same individual.

To optimize this threshold, we ﬁne-tuned it using a

validation set of 50 speakers, aiming to minimize the

Equal Error Rate (EER) — the point where FAR and

FRR are equal. By iteratively adjusting the thresh-

old, we identiﬁed the value that balances these error

rates, thus reducing both types of errors and improv-

ing the system’s overall accuracy. This process en-

sures that the model is neither overly lenient (which

would increase FAR) nor too strict (which would in-

crease FRR), resulting in a robust and reliable classi-

ﬁcation.

Once the optimal positive threshold was estab-

lished, we evaluated the model’s performance on a

carefully re-labeled test set to ensure data accuracy.

This ﬁnal evaluation provided a comprehensive as-

sessment of the model’s capability to correctly clas-

sify identities under realistic conditions. The model’s

performance was then compared with other state-of-

the-art methods, demonstrating the effectiveness of

our approach in terms of both accuracy and error

rates.

4.2 Experimental Conﬁgurations

Our experiments were divided into three main conﬁg-

urations, each designed to assess the impact of enroll-

ment data quality and consistency on speaker identiﬁ-

cation performance:

• Bad Case: In this conﬁguration, enrollment sam-

ples have low pairwise similarity, with values

falling below a predeﬁned positive threshold. This

threshold represents the minimum similarity score

needed to consider samples as consistent repre-

sentations of the speaker’s primary vocal charac-

teristics. The Bad Case simulates a worst-case

scenario, where enrollment samples are likely to

be impacted by noise, distortions, or inconsistent

speaker tones. Such poor-quality data introduces

variability and can compromise the reliability of

the speaker proﬁle, leading to decreased identiﬁ-

cation accuracy and increased error rates.

• Random Case: In the Random Case, enrollment

samples are selected without any consideration

for pairwise similarity, meaning samples are cho-

sen randomly from the dataset. This setup repre-

sents a typical real-world scenario where no spe-

ciﬁc enrollment strategy is applied, serving as a

baseline for comparison. Some samples may, by

chance, meet the positive threshold, while others

may fall below it, creating inconsistencies in data

quality. The mixed-quality dataset in this con-

ﬁguration reﬂects common, unsupervised enroll-

ment conditions and helps gauge how our method

compares against a standard, uncontrolled selec-

tion process.

• Optimal Case: In the Optimal Case, only sam-

ples that meet or exceed the positive threshold are

included in the enrollment set. This careful se-

lection process ensures a high level of quality and

consistency, producing a dataset that strongly rep-

resents the speaker’s dominant vocal characteris-

tics. By ﬁltering out samples that do not meet the

similarity criteria, this conﬁguration aims to pro-

vide the most reliable speaker proﬁle, with min-

imal noise or variability. As a result, the Opti-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

348

mal Case is expected to deliver the highest perfor-

mance in speaker identiﬁcation, highlighting the

beneﬁts of a controlled, similarity-based enroll-

ment strategy.

An illustration of the algorithm setup for all con-

ﬁgurations can be found in Figure 3. For each conﬁg-

uration, we systematically varied the number of en-

rollment samples from one to 10 to observe its effect

on performance metrics. This approach allowed us

to analyze how the quality and quantity of enrollment

data interact to inﬂuence accuracy and to determine

the optimal number of samples required to achieve

high identiﬁcation accuracy with a minimal enroll-

ment set size.

Figure 3: Illustration of the algorithm setup across conﬁgu-

rations.

The experimental results, summarized in Table 1,

Table 2, and Table 3, demonstrate the substantial im-

pact of enrollment data quality on speaker identiﬁ-

cation performance across three conﬁgurations: Bad

Case, Random Case, and Optimal Case. For each con-

ﬁguration, we evaluated the model’s performance by

analyzing Accuracy, Precision, Recall, and F1-score

as the number of enrollment samples increased from

one to 10. We randomly selected the samples for each

sample size with 30 iterations and calculated mean

and standard deviation (std) of the scores. A clear

trend is observed: the Optimal Case yields the high-

est performance across all metrics, substantiating the

effectiveness of our proposed solution.

4.3 Results and Analysis

In the Bad Case conﬁguration, where enrollment sam-

ples fall below the positive threshold, model per-

formance is notably reduced. Accuracy begins at

68.73% with one sample, increasing only marginally

to 73.38% with 10 samples, while high standard devi-

ations across metrics reveal unstable and inconsistent

outcomes. These results underscore the adverse ef-

fects of low-quality and inconsistent samples, which

introduce variability and noise, ultimately degrading

identiﬁcation accuracy.

Table 1: Performance Metrics for Different Enrollment

Sample Sizes in Bad Case.

Size Accuracy Precision Recall F1-score

Mean Std Mean Std Mean Std Mean Std

1 0.687 0.016 0.868 0.028 0.646 0.019 0.694 0.018

2 0.700 0.020 0.887 0.034 0.658 0.022 0.705 0.021

3 0.708 0.023 0.899 0.035 0.665 0.024 0.712 0.023

4 0.715 0.024 0.900 0.033 0.673 0.026 0.719 0.024

5 0.720 0.024 0.896 0.033 0.677 0.026 0.722 0.023

6 0.724 0.024 0.892 0.033 0.681 0.025 0.724 0.022

7 0.727 0.023 0.887 0.033 0.684 0.024 0.726 0.021

8 0.730 0.023 0.881 0.035 0.687 0.024 0.727 0.020

9 0.732 0.023 0.877 0.036 0.689 0.024 0.729 0.019

10 0.734 0.022 0.873 0.036 0.691 0.023 0.729 0.019

The Random Case conﬁguration, representing

typical real-world conditions without speciﬁc enroll-

ment criteria, achieves moderate improvements over

the Bad Case. Accuracy increases from 84.88% with

one sample to 89.38% with 10 samples; however,

these values remain consistently lower than those in

the Optimal Case. F1-score and other metrics sim-

ilarly improve as sample quantity increases, but rel-

atively high standard deviations indicate limited ro-

bustness compared to the Optimal Case conﬁguration.

This baseline scenario illustrates that random sample

selection, while beneﬁcial, lacks the consistency and

quality necessary for optimal performance.

Table 2: Performance Metrics for Different Enrollment

Sample Sizes in Random Case.

Size Accuracy Precision Recall F1-score

Mean Std Mean Std Mean Std Mean Std

1 0.849 0.022 0.943 0.019 0.840 0.026 0.868 0.026

2 0.863 0.024 0.954 0.019 0.855 0.027 0.882 0.026

3 0.872 0.024 0.959 0.018 0.864 0.026 0.891 0.026

4 0.878 0.024 0.962 0.017 0.871 0.026 0.897 0.025

5 0.882 0.024 0.965 0.016 0.875 0.026 0.900 0.024

6 0.886 0.023 0.966 0.016 0.878 0.025 0.903 0.024

7 0.888 0.023 0.967 0.015 0.881 0.024 0.906 0.023

8 0.890 0.022 0.969 0.015 0.883 0.024 0.908 0.022

9 0.892 0.022 0.969 0.014 0.885 0.024 0.909 0.022

10 0.894 0.021 0.970 0.014 0.887 0.023 0.911 0.022

The Optimal Case conﬁguration, where enroll-

ment samples meet or exceed the positive threshold,

consistently demonstrates the highest results across

Data-Centric Optimization of Enrollment Selection in Speaker Identiﬁcation

349

all metrics. With only one sample, the model attains

an Accuracy of 92.05%, which further improves to

93.62% with 10 samples. The F1-score shows a sim-

ilar progression, increasing from 94.15% to 95.48%,

with low standard deviations reﬂecting high reliabil-

ity and robustness. These ﬁndings conﬁrm that pri-

oritizing high pairwise similarity in enrollment data

produces a stable and accurate speaker proﬁle, effec-

tively mitigating the impact of noise and variability.

Table 3: Performance Metrics for Different Enrollment

Sample Sizes in Optimal Case.

Size Accuracy Precision Recall F1-score

Mean Std Mean Std Mean Std Mean Std

1 0.921 0.008 0.966 0.005 0.927 0.008 0.942 0.006

2 0.926 0.009 0.970 0.006 0.932 0.008 0.947 0.007

3 0.930 0.009 0.973 0.006 0.936 0.009 0.950 0.008

4 0.932 0.009 0.975 0.005 0.937 0.008 0.951 0.007

5 0.933 0.008 0.974 0.005 0.939 0.008 0.952 0.007

6 0.934 0.008 0.975 0.005 0.940 0.008 0.953 0.007

7 0.935 0.009 0.976 0.005 0.940 0.007 0.954 0.006

8 0.935 0.008 0.976 0.005 0.941 0.007 0.954 0.006

9 0.936 0.007 0.976 0.005 0.941 0.007 0.955 0.006

10 0.936 0.007 0.977 0.005 0.941 0.006 0.955 0.006

Across all the conﬁgurations, increasing the num-

ber of enrollment samples generally enhances perfor-

mance; however, the rate of improvement diminishes

beyond 5 samples, particularly in the Optimal Case.

This suggests that, above a certain threshold, addi-

tional samples contribute minimally to performance,

especially when data quality is already high, conﬁrm-

ing that sample quality is more critical than quantity

in enrollment selection.

Although a large number of utterances were used

for evaluations in our experiments, in practice, we can

iteratively choose enrolling utterances that maximize

pairwise similarity instead of random utterances. This

process can be repeated over several iterations until

we get a suitable number of qualiﬁed utterances, e.g.,

ﬁve utterances.

4.4 Summary of Experimental Findings

Our experimental ﬁndings underscore the critical role

of data quality in enrollment selection, with clear ev-

idence that high pairwise similarity among enroll-

ment samples signiﬁcantly boosts speaker identiﬁca-

tion performance. The Optimal Case, which empha-

sizes selecting samples that meet a predeﬁned pos-

itive similarity threshold, consistently outperformed

the Bad Case and Random Case. Speciﬁcally, with

5 enrollment samples, the Optimal Case achieved an

F1-score of 95.21%, compared to 72.20% in the Bad

Case and 90.02% in the Random Case. These results

demonstrate that carefully curated, high-quality sam-

ples are essential for creating robust speaker proﬁles.

In addition to quality, our ﬁndings indicate that in-

creasing the number of enrollment samples can im-

prove performance, but only up to a certain point.

The rate of improvement diminishes beyond 5 sam-

ples, particularly in the Optimal Case, suggesting that

a modest number of high-quality samples is sufﬁ-

cient for reliable identiﬁcation. This observation con-

ﬁrms that sample quality is more critical than quan-

tity, as adding more samples beyond a certain thresh-

old yields minimal beneﬁts.

Based on these insights, the most effective ap-

proach for enrollment selection is to maintain a mod-

est sample size of approximately 5 high-quality ut-

terances per user, with each sample meeting the pos-

itive threshold. This solution optimizes the balance

between performance and resource efﬁciency, deliv-

ering a robust speaker identiﬁcation system that max-

imizes accuracy with minimal data. Such a strategy is

especially valuable in low-resource scenarios, where

data quality is prioritized over quantity to achieve op-

timal results.

5 CONCLUSION

In this paper, we presented a data-centric approach

for optimizing the enrollment selection process in

speaker identiﬁcation systems, with a particular focus

on low-resource languages such as Vietnamese. Our

proposed method emphasizes the importance of se-

lecting high-quality and representative samples dur-

ing the enrollment phase to mitigate the challenges

posed by variability in voice characteristics and envi-

ronmental factors. This approach is especially effec-

tive in low-resource settings, where the availability of

large and diverse datasets is limited, making it difﬁ-

cult to capture all aspects of a speaker’s vocal charac-

ters.

Through a series of experiments, we demonstrated

that by ﬁltering out low-quality and inconsistent sam-

ples, we can create robust speaker proﬁles that en-

hance the accuracy and reliability of the speaker iden-

tiﬁcation system. Our method consistently outper-

formed random and bad enrollment selection strate-

gies, showing signiﬁcant improvements in key per-

formance metrics such as accuracy and F1-score.

The results validate the effectiveness of leveraging a

small set of high-quality samples to achieve compara-

ble performance to systems that require much larger

datasets in high-resource scenarios.

Additionally, we acknowledged a trade-off in our

approach, where the system requires consistency in

the user’s voice tone between enrollment and authen-

tication. While this constraint may limit ﬂexibility, it

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

350

signiﬁcantly enhances the robustness of the system,

which is particularly important in low-resource lan-

guages.

Overall, our ﬁndings highlight the potential of a

data-centric approach to overcome the challenges in-

herent in low-resource speaker recognition, paving

the way for more efﬁcient and effective systems in

this domain. Future work may explore further opti-

mizations in the enrollment process and investigate

how additional techniques, such as data augmenta-

tion, can be applied to further improve performance

in low-resource settings.

REFERENCES

Aronowitz, H. (2014). Inter dataset variability compen-

sation for speaker recognition. In 2014 IEEE Inter-

national Conference on Acoustics, Speech and Signal

Processing (ICASSP), pages 4002–4006.

Chung, J. S., Huh, J., and Mun, S. (2019). Delving into

voxceleb: environment invariant speaker recognition.

arXiv preprint arXiv:1910.11238.

Chung, J. S., Huh, J., Mun, S., Lee, M., Heo, H. S., Choe,

S., Ham, C., Jung, S., Lee, B.-J., and Han, I. (2020).

In defence of metric learning for speaker recognition.

arXiv preprint arXiv:2003.11982.

Dawalatabad, N., Ravanelli, M., Grondin, F., Thienpondt,

J., Desplanques, B., and Na, H. (2021). Ecapa-tdnn

embeddings for speaker diarization. arXiv preprint

arXiv:2104.01466.

Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and

Ouellet, P. (2010). Front-end factor analysis for

speaker veriﬁcation. IEEE Transactions on Audio,

Speech, and Language Processing, 19(4):788–798.

Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019). Ar-

cface: Additive angular margin loss for deep face

recognition. In Proceedings of the IEEE/CVF con-

ference on computer vision and pattern recognition,

pages 4690–4699.

Glembek, O., Ma, J., Mat

ejka, P., Zhang, B., Plchot, O.,

urget, L., and Matsoukas, S. (2014). Domain adapta-

tion via within-class covariance correction in i-vector

based speaker recognition systems. In 2014 IEEE In-

ternational Conference on Acoustics, Speech and Sig-

nal Processing (ICASSP), pages 4032–4036.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Jung, J.-w., Kim, Y. J., Heo, H.-S., Lee, B.-J., Kwon,

Y., and Chung, J. S. (2022). Pushing the limits of

raw waveform speaker recognition. arXiv preprint

arXiv:2203.08488.

Kimball, O., Schmidt, M., Gish, H., and Waterman, J.

(1997). Speaker veriﬁcation with limited enrollment

data. In Eurospeech, pages 967–970.

Kurita, T. (2019). Principal component analysis (pca).

Computer vision: a reference guide, pages 1–4.

Li, J., Zhang, K., Wang, S., Li, H., Mak, M.-W., and Lee,

K. A. (2024). On the effectiveness of enrollment

speech augmentation for target speaker extraction.

Li, L., Wang, D., Kang, J., Wang, R., Wu, J., Gao, Z., and

Chen, X. (2022). A principle solution for enroll-test

mismatch in speaker recognition. IEEE/ACM Trans-

actions on Audio, Speech, and Language Processing,

30:443–455.

Mak, M.-W., Hsiao, R., and Mak, B. (2006). A comparison

of various adaptation methods for speaker veriﬁcation

with limited enrollment data. In 2006 IEEE Inter-

national Conference on Acoustics Speech and Signal

Processing Proceedings, volume 1, pages I–I.

Mingote, V., Miguel, A., Gim

enez, A. O., and Lleida, E.

(2020). Training speaker enrollment models by net-

work optimization. In INTERSPEECH, pages 3810–

3814.

Mohd Hanifa, R., Isa, K., and Mohamad, S. (2021).

A review on speaker recognition: Technology and

challenges. Computers & Electrical Engineering,

90:107005.

Ngo, M.-N. and Le, L.-Q. (2024). Evaluation of command

and speaker recognition on vietnamese voice dataset

to enhance security. In 2024 International Confer-

ence on Multimedia Analysis and Pattern Recognition

(MAPR), pages 1–6.

Peddinti, V., Povey, D., and Khudanpur, S. (2015). A time

delay neural network architecture for efﬁcient model-

ing of long temporal contexts. In Interspeech, pages

3214–3218.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and

Khudanpur, S. (2018). X-vectors: Robust dnn embed-

dings for speaker recognition. In 2018 IEEE inter-

national conference on acoustics, speech and signal

processing (ICASSP), pages 5329–5333. IEEE.

Thanh, P. V., Hoa, N. X. T., Vu, H. L., and Trang, N. T. T.

(2023). Vietnam-celeb: a large-scale dataset for viet-

namese speaker recognition.

Wang, Q., Rao, W., Sun, S., Xie, L., Chng, E. S., and Li, H.

(2018). Unsupervised domain adaptation via domain

adversarial training for speaker recognition. In 2018

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 4889–4893.

Zeinali, H., Wang, S., Silnova, A., Mat

ejka, P., and Pl-

chot, O. (2019). But system description to voxceleb

speaker recognition challenge 2019. arXiv preprint

arXiv:1910.12592.

Data-Centric Optimization of Enrollment Selection in Speaker Identiﬁcation

351