Towards High-Fidelity ECG Generation:

Evaluation via Quality Metrics and Human Feedback

Maria Russo

1 a

, Joana Rebelo

1 b

, Nuno Bento

1 c

and Hugo Gamboa

1,2 d

Fraunhofer Portugal AICOS, Rua Alfredo Allen 455/461, 4200-135 Porto, Portugal

Laborat

orio de Instrumentac¸

ao, Engenharia Biom

edica e F

ısica da Radiac¸

ao (LIBPhys-UNL), Departamento de F

ısica,

Faculdade de Ci

encias e Tecnologia da Universidade Nova de Lisboa, Monte da Caparica, 2829-516 Caparica, Portugal

Keywords:

ECG Synthesis, Deep Generative Models, Synthetic Data Evaluation, Human Feedback.

Abstract:

Access to medical data, such as electrocardiograms (ECGs), is often restricted due to privacy concerns and

data scarcity, posing challenges for research and development. Synthetic data offers a promising solution

to these limitations. However, ensuring that synthetic medical data is both realistic and clinically relevant

requires evaluation methods that go beyond general quality metrics. This study aims to overcome such chal-

lenges by advancing high-ﬁdelity ECG data generation and evaluation, presenting an approach for generating

realistic ECG signals using a diffusion model and introducing a novel evaluation metric based on a deep learn-

ing evaluator model. The state-of-the-art Structured State Space Diffusion (SSSD-ECG) model was reﬁned

through hyperparameter optimization, and the ﬁdelity of the generated signals was assessed using quantitative

metrics and expert feedback. Complementary evaluations of diversity and utility ensured a comprehensive

assessment. The evaluator model was developed to classify individual synthetic ECG signals into four quality

classes and was trained on a custom-developed quality dataset designed for the generation of 12-lead ECG

signals. Results demonstrated the success in generating high-ﬁdelity ECG data, validated by evaluation met-

rics and expert feedback. Correlation studies conﬁrmed an alignment between the evaluator model and ﬁdelity

metrics, highlighting its potential as a valid tool for quality assessment.

1 INTRODUCTION

Electrocardiograms (ECGs) are a cornerstone of car-

diovascular diagnostics, offering vital insights into the

electrical activity of the heart and playing a crucial

role in detecting a broad spectrum of cardiac condi-

tions (Di Costanzo et al., 2024). The accuracy and

reliability of these diagnoses depend heavily on ac-

cess to high-quality ECG data. However, the acquisi-

tion of real recordings is often constrained by privacy

concerns and data scarcity (Monachino et al., 2023).

To address these limitations, deep generative mod-

els have emerged as a promising solution, capable of

replicating data with similar structural patterns and

statistical characteristics. However, synthetic medical

data must be highly realistic, encompassing not only

statistical ﬁdelity but also clinical interpretability and

practical utility (Murtaza et al., 2023).

Evaluating the quality of synthetic data is, there-

fore, a critical step. Current evaluation metrics for

https://orcid.org/0009-0001-9482-4566

https://orcid.org/0000-0003-0385-053X

https://orcid.org/0000-0001-7279-1890

https://orcid.org/0000-0002-4022-7424

time series data often focus on statistical comparisons

between synthetic and real datasets, potentially over-

looking complex signal features essential for accurate

medical interpretation. Clinicians may also struggle

to contextualize statistical criteria within a clinical

context, highlighting the need for more sophisticated

evaluation methods. These methods should ideally as-

sess data quality at the sample level rather than collec-

tively (Murtaza et al., 2023).

Alongside quantitative assessments, researchers

have emphasized the importance of qualitative eval-

uation by medical experts to identify discrepancies in

synthetic samples (Murtaza et al., 2023). As the most

reliable source of “ground truth”, clinical profession-

als provide invaluable insights into the realism of syn-

thetic ECG data (Stein et al., 2024).

In response to the previous needs, this study aims

to evaluate and enhance the generation of synthetic

ECG data using deep generative models, with a fo-

cus on achieving high realism. The approach includes

reﬁning a state-of-the-art generative model using met-

rics that assess ﬁdelity, diversity and utility. Addition-

ally, a novel sample-level evaluation metric is intro-

duced, emphasizing generation quality over artifacts

1154

Russo, M., Rebelo, J., Bento, N. and Gamboa, H.

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback.

DOI: 10.5220/0013400500003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 1, pages 1154-1165

ISBN: 978-989-758-731-3; ISSN: 2184-4305

and noise. Finally, the ﬁdelity of the generated data

is validated using the newly developed metric and ex-

pert human feedback.

2 RELATED WORK

Over the years, various methods have been devel-

oped to generate synthetic ECG signals, with recent

deep learning (DL) advancements signiﬁcantly sur-

passing traditional approaches and driving progress in

the biomedical ﬁeld. Wulan et al. (2020) introduced a

Deep Convolutional Generative Adversarial Network

to generate realistic ECG signals, including various

heartbeat types. However, challenges like the require-

ment for R-peak-centered segments and limited scal-

ability to longer signals persisted. Dissanayake et al.

(2022) extended adversarial models to include inde-

pendent peak annotations and longer synthetic signals

with multiple R-peaks, addressing these limitations.

Similarly, Belo et al. (2017) utilized a Deep Neural

Network (DNN) with Gated Recurrent Units to syn-

thesize biosignals, including ECG, capturing subject-

speciﬁc traits and morphological details. Nishikimi et

al. (2023) further explored DNNs, leveraging a con-

ditional Variational Autoencoder to synthesize ECGs

efﬁciently using cardiac parameters.

More recently, diffusion models have introduced

remarkable approaches for time series modeling,

demonstrating outcomes that surpass their competi-

tors. Alcaraz and Strodthoff (2023) proposed the

Structured State Space Diffusion ECG (SSSD-ECG)

framework, which combines a conditional diffusion

model with structured state space sequences to syn-

thesize short 12-lead ECG signals. Their approach

excels in quantitative, qualitative, and human evalua-

tions. Inspired by SSSD-ECG, Zama and Schwenker

(2023) developed the Diffusion State Space Aug-

mented Transformer model, which also generates

conditional 12-lead ECG data, replacing S4 layers

with State Space Augmented Transformer layers. Ad-

ditionally, Neifar et al. (2023) developed a versa-

tile framework based on Diffusion Denoising Prob-

abilistic Models for ECG signal generation, imputa-

tion, and forecasting. Their approach uses efﬁcient

conditioning encoding for seamless task transitions,

achieving promising results.

As generative models advance, it becomes in-

creasingly important to establish robust methods for

evaluating the quality of synthetic samples. Vari-

ous metrics have been proposed, but the choice de-

pends on the speciﬁc problem and domain. Stenger

et al. (2024) suggested categorizing these met-

rics into distribution-level, which assess data col-

lectively, and sample-level, which evaluate individ-

ual samples. Common distribution-level metrics in-

clude Average Euclidean Distance, Jensen-Shannon

Distance, and Maximum Mean Discrepancy. Sajjadi

et al. (2018) proposed a novel deﬁnition of preci-

sion and recall for distributions, based on the esti-

mated supports of real and synthetic data, separately

assessing quality (precision) and diversity (recall).

Kynk

anniemi et al. (2019) addressed limitations in

the previous metrics by introducing improved preci-

sion and improved recall, which better estimate real

and synthetic data distributions using non-parametric

methods, pairwise Euclidean distances, and k-nearest

neighbors in a high-dimensional feature space. More

recently, Naeem et al. (2020) highlighted the unre-

liability of newer precision and recall metrics, intro-

ducing density and coverage metrics as alternative ap-

proaches designed to be less vulnerable to outliers and

more computationally efﬁcient.

For sample-level metrics, Dynamic Time Warp-

ing is commonly used for time series, as it captures

ﬂexible similarities under time distortions. How-

ever, it can be sensitive to noise and outliers. Alaa

et al. (2022) proposed α-Precision and β-Recall,

which builds on the metrics proposed by Sajjadi et al.

through a reﬁned soft-boundary classiﬁcation. How-

ever, the authors of the SSSD-ECG framework have

raised concerns about these metrics, citing issues with

instability during the training of one-class embed-

dings, which signiﬁcantly affected the results.

Turning the attention to ECG quality assessment,

both for real and synthetic signals, machine learn-

ing (ML) and DL techniques provide a more granular

approach to evaluating signal quality. Several stud-

ies have employed these techniques to assess various

quality aspects of ECG signals. For example, C. Liu

et al. (2018) and Athif and Daluwatte (2017), trained

ML classiﬁers to evaluate background noise, beat

consistency (detecting unexpected events), amplitude

range, and the identiﬁcation of signals with missing

leads. Non-feature-based approaches, explored by G.

Liu et al. (2021) and Zhang et al. (2018), also ad-

dress these issues. These studies, which rely on the

PhysioNet/Computing in Cardiology Challenge 2011

dataset, indicate a shared focus on artifact and noise

detection, limiting their applicability in assessing the

quality of synthetic ECG signals. The evaluation met-

ric proposed in this study sets itself apart by speciﬁ-

cally targeting the realism of individual ECG samples,

concentrating on the quality of the generated signals

rather than merely identifying noise and artifacts.

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback

1155

3 METHODS

As illustrated in Figure 1, the methodology of this

study was structured around several key stages, in-

cluding data preprocessing, generative model im-

plementation, quality dataset construction, evaluator

model development, and a comprehensive evaluation.

ECG signals were initially preprocessed for train-

ing and evaluation purposes. The SSSD-ECG was

employed to produce highly realistic synthetic ECG

signals, which were subjected to both quantitative and

qualitative assessments. To train the proposed evalua-

tion metric, referred to as Evaluator Model, a custom

quality dataset was created. This model was speciﬁ-

cally designed to classify synthetic ECG signals into

four distinct quality levels.

3.1 ECG Dataset

The dataset used in this study was sourced from the

“Will Two Do? Varying Dimensions in Electrocar-

diography: The PhysioNet/Computing in Cardiology

Challenge 2021” (Reyna et al., 2021), speciﬁcally

the Physikalisch-Technische Bundesanstalt (PTB)

source, for training and evaluating generative models.

The PTB-XL dataset was selected for its exten-

sive size and diversity, featuring 21,837 annotated 12-

lead ECG recordings, each 10 seconds long, collected

from 18,885 patients. Its gender-balanced composi-

tion, wide age range, and comprehensive pathology

coverage make it suitable for training robust models.

Each record was annotated by one or two cardiolo-

gists, who assigned multiple ECG statements based

on the SCP-ECG standard, covering form, rhythm,

and diagnostic categories. This research focused on

the diagnostic labels, which are organized hierarchi-

cally into ﬁve broad superclasses: Conduction Dis-

turbance (CD), Myocardial Infarction (MI), Hypertro-

phy (HYP), Normal (NORM), ST/T Change (STTC)

(Wagner et al., 2020).

While the PTB-XL dataset offers many advan-

tages, its multi-labeled signals presented a challenge

for this study, which focused exclusively on the ﬁve

diagnostic superclasses. To address this, only single-

label signals were selected, reducing the dataset size.

3.2 Data Preprocessing

The ECG signals from the PTB-XL dataset were pre-

processed to optimize data organization and prepare

labels for model training. First, signals were resam-

pled from 500Hz to 100Hz per lead, signiﬁcantly re-

ducing data size while preserving essential features.

A moving average ﬁlter with a kernel size of 101 was

then applied to remove baseline wander by smooth-

ing the signals and subtracting the baseline. Each

ECG channel was standardized using z-score normal-

ization, centering the data around a mean of zero and

scaling it to a standard deviation of one. This ensured

uniform amplitude across all signals.

To reduce label complexity, signals with multiple

diagnostic class labels were excluded, resulting in a

dataset where each sample was assigned to a single

diagnostic superclass. The PTB-XL diagnostic la-

bels, originally based on SNOMED-CT codes, were

mapped to the ﬁve broad diagnostic superclasses and

then one-hot encoded to structure the model inputs.

Focusing on these ﬁve superclasses addressed practi-

cal constraints, as evaluating all 71 annotations avail-

able in the PTB-XL dataset would have been imprac-

tical for clinical experts.

3.3 Quality Dataset

To develop an evaluation metric capable of assessing

the quality of synthetic ECG data on a sample-by-

sample basis required a dataset meeting two criteria:

(1) a large number of 12-lead ECG records to sup-

port the training of the DL model and (2) clear de-

tailed descriptions of quality levels to ensure the met-

ric focuses on generation quality rather than artifacts

or noise.

Two databases were initially considered: the Phy-

sioNet/Computing in Cardiology Challenge 2011 and

the Brno University of Technology ECG Quality

Database (BUT QDB). Unfortunately, neither dataset

fully met these criteria, as each lacked one of the two

essential requirements. So there was a need to con-

struct a custom quality dataset from scratch.

The custom dataset was inspired by the classiﬁca-

tion system of the BUT QDB and organized into four

distinct classes based on ECG characteristic waves.

Examples of each class are illustrated in Figure 2 and

described as follows:

Figure 1: Overview of the methodology.

SyntBioGen 2025 - Special Session on Synthetic biosignals generation for clinical applications

1156

• Class 1. Signals that do not resemble ECGs.

• Class 2. Signals similar to ECGs, but only show

discernible R peaks, with other waves obscured

by noise.

• Class 3. Signals that resemble ECGs with visible

periodic R waves and most other waves observ-

able, but containing conceptual errors that result

in highly improbable ECG patterns.

• Class 4. Real ECG signals.

For Class 1, signals were created using a mix of

basic wave functions, such as sine, triangular, rect-

angular, and sawtooth waves, each with varying lev-

els of noise. Class 2 samples were generated using a

GAN model trained with a reduced number of epochs.

Class 3 was produced with a speciﬁc conﬁguration of

the SSSD-ECG model to ensure higher ﬁdelity, de-

tailed in Section 3.4. Class 4 consisted of real signals

from the PTB-XL database. Every class has approxi-

mately 10,000 samples, except for Class 3 which has

176 samples, due to the manual selection of the sam-

ples that met the required characteristics.

3.4 Structured State Space Diffusion

ECG

The SSSD-ECG model, developed by Alcaraz and

Strodthoff (2023), represents a state-of-the-art frame-

work for ECG generation, leveraging conditional dif-

fusion models and structured state space dynamics. In

their original paper, the model excelled across various

evaluation contexts, including qualitative, quantita-

tive, and expert assessments. This success was the pri-

mary reason for its selection in this study. The SSSD-

ECG was applied with two speciﬁc objectives: (1)

produce signals for Class 3 in the quality dataset and

(2) generate highly realistic ECG samples for subse-

quent evaluation by clinical experts.

To accomplish the desired results for both ob-

jectives, several hyperparameters were adjusted and

tested across different conﬁgurations. Table 1 pro-

vides an overview of the hyperparameters explored

during the experiments, along with their respective

tested values. To isolate the impact of each vari-

able, only one hyperparameter was modiﬁed at a time.

Synthetic ECG samples were then generated for each

conﬁguration and evaluated using the improved preci-

sion, improved recall, density, and coverage metrics.

Table 1: SSSD-ECG hyperparameters tested during opti-

mization and their respective values.

Hyperparameter Values

Diffusion Steps T 300, 1000

Residual Layers 24, 48

Label Embedding Dimension 256

Batch Size 4

Diffusion Step Embedding Dim. In 256

S4 State Dimension 128

S4 Dropout 0.2

S4 Layer Normalization 0 (disable)

S4 Bidirectional 0 (disable)

One of the most effective conﬁgurations was

determined by combining the hyperparameters that

yielded the best results based on the quantitative eval-

uation metrics. The key hyperparameters identiﬁed

were the number of diffusion time steps, the num-

ber of residual layers, and the dimension of the la-

bel embedding. While these three parameters signiﬁ-

Figure 2: Representative examples for each class in the Quality Dataset.

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback

1157

cantly enhanced the generation capacity of the model,

further optimization was achieved by increasing the

number of diffusion time steps. This reﬁned conﬁg-

uration, referred to as best hyperparameter combina-

tion, is detailed in Table 2.

In order to accelerate the clinical evaluation pro-

cess, the samples provided to the experts were gen-

erated using the model that demonstrated the highest

performance at the time, which was conﬁgured with

the original hyperparameter settings. Subsequent ex-

periments focused on further reﬁning the model, ulti-

mately leading to the identiﬁcation of the best hyper-

parameter combination.

For generating Class 3 signals, the conﬁguration

with 24 residual layers was speciﬁcally chosen based

on visual inspection of the generated signals. This

setup was selected as it best met the criteria for accu-

rately populating this class.

Table 2: SSSD-ECG best hyperparameter conﬁguration.

Hyperparameter Value

Diffusion Steps T 1000

Residual Layers 48

Residual Channels 256

Skip Channels 256

Diffusion Embedding Dim. 1 128

Diffusion Embedding Dim. 2 512

Diffusion Embedding Dim. 3 512

S4 State Dimension 64

S4 Dropout 0

S4 Layer Normalization 1

S4 Bidirectional 1

Label Embedding Dimension 256

3.5 Evaluator Model

The proposed evaluator model represents a novel

evaluation metric designed to assess the quality of

synthetic ECG data at the sample level. Unlike con-

ventional metrics, which often require manual feature

extraction and statistical comparisons across datasets,

this model classiﬁes each signal individually into one

of the four classes from the quality dataset.

The model was developed using ensemble DL

techniques and consists of ﬁve neural networks, each

initialized with a random seed from the range [42–46]

to ensure diversity. While all networks share the

same architecture, they are independently initialized.

Each network comprises ﬁve one-dimensional convo-

lutional layers, followed by Leaky ReLU activation

functions with a negative slope of 0.2 and dropout

layers with a rate of 0.3. The convolutional layers

use a kernel size of 4, a stride of 2, and padding of 1,

except for the ﬁnal convolutional layer, which uses a

stride of 1 and no padding. A ﬂatten operation is then

applied to convert the output into a one-dimensional

vector for classiﬁcation. During training, signals are

passed through the networks, and the output is com-

pared to the target label using cross-entropy loss. The

Adam optimizer is then used to adjust the weights and

biases, minimizing this loss.

The training data used to develop this model was

sourced from the custom quality dataset described

earlier, where it was observed that Class 3 contained

fewer signals than other classes. To address this im-

balance, class weights were calculated and applied

during the training process. This adjustment ensured

that the underrepresented classes were given propor-

tionally higher weights, allowing the model to learn

better from the fewer signals available and reducing

bias toward the more frequent classes.

Since ensemble learning enhances prediction per-

formance by combining multiple models, it was es-

sential to deﬁne an effective strategy for aggregating

predictions. Therefore, soft voting was implemented,

averaging the class label probabilities across all mod-

els. The class with the highest average probability is

then selected as the ﬁnal prediction, effectively con-

sidering the conﬁdence levels of all model predictions

(Mahajan et al., 2023).

This approach adds a privacy layer of data protec-

tion by reducing the need for access to sensitive real

data during evaluation. Traditional metrics typically

require access to both real and synthetic datasets,

which poses privacy risks, especially when the real

data contains sensitive information. In contrast, this

method relies solely on the model’s weights and the

generated synthetic data for evaluation. Although the

model is trained using both real and synthetic data to

capture the underlying patterns effectively, it does not

expose the raw features or contents of the real dataset

during the evaluation phase.

Additionally, while prior works such as G. Liu et

al. (2021) and Zhang et al. (2018), have used DL

techniques to assess ECG quality, their focus was pri-

marily on noise and artifact detection, which limits

their applicability to synthetic signals. In contrast,

the evaluator model was speciﬁcally designed with

diverse waveform characteristics and targets the real-

ism of the generated ECG samples, providing a more

comprehensive assessment of signal quality beyond

noise and artifacts.

SyntBioGen 2025 - Special Session on Synthetic biosignals generation for clinical applications

1158

3.6 Evaluation

Assessing the quality of time series generation is a

multidimensional task, covering various aspects such

as ﬁdelity, diversity, and utility (Stenger et al., 2024).

The main goal of this work was to produce realistic

synthetic ECG samples using the SSSD-ECG model.

Therefore, the focus was primarily on ﬁdelity, by

evaluating how closely the generated samples resem-

ble real ECG signals. In addition, the diversity of

the synthetic dataset was also evaluated to ensure that

the samples represent the full variability of the real

data. Moreover, the utility of the synthetic data was

assessed through several classiﬁcation tasks.

To complement the quantitative metrics, the gen-

erated signals were also subjected to qualitative evalu-

ation by clinical experts through a questionnaire, pro-

viding expert feedback on the realism of the data.

Finally, classiﬁcation metrics, including accuracy,

F1-score, and the confusion matrix, were used to as-

sess the performance of the evaluator model. The cor-

relation between the evaluation metrics for synthetic

data and the evaluator was analyzed to determine if

the model aligns with the state-of-the-art metrics. Ad-

ditionally, the relationship between the human evalu-

ation and the evaluator was also studied.

3.6.1 Fidelity and Diversity

The metrics used to assess the ﬁdelity of the gener-

ated data were improved precision and density, while

improved recall and coverage metrics were used to

evaluate diversity. Density and coverage were pro-

posed by Naeem et al. (2020), whereas improved

precision and improved recall were introduced by

Kynk¨a¨anniemi et al. (2019). For simplicity,

throughout this work, improved precision and im-

proved recall will be referred to as precision and re-

call, respectively. The implementation was adapted to

use 5 nearest neighbors (k=5) and 200 samples from

each diagnostic class for both real and synthetic data,

ensuring a balanced dataset.

During the experiments conducted to optimize

the performance of the SSSD-ECG model, each ex-

periment produced corresponding ﬁdelity and diver-

sity results for the synthetic data generated. The

real dataset used for comparison remained consistent

across all experiments.

To compute the metrics, features from multiple

domains, such as statistical, spectral, and temporal,

were extracted using the Time Series Feature Extrac-

tion Library (TSFEL) version 0.1.7, a Python package

optimized for automatic feature extraction from time

series data (Barandas et al., 2020).

3.6.2 Utility

The utility of the synthetic dataset was evaluated

through a classiﬁcation task using the Train on Real,

Test on Synthetic (TRTS) and Train on Synthetic, Test

on Real (TSTR) metrics proposed by (Esteban et al.,

2017), as well as the additional Train on Synthetic,

Test on Synthetic (TSTS) metric, introduced in the

work of (Fekri et al., 2019).

The supervised classiﬁcation task was carried out

using a Random Forest classiﬁer, and features were

extracted from both the real and synthetic datasets us-

ing the TSFEL library (Barandas et al., 2020). For

baseline comparisons, the classiﬁer was trained and

tested on real data, with the dataset divided into train-

ing and test sets. A test size of 30% was consistently

used across all classiﬁcation tasks.

From these evaluations, three performance mea-

sures were derived and analyzed:

• TSTR. This metric assesses the capacity of syn-

thetic data to replace real data by evaluating how

well a model trained on generated samples per-

forms when tested on real ones.

• TRTS. This metric measures the realism of syn-

thetic samples by training the classiﬁer on real

data and evaluating its performance on synthetic

data.

• TSTS. This metric evaluates the internal consis-

tency of the synthetic dataset by measuring how

well a model trained on synthetic samples gener-

alizes to unseen synthetic data.

3.6.3 Human Evaluation

The human evaluation speciﬁcally targeted the real-

ism of the synthetic dataset, as realism is a prop-

erty for which humans can provide an unequivocal

“ground truth” (Stein et al., 2024). To validate the

realism of synthetic ECG samples generated by the

SSSD-ECG model, a structured questionnaire was de-

veloped using Microsoft Forms. The questionnaire

featured 20 images of ECG tracings – 10 synthetic

signals from the generative model and 10 real signals

from the PTB-XL database. The study involved eval-

uations by three clinical experts: a cardiologist with

over 10 years of experience, an internist with less than

5 years of experience, and a ﬁnal-year medical stu-

dent.

Each tracing was paired with a set of questions,

beginning with an inquiry about the nature of the sig-

nal. The respondents were asked to indicate whether

they believed the tracing to be an ECG or not. If

uncertain, they could select the ‘Not sure’ option,

which allowed them to proceed to the next image.

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback

1159

For tracings identiﬁed as ECG, participants were then

asked to classify the tracing into one of several di-

agnostic categories: Normal, Myocardial Infarction,

ST/T change, Hypertrophy, or Conduction Distur-

bance. These categories correspond to the ﬁve super-

classes used to classify the PTB-XL data in terms of

disease diagnosis.

If a tracing was not recognized as an ECG, the

clinical experts were asked to evaluate its quality by

selecting one of the following options, which cor-

respond to the quality levels deﬁned in the quality

dataset:

• Noise (Class 1). the tracing does not resemble an

ECG, and the R waves are not reliably observable.

• Clearly not an ECG (Class 2). periodic R waves

are visible in some leads, but other ECG waves

are not clearly identiﬁable.

• Almost an ECG (Class 3). periodic R waves are

visible, and most of the waves can be observed,

but there are conceptual errors resulting in highly

unlikely ECG patterns.

The signals selected for the questionnaire were

chosen to represent the diversity within the dataset.

To achieve this, a method employing a nearest neigh-

bors model was used. This approach measured the

dissimilarity between samples using Euclidean dis-

tance in high-dimensional feature space, with the goal

of iteratively selecting the most unique signals. A to-

tal of 20 signals were selected, with two signals from

each of the ﬁve diagnostic superclasses for both real

and synthetic signals, ensuring balanced representa-

tion. This selection promoted diversity across the

dataset while limiting the total number of signals to

20 to avoid overburdening human evaluators during

the questionnaire. Each selected sample was reviewed

to ensure it accurately reﬂected the diverse character-

istics of the dataset.

4 RESULTS AND DISCUSSION

Considering the main goal of this work was to achieve

high ﬁdelity, the synthetic signals were evaluated us-

ing metrics speciﬁcally focused on this aspect, while

diversity and utility were assessed as complementary

measures. Next, the realism of the generated dataset,

evaluated by medical experts, was analyzed. Finally,

the performance of the proposed evaluation metric

was assessed, with a focus on its alignment with qual-

ity metrics and human evaluators.

4.1 Fidelity and Diversity

In initial experiments, the original hyperparameters

from the SSSD-ECG paper were used to gener-

ate synthetic ECG signals, which were subsequently

provided to clinical experts for qualitative assess-

ment. While awaiting feedback, additional experi-

ments were conducted to enhance the realism of the

synthetic signals, resulting in the identiﬁcation of a

best set of hyperparameters, detailed in Section 3.4.

Fidelity and diversity were quantitatively assessed

using precision, recall, density, and coverage met-

rics, comparing the two hyperparameter conﬁgura-

tions across the ﬁve diagnostic classes. The results

are detailed in Table 3.

The average precision of the synthetic ECG sig-

nals increased substantially from 0.57 with the origi-

nal hyperparameters to 0.94 with the best conﬁgura-

tion. In addition, the density metric improved across

all diagnostic classes, with several exhibiting values

greater than 1. Consequently, the overall average den-

sity increased signiﬁcantly from 0.80 to 3.85. These

values indicate that the model is generating more syn-

thetic samples in proximity to real data points.

While these improvements in ﬁdelity are signif-

icant, examining the diversity of the generated sig-

nals is essential for a holistic comprehension of the

performance of the model. Although recall improved

with the best set of hyperparameters, it remained low.

In contrast, the coverage metric showed notable im-

provements across all diagnostic categories. These

results suggest that, although some synthetic samples

may lie outside the real data space (reﬂected by low

recall), the model is still capable of generating a di-

verse set of samples that cover the majority of the data

space.

After analyzing both the ﬁdelity and diversity re-

sults, it is evident that the best hyperparameter con-

ﬁguration has successfully achieved the goal of gen-

erating synthetic ECG signals that exhibit statistical

characteristics similar to those of real ones. As con-

ﬁrmed by high precision and density values. How-

ever, the lower recall and higher coverage scores in-

dicate that while the model generates a broad array of

signals (high coverage), many real points are still not

represented in the synthetic dataset (low recall). This

limitation highlights the need for future work to en-

hance the diversity of the synthetic signals to better

capture the full range of characteristics present in real

data.

SyntBioGen 2025 - Special Session on Synthetic biosignals generation for clinical applications

1160

Table 3: Comparison of precision, recall, density, and coverage values across the diagnostic classes for two hyperparameter

conﬁgurations: the original and the best-performing.

Diagnostic Class

Original Hyperparameters Best Hyperparameters

Precision Density Recall Coverage Precision Density Recall Coverage

CD 0.65 0.99 0.00 0.25 0.97 4.15 0.01 0.83

HYP 0.03 0.01 0.01 0.03 0.95 3.13 0.02 0.95

MI 0.95 1.41 0.00 0.32 0.89 2.73 0.03 0.79

NORM 0.28 0.17 0.00 0.06 0.90 3.99 0.04 0.91

STTC 0.95 1.43 0.00 0.27 0.98 5.25 0.06 0.99

Mean 0.57 0.80 0.00 0.19 0.94 3.85 0.03 0.89

4.2 Utility

Synthetic datasets are often designed for speciﬁc ML

applications, and their usefulness can be assessed by

evaluating how effectively they support these applica-

tions. In this study, the utility of the synthetic data was

evaluated by performing several classiﬁcation tasks

with a Random Forest classiﬁer, as detailed in Sec-

tion 3.6.2 and summarized in Table 4.

Table 4: Macro average F1-score for classiﬁcation on real

and synthetic datasets.

Test on Real Test on Synthetic

Train on Real 56.58% 57.03%

Train on Synthetic 40.84% 78.00%

The classiﬁer trained on real data has nearly iden-

tical performance when tested on both real (56.58%)

and synthetic data (57.03%). These results indicate

that the synthetic dataset seems to preserve the char-

acteristics of the real one, conﬁrming the realism of

the generated samples.

The model trained on synthetic data performed

signiﬁcantly better on synthetic data (78.00%) com-

pared to real data (40.84%). This suggests that while

data conditioning produces consistent results, it may

lack generalization when applied to real-world sce-

narios. This limitation may be due to the lower values

of the diversity metrics. Nevertheless, the synthetic

data still displays some quality, despite of not being

able to fully replace real data in practical applications.

Examining the entire scope, the high similarity be-

tween the performance on real and synthetic data sug-

gests the synthetic dataset replicates many patterns

from the real dataset. This is a positive indication of

its quality and aligns with the main goal of this disser-

tation. However, its utility is more limited for training

models intended for real-world applications.

4.3 Human Expert Evaluation

To complement the quantitative metrics, three clini-

cal experts assessed the realism of the synthetic sig-

nals through a questionnaire detailed in Section 3.6.3.

The primary task was to classify each ECG tracing as

either real or synthetic, with follow-up questions tai-

lored to their responses.

Individual evaluations were ﬁrst analyzed, cate-

gorizing the outcomes into four groups: real signals

correctly identiﬁed, real signals misclassiﬁed as syn-

thetic, synthetic signals misidentiﬁed as real, and syn-

thetic signals correctly classiﬁed, as illustrated in Fig-

ure 3. The responses were then collectively analyzed

using majority voting. Notably, experts could select

‘Not sure’ when uncertain about the nature of the sig-

nal. Although only one expert chose this option, for

statistical analysis, ‘Not sure’ was treated as a posi-

tive classiﬁcation, indicating that the signal had sufﬁ-

ciently realistic characteristics to cause indecision and

was therefore considered real.

Examining individual cases, medical expert A

classiﬁed all 20 signals as real, without considering

any as synthetic. Clinician B correctly identiﬁed 8

real signals but also classiﬁed 8 synthetic signals as

real. The ﬁnal evaluator classiﬁed 5 real ECG trac-

ings as real but labeled the other 5 as synthetic, and 4

synthetic ECGs were classiﬁed as real. These results

highlight the realistic characteristics and patterns of

the synthetic signals, as most were perceived as real.

Taking a holistic view, the majority of the three

clinicians identiﬁed 8 out of 10 synthetic signals as

real, while 2 out of 10 real signals were misclassiﬁed

as synthetic. This underscores the high degree of re-

alism in the synthetic data, aligning with previously

evaluated metrics of precision and density. Moreover,

the difﬁculty clinicians faced in distinguishing real

from synthetic signals highlights the challenge posed

by the realistic nature of the generated data.

For the analysis of the second set of follow-up

questions, only the feedback from two medical ex-

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback

1161

Figure 3: Classiﬁcation of real and synthetic ECG signals by clinical experts.

perts was considered, as there was no information

from one clinician. As mentioned, 8 out of 10 syn-

thetic signals were mistaken for real ones, while the

remaining two were correctly classiﬁed as synthetic.

According to the evaluators, these two synthetic sam-

ples fell into the ‘Noise’ quality level (Class 1), char-

acterized by the absence of observable R waves. This

ﬁnding indicates that although most synthetic signals

successfully reproduce the characteristics of real sig-

nals, those with lower realism are readily recognized

as synthetic. Furthermore, the analysis of the syn-

thetic signals classiﬁed as real revealed a lack of con-

sensus among the clinicians regarding the assigned di-

agnostic categories. This inconsistency suggests that

the conditional aspect of the generative model may

not be functioning as intended.

In conclusion, human evaluation provides prelim-

inary evidence supporting the effectiveness of the

SSSD-ECG model in generating realistic ECG sig-

nals. Although the evaluation involved only three

clinicians, the results suggest that the synthetic data

demonstrates sufﬁcient quality to merit further explo-

ration.

4.4 Evaluator Model Assessment

The evaluator model performance in distinguishing

synthetic ECG signals across four quality classes was

assessed, achieving a mean accuracy of 99.99% and

an average F1-score of 99.70%.

Another approach to assess the performance of

the evaluator model involved exploring its relation-

ship with key evaluation metrics such as precision,

density, recall, and coverage, through the Pearson cor-

relation method. The correlation values presented in

Table 5, show that the evaluator model exhibits strong

correlations with precision (0.88) and density (0.72),

metrics that emphasize ﬁdelity. This alignment under-

scores the evaluator capacity as a sample-level ﬁdelity

assessment tool.

Table 5: Correlation values between evaluator model and

evaluation metrics precision and density.

Precision (p-value) Density (p-value)

Evaluator model 0.88 (p < 0.001) 0.72 (p < 0.01)

Another interesting perspective emerged from an-

alyzing the relationship between the evaluator model

and the medical experts, since both classiﬁed samples

individually. This alignment made it logical to evalu-

ate the ﬁdelity of synthetic ECG signals by comparing

the performance of the evaluator model to that of the

experts on the same classiﬁcation task.

The results, illustrated in Figure 4, reveal that the

evaluator model correctly identiﬁed 7 real signals and

classiﬁed 7 synthetic signals as real. Demonstrating a

notable degree of similarity with the medical experts,

who also misclassiﬁed 8 synthetic signals as real. In

addition, both the evaluator model and the experts ex-

hibited some difﬁculty in distinguishing certain real

signals as real. The alignment in performance be-

tween the evaluator model and the human evaluators

supports the conclusion that synthetic data closely re-

sembles genuine ECG tracings, reinforcing their ﬁ-

delity.

SyntBioGen 2025 - Special Session on Synthetic biosignals generation for clinical applications

1162

Figure 4: Confusion matrix for the classiﬁcation of signals

by the evaluator model.

In summary, the strong correlation with estab-

lished evaluation metrics and the similar performance

with clinical experts reinforce the potential of the

evaluator model as a robust tool for assessing the

quality of synthetic ECG signals at a sample level.

5 CONCLUSION

In healthcare, synthetic data has shown potential to

improve patient care by supporting clinical research

and advancing the development and training of ML

models for diagnostic support systems. However,

medical data must be of high quality and have clin-

ical relevance, as it can signiﬁcantly impact patient

outcomes. As a result, evaluating the quality of gener-

ated data becomes a crucial yet ambiguous step, since

there is no standard procedure for assessing the qual-

ity of synthetic datasets.

Considering the challenges outlined above, this

work introduces an approach for generating and eval-

uating highly realistic ECG signals. The SSSD-ECG

model successfully produced synthetic samples that

closely resemble real ECG samples, with validation

from quantitative metrics and expert feedback. How-

ever, while the synthetic data demonstrated high ﬁ-

delity, its utility in real-world applications for training

models was more limited, likely due to issues with di-

versity. Despite these limitations, the research priori-

tized realism, and several criteria support the conclu-

sion that the synthetic ECG data is sufﬁciently realis-

tic, demonstrating its potential for further exploration.

This study also introduced a novel evaluator

model capable of assessing synthetic ECG signals

at the sample level, offering a different perspective

than traditional distribution-based metrics. The align-

ment of this model’s results with expert evaluations

and state-of-the-art methods underscores its effective-

ness. These ﬁndings not only validate the quality of

the synthetic data but also demonstrate the evaluator

model capacity as a potential tool for ﬁdelity assess-

ment. The evaluator model was trained using a quality

dataset also developed in this research.

Although the results are promising, there are cer-

tain limitations and opportunities for future research

to address. The SSSD-ECG model, while effective in

generating realistic ECG signals, still faces challenges

with the diversity of the generated samples. This lim-

itation is reﬂected in the low recall values, which sug-

gest that the model struggles to fully replicate the va-

riety of real ECG data. Moreover, the small number of

clinical evaluators involved in the validation process

limits the robustness of the results, therefore future

work should include a larger pool of experts. Another

area for improvement is the evaluator model. Ex-

panding its capabilities to assess whether diagnostic

labels of real signals are correctly assigned would en-

hance the evaluation of the conditional component of

the model. Furthermore, exploring other sample-level

metrics for synthetic data evaluation could provide a

more nuanced understanding of data quality.

In conclusion, this work addresses challenges in

generating and evaluating synthetic ECG data. While

there are areas for improvement, high-quality medical

data remains essential for research and development

of models for real-world applications. By advancing

towards high-ﬁdelity ECG data generation and evalu-

ation, this research paves the way for future innova-

tions in the ﬁeld.

ACKNOWLEDGEMENTS

This work was funded by AISym4Med project num-

ber 101095387, supported by the European Health

and Digital Executive Agency (HADEA) under the

authority delegated by the European Commission.

REFERENCES

Alaa, A., Van Breugel, B., Saveliev, E. S., and van der

Schaar, M. (2022). How faithful is your synthetic

data? sample-level metrics for evaluating and audit-

ing generative models. In International Conference

on Machine Learning, pages 290–306. PMLR.

Alcaraz, J. M. L. and Strodthoff, N. (2023). Diffusion-based

conditional ecg generation with structured state space

models. Computers in Biology and Medicine, page

107115.

Athif, M. and Daluwatte, C. (2017). Combination of rule

based classiﬁcation and decision trees to identify low

quality ecg. In 2017 IEEE International Conference

on Industrial and Information Systems (ICIIS), pages

1–4. IEEE.

Barandas, M., Folgado, D., Fernandes, L., Santos, S.,

Abreu, M., Bota, P., Liu, H., Schultz, T., and Gam-

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback

1163

boa, H. (2020). Tsfel: Time series feature extraction

library. SoftwareX, 11:100456.

Belo, D., Rodrigues, J., Vaz, J. R., Pezarat-Correia, P., and

Gamboa, H. (2017). Biosignals learning and synthesis

using deep neural networks. Biomedical engineering

online, 16:1–17.

Di Costanzo, A., Spaccarotella, C. A. M., Esposito, G., and

Indolﬁ, C. (2024). An artiﬁcial intelligence analysis

of electrocardiograms for the clinical diagnosis of car-

diovascular diseases: a narrative review. Journal of

Clinical Medicine, 13(4):1033.

Dissanayake, T., Fernando, T., Denman, S., Sridharan, S.,

and Fookes, C. (2022). Generalized generative deep

learning models for biosignal synthesis and modality

transfer. IEEE Journal of Biomedical and Health In-

formatics, 27(2):968–979.

Esteban, C., Hyland, S. L., and R

atsch, G. (2017). Real-

valued (medical) time series generation with recurrent

conditional gans. arXiv preprint arXiv:1706.02633.

Fekri, M. N., Ghosh, A. M., and Grolinger, K. (2019).

Generating energy data for machine learning with re-

current generative adversarial networks. Energies,

13(1):130.

Kynk

anniemi, T., Karras, T., Laine, S., Lehtinen, J., and

Aila, T. (2019). Improved precision and recall metric

for assessing generative models. Advances in neural

information processing systems, 32.

Liu, C., Zhang, X., Zhao, L., Liu, F., Chen, X., Yao,

Y., and Li, J. (2018). Signal quality assessment and

lightweight qrs detection for wearable ecg smartvest

system. IEEE Internet of Things Journal, 6(2):1363–

1374.

Liu, G., Han, X., Tian, L., Zhou, W., and Liu, H.

(2021). Ecg quality assessment based on hand-

crafted statistics and deep-learned s-transform spec-

trogram features. Computer Methods and Programs

in Biomedicine, 208:106269.

Mahajan, P., Uddin, S., Hajati, F., and Moni, M. A. (2023).

Ensemble learning for disease prediction: A review.

In Healthcare, volume 11, page 1808. MDPI.

Monachino, G., Zanchi, B., Fiorillo, L., Conte, G., Auric-

chio, A., Tzovara, A., and Faraci, F. D. (2023). Deep

generative models: The winning key for large and eas-

ily accessible ecg datasets? Computers in biology and

medicine, page 107655.

Murtaza, H., Ahmed, M., Khan, N. F., Murtaza, G., Zafar,

S., and Bano, A. (2023). Synthetic data generation:

State of the art in health care domain. Computer Sci-

ence Review, 48:100546.

Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y., and Yoo, J.

(2020). Reliable ﬁdelity and diversity metrics for gen-

erative models. In International Conference on Ma-

chine Learning, pages 7176–7185. PMLR.

Neifar, N., Ben-Hamadou, A., Mdhaffar, A., and Jmaiel,

M. (2023). Diffecg: A versatile probabilistic diffu-

sion model for ecg signals synthesis. arXiv preprint

arXiv:2306.01875.

Nishikimi, R., Nakano, M., Kashino, K., and Tsukada, S.

(2023). Variational autoencoder–based neural electro-

cardiogram synthesis trained by fem-based heart sim-

ulator. Cardiovascular Digital Health Journal.

Reyna, M. A., Sadr, N., Alday, E. A. P., Gu, A., Shah,

A. J., Robichaux, C., Rad, A. B., Elola, A., Seyedi,

S., Ansari, S., et al. (2021). Will two do? vary-

ing dimensions in electrocardiography: the phys-

ionet/computing in cardiology challenge 2021. In

2021 Computing in Cardiology (CinC), volume 48,

pages 1–4. IEEE.

Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and

Gelly, S. (2018). Assessing generative models via pre-

cision and recall. Advances in neural information pro-

cessing systems, 31.

Stein, G., Cresswell, J., Hosseinzadeh, R., Sui, Y., Ross, B.,

Villecroze, V., Liu, Z., Caterini, A. L., Taylor, E., and

Loaiza-Ganem, G. (2024). Exposing ﬂaws of gener-

ative model evaluation metrics and their unfair treat-

ment of diffusion models. Advances in Neural Infor-

mation Processing Systems, 36.

Stenger, M., Leppich, R., Foster, I., Kounev, S., and Bauer,

A. (2024). Evaluation is key: a survey on evalua-

tion measures for synthetic time series. Journal of Big

Data, 11(1):66.

Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D.,

Lunze, F. I., Samek, W., and Schaeffter, T. (2020).

Ptb-xl, a large publicly available electrocardiography

dataset. Scientiﬁc data, 7(1):1–15.

Wulan, N., Wang, W., Sun, P., Wang, K., Xia, Y., and

Zhang, H. (2020). Generating electrocardiogram sig-

nals by deep learning. Neurocomputing, 404:122–136.

Zama, M. H. and Schwenker, F. (2023). Ecg synthesis via

diffusion-based state space augmented transformer.

Sensors, 23(19):8328.

Zhang, J., Wang, L., Zhang, W., and Yao, J. (2018). A sig-

nal quality assessment method for electrocardiogra-

phy acquired by mobile device. In 2018 IEEE Interna-

tional Conference on Bioinformatics and Biomedicine

(BIBM), pages 1–3. IEEE.

APPENDIX

This appendix provides supplementary visualizations

of two examples of Normal 12-lead ECG tracings in-

cluded in the human evaluation questionnaire. Fig-

ure 5 shows a real ECG sourced from the PTB-XL

database, while Figure 6 depicts a synthetic ECG gen-

erated by the SSSD-ECG model.

SyntBioGen 2025 - Special Session on Synthetic biosignals generation for clinical applications

1164

Figure 5: Real ECG sourced from the PTB-XL database.

Figure 6: Synthetic ECG generated by the SSSD-ECG.

Towards High-Fidelity ECG Generation: Evaluation via Quality Metrics and Human Feedback

1165