Hybrid Classical Quantum Learning Model Framework for Detection of

Deepfake Audio

Atul Pandey

and Bhawana Rudra

Department of Information Technology, National Institute of Technology,

Srinivasnagar-575025, Mangalore, Dakshina Kannada, Karnataka, India

Keywords:

Quantum Machine Learning, Quantum Deep Learning, Quantum-Deepfake Audio, Hybrid Classical Quantum

Model, Deepfake Audio.

Abstract:

Artiﬁcial intelligence (AI) has simpliﬁed individual tasks compared to earlier times. However, it also enables

the creation of fake images, audio, and videos that can be misused to tarnish the reputation of a person on

social media. The rapid advancement of deepfake technology presents signiﬁcant challenges in detecting

such fabricated content. Therefore, in this paper, we particularly focus on the deepfake audio detection.

Many Classical models exist to detect deepfake audio, but they often overlook critical audio features, and

training these models can be computationally resource-intensive. To address this issue, we used a real-time

AI-generated fake speech dataset, which includes all the necessary features required to train models and used

Quantum Machine Learning (QML) techniques, which follow principles of quantum mechanics to process

the data simultaneously. We propose a hybrid Classical-Quantum Learning Model that takes advantage of

Classical and Quantum Machine Learning. The hybrid model is trained on a real-time AI-generated fake

speech dataset, and we compare the performance with existing Classical and Quantum models in this area.

Our results show that the hybrid Classical-Quantum model gives an accuracy of 98.81% than the Quantum

Support vector Machine (QSVM) and Quantum Neural Network (QNN).

1 INTRODUCTION

Technological innovation continues to simplify hu-

man tasks, with one key advancement being Artiﬁcial

Intelligence (AI), which enables people to work more

efﬁciently and intelligently. AI can be used to gen-

erate images, videos, digital avatars, and even video

dubbing (Nguyen et al., 2022). Unfortunately, this

technology is sometimes exploited to tarnish the rep-

utation of individuals by creating fake content, such

as forged voices, images, and videos, using deep-

fake techniques—where deep learning models are

employed to fabricate such content (Dagar and Vish-

wakarma, 2022). In a recent case, fraudsters uti-

lized AI-driven software to imitate the voice of a

company’s CEO, successfully extorting USD 243,000

(The Wall Street Journal, 2019). As a result, there is

growing interest in developing methods for detecting

fraudulent voices (Khochare et al., 2021). People of-

ten rely on their knowledge and environmental aware-

ness to identify fake audio. However, the rapid ad-

https://orcid.org/0009-0004-3106-4995

https://orcid.org/0000-0001-7651-3820

vancements in deepfake audio generation have under-

scored the importance of addressing these challenges.

One speciﬁc type of deepfake audio is voice conver-

sion, where the voice of a person is swapped with an-

other (Yi et al., 2023).

Researchers commonly rely on Classical Deep-

Learning models to identify deepfake content. How-

ever, training these models demands signiﬁcant

computational resources, even when using high-

performance hardware like Graphics Processing Units

(GPUs). While GPUs offer parallel processing capa-

bilities, their performance is limited by the number

of cores, leading to slower processing speeds than

Quantum computers. Quantum computers leverage

the principles of Quantum mechanics—such as super-

position, entanglement, and interference, to process

data much more efﬁciently and faster. These systems

operate on Quantum bits (Qubits), each representing

a superposition of Quantum states. Qubits provide

exponential computational speed by simultaneously

accessing multiple Quantum states. The processing

power of the system is determined by the number of

Qubits (denoted as N), with the ability to access 2

states concurrently (Zaman et al., 2024).

Pandey, A. and Rudra, B.

Hybrid Classical Quantum Learning Model Framework for Detection of Deepfake Audio.

DOI: 10.5220/0013258700003899

In Proceedings of the 11th International Conference on Information Systems Security and Privacy (ICISSP 2025) - Volume 2, pages 231-239

ISBN: 978-989-758-735-1; ISSN: 2184-4356

231

Figure 1: Four Sub-Categories of QML (A

ımeur et al.,

2006).

Quantum Machine Learning (QML) offers a range

of methods to address complex problems efﬁciently.

Problems in QML can be approached using four dis-

tinct methods, as illustrated in ﬁgure 1. The four sub-

categories of the QML use the Quantum-inspired ma-

chine learning algorithm to process the Classical or

Quantum data involved in the problem, and data pro-

cessing is achieved on either a Classical or Quantum

computer (A

ımeur et al., 2006). Each sub-categories

are outlined below :

• CC Approach. Classical data is processed on a

Classical computer using a Quantum-inspired ma-

chine learning algorithm.

• CQ Approach. Classical data is processed on

a Quantum computer using a Quantum-inspired

machine learning algorithm.

• QC Approach. Quantum data is processed on a

Classical computer using a Quantum-inspired ma-

chine learning algorithm.

• QQ Approach. Quantum data is processed on

a Quantum computer using a Quantum-inspired

machine learning algorithm.

This paper is structured as follows: Section 2 dis-

cusses the existing Classical and Quantum models in

the deepfake audio. This section also discusses the

background of Quantum circuits. Section 3 provides

a method to detect the deepfake audio and also dis-

cusses the proposed hybrid model. Section 4 provides

the comparative analysis of both systems. Section 5

provides the insights of the implementation and the

overview of the model performance. Section 6 pro-

vides a detailed analysis of the paper. Section 7 con-

cludes the paper and discusses future work.

2 LITERATURE REVIEW

Several research papers have explored deepfake audio

classiﬁcation using machine learning and deep learn-

ing models. The author proposed a Convolutional

Neural Network (CNN) based classiﬁer, such as Light

CNN, that ﬁlters noise in voice signals while preserv-

ing key information (Wu et al., 2018). Convolutional-

Recurrent Neural Network (CRNN) based spooﬁng

detection uses ﬁve 1D convolutional layers, a Long

Short-Term Memory (LSTM) Layer, and two fully

connected Layers to perform end-to-end detection of

deepfake audio (Chintha et al., 2020). The authors

(Khochare et al., 2021) proposed two approaches:

one uses audio features for classiﬁcation via ma-

chine learning models, while the other classiﬁes us-

ing a temporal convolutional network and spatial

transformer network based on images of the audio

signals. While deep learning models achieve bet-

ter results, they did not consider Short-Time Fourier

Transform (STFT) and Mel-Frequency Cepstral Co-

efﬁcients (MFCCs), which are two of the most ef-

fective features of the audio signal. The authors

(Zhang et al., 2021) proposed a Squeeze and Exci-

tation Network (SENet) that captures interdependen-

cies between channels but requires more computa-

tional time for training. The authors (Hamza et al.,

2022) presented a method for handling large datasets

and classifying them using various machine learning

algorithms. The Support Vector Machine (SVM) per-

formed well on the For-rece and For-2-sec datasets,

but for the For-norm dataset, gradient boosting gen-

erates better results. However, this work did not ad-

dress ﬂuctuations and distortions in the audio signals.

The authors (Mcuba et al., 2023) extracted the vari-

ous features from the fake audio ﬁle, such as MFCCs,

Mel-Spectrum, Chromagram, and Spectrogram and

converted them into images. The custom model and

VGG model were trained on the audio features, and

the results show that VGG performed well for the

MFCCs feature and the custom model performed bet-

ter for the rest of the features. The authors (Doan

et al., 2023) proposed a breathing-talking-silence en-

coder to detect deepfake audio using ASVspoof 2019

and 2021 datasets. The results show that the perfor-

mance of the classiﬁer increased by 40%. The authors

(Wu et al., 2024) proposed a deepfake detector based

on contrastive Learning. This method minimized the

variation in audio, which happened because of the

manipulation of audio. This will increase the robust-

ness of the model for the detection of deepfake audio.

The author (Pham et al., 2024) used an ASVspoof

2019 benchmark dataset and extracted the Spectro-

gram from the audio. The CNN-based model, var-

ious pre-trained models and ensemble models were

trained on the Spectrogram. The results show that the

ensemble model performs better than the other mod-

els. The authors (Li et al., 2024) proposed a SafeEar

framework to detect deepfake audio without relying

on semantic content such that private content remains

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

232

secure in audio. They introduced the neural audio

codec that separates the semantic and acoustic infor-

mation, and they rely only on the acoustic informa-

tion. The framework was tested on the 4 datasets and

shows an error rate down to 2.02%, which made it

suitable for anti-deepfake and anti-content recovery.

However, the proposed method is limited to acous-

tic features, which makes it less effective against nu-

anced manipulations mimicking natural patterns. The

authors (Saha et al., 2024) proposed a method to ex-

ecute the machine learning and deep learning pro-

gram for deepfake audio detection on the Central Pro-

cessing Unit (CPU). This framework utilizes the self-

supervised learning-based pre-trained model. The re-

sults show that the author achieved the 0.90% error

rate with 1000 trainable parameters. Several papers

focus on Classical models, and only a few have ex-

plored Quantum learning models for the deepfake im-

age. Therefore, this paper focuses on the deepfake

audio.

The authors (Mittal et al., 2020) proposed a

method to detect fake images based on feature ex-

traction using a Quantum-inspired evolutionary algo-

rithm, though it lacks ﬁne-tuning of parameters and

noise ﬁltering. The authors (Mishra and Samanta,

2022) introduced a Quantum-based transfer learning

approach to detect deepfake images, where features

are extracted from a pre-trained ResNet-18 model and

classiﬁed using a Quantum Neural Network (QNN).

The authors (Pandey and Rudra, 2024) proposed a

method to detect deepfake audio speech using a Quan-

tum Support Vector Machine (QSVM) and QNN.

However, the performance of the model was not better

compared to Classical.

The challenge to detect deepfake audio lies in rec-

ognizing features within the audio signal, whether

they are genuine or fake. Literature shows that AI-

based models are capable of effectively learning and

predicting audio authenticity. These models are based

on Classical Deep-Learning techniques. Many au-

thors proposed the Quantum model to take Quantum

advantage to overcome issues in a Classical computer.

However, the Quantum model for detecting deepfake

audio did not perform well compared to the Clas-

sical model. Therefore, to improve the model per-

formance, we propose a hybrid Classical-Quantum

learning model that takes advantage of Classical and

Quantum Machine Learning.

2.1 Quantum Preliminaries

A Quantum circuit comprises two essential compo-

nents: the feature map and the variational form. The

feature map encodes Classical data into a Quantum

Figure 2: Parametrized Quantum Circuit (Benedetti et al.,

2019).

state, while the variational form adjusts this Quantum

state to the desired target state by iteratively tuning

parameters.

• Feature Maps. There are several methods for

embedding Classical data into Quantum states

through feature mapping. Common techniques in-

clude Angle Embedding.

– Angle Embedding. Angle Embedding is

one of the simplest approaches for encoding

ﬂoating-point data. It transforms a single

ﬂoating-point value x ∈ R into a Quantum state

using the following equation (1):

(x)|0⟩ = e

−ix

|0⟩ (1)

Here, k ∈ {x, y, z} represents the rotation axis

on the Bloch sphere, implemented through

Pauli rotation gates. These rotations are applied

to the data being encoded. In the case of An-

gle Embedding, the number of rotations corre-

sponds to the number of features in the dataset

(Schuld and Petruccione, 2018).

• Parametrized Quantum Circuits (PQCs). Vari-

ational Quantum Circuits (VQCs) or Parametrized

Quantum Circuits (PQCs) are quantum algorithms

by their reliance on free parameters. In QML,

VQCs encode the Classical data into a Quantum

state using the feature maps discussed in section

2.1 and then perform a variational form to create

the QNN. The parameters used in the variational

form are optimized through an iterative process.

Measurement is performed on a Quantum circuit,

which leads to stochastic output. We repeat the

experiment multiple times to get the expectation

value, and this will result in a probability distribu-

tion of the basis states. This probability distribu-

tion is given to the Classical algorithm to compute

the loss function or cost function, which gives the

difference between the predicted and true labels.

These results are given to the Classical optimizer

to update the parameters of the Quantum circuit

to minimize the loss function. Figure 2 shows the

working principle of the PQCs (Benedetti et al.,

2019), which consists of feature mapping, varia-

tional forms and optimisation.

Hybrid Classical Quantum Learning Model Framework for Detection of Deepfake Audio

233

Figure 3: Proposed Methodology.

Algorithm 1: TwoLocal(n,k, θ).

Input: n, k, θ

for r = 0 to k do

▷ Add the r-th layer.;

for j = 1 to n do

Apply a R

(θ

r j

) gate on qubit j.;

end

▷ Create entanglement between layers.;

if r < k then

for t = 1 to n − 1 do

Apply a CNOT gate with control on

qubit t and target on qubit t + 1.;

end

– Variational Form. The variational form of a

Quantum Neural Network (QNN) mimics the

layered architecture of Classical Neural Net-

works. It relies on optimizable parameters

−→

and introduces entanglement between Qubits

through a parameter-independent circuit U

ent

Multiple Layers (or repetitions) can be stacked

in the variational circuit. In our model, we em-

ploy a TwoLocal variational form which uses n

Qubits and k repetitions, the total number of pa-

rameters needed for optimization is n×(k + 1).

These parameters, denoted as θ

r j

, are indexed

by r (from 0 to k) and j (from 1 to n). It will

create k Layers not the k +1 Layers. Algorithm

1 deﬁnes the creation of the TwoLocal varia-

tional form (El

ıas Fern

andez, 2023).

• Quantum Kernel. Consider a Quantum model

f (x) deﬁned as in equation (2):

f (x) = ⟨ψ(x)|M|ψ(x)⟩ (2)

In equation 2, |ψ(x)⟩ is a Quantum state gener-

ated by an embedding circuit that encodes the in-

put data x, and M is a chosen observable. ⟨ψ(x)| is

the transpose of the |ψ(x)⟩ i.e ⟨ψ(x)| = (|ψ(x)⟩)

†

This formulation encompasses variational QML

models because the observable M can be realized

through a simple measurement, which is preceded

by a variational circuit. Instead of training the

function f using variational methods, we can of-

ten achieve the same result by employing a Clas-

sical kernel method, where the kernel is computed

on a Quantum device. The equation (3) shows

the Quantum kernel determined by the overlap be-

tween two Quantum states encoding different data

points:

κ(x,x

′

) = |⟨ψ(x

′

)|ψ(x)⟩|

(3)

By using this kernel-based approach, we avoid

the need for processing and measuring the typi-

cal variational circuits, focusing solely on the data

encoding (Schuld and Killoran, 2018).

3 METHODOLOGY

(Pandey and Rudra, 2024) proposed deepfake speech

detection using Quantum models such as QSVM

and QNN to compare the performance with Classi-

cal models such as Support Vector Machine (SVM)

and Artiﬁcial Neural Networks (ANN). To improve

the detection performance of the Quantum model,

we proposed a hybrid Classical-Quantum model that

takes advantage of Classical and Quantum Machine

Learning. We also train the classical 1D CNN to com-

pare it with the proposed model. A numerical dataset

is utilized to train the models. The dataset contains

features from the audio speech. Figure 3 illustrates

the proposed methodology, which is broken down into

three stages:

• Input Phase. In this stage, the dataset is taken as

input and undergoes preprocessing.

• Training and Testing Phase. This stage takes the

Classical data and encodes it into Quantum states

using the embedding technique, and the varia-

tional form is applied to create the QNN (refer to

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

234

Figure 4: Hybrid Classical 1D Convolution Quantum Neural Network.

Figure 5: Quantum Layer Circuit of Hybrid Model

(HC1CQNN).

section 2.1). The hybrid model is trained, and its

performance is evaluated using the test dataset.

• Output Phase. This stage compares the results

from both the Classical models, Quantum models,

and the hybrid model.

3.1 Dataset

We consider the recent audio numerical dataset ”Real-

time detection of AI-Generated speech for Deepfake”,

published in 2023. J. J. Bird and A. Lotﬁ applied

the two-stem model (Hennequin et al., 2020) from

Spleeter to separate actual speech into natural vo-

cals and accompaniment (background noise). The

Spleeter model comprises 12 Layers organized into

two sets of 6 Layers each for the encoder-decoder

Convolutional Neural Network (CNN) within a U-

Net architecture. Following this, the unprocessed vo-

cals were converted into synthesized vocals of dis-

tinct individuals utilizing the Retrieval-Based Voice

Conversion (RVC) model. Subsequently, the back-

ground noise and RVC-generated vocals were amal-

gamated to produce synthetic speech. The author has

employed the Python-based Librosa library (McFee

et al., 2015) to extract the 26 different features. The

features include the chromagram (chromastft), spec-

tral centroid, spectral bandwidth, spectral rolloff, root

mean square (RMS) and twenty Mel-Frequency Cep-

stral Coefﬁcients (MFCCs) (Bird and Lotﬁ, 2023).

3.2 Preprocessing

To ensure that the model does not become biased

towards any particular class, the dataset should be

shufﬂed to introduce randomness. It is divided into

80% for training and 20% for testing, which helps

to prevent data leakage. Since some features in the

dataset have varying ranges, feature scaling is applied

to bring them to a uniform scale. This scaling im-

proves the algorithm’s convergence speed. The Min-

Max scaling method normalises the features while

preserving the original data range and enhancing in-

terpretability. The scaler function is ﬁtted on the train-

ing data and then applied to transform the test data,

thereby avoiding data leakage during model evalua-

tion.

3.3 Hybrid Classical Quantum

Learning Model

Literature shows that CNN helps the model to learn

necessary features from the data, which is extracted

using the convolution operation. To take advantage

Hybrid Classical Quantum Learning Model Framework for Detection of Deepfake Audio

235

of the CNN, we propose a Hybrid Classical 1D Con-

volution Quantum Neural Network (HC1CQNN) as

shown in ﬁgure 4. This hybrid model combines 4

Layers- 1D Convolution Layer, Classical Neurons

Layer, Quantum Layer, and Single Classical Neuron

Layer. The Classical 1D Convolution extracts the im-

portant feature from the audio dataset and feeds the

features as input to the Classical Neurons. This will

reduce the dimension of the data, which will be help-

ful for the Quantum Layer because we can not feed

all the extracted features into the Quantum Layer be-

cause of the limitations of the current Quantum Sim-

ulators. The end Layer (Single Classical Neuron) of

the hybrid model will classify the audio speech as real

or fake. The hybrid model is trained as a single unit.

The hybrid model in ﬁgure 4 performs the con-

volution operation on the data of size 1x26 with 32

ﬁlters of size 3, which results in the 32 features map

of size 1x26, and then applies the Max Pooling Layer

(size-2) on the feature maps, which produces the 32

feature maps of size 1x13. Each feature map contains

13 features, which generates 416 features after ﬂat-

tening. 416 features cannot directly feed as input to

the Quantum Layer because of the restriction of the

qubit and the limitation of Quantum simulators. To

handle this issue, we have applied 32 Classical Neu-

rons followed by 4 Classical Neurons after ﬂatten-

ing the features, and their output will be input to the

Quantum Layer with 4 qubits to avoid system crash.

The Quantum Layer converts the Classical data into

a Quantum state using Angle embedding, which then

learns the pattern in the data after embedding using

the TwoLocal variational forms algorithm discussed

in section 2.1 and then performs measurements on all

the qubits. Figure 5 shows the Quantum Layer circuit

diagram of HC1CQNN, which includes Angle em-

bedding, TwoLocal variational forms and measure-

ment of the circuit. The measurement result will be

the input for the Single Classical Neuron, which later

performs the classiﬁcation of fake and real audio.

4 QUANTUM AND CLASSICAL

SYSTEM ANALYSIS

This section analyzes the Quantum system Q and

Classical system C to evaluate their computational

performance in deep learning tasks. Let us assume

that both systems take classical data x consisting of n

bits as input.

x ← Classical data

n ← Number of bits

The system Q processes x using the PQC dis-

cussed in section 2.1. This system processes all 2

states simultaneously. Let us assume that Q requires

p units to process these 2

states, from encoding to

measurement. The measurement results are then fed

into a classical optimizer, which adjusts the parame-

ters used in Q. Assume the optimizer takes q units per

optimization iteration. Thus, the PQC requires (p+q)

units for a single run. To reach the desired minimum

loss, the PQC runs z times.

p ← Units to process Q

q ← Units taken by the classical optimizer

p + q ← Units taken by PQC for a single run

z ← Times to run PQC

The total processing time for system Q is:

= (p + q) × z (4)

Now, let’s assume the Classical system C with

GPU can handle m states concurrently. With 2

states

to process, C would need to perform approximately

/m) sequential processing steps. Each processing

step requires r units, and its result is fed into the clas-

sical optimizer, which adjusts the parameters for C.

Thus, system C requires

×(r + q) units for a single

run. To achieve the minimum loss, system C also runs

z times.

m ← States handle by the GPU

← Sequential processing steps

r ← Units required per processing step

× (r + q) ← Units taken by system C for a single run

z ← Times to run system C

The total processing time for the Classical system

C with GPU is:

GPU

× (r + q) × z (5)

For the Quantum system Q to outperform the

Classical system C with GPU support, we require:

(p + q) × z <

× (r + q) × z

Dividing by z (assuming z ̸= 0) gives:

p + q <

× (r + q)

Based on this analysis, we observe that the Quan-

tum system Q maintains an advantage as n grows

larger. As 2

grows exponentially,

× (r + q) be-

comes large even with signiﬁcant GPU parallelism.

Thus, while adding GPU with Classical system C, it

still does not eliminate the exponential scaling chal-

lenge faced by Classical processing.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

236

Figure 6: Loss Graph of 1D Convolution Neural Network

During Training.

Figure 7: Loss Graph of Hybrid Classical 1D Convolution

Quantum Neural Network During Training.

5 IMPLEMENTATION &

RESULTS

The authors (Bird and Lotﬁ, 2023) created a .csv ﬁle

comprising both real and fake voice samples. This

ﬁle contains 11,778 data points (rows) and 26 features

(columns). The authors (Pandey and Rudra, 2024)

subsequently reduced the dataset to 2,000 data points

and applied Principal Component Analysis (PCA) to

reduce the feature count from 26 to 13 for training

the Quantum models (QSVM and QNN) to prevent

system crashes. We have utilized the entire dataset

to enhance the performance of the Quantum model

through a hybrid model approach. We employed an

NVIDIA RTX A4000 GPU to run the classical and hy-

brid models. Both the hybrid model and the Classical

1D CNN were trained on all 11,778 data points with

26 features to ensure a fair comparison of the models

on the same scale. The Classical 1D CNN and the hy-

brid model were implemented using TensorFlow ver-

sion 2.15.0. For the hybrid model, we used Penny-

lane (Bergholm et al., 2022) version 0.36.0 to encode

the classical data and construct the Quantum Layer

using Pennylane’s TensorFlow interface. The hy-

brid model training was conducted on the lightning

qubit simulator provided by Pennylane, which of-

fers efﬁcient linear algebra computation and differ-

entiation methods to train the hybrid model effec-

tively. These simulators use Quantum algorithms to

leverage Quantum properties to execute QML pro-

grams. However, at the hardware level, simulators

run on Classical computers, which may require sev-

eral days or even weeks to complete QML tasks, re-

sulting in more execution time and sometimes sys-

tem crashes. Table 1 presents the classiﬁcation met-

rics for both Classical (SVM and ANN) and Quantum

(QSVM and QNN) models, as discussed (Pandey and

Rudra, 2024). 90.02%, 95.97%, 83.50% and 89.30%

are the accuracy, precision, recall and f1-score respec-

tively represent the performance metric of the QSVM.

70.07%, 72.47%, 64.50% and 68.25% are the accu-

racy, precision, recall, and f1-score respectively rep-

resent the performance metric of the QNN. These

results indicate that the reduced dataset and simula-

tor limitation leads to performance degradation for

QSVM and QNN. Our hybrid model implementation

overcomes these issues and yields improved results.

Table 2 provides the classiﬁcation metrics for the

models—1D CNN and hybrid model (HC1CQNN) on

the test dataset. We obtain the training loss graph as

shown in ﬁgure 6 and 7. We observe that all the pa-

rameters range from 98-99% for both 1D CNN and

hybrid model (HC1CQNN). we observe that the hy-

brid model is trained perfectly and is almost similar

to the training loss graph of the Classical 1D CNN

model. This shows that the hybrid model has im-

proved its performance over the Quantum model.

6 DISCUSSION

In this study, we evaluate both Classical and Quantum

Machine Learning for deepfake audio detection. To

leverage the Quantum advantage, the author (Pandey

and Rudra, 2024) uses the Quantum models to per-

form deepfake audio detection. However, the per-

formance of the Quantum model lags behind that of

the classical model, as in table 1. This likely hap-

pened because of the use of fewer features and data

(rows) from dataset (Bird and Lotﬁ, 2023) as well as

the limitation of simulators. Therefore to improve the

detection performance, we have proposed the hybrid

model. This model leverages the advantage of the

Classical and Quantum models. The results in table

2 show that the hybrid model achieved almost simi-

lar results compared to Classical 1D CNN. This in-

dicates that the performance of the Quantum model

Hybrid Classical Quantum Learning Model Framework for Detection of Deepfake Audio

237

Table 1: Classiﬁcation Metric Results of Classical and Quantum Model.

Model Type Models Accuracy

(%)

Precision

(%)

Recall (%) F1-score (%)

Classical

Model

Classical SVM

(Pandey and Rudra,

2024)

83.79 86.43 81.90 84.10

Classical ANN

(Pandey and Rudra,

2024)

95.00 96.27 93.29 94.76

Quantum

Model

QSVM (Pandey and

Rudra, 2024)

90.02 95.97 83.50 89.30

QNN (Pandey and

Rudra, 2024)

70.07 72.47 64.50 68.25

Table 2: Classiﬁcation Metric Results of Classical and Hybrid Model.

Model Type Models Accuracy

(%)

Precision

(%)

Recall (%) F1-score (%)

Classical Model Classical 1D CNN 99.19 98.89 99.48 99.19

Quantum Model Hybrid Model

(HC1CQNN)

98.81 98.39 99.23 98.80

(QSVM and QNN) can be enhanced through a hy-

brid approach. Our analysis in section 4 suggests that

Quantum systems offer faster computation than Clas-

sical systems. However, in practice, this advantage is

not purely realized due to the current limitations of

Quantum simulators.

7 CONCLUSION AND FUTURE

WORK

This paper demonstrates the application of Quan-

tum models for deepfake audio detection, utilizing

the computational advantages offered by Quantum

processing. Quantum approaches were considered,

as Classical computers encounter signiﬁcant compu-

tational challenges, particularly with the extensive

resources required for training deep learning mod-

els. However, literature shows that Quantum models

are not up to when compared with Classical models

due to the use of fewer features and data from the

dataset as well as the limitation of Quantum simula-

tors. Therefore, to improve the performance of Quan-

tum models (QSVM and QNN), we propose a hybrid

approach that leverages the strengths of both Classi-

cal and Quantum models. The results indicate that the

hybrid model performs almost similar to the Classical

1D CNN model. Our analysis shows that Quantum

systems have the potential to perform faster computa-

tions than Classical systems. However, this advantage

remains constrained in practice due to the limitations

of current Quantum simulators. Deploying deepfake

audio detection using the Quantum model effectively

requires large datasets and improved Quantum sim-

ulators to prevent systems from crashing. With the

existing technology, our hybrid model demonstrates

improved performance with a combination of Clas-

sical and Quantum Machine Learning techniques.

However, achieving optimal performance with purely

Quantum models will require further development in

Quantum simulators. In the future, we will explore a

hybrid Classical-Quantum Model approach for other

areas of deepfake detection, such as video deepfake

detection.

REFERENCES

ımeur, E., Brassard, G., and Gambs, S. (2006). Machine

learning in a quantum world. In Lamontagne, L. and

Marchand, M., editors, Advances in Artiﬁcial Intelli-

gence, pages 431–442, Berlin, Heidelberg. Springer

Berlin Heidelberg.

Benedetti, M., Lloyd, E., Sack, S., and Fiorentini, M.

(2019). Parameterized quantum circuits as machine

learning models. Quantum Science and Technology,

4(4):043001.

Bergholm, V., Izaac, J., Schuld, M., Gogolin, C., Ahmed,

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

238

S., and et.al., V. A. (2022). Pennylane: Automatic

differentiation of hybrid quantum-classical computa-

tions.

Bird, J. J. and Lotﬁ, A. (2023). Real-time detection of ai-

generated speech for deepfake voice conversion.

Chintha, A., Thai, B., and et.al., S. (2020). Recurrent con-

volutional structures for audio spoof and video deep-

fake detection. IEEE Journal of Selected Topics in

Signal Processing, 14:1024–1037.

Dagar, D. and Vishwakarma, D. (2022). A literature review

and perspectives in deepfakes: generation, detection,

and applications. International Journal of Multimedia

Information Retrieval, 11.

Doan, T.-P., Nguyen-Vu, L., Jung, S., and Hong, K. (2023).

Bts-e: Audio deepfake detection using breathing-

talking-silence encoder. ICASSP 2023 - 2023 IEEE

International Conference on Acoustics, Speech and

Signal Processing (ICASSP), pages 1–5.

ıas Fern

andez, Combarro

Alvarez, S. G. C. (2023). A

Practical Guide to Quantum Machine Learning and

Quantum Optimization. Packt, 1st edition.

Hamza, A., Javed, A. R. R., Iqbal, F., Kryvinska, N., Al-

madhor, A. S., Jalil, Z., and Borghol, R. t. (2022).

Deepfake audio detection via mfcc features using ma-

chine learning. IEEE Access, 10:134018–134028.

Hennequin, R., Khlif, A., Voituret, F., and Moussallam, M.

(2020). Spleeter: a fast and efﬁcient music source

separation tool with pre-trained models. Journal of

Open Source Software, 5(50):2154.

Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., and Kazi,

F. t. (2021). A deep learning framework for audio

deepfake detection. Arabian Journal for Science and

Engineering, 47.

Li, X., Li, K., Zheng, Y., Yan, C., Ji, X., and et.al, W. X.

(2024). Safeear: Content privacy-preserving audio

deepfake detection.

McFee, B., Raffel, C., and et.al., L. (2015). librosa: Audio

and music signal analysis in python. In Proceedings

of the 14th python in science conference, volume 8,

pages 18–25.

Mcuba, M., Singh, A., Ikuesan, R. A., and et.al., H. V.

(2023). The effect of deep learning methods on deep-

fake audio detection for digital investigation. Pro-

cedia Computer Science, 219:211–219. CENTERIS

– International Conference on ENTERprise Informa-

tion Systems / ProjMAN – International Conference

on Project MANagement / HCist – International Con-

ference on Health and Social Care Information Sys-

tems and Technologies 2022.

Mishra, B. and Samanta, A. (2022). Quantum transfer

learning approach for deepfake detection. Sparkling-

light Transactions on Artiﬁcial Intelligence and Quan-

tum Computing.

Mittal, H., Saraswat, M., Bansal, J. C., and Nagar, A.

(2020). Fake-face image classiﬁcation using im-

proved quantum-inspired evolutionary-based feature

selection method. In 2020 IEEE Symposium Series on

Computational Intelligence (SSCI), pages 989–995.

Nguyen, T. T., Nguyen, Q. V. H., Nguyen, D. T., Nguyen,

D. T., Huynh-The, T., Nahavandi, S., Nguyen, T. T.,

Pham, Q.-V., and Nguyen, C. M. t. (2022). Deep

learning for deepfakes creation and detection: A sur-

vey. Computer Vision and Image Understanding,

223:103525.

Pandey, A. and Rudra, B. (2024). Deepfake audio detection

using quantum learning models. In Proceedings of

the IEEE Middle East Conference on Communications

and Networking.

Pham, L., Lam, P., Nguyen, T., Nguyen, H., and

Schindler, A. (2024). Deepfake audio detection us-

ing spectrogram-based feature and ensemble of deep

learning models.

Saha, S., Sahidullah, M., and Das, S. (2024). Exploring

green ai for audio deepfake detection.

Schuld, M. and Killoran, N. (2018). Quantum machine

learning in feature hilbert spaces. Physical review let-

ters, 122 4:040504.

Schuld, M. and Petruccione, F. (2018). Supervised Learn-

ing with Quantum Computers. Springer Publishing

Company, Incorporated, 1st edition.

The Wall Street Journal (2019). Fraudsters use ai to mimic

ceo’s voice in unusual cybercrime case. Accessed on

May 28, 2024.

Wu, H., Chen, J., Du, R., Wu, C., He, K., Shang, X., Ren,

H., and Xu, G. (2024). Clad: Robust audio deep-

fake detection against manipulation attacks with con-

trastive learning.

Wu, X., He, R., and et.al., S. (2018). A light cnn for

deep face representation with noisy labels. IEEE

Transactions on Information Forensics and Security,

13(11):2884–2896.

Yi, J., Wang, C., Tao, J., Zhang, X., Zhang, C. Y., and et.al,

Y. Z. (2023). Audio deepfake detection: A survey.

Zaman, K., Marchisio, A., Hanif, M. A., and et.al., M. S.

(2024). A survey on quantum machine learning: Cur-

rent trends, challenges, opportunities, and the road

ahead.

Zhang, Y., Wang, W., and et.al., P. Z. (2021). The effect of

silence and dual-band fusion in anti-spooﬁng system.

In Interspeech.

Hybrid Classical Quantum Learning Model Framework for Detection of Deepfake Audio

239