QMamba: Quantum Selective State Space Models for Text Generation

Gerhard Stenzel

, Michael K

olle, Tobias Rohe, Maximilian Balthasar Mansky, Jonas N

ußlein and

Thomas Gabor

LMU Munich, Munich, Germany

ﬁ

Keywords:

Quantum Machine Learning, Quantum Generative Models, State Space Models, Variational Quantum

Circuits, Quantum Computing, Sequence Modeling.

Abstract:

Quantum machine learning offers novel paradigms to address limitations in traditional natural language pro-

cessing models, such as ﬁxed context lengths and computational inefﬁciencies. In this work, we propose

QMamba, the ﬁrst quantum adaptation of the Mamba architecture, integrating selective state space models

with quantum computation for efﬁcient and scalable text generation. QMamba leverages quantum principles

like superposition and entanglement to enable unbounded context sizes and reduced computational complex-

ity. Our contributions include the development of a quantum generative model optimized for hardware con-

straints, advancements in encoding, embedding, and measurement techniques, and the demonstration of its

performance on pattern reproduction and context-challenging tasks like ”Needle in a Haystack.” Experimen-

tal results conﬁrm QMamba’s potential to maintain high efﬁciency and performance across varying sequence

lengths, laying the groundwork for future explorations in quantum-enhanced natural language processing.

1 INTRODUCTION

The rapid advancements in large language models

(LLMs) have revolutionized natural language pro-

cessing (NLP). However, their current reliance on

transformer architectures comes with inherent limi-

tations, such as ﬁxed context lengths, high compu-

tational complexity, and inefﬁciencies in capturing

long-range dependencies. These limitations create

a bottleneck, particularly for tasks requiring exten-

sive sequence analysis and efﬁcient processing. State

space models (SSMs) have emerged as an alternative

architecture, offering linear computational complex-

ity and better scalability for long sequences. How-

ever, their inability to focus on contextually relevant

input signiﬁcantly limits their performance on ad-

vanced tasks. To address these shortcomings, archi-

tectures like Mamba have introduced selective mech-

anisms, improving context handling while maintain-

ing efﬁciency.

Quantum machine learning (QML) introduces a

novel paradigm capable of addressing these compu-

tational bottlenecks. By leveraging quantum princi-

ples like superposition and entanglement, QML of-

fers the potential to reduce computational overhead

and enhance model performance. Yet, despite this

https://orcid.org/0009-0009-0280-4911

potential, QML in NLP remains underexplored, and

existing quantum generative models often suffer from

limited context sizes and scalability issues.

This work proposes QMamba, the ﬁrst quantum

adaptation of the Mamba architecture, to bridge these

gaps. By integrating the strengths of state space

models with quantum computation, QMamba aims to

overcome the context and efﬁciency limitations of ex-

isting architectures, enabling scalable and effective

text generation. This innovation not only demon-

strates the viability of quantum approaches in NLP

but also establishes a foundation for future research

in quantum generative models.

In section 2, we review related work, cover-

ing transformers, state space models (SSMs), the

Mamba architecture, and quantum machine learn-

ing. Section 3 introduces our novel Quantum Mamba

(QMamba) architecture, detailing its encoding and

embedding (Section 3.1), measuring (Section 3.2),

and circuit architecture (Section 3.3). Section 4 out-

lines our experimental setup and evaluation tasks. We

include a simple pattern reproduction test for com-

pression capabilities and the ”Needle in a Haystack”

challenge to assess handling of unbounded context

sizes. Finally, Sections 5 and 6 discuss results and

propose directions for future research.

This paper makes the following contributions:

742

Stenzel, G., Kölle, M., Rohe, T., Mansky, M. B., Nüßlein, J. and Gabor, T.

QMamba: Quantum Selective State Space Models for Text Generation.

DOI: 10.5220/0013378300003890

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 1, pages 742-750

ISBN: 978-989-758-737-5; ISSN: 2184-433X

• We present the theoretical foundations of quan-

tum state space models, providing a comprehen-

sive framework for understanding their principles

and applications.

• We propose efﬁcient methods for token encoding

and embedding, leveraging quantum properties to

enhance computational efﬁciency.

• We introduce novel techniques for token decod-

ing and measuring for NLP tasks on hardware-

constrained systems, optimizing the performance

of quantum generative models.

• We develop a new quantum generative model,

QMamba, which integrates quantum computing

with state space models, paving the way for more

efﬁcient and scalable quantum text generation

models.

2 RELATED WORK

2.1 Transformers

The previously rather stale ﬁeld of models generating

human-like language has seen a massive boost in per-

formance with the introduction of the attention mech-

anism (Bahdanau et al., 2016). This allowed mod-

els to focus on the most relevant parts of the input

sequence, hence allowing for a much better context

understanding. Building on this, transformers using

self-attention mechanisms (Vaswani et al., 2017) like

BERT (Devlin et al., 2019) and ChatGPT (OpenAI,

2024) have become the state-of-the-art in many NLP

tasks (Brown et al., 2020; Touvron et al., 2023) and

even some multi-modal tasks (Llama Team, 2024;

OpenAI, 2024; Gemini Team, 2024; Qwen Team,

2024). As with most models, the solution quality

of transformers roughly scales with their size (Ka-

plan et al., 2020). The main bottleneck of transform-

ers is their context length, which limits the maximum

amount of information that the model can use to gen-

erate the next token (Dao et al., 2022). Even though

there are numerous ways to improve the efﬁciency of

transformers using better attention mechanisms (Dao

et al., 2022; Dao, 2023; Llama Team, 2024; Gemini

Team, 2024; Yang et al., 2024; Tay et al., 2020), the

search for other model architectures is still ongoing.

2.2 State Space Models

State space models (SSMs) are an end-to-end neu-

ral network architecture, with foundations in statis-

tics, but also usable for sequence generation tasks. It

consists of two parts: the current input x

(with t be-

ing the current time step) and the current state h

, and

three parameters A, B,C that deﬁne the model. State

Space Models can be interpreted as continuous func-

tions, with the state being computed as h

′ = Ah

+Bx

and the output as x

t+1

= Ch

(we use a slightly differ-

ing notation compared to (Gu et al., 2021; Gu et al.,

2022) to keep consistency with the rest of this paper).

The continuous formulation is however not usable for

training or inference of the model, as it is slow to com-

pute (due to differential equations) and not paralleliz-

able (Gu et al., 2021; Gu et al., 2022).

They can however be discretized using discretiz-

ing functions like zero-order hold, expanding the

model to four main parameters A, B,C, ∆ (equa-

tion (1)). This allows for signiﬁcantly faster infer-

ence, as the model can be executed in a recurrent man-

ner, with the state being updated at each time step,

while maintaining a ∼ O (L) computational complex-

ity (with L being the length of the input sequence).

This makes SSMs a very efﬁcient model for long

sequences, compared to the ∼ O (L

) complexity of

transformers. SSMs can be trained very effectively

by precomputing multiple steps of the model and then

using a convolution operation to compute the output

t+1

= CA

+ CA

t−1

+ ··· + CABx

t−1

+ CBx

as the parameters A, B,C, ∆ are not time-dependent

(described as Linear Time Invariance in (Gu et al.,

2021)). The convolutional interpretation allows for

parallelizable initial training. SSMs however lack

performance on advanced tasks, as they are not able to

focus on the most relevant parts of the input sequence,

like transformers (and other attention-based models)

can. This has led to the development of several deriva-

tive models, like Mamba (Gu and Dao, 2024). (Gu

et al., 2021; Gu et al., 2022; Gu and Dao, 2024)

A = exp(∆A)

B = (∆A)

−1

(A − I) · ∆B

t+1

= Ah

+ Bx

t+1

= Ch

t+1

(1)

2.3 Mamba

Mamba (Gu and Dao, 2024) builds upon the SSM

architecture, expanding it to a selective state space

model (SSSM). Instead of using the same parameters

for all input tokens, selective state space models al-

low the models to inﬂuence the B, C and ∆ param-

eters of the model (see ﬁgure 1, A does not have to

be selective, as it is directly inﬂuenced by ∆). This is

achieved by creating a projection of the current input

token into a high-dimensional space, and then supply-

ing this additional information to the model. ∆, which

QMamba: Quantum Selective State Space Models for Text Generation

743

controls the amount of inﬂuence of the previous state

on the current state, is now additionally inﬂuenced by

the input token. This allows the model to focus on

the most relevant parts of the input sequence and dis-

card less relevant parts, making it signiﬁcantly more

efﬁcient than the original SSM. As this violates the

Linear Time Invariance property of SSMs, the model

cannot be parallelized using the convolution method

described above. Instead, the paper proposes other

hardware efﬁcient methods to train the model, like

optimizing which parts of the model should be stored

in the GPUs (slow but big) high-bandwidth memory

or in the (fast but small) low-bandwidth memory. In

combination with the parallel associative scan tech-

nique (Harris et al., 2007), this allows for training and

inference efﬁciencies comparable or even better than

transformers. (Gu and Dao, 2024)

Figure 1: Structured State Space Models have four groups

of parameters, marked in green. The selection mechanism

is marked in blue. Activations are not shown for clarity.

Figure modiﬁed from Mamba (Gu and Dao, 2024).

MambaByte builds upon Mamba, but skips the to-

kenization of the inputs and outputs and instead op-

erates on the byte level. Such an approach is unfea-

sible for transformer-based models, wasting its pre-

cious context size, but is feasible on the Mamba archi-

tecture, as the model can use the selection mechanism

to discard irrelevant parts efﬁciently (Wang et al.,

2024). RWKV is another novel approach, combin-

ing concepts of transformers (using a self-attention

mechanism) and Mamba (using a selective state space

model), allowing for unbounded context size, in a

very efﬁcient manner (Peng et al., 2023; Peng et al.,

2024).

2.4 Quantum Machine Learning

Quantum machine learning (QML) harnesses quan-

tum computing to address the increasing computa-

tional demands of traditional machine learning algo-

rithms (Biamonte et al., 2017; Gabor et al., 2020;

Rohe et al., 2024). Central to QML, variational

quantum circuits (VQCs) serve as quantum analogs

of classical neural networks, functioning as effec-

tive function approximators. VQCs apply parameter-

ized unitary gates to qubits (Barenco et al., 1995),

leveraging quantum phenomena such as superposi-

tion, entanglement, and interference. These param-

eters, typically rotation angles, are optimized using

standard machine learning techniques. A VQC ar-

chitecture consists of three key components: input

state encoding, a trainable quantum circuit, and mea-

surement output decoding. Notably, VQCs can ef-

ﬁciently process high-dimensional inputs with only

log

(N) qubits (Lloyd et al., 2013). However, they

currently face challenges like substantial qubit over-

head and high error rates on real quantum devices

in the Noisy Intermediate-Scale Quantum (NISQ) era

(Preskill, 2018). Nonetheless, ongoing advancements

are expected to strengthen QML’s essential role in the

future (Preskill, 2018; Gabor et al., 2020; K

olle et al.,

2024).

2.5 Quantum Generative Models

For the topic of quantum machine learning based text

generation models, the focus has been quantum trans-

formers.

The quantum transformer proposed by (Di Sipio

et al., 2022) uses a quantum transformer for senti-

ment analysis by replacing each of the well-known at-

tention mechanism’s parameters with quantum gates,

similar to the paper using a quantum transformer for

gluon classiﬁcation (Comajoan Cara et al., 2024).

Quixer (Khatri et al., 2024) is a quantum transformer

capable of text generation, albeit with a very limiting

context size of only 32 tokens on the Penn Tree Bank

dataset (Marcus et al., 1993).

To the best of our knowledge, this paper is the ﬁrst

to propose porting the Mamba architecture to quan-

tum machine learning.

3 QUANTUM MAMBA

3.1 Encoding and Embedding

All input tokens are encoded into a unique combina-

tion of rotations across q

qubits, utilizing r possible

QAIO 2025 - Workshop on Quantum Artiﬁcial Intelligence and Optimization 2025

744

rotation angles. These angles are evenly distributed

from 0 (inclusive) to 2π (exclusive), forming the se-

quence Θ as deﬁned in equation (2). Subsequently,

the tokens are embedded using rotational gates such

as R

, R

, or R

. This method allows for the encoding

of up to k

different tokens within a circuit compris-

ing q

states. When k = 2 and the R

gate is used

for embedding, this corresponds to basis embedding.

For higher values of k, this approach facilitates signif-

icant input compression. Obvious limitations include

the increasing difﬁculty for the model to distinguish

between tokens as the number of possible rotations r

grows, therefore a trade-off between the number of

qubits q

(and thus execution speed) and the number

of distinguishable tokens k must be made. Addition-

ally, when executing on actual quantum hardware in

the NISQ era, a high number of angles r is negatively

impacted by precision error both during encoding and

measurement.

Θ =



2πn



n ∈ {0, 1, 2, . . . , k − 1}



(2)

The work by (Comajoan Cara et al., 2024) em-

ploys patch angle embedding (thus hindering paral-

lelism), whereas (Khatri et al., 2024) utilizes clas-

sical tokenization and classical weights to create in-

put embeddings (which limits the potential beneﬁts of

quantum models by requiring classical pre- and post-

processing).

3.2 Measuring

The output is processed in a manner analogous to the

inputs: the probability p of each qubit being in the

state |1⟩ is measured in a predeﬁned basis. Subse-

quently, the sequence δ (not to be confused with ∆,

which is a sub-circuit of Mamba) is computed as de-

ﬁned in equation (3), which enumerates the discrep-

ancies between the predicted and target values. The

index corresponding to the smallest error is identiﬁed

as the predicted output.

The overall conﬁdence of an output token is quan-

tiﬁed by summing the discrepancies across all q

out

qubits. This measurement allows for the introduction

of a temperature parameter τ, which facilitates con-

trolled randomness in the output by scaling the conﬁ-

dence scores of the output tokens. Speciﬁcally, when

τ = 0, the model deterministically selects the most

conﬁdent token.

δ =



k − 1

− p



n ∈ {0, 1, 2, . . . , k − 1}



(3)

3.3 Circuit Architecture

The naive quantum implementation of the Mamba ar-

chitecture, depicted in ﬁgure 2, utilizes e + k + m + s

qubits. Here, e represents the number of qubits re-

quired for embedding the information, m (approxi-

mately equal to e) for intermediate processing, k for

extracting the most relevant information from the in-

put (also approximately equal to e), and s for storing

the entire state of the model. Consequently, this con-

ﬁguration results in a large circuit with 3e + s qubits.

We deﬁne h

to be |0⟩. The ∆ circuit is executed ﬁrst.

It is parameterized with x

using re-uploading. This

is comparable to the ∆ matrices in the non-quantum

Mamba (Gu and Dao, 2024), which predicts the im-

pact of the current token x

on the state. This im-

pact is then used as a weight between the A and B

sub-circuits. By using this trade-off, the model can

learn to distinguish between relevant and less relevant

features. The non-quantum Mamba draws inspiration

from the discretization of a continuous signal with an

adaptive parameter choosing the step size. For the in-

herently continuous quantum model, this is achieved

by interpreting the A and B gates as controlled by

the result of the ∆ circuit. Notably, their control ba-

sis is complementary, with one activating when the

other is not. This, over the course of multiple discrete

shots or a full quantum state simulation, allows the

model to learn the trade-off between the A and B sub-

circuits. The A circuit stretches over multiple qubit

groups, spanning the qubits holding the last state ⟨h

the result of the input group B and the intermediate

group m. Just like in the classical Mamba, the A cir-

cuit does not contain re-uploading of the input token,

as the input tokens are already encoded into B. The C

sub-circuit (parameterized with re-uploading just as ∆

and B) transforms the output of A and is followed by

a measurement. Out of the e + k + m + s qubits, only

m get measured, k + e get reset and s holds the state

for the next mamba block.

|0⟩

⟩ |h

t+1

⟩

∆

Figure 2: The naive QMamba circuit with the A, B,C, ∆ sub-

circuits requires e + k + m + s qubits. The A and B sub-

circuits are controlled by the result of the ∆ sub-circuit.

The circuit efﬁciency can be improved by reusing

the qubits of the C sub-circuit, thus reducing the num-

ber of required qubits to 2k + s (depicted in ﬁgure 3).

QMamba: Quantum Selective State Space Models for Text Generation

745

Even further improvements can be achieved by exe-

cuting the ∆ operation on a separate circuit, and input

its result using an R

gate. This optimization reduces

the qubit requirement to k + s + 1. A and B are con-

trolled and anti-controlled as before.

|0⟩

∆

|0⟩

⟩ |h

t+1

⟩

Figure 3: A more efﬁcient QMamba circuit, using only 2k +

s qubits.

We have implemented our model using the Qan-

dle (Stenzel et al., 2024) framework. The trainable

parameters are packed in strongly entangling layers

of the sub-circuits as R

combinations followed

by a CNOT gate. The individual model conﬁgurations

are listed in the appendix (section 6).

4 EXPERIMENTS

We want to evaluate out QMamba model on two main

tasks: text generation based on easy to learn patterns

and the “Needle in a Haystack” task. For pattern

reproduction, the intent is to learn a simple pattern,

testing the model’s capability to compress several to-

kens into as few qubits as possible, using a relatively

small context size. The “Needle in a Haystack” task

is an advanced challenge to test the “lost in the mid-

dle” problem (Liu et al., 2023). It has been shown

that Large Language Models boasting with very high

potential context sizes can’t utilize all of it, causing

the models to ignore parts of context, usually snip-

pets located in the middle of the context (Hsieh et al.,

2024). The Mamba architecture has a theoretically

unbounded context size due to its selection mecha-

nism, allowing it to keep relevant parts in the context

indeﬁnitely (Gu and Dao, 2024). We want to test if

the quantum version of Mamba can keep up with this

promise.

The loss is computed as the mean squared error

between the predicted and target values, attempting

to minimize the discretization error δ (as deﬁned in

equation (3)) for each token. The different tasks have

been evaluated on separately trained models.

4.1 Datasets

For the pattern reproduction task, we create a sim-

ple synthetic dataset, containing the words ”zero”,

”one”, . . . , ”nine”. The words are always presented

in the same looping order. As the model operates on

the character level, the task can be trivially described

to predict the next logical character in the sequence.

During training, the model is represented with a ran-

dom starting character, and the task is to predict the

next character in the sequence. Independent of the

produced character, the model is always presented

with the next character in the sequence parallel. As

the dataset is trivial to test, we can predict the error

rate of the model per token, e.g., the model should al-

ways predict “e” after “z” (from “zero”) and a space

after “x” (“six”).

The ”Needle in a Haystack” uses a different setup.

We pick n different tokens. Each sequence consist of

a ﬁxed number of tokens (the training context size),

with one of the tokens being replaced with a globally

ﬁxed pointer token. The task for the model is to output

the token directly following the pointer token. During

testing, sequences with a higher length are generated

and primed with an indicator token. All intermediate

outputs are discarded, as the only relevant output fol-

lows after inputting the entire test string. The model

is evaluated on the accuracy of the output token, de-

pending on the test context size and the pointer posi-

tion, ignoring all other outputs.

4.2 Results

Figure 4: The QMamba model produces high probabilities

for the correct tokens in the pattern reproduction task. The

most likely token is listed in the ﬁrst row, less likely to-

kens listed below. The color indicates the probability of the

token. For a τ > 0, non-maximum tokens can be chosen

non-deterministically, weighted by their probability.

On the pattern reproduction task, our QMamba mod-

els show strong performance. Analyzing the predic-

tions of the model in ﬁgure 4, we see the expected

behavior: as the beginning of the training strings has

been chosen at random, the model is initially unsure,

which token to produce, outputting a mixed encoding,

containing multiple possible tokens with near equal

conﬁdence (we measure the error from the continu-

ous model outputs to the individual discrete token en-

codings as the conﬁdence in each token). As we have

selected τ = 0, we deterministically choose the most

likely token, and feed it back into the model for the

next iteration. With rising number of iteration and

therefore growing context size, the model becomes

more conﬁdent in outputting the correct tokens.

QAIO 2025 - Workshop on Quantum Artiﬁcial Intelligence and Optimization 2025

746

Figure 5: QMamba’s confusion matrix for the pattern re-

production task shows very low error rates. Dark colors

indicate a higher probability.

The strong quality is underlined by the confusion

matrix in ﬁgure 5, showing a very strong diagonal dis-

tribution for single token continuation. When tasked

with predicting only the next token, the correct token

is chosen with a very high probability. The few out-

liers ﬂuctuate from run to run, while maintaining a

very acceptable amount.

On the “Needle in a Haystack” task (Hsieh et al.,

2024), our QMamba performs very well (evaluation

in ﬁgure 6). On low document lengths like 10 or 20

tokens (10 was the training context size), the models

manage to retrieve close to all target tokens. As the

document length increases, the accuracy decreases,

but reaches a stable level at around 65% accuracy for

a document length of 100 (so ten times the training

context length) tokens. The retrieval quality for nee-

dles in the last 20% of the document does not dete-

riorate, thus showing that the model does not suffer

from a polluted previous context. The stable (albeit

lowered) accuracy for long contexts for needles both

at the beginning and the end of the document shows

that the model does not suffer from the “lost in the

middle” problem, as the difﬁculty is mostly due to the

increased noise from tokens after the needle then by

a structural learning problem seen in many classical

transformer based models (Liu et al., 2023).

5 DISCUSSION

We have introduced the theoretical foundations of

quantum state space models, allowing our quantum

model to repeatedly decide the amount of update al-

lowed to the state based on the current token, allow-

ing for theoretically unbounded context sizes. We

have experimentally shown the effectiveness of our

token encoding, embedding, and measurement strate-

gies and have laid out the path for non-deterministic

sampling. The model has shown impressively low

error rates, with a very strong diagonal in the con-

fusion matrix. From the very promising results on

the ”Needle in the Haystack” task (which the clas-

sical Mamba models and the RWKV family have al-

ready shown impressive results on despite huge con-

text lengths (Peng et al., 2024)), we conclude that the

strengths of selective state space models are not lost

in the quantum translation. The models allow efﬁ-

cient training on smaller context sizes, losing only

little in performance when running on large context

sizes, like 10 times their original size (as seen in ﬁg-

ure 6). In summary, our QMamba model manages to

reproduce the classical Mamba’s impressive perfor-

mance on current quantum simulators, showing the

unused potential of quantum machine learning in nat-

ural language processing.

6 FUTURE WORK

Our embedding and measurement techniques for em-

bedding natural language in quantum circuits have

shown impressive results. We have, however, noticed

a degradation in the results for more than four dis-

crete embedding angles, limiting the model’s abilities

for more than ﬁve angles to a minimum. In a four-

qubit setup, this would still allow embedding a very

impressive 4

= 1024 different tokens, but a deeper

analysis of the drop-off is needed.

As all models operated on synthetic datasets, big-

ger models trained on real-world datasets like the

Penn Tree Bank (Marcus et al., 1993) or the WikiText-

103 (Merity et al., 2016) are needed to evaluate the

models on real-world tasks. The models should be

evaluated on perplexity scores (Jurafsky and Martin,

2008) to compare them to classical models to fully

bridge the gap between quantum and classical text

generation models. These bigger models should be

evaluated with the same time budgets as comparable

classical models to allow for a fair comparison in the

NISQ (Preskill, 2018) era.

In preparation for the post-NISQ era, the mod-

els should be tested for their suitability on running

on large-scale, fault-tolerant quantum computers. As

QMamba only builds upon basic quantum gates like

, R

, and controlled-not and controlled groups, our

models can be easily ported, allowing for even im-

proved performance on real quantum hardware. Im-

QMamba: Quantum Selective State Space Models for Text Generation

747

Figure 6: The QMamba model performs well on the “Needle in a Haystack” task. Dark green indicates a perfect score, while

dark red (not present in the ﬁgure) would indicate a total failure with no correct outputs. The score is averaged from 3 models,

with 1000 test sequences each.

plementations using customized encodings in combi-

nations with non-standard circuit layouts could lead

to further improvement.

ACKNOWLEDGEMENTS

This paper was partially funded by the German Fed-

eral Ministry of Education and Research through

the funding program “quantum technologies —

from basic research to market” (contract number:

13N16196).

REFERENCES

Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural

Machine Translation by Jointly Learning to Align and

Translate.

Barenco, A., Bennett, C. H., Cleve, R., DiVincenzo, D. P.,

Margolus, N., Shor, P., Sleator, T., Smolin, J. A., and

Weinfurter, H. (1995). Elementary gates for quantum

computation. Physical Review A, 52(5):3457–3467.

Biamonte, J., Wittek, P., Pancotti, N., Rebentrost, P., Wiebe,

N., and Lloyd, S. (2017). Quantum machine learning.

Nature, 549(7671):195–202.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,

Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,

Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language Models are Few-Shot

Learners. In Larochelle, H., Ranzato, M., Hadsell, R.,

Balcan, M. F., and Lin, H., editors, Advances in Neu-

ral Information Processing Systems, volume 33, pages

1877–1901. Curran Associates, Inc.

Comajoan Cara, M., Dahale, G. R., Dong, Z., Forestano,

R. T., Gleyzer, S., Justice, D., Kong, K., Magorsch, T.,

Matchev, K. T., Matcheva, K., and Unlu, E. B. (2024).

Quantum Vision Transformers for Quark–Gluon Clas-

siﬁcation. Axioms, 13(5):323.

Dao, T. (2023). FlashAttention-2: Faster Attention with

Better Parallelism and Work Partitioning.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R

e, C. (2022).

FlashAttention: Fast and Memory-Efﬁcient Exact At-

tention with IO-Awareness.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding.

Di Sipio, R., Huang, J.-H., Chen, S. Y.-C., Mangini, S.,

and Worring, M. (2022). The Dawn of Quantum Nat-

ural Language Processing. In ICASSP 2022 - 2022

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 8612–8616.

IEEE.

Gabor, T., S

unkel, L., Ritz, F., Phan, T., Belzner, L., Roch,

C., Feld, S., and Linnhoff-Popien, C. (2020). The

Holy Grail of Quantum Artiﬁcial Intelligence: Ma-

jor Challenges in Accelerating the Machine Learning

Pipeline.

Gemini Team (2024). Gemini: A Family of Highly Capable

Multimodal Models.

Gu, A. and Dao, T. (2024). Mamba: Linear-Time Sequence

Modeling with Selective State Spaces.

QAIO 2025 - Workshop on Quantum Artiﬁcial Intelligence and Optimization 2025

748

Gu, A., Goel, K., and R

e, C. (2022). Efﬁciently Modeling

Long Sequences with Structured State Spaces.

Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra,

A., and R

e, C. (2021). Combining Recurrent, Con-

volutional, and Continuous-time Models with Linear

State-Space Layers.

Harris, M., Sengupta, S., and Owens, J. (2007). Chapter 39.

Parallel Preﬁx Sum (Scan) with CUDA — NVIDIA

Developer.

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D.,

Jia, F., Zhang, Y., and Ginsburg, B. (2024). RULER:

What’s the Real Context Size of Your Long-Context

Language Models?

Jurafsky, D. and Martin, J. (2008). Speech and Language

Processing: An Introduction to Natural Language

Processing, Computational Linguistics, and Speech

Recognition, volume 2.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,

Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,

and Amodei, D. (2020). Scaling Laws for Neural Lan-

guage Models.

Khatri, N., Matos, G., Coopmans, L., and Clark, S. (2024).

Quixer: A Quantum Transformer Model.

Kingma, D. P. and Ba, J. (2014). Adam: A Method for

Stochastic Optimization.

olle, M., Giovagnoli, A., Stein, J., Mansky, M. B., Hager,

J., and Linnhoff-Popien, C. (2022). Improving Con-

vergence for Quantum Variational Classiﬁers using

Weight Re-Mapping.

olle, M., Stenzel, G., Stein, J., Zielinski, S., Ommer, B.,

and Linnhoff-Popien, C. (2024). Quantum Denoising

Diffusion Models.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua,

M., Petroni, F., and Liang, P. (2023). Lost in the Mid-

dle: How Language Models Use Long Contexts.

Llama Team (2024). Llama 3.2: Revolutionizing edge AI

and vision with open, customizable models.

Lloyd, S., Mohseni, M., and Rebentrost, P. (2013). Quan-

tum algorithms for supervised and unsupervised ma-

chine learning.

Marcus, M., Santorini, B., and Marcinkiewicz, M. A.

(1993). Building a Large Annotated Corpus of En-

glish: The Penn Treebank. Computational Linguis-

tics, 19(2):313–330.

Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016).

Pointer Sentinel Mixture Models.

OpenAI (2024). GPT-4.

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcad-

inho, S., Biderman, S., Cao, H., Cheng, X., Chung,

M., Grella, M., GV, K. K., He, X., Hou, H., Lin, J.,

Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau,

H., Mantri, K. S. I., Mom, F., Saito, A., Song, G.,

Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang,

R., Zhang, Z., Zhao, Q., Zhou, P., Zhou, Q., Zhu, J.,

and Zhu, R.-J. (2023). RWKV: Reinventing RNNs for

the Transformer Era.

Peng, B., Goldstein, D., Anthony, Q., Albalak, A., Alcaide,

E., Biderman, S., Cheah, E., Du, X., Ferdinan, T.,

Hou, H., Kazienko, P., GV, K. K., Koco

n, J., Koptyra,

B., Krishna, S., au2, R. M. J., Lin, J., Muennighoff,

N., Obeid, F., Saito, A., Song, G., Tu, H., Wirawan,

C., Wo

zniak, S., Zhang, R., Zhao, B., Zhao, Q., Zhou,

P., Zhu, J., and Zhu, R.-J. (2024). Eagle and Finch:

RWKV with Matrix-Valued States and Dynamic Re-

currence.

Preskill, J. (2018). Quantum Computing in the NISQ era

and beyond.

Qwen Team (2024). Qwen2.5: A Party of Foundation Mod-

els.

Rohe, T., Gr

atz, S., K

olle, M., Zielinski, S., Stein, J., and

Linnhoff-Popien, C. (2024). From Problem to Solu-

tion: A general Pipeline to Solve Optimisation Prob-

lems on Quantum Hardware.

Stenzel, G., Zielinski, S., K

olle, M., Altmann, P., N

ußlein,

J., and Gabor, T. (2024). Qandle: Accelerating State

Vector Simulation Using Gate-Matrix Caching and

Circuit Splitting.

Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham,

P., Rao, J., Yang, L., Ruder, S., and Metzler, D. (2020).

Long Range Arena: A Benchmark for Efﬁcient Trans-

formers.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

M.-A., Lacroix, T., Rozi

ere, B., Goyal, N., Hambro,

E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E.,

and Lample, G. (2023). LLaMA: Open and Efﬁcient

Foundation Language Models.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention Is All You Need.(Nips), 2017. Ad-

vances in Neural Information Processing Systems.

Wang, J., Gangavarapu, T., Yan, J. N., and Rush, A. M.

(2024). MambaByte: Token-free Selective State

Space Model.

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou,

C., Li, C., Li, C., Liu, D., Huang, F., and others

(2024). Qwen2 technical report. arXiv preprint

arXiv:2407.10671.

APPENDIX

Hyperparameters

Hyperparameters have been determined by using a

constrained grid search for learning rate, optimizer,

and number of parameters in a constrained time frame

per run. All models used quantum weight remapping

for limiting parameter ranges (K

olle et al., 2022) and

the Adam optimizer (Kingma and Ba, 2014). We used

a parameter distribution of 2:1:1:1 ratio, spread out

over the A, B, C, and ∆ sub-circuits (picked after pre-

liminary trials). For the haystack task, we chose a

learning rate of 2e

and around 1000 parameters in

total.

QMamba: Quantum Selective State Space Models for Text Generation

749

Encoding and Decoding

The mapping between angle representation and in-

put/output tokens is distinct for each token set. We

enumerate both the token set and the possible angle

combinations and naively match them up. More ad-

vanced approaches could include a statistical analy-

sis of frequent patterns in the training data, matching

them up with more similar encodings. This would

essentially imply mapping the encoding space to a q-

dimensional space, with the Hamming distance be-

tween two angle embeddings correlating with the fre-

quency of consecutive tokens. This would enable the

model to generate speciﬁc patterns more easily, im-

proving short-term performance. We did not pursue

this idea further, as we worried about a possible lack

of output diversity. However, we are conﬁdent that in-

vestigating this idea further could lead to a signiﬁcant

improvement in the model’s performance.

QAIO 2025 - Workshop on Quantum Artiﬁcial Intelligence and Optimization 2025

750