
Gu, A., Goel, K., and R
´
e, C. (2022). Efficiently Modeling
Long Sequences with Structured State Spaces.
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra,
A., and R
´
e, C. (2021). Combining Recurrent, Con-
volutional, and Continuous-time Models with Linear
State-Space Layers.
Harris, M., Sengupta, S., and Owens, J. (2007). Chapter 39.
Parallel Prefix Sum (Scan) with CUDA — NVIDIA
Developer.
Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D.,
Jia, F., Zhang, Y., and Ginsburg, B. (2024). RULER:
What’s the Real Context Size of Your Long-Context
Language Models?
Jurafsky, D. and Martin, J. (2008). Speech and Language
Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech
Recognition, volume 2.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
Chess, B., Child, R., Gray, S., Radford, A., Wu, J.,
and Amodei, D. (2020). Scaling Laws for Neural Lan-
guage Models.
Khatri, N., Matos, G., Coopmans, L., and Clark, S. (2024).
Quixer: A Quantum Transformer Model.
Kingma, D. P. and Ba, J. (2014). Adam: A Method for
Stochastic Optimization.
K
¨
olle, M., Giovagnoli, A., Stein, J., Mansky, M. B., Hager,
J., and Linnhoff-Popien, C. (2022). Improving Con-
vergence for Quantum Variational Classifiers using
Weight Re-Mapping.
K
¨
olle, M., Stenzel, G., Stein, J., Zielinski, S., Ommer, B.,
and Linnhoff-Popien, C. (2024). Quantum Denoising
Diffusion Models.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua,
M., Petroni, F., and Liang, P. (2023). Lost in the Mid-
dle: How Language Models Use Long Contexts.
Llama Team (2024). Llama 3.2: Revolutionizing edge AI
and vision with open, customizable models.
Lloyd, S., Mohseni, M., and Rebentrost, P. (2013). Quan-
tum algorithms for supervised and unsupervised ma-
chine learning.
Marcus, M., Santorini, B., and Marcinkiewicz, M. A.
(1993). Building a Large Annotated Corpus of En-
glish: The Penn Treebank. Computational Linguis-
tics, 19(2):313–330.
Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016).
Pointer Sentinel Mixture Models.
OpenAI (2024). GPT-4.
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcad-
inho, S., Biderman, S., Cao, H., Cheng, X., Chung,
M., Grella, M., GV, K. K., He, X., Hou, H., Lin, J.,
Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau,
H., Mantri, K. S. I., Mom, F., Saito, A., Song, G.,
Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang,
R., Zhang, Z., Zhao, Q., Zhou, P., Zhou, Q., Zhu, J.,
and Zhu, R.-J. (2023). RWKV: Reinventing RNNs for
the Transformer Era.
Peng, B., Goldstein, D., Anthony, Q., Albalak, A., Alcaide,
E., Biderman, S., Cheah, E., Du, X., Ferdinan, T.,
Hou, H., Kazienko, P., GV, K. K., Koco
´
n, J., Koptyra,
B., Krishna, S., au2, R. M. J., Lin, J., Muennighoff,
N., Obeid, F., Saito, A., Song, G., Tu, H., Wirawan,
C., Wo
´
zniak, S., Zhang, R., Zhao, B., Zhao, Q., Zhou,
P., Zhu, J., and Zhu, R.-J. (2024). Eagle and Finch:
RWKV with Matrix-Valued States and Dynamic Re-
currence.
Preskill, J. (2018). Quantum Computing in the NISQ era
and beyond.
Qwen Team (2024). Qwen2.5: A Party of Foundation Mod-
els.
Rohe, T., Gr
¨
atz, S., K
¨
olle, M., Zielinski, S., Stein, J., and
Linnhoff-Popien, C. (2024). From Problem to Solu-
tion: A general Pipeline to Solve Optimisation Prob-
lems on Quantum Hardware.
Stenzel, G., Zielinski, S., K
¨
olle, M., Altmann, P., N
¨
ußlein,
J., and Gabor, T. (2024). Qandle: Accelerating State
Vector Simulation Using Gate-Matrix Caching and
Circuit Splitting.
Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham,
P., Rao, J., Yang, L., Ruder, S., and Metzler, D. (2020).
Long Range Arena: A Benchmark for Efficient Trans-
formers.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E.,
and Lample, G. (2023). LLaMA: Open and Efficient
Foundation Language Models.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention Is All You Need.(Nips), 2017. Ad-
vances in Neural Information Processing Systems.
Wang, J., Gangavarapu, T., Yan, J. N., and Rush, A. M.
(2024). MambaByte: Token-free Selective State
Space Model.
Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou,
C., Li, C., Li, C., Liu, D., Huang, F., and others
(2024). Qwen2 technical report. arXiv preprint
arXiv:2407.10671.
APPENDIX
Hyperparameters
Hyperparameters have been determined by using a
constrained grid search for learning rate, optimizer,
and number of parameters in a constrained time frame
per run. All models used quantum weight remapping
for limiting parameter ranges (K
¨
olle et al., 2022) and
the Adam optimizer (Kingma and Ba, 2014). We used
a parameter distribution of 2:1:1:1 ratio, spread out
over the A, B, C, and ∆ sub-circuits (picked after pre-
liminary trials). For the haystack task, we chose a
learning rate of 2e
3
and around 1000 parameters in
total.
QMamba: Quantum Selective State Space Models for Text Generation
749