
(2024). Google colaboratory. https://colab.research.google.
com. Accessed: 2024-01-07.
(2024). Harfbuzz. https://harfbuzz.github.io/. Accessed:
2024-01-07.
(2024). Librosa. https://librosa.org/doc/latest/feature.html.
Accessed: 2024-01-07.
(2024). Musicbrainz. https://musicbrainz.org. Accessed:
2024-01-07.
(2024). Openai. https://openai.com/. Accessed: 2024-01-
07.
(2024). Pango. https://pango.gnome.org/. Accessed: 2024-
01-07.
(2024). Ru dall-e. https://rudalle.ru/. Accessed: 2024-01-
07.
(2024). Spectral peaks. https://ccrma.stanford.edu/
∼
jos/parshl/Peak Detection Steps 3.html. Accessed:
2024-01-07.
(2024). Wikidata. https://www.wikidata.org. Accessed:
2024-01-07.
(2024). Xml documentation. https://www.w3.org/TR/xml/.
Accessed: 2024-01-07.
(2024). Yandex toloka. https://toloka.ai/. Accessed: 2024-
01-07.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-
stein generative adversarial networks. In Interna-
tional conference on machine learning, pages 214–
223. PMLR.
Bogdanov, D., Wack, N., G
´
omez, E., Gulati, S., Herrera, P.,
Mayor, O., Roma, G., Salamon, J., Zapata, J. R., and
Serra, X. (2013). Essentia: An audio analysis library
for music information retrieval. In ISMIR.
Bond-Taylor, S., Leach, A., Long, Y., and Willcocks, C. G.
(2021). Deep generative modelling: A compara-
tive review of vaes, gans, normalizing flows, energy-
based and autoregressive models. arXiv preprint
arXiv:2103.04922.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. CoRR, abs/2005.14165.
Canny, J. (1986). A computational approach to edge de-
tection. IEEE Transactions on pattern analysis and
machine intelligence, (6):679–698.
Carlier, A., Danelljan, M., Alahi, A., and Timofte,
R. (2020). Deepsvg: A hierarchical generative
network for vector graphics animation. CoRR,
abs/2007.11301.
Choi, K., Fazekas, G., Cho, K., and Sandler, M. B. (2017).
A tutorial on deep learning for music information re-
trieval. CoRR, abs/1709.04396.
Cohn, R., Dodds, D., Donoho, A. W., Duce, D. A., Evans,
J., Ferraiolo, J., Furman, S., Graffagnino, P., Graham,
R., Henderson, L., Hester, A., Hopgood, B., Jolif, C.,
Lawrence, K. R., Lilley, C., Mansfield, P., McCluskey,
K., Nguyen, T., Sandal, T., Santangeli, P., Sheikh,
H. S., Stevahn, R. E., and Zhou, S. (2000). Scalable
vector graphics svg 1.0 specification.
Cook, K. (2013). Music industry market research-the effect
of cover artwork on the music industry.
Davis, S. and Mermelstein, P. (1980). Comparison of para-
metric representations for monosyllabic word recog-
nition in continuously spoken sentences. IEEE trans-
actions on acoustics, speech, and signal processing,
28(4):357–366.
Duarte, A., Roldan, F., Tubau, M., Escur, J., Pascual, S.,
Salvador, A., Mohedano, E., McGuinness, K., Torres,
J., and Giro-i Nieto, X. (2019). Wav2pix: Speech-
conditioned face generation using generative adver-
sarial networks. In ICASSP, pages 8633–8637.
Frans, K., Soros, L. B., and Witkowski, O. (2021). Clip-
draw: Exploring text-to-drawing synthesis through
language-image encoders. CoRR, abs/2106.14843.
Freeman, S. (2020). Musical variables and color association
in classical music. Journal of Student Research, 9(1).
Gao, R., Oh, T.-H., Grauman, K., and Torresani, L. (2020).
Listen to look: Action recognition by previewing au-
dio. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
10457–10467.
Gavelin, D. (2023). Rocklou album cover generator. https:
//www.rocklou.com/albumcovergenerator. Accessed:
2023-01-15.
Gillotte, J. L. (2019). Copyright infringement in ai-
generated artworks. UC Davis L. Rev., 53:2655.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial nets. Advances
in neural information processing systems, 27.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
Courville, A. (2017). Improved training of wasser-
stein gans. arXiv preprint arXiv:1704.00028.
Hepburn, A., McConville, R., and Santos-Rodrıguez, R.
(2017). Album cover generation from genre tags.
In 10th International Workshop on Machine Learning
and Music.
Hertzmann, A. (2020). Visual indeterminacy in gan art.
Leonardo, 53(4):424–428.
Jiang, D.-N., Lu, L., Zhang, H., Tao, J., and Cai, L. (2002).
Music type classification by spectral contrast feature.
Proceedings. IEEE International Conference on Mul-
timedia and Expo, 1:113–116 vol.1.
Kingma, D. P. and Ba, J. (2015). Adam: A method for
stochastic optimization. CoRR, abs/1412.6980.
Kingma, D. P. and Welling, M. (2019). An introduc-
tion to variational autoencoders. arXiv preprint
arXiv:1906.02691.
Korzeniowski, F. and Widmer, G. (2016). Feature learning
for chord recognition: The deep chroma extractor. In
ISMIR.
Li, T.-M., Luk
´
a
ˇ
c, M., Micha
¨
el, G., and Ragan-Kelley, J.
(2020). Differentiable vector graphics rasterization
for editing and learning. ACM Trans. Graph. (Proc.
SIGGRAPH Asia), 39(6):193:1–193:15.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
242