ACKNOWLEDGEMENT
Work partially supported by the Italian Ministry of
Education and Research (MIUR) in the framework of
the CrossLab project (Departments of Excellence).
REFERENCES
Brock, A., Donahue, J., and Simonyan, K. (2018). Large
scale gan training for high fidelity natural image syn-
thesis. arXiv preprint arXiv:1809.11096.
Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002).
A fast and elitist multiobjective genetic algorithm:
Nsga-ii. IEEE transactions on evolutionary compu-
tation, 6(2):182–197.
Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic
meta-learning for fast adaptation of deep networks.
In International Conference on Machine Learning,
pages 1126–1135. PMLR.
Galatolo, F. A. (2021). Clip-glass repository on github,
https://github.com/galatolofederico/clip-glass.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial networks. arXiv
preprint arXiv:1406.2661.
Hu, D. (2019). An introductory survey on attention mecha-
nisms in nlp problems. In Proceedings of SAI Intelli-
gent Systems Conference, pages 432–448. Springer.
Karras, T., Laine, S., and Aila, T. (2019). A style-based
generator architecture for generative adversarial net-
works. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
4401–4410.
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen,
J., and Aila, T. (2020). Analyzing and improving
the image quality of stylegan. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 8110–8119.
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and
Socher, R. (2019). Ctrl: A conditional transformer
language model for controllable generation. arXiv
preprint arXiv:1909.05858.
Li, A., Jabri, A., Joulin, A., and van der Maaten, L. (2017).
Learning visual n-grams from web data. In Proceed-
ings of the IEEE International Conference on Com-
puter Vision, pages 4183–4192.
Mansimov, E., Parisotto, E., Ba, J. L., and Salakhutdinov,
R. (2015). Generating images from captions with at-
tention. arXiv preprint arXiv:1511.02793.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision.
Radford, A., Wu, J., Amodei, D., Amodei, D., Clark, J.,
Brundage, M., and Sutskever, I. (2019). Better lan-
guage models and their implications. OpenAI Blog
https://openai. com/blog/better-language-models.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., et al. (2015). Imagenet large scale visual
recognition challenge. International journal of com-
puter vision, 115(3):211–252.
Sennrich, R., Haddow, B., and Birch, A. (2015). Neural
machine translation of rare words with subword units.
arXiv preprint arXiv:1508.07909.
Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Man-
ning, C. D., and Ng, A. Y. (2013). Zero-shot learn-
ing through cross-modal transfer. arXiv preprint
arXiv:1301.3666.
Wang, Z., She, Q., and Ward, T. E. (2019). Generative ad-
versarial networks in computer vision: A survey and
taxonomy. arXiv preprint arXiv:1906.01529.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.,
and Le, Q. V. (2019). Xlnet: Generalized autoregres-
sive pretraining for language understanding. arXiv
preprint arXiv:1906.08237.
Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and
Xiao, J. (2015). Lsun: Construction of a large-scale
image dataset using deep learning with humans in the
loop. arXiv preprint arXiv:1506.03365.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang,
X., and Metaxas, D. N. (2018). Stackgan++: Realis-
tic image synthesis with stacked generative adversar-
ial networks. IEEE transactions on pattern analysis
and machine intelligence, 41(8):1947–1962.
IMPROVE 2021 - International Conference on Image Processing and Vision Engineering
174