
of the IEEE international conference on computer vi-
sion, pages 5803–5812.
Avrahami, O., Hertz, A., Vinker, Y., Arar, M., Fruchter,
S., Fried, O., Cohen-Or, D., and Lischinski, D.
(2023). The chosen one: Consistent characters
in text-to-image diffusion models. arXiv preprint
arXiv:2311.10093.
Brooks, T., Holynski, A., and Efros, A. A. (2022). In-
structpix2pix: Learning to follow image editing in-
structions. arXiv preprint arXiv:2211.09800.
Cho, J., Zala, A., and Bansal, M. (2023). Dall-eval: Probing
the reasoning skills and social biases of text-to-image
generation models. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
3043–3054.
Crawford, J., Dillon, D., Petrisor, B., Schneider, F. W., and
Teague, E. (2020). Tasha’s Cauldron of Everything.
Wizards of the Coast Publishing.
Esser, P., Rombach, R., and Ommer, B. (2021). Tam-
ing transformers for high-resolution image synthesis.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 12873–
12883.
Frans, K., Soros, L., and Witkowski, O. (2022). Clip-
draw: Exploring text-to-drawing synthesis through
language-image encoders. NeurIPS, 35:5207–5218.
Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., and Kemb-
havi, A. (2018). Imagine this! scripts to compositions
to videos. In Proceedings of the European conference
on computer vision (ECCV), pages 598–613.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
Hochreiter, S. (2017). Gans trained by a two time-
scale update rule converge to a local nash equilibrium.
NeurIPS, 30.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion
probabilistic models. Advances in neural information
processing systems, 33:6840–6851.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
S., Wang, L., and Chen, W. (2021). Lora: Low-rank
adaptation of large language models. arXiv preprint
arXiv:2106.09685.
Jeong, H., Kwon, G., and Ye, J. C. (2023). Zero-shot gener-
ation of coherent storybook from plain text story using
diffusion models. arXiv preprint arXiv:2302.03900.
Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon,
Y., Michael, K., Fang, J., Yifu, Z., Wong, C., Montes,
D., et al. (2022). ultralytics/yolov5: v7. 0-yolov5 sota
realtime instance segmentation. Zenodo.
Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip:
Bootstrapping language-image pre-training for unified
vision-language understanding and generation. In In-
ternational Conference on Machine Learning, pages
12888–12900. PMLR.
Li, Y., Gan, Z., Shen, Y., Liu, J., Cheng, Y., Wu, Y., Carin,
L., Carlson, D., and Gao, J. (2019). Storygan: A se-
quential conditional gan for story visualization. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 6329–
6338.
Liang, J., Pei, W., and Lu, F. (2019). Cpgan: full-
spectrum content-parsing generative adversarial net-
works for text-to-image synthesis. arXiv preprint
arXiv:1912.08562.
Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual in-
struction tuning. arXiv preprint arXiv:2304.08485.
Maharana, A., Hannan, D., and Bansal, M. (2021). Improv-
ing generation and evaluation of visual stories via se-
mantic consistency. arXiv preprint arXiv:2105.10026.
Maharana, A., Hannan, D., and Bansal, M. (2022).
Storydall-e: Adapting pretrained text-to-image trans-
formers for story continuation. In European Confer-
ence on Computer Vision, pages 70–87. Springer.
Pan, X., Qin, P., Li, Y., Xue, H., and Chen, W. (2022). Syn-
thesizing coherent story with auto-regressive latent
diffusion models. arXiv preprint arXiv:2211.10950.
Peiris, A. and de Silva, N. (2022). Synthesis and evalua-
tion of a domain-specific large data set for dungeons
& dragons. arXiv preprint arXiv:2212.09080.
Peiris, A. and de Silva, N. (2023). SHADE: semantic hy-
pernym annotator for Domain-Specific entities - DnD
domain use case. In 2023 IEEE 17th International
Conference on Industrial and Information Systems
(ICIIS), page 6, Peradeniya, Sri Lanka.
Perkins, C., Hickman, T., and Hickman, L. (2016). Curse
of Strahd. Wizards of the Coast.
Rahman, T., Lee, H.-Y., Ren, J., Tulyakov, S., Mahajan, S.,
and Sigal, L. (2023). Make-a-story: Visual memory
conditioned consistent story generation. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 2493–2502.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. (2022). High-resolution image synthesis
with latent diffusion models. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695.
Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M.,
and Aberman, K. (2023). Dreambooth: Fine tuning
text-to-image diffusion models for subject-driven gen-
eration. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
22500–22510.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and
Catanzaro, B. (2018). High-resolution image synthe-
sis and semantic manipulation with conditional gans.
In CVPR, pages 8798–8807.
Weerasundara, G. and de Silva, N. (2023). Comparative
analysis of named entity recognition in the dungeons
and dragons domain.
Zermani, M., Larabi, M.-C., and Fernandez-Maloigne, C.
(2021). A comprehensive assessment of the structural
similarity index. Signal Processing: Image Commu-
nication, 99:116336.
Zhang, L. and Agrawala, M. (2023). Adding conditional
control to text-to-image diffusion models. arXiv
preprint arXiv:2302.05543.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
242