dataset was encoded into a fixed-length sentence vec-
tor using the Universal Sentence Encoder (Cer et al.,
2018).
5 EXPERIMENTS
We next show the results of some experiments on gen-
erating story videos from story sentences. We trained
our network for 120 epochs using 2500 sets of train-
ing data described in section 4. The trained network
was tested by using 495 sets of test data.
We first show video images generated from test
story sentences. Fig. 6 (a), (b) and (c) show three
different story sentences and video images generated
from our method. For comparison, we also show
the results from StoryGAN (Li et al., 2019) and our
method trained without using caption loss. As shown
in this figure, the proposed method can generate a se-
ries of short videos from multiple sentences, while
StoryGAN can only generate a single keyframe im-
age from each sentence and cannot visualize the mo-
tion of the characters and scenes described in the
sentence. We can also see that the characters men-
tioned in the sentences are properly generated in the
videos obtained from the proposed method. Compar-
ing methods with and without the caption loss, the
proposed method using the caption loss can generate
videos with larger changes, indicating that the gener-
ated videos have richer expressions.
To evaluate the accuracy of the generated videos
quantitatively, we captioned the generated video by
using the captioning network (Vladimir Iashin, 2020)
trained by using the Pororo dataset, and computed
the cosine similarity with the input sentence. Ta-
ble 1 shows the cosine similarity between the input
sentences and the sentences obtained from the gener-
ated videos. For comparison, we also show the co-
sine similarity when we do not use the caption loss in
our method. As shown in Table 1, the caption loss in
our method can improve the quality of the generated
videos.
6 CONCLUSION
In this paper, we proposed a method for generating
videos that represent stories described in multiple sen-
tences.
The existing methods of text-to-video can only
generate short videos from texts, and generating long
story videos from long story sentences is a challeng-
ing problem. We in this paper extended the Story-
GAN and showed that we can generate story videos
from story sentences. Furthermore, we showed that
by using the caption loss, we can further improve the
accuracy of generated story videos.
Our method replicated the human ability to imag-
ine and visualize the scenes described in the sen-
tence while reading a novel. On the other hand, a
large amount of experience is required to imagine rich
scenes from story sentences, so using the foundation
model may improve its ability to perform this task.
REFERENCES
Brock, A., Donahue, J., and Simonyan, K. (2019). Large
scale gan training for high fidelity natural image syn-
thesis. In Proc. ICLR.
Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N.,
John, R. S., Constant, N., Guajardo-Cespedes, M.,
Yuan, S., Tar, C., Sung, Y.-H., Strope, B., and
Kurzweil, R. (2018). Universal sentence encoder. In
arXiv:1803.11175.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Proc, ICLR.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial nets. In Proc.
Advances in neural information processing systems,
pages 2672–2680.
Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M.,
and Fleet, D. J. (2022). Video diffusion models. In
arXiv:2204.03458.
Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).
Image-to-image translation with conditional adversar-
ial networks. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 1125–1134.
Jonathan Ho, Ajay Jain, P. A. (2020). Denoising diffusion
probabilistic models. In Proc. Conference on Neural
Information Processing Systems.
Kim, K., Heo, M., Choi, S., and Zhang, B. (2017). Deep-
story: video story qa by deep embedded memory net-
works. In Proc. of International Joint Conference on
Artificial Intelligence, page 2016–2022.
Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T. L., and Bansal,
M. (2020). Mart: Memory-augmented recurrent trans-
former for coherent video paragraph captioning. In
arXiv:2005.05402.
Li, C., Kong, L., and Zhou, Z. (2020). Improved-
storygan for sequential images visualization. Journal
of Visual Communication and Image Representation,
73(102956).
Li, Y., Gan, Z., Shen, Y., Liu, J., Cheng, Y., Wu, Y., Carin,
L., Carlson, D., and Gao, J. (2019). Storygan: A se-
quential conditional gan for story visualization. In
Proc. of IEEE Conference on Computer Vision and
Pattern Recognition, pages 6329–6338.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
676