from our system is visually acceptable and outper-
forms the competition.
The contributions of this paper are:
• To the best of our knowledge, AnimaChaotic is
the first system to convert stories to animated 3D
videos while supporting a wide range of visualiz-
able information such as scenes, objects, actors,
actions, emotions, dialogues, and weather con-
ditions. The system is extendible such that new
scenes, objects, actors, and emotions can easily
be supported by extending the database.
• The system handles word and verb disambigua-
tion to extract visualizable actions or events.
• We develop a robust graph-based object position-
ing algorithm in order to handle constraints be-
tween the scene objects.
• We support dynamic actions by defining action
schemas (preconditions, execution mechanisms,
and termination conditions). We use steering be-
haviors to implement navigation of the actors in
the scene.
The rest of this paper is organized as follows. Sec-
tion 2 presents related work. Section 3 details our
methodology. Section 4 evaluates the system using
a variety of techniques. Finally, Section 5 concludes
the paper and summarizes future work.
2 RELATED WORK
Both NLP and Computer Graphics have recently seen
rapid progress, but their integration when it comes to
story visualization has not been extensively studied.
Our work resides in this common intersection.
Information extraction is a crucial part of story
visualization systems in order to extract visualiz-
able elements from the story such as actors, actions,
etc. ClausIE (Corro and Gemulla, 2013) and OL-
LIE (Mausam et al., 2012) are open information
extraction systems that are used to extract triplets
(subjects, verbs, and objects) from text. However,
these systems do not generalize well to story text as
they have been developed using factual texts such as
Wikipedia. He et al. (He et al., 2017) use semantic
role labeling for information extraction, but this sys-
tem is trained on a news dataset, a far domain from
stories. This urged the need to build our own infor-
mation extraction system that is especially designed
to handle stories.
Several similar solutions (Ma, 2002; Moens et al.,
2015; Marti et al., 2018; Gupta et al., 2018) have
been proposed to convert stories into animated videos.
However, our system is different from them in sev-
eral ways. CONFUCIUS (Ma, 2002) supports actors
and actions only, but it focuses more on language to
humanoid animation. MUSE (Moens et al., 2015) is
limited to a pre-determined graphical scene and does
not focus on scene generation. CARDINAL (Marti
et al., 2018) requires the input text to follow the stan-
dardized format of movie scripts, and it only consid-
ers subjects, verbs, and objects. Gupta et al. (Gupta
et al., 2018) developed a retrieval storytelling system
in which the system retrieves a short video from a
database to represent each sentence in the input story.
These short videos are then processed and concate-
nated to produce the final video. One downside of this
approach is being limited by the video frames stored
in the database and the nature of the input stories.
Other solutions have been proposed to convert text
into static 3D scenes or images using a variety of tech-
niques. These solutions include WordsEye (Coyne
and Sproat, 2001), Text2Scene (Tan et al., 2019), the
system proposed by Lee et al. (Lee et al., 2018), Sce-
neSeer (Chang et al., 2017), and the system intro-
duced by Chang et al. (Chang et al., 2014) which fo-
cuses on spatial relations between objects and infer-
ring missing objects in the scene. These systems are
limited to static scene generation, i.e., not dynamic or
animated as in the case of our system.
3 METHODOLOGY
Figure 1 shows the architecture of our proposed sys-
tem which is divided into an NLP pipeline and a
graphics pipeline. The story is entered as text. It is
then processed by the NLP pipeline which extracts
all the visualizable information. This information is
structured in a special format and fed into the graph-
ics pipeline. This pipeline handles loading the scene,
positioning objects in the scene, and applying the ap-
propriate animations. The following subsections ex-
plain the components of each pipeline in detail.
3.1 NLP Pipeline
This pipeline extracts the information that can be vi-
sualized and animated from the story text. This in-
formation represents objects, actors, positioning con-
straints, and events. Objects can have characteristics
such as color and shape. Actors can be described by
their gender, age, height, physical appearance, and
clothes. Positioning constraints define the spatial rela-
tions between objects and actors in the scene. Events
can be changes in the weather conditions, changes in
the emotions of the actors, or actions performed by
AnimaChaotic: AI-based Automatic Conversion of Children’s Stories to Animated 3D Videos
227