depend on the captured videos’ lengths. Table 1 gives
the figures of the individual pipeline steps, based on
104 captured videos with 24 fps and a length of 30 sec-
onds. Furthermore, it is assumed that all 720 frames per
video are used in the photogrammetry reconstruction,
resulting in a total pipeline duration of 4.16 hours. Per
time-step of the animation, the mesh generation based
on 104 images takes 8.7 minutes, resulting in approxi-
mately 104.4 hours on a single computer. As each frame
of the video can be reconstructed independently based
on the same camera alignment, we parallelized this step
on a 50 node cluster. As a consequence, we were able to
significantly reduce the mesh generation time to approx-
imately 0.17 seconds per frame, resulting in 2 hours for
the complete video, as shown in step 5 of Table 1.
6 LIMITATIONS
The capture space is confined by the photogrammetry
cage’s footprint, limiting the type of movements which
can be recorded. In our cage, subjects can only move
within a narrow volume of about
0.8m
3
. Although this
is sufficient for animations required in virtual face-
to-face conversations or interactions, expanding the
available space would enhance the system’s use case.
As the capture frame rate of the low-cost camera
used is limited, fast movements result in partially
blurred frames. As a consequence, rapid moving body
parts, such as hands, e.g., during waving, or aspects
like flying hair during head movements, can be missing
in the final mesh, preventing a detailed digitization
of the subject per frame. Thus, the visual quality and
naturalness of the resulting, animated avatar might
decrease and mesh discontinuities between succeeding
frames might occur. However, for most gestures and
mimics required in face-to-face interactions (see Fig. 1),
the achieved recording speed is sufficient. To use our
pipeline also for faster performances two methods
are reasonable: (a) An increased capture frame rate
would improve the quality of the reconstructed imagery.
However, the specific setup of our capture cage restricts
the potential frame rates: The placement of the cameras
within the cage was designed around a 4:3 image aspect
ratio, but the Raspberry Pi cameras used are not capable
of capturing in 4:3 at a high frame rate, so a 16:9 aspect
ratio resolution was used instead. This results in less
overlap between neighboring imagery, thereby nega-
tively affecting the final quality of the reconstructed
video. (b) A mesh tracking over the single time steps
should be embedded to improve the generated mesh
quality. By using the mesh information of previous and
succeeding frames, temporally coherent meshes can
be created. As a consequence, missing information on
body parts or dynamics in hairs and clothes, normally
leading to mesh discontinuities, can be supplemented.
Besides enabling a high-quality and detailed
reconstruction of fast performances of a captured
subject, accelerating the pipeline itself will improve
our work. A small speed-up can be achieved by
parallelizing the third step of our pipeline, the frame
extraction routine (cf. Table 1). Furthermore, and
even more essential with respect to a speed-up is the
frame sorting (see step 4b in Table 1). In our current
design, the mesh generation routine requires one
distinct directory with all frames from the different
perspectives per time frame. This structure is obtained
by the frame sorting routine, copying the single frames
based on the detected flashes to the distinct directories.
This brute force method should be replaced by a more
advanced approach using a designated data structure
working, e.g., with index shifts to group the single
frames into the individual time steps.
7 SUMMARY
Based on photogrammetry cages utilized to generate
a realistic, digital double of a captured human body,
we configured a 3D pipeline allowing for volumetric
video capture of a human subject’s performance. Our
results indicate that the pipeline is a first valuable step
towards fast and cost-efficient generation approaches
of authentic, high-quality animation recording for
digital doubles. For a successful future implementation,
the pipeline duration has to be further shortened, while
an additional mesh tracking step should be embedded
to enhance the mesh quality per animation time step.
REFERENCES
Achenbach, J., Waltemate, T., Latoschik, M. E., and Botsch,
M. (2017). Fast generation of realistic virtual humans.
In Proc. 23rd ACM Symp. Virtual Real. Softw. Technol.
- VRST ’17, pages 1–10.
Agisoft (2006). Agisoft PhotoScan. http://www.agisoft.com/.
[Online; last-accessed 17-October-2018].
Carranza, J., Theobalt, C., Magnor, M. A., and Seidel, H.-P.
(2003). Free-viewpoint video of human actors. In ACM
SIGGRAPH 2003 Papers, pages 569–577.
Casas, D., Volino, M., Collomosse, J., and Hilton, A.
(2014). 4D Video Textures for Interactive Character
Appearance. Computer Graphics Forum.
Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D.,
Calabrese, D., Hoppe, H., Kirk, A., and Sullivan, S.
(2015). High-quality streamable free-viewpoint video.
ACM Trans. Graph., 34(4):69:1–69:13.
Dou, M., Fuchs, H., and Frahm, J.-M. (2013). Scanning
and tracking dynamic objects with commodity depth
GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications
260