vided up into two classes. First the degree of realism
of the appearance of objects in a scene, which in case
of an RGB sensor would be the similarity of the syn-
thetic image and the output of a real camera sensor.
Second, the similarity of the statistics of object con-
stellations and parameterizations in the virtual and the
real world. In a scene or room which contains a ta-
ble, person, chair and laptop, we would expect certain
constellations to appear more often than others. The
chair would be most likely somewhere around the ta-
ble in close proximity, the human would be expected
to be somewhere in the room and the laptop would be
most likely on the table. Also a situation where the
person is standing in the table or the chair, would be
highly unlikely or just impossible.
In our approach we try to tackle the problem of
similarity of the second type. Our goal is to provide
a framework, where constellations and parameteri-
zations can be modeled probabilistically as a mixed
joint density function, by domain experts. In order
to produce similar statistics, when generating a large
amount of data, the joint density function is used for
the sampling of scene instances in a virtual environ-
ment, which are the basis for the generation of syn-
thetic sensor data.
The remainder of this paper is organized as fol-
lows. In Section 2 related work concerning the ap-
plication of synthetic data in machine vision applica-
tions is presented. In Section 3 the generic concep-
tional design of our probabilistic modeling scheme is
described. In Section 4, the application and useful-
ness of the generic concept is illustrated by a use case
with reduced complexity. Finally, in Section 5, a con-
clusion is drawn and hints for future work are given.
2 RELATED WORK
In this section we focus on work done by other au-
thors on the pros and cons of using artificially cre-
ated images and ground truth for evaluation of com-
puter vision algorithms. As stated by Rachkovskij
and Kussul (Rachkovskij and Kussul, 1998), Frasch
et. al (Frasch et al., 2011) and Kondermann (Konder-
mann, 2013), computer vision and machine learning
algorithm selection, training and evaluation requires
a representative set of image data to be processed to-
gether with the results expected from processing, the
ground truth. For computer vision applications, sam-
ple image data can either be captured with real-world
imaging sensors or created artificially based on direct
image statistics modeling or from 3D scenes using
computer graphics. Major drawbacks of using real
world sensor data for this task have been identified by
the aforementioned and other authors:
1. Creation of these datasets using real world sen-
sors is, for many applications, expensive in terms
of equipment and time or in case humans can be
injured or material can be damaged, even impos-
sible (Meister et al., 2012)(Geiger et al., 2012).
2. Manual effort has to be made to label objects in
the images and/or add high level context infor-
mation (Utasi and Benedek, 2012). Human ob-
servers have to annotate each frame with pixel-
wise ground truth information or at least control
the correctness of semi-automatic labeling.
3. Manually added ground truth information is
highly subjective as shown by Martin et al. (Mar-
tin et al., 2001) for the Berkeley Segmentation
Dataset and Benchmark BSDS300
1
. With the im-
age capturing of the real sensor, a lot of the infor-
mation present in the scene is lost for later pro-
cessing, especially contextual information or sur-
face data of occluded parts. A human annotating
a given image is interpreting the scene based on
his knowledge and may annotate it different to the
original scene.
4. It is not fully known, how well the sample data re-
flects the statistical characteristics of the real ap-
plication, especially in terms of edge cases(Frasch
et al., 2011). One would manually have to evalu-
ate the statistical properties of the whole dataset
and needs to know the setup of the real-world
scenes to gauge the coverage of real scenes statis-
tics by the sample dataset.
Although there are numerous drawbacks of using real
sensors to create sample data for vision and machine
learning applications, there are still much more of
these datasets used than synthetic ones (Frasch et al.,
2011). Regarding to Kondermann(Kondermann,
2013), this is the case because the data created ar-
tificially has been considered too unrealistic. For a
discussion on the impact of visual object appearance,
sensor models and light propagation in the scene for
creating synthetic datasets and ground truth we re-
fer to the work of Meister and Kondermann (Meister
and Kondermann, 2011). With the recent advances
made in the physical correct rendering in the field
of computer graphics, this prejudice is diminishing
and synthetic datasets are moving into focus of com-
puter vision researchers more and more(Frasch et al.,
2011). An approach how to evaluate driver assistance
systems using synthetic image data was presented by
von Neumann-Cosel, who described a simulation en-
vironment Virtual Test Drive (Neumann-Cosel et al.,
1
http://www.eecs.berkeley.edu/Research/Projects/CS/
vision/bsds/
GRAPP2015-InternationalConferenceonComputerGraphicsTheoryandApplications
166