Synthesising Light Field Volumetric Visualizations in Real-time using a

Compressed Volume Representation

an Bruton, David Ganter and Michael Manzke

Graphics, Vision and Visualisation (GV2), Trinity College Dublin, the University of Dublin, Ireland

Keywords:

Volumetric Data, Visualization, View Synthesis, Light Field, Convolutional Neural Network.

Abstract:

Light ﬁeld display technology will permit visualization applications to be developed with enhanced perceptual

qualities that may aid data inspection pipelines. For interactive applications, this will necessitate an increase

in the total pixels to be rendered at real-time rates. For visualization of volumetric data, where ray-tracing

techniques dominate, this poses a signiﬁcant computational challenge. To tackle this problem, we propose

a deep-learning approach to synthesise viewpoint images in the light ﬁeld. With the observation that image

content may change only slightly between light ﬁeld viewpoints, we synthesise new viewpoint images from a

rendered subset of viewpoints using a neural network architecture. The novelty of this work lies in the method

of permitting the network access to a compressed volume representation to generate more accurate images

than achievable with rendered viewpoint images alone. By using this representation, rather than a volumetric

representation, memory and computation intensive 3D convolution operations are avoided. We demonstrate

the effectiveness of our technique against newly created datasets for this viewpoint synthesis problem. With

this technique, it is possible to synthesise the remaining viewpoint images in a light ﬁeld at real-time rates.

1 INTRODUCTION

Volumetric information is increasingly used across

many domains including civil engineering, radiology

and meteorology. Visualization of this volumetric in-

formation is an essential part of workﬂows in these

areas, with users tasked with inspecting the visualiza-

tion for patterns of interest. With the development of

advanced display technologies, such as 3D displays

and virtual reality, this visual inspection process may

be performed in a more natural manner, potentially

leading to easier discovery of salient patterns.

These future displays are likely to be enabled by

the use of light ﬁeld technology (Wetzstein et al.,

2012; Lanman and Luebke, 2013). Displays employ-

ing light ﬁelds permit added perceptual quality, with

parallax, accommodation and convergence cues mak-

ing interactions with the displays more compelling

and natural. In a medical inspection setting, it has

been reported that light ﬁeld displays add impressive

depth perception of structures in 3D scan data, and fa-

cilitate easier discrimination by the physician (Agus

et al., 2009).

A light ﬁeld is a function that describes the direc-

tion and wavelength of all light rays in a scene (Levoy

and Hanrahan, 1996). Imaging technologies for light

ﬁelds, such as the Lytro Illum camera, simplify this

ﬁve-dimensional function to four dimensions, allow-

ing capturing of a light ﬁeld by a discrete rectangular

grid of camera lenses. In a virtual setting, this light

ﬁeld formulation can be represented by a 2D grid of

aligned cameras, where the image size at each cam-

era is given by the spatial resolution, and the num-

ber of cameras in the 2D grid is given by the angular

resolution. Thus, to render such a light ﬁeld within

ﬁxed computational bounds, there exists a trade-off

between the spatial resolution and the angular resolu-

tion. In this work, we seek to efﬁciently increase the

angular resolution of a light ﬁeld rendering of volu-

metric data.

Rendering this volumetric data is itself a non-

trivial problem. The volume data can be highly de-

tailed, with multiple dimensions represented, such as

temperature and pressure in the case of a meteorolog-

ical simulation. This richness necessitates advanced

rendering techniques to be used (Philips et al., 2018).

The choice of methods, such as isosurface extraction,

and parameters, such as the transfer function, are de-

pendent on the domain and purpose of the visualiza-

tion. Furthermore, due to the dense nature of the vol-

ume, as opposed to sparse mesh representations, in-

formative visualizations often require expensive ray

Bruton S., Ganter D. and Manzke M.

Synthesising Light Field Volumetric Visualizations in Real-time using a Compressed Volume Representation.

DOI: 10.5220/0007407200960105

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 10th International Conference on

Information Visualization Theory and Applications), pages 96-105

ISBN: 978-989-758-354-4

tracing techniques to permit observation of salient in-

ternal substructures. Thus, rendering multiple visu-

alization frames at real-time rates is very demanding

for modest hardware.

For virtual light ﬁeld cameras, in which the sen-

sors are arranged in a closely-spaced array, the ren-

dered image may change only slightly between differ-

ent sensors’ viewpoints. As such, the rendering power

used to produce one light ﬁeld viewpoint image can

be re-used to help produce another. Advances in video

interpolation and viewpoint synthesis have shown that

such a problem is becoming more tractable, espe-

cially with the use of convolutional neural networks

that can learn how to estimate the dynamics of scene

components between views. In such videos, the nat-

ural scene is often composed of opaque objects, and

so unique correspondences between pixels in pairs of

images can be determined. Such correspondences can

be used to calculate optical ﬂow and disparity, which

are key inputs of many of these interpolation tech-

niques. However, in the case of volumetric rendering,

the use of alpha values in the transfer function violates

the optical ﬂow assumption of constant brightness of

object points. Furthermore, due to the contribution

of multiple object points, with differing alpha values,

to a ﬁnal pixel value, a disparity value between two

images cannot be uniquely deﬁned. Thus to obtain a

correct pixel value for a generated image, it is nec-

essary to understand the dynamics of multiple dense

volumetric scene components.

Another difference between this image interpola-

tion problem and that of natural scenes, is that the vol-

ume data is available to us. Using this data as an input

could potentially improve the quality of the generated

images. However, if a neural network approach to

the image interpolation problem is used, it is not clear

how best to exploit this volume data. Performing 3D

convolutional operations over the typically large vol-

umes would be highly demanding and may preclude

the use of this technique as part of a real-time system.

Accordingly, we propose the use of a compressed im-

age representation of the volume, via a rank pooling

of volume slices. We show that such ranked volume

images can be used as part of a neural network ap-

proach to improve image synthesis results.

The contributions of this work are as follows:

• We are the ﬁrst to demonstrate a neural network

approach to synthesising light ﬁelds for volume

visualizations.

• To exploit the availability of the volumetric data,

we propose the use of a compressed representa-

tion of dense real-valued volumetric data. This

avoids the prohibitive cost of 3D convolutional ﬁl-

ters for real-time performance and is shown to im-

prove image synthesis results.

• Our neural network approach, generating a light

ﬁeld of angular resolution 6 × 6, performs at real-

time rates, signiﬁcantly reducing the time taken

using traditional ray-tracing techniques.

• We prepare and release two datasets for future

use in tackling this problem. All experimental

code and data appears in our code repository

(https://github.com/leaveitout/deep

light ﬁeld interp).

We envisage that the compressed volume repre-

sentation may have uses in other volume visualization

problems and intend to examine such solutions in fu-

ture work.

2 RELATED WORK

Convolutional neural networks (CNNs) have under-

gone a signiﬁcant renaissance in recent years, and

since their application to image classiﬁcation tasks

(Krizhevsky et al., 2012) the nonlinear complexity

that can be modelled by a CNN has been shown to be

useful for many image generation tasks (Park et al.,

2017; Liu et al., 2017; Zhou et al., 2016). Techniques

for the generation of images based on an input im-

age, often utilise ﬂow or disparity information. In the

closely-related case of light ﬁeld viewpoint genera-

tion, an estimate of disparity has been integrally used

to generate viewpoint images using deep learning ap-

proaches (Kalantari et al., 2016; Srinivasan et al.,

2017). Optical ﬂow has been used as part of novel

view synthesis techniques (Zhou et al., 2016) and as

part of video interpolation techniques to guide pixel

determination from a pair of images (Liu et al., 2017).

Another recent technique of video interpolation does

not rely on ﬂow information, but rather on learning

global convolutional ﬁlters that operate on the pair of

images to be interpolated (Niklaus et al., 2017). These

approaches are developed, however, for the purposes

of generating natural images and so do not make use

of data being visualized, as in our case.

Convolutional architectures for the speciﬁc pro-

cessing of volumetric data have been developed (Wu

et al., 2015; Maturana and Scherer, 2015). These ini-

tial works made use of 3D convolutional neural net-

works as natural extensions of 2D convolutional net-

works. It was observed, however, that this leads to

an increase in convolutional ﬁlter size, hence requir-

ing more data for training and precluding deeper net-

works under a ﬁxed computation budget (Qi et al.,

2016). Furthermore, 3D CNNs were outperformed by

2D CNNs trained on multiple views for object recog-

nition tasks (Su et al., 2015). Other techniques for

Synthesising Light Field Volumetric Visualizations in Real-time using a Compressed Volume Representation

efﬁcient learning from volumetric data have utilised

probing techniques (Li et al., 2016), tree representa-

tions (Wang et al., 2017b; Riegler et al., 2017; Klokov

and Lempitsky, 2017) and point cloud representations

(Qi et al., 2017). Each of these approaches assume

a binary occupancy volume and have not yet been

shown to work with real-valued volumes or indeed

multi-dimensional volumes.

Ranking is a commonly used machine learning

task encountered in information retrieval. Image fea-

tures extracted from video frames (Fernando et al.,

2017) have been ranked according to their temporal

order. The ranking representation allows videos of

variable lengths to be encoded into a ﬁxed size de-

scriptor that can then be used for classiﬁcation tasks,

such as action recognition. Recently, this technique

has been adapted for use on pixel data (Bilen et al.,

2016), with the ranking descriptor taking the form of

a single image, a compressed representation of the dy-

namics of the video. This “dynamic image” represen-

tation has been shown to beneﬁt action recognition

pipelines when trained on colour, optical ﬂow, and

scene ﬂow information (Wang et al., 2017a). Such

ranking representations have yet to be applied to vol-

umetric information as part of a visualisation pipeline.

We show in this work, that these representations are

effective in extracting details from volume data to im-

prove image synthesis quality.

A number of previous works have focussed on the

speciﬁc topic of rendering volumetric data for light

ﬁeld displays (Mora et al., 2009; Birklbauer and Bim-

ber, 2012). One work looks at volume rendering using

emissive volumetric displays and addresses the inher-

ent challenges of using such a display for this purpose

(Mora et al., 2009). Volume rendering for light ﬁelds

has also been performed on other speciﬁc light ﬁeld

displays (Agus et al., 2009). This method iteratively

reﬁnes a rendering when the viewpoint changes and

is thus susceptible to slow rendering when the view-

point changes signiﬁcantly. Another work speeds up

volume rendering for light ﬁelds by maintaining a

render cache, based on the current viewpoint, that is

ﬁlled during idle times and used to compose the ﬁ-

nal renders (Birklbauer and Bimber, 2012). Our work

seeks to make light ﬁeld volume rendering tractable

for large display sizes, and independent of viewpoint,

by using a neural network approach to increase the

angular resolution.

3 METHOD

Here, we detail our approach of synthesising light

ﬁeld viewpoint images using a subset of rendered

Figure 1: Volume ranked images calculated across entire

coordinate axes for a volume dataset of a magnetic reso-

nance imaging scan of a person’s head. The entropy-order

images of the top row appear to capture greater detail than

appears in the bottom row of axis-order images. In partic-

ular, we observe that the axis-order representation devotes

more features to discriminating elements such as the nose

and ear, and less to internal structures. Here, the images

are encoded in a single image, however, it is possible to use

more than one image in our ﬁnal encoding, where we will

test whether entropy or order is more effective as part of our

synthesis method.

viewpoints and a compressed volume representation.

We ﬁrst describe how we formulate these compressed

volume representations, before describing how it is

used in our neural network architecture to synthesise

the images.

3.1 Volumetric Representation

The rank pooling method of generating a compressed

representation of a video sequence is applied to a

dense volume to produce the volumetric equivalent

of a dynamic image (Bilen et al., 2016). Describ-

ing the volume, V , as a sequence of ordered slices

, .. . , v

], we let ψ(v

) ∈ R

be a representation of

an arbitrary slice. In our case, we use the slice infor-

mation itself, and so ψ(v

) = v

. Pairwise linear rank-

ing is characterised by parameters u ∈ R

, such that a

ranking function S(t|u) = u

· v

produces a score for

the slice at t. As such, u as a matrix that when multi-

plied by a slice v

, outputs a score indicating the rank

of this slice.

The parameters u are optimised such that the

given ordering of the slices is represented in the

scores, i.e. q > p =⇒ S(q|u) > S(p|u). Such an op-

timisation can be formulated as a convex problem and

solved using Rank Support Vector Machines (Smola

and Schlkopf, 2004):

∗

= ρ(v

, . . . , v

;ψ) = argmin

E(u) (1)

IVAPP 10th International Conference on Information Visualization Theory and Applications - 10th International Conference on Information

Visualization Theory and Applications

E(u) =

n(n − 1)

∑

q>p

max{0, 1 − S(q|u) + S(p|u)}

kuk

(2)

In Equation 1, we see that the optimal repre-

sentation, u

∗

, is a function of the slices, v

, . . . , v

and the representation type ψ. This can be framed

as an energy minimisation of the energy term E(u).

The ﬁrst term in Equation 2 is the hinge-loss soft-

counting of the number of incorrectly ranked pairs,

and the second term is the quadratic regularization

term of support vector machines, where the regular-

ization is controlled by hyperparameter λ. The hinge-

loss soft-counter term ensures that there is at least unit

score difference between correctly ranked pairs, i.e.

S(q|u) > S(p|u) + 1.

The optimised ranking representation u

∗

thus en-

codes information regarding the ordering of the vol-

ume slices and can be seen as the output of a function

ρ(v

, . . . , v

;ψ) that maps an arbitrarily sized volume

to a smaller size representation. Furthermore, this

representation is 2D and so 2D CNNs can be used to

learn from the volume data, instead of expensive 3D

CNNs if we were to learn from the volume directly.

In our volumetric representation, we seek com-

plex features present in the volume to persist in the

compressed representation. Thus, in addition to cal-

culating volumetric representations based on a rank-

ing of the order of slices along an axis, we propose

an alternative ranking method. This ranking is per-

formed according to the Shannon entropy of the slice,

H = −

∑

log p

, where p

is the normalised proba-

bility of a speciﬁc cell value in a slice. Entropy pro-

vides a heuristic of the complexity of the slice and

thus the ranking learns to discriminate this aspect,

rather than the arbitrarily positive or negative axis or-

der. In Figure 1, we show example images for the

two ranking formulations. We shall refer to the com-

pressed volume representations for axis-order rank-

ing and for entropy ranking as Order-Ranked Volume

Images (ORVI) and Entropy-Ranked Volume Images

(ERVI), respectively.

3.2 Viewpoint Synthesis

In our problem formulation, we assume that a subset

of the viewpoint images of the light ﬁeld are rendered

using traditional ray-tracing approaches. We select as

this subset the four corner images of the light ﬁeld ar-

ray as per Figure 2. We wish to efﬁciently synthesise

all the remaining viewpoint images. In recent work

(Kalantari et al., 2016), a convolutional autoencoder

approach was shown capable of performing such im-

age synthesis problems for a single viewpoint image

in light ﬁeld photographs. We thus investigate such

solutions for our problem of synthesising an entire

light ﬁeld.

By estimating scene geometry, it is possible to

generate an entire light ﬁeld for natural images from a

single input image (Srinivasan et al., 2017). However,

as a mapping between pixel values in a pair of light

ﬁeld volume renderings with varying alpha values is

not injective (one-to-one), disparity or ﬂow-based ap-

proaches of image interpolation are not ideal. Hence,

our network cannot rely on the presence or calculation

of these features.

Instead, we use a more direct approach of interpo-

lating between the rendered images, using a convolu-

tional autoencoder architecture. This architecture is

composed of successive dimension reducing encoder

blocks and dimension increasing decoder blocks. For

the design of these blocks, we follow the proven or-

der of operations for the similar problem of video

frame interpolation (Niklaus et al., 2017). In an En-

coder block, the order of operations is a convolution

with 3 × 3 ﬁlter, Rectiﬁed Linear Unit (ReLU) acti-

vation ( f (x) = max(0, x)), followed by another con-

volution, ReLU activation, and a ﬁnal average pool-

ing with 2 × 2 window and stride of 2. The second

convolutional operation maintains an equal number of

output channels as the previous convolutional opera-

tion. A Decoder block differs in that instead of an av-

erage pooling operation, a bilinear upsampling ﬁlter

is applied to double the output dimensions, followed

by convolution and ReLU operations. The convolu-

tion operations again maintain the number of output

channels. Bilinear sampling has the advantage of be-

ing differentiable, thus permitting back propagation

through the entire network.

As a baseline, we compose a network architecture

that takes as input solely the rendered visualisation

images. We refer to this approach as Light Field Syn-

thesis Network - Direct (LFSN-Direct). The number

of input and output features of the Encoder and De-

coder blocks in this architecture, which determine the

number of convolutional ﬁlters according the above

block deﬁnitions, are shown in Table 1.

In the architecture, residual connections (He et al.,

2016) are used between Encoder and Decoder blocks

to encourage less blurry results. Residual connections

formulate the intervening operations between a con-

nection as a function, F , that is tasked with learning

the change (or residual) required to the input, x, to

improve the output y, i.e.

y = F (x) + x, (3)

rather than being tasked with learning the entire out-

Synthesising Light Field Volumetric Visualizations in Real-time using a Compressed Volume Representation

Figure 2: An illustration of the convolutional autoencoder architecture used to synthesise light ﬁeld viewpoint images. One

network operates on rendered views, while the other operates on ranked volume images. In the network, green represents a

convolutional layer, blue an average pooling layer, purple a bilinear upsampling layer and red arrows represent skip connec-

tions.

put. Connections are made between the output of the

ﬁnal convolution of the Encoder block and the ﬁnal

output of the Decoder block, for corresponding block

numbers in Table 1.

Table 1: The network structure for the Light Field Synthe-

sis Network. The number of input features of the network

corresponds to four RGBA images concatenated along the

channel dimension, representing the rendered images. The

number of output features of the network corresponds to 32

RGBA images concatenated along the channel dimension,

representing the 32 remaining images of the 6 ×6 light ﬁeld.

Block # input

features

# output

features

Encoder 1 16 32

Encoder 2 32 64

Encoder 3 64 128

Encoder 4 128 256

Decoder 4 256 256

Decoder 3 256 128

Decoder 2 128 128

Decoder 1 128 128

Residual connections are made using an addition

operation, and thus it is required that both operands

are of the same dimension. As can be seen in Ta-

ble 1, the number of output features do not match be-

tween all corresponding Encoder and Decoder blocks

for residual connections. The Decoder blocks have a

larger number of output features to facilitate the gen-

eration of the larger number of output images. To

overcome this, we repeat the outputs of the corre-

sponding convolution operation in the Encoder block

along the channel dimension before addition to the

output of the Decoder block. For example, the resid-

ual connection between Encoder 1 and Decoder 1 re-

quires that the output of the convolution operation

is repeated four times. Using the same notation as

above, this residual connection can be expressed as

y = F (x) + [x, x, x, x]. (4)

This concatenation of the input may have potential

beneﬁts as it would encourage Encoder 1 to learn fea-

tures that output an x that is more generally applicable

across the light ﬁeld due to the joint effect of x upon

multiple output images.

We introduce a parallel network branch to learn

from ranked volume image representations, as shown

in Figure 2. This network branch has identical En-

coder and Decoder blocks to the LFSN-Direct net-

work, apart from the Encoder 1 block which may

have a different number of input features. This is de-

pendent upon the number of ranked volume images

used to encode a volume. We refer to this network ar-

chitecture as LFSN-ORVI or LFSN-ERVI, depend-

ing on whether ORVI or ERVI images are used, re-

spectively. In addition to those described previously,

residual connections are used in this architecture to

connect the branch processing the rendered images

(render branch) to the branch processing the ranked

volume images (volume branch). These connections

are made between corresponding Encoder blocks in

the render branch and Decoder blocks in the volume

branch. The ﬁnal output of the volume branch is

added to the output of the render branch. This en-

courages the volume branch to learn to speciﬁcally

generate the additional details that cannot be gener-

ated from the rendered images alone.

4 IMPLEMENTATION

To demonstrate applicability to a speciﬁc use case, we

restrict our datasets to two visualizations of a mag-

netic resonance imaging volume of a human, one of

IVAPP 10th International Conference on Information Visualization Theory and Applications - 10th International Conference on Information

Visualization Theory and Applications

100

Figure 3: An example light ﬁeld rendering for the MRI-

Head dataset.

the torso and one of the head referred to as MRI-Head

and MRI-Torso, respectively. The Interactive Visual-

ization Workshop (Sunden et al., 2015) was used to

create each dataset. Virtual cameras were arranged in

a planar 6 × 6 array, with optical axes parallel and a

ﬁxed spacing between adjacent cameras. 2,000 sets of

light ﬁeld visualizations were collected for each, with

spatial resolution of 256 × 256. To expose the inter-

nal structures of the volume, and increase the number

of alpha-blended image components, the MRI-Head

volume was clipped along 400 different planes and 5

light ﬁeld images were captured at randomized cam-

era locations for each different clipping. The clipping

planes were selected as plausible planes for inspection

purposes.

The transfer function was kept ﬁxed throughout

collection of each dataset. For the MRI-Torso dataset,

we include global illumination effects and an ex-

tracted isosurface to make the synthesis problem more

challenging.

For each dataset, we need to select an axis along

which to calculate the ranked volume images. We se-

lect the side proﬁle axis for the MRI-Head dataset and

the front proﬁle axis for the MRI-Torso dataset. The

ERVI images were calculated using the entropy val-

ues of volume slices along the chosen axis. Ranked

volume images were calculated for sets of adjacent

sub-volumes of sizes 16 × 256 × 256 and concate-

nated to form the compressed representation. This

represents a compression rate of 16 times compared

to the full volumetric data. Here, a trade-off is made

between compressing the volumetric information and

minimising the training and inference cost. This cost

increases due to the ﬁrst convolutional layer in the

Encoder 1 block in the volume branch, which is pro-

portional to the number of input ranked volume im-

ages. Ranked volume image calculated across the en-

tire volume can be seen in Figure 1.

To train the network, we use images from four

corner cameras of the light ﬁeld array as inputs. The

network is trained to output the remaining 32 images

of the light ﬁeld of angular resolution of 6 × 6. The

dataset was split into 1,600 sets of training images,

with the ﬁnal 400 reserved for testing. The network is

trained for a total of 250 epochs based on observations

of overﬁtting if trained further.

To train the network, the Adamax learning rate

scheduler was used with the recommended parame-

ters (Kingma and Ba, 2015). The data was trans-

formed to lie in the [0, 1] range prior to use during

training. Batch normalisation was found to adversely

affect results and so is not used. The loss function

used was

= kI − I

(5)

where I and I

are the synthesised and ground truth

light ﬁeld images respectively.

The Pytorch framework was used for training all

the neural network models, and the networks were

trained on a Nvidia Titan V card. To improve the

speed of training and inference, and to reduce mem-

ory requirements, we use half-precision ﬂoating point

representations for our entire network.

5 EVALUATION

5.1 Qualitative

We inspect the outputs of the three tested approaches

to compare their performances qualitatively. We note,

expectedly, that internal light ﬁeld viewpoint images

perform worse, with images appearing more blurred

as the distance from the input rendered images in-

creases.

In the case of the MRI-Head model, all models

perform well and it is difﬁcult to distinguish any ren-

dering artefacts. On inspection of Mean Average Er-

ror images, however, we do note that the edges of sur-

faces are synthesised more accurately with the LFSN-

ORVI and LFSN-ERVI models.

In the case of the MRI-Torso model, the LFSN-

ORVI and LFSN-ERVI models perform markedly

better, as can be seen in Figure 4. The ﬁner details ap-

pear sharper overall than the LFSN-Direct approach

in these images, however a slight blur effect is notice-

able across all three model outputs. This may be due

Synthesising Light Field Volumetric Visualizations in Real-time using a Compressed Volume Representation

101

Figure 4: Result images of the viewpoint synthesis methods for the MRI-Torso dataset, generating the light ﬁeld viewpoint at

(3, 3) in the angular dimension. Synthesised images appear slightly blurry compared to the ground truth rendering, speciﬁcally

the LFSN-Direct result. Inspecting the Mean Average Error (MAE), we see that overall the LFSN-ORVI and LFSN-ERVI

produce sharper images, with their errors more contained within speciﬁc difﬁcult regions, such as specular areas.

to the use of the L

loss for training, as has been ob-

served in other works (Niklaus et al., 2017). Further

example output images are shown in Figure 5.

5.2 Quantitative

To evaluate our technique quantitatively, we calculate

Structural Similarity (SSIM), Peak Signal to Noise

(PSNR) and Mean Squared Error (MSE) metrics for

the synthesised test set images against the ground

truth.

Table 2: Metrics calculated for the test set of images on

MRI-Head dataset.

Network type SSIM PSNR MSE

LFSN-Direct 0.9795 35.98 16.89

LFSN-ORVI 0.9906 40.22 6.33

LFSN-ERVI 0.9886 39.38 7.67

Our results for the MRI-Head dataset are pre-

sented in Table 2. It was found that the additional

information available via the ranked volume images

increased performance, however only marginally. In

Table 3: Metrics calculated for the test set of images on

MRI-Torso dataset.

Network type SSIM PSNR MSE

LFSN-Direct 0.9182 34.78 27.86

LFSN-ORVI 0.9408 36.87 17.72

LFSN-ERVI 0.9417 36.90 17.35

this case, The axis ordered representation outperforms

entropy, which may indicate that it encodes the vol-

ume data better when split into smaller sub-volumes.

Our results for the MRI-Torso data are presented

in Table 3. This dataset is shown to be more chal-

lenging for the LFSN-Direct to synthesise. The pres-

ence of isosurface meshes and global illumination

are likely responsible for the lower metrics overall.

We note that the two ranked volume models outper-

form the baseline approach to a greater margin in this

case. This indicates that the volume branch of the net-

work is effective at learning to generate missing de-

tails from the ranked volume representation. We note

that in this case, the LFSN-ERVI outperforms LFSN-

ORVI. The difference is small, however, and as such

further investigation on optimal methods of ordering

the volume slices may be warranted.

5.3 Performance Speed

A key aspect to consider is the speed at which images

can be generated. For a neural network approach to

light ﬁeld synthesis for volume visualisation to be fea-

sible, it needs to signiﬁcantly improve upon the time

taken to render the light ﬁeld using a traditional ap-

proach. To measure this, we recorded the time to ren-

der all light ﬁeld images in the MRI-Torso dataset.

We use the same computer, under similar load, with

Nvidia Titan V graphics card used for both the light

ﬁeld synthesis and rendering timings. The approach

that we compare against (called Baseline in Table 4)

IVAPP 10th International Conference on Information Visualization Theory and Applications - 10th International Conference on Information

Visualization Theory and Applications

102

Figure 5: Result images of the viewpoint synthesis methods for the MRI-Torso dataset (top one) and MRI-Head dataset

(bottom two), generating the light ﬁeld viewpoint at (3, 3) in the angular dimension. It is difﬁcult to determine artefacts for

the bottom two, however, the Mean Average Error images indicate better performance by LFSN-ORVI and LFSN-ERVI with

regards to edges present in the data.

Synthesising Light Field Volumetric Visualizations in Real-time using a Compressed Volume Representation

103

uses ray-tracing as part of the volume rendering, as

implemented in the Interactive Visualization Work-

shop.

Table 4: The timing of the different approaches of out-

putting an entire light ﬁeld for the MRI-Torso dataset. The

timing is calculated as the average over the entire test set

of 400 light ﬁelds. The errors are reported as the standard

deviation over the test set. Note that we include the time to

render the four corner images as part of all LFSN variants.

This amounts to a time of 0.08 seconds on average.

Network type Time (s)

Baseline 0.716 ± 0.223

LFSN-Direct 0.092 ± 0.026

LFSN-ORVI 0.097 ± 0.026

LFSN-ERVI 0.097 ± 0.026

The results in Table 4 show that we substantially

reduce the time taken to generate a light ﬁeld. The

LFSN-ORVI and LFSN-ERVI increase frames rates

by a multiplier of seven, bringing frame rates to

greater than 10 frames per second. Each of the neural

network synthesis methods synthesise images in ﬁxed

timings due to the network acting as a ﬁxed operation

once trained. This property will also be maintained

across dataset complexity and size, as along as size of

the input images to the network are maintained. In the

above results, we note that the majority of time taken

for the LFSN variants is spent rendering the four cor-

ner images. Thus if further speed improvements are

required, we should investigate methods of removing

the requirement of these rendered corner images.

6 CONCLUSIONS AND FUTURE

WORK

In this work, we proposed the use of convolutional

neural networks to synthesise views of dense volu-

metric data for applications of light ﬁeld rendering.

We propose the use of a ranked volume representation

to allow a network access to the available volumetric

data to improve synthesis results. We have demon-

strated that this approach improves synthesis results,

moreso for a more challenging dataset that we col-

lected. A key advantage of this approach is the speed

with which a light ﬁeld can be synthesised, making

real-time light ﬁeld generation achievable. As part of

this work we also release the code and data to encour-

age other researchers to tackle this and other related

problems in volume visualization.

We demonstrated our work for visualisation tasks

involving medical imaging data. In this domain, there

may be concerns with using such a predictive ap-

proach for visual inspection of data. For example, the

network may synthesise images that are missing iden-

tifying properties of a tumour that is present in the

data. This represents a key challenge for the type of

approach introduced in this work. Future work could

address this challenge by improving performance to

within an acceptable limit on a range of datasets, or

by characterising uncertainty in the synthesised im-

ages.

In future work, we wish to explore other ap-

proaches of synthesis, perhaps by generating coarse

synthesis estimates via geometrical heuristics, prior

to reﬁnement with a neural network. The proposed

pipeline for visualization is not restricted to light ﬁeld

applications. In future work, we hope to demonstrate

the applicability of the technique to frame extrapo-

lation, potentially leading to increased visualization

frame rates. Further avenues of investigation include

extending the technique to work for time-varying vol-

umetric data.

ACKNOWLEDGEMENTS

The authors would also like to thank the anonymous

referees for their valuable comments and helpful sug-

gestions. This research has been conducted with the

ﬁnancial support of Science Foundation Ireland (SFI)

under Grant Number 13/IA/1895.

REFERENCES

Agus, M., Bettio, F., Giachetti, A., Gobbetti, E., Igle-

sias Guitin, J. A., Marton, F., Nilsson, J., and Pin-

tore, G. (2009). An interactive 3d medical visualiza-

tion system based on a light ﬁeld display. The Visual

Computer, 25(9):883–893.

Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., and Gould,

S. (2016). Dynamic image networks for action recog-

nition. In 2016 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 3034–3042.

Birklbauer, C. and Bimber, O. (2012). Light-ﬁeld supported

fast volume rendering. In ACM SIGGRAPH 2012

Posters on - SIGGRAPH ’12, page 1, Los Angeles,

California. ACM Press.

Fernando, B., Gavves, E., M, J. O., Ghodrati, A., and Tuyte-

laars, T. (2017). Rank pooling for action recognition.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 39(4):773–787.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Kalantari, N. K., Wang, T.-C., and Ramamoorthi, R. (2016).

Learning-based view synthesis for light ﬁeld cameras.

ACM Trans. Graph., 35(6):193:1–193:10.

IVAPP 10th International Conference on Information Visualization Theory and Applications - 10th International Conference on Information

Visualization Theory and Applications

104

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In Proceedings of the 3rd In-

ternational Conference on Learning Representations

(ICLR), San Diego.

Klokov, R. and Lempitsky, V. (2017). Escape from cells:

Deep kd-networks for the recognition of 3d point

cloud models. In Computer Vision (ICCV), 2017 IEEE

International Conference on, pages 863–872. IEEE.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc.

Lanman, D. and Luebke, D. (2013). Near-eye light ﬁeld dis-

plays. In ACM SIGGRAPH 2013 Emerging Technolo-

gies, SIGGRAPH ’13, pages 11:1–11:1, New York,

NY, USA. ACM.

Levoy, M. and Hanrahan, P. (1996). Light ﬁeld render-

ing. In Proceedings of the 23rd Annual Conference

on Computer Graphics and Interactive Techniques,

SIGGRAPH ’96, pages 31–42, New York, NY, USA.

ACM.

Li, Y., Pirk, S., Su, H., Qi, C. R., and Guibas, L. J. (2016).

Fpnn: Field probing neural networks for 3d data.

In Proceedings of the 30th International Conference

on Neural Information Processing Systems, NIPS’16,

pages 307–315, USA. Curran Associates Inc.

Liu, Z., Yeh, R. A., Tang, X., Liu, Y., and Agarwala, A.

(2017). Video frame synthesis using deep voxel ﬂow.

In 2017 IEEE International Conference on Computer

Vision (ICCV), pages 4473–4481.

Maturana, D. and Scherer, S. (2015). Voxnet: A 3d convolu-

tional neural network for real-time object recognition.

In 2015 IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS), pages 922–928.

Mora, B., Maciejewski, R., Chen, M., and Ebert, D. S.

(2009). Visualization and computer graphics on

isotropically emissive volumetric displays. IEEE

Transactions on Visualization and Computer Graph-

ics, 15(2):221–234.

Niklaus, S., Mai, L., and Liu, F. (2017). Video frame inter-

polation via adaptive separable convolution. In 2017

IEEE International Conference on Computer Vision

(ICCV), pages 261–270.

Park, E., Yang, J., Yumer, E., Ceylan, D., and Berg, A. C.

(2017). Transformation-grounded image generation

network for novel 3d view synthesis. In 2017 IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 702–711.

Philips, S., Hlawitschka, M., and Scheuermann, G. (2018).

Slice-based visualization of brain ﬁber bundles - a lic-

based approach. pages 281–288.

Qi, C. R., Su, H., Niessner, M., Dai, A., Yan, M., and

Guibas, L. J. (2016). Volumetric and multi-view cnns

for object classiﬁcation on 3d data. arXiv:1604.03265

[cs].

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017). Point-

net++: Deep hierarchical feature learning on point sets

in a metric space. arXiv:1706.02413 [cs].

Riegler, G., Ulusoy, A. O., and Geiger, A. (2017). Octnet:

Learning deep 3d representations at high resolutions.

In The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Smola, A. J. and Schlkopf, B. (2004). A tutorial on sup-

port vector regression. Statistics and Computing,

14(3):199–222.

Srinivasan, P. P., Wang, T., Sreelal, A., Ramamoorthi, R.,

and Ng, R. (2017). Learning to synthesize a 4d rgbd

light ﬁeld from a single image. In 2017 IEEE Interna-

tional Conference on Computer Vision (ICCV), pages

2262–2270.

Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E.

(2015). Multi-view convolutional neural networks for

3d shape recognition. In Proceedings of the 2015

IEEE International Conference on Computer Vision

(ICCV), ICCV ’15, pages 945–953, Washington, DC,

USA. IEEE Computer Society.

Sunden, E., Steneteg, P., Kottravel, S., Jonsson, D., En-

glund, R., Falk, M., and Ropinski, T. (2015). Inviwo

- an extensible, multi-purpose visualization frame-

work. In 2015 IEEE Scientiﬁc Visualization Confer-

ence (SciVis), pages 163–164.

Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., and Ogun-

bona, P. (2017a). Scene ﬂow to action map: A new

representation for rgb-d based action recognition with

convolutional neural networks. In 2017 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 416–425.

Wang, P.-S., Liu, Y., Guo, Y.-X., Sun, C.-Y., and Tong, X.

(2017b). O-cnn: Octree-based convolutional neural

networks for 3d shape analysis. ACM Trans. Graph.,

36(4):72:1–72:11.

Wetzstein, G., Lanman, D., Hirsch, M., and Raskar, R.

(2012). Tensor displays: Compressive light ﬁeld syn-

thesis using multilayer displays with directional back-

lighting. ACM Trans. Graph., 31(4):80:1–80:11.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,

and Xiao, J. (2015). 3d shapenets: A deep representa-

tion for volumetric shapes. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 1912–1920.

Zhou, T., Tulsiani, S., Sun, W., Malik, J., and Efros, A. A.

(2016). View synthesis by appearance ﬂow. In Com-

puter Vision - ECCV 2016, Lecture Notes in Computer

Science, pages 286–301. Springer, Cham.

Synthesising Light Field Volumetric Visualizations in Real-time using a Compressed Volume Representation

105