(Stingl et al., 2013). Therefore, our simulated reti-
nal implant consists of 40 × 40 electrode/photodiode
pairs, where each electrode has a circular shape with a
radius of 50µm. Electrodes were evenly arranged on a
squared grid with a distance of 70µm w.r.t. their cen-
ters. Pulse trains were generated for a stimuli duration
of 200 ms with monophasic cathodic pulses of 1 ms.
The temporal sampling steps where set to 0.0004 ms.
Thus, one still input image was represented as a pulse
train of shape (40, 40,500), i.e., spatial resolution of
the input at 40 × 40 with 500 simulated time steps.
Pulses occurred at a working frequency of 5Hz. Re-
garding the output of the network, we compromised
between computational complexity and accuracy of
the model by sampling every 25 µm on the retinal sur-
face yielding a spatial output resolution of 112 × 112
px. To obtain a single time step output after the
neurophysiological network we followed the work of
(Beyeler et al., 2017) by extracting the response at the
time step of highest output response. Please note, that
this network has no trainable parameters but has fixed
convolutional kernels as described in (Nanduri et al.,
2012).
2.2.1 Image Transformation Network
The original input image is processed within the im-
age transformation network (see Figure 3 for more de-
tails). Since we are interested in generally suitable
transformations based on a rather small local neigh-
bourhood (and want to avoid transformations based
on image semantics) it consists of only 4 convolu-
tional blocks with kernels of size 3 × 3. The first
three convolutional layers consist of 32 trainable ker-
nels and the last one of 1 to re-obtain the input shape
of the original image.
2.2.2 Image Reconstruction Task
For the task of image reconstruction, the input im-
age is fed into our proposed image transformation net-
work and its output is subsequently transformed to a
simulated percept using the implemented neurophys-
iological tensor network (Figure 3, green path) with
parameters as described in Section 2.2. Since the
output resolution of the transformed visual percept
does not match with the shape of the original input
image (40 × 40 electrodes input, 112 × 112 sampled
positions on the retina as output) the original image
is bi-linearly interpolated to match the output shape
of the neurophysiological network. The dissimilar-
ity between the two is then assessed using the mean-
squared-error.
2.2.3 Classification Task
We choose a simple image classification task with
10 object classes for evaluating the general plausibil-
ity of our system. For the task of object classifica-
tion, we seek for a suitable transformation of the in-
put images, such that their corresponding visual per-
cepts, generated by the neurophysiological network,
will lead to an increased classification accuracy com-
pared to their unaltered counterparts. Therefore, af-
ter feeding the input image to the transformation net-
work and the spatiotemporal network, the output of
the latter is fed to a standard classification convolu-
tional neuronal network consisting of convolutional
blocks and a multilayer-perceptron thereafter (please
refer to Figure 3 for an overview). Here, categorical
cross-entropy is used as the objective function.
3 EVALUATION
3.1 Image Reconstruction
The proposed image reconstruction task was tested
on the popular MNIST data set (Y. LeCun, 1998)
comprising binary images of handwritten digits. This
data set is of particular interest, since due to its clear
figure/ground separation the qualitative assessment
of learned transformations is assumed to be easy to
grasp. Furthermore, the enhancement of visual per-
cepts of digits (and even more letters) is an everyday
visual task that potentially is of great importance for
patients suffering from RP and treated with a retinal
implant. Mean-squared-error was used as the objec-
tive function assessing the dissimilarity of the input
image and its virtually perceived version.
For the training of the network, the training set
comprised 50000 images belonging to 10 classes
of digits (0 − 9). Training was performed batch-
wise (n = 128) for 500 epochs and a validation set
of 10000 different images was evaluated after each
epoch. Standard stochastic gradient descent was used
for optimization with a fixed learning rate of 0.01.
Figure 4a) shows the mean-squared-error (mse)
over time using a logarithmic scale for better visibil-
ity. As it can be seen, training and validation loss
decrease significantly until the validation loss suppos-
edly saturates after around 300 epochs. As a quantita-
tive reference, the baseline mse (without image trans-
formation) throughout the validation set is at 0.067,
whereas it drops to 0.035 after 500 epochs.
For a qualitative visual comparison, the last two
rows of Figure 5 show exemplary results given an in-
put image from the validation set (first column), its
Automatic Perception Enhancement for Simulated Retinal Implants
911