Therefore, a modal transformation neural network
that transforms each modal image into another modal
image is also trained, and this is also used as a con-
dition for image generation. Furthermore, by sharing
some of these networks, a multi-task learning frame-
work is applied to estimate multimodal images with
high accuracy.
In the following, for the simplification of the dis-
cussion, we will consider the case where images with
different modals are captured using filters that trans-
mit light of red wavelength and blue wavelength, re-
spectively, as shown in Fig.3. Let I
i
be the modal
mixed image taken at viewpoint i. It is also assumed
that at each viewpoint, it is known which modal infor-
mation was acquired at which pixel, and that a mask
M
R
i
replacing the non-red region with 0 and a mask
M
B
i
replacing the non-blue region with 0 are obtained.
Under these conditions, the objective is to bring the
output
ˆ
I
R
i
= P(θ
R
i
, N) and
ˆ
I
B
i
= P(θ
B
i
, N) obtained by
Deep Image Prior to their respective modal images. In
this section, we consider, in particular, the constraints
available when focusing on
ˆ
I
R
i
.
3.3 Image Inpainting Constraint
First, consider the constraint that image inpainting is
used to generate an image from an image of the tar-
get modal contained in a modal mixture. In this case,
the error in the unmasked area becomes the evaluation
function for image generation, as follows:
ε
P
= ||M
R
i
ˆ
I
R
i
− M
R
i
I
i
||
2
(3)
Minimizing this evaluation function yields an image
that mimics the input for regions where red informa-
tion can be captured directly, and an interpolated im-
age based on the input for regions where it cannot be
obtained.
3.4 Constraint from Viewpoint
Transformation
Next, we consider using information from images
taken from different viewpoints by estimating the dis-
parity between the images. In this study, we extend
the method of Luo et al(Luo et al., 2018). to estimate
the disparity. In this method, multiple images are pre-
pared for the image to be viewpoint transformed that
have been shifted by k pixels in advance, and these
are called the shifted image set S
k
(I). A weight map
W
k
representing the weight of each pixel is estimated
for each image in this shifted image group, and the
weighted average of the weight maps is computed to
generate the viewpoint transformation image. The
weight map W
k
indicates from which shifted image
Figure 4: Overview of the Viewpoint Transformation Using
Shifted Image Set.
the pixel values are referenced for each pixel in the
viewpoint-transformed image, and by optimizing this
map, an image that is appropriately shifted according
to disparity can be generated.
In the method of Luo et al. as shown in Fig.4,
the viewpoint transformation is performed according
to the input by learning the relationship between the
input and W
k
in advance. In this study, the viewpoint
transformation is performed by generating this weight
map using Deep Image Prior. However, since the im-
ages handled in this study are modal mixed images,
the number of pixels that can be compared is limited
when directly comparing input images. Therefore, the
viewpoint-transformed image is estimated by com-
paring the generated image
ˆ
I
m
i
(m ∈ {R, B}) at view-
point i with the generated image
ˆ
I
m
j
at viewpoint j.
This optimizes the weight map W
k
j
, which represents
the k pixel shift, for the conversion of the viewpoint j
image to the viewpoint i image as follows:
ε
V
j→i
=
∑
m∈M
||
ˆ
I
m
i
−
∑
k
W
k
j
S
k
(
ˆ
I
m
j
)||
2
(4)
where M = {R, B}. When the generated image is
fixed, this function is an evaluation function for the
viewpoint transformation. On the other hand, if W is
fixed and the generated image is variable, it becomes
a constraint on image generation considering the re-
sult of viewpoint transformation. In this study, the
optimization of W and
ˆ
I is performed simultaneously
to simultaneously perform viewpoint transformation
and image generation.
3.5 Modality Transformation by Neural
Network
Next, consider how to use images from other view-
points and other modalities in image generation. As
shown in Fig.3, since each mixed-modal image is
taken from a different viewpoint (subaperture), the fil-
ter pass points are different. Therefore, there are over-
Multimodal Light-Field Camera with External Optical Filters Based on Unsupervised Learning
485