2013; Murala et al., 2010; Kumar and Pal, 2004; Byrd
et al., 2003). Kumar et al. (Kumar and Pal, 2004)
and Byrd et al. (Byrd et al., 2003) proposed recording
systems that place a camera in the surgery room envi-
ronment. However, the view is easily occluded by the
surgeon’s head or body. Observing the surgical field
with a single camera without any occlusion is a diffi-
cult task. Matsumoto et al. and Murala et al. proposed
recording systems that ask surgeons to wear cameras.
However, this system is not only limited by hardware
in its ability to produce high-quality videos, but it is
also uncomfortable for surgeons to wear.
To solve this problem, Shimizu et al. proposed a
new system that embeds cameras on a surgical lamp.
This system not only allows one of the cameras to
record the surgical field while one of the light bulbs
illuminates it but also does not interfere with the sur-
geons during surgeries.
2.2 Camera Switching System
As the cameras obtain multiple videos of a single
surgery, Shimizu et al. proposed a method for au-
tomatically selecting the image with the best view
of the surgical field at each moment using Dijkstra’s
algorithm based on the size of the surgical field to
generate a single video. Hachiuma et al. (Hachiuma
et al., 2020) proposed Deep Selection, which selects
the camera with the best view of the surgery using
a fully supervised deep neural network. However, a
problem with these methods is that the video quality
is often low due to frequent changes in the viewing
direction.
2.3 Novel View Synthesis
Novel view synthesis is one of the fundamental func-
tionality and long-standing problem in computer vi-
sion (Gortler et al., 1996; Levoy and Hanrahan, 1996;
Davis et al., 2012).
For Non-medical Images: Recently, novel view
synthesis methods for everyday scenes have made sig-
nificant progress by using neural networks. Milden-
hall et al. (Mildenhall et al., 2020) proposed NeRF, a
method for synthesizing novel views of static, com-
plex scenes from a set of input images with known
camera poses. Wang et al. proposed IBRNet (Wang
et al., 2021), which introduced the Ray Transformer.
This method enables learning a generic view interpo-
lation function that generalizes to novel scenes, unlike
previous neural scene representation work that opti-
mized per-scene functions for rendering. Varma T.et
al. proposed the Generalizable NeRF Transformer
(GNT) (Wang et al., 2022), which introduced the
Figure 3: Examples of the data captured by the recording
system proposed by Shimizu et al. (Shimizu et al., 2020).
(a) The examples we used for the training, in which no cam-
eras were obstructed by the surgeon’s head. (b) A bad exam-
ple in which the surgical area was hidden by the surgeon’s
head (Cam3).
View Transformer as the first stage. They also demon-
strated that depth and occlusion could be inferred
from the learned attention maps, which implies that
the pure attention mechanism is capable of learning a
physically-grounded rendering process. Furthermore,
this transformer-based architecture makes it possible
to generate unseen scenes from source-view images.
For Medical Images: The novel view synthesis
method is begging to be used for medical scenes. Ma-
suda et al. (Masuda et al., 2022) proposed C-BARF,
which can synthesize novel view images of surgical
scenes. They used the relative position of the camera
to more accurately estimate the camera poses. They
achieved a novel view synthesis method for surgical
videos that consists of a small number of images.
3 METHOD
Our objective is to generate the surgical areas oc-
cluded by the surgeons’ or nurse’s head in order to
create a comprehensive video for reviewing the surgi-
cal procedure. Our methodology can be divided into
five distinct steps.
The first step is to prepare some sets of multi-
view training videos using the camera recording sys-
tem proposed by Shimizu et al.(Shimizu et al., 2020).
In order to train GNT with a diversity of surgeries and
surgical procedures, we recorded the surgeries with
multiple types and for various areas. Presently, there
are numerous frames with occlusions as dipicted in
Fig. 3-(b).
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
946