up to the first 10 minutes, about 1-3 phases may
occur (from P1-P4), and the same is valid after the
50th min (from P5-P8). From the bottom diagram it
may be seen that the order of the phases is relatively
fixed and their duration is limited (e.g. P1-P5).
Furthermore, some phases present almost no
temporal overlap (e.g. P1 vs. P4-P8 and P2 vs. P6-
P8), whereas some others present moderate overlap
(e.g. P1 vs. P3 and P2 vs. P5). From this analysis it
is evident that the absolute temporal position of a
video frame (with regard to the beginning of the
operation), is a crucial factor that should be
considered in phase recognition.
Based on the aforementioned dataset, we
extracted the video shot dataset which was employed
for classification. In particular, for each video and
for each of the 8 phases, we extracted 2 non-
overlapping shots of 10s duration (i.e. 250 frames).
The video shots were extracted from random
temporal positions of each phase, ensuring that the
first/last frame of each shot was within the temporal
limits of each phase. In two videos, P7 was absent so
in order to have an equal number of shots per phase,
we randomly selected 50 (out of 54) shots for each
of the P1-P6 and P8 phases, leading to a total
number of 400 video shots, equally distributed
across the 8 phases.
2.2 CNN Feature Extraction
Feature extraction was based on transfer learning
using ‘off-the-shelf’ features extracted from four
state-of-the-art CNNs: Alexnet, VGG19, GoogleNet,
and Resnet101. These network architectures were
chosen as they are known to perform well on
surgical endoscopy images (Petscharnig and
Schöffmann 2018). Transfer learning implies that
the CNNs were pretrained, in this case on the
Imagenet database which contains millions of
natural images distributed in 1000 classes. Although
surgical images are substantially different, given the
powerful architecture of the CNNs and the huge
volume of Imagenet, transfer learning has been
proved a simple, yet good-working approach for
content-based description of surgical images
(Petscharnig and Schöffmann 2017). Moreover, our
dataset is considerably small to train these CNNs
from scratch. However, as will be discussed later,
we perform training to model the temporal variation
of the extracted CNN features.
Alexnet consists of eight layers: 5 convolutional
layers followed by 3 fully-connected (FC) layers.
For each frame in a video shot, we extracted features
from layer fc7, which is the before-final-fc (BFFC)
layer with length n1=4096. VGG19 is much deeper,
consisting of 16 convolutional layers followed by 3
fully connected layers. We again extracted features
from the BFFC layer (fc7, n2=4096). GoogleNet is
different to Alexnet and VGG19, including various
Inception modules with dimensionality reduction
and only one fully connected layer combined with a
softmax layer (22 layers in total). For each frame we
used the features extracted from the BFFC layer:
pool5-7x7_s1 (n3=1024). Finally, the Resnet101
model is the deepest of the four (101 layers); it
stacks several residual blocks in-between the
convolutional blocks aiming to alleviate the
vanishing gradient problem, usually encountered
when stacking several convolutional layers together.
For the ResNet101 model we used the bottleneck
features extracted from the BFFC layer: pool5
(n4=2048).
Based on the aforementioned approach we
extracted feature descriptors from each video frame.
In order to achieve a compact feature representation
of the video shot, we concatenate the descriptors
along the temporal dimension and apply two
temporal pooling mechanisms: max-pooling and
average-pooling. The former extracts the maximum
value from each dimension of the BFFC layer,
whereas the second one outputs the average from
each dimension of the BFFC layer. For each CNN
architecture employed, both approaches result in a
single feature descriptor for each shot, equal to the
size of the corresponding BFFC layer.
2.3 CNN Input and Saliency Maps
The size of the input layer of the aforementioned
CNNs is: 227×227 (Alexnet) and 224×224 (VGG19,
GoogleNet, and ResNet101). In previous works, the
original image is resized either to match the CNN’s
input, or so that the smaller side matches one side of
the CNN input layer and then the center crop is used
as input to the CNN (Petscharnig and Schöffmann
2017),(Varytimidis et al. 2016). However, both
approaches have some limitations. Considering that
the original video resolution is 16/9, the former case
leads to a spatial degradation of the original image,
as the aspect ratio is forced to be 1 (see Figure 3). In
the latter case, image resizing does not affect the
aspect ratio, but extracting features from the center
crop may lead to an efficient representation of the
original frame since the structures of interest are not
always in the center (w.r.t. Figure 1, in P1 the trocar
is located towards the upper-right corner whereas in
P4 the clips/tool-tip are in the bottom).