and precise segmentation of biomedical images. As
clearly explained in (Ronneberger et al., 2015), it con-
sists of a contracting path and an expansive path. The
contracting path consists of the repeated application
of two 3 x 3 convolution (same padding), each fol-
lowed by a rectified linear unit and a 2 x 2 max pool-
ing operation for downsampling. At each downsam-
pling, the number of feature channels are doubled. To
avoid overfitting, we use dropout before max pooling
operation that behaves as a regularizer when training
the network. After each convolution layer, we use
batch normalization to reduce covariate shift allowing
each layer to learn by itself a little bit more indepen-
dently of other layers.
Figure 3: U-Net architecture (Ronneberger et al., 2015).
Every step in the expansive path consists of an up-
sampling of the feature map followed by a 3 x 3 up-
convolution (same padding) that halves the number of
the feature channels, a concatenation with the corre-
sponding feature map from the contracting path and
two 3 x 3 convolution each followed by a ReLU. At
the final layer a 1 x 1 convolution is used to map each
16 component feature vector to the desired segmented
image (binary).
3.1.2 Training U-Net
Each CT volume is resampled to a voxel size of 1 mm
x 1mm x 1mm. A window of width 550 HU and level
25 HU is applied to resampled CT volume. All val-
ues less than -250 HU are mapped to -1000 HU, and
those above 300 HU are mapped to 1000 HU. The
network is trained on axial slices of CT volume. (An-
notations in the dataset included only shortest axis
in the axial view of LN in accordance with RECIST
criteria). For a given CT volume, the mask volume
is generated (programmatically) by drawing an ellip-
soid/sphere with center and shortest axis/diameter as
per given annotations. Thereafter, axial slices with at
least one annotated LN are cropped to 256 x 256 pix-
els about the center to contain the mediastinum with
a sufficient margin. Alternatively, the mediastinum
can be extracted by segmenting lungs using a number
of morphological operations followed by appropri-
ate cropping of the region between segmented lungs.
Random rotations, horizontal flips, shear, zoom (0.9-
1.1), horizontal and vertical translations are used for
data augmentation. Both input axial slice and cor-
responding mask are augmented dynamically during
training.
The output image is thresholded at 0.35-0.5, de-
pending on the required sensitivity level (>=0.35
represents part of a detected LN). Subsequently, the
centroids of the detected LNs is found by calculat-
ing moments of contours in the binarized image and
translating coordinates from a cropped 256 x 256 im-
age to original 512 x 512 axial view. The network
is trained till the “test” dice coefficient reaches [0.60,
0.65] range. A much higher dice coefficient may de-
feat the purpose of Section 3.1 to maintain high sen-
sitivity while detecting LNs, which could otherwise
adversely affect Section 3.2 in the aforementioned
pipeline.
3.1.3 Preliminary CADe
For comparison between different models (explained
later) and previous works on mediastinal LN detec-
tion, we use a preliminary CADe system for detecting
LN candidates from mediastinal (Liu et al., 2014) CT
volumes in which lungs are segmented automatically
and shape features by Hessian analysis, local scale,
and circular transformation are computed at voxel-
level. The system uses a spatial prior of anatom-
ical structures (lung, spine, esophagus, heart, etc)
via multi-atlas label fusion before detecting LN can-
didates using a Support Vector Machine (SVM) for
classification.
3.2 False Positive Reduction
Corresponding to each candidate, a 3D VOI of shape
32 x 32 x 32 voxels is extracted with center same
as the candidate centroid. In order to increase train-
ing data variation and to avoid overfitting, each nor-
malized VOI is also flipped (horizontal and vertical),
translated, and rotated along a random vector in 3D
space. Furthermore, each flipped, translated, and ro-
tated VOI is augmented with Gaussian Noise, Gaus-
sian Offset, and Elastic Transform different number
of times depending on the scale of data augmentation.
Different sample and augmentation rates for positive
and negative VOIs are used to obtain a reasonably bal-
anced training set.
When classifying an unseen VOI, we make use of
Test Time Augmentation (TTA). It involves creating
Mediastinal Lymph Node Detection using Deep Learning
161