It can be seen that, when the indexes are ap-
plied to the same segmentation, Prec(T , T ) = 1,
Rec(T , T ) = 1, and Cnt(T , T ) = 0. Intuitively, the
more similar the two segmentations T and T
0
, the
closer the precision and recall indexes to 1 and the
closer the count error index to 0. In the extreme case
where T = {I}, i.e., T consists of a single region
covering the full image, recall is 1, whereas precision
may be low and count error may be high; on the op-
posite case, if T = {{i} : i ∈ I}, i.e., if regions of T
correspond to single pixels of I, then precision is 1,
recall may be low, and count error may be high.
Let T
?
be the unknown desired segmentation of
a mosaic image I in which each region exactly cor-
responds to a tile in the image. The goal is to find
a method that, for any image I of a mosaic, outputs
a segmentation T which maximizes Prec(T , T
?
) and
Rec(T , T
?
) and minimizes Cnt(T , T
?
).
3 U-net FOR MOSAIC
SEGMENTATION
We propose a solution for the mosaic image seg-
mentation problem described in the previous section
which is based on a kind of Convolutional Neural Net-
work (CNN). We assume that a learning set com-
posed of images of mosaics and the corresponding
desired segmentations are available. In a learning
phase, to be performed just once, the learning set is
used to learn the values of the parameters of the net-
work. Then, once learned, the network is used in a
procedure that can take any image I as input and out-
puts a segmentation T .
The CNN used in this study is known as U-net,
the name deriving from the shape of the ANN archi-
tecture. U-net was introduced by Ronneberger et al.
(2015) who used it for the segmentation of neuronal
structures in electron microscopic stacks: according
to the cited study, U-net experimentally outperformed
previous approaches.
When applied to an image, a U-net works as a bi-
nary classifier at the pixel level, i.e., it takes as input a
3-channels (RGB) image and returns as output a two-
channels image. In the output image, the two channels
correspond to the two classes and encodes, together,
the fact that the pixel belongs or does not belong to the
artifact of interest—in our case, a tile of the mosaic.
In order to obtain a segmentation from the output
of the U-net, we (i) consider the single-channel image
that is obtained by applying pixel-wise the softmax
function to the two channels of the ANN output and
considering just the first value, that we call the pixel
intensity and denote by p(i); (ii) compare each pixel
intensity against a threshold τ; (iii) merge sets of adja-
cent pixels that exceed the threshold, hence obtaining
connected regions. We discuss in detail this procedure
in Section 3.2.
Internally, the U-net is organized as follows: a
contracting path made of a series of 3 × 3 un-padded
convolutions followed by max-pooling layers enables
the context capturing while the expanding path con-
sisting of transposed convolutions and cropping op-
erations ensures precise features localization (Ron-
neberger et al., 2015).
In our study we used an instance of the U-net tai-
lored to input images of 400×400. In the contracting
path, we used two 2-D un-padded convolutions steps
of size 3, both made of 32 filters and followed by
a rectified linear unit (reLU) precede a max-pooling
layer with 2 × 2 pool-size. The same structure is re-
peated four times every time increasing the number of
filters to 64, 128, 256, and 512. At the end of the con-
traction phase the 400 × 400 pixels input image in re-
shaped in a 17×512 tensor. In the expansion path, we
started with an up-sampling 2-D layer of 2×2 size of
the features map followed by a concatenation with the
correspondingly cropped feature map from the con-
tracting phase and two 2-D un-padded convolutions
steps of size 3 × 3 each with reLU activation func-
tion. The same procedure is repeated also four times
every time reducing the number of convolutions filter
by half leading to a tensor of shape 216 × 32. Fur-
thermore a zero-padding 2-D layer reshapes the tensor
in a 400 × 400 × 32 shape prior to a 1-D convolution
steps composed of two filters that gives in output a
400 × 400 ×2 tensor that constitutes the output of the
U-net. The output is then used to compute pixel inten-
sities and hence the segmentation as briefly sketched
above and detailed in Section 3.2.
3.1 Learning
Let L = {(I
1
, T
?
1
), . . . , (I
m
, T
?
m
)} be the learning set
composed of m pairs, each consisting of a mosaic
image I
i
and the corresponding desired segmentation
T
?
i
, obtained by manual annotation. The outcome of
the learning phase consists of the weights θ of the U-
net.
We first preprocess the pairs in the learning set L
as follows, obtaining a different learning set L
0
, for
which |L
0
| = |L| does not generally hold.
1. We rescale each pair (I, T
?
) ∈ L so as to obtain
a given tile density ρ
0
=
|T
?
|
|I|
, i.e., a given ratio
between the number of tiles in the image and the
image size; ρ
0
is a parameter of our method. We
use a bicubic interpolation over 4 × 4 pixel neigh-
borhood.
Mosaic Images Segmentation using U-net
487