Mosaic Images Segmentation using U-net

Gianfranco Fenu

, Eric Medvet

, Daniele Panﬁlo and Felice Andrea Pellegrino

Dipartimento di Ingegneria e Architettura, Universit

a degli Studi di Trieste, Trieste, Italy

Keywords:

Cultural Heritage, Computer Vision, Deep Learning, Convolutional Neural Networks.

Abstract:

We consider the task of segmentation of images of mosaics, where the goal is to segment the image in such a

way that each region corresponds exactly to one tile of the mosaic. We propose to use a recent deep learning

technique based on a kind of convolutional neural networks, called U-net, that proved to be effective in seg-

mentation tasks. Our method includes a preprocessing phase that allows to learn a U-net despite the scarcity

of labeled data, which reﬂects the peculiarity of the task, in which manual annotation is, in general, costly. We

experimentally evaluate our method and compare it against the few other methods for mosaic images segmen-

tation using a set of performance indexes, previously proposed for this task, computed using 11 images of real

mosaics. In our results, U-net compares favorably with previous methods. Interestingly, the considered meth-

ods make errors of different kinds, consistently with the fact that they are based on different assumptions and

techniques. This ﬁnding suggests that combining different approaches might lead to an even more effective

segmentation.

1 INTRODUCTION AND

RELATED WORKS

Cultural heritage is one of the most important assets

of the society. Its preservation and restoration are

time-consuming activities performed by experts and

often consist in manual analysis of ﬁne details of the

works. It is hence natural that these tasks, as many

others where human experts are involved in some

form of data processing, are subjected to automa-

tion using machine learning techniques. Differently

than other domains, however, tasks concerning cul-

tural heritage may be harder because of the scarcity

of labeled data and nature of the data itself. Despite

these limitations, successful examples of applications

exist, e.g., (Assael et al., 2019), and progresses in the

techniques for different kinds of data pave the way for

other successful applications.

In this work we consider a particular kind of artis-

tic works, i.e., mosaics. Mosaics are assemblies of

small pieces of stone or similar materials, called tiles

or tessellae, glued together with some binder or ﬁller,

such that the overall appearance of the assembly looks

like a painting or some decorative pattern. Mosaics

https://orcid.org/0000-0003-0867-8388

https://orcid.org/0000-0001-5652-2113

https://orcid.org/0000-0002-4423-1666

constitute an essential component of the cultural her-

itage for many (ancient) civilizations. Preservation,

and, to some degree, restoration of mosaics might be

enhanced if digital versions of the works were avail-

able. Moreover, the access to the artistic works might

be eased using digital means, possibly as part of a pro-

cess in which hard copies are obtained starting from

digital copies, hence enlarging the portion of popula-

tion that can access mosaics, regardless of their phys-

ical location (Neum

uller et al., 2014). There have

been a couple of approaches, namely (Youssef and

Derrode, 2008; Bartoli et al., 2016), that proposed

automatic methods for obtaining a digital version of

the mosaic. All of them take as input an image of

the mosaic, that can be cheaply obtained also for not-

relocable mosaics, and output a segmentation of the

image in which regions should correspond to tiles.

Starting from the segmentation, a digital version of

the mosaic may be obtained straightforwardly, hence

easing the mosaic preservation and restoration and

making it more accessible (Comes et al., 2014).

Here we propose a novel technique for mosaic im-

age segmentation that is based on a recently proposed

kind of convolutional neural networks (CNN), called

U-net (Ronneberger et al., 2015). Our approach dif-

fers from the previous ones in the way the mosaic im-

age is processed. The U-net processes the image at

the pixel level, differently than the proposal by Bartoli

Fenu, G., Medvet, E., Panﬁlo, D. and Pellegrino, F.

Mosaic Images Segmentation using U-net.

DOI: 10.5220/0008967404850492

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 485-492

ISBN: 978-989-758-397-1; ISSN: 2184-4313

485

et al. (2016), but permits, by design, that some pixels

are not associated with any region, differently than the

approach of Youssef and Derrode (2008): this means

that using U-net for segmentation allows to model the

presence of the ﬁller. A key component of our ap-

proach is in the preprocessing phase that is part of the

learning process: we propose a method for augment-

ing the dataset in such a way that the learning of U-net

parameters is effective even when a small number of

annotated examples are available. In facts, manual

annotating mosaic images is a costly process (Bartoli

et al., 2016).

We assess experimentally our approach applying

it to 11 images of real mosaics, differing in style, age,

and quality (both of the image and of the mosaic it-

self in terms of wear). We compare the segmentation

based on U-net against previous methods using a set

of established performance indexes suitable for the

mosaic image segmentation task and we found that

our method outperforms the other ones in the most

relevant index. Moreover, we show that the way in

which the three methods make errors in analyzing the

image varies consistently with the fact that the meth-

ods are based on different underlying assumptions.

This ﬁnding opens an opportunity for designing an

even more effective method where U-net segmenta-

tion is a step of a more complex procedure which in-

volves also other processing steps, eventually result-

ing in a better segmentation effectiveness.

Despite the availability of a “digital model” could

be very useful, in the literature only few segmenta-

tion algorithms have been proposed, taking into ac-

count the speciﬁc structure features of a mosaic, i.e.,

shape, organisation, color of tiles, and the presence

of the ﬁller. In particular, in (Youssef and Derrode,

2008) the proposed approach aims to detect and to

extract the tile from the ﬁller, using the well-known

watershed algorithm (Vincent and Soille, 1991) and

some mosaic-speciﬁc preprocessing. In (Bartoli et al.,

2016) the authors proposal goal is the same, but they

employed deformable models as ﬂexible shapes to be

superimposed on the mosaic picture and to be adapted

to the effective shapes of the tiles. The optimization of

such deformable shapes has been performed by means

of a genetic algorithm.

In addition to these approaches, many other tech-

niques have been applied with the aim to obtain a

digital model of a mosaic: among others laser scan-

ners and photogrammetry (Fazio et al., 2019), seg-

mentation based on already available mosaic cartoons

(Monti and Maino, 2011). We refer the reader to

(Benyoussef and Derrode, 2011) for a detailed review.

Regarding the U-nets, there are many applications

in biomedical image segmentation, e.g., (Falk et al.,

2019). Variations of the U-nets have also been applied

to volumetric segmentation from sparsely annotated

volumetric images (C¸ ic¸ek et al., 2016), road extrac-

tion from aerial images (Zhang et al., 2018), and in

case of ambiguous images, i.e., when many different

annotations are available for every single image (Kohl

et al., 2018).

2 PROBLEM STATEMENT

The goal of this work is to propose a method for seg-

menting an image of a mosaic in such a way that, for

each tile of the mosaic in the image, all and only the

corresponding pixels are assigned to the same region

of the segmentation.

More formally, we call a region of the image I a

subset of adjacent pixels of I. We call a segmentation

of an image I a set T = {T

, . . . , T

} of disjoint regions

of I, i.e., ∀i, j, T

∩ T

Let T and T

be two segmentations of the same

image I. Accordingly to (Fenu et al., 2015), we deﬁne

the following three indexes:

Prec(T , T

) =

|T |

∑

T ∈T

max

∈T

|T ∩ T

|T |

(1)

Rec(T , T

) =

∑

∈T

max

T ∈T

|T ∩ T

(2)

Fm(T , T

) = 2

Prec(T , T

)Rec(T , T

)

Prec(T , T

) + Rec(T , T

)

(3)

Cnt(T , T

) =

abs(|T

| − |T |)

(4)

where |T | is the number of pixels in the region T, |T |

is the number of regions in the segmentation T , and

T ∩ T

is the set of pixels which belong to both T and

The precision index Prec(T , T

) is the average

precision of regions in T , where the precision of a

region T is the largest ratio

|T ∩T

|T |

among different

∈ T

, i.e., the proportion of T pixels which be-

long to the region of T

with which T overlaps most.

The recall index Rec(T , T

) is the average recall of

regions in T

, where the recall of a region T

is the

largest ratio

|T ∩T

among different T ∈ T , i.e., the

proportion of T

pixels which belong to the region

of T with which T

overlaps most—it can be noted

that Prec(T , T

) = Rec(T

, T ).. The F-measure (also

known as F-1 score) is the harmonic mean of pre-

cision and recall. Finally, the count error index

Cnt(T , T

) is the normalized absolute difference be-

tween the number of regions in T

and the number of

regions in T .

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

486

It can be seen that, when the indexes are ap-

plied to the same segmentation, Prec(T , T ) = 1,

Rec(T , T ) = 1, and Cnt(T , T ) = 0. Intuitively, the

more similar the two segmentations T and T

, the

closer the precision and recall indexes to 1 and the

closer the count error index to 0. In the extreme case

where T = {I}, i.e., T consists of a single region

covering the full image, recall is 1, whereas precision

may be low and count error may be high; on the op-

posite case, if T = {{i} : i ∈ I}, i.e., if regions of T

correspond to single pixels of I, then precision is 1,

recall may be low, and count error may be high.

Let T

be the unknown desired segmentation of

a mosaic image I in which each region exactly cor-

responds to a tile in the image. The goal is to ﬁnd

a method that, for any image I of a mosaic, outputs

a segmentation T which maximizes Prec(T , T

) and

Rec(T , T

) and minimizes Cnt(T , T

3 U-net FOR MOSAIC

SEGMENTATION

We propose a solution for the mosaic image seg-

mentation problem described in the previous section

which is based on a kind of Convolutional Neural Net-

work (CNN). We assume that a learning set com-

posed of images of mosaics and the corresponding

desired segmentations are available. In a learning

phase, to be performed just once, the learning set is

used to learn the values of the parameters of the net-

work. Then, once learned, the network is used in a

procedure that can take any image I as input and out-

puts a segmentation T .

The CNN used in this study is known as U-net,

the name deriving from the shape of the ANN archi-

tecture. U-net was introduced by Ronneberger et al.

(2015) who used it for the segmentation of neuronal

structures in electron microscopic stacks: according

to the cited study, U-net experimentally outperformed

previous approaches.

When applied to an image, a U-net works as a bi-

nary classiﬁer at the pixel level, i.e., it takes as input a

3-channels (RGB) image and returns as output a two-

channels image. In the output image, the two channels

correspond to the two classes and encodes, together,

the fact that the pixel belongs or does not belong to the

artifact of interest—in our case, a tile of the mosaic.

In order to obtain a segmentation from the output

of the U-net, we (i) consider the single-channel image

that is obtained by applying pixel-wise the softmax

function to the two channels of the ANN output and

considering just the ﬁrst value, that we call the pixel

intensity and denote by p(i); (ii) compare each pixel

intensity against a threshold τ; (iii) merge sets of adja-

cent pixels that exceed the threshold, hence obtaining

connected regions. We discuss in detail this procedure

in Section 3.2.

Internally, the U-net is organized as follows: a

contracting path made of a series of 3 × 3 un-padded

convolutions followed by max-pooling layers enables

the context capturing while the expanding path con-

sisting of transposed convolutions and cropping op-

erations ensures precise features localization (Ron-

neberger et al., 2015).

In our study we used an instance of the U-net tai-

lored to input images of 400×400. In the contracting

path, we used two 2-D un-padded convolutions steps

of size 3, both made of 32 ﬁlters and followed by

a rectiﬁed linear unit (reLU) precede a max-pooling

layer with 2 × 2 pool-size. The same structure is re-

peated four times every time increasing the number of

ﬁlters to 64, 128, 256, and 512. At the end of the con-

traction phase the 400 × 400 pixels input image in re-

shaped in a 17×512 tensor. In the expansion path, we

started with an up-sampling 2-D layer of 2×2 size of

the features map followed by a concatenation with the

correspondingly cropped feature map from the con-

tracting phase and two 2-D un-padded convolutions

steps of size 3 × 3 each with reLU activation func-

tion. The same procedure is repeated also four times

every time reducing the number of convolutions ﬁlter

by half leading to a tensor of shape 216 × 32. Fur-

thermore a zero-padding 2-D layer reshapes the tensor

in a 400 × 400 × 32 shape prior to a 1-D convolution

steps composed of two ﬁlters that gives in output a

400 × 400 ×2 tensor that constitutes the output of the

U-net. The output is then used to compute pixel inten-

sities and hence the segmentation as brieﬂy sketched

above and detailed in Section 3.2.

3.1 Learning

Let L = {(I

, T

), . . . , (I

, T

)} be the learning set

composed of m pairs, each consisting of a mosaic

image I

and the corresponding desired segmentation

, obtained by manual annotation. The outcome of

the learning phase consists of the weights θ of the U-

net.

We ﬁrst preprocess the pairs in the learning set L

as follows, obtaining a different learning set L

, for

which |L

| = |L| does not generally hold.

1. We rescale each pair (I, T

) ∈ L so as to obtain

a given tile density ρ

|I|

, i.e., a given ratio

between the number of tiles in the image and the

image size; ρ

is a parameter of our method. We

use a bicubic interpolation over 4 × 4 pixel neigh-

borhood.

Mosaic Images Segmentation using U-net

487

2. From each pair (I, T

) ∈ L, we obtain a num-

ber of pairs by cropping square regions of I

of size l × l (crops) that overlap for half of

their size; l is a parameter of our method. Let

w × h be the size of the image I of the pair,

the number of pairs obtained by cropping is







−







−



. We build a

set L

including the resulting pairs, each one con-

sisting of a square image of size l × l and a seg-

mentation with, on average, approximately ρ

regions.

3. Finally, we augment L

by adding, for each of its

pair, few pairs obtained by common image data

augmentation techniques, i.e., rotation, horizontal

and vertical ﬂipping.

We remark that, when building L

from L, segmenta-

tions T

are processed accordingly to the processing

of the corresponding images I.

In order to learn the weights θ of the U-net, we

consider a subset L

train

of L

that contains 90% of the

pairs in L

, chosen with uniform probability.

Then, we use the Adam optimizer (Kingma and

Ba, 2015) on image pairs in L

train

to learn the weights

θ. We feed Adam with batches of 8 images and drive

it by the following weighted binary cross entropy loss

function:

Loss(θ) = −

∑

I∈|L

∑

i∈I



wq(i)log p(i)

+ (1 − q(i))log(1 − p(i))



(5)

where L

is the batch of pairs (I, T

), w is a weight-

ing factor, p(i) is the pixel intensity of i obtained by

applying the U-net with weights θ, and q(i) is an indi-

cator function that encodes in {1, 0} the fact that the

pixel i belongs or does not belong to a region of T

q(i) =

(

1 if ∃T ∈ T

, i ∈ T

0 otherwise

(6)

We use the weighting factor w in Equation (5)

in order to weight differently classiﬁcation errors for

pixels of tiles and ﬁller. The parameter w hence per-

mits to cope with the fact that image of mosaics are

in general highly unbalanced: much more pixel are

associated to tiles than to the ﬁller. We experimented

with three different values of w: 0.5, 0.1, and 0.01,

the former corresponding to weighting the two classes

equally.

We set Adam to run the optimization for n

epoch

epochs, with a learning rate that varies at every epoch

using an exponential decay function. During the op-

timization, we use the remaining 10% of the set L

for monitoring the progress, by computing the loss of

Equation (5) on L

\ L

train

3.2 Segmentation

In the segmentation phase, we use a learned U-net to

obtain a segmentation T out of an image I, as follows.

1. We rescale the input image such that its estimated

tile density ρ is approximately equal to the ρ

value used during the learning. For computing

ρ, and hence for performing the scaling, we as-

sume that a raw estimate of the number of tiles in

the image I is available: in practice, this estimate

might be obtained by visual inspection of a small

portion of the image.

2. We apply the U-net to I obtaining a single-channel

image of pixel intensities that we threshold at 0.5,

hence obtaining a binary image of the same size

of I. We call this image the output mask.

3. We consider the subset I

= {i ∈ I, p(i) ≥ 0.5} of

pixels of I that are classiﬁed by the U-net as a be-

longing to tiles.

4. Finally, we partition I

in subsets composed of ad-

jacent pixels, hence obtaining the segmentation

T = {T

, T

, . . . }.

4 EXPERIMENTAL EVALUATION

We performed an extensive experimental evaluation

aimed at assessing our method effectiveness in terms

of precision, recall, and count error on images of mo-

saics not used for learning. To this end, we considered

a set of images of real mosaics, that we manually an-

notated to obtain the corresponding desired segmen-

tations (i.e., the ground-truth segmentations), and ap-

plied our method.

We collected a dataset of 12 images of real mo-

saics including both images that we acquired with a

consumer camera and images that we obtained online.

Part of this dataset has already been used by Bartoli

et al. (2016).

The mosaics depicted in the images of our dataset

belong to different ages in time and also differ in tile

density and color. The annotation required for the

training was performed manually. Despite the exten-

sive effort and attention devoted to the process, some

dissimilarities between a mosaic image and its cor-

responding ground-truth segmentation may still exist.

Nonetheless, the manually annotated mask looks vi-

sually correct.

Figure 1 shows the images of our dataset. It can

be seen that the images greatly vary in the density and

size of the tiles as well as in the visually perceived

sharpness of tile edges.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

488

(a) Image 0. (b) Image 1. (c) Image 2. (d) Image 3. (e) Image 4. (f) Image 5.

(g) Image 6.

(h) Image 7.

(i) Image 8.

(j) Image 9.

(k) Image 10.

(l) Image 11.

Figure 1: The images of the dataset. Images from 7 to 11 has been used also in Bartoli et al. (2016); Fenu et al. (2015).

4.1 Procedure

We evaluated our method using a leave-one-out pro-

cedure on the images (and corresponding desired seg-

mentations) of our dataset, as follows. For each pair

(I, T

) in the dataset D, we (i) performed the learn-

ing on L = D \ (I, T

), hence obtaining a learned U-

net, (ii) used the learned U-net for obtaining the seg-

mentation T of I (i.e., the image of the left-out pair),

and (iii) computed the precision Prec(T , T

), recall

Rec(T , T

), and count error Cnt(T , T

Concerning the method parameters, we set ρ

15 × 10

−5

, l = 400, and n

epoch

= 10. We chose the

values for ρ

and l based on the minimum dimension

and tile density of the images in our dataset. In this

way we obtained crops of 400 × 400 pixels with ap-

proximately l

= 24 tiles in each crop. In the seg-

mentation phase, we set ρ = ρ

and computed ρ using

the actual number T

of tiles.

We run the experiments using an implementation

of the method based on Python 3.6 with Keras and

Tensorﬂow; we executed it on some p3.8xlarge AWS

EC2 instances, each equipped with 64 vCPU based on

2.3 GHz Intel Xeon E5-2686 v4 with 244 GB RAM

and with 4 GPUs based on NVIDIA Tesla V100 with

32 GB RAM. In these settings, the learning time for

one repetition of a leave-one-out procedure is 30 min

and the segmentation time is in the order of few sec-

onds.

4.2 Results and Discussion

We show in Table 1 the results in terms of the salient

segmentation effectiveness indexes presented in Sec-

tion 2 (precision, recall, F-measure, and count error)

for each mosaic image, i.e., each repetition of the

leave-one-out procedure.

Table 1 shows that average precision, recall, and

F-measure at w = 0.5 are 0.60, 0.70, and 0.64 respec-

tively, whereas the average count error is 0.36. By

looking at the ﬁgures of single images, it can be seen

that the effectiveness of segmentation varies among

mosaic images, with the F-measure ranging from 0.51

of image 2 to 0.73 of image 7 and the count error

ranging from 0.75 of image 2 to 0.15 of image 4. We

carefully compared the numerical features of Table 1

with the corresponding mosaic images (see Figure 1)

and found that the segmentation effectiveness is con-

sistent with the subjectively perceived quality of the

images. Good numerical results are obtained by our

method for images 7 and 11, while the worst result is

obtained for image 2 which exhibits poor sharpness.

Concerning the impact of the weighting parame-

ter w, it can be seen from the three sections of Ta-

ble 1 that it act consistently with its semantic. As

w decreases, the balancing between precision and re-

call varies, namely precision increases and recall de-

creases: in facts, a lower value for w corresponds

to a lower contribution, in the loss used during the

learning (see Equation (5)), of the errors in classify-

ing pixel belonging to the actual tiles. As a result,

Mosaic Images Segmentation using U-net

489

Table 1: Results obtained with the three U-nets with different w values.

U-net with w = 0.5 U-net with w = 0.1 U-net with w = 0.01

Im. # Cnt Prec Rec Fm Cnt Prec Rec Fm Cnt Prec Rec Fm

0 0.49 0.56 0.71 0.62 0.18 0.62 0.54 0.57 0.19 0.74 0.40 0.51

1 0.52 0.57 0.69 0.62 0.20 0.70 0.56 0.61 0.19 0.81 0.30 0.43

2 0.75 0.41 0.76 0.51 0.40 0.51 0.52 0.51 0.40 0.51 0.40 0.43

3 0.18 0.64 0.73 0.68 0.07 0.79 0.67 0.72 0.17 0.96 0.35 0.51

4 0.15 0.66 0.61 0.64 0.10 0.68 0.57 0.62 0.08 0.79 0.38 0.51

5 0.31 0.66 0.62 0.64 0.26 0.68 0.57 0.62 0.21 0.86 0.23 0.36

6 0.32 0.58 0.61 0.59 0.17 0.63 0.60 0.61 0.11 0.81 0.32 0.46

7 0.30 0.67 0.80 0.73 0.26 0.73 0.76 0.75 0.41 0.73 0.70 0.71

8 0.28 0.59 0.71 0.64 0.29 0.62 0.67 0.64 0.27 0.68 0.53 0.59

9 0.21 0.62 0.73 0.67 0.23 0.67 0.60 0.62 0.13 0.73 0.52 0.60

10 0.52 0.65 0.70 0.67 0.41 0.70 0.62 0.65 0.31 0.88 0.31 0.45

11 0.29 0.65 0.72 0.69 0.25 0.68 0.68 0.68 0.24 0.83 0.23 0.35

Avg. 0.36 0.60 0.70 0.64 0.23 0.67 0.61 0.63 0.22 0.78 0.39 0.49

the learned U-net tends to outputs smaller regions that

have a lower recall and a greater precision. Another

effect is that the count error is lower with lower val-

ues of w, because there are fewer regions of the output

segmentation in which tiles are “glued” together (see

also later discussion). These ﬁnding on the impact

of w on the output segmentations suggests that it can

be used as a parameter to tailor the output to the spe-

ciﬁc usage intended by the user. However, since in

our experiments w = 0.5 delivers the best F-measure,

we report in the following only the results obtained in

this settings.

4.2.1 Comparison with Other Methods

In order to put our results in perspective, we compared

them with those obtained by the two other existing

methods for mosaic image segmentation, i.e., (Bartoli

et al., 2016) and (Youssef and Derrode, 2008), that

we here denote by GA and TOS, respectively. Table 2

shows the values of the four indexes for the mosaic

images of our dataset which were also processed with

GA and TOS (for these methods, the ﬁgures are taken

from (Bartoli et al., 2016)). For each image, we high-

light in Table 2 the best Fm and Cnt ﬁgure among the

three methods.

The foremost ﬁnding is that our method outper-

forms both GA and TOS in terms of average F-

measure, with 0.67 vs. 0.54 and 0.66, respectively:

considering Fm on the single images, our method ob-

tains the best result in 3 on 5 images. Concerning the

count error, U-net scores better than TOS (0.30 vs.

0.33) and worse than GA (0.30 vs. 0.03): we note,

however, that the latter method is designed to output

a number of tiles corresponding to the user-provided

estimate.

Another ﬁnding concerns how the errors in seg-

mentation are distributed between precision and re-

call. For all the three methods, recall is in general

larger than precision, meaning that tiles in the com-

puted segmentation are in general “larger” than the

corresponding tiles in the desired segmentation. The

unbalancing is, however, much greater in TOS and

GA than in U-net, the difference between recall and

precision being 0.09, 0.21, and 0.20 respectively for

our method, GA and TOS. We think that this differ-

ence can be explained by the way the three methods

work. In TOS, the segmentation does not allow to ob-

tain regions which are not tiles: this means that the

ﬁller is always included in a tile, resulting in a low

precision and good recall. In GA, the overlapping of

tiles is not explicitly forbidden or discouraged, thus

the precision is very low, on average, because the

output segmentation often contains tiles which span

across many desired tiles. In our method, instead,

the network is trained to discriminate between pix-

els belonging or not belonging to a tile in the desired

segmentation: the way the loss is computed during

the training of the U-net (see Equation (5)) favors a

good balancing between false positive and false neg-

ative classiﬁcation at the level of pixels and, hence,

between precision and recall.

In Figure 2 we compare the visual results of the

segmentation of image 11 using the three methods.

Due to the aforementioned differences between the

algorithms, the number of tiles in the TOS segmen-

tation is higher while the size of the tiles tend to be

smaller when compared to the other methods. In GA,

since the algorithm allows for tiles overlapping, many

of the predicted tiles share the same area. In the U-net

segmentation some tiles are not properly separated,

however position, size, and count are visually closer

to those in the original image.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

490

Table 2: Results obtained with our method and with GA and TOS for a subset of the dataset.

U-net with w = 0.5 GA TOS

Im. # Cnt Prec Rec Fm Cnt Prec Rec Fm Cnt Prec Rec Fm

7 0.30 0.67 0.80 0.73 0.03 0.50 0.76 0.60 0.14 0.64 0.87 0.74

8 0.28 0.59 0.71 0.64 0.03 0.42 0.63 0.50 0.54 0.56 0.72 0.63

9 0.21 0.62 0.73 0.67 0.01 0.41 0.66 0.51 0.03 0.53 0.82 0.64

10 0.52 0.65 0.70 0.67 0.07 0.50 0.63 0.56 0.06 0.49 0.68 0.57

11 0.29 0.65 0.72 0.69 0.03 0.46 0.67 0.55 0.90 0.63 0.78 0.70

Avg. 0.32 0.64 0.73 0.68 0.03 0.46 0.67 0.54 0.33 0.57 0.77 0.66

(a) U-net (b) GA (c) TOS

Figure 2: Example of segmentation of image 11 overlapped on the original image with the three methods.

5 CONCLUSIONS AND FUTURE

WORK

We considered the problem of the segmentation of

mosaic images and proposed a method based on deep

learning, namely U-net. We experimentally evaluated

our proposal on a set of 11 images of real mosaics

acquired in different conditions, with different image

quality, and with different building properties. The re-

sults suggest that our method is effective, scoring the

better value for the most relevant index on the major-

ity of images used in the comparison.

We think that our results constitute a further ev-

idence that modern deep learning systems can help

solving tasks in a variety of ﬁelds, here in digital hu-

manities.

We believe that further improvements in mosaic

image segmentation might be obtained. The most

promising way to achieve them might be merging to-

gether two radically different techniques: the one pre-

sented in the present paper, based on deep learning,

and the one designed by Bartoli et al. (2016), based

on a different form of optimization which includes,

in the solution presentation, some domain knowledge

concerning the shape of the tiles.

ACKNOWLEDGMENTS

The experimental evaluation of this work has been

done on Amazon AWS within the “AWS Cloud Cred-

its for Research” program.

REFERENCES

Assael, Y., Sommerschield, T., and Prag, J. (2019). Restor-

ing ancient text using deep learning: a case study on

greek epigraphy. arXiv preprint arXiv:1910.06262.

Bartoli, A., Fenu, G., Medvet, E., Pellegrino, F. A., and

Timeus, N. (2016). Segmentation of mosaic im-

ages based on deformable models using genetic al-

gorithms. In International Conference on Smart Ob-

jects and Technologies for Social Good, pages 233–

242. Springer, Springer.

Benyoussef, L. and Derrode, S. (2011). Analysis of an-

cient mosaic images for dedicated applications. In

Stanco, F., Battiato, S., and Gallo, G., editors, Digital

Imaging for Cultural Heritage Preservation: Analy-

sis, Restoration, and Reconstruction of Ancient Art-

works. CRC Press.

C¸ ic¸ek,

O., Abdulkadir, A., Lienkamp, S. S., Brox, T., and

Ronneberger, O. (2016). 3d u-net: learning dense vol-

umetric segmentation from sparse annotation. In In-

ternational conference on medical image computing

and computer-assisted intervention, pages 424–432.

Springer.

Mosaic Images Segmentation using U-net

491

Comes, R., Buna, Z., and Badiu, I. (2014). Creation and

preservation of digital cultural heritage. Journal of

Ancient History and Archaeology, 1(2).

Falk, T., Mai, D., Bensch, R., C¸ ic¸ek,

O., Abdulkadir, A.,

Marrakchi, Y., B

ohm, A., Deubner, J., J

ackel, Z., Sei-

wald, K., et al. (2019). U-net: deep learning for cell

counting, detection, and morphometry. Nature meth-

ods, 16(1):67.

Fazio, L., Brutto, M. L., and Dardanelli, G. (2019). Sur-

vey and virtual reconstruction of ancient roman ﬂoors

in an archaeological context. International Archives

of the Photogrammetry, Remote Sensing and Spatial

Information Sciences, 42(2/W11).

Fenu, G., Jain, N., Medvet, E., Pellegrino, F. A., and Pi-

lutti Namer, M. (2015). On the assessment of seg-

mentation methods for images of mosaics. In In-

ternational Conference on Computer Vision Theory

and Applications-VISAPP, volume 3, pages 130–137.

SciTePress, SciTePress.

Kingma, D. P. and Ba, J. L. (2015). Adam: A method for

stochastic optimization. In Proceedings of the 3rd In-

ternational Conference on Learning Representations

(ICLR).

Kohl, S., Romera-Paredes, B., Meyer, C., De Fauw, J., Led-

sam, J. R., Maier-Hein, K., Eslami, S. A., Rezende,

D. J., and Ronneberger, O. (2018). A probabilis-

tic u-net for segmentation of ambiguous images. In

Advances in Neural Information Processing Systems,

pages 6965–6975.

Monti, M. and Maino, G. (2011). Image processing and

a virtual restoration hypothesis for mosaics and their

cartoons. In International Conference on Image Anal-

ysis and Processing, pages 486–495. Springer.

Neum

uller, M., Reichinger, A., Rist, F., and Kern, C.

(2014). 3d printing for cultural heritage: Preserva-

tion, accessibility, research and education. In 3D Re-

search Challenges in Cultural Heritage, pages 119–

134. Springer.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In International Conference on Medical

image computing and computer-assisted intervention,

pages 234–241. Springer.

Vincent, L. and Soille, P. (1991). Watersheds in digital

spaces: an efﬁcient algorithm based on immersion

simulations. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 13(6):583–598.

Youssef, L. B. and Derrode, S. (2008). Tessella-oriented

segmentation and guidelines estimation of ancient

mosaic images. Journal of Electronic Imaging,

17(4):043014.

Zhang, Z., Liu, Q., and Wang, Y. (2018). Road extraction

by deep residual u-net. IEEE Geoscience and Remote

Sensing Letters, 15(5):749–753.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

492