
Based on literature analyses, attention can be
drawn to the need for newer models that will enable
better data segmentation. In this work, we propose a
new solution based on the U-Net network model, in-
cluding the squeeze and excitation mechanism, which
enables the analysis of dependencies between features
in feature maps. Additionally, we introduce Pyramid-
Pooling to these blocks to take into account informa-
tion from different scales or sizes of objects and to
increase the importance of image context for analysis
of the moon’s surface. The main contributions of this
research are:
• a new U-net model for boulder/rock segmenta-
tion,
• a novel block type that combines Pyramid-
Pooling with Squeeze and Excitation.
2 METHODOLOGY
In this section, we propose a modified U-Net model
enhanced with PPS-CE blocks for multi-class seman-
tic segmentation tasks. The overview illustration of
the model is presented in Fig. 1. The contraction
path consists of 5 doubled 3 × 3 convolutional lay-
ers with (ReLU activation functions) and dropout be-
tween them. These are followed by 2 × 2 MaxPool-
ing layers. In the expansive path, to enhance the per-
formance of the model, we propose utilizing PPS-
SE blocks after each transpose convolution and copy
path. The final output is obtained using 1 × 1 convo-
lution with Softmax activation function (to obtain the
probabilistic distribution of the classes).
2.1 Pyramid Pooling
Pyramid pooling is a unique pooling technique that al-
lows the model to gather more contextual information
by capturing information at multiple scales. The prin-
ciple of this method is based on dividing the input fea-
ture map into regions of different sizes. Then, for each
divided feature map obtained this way, average pool-
ing is performed. The result of each pooling segment
is then concatenated, creating a unified representation
that carries multi-scale information. Mathematically,
this can be presented as processing the input feature
map X = (x
h,w,c
) where values h, w, c are accordingly
height, width and number of channels of the feature
map X. Given the set of scales L, for each l scale in
the set, a divided feature map is created according to
the following equation:
X
l
= (x
h
1
,w
1
,c
), (1)
where h
1
=
h
l
and w
1
=
j
w
l
k
. For each ob-
tained X
l
, the average pooling operation is performed.
Lastly, the pooling results at all scales are concate-
nated, resulting in the final feature map Y, whose di-
mensionality is presented as:
dim(Y ) =
|L|
∑
i=1
h
l
i
×
|L|
∑
i=1
w
l
i
× c
!
. (2)
As previously mentioned, pyramid-pooling allows
the model to gather extended contextual informa-
tion by utilizing many different perception field sizes.
This provides better robustness regarding object scale,
which is especially important in semantic segmenta-
tion tasks.
2.2 Pyramid-Pooling Squeeze and
Convolutional Excitation Blocks
Squeeze-and-excitation (SE) blocks are a mechanism
that improves the representational power of the con-
volutional layers by analyzing the dependencies be-
tween various channels in feature maps passed from
the convolutional layer and assigning them weights
based on the impact they have on the further assess-
ment of the model. This is one of many types of
attention mechanisms used in neural networks, high-
lighting the more influential channels, while also sup-
pressing less informative ones. This process improves
the overall feature representation. The basic SE block
first performs average global pooling as the squeeze
operation, obtaining 1 × 1 × c (c indicating the num-
ber of channels in the input feature map) vector. In
the excitation operation, the vector is then passed onto
two dense layers with the former having ReLU (in-
troducing non-linearity) and the ladder having a Sig-
moid activation function. The output of these lay-
ers is then scaled and applied to the original fea-
ture map. In this paper, we propose Pyramid-Pooling
Squeeze and Convolutional Excitation blocks (PPS-
CE), utilizing Pyramid-Pooling in Squeeze operation
and double 1 × 1 convolution instead of dense layers
in Excitation operation. The main advantages of this
approach are the benefits of using Pyramid-Pooling
and convolutional layers having less trainable param-
eters than dense layers. In squeeze operation, each di-
vided X
l
feature map is processed using 1×1 convolu-
tion with ReLU activation function. In this paper, we
propose that each convolution has the number of out-
put channels equal to c
conv
= b
c
r
c, with r parameter set
to 16. The output of each convolutional layer is then
concatenated along the channel axis. Next, concate-
nated feature maps from Pyramid Pooling are passed
through two 1 × 1 convolutional layers, the first of
Semantic Segmentation for Moon Rock Recognition Using U-Net with Pyramid-Pooling-Based SE Attention Blocks
967