
which participated in the ImageNet Large Scale Vi-
sual Recognition Challenge (ILSVRC) in 2014 and
achieved notable success. It can more accurately ex-
press the characteristics of the data set when identi-
fying and classifying images. This model has a 16
weight layers, which include 13 convolutional layers
and 3 fully connected layers. It represented an ad-
vance on previous models, offering convolution lay-
ers with smaller convolution kernels (3×3) than had
previously been the case.
During model training, the input for the first con-
volution layer is an RGB image of size 224 x 224. For
all convolution layers, the convolution kernel is 3×3.
These convolution layers are accompanied by Max-
Pooling layers, each 2×2 in size, to reduce the size of
the filters during learning.
In the proposed method, the model, as shown in 1,
is inspired by (Moung et al., 2021). So, the last max-
pooling layer was removed and replaced with an av-
erage pooling layer. this will introduce some feature
generalization because this operation will take the av-
erage of pixels to retain fine-grain details in the fi-
nal convolutions. During the classification phase, the
output features from the average pooling layer of the
VGG16 model are fed into the new fully-connected
layers. The new classifier part of the VGG16 model
consists of one flattened layer and two dense lay-
ers, with each generating 100 and 2 outputs, respec-
tively. The first dense layer has a Rectified Linear
Unit (ReLU) activation function with 0.5 dropout.
The output layer, which is the last dense layer, has
a Sigmoid activation function. The feature map size
of the generated VGG16 features used in this work is
25,088 × 1 dimension per input image.
In this work, the proposed approach is built us-
ing a convolutional neural network (CNN) with batch
normalization (BN) which is a popular and effec-
tive technique that consistently accelerates the con-
vergence of deep networks.
3.2 Proposed Method
Our proposed method consists of 3 phases: feature
extraction, feature concatenation and classification.
Figure 2 illustrates the architecture of the proposed
method. We will detail each phase in the next section.
3.2.1 Feature Extraction
In the feature extraction step, two types of features
are extracted: those associated to DL (Deep Learn-
ing) and those relative to MI (moment invariant).
For DL feature extraction, VGG16 is utilised in this
work. Meanwhile, the MI-based features are ex-
tracted using the (Ghorbel et al., 2006) moment in-
variant method. Then, these two features (DL-based
and MI-based features) are concatenated using the
join fusion. Lastly, classification based on the fused
features is performed by using the fully connected
(FC) layers.
VGG16 Network Architecture
In this work, we use a CNN model VGG16 (see Sec-
tion 3.1 to extract essential features. This model was
composed of two parts: convolutional base and classi-
fier. The convolutional base comprises convolutional
and pooling layers to generate features, whereas the
classifier categorises the image based on extracted
features. We initialized the model weights using
Kaiming Initialization for convolutions and normal
initialization for fully connected layers. Additionally,
we did not employ any transfer learning in our exper-
iments. The VGG16 output features of shape 77512
at the final max-poling layer.
Complete Complex Invariant Descriptors
Based on the work of (Ghorbel et al., 2006), the MI-
based features of each CT image are extracted using
complex invariant moments (see Section 2) to trans-
lation, rotation and scale, by means of some linear
combinations of complex moments. We use both the
absolute and phase values of the obtained complex in-
variant features, so no data are lost. More details are
presented in the experimental results.
3.2.2 Feature Concatenation
In (Huang et al., 2020), authors described all the dif-
ferent fusion strategies using deep learning. In this
work, we use the joint fusion as shown in Fig 3.
It’s the process of joining learned feature represen-
tations from intermediate layers of neural networks
with features from other modalities as input to a fi-
nal model. Thus, the features of each batches of im-
ages from the VGG16 model are concatenated with
the Ghorbel moments features into a composite vec-
tor. The features extracted from VGG16 were multi-
dimensional vectors of shape (7x7x512). Thus, it is
flattened to be (25088x1) and the extracted Ghorbel
complex moment invariant values were presented as a
one-dimensional row vector. To be on the same scale
as the deep learning feature, which is a vector in the
R
(25088x1)
, we extracted the module and phase infor-
mation from the Ghorbel complex invariant feature.
Then we did the fusion. Fig3 illustrates the joint fu-
sion process. The size of the final vector will vary
ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods
446