Convolutional Neural Networks (CNNs)-Based for Medical Image

Analysis

Leyan Li

Department of Computer Science, University of Liverpool, Liverpool, U.K.

Keywords: CNNs, Medical Image Processing, Residual Network, Transformers.

Abstract: This paper provides an exhaustive examination of convolutional neural networks (CNNs) in medical image

processing, recognizing their pivotal role in healthcare diagnostics. As CNNs continue to evolve, they offer

promising avenues for enhancing accuracy and efficiency in image analysis. The primary objective of this

study is to scrutinize and assess the performance of both classic and contemporary CNN models across a

spectrum of pathological datasets. The methodology entails a comprehensive analysis of various CNN

architectures, ranging from well-established models to more advanced approaches. Emphasis is placed on

their efficacy in disease classification and feature extraction tasks. Experiments conducted on datasets

underscore the models' adeptness in handling intricate medical images. The findings indicate CNNs'

superiority in feature extraction, the proficiency of Residual Network (ResNet) in managing depth and

ensuring robust training, and Transformers' effectiveness in navigating high-dimensional data through their

attention mechanisms. These insights hold profound implications for medical diagnostics, promising

significant advancements in accuracy and timeliness of health interventions.

1 INTRODUCTION

Medical imaging technology is an effective means of

understanding pathological processes that affect

human health (Jannin, 2006). Compared to natural

images, medical images (such as slices or patches

from different modalities) contain richer information

due to their more organized and similar visual

representations of human organs (Dai, 2021). With

the advancement of technology, the main tasks of

medical image processing can be summarized as

generating new images from original ones, computing

features and measurements (known as image

analysis), or extracting high-level descriptions

(referred to as image understanding) (Jannin, 2006).

In the medical field, the quality and accuracy of

image processing have become benchmarks, and

these processing results are crucial for medical

decision-making (Sonka, 2000).

Advances in algorithms in the domain of medical

imaging technology often stem from the need for new

image analysis capabilities (Jannin, 2006). For

instance, since Krizhevsky (Krizhevsky, 2012) and

others introduced the convolutional neural network

https://orcid.org/0009-0009-5255-2230

(CNN) with AlexNet winning the ImageNet image

classification championship in 2012, CNNs have

shown tremendous advantages in disease detection

and classification, and local feature extraction from

images. Classic CNN models like AlexNet, Network

in Network (NIN) which reduces the risk of

overfitting through global average pooling (Lin,

2013), and Visual Geometry Group (VGGNet) and

GoogLeNet, which enhanced precision on the

ImageNet dataset (Simonyan, 2014), have all

performed well. In 2015, He and others introduced

Spatial Pyramid Pooling (SPP), which addressed the

strict input size requirements of CNNs (He, 2015),

and in the subsequent year introduced the residual

network (ResNet) to address the issue of model

degradation. Recently, the Transformer has excelled

in tasks needing a deep understanding of visual

contexts and details, thanks to its capability to handle

high-dimensional data and synthesize details from

different image sections.

This paper comprehensively reviews and

summarizes the current research status of utilizing

CNNs for medical image processing. Chapter Two

analyzes and explains classic and current mainstream

546

Li, L.

Convolutional Neural Networks (CNNs)-Based for Medical Image Analysis.

DOI: 10.5220/0012958600004508

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2024), pages 546-552

ISBN: 978-989-758-713-9

network models based on CNNs in deep learning. It

delves into the architecture, training strategies, and

applications of these models in medical imaging tasks.

Chapter Three provides a detailed analysis and

comparison of the results obtained by different

models using various pathological datasets. It

evaluates the performance metrics, including

accuracy, sensitivity, and specificity, to assess the

effectiveness of CNN-based approaches in medical

image analysis. In Chapter Four, the paper

summarizes the advantages of CNNs in medical

imaging, highlighting their ability to extract

meaningful features and their reliance on large

datasets for training. Furthermore, it explores

potential future trends, such as the development of

CNN architectures that require less data for training

or the utilization of artificial intelligence techniques

to generate novel models tailored to specific medical

imaging tasks.

2 METHODOLOGIES

2.1 Dataset Description and

Preprocessing

Interstitial Lung Disease (ILD) encompasses various

pulmonary diseases affecting the lung parenchyma,

associated with significant morbidity and mortality

(Stanford, 2024). Recent advances have led to a

substantial collection of genetic and disease data

integrated into the International Lattice Data Grid

(ILDG) database, 2024 version, featuring 20 types of

ILD across four species with over 600 genes and

2,018 entries. This database includes detailed records

of species, disease types, gene symbols, and primary

references. The study utilized image blocks where at

least 75% of pixels are within Regions of Interest

(ROI), involving 16,220 blocks from 92 high-

resolution computed tomography (HRCT) image sets.

The images were divided into ten groups per round,

with one for testing and nine for training, using

random image shifting to enhance diversity and

prevent overfitting. The MRNet dataset includes

1,370 magnetic resonance imaging (MRI) knee

examinations, categorized by conditions like anterior

cruciate ligament (ACL) tears (Touvron, 2021). It

splits into 1,130 training, 120 validation, and 120

testing cases and uses three MRI scan types: T1, T2,

and proton density, with resolutions reformed into 3D

stacks for T1 and T2 scans. The preprocessing

includes the OTSU algorithm for background

separation, image alignment, and formatting into

stacks of 36x448x448. The augmentation techniques

include random flipping and Gaussian noise to

improve dataset robustness.

2.2

Proposed Approach

This study aims to comprehensively review and

analyze the applications of CNN in medical image

processing, with a focus on evaluating the efficacy of

both classic and contemporary mainstream deep

learning models based on CNNs. The performance of

these models is assessed across a range of

pathological datasets, employing detailed

methodologies and comparisons to elucidate each

model's strengths and applicable scopes. Additionally,

the principal flowchart of the main process is outlined

in Figure 1 to provide a visual representation of the

workflow.

Figure 1: The pipeline of the model (Photo/Picture credit:

Original).

2.2.1 Introduction to Basic Techniques

CNNs represent a form of deep learning model,

evolved from the multilayer perceptron, as shown in

the Figure 2. They simplify the learning process by

using smaller kernel filters to incorporate weights,

which speeds up operations and enhances robustness.

Due to their ability to automatically and efficiently

learn intrinsic features from blocks of medical images

and their strong generalization capability, CNNs are

widely used in medical image processing. The main

structures of CNNs include: Data Input Layer: This

layer preprocesses the raw image data, including

mean subtraction (centring the data around zero to

reduce variations between samples), normalization

(aligning the amplitude range across different

dimensions), and Principal Component Analysis

(PCA)/whitening (reducing dimensionality and

normalizing the amplitude of the data feature axes).

Convolutional Layer (CONV layer) and rectified

linear unit (ReLU) Activation Layer: The

convolutional layer applies multiple filters through

local connections and sliding window mechanisms,

with each neuron acting as a filter. Key parameters

include filter size, stride, and padding. The stride

controls the movement distance of the filter over the

input, while zero padding adds zeros at the boundaries

of the input to maintain consistent spatial dimensions

Convolutional Neural Networks (CNNs)-Based for Medical Image Analysis

547

between the input and output. The formula to

calculate the output size is:

Output Size 

  



 1(1)

The ReLU activation layer provides a non-linear

mapping, enhancing the model's non-linear

expression capability and convergence speed. The

next is pooling layer, which situated between

successive convolutional layers, its primary function

is to compress data and parameter quantity, thus

reducing overfitting. Typical techniques involve

average pooling and max pooling, with max pooling

choosing the maximum value from each window for

the output. Next is Fully Connected Layer (FC layer).

The last layer of a CNN typically comprises an FC

layer, in which each neuron is connected to every

neuron in the preceding layer, culminating in the

network's ultimate output.

Figure 2: The structure of the FC layer (Photo/Picture

credit: Original).

These components enable CNNs to effectively

process medical images, through this process, CNNs

have successfully achieved efficient processing of

medical images, demonstrating their strong potential

for the medical domain.

2.2.2 ResNet

CNNs have shown exceptional effectiveness in object

recognition and have gradually become the preferred

method for image analysis, as shown in the Figure 3.

He et al. (ildgdb, 2024) introduced the ResNet, which

effectively addresses challenges related to vanishing

gradients and network degradation stemming from

increased network depth. This network structure

significantly speeds up the training of neural

networks and greatly enhances their generalization

capabilities and robustness.

Comprising multiple residual units, Residual

Neural Networks include a convolutional layer (conv

layer), batch normalization layer (BN), and ReLU in

each unit. At the heart of a residual unit lies the direct

passage of input to output, thereby forming the

foundational result. In the event that the input to the

neural network is denoted as 𝒙 and the anticipated

output as 𝐇𝐱, the residual function is represented

as H(x)-x, the output of the residual unit is 𝐱

𝐇𝐱  𝐱, so the network's learning target becomes

H(x)-x —the residual. By fitting the residual

mapping, ResNet simplifies the learning process,

making the optimization of deep networks easier and

solving the issues of gradient disappearance and

degradation with increased depth. Another key

feature of residual units is the shortcut connection

(identity mapping) that changes the learning target

from the direct mapping 𝐇𝐱 to 𝐇𝐱  𝐱.

When the input and output dimensions are

consistent, the shortcut connections can directly add

the input to the output. If dimensions are inconsistent,

there are two strategies for handling this:

Zero-padding is used to increase dimensions,

usually combined with pooling operations with a

stride of 2 for downsampling. This method does not

add extra parameters. Projection shortcuts are

typically adjusted through 1×1 convolution to change

dimensions, which increases some parameters and

computational load. These shortcut connection

strategies not only maintain the network’s parameters

and computational load but also significantly improve

the model's training speed and efficiency in

Figure 3: The Structure of the ResNet (Photo/Picture credit: Original).

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

548

deeper network structures, effectively preventing

performance degradation. With these innovative

designs, ResNet has shown immense potential and

practical utility, particularly in deep medical image

processing in the field of deep learning.

2.2.3 Application of Transformers in

Medical Imaging

Transformers, initially designed for Natural

Language Processing, effectively capture long-range

dependencies using self-attention mechanisms. This

technology has been adapted for visual tasks such as

object detection with Detection transformer (DETR),

semantic segmentation, and image classification with

vision in transformer (ViT) (Touvron, 2021). In

multimodal medical imaging, where capturing long-

range interactions is essential, Transformers enhance

deep learning models by effectively integrating

multimodal data, outperforming traditional CNNs

that excel in local feature extraction but struggle with

distant relationships (Touvron, 2021). The

fundamental element of the Transformer are Self-

Attention Mechanism (SA): SA is the foundation of

the Transformer, allowing the model to enhance its

predictions by using other parts of a data sample

during processing. In the self-attention layer, the

input vector X undergoes the transformation into

three distinct vectors: Query matrix Q, Key matrix K,

and Value matrix V, as shown in the Figure 4.

Weights are assigned based on the dot product of

queries and their respective keys. The attention

function is calculated as follows:

AttentionQ, K, V  Softmax











  V (2)

where 𝐝

𝐤

is the dimension of the key vectors, and this

normalization helps stabilize the gradients. Multi-

Head Self-Attention (MSA) (Figure 5): MSA is

central to the Transformer architecture, enhancing the

model's ability to learn information from different

representational subspaces by splitting the input into

multiple parts and processing them in parallel. The

computation of MSA is expressed as:

head



 AttentionQW





,KW





,VW







MSAQ, K, VConcathead



,head



,...,

head



W



(3)

where projection matrices 𝐖

𝐢

𝐐

, 𝐖

𝐢

𝐊

, 𝐖

𝐢

𝐕

, 𝐖

𝐎

are

trainable parameters.

Multi-Layer Perceptron (MLP): Located above

the MSA layer, composed of linear layers and

activation functions (like GeLU), providing the

model with non-linear processing capabilities.

Similar to ResNet, MLP and MSA, integrate layer

normalization and skip connections techniques to aid

in training deep networks.

Figure 4: The structure of the SA (Photo/Picture credit: Original).

Figure 5: The structure of the MSA (Photo/Picture credit: Original).

Convolutional Neural Networks (CNNs)-Based for Medical Image Analysis

549

The input layer includes several embeddings and

tokens: patch embeddings (from CNNs), positional

embeddings (encoding spatial information), class

embeddings (training vectors), and patch and class

tokens. The class token, atop the patch tokens, moves

through Transformer layers and is outputted by a fully

connected layer for classification. This design allows

the Transformer to excel in handling multimodal

medical images with complex, long-range

dependencies.

3 RESULTS AND DISCUSSION

This section analyzes and compares the performance

of different deep learning models such as CNN,

ResNet, and Transformer on various pathological

datasets, revealing the characteristics and advantages

of each model.

3.1 Performance of CNN Compared

with Different Models

In this study, the ILD database (Stanford, 2024) was

used, consisting of 113 HRCT lung image sets with

2062 2D regions of interest (ROI) classified into five

ILD types: Normal (N), Emphysema (E), Ground

Glass (G), Fibrosis (F), and Micronodules (M). CT

slices were segmented into 32×32-pixel semi-

overlapping blocks, using only those where at least

75% of pixels were within ROIs, totaling 16220

blocks. Three feature extraction methods—Scale

Invariant Feature Transform (SIFT), which identifies

central key points; rotation-resistant Local Binary

Patterns (LBP) at varying resolutions; and Restricted

Boltzmann Machine (RBM) for unsupervised

learning—were compared using an supported vector

machine (SVM) classifier. In contrast, CNN directly

classifies through three neural network layers,

optimizing performance via parameter fine-tuning

and backpropagation, without needing a separate

classifier.

The classification outcomes were assessed

through precision and recall metrics. Figures 6 and 7

illustrate that the CNN method delivered superior

classification performance, surpassing both SIFT and

LBP, demonstrating CNN's clear advantage in

automatic feature learning in medical imaging.

Despite challenges such as ambiguous visual

structures and limited training data, overfitting issues

can be effectively mitigated by designing appropriate

network architectures and applying techniques like

dense dropout and input distortion.

Figure 6: Classification results focused on recall metrics

(Li, 2014).

Figure 7: Classification results focused on Precision metrics

(Li, 2014).

3.2 Different Configurations of ResNet

Models in Medical Image

Processing

This section aims to evaluate the performance of

different configurations of ResNet models in medical

image processing. This paper used three different data

allocation strategies to train and test the ResNet

approaches: Approach 1 employs a data split of 60%

for training and 40% for testing; Approach 2 allocates

75% for training and 25% for testing; Approach 3

utilizes an 80% training and 20% testing data split. At

the start of the experiments, all images were

converted to grayscale and enhanced for contrast

using the Contrast Limited Adaptive Histogram

Equalization (CLAHE) algorithm to standardize

initial inputs. During the learning process of the

ResNet model, in addition to the basic convolutional

layers, batch normalization layers, ReLU activation

layers, and pooling layers were included. Each

architecture of ResNet contains multiple residual

blocks, each divided into 5 layers, with pooling layers

primarily used at the feature extraction and before the

classification layers.

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

550

All models were optimized using the SGD

optimizer with momentum as the optimization

function, accompanied by adjustments to the learning

rate and utilizing binary cross-entropy as the

objective function. Table 1 shows the performance

using the ResNet-18 architecture under different

training-testing ratios. The configuration that

Approach 3 achieved the highest accuracy of 85%.

And the configuration using Approach 2 achieved the

highest sensitivity of 96%.

Table 1: ResNet-18 Evaluation Measures Across Various

Test Datasets (Sarwinda, 2021).

Confi

uration Accurac

ecificit

Sensitivit

roach 1 73% 83% 64%

roach 2 81% 63% 96%

Approach 3 85% 87% 83%

Table 2 illustrates the efficacy of the approach of

ResNet-50, where the peak accuracy achieved is 88%,

stemming from the training data configurations of

75% and 80%. In the Approach 1 configuration, the

highest sensitivity reached 92%. Comparing the

results of ResNet-18 and ResNet-50 shows that

ResNet-50 has better accuracy and sensitivity on the

same dataset, indicating that stacking more

convolutional layers can enhance the ability to learn

features.

Table 2: Evaluation Measures Across Various Test Datasets

for ResNet-50 (Sarwinda, 2021).

Configuration Accurac

Specificit

Sensitivit

Approach 1 77% 92% 60%

Approach 2 88% 87% 89%

roach 3 88% 83% 93%

Data from Table 3 indicate that the training and

testing times for ResNet-18 are generally lower than

for ResNet-50, primarily due to differences in the

number of layers in the architecture. Additionally,

performance analysis of both models shows that the

ResNet variants achieve an accuracy range from 73%

to 88% and a sensitivity range from 64% to 96% in

colorectal cancer detection, proving the effectiveness

of the ResNet architecture in such applications.

Table 3: Evaluation of Execution Time for Each Epoch

Between ResNet-18 and ResNet-50 (Sarwinda, 2021).

Configuration

𝑇









(seconds)

𝑇









(seconds)

Approach 1 77% 60%

roach 2 88% 89%

roach 3 88% 93%

This detailed performance assessment of ResNet

models underscores their adaptability and efficiency

in handling complex medical imaging tasks across

different configurations and datasets.

3.3 Study on the Use of Transformers

in CT Medical Imaging

This section explores the application of Transformer

models in medical imaging processing, particularly

focusing on their performance in Computer

Tomography (CT) image analysis. CT, especially for

diagnosing chest diseases, provides an ideal scenario

for Transformers due to the high contrast between

gases and tissues.

Table 4: Evaluation of Transformer for Computer

Tomography (He, 2023).

Citations Datasets Accuracy

(

)

illness Body

art

COVID-

VIT

COV19.CT-

96.0 COVID-19 Lung

Zhang et

al.

COV19.CT-

76.6 COVID-19 Lung

Than et

al.

COVID-

CTset

- COVID-19 Lung

Li et al. - 98.0 COVID-19 Lun

As shown in the Table 4, Than et al. studied the

impact of patch size on ViT's performance in

classifying COVID-19 and other lung pathologies.

They found that a patch size of 32x32 achieved the

best accuracy, revealing a trade-off between patch

size and model performance. Li et al. developed a

ViT-based COVID-19 diagnostic platform that

converts CT images into streamlined patches suitable

for ViT input requirements. Using a teacher-student

model strategy, they enhanced the model's diagnostic

capabilities by distilling knowledge from CNNs pre-

trained on natural images. Zhang et al. first

segmented the lung areas in CT images using Unet,

then inputted the segmented lung regions into Swin-

Transformer for feature extraction. This strategy

significantly reduced the computational load of the

Transformer model. The above studies highlight the

role of pretraining in CT image classification

processing. Using attention mechanisms to reduce

computational complexity is particularly crucial for

processing large-volume images.

This chapter reviews three significant deep

learning models—CNN, ResNet, and Transformer—

in medical imaging. CNNs are superior in automatic

feature extraction, outperforming traditional

techniques in classifying complex lung images, with

network enhancements like dense dropout improving

Convolutional Neural Networks (CNNs)-Based for Medical Image Analysis

551

accuracy and recall. ResNet, using deep architectures

and residual learning, excels in tasks requiring the

detection of subtle differences, benefiting from its

efficient data use. Transformers handle complex CT

scans effectively, including those for COVID-19, by

managing long-range dependencies with attention

mechanisms and adaptability to various patch sizes

and pretraining approaches. This analysis highlights

the distinct advantages and contributions of each

model to medical imaging technology.

4 CONCLUSIONS

This research focuses on evaluating deep learning

models such as CNN, ResNet, and Transformer in

medical image processing, with the objective of

enhancing diagnostic accuracy across various

imaging modalities. The study involves

methodological applications and analyses of each

model on different pathological datasets, including

interstitial lung diseases and knee joint injuries,

through ILD and MRI scans.

Extensive experiments were conducted to

evaluate the proposed methods. The experimental

results revealed that CNN excels in automatic feature

extraction, particularly in environments with limited

data and ambiguous visual structures. ResNet

demonstrated superior performance in managing

depth and complexity, significantly enhancing the

model's training and generalization capabilities in

deeper network architectures. Meanwhile,

Transformers displayed their advantage in handling

complex, high-dimensional image data, utilizing their

attention mechanisms to enhance model predictive

capabilities on large and diverse datasets.

Future research will explore integrating

multimodal imaging data to analyze the combined

effects of various imaging modalities using advanced

machine learning frameworks. This aims to enhance

diagnostic precision and robustness, addressing the

limits of single-modality analysis and advancing AI-

driven diagnostic tools in clinical settings, potentially

improving patient outcomes and healthcare efficiency.

REFERENCES

Dai, Y., Gao, Y., & Liu, F. 2021. Transmed: Transformers

advance multi-modal medical image classification.

Diagnostics, 11(8), 1384.

He, K., Gan, C., Li, Z., Rekik, I., Yin, Z., Ji, W., ... & Shen,

D. 2023. Transformers in medical image analysis.

Intelligent Medicine, 3(1), 59-78.

He, K., Zhang, X., Ren, S., & Sun, J. 2015. Spatial pyramid

pooling in deep convolutional networks for visual

recognition. IEEE transactions on pattern analysis and

machine intelligence, 37(9), 1904-1916.

Ildgdb. 2024. Dataset. http://ildgdb.org/

Jannin, P., Krupinski, E., & Warfield, S. K. 2006.

Validation in medical image processing. IEEE

Transactions on Medical Imaging, 25(11), 1405-9.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012.

Imagenet classification with deep convolutional neural

networks. Advances in neural information processing

systems, 25.

Lin, M., Chen, Q., & Yan, S. 2013. Network in network.

arXiv preprint arXiv:1312.4400.

Li, Q., Cai, W., Wang, X., Zhou, Y., Feng, D. D., & Chen,

M. 2014. Medical image classification with

convolutional neural network. In 2014 13th

international conference on control automation robotics

& vision (ICARCV) (pp. 844-848). IEEE.

Sarwinda, D., Paradisa, R. H., Bustamam, A., & Anggia, P.

2021. Deep learning in image classification using

residual network (ResNet) variants for detection of

colorectal cancer. Procedia Computer Science, 179,

423-431.

Simonyan, K., & Zisserman, A. 2014. Very deep

convolutional networks for large-scale image

recognition. arXiv preprint arXiv:1409.1556.

Sonka, M., & Fitzpatrick, J. M. 2000. Handbook of medical

imaging: Volume 2, Medical image processing and

analysis. SPIE.

Stanford. 2024. mrnet. https://stanfordmlgroup.github.io/

competitions/mrnet/

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,

A., & Jégou, H. 2021. Training data-efficient image

transformers & distillation through attention. In

International conference on machine learning (pp.

10347-10357). PMLR.

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

552