Brain MRI Images Pre-processing of Heterogeneous Data-sets for Deep

Learning Applications

S. Ostellino

, A. Benso

and G. Politano

Politecnico di Torino, Computer Science and Automation Department, Torino, Italy

Keywords:

Multiple Sclerosis, MRI, Imaging, Pre-processing, Deep Learning, Data Preparation, Heterogeneous

Data-sets, Real Clinical Data.

Abstract:

Automatic segmentation of tissues and lesions is a very important step in any Artiﬁcial Intelligence pipeline de-

signed to analyze medical images (especially MRI). This is particularly true for brain MRI images of patients

affected by neurological pathologies like Multiple Sclerosis (MS). To perform well, cutting edge Artiﬁcial

Intelligence approaches like Deep Learning need a huge amount of training data. Unfortunately, available

data-sets of MRI medical images often lack annotations, standardized acquisition protocols, formats and di-

mensions. This heterogeneity in the data-sets makes it often very difﬁcult to use and integrate different data-

sets in the same pipeline. Available image pre-processing tools have speciﬁc requirements and might not be

adequate for extensive usage with heterogeneous data-sets. This paper presents an on-going work on a com-

prehensive and consistent brain MRI images pre-processing pipeline for Deep Learning applications enabling

the creation of a congruous data-set. The pipeline was tested with the public available ISBI2015 data-set.

1 INTRODUCTION

Magnetic resonance imaging (MRI) is a non-invasive

and fundamental diagnostic and monitoring tool for

many of the existing neurological conditions. Among

them, Multiple Sclerosis (MS) is a chronic autoim-

mune disease for which MRI is particularly impor-

tant as the disease needs to be carefully monitored

with at least one MRI per-year. The progression of

MS is variable between patients and a good monitor-

ing is crucial for correct therapeutic choices: MRI

is a support during disease diagnosis and follow-

up, together with others indicators of disease sta-

tus (Inojosa, 2021). MRI allows to visualize differ-

ent brain tissues depending on the chosen acquisi-

tion sequence and acquisition protocol. The modal-

ities that allow MS monitoring are the T1-weighted

(T1w), that allows easy annotation of healthy tissues,

and T2-weighted (T2w) and FLAIR images that are

used for detecting inﬂammatory lesions, indicators of

disease activity. Images are typically visually exam-

ined by neuro-radiologists that ﬁll a written clinical

report that in many cases is affected by intra-and-

https://orcid.org/0000-0002-6275-3214

https://orcid.org/0000-0003-3433-7739

https://orcid.org/0000-0001-5268-9899

inter reader variability; therefore, a lot of attention

is now being paid to tools and methodologies for the

automatic segmentation of tissues and lesions (Kaur,

2021) (Zeng, 2020). The current problem with Deep

Learning pipelines, is that not only they require very

large training sets, but also they enforce strict require-

ments in the input data-sets’ format, size, and image

quality. This makes it often impossible to use differ-

ent data-sets as inputs for the same pipeline because

the differences in the data-sets can be easily result in

biases in the segmentation results.

The purpose of this paper is to present a compre-

hensive pre-processing pipeline able to prepare raw

MRI brain images data-sets (with the corresponding

lesions masks, if present) so that they can be directly

fed into Deep Learning architectures. The pipeline

has been implemented entirely in Python and partic-

ular attention has been paid in giving the possibility

to directly access each functions’ parameters to allow

further customizations/optimizations (see 3.1).

2 PROBLEMS DEFINITION

Several issues can negatively inﬂuence the perfor-

mances of a segmentation strategy: not only non-

brain tissues are a source of errors and need to be

Ostellino, S., Benso, A. and Politano, G.

Brain MRI Images Pre-processing of Heterogeneous Data-sets for Deep Learning Applications.

DOI: 10.5220/0010828500003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 3: BIOINFORMATICS, pages 115-120

ISBN: 978-989-758-552-4; ISSN: 2184-4305

115

removed, but MS lesions come in different locations

and sizes. Moreover, brain MRI images suffer the

presence of noise artifacts, non-uniformities, and are

affected by the intrinsic differences in the anatomy of

human brains; the lack of annotated data (MRI im-

ages without a corresponding lesion mask) is also se-

rious limitation that can cause misclassiﬁcation prob-

lems, and result in a reduction of performances in

lesion identiﬁcation, both in Machine Learning and

Deep Learning approaches. To overcome these prob-

lems, Artiﬁcial Intelligence requires very large train-

ing sets to correctly ”learn” to identify the relevant

features (tissues/lesions) on the image. Unfortunately,

on top of the physiological variability of the human

brains and MS lesions, different data-sets also come

in different formats, have been generated by differ-

ent equipment that introduce different artifacts, have

different dimensions, number of slices, ﬁle formats.

On the other hand, Deep learning methods require, as

input, images with a certain standard images’ ﬁle for-

mat (such as PNG), so it is not possible to feed a deep

learning architecture directly with an image stored

in a typical medical image format (such as NIfTI or

DICOM). Existing software and tools for image pre-

processing have some limitations. Firstly, they are of-

ten maintained by separated groups: when updating to

a new version of one of these tools, versioning prob-

lems and inconsistencies may occur if such update is

not supported by the other tools or plugins, such as

CBS Tools, JIST, TOADS-CRUISE, and BrainSuite.

Secondly, they lack in easy customization, and have

often strict requirements in terms of settings, making

it difﬁcult to simply obtaining homogenous and weel

structured data-sets. All the mentioned problems of-

Figure 1: Steps of the proposed pipeline.

ten force scientists to rely on a single data-set, hope-

lessly affecting the quality of the Machine Learning

or Deep Learning pipeline performance.

3 MATERIALS AND METHODS

Pre-processing is crucial for any data-analysis

pipeline, and this is particularly true for medical im-

ages. The proposed pipeline (summarized in Figure

1) includes the following typical pre-processing steps

for brain MRI images (which are usually performed

by separate and independent tools):

• MRI sequence selection: consists in selecting

the MRI modality that will be processed. Our

pipeline focuses on T2w images, as MS lesions

are mostly visible in such modality;

• Image registration: registering an image means,

in this case, matching it with a reference model.

It is a key step and a prerequisite for all applica-

tions that want to compare data-sets among sub-

jects or across time (Toga, 2019); raw MRI im-

ages are not registered and may have different

spacing and slice resolution (Alam, 2016). The

registration step consists in a set of transformation

of the raw image that optimises a similarity index

with the reference image. The registered image

is obtained linearly interpolating the initial image

domain into the new domain, as image ﬁles con-

sist of a variable number of slices, each slice cor-

responding to a different longitudinal brain sec-

tion: when the number of slices in the original im-

age is not consistent with the number of images in

the atlas, the missing slices are interpolated. Our

pipeline registers each MRI image and then saves

it back as NIfTI ﬁles; moreover, this step makes

sure the MRI image contains the same number of

slices as the reference image by (if necessary) in-

terpolating missing slices;

• Brain-extraction: also known as skull-stripping,

it is a step which removes tissues that are not

of interest, such as skull and dura mater. In-

cluding non-brain tissues is a known source of

errors (Rehman, 2020); there are several possi-

ble brain-extraction methods, for instance rely-

ing on deep learning techniques or on traditional

morphological operations (Kalavathi, 2016). The

brain-extraction method proposed in this paper is

an adaptation of the method proposed by Gam-

bino et al. (Gambino, 2011), and uses a combina-

tion of morphological operations;

• Bias ﬁeld correction and noise reduction: it cor-

rects the bias ﬁeld, that is a low frequency inten-

sity nonuniformity present in the image data as

inhomogeneity and illumination nonuniformity.

• Final data-set creation: images (and correspond-

ing masks) are saved back as NIfTI ﬁles and as

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

116

PNG ﬁles in the correct format required by the

chosen neural network pipeline.

The pipeline was tested on the raw longitudinal

T2w images of the public avaiable ISBI2015 data-set

that consists in MRI (acquired at 1 to 4 different time

points) images of 5 subjects diagnosed with differ-

ent MS subtypes. Lesions masks of two independent

readers are also given. The data-set includes, besides

the raw images, the images that were processed when

creating the ISBI2015 data-set with the MIPAV soft-

ware, integrated with several plugins such TOADS-

CRUISE plugins. We used the latter images as a com-

parison to evaluate the results of our pipeline. The

MIPAV package, as many others such as CBS Tools,

JIST, and TOADS-CRUISE, are maintained by sepa-

rate groups: when a new version of one is released,

instabilities may occur if such update is not supported

by the other tools or plugins.

The proposed pipeline addresses this limitation by

using only stable Python libraries, chosen based on

the clarity of their documentation and on their perfor-

mances:

• ANTsPy for Registration, Brain extraction, and

Bias Field Correction;

• Dicom2nifti for DICOM to NIfTI conversion;

• SimpleITK for NIfTI ﬁles storage.

As previously pointed out, using different data-

sets acquired with different purposes and in differ-

ent centers, is fundamental for developing accurate

and efﬁcient automatic segmentation methods based

on Deep Learning.

3.1 Pipeline Description

ANTsPy library was chosen for implementing the

most important steps of the pipeline because of its

good performances and extensive documentation. To

give the reader an idea, Table 1 shows the NIfTI to

NumPy (NumPy Python library) conversion times of

three common libraries on a MacBook Pro laptop.

This time is important because in each MRI data-set

there are thousands of images to be converted.

Table 1: Libraries performances.

ITK time 4.375 seconds

NIBABEL time 2.902 seconds

ANTS time 1.713 seconds

It is possible to easy access speciﬁc parameters as

shown in Table 2 thus allowing the pipeline steps to

be customized as well as optimized. Such parameters

directly affect the core points of the processing func-

tions.

Image Conversion

Image format conversion is fundamental to improve

the ability to generalize and work with more diversi-

ﬁed data. The present pipeline supports, if needed,

the conversion from a series of DICOM images be-

longing to a single scan into a single NIfTI ﬁle, as the

DICOM format is another widespread medical image

format.

Table 2: Pipeline parameters.

Registration Transform, atlas

Brain extraction Iterations, kernels

Bias ﬁeld correction Correction parameters

Image Registration

Images were registered to a reference atlas that was

chosen accordingly to the registration step proposed

by the ISBI data-set guidelines as a matter of con-

sistency: as they performed manual segmentation on

FLAIR images with the help of T1w and T2w im-

ages, and as there is no matching standard atlas for

FLAIR images, we rigidly registered T2w images to

the corresponding reference T2 standard atlas in the

MNI space

, so that the correspondence between im-

ages and lesion masks was obtained. For the purposes

of this paper and its future implementations for the

development of segmentation via deep learning, only

T2w images are considered at this point (Abderrahim,

2020). The atlas that was used is the ICBM Average

Brain linearly transformed to Talairach space, adapted

for use with the MNI Linear Registration Package.

The pipeline is designed to easily choose to use dif-

ferent atlases too, as it is sufﬁcient to download and

import the desired ﬁle. Registration is done by deter-

mining the transformation needed to match the source

image with the target atlas, optimizing the similarity

index between them. The registered image is then ob-

tained linearly interpolating the initial image domain

into the new domain. The registration step is particu-

larly important when different images are put together

in a comprehensive set. Due to different acquisition

parameters and settings, the number of slices (being

each slice a section of the brain) between different

image ﬁles can vary: registration makes it possible

to uniform the number of slices to an atlas, conse-

quently matching the resolution of the image to atlas

resolution (in this case with a resolution of 1mm

) via

interpolation, so that the different scans match each

other in terms of anatomical references and in terms

of number of images per-scan. This step is extremely

http://nist.mni.mcgill.ca/icbm-152lin/

Brain MRI Images Pre-processing of Heterogeneous Data-sets for Deep Learning Applications

117

important to reduce the input variability, which is fun-

damental in the development of neural networks. As

an example, the original ISBI2015 T2w NIfTI ﬁles

contain 70 slices, while the registered image consists

of 181 slices.

Figure 2: Brain extraction steps.

Brain Extraction

The steps that compose the brain extraction are brieﬂy

summarized in Figure 2. Such steps were imple-

mented adapting the ﬁndings of Gambino et al. (Gam-

bino, 2011). Skull-stripping is usually implemented

on T1w images, but in our pipeline it is adapted to

work directly on T2w images, without requiring inter-

mediate steps that would need to further register T2w

images to T1w images, in order to apply the brain

mask obtained with the T1w images to the T2w scan.

Brain extraction develops in ﬁve simple steps; again

all parameters such as the number of iterations or the

features of morphological operators can be easily cus-

tomized (see Table 2):

1. n (n=3) erosions with a cross kernel,

2. n-1 dilatations with a cross kernel,

3. brain mask extraction,

4. ﬁnal erosion,

5. the original image is multiplied by the brain mask,

in order to obtain the brain.

Such parameters can be adapted in order to obtain

better performances as future implementations, and

most of them can be adapted in order to be able to

work independently from the MRI image modalities.

This is not true, for example, for tools such as MIPAV

MP2RAGE that, in order to obtain the skull stripping

of T2w images, require both T1w images and T2w

images as input.

Bias Field Correction

The N4 Bias Field Correction method (Tustison,

2010) was applied for estimating and correcting the

bias, giving as output the corrected image.

Deep Learning Data-set Creation

Once the images are processed, it is possible to con-

vert them into PNG ﬁles that can be then organised in

folders, and directly fed to a neural network. As not

all the images might be needed (for example, those

for which the pre-processing has failed, or those that

only partially show the cerebellum), the pipeline also

allows to store only images of interest, selecting their

corresponding identiﬁcation number. Particular atten-

tion was paid to preserve image orientations during

the creation of the data-set, as multiple conversions

between formats is needed for the pipeline itself to

work.

3.2 Working Example

The steps previously described are exempliﬁed in

Figure 3 and in Figure 4 where an image from the

ISBI2015 data-set was processed and then saved as

.png ﬁle for later use.

(a) Original image. (b) Brain mask.

Figure 3: Extraction of the brain.

Figure 3 shows a slice of the image that is fed to

the pipeline for the pre-processing: the brain mask

represents the result of the brain-extraction step. Such

mask is then multiplied to the initial slice, giving the

desired brain-stripped result.

In Figure 4 the lesion mask (coloured in blue) is

superimposed over the skull-stripped brain, and such

mask will serve as ground truth for a segmentation

algorithm. The masks were given together with the

ISBI2015 dataset, and they were obtained by two ex-

pert readers that annotated the MRIs previously pre-

processed with the MIPAV software. The correspon-

dence of such lesion masks with the brain images pro-

cessed with our pipeline indicates that it works as ex-

pected and gives comparable results.

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

118

(a) Processed. (b) With lesions.

Figure 4: Skull-stripping.

3.2.1 Processing Time

We report here the processing times of the execu-

tion of the pipeline to the data-set. The processing

times

of the registration step and of the rest of the

process are separately reported, as the registration is

performed on the original raw image, while brain-

extraction and bias ﬁeld correction are done on the

registered image.

Table 3: Performances with 1 MRI scan.

Registration 4.5 sec.

Processing 143.52 sec.

Values in Table 3 refer to the processing of a single

NIfTI ﬁle that corresponds to a T2w MRI scan of a

subject. Each NIfTI ﬁle contains, once it is registered

with the atlas, 181 images.

4 DISCUSSION

Deep Learning pipelines require large training sets,

and have strict requirements in the input format, size,

and image quality. Moreover, Deep Learning per-

formances beneﬁt from homogenous and weel struc-

tured data-sets. This makes it often impossible to

use different data-sets directly as inputs because their

differences can be easily result in biases in the seg-

mentation results, affecting performances. This pa-

per presents a comprehensive pre-processing pipeline

able to prepare raw MRI brain images data-sets so

that they can be directly fed into deep learning ar-

chitectures. The pipeline has been implemented en-

tirely in Python and it guarantees the possibility to

directly access each functions’ parameters to allow

further customizations/optimizations. Several other

approaches recommended for MRI images process-

ing have strict constrains in terms of the images that

they can process: some are limited to 3D images,

Running on a MacBook Pro - macOS Bis Sur - 2.6GHz

Intel Core i7 6 core

and some require speciﬁc MRI modalities to function

properly. The presented pipeline differs from other

existing software as it incorporates, with ﬂexibility

and customizability, all the steps that are needed when

preparing heterogenous data-sets for deep learning

application, from the raw MRI scan (in different im-

age formats) to the image that will be the input of a

deep learning algorithm. Furthermore, it differs from

other known tools such as BrainSuite as it can be di-

rectly and easily incorporated into the development of

any automatic medical images analysis systems based

on Deep Learning architectures, without limiting it-

self to a processing or visualising tool for MRI im-

ages and, above all, relying on stable Python libraries

without depending on many plug-ins that can cause

inconsistencies and versioning problems. This ap-

proach can help with the task of data preparation and

image pre-processing, that cannot be ignored or un-

derestimated when constructing data-sets for Machine

Learning (and in particular Deep Learning) pipelines.

Future implementations will include an optimization

of the performances to include as input more MRI

modalities and formats, increasing its versatility, and

the incorporation of the pipeline as the backbone of

an innovative Deep Learning architecture targeted for

application in real clinical practice.

5 CONCLUSIONS

Handling the variability of MRI medical images

needs an efﬁcient pre-processing pipeline aimed at

solving practical issues that include but are not limited

to raw images format and dimensions, physiological

differences, image artifacts. This paper introduces a

pipeline to ﬁll the gap between heterogeneous data-

sets and their practical integration and usability in the

same Artiﬁcial Intelligence pipeline. We successfully

tested the pipeline to show how it helps ﬁlling the gap

between heterogeneous data-sets and their practical

use. Future work will include extensive comparative

evaluations of the pipeline, and an optimization of the

performances to include as input more MRI modali-

ties and formats.

Finally, it is important to point out that the pro-

posed approach is not necessarily strictly related to

brain MRI images, but could be easily adapted to

other MRI scans, such as chest imaging.

REFERENCES

Abderrahim, Marwa, e. a. (2020). Comparative study of rel-

evant methods for mri/x brain image registration. The

Brain MRI Images Pre-processing of Heterogeneous Data-sets for Deep Learning Applications

119

Impact of Digital Technologies on Public Health in

Developed and Developing Countries, page 338–47.

Alam, Fakhre, e. a. (2016). Evaluation of medical image

registration techniques based on nature and domain of

the transformation. Journal of Medical Imaging and

Radiation Sciences, 47(2):178–93.

Gambino, Orazio, e. a. (2011). Automatic skull stripping in

mri based on morphological ﬁlters and fuzzy c-means

segmentation. 2011 Annual International Conference

of the IEEE Engineering in Medicine and Biology So-

ciety, page 5040–43.

Inojosa, Hernan, e. a. (2021). Should we use clinical tools to

identify disease progression? Frontiers in Neurology,

11:1890.

Kalavathi, P., e. V. B. S. P. (2016). Methods on skull strip-

ping of mri head scan images — a review. Journal of

Digital Imaging, 29(3):365–79.

Kaur, Amrita, e. a. (2021). State-of-the-art segmentation

techniques and future directions for multiple sclerosis

brain lesions. Archives of Computational Methods in

Engineering, 28(3):951–77.

Rehman, Haﬁz Zia Ur, e. a. (2020). Conventional and deep

learning methods for skull stripping in brain mri. Ap-

plied Sciences, 10(5):1773.

Toga, A. W., e. P. M. T. (2019). The role of image regis-

tration in brain mapping. Journal of Big Data, 19(1-

2):3–24.

Tustison, Nicholas J., e. a. (2010). N4itk: Improved n3 bias

correction. IEEE transactions on medical imaging,

29(6):1310–20.

Zeng, Chenyi, e. a. (2020). Review of deep learning ap-

proaches for the segmentation of multiple sclerosis

lesions on brain mri. Frontiers in Neuroinformatics,

14:55.

APPENDIX

The code of the presented pipeline can be downloaded

from GitHub at the following address: https://github.

com/aSofworkOrange/BrainMRI-preproc

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

120