Approaching the Semantic Segmentation in Medical Problems: A

Solution for Pneumothorax Detection

alin T

imbus

, Vlad Miclea and Camelia Lemnaru

Technical University of Cluj-Napoca, Cluj-Napoca, Romania

Keywords:

Semantic Segmentation, Pneumothorax Detection, Pipeline.

Abstract:

We present a method for detecting and delineating pneumothorax from X-Ray medical images by using a three-

step processing pipeline: a deep learning classiﬁcation module, responsible for detecting the possible existence

of a collapsed lung within an image, followed by a segmentation model applied on the positive samples (as

detected by the classiﬁcation module). The last module attempts to eliminate possible artefacts based on

their size. We demonstrate how the pipeline employed signiﬁcantly improves the results, by increasing the

mean-Dice coefﬁcient metric by 0.13, in comparison with the performance of a single segmentation module.

In addition to this, we demonstrate that using together speciﬁc state-of-the-art techniques leads to improved

results, without employing techniques such as dataset enrichment from external sources, semi-supervised

learning or pretraining on much larger medical datasets.

1 INTRODUCTION

Convolutional neural networks have recently become

ubiquitous in large-scale image recognition tasks, ow-

ing to the exponential advancement in computing

power. In addition to the considerable gain in hard-

ware performance, widely available comprehensive

datasets have contributed towards state-of-the-art im-

provements (Timbus et al., 2018). Having pushed the

boundaries in several computer vision tasks, such as

object classiﬁcation and detection, they have likewise

been proven to excel at semantic segmentation.

The medical ﬁeld has also beneﬁted to a great

extent from the aforementioned technical advance-

ments: while the medical staff will probably never be

replaced by automated deep learning solutions, the ro-

bustness of many such solutions is evident and it has

become apparent that they could be employed to pro-

vide support to the medical industry.

In this paper we propose a pipeline for detecting

and segmenting pneumothorax from medical images.

In lieu of using exclusively a segmentation module,

we employ a deep-learning pipeline, composed of

two convolutional neural networks, one responsible

for detecting the pneumothorax while the second one

has the purpose of delineating the zones with col-

lapsed lung. We demonstrate the effectiveness of our

approach by contrasting the results obtained by the

pipeline versus the simple segmentation module.

At the same time, we prove that a combination

of several state-of-the-art techniques, such as SWA

(Stochastic weighted averaging) and cosine-annealing

learning rate schedules, can lead to a considerable im-

provement of the ﬁnal score, in absence of dataset ex-

pansion or heavy ensemble modeling, the latter being

widely used recently for achieving state-of-the-art re-

sults. Our solution achieves top 8%, more precisely

the 130th position out of 1475 teams in the SIIM-

ACR Pneumothorax Segmentation (Society for Imag-

ing Informatics in Medicine (SIIM), 2019) hosted by

the Kaggle platform.

2 RELATED WORK

2.1 Semantic Segmentation

As far as the trafﬁc scenario is concerned, comprehen-

sive datasets such as Cityscapes (Cordts et al., 2016),

Kitti (Geiger et al., 2013) or Mapillary (Neuhold

et al., 2018) have been developed. However, these

benchmarks are pertained solely to the automotive

industry. To the best of our knowledge, apart from the

CheXpert (Irvin et al., 2019) dataset from the Stan-

ford University, which is related to X-ray image clas-

siﬁcation problems, there is no well-established and

Timbus, C., Miclea, V. and Lemnaru, C.

Approaching the Semantic Segmentation in Medical Problems: A Solution for Pneumothorax Detection.

DOI: 10.5220/0010185402650272

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

265-272

ISBN: 978-989-758-488-6

265

ubiquitous benchmark for image segmentation in the

medical ﬁeld.

On the solution side, the majority of the archi-

tectures employ an encoder-decoder strategy. As

new classiﬁcation architectures emerge, the encoder

can be replaced with a more powerful architecture,

thereby improving the feature extraction process.

2.2 Medical Image Segmentation

Medical image segmentation poses several additional

challenges in contrast to the trafﬁc scene segmenta-

tion. While the trafﬁc scenes are comprised of nu-

merous object categories which may constitute a large

part of an image (pedestrians, cars, buildings, etc.),

the imbalance problem is more salient in the medi-

cal image processing. This phenomenon happens on

account of two different factors. The ﬁrst one is the

sample imbalance factor, which refers to the number

of positive samples against the number of negative

samples. This type of imbalance is also widely en-

countered in more typical classiﬁcation problems. In

medical problems, the underrepresented class is often

the positive one. Nevertheless, the second imbalance

factor, which poses a greater challenge than the previ-

ous, is pertained to the area of the region of interest; in

essence, under most circumstances, the zone of inter-

est one attempts to detect is of negligible dimension,

be it polyp, skin lesion or pneumothorax. Therefore,

one can refer to such phenomena as an exponential

imbalance situation.

U-Net (Ronneberger et al., 2015) is a well-

established image segmentation architecture that has

proven to perform remarkably in the medical segmen-

tation. U-Net++ adds dense skip connections and re-

designed skip pathways, thus ensuring that all the pre-

viously accumulated information is gathered in the

feature map concatenation step, at the same time im-

proving the gradient ﬂow. Moreover, the deep super-

vision model implementation enabled the selection of

selection of segmentation maps from a speciﬁc branch

(fast) or the average of the full output of the branches

(accurate), thus enabling switching between two dif-

ferent approaches according to the needs.

Last but not least, DeepLabV3+ (Chen et al.,

2018) which outperformed the previous state-of-the-

art PSPNet (Zhao et al., 2016) on Cityscapes (Cordts

et al., 2016) used the concept of atrous convolution in

conjunction with atrous spatial pyramid pooling.

3 PROPOSED APPROACH

Figure 1 describes the pipeline operations, together

with the intermediary output results. We initially feed

the pipeline with a 3-channel 1024x1024 input im-

age. Should the X-Ray image be deemed as contain-

ing pneumothorax (according to a speciﬁc threshold),

it is passed along the pipeline for further processing;

otherwise it is disregarded and marked accordingly.

In case of the forward pass (image is considered as

encompassing collapsed lung regions), the segmenta-

tion module receives the input image and delineates

the zones with pneumothorax. At this stage the image

is downscaled with a factor of two. The motivation for

this is that we noticed comparable results when train-

ing on 1024x1024 and 512x512. The semantically

segmented result is sent to the Small ROIs elimina-

tor module, which is responsible for excluding the re-

gions which are below the elimination threshold. This

process takes place at an upscaled resolution with a

factor of two, thus at the initial image resolution. This

represents the ﬁnal step in our pipeline. The follow-

ing paragraphs detail the particularities of the entire

ﬂow.

3.1 Image Classiﬁcation Model

3.1.1 Description of the Pneumothorax Dataset

The dataset used to train both the classiﬁcation

and segmentation models is provided by SIIM-ACR,

and exposed by Kaggle (Society for Imaging Infor-

matics in Medicine (SIIM), 2019) to the competi-

tors. The dataset is comprised of approximately

12.000 X-Ray images belonging to both healthy (non-

pneumothorax) and ill patients (at least a zone with

pneumothorax). The input resolution for all the im-

ages (and masks included) is 1024x1024.

The dataset for the second stage of the competition

contains 9378 X-Ray for the non-pneumothorax cat-

egory, whilst the number for patients suffering from

pneumothorax is 2883, yielding a class-imbalance

factor of 3.25. During the dataset analysis phase, we

observed that several patients exhibit pneumothorax

which accounts for less than 1% of the overall image.

Such a percentage is to be expected considering the

medical nature of the pneumothorax. In addition to

this, the dataset provided consists of only 12.000 sam-

ples. Therefore, the two-level imbalance increases the

complexity of the problem to a great extent, in partic-

ular due to the second type of imbalance, which is the

pixel per class one.

In light of the former observations, we argue that

an initial classiﬁcation step is needed in order to re-

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

266

Figure 1: Workﬂow of the proposed method.

duce the number of false positives, improving the

overall performance – as shown in the results section.

We use the initial dataset both for classiﬁcation and

segmentation purposes. The motivation for our de-

cision is the reliability of the ground truth labels, as

well as the option to being able to preserve the same

distribution for the test set.

3.1.2 Image Classiﬁcation Architecture

The architecture used for image classiﬁcation is

Xception (Chollet, 2016), with the weights pre-

trained on ImageNet (Deng et al., 2009). In spite of

the different nature of the dataset, we have observed

both faster convergence and better ﬁnal results by us-

ing pre-trained networks on the medical pneumotho-

rax dataset. Our observation upholds the already es-

tablished good practice of using pre-trained weights

from a comprehensive dataset such as ImageNet.

We experimented both with NASNet(Large)

(Zoph et al., 2017) and InceptionResNetV2 (Szegedy

et al., 2016) prior to opting for Xception (Chollet,

2016). Lowering the batch size with a factor of 2 did

not result only in twice training time, but also in a

decrease of the overall performance. We suspect that

this phenomenon is generated by the batch normaliza-

tion layers (present in all the above-mentioned archi-

tectures), as the batch normalization tends to yield an

unrepresentative mean and variance when used with

very small batch sizes (Wu and He, 2018). The depth-

wise separable convolution operation, which greatly

reduces the number of parameters to a ninth in con-

trast to a standard K*K convolution and present in

Xception (Chollet, 2016) allowed us to use a greater

batch size.Therefore, the Xception network was the

ﬁnal choice for our classiﬁcation module.

3.2 Image Segmentation Model

3.2.1 Description of the Segmentation Dataset

The dataset for segmentation was created exclusively

from the dataset available within the competition.

While initially we trained our model exclusively on

the pneumothorax images, given the rationale of the

classiﬁcation module, we observed an increase in the

ﬁnal score when we trained the segmentation module

on both pneumothorax and non-pneumothorax im-

ages.

An explanation for this phenomenon is that, false

negatives from the classiﬁcation module may pass

through the pipeline and on account of the biased na-

ture of the segmentation module, which was trained

only on pneumothorax images, several zones for a

healthy patient are marked as containing pneumotho-

rax. Moreover, it is reasonable to assume that the test

set contains more healthy samples than pneumotho-

rax samples. While the former assumption could be

easily tested, the latter cannot be veriﬁed due to the

undisclosed ground truth on the test set.

Therefore, we randomly chose non-pneumothorax

images from the training set and created the segmen-

tation dataset, using the ﬁne pneumothorax annota-

tions provided in the dataset.

We split the dataset into a 80%-20% manner for

the training and validation set respectively, similar to

the classiﬁcation one.

3.2.2 Image Segmentation Architecture

For the segmentation network, we have experimented

with several, well-established architectures, such as

U-Net(Ronneberger et al., 2015) or PSPNet(Zhao

et al., 2016).

We have preliminarily investigated both UNet

(Ronneberger et al., 2015) and PSPNet(Zhao et al.,

2016) with several backbones, such as the state-of-

the-art EfﬁcientNet (Tan and Le, 2019). The best re-

sults on the validation set, which also later translated

to better results on the private set, was a combina-

tion of Feature Pyramid Networks (FPN) (Lin et al.,

2016) and InceptionResNetV2 (Szegedy et al., 2016).

This reinforces the idea that the training and test set

belong to the same distribution. At the same time, In-

ceptionResNetV2 yielded superior results to SERes-

Net (34,50,101,152) (Hu et al., 2017) or any variant

of ResNetXt (Xie et al., 2016) that we have tried in

our experiments (34,50,101,152). Therefore, the ﬁnal

feature extractor for the segmentation architecture is

InceptionResNetV2.

Approaching the Semantic Segmentation in Medical Problems: A Solution for Pneumothorax Detection

267

The loss function that we used during both phases

of the training is essentially the sum of focal binary

cross-entropy and Dice-coefﬁcient loss.

Although the focal-loss has been initially used in

the context of background-foreground imbalance (Lin

et al., 2017), the loss can be adapted to other imbal-

ance problems, such as both classiﬁcation and seg-

mentation.

The optimizing metric in this case was the inter-

section over union (IoU), at the same time observing

the behaviour of F1 and F2 metrics on both training

and validation sets. Both F1 and F2 are particular

cases of the more general F-Beta metric; for the F1

score, the beta parameter is set to 1, yielding the har-

monic mean of precision and recall (i.e. the same

weight for precision and recall). The F2 metric em-

phasizes the recall metric, assigning twice the im-

portance to the recall as compared to the precision.

Therefore, the F2 metric was used in order to also ver-

ify the capacity of the model to detect pneumothorax

regions.

3.3 Small ROIs Eliminator

Throughout the competition, a proven heuristic to per-

form well in practice is to eliminate the small regions

of interest. In this particular case of image segmen-

tation, a small region of interest is deﬁned as a con-

nected component whose surface in pixels is less than

a speciﬁc threshold.

De facto, for each and every medical segmentation

problem, when the region of segmentation is insignif-

icant, the same heuristic can be put into practice. Ad-

mittedly, the exact nature of the problem needs to be

taken into consideration: it may be the case that very

small areas represent important regions of interest.

Nevertheless, the pneumothorax detection prob-

lem can be construed as belonging to the former cat-

egory. A very small delineated pneumothorax region

may very well constitute a false positive, hence the

suitability of its elimination.

4 EMPIRICAL EVALUATIONS

This section is split into three subsections, provid-

ing details with regard to each module: the classiﬁ-

cation module (1), the segmentation module (2) and

the small regions of interest removal (3) one, as seen

in Figure 1. For each module we present the motiva-

tion behind each choice, describing the experiments

and the results obtained.

4.1 Training Considerations and

Hyper-parameter Tuning

4.1.1 Classiﬁcation Model

As far as the classiﬁcation module is concerned, we

ran an experiment to check whether the augmenta-

tion or enrichment yield ﬁnal better results. In other

words, given exactly the same training hyperparame-

ters, we contrast the models and verify their perfor-

mance on the local validation set.

For the augmentation (1) scenario, we chose the

following augmentations:

• CLAHE (Contrast Limited Adaptive Histogram

Equalization)

• Optical distortion

• Zoom-In (0.05-0.10) random factor with uniform

distribution

• Zoom-Out (0.05-0.10) random factor with uni-

form distribution

Each augmentation is applied with a probability

of 50%. Prior to implementing our pipeline, we ex-

perimented with CenterCrop augmentation. However,

due to the fact that several X-Rays are shifted, the

CenterCrop augmentation could result in a loss of in-

formation thereby introducing false ground truth el-

ements. The reason for such loss of information re-

sides in the medical nature of pneumothorax, the lat-

ter manifesting in many situations at the extremities

of the lungs.

For the enrichment/oversampling (2) scenario, we

chose the same image preprocessing techniques. We

perfectly balanced the dataset, by oversampling the

positive, minority class. We applied the enrichment

in the following manner:

sdi f = np images − pn images (1)

enrich type support = sdi f /nb o f aug (2)

In the ﬁrst equation (1), sdif stands for the dif-

ference in support, which in this speciﬁc case is ob-

tained by subtracting the number of pneumothorax

images from the number of non-pneumothorax im-

ages, since the latter has higher support. The en-

rich type support in (2), which in essence translates

to the number of images of that speciﬁc image prepro-

cessing with whom the dataset is enriched, is there-

fore the division between the sdif and the number

of augmentations, namely nb of aug. For example,

if the dataset consisted of 1000 pneumothorax im-

ages and 5000 non-pneumothorax ones, we could use

4 image processing techniques(nb of augmentations)

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

268

Figure 2: Example predictions on the validation set. The ﬁrst column is the input image and the second represents the ground

truth. The third and fourth columns represent results – the third without the small component removal, and the fourth (ﬁnal)

with that component.

for balancing, and the number of samples per enrich-

mentenrich type support is (5000 − 1000)/4 = 1000,

where sdif is 5000 − 1000 = 4000. Therefore, a bal-

anced dataset is obtained in such a manner.

In this particular scenario (enrichment), the train-

ing proceeds normally without employing any types

of augmentation due to the prior oversampling pro-

cess. At the same time the enrichment-model con-

verges faster than the augmentation-model, attaining

the peak value on the validation set on the 10

epoch

as compared to 17

epoch in case of the augmenta-

tion model.

However, the augmentation model achieved an

MCC of 0.774 on the ground truth validation set. The

former value represents an increase of 0.04 on the val-

idation set and 0.02 on the private score(mean-Dice

coefﬁcient) as compared to the enrichment model.

The increase on both validation and test sets con-

ﬁrm that the training-validation-test sets belong to

the same distribution. In the paragraphs below, we

present two possible reasons for this phenomenon.

As the numbers of epochs increases towards a

large number (i.e. tends towards inﬁnity from a limit

viewpoint), the probability of a model for having seen

a particular image with a particular augmentation in-

creases. Mathematically, this is incontrovertible, as

the number of epochs increases, should an augmenta-

tion be applied with a likelihood of 30%, the probabil-

ity of the model to have seen a speciﬁc image from the

dataset later in the training phase rather than earlier.

This phenomenon is also similar to an extent with

the exploration-exploitation reinforcement epsilon-

greedy balancing strategy: while epsilon is very

small, as the time passes, the agent performs actions it

could have taken from the beginning phase if epsilon

was given a high value. Exactly like in the augmenta-

tion training versus enrichment training, the augmen-

tation training sees particular examples later in the

training phase rather than early.

Therefore, given the test dataset distribution in

this particular case, we noticed that aggressive aug-

mentation (setting a 50% probability for each possi-

ble augmentation) as presented above even slightly

outperformed the balanced dataset enrichment model

on the test set. We therefore decided to opt for the

augmentation-model.

In the paragraphs below we present the hyperpa-

rameter conﬁgurations that were employed for both

training sessions: augmentation and enrichment.

We trained both models for 25 epochs. We no-

ticed an overﬁtting phenomenon after training for

more than 25, hence the justiﬁcation for the number of

epochs hyperparameter. The duration of each epoch is

approximately 50 minutes on a GTX 1080Ti. We split

the initial dataset into an 80%-20% ratio, 80% being

reserved for training and the remaining 20% for vali-

dation. We use a ﬁxed batch size of 4. As the dataset

is inherently imbalanced, we applied a stratiﬁed split

in order to ensure the support ratios on the training

and validation set are the same.

We started by freezing the base convolutional

model and pre-training only on last newly-added layer

of the network. We employ this strategy for 2 epochs

with Adam (Kingma and Ba, 2014) as an optimizer

with a learning rate of 2x10

−2

. In conjunction with

the Adam (Kingma and Ba, 2014) optimizer, we use

binary cross-entropy as a loss function. We also use a

learning rate on plateau reducer, with a patience of

3 and a reduction factor of 2x10

−1

. This means a

reduction of the learning rate by the factor above if,

Approaching the Semantic Segmentation in Medical Problems: A Solution for Pneumothorax Detection

269

Figure 3: The metrics during the training with the classiﬁcation model (with strong augmentation). The yellow line represents

the optimizing metric, which is the Matthews Correlation Coefﬁcient. The green lines represent the training and validation

(dotted) losses. The black vertical line represents the model obtained when the MCC value attained its highest value. Notice

how the MCC reaches the peak at a different time than the moment when the validation loss attains its lowest peak.

for three consecutive epochs, the optimizing metric

(MCC) does not improve on the validation set.

For the next 23 epochs we unfreeze the entire

model and train it with the same optimizer, but with a

learning rate of 2x10

−4

. We reduce the learning rate

in order to avoid information loss for the pretrained

weights. The best results are obtained on the 17

epoch. We choose Matthews Correlation Coefﬁcient

(MCC) as our optimizing metric, taking into consid-

eration that accuracy does not reﬂect the robustness

of a classiﬁcation model in an imbalanced scenario.

Although the dataset is balanced in case of the en-

richment model, for consistency reasons we maintain

the MCC as the optimizing metric. The metrics set on

validation reported at the 17

epoch with the aggres-

sive augmentation model is:

• MCC: 0.7702

• Precision: 0.8010

• Recall: 0.8674

• Accuracy: 93.60%

4.1.2 Segmentation Model

As far as the training is concerned, we employed Co-

sine Annealing Learning Rate Schedule (Loshchilov

and Hutter, 2016) in conjunction with Stochastic

Weight Averaging (SWA) (Izmailov et al., 2018). We

employ a different approach for each procedure, as

explained in the paragraphs below.

We initially use the optimizer Adam with a learn-

ing rate of 1 × 10

−3

. For the ﬁrst two epochs, we

freeze the ﬁrst half of the trainable layers. We employ

this freezing strategy as the training set is different

from ImageNet (Deng et al., 2009) and also signiﬁ-

cantly smaller in size. Starting from the 3

epoch, we

decrease the learning rate to 1 ×10

−4

Table 1: Classic SWA versus Our Approach (SWA between

epochs).

Final results Typical SWA Our Approach

Private Score 0.8322 0.8347

We employed the classical idea of weight-

averaging from SWA, but we experimented with a dif-

ferent approach: rather than updating the weights at

the end of each cycle, we performed SWA between

several epochs within a cycle where we empirically

observed that the IoU metric reaches its highest peaks

(on validation data).

As such, we trained for 37 epochs and carefully

supervised the evolution of the IoU metric. We ob-

served that between the 13

and the 16

epochs the

IoU metric reached the maximum values during the

training phase, attaining values between 0.638 and

0.647. Thus, the weights of the ﬁnal model are ob-

tained by applying SWA between epochs 13 and 16.

Much to our surprise, the results in Table 1 sug-

gest that by carefully choosing epochs between which

SWA is employed, one can achieve good results and

even surpass the results obtained with the typical

SWA. Another beneﬁt for this situation would be that

a single learning cycle could be used (regardless of

the scheduler), thereby leading to considerably less

training time.

We report the following average values of the met-

rics considered, between epochs 13

and 16

, inclu-

sive:

• IoU: 0.6553

• F1-Score: 0.6773

• F2-Score: 0.6811

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

270

Figure 4: The validation IoU score. We apply the SWA approach between epochs 13

and 16

, when the validation IoU

score reaches the highest values between epoch 13

and 16

. The red vertical lines represent the starting and the ending

epochs for the SWA.

Table 2: Inﬂuence of the connected components hyperpa-

rameter upon the ﬁnal result.

Component Size Image Resolution Private Score

3500 1024x1024 0.8347

4500 1024x1024 0.8329

750 512x512 0.8328

1750 1024x1024 0.8318

250 512x512 0.8307

4.1.3 Small ROI Elimination

As the segmentation module receives input images

with 512x512 dimension and the ﬁnal results are en-

coded on 1024x1024 resolution, the natural question

of resolution choice for small positive remains.

In Table 2 we provide the results of the ex-

perimentation that we performed with this hyper-

parameter. The component size represents the elim-

ination threshold for a connected component. The re-

sults demonstrate that only tuning the connected com-

ponent hyper-parameter can greatly inﬂuence the ﬁnal

score.

4.2 Evaluation of the Entire System

The solution that we provided obtains the 130

po-

sition in the private score of the competition. We

achieved 0.8347 mean-Dice coefﬁcient on the private

score by combining the classiﬁcation module in con-

junction with the segmentation and the small regions

of interest eliminator modules. At the time of writ-

ing this paper, the previously mentioned score would

achieve a bronze-medal position in the on-going com-

petition.

It is important to outline the results presented in

2, as opting for a different combination of resolution

and connected component size can drastically reduce

the ﬁnal private score.

Last but not least, a crucial aspect to emphasize

that, given the exact same conﬁguration and in ab-

sence of the classiﬁcation module, the ﬁnal mean-

Dice coefﬁcient score would be 0.7012. This is more

than 0.13 lower than the best conﬁguration obtained

with the employment of the classiﬁcation module,

which is 0.8347 (ﬁrst line in 2).

It is also worth mentioning that the 1

place in

the Kaggle competition adopted the same approach of

constructing a strong pipeline alongside several mod-

iﬁcations: ﬁrst, the classiﬁcation module is replaced

with a triplet scheme of inference and validation, in

which possible pneumothorax or non-pneumothorax

images are eliminated considering the area and pre-

diction conﬁdence. In addition to this, a weighted

combination of Dice, focal and binary cross-entropy

is used. Another novel idea that is employed in the

ﬁrst place solution is the sliding sample rate: the

author notes that a better convergence of network

weights can be obtained if the sample rate is adapted

as the training progresses (0.8 at the beginning and

0.4 towards the end), where the sample rate is the

portion of pneumothorax images. At the same time,

several other augmentations are used such as elastic

transform and grid distortion. The solution achieves

0.8679 mean-Dice coefﬁcient on the private score.

5 CONCLUSIONS

In this paper we presented a pipeline for detecting the

pneumothorax from X-Rays. It consists of two deep

learning modules: a classiﬁer, whose purpose is to

eliminate images which do not contain pneumotho-

rax, and a segmentation module which segments only

images classiﬁed as containing pneumothorax. In ad-

dition to the deep learning modules, the results are

reﬁned by a classical computer vision component,

which eliminates areas which are insigniﬁcant in size.

Experimentally, we observed that a deep learning

model, when subjected to aggressive augmentation,

can obtain similar results to the same model on an en-

Approaching the Semantic Segmentation in Medical Problems: A Solution for Pneumothorax Detection

271

riched dataset, otherwise given the same exact train-

ing conﬁgurations. Also, we argue that aggressive

augmentation, given a consistent number of epochs,

achieves similar or better results to enrichment, in

spite of the balance achieved by the latter. Moreover,

a combination of several state-of-the-art techniques,

such as a modiﬁed Stochastic Weight-Averaging with

Cosine Annealing scheduler used for training the seg-

mentation module further improves the performance.

A powerful characteristic of our pipeline resides

in its replaceable modules: as the state-of-the-art ad-

vances, both the classiﬁcation and the segmentation

modules can be replaced with improved versions,

thereby potentially leading to better results. We rec-

ommend that such a pipeline be used in all medical

segmentation problems; while we consider that the

deep learning modules should be indispensable (given

a complex dataset), the existence of the small regions

of interest component is debatable: depending on the

exact nature of the medical problem, the threshold for

elimination can vary to a great extent, if the module is

to be implemented.

As possible improvements, we consider that the

Tversky Loss (Salehi et al., 2017) could be used to

improve the ﬁnal results, as it shows promising results

on both 2D and 3D image segmentation. In addition,

it could be relevant to investigate whether the usage of

class weights to penalize harder false negative errors

could also contribute to an increased recall. Last, but

not least, test-time augmentation is a technique that

has been widely used recently and could contribute to

increasing the performance.

REFERENCES

Chen, L., Zhu, Y., Papandreou, G., Schroff, F., and Adam,

H. (2018). Encoder-decoder with atrous separable

convolution for semantic image segmentation. CoRR,

abs/1802.02611.

Chollet, F. (2016). Xception: Deep learning with depthwise

separable convolutions. CoRR, abs/1610.02357.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding. In Proc. of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. International

Journal of Robotics Research (IJRR).

Hu, J., Shen, L., and Sun, G. (2017). Squeeze-and-

excitation networks. CoRR, abs/1709.01507.

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S.,

Chute, C., Marklund, H., Haghgoo, B., Ball, R. L.,

Shpanskaya, K. S., Seekins, J., Mong, D. A., Halabi,

S. S., Sandberg, J. K., Jones, R., Larson, D. B., Lan-

glotz, C. P., Patel, B. N., Lungren, M. P., and Ng, A. Y.

(2019). Chexpert: A large chest radiograph dataset

with uncertainty labels and expert comparison. CoRR,

abs/1901.07031.

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P.,

and Wilson, A. G. (2018). Averaging weights leads

to wider optima and better generalization. CoRR,

abs/1803.05407.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization. CoRR, abs/1412.6980.

Lin, T., Doll

ar, P., Girshick, R. B., He, K., Hariharan, B.,

and Belongie, S. J. (2016). Feature pyramid networks

for object detection. CoRR, abs/1612.03144.

Lin, T., Goyal, P., Girshick, R. B., He, K., and Doll

ar, P.

(2017). Focal loss for dense object detection. CoRR,

abs/1708.02002.

Loshchilov, I. and Hutter, F. (2016). SGDR: stochastic gra-

dient descent with restarts. CoRR, abs/1608.03983.

Neuhold, G., Ollmann, T., Bulo, S. R., and Kontschieder,

P. (2018). The mapillary vistas dataset for semantic

understanding of street scenes. In 2017 IEEE Interna-

tional Conference on Computer Vision (ICCV), vol-

ume 00, pages 5000–5009.

Ronneberger, O., P.Fischer, and Brox, T. (2015). U-

net: Convolutional networks for biomedical image

segmentation. In Medical Image Computing and

Computer-Assisted Intervention (MICCAI), volume

9351 of LNCS, pages 234–241. Springer. (available

on arXiv:1505.04597 [cs.CV]).

Salehi, S. S. M., Erdogmus, D., and Gholipour, A. (2017).

Tversky loss function for image segmentation us-

ing 3d fully convolutional deep networks. CoRR,

abs/1706.05721.

Society for Imaging Informatics in Medicine (SIIM),

American College of Radiology (ACR), S. o.

T. R. S. M. (2019). Siim-acr pneumotho-

rax segmentation. https://www.kaggle.com/c/

siim-acr-pneumothorax-segmentation.

Szegedy, C., Ioffe, S., and Vanhoucke, V. (2016). Inception-

v4, inception-resnet and the impact of residual con-

nections on learning. CoRR, abs/1602.07261.

Tan, M. and Le, Q. V. (2019). Efﬁcientnet: Rethink-

ing model scaling for convolutional neural networks.

CoRR, abs/1905.11946.

Timbus, C., Miclea, V.-C., and Lemnaru, C. (2018). Se-

mantic segmentation-based trafﬁc sign detection and

recognition using deep learning techniques. pages

325–331.

Wu, Y. and He, K. (2018). Group normalization. CoRR,

abs/1803.08494.

Xie, S., Girshick, R. B., Doll

ar, P., Tu, Z., and He, K.

(2016). Aggregated residual transformations for deep

neural networks. CoRR, abs/1611.05431.

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2016). Pyra-

mid scene parsing network. CoRR, abs/1612.01105.

Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. (2017).

Learning transferable architectures for scalable image

recognition. CoRR, abs/1707.07012.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

272