
forms when tested on a separate, unseen dataset, high-
lighting its robustness and adaptability.
Tables 1, 2, and 3 present the DSC and IoU scores
of the network architectures on the test sets (CS v7,
C1000 v6, and BMGS B3) using BCE, Dice, and
STR loss functions, respectively. The tables also
show each architecture’s average (AVG) performance
across the three datasets. The best result for each test
is bolded and the second best is underlined. The first
noteworthy observation is the consistently high per-
formance achieved by all the tested methods in all
datasets, with only minor differences between them.
This performance indicates that all segmentation net-
works presented a relevant cross-dataset performance,
showing robustness against overfitting.
When using the BCE loss function (Table 1), the
CAFE-Net model consistently outperformed the other
architectures, achieving the highest DSC and IoU
scores across all datasets. It achieved an average DSC
of 0.9483 and an average IoU of 0.9151. The HSNet
model also produced competitive results, with an av-
erage DSC of 0.9459 and an average IoU of 0.9108.
The PVT model matched the HSNet’s average DSC
of 0.9459. A similar trend was observed when test-
ing with the STR loss function (Table 3). CAFE-Net
and PVT emerged as the best models in terms of DSC
and IoU scores, achieving average values of 0.9491
DSC, 0.9168 IoU for CAFE-Net, and 0.9457 DSC
and 0.9106 IoU for PVT.
Considering the Dice loss function (Table 2), the
best results were mixed between the architectures U-
Net, DeepLabV3, and CAFE-Net. The U-Net model
achieved the highest scores on the BMGS B3 dataset,
with a DSC of 0.9678 and an IoU of 0.9385. The
DeepLabV3 model achieved a DSC of 0.9327 and
an IoU of 0.9022 on the CS v7 dataset. The CAFE-
Net model achieved a DSC of 0.9369 and an IoU of
0.8906 on the C1000 v6 dataset. Despite the indi-
vidual best performance of these methods, the PVT
model achieved the highest average DSC and IoU
scores across all datasets, with an average DSC of
0.9450 and an average IoU of 0.8868. The CAFE-
Net achieved strong results, with an average DSC
of 0.9436 and an average IoU of 0.9066 across all
datasets. This model ranked second in both average
DSC and IoU scores.
Taking into account all models and loss functions,
it is possible to note that the highest performance
was provided by the CAFE-Net model with STR loss
function, with an average DSC of 0.9491 and an av-
erage IoU of 0.9168, followed by CAFE-Net with
BCE loss function, with an average DSC of 0.9483
and an average IoU of 0.9151. The PVT model with
Dice loss function was the third-best model, with
an average DSC of 0.9450 and an average IoU of
0.8868. The results demonstrate the effectiveness of
the transformer-based model in capturing long-range
dependencies and enhancing the segmentation accu-
racy of cattle in complex farm environments.
Table 4 shows the tested networks’ complexity
(number of parameters) and inference time. It is
important to note that the transformer-based models
have a similar number of parameters to the CNN-
based models, although they have a slightly higher
inference time. An interesting fact that can be high-
lighted is that HSNet and PVT have fewer parameters
than HarDNet-MSEG, with PVT having even less in-
ference time while achieving better results.
3.1 Ensemble Results
We selected all models trained with BCE, and for the
other losses, we only selected HSNet and CAFE-Net
(the top 2 performers based on BCE), and evaluated
the diversity among the models by calculating the dis-
similarity metric (Equation 9) for each pair of classi-
fiers on the validation set of CS v7. The results are
shown in Figure 3. The results show interestingly
high values of diversity for several model pairs, sug-
gesting that their combination might improve the ro-
bustness of model ensembles.
In Figure 4, we report the average DSC and IoU
scores for each pair of classifiers across three test
sets, with standalone results (single models) provided
along the diagonal. The best ensemble was obtained
by combining CAFE-Net (trained with STR loss), the
best standalone model, with PVT (trained with BCE).
Interestingly, PVT is not the second-best standalone
model but exhibits a high degree of diversity com-
pared to CAFE-Net. This two-model ensemble out-
performs all standalone methods.
4 DISCUSSION
The experimental results demonstrate the proposed
models’ effectiveness in segmenting cattle in com-
plex farm environments. The models achieved high
performance across all datasets, with the CAFE-Net
model consistently outperforming the other architec-
tures. The transformer-based models, particularly
CAFE-Net and PVT, demonstrated superior perfor-
mance to the CNN-based models, achieving the high-
est average DSC and IoU scores across all datasets.
The results indicate that the transformer-based models
are well-suited for capturing long-range dependencies
and enhancing the segmentation accuracy of cattle in
complex farm environments.
Comparison of CNN and Transformer Architectures for Robust Cattle Segmentation in Complex Farm Environments
97