as works on outliers in circular data, discriminant
analysis, experimental design, multivariate data, gen-
eralized linear models, distributions other than nor-
mal, time series, etc. (Markou and Singh, 2003) have
provided a state-of-the-art review in the area of nov-
elty detection based on statistical approaches. (Sajesh
and Srinivasan, 2012) presented a review of mul-
tivariate outlier detection methods especially robust
distance based methods. They have also proposed
a computationally efficient outlier detection method
using the comedian approach with high breakdown
value and low computation time. And more recently,
(Samariya and Thakkar, 2021) have listed different
types of outlier detection algorithm and their domains
of applications as well as some evaluation measures.
Indeed, several outlier detection algorithms have
been applied on MET data, such as the Cook’s dis-
tance, model statistics based on confidence ellipsoid
(Cook, 1977), and (Christensen et al., 1992), the lo-
cally centered Mahalanobis distance, which centers
the covariance matrix at each that sample (Todes-
chini et al., 2013), etc. However, to the best of our
knowledge, no comparisons have been made between
different outlier detection methods, and no genuine
outlier detection method has ever been strongly rec-
ommended for identifying anomalies in MET data.
The latter can be very complex and challenging to be
cleaned (DeLacy et al., 1996).
To bridge this gap, in this study, the focus is to
provide a critical comparison, on this specific task,
of various multivariate outlier detection algorithms
from different approaches such as hierarchical clus-
tering or connectivity e.g: Mahalanobis Distance,
Mean Shift Outlier Model), influential (e.g: Cook’s
Distance), distribution (e.g: One-Class Support Vec-
tor Machine), centroid (e.g: K-Means Clustering), en-
semble (e.g: Isolation Forest), density (e.g: Density-
Based Spatial Clustering of Applications with Noise),
probabilistic (e.g: Gaussian Mixture Model), sub-
space (e.g: Subspace Outlier Detection Algorithm,
Auto-Encoders for Outlier Detection) to determine
which ones are best suited to identify outliers (espe-
cially mild ones) in MET samples. To conduct that
comparison, while taking into account the aggressive-
ness and robustness of each method, we consider two
scenarios: first, compute the GP without identifying
the anomalies, and then recompute it with anoma-
lies identified and removed. Second, inject artificial
anomalies into the samples and use the same methods
to retrieve them. All scenarios and methods have been
run on each of the eleven different MET data sets. The
method with the best score in both scenarios will be
considered as the most appropriate one.
The rest of this paper is organized as follows. Sec-
tion 2-materials and methods, where we present the
data sets we have worked with, summarize the differ-
ent outlier detection algorithms considered, define the
genomic prediction method and comparison metrics
used, as well as the validation methodologies. In Sec-
tion 3, we present and discuss the results. In Section 4
we draw some conclusions.
2 MATERIALS AND METHODS
2.1 Data Summary
For this study of multivariate anomalies within MET,
we inherited a few historical datasets from three dif-
ferent sources:
• RAGT - A European seed company for field
crops and livestock soft winter wheat, durum
wheat, grain maize, rapeseed, sunflower, soy-
beans, sorghum and maize.
1
Those MET have
been separately conducted in three countries
(France, Hungary and Ukraine), within up to
thirty-one (31) locations, on hybrid grain maize
from 2014 to 2021. MET data is the result of a
manual annotation process carried out by the com-
pany’s experts.
• 2016 CAIGE - 2016 CIMMYT Austrialia
ICARDA Germplasm Evaluation (CAIGE).
2
Those datasets relate to bread wheat trials. The
latter were conducted at eight locations in Aus-
tralia, where 240 varieties have been tested on
seven trials. Each experience employed a partially
replicated design with two blocks and p ranging
from 0.23 to 0.39.
We have built up a bank of eleven real-world MET
datasets from the two sources presented above. Each
sample is a combination of phenotypes and genotypes
(single nucleotide polymorphisms genetic markers).
The samples vary in size. They are varieties of maize
and bread wheat.
The code and data from this study are available in
a public repository.
3
2.2 Outlier Detection Methods
Below is a list of outlier detection algorithms from
different approaches that can be used to identify
1
https://ragt-semences.fr/fr-fr Accessed on June 12,
2023.
2
https://www.caigeproject.org.au/
icarda-data-2016shipment Accessed on June 12, 2023.
3
https://github.com/charlesdupuyrony/
outlierDetectionComparison
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
244