cluding heterogeneous high-dimensional data, intro-
duce challenges to existing ML methods (Karim et al.,
2021), which are increasingly being used successfully
for data analysis and interpretation.
To date, the use of AI techniques in Bioinformat-
ics has two main limitations. The first is the so-called
explainable AI (XAI), i.e., the ability of the methods
to partially explain or motivate their behavior, while
the second is about the usable AI, i.e., the actual use
of such systems in real-world scenarios.
While ML models are able to address complex
problems, their “black-box” nature raises concerns
about transparency and accountability, which also
overshadow their ability to solve the problems them-
selves. The field of XAI aims to make AI systems
more transparent by explaining how they make deci-
sions and so to enhance the human-comprehensibility,
reasoning, transparency, and accountability.
As mentioned earlier, another strong limitation of
the use of AI in Bioinformatics is about the actual “us-
ability” of such systems in real-world scenarios. Ad-
vanced ML models facing really complex problems
often suffer from scalability problems. In some cases,
the motivation could be found in the “classification”
paradigm used to formulate the problem faced: there
are n classes of samples, and the model is trained on a
training set to classify a new sample in one of such n
classes. This approach, especially in Bioinformatics,
could suffers from some issues, including the enor-
mous amount of data on which the model must be
trained, the strong imbalance of the classes that can
arise when working on real data, and above all the
problem of scaling the model when new classes of
samples must be classified. In this case, the model
must be retrained on the whole set of data, with se-
vere impact on the computational effort, but also in
contexts where a timely response can be crucial.
Proposed Strategy. We propose a novel ML-based
cancer-type detection system with the the aim of in-
tegrating it with explainability and usability tech-
niques. We first formulate such a problem in terms
of similarity-based classification (Chen et al., 2009).
Given a cancer sample, we assume to have a set
of somatic mutation features available which can be
interpreted as a cancer mutational view of the sam-
ple itself. Then, according to the central idea of the
similarity-based classification paradigm, we define a
model which does not simply learn to classify a can-
cer sample by observing its cancer mutational view,
but which is able to learn, starting from a set of sam-
ple pairs, a similarity function and which therefore is
able to tell whether two samples are similar or not.
Clearly, the more the starting set of samples is repre-
sentative of the problem, the more accurate the func-
tion is. The advantage of this approach is that once
the similarity function has been calculated, the model
can also be used on new samples (even of a cancer-
type never seen during the training) of which to find
out which classes are more similar to. Furthermore,
to make the system scalable on large amounts of data,
we keep track, for each cancer-type class, of one sin-
gle representative view, and using them to find out
which classes are more similar to a test view, with
great benefits both in terms of memory and privacy.
There are numerous examples of works in Bioin-
formatics based on the similarity-based classification
paradigm (Mathai and Kirchmair, 2020). In this pa-
per, we propose the usage of special ML models de-
fined for learning similarity functions, i.e., Siamese
Neural Networks (SNN). We define a novel SNN
which given a pair of cancer mutational views out-
puts a similarity score that can be used to verify that
they are similar. The proposed solution is based on the
following two main ideas that, in our opinion, could
limitate the issues discussed above. First, the somatic
mutation features of a cancer sample could be used
as “similarity view” that can be exploited as effec-
tive feature embedding for ML methods. Second, we
show that the SNN increases the level of discrimina-
tion strength within the proposed cancer mutational
views (Bell and Bala, 2015).
Several studies have been proposed in the litera-
ture to face the problem of using ML techniques to
determine tumour organ of origin and histology using
the patterns of somatic mutation identified by whole
genome DNA sequencing, such as (Jiao et al., 2020).
However, most of these are based on the classification
paradigm. Furthermore, several works use SNNs in
Bioinformatics (Bechar et al., 2023; Narmatha et al.,
2023), but to the best of our knowledge this is the
fist attempt to propose a similarity-based classifica-
tion paradigm based on SNNs exploiting somatic mu-
tation features for the cancer-type detection problem.
Our Contributions:
• A novel cancer-type detector integrating explain-
ability and usability techniques, and based on can-
cer mutational views for training SNNs at verify-
ing the similarity between cancer samples.
• Preliminary experiments to assess the effective-
ness of the proposed method; results obtained on
a dataset of somatic mutation features show ac-
curacy 89.25%, precision 97.60%, recall 97.63%,
and f1 score 97.63%, highlighting the advantages
of the similarity-based classification paradigm.
Visual Insights in Human Cancer Mutational Patterns: Similarity-Based Cancer Classification Using Siamese Networks
463