Figure 8: Correlation Improvement.
models on shifted datasets, the quality of the training
would be positively affected.
6 FUTURE WORK
Though PUMP provides an easy-access method to an-
alyze underspecification, it currently does not possess
all desired functionalities. On top of that, there are
many avenues in which attempts can be made to mini-
mize underspecification and better generalize models.
6.1 System Improvements
As currently constructed, PUMP has a heavy focus
on clustering algorithms to perform analysis and to
generate shifted datasets. While relatively effective,
clustering algorithms are not the only way to do either
of the aforementioned tasks. In the future, we hope to
introduce a greater variety of algorithms to perform
these tasks so that underspecification analysis can be
more generalized to other datasets.
Aside from an increase in algorithm options, other
areas of improvement include an increase in model
options. Though SVMs, Random Forests, and Neu-
ral Networks (MLP) generally cover many use cases,
there exist many more types of prediction modeling.
In the future, we hope to include a more extensive list
of models to select as options for performance evalu-
ation.
6.2 Minimizing Underspecification
Outside of the scope of PUMP, underspecification
still exists and is yet to be addressed. In our ex-
tended application of PUMP, we illustrate how partic-
ular knowledge can be applied to breast cancer sub-
typing and how there is a good correlation between
that knowledge and more well-specified models.
In the application of sample-sample graphs, bio-
logical insight can perhaps be leveraged in machine
learning models by incorporating knowledge directly
into its training via the loss function. Examples of
such models exist, such as neural graph machines
(Bui et al., 2017) and graph convolution networks
(Kipf and Welling, 2017). Of course, models similar
to these have already been used in biomedical appli-
cations; however, the introduction of PUMP allows
for an simple way to ensure underspecification in a
dataset for a particular model is not a prevalent issue.
ACKNOWLEDGEMENTS
This study was made possible thanks to William and
Linda Frost Fund and the College of Science and
Math at California Polytechnic State University, San
Luis Obispo. Additionally, we would like to thank the
College of Engineering for their support of this work.
We would also like to acknowledge and thank the Eu-
ropean Genome-Phenome Archive for access to the
METABRIC transcriptomics and metadata dataset as
well as the authors of the DeepType study for the ini-
tial classifier algorithm. We are grateful to the Cal
Poly Bioinformatics Research Group for their assis-
tance in this research project and review of this arti-
cle.
REFERENCES
Anderson, P., Gadgil, R., Johnson, W. A., Schwab, E., and
Davidson, J. M. (2021). Reducing variability of breast
cancer subtype predictors by grounding deep learning
models in prior knowledge. Computers in Biology and
Medicine, 138:104850.
Apic, G., Ignjatovic, T., Noyer, S., and Russell, R. (2005).
Illuminating drug discovery with biological pathways.
ScienceDirect.
Bui, T., Ravi, S., and Ramavajjala, V. (2017). Neural graph
machines: Learning neural networks using graphs.
arXiv.
Burstein, M., Tsimelzon, A., Poage, G., Covington, K.,
Contreras, A., Fuqua, S., Savage, M., Osborne, K.,
Hilsenbeck, S., Chang, J., Mills, G., Lau, C., and
Brown, P. (2014). Comprehensive genomic analysis
identifies novel subtypes and targets ot triple-negative
breast cancer. NCBI.
Cascianelli, S., Molineris, I., Isella, C., Masseroli, M.,
and Medico, E. (2020). Machine learning for rna
sequencing-based intrinsic subtyping of breast cancer.
Nature.
Chen, R., Yang, L., Goodison, S., and Sun, Y. (2020). Deep-
learning approach to identifying cancer subtypes us-
ing high-dimensional genomic data. Oxford Aca-
demic.
Cornen, S., Guille, A., Adelaide, J., Addou-Klouche, L.,
Finetti, P., Saade, M.-R., Manai, M., Carbuccia, N.,
PUMP: An Underspecification Analysis Tool
107