Authors:
Marco Nicolini
1
;
Dario Malchiodi
1
;
Alberto Cabri
1
;
Emanuele Cavalleri
1
;
Marco Mesiti
1
;
Alberto Paccanaro
2
;
Peter Robinson
3
;
Justin Reese
4
;
Elena Casiraghi
1
;
5
and
Giorgio Valentini
1
;
5
Affiliations:
1
AnacletoLab, Dept. of Computer Science, University of Milan, Italy
;
2
School of Applied Mathematics (EMAp) - FGV, Rio de Janeiro, Brazil
;
3
Berlin Institute of Health at Charité (BIH), Germany
;
4
Environmental Genomics and Systems Biology Bioscience, Lawrence Berkeley National Laboratory, U.S.A.
;
5
ELLIS European Laboratory for Learning and Intelligent Systems
Keyword(s):
Large Language Models, Protein Language Models, Conditional Transformers, Protein Design and Modeling.
Abstract:
Conditional transformers improve the generative capabilities of large language models (LLMs) by processing specific control tags able to drive the generation of texts characterized by specific features. Recently, a similar approach has been applied to the generation of functionally characterized proteins by adding specific tags to the protein sequence to qualify their functions (e.g., Gene Ontology terms) or other characteristics (e.g., their family or the species which they belong to). In this work, we show that fine tuning conditional transformers, pre-trained on large corpora of proteins, on specific protein families can significantly enhance the prediction accuracy of the pre-trained models and can also generate new potentially functional proteins that could enlarge the protein space explored by the natural evolution. We obtained encouraging results on the phage lysozyme family of proteins, achieving statistically significant better prediction results than the original pre-traine
d model. The comparative analysis of the primary and tertiary structure of the synthetic proteins generated by our model with the natural ones shows that the resulting fine-tuned model is able to generate biologically plausible proteins. Our results confirm and suggest that fine-tuned conditional transformers can be applied to other functionally characterized proteins for possible industrial and pharmacological applications.
(More)