
and assess their potential impact on human beings.
However, the analysis is not always definitive due to
the evolving nature of scientific knowledge, making
the use of up-to-date databases critically important.
Various tools have been introduced in the litera-
ture to annotate VCF files, with ANNOVAR being
one of the most widely used. Despite its popular-
ity, ANNOVAR has certain limitations regarding the
range of databases it supports. In this paper, we
propose VCFAnnotator to address these limitations.
It facilitates the use of external databases, automati-
cally preparing them for compatibility with ANNO-
VAR, and includes a scraping feature to detect up-
dated databases.
This study marks an initial step towards devel-
oping an automated system for streamlining vari-
ous tasks in genetic research processes. As future
work, we aim to enable researchers to selectively re-
annotate specific entries of the VCF file based on pre-
defined criteria and to facilitate comparisons between
these new annotations and previous ones. Addition-
ally, we aim to explore steps required for VCFAnnota-
tor ’s certification for diagnostic use (Bombarda et al.,
2022; Bombarda et al., 2021).
ACKNOWLEDGEMENTS
This work was funded by PNRR - ANTHEM (Ad-
vaNced Technologies for Human-centrEd Medicine)
- Grant PNC0000003 – CUP: B53C22006700001 -
Spoke 1 - Pilot 1.4. We would like to thank Fabio As-
solari and Simone Ronzoni for the preliminary work
that they did for this project during their B.Sc. theses.
REFERENCES
Adzhubei, I. et al. (2013). Predicting functional effect of
human missense mutations using polyphen-2. Current
Protocols in Human Genetics, 76(1).
Battista, R., Blancquaert, I., et al. (2011). Genetics in health
care: an overview of current and emerging models.
Public health genomics, 15(1):34–45.
Bombarda, A., Bonfanti, S., et al. (2021). Lessons learned
from the development of a mechanical ventilator for
covid-19. In 2021 IEEE 32nd International Sym-
posium on Software Reliability Engineering (ISSRE),
page 24–35. IEEE.
Bombarda, A., Bonfanti, S., et al. (2022). Guidelines for the
development of a critical software under emergency.
Information and Software Technology, 152:107061.
Chen, S., Francioli, L. C., Goodrich, J. K., et al. (2024). A
genomic mutational constraint map using variation in
76,156 human genomes. Nature, 625(7993):92–100.
Cingolani, P., Platts, A., et al. (2012). A program for
annotating and predicting the effects of single nu-
cleotide polymorphisms, snpeff: Snps in the genome
of drosophila melanogaster strain w1118; iso-2; iso-3.
Fly, 6(2):80–92.
Cooper, D. (1998). The human gene mutation database.
Nucleic Acids Research, 26(1):285–287.
Danecek, P., Auton, A., et al. (2011). The variant call format
and vcftools. Bioinformatics, 27(15):2156–2158.
Danecek, P., Bonfield, J. K., et al. (2021). Twelve years of
samtools and bcftools. GigaScience, 10(2).
Davydov, E. V., Goode, D. L., Sirota, M., et al. (2010).
Identifying a high fraction of the human genome to be
under selective constraint using gerp++. PLoS Com-
putational Biology, 6(12):e1001025.
Hajba, G. L. (2018). Using Beautiful Soup, page 41–96.
Apress.
Hamosh, A. (2004). Online mendelian inheritance in man
(omim), a knowledgebase of human genes and genetic
disorders. Nucleic Acids Research, 33(Database is-
sue):D514–D517.
Hart, S. N., Duffy, P., et al. (2015). Vcf-miner: Gui-
based application for mining variants and annota-
tions stored in vcf files. Briefings in Bioinformatics,
17(2):346–351.
Kircher, M. et al. (2014). A general framework for estimat-
ing the relative pathogenicity of human genetic vari-
ants. Nature Genetics, 46(3):310–315.
Landrum, M. J. et al. (2015). Clinvar: public archive of
interpretations of clinically relevant variants. Nucleic
Acids Research, 44(D1):D862–D868.
McLaren, W. et al. (2010). Deriving the consequences of
genomic variants with the ensembl API and SNP ef-
fect predictor. Bioinformatics, 26(16):2069–2070.
Myers, C., Paulk, N., and Dudlak, C. (2001). Genomics:
implications for health systems. Frontiers of Health
Services Management, 17(3):3–16.
Ng, P. C. (2003). Sift: predicting amino acid changes
that affect protein function. Nucleic Acids Research,
31(13):3812–3814.
O’Leary, N. A., Wright, M. W., et al. (2015). Reference
sequence (refseq) database at ncbi: current status, tax-
onomic expansion, and functional annotation. Nucleic
Acids Research, 44(D1):D733–D745.
Pedersen, B. S., Layer, R. M., and Quinlan, A. R. (2016).
Vcfanno: fast, flexible annotation of genetic variants.
Genome Biology, 17(1).
Pei, B., Sisu, C., Frankish, A., et al. (2012). The gencode
pseudogene resource. Genome Biology, 13(9):R51.
Salgado, D., Bellgard, M. I., et al. (2016). How to identify
pathogenic mutations among all those variations: vari-
ant annotation and filtration in the genome sequencing
era. Human mutation, 37(12):1272–1282.
Wang, K., Li, M., et al. (2010). ANNOVAR: functional
annotation of genetic variants from high-throughput
sequencing data. Nucleic Acids Res., 38(16):e164.
Yang, H. and Wang, K. (2015). Genomic variant annotation
and prioritization with annovar and wannovar. Nature
Protocols, 10(10):1556–1566.
A Flexible and Open-Source Tool for Genetic Variant Annotation
413