A Flexible and Open-Source Tool for Genetic Variant Annotation

Andrea Bombarda

1 a

, Matteo Bellini

2 b

, Maria Iascone

2 c

and Domenico Fabio Savo

1 d

Department of Management, Information, and Production Engineering, University of Bergamo, Bergamo, Italy

Medical Genetics Lab., ASST Papa Giovanni XXIII, Bergamo, Italy

{andrea.bombarda, domenicofabio.savo}@unibg.it, {m.bellini, miascone}@asst-pg23.it

Keywords:

Medical Genetics, Variant Annotation, Rare Genetic Disease Research.

Abstract:

Advances in genomic research have signiﬁcantly enhanced our understanding of the genetic factors inﬂuencing

human health. A key output of this research are VCF (Variant Call Format) ﬁles, which document genetic

variations detected through DNA sequencing. These ﬁles, however, provide limited information, making it

challenging to interpret the biological signiﬁcance of the variants without additional data. Annotation, the

process of enriching VCF ﬁles with information from publicly available biomedical datasets, is essential for

facilitating variant interpretation in research. In this paper, we present VCFAnnotator, a tool developed to

adapt ANNOVAR software used in genetic research, enabling the annotation of entire directories with a single

command and facilitating the use of any relevant external database. Additionally, VCFAnnotator offers the

ability to scrape the various websites of the biomedical databases in use, ensuring that the researchers remain

informed of any updates.

1 INTRODUCTION

Genomic research has transformed our understand-

ing of human biology, providing crucial insights into

the genetic factors that inﬂuence human health (My-

ers et al., 2001; Battista et al., 2011). The study of

genetic variants is now pivotal in medical research,

helping to identify new biological pathways, elucidate

disease mechanisms, and inform the development of

innovative therapeutic approaches. As the volume of

genomic data continues to grow, the demand for efﬁ-

cient tools to process, analyze, and interpret this infor-

mation has become increasingly critical to advancing

research in medical genetics.

A crucial output of genetic analysis process are

the VCF (Variant Call Format) ﬁles. These ﬁles, pro-

duced through DNA sequencing and alignment, doc-

ument genetic variations in comparison to a reference

genome. Some variants may be associated with the

development of diseases, but not all are pathogenic.

It is up to the geneticist, based on current medical re-

search, to evaluate whether speciﬁc symptoms can be

linked to these genetic variants.

To assist the genetics researchers in their work, the

https://orcid.org/0000-0003-4244-9319

https://orcid.org/0009-0001-4297-9160

https://orcid.org/0000-0002-4707-212X

https://orcid.org/0000-0002-8391-8049

data contained in the VCF ﬁle are enriched with ad-

ditional information about each variant by using data

from publicly available biomedical datasets. This pro-

cess, called annotation, is considered crucial for inter-

preting genetic data and assessing the potential impact

of variants on human health (Salgado et al., 2016).

In fact, apart from detailing the chromosome po-

sition and the nucleotides in a speciﬁc variant, VCF

ﬁles provide very little information. As a result, with-

out performing annotation, it becomes challenging to

address key questions, such as whether a variant has

already been identiﬁed in other studies, what its ef-

fects might be, or whether it is widespread in the pop-

ulation and therefore unlikely to be pathogenic. In the

context of genetic research, where new correlations

between variants and diseases are continuously dis-

covered, leveraging up-to-date information is imper-

ative for increasing the chances of achieving prompt

and accurate results. For this reason, several tools that

help researchers to annotate VCF ﬁles have been pro-

posed (Wang et al., 2010; Yang and Wang, 2015; Ped-

ersen et al., 2016; Cingolani et al., 2012; McLaren

et al., 2010). However, despite their efﬁciency, most

of them lack in providing information to the user

about the up-to-date status of the used data sources

and are limited to annotations performed using only a

subset of the available biomedical databases.

In this paper, we present VCFAnnotator, a tool de-

406

Bombarda, A., Bellini, M., Iascone, M. and Savo, D. F.

A Flexible and Open-Source Tool for Genetic Variant Annotation.

DOI: 10.5220/0013111600003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 406-413

ISBN: 978-989-758-731-3; ISSN: 2184-4305

veloped in collaboration with the medical genetics de-

partment of Papa Giovanni XXIII Hospital in Berg-

amo, to address the issues they observe in existing

annotation tools. Speciﬁcally, VCFAnnotator acts as

a wrapper for the ANNOVAR (Wang et al., 2010)

tool, enabling the automatic veriﬁcation of the up-to-

date status of the data sources used to perform the an-

notation. Moreover, it allows for working with any

database that is not natively compatible with ANNO-

VAR by automatically converting selected resources

into a format accepted by the ANNOVAR annotation

tool. With these features, VCFAnnotator integrates

seamlessly with ANNOVAR, allowing researchers to

perform VCF annotation more quickly and efﬁciently,

both for new cases and for those previously analyzed

but considered inconclusive.

The rest of the paper is structured as follows.

Sect. 2 provides the background on the variant iden-

tiﬁcation and annotation process, as well as the most

used biomedical databases, and presents the ANNO-

VAR , which is at the base of VCFAnnotator. Then,

Sect. 3 and 4, present the requirements and the design

of VCFAnnotator and its implementation details and

functionalities, respectively. Finally, Sect. 5 discusses

related works, and Sect. 6 concludes the paper.

2 BACKGROUND

In this section, we describe the process of analysis

and investigation that, from the biological material of

a subject, allows for the extraction of his/her genetic

variants, and the following annotation process. More-

over, we provide an overview of the software tools

currently used to support the annotation phase and

highlight their limitations.

2.1 Variant Identiﬁcation

In Next Generation Sequencing (NGS), the journey

from biological material to a VCF ﬁle is a complex,

multistep process that integrates both wet-lab proce-

dures and computational bioinformatics. This pro-

cess aims to identify genetic variants, such as Sin-

gle Nucleotide Variants (SNVs), insertions, deletions,

Copy Number Variations (CNVs), and other muta-

tions, which are fundamental for genetic analysis and

therefore for formulating a diagnosis. In cases involv-

ing pediatric subjects, analyses typically involve data

from the child, mother, and father, sometimes includ-

ing other relatives.

Fig. 1 depicts a general overview of the process. In

the ﬁrst step, biological material (e.g., blood, saliva,

or tissue) is collected, and DNA is extracted, frag-

Raw Biological

Material

Sequencing Reads

Mapping and

Alignment

BAMVariant CallingVCF File

Annotation Tool +

Variant Quality

Control

Annotated

VCF File

Clinically Relevant

Variants

Conclusive

Analysis

Inconclusive

Analysis

Figure 1: Overview of the Genetics Process.

mented, and tagged with adapters for sequencing. The

fragments are then processed through a sequencing

platform, producing raw data (short or long reads)

in FASTQ format with nucleotide sequences and qual-

ity scores. These reads are aligned to a reference

genome, generating a binary alignment map (BAM)

ﬁle with alignment details and quality metrics.

Then, variant calling is performed, and statisti-

cally signiﬁcant differences (SNVs, insertions, dele-

tions, CNVs, and structural variants) between the

sample’s DNA and the reference genome are identi-

ﬁed. The result is a VCF

ﬁle (Danecek et al., 2011),

listing all detected variants along with auxiliary in-

formation. An example of a VCF ﬁle is reported in

Listing 1. It is a tab-separated ﬁle containing chromo-

some, variant position, ID, reference and altered nu-

cleotides, quality score, ﬁlter info, read details (e.g.,

depth, genotype), ﬁeld format, and case-speciﬁc data

for each individual.

To aid researchers, VCF ﬁles undergo further pro-

cessing, including variant quality control to remove

low-quality variants and false positives, and annota-

tion to enrich data with information about identiﬁed

variants. During annotation, information taken from

relevant biomedical databases (see Sect. 2.2) with the

potential functional effects, known associations with

diseases, or spread in the population for each variant

is added. This phase is carried out through the use of

annotation tools such as ANNOVAR (see Sect. 2.3).

As a result, a VCF ﬁle similar to the one we report in

Sect. 4 is produced. The annotation of VCF ﬁles is

essential because it transforms a simple list of genetic

variants into a tool rich in clinical and biological in-

formation, enabling their ﬁltering and prioritization.

Given the importance of the annotation process

in research, several biomedical databases have been

released by the scientiﬁc community (see Sect. 2.2).

Moreover, several tools are available to perform this

activity. While most sequencing machine manufac-

The complete v4.3 speciﬁcation of the VCF is avail-

able at https://samtools.github.io/hts-specs/VCFv4.3.pdf

A Flexible and Open-Source Tool for Genetic Variant Annotation

407

Listing 1: Exerpt of a VCF ﬁle.

1 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT W3HH−AFA−I

2 chr1 941119 . A G 568.23 PASS AC=2;AF=1;AN=2 ;DP=117;FS=0;MQ=250;QD=4.86 ;SOR=1.981 GT:AD: AF :DP :GQ: FT : F1R2 : F2R1 : PL

:GP: PP:DN 1 / 1 : 0 , 2 8 : 1 : 2 8 : 8 1 :PASS:0 ,1 3:0 , 1 5 : 169 , 8 4 ,0:1 3 0 . 8 2 ,8 0 . 816 , 0 : 329 , 1 6 4 , 0 : I n h e r i t e d [ . . . ]

turers offer proprietary analysis tools, these often re-

quire subscriptions and incur per-use fees. Conse-

quently, open-source and free alternatives, e.g., AN-

NOVAR (Sect. 2.3), have emerged as viable options.

2.2 Genetics Databases

As discussed in Sect. 2.1, the usefulness and ef-

fectiveness of the annotation process depend on the

databases used. Each of the available databases fo-

cuses on speciﬁc information, such as the frequency

of a variant in the population or the predicted effect of

the variant on the proteins. In the following, some of

the most common databases are introduced.

OMIM: The Online Mendelian Inheritance in Man

(OMIM) database (Hamosh, 2004) is a comprehen-

sive resource that catalogs human genes and genetic

disorders. It focuses on the relationships among genes

and the diseases they cause, as well as their inheri-

tance patterns. More speciﬁcally, it is used to indi-

viduate the inheritance patterns (dominant, recessive,

etc.) thanks to the link of each genetic variant to the

scientiﬁc literature, helping clinicians and researchers

better understand the genetic basis of diseases.

GnomAD: The Genome Aggregation Database

(GnomAD) (Chen et al., 2024) is a large public

dataset cataloging human genetic variants. It is

widely used in genomics research to help understand

the spectrum of genetic variants across diverse popu-

lations and their frequency. Given that most genetics

laboratories work on rare diseases, GnomAD is used

to exclude all variants with high occurrence rates in

the population, which are therefore not pathogenic.

Clinvar: ClinVar (Landrum et al., 2015) is a public,

freely accessible database archiving reports on human

genetic variants linked to diseases,along with support-

ing evidence. It enables easy access to information

about the relationships between genetic variants and

speciﬁc health conditions. ClinVar processes submis-

sions including variants identiﬁed in patient samples,

classiﬁcations related to diseases and drug responses,

and additional supporting data, and provides a classi-

ﬁcation based on the predicted biological effect of a

mutation (benign, pathogenic, etc.).

Gencode: The Gencode database (Pei et al., 2012)

is a collection of annotations of human and mouse

genomes. It provides detailed information on gene

structures, including protein-coding genes, noncoding

RNA genes, and other functional elements. It is used

for linking variants (both the position and the speciﬁc

nucleotide alteration) to speciﬁc genes and regions.

RefSeq: RefSeq (Reference Sequence) (O’Leary

et al., 2015) is a database curated by the National

Center for Biotechnology Information (NCBI) that

provides complete, annotated reference sequences for

genomes, transcripts, and proteins. It is used as a stan-

dard for genomic research and comparative studies,

offering accurate and non-redundant representations

of genetic sequences from humans and other species.

HGMD: The Human Gene Mutation Database

(HGMD) (Cooper, 1998) is a comprehensive resource

collecting data on clinically relevant variants that are

associated with genetic disorders. It can be used as

a key reference for researchers by providing detailed

information on variants that cause or may cause in-

herited diseases. The key advantage of this database

is that it contains entries that come only from scien-

tiﬁcally proven sources. Unlike the other previously

presented databases, HGMD is not freely available.

SIFT: SIFT (Sorting Intolerant From Tolerant) (Ng,

2003) is a database used to determine whether an

amino acid substitution in a protein will affect its

function. It analyzes sequence homology and the

physical properties of amino acids to assess whether a

mutation is likely to be deleterious (damaging) or tol-

erated (neutral). It is commonly used to evaluate the

potential impact of genetic variations.

2.3 ANNOVAR

ANNOVAR (Wang et al., 2010) is a tool that can be

used to annotate Single Nucleotide Variants (SNVs)

and insertions/deletions, such as examining their

functional consequences on genes, reporting func-

tional importance scores, or ﬁnding variants in con-

served regions. It is available as a CLI software tool

and can be used as a standalone application on sys-

tems in which standard Perl is supported. ANNO-

VAR works with text-based input VCF ﬁles (e.g., the

example reported in Listing 1), whereby each line cor-

responds to a genetic variant and reports its character-

istics. Then, to annotate variants, ANNOVAR needs

to download gene annotation databases, such as those

described in Sect. 2.2, and to save them to a local disk.

ANNOVAR is not the only tool available for an-

notating VCF ﬁles, and we describe most of the others

in Sect. 5. However, being open-source and supported

HEALTHINF 2025 - 18th International Conference on Health Informatics

408

by the community makes ANNOVAR one of the most

chosen options by researchers. Nevertheless, it still

has some limitations, which we try to address with

VCFAnnotator and discuss in the following section.

2.3.1 Limitations

Despite being very powerful, during our experiments,

we discovered several limitations of ANNOVAR that

our work aims to solve. First, the efﬁcacy of the anno-

tation process is closely linked to the use of up-to-date

biomedical databases (see Sect. 2.2). Indeed, because

new correlations between gene variants and patholo-

gies are frequently discovered, the databases must be

in their latest available versions for the annotation to

be effective. However, ANNOVAR lacks a method to

automatically check for newer database versions.

Additionally, ANNOVAR requires the databases

used for annotating VCFs to be in a speciﬁc format,

which is not always the one used by those databases’

creators. To solve this issue, ANNOVAR provides a

set of already adapted databases,

but most of these

are not up-to-date and some required by the genet-

ics laboratories are not available (e.g., OMIM; see

Sect. 2.2). Thus, to get all needed and up-to-date in-

formation, working with “external” (and possibly in a

different format) databases is required. Another sim-

ilar limitation of ANNOVAR is that it can perform

the annotation by exploiting only databases with the

same format. This means that if one of the biomed-

ical databases used for the annotation is stored in a

txt ﬁle, all the other databases must be provided as

txt as well. In practice, every biomedical database

is provided in many possible formats and this limits

the usability of ANNOVAR as it is. Finally, ANNO-

VAR was designed as a tool that is to annotate only a

single VCF ﬁle per time, making it unsuitable for the

re-analysis process, in which or a set of previously

analyzed cases require reanalysis at the same time.

3 SOFTWARE DESIGN

In this section, we begin by introducing the re-

quirements we identiﬁed for VCFAnnotator that were

speciﬁcally tailored for overcoming the limitations

identiﬁed for ANNOVAR. Then, we present the ar-

chitecture we developed to allow the highest conﬁg-

urability and ﬂexibility for the tool.

https://annovar.openbioinformatics.org/en/latest/

user-guide/download/

3.1 Software Requirements

At the beginning of the project, thanks to meet-

ings with genetics researchers we collaborated with,

we identiﬁed several requirements for VCFAnnota-

tor. More speciﬁcally, we identiﬁed some general re-

quirements and three modes.

In terms of general requirements, VCFAnnotator

should be as much conﬁgurable as possible because

of the evolving environment in the genetics labora-

tory and scientiﬁc state-of-the-art. This means that

the databases used for VCF annotation should be set

by the user, allowing for different operations in dif-

ferent formats. Similarly, the paths in which the anno-

tated VCF and biomedical databases are stored should

be conﬁgurable to make VCFAnnotator usable even

when additional storage is added and the ﬁle system

structure changes. In terms of modes, we identiﬁed

scraping, DB preparation, and annotation modes:

• When in scraping mode, VCFAnnotator should

be able to automatically check the websites of the

chosen biomedical databases, to discover whether

new versions are available. This operation should

be set as manual or automatic every time a new an-

notation is performed, but the update should always

be manual to avoid unwanted overwrites or having

databases in an inconsistent state, such as incomplete

downloads.

• When in DB preparation mode, the downloaded

databases should be adapted to make them suitable

for use with ANNOVAR. This phase is pivotal, as

biomedical databases are available in three different

formats (vcf, txt, and gff3) and adopt different en-

coding. More speciﬁcally, VCFAnnotator should be

able to remove all useless comments from txt ﬁles,

convert some vcf ﬁle into a txt ﬁle, and substitute the

<DEL> string, used when the variant implies a dele-

tion, with a dot. The user should be able to set for

each database, the required preparation operations.

• When in annotation mode, VCFAnnotator

should annotate the input VCF ﬁles using the infor-

mation taken from selected databases. More speciﬁ-

cally, VCFAnnotator should support working at least

with the Gencode, Clinvar, GnomAD, OMIM and

HGMD databases and, in general, with databases in

the vcf, gff3, and txt format. Furthermore, VC-

FAnnotator should annotate a single ﬁle, a portion of

a VCF ﬁle, or all ﬁles in a speciﬁed folder to make

re-analyses possible on previously inconclusive cases.

At the end of annotation, VCFAnnotator should pro-

duce a single ﬁle with database information stored in

separate columns, enabling easier and up-to-date vari-

ant analysis.

A Flexible and Open-Source Tool for Genetic Variant Annotation

409

VCFAnnotator

«Annotation tool»

Annovar

«Component»

Input Parser

«Component»

WebScraper

«Component»

DBPreparation

«Database»

GnomAD

«Database»

HGMD

«Database»

Clinvar

«Database»

Gencode

«Database»

OMIM

Figure 2: VCFAnnotator software architecture.

VCFAnnotator

Functionalities

Scraping

Annotation

DatabaseAdaptation

Preparation

Check Database

Update

Automatic

Scraping

Inputs

VCF Folder or

VCF File

Databases

Folder Path

(Optional)

Output Path

(Optional)

Output File for

Each Database

Annotation

Annotated VCF

File Merging and

Column

Formatting

Ready-to-Use

Databases

Figure 3: VCFAnnotator usage ﬂow.

3.2 Software Architecture

Fig. 2 shows the VCFAnnotator software ar-

chitecture. The tool, implemented in Python

and available at https://github.com/ANTARES-PRJ/

VCFAnnotator, provides an Input Parser compo-

nent, which parses the input parameters and conﬁgu-

ration ﬁles. Depending on the conﬁguration and the

user request, one of the three functionalities is started.

The annotation functionalities are carried out by

ANNOVAR (see Sect. 2.3). This takes as input the

VCF ﬁle (or ﬁles) to be annotated and, exploiting

the selected biomedical databases, performs the an-

notation. Note that ANNOVAR is executed as a

command-line tool. In this way, we kept the architec-

ture as modular as possible, and substituting it with

a different (and possibly more powerful or efﬁcient)

annotation tool is possible in the future. In Fig. 2,

only ﬁve databases (i.e., those identiﬁed in the soft-

ware requirement analysis and previously explained

in Sect. 2.2) are reported, but the architecture is ﬂexi-

ble, and new databases can be added on the ﬂy.

The web scraping functionalities are carried out

by the WebScraper component, which takes advan-

tage of the functionalities offered by the beautifulsoup

Python library (Hajba, 2018). It is devoted to fetching

database websites and signaling whether new versions

for each of the set databases are available.

Finally, the database preparation is performed by

the DBPreparation component. As for the ANNO-

VAR tool, despite only ﬁve databases being connected

to the module in Fig. 2, the component can interact

with all databases, depending on user conﬁguration.

4 APP PROTOTYPE

In this section, we present the usage of VCFAnnota-

tor, as shown in Fig. 3. VCFAnnotator supports three

different types of operations, with each fulﬁlling one

speciﬁc task - namely the automatic scraping of the

database websites, the preparation of the databases,

and the actual annotation.

Conﬁguration. The properties of VCFAnnotator

and the conﬁgurations used in each of the three

supported functionalities are set by using the

config.yaml ﬁle reported in Listing 2. More specif-

ically, the ﬁle contains two relevant paths: The

db path, where the databases used during the anno-

tation phase are stored, and the destination path,

where the annotated ﬁles are saved (lines 1 and 2).

Then, the conﬁguration information for each database

is described. As previously discussed, VCFAnnota-

tor supports databases in the txt, vcf, and gff3 for-

mats. Thus, according to the format, each database is

reported in a different list (databasesTXT at line 3,

databasesVCF at line 10, and databasesGFF3 at

line 17). For each entry, the conﬁguration ﬁle con-

tains the id, file name, and the operation type to

be used while annotating a VCF ﬁle with the ANNO-

VAR subcomponent.

Automatic Scraping. The purpose of the automatic

scraping procedure is to discover, for each of the

databases used during the annotation phase, whether

new versions are available. This is pivotal for the ac-

curacy and effectiveness of the diagnosis provided by

a genetics laboratory. The scraping operation can be

performed automatically whenever a new annotation

is launched (autoCheck parameter at line 27 in List-

ing 2) or launched manually. To manually execute

this operation, VCFAnnotator can be called by using

the --checkDB (or -c) option:

$ python vcf annotator.py −−checkDB

In this way, the tool automatically checks for

database updates and prints a table in which updates

The following values can be used as operation: g for

gene-based, gx for gene-based with cross-reference annota-

tion, r for region-based, and f for ﬁlter-based.

HEALTHINF 2025 - 18th International Conference on Health Informatics

410

Listing 2: Conﬁguration ﬁle for VCFAnnotator.

1 db path: "humandb/"

2 destination path: "result/"

3 databasesTXT: # Name of the DBs

4 - id: Clinvar

5 ﬁle: "clinvar"

6 operation: "f"

7 - id: OMIM

8 ﬁle: "omim"

9 operation: "f"

10 databasesVCF:

11 - id: gnomAD

12 ﬁle: "hg38_gnomad.vcf"

13 operation: "f"

14 - id: HGMD

15 ﬁle: "hg38_hgmd.vcf"

16 operation: "f"

17 databasesGFF3:

18 - id: Gencode

19 ﬁle: "hg38_gencode.gff3"

20 operation: "r"

21 convertFromVCFToTxt: # convert from vcf to txt

22 - id: Clinvar

23 removeDEL: # substitute <DEL> with the .

24 - id: HGMD

25 clean: # remove the comments from txt files

26 - id: OMIM

27 autoCheck: false # Scraping

28 scraping:

29 - id: Gencode

30 release: "44"

31 website: "https://www.gencodegenes.org/human"

32 textToSearch: "Release "

33 tag: "h1"

34 - id: Clinvar

35 date: "2024-04-22"

36 website: "https://ftp.ncbi.nlm.nih.gov/pub/clinvar

37 /vcf_GRCh38/"

38 textToSearch: "clinvar_"

39 tag: "a"

40 - id: GnomAD

41 release: "4.1"

42 website: "https://gnomad.broadinstitute.org/news/

43 category/release/"

44 textToSearch: "gnomAD"

45 tag: "h2"

46 - id: OMIM

47 date: "2024-09-20"

48 website: "https://omim.org"

49 textToSearch: "Updated "

50 tag: "h5"

are indicated with a [!], as illustrated by the screen-

shot in Fig. 4. Note that at the moment, the update

of the databases is not automatic, and the user has to

download the new ﬁles and manually substitute the

old versions. More speciﬁcally, after having down-

loaded the new ﬁles and substituted them in the cho-

sen db path (see line 1 in Listing 2), the user may

update the information in the config.yaml ﬁle (see

Listing 2, starting from line 28) by adding for each

database id the relevant version information (e.g.,

release number or release date), the website that

has to be scraped to ﬁnd whether new versions are

available, and the text to be searched in a speciﬁc

HTML tag. The update process was made manual to

ensure researchers know the current database version

and avoid annotations with inconsistently updated

databases, which risk incomplete information. Fur-

thermore, we emphasize that in the screen in Fig. 4,

Figure 4: Output table for the automatic scraping procedure.

only four of the ﬁve databases used by VCFAnnotator

are reported. Indeed, looking at the scraping section

in Listing 2, the HGMD database is not set as one of

those to be scraped because, as explained in Sect. 2.2,

it is the only one not freely available.

Database Preparation. As introduced in Sect. 3,

VCFAnnotator allows for annotating VCF ﬁles by us-

ing different databases. This operation is made pos-

sible by its integration with ANNOVAR. However,

we have found that the formatting conventions used

by ANNOVAR and those used by major databases

may differ and, thus, a preparation of each database

is needed. To execute this operation, VCFAnnotator

can be called by using the --prepare (or -p) option:

$ python vcf annotator.py −−prepare

This operation encompasses three different steps

and is performed depending on the conﬁguration

set in the config.yaml ﬁle reported in Listing 2

(from line 21 to line 26). The databases in the

convertFromVCFToTxt list are translated into the

.txt format, which is supported by ANNOVAR. De-

spite ANNOVAR is supposed to work correctly with

vcf ﬁles, we have found that for some of them, it does

not. For example, in our experiments, we have seen

that the Clinvar database is not supported by ANNO-

VAR if used as a vcf ﬁle, while the GnomAD one

works ﬁne. Thus, for those databases ANNOVAR

does not work with, they are converted into txt ﬁles.

Then, if some <DEL>s are present, indicating a dele-

tion in the databases in the removeDEL list, they are

converted into a dot. Finally, all lines starting with #

are removed from the databases in the clean list as

they may be interpreted as the header by ANNOVAR.

VCF Annotation. This mode provides the core

functionalities of VCFAnnotator- i.e., the annotation

of VCF ﬁles. To execute this task, VCFAnnotator can

be called using the --annotateVCF (or -a) option:

$ python vcf annotator.py −−annotateVCF input.vcf

This operation requires the user to specify an

input vcf ﬁle or the path of a folder containing

multiple ﬁles. In both cases, for each input ﬁle,

A Flexible and Open-Source Tool for Genetic Variant Annotation

411

Listing 3: Example of an annotated VCF ﬁle.

1 Chr S t a r t End Ref A l t CLNALLELEID CLNDN CLNDISDB CLNREVSTAT CLNSIG I d Qual F i l t e r I n f o Format W3HH−AFA−I Gencode

MIM Number Gene/ Locus And Other Rel at ed Symbols Gene Name Approved Gene Symbol Entr ez Gene ID Ensembl Gene ID

Comments Phenotypes Mouse Gene Symbol / ID AC AN AF CLASS MUT GENE STRAND DNA PROT DB PHEN RANKSCORE SVTYPE END

SVLEN

2 chr1 941119 941119 A G . . . . . . 568.23 PASS AC=2 ;AF=1;AN=2 ;DP=117;FS=0;MQ=250;QD=4.86;SOR=1.981 GT: AD: AF: DP:GQ:

FT : F1R2 : F2R1 : PL :GP: PP:DN 1 / 1 : 0 , 2 8 : 1 : 2 8 : 8 1 :PASS: 0 ,1 3 :0 , 1 5 : 1 69 , 8 4 ,0:13 0 . 8 2 ,8 0 . 8 16 , 0 : 3 29 , 1 6 4 ,0: I n h e r i t e d Name=

ENSG00000187634. 13 ,ENST00000618323 . 5 , ENST00000474461 . 1 , ENST00000478729 . 1 , ENST00000618181 . 5 , ENST00000618779 . 5 ,

ENST00000622503 . 5 , ENST00000342066 . 8 , ENST00000616125 . 5 , ENST00000341065 . 8 , ENST00000616016 . 5 , ENST00000455979 . 1 ,

exon : ENST00000474461 . 1 : 1 , ENST00000617307 . 5 . . . . . . . . . . . . . . . . . . . . . . . . [ . . . ]

the annotation is performed using all databases

reported in the config.yaml ﬁle and, for each of

them, with the speciﬁc operation. We emphasize

that we designed VCFAnnotator to work even with

noncomplete vcf ﬁles, and with ﬁles composed

by merging rows appertaining to multiple subjects.

In such a way, researchers can also annotate a

reduced set of variants, e.g., those under investiga-

tion for ﬁnding a speciﬁc pathology. To maintain

the traceability of all operations, VCFAnnotator

calls ANNOVAR iteratively on all databases, so

that, at the end of the annotation process, multiple

annotated vcf ﬁles are available with the name

DBName VCFInputName YYYY-mm-dd HH MM SS.

In addition, a merged ﬁle with the name

VCFInputName YYYY-mm-dd HH MM SS is pro-

duced. It contains all annotations from all databases,

each one in one or more columns with the same name

(or starting with a preﬁx) as the used database.

During annotation, the databases are taken from

the db path folder (line 1 in Listing 2), and the

results are stored in the destination path folder

(line 2 in Listing 2). However, different paths can

be speciﬁed when executing VCFAnnotator from the

command line. More speciﬁcally, the database path

can be speciﬁed in the command after the option

--DBPath (or -db), while destination path after the

option --DestinationPath (or -d). An example of

an annotated VCF obtained using VCFAnnotator and

with the VCF previously shown in Listing 1 is re-

ported in Listing 3. It can be seen that the annotation

process adds several columns (e.g., all those starting

with CLN for the Clinvar database and the one marked

as Gencode for the gencode database).

5 RELATED WORK

Several attempts to provide efﬁcient and effective

tools are available in the literature. One of the most

used is ANNOVAR (Wang et al., 2010), which is

the tool underlying the annotation functionalities of

VCFAnnotator. Over the years, it has been continu-

ously updated and extended, including the addition

of a web-based annotation environment, wANNO-

VAR (Yang and Wang, 2015). Similarly, in (Ped-

ersen et al., 2016), the Vcfanno tool was proposed.

While powerful, its complexity makes it less suitable

for non-expert users. Future work could explore au-

tomating its conﬁguration as our architecture (Sect. 3)

supports replacing the annotation component without

affecting others. The SnpEff and VEP tools were

proposed in (Cingolani et al., 2012) and (McLaren

et al., 2010), respectively. They annotate variants

by genomic location and predict coding effects but

lack additional needed information, making SnpEff or

VEP unsuitable replacements for ANNOVAR. Simi-

larly, BCFTools (Danecek et al., 2021) can perform

VCF annotation, but it works only by using a sin-

gle reference database, and only in the GFF3 for-

mat. Instead, VCFAnnotator allows users to use more

databases with multiple input formats. Finally, most

companies producing sequencers provide their own

annotation tools, such as for Illumina.

However, in

most cases, the tools are not open-source and are ex-

pensive and inﬂexible. Other tools, such as VCF-

Miner (Hart et al., 2015), was proposed in the liter-

ature, but they allow only for mining information into

already annotated VCFs.

In addition to what mentioned in Sect. 2.2, other

tools can enhance the annotated VCF and help pri-

oritize variants. GERP (Genomic Evolutionary Rate

Proﬁling) (Davydov et al., 2010) detects conserved

regions by analyzing evolutionary constraints across

species. CADD (Combined Annotation-Dependent

Depletion) (Kircher et al., 2014) predicts variant im-

pact by combining annotations into a score, identify-

ing those likely to affect gene function. PolyPhen-

2 (Polymorphism Phenotyping v2) (Adzhubei et al.,

2013) assesses the impact of amino acid substitutions

on protein structure and function, aiding in evaluating

genetic variant pathogenicity.

6 CONCLUSIONS

Genetic variant annotation plays a crucial role in

genetics-related diseases research. This process en-

ables genetics laboratory researchers to ﬁlter variants

https://emea.illumina.com

HEALTHINF 2025 - 18th International Conference on Health Informatics

412

and assess their potential impact on human beings.

However, the analysis is not always deﬁnitive due to

the evolving nature of scientiﬁc knowledge, making

the use of up-to-date databases critically important.

Various tools have been introduced in the litera-

ture to annotate VCF ﬁles, with ANNOVAR being

one of the most widely used. Despite its popular-

ity, ANNOVAR has certain limitations regarding the

range of databases it supports. In this paper, we

propose VCFAnnotator to address these limitations.

It facilitates the use of external databases, automati-

cally preparing them for compatibility with ANNO-

VAR, and includes a scraping feature to detect up-

dated databases.

This study marks an initial step towards devel-

oping an automated system for streamlining vari-

ous tasks in genetic research processes. As future

work, we aim to enable researchers to selectively re-

annotate speciﬁc entries of the VCF ﬁle based on pre-

deﬁned criteria and to facilitate comparisons between

these new annotations and previous ones. Addition-

ally, we aim to explore steps required for VCFAnnota-

tor ’s certiﬁcation for diagnostic use (Bombarda et al.,

2022; Bombarda et al., 2021).

ACKNOWLEDGEMENTS

This work was funded by PNRR - ANTHEM (Ad-

vaNced Technologies for Human-centrEd Medicine)

- Grant PNC0000003 – CUP: B53C22006700001 -

Spoke 1 - Pilot 1.4. We would like to thank Fabio As-

solari and Simone Ronzoni for the preliminary work

that they did for this project during their B.Sc. theses.

REFERENCES

Adzhubei, I. et al. (2013). Predicting functional effect of

human missense mutations using polyphen-2. Current

Protocols in Human Genetics, 76(1).

Battista, R., Blancquaert, I., et al. (2011). Genetics in health

care: an overview of current and emerging models.

Public health genomics, 15(1):34–45.

Bombarda, A., Bonfanti, S., et al. (2021). Lessons learned

from the development of a mechanical ventilator for

covid-19. In 2021 IEEE 32nd International Sym-

posium on Software Reliability Engineering (ISSRE),

page 24–35. IEEE.

Bombarda, A., Bonfanti, S., et al. (2022). Guidelines for the

development of a critical software under emergency.

Information and Software Technology, 152:107061.

Chen, S., Francioli, L. C., Goodrich, J. K., et al. (2024). A

genomic mutational constraint map using variation in

76,156 human genomes. Nature, 625(7993):92–100.

Cingolani, P., Platts, A., et al. (2012). A program for

annotating and predicting the effects of single nu-

cleotide polymorphisms, snpeff: Snps in the genome

of drosophila melanogaster strain w1118; iso-2; iso-3.

Fly, 6(2):80–92.

Cooper, D. (1998). The human gene mutation database.

Nucleic Acids Research, 26(1):285–287.

Danecek, P., Auton, A., et al. (2011). The variant call format

and vcftools. Bioinformatics, 27(15):2156–2158.

Danecek, P., Bonﬁeld, J. K., et al. (2021). Twelve years of

samtools and bcftools. GigaScience, 10(2).

Davydov, E. V., Goode, D. L., Sirota, M., et al. (2010).

Identifying a high fraction of the human genome to be

under selective constraint using gerp++. PLoS Com-

putational Biology, 6(12):e1001025.

Hajba, G. L. (2018). Using Beautiful Soup, page 41–96.

Apress.

Hamosh, A. (2004). Online mendelian inheritance in man

(omim), a knowledgebase of human genes and genetic

disorders. Nucleic Acids Research, 33(Database is-

sue):D514–D517.

Hart, S. N., Duffy, P., et al. (2015). Vcf-miner: Gui-

based application for mining variants and annota-

tions stored in vcf ﬁles. Brieﬁngs in Bioinformatics,

17(2):346–351.

Kircher, M. et al. (2014). A general framework for estimat-

ing the relative pathogenicity of human genetic vari-

ants. Nature Genetics, 46(3):310–315.

Landrum, M. J. et al. (2015). Clinvar: public archive of

interpretations of clinically relevant variants. Nucleic

Acids Research, 44(D1):D862–D868.

McLaren, W. et al. (2010). Deriving the consequences of

genomic variants with the ensembl API and SNP ef-

fect predictor. Bioinformatics, 26(16):2069–2070.

Myers, C., Paulk, N., and Dudlak, C. (2001). Genomics:

implications for health systems. Frontiers of Health

Services Management, 17(3):3–16.

Ng, P. C. (2003). Sift: predicting amino acid changes

that affect protein function. Nucleic Acids Research,

31(13):3812–3814.

O’Leary, N. A., Wright, M. W., et al. (2015). Reference

sequence (refseq) database at ncbi: current status, tax-

onomic expansion, and functional annotation. Nucleic

Acids Research, 44(D1):D733–D745.

Pedersen, B. S., Layer, R. M., and Quinlan, A. R. (2016).

Vcfanno: fast, ﬂexible annotation of genetic variants.

Genome Biology, 17(1).

Pei, B., Sisu, C., Frankish, A., et al. (2012). The gencode

pseudogene resource. Genome Biology, 13(9):R51.

Salgado, D., Bellgard, M. I., et al. (2016). How to identify

pathogenic mutations among all those variations: vari-

ant annotation and ﬁltration in the genome sequencing

era. Human mutation, 37(12):1272–1282.

Wang, K., Li, M., et al. (2010). ANNOVAR: functional

annotation of genetic variants from high-throughput

sequencing data. Nucleic Acids Res., 38(16):e164.

Yang, H. and Wang, K. (2015). Genomic variant annotation

and prioritization with annovar and wannovar. Nature

Protocols, 10(10):1556–1566.

A Flexible and Open-Source Tool for Genetic Variant Annotation

413