MUTATIONAL DATA LOADING ROUTINES FOR HUMAN

GENOME DATABASES

The BRCA1 Case

Matthijs van der Kroon, Ignacio Lereu Ramirez, Ana M. Levin,

Oscar Pastor

Centro de Investigaci´on en M´etodos de Producci´on de Software, Universidad Polit´ecnica de Valencia

Camino de Vera s/n, 46022 Valencia, Spain

Sjaak Brinkkemper

Department of Information and Computing Sciences Utrecht University, Utrecht, The Netherlands

Keywords:

BRCA1, Conceptual model, Data integration, Human genome.

Abstract:

The last decades a large amount of research has been done in the genomics domain which has and is generating

terabytes, if not exabytes, of information stored globally in a very fragmented way. Different databases use

different ways of storing the same data, resulting in undesired redundancy and restrained information transfer.

Adding to this, keeping the existing databases consistent and data integrity maintained is mainly left to human

intervention which in turn is very costly, both in time and money as well as error prone. Identifying a ﬁxed

conceptual dictionary in the form of a conceptual model thus seems crucial. This paper presents an effort

to integrating the mutational data from the established genomic data source HGMD into a conceptual model

driven database HGDB, thereby providing useful lessons to improve the already existing conceptual model of

the human genome.

1 INTRODUCTION

Looking from an information system point of view,

the human genome is an extremely complex system

in which exists a lot of ambiguity. For example, basic

concepts of what exactly deﬁnes a gene are still not

explicitly described by the domain. Biology largely

depends on domain experts interpreting data, in or-

der for knowledge to appear. Combining the lack of

proper data structure and the very large amounts of

data generated, a clear problem emerges. How can

domain experts dedicate their limited time to the right

pieces of information if these are buried in noise?

Computers excel at processing large amounts of data,

and thus a logical step would be to apply this excel-

lence to the present day problem in genetics, sifting

the noise from potentially useful information. For

this process to take place, a conceptual modeling ap-

proach is essential: it allows for an adequate represen-

tation of the domain. Present day solutions that pre-

tend to do exactly this (i.e. ontologies) usually pro-

vide controlled vocabularies instead of ﬁxing a con-

ceptual gamut.

A proper conceptual model is expected to provide

a clear data structure, enabling efﬁcient and effec-

tive access to genomic data, thereby offering ways

of reusing previously researched data by pharmaceu-

tic, medical and research institutes as mentioned by

(Pastor, 2008). Also, the paradigm shift implicated

by considering the genome as a complex information

system is expected to allow for exciting new views.

To present day, most bioinformatics research is lo-

cated in the solution space, by attempting to interpret

the data that comes out of ’the black box’. For in-

stance by applying powerfulsequence alignment tools

like BLAST and BLAT. Another point of view is of-

fered by (Pastor, 2008), whose efforts are directed

at tracing and understanding the processes effectively

leading to these data. Essentially, seen from an in-

formatics point of view, ﬁnding the source-code, by

analyzing the object-code, of what may very well be

the most sophisticated software ever to be analyzed:

life itself.

In section 2 earlier work will be discussed. For a

detailed description of the conceptual model of the

Human Genome (CSHG) often referred to in this

266

van der Kroon M., Lereu Ramirez I., M. Levin A., Pastor Ó. and Brinkkemper S..

MUTATIONAL DATA LOADING ROUTINES FOR HUMAN GENOME DATABASES - The BRCA1 Case.

DOI: 10.5220/0003130902660269

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2011), pages 266-269

ISBN: 978-989-8425-36-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

work please consult (Pastor et al., 2010a) and (van der

Kroon et al., 2009). Section 3 will discuss the results

of the extraction of data from the external sources,

listing the encountered problems and resulting adjust-

ments to the conceptual model. Ultimately, in section

4 conclusions will be drawn, along with suggestions

for further research.

2 RELATED WORK

Other solutions to the ambiguity problems associated

to the genetics domain include ontologies (Ashburner

et al., 2000). To understand why ontologies alone do

not fulﬁll the job of obtaining a full understanding of

any given domain, some background information is

necessary. Conceptual deﬁnitions exist on two levels:

conceptually and semantically. The semantic aspect

refers to instances of concepts; e.g. the BRCA2 gene,

which is an instance of the abstract ”gene” concept.

A problem here for example means ambiguity about

naming conventions, for instance the BRCA2 gene is

also known as: ”Fancd1” and ”RAB163”. The con-

ceptual aspect is more abstract and handles questions

like ”What is a gene?”. It is our strong belief that for

the proper and complete understanding of any given

domain, both are vital.

An information systems approach to this speciﬁc

problem space is not entirely new. (Okayama et al.,

1998) describes the conceptual schema of a DNA

database using an extended entity-relationship model.

(Paton et al., 2000) advanced on these efforts by pre-

senting a ﬁrst attempt in conceptually modeling the S.

cerevisiae genome by proposing a collection of con-

ceptual data models for genomic data. Among these

conceptual models are a basic schema diagram for

genomic data, a protein-protein interaction model, a

model for transcriptome data and a schema for allele

modeling.

Whereas (Paton et al., 2000) provides a broader

view by presenting conceptual models for describing

both genome sequences and related functional data

sets, (Pastor, 2008) converged on the basic schema

diagram for genomic data adapting it to the human

genome and eventually produced a database, the hu-

man genome database (HGDB) corresponding to this

model and following the standard rules of logical de-

sign. This database is now in the prototype phase and

the ﬁrst 2 genes, NF1 and BRCA1, have been partially

loaded. (Pastor et al., 2009) describes the evolution

HGDB went through during the process of conceptu-

ally mapping HGDB and HGMD to each other. (Pas-

tor, 2008) describes the evolution of the model more

in general and provides a descriptive overview of how

the model came to be, and from where it evolved to

what it is now.

3 RESULTS

(Pastor et al., 2010b) reports a study of comparing the

HGMD to the CSHG, in order to identify a concep-

tual mapping between the two. It is this mapping

that is followed in this document, and the following

section will report the encountered problems for ac-

tually loading the information from the HGMD into

the HGDB for the BRCA1 gene. Roughly the prob-

lems can be separated in two categories; intrinsic data

properties and data representation. Veriﬁably incor-

rect, inconsistent or incomplete data (tuples) are ex-

amples of these encountered mishaps with the actual

data, or intrinsic data properties. Difﬁculties associ-

ated to the process of extracting the data from the ex-

ternal source and ambiguous descriptions of mutation

properties are typical examples of data representation

problems. Naturally, the divisionbetween the two cat-

egories is not strict and thus some overlap exists, it

is however useful to keep in mind that intrinsic data

property problems tend to affect the entire genetics

domain, while the data representation difﬁculties are

restricted to HGMD.

3.1 Data Loading Problems

HGMD distinguishes 10 mutation types: Mis-

sense/nonsense, Splicing, Regulatory, Small Dele-

tions, Small Insertions, Small Indels, Gross Dele-

tions, Gross Insertions, Complex Rearrangements and

Repeat Variations. Roughly all the types can be

mapped to the Variation and Precise concepts of the

CSHG, except for the Gross Deletions, Gross Inser-

tions, Complex Rearrangements and Repeat Varia-

tions. These latter are described in a very unstructured

manner, almost natural language, and are thus consid-

ered impossible to process automatically. The CSHG

facilitates these tuples as Imprecise, which merely

stores a description of the mutation.

3.1.1 Intrinsic Data Properties

In some cases the HGMD mutational data lacks en-

tries. For instance, the splicing mutations overview

provided by HGMD mentions 5 mutations in intron

22, while (Panguluri et al., 1999) states at least 2 other

mutations; IVS22+67(T>C) and IVS22+8 (T>A).

Three concrete examples of this problem were en-

countered, all three in Splicing mutations. However,

MUTATIONAL DATA LOADING ROUTINES FOR HUMAN GENOME DATABASES - The BRCA1 Case

267

this particular type of problem is very difﬁcult to de-

tect, since ﬁnding them involves rereading the articles

HGMD provides which is hard to automate. Thus, al-

though only three concrete occurrences of this prob-

lem have been encountered, it is likely more exist.

Splicing mutation CS961492 describes a C>T

mutation, as a possible phenotype HGMD indicates

Breast cancer. However, having the read the cor-

responding article (Langston et al., 1996), not once

breast cancer is mentioned in combination with this

mutation. The article does mention the mutation as

being afﬁliated with men suffering from prostate can-

cer. Thus, deducing from the rather limited informa-

tion made available by HGMD on this speciﬁc mu-

tation, it is concluded HGMD made an error during

data entry.

Splicing mutations CS063247 and CS011027

should be located near intron 4. According to the

splice junctions overview HGMD provides, there ex-

ists no intron 4, nor an exon 4. However, literature

explains the ambiguity as a result of misidentiﬁcation

of an inserted Alu element (Smith et al., 1996).

3.1.2 Data Representation

Some data is provided in natural language. For in-

stance the fact that the ﬁrst two BRCA1 exons are al-

ternative non-coding exons is only mentioned in the

header of the Splice Junctions overview. Adding to

this, in Small Deletions (2 instances) and in Small

Insertions (3 instances) some mutations are located

through mouse-over tags, the information communi-

cated by these tags is highly unstructured to a degree

that we might call it natural language as well. Also,

in the case of imprecise mutations (Gross Deletions,

Gross Insertions, Complex Rearrangements and Re-

peat Variations), the greater part of the information

presented by HGMD is in natural language, impeding

an automated approach severely in the affected cases.

In some cases, the HGMD database uses differ-

ent ways of locating mutations, within the same type

of mutations. For instance, Small Insertion mutations

CI030168, CI962219 and CI022582 happen in non-

coding areas of the gene, just like the Small Dele-

tions mutations CD991644 and CD994433. Since

HGMD generally uses a cDNA codon referenced way

of locating these types of mutations, and given that

non-coding sequences simply not exist in the cDNA,

HGMD locates these earlier mentioned mutations in a

different way. In the case of Small Insertions, HGMD

provides a Splice Junction reference, very much like

the method used to locate Splicing mutations. In

this case the CI030168, CI962219 and CI022582

mutations are located at IVS20+21, IVS20+48 and

IVS20+64 respectively. So IVS20 indicates the in-

tron number, where +21 indicates the offset, however

since no acceptor/ donor information is provided, it

is unclear from which side of the intron the offset

should be referenced. In the case of Small Deletion

mutations CD991644 and CD994433 at ﬁrst sight, no

indication of how to locate them is provided. How-

ever, this information is provided through mouse-

over tags in the Splice Junctions referenced form, de-

scribed earlier. CD991644 is thus located by ”I7E8-

24, aka IVS7 -15 del10” and CD994433 is located by

”I12+34 / polymorphism ?”. This problem was en-

countered 3 times in Small Insertions and 2 times in

Small Deletions, making a total of 5 occurrences.

In splicing mutations, HGMD uses a differentway

of locating mutations. Here mutations are located by

referring to splice-junctions. An offset is given, to

indicate the amount of nucleotides between the in-

dicated splice junction and the actual mutation. In

a so-called splicing mutations overview HGMD then

provides a sample sequence for each intron/exon-

junction contained in the gene. This method of lo-

cating mutations is used primarily in splicing muta-

tions (80 instances), but in some exceptional cases

HGMD also uses this notation to provide locational

data for other types of mutations. For instance, In

Small Deletions (2 instances) and in Small Insertions

(3 instances).

In the HGMD phenotype ambiguity exist, i.e. mu-

tations may or may not result in a certain phenotype.

This is indicated by a question mark following the

supposed phenotype. However, no probability scores

are stated and a mutation without a (noticeable) phe-

notype is considered to be a variation with neutral ef-

fect. Since variations and mutations are considered to

be two different concepts in the conceptual model of

the human genome, this poses problems with loading

the database correctly. 94 instances of this problem

have been identiﬁed: missense/nonsense mutations

account for the most instances (73), splicing muta-

tions contains another 16, small deletion mutations 2

and small insertion mutations account for 3 instances.

4 CONCLUSIONS

In this document we have conﬁrmed the primary rea-

son of existence for conceptual modeling techniques.

The HGMD is considered an extremely useful source

of data about genetic mutations in the ﬁeld. For be-

ing curated, it is also considered to be highly reliable.

However,this document showsthat a lot remains to be

wished for. The apparent lack of a thorough concep-

tual modeling approach seems to bear it’s traces on

the service. Every tuple in the HGMD is supposed to

BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms

268

represent a genetic variation, known to be associated

to disease. This quite rigorous deﬁnition becomes en-

dangered in cases where indicated variations ’might’

be associated to disease, as indicated in the HGMD

by the question mark. Indeed, a variation that is not

associated to disease should not be considered a mu-

tation and thus not enter the dataset as is. The CSHG

handles these cases nicely by providing the neutral

polymorphism dimension, for the Variation concept.

Another point of improvement is the lack of a proper

way of facilitating the various reference sequence in

common use by research papers. For illustration, a

certain mutation might be located in position 131 in

reference sequence X, but correspond to position 125

in reference sequence Y. The HGMD provides it’s

own cDNA sequence, from which it locates the ma-

jority of it’s mutations. However this cDNA sequence

is ’based’ on an NCBI sequence, and can thus differ

from it.

For an optimal use of the data provided by

HGMD, the above means an expert in many cases still

needs to evaluate and interpret the data. This is expen-

sive in both time and money. Aligning the HGMD set

of mutations to the NCBI reference sequence, that is

considered to be the ’golden standard’ thus seems a

logic step. Concretely, we suggest two major changes

to the HGMD: (i) facilitate a more elaborate way of

handling associated phenotype, perhaps link directly

to the Online Mendelian Inheritance in Man (OMIM)

database. And (ii) add a new column, in which the

reference sequence indicated by the source paper is

also stored. This will allow for a much easier, and

more efﬁcient use of the HGMD data set. Consider-

ing data is acquired manually from the papers, adding

this element of extracted data seems to be relatively

low cost.

When we look at the HGMD we can not help but

notice that although very useful, a lot is still to be

wished for from an information systems point of view.

It is our strong belief that the only way of accurately

representing any data, and perhaps genetic data in par-

ticular, can only be done by means of careful analysis

of the domain. The CSHG aims to do exactly this, by

applying a conceptual modeling approach.

REFERENCES

Ashburner, M., Ball, C., and Blake, J. (2000). Gene ontol-

ogy: tool for the uniﬁcation of biology. Nature genet-

ics, 25(1):25–30.

Langston, A., Stanford, J., Wicklund, K., Thompson, J.,

Blazej, R., and Ostrander, E. (1996). Germ-line brca1

mutations in selected men with prostate cancer. Amer-

ican Journal of Human Genetics, 58:881–885.

Okayama, T., Tamura, T., Gojobori, T., Tateno, Y., Ikeo, K.,

Miyazaki, S., Fukami-Kobayashi, K., and Sugawara,

H. (1998). Formal design and implementation of an

improved ddbj dna database with a new schema and

object-oriented library. Bioinformatics, 14(6):472.

Panguluri, R., Dunston, G., Brody, L., Modali, R., Ut-

ley, K., Adams-Campbell, L., Day, A., and Whitﬁeld-

Broome, C. (1999). Brca1 mutations in african amer-

icans. Human Genetics, 105(1-2):28–31.

Pastor, O. (2008). Conceptual modeling meets the human

genome. In Conceptual modeling - ER 2008, volume

5231 of Lecture Notes in Computer Science, pages 1–

11. Springer-Verlag Berling Heidelberg.

Pastor, O., Levin, A., Casamayor, J., Celma, M., Virrueta,

A., and Eraso, L. (2009). The Evolution of Concep-

tual Modeling, chapter Model driven-based engineer-

ing applied to the interpretation of the human genome.

Springer-Verlag.

Pastor, O., Levin, A., Celma, M., Casamayor, J., Schattka,

L. E., Villanueva, M., and Perez-Alonso, M. (2010a).

Proceedings of the IVth Int. Conference on Research

Challenges in Information Science, chapter Enforcing

Conceptual Modeling to Improve the Understanding

of Human Genome. IEEE Press.

Pastor, O., Pastor, M., and Burriel, V. (2010b). Conceptual

modeling of human genome mutations: a dichotomy

between what we have and what we should have. In

Proceedings of Bioinformatics 2010, pages 160–166.

BIOSTEC Bioinformatics.

Paton, N., Khan, S., Hayes, A., Moussouni, F., Brass, A.,

Eilbeck, K., Goble, C., Hubbard, S., and Oliver, S.

(2000). Conceptual modeling of genomic information.

Bioinformatics, 16(6):548–557.

Smith, T., Lee, M., Jerome, N., McEuen, M., Taylor, M.,

Hood, L., and King, M. (1996). Complete genomic

sequence and analysis of 117 kb of human dna con-

taining the gene brca1. Genome Research, 6:1029–

1049.

van der Kroon, M., Ramirez, I. L., Levin, A., Pastor, O.,

and Brinkkemper, S. (2009). Mutational data loading

routines for human genome databases: the brca1 case.

Report UUCS2009020, Utrecht University.

MUTATIONAL DATA LOADING ROUTINES FOR HUMAN GENOME DATABASES - The BRCA1 Case

269