start_position and end_position respectively
describe the beginning and end of the allele respect
to the chromosome, and strand stands for one of the
two DNA chains (plus or minus).
The relationship between Gene class and Allele
class helps to identify an allele of a gene in the
information system. It allows that a gene may not to
have any allelic information.
The DataBank class represents different public
databases used to load our database. The
GeneDataBankIdentification class is the gene
identification in different public databases. The
AlleleDataBankIdentification class is the allele
identification in different public databases.
The Allelic Variant class and the Allelic
Reference Type class are specialized classes from
Allele class. The Allelic Reference Type class
represents the alleles which are used as reference in
the consulted data sources, while the Allelic Variant
represents allelic variations of a reference allele.
Both of these classes include a sequence attribute,
but while for the Allelic Reference Type class this
attribute stores the complete DNA sequence of the
allele, for the other it can be obtained by derivation
in a way that is explained later. Related is an
association between Allelic Reference Type and
Allelic Variant representing the existing relation
between a reference allele and its variations.
The main innovation in this CSHG version is the
Variation class. It represents the changes shown by
different DNA sequences compared with a reference
type. This class consists of the following attributes:
id_variation, a local identifier, description, a
description of the variation, id_variation_db, the
identifier given to the variation by the database
where it has been taken from, type which ranges
over the values‘mutant’, ‘neutral polymorphism’ and
‘unknown consequence’, and location which can be
‘chromosomal’ or ‘genic’.
The Variation class is associated with Data Bank
class and with Bibliography Reference; it is also
associated with the Allelic Reference Type class
through the Referred association as well as with the
Allelic Variant class by the Changes association; this
one represents that each variation is necessarily
shown by an allelic variant with respect to an allelic
reference type. The Changes association allows for
an allelic variant to include several variations with
respect to an allelic reference type; subsequently,
now it can be seen clearly how the sequence of the
Allelic Variant class can be derived if all the
variations respect to the Allelic Reference Type are
known; conversely, the variations of an Allelic
Variant respect to its Allelic Reference Type can be
derived if the two allelic sequences are known. As a
result of the new class, the specialization hierarchy
describing different types of valid variations belongs
to the Variation class, instead of to the Allelic
Variation class as it was in the previous version.
This specialization hierarchy classifies variations by
different criteria represented in four specializations:
• Location specialization represents
whether the variation affects only one
gene or a part of the chromosome.
• Description specialization models the
degree of knowledge of the variation.
• Effect specialization depends on the effect
on the phenotype.
• Load specialization for the variations
where data inconsistencies have been
found at the source.
The Chromosomal specialized class includes the
variations which affect to more than one gene. The
Chrom.Mutation class describes the chromosomal
variation. In the Description specialization, the
variation is classified in Imprecise and Precise.
When details about the variation are not known, it is
classified as imprecise. There is a description
attribute in the Imprecise class. When a variation is
precise its position is represented in the position
attribute, and the variation is classified in one of the
four specialized classes. If the precise variation type
is Insertion, a sequence (sequence attribute) has
been introduced n times (repetition attribute). If the
type is Deletion, certain nucleotides were deleted
from the specified position. When the type is Indel,
there was a deletion of
n nucleotides and then an
insertion of m nucleotides p times occurred.
Inversion means that the nucleotide chain flipped at
the position specified in the position attribute. In the
Effect specialization, variations with mutant effect
on the phenotype are represented in the Mutant
specialized class. They have a type attribute ranging
on values ‘splicing’, ’missense’, ’regulatory’ and
’others’. The regulatory variations have three
attributes: sequence which contains the substitution
sequences; origin denotes the point from which the
relative position of the mutation is given. Finally, by
the Load specialization, variations are classified in
Problematic Load when the data from sources have
inconsistencies. It has three attributes: original_data,
where the incorrect data is stored; cause, an
explanation about the inconsistence found and
correction_date, empty while the inconsistence is
not repaired, and containing a date value indicating
when the inconsistence has been repaired.
The Phenotype class represents the different
external features that can be associated to variations;
BIOINFORMATICS 2010 - International Conference on Bioinformatics
162