ations reports. Based on this conceptual model, a tex-
tual format is defined using the XMLSchema speci-
fication (Biron and Malhotra, 2004). For validation
purposes, this textual format has been applied into the
integration between two alignment tools and one vari-
ation analysis tool.
The rest of the paper is organized as follows. In
section 2, the related work is presented. Section 3 ex-
plains the alignment tools review. Section 4 presents
a conceptual model that formalizes the variation re-
ports and the proposed XML format. In section 5,
the contributions of this work are applied into a real
integration scenario. Finally, section 6 presents the
concluding remarks.
2 RELATED WORK
In order to formalize the heterogeneity of the con-
cepts in the genomic domain, several works propose
the use of conceptual models. For instance (Paton
et al., 2000) describes a collection of conceptual mod-
els in yeast. Paton’s work models general genomic
data and data related to experiments, proteins or alle-
les. Another approach, as PaGE-OM (Brookes et al.,
2009), proposes a conceptual model that represents
genomic data in relation to assays performed by biol-
ogists. The project Atlas (Shah et al., 2005) presents
an integration attempt that defines the genomic data
models to be integrated from different databases. And
finally, the Gene Ontology (Gene Ontology Consor-
tium, 2004) defines a set of vocabularies and classifi-
cations, which are related to biological functions, pro-
cesses, and cellular components.
A common issue in these approaches is that the
proposed conceptualizations are highly related to the
experimental data, the used technologies or the rep-
resentation formats. As a consequence, these ap-
proaches cannot be easily adopted by variation analy-
sis tools. The purpose of this work is to use concep-
tual models to achieve a domain representation that
only considers the precise biological concepts.
Focusing on the problem of alignment output rep-
resentation, other attempts to solve the lack of stan-
dards can be found as well. For example, the Se-
quence Alignment Map (SAM) format (Li et al.,
2009) is a compact format to express variation results
from alignments. The main drawbacks are the com-
plexity of the syntax and the mandatory implemen-
tation of a low level mechanism to extract the data.
Our proposal overcomes these drawbacks by the use
of a conceptual model that is easier to understand by
biologists. The complexity of data representation is
reduced thanks to the formalization of the variation
detection domain. As one implementation of this con-
ceptual model, it is presented a textual format based
on the XML language: a standard language supported
by several software development environments. The
implementation of the software integration compo-
nents is simplified by the conceptual model and the
corresponding XML format, that can be used inside a
model-driven software development process.
3 ALIGNMENT TOOLS REVIEW
With the purpose of detecting the most relevant con-
cepts that alignment tools use in their reports, a set
of the most representative ones has been reviewed:
Sequencher (Gene Codes Corporation, 2010), SeqS-
cape (Applied Biosystems, 2010), Mutation Surveyor
(Softgenetics, 2010), Codon Code Aligner (Codon
Code Corporation, 2010), Polyphred (Department
Genomic Sciences, 2010), InSNP (Manaster et al.,
2005) and the WebTool BLAST from the NCBI
(NCBI, 2010).
To perform this review a real test has been carried
out with these tools. Real samplesof the BRCA1 gene
were provided by a research laboratory to give value
to the results. The strategy followed in this test was:
1. Installation of the tools in a computer under
Windows 7 (Sequencher, SeqScape, Mutation-
Surveyor, CodonCode Aligner, old versions of
Staden and InSNP). For the tools only supported
in Linux, the installation was done in another
computer under Ubuntu v8.04 (Polyphred).
2. Reading of the introduction tutorials and user
guides to understand the general principles of the
tool, the graphical user interface and the sup-
ported functionality.
3. Checking of the functionality for each tool, using
the samples provided.
4. Searching of variations within the samples in or-
der to compare the results and limitations under
the same conditions.
While working with these tools, the required concepts
around variations have been gathered and three main
issues have been detected: In the first place, the in-
troduction of a complete DNA Sequence is not pos-
sible due to technological limitations. Sequencing
machines are constrained to a maximum sequence
length, so the sequenced region must be split up in
small pieces called contigs. In the second place, the
limitations of the sequencing process produces erro-
neous bases. So, in order to improve the analysis
quality, this process must be realized several times.
BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms
138