3 JUNCTION BOUNDARY
DETECTION
The large number of putative fused genes are filtered
according to a set of criteria reflecting an accurate
model of gene fusion. The following subsection pro-
vides the details of the most relevant criteria defining
the model.
Insert Size Coherency. In RNA-Seq paired end data,
the insert size distance is not fixed a priori and it varies
according to the specific protocol adopted in the se-
quence analysis. The distribution of the insert frag-
ment length of the aligned paired end mostly concen-
trates on a mean value with a specified standard devi-
ation. However, as emphasized in (Sboner, 2010), the
preparation of biological sample produces gene fu-
sion artifacts presenting abnormal insert size between
the sequenced ends. Therefore, in order to remove
fusion artifacts the proposed methodology estimates
the insert distance of the reads encompassing a gene
fusion candidate and removes those reads having an
insert distance size that is outlier in the fragment in-
ner size distribution.
Asymmetric Encompassing Read Distribution. As
recently investigated in (Edgren, 2011), fusions due to
PCR artifacts present an encompassing reads align-
ment that is asymmetric for the involved genes.
Specifically, it might occur that the mates encompass-
ing a fused gene are more longly aligned on one of
the two candidates whereas more concentrated in a
short range of base pairs in the corresponding gene.
In presence of asymmetric encompassing reads distri-
bution, the insert size of encompassing reads varies
around a widely variable range. Therefore, the pro-
posed methodology exploits the computation of insert
distances and it effectively removes gene fusion arti-
facts due to PCR amplification detecting asymmetric
encompassing read distribution.
Homologous Sequence Artifacts Filter. Multiple
mate matches occur due to homologies in the genome
reference. Homologous sequences affect the fusion
detection analysis because the mate pairs that nor-
mally would match on the same gene match discor-
dantly on two distinct but similar genes. Homolo-
gous region may be due both to the presence of par-
alogue genes that share long sequence regions and to
the presence of shorts similar sequences. The pro-
posed flow implements two different policies for both
cases. Concerning the long homologoussequence due
to paralogue genes a filter that query TreeFam (Li,
2006) database has been implemented. For short ho-
mologous sequences, the filter extracts and reversely
maps the read mates on the same genes. If the reads
reversely maps the gene candidates it means that the
reads encompasses the candidates due to an homolo-
gous subsequence.
Encompassing-Spanning Read Coherency. Ac-
cording to the definition of encompassing and span-
ning reads, a true gene fusion sequence results from
the consensus between encompassing and spanning
reads. If the set of encompassing and spanning reads
are located in largely different gene regions the candi-
date must be discarded an incoherent gene sequence
can be produced. Therefore, this criterion preserves
only those gene fusions with overlapping spanning
and encompassing regions.
4 RESULTS
In order to evaluate the efficiency of the proposed
flow in detecting chimeric transcripts, we analyzed
the publicly available sets of RNA-Seq data from
NCBI database (submission number SRA009053). It
is worth noting that the gene fusions occurring in
the the aforementioned data set have been validated
through RT-PCR as reported in (Berger, 2010). Ta-
ble 1 demonstrates the capability of the proposed
methodology in revealing the RT-PCR validated fu-
sions. These samples have a coverage of at most 16
million reads, a read length of 50 bp and fragment
length spanning from 350 to 500. All the 14 fusions
validated in the 7 samples of melanoma cells (Berger,
2010) have been successfully detected. Table 2 shows
some details of the detected gene fusion. In fact, for
each sample the name of the 5’ and 3’ gene are re-
ported. Moreover, the table highlights for each fu-
sion the number of encompassing and spanning reads.
This information is extremely important in the anal-
ysis of chimeric transcripts. In fact, the number of
spanning and encompassing reads across the fused
junction is directly correlated with the sequencing ex-
perimental coverage. Therefore, the proposed analy-
sis flow is able to detect the gene fusion also in case
of low coverage where the number of spanning and
encompassing reads is reduced.
Moreover, the detection of a chimeric transcript
analysis flow built on top of the TopHat and Cuf-
flinks tools represents the major novelty of the pro-
posed methodology. In fact, the adoption of TopHat
and Cufflinks allows to detect novel transcripts iso-
forms that can be recombined with known transcript
in a new chimeric gene. Therefore, in order to demon-
strate the effectiveness of the proposed flow in detect-
ing fused genes involving an unknown transcript iso-
form we report the analysis results conducted on the
sample SRR018259 (See Table 3). Specifically, the
second and third column reports the name and the ge-
A NOVEL ANALYSIS FLOW FOR FUSED TRANSCRIPTS DISCOVERY FROM PAIRED-END RNA-SEQ DATA
333