cent researches before mentioned(Abate, 2012)(Car-
rara, 2013).These results, not so conspicuous if con-
sidered the number of samples under examination and
the mean amount of gene fusions usually identified
in RNA-Seq samples, are essentially due to the poor
coverage of the samples. It is worth noting how-
ever that generally higher the coverage of the sam-
ples higher will be the number of detected gene fu-
sions with remarkably computational costs spent for
the detection. It is at the same time worth noting that
not all the 299 fusions can be tested in lab using PCR
for known economic and temporal restraints. The pro-
posed pipeline was therefore applied in order to pri-
oritize the identified 299 chimeric transcripts. In the
following we refer to the different chimeric transcripts
using couples of capital letters (actual gene names
cannot be disclosed as the biological results of this
research are currently under review).
For what is concerning the first filtering stage a
threshold of one split read was imposed: Thirteen
gene fusions were here selected because supported
by at least one split read and present at least one
of the other features described in Subsection First
filtering stage: Gene fusions annotation and selec-
tion. The two thresholds were selected with the in-
tention to be as preservative as possible in evaluat-
ing the different candidates. These parameter val-
ues can be however tuned in order to satisfy spe-
cific needs. Furthemore even if among the initial 299
fusion genes none have been previously detected in
cancer samples, all the genes involved in the thire-
teen fusions are characterized by mutational states re-
lated to cancer development and progression (infor-
mation deriving from literature sources and COSMIC
database(Simon, 2010)). Three out of the thirteen
chimeras that passed the first filtering stage are more-
over characterized by a partner gene found to be fused
with other genes in cancer diseases.
For each of these chimeric transcripts, the fusion
sequence has been then retrieved and analyzed in or-
der to understand the biological mechanism at the
basis of the recombination. The criteria of the sec-
ond filtering stage reduced the previous list to only
eight gene fusions characterized by the presence of a
Kozac sequence (or an ATG triplet) at 5’-end or by
an in frame configuration. Also frame shifted con-
figurations could however be interesting in case of
tumor suppressors 3’ partner genes: In the proposed
pipeline this scenario has not been considered because
no oncogenic suppressor genes were detected among
the identified chimeric transcript partner genes.
In Table 1 are reported the results relative to the
third and fourth filtering stages. In particular, a
threshold of at least one paired-end read was fixed
in order to select a fusion for the next phases of the
pipeline being as much preservative as possible: Only
one gene fusion (i.e gene fusion G-H) was deleted
because not supported by paired-end reads as shown
in column 3 of Table 1. The differences among
the breakpoint sequence lengths (Column 2 of Ta-
ble 1) can be attributed to the fact that they depend
on the number of reads used to define the fusion
sequence. So higher the number of reads mapped
by the chimeric transcript discovery tool on the sup-
posed breakpoint sequence, higher will be the pro-
vided length of the same sequence and the probability
of finding with the propose pipeline paired-end reads
aligning on the same. The absence of mates removed
by the mouse remove filter, shown in Column 5 of
Table 1, confirm the fact that effectively the samples
were composed exclusively of human tumor cells. On
the other and, instead, the PCR in the most of cases
caused a remarkably number of artifacts, as it is pos-
sible to note from Column 4 of Table 1.
After PCR and mouse mates removal, for each
of the remaining seven gene fusions the supporting
paired-end reads, if present, are reconstructed. This
activity is followed by the identification of the so
called paired-end spanning reads if existing. The re-
sults are shown in columns 2 and 3 of Table 2.
Two of the previous seven gene fusions have been
removed because they are not supported by spanning
reads (i.e. gene fusions I-L and M-N). A threshold
of one spanning read was indeed imposed in order to
consider a gene fusion. The value selected derives, as
already largely discussed, from the desire to be very
preservative since the previous filtering stages con-
cerning functional and biological properties of fusion
genes have been already capable as shown to remove a
conspicuous number of not functional chimeric tran-
scrips. It is worth noting however that it is possible
to set this parameter according to the specific require-
ments.
The fourth column of Table 2 reports instead the
number of split mates supporting the remaining five
chimeric transcripts. In the last filtering stage a
threshold of at least one split mate was imposed in
order to consider a gene fusion for in lab validation.
Even for the split reads the value parameter was se-
lected, as already said in relation to spanning reads
threshold, in order to be as conservative as possi-
ble. Of the initial 299 gene fusions at the end of
the pipeline five were considered priority (i.e gene fu-
sions A-B, C-D, E-F, O-P and Q-R). Three out of five
have been actually validated in lab using PCR result-
ing as true gene fusions. In table 3 is reported a sum-
mary of the number of fusion genes obtained after the
application of the different filtering stages.
BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
146