but this originates from the difference in analyses
performed. FastX only saves information concerning
the nucleotidic composition and the quality
distribution per cycle, whereas quasi-qa also
analyzes the read length distribution, nucleotidic
composition, quality distribution per cycle and the
quality distribution per base.
Table 1: Performance benchmark of the quality
assessment. All calculations were performed on an Apple
MacBook Pro (late 2011) equipped with 8 GB of RAM,
2.3 GHz Intel Core i5 and a SSD harddrive. The
measurements were derived from the GNU time command
which is available on all Unix systems. It should be noted
that the tools do not possess the same range of functions.
Filesize Tool Wall time (hh:mm:ss) Total RAM
1.1 GB
fastqc 0.10 00:01:05 568 MB
fastx 0.0.13 00:00:09 21 MB
quasi-qa 00:00:10 2 MB
2.2 Alignment
The proper alignment tool must be chosen according
to the nature of the experiment. If total mRNA was
used for sequencing (i.e. RNA-Seq), TopHat,
SOAPsplice or other slice-junction-aware aligners
need to be chosen over splice-junction-unaware
aligners, as the latter are only able to align intra-
exonic reads back to the reference. Using splice-
junction-unaware aligners would result in the
incorrect dropping of all junction spanning reads as
unmappable and therefore loosing many counts.
If specific tags are sequenced, as is the case in
shRNA-Seq, splice-junction-unaware aligners such
as Bowtie (Langmead et al., 2009), BWA (Li and
Durbin, 2009) or SOAP3 (Liu et al., 2012), are more
than sufficient.
The alignment script adheres to common
standards, thus only accepting FASTQ formatted
files as input and writing alignments in the well
known SAM format
The reads of a sample-specific FASTQ file are
aligned to a predefined reference data set containing
the relevant sequences of all shRNAs used in the
RNAi screening experiment.
Multiple cores in a CPU are automatically
detected and are assumed to be available. Using
multiple cores during the alignment, drastically
reduces the total runtime on a near linear scale.
A pre-defined set of parameters has been chosen
for the alignment tools. However, the set of
parameters can be adjusted by the user if necessary.
2.3 Quantification
The tool quasi-count must be presented with one or
multiple SAM files, which will be analyzed
sequentially. This tool counts the number of
allocated reads to each reference sequence during
the alignment step. The resulting counts will be
saved in a matrix style textfile, which will later be
used for the inference of statistically significant
changes in shRNA frequencies.
The only other requirement, when using quasi-
count, is that the header section of the SAM file is
intact as the tool uses the information given therein
to identify the sequenced shRNAs.
2.4 Statistical Inference
This part of the pipeline is implemented in the
programming language R. The R script contains
functions to read in the quality assessment data and
print them out in a single PDF file, read in the count
matrix textfile to start differential abundance
analysis or visualize the Pearson correlation between
samples.
Differential abundance analysis is done by the
freely available R packages DESeq (Anders and
Huber, 2010), edgeR (Robinson et al., 2010) or
baySeq (Hardcastle and Kelly, 2010). The statistical
assumptions, made in all three packages, are based
on a negative-binomial rather than a Poisson
distribution of the counts. The assumption of a
Poisson distribution is not applicable in this case,
due to the additional sources of variance
(overdispersion), when including biological replicate
samples, that cannot be accounted for as has been
shown by Lu et al., (2005). This underestimation of
the variance leads to an increased number of type-I
errors, that is false positive discoveries of
differential abundance.
Figure 3: Example Venn diagram of significantly
differentially abundant shRNAs inferred by baySeq,
edgeR and DESeq.
BIOINFORMATICS2013-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
290