On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy

Anna Kuosmanen; Ahmed Sobih; Romeo Rizzi; Veli Mäkinen; Alexandru I. Tomescu

doi:10.5220/0005819702720277

On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy

Anna Kuosmanen, Ahmed Sobih, Romeo Rizzi, Veli Mäkinen, Alexandru I. Tomescu

2016

Abstract

Over the past decade, sequencing read length has increased from tens to hundreds and then to thousands of bases. Current cDNA synthesis methods prevent RNA-seq reads from being long enough to entirely capture all the RNA transcripts, but long reads can still provide connectivity information on chains of multiple exons that are included in transcripts. We demonstrate that exploiting full connectivity information leads to significantly higher prediction accuracy, as measured by the F-score. For this purpose we implemented the solution to the Minimum Path Cover with Subpath Constraints problem introduced in (Rizzi et al., 2014), which is an extension of the classical Minimum Path Cover problem and was shown solvable by min-cost flows. We show that, under hypothetical conditions of perfect sequencing, our approach is able to use long reads more effectively than two state-of-the-art tools, StringTie and FlipFlop. Even in this setting the problem is not trivial, and errors in the underlying flow graph introduced by sequencing and alignment errors complicate the problem further. As such our work also demonstrates the need for a development of a good spliced read aligner for long reads. Our proof-of-concept implementation is available at http://www.cs.helsinki.fi/en/gsa/traphlor.

References

Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biol, 11(10):R106.
Bao, E., Jiang, T., and Girke, T. (2013). BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences. Bioinformatics, 29(10):1250-1259.
Bernard, E., Jacob, L., Mairal, J., and Vert, J.-P. (2014). Efficient RNA isoform identification and quantification from RNA-Seq data with network flows. Bioinformatics, 30(17):2447-2455.
Florea, L., Di Francesco, V., Miller, J., Turner, R., Yao, A., Harris, M., Walenz, B., Mobarry, C., Merkulov, G. V., Charlab, R., Dew, I., Deng, Z., Istrail, S., Li, P., and Sutton, G. (2005). Gene and alternative splicing annotation with AIR. Genome Res, 15(1):54-66.
Gelfand, M. S., Mironov, A. A., and Pevzner, P. A. (1996). Gene recognition via spliced sequence alignment. Proc. Natl Acad Sci U S A, 93(17):9061-6.
Glaus, P., Honkela, A., and Rattray, M. (2012). Identifying differentially expressed transcripts from RNAseq data with biological variation. Bioinformatics, 28(13):1721-1728.
Griebel, T., Zacher, B., Ribeca, P., Raineri, E., Lacroix, V., Guigó, R., and Sammeth, M. (2012). Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res, 40(20):10073- 10083.
Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M. J., Gnirke, A., Nusbaum, C., Rinn, J. L., Lander, E. S., and Regev, A. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol, 28(5):503-510.
Heber, S., Alekseyev, M., Sze, S.-H., Tang, H., and Pevzner, P. A. (2002). Splicing graphs and EST assembly problem. Bioinformatics, 18 Suppl 1:S181-S188.
Holland, M. J. (2002). Transcript abundance in yeast varies over six orders of magnitude. J Biol Chem, 277(17):14363-14366.
Kopylova, E. (2013). New algorithmic and bioinformatic approaches for the analysis of data from high throughput sequencing. PhD thesis, Université des Sciences et Technologie de Lille-Lille I.
Lemon (2014). Library for Efficient Modeling and Optimization in Networks. http://lemon.cs.elte.hu/.
Li, B. and Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323.
W. (2012). http://alumni.cs.ucr.edu/~liw/ rnaseqreadsimulator.html.
Li, W., Feng, J., and Jiang, T. (2011). IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J Comput Biol, 18(11):1693-1707.
Mäkinen, V., Belazzougui, D., Cunial, F., and Tomescu, A. I. (May 2015). Genome-Scale Algorithm DesignBiological Sequence Analysis in the Era of HighThroughput Sequencing. Cambridge University Press. URL www.genome-scale.info.
Ntafos, S. C. and Hakimi, S. L. (1979). On path cover problems in digraphs and applications to program testing. IEEE Transactions on Software Engineering, SE5(5):520-529.
Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T., and Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol, 33(3):290-295.
Quinlan, A. R. and Hall, I. M. (2010). Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841-842.
Rizzi, R., Tomescu, A. I., and Mäkinen, V. (2014). On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly. BMC Bioinformatics, 15(S-9):S5.
Robinson, M. D., McCarthy, D. J., and Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139-140.
Sharon, D., Tilgner, H., Grubert, F., and Snyder, M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol, 31(11):1009-1014.
Song, L. and Florea, L. (2013). CLASS: constrained transcript assembly of RNA-seq reads. BMC Bioinformatics, 14 Suppl 5:S14.
Tomescu, A. I., Kuosmanen, A., Rizzi, R., and Mäkinen, V. (2013). A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics, 14 Suppl 5:S15.
Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J., and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):511-515.
Vyverman, M. (2014). ALFALFA: Fast and Accurate Mapping of Long Next Generation Sequencing Reads. PhD thesis, Ghent University.
Wu, T. D. and Watanabe, C. K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics, 21(9):1859-1875.

Download

Paper Citation

in Harvard Style

Kuosmanen A., Sobih A., Rizzi R., Mäkinen V. and Tomescu A. (2016). On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy . In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016) ISBN 978-989-758-170-0, pages 272-277. DOI: 10.5220/0005819702720277

in Bibtex Style

@conference{bioinformatics16,
author={Anna Kuosmanen and Ahmed Sobih and Romeo Rizzi and Veli Mäkinen and Alexandru I. Tomescu},
title={On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy},
booktitle={Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)},
year={2016},
pages={272-277},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005819702720277},
isbn={978-989-758-170-0},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2016)
TI - On using Longer RNA-seq Reads to Improve Transcript Prediction Accuracy
SN - 978-989-758-170-0
AU - Kuosmanen A.
AU - Sobih A.
AU - Rizzi R.
AU - Mäkinen V.
AU - Tomescu A.
PY - 2016
SP - 272
EP - 277
DO - 10.5220/0005819702720277