Prediction of Essential Genes based on Machine Learning and Information Theoretic Features
Dawit Nigatu, Werner Henkel
2017
Abstract
Computational tools have enabled a relatively simple prediction of essential genes (EGs), which would otherwise be done by costly and tedious gene knockout experimental procedures. We present a machine learning based predictor using information-theoretic features derived exclusively from DNA sequences. We used entropy, mutual information, conditional mutual information, and Markov chain models as features. We employed a support vector machine (SVM) classifier and predicted the EGs in 15 prokaryotic genomes. A fivefold cross-validation on the bacteria E. coli, B. subtilis, and M. pulmonis resulted in AUC score of 0.85, 0.81, and 0.89, respectively. In cross-organism prediction, the EGs of a given bacterium are predicted using a model trained on the rest of the bacteria. AUC scores ranging from 0.66 to 0.9 and averaging 0.8 were obtained. The average AUC of the classifier on a one-to-one prediction among E. coli, B. subtilis, and Acinetobacter is 0.85. The performance of our predictor is comparable with recent and state-of-the art predictors. Considering that we used only sequence information on a problem that is much more complicated, the achieved results are very good.
References
- Acencio, M. L. and Lemke, N. (2009). Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC bioinformatics, 10(1):1.
- Bauer, M., Schuster, S. M., and Sayood, K. (2008). The average mutual information profile as a genomic signature. BMC bioinformatics, 9(1):1.
- Ben-Hur, A. and Weston, J. (2010). A users guide to support vector machines. Data mining techniques for the life sciences, pages 223-239.
- Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2013). Genbank. Nucleic acids research, 41(D1):D36-D42.
- Chalker, A. F. and Lunsford, R. D. (2002). Rational identification of new antibacterial drug targets that are essential for viability using a genomics-based approach. Pharmacology & therapeutics, 95(1):1-20.
- Chen, L., Ge, X., and Xu, P. (2015). Identifying essential streptococcus sanguinis genes using genome-wide deletion mutation. Gene Essentiality: Methods and Protocols, pages 15-23.
- Chen, W.-H., Minguez, P., Lercher, M. J., and Bork, P. (2012). OGEE: an online gene essentiality database. Nucleic acids research, 40(D1):D901-D906.
- Chen, Y. and Xu, D. (2005). Understanding protein dispensability through machine-learning analysis of highthroughput data. Bioinformatics, 21(5):575-581.
- Cheng, J., Xu, Z., Wu, W., Zhao, L., Li, X., Liu, Y., and Tao, S. (2014). Training set selection for the prediction of essential genes. PloS one, 9(1):e86805.
- Clarke, L. and Carbon, J. (1976). A colony bank containing synthetic coi ei hybrid plasmids representative of the entire e. coli genome. Cell, 9(1):91-99.
- Cullen, L. M. and Arndt, G. M. (2005). Genomewide screening for gene function using RNAi in mammalian cells. Immunology and cell biology, 83(3):217-223.
- Dalevi, D. and Dubhashi, D. (2005). The peres-shields order estimator for fixed and variable length markov models with applications to DNA sequence similarity. In International Workshop on Algorithms in Bioinformatics, pages 291-302. Springer.
- Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nature biotechnology, 21(9):1055-1062.
- Deng, J., Deng, L., Su, S., Zhang, M., Lin, X., Wei, L., Minai, A. A., Hassett, D. J., and Lu, L. J. (2011). Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic acids research, 39(3):795-807.
- Giaever, G., Chu, A. M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson, K., Andre, B., et al. (2002). Functional profiling of the saccharomyces cerevisiae genome. nature, 418(6896):387-391.
- Hagenauer, J., Dawy, Z., Gobel, B., Hanus, P., and Mueller, J. (2004). Genomic analysis using methods from information theory. In Information Theory Workshop, 2004. IEEE, pages 55-59. IEEE.
- Hutchison, C. A., Chuang, R.-Y., Noskov, V. N., AssadGarcia, N., Deerinck, T. J., Ellisman, M. H., Gill, J., Kannan, K., Karas, B. J., Ma, L., et al. (2016). Design and synthesis of a minimal bacterial genome. Science, 351(6280):aad6253.
- Itaya, M. (1995). An estimation of minimal genome size required for life. FEBS letters, 362(3):257-260.
- Jacobs, M. A., Alwood, A., Thaipisuttikul, I., Spencer, D., Haugen, E., Ernst, S., Will, O., Kaul, R., Raymond, C., Levy, R., et al. (2003). Comprehensive transposon mutant library of pseudomonas aeruginosa. Proceedings of the National Academy of Sciences, 100(24):14339-14344.
- Katz, R. W. (1981). On some criteria for estimating the order of a markov chain. Technometrics, 23(3):243- 249.
- Lamichhane, G., Zignol, M., Blades, N. J., Geiman, D. E., Dougherty, A., Grosset, J., Broman, K. W., and Bishai, W. R. (2003). A postgenomic method for predicting essential genes at subsaturation levels of mutagenesis: application to mycobacterium tuberculosis. Proceedings of the National Academy of Sciences, 100(12):7213-7218.
- Letunic, I. and Bork, P. (2016). Interactive tree of life (itol) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic acids research, page gkw290.
- Lu, Y., Deng, J., Rhodes, J. C., Lu, H., and Lu, L. J. (2014). Predicting essential genes for identifying potential drug targets in aspergillus fumigatus. Computational biology and chemistry, 50:29-40.
- Luo, H., Lin, Y., Gao, F., Zhang, C.-T., and Zhang, R. (2014). Deg 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic acids research, 42(D1):D574-D580.
- Menéndez, M., Pardo, L., Pardo, M., and Zografos, K. (2011). Testing the order of markov dependence in DNA sequences. Methodology and computing in applied probability, 13(1):59-74.
- Mushegian, A. R. and Koonin, E. V. (1996). A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences, 93(19):10268-10273.
- Nigatu, D., Henkel, W., Sobetzko, P., and Muskhelishvili, G. (2016). Relationship between digital information and thermodynamic stability in bacterial genomes. EURASIP Journal on Bioinformatics and Systems Biology, 2016(1):1.
- Nigatu, D., Mahmood, A., Henkel, W., Sobetzko, P., and Muskhelishvili, G. (2014). Relating digital information, thermodynamic stability, and classes of functional genes in e. coli. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 1338-1341. IEEE.
- Ning, L., Lin, H., Ding, H., Huang, J., Rao, N., and Guo, F. (2014). Predicting bacterial essential genes using only sequence composition information. Genet. Mol. Res, 13:4564-4572.
- Papapetrou, M. and Kugiumtzis, D. (2013). Markov chain order estimation with conditional mutual information. Physica A: Statistical Mechanics and its Applications, 392(7):1593-1601.
- Papapetrou, M. and Kugiumtzis, D. (2016). Markov chain order estimation with parametric significance tests of conditional mutual information. Simulation Modelling Practice and Theory, 61:1-13.
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.
- Peres, Y. and Shields, P. (2005). Two new Markov order estimators. ArXiv Mathematics e-prints.
- Plaimas, K., Eils, R., and König, R. (2010). Identifying essential genes in bacterial metabolic networks with machine learning methods. BMC systems biology, 4(1):1.
- Provost, F. (2000). Machine learning from imbalanced data sets 101. In Proceedings of the AAAI2000 workshop on imbalanced data sets, pages 1-3.
- Salama, N. R., Shepherd, B., and Falkow, S. (2004). Global transposon mutagenesis and essential gene analysis of helicobacter pylori. Journal of bacteriology, 186(23):7926-7935.
- SantaLucia, J. (1998). A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci., 95(4):1460-1465.
- Sassetti, C. M., Boyd, D. H., and Rubin, E. J. (2001). Comprehensive identification of conditionally essential genes in mycobacteria. Proceedings of the National Academy of Sciences, 98(22):12712-12717.
- Sharp, P. M. and Li, W.-H. (1987). The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic acids research, 15(3):1281-1295.
- Song, K., Tong, T., and Wu, F. (2014). Predicting essential genes in prokaryotic genomes using a linear method: Zupls. Integrative Biology, 6(4):460-469.
- Tong, H. (1975). Determination of the order of a markov chain by akaike's information criterion. Journal of Applied Probability, pages 488-497.
- Visa, S. and Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. In Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, volume 2005, pages 67-73. sn.
- Ye, Y.-N., Hua, Z.-G., Huang, J., Rao, N., and Guo, F.-B. (2013). CEG: a database of essential gene clusters. BMC genomics, 14(1):1.
- Zhang, X., Acencio, M. L., and Lemke, N. (2016). Predicting essential genes and proteins based on machine learning and network topological features: a comprehensive review. Frontiers in physiology, 7.
Paper Citation
in Harvard Style
Nigatu D. and Henkel W. (2017). Prediction of Essential Genes based on Machine Learning and Information Theoretic Features . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017) ISBN 978-989-758-214-1, pages 81-92. DOI: 10.5220/0006165700810092
in Bibtex Style
@conference{bioinformatics17,
author={Dawit Nigatu and Werner Henkel},
title={Prediction of Essential Genes based on Machine Learning and Information Theoretic Features},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)},
year={2017},
pages={81-92},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006165700810092},
isbn={978-989-758-214-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)
TI - Prediction of Essential Genes based on Machine Learning and Information Theoretic Features
SN - 978-989-758-214-1
AU - Nigatu D.
AU - Henkel W.
PY - 2017
SP - 81
EP - 92
DO - 10.5220/0006165700810092