Recognition of genome special regions by machine learning methods
- Authors: Djukova A.P.1, Djukova E.V.1
-
Affiliations:
- Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences
- Issue: No 4 (2024)
- Pages: 45-54
- Section: Computational Intelligence
- URL: https://journal-vniispk.ru/2071-8594/article/view/278195
- DOI: https://doi.org/10.14357/20718594240404
- EDN: https://elibrary.ru/WMCQXO
- ID: 278195
Cite item
Full Text
Abstract
The article studies the recognition of special structural segments of genomes called promoters. To solve the problem of promoter recognition machine learning methods based on logical analysis and data classification were used for the first time. These methods are based on searching for informative fragments in feature descriptions of precedents and are focused on processing low-value integer information. The fragments found are well interpretable and allow distinguishing promoters from other regions of the genome. However, their search is time-consuming. The results of experiments on an unbalanced sample of a large volume are presented, considering both the traditional method of feature formation using k-meres and the method of direct application of the logical classifier to the original data. It is shown that in the second case, the quality of logical classification is significantly higher and amounts to 94.3% according to ROC-AUC using the ensemble approach. The best result, namely, an ROC-AUC accuracy of 95.1%, was shown by the CatBoost classifier when directly applied to the original sample. With the traditional method of feature generation, the accuracy of CatBoost is 94.8%.
About the authors
Anastasia P. Djukova
Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences
Author for correspondence.
Email: anastasia.d.95@gmail.com
Postgraduate student
Russian Federation, MoscowElena V. Djukova
Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences
Email: edjukova@mail.ru
Doctor of Science in physics and mathematics, Chief researcher
Russian Federation, MoscowReferences
- Anwar F., Baker S. M., Jabid T., Mehedi Hasan M., Shoyaib M., Khan H., Walshe R. Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach // BMC Bioinformatics. 2008. V. 9. P.414.
- Huang W. L., Tung C. W., Liaw C., Huang H. L., Ho S. Y. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species // TheScientificWorldJournal. 2014. V. 2014. P. 327306.
- Umarov R., Solovyev V. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks // Plos One. 2017. V. 12 (2). e0171410.
- Zhang M., Jia C., Li F., Li C., Zhu Y., Akutsu T., Webb G. I., Zou Q., Coin L. J. M., Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction // Briefings in bioinformatics. 2022. V. 23 (2). bbab551.
- Zhu Y., Li F., Xiang D., Akutsu T., Song J., Jia C. Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks // Briefings in bioinformatics. 2021. V. 22 (4), bbaa299.
- Bishop C. M. Pattern Recognition and Machine Learning // Springer, Series: Information Sience and Statistics, 2006. P. 740.
- Breiman L., Random Forests // Machine Learning. 2001. V. 45. P. 5–32.
- Friedman J., Stochastic Gradient Boosting // Computational Statistics & Data Analysis. 2002.V. 38. P. 367–378.
- Chen T., Guestrin C., XGBoost: A Scalable Tree Boosting System. Shah, Mohak; Smola, Alexander J.; Aggarwal, Charu C.; Shen, Dou; Rastogi, Rajeev (eds.) // Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Krishnapuram, Balaji. 2016. ACM. P. 785–794.
- Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree // Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017. P. 3149–3157.
- Hancock J.T., Khoshgoftaar T.M. CatBoost for big data: an interdisciplinary review // Journal of Big Data. 2020. V. 7. P. 94.
- Dragunov N., Djukova E., Djukova А. Supervised classification and finding frequent elements in data // 8th Conference (International) on Information Technology and Nanotechnology Proceedings. NJ: IEEE. 2022. P. 5.
- Dragunov N. A., Djukova E. V., Djukova. А. P. Logicheskaya klassifikaciya na osnove poiska pravil'nyh predstavitel'nyh elementarnyh klassifikatorov [Logical classification based on searching for the correct representative elementary classifiers] // Izvestiya RAN. Teoriya i sistemy upravleniya [Journal of Computer and Systems Sciences International]. 2024. № 3 (in press).
- Baskakova L. V., Zhuravlev Yu. I. Model` raspoznayushhix algoritmov s predstavitel`ny`mi naborami i sistemami oporny`x mnozhestv [Model of Recognition Algorithms with Representative Sampls and Systems of Supporting Sets] // Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki [Computational Mathematics and Mathematical Physics]. 1981. Vol. 21. Issue 5. P. 189–199.
- Djukova, E. V., Maslyakov G. O., Djukova. А. P. Logicheskie metody korrektnoj klassifikacii dannyh [Logical methods of correct data classification] // Informatika i eyo primeneniya [Informatics and Applications]. 2023. Vol. 17. Issue. 3. С. 64–70.
- Djukova, E. V., Maslyakov G. O., Prokofyev P. A. O logicheskom analize dannyh s chastichnymi poryadkami v zadache klassifikacii po precedentam [On the Logical Analysis of Partially Ordered Data in the Supervised Classification Problem] // Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki [Computational Mathematics and Mathematical Physics]. 2019. Vol. 59. Issue 9. P. 1542–1552.
- Dyukova E. V., Peskov N. V. Poisk informativnyh fragmentov opisanij ob"ektov v diskretnyh procedurah raspoznavaniya [Search for Informative Fragments of Object Descriptions in Discrete Recognition Procedures] // Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki [Computational Mathematics and Mathematical Physics]. 2002. Vol. 42. Issue 5. P. 711–723.
- Zhuravlev Yu. I., Ryazanov V. V., Senko O. V. Raspjznavanie. Matematicheskie metody. Programmnaya sistema. Prakticheskie primeneniya [Recognition. Mathematical methods. The software system. Practical applications] // PHASIS, Moscow, 2006. P.159 [in Russian].
- Dyukova E. V., Sizov A. V., Sotnezov R. M. Ob optimal'nom korrektnom perekodirovanii celochislennyh dannyh v raspoznavanii [On the optimal correct recoding of integer data in recognition] // Informatika i eyo primeneniya [Informatics and Applications]. 2012. Vol. 6. Issue. 4. С. 61–65.
Supplementary files
