A BERT-Based Classification Model: The Case of Russian Fairy Tales

Valery Dmitrievich Solovyev; Соловьев Валерий Дмитриевич; Marina Ivanovna Solnyshkina; Солнышкина Марина Ивановна; Andrey Ten; Ten Andrey; Nikolai Arkadievich Prokopyev; Прокопьев Николай Аркадиевич

doi:10.17323/jle.2024.24030

Модель классификации на основе BERT: пример применения к русским сказкам

Авторы: Соловьев В.Д.¹, Солнышкина М.И.¹, Ten A.², Прокопьев Н.А.³
Учреждения:
1. Казанский университет
2. Nobilis.Team
3. Академия наук Республики Татарстан
Выпуск: Том 10, № 4 (2024)
Страницы: 98-111
Раздел: Оригинальное исследование
URL: https://journal-vniispk.ru/2411-7390/article/view/356612
DOI: https://doi.org/10.17323/jle.2024.24030
ID: 356612

Цитировать

Полный текст

Аннотация
Об авторах
Список литературы
Дополнительные файлы
Статистика

Аннотация

Введение: Автоматическое профилирование и жанровая классификация текстов играют ключевую роль в оценке их пригодности и уже более десяти лет широко используются в образовании, поиске информации, анализе тональности текста и машинном переводе. Среди всех жанров сказки представляют собой один из самых сложных и ценных объектов исследования из-за своей неоднородности и множества неявных особенностей. Однако традиционные методы классификации, включая стилометрические и параметрические алгоритмы, не только трудоемки и требуют значительных временных затрат, но и сталкиваются с трудностями при определении подходящих классифицирующих признаков. Исследования в этой области немногочисленны, а их результаты остаются противоречивыми и спорными.

Цель: Наше исследование направлено на заполнение этого важного пробела и предлагает алгоритм для классификации русских сказок на основе предварительно заданных параметров. Мы представляем современную модель классификации на основе BERT для русских сказок, проверяем гипотезу о потенциале BERT для классификации русских текстов и тестируем ее на репрезентативном корпусе из 743 русских сказок.

Метод: Мы предварительно обучаем BERT на наборе данных, состоящем из трех классов документов, и настраиваем его для решения конкретной прикладной задачи. Акцентируя внимание на механизме токенизации и создании векторных представлений как ключевых компонентах обработки текста в BERT, исследование также оценивает стандартные метрики, используемые для обучения моделей классификации, анализирует сложные случаи, возможные ошибки и алгоритмы улучшения, тем самым повышая точность моделей. Оценка производительности моделей проводится на основе функции потерь, точности прогнозирования, полноты и отклика.

Результаты: Мы подтвердили потенциал BERT для классификации русских текстов и его способность повышать производительность и качество существующих моделей NLP. Наши эксперименты с моделями cointegrated/rubert-tiny, ai-forever/ruBert-base и DeepPavlov/rubert-base-cased-sentence на различных задачах классификации показали, что наши модели достигают самых современных результатов, при этом наивысшая точность (95,9%) была достигнута с использованием модели cointegrated/rubert-tiny, что значительно превосходит результаты двух других моделей. Точность классификации, достигаемая с помощью ИИ, настолько высока, что может конкурировать с экспертной оценкой человека.

Заключение: Результаты подчеркивают важность тонкой настройки для моделей классификации. BERT демонстрирует значительный потенциал для улучшения технологий обработки естественного языка, внося вклад в качество автоматического анализа текста и открывая новые возможности для исследований и применения в различных областях, включая идентификацию и упорядочивание текстов по содержанию, что способствует принятию решений. Разработанный и проверенный алгоритм можно масштабировать для классификации как сложного и неоднозначного дискурса, так и художественной литературы, улучшая наше понимание текстовых категорий. Для дальнейшего развития этих подходов необходимы значительно большие наборы данных.

Ключевые слова

Модель BERT, сказки, Классификация текста, Нейронные сети

Список литературы

Aarne, A. (1910). Verzeichnis der Märchentypen [List of fairy tale types]. Folklore Fellows' Communications, (3). Suomalaisen Tiedeakatemian Toimituksia.
Andreev, N. P. (1929). Index of fairy-tale plots according to the Aarne System.Russian Geographical Society.
Atagün, E., Hartoka, B. & Albayrak A. (2021). Topic modeling using LDA and BERT Techniques: Teknofest example.6th International Conference on Computer Science and Engineering (pp. 660-664). Akdeniz University Publisher. DOI:https://doi.org/10.1109/UBMK52708.2021.9558988
Barros, L., Rodriguez, P., & Ortigosa, A. (2013). Automatic classification of literature pieces by emotion detection: A study on Quevedo's poetry. Humaine Association Conference on Affective Computing and Intelligent Interaction (pp. 141-146). IEEE. DOI:https://doi.org/10.1109/ACII.2013.30
Batraeva, I. A., Nartsev, A. D., & Lezgyan, A.S. (2020). Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning, Tomsk State University Journal of Control and Computer Science, 50, 14-22. DOI:https://doi.org/10.17223/19988605/50/2
Bayer, M., Kaufhold, M.-A., & Reuter, Ch. (2021). A survey on data augmentation for text classification. arXiv preprint. arXiv:2107.03158. DOI:https://doi.org/10.48550/arXiv.2107.03158
Chan, B., Schweter, S., & Möller, T. (2020). German's next language model. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 6788-6796).International Committee on Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.coling-main.598
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint. arXiv:1810.04805,. DOI:https://doi.org/10.48550/arXiv.1810.04805
Dubovik, A.R. (2017). Automatic text style identification in terms of statistical parameters. Komp'yuternaya lingvistika i vychislitel'nye ontologii, 1, 29-45. DOI:https://doi.org/10.17586/2541-9781-2017-1-29-45
Fu, Z., Zhou W., Xu J., Zhou H., & Li L. (2022). Contextual representation learning beyond Masked Language Modeling. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers, pp. 2701-2714). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2022.acl-long.193
El-Halees, A. M. (2017). Arabic text genre classification. Journal of Engineering Research and Technology, 4(3), 105-109.
Gerasimenko, N.A., Chernyavsky, A.S. & Nikiforova, M.A. (2022) ruSciBERT: A transformer language model for obtaining semantic embeddings of scientific texts in Russian. Doklady Mathematics, 106 (Suppl. ١), ٩٥-٩٦. DOI:https://doi.org/10.1134/S1064562422060072
Jin, Q., Xue, X., Peng, W., Cai, W., Zhang, Y., Zhang, L. (2020). TBLC-rAttention: A deep neural network model for recognizing the emotional tendency of Chinese medical comment. IEEE Access, 8, 96811-96828. DOI:https://doi.org/10.1109/ACCESS.2020.2994252
Jwa, H. D. Oh, K. Park, J. M. Kang, & H. Lim (2019). exBAKE: Automatic fake news detection model based on Bidirectional Encoder Representations from Transformers (BERT). Applied Sciences, 9(19), 4062. DOI:https://doi.org/10.3390/app9194062
Karsdorp, F. & Bosch, Van den A. (2013). Identifying motifs in folktales using topic models. Proceedings of BENELEARN 2013 (pp. 41-49). Radboud University.https://hdl.handle.net/2066/112943.
Kelodjoue, E., Goulian, J., & Schwab D. (2022). Performance of two French BERT models for French language on verbatim transcripts and online posts. Proceedings of the 5th International Conference on Natural Language and Speech Processing (pp. 88-94). Association for Computational Linguistics.https://aclanthology.org/2022.icnlsp-1.10.
Kessler B., Numberg G. & Schütze H. (1997). Automatic detection of text genre. Proceedings of the Eighth Conference on European chapters of the Association for Computational Linguistics. (pp. 32-38). Association for Computational Linguistics. DOI:https://doi.org/10.3115/976909.979622
Kupriyanov, R.V., Solnyshkina, M.I. & Lekhnitskaya, P.A. (2023). Parametric taxonomy of educational texts. Science Journal of VolSU. Linguistics, 22(6), 80-94. DOI:https://doi.org/10.15688/jvolsu2.2023.6.6
Labusch, K., Kulturbesitz, P., Neudecker, C., & Zellhofer, D. (2019). BERT for named entity recognition in contemporary and historical German. Proceedings of the 15th Conference on Natural Language Processing (pp. 9-11). Erlangen.
Lagutina, K. V., Lagutina, N. S., & Boychuk, E. I. (2021). Text classification by genre based on rhythm features. Modeling and Analysis of Information Systems, 28(3), 280-291. DOI:https://doi.org/10.18255/1818-1015-2021-3-280-291
Lagutina, K. V. (2023). Genre classification of Russian texts based on Modern Embeddings and Rhythm. Automatic Control and Computer Sciences, 57(7), 817-827. DOI:https://doi.org/10.3103/S0146411623070076
Lai, Y. A., Lalwani, G. & Zhang, Y. (2020). context analysis for pre-trained masked language models. Findings of the Association for Computational Linguistics (pp. 3789-3804). Association for Computational Linguistics. DOI:https://doi.org/10.18653/v1/2020.findings-emnlp.338
Liebeskind, Ch., Liebeskind, Sh., & Bouhnik, D. (2023) Machine translation for historical research: A case study of Aramaic-Ancient Hebrew translations. Journal on Computing and Cultural Heritage, 17(2), 1-23. DOI:https://doi.org/10.1145/3627168
Leitner, E., Rehm, G., & Moreno-Schneider, J. (2020). A dataset of German legal documents for named entity recognition. arXiv preprint. arXiv:2003.13016. DOI:https://doi.org/10.48550/arXiv.2003.13016
Lippert, Ch., Junger, A., Golam R., Md., Mohammad Ya., Hasan Sh., Md, & Chowdhury, Md. (2022). Kuzushiji (Japanese Text) classification. Technical Report.https://doi.org. DOI:https://doi.org/10.13140/RG.2.2.22416.07680
Liu, C., Zhao, Y., Cui X. & Zhao, Y. (2022) A comparative research of different granularities in Korean text classification. In IEEE International Conference on Advances in Electrical Engineering and Computer Applications (pp. 486-489). CONF-CDS. Publisher. DOI:https://doi.org/10.1109/AEECA55500.2022.9919047
Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., Villemonte de La Clergerie, É., Seddah, D., & Sagot, B. (2019). Camembert: A tasty French language model. arXiv preprint. arXiv:1911.03894. DOI:https://doi.org/10.18653/v1/2020.acl-main.645
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2022). Deep learning-based text classification: A comprehensive review. ACM Computing Surveys, 54(3), 1-40. DOI:https://doi.org/10.1145/3439726
Nikolaev, P.L. (2022) Classification of books into genres based on text descriptions via deep learning.International Journal of Open Information Technologies, 10(1), 36-40.
Nguyen, D., Trieschnigg, D., Meder, Th., & Theune, M. (2012). Automatic classification of folk narrative genres. Proceedings of the KONVENS 2012 (pp. 378-382). ASAI.http://www.oegai.at/konvens2012/proceedings/56_nguyen12w.
Nguyen, D., Trieschnigg, D., Meder, Th., & Theune, M. (2013) Folktale classification using learning to rank. Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science (vol. 7814, pp. 195-206). Springer. DOI:https://doi.org/10.1007/978-3-642-36973-5_17
Ostrow, R. A., (2022). Heroes, villains, and the in-between: A Natural Language Processing approach to fairy tales. Senior Projects Spring, 275.
Parida, U., Nayak, M., Nayak, A.K., (2021) News text categorization using random forest and naive bayes. In 1st Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology (pp. 1-4). IEEE. DOI:https://doi.org/10.1109/ODICON50556.2021.9428925
Peters, M., E., Neumann, M., Iyyer, M., Gardner, M., Clark, Ch., Lee, K. & Zettlemoyer, L. (2018). Deep contextualized word representations. ArXiv, abs/1802.05365. DOI:https://doi.org/10.18653/v1/N18-1202
Pompeu, D. P. (2019).Interpretable deep learning methods for classifying folktales according to the Aarne-Thompson-Uther Scheme [Master's Thesis]. Instituto Superior Técnico.
Propp, V. (1984). The Russian fairy tale. Izd. LSU.
Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. (2021) Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digital Medicine, 4(1), 86. DOI:https://doi.org/10.1038/s41746-021-00455-y
Reusens, M., Stevens, A., Tonglet, J., De Smedt, J., Verbeke, W., Vanden Broucke, S., & Baesens, B. (2024). Evaluating text classification: A benchmark study. Expert Systems with Applications, 254, 124302. DOI:https://doi.org/10.1016/j.eswa.2024.124302
Sabharwal, N. & Agrawal, A. (2021). BERT model applications: Question answering system in hands-on question answering systems with BERT. Apress eBooks. DOI:https://doi.org/10.1007/978-1-4842-6664-9
Samothrakis, В. S., & Fasli, M. (2015). Emotional sentence annotation helps predict fiction genre. PloS One, 10(11), e0141922. DOI:https://doi.org/10.1371/journal.pone.0141922
Santoro, A. & Faulkner, R. & Raposo, D. & Rae, J. & Chrzanowski, M. & Weber, Th. & Wierstra, D. & Vinyals, O. & Pascanu, R. & Lillicrap, T. (2018). Relational recurrent neural networks. arXiv. DOI:https://doi.org/10.48550/arXiv.1806.01822
Solnyshkina, M.I., Kupriyanov, R.V. & Shoeva, G.N. (2024). Linguistic profiling of text: Adventure story vs. Textbook. In Scientific Result. Questions of Theoretical and Applied Linguistics, 10(1), 115-132. DOI:https://doi.org/10.18413/2313-8912-2024-10-1-0-7
Solovyev, V., Solnyshkina, M., & Tutubalina, E. (2023). Topic modeling for text structure assessment: The case of Russian academic texts. Journal of Language and Education, 9(3), 143-158. DOI:https://doi.org/10.17323/jle.2023.16604
Sun, F., Liu, J., Wu, J., Pei, Ch., Lin, X., Ou, W. & Jiang P. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (pp. 1441-1450). Association for Computing Machinery. DOI:https://doi.org/10.1145/3357384.3357895
Tangherlini, T. & Chen, R. (2024). Travels with BERT: Surfacing the intertextuality in Hans Christian Andersen's travel writing and fairy tales through the network lens of large language model based topic modeling. Orbis Litterarum, 79(6), 519-562. DOI:https://doi.org/10.1111/oli.12458
Tianqi, Ch. & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). ACM. DOI:https://doi.org/10.1145/2939672.2939785
Tomin, E., Solnyshkina, M., Gafiyatova, E. & Galiakhmetova, A. (2023). Automatic text classification as relevance measure for Russian school physics texts. In 2023 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (pp. 366-370). IEEE. DOI:https://doi.org/10.1109/MCSoC60832.2023.00061
Tudorovskaya, E.A. (1961). On classification of Russian folk fairy tales. Specifics of Russian folklore genres. Specificity of genres of Russian folklore: Theses of the report. Institute of Russian Literature (Pushkin House).
Uther, H.-J. (٢٠٠٤). The types of international folktales: A classification and bibliography, based on the system of Antti Aarne and Stith Thompson. Folklore Fellows' Communications (vol. 3, pp. 284-286). Suomalainen Tiedeakatemia.
Thompson, S. (١٩٢٨). The types of the folk-tale: A classification and bibliography. Folklore Fellows' Communications, (74). Suomalainen Tiedeakatemia.
Thompson, S. (١٩٧٧). The folktale. University of California Press.
Wang, Z., Wu, H. Liu, H.& Cai, Q.-H. (2020). BertPair-networks for sentiment classification. 2020 International Conference on Machine Learning and Cybernetics (pp. 273-278). IEEE Xplore. DOI:https://doi.org/10.1109/ICMLC51923.2020.9469534
Worsham, В, J., & Kalita, J. (2018). Genre identification and the compositional effect of genre in literature. Proceedings of the 27th International Conference on Computational Linguistics (pp. 1963-1973). Association for Computational Linguistics.https://aclanthology.org/C18-1167.
Xiong, H. & Wu, J. & Liu, L. (2010). Classification with ClassOverlapping: A systematic study. 1st International Conference on E-Business Intelligence (pp. 303-309). Atlantis Press. DOI:https://doi.org/10.2991/icebi.2010.43

Дополнительные файлы

Доп. файлы

Действие

1. JATS XML

Скачать

Имя пользователя
Пароль
Запомнить меня

Забыли пароль?	Регистрация

Имя пользователя
Пароль
Запомнить меня

Забыли пароль?	Регистрация

Том 11, № 1 (2025)

Модель классификации на основе BERT: пример применения к русским сказкам

Полный текст

Аннотация

Ключевые слова

Об авторах

Валерий Дмитриевич Соловьев

Марина Ивановна Солнышкина

Andrey Ten

Николай Аркадиевич Прокопьев

Список литературы

Дополнительные файлы