Sentence splitters benchmark

A. P. Zavyalova; Zavyalova А. P.; P. A. Martynyuk; Martynyuk P. А.; R. S. Samarev; Samarev R. S.

doi:10.14357/20790279230119

Sentence splitters benchmark

Авторы: Zavyalova А.P.¹, Martynyuk P.А.¹, Samarev R.S.¹
Учреждения:
1. Bauman Moscow State Technical University
Выпуск: Том 73, № 1 (2023)
Страницы: 167-175
Раздел: Компьютерный анализ текстов
URL: https://journal-vniispk.ru/2079-0279/article/view/286903
DOI: https://doi.org/10.14357/20790279230119
ID: 286903

Цитировать

Полный текст

Аннотация
Об авторах
Список литературы
Дополнительные файлы
Статистика

Аннотация

There are multiple implementations of text into sentences splitters including open source libraries and tools. But the quality of segmentation and the performance of each segmentation tool are very different. Moreover, it is convenient for NLP developers to have all libraries written in the same programming language, except when using some kind of integration programming language. This paper considers two aspects building a uniform framework and estimating language features of the modern and popular programming language Julia from one side. And the performance estimation of sentence splitting libraries as is. The paper contains detailed performance results, samples of texts after splitting, and a list of some typical issues related to sentence splitting.

Ключевые слова

segmentation, sentence, splitting, NLP, Julia language, benchmark, text analysis

Associate Professor

Россия, ul. Baumanskaya 2-ya, 5, Moscow, 105005

Список литературы

Text to sentence splitter. https://github.com/mediacloud/sentence-splitter, 2019. Accessed: 2022-01-20.
Apache. Opennlp. http://opennlp.apache.org, 2010. Accessed: 2022-01-20.
Bird, S., Klein, E., and Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. “ O’Reilly Media, Inc.”, 2009.
Bolshakova, E.I., Peskova, O., Klyshinsky, E., Noskov, A.A., Lande, D., and Yagunova, E.V. Automatic natural language processing and computational linguistics, 2015.
Chen, J., and Revels, J. Robust benchmarking in noisy environments. arXiv e-prints (Aug 2016).
Community, T.J. Calling c and fortran code, may 2022.
Community, T.J. Why we use julia, 10 years later, february 2022.
Honnibal, M., and Johnson, M. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Lisbon, Portugal, Sept. 2015), Association for Computational Linguistics, pp. 1373–1378.
Honnibal, M., and Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.
Koehn, P., et al. Europarl: A parallel corpus for statistical machine translation. In MT summit (2005), vol. 5, Citeseer, pp. 79–86.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (2014), pp. 55–60.
Nivre, J., and Nilsson, J. Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) (Ann Arbor, Michigan, June 2005), Association for Computational Linguistics, pp. 99–106.
Ruopp, A. Lingua sentence. https://metacpan.org/ pod/Lingua::Sentence, 2010. Accessed: 2022-01-20.
Sætre, R., Søvik, H., Amble, T., and Tsuruoka, Y. Genetuc, genia and google: Natural language understanding in molecular biology literature. In Transactions on Computational Systems Biology V (Berlin, Heidelberg, 2006), C. Priami, X. Hu, Y. Pan, and T. Y. Lin, Eds., Springer Berlin Heidelberg, pp. 68–82.
Soricut, R., and Marcu, D. Sentence level discourse parsing using syntactic and lexical information. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (2003), pp. 228–235.
Zeldes, A. The GUM corpus: Creating multilayer resources in the classroom. Language Resources and Evaluation 51, 3 (2017), 581–612.

Дополнительные файлы

Доп. файлы

Действие

1. JATS XML

Скачать

Имя пользователя
Пароль
Запомнить меня

Забыли пароль?	Регистрация

Имя пользователя
Пароль
Запомнить меня

Забыли пароль?	Регистрация

Том 75, № 2 (2025)

Sentence splitters benchmark

Полный текст

Аннотация

Ключевые слова

Об авторах

А. P. Zavyalova

P. А. Martynyuk

R. S. Samarev

Список литературы

Дополнительные файлы