Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)

N. A. Kolpakov; Колпаков Николай Алексеевич; A. I. Molodchenkov; Молодченков Алексей Игоревич; A. V. Lukin; Лукин Антон В.

doi:10.14357/20790279230118

Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)

Autores: Kolpakov N.A.¹, Molodchenkov A.I.²^,3, Lukin A.V.³
Afiliações:
1. Moscow Institute of Physics and Technology
2. Federal research center “Computer science and control” of Russian Academy of Sciences
3. Peoples’ Friendship University of Russia
Edição: Volume 73, Nº 1 (2023)
Páginas: 159-166
Seção: Text Mining
URL: https://journal-vniispk.ru/2079-0279/article/view/286896
DOI: https://doi.org/10.14357/20790279230118
ID: 286896

Citar

Texto integral

Resumo
Sobre autores
Bibliografia
Arquivos suplementares
Estatísticas

Resumo

This article proposes an algorithm for solving the problem of extracting information from biomedical patents and scientific publications. The introduced algorithm is based on machine learning methods. Experiments were carried out on patents from the USPTO database. Experiments have shown that the best extraction quality was achieved by a model based on BioBERT.

Palavras-chave

machine learning, natural language processing, named entity recognition, biomedical texts processing

Sobre autores

N. Kolpakov

Moscow Institute of Physics and Technology

Email: kolpakov.na@phystech.edu

Bachelor

Rússia, 1A, building 1, Kerch str., Moscow, 117303 Moscow

A. Molodchenkov

Federal research center “Computer science and control” of Russian Academy of Sciences; Peoples’ Friendship University of Russia

Autor responsável pela correspondência
Email: aim@tesyan.ru

PhD

Rússia, 44/2 Vavilova str., Moscow, 119333; 6, Miklukho-Maklaya str., Moscow, 117198

A. Lukin

Peoples’ Friendship University of Russia

Email: antonvlukin@gmail.com

учёная степень

Rússia, 6, Miklukho-Maklaya str., Moscow, 117198

Bibliografia

Akhondi, S., Rey, H., Schwörer, M., Maier, M., Toomey, J., Nau, H., Ilchmann, G., Sheehan, M., Irmer, M., Bobach, C., Doornenbal, M., Gregory and M., Kors, J. 2019. Automatic identification of relevant chemical compounds from patents. Database: the journal of biological databases and curation, vol. 1, pp. 1–14.
Jessop, D., Adams, S., Willighagen, E., Hawizy, L. and Murray-Rust, P. 2011. OSCAR4: A flexible architecture for chemical textmining. Journal of cheminformatics, vol. 3, no. 1, pp. 1–12.
Soysal, E., Wang, J., Jiang, M., Wu, Y., Pakhomov, S., Liu, H. and Qi, W. 2018. CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines. Journal of the American Medical Informatics Association: JAMIA, vol. 25, no. 3, pp. 331–336.
Swain, M. and Cole, J. 2016. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. Journal of Chemical Information and Modeling, vol. 56, no. 10, pp. 1894–1904.
Jinhyuk, L., Wonjin, Y., Sungdong, K., Donghyeon, K., Sunkyu, K., Chan, H. S. and Jaewoo, K. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, vol. 36, no. 4, pp. 1234–1240.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I. 2017. Attention Is All You Need. Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008.
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. 2019. Bert: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186.
The OpenNLP Project. Available at: http://opennlp. apache.org (accessed February 20, 2022).
CRFsuite: a Fast Implementation of Conditional Random Fields (CRFs). Available at: http://www. chokkan.org/software/crfsuite/ (accessed February 20, 2022).
Barnard, J. 1991. A comparison of different approaches to Markush structure handling. Journal of Chemical Information and Computer Sciences, vol. 31, no. 1, pp. 64–68.
Heller, S., McNaught, A., Pletnev, I., Stein, S. and Tchekhovskoi, D. 2015. The IUPAC International Chemical Identifier. Journal of Cheminformatics, vol. 7, pp. 1–34.
USPTO. Available at: https://www.uspto.gov/ patents (accessed February 20, 2022).
Mikolov, T., Chen, K., Corrado, G. and Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR, pp. 1–12.
Mikolov, T., Yih, W.-T. and Zweig, G. 2013. Linguistic regularities in continuous space word representations. Proceedings of NAACL-HLT, pp. 746–751.
Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning, vol. 20, no. 3, pp. 273–297.
Finkel, J., Grenager, T. and Manning, C. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363–370.
Mitchell, T. 1997. Machine Learning. New York: McGraw-Hill. 432 p.

Arquivos suplementares

Ação

1. JATS XML

Baixar

Nome de usuário
Senha
Lembrar usuário

Esqueceu a senha?	Cadastro

Nome de usuário
Senha
Lembrar usuário

Esqueceu a senha?	Cadastro

Volume 75, Nº 2 (2025)

Volume 75, Nº 2 (2025)

Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)

Texto integral

Resumo

Palavras-chave

Sobre autores

N. Kolpakov

A. Molodchenkov

A. Lukin

Bibliografia

Arquivos suplementares