Methods for Cross-Lingual Retrieval of Similar Documents in Legal Domain Based on Machine Learning

Vladimir V. Zhebel; Жебель Владимир Викторович; Dmitry A. Devyatkin; Девяткин Дмитрий Алексеевич; Denis V. Zubarev; Зубарев Денис Владимирович; Ilya V. Sochenkov; Соченков Илья Владимирович

doi:10.14357/10.14357/20718594220203

Methods for Cross-Lingual Retrieval of Similar Documents in Legal Domain Based on Machine Learning

Authors: Zhebel V.V.¹, Devyatkin D.A.², Zubarev D.V.², Sochenkov I.V.²^,3^,4
Affiliations:
1. Limited liability company «Technologies for systems analysis»
2. Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences
3. Innopolis University
4. Ivannikov Institute for System Programming of the Russian Academy of Sciences
Issue: No 2 (2022)
Pages: 27-35
Section: Analysis of Textual and Graphical Information
URL: https://journal-vniispk.ru/2071-8594/article/view/270288
DOI: https://doi.org/10.14357/10.14357/20718594220203
ID: 270288

Cite item

Full Text

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

The need of studying the international experience to improve legislation cause the need of information retrieval systems to be good in multilingual legal domain. One of the possible solutions is thematically similar document retrieval. However, there is an important task to transfer between languages to let the user put a document on the one language and get the search result on another one. The paper describes different approaches to solve this problem: from classical mediator-based methods to modern procedures of distributive semantics. As a test collection, we have used the UN digital library. The combination of the extended translation model and BM25 ranking function demonstrates the best results.

Keywords

Сross-Lingual Document Retrieval, Distributional Semantics, Information Retrieval in the Legal Domain

About the authors

Vladimir V. Zhebel

Limited liability company «Technologies for systems analysis»

Author for correspondence.
Email: zhebel@isa.ru

Research fellow

Russian Federation, Moscow

Dmitry A. Devyatkin

Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences

Email: devyatkin@isa.ru

Research fellow

Russian Federation, Moscow

Denis V. Zubarev

Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences

Email: zubarev@isa.ru

Junior research fellow

Russian Federation, Moscow

Ilya V. Sochenkov

Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences; Innopolis University; Ivannikov Institute for System Programming of the Russian Academy of Sciences

Email: sochenkov@isa.ru

Candidate of physical and mathematical sciences, Leading Expert Consultant, Lead Research Fellow, Junior research technician

Russian Federation, Moscow; Kazan; Moscow

References

Dini L., Peters W., Liebwald D., Schweighofer E., Mommers L., Voermans W. Cross-lingual legal information retrieval using a WordNet architecture," in Proceedings of the 10th international conference on Artificial intelligence and law. Bologna, Italy. 2005.
Abramova N.N., Globus E.I. Formation of multilingual dictionaries and their use in cross-language information retrieval. pp. 18-37, 2005. P. Curtoni, L. Dini, V. D. Tomaso, L. Mommers, W. Peters, P. Quaresma, E. Schweighofer and D. Tiscornia, Semantic access to multilingual legal information.1999.
Curtoni P., Dini L., Tomaso V. D., Mommers L., Peters W., Quaresma P., Schweighofer E., Tiscornia D. Semantic access to multilingual legal information. 1999.
Oard D.W., Hackett P. Document translation for crosslanguage text retrieval at the University of Maryland. The 6th Text Retrieval Convference (TREC-6). E.M. Voorchees and D.K. Harman. 1998.
McCarley J.S. Should we translate the documents or the queries in cross-language information retrieval? ACL’99: Proceedings of the 37 annual meeting of the Association for Computational Linguistics on Computational Linguistics. 1999. P. 208-214.
Dumais S., Letsche T., Littman M., Landauer T. Automatic cross-language retrieval using latent semantic indexing. AAAI Spring Symposium on Cross-Language Text and Speech Retrieval. 1997. P. 18-24.
Chandar A.P.S., Lauly S., Larochelle H., Khapra M., Ravindran B., Raykar V.C., SahaA. An autoencoder approach to learning bilingual word representations. Proc. 27th International Conference on Neural Information Processing Systems. 2014. P.1853-1861.
Mueller J., Thyagarajan A. Siamese recurrent architectures for learning sentence similarity. Proc. 30th AAAI Conference on artificial intelligical intelligence. 2016. P.2786-2792.
Seki K. On cross-lingual text similarity using neural translation models. Journal of Information Processing. Vol. 27. 2019. P.315-321.
Zhebel, V., Kreskin, A., Sochenkov, I.: Cross-lingual document analysis in legal domain. Trudy Instituta sistemnogo analiza rossiyskoy akademii nauk. 2020.70(1). P. 24–29.
Potthast M., Barrón-Cedeño A., Stein B., Rosso P. Crosslanguage plagiarism detection. Language Resources and Evaluation.2011.45(1). P.45–62.
Sochenkov I.V., Zubarev D.V., Tikhomirov I.A. Exploratory patent search. Informatics and its Applications.2018. 12 (1). P. 89-94.
Mikolov, T., Chen, K., Corrado G., and Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop. 2013.
Rekabsaz N., Lupu M., Hanbury A., Zuccon G. Generalizing translation models in the probabilistic relevance framework. In: Proceedings of CIKM. 2016.
Robertson S.E. et al. Okapi at TREC-3.0. In: Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November. 1994.
Vulić I., Moens M.F. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015. Vol. 2. P.719–725.
Zubarev D.V., Sochenkov I.V. Cross-lingual similar document retrieval methods. Proceedings of the Institute for System Programming. 2019. 31 (5). P.127–136.
Tiedemann J. Parallel Data, Tools and Interfaces in OPUS. In: Proc. of the language resources and evaluation (LREC). 2012. P.2214-2218.
Artetxe M., Schwenk H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics. 2019.7. P.597–610.
Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. arXiv:1702.08734. 2017.
Devyatkin D., Pogorelskaya Y., Yadrintsev V., Sochenkov Detection of Missed Links in Large Legal Corpora. 2021 Ivannikov Memorial Workshop (IVMEM). 2021. P.23-27.
Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2019, P.3982–3992.

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register