Methods for Cross-Lingual Retrieval of Similar Documents in Legal Domain Based on Machine Learning
- Authors: Zhebel V.V.1, Devyatkin D.A.2, Zubarev D.V.2, Sochenkov I.V.2,3,4
-
Affiliations:
- Limited liability company «Technologies for systems analysis»
- Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences
- Innopolis University
- Ivannikov Institute for System Programming of the Russian Academy of Sciences
- Issue: No 2 (2022)
- Pages: 27-35
- Section: Analysis of Textual and Graphical Information
- URL: https://journal-vniispk.ru/2071-8594/article/view/270288
- DOI: https://doi.org/10.14357/10.14357/20718594220203
- ID: 270288
Cite item
Full Text
Abstract
The need of studying the international experience to improve legislation cause the need of information retrieval systems to be good in multilingual legal domain. One of the possible solutions is thematically similar document retrieval. However, there is an important task to transfer between languages to let the user put a document on the one language and get the search result on another one. The paper describes different approaches to solve this problem: from classical mediator-based methods to modern procedures of distributive semantics. As a test collection, we have used the UN digital library. The combination of the extended translation model and BM25 ranking function demonstrates the best results.
About the authors
Vladimir V. Zhebel
Limited liability company «Technologies for systems analysis»
Author for correspondence.
Email: zhebel@isa.ru
Research fellow
Russian Federation, MoscowDmitry A. Devyatkin
Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences
Email: devyatkin@isa.ru
Research fellow
Russian Federation, MoscowDenis V. Zubarev
Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences
Email: zubarev@isa.ru
Junior research fellow
Russian Federation, MoscowIlya V. Sochenkov
Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences; Innopolis University; Ivannikov Institute for System Programming of the Russian Academy of Sciences
Email: sochenkov@isa.ru
Candidate of physical and mathematical sciences, Leading Expert Consultant, Lead Research Fellow, Junior research technician
Russian Federation, Moscow; Kazan; MoscowReferences
- Dini L., Peters W., Liebwald D., Schweighofer E., Mommers L., Voermans W. Cross-lingual legal information retrieval using a WordNet architecture," in Proceedings of the 10th international conference on Artificial intelligence and law. Bologna, Italy. 2005.
- Abramova N.N., Globus E.I. Formation of multilingual dictionaries and their use in cross-language information retrieval. pp. 18-37, 2005. P. Curtoni, L. Dini, V. D. Tomaso, L. Mommers, W. Peters, P. Quaresma, E. Schweighofer and D. Tiscornia, Semantic access to multilingual legal information.1999.
- Curtoni P., Dini L., Tomaso V. D., Mommers L., Peters W., Quaresma P., Schweighofer E., Tiscornia D. Semantic access to multilingual legal information. 1999.
- Oard D.W., Hackett P. Document translation for crosslanguage text retrieval at the University of Maryland. The 6th Text Retrieval Convference (TREC-6). E.M. Voorchees and D.K. Harman. 1998.
- McCarley J.S. Should we translate the documents or the queries in cross-language information retrieval? ACL’99: Proceedings of the 37 annual meeting of the Association for Computational Linguistics on Computational Linguistics. 1999. P. 208-214.
- Dumais S., Letsche T., Littman M., Landauer T. Automatic cross-language retrieval using latent semantic indexing. AAAI Spring Symposium on Cross-Language Text and Speech Retrieval. 1997. P. 18-24.
- Chandar A.P.S., Lauly S., Larochelle H., Khapra M., Ravindran B., Raykar V.C., SahaA. An autoencoder approach to learning bilingual word representations. Proc. 27th International Conference on Neural Information Processing Systems. 2014. P.1853-1861.
- Mueller J., Thyagarajan A. Siamese recurrent architectures for learning sentence similarity. Proc. 30th AAAI Conference on artificial intelligical intelligence. 2016. P.2786-2792.
- Seki K. On cross-lingual text similarity using neural translation models. Journal of Information Processing. Vol. 27. 2019. P.315-321.
- Zhebel, V., Kreskin, A., Sochenkov, I.: Cross-lingual document analysis in legal domain. Trudy Instituta sistemnogo analiza rossiyskoy akademii nauk. 2020.70(1). P. 24–29.
- Potthast M., Barrón-Cedeño A., Stein B., Rosso P. Crosslanguage plagiarism detection. Language Resources and Evaluation.2011.45(1). P.45–62.
- Sochenkov I.V., Zubarev D.V., Tikhomirov I.A. Exploratory patent search. Informatics and its Applications.2018. 12 (1). P. 89-94.
- Mikolov, T., Chen, K., Corrado G., and Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop. 2013.
- Rekabsaz N., Lupu M., Hanbury A., Zuccon G. Generalizing translation models in the probabilistic relevance framework. In: Proceedings of CIKM. 2016.
- Robertson S.E. et al. Okapi at TREC-3.0. In: Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November. 1994.
- Vulić I., Moens M.F. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015. Vol. 2. P.719–725.
- Zubarev D.V., Sochenkov I.V. Cross-lingual similar document retrieval methods. Proceedings of the Institute for System Programming. 2019. 31 (5). P.127–136.
- Tiedemann J. Parallel Data, Tools and Interfaces in OPUS. In: Proc. of the language resources and evaluation (LREC). 2012. P.2214-2218.
- Artetxe M., Schwenk H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics. 2019.7. P.597–610.
- Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. arXiv:1702.08734. 2017.
- Devyatkin D., Pogorelskaya Y., Yadrintsev V., Sochenkov Detection of Missed Links in Large Legal Corpora. 2021 Ivannikov Memorial Workshop (IVMEM). 2021. P.23-27.
- Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 2019, P.3982–3992.
Supplementary files
