Том 73, № 1 (2023)
Информационные технологии
Classes of objects and relations in the Common Digital Space of Scientific Knowledge
Аннотация
All over the world there are many both global and local information systems focused on solving various problems. As an integrator that allows you to solve complex information problems at the intersection of sciences and application areas of existing information systems. to the maximum extent using the information resources accumulated in them, the Common Digital Space of Scientific Knowledge (CDSSK) can be considered. The article provides the structure of the CDSSK, the requirements for its functionality and the structure of the software shell, corresponding to the principles of the Semantic WEB. All objects reflected in the CDSSK are divided into two classes – universal and local. Relationships between objects are also divided into two groups – universal and specific. The paper proposes a list of universal classes of objects, defines universal types of relations between them, gives examples of specific relations and approaches to identifying local classes and subclasses of objects in a particular field of science.



Curation of bibliographic metadata of the institutional repository on the Invenio-JOIN² platform
Аннотация
Content filling of the institutional repository and keeping the entered data “up to date” is a very resource-intensive task that requires organizing the coordinated actions of operators to enter data into an information system (IS). To resolve one helps the curation of bibliographic metadata — a set of actions and measures aimed for updating, managing and preserving digital objects throughout their life cycle in educational and the scientific interests of the community. This work considers the issues of bibliographic descriptions curation of publications by JINR (Joint Institute for Nuclear Research) employees, their enrichment of metadata entered into the JINR institutional repository from external sources: the Scopus bibliographic and abstract database, the Web of Science search Internet platform, the information platform in High Energy Physics INSPIREHEP. The development of information services for solving the problem of current accounting of the publication activity of JINR staff is described.



Development of an Information-analytical System for the Support and Maintenance of Licenses at MLIT JINR
Аннотация
The License Management System (LMS) was developed at the JINR Information Technology Laboratory. The purpose of creating an LMS is to automate the management, acquisition, maintenance and use of licensed software products. The system consists of a network licensing system (NLS), databases and a web interface. NLS provides network license management, collects and transmits to the time series database information about which network license was used by the user and at what time. The features of collecting this type of data are given. This data is used in monitoring implemented on the basis of the Grafana platform. The main LMS database stores data related to corporate, private, and other types of licenses. It also stores the necessary data regarding license users. The database is implemented in PostgreSQL. The system provides the ability to process workflows such as ordering new licenses that users need, ordering to add to the catalog of purchased licenses, and other functions. The LMS web interface is implemented in the development environment of the Electronic Document Management System "EDMS Dubna" using the LegoToolkit web application. A website has been developed and created for LMS users.



On the analysis of individual data on transport usage
Аннотация
The percentage of the world's urban population is currently more than 50% and will increase according to UN forecasts. Urban infrastructure must develop along with population growth. This article provides an overview of methods for improving the city's transport infrastructure based on data analysis. The article presents methods for reducing harmful emissions, optimizing the operation of taxis and public transport, as well as recognizing transportation modes and some other tasks. These methods operate with data describing the transport behavior of individual users of the transport network. The sources of such data are smart card validators, GPS sensors, and smartphone accelerometers. The article reveals the advantages and disadvantages of using each of the data types, as well as presents alternative ways to obtain them. These methods, along with methods for aggregated data analysis, can become the main part of a single platform that will allow city authorities in the process of improving the transport infrastructure. We propose architecture of this platform which will allows developers to extend range of available algorithms and methods dynamically.



Хранилище сымитированных типовых сигналов как основа разработки быстрых алгоритмов
Аннотация
Статья посвящена структуре хранилища многомерных типовых сигналов и процессов в рамках задачи цифровой фильтрации, основанной на быстрых алгоритмах имитации и операторе универсального адаптивного матричного преобразования. Хранилище содержит актуальные уравнения имитации; сымитированные сигналы в виде таблиц значений и изображений; оригинальный код на Python, C++ и других языках программирования; дополнительную лицензионную информацию. В статье рассмотрены различные подходы к созданию хранилища. Выбранный путь заключается в формировании набора данных, его публикации на ресурсе GitHub и построении онтологии на данном наборе. Принципы структурирования данных и организации такой онтологии представлены в статье. Представлен готовый опубликованный набор данных.



Интеллектуальный анализ данных
AutoML: исследование существующих программных реализаций и определение общей внутренней структуры решений
Аннотация
В статье рассматриваются различные программные реализации автоматизации процесса машинного обучения для решения задачи регрессии. Рассмотрено внутреннее устройство и возможности ряда существующих и обширно используемых инструментов автоматизированного машинного обучения таких, как LightAutoML (LAMA), TPOT, Auto-Sklearn, H2O AutoML, MLJAR. Возможности данных программных систем были исследованы для решения задачи регрессии на нескольких наборах данных. В результате исследования была выведена общая структура программного решения автоматизированного машинного обучения, которая может быть взята за основу при дальнейшем проектировании и разработке собственного программного продукта, а также проанализирована точность, с которой системы предсказывали значения целевого признака.



Web Application with GUI for Data Analysis Automation
Аннотация
In the current digital age, the world has a huge amount of data. Therefore, people are more and more confronted with the use of such methods as data analysis and machine learning. Moreover, many people are considering using machine learning algorithms for their own purposes. However, data analysis is a complex process that can hardly be carried out by people who do not have sufficient knowledge both in this field and in programming. This paper presents an approach to give non-expert users the ability to apply machine learning algorithms to their datasets using an application with a graphical interface. There are a lot of challenges involved in creating ML-solutions, even if we take advantage of existing ML-algorithms: feature engineering, outliers’ detection, filling the missing values, ML-method’s hyperparameters optimization and so on. The main point of the research is to find a balance in solving these complex tasks and to provide a Web-based user interface for unexperienced people to enable them to utilize the power of ML-methods in automatic or semi-automatic way. The practical outcome is an information system development, that consists of three interrelated parts: a web application, an API and several microservices that implement ML-algorithms from Scikit-learn library.



Проект System.AI: полностью управляемый стек машинного обучения и анализа данных для экосистемы .NET
Аннотация
В последние годы технологии машинного обучения становятся всё более распространёнными в таких известных задачах как стилизация изображений, окрашивание чёрно-белых изображений, супер-разрешение изображений, поиск поддельных данных, распознавание голоса и изображений. В связи с этим возникает необходимость в реализации набора инструментов для интеграции систем искусственного интеллекта в приложения для мобильных устройств, устройств умного дома и домашних ПК. Статья посвящена решению, позволяющему разработчикам интегрировать системы анализа данных и искусственного интеллекта непосредственно в приложение, что позволит получить легковесный, портативный, кроссплатформенный монолитный программный продукт, что зачастую невозможно с использованием существующих решений. Основными особенностями предлагаемого решения являются нацеленность на экосистему Microsoft .NET [1], а также использование только стандартных возможностей BCL и языка C#. Реализованный пакет инструментов является исключительно кроссплатформенным и аппаратно-независимым. API во многом совпадает с аналогичными решениями для Python, что позволяет быстро перенести коды на Python в проект для .NET.



On the Practical Generation of Counterfactual Examples
Аннотация
One of the important elements in evaluating the stability of machine learning systems are the so-called adversarial examples. These are specially selected or artificially created input data for machine learning systems that interfere with their normal operation, are interpreted or processed incorrectly. Most often, such data are obtained through some formal modifications of the real source data. This article considers a different approach to creating such data, which takes into account the semantic significance (meaning) of the modified data counterfactual examples. The purpose of the work is to present practical solutions for generating counterfeit examples. The consideration is based on the real use of counterfactual examples in assessing the robustness of machine learning systems.



A Survey of Model Inversion Attacks and Countermeasures
Аннотация
This article provides a detailed overview of the so-called Model Inversion(MI) attacks. These attacks aim at Machine-Learning-as-a-Service (MLaaS) platforms, and the goal is to use some well-prepared adversarial samples to attack target models and gain sensitive information from ML models, such as items from the dataset on which ML model was trained or ML model's parameters. This kind of attack now becomes an enormous threat to ML models, therefore, it is necessary to research this attack, understand how it will affect ML models, and based on this knowledge, we can propose some strategies that may improve the robustness of ML models.



Research the Stability of Decision Trees Using Distances on Graphs
Аннотация
The article deals with the problem of stability of classifiers based on decision trees for the problem of text attribution. Such a task arises, for example, in the study of the authorship of articles from the pre-revolutionary journals “Time” (1861–1863), “Epoch” (1864–1865) and the weekly “Citizen” (1873–1874). The texts were divided into separate parts of different sizes using the sliding window method, then the frequency of n-grams (encoded sequences of parts of speech) in each fragment was determined. Further, these indicators were used to build various classifiers. The resulting decision trees were compared with each other using the tree edit distance. For this purpose, a procedure for processing, comparing and visualizing graphs was implemented in the SMALT software package. As a result of experiments using different weights for editing operations, patterns were revealed between the parameters for constructing text fragments and the decision trees obtained on their basis.



Economic Cycle Prediction using Machine Learning – Russia Case Study
Аннотация
The long-term development of the world economy is characterized by cyclical development. To date, there is no single accepted approach to describe the nature of the economic cycle. Therefore, studies of economic and political cycles are one of the key areas of economic theory. Econometrics and machine learning have a common goal: to build a predictive model, for a target variable, using explanatory variables. This research aims to identify economic cycle in Russian Federation using collective factors. It uses a different approach, concerning classical econometric techniques, and shows how machine learning (ML) techniques can improve the accuracy of forecasts. We used three machine learning algorithms such as k-Nearest Neighbors (kNN), Random Forests (RF) and Support vector machines (SVM). The research is based on 30 economic factors for the period 1990-2020 from FRED, World Bank, WTO, Federal State Statistics Service, Bank of Russia etc. The results indicate that the Russian economy would be very active (peak) in the next quarters. This result could be a new approach to provide policy recommendations to authorities and financial institutions in particular.



Quantitative large-scale study of school student’s academic performance peculiarities during distance education caused by COVID-19
Аннотация
The paper presents the large-scale analysis results of the distance learning impact caused by COVID-19 and its influence on school student's academic performance. This multidisciplinary study is based on the large amount of the raw data containing school student’s grades from 2015 till 2021 academic years taken from “Electronic education in Tatarstan Republic” system. The analysis is based on application of BigData and mathematical statistics methods, realized by using Python programming language. Dask framework for parallel cluster-based computation, Pandas library for data manipulation and large-scale analysis data is used. One of the main priorities of this paper is to identify the impact of different educational system’s factors on school student’s academic performance. For that purpose, the quantile regression method was used. This method is widely used for processing a large-scale data of various experiments in modern data science. Quantile regression models are designed to determine conditional quantile functions. Therefore, this method is especially suitable to exam conditional effects at various locations of the outcome distribution: e.g., lower and upper tails. The study-related conditional factors include such factors as student’s marks from previous academic years, types of lessons in which grades were obtained, and various teacher’s parameters such as age, gender and qualification category.



Компьютерный анализ текстов
Методы извлечения биомедицинской информации из патентов и научных публикаций (на примере химических соединений)
Аннотация
В данной статье предложен алгоритм для решения задачи извлечения информации из биомедицинских патентов и научных публикаций. Предложенный алгоритм основан на методах машинного обучения. Были проведены эксперименты на патентах из базы USPTO. Эксперименты показали, что лучшее качество извлечения показала модель, построенная на основе BioBERT.



Sentence splitters benchmark
Аннотация
There are multiple implementations of text into sentences splitters including open source libraries and tools. But the quality of segmentation and the performance of each segmentation tool are very different. Moreover, it is convenient for NLP developers to have all libraries written in the same programming language, except when using some kind of integration programming language. This paper considers two aspects building a uniform framework and estimating language features of the modern and popular programming language Julia from one side. And the performance estimation of sentence splitting libraries as is. The paper contains detailed performance results, samples of texts after splitting, and a list of some typical issues related to sentence splitting.



The Conceptual Modeling System Based on Metagraph Approach
Аннотация
The article is devoted to an approach to building a conceptual modeling system, which includes text recognition in a conceptual structure and text generation based on a conceptual structure. The metagraph is used as a conceptual structure. The architecture of the conceptual modeling system is proposed. The metagraph model is considered as a data model for conceptual modeling. The main ideas of the work of the text parsing module and text generation module are considered.



Методы и модели в естественных науках
Структура и эволюция рассеянных звездных скоплений: теория и наблюдения на основе данных Gaia
Аннотация
Рассмотрено строение и эволюция рассеянных звездных скоплений (РЗС) на примерах РЗС Плеяды и группы РЗС в области Меча Ориона. Проведен отбор звезд по данным Gaia. Прослежена связь скоплений Меча Ориона с молекулярными облаками по данным космического аппарата «Гершель». Показано место рассмотренных объектов в общей схеме эволюции, составленной нами ранее. Сделан вывод о назревшей необходимости расширения классификации РЗС. Рассмотренная звездная система Плеяд показала наличие обширного звездного гало. Найденный в окрестности РЗС звездный поток Рыб – Эридана, вероятно, генетически связан с Плеядами и совместно с ним представляет остатки распавшейся ОВ ассоциации. В области Меча Ориона наблюдающиеся молодые РЗС, по всей вероятности, связаны с молекулярными облаками. Меч Ориона является видимой с ребра дисковой структурой продуктом столкновения двух гигантских молекулярных облаков. Данные о РЗС быстро пополняются, число РЗС растет за счет их выявления в обзорах Gaia. Анализ данной области может повторяться и расширяться по увеличивающемуся со временем объему данных с помощью проверенных методик, укладываясь в концепцию управления данными в областях с интенсивным использованием данных.



Astronomical observation planner
Аннотация
One of the tasks of robotization of astronomical observations is the creation of programs for the optimal distribution of time depending on the position of the Sun (efficient use of twilight time), the position and phases of the Moon. An important requirement for this program is the autonomy of its work, independent of external Internet resources. To solve this problem, an autonomous astronomical calendar was developed that makes it possible to estimate the time of sunrise and sunset, the moon (as well as its phases), the onset and end of twilight. This subroutine is the first step in automating the planning of astronomical observations. The next important step is to develop software that will be able to plan observations in an optimal way. The targets for observations are discussed, for these purposes the necessary initial parameters are indicated, which make it possible to form a schedule of observations at telescopes in an automatic mode.



A logical model for integration of heterogeneous experimental data in soil research
Аннотация
The undoubted challenge for science is the extraction of knowledge from fast growing heterogeneous datasets. Particularly, details of experimental setups are insufficiently formalized and cannot be easily inserted into databases. Thus, there is a problem of using these details in the process of data integration and meta-analyses. For this purpose, we developed a scheme of formalization for object descriptions with its origination, protocols for field and laboratory measurements (including instruments and experimental conditions). It allows the integration of larger amounts of data accounting for its specifics of acquisition, for example, by applying adjustments, assigning weights to data sources (based on its reliability, method precision and experimental uncertainty) or directly accounting for experimental conditions in models. This formalization is currently used to develop an electronic laboratory journal for soil research, intended for detailed description of a conducted or planned experiment. The study aims to: increase the re-producibility of scientific research results; allow automatic data processing and error detection, and most importantly; effective soil data mining for decision support systems.



Условия результативного применения технологий искусственного интеллекта в агропромышленном комплексе ЕАЭС
Аннотация
Рассматриваются решения для снижения экологической опасности в сельском хозяйстве единого агропромышленного пространства ЕАЭС. Предложен механизм формирования такого пространства, позволяющего разрешить возникшие геополитические, экономические, социальные, экологические проблемы. Это единая цифровая платформа управления, включающая возможность облачного построения на основе математического и онтологического моделирования, единых цифровых стандартах (структура подплатформы сбора, хранения и интеграции пооперационной первичной учетной информации всех участников в единой базе данных; структура подплатформы технологического учета; структура подплатформы алгоритмов обработки данных первых двух подплатформ в целях управления производством). При таком подходе применение технологий искусственного интеллекта принесет наибольший эффект и позволит обеспечить максимальную межотраслевую прослеживаемость продукции и будет минимизировано негативное воздействие природных и антропогенных факторов экологической опасности на окружающую среду, на продукцию агропромышленного комплекса и на самого человека.


