Vol 73, No 1 (2023)
Information Technologies
Classes of objects and relations in the Common Digital Space of Scientific Knowledge
Abstract
All over the world there are many both global and local information systems focused on solving various problems. As an integrator that allows you to solve complex information problems at the intersection of sciences and application areas of existing information systems. to the maximum extent using the information resources accumulated in them, the Common Digital Space of Scientific Knowledge (CDSSK) can be considered. The article provides the structure of the CDSSK, the requirements for its functionality and the structure of the software shell, corresponding to the principles of the Semantic WEB. All objects reflected in the CDSSK are divided into two classes – universal and local. Relationships between objects are also divided into two groups – universal and specific. The paper proposes a list of universal classes of objects, defines universal types of relations between them, gives examples of specific relations and approaches to identifying local classes and subclasses of objects in a particular field of science.



Curation of bibliographic metadata of the institutional repository on the Invenio-JOIN² platform
Abstract
Content filling of the institutional repository and keeping the entered data “up to date” is a very resource-intensive task that requires organizing the coordinated actions of operators to enter data into an information system (IS). To resolve one helps the curation of bibliographic metadata — a set of actions and measures aimed for updating, managing and preserving digital objects throughout their life cycle in educational and the scientific interests of the community. This work considers the issues of bibliographic descriptions curation of publications by JINR (Joint Institute for Nuclear Research) employees, their enrichment of metadata entered into the JINR institutional repository from external sources: the Scopus bibliographic and abstract database, the Web of Science search Internet platform, the information platform in High Energy Physics INSPIREHEP. The development of information services for solving the problem of current accounting of the publication activity of JINR staff is described.



Development of an Information-analytical System for the Support and Maintenance of Licenses at MLIT JINR
Abstract
The License Management System (LMS) was developed at the JINR Information Technology Laboratory. The purpose of creating an LMS is to automate the management, acquisition, maintenance and use of licensed software products. The system consists of a network licensing system (NLS), databases and a web interface. NLS provides network license management, collects and transmits to the time series database information about which network license was used by the user and at what time. The features of collecting this type of data are given. This data is used in monitoring implemented on the basis of the Grafana platform. The main LMS database stores data related to corporate, private, and other types of licenses. It also stores the necessary data regarding license users. The database is implemented in PostgreSQL. The system provides the ability to process workflows such as ordering new licenses that users need, ordering to add to the catalog of purchased licenses, and other functions. The LMS web interface is implemented in the development environment of the Electronic Document Management System "EDMS Dubna" using the LegoToolkit web application. A website has been developed and created for LMS users.



On the analysis of individual data on transport usage
Abstract
The percentage of the world's urban population is currently more than 50% and will increase according to UN forecasts. Urban infrastructure must develop along with population growth. This article provides an overview of methods for improving the city's transport infrastructure based on data analysis. The article presents methods for reducing harmful emissions, optimizing the operation of taxis and public transport, as well as recognizing transportation modes and some other tasks. These methods operate with data describing the transport behavior of individual users of the transport network. The sources of such data are smart card validators, GPS sensors, and smartphone accelerometers. The article reveals the advantages and disadvantages of using each of the data types, as well as presents alternative ways to obtain them. These methods, along with methods for aggregated data analysis, can become the main part of a single platform that will allow city authorities in the process of improving the transport infrastructure. We propose architecture of this platform which will allows developers to extend range of available algorithms and methods dynamically.



Simulated reference signals storage: basis for fast algorithms creation
Abstract
The paper is devoted to organizing a storage structure for multidimensional reference signals and processes within the framework of digital filtering based on fast simulation algorithms and a universal adaptive matrix transformation operator. The storage contains relevant simulation equations, simulated signals as value tables and images, and original code in Python, C++ and other languages, as well as a supplementary license data. Different approaches to storage creation are analyzed in the paper. The chosen route consists of structuring all the data into a dataset, publishing the dataset on GitHub and building an ontology based upon the dataset. The principles for structuring the data and organizing such an ontology are given. The resulting published dataset is presented.



Data Mining
AutoML: Examining Existing Software Implementations and Determining the Overall Internal Structure of Solutions
Abstract
The article discusses various software implementations of the process of automating the task of using machine learning to solve the linear regression problem. The internal structure and capabilities of a number of existing and widely used automated machine learning tools such as LightAutoML (LAMA), TPOT, AutoSklearn, H2O AutoML, MLJAR are considered. The capabilities of these software systems have been explored to solve the regression problem on multiple datasets.



Web Application with GUI for Data Analysis Automation
Abstract
In the current digital age, the world has a huge amount of data. Therefore, people are more and more confronted with the use of such methods as data analysis and machine learning. Moreover, many people are considering using machine learning algorithms for their own purposes. However, data analysis is a complex process that can hardly be carried out by people who do not have sufficient knowledge both in this field and in programming. This paper presents an approach to give non-expert users the ability to apply machine learning algorithms to their datasets using an application with a graphical interface. There are a lot of challenges involved in creating ML-solutions, even if we take advantage of existing ML-algorithms: feature engineering, outliers’ detection, filling the missing values, ML-method’s hyperparameters optimization and so on. The main point of the research is to find a balance in solving these complex tasks and to provide a Web-based user interface for unexperienced people to enable them to utilize the power of ML-methods in automatic or semi-automatic way. The practical outcome is an information system development, that consists of three interrelated parts: a web application, an API and several microservices that implement ML-algorithms from Scikit-learn library.



The System.AI Project: Fully Managed Cross-Platform Machine Learning and Data Analysis Stack for .NET Ecosystem
Abstract
In recent years, machine learning technologies have become increasingly popular in widespread tasks such as image stylization, black-and-white image coloring, super-resolution of images, fake data searching, voice and image recognition. In this regard, there is a need to implement a set of tools for integrating artificial intelligence systems into applications for mobile devices, smart home devices, and home PCs. The paper describes a solution that allows developers to integrate data analysis and machine learning systems directly into a user application, which will allow to produce a lightweight, portable, and cross-platform monolithic application, which is often not possible with existing solutions. The main features of the proposed solution are the focus on the Microsoft .NET [1] ecosystem and the use of exclusively standard features of BCL and C# programming language. The implemented package of tools is completely cross-platform and hardware independent. The API is similar in many ways to its Python counterparts, which allows to quickly migrate Python codes into a .NET project.



On the Practical Generation of Counterfactual Examples
Abstract
One of the important elements in evaluating the stability of machine learning systems are the so-called adversarial examples. These are specially selected or artificially created input data for machine learning systems that interfere with their normal operation, are interpreted or processed incorrectly. Most often, such data are obtained through some formal modifications of the real source data. This article considers a different approach to creating such data, which takes into account the semantic significance (meaning) of the modified data counterfactual examples. The purpose of the work is to present practical solutions for generating counterfeit examples. The consideration is based on the real use of counterfactual examples in assessing the robustness of machine learning systems.



A Survey of Model Inversion Attacks and Countermeasures
Abstract
This article provides a detailed overview of the so-called Model Inversion(MI) attacks. These attacks aim at Machine-Learning-as-a-Service (MLaaS) platforms, and the goal is to use some well-prepared adversarial samples to attack target models and gain sensitive information from ML models, such as items from the dataset on which ML model was trained or ML model's parameters. This kind of attack now becomes an enormous threat to ML models, therefore, it is necessary to research this attack, understand how it will affect ML models, and based on this knowledge, we can propose some strategies that may improve the robustness of ML models.



Research the Stability of Decision Trees Using Distances on Graphs
Abstract
The article deals with the problem of stability of classifiers based on decision trees for the problem of text attribution. Such a task arises, for example, in the study of the authorship of articles from the pre-revolutionary journals “Time” (1861–1863), “Epoch” (1864–1865) and the weekly “Citizen” (1873–1874). The texts were divided into separate parts of different sizes using the sliding window method, then the frequency of n-grams (encoded sequences of parts of speech) in each fragment was determined. Further, these indicators were used to build various classifiers. The resulting decision trees were compared with each other using the tree edit distance. For this purpose, a procedure for processing, comparing and visualizing graphs was implemented in the SMALT software package. As a result of experiments using different weights for editing operations, patterns were revealed between the parameters for constructing text fragments and the decision trees obtained on their basis.



Economic Cycle Prediction using Machine Learning – Russia Case Study
Abstract
The long-term development of the world economy is characterized by cyclical development. To date, there is no single accepted approach to describe the nature of the economic cycle. Therefore, studies of economic and political cycles are one of the key areas of economic theory. Econometrics and machine learning have a common goal: to build a predictive model, for a target variable, using explanatory variables. This research aims to identify economic cycle in Russian Federation using collective factors. It uses a different approach, concerning classical econometric techniques, and shows how machine learning (ML) techniques can improve the accuracy of forecasts. We used three machine learning algorithms such as k-Nearest Neighbors (kNN), Random Forests (RF) and Support vector machines (SVM). The research is based on 30 economic factors for the period 1990-2020 from FRED, World Bank, WTO, Federal State Statistics Service, Bank of Russia etc. The results indicate that the Russian economy would be very active (peak) in the next quarters. This result could be a new approach to provide policy recommendations to authorities and financial institutions in particular.



Quantitative large-scale study of school student’s academic performance peculiarities during distance education caused by COVID-19
Abstract
The paper presents the large-scale analysis results of the distance learning impact caused by COVID-19 and its influence on school student's academic performance. This multidisciplinary study is based on the large amount of the raw data containing school student’s grades from 2015 till 2021 academic years taken from “Electronic education in Tatarstan Republic” system. The analysis is based on application of BigData and mathematical statistics methods, realized by using Python programming language. Dask framework for parallel cluster-based computation, Pandas library for data manipulation and large-scale analysis data is used. One of the main priorities of this paper is to identify the impact of different educational system’s factors on school student’s academic performance. For that purpose, the quantile regression method was used. This method is widely used for processing a large-scale data of various experiments in modern data science. Quantile regression models are designed to determine conditional quantile functions. Therefore, this method is especially suitable to exam conditional effects at various locations of the outcome distribution: e.g., lower and upper tails. The study-related conditional factors include such factors as student’s marks from previous academic years, types of lessons in which grades were obtained, and various teacher’s parameters such as age, gender and qualification category.



Text Mining
Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)
Abstract
This article proposes an algorithm for solving the problem of extracting information from biomedical patents and scientific publications. The introduced algorithm is based on machine learning methods. Experiments were carried out on patents from the USPTO database. Experiments have shown that the best extraction quality was achieved by a model based on BioBERT.



Sentence splitters benchmark
Abstract
There are multiple implementations of text into sentences splitters including open source libraries and tools. But the quality of segmentation and the performance of each segmentation tool are very different. Moreover, it is convenient for NLP developers to have all libraries written in the same programming language, except when using some kind of integration programming language. This paper considers two aspects building a uniform framework and estimating language features of the modern and popular programming language Julia from one side. And the performance estimation of sentence splitting libraries as is. The paper contains detailed performance results, samples of texts after splitting, and a list of some typical issues related to sentence splitting.



The Conceptual Modeling System Based on Metagraph Approach
Abstract
The article is devoted to an approach to building a conceptual modeling system, which includes text recognition in a conceptual structure and text generation based on a conceptual structure. The metagraph is used as a conceptual structure. The architecture of the conceptual modeling system is proposed. The metagraph model is considered as a data model for conceptual modeling. The main ideas of the work of the text parsing module and text generation module are considered.



Methods and Models in Natural Sciences
The structure and Evolution of open star clusters: Theory and Observations based on Gaia data
Abstract
The structure and evolution of open star clusters (OSCs) are considered using the Pleiades OSCs and the OSC group in the Orion Sword region as examples. The stars were selected according to the Gaia data. The relationship between the Orion Sword clusters and molecular clouds is traced according to the data of the Herschel spacecraft. The place of the considered objects in the general scheme of evolution compiled by us earlier is shown. It is concluded that there is an urgent need to expand the OSC classification. The considered Pleiades star system showed the presence of an extensive stellar halo. The stellar stream Pisces Eridanus found in the vicinity of the Pleiades is probably genetically related to the Pleiades and, together with it, represents the remnants of the disintegrated OB association. In the Orion Sword region, the observed young OSCs are most likely associated with molecular clouds. Young clusters stand out associated with dust (15 35 K) and hot (10000 K) gas. Data on OSCs are rapidly replenishing, and the number of OSCs is growing due to their detection in the Gaia surveys. Analysis in this area can be iterated and extended over time with proven methodologies to fit data management concepts in data-intensive areas.



Astronomical observation planner
Abstract
One of the tasks of robotization of astronomical observations is the creation of programs for the optimal distribution of time depending on the position of the Sun (efficient use of twilight time), the position and phases of the Moon. An important requirement for this program is the autonomy of its work, independent of external Internet resources. To solve this problem, an autonomous astronomical calendar was developed that makes it possible to estimate the time of sunrise and sunset, the moon (as well as its phases), the onset and end of twilight. This subroutine is the first step in automating the planning of astronomical observations. The next important step is to develop software that will be able to plan observations in an optimal way. The targets for observations are discussed, for these purposes the necessary initial parameters are indicated, which make it possible to form a schedule of observations at telescopes in an automatic mode.



A logical model for integration of heterogeneous experimental data in soil research
Abstract
The undoubted challenge for science is the extraction of knowledge from fast growing heterogeneous datasets. Particularly, details of experimental setups are insufficiently formalized and cannot be easily inserted into databases. Thus, there is a problem of using these details in the process of data integration and meta-analyses. For this purpose, we developed a scheme of formalization for object descriptions with its origination, protocols for field and laboratory measurements (including instruments and experimental conditions). It allows the integration of larger amounts of data accounting for its specifics of acquisition, for example, by applying adjustments, assigning weights to data sources (based on its reliability, method precision and experimental uncertainty) or directly accounting for experimental conditions in models. This formalization is currently used to develop an electronic laboratory journal for soil research, intended for detailed description of a conducted or planned experiment. The study aims to: increase the re-producibility of scientific research results; allow automatic data processing and error detection, and most importantly; effective soil data mining for decision support systems.



Conditions for the effective application of artificial intelligence technologies in the agro-industrial complex of the EAEU
Abstract
The solutions for reducing the environmental hazard in agriculture of the EAEU agro-industrial space formating are considered. The mechanism for such space formatting is proposed. It allows resolving the emerging geopolitical, economic, social, and environmental problems. This is a single digital management platform, which includes the possibility of cloud building based on mathematical and ontological modeling, common digital standards (the structure of the subplatform for collecting, storing and integrating operational primary accounting information of all participants in a single database; the structure of the subplatform for technological accounting; the structure of the subplatform of data processing algorithms of the first two subplatforms for the purpose of production management). The use of artificial intelligence technologies will bring the greatest effect and will ensure maximum cross-industry traceability of products and the negative impact of natural and anthropogenic environmental hazards on the environment, on the products of the agro-industrial complex and on the person himself will be minimized.


