Research the Stability of Decision Trees Using Distances on Graphs
- Authors: Moskin N.D.1, Kulakov K.A.1, Rogov A.A.1, Abramov R.V.2
-
Affiliations:
- Petrozavodsk State University
- ITMO University
- Issue: Vol 73, No 1 (2023)
- Pages: 94-100
- Section: Data Mining
- URL: https://journal-vniispk.ru/2079-0279/article/view/286875
- DOI: https://doi.org/10.14357/20790279230111
- ID: 286875
Cite item
Full Text
Abstract
The article deals with the problem of stability of classifiers based on decision trees for the problem of text attribution. Such a task arises, for example, in the study of the authorship of articles from the pre-revolutionary journals “Time” (1861–1863), “Epoch” (1864–1865) and the weekly “Citizen” (1873–1874). The texts were divided into separate parts of different sizes using the sliding window method, then the frequency of n-grams (encoded sequences of parts of speech) in each fragment was determined. Further, these indicators were used to build various classifiers. The resulting decision trees were compared with each other using the tree edit distance. For this purpose, a procedure for processing, comparing and visualizing graphs was implemented in the SMALT software package. As a result of experiments using different weights for editing operations, patterns were revealed between the parameters for constructing text fragments and the decision trees obtained on their basis.
About the authors
N. D. Moskin
Petrozavodsk State University
Author for correspondence.
Email: moskin@petrsu.ru
PhD in Technics, Associate Professor
Russian Federation, 33 Lenin str., Petrozavodsk, 185910K. A. Kulakov
Petrozavodsk State University
Email: kulakov@cs.karelia.ru
PhD in Physics and Mathematics, Associate Professor
Russian Federation, 33 Lenin str., Petrozavodsk, 185910A. A. Rogov
Petrozavodsk State University
Email: rogov@petrsu.ru
Russian Federation, 33 Lenin str., Petrozavodsk, 185910
R. V. Abramov
ITMO University
Email: monset008@gmail.com
Russian Federation, 49 Kronverksky Pr., bldg. A, Saint Petersburg, 197101
References
- Abramov R. V., Kulakov K. A., Lebedev A. A., Moskin N. D., Rogov A. A. 2021. Research of features of Dostoevsky’s publicistic style by using n-grams based on the materials of the “Time” and “Epoch” magazines. Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes. Saint Petersburg, 17(4): 389–396.
- Safavian S. R., Landgrebe D. 1991. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3): 660–674.
- Pedregosa F., et al. 2011. Scikit-learn: Machine learning in python. The Journal of Machine Learn- ing Research, 12: 2825–2830.
- Lewis R. J. 2000. An introduction to classification and regression tree (CART) analysis. Annual meet- ing of the society for academic emergency medi- cine in San Francisco, California. Citeseer 14.
- Coppersmith D., Hong S. J., Hosking J. R. M. 1999. Partitioning nominal attributes in decision trees. Data Mining and Knowledge Discovery, 3(2): 197–217.
- Conte D., Foggia P., Sansone C., Vento M. 2004. Thirty years of graph matching in pattern recogni- tion. International Journal of Pattern Recognition and Artificial Intelligence, 18(3): 265–298.
- Jiang X., Bunke H. 2008. Graph matching. Case- Based Reasoning on Images and Signals. Vol. 73 of Studies in Computational Intelligence. Springer, 149–173.
- Riesen K. 2015. Structural Pattern Recognition with Graph Edit Distance: Approximation Algorithms and Applications. Advances in Computer Vision and Pattern Recognition. Springer, Heidelberg. 158 p.
- Bille P. 2003. Tree edit distance, alignment distance and inclusion. IT University of Copenhagen. Tech- nical Report Series, TR-2003-23.
- Jiang T., Wang L., Zhang K. 1995. Alignment of trees – an alternative to tree edit. Theoretical Computer Science, 143(1): 137–148.
- Torsello A., Hidovic D., Pelillo M. 2004. Four metrics for efficiently comparing attributed trees. Proc. ICPR’04 – 17th International Conference on Pattern Recognition, 2: 467–470.
- Torsello A., Hidovic D., Pelillo M. 2005. Polyno- mial time metrics for attributed trees. IEEE Trans- actions on Pattern Analysis and Machine Intelli- gence, 27(7): 1087–1099.
- Kuznetsov A. V. 2017. Mera neskhodstva na mnozhestve grafov i ee prilozheniya [A measure of dissimilarity on a set of graphs and its applica- tions]. Vestnik Voronezhskogo gosudarstvennogo universiteta. Seriya: sistemnyj analiz i informa- cionnye tekhnologii [Bulletin of the Voronezh State University. Series: system analysis and in- formation technology], 1: 125–131.
- Isert C. 1999. The editing distance between trees. Ferienakademie Bäume: Algorithmik und Kombi- natorik. Sarntal, Italy.
- Rogov A. A., Abramov R. V., Buchneva D. D., Zakharova O. V., Kulakov K. A., Lebedev A. A., Moskin N. D., Otlivanchik A. V., Savinov E. D., Sidorov Y. V. 2021. Problema atribucii v zhurnalah “Vremya”, “Epoha” i ezhenedel’nike “Grazhd- anin” [The problem of attribution in the magazines “Time”, “Epoch” and the weekly “Citizen”]. Petrozavodsk: Islands, 391 p.
- Wu L., Chen Y., Shen K., Guo X., Gao H., Li S., Pei J., Long B. 2021. Graph Neural Networks for Natural Language Processing: A Survey. ArXiv abs/2106.06090.
- Hlaoui A., Wang S. 2003. A New Median Graph Algorithm. Graph Based Representations in Pat- tern Recognition (GbRPR 2003). Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, 2726: 225–234.
Supplementary files
