Information-Theoretic method for classification of texts


Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription Access

Abstract

We consider a method for automatic (i.e., unmanned) text classification based on methods of universal source coding (or “data compression”). We show that under certain restrictions the proposed method is consistent, i.e., the classification error tends to zero with increasing text lengths. As an example of practical use of the method we consider the classification problem for scientific texts (research papers, books, etc.). The proposed method is experimentally shown to be highly efficient.

About the authors

B. Ya. Ryabko

Institute of Computational Technologies; Novosibirsk State University

Author for correspondence.
Email: boris@ryabko.net
Russian Federation, Novosibirsk; Novosibirsk

A. E. Gus’kov

Institute of Computational Technologies; Russian National Public Library for Science and Technnology

Email: boris@ryabko.net
Russian Federation, Novosibirsk; Novosibirsk

I. V. Selivanova

Novosibirsk State University; Russian National Public Library for Science and Technnology

Email: boris@ryabko.net
Russian Federation, Novosibirsk; Novosibirsk

Supplementary files

Supplementary Files
Action
1. JATS XML

Copyright (c) 2017 Pleiades Publishing, Inc.