International Conference «Mathematical and Information Technologies, MIT-2016»

28 August – 5 September 2016

Vrnjacka Banja, Serbia – Budva, Montenegro

Guskov A.   Ryabko B.   Selivanova I.  

Information-Theoretic Approach to Classification of Scientific Documents

Reporter: Ryabko B.

Nowadays the problem of classification of scientific documents is of a great importance, because a flow of scientific documents is growing in fact exponentially. The development of methods of automatic classification of scientific documents attracts attention of many researchers over the world, see [1-6]. One of the most difficult tasks is the process of automation the thematic classification of documents, the result of which is assigning a document to one or more classes (e.g. mathematics, physics, chemistry, etc.)   In spite of many efforts,   an efficient automatic method for the thematic classification of scientific documents does not exist yet.
In this report we propose to use data compression methods in order to automatically determine a thematic affiliation of scientific texts.  The main idea of the suggested method is quite natural: scientific texts (articles, books, etc.) use similar terminology if they belong to the same area. On the other hand, the data compressor uses frequencies of occurrence of words in the text and "compresses" the data the better, the more repeated words. Based on this observation, we suggest the following classification scheme: for any scientific area we form a set of papers, which represents the area. Then a new text is compressed together with each set of texts representing the thematic areas and refers to that area for which it is compressed to a minimum size.
For an assessment of the possible practical applications of this method, an experiment was conducted. We used data provided on the website to select subject domains and the formation of the texts describing them.  This arxiv contains more than a million articles pertaining to various areas of science. When placing the article on the site, an author refers to the work of one of the scientific sections. The first section, pointed by author, we will call "the main category", other - "secondary".
For our experiment, we have chosen thirty research fields, presented in the arxiv (For example, information theory, logic in computer science, artificial Intelligence, cryptography and security, etc.). For any field we formed a set of 100 documents belonged to this field. Then we randomly chose 20 test files from every category, which did not belong to the sets and use the described method for automatic classification. It turns out that the total numbers of errors is 21 of 600 (3,5 %).  In this report we described the experimental results in detail and show that the suggested method is quite efficient.
[1] R. Baghel, R. Dhir, "A Frequent Concepts Based Document Clustering Algorithm," IJCA 4, pp. 6 – 12 (2010).
[2] S. E. Schaeffer, "Graph clustering," Computer Science Review 1, pp. 27 – 64 (2007).
[3] Z. Wang, Y. He, M. Jiang, "A comparison among three neural networks for text classification," Proc. of the IEEE 8th international conference on Signal Processing 3 (2006).
[4] V. Bobicev, "Text Classification Using Word-Based PPM Models," CSJM 14, pp. 183 – 201 (2006).
[5] S. Kim, K. Han, H. Rim, S. H. Myaeng, "Some effective techniques for naïve bayes text classification," IEEE Transactions on Knowledge and Data Engineering 18, pp. 1457 – 1466 (2006).
[6] M. Zhang, D. Zhang, "Trained SVMs Based Rules Extraction Method for Text Classification," Proc. of the IEEE International Symposium on IT in Medicine and Education, pp. 16-19 (2008).

To reports list

© 1996-2019, Institute of computational technologies of SB RAS, Novosibirsk