Thanh Tram Tran1, Dinh Bay Vo2*
1University of Information Technology, Vietnam National University Ho Chi Minh City
2Faculty of Information Technology, Ho Chi Minh City University of Technology
Abstract:
Text clustering is one of the important topics of the text mining. One of the most effective methods for text clustering is the frequent itemsets based approach. There are many frequent itemsets based algorithms to improve the accuracy of text clustering. However, these algorithms do not focus on the weight of terms in documents. In fact, the frequency of each term in each document has a great impact on results. In this paper, we propose a new method for text clustering based on frequent weighted utility itemsets. First, we calculate TF (Term Frequency) for each term in documents to create a number matrix for the documents. The weights of terms in the documents is based on the IDF (Inverse Document Frequency). Next, we use MWIT-FWUI algorithm for mining frequent weighted utility itemsets from a number matrix and the weight of terms in the documents. Finally, based on frequent utility itemsets, we cluster text using MC (Maximum Capturing) algorithm. We have tested this method on a corpus consisting of 1,600 documents including 16 different topics. Experimental results have shown that our method significantly improves the accuracy of the text clustering compared with the clustering method using frequent itemsets.
Keywords:
frequent itemsets, frequent weighted utility itemsets, quantitative databases, text clustering, weight of terms.