摘要

In view of problems of high dimension, the sparsity of information and inconsideration of semantic relation between words of TF-IDF space vector, a method that uses semantic word set as features to reduce dimension and strengthen information density is proposed. This study uses the latent semantic analysis algorithm to obtain the semantic relations between words, and establishes the semantic dictionary by ESD, then we use the word set as features to express text features, and form TCSD combining with the clustering algorithm to cluster the corpus. The experimental results show that the precision rate is 94.29% and the recall rate is 94.28%, which indicate that TCSD performs better than the algorithms that use words as features.

全文