Data-driven Feature Selection Methods for Text Classification: an Empirical Evaluation
Rogerio C. P. Fragoso (Universidade Federal de Pernambuco, Brazil)
Roberto H. W. Pinheiro (Universidade Federal do Cariri, Brazil)
George D. C. Cavalcanti (Universidade Federal de Pernambuco, Brazil)
Abstract: Dimensionality reduction is a crucial task in text classification. The most adopted strategy is feature selection using filter methods. This approach presents a difficulty in determining the best size for the final feature vector. At Least One FeaTure (ALOFT), Maximum f Features per Document (MFD), Maximum f Features per Document-Reduced (MFDR) and Class-dependent Maximum f Features per Document-Reduced (cMFDR) are feature selection methods that define automatically the number of features per Corpus. However, MFD, MFDR, and cMFDR require a parameter that defines the number of features to be selected per document. Automatic Feature Subsets Analyzer (AFSA) is an auxiliary method that automates such configuration. In this paper, we evaluate dimensionality reduction, classification performance and execution time of this family of methods: ALOFT, MFD, MFDR, cMFDR and AFSA. The experiments are conducted using three feature evaluation functions and twenty databases. MFD obtained the best results among the feature selection methods. In addition, the experiments showed that the use of AFSA does not significantly affect the classification performances or the dimensionality reduction rates of the feature selection methods, but considerably reduces their execution times.
Keywords: feature selection, text classification
Categories: I.5.2, I.5.4