Go home now Header Background Image
Submission Procedure
share: |
Follow us
Volume 25 / Issue 4

available in:   PDF (529 kB) PS (507 kB)
Similar Docs BibTeX   Write a comment
Links into Future
DOI:   10.3217/jucs-025-04-0334


Data-driven Feature Selection Methods for Text Classification: an Empirical Evaluation

Rogerio C. P. Fragoso (Universidade Federal de Pernambuco, Brazil)

Roberto H. W. Pinheiro (Universidade Federal do Cariri, Brazil)

George D. C. Cavalcanti (Universidade Federal de Pernambuco, Brazil)

Abstract: Dimensionality reduction is a crucial task in text classification. The most adopted strategy is feature selection using filter methods. This approach presents a difficulty in determining the best size for the final feature vector. At Least One FeaTure (ALOFT), Maximum f Features per Document (MFD), Maximum f Features per Document-Reduced (MFDR) and Class-dependent Maximum f Features per Document-Reduced (cMFDR) are feature selection methods that define automatically the number of features per Corpus. However, MFD, MFDR, and cMFDR require a parameter that defines the number of features to be selected per document. Automatic Feature Subsets Analyzer (AFSA) is an auxiliary method that automates such configuration. In this paper, we evaluate dimensionality reduction, classification performance and execution time of this family of methods: ALOFT, MFD, MFDR, cMFDR and AFSA. The experiments are conducted using three feature evaluation functions and twenty databases. MFD obtained the best results among the feature selection methods. In addition, the experiments showed that the use of AFSA does not significantly affect the classification performances or the dimensionality reduction rates of the feature selection methods, but considerably reduces their execution times.

Keywords: feature selection, text classification

Categories: I.5.2, I.5.4