Mining HTML pages to support document sharing in a cooperative system

Malerba, Donato; Esposito, Floriana; Ceci, Michelangelo

doi:10.1007/3-540-36128-6_25

In this paper, the problem of classifying HTML documents is investigated in the context of a client-server application, named WebClass, developed to support the search activity of a geographically distributed group of people with common interests. The two main issues studied in the paper are the selection of some features to represent HTML documents and the construction of the classifiers. A new feature selection technique is presented and its interaction with different classifiers is experimentally studied. Results show that performance improves even with simple classifiers and the proposed feature selection technique compares favorably with respect to other well-known approaches.