In this paper, the problem of classifying a HTML documents into a hierarchy of categories is investigated in the context of cooperative information repository, named WebClassII. The hierarchy of categories is involved in all aspects of automated document classification, namely feature extraction, learning, and classification of a new document. Innovative aspects of this work are: a) an experimental study on actual Web documents which can be associated to any node in the hierarchy; b) the feature selection process; c) the automated selection of thresholds for the score returned by a classifier; d) the comparison of three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets); e) the definition of new measures for the evaluation of system performances. Results show that the use of hierarchical training sets improves the hierarchical techniques.

Automated Classification of Web Documents into a Hierarchy of Categories

CECI, MICHELANGELO;ESPOSITO, Floriana;MALERBA, Donato
2003

Abstract

In this paper, the problem of classifying a HTML documents into a hierarchy of categories is investigated in the context of cooperative information repository, named WebClassII. The hierarchy of categories is involved in all aspects of automated document classification, namely feature extraction, learning, and classification of a new document. Innovative aspects of this work are: a) an experimental study on actual Web documents which can be associated to any node in the hierarchy; b) the feature selection process; c) the automated selection of thresholds for the score returned by a classifier; d) the comparison of three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets); e) the definition of new measures for the evaluation of system performances. Results show that the use of hierarchical training sets improves the hierarchical techniques.
3-540-00843-8
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/136697
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact