Discretization, defined as a set of cuts over domains of attributes, represents an important pre-processing task for numeric data analysis. Some Machine Learning algorithms require a discrete feature space but in real-world applications continuous attributes must be handled. To deal with this problem many supervised discretization methods have been proposed but little has been done to synthesize unsupervised discretization methods to be used in domains where no class information is available. Furthermore, existing methods such as (equal-width or equal-frequency) binning, are not well-principled, raising therefore the need for more sophisticated methods for the unsupervised discretization of continuous features. This paper presents a novel unsupervised discretization method that uses non-parametric density estimators to automatically adapt sub-interval dimensions to the data. The proposed algorithm searches for the next two sub-intervals to produce, evaluating the best cut-point on the basis of the density induced in the sub-intervals by the current cut and the density given by a kernel density estimator for each sub-interval. It uses cross-validated log-likelihood to select the maximal number of intervals. The new proposed method is compared to equal-width and equal-frequency discretization methods through experiments on well known benchmarking data.

Unsupervised Discretization Using Kernel Density Estimation

ESPOSITO, Floriana;BASILE, TERESA MARIA
2007

Abstract

Discretization, defined as a set of cuts over domains of attributes, represents an important pre-processing task for numeric data analysis. Some Machine Learning algorithms require a discrete feature space but in real-world applications continuous attributes must be handled. To deal with this problem many supervised discretization methods have been proposed but little has been done to synthesize unsupervised discretization methods to be used in domains where no class information is available. Furthermore, existing methods such as (equal-width or equal-frequency) binning, are not well-principled, raising therefore the need for more sophisticated methods for the unsupervised discretization of continuous features. This paper presents a novel unsupervised discretization method that uses non-parametric density estimators to automatically adapt sub-interval dimensions to the data. The proposed algorithm searches for the next two sub-intervals to produce, evaluating the best cut-point on the basis of the density induced in the sub-intervals by the current cut and the density given by a kernel density estimator for each sub-interval. It uses cross-validated log-likelihood to select the maximal number of intervals. The new proposed method is compared to equal-width and equal-frequency discretization methods through experiments on well known benchmarking data.
978-1-57735-516-8
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/70657
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 32
  • ???jsp.display-item.citation.isi??? 17
social impact