Self-training for multi-target regression with tree ensembles

IRIS

Semi-supervised learning (SSL) aims to use unlabeled data as an additional source of information in order to improve upon the performance of supervised learning methods. The availability of labeled data is often limited due to the expensive and/or tedious annotation process, while unlabeled data could be easily available in large amounts. This is particularly true for predictive modelling problems with a structured output space. In this study, we address the task of SSL for multi-target regression (MTR), where the output space consists of multiple numerical values. We extend the self-training approach to perform SSL for MTR by using a random forest of predictive clustering trees. In self-training, a model iteratively uses its own most reliable predictions, hence a good measure for the reliability of predictions is essential. Given that reliability estimates for MTR predictions have not yet been studied, we propose four such estimates, based on mechanisms provided within ensemble learning. In addition to these four scores, we use two benchmark scores (oracle and random) to empirically determine the performance limits of self-training. We also propose an approach to automatically select a threshold for the identification of the most reliable predictions to be used in the next iteration. An empirical evaluation on a large collection of datasets for MTR shows that self-training with any of the proposed reliability scores is able to consistently improve over supervised random forests and multi-output support vector regression. This is also true when the reliability threshold is selected automatically.

Self-training for multi-target regression with tree ensembles

Levatić, Jurica;CECI, MICHELANGELO;KOCEV, DRAGI;Džeroski, Sašo

2017-01-01

Abstract

Semi-supervised learning (SSL) aims to use unlabeled data as an additional source of information in order to improve upon the performance of supervised learning methods. The availability of labeled data is often limited due to the expensive and/or tedious annotation process, while unlabeled data could be easily available in large amounts. This is particularly true for predictive modelling problems with a structured output space. In this study, we address the task of SSL for multi-target regression (MTR), where the output space consists of multiple numerical values. We extend the self-training approach to perform SSL for MTR by using a random forest of predictive clustering trees. In self-training, a model iteratively uses its own most reliable predictions, hence a good measure for the reliability of predictions is essential. Given that reliability estimates for MTR predictions have not yet been studied, we propose four such estimates, based on mechanisms provided within ensemble learning. In addition to these four scores, we use two benchmark scores (oracle and random) to empirically determine the performance limits of self-training. We also propose an approach to automatically select a threshold for the identification of the most reliable predictions to be used in the next iteration. An empirical evaluation on a large collection of datasets for MTR shows that self-training with any of the proposed reliability scores is able to consistently improve over supervised random forests and multi-output support vector regression. This is also true when the reliability threshold is selected automatically.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2017

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Self-training.pdf non disponibili Tipologia: Documento in Versione Editoriale Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 1.74 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.74 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
Self_training_for_MTR_with_tree_ensembles__KBS_ (1).pdf accesso aperto Tipologia: Documento in Pre-print Licenza: Creative commons Dimensione 455.42 kB Formato Adobe PDF Visualizza/Apri	455.42 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/187727

Citazioni

ND

46

40

social impact