Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made.

Semi-supervised regression trees with application to QSAR modelling

Ceci M.;Kocev D.
2020-01-01

Abstract

Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made.
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0957417420303936-main.pdf

non disponibili

Tipologia: Documento in Versione Editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 724.24 kB
Formato Adobe PDF
724.24 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
SSLforRegression__ESWA_.pdf

accesso aperto

Tipologia: Documento in Pre-print
Licenza: Creative commons
Dimensione 693.08 kB
Formato Adobe PDF
693.08 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/345356
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 16
social impact