Semi-supervised regression trees with application to QSAR modelling

IRIS

Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made.

Semi-supervised regression trees with application to QSAR modelling

Levatic J.;Ceci M.;Stepisnik T.;Dzeroski S.;Kocev D.

2020-01-01

Abstract

Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2020

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0957417420303936-main.pdf non disponibili Tipologia: Documento in Versione Editoriale Licenza: NON PUBBLICO - Accesso privato/ristretto Dimensione 724.24 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	724.24 kB	Adobe PDF	Visualizza/Apri Richiedi una copia
SSLforRegression__ESWA_.pdf accesso aperto Tipologia: Documento in Pre-print Licenza: Creative commons Dimensione 693.08 kB Formato Adobe PDF Visualizza/Apri	693.08 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/345356

Citazioni

ND

23

18

social impact