Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made.
Semi-supervised regression trees with application to QSAR modelling
Ceci M.;Kocev D.
2020-01-01
Abstract
Despite the ease of collecting abundance of data about various phenomena, obtaining labeled data needed for learning models with high predictive performance remains a difficult and expensive task in many domains. This issue is particularly present in the case of the analysis of scientific data where obtaining labeled data typically requires expensive experiments. Moreover, in the analysis of scientific data, another issue is of fundamental importance: the interpretability of the models and the explainability of their decisions. By taking into account these considerations, we propose a novel semi-supervised method to learn regression trees. Thanks to the semi-supervised machine learning approach, the method is able to exploit information coming not only from labeled data, but also from unlabeled data, thus alleviating the issue of lack of labeled data. The method is based on the predictive clustering trees paradigm that extends regression trees towards structured output prediction. This allows us to obtain interpretable regression trees. The method we propose is particularly suited for the chemoinformatics task of quantitative structure-activity relationship (QSAR) modeling, which is the main application context considered in this paper. Specifically, we evaluate the proposed method on 4 QSAR modelling datasets and illustrate its use on a case study of predicting farnesyltransferase inhibitors. Additionally, we also evaluate our approach on 10 benchmark datasets not related to the QSAR modeling problem. The evaluation reveals the following: semi-supervised trees and ensembles thereof have better predictive performance than their supervised counterparts (especially when the number of labeled examples is very small); different datasets and different amounts of labeled data require different amounts of unlabeled data to be included in the learning process; and the learned semi-supervised regression trees can be used to better understand the problem at hand and the way predictions are being made.File | Dimensione | Formato | |
---|---|---|---|
1-s2.0-S0957417420303936-main.pdf
non disponibili
Tipologia:
Documento in Versione Editoriale
Licenza:
NON PUBBLICO - Accesso privato/ristretto
Dimensione
724.24 kB
Formato
Adobe PDF
|
724.24 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
SSLforRegression__ESWA_.pdf
accesso aperto
Tipologia:
Documento in Pre-print
Licenza:
Creative commons
Dimensione
693.08 kB
Formato
Adobe PDF
|
693.08 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.