Predicting phenotypes from genomic data can significantly advance agriculture. Genomic selection, which uses genome-wide DNA markers to identify individuals with high genetic value, enhances the accuracy of breeding programs. While linear models are routinely used for genomic selection (GS), machine learning (ML) models offer complementary potential. In this study, robust ML-based models were developed to predict five phenotypic traits—three related to flowering time and two to leaf number—in Arabidopsis thaliana, a model plant with a fully sequenced genome. Using explainable artificial intelligence (XAI), specifically SHapley Additive exPlanations (SHAP) values, we identified SNPs that contributed most to trait prediction. Many of these SNPs were located in or near genes known to regulate flowering and stem elongation, such as DOG1 and VIN3, supporting the biological plausibility of the model. SHAP also enabled local interpretability at the single-plant level, revealing the genotypic basis of individual predictions. Our results indicate that integrating ML with XAI improves model interpretability and provides predictive performance comparable to traditional methods. This approach confirms known genotype–phenotype relationships and highlights new candidate loci, paving the way for functional validation. The proposed methodology offers promising applications in precision breeding and translation of insights from Arabidopsis to crop species.
Leveraging Explainable Artificial Intelligence for Genotype-to-Phenotype Prediction: A Case Study in Arabidopsis thaliana
Novielli, Pierfrancesco;Pavan, Stefano;Delvento, Chiara;Diacono, Domenico;Bellotti, Roberto;Tangaro, Sabina
2025-01-01
Abstract
Predicting phenotypes from genomic data can significantly advance agriculture. Genomic selection, which uses genome-wide DNA markers to identify individuals with high genetic value, enhances the accuracy of breeding programs. While linear models are routinely used for genomic selection (GS), machine learning (ML) models offer complementary potential. In this study, robust ML-based models were developed to predict five phenotypic traits—three related to flowering time and two to leaf number—in Arabidopsis thaliana, a model plant with a fully sequenced genome. Using explainable artificial intelligence (XAI), specifically SHapley Additive exPlanations (SHAP) values, we identified SNPs that contributed most to trait prediction. Many of these SNPs were located in or near genes known to regulate flowering and stem elongation, such as DOG1 and VIN3, supporting the biological plausibility of the model. SHAP also enabled local interpretability at the single-plant level, revealing the genotypic basis of individual predictions. Our results indicate that integrating ML with XAI improves model interpretability and provides predictive performance comparable to traditional methods. This approach confirms known genotype–phenotype relationships and highlights new candidate loci, paving the way for functional validation. The proposed methodology offers promising applications in precision breeding and translation of insights from Arabidopsis to crop species.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


