Performance and limitations of a supervised deep learning approach for the histopathological Oxford Classification of glomeruli with IgA nephropathy

Altini, N.; Turkevi-Nagy, S.; Pesce, F.; Pontrelli, P.; Prencipe, B.; Berloco, F.; Seshan, S.; Gibier, J. -B.; Pedraza Dorado, A.; Bueno, G.; Peruzzi, L.; Rossi, M.; Eccher, A.; Li, F.; Koumpis, A.; Beyan, O.; Barratt, J.; H. Q., Vo; Mohan, C.; Nguyen, H. V.; Cicalese, P. A.; Ernst, A.; Gesualdo, L.; Bevilacqua, V.; Becker, J. U.

doi:10.1016/j.cmpb.2023.107814

Background and Objective: The Oxford Classification for IgA nephropathy is the most successful example of an evidence-based nephropathology classification system. The aim of our study was to replicate the glomerular components of Oxford scoring with an end-to-end deep learning pipeline that involves automatic glomerular segmentation followed by classification for mesangial hypercellularity (M), endocapillary hypercellularity (E), segmental sclerosis (S) and active crescents (C). Methods: A total number of 1056 periodic acid–Schiff (PAS) whole slide images (WSIs), coming from 386 kidney biopsies, were annotated. Several detection models for glomeruli, based on the Mask R-CNN architecture, were trained on 587 WSIs, validated on 161 WSIs, and tested on 127 WSIs. For the development of segmentation models, 20,529 glomeruli were annotated, of which 16,571 as training and 3958 as validation set. The test set of the segmentation module comprised of 2948 glomeruli. For the Oxford classification, 6206 expert-annotated glomeruli from 308 PAS WSIs were labelled for M, E, S, C and split into a training set of 4298 glomeruli from 207 WSIs, and a test set of 1908 glomeruli. We chose the best-performing models to construct an end-to-end pipeline, which we named MESCnn (MESC classification by neural network), for the glomerular Oxford classification of WSIs. Results: Instance segmentation yielded excellent results with an AP50 ranging between 78.2–80.1 % (79.4 ± 0.7 %) on the validation and 75.1–77.7 % (76.5 ± 0.9 %) on the test set. The aggregated Jaccard Index was between 73.4–75.9 % (75.0 ± 0.8 %) on the validation and 69.1–73.4 % (72.2 ± 1.4 %) on the test set. At granular glomerular level, Oxford Classification was best replicated for M with EfficientNetV2-L with a mean ROC-AUC of 90.2 % and a mean precision/recall area under the curve (PR-AUC) of 81.8 %, best for E with MobileNetV2 (ROC-AUC 94.7 %) and ResNet50 (PR-AUC 75.8 %), best for S with EfficientNetV2-M (mean ROC-AUC 92.7 %, mean PR-AUC 87.7 %), best for C with EfficientNetV2-L (ROC-AUC 92.3 %) and EfficientNetV2-S (PR-AUC 54.7 %). At biopsy-level, correlation between expert and deep learning labels fulfilled the demands of the Oxford Classification. Conclusion: We designed an end-to-end pipeline for glomerular Oxford Classification on both a granular glomerular and an entire biopsy level. Both the glomerular segmentation and the classification modules are freely available for further development to the renal medicine community.