Recent advances in deep learning have enabled accurate artwork classification using models such as Vision Transformers (ViTs). However, interpreting the internal mechanisms behind such decisions remains challenging, especially in the abstract and symbolic domain of visual arts. We propose an interpretability framework that combines feature visualization via activation maximization with natural language grounding through a Multimodal Large Language Model. Our method extracts class-specific visual patterns learned by ViTs, synthesizes prototype images that activate key features, and generates human-readable descriptions. Applied to a large-scale art dataset, the approach reveals that ViTs attend to subtle and abstract cues—such as texture, shape, and composition—differently from natural image tasks. The resulting visual and textual explanations offer valuable insight into model behavior and move toward more transparent, human-aligned AI systems for art analysis.

Unveiling Visual Features in Artwork Classification: Towards Explainable Vision Transformers in the Arts

Scaringi, Raffaele
;
Fanelli, Nicola;Vessio, Gennaro;Castellano, Giovanna
2026-01-01

Abstract

Recent advances in deep learning have enabled accurate artwork classification using models such as Vision Transformers (ViTs). However, interpreting the internal mechanisms behind such decisions remains challenging, especially in the abstract and symbolic domain of visual arts. We propose an interpretability framework that combines feature visualization via activation maximization with natural language grounding through a Multimodal Large Language Model. Our method extracts class-specific visual patterns learned by ViTs, synthesizes prototype images that activate key features, and generates human-readable descriptions. Applied to a large-scale art dataset, the approach reveals that ViTs attend to subtle and abstract cues—such as texture, shape, and composition—differently from natural image tasks. The resulting visual and textual explanations offer valuable insight into model behavior and move toward more transparent, human-aligned AI systems for art analysis.
2026
9783032113160
9783032113177
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/562283
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact