Unveiling Visual Features in Artwork Classification: Towards Explainable Vision Transformers in the Arts

Scaringi, Raffaele; Fanelli, Nicola; Vessio, Gennaro; Castellano, Giovanna

doi:10.1007/978-3-032-11317-7_47

Recent advances in deep learning have enabled accurate artwork classification using models such as Vision Transformers (ViTs). However, interpreting the internal mechanisms behind such decisions remains challenging, especially in the abstract and symbolic domain of visual arts. We propose an interpretability framework that combines feature visualization via activation maximization with natural language grounding through a Multimodal Large Language Model. Our method extracts class-specific visual patterns learned by ViTs, synthesizes prototype images that activate key features, and generates human-readable descriptions. Applied to a large-scale art dataset, the approach reveals that ViTs attend to subtle and abstract cues—such as texture, shape, and composition—differently from natural image tasks. The resulting visual and textual explanations offer valuable insight into model behavior and move toward more transparent, human-aligned AI systems for art analysis.

Unveiling Visual Features in Artwork Classification: Towards Explainable Vision Transformers in the Arts

Scaringi, Raffaele;Fanelli, Nicola;Vessio, Gennaro;Castellano, Giovanna

2026-01-01

Abstract

Recent advances in deep learning have enabled accurate artwork classification using models such as Vision Transformers (ViTs). However, interpreting the internal mechanisms behind such decisions remains challenging, especially in the abstract and symbolic domain of visual arts. We propose an interpretability framework that combines feature visualization via activation maximization with natural language grounding through a Multimodal Large Language Model. Our method extracts class-specific visual patterns learned by ViTs, synthesizes prototype images that activate key features, and generates human-readable descriptions. Applied to a large-scale art dataset, the approach reveals that ViTs attend to subtle and abstract cues—such as texture, shape, and composition—differently from natural image tasks. The resulting visual and textual explanations offer valuable insight into model behavior and move toward more transparent, human-aligned AI systems for art analysis.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2026

Codice ISBN

9783032113160
9783032113177

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/562283

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

Unveiling Visual Features in Artwork Classification: Towards Explainable Vision Transformers in the Arts

Scaringi, Raffaele;Fanelli, Nicola;Vessio, Gennaro;Castellano, Giovanna

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)