Multimodal Artwork Topic Modeling via Fine-Tuned CLIP and Knowledge-Driven Prompts

IRIS

We propose a novel multimodal topic modeling framework to extract and explain latent themes in extensive collections of digitized artworks. Our approach leverages CLIP's contrastive pre-training to encode images and textual metadata into a shared semantic space. We fine-tune CLIP on a domain-specific dataset built from ArtGraph, an art-domain knowledge graph containing over 100k artworks enriched with curated metadata. Using the resulting multimodal embeddings, we perform clustering to uncover latent visual topics and associate each cluster with descriptive terms via cosine similarity with templated textual prompts. Finally, to further interpret the discovered topics, we employ LLaVA to generate textual summaries based on representative images. Our framework demonstrates promising performance in terms of topic coherence and diversity, evaluated through both visual and textual metrics. The method is unsupervised, easily adaptable, and provides interpretable outputs, making it suitable for applications in digital humanities, cultural heritage analysis, and content-based art retrieval.

Multimodal Artwork Topic Modeling via Fine-Tuned CLIP and Knowledge-Driven Prompts

Scaringi, Raffaele;Stea, Giovanni;Fanelli, Nicola;Vessio, Gennaro;Castellano, Giovanna

2025-01-01

Abstract

We propose a novel multimodal topic modeling framework to extract and explain latent themes in extensive collections of digitized artworks. Our approach leverages CLIP's contrastive pre-training to encode images and textual metadata into a shared semantic space. We fine-tune CLIP on a domain-specific dataset built from ArtGraph, an art-domain knowledge graph containing over 100k artworks enriched with curated metadata. Using the resulting multimodal embeddings, we perform clustering to uncover latent visual topics and associate each cluster with descriptive terms via cosine similarity with templated textual prompts. Finally, to further interpret the discovered topics, we employ LLaVA to generate textual summaries based on representative images. Our framework demonstrates promising performance in terms of topic coherence and diversity, evaluated through both visual and textual metrics. The method is unsupervised, easily adaptable, and provides interpretable outputs, making it suitable for applications in digital humanities, cultural heritage analysis, and content-based art retrieval.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Codice ISBN

979-8-3315-7029-3

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/553220

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

social impact