We propose a novel multimodal topic modeling framework to extract and explain latent themes in extensive collections of digitized artworks. Our approach leverages CLIP's contrastive pre-training to encode images and textual metadata into a shared semantic space. We fine-tune CLIP on a domain-specific dataset built from ArtGraph, an art-domain knowledge graph containing over 100k artworks enriched with curated metadata. Using the resulting multimodal embeddings, we perform clustering to uncover latent visual topics and associate each cluster with descriptive terms via cosine similarity with templated textual prompts. Finally, to further interpret the discovered topics, we employ LLaVA to generate textual summaries based on representative images. Our framework demonstrates promising performance in terms of topic coherence and diversity, evaluated through both visual and textual metrics. The method is unsupervised, easily adaptable, and provides interpretable outputs, making it suitable for applications in digital humanities, cultural heritage analysis, and content-based art retrieval.
Multimodal Artwork Topic Modeling via Fine-Tuned CLIP and Knowledge-Driven Prompts
Scaringi, Raffaele
;Fanelli, Nicola;Vessio, Gennaro;Castellano, Giovanna
2025-01-01
Abstract
We propose a novel multimodal topic modeling framework to extract and explain latent themes in extensive collections of digitized artworks. Our approach leverages CLIP's contrastive pre-training to encode images and textual metadata into a shared semantic space. We fine-tune CLIP on a domain-specific dataset built from ArtGraph, an art-domain knowledge graph containing over 100k artworks enriched with curated metadata. Using the resulting multimodal embeddings, we perform clustering to uncover latent visual topics and associate each cluster with descriptive terms via cosine similarity with templated textual prompts. Finally, to further interpret the discovered topics, we employ LLaVA to generate textual summaries based on representative images. Our framework demonstrates promising performance in terms of topic coherence and diversity, evaluated through both visual and textual metrics. The method is unsupervised, easily adaptable, and provides interpretable outputs, making it suitable for applications in digital humanities, cultural heritage analysis, and content-based art retrieval.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


