Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding. Due to their extensive training, they excel in tasks such as visual question answering and image retrieval. Their impressive generalization ability enables them to address novel and complex challenges. In this study, we evaluate the capability of VLMs for the Visual Word Sense Disambiguation (VWSD) task. Specifically, we examine their ability to select the correct image from a set of candidates for a given lemma based on minimal contextual information (few additional words). Additionally, we evaluate the ability of VLMs to solve this task across multiple languages and analyze the performance of multimodal encoder-based and generative VLMs.
Assessing and Improving the Multilingual Visual Word Sense Disambiguation Ability of Vision-Language Models
Lucia Siciliani;Pierpaolo Basile;Giovanni Semeraro
2025-01-01
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding. Due to their extensive training, they excel in tasks such as visual question answering and image retrieval. Their impressive generalization ability enables them to address novel and complex challenges. In this study, we evaluate the capability of VLMs for the Visual Word Sense Disambiguation (VWSD) task. Specifically, we examine their ability to select the correct image from a set of candidates for a given lemma based on minimal contextual information (few additional words). Additionally, we evaluate the ability of VLMs to solve this task across multiple languages and analyze the performance of multimodal encoder-based and generative VLMs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


