The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research. 1 1 https://github.com/swapUniba/LVLMs-NonEnglish .

Extending Large Language Models to multimodality for non-English languages

Siciliani L.;Basile P.;Semeraro G.
2026-01-01

Abstract

The growing popularity of Large Vision-Language Models has highlighted and intensified one of the most well-known challenges in the field of Large Language Models: training is mainly, and most of the time exclusively, conducted on English data. Consequently, the resulting models are more prone to error in non-English tasks, and this issue is exacerbated in multimodal settings that are even more complex and use task-specific datasets. Given this, research on Large Language Models has turned toward adapting them to non-English languages. However, the scarcity of open and curated resources for these languages poses a significant limitation. In this work, we aim to tackle the aforementioned challenge by exploring Large Vision-Language Models adaptation to non-English languages, using machine translation to overcome the lack of curated data. We also analyze how the evaluation of the results is influenced when training a vision-to-text adapter across different languages, examining the performance variations and challenges associated with multilingual adaptation. Finally, we highlight the importance of using open resources to ensure transparency and reproducibility of the results. Following this philosophy, we provide open access to the entire codebase of the adaptation pipeline, along with the trained models and dataset, to foster further research. 1 1 https://github.com/swapUniba/LVLMs-NonEnglish .
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/575352
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact