Since their initial inception, large language models have undergone many innovations. One of these innovations concerns multimodality. Several adaptation strategies have been developed to expand LLMs to process multimodal signals. However, the training procedure for these multimodal models is performed on English-only vision-language datasets in the current literature, limiting their capabilities for other languages. This work proposes the first family of LMMs for the Italian language. We trained them using state-of-the-art backbone models and datasets, translated into Italian using the most up-to-date machine translation model available. In support of open science, we publicly release the data, models, and code used to develop these models.
LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language
Siciliani L.;Basile P.;Semeraro G.
2024-01-01
Abstract
Since their initial inception, large language models have undergone many innovations. One of these innovations concerns multimodality. Several adaptation strategies have been developed to expand LLMs to process multimodal signals. However, the training procedure for these multimodal models is performed on English-only vision-language datasets in the current literature, limiting their capabilities for other languages. This work proposes the first family of LMMs for the Italian language. We trained them using state-of-the-art backbone models and datasets, translated into Italian using the most up-to-date machine translation model available. In support of open science, we publicly release the data, models, and code used to develop these models.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


