Recommender systems can significantly benefit from the availability of pre-trained large language models (LLMs), which can serve as a basic mechanism for generating recommendations based on detailed user and item data, such as text descriptions, user reviews, and metadata. On the one hand, this new generation of LLM-based recommender systems paves the way for dealing with traditional limitations, such as cold-start and data sparsity. Still, on the other hand, this poses fundamental challenges for their accountability. Reproducing experiments in the new context of LLM-based recommender systems is challenging for several reasons. New approaches are published at an unprecedented pace, which makes difficult to have a clear picture of the main protocols and good practices in the experimental evaluation. Moreover, the lack of proper frame- works for LLM-based recommendation development and evaluation makes the process of benchmarking models complex and uncertain. In this work, we discuss the main issues encountered when trying to reproduce P5 (Pretrain, Personalized Prompt, and Prediction Paradigm), one of the first works unifying different recommendation tasks in a shared language modeling and natural language generation framework. Starting from this study, we have developed LaikaLLM, a framework for training and evaluating LLMs, specifically for the recommendation task. It has been used to perform several experiments to assess the impact of using different LLMs, different personalization strategies, and a novel set of more informative prompts on the overall performance of recommendations in a fully reproducible environment.

Reproducibility of LLM-based Recommender Systems: the Case Study of P5 Paradigm

Pasquale Lops
Conceptualization
;
Antonio Silletti
Software
;
Marco Polignano
Methodology
;
Cataldo Musto
Investigation
;
Giovanni Semeraro
Supervision
2024-01-01

Abstract

Recommender systems can significantly benefit from the availability of pre-trained large language models (LLMs), which can serve as a basic mechanism for generating recommendations based on detailed user and item data, such as text descriptions, user reviews, and metadata. On the one hand, this new generation of LLM-based recommender systems paves the way for dealing with traditional limitations, such as cold-start and data sparsity. Still, on the other hand, this poses fundamental challenges for their accountability. Reproducing experiments in the new context of LLM-based recommender systems is challenging for several reasons. New approaches are published at an unprecedented pace, which makes difficult to have a clear picture of the main protocols and good practices in the experimental evaluation. Moreover, the lack of proper frame- works for LLM-based recommendation development and evaluation makes the process of benchmarking models complex and uncertain. In this work, we discuss the main issues encountered when trying to reproduce P5 (Pretrain, Personalized Prompt, and Prediction Paradigm), one of the first works unifying different recommendation tasks in a shared language modeling and natural language generation framework. Starting from this study, we have developed LaikaLLM, a framework for training and evaluating LLMs, specifically for the recommendation task. It has been used to perform several experiments to assess the impact of using different LLMs, different personalization strategies, and a novel set of more informative prompts on the overall performance of recommendations in a fully reproducible environment.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/515181
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact