Reproducibility of LLM-based Recommender Systems: the Case Study of P5 Paradigm

Lops, Pasquale; Silletti, Antonio; Polignano, Marco; Musto, Cataldo; Semeraro, Giovanni

doi:10.1145/3640457.3688072

Recommender systems can significantly benefit from the availability of pre-trained large language models (LLMs), which can serve as a basic mechanism for generating recommendations based on detailed user and item data, such as text descriptions, user reviews, and metadata. On the one hand, this new generation of LLM-based recommender systems paves the way for dealing with traditional limitations, such as cold-start and data sparsity. Still, on the other hand, this poses fundamental challenges for their accountability. Reproducing experiments in the new context of LLM-based recommender systems is challenging for several reasons. New approaches are published at an unprecedented pace, which makes difficult to have a clear picture of the main protocols and good practices in the experimental evaluation. Moreover, the lack of proper frame- works for LLM-based recommendation development and evaluation makes the process of benchmarking models complex and uncertain. In this work, we discuss the main issues encountered when trying to reproduce P5 (Pretrain, Personalized Prompt, and Prediction Paradigm), one of the first works unifying different recommendation tasks in a shared language modeling and natural language generation framework. Starting from this study, we have developed LaikaLLM, a framework for training and evaluating LLMs, specifically for the recommendation task. It has been used to perform several experiments to assess the impact of using different LLMs, different personalization strategies, and a novel set of more informative prompts on the overall performance of recommendations in a fully reproducible environment.