Balancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates


Abstract

Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems.

<ccs2012> <concept> <concept_id>10002951.10003317</concept_id> <concept_desc>Information systems Information retrieval</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178</concept_id> <concept_desc>Computing methodologies Artificial intelligence</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

The emergence of Large Language Models (LLMs) is transforming the landscape of recommendation systems with their extensive world knowledge and reasoning capabilities. LLM-powered recommenders [1][3] utilize the deep semantic understanding and generative strengths of these models to deliver more personalized, explainable, and context-aware suggestions.

A common approach involves initially fine-tuning an LLM on domain knowledge and historical user-item interactions to tailor it for specific recommendation tasks [4]. However, the environments where these systems operate are inherently dynamic [5], [6]. User interests evolve, new items emerge constantly, and underlying data patterns shift – for instance, analysis of user transitions between interest clusters often reveals significant temporal variability. An LLM fine-tuned only on past data captures a static snapshot and cannot inherently reflect these real-time dynamics.

To address this challenge, two prominent techniques for adapting and updating LLMs are fine-tuning [7], [8] and Retrieval-Augmented Generation (RAG) [9], [10]. Fine-tuning involves further training a pre-trained LLM on a specific dataset to adjust its internal parameters, tailoring its knowledge or behavior. RAG, conversely, connects the LLM to external knowledge sources at inference time, retrieving relevant information to augment the prompt and ground the model’s generation in specific, often up-to-date, data without altering the model’s parameters.

This paper conducts a comparative analysis of fine-tuning and RAG as methodologies for adapting LLM-powered recommendation systems to dynamic updates. Our investigation is grounded in a deployed LLM-powered user interest recommendation system [4], [7]. While interest exploration systems [11][15] aim to diversify recommendations, effectively introducing novel interests poses a significant challenge. In our case study, the fine-tuned LLM generates potential novel interest clusters from user history; the core update challenge we address is enabling this model to accurately reflect the changing popularity and relationships between these clusters over time.

This challenge leads to our central hypothesis: In a highly dynamic domain like short-form video recommendation, a static, fine-tuned LLM is insufficient to maintain recommendation quality over time. We hypothesize that a hybrid strategy, combining periodic fine-tuning with frequent RAG-based updates, will more effectively adapt to shifting user interest patterns and result in superior online performance. This paper tests this hypothesis through a LLM-powered user interest exploration system [7]. We therefore compare fine-tuning and RAG specifically for this task, discussing their respective system designs, processes, strengths, limitations, effectiveness, and cost, using both offline and live experimental results.

2 Method↩︎

This section first provides necessary preliminary information and outlines the motivation for our work, followed by a detailed description of the interest exploration system. Subsequently, we detail the designs for fine-tuning and RAG.

2.1 Preliminary↩︎

Motivation. To effectively model the dynamic nature of user interests, we represent them using clusters, following the methodology in [16]. To assess the evolution of user interest transitions, we first define a ‘successor interest’. From user interaction logs, we construct sequences of three consecutive, distinct item clusters a user engages with, denoted as \((c1, c2, c_{next})\). Here, \(c_{next}\) is the ‘successor interest’ to the preceding pair \((c1, c2)\). We then measured the stability of the top-5 most frequent successor interests month-over-month using the Jaccard Similarity (i.e., quantifying the semantic overlap between these top-5 sets). Our analysis revealed substantial variability, with a low mean Jaccard score of 0.17 (variance 0.07), demonstrating substantial monthly variability in prevalent user transition patterns. This observed dynamism highlights the critical need for efficiently incorporating refreshed user feedback.

Interest Exploration System. In the LLM-powered system [7], each user’s recent interaction history is represented as a sequence of \(k\) interest clusters \(S_u = \{c_1, c_2, \dots, c_k\}\), where each \(c_i \in \mathcal{C}\) denotes an item interest cluster from a predefined cluster set \(\mathcal{C}\) [16]. Each interest cluster groups items that are topically coherent, based on their metadata and content feature. Given \(S_u\), the LLM predicts the user’s next novel interest cluster \(c_n \in \mathcal{C}\). Because online serving the LLM for a billion-user system is prohibitively costly, we precompute and store the predicted next-cluster transitions for all possible \(k\)-length sequences of interest clusters. Let \(\mathcal{S} = \{(c_1, \dots, c_k) \mid c_i \in \mathcal{C}\}\) denote the set of all possible \(k\)-length cluster sequences. For each \(S \in \mathcal{S}\), we store a corresponding predicted novel cluster \(c_n\) offline. During online serving, a user’s current history \(S_u\) is matched to a set \(S \in \mathcal{S}\), and the corresponding predicted next cluster is retrieved via table lookup.

2.2 Fine-tune↩︎

Following the preliminary example [7], with \(k=2\), each fine-tuning data sample is denoted as \([(c_1,c_2 ), c_{next}]\). The prompt is illustrated as black lines in Figure 1. Periodically, we curate thousands of those pairs for fine-tuning. Fine-tuning offers the benefits of adapting model behavior and style, as well as improving performance on specific tasks. However, the drawbacks are also significant, including high cost and complexity, and the risk of overfitting. Due to the high cost of fine-tuning, updates happen on a monthly basis.

We also propose two key evaluation metrics to evaluate the fine-tuning quality: the exact match rate (percentage of predictions precisely matching the partition description) and the test set recall (percentage of predictions aligning with users’ watch history). Leveraging these insights, our auto-refreshed fine-tuning pipeline implements two automated quality checks:

\(\bullet\)

If the exact match rate during partition mapping generation is below 90%, the pipeline execution is halted.

If the test set recall is less than 1.5%, the pipeline fails.

These conditions necessitate manual review by an engineer to identify the root cause and decide whether to proceed to production or re-run the process.

2.3 RAG↩︎

Instead of retraining LLM with new viewing data with high cost, we can prompt new data to the LLM and perform bulk inference periodically to generate a dynamic transition mapping with low cost. Adhering to the prompt design of the LLM-powered interest exploration system [7], we represent a user’s consumption history as a sequence of their most recently interacted unique clusters. Each cluster is defined by a set of keywords. To better capture both dynamic system-wide trends and individual user’s evolving preferences, these prompts incorporate top popular interest clusters along with the user’s recent watch history, as detailed in Figure 1. Fine-tuning can be done on a monthly schedule, while the RAG prompt can happen more frequently, even at daily basis, with the overall system illustrated in Figure 2.

Figure 1: Prompt for Novel Interest Prediction. Black lines are the fine-tuning prompt. Added Blue lines are the RAG version with injected recent watch history. Label is only used for fine-tuning, but not RAG
Figure 2: Refresh with Retrieval-Augmented Generation. Fine-tuning is refreshed monthly, while RAG is refreshed multiple times within a week.

2.3.1 Granularity↩︎

During the bulk inference phase, RAG prompts can be generated at different level of granularity.

\(\bullet\)

Instance level. Prompts are tailored for each individual cluster pair. Specifically, we can identify the top-1 most frequent next cluster based on recent data. Consider a distribution \(\{(c_1, c_2, c_3): 10, (c_1, c_2, c_4): 8, (c_1, c_2, c_5): 2, \dots\}\). Since \(c_3\) appears most frequently following \(c_1\) and \(c_2\), \(c_3\) can be included in the prompt for inference.

Global level. This approach uses a single, universal prompt for all data pairs. This prompt captures overall user behavior and might include illustrative examples. E.g. we can construct the prompt using the top-100 most frequent pairs found across the entire new data, regardless of specific input pairs. These globally representative clusters are then included to guide inference.

Given that the global level design might introduce noise to the target cluster pair during output cluster generation, we adopt the instance level design approach.

2.3.2 Retrieval Similarity↩︎

This section outlines methods for retrieving relevant recent data during bulk inference.

\(\bullet\)

Frequency-based Retrieval: We identify data points within the same cluster pair as the query and select those with the highest frequency. This provides the LLM with prompts reflecting recent, prevalent user behaviors for the specific cluster.

Trend-based Retrieval: Focusing on the query’s cluster pair, we select data points exhibiting the largest frequency difference, highlighting emerging or declining user interests.

Our analysis and evaluation indicate that frequency-based retrieval yields the best results.

Number of retrieved clusters included in the context can vary (e.g. the Cluster 3 in Figure 1). While a larger number provides richer information, it also increases computational cost. Our live experiments suggest that including the top 1 most frequent cluster is sufficient to provide satisfying results.

2.4 Data Retrieval↩︎

We use users’ interaction history, represented as a sequence of watches, on a large-scale video platform as the source dataset. Our data extraction targets videos demonstrating positive viewer engagement. To refine the dataset, we deduplicate video cluster IDs within each sequence and remove sequences with fewer than two videos. For the remaining sequences, we construct tuples of three consecutive video cluster IDs as \((c_1, c_2, c_{next})\). The final step is to count the occurrences of the next cluster for each cluster pair.

3 Results and Evaluation↩︎

In our hybrid update strategy, LLM models undergo monthly fine-tuning, while RAG refresh occurs sub-weekly. From the fine-tuned model, we then measure the incremental gains of more frequent and up-to-date RAG.

3.1 Offline Evaluation↩︎

We evaluated how RAG-generated cluster mappings evolve over time and their alignment with user behavior. Specifically, We assessed the hit rate, which computes the proportion of times the predicted next cluster appears in the real user sequence. We compared three versions of transition mappings: (1) a fixed mapping generated without RAG; (2) a RAG-generated mapping updated every two days; and (3) a RAG-generated mapping computed only on \(day_1\) and held fixed thereafter. As illustrated in Figure 3 (a), both RAG-based mappings outperform the fixed baseline, with the version updated every two days achieving slightly higher hit rates.

To better understand the influence of RAG on the LLM’s generation behavior, we analyze the similarity between outputs generated with and without RAG. Only 7.8% of the RAG-generated outputs were identical to those produced without RAG, compared to a 37.5% overlap when using repeated prompts without RAG. The results indicate that RAG significantly alters the generated content, often leading to novel predictions that differ from both the retrieved context and the non-RAG outputs.

Finally, we studied how the top-\(k\) most frequent clusters for each cluster pair changed over time. Our findings reveal a significant shift in top clusters across retrieval dates, with substantial drops in overlap as time progresses. This trend, illustrated in Figure 3 (b), re-emphasize the dynamic nature of user interests and underscores the need for regularly refreshed retrieval to reflect current behavioral patterns.

a
b

Figure 3: Offline Evaluation.. a — Trajectory of hit rate., b — Exact match rates for top k clusters over time

3.2 Live Experiment↩︎

We conducted A/B experiments within a short-form video recommendation system, serving billions of users, to measure the effectiveness of RAG in enhancing the performance of our LLM-powered interest exploration system. Gemini 1.5 [17] was adopted as the base LLM for this system, while the process and pipeline are designed for adaptability to other models. The system’s high-level function recommends novel interest clusters, currently based on a user’s historical interest cluster sequence of length \(K=2\).

We report the user metrics of the live experiments in Figure 4. The x-axis represents the date, and the y-axis shows the relative percentage difference between the treatment and control. We also report the mean and 95% confidence intervals for each metric. The top-tier metrics Satisfied User Outcomes are increased by \(0.11\%\) with 95% confidence interval \([0.00\%, 0.21\%]\), which is highly significant at the scale of our system. Satisfaction Rate is increased by \(0.25\%\) with interval \([0.01\%, 0.48\%]\). The Dissatisfaction Rate is reduced by \(0.05\%\) with interval \([-0.08\%, -0.01\%]\). Negative Interaction is reduced by \(0.04\%\) with interval \([-0.08\%, -0.01\%]\).

We employed RAG to update the cluster transition table on \(day_1\) and \(day_4\). Following these updates, we observed notable increases in user engagement, including significant improvements in Satisfied User Outcomes and Satisfaction Rate, indicating enhanced user satisfaction.

a
b
c
d

Figure 4: Live experiment results for user metrics. The x-axis represents the date; the y-axis represents the relative difference (in percentage) between the treatment and control groups.. a — Satisfied User Outcomes, b — Satisfaction Rate, c — Dissatisfaction Rate, d — Negative Interaction

4 Conclusion↩︎

This paper investigated the critical challenge of keeping LLM-powered recommendation systems updated. We conducted a comparative analysis of fine-tuning and RAG, proposing and validating a hybrid strategy. Our core finding is that combining monthly fine-tuning with sub-weekly RAG updates provides a robust, cost-effective solution for adapting to dynamic user interests, leading to significant improvements in online user satisfaction metrics in a large-scale production environment. Future work will explore more adaptive update cadences, where the frequency of RAG or fine-tuning is determined automatically based on the detected rate of interest drift.

Speaker Bio↩︎

Changping Meng is a software engineer at Google (YouTube). He received the Computer Science PhD from the Purdue University. His work primarily focuses on short-form video recommendations.

References↩︎

[1]
Sein Kim, Hongseok Kang, Seungyoon Choi, Donghyun Kim, Minchul Yang, and Chanyoung Park.2024. . In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1395–1406.
[2]
Arkadeep Acharya, Brijraj Singh, and Naoyuki Onoe.2023. . In Proceedings of the 17th ACM conference on recommender systems. 1204–1207.
[3]
Shuyuan Xu, Wenyue Hua, and Yongfeng Zhang.2024. . In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 386–394.
[4]
Jianling Wang, Haokai Lu, Yifan Liu, He Ma, Yueqi Wang, Yang Gu, Shuzhou Zhang, Shuchao Bi, Lexi Baugher, Ed Chi, et al2024. . arXiv e-prints(2024), arXiv–2405.
[5]
Jianling Wang, Kaize Ding, Liangjie Hong, Huan Liu, and James Caverlee.2020. . In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 1101–1110.
[6]
Jianling Wang, Raphael Louca, Diane Hu, Caitlin Cellier, James Caverlee, and Liangjie Hong.2020. . In Proceedings of the 13th international conference on web search and data mining. 645–653.
[7]
Jianling Wang, Haokai Lu, Yifan Liu, He Ma, Yueqi Wang, Yang Gu, Shuzhou Zhang, Ningren Han, Shuchao Bi, Lexi Baugher, Ed H. Chi, and Minmin Chen.2024. . In Proceedings of the 18th ACM Conference on Recommender Systems(RecSys ’24). Association for Computing Machinery, New York, NY, USA, 872–877. ://doi.org/10.1145/3640457.3688161.
[8]
Jiaju Chen, Chongming Gao, Shuai Yuan, Shuchang Liu, Qingpeng Cai, and Peng Jiang.2025. . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining. 857–865.
[9]
Huimin Zeng, Zhenrui Yue, Qian Jiang, and Dong Wang.2024. . In 2024 IEEE International Conference on Big Data (BigData). IEEE, 8078–8087.
[10]
Run-Ze Fan, Yixing Fan, Jiangui Chen, Jiafeng Guo, Ruqing Zhang, and Xueqi Cheng.2024. . In European Conference on Information Retrieval. Springer, 39–55.
[11]
Minmin Chen, Yuyan Wang, Can Xu, Ya Le, Mohit Sharma, Lee Richardson, Su-Lin Wu, and Ed Chi.2021. . In Proceedings of the 15th acm Conference on recommender systems. 85–95.
[12]
Minmin Chen.2021. . In Proceedings of the 15th ACM Conference on Recommender Systems. 551–553.
[13]
Yu Song, Shuai Sun, Jianxun Lian, Hong Huang, Yu Li, Hai Jin, and Xing Xie.2022. . In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 947–956.
[14]
Khushhall Chandra Mahajan, Amey Porobo Dharwadker, Romil Shah, Simeng Qu, Gaurav Bang, and Brad Schumitsch.2023. . In Companion Proceedings of the ACM Web Conference 2023. 508–512.
[15]
Yi Su, Xiangyu Wang, Elaine Ya Le, Liang Liu, Yuening Li, Haokai Lu, Benjamin Lipshitz, Sriraj Badam, Lukasz Heldt, Shuchao Bi, et al2024. . In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 636–644.
[16]
Bo Chang, Changping Meng, He Ma, Shuo Chang, Yang Gu, Yajun Peng, Jingchen Feng, Yaping Zhang, Shuchao Bi, Ed H Chi, and Minmin Chen.2024. . In Companion Proceedings of the ACM Web Conference 2024.
[17]
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al2024. . arXiv preprint arXiv:2403.05530(2024).