October 23, 2025
Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems.
<ccs2012> <concept> <concept_id>10002951.10003317</concept_id> <concept_desc>Information systems Information retrieval</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178</concept_id> <concept_desc>Computing methodologies Artificial intelligence</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>
The emergence of Large Language Models (LLMs) is transforming the landscape of recommendation systems with their extensive world knowledge and reasoning capabilities. LLM-powered recommenders [1]–[3] utilize the deep semantic understanding and generative strengths of these models to deliver more personalized, explainable, and context-aware suggestions.
A common approach involves initially fine-tuning an LLM on domain knowledge and historical user-item interactions to tailor it for specific recommendation tasks [4]. However, the environments where these systems operate are inherently dynamic [5], [6]. User interests evolve, new items emerge constantly, and underlying data patterns shift – for instance, analysis of user transitions between interest clusters often reveals significant temporal variability. An LLM fine-tuned only on past data captures a static snapshot and cannot inherently reflect these real-time dynamics.
To address this challenge, two prominent techniques for adapting and updating LLMs are fine-tuning [7], [8] and Retrieval-Augmented Generation (RAG) [9], [10]. Fine-tuning involves further training a pre-trained LLM on a specific dataset to adjust its internal parameters, tailoring its knowledge or behavior. RAG, conversely, connects the LLM to external knowledge sources at inference time, retrieving relevant information to augment the prompt and ground the model’s generation in specific, often up-to-date, data without altering the model’s parameters.
This paper conducts a comparative analysis of fine-tuning and RAG as methodologies for adapting LLM-powered recommendation systems to dynamic updates. Our investigation is grounded in a deployed LLM-powered user interest recommendation system [4], [7]. While interest exploration systems [11]–[15] aim to diversify recommendations, effectively introducing novel interests poses a significant challenge. In our case study, the fine-tuned LLM generates potential novel interest clusters from user history; the core update challenge we address is enabling this model to accurately reflect the changing popularity and relationships between these clusters over time.
This challenge leads to our central hypothesis: In a highly dynamic domain like short-form video recommendation, a static, fine-tuned LLM is insufficient to maintain recommendation quality over time. We hypothesize that a hybrid strategy, combining periodic fine-tuning with frequent RAG-based updates, will more effectively adapt to shifting user interest patterns and result in superior online performance. This paper tests this hypothesis through a LLM-powered user interest exploration system [7]. We therefore compare fine-tuning and RAG specifically for this task, discussing their respective system designs, processes, strengths, limitations, effectiveness, and cost, using both offline and live experimental results.
This section first provides necessary preliminary information and outlines the motivation for our work, followed by a detailed description of the interest exploration system. Subsequently, we detail the designs for fine-tuning and RAG.
Motivation. To effectively model the dynamic nature of user interests, we represent them using clusters, following the methodology in [16]. To assess the evolution of user interest transitions, we first define a ‘successor interest’. From user interaction logs, we construct sequences of three consecutive, distinct item clusters a user engages with, denoted as \((c1, c2, c_{next})\). Here, \(c_{next}\) is the ‘successor interest’ to the preceding pair \((c1, c2)\). We then measured the stability of the top-5 most frequent successor interests month-over-month using the Jaccard Similarity (i.e., quantifying the semantic overlap between these top-5 sets). Our analysis revealed substantial variability, with a low mean Jaccard score of 0.17 (variance 0.07), demonstrating substantial monthly variability in prevalent user transition patterns. This observed dynamism highlights the critical need for efficiently incorporating refreshed user feedback.
Interest Exploration System. In the LLM-powered system [7], each user’s recent interaction history is represented as a sequence of \(k\) interest clusters \(S_u = \{c_1, c_2, \dots, c_k\}\), where each \(c_i \in \mathcal{C}\) denotes an item interest cluster from a predefined cluster set \(\mathcal{C}\) [16]. Each interest cluster groups items that are topically coherent, based on their metadata and content feature. Given \(S_u\), the LLM predicts the user’s next novel interest cluster \(c_n \in \mathcal{C}\). Because online serving the LLM for a billion-user system is prohibitively costly, we precompute and store the predicted next-cluster transitions for all possible \(k\)-length sequences of interest clusters. Let \(\mathcal{S} = \{(c_1, \dots, c_k) \mid c_i \in \mathcal{C}\}\) denote the set of all possible \(k\)-length cluster sequences. For each \(S \in \mathcal{S}\), we store a corresponding predicted novel cluster \(c_n\) offline. During online serving, a user’s current history \(S_u\) is matched to a set \(S \in \mathcal{S}\), and the corresponding predicted next cluster is retrieved via table lookup.
Following the preliminary example [7], with \(k=2\), each fine-tuning data sample is denoted as \([(c_1,c_2 ), c_{next}]\). The prompt is illustrated as black lines in Figure 1. Periodically, we curate thousands of those pairs for fine-tuning. Fine-tuning offers the benefits of adapting model behavior and style, as well as improving performance on specific tasks. However, the drawbacks are also significant, including high cost and complexity, and the risk of overfitting. Due to the high cost of fine-tuning, updates happen on a monthly basis.
We also propose two key evaluation metrics to evaluate the fine-tuning quality: the exact match rate (percentage of predictions precisely matching the partition description) and the test set recall (percentage of predictions aligning with users’ watch history). Leveraging these insights, our auto-refreshed fine-tuning pipeline implements two automated quality checks:
\(\bullet\)
If the exact match rate during partition mapping generation is below 90%, the pipeline execution is halted.
If the test set recall is less than 1.5%, the pipeline fails.
These conditions necessitate manual review by an engineer to identify the root cause and decide whether to proceed to production or re-run the process.
Instead of retraining LLM with new viewing data with high cost, we can prompt new data to the LLM and perform bulk inference periodically to generate a dynamic transition mapping with low cost. Adhering to the prompt design of the LLM-powered interest exploration system [7], we represent a user’s consumption history as a sequence of their most recently interacted unique clusters. Each cluster is defined by a set of keywords. To better capture both dynamic system-wide trends and individual user’s evolving preferences, these prompts incorporate top popular interest clusters along with the user’s recent watch history, as detailed in Figure 1. Fine-tuning can be done on a monthly schedule, while the RAG prompt can happen more frequently, even at daily basis, with the overall system illustrated in Figure 2.
During the bulk inference phase, RAG prompts can be generated at different level of granularity.
\(\bullet\)
Instance level. Prompts are tailored for each individual cluster pair. Specifically, we can identify the top-1 most frequent next cluster based on recent data. Consider a distribution \(\{(c_1, c_2, c_3): 10, (c_1, c_2, c_4): 8, (c_1, c_2, c_5): 2, \dots\}\). Since \(c_3\) appears most frequently following \(c_1\) and \(c_2\), \(c_3\) can be included in the prompt for inference.
Global level. This approach uses a single, universal prompt for all data pairs. This prompt captures overall user behavior and might include illustrative examples. E.g. we can construct the prompt using the top-100 most frequent pairs found across the entire new data, regardless of specific input pairs. These globally representative clusters are then included to guide inference.
Given that the global level design might introduce noise to the target cluster pair during output cluster generation, we adopt the instance level design approach.
This section outlines methods for retrieving relevant recent data during bulk inference.
\(\bullet\)
Frequency-based Retrieval: We identify data points within the same cluster pair as the query and select those with the highest frequency. This provides the LLM with prompts reflecting recent, prevalent user behaviors for the specific cluster.
Trend-based Retrieval: Focusing on the query’s cluster pair, we select data points exhibiting the largest frequency difference, highlighting emerging or declining user interests.
Our analysis and evaluation indicate that frequency-based retrieval yields the best results.
Number of retrieved clusters included in the context can vary (e.g. the Cluster 3 in Figure 1). While a larger number provides richer information, it also increases computational cost. Our live experiments suggest that including the top 1 most frequent cluster is sufficient to provide satisfying results.
We use users’ interaction history, represented as a sequence of watches, on a large-scale video platform as the source dataset. Our data extraction targets videos demonstrating positive viewer engagement. To refine the dataset, we deduplicate video cluster IDs within each sequence and remove sequences with fewer than two videos. For the remaining sequences, we construct tuples of three consecutive video cluster IDs as \((c_1, c_2, c_{next})\). The final step is to count the occurrences of the next cluster for each cluster pair.
In our hybrid update strategy, LLM models undergo monthly fine-tuning, while RAG refresh occurs sub-weekly. From the fine-tuned model, we then measure the incremental gains of more frequent and up-to-date RAG.
We evaluated how RAG-generated cluster mappings evolve over time and their alignment with user behavior. Specifically, We assessed the hit rate, which computes the proportion of times the predicted next cluster appears in the real user sequence. We compared three versions of transition mappings: (1) a fixed mapping generated without RAG; (2) a RAG-generated mapping updated every two days; and (3) a RAG-generated mapping computed only on \(day_1\) and held fixed thereafter. As illustrated in Figure 3 (a), both RAG-based mappings outperform the fixed baseline, with the version updated every two days achieving slightly higher hit rates.
To better understand the influence of RAG on the LLM’s generation behavior, we analyze the similarity between outputs generated with and without RAG. Only 7.8% of the RAG-generated outputs were identical to those produced without RAG, compared to a 37.5% overlap when using repeated prompts without RAG. The results indicate that RAG significantly alters the generated content, often leading to novel predictions that differ from both the retrieved context and the non-RAG outputs.
Finally, we studied how the top-\(k\) most frequent clusters for each cluster pair changed over time. Our findings reveal a significant shift in top clusters across retrieval dates, with substantial drops in overlap as time progresses. This trend, illustrated in Figure 3 (b), re-emphasize the dynamic nature of user interests and underscores the need for regularly refreshed retrieval to reflect current behavioral patterns.
Figure 3: Offline Evaluation.. a — Trajectory of hit rate., b — Exact match rates for top k clusters over time
We conducted A/B experiments within a short-form video recommendation system, serving billions of users, to measure the effectiveness of RAG in enhancing the performance of our LLM-powered interest exploration system. Gemini 1.5 [17] was adopted as the base LLM for this system, while the process and pipeline are designed for adaptability to other models. The system’s high-level function recommends novel interest clusters, currently based on a user’s historical interest cluster sequence of length \(K=2\).
We report the user metrics of the live experiments in Figure 4. The x-axis represents the date, and the y-axis shows the relative percentage difference between the treatment and control. We also report the mean and 95% confidence intervals for each metric. The top-tier metrics Satisfied User Outcomes are increased by \(0.11\%\) with 95% confidence interval \([0.00\%, 0.21\%]\), which is highly significant at the scale of our system. Satisfaction Rate is increased by \(0.25\%\) with interval \([0.01\%, 0.48\%]\). The Dissatisfaction Rate is reduced by \(0.05\%\) with interval \([-0.08\%, -0.01\%]\). Negative Interaction is reduced by \(0.04\%\) with interval \([-0.08\%, -0.01\%]\).
We employed RAG to update the cluster transition table on \(day_1\) and \(day_4\). Following these updates, we observed notable increases in user engagement, including significant improvements in Satisfied User Outcomes and Satisfaction Rate, indicating enhanced user satisfaction.
Figure 4: Live experiment results for user metrics. The x-axis represents the date; the y-axis represents the relative difference (in percentage) between the treatment and control groups.. a — Satisfied User Outcomes, b — Satisfaction Rate, c — Dissatisfaction Rate, d — Negative Interaction
This paper investigated the critical challenge of keeping LLM-powered recommendation systems updated. We conducted a comparative analysis of fine-tuning and RAG, proposing and validating a hybrid strategy. Our core finding is that combining monthly fine-tuning with sub-weekly RAG updates provides a robust, cost-effective solution for adapting to dynamic user interests, leading to significant improvements in online user satisfaction metrics in a large-scale production environment. Future work will explore more adaptive update cadences, where the frequency of RAG or fine-tuning is determined automatically based on the detected rate of interest drift.
Changping Meng is a software engineer at Google (YouTube). He received the Computer Science PhD from the Purdue University. His work primarily focuses on short-form video recommendations.