What Matters in LLM-Based Feature Extractor for Recommender?
A Systematic Analysis of Prompts, Models, and Adaptation
September 18, 2025
Using Large Language Models (LLMs) to generate semantic features has been demonstrated as a powerful paradigm for enhancing Sequential Recommender Systems (SRS). This typically involves three stages: processing item text, extracting features with LLMs, and adapting them for downstream models. However, existing methods vary widely in prompting, architecture, and adaptation strategies, making it difficult to fairly compare design choices and identify what truly drives performance. In this work, we propose RecXplore, a modular analytical framework that decomposes the LLM-as-feature-extractor pipeline into four modules: data processing, semantic feature extraction, feature adaptation, and sequential modeling. Instead of proposing new techniques, RecXplore revisits and organizes established methods, enabling systematic exploration of each module in isolation. Experiments on four public datasets show that simply combining the best designs from existing techniques—without exhaustive search—yields up to 18.7% relative improvement in NDCG@5 and 12.7% in HR@5 over strong baselines. These results underscore the utility of modular benchmarking for identifying effective design patterns and promoting standardized research in LLM-enhanced recommendation.
<ccs2012> <concept> <concept_id>10002951.10003317.10003347.10003350</concept_id> <concept_desc>Information systems Recommender systems</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>
Recently, large language models (LLMs), known for their strong semantic understanding capabilities, have been increasingly integrated into recommender systems, demonstrating substantial potential in user modeling [1], item representation [2], and reasoning tasks [3]. Among the various LLM-based approaches, two paradigms have received the most attention: generation-based recommenders (Fig. 1 (a)) that place LLMs at the core, and representation-based methods that leverage LLMs as feature extractors to enhance traditional models (Fig. 1 (b)). The latter offers key advantages: semantic embeddings can be precomputed offline to eliminate real-time inference latency and can be seamlessly integrated into existing architectures, making it more practical for real-world deployment [4]. Yet, how best to design and deploy this paradigm remains an open question, motivating a closer look at its internal mechanisms.
The LLM-as-feature-extractor paradigm typically follows three stages: first, transforming structured or unstructured item attributes (e.g., title, category, brand) into natural language prompts; second, feeding the prompt into an LLM to extract semantic embeddings through a designated aggregation strategy; and third, adapting the high-dimensional LLM outputs (e.g., 4096-dim vectors from LLaMA [5]) to a compact representation compatible with downstream recommender models. Finally, the adapted representations are fed into sequential recommenders such as SASRec [6] for training. Existing studies have explored a variety of techniques for each stage. For prompt construction, [7], [8] adopt simple attribute concatenation, while others incorporate keyword extraction [9], summarization [10], or knowledge-enhanced prompting [11], [12]. In the text encoding stage, aggregation strategies range from mean pooling to last-token or special-token representations [13], [14]. For parameter tuning, earlier works [7], [15] mostly froze the LLM, whereas recent studies [16] show that lightweight fine-tuning methods, such as supervised contrastive learning, can further boost performance. Feature adaptation designs include linear projection [17], multilayer perceptrons (MLPs) [18], and mixture-of-expert (MoE) [7] networks.
Despite these efforts, existing studies often investigate isolated components or propose single-model designs, resulting in limited comparability across methods. The absence of a unified framework makes it difficult to disentangle the effects of individual design choices, thereby hindering a holistic understanding and principled advancement of this paradigm. This raises a natural question: How do design choices within the LLM-as-feature-extractor pipeline impact recommendation performance? Moreover, can we attain stronger results by simply combining well-established techniques, without resorting to overly complex architectures?
To address this question, we propose RecXplore—a modular and reproducible analytical framework for fair and systematic evaluation of the LLM-as-feature-extractor pipeline. We decompose the pipeline into four core modules: Data Processing, Feature Extraction, Feature Adaptation, and Sequential Modeling. Through controlled experiments on four widely-used public datasets, we derive several key insights. First, simple attribute concatenation serves as a robust prompting strategy, while excessive prompt engineering often introduces noise and degrades performance. Second, a two-stage fine-tuning pipeline—continued pretraining (CPT) followed by supervised fine-tuning (SFT)—yields superior semantic representations, with mean pooling outperforming other aggregation methods. Third, a hybrid adapter that combines principal component analysis (PCA) with a mixture-of-experts (MoE) architecture proves most effective for feature adaptation. Finally, when LLM-derived semantic embeddings are sufficiently rich, traditional ID embeddings provide marginal benefit, and direct replacement emerges as the most efficient integration strategy.
Building on these insights, we instantiate the most effective design choices into RecXplore, which consistently outperforms strong baselines across all datasets and evaluation metrics, achieving up to a 12.7% and 18.7% relative improvement in HR@5 and NDCG@5, respectively. These results underscore the effectiveness and practical value of systematic modular analysis. Our main contributions are as follows:
We introduce RecXplore, the first modular framework for systematic analysis of the "LLM-as-feature-extractor" paradigm, explicitly decoupling key components within the recommendation pipeline.
We conduct comprehensive evaluations of representative design choices for each module across multiple datasets, establishing effective and reusable practices to guide future development.
We demonstrate that, without introducing complex architectures, a simple combination of the best-performing designs from each module in RecXplore is sufficient to achieve consistently strong performance across diverse datasets and evaluation metrics.
Sequential recommendation aims to predict the next item a user will interact with based on their behavioral history. Early methods used Markov chains [19] to model short-term dependencies. With deep learning, RNN-based models such as GRU4Rec [20] and CNN-based models like NextItNet [21] and Caser [22] were proposed to better capture sequential patterns. Transformer-based models, including SASRec [6] and BERT4Rec [23], further improved performance by modeling long-range dependencies. Recent advances introduce Mamba-based architectures [24]–[26] that offer linear inference complexity for long sequences, and diffusion-based models [27]–[29] that enhance recommendation by modeling uncertainty through generative denoising processes. However, despite their success, these models typically ignore the rich semantic information contained in item attributes such as titles, descriptions, and categories, which can provide valuable complementary signals beyond interaction IDs.
Recent research has explored two primary paradigms for integrating LLMs into recommendation pipelines. The first paradigm is LLM-centric, in which user behaviors and item attributes are converted into natural language descriptions and processed by a pre-trained LLM to directly generate recommendation outputs [30]–[32]. This approach treats recommendation as a language modeling task and leverages the generative strengths of LLMs. The second paradigm explores the use of LLMs as feature extractors to enhance recommender systems, focusing on different components such as prompt design [7], [10], [12], aggregation strategies [13], [14], or adaptation modules [8], [16], [18]. While these studies demonstrate the potential of LLM-enhanced representations, they typically investigate specific techniques in isolation or under inconsistent setups, making it difficult to draw general conclusions about what design choices matter and why.
In contrast, our work introduces RecXplore, a unified and modular framework that systematically decomposes the LLM-as-feature-extractor pipeline into four core modules. This design enables controlled comparisons across design dimensions and datasets, allowing us to derive actionable principles for building effective LLM-enhanced recommender systems.
To systematically analyze the LLM-as-a-feature-extractor paradigm, we use the following analysis framework (Fig. 2).
Data Processing Module: Converts raw item attributes into a textual input format suitable for the LLM.
Feature Extraction Module: Encodes the textual input into semantically rich feature embeddings using LLM.
Feature Adaptation Module: Performs transformation, dimensionality reduction, and adaptation on the high-dimensional semantic embeddings.
Sequential Modeling Module: Models user behavior sequences with the adapted features within the downstream recommendation model.
The framework consists of four core modules, with data flowing sequentially from data processing to the final recommendation. This modular structure facilitates the testing and analysis of how different module designs and implementations affect overall performance. Next, we delve into the detailed design and exploration setup for each core module.
The core objective of this module is to construct high-quality input representations for items by converting raw item attributes (e.g., price, category, brand) into a format suitable for the LLM. Our investigation unfolds along two primary dimensions: Template-based Methods and LLM-based Semantic Enhancement.
This method aims to consolidate an item’s structured and unstructured information into a coherent natural language input using a predefined template. We test the Simple Attribute Concatenation approach, which combines all available attributes (e.g., brand, category, description) into a comprehensive descriptive sentence to serve as the LLM’s input.
This approach leverages the capabilities of an additional LLM to refine or enrich the input text. We evaluate three techniques: (1) Keyword Extraction, which uses the LLM to extract keyword information from item attributes; (2) Summarization, which uses the LLM to summarize item attributes into one or two sentences, reducing redundancy; (3) Knowledge Expansion, which utilizes the LLM’s world knowledge to augment the original item attributes.
This module aims to utilize an LLM to transform the processed textual input into semantic embeddings. We focus on two core strategies in this module: the LLM fine-tuning strategy and the feature aggregation strategy.
This strategy investigates whether the LLM’s parameters should be optimized for the recommendation task. We compare two primary approaches: (1) Frozen LLM, which keeps the LLM’s parameters fixed during training, directly leveraging its zero-shot capabilities; and (2) Fine-tuned LLM, which applies Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA, to optimize the LLM on recommendation data. Within the fine-tuning approach, we further evaluate several methods, including Continued Pre-training (CPT), Supervised Fine-tuning (SFT), Supervised Contrastive Fine-Tuning (SCFT) [16], and a combined CPT+SFT strategy, to identify the optimal implementation for using LLM as a feature extractor in recommendation tasks.
This strategy defines how to aggregate the sequence of token-level vectors from the LLM’s output to generate a final, item-level embedding. We systematically compare four mainstream methods: (1) Mean Pooling, which averages the hidden states of all tokens in the final layer; (2) Max Pooling, which takes the maximum value across the hidden states; (3) Last Token, which directly uses the hidden state of the sequence’s final token as the representation; and (4) Explicit One-word Limitation (EOL) [33], which prompts the model to distill item information into a single word and uses this word’s embedding as the final representation.
This module adapts raw semantic embeddings for the downstream recommender backbone models, addressing potential issues like high dimensionality and feature space mismatch. To this end, we explore two primary approaches: the design of different Adaptation Architectures and various strategies for Fusion with item ID Embeddings.
This approach aims to identify optimal adaptation methods for semantic features, preserving or even enhancing key semantic information while reducing computational complexity. We systematically evaluate five distinct architectures: (1) Principal Component Analysis (PCA); (2) Linear Projection; (3) Multilayer Perceptron (MLP) Adapter; (4) Product Quantization (PQ) [34]; and (5) Mixture-of-Experts (MoE) [35], which uses a network of specialized experts for dynamic, input-dependent adaptation.
To investigate whether item ID information can serve as an effective supplement to semantic embeddings, we compare three mainstream methods for fusing semantic embeddings and ID embeddings: (1) Replacement, in which the pre-trained ID embeddings is directly replaced by the semantic embeddings, enabling the system to fully rely on semantic embeddings; (2) Concatenation, which directly concatenates the semantic embeddings and pre-trained ID embeddings; and (3) Alignment, which uses an additional alignment loss [16] to constrain the semantic embedding space to align with the pre-trained ID embedding space.
This module serves as the backbone of the recommendation system, responsible for modeling user behavior sequences. In our study, we adopt the powerful and widely-used sequential recommendation model, SASRec [6]. SASRec takes the item embeddings from the feature adaptation module as input to capture users’ dynamic preferences and generate the final recommendation list.
We investigate two training modes, distinguished by whether the LLM’s parameters are updated:
Frozen LLM (Single-Stage Training): In this mode, the LLM’s parameters remain unchanged throughout training. The training process focuses solely on optimizing the parameters of the Adaptation Module and the downstream SASRec, driven entirely by the recommendation task’s loss function (e.g., cross-entropy loss).
Fine-tuned LLM (Two-Stage Training): This mode consists of two stages. In the first stage, the LLM is fine-tuned on recommendation data using Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA to instill domain knowledge. In the second stage, the fine-tuned LLM is frozen, and its superior output features are used to train the Adaptation Module and SASRec.
To ensure the low-latency responses required for industrial applications, all item semantic embeddings are pre-computed and cached offline. During online inference, the system only needs to perform a forward pass through the lightweight Adaptation Module and the Sequential Module. This completely avoids real-time calls to the LLM, thus guaranteeing efficient recommendation services.
We conduct systematic experiments based on the proposed RecXplore framework to analyze how design choices across different modules affect recommendation performance. Our study is guided by the following research questions:
RQ1 What are the specific impacts of different data processing strategies on overall performance?
RQ2 What is the impact of different feature aggregation strategies on the final recommendation performance?
RQ3 What is the impact of different LLM fine-tuning strategies on recommendation quality?
RQ4 What is the impact of different feature adaptation strategies on the final recommendation performance?
RQ5 Can pre-trained item ID embeddings bring significant additional performance gains on top of the semantic embeddings?
RQ6 Can the optimal combination of components, distilled from our systematic decoupled analysis, outperform the state-of-the-art methods, and by what margin?
We conduct experiments on four real-world datasets, i.e., the Steam dataset [36] and three datasets from Amazon product reviews [37]: Beauty, Fashion, and Games. These datasets span different product domains and user behavior patterns, serving as common benchmarks for evaluating sequential recommendation models. To ensure consistency, we adopt the data pre-processing method proposed in LLMEmb [16] for all datasets.
This base configuration integrates the most straightforward and representative design choices from each module, serving as the reference benchmark for all subsequent decoupled analyses. The base configuration is composed as follows:
Data Processing: Uses the ‘Simple Attribute Concatenation’ strategy, which combines the values of item attributes to form the input text. This strategy serves as the default benchmark for our experiments.
Feature Extraction: Utilizes a frozen-parameter LLaMA model combined with Mean Pooling to efficiently convert the input text into a single item embedding.
Feature Adaptation: Reduces feature dimensionality using a single-layer Linear Projection adapter, without processing via Principal Component Analysis (PCA).
Sequential Modeling: Adopts SASRec as the downstream backbone to model user behavior sequences.
Our experiments were conducted on a server with an Intel Xeon Platinum 8457C CPU and 8 NVIDIA L20 GPUs. We chose LLaMA-7B [38] as the base model for its superior performance over alternatives (e.g., BERT, RoBERTa) in preliminary tests. We employ LoRA (r=8,alpha=32) and an 8-expert MoE adapter for parameter-efficient fine-tuning. Full details on the model comparison and hyperparameters are available in the Appendix.
Following previous work [6], we use Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG @K) as our evaluation metrics, with K set to 5, 10. For all metrics, a higher value indicates better performance. In the evaluation, each expected recommended item in the test set is paired with 100 randomly sampled uninteracted items to calculate the metrics. To reduce the randomness of the results, all experiments are conducted three times with different random seeds (42, 43, 44), and we report the average metrics. As the standard deviation across all runs was consistently \(\leq\) 0.002, we omit it from the tables for brevity.
4pt
| Dataset | Method | H@5 | H@10 | N@5 | N@10 |
|---|---|---|---|---|---|
| Beauty | SAC | 0.4282 | 0.5525 | 0.3110 | 0.3511 |
| Keyword | 0.4210 | 0.5443 | 0.3052 | 0.3450 | |
| Summary | 0.4179 | 0.5418 | 0.3017 | 0.3418 | |
| Expansion | 0.4269 | 0.5515 | 0.3093 | 0.3495 | |
| Games | SAC | 0.5442 | 0.7019 | 0.3850 | 0.4362 |
| Keyword | 0.5508 | 0.7038 | 0.3928 | 0.4425 | |
| Summary | 0.5275 | 0.6827 | 0.3741 | 0.4244 | |
| Expansion | 0.5430 | 0.6990 | 0.3846 | 0.4352 | |
| Fashion | SAC | 0.5121 | 0.5832 | 0.4574 | 0.4803 |
| Keyword | 0.5017 | 0.5659 | 0.4443 | 0.4650 | |
| Summary | 0.5184 | 0.5860 | 0.4599 | 0.4816 | |
| Expansion | 0.4971 | 0.5628 | 0.4394 | 0.4606 | |
| Steam | SAC | 0.5592 | 0.7106 | 0.4043 | 0.4469 |
| Keyword | 0.5372 | 0.6989 | 0.3968 | 0.4391 | |
| Summary | 0.5495 | 0.7041 | 0.3993 | 0.4412 | |
| Expansion | 0.5568 | 0.7086 | 0.4029 | 0.4479 |
4pt
| Dataset | Strategy | H@5 | H@10 | N@5 | N@10 |
|---|---|---|---|---|---|
| Beauty | EOL | 0.4035 | 0.5224 | 0.2926 | 0.3310 |
| Last Token | 0.3920 | 0.5166 | 0.2806 | 0.3209 | |
| Max Pooling | 0.1801 | 0.2695 | 0.1264 | 0.1550 | |
| Mean Pooling | 0.4282 | 0.5525 | 0.3110 | 0.3511 | |
| Games | EOL | 0.5339 | 0.6914 | 0.3745 | 0.4256 |
| Last Token | 0.5339 | 0.6907 | 0.3752 | 0.4260 | |
| Max Pooling | 0.4296 | 0.5962 | 0.2936 | 0.3474 | |
| Mean Pooling | 0.5442 | 0.7019 | 0.3850 | 0.4362 | |
| Fashion | EOL | 0.4977 | 0.5692 | 0.4327 | 0.4557 |
| Last Token | 0.5031 | 0.5746 | 0.4361 | 0.4591 | |
| Max Pooling | 0.4085 | 0.4709 | 0.3362 | 0.3564 | |
| Mean Pooling | 0.5121 | 0.5832 | 0.4574 | 0.4803 | |
| Steam | EOL | 0.5502 | 0.7028 | 0.3949 | 0.4444 |
| Last Token | 0.5393 | 0.6950 | 0.3860 | 0.4365 | |
| Max Pooling | 0.4511 | 0.6153 | 0.3020 | 0.3552 | |
| Mean Pooling | 0.5592 | 0.7106 | 0.4043 | 0.4469 |
| Dataset | Strategy | H@5 | H@10 | N@5 | N@10 |
|---|---|---|---|---|---|
| Beauty | Frozen | 0.4282 | 0.5525 | 0.3110 | 0.3511 |
| SCFT | 0.4641 | 0.5705 | 0.3558 | 0.3901 | |
| CPT | 0.4803 | 0.5903 | 0.3679 | 0.4034 | |
| SFT | 0.4769 | 0.5886 | 0.3623 | 0.3984 | |
| CPT+SFT | 0.4812 | 0.5952 | 0.3633 | 0.4057 | |
| Games | Frozen | 0.5442 | 0.7019 | 0.3850 | 0.4362 |
| SCFT | 0.6093 | 0.7443 | 0.4475 | 0.4905 | |
| CPT | 0.5800 | 0.7261 | 0.4180 | 0.4654 | |
| SFT | 0.6108 | 0.7477 | 0.4518 | 0.4967 | |
| CPT+SFT | 0.6140 | 0.7511 | 0.4569 | 0.4988 | |
| Fashion | Frozen | 0.5121 | 0.5832 | 0.4574 | 0.4803 |
| SCFT | 0.5241 | 0.5740 | 0.4819 | 0.4980 | |
| CPT | 0.5197 | 0.5823 | 0.4628 | 0.4830 | |
| SFT | 0.5212 | 0.5837 | 0.4664 | 0.4865 | |
| CPT+SFT | 0.5270 | 0.5915 | 0.4680 | 0.4887 | |
| Steam | Frozen | 0.5592 | 0.7106 | 0.4043 | 0.4469 |
| SCFT | 0.5964 | 0.7432 | 0.4383 | 0.4859 | |
| CPT | 0.5800 | 0.7380 | 0.4173 | 0.4686 | |
| SFT | 0.5929 | 0.7416 | 0.4328 | 0.4811 | |
| CPT+SFT | 0.6056 | 0.7496 | 0.4427 | 0.4895 |
5pt
| Linear | MLP | PQ | MoE | ||||||
| Dataset | Metric | w/o PCA | w/ PCA | w/o PCA | w/ PCA | w/o PCA | w/ PCA | w/o PCA | w/ PCA |
| Beauty | H@5 | 0.4812 | 0.4750 | 0.4641 | 0.5053 | 0.3905 | 0.3753 | 0.5104 | 0.5066 |
| H@10 | 0.5952 | 0.5831 | 0.5814 | 0.6135 | 0.5008 | 0.4734 | 0.6155 | 0.6053 | |
| N@5 | 0.3633 | 0.3681 | 0.3479 | 0.3908 | 0.2869 | 0.2868 | 0.3978 | 0.3957 | |
| N@10 | 0.4057 | 0.4029 | 0.3859 | 0.4288 | 0.3225 | 0.3185 | 0.4308 | 0.4268 | |
| Games | H@5 | 0.6140 | 0.6038 | 0.6174 | 0.6423 | 0.5282 | 0.4878 | 0.6355 | 0.6464 |
| H@10 | 0.7511 | 0.7369 | 0.7563 | 0.7641 | 0.6738 | 0.6219 | 0.7624 | 0.7675 | |
| N@5 | 0.4569 | 0.4459 | 0.4558 | 0.4836 | 0.3825 | 0.3524 | 0.4803 | 0.4896 | |
| N@10 | 0.4988 | 0.4891 | 0.5009 | 0.5268 | 0.4297 | 0.3958 | 0.5216 | 0.5289 | |
| Fashion | H@5 | 0.5270 | 0.5489 | 0.5285 | 0.5423 | 0.4745 | 0.5054 | 0.5302 | 0.5544 |
| H@10 | 0.5915 | 0.6088 | 0.5993 | 0.5959 | 0.5309 | 0.5547 | 0.5953 | 0.6112 | |
| N@5 | 0.4680 | 0.5105 | 0.4678 | 0.4965 | 0.4108 | 0.4656 | 0.4675 | 0.5088 | |
| N@10 | 0.4887 | 0.5255 | 0.4906 | 0.5138 | 0.4322 | 0.4815 | 0.4886 | 0.5282 | |
| Steam | H@5 | 0.6056 | 0.5907 | 0.5942 | 0.5991 | 0.5185 | 0.4870 | 0.6027 | 0.6227 |
| H@10 | 0.7496 | 0.7338 | 0.7466 | 0.7495 | 0.6748 | 0.6262 | 0.7511 | 0.7683 | |
| N@5 | 0.4427 | 0.4310 | 0.4307 | 0.4371 | 0.3695 | 0.3520 | 0.4318 | 0.4612 | |
| N@10 | 0.4895 | 0.4775 | 0.4802 | 0.4860 | 0.4201 | 0.3971 | 0.4813 | 0.5062 | |
To answer RQ1, we investigate the impact of different data processing strategies on recommendation performance. The experiment is benchmarked against the Base Configuration, which uses the most straightforward Simple Attribute Concatenation (SAC) strategy. Building upon this base, we fix all other modules (e.g., Feature Extraction, Adaptation) and vary only the data processing method. We also evaluate three LLM-based Attribute Enhancement methods: (1) Keyword Extraction, (2) Summarization, and (3) Knowledge Expansion. Detailed results are presented in Table 1.
Our experiments on data processing highlight the following key takeaways:
Complex Enhancement is Suboptimal. Additional attribute enhancements (e.g., keyword extraction, summarization) provide minimal or even negative performance improvements. This can be attributed to the powerful downstream encoder’s intrinsic ability to understand semantics, which makes further processing on structured data redundant and potentially harmful.
Simplicity is Optimal. With a powerful downstream encoder, simple attribute concatenation proves to be the most effective and efficient strategy, eliminating the need for complex attribute enhancements.
Building on the optimal finding from RQ1 (SAC), we evaluate different feature aggregation strategies by comparing the default Mean Pooling against three other methods. The results are shown in Table 2.
Our experiments yield the following takeaways:
Mean Pooling is Optimal. It consistently outperforms all other strategies, as integrating all token information provides a more comprehensive semantic representation.
Max Pooling Performs Poorly. Max Pooling is clearly the worst-performing strategy on all datasets. This is likely because it over-focuses on isolated features, which harms the representation quality. The EOL and Last Token strategies provide middling, but better results.
Building upon the optimal conclusions from prior research (RQ1 and RQ2), this experiment fixes the data processing to Simple Attribute Concatenation and feature aggregation to Mean Pooling, aiming to evaluate the impact of different LLM fine-tuning paradigms. The results are shown in Table 3.
Our experiments highlight the following key takeaways:
Fine-tuning helps. A clear conclusion is that any form of fine-tuning significantly outperforms keeping the LLM frozen, confirming that domain adaptation is essential for improving recommendation quality.
CPT+SFT is optimal. The two-stage CPT+SFT combination demonstrates the highest overall performance. This result strongly supports the effectiveness of a combined domain-first (CPT) and task-specific (SFT) training path.
| Dataset | Metric | GRU4Rec | Bert4Rec | SASRec | SAID | LLMESR | LLMEMB | Our | Improvement |
|---|---|---|---|---|---|---|---|---|---|
| Beauty | H@5 | 0.2636 | 0.2867 | 0.3290 | 0.4218 | 0.4511 | 0.4362 | 0.5066* | +12.3% |
| H@10 | 0.3639 | 0.3991 | 0.4193 | 0.4998 | 0.5692 | 0.5410 | 0.6053* | +6.3% | |
| N@5 | 0.1836 | 0.1998 | 0.2521 | 0.3067 | 0.3459 | 0.3340 | 0.3957* | +14.4% | |
| N@10 | 0.2160 | 0.2361 | 0.2812 | 0.3244 | 0.3741 | 0.3678 | 0.4268* | +14.1% | |
| Games | H@5 | 0.3604 | 0.4165 | 0.5489 | 0.5603 | 0.5734 | 0.5654 | 0.6464* | +12.7% |
| H@10 | 0.4906 | 0.5529 | 0.6813 | 0.6984 | 0.7080 | 0.7069 | 0.7675* | +8.4% | |
| N@5 | 0.2529 | 0.2955 | 0.3989 | 0.4096 | 0.4074 | 0.4126 | 0.4896* | +18.7% | |
| N@10 | 0.2949 | 0.3396 | 0.4682 | 0.4727 | 0.4791 | 0.4892 | 0.5289* | +8.1% | |
| Fashion | H@5 | 0.4196 | 0.4102 | 0.4683 | 0.4982 | 0.5167 | 0.5096 | 0.5544* | +7.3% |
| H@10 | 0.4822 | 0.4679 | 0.5030 | 0.5327 | 0.5678 | 0.5554 | 0.6112* | +7.6% | |
| N@5 | 0.3571 | 0.3683 | 0.4364 | 0.4389 | 0.4635 | 0.4710 | 0.5088* | +8.0% | |
| N@10 | 0.3773 | 0.3868 | 0.4619 | 0.4679 | 0.4866 | 0.4856 | 0.5282* | +8.5% | |
| Steam | H@5 | 0.5216 | 0.5165 | 0.5541 | 0.5607 | 0.5811 | 0.5739 | 0.6227* | +7.2% |
| H@10 | 0.6861 | 0.6779 | 0.6954 | 0.7028 | 0.7353 | 0.7327 | 0.7683* | +4.5% | |
| N@5 | 0.3673 | 0.3674 | 0.4016 | 0.4089 | 0.4311 | 0.4302 | 0.4612* | +7.0% | |
| N@10 | 0.4207 | 0.4196 | 0.4461 | 0.4573 | 0.4807 | 0.4789 | 0.5062* | +5.3% |
2pt
| Dataset | Adapter | Strategy | H@5 | H@10 | N@5 | N@10 |
|---|---|---|---|---|---|---|
| Beauty | Linear | Replace | 0.4750 | 0.5831 | 0.3681 | 0.4029 |
| Concat | 0.4649 | 0.5721 | 0.3567 | 0.3913 | ||
| Align | 0.4741 | 0.5817 | 0.3644 | 0.3992 | ||
| MOE | Replace | 0.5066 | 0.6053 | 0.3957 | 0.4268 | |
| Concat | 0.4485 | 0.5744 | 0.3401 | 0.3753 | ||
| Align | 0.5001 | 0.6035 | 0.3885 | 0.4220 | ||
| Games | Linear | Replace | 0.6038 | 0.7369 | 0.4459 | 0.4891 |
| Concat | 0.6390 | 0.7555 | 0.4940 | 0.5318 | ||
| Align | 0.5995 | 0.7367 | 0.4390 | 0.4835 | ||
| MOE | Replace | 0.6464 | 0.7675 | 0.4896 | 0.5289 | |
| Concat | 0.6338 | 0.7496 | 0.4893 | 0.5269 | ||
| Align | 0.6346 | 0.7564 | 0.4792 | 0.5187 | ||
| Fashion | Linear | Replace | 0.5489 | 0.6088 | 0.5105 | 0.5255 |
| Concat | 0.5297 | 0.5822 | 0.4819 | 0.4989 | ||
| Align | 0.5554 | 0.6095 | 0.5075 | 0.5249 | ||
| MOE | Replace | 0.5544 | 0.6112 | 0.5088 | 0.5282 | |
| Concat | 0.5388 | 0.5895 | 0.4968 | 0.5131 | ||
| Align | 0.5477 | 0.6028 | 0.4985 | 0.5163 | ||
| Steam | Linear | Replace | 0.5907 | 0.7338 | 0.4310 | 0.4775 |
| Concat | 0.5975 | 0.7423 | 0.4404 | 0.4873 | ||
| Align | 0.5914 | 0.7317 | 0.4348 | 0.4804 | ||
| MOE | Replace | 0.6227 | 0.7683 | 0.4612 | 0.5062 | |
| Concat | 0.5975 | 0.7453 | 0.4401 | 0.4880 | ||
| Align | 0.6219 | 0.7630 | 0.4579 | 0.5038 |
Building upon the optimal strategies from RQ1-RQ3 (SAC, Mean Pooling, and CPT+SFT), this experiment evaluates the impact of different adaptation architectures. We compare four architectures—Linear, MLP, PQ, and MoE—each with and without PCA processing, with results presented in Table 4.
Our experimental results reveal several key takeaways regarding adapter design:
MoE architecture is superior. Among all evaluated architectures, MoE consistently demonstrates the best performance. This highlights the superiority of its dynamic, input-dependent adaptation capabilities in capturing complex semantics.
PCA helps MoE perform better. PCA is not always helpful and can sometimes hurt performance. However, we found that when PCA is used with MoE, it achieves the best results on three out of our four datasets.
To answer RQ5, we investigate whether fusing traditional ID embeddings is always beneficial, and if this effect depends on the downstream adapter’s architecture. The experiments are based on two adapters from RQ4 with different expressive capabilities: the top-performing MoE adapter and the efficient Linear adapter. On top of these two architectures, we compare three integration strategies: (1) Direct Replacement (our baseline), (2) Concatenation, and (3) Alignment. The results are presented in Table 6.
Our experimental results reveal several key takeaways:
Powerful Adapters Render ID Embeddings Redundant. For the powerful MoE adapter, the rich semantic information it generates renders ID embeddings redundant, thus the Direct Replacement strategy performs best.
ID Embeddings Effectively Supplement Simpler Adapters. Conversely, for the simpler Linear adapter, fusing ID information effectively supplements its limited semantic capabilities, leading to superior performance from the Concatenation or Alignment in most cases.
Through systematic and decoupled analysis in prior sections, we have identified a set of optimal practices for the LLM-as-a-feature-extractor paradigm. Unlike the baseline configuration, this optimized setup integrates: (1) Simple Attribute Concatenation for processing item attributes; (2) Mean Pooling for feature aggregation; (3) a two-stage CPT+SFT strategy for LLM fine-tuning; (4) an MoE Adapter with PCA for feature adaptation; and (5) a Direct Replacement strategy, omitting item ID information, for backbone recommender systems.
We evaluate RecXplore against two categories of baselines: traditional sequential models (GRU4Rec [20], Bert4Rec [23], SASRec [6]) and up-to-date LLM-enhanced baselines (SAID [39], LLMESR [40], LLMEMB [16]).
As shown in Table 5, our optimal configuration, RecXplore, comprehensively outperforms all baselines, achieving improvements across the metrics on all datasets. This result demonstrates that our approach of systematically decoupling and optimizing individual components can yield a superior configuration.
We proposed RecXplore, a modular framework for optimizing the LLM-as-a-feature-extractor paradigm in sequential recommendation. By decoupling the pipeline into four components—data processing, feature extraction, feature adaptation, and sequential modeling—and analyzing each in isolation, we identified effective design choices that, when combined into RecXplore, consistently outperformed both traditional and LLM-based baselines across four datasets. While effective, our approach uses a greedy optimization strategy that may miss globally optimal configurations, and focuses solely on the SASRec backbone; future work can explore joint optimization and extend the analysis to other architectures.
To Robert, for the bagels and explaining CMYK and color spaces.