Abstract

Using Large Language Models (LLMs) to generate semantic features has been demonstrated as a powerful paradigm for enhancing Sequential Recommender Systems (SRS). This typically involves three stages: processing item text, extracting features with LLMs, and adapting them for downstream models. However, existing methods vary widely in prompting, architecture, and adaptation strategies, making it difficult to fairly compare design choices and identify what truly drives performance. In this work, we propose RecXplore, a modular analytical framework that decomposes the LLM-as-feature-extractor pipeline into four modules: data processing, semantic feature extraction, feature adaptation, and sequential modeling. Instead of proposing new techniques, RecXplore revisits and organizes established methods, enabling systematic exploration of each module in isolation. Experiments on four public datasets show that simply combining the best designs from existing techniques—without exhaustive search—yields up to 18.7% relative improvement in NDCG@5 and 12.7% in HR@5 over strong baselines. These results underscore the utility of modular benchmarking for identifying effective design patterns and promoting standardized research in LLM-enhanced recommendation.

<ccs2012> <concept> <concept_id>10002951.10003317.10003347.10003350</concept_id> <concept_desc>Information systems Recommender systems</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

Recently, large language models (LLMs), known for their strong semantic understanding capabilities, have been increasingly integrated into recommender systems, demonstrating substantial potential in user modeling [1], item representation [2], and reasoning tasks [3]. Among the various LLM-based approaches, two paradigms have received the most attention: generation-based recommenders (Fig. 1 (a)) that place LLMs at the core, and representation-based methods that leverage LLMs as feature extractors to enhance traditional models (Fig. 1 (b)). The latter offers key advantages: semantic embeddings can be precomputed offline to eliminate real-time inference latency and can be seamlessly integrated into existing architectures, making it more practical for real-world deployment [4]. Yet, how best to design and deploy this paradigm remains an open question, motivating a closer look at its internal mechanisms.

Figure 1: Two representative paradigms of applying LLMs in recommender systems.

The LLM-as-feature-extractor paradigm typically follows three stages: first, transforming structured or unstructured item attributes (e.g., title, category, brand) into natural language prompts; second, feeding the prompt into an LLM to extract semantic embeddings through a designated aggregation strategy; and third, adapting the high-dimensional LLM outputs (e.g., 4096-dim vectors from LLaMA [5]) to a compact representation compatible with downstream recommender models. Finally, the adapted representations are fed into sequential recommenders such as SASRec [6] for training. Existing studies have explored a variety of techniques for each stage. For prompt construction, [7], [8] adopt simple attribute concatenation, while others incorporate keyword extraction [9], summarization [10], or knowledge-enhanced prompting [11], [12]. In the text encoding stage, aggregation strategies range from mean pooling to last-token or special-token representations [13], [14]. For parameter tuning, earlier works [7], [15] mostly froze the LLM, whereas recent studies [16] show that lightweight fine-tuning methods, such as supervised contrastive learning, can further boost performance. Feature adaptation designs include linear projection [17], multilayer perceptrons (MLPs) [18], and mixture-of-expert (MoE) [7] networks.

Despite these efforts, existing studies often investigate isolated components or propose single-model designs, resulting in limited comparability across methods. The absence of a unified framework makes it difficult to disentangle the effects of individual design choices, thereby hindering a holistic understanding and principled advancement of this paradigm. This raises a natural question: How do design choices within the LLM-as-feature-extractor pipeline impact recommendation performance? Moreover, can we attain stronger results by simply combining well-established techniques, without resorting to overly complex architectures?

To address this question, we propose RecXplore—a modular and reproducible analytical framework for fair and systematic evaluation of the LLM-as-feature-extractor pipeline. We decompose the pipeline into four core modules: Data Processing, Feature Extraction, Feature Adaptation, and Sequential Modeling. Through controlled experiments on four widely-used public datasets, we derive several key insights. First, simple attribute concatenation serves as a robust prompting strategy, while excessive prompt engineering often introduces noise and degrades performance. Second, a two-stage fine-tuning pipeline—continued pretraining (CPT) followed by supervised fine-tuning (SFT)—yields superior semantic representations, with mean pooling outperforming other aggregation methods. Third, a hybrid adapter that combines principal component analysis (PCA) with a mixture-of-experts (MoE) architecture proves most effective for feature adaptation. Finally, when LLM-derived semantic embeddings are sufficiently rich, traditional ID embeddings provide marginal benefit, and direct replacement emerges as the most efficient integration strategy.

Building on these insights, we instantiate the most effective design choices into RecXplore, which consistently outperforms strong baselines across all datasets and evaluation metrics, achieving up to a 12.7% and 18.7% relative improvement in HR@5 and NDCG@5, respectively. These results underscore the effectiveness and practical value of systematic modular analysis. Our main contributions are as follows:

We introduce RecXplore, the first modular framework for systematic analysis of the "LLM-as-feature-extractor" paradigm, explicitly decoupling key components within the recommendation pipeline.
We conduct comprehensive evaluations of representative design choices for each module across multiple datasets, establishing effective and reusable practices to guide future development.
We demonstrate that, without introducing complex architectures, a simple combination of the best-performing designs from each module in RecXplore is sufficient to achieve consistently strong performance across diverse datasets and evaluation metrics.

Figure 2: Overview of the proposed RecXplore framework, which comprises four modules: Data Processing, Feature Extraction, Feature Adaptation, and Sequential Modeling. The figure illustrates the data flow from item attributes to final recommendations. For each module, we systematically examine representative design choices and their impact on recommendation performance. — Figure 2: Overview of the proposed ***RecXplore*** framework, which comprises four modules: *Data Processing*, *Feature Extraction*, *Feature Adaptation*, and *Sequential Modeling*. The figure illustrates the data flow from item attributes to final recommendations. For each module, we systematically examine representative design choices and their impact on recommendation performance.

2.1 Sequential Recommendation↩︎

Sequential recommendation aims to predict the next item a user will interact with based on their behavioral history. Early methods used Markov chains [19] to model short-term dependencies. With deep learning, RNN-based models such as GRU4Rec [20] and CNN-based models like NextItNet [21] and Caser [22] were proposed to better capture sequential patterns. Transformer-based models, including SASRec [6] and BERT4Rec [23], further improved performance by modeling long-range dependencies. Recent advances introduce Mamba-based architectures [24]–[26] that offer linear inference complexity for long sequences, and diffusion-based models [27]–[29] that enhance recommendation by modeling uncertainty through generative denoising processes. However, despite their success, these models typically ignore the rich semantic information contained in item attributes such as titles, descriptions, and categories, which can provide valuable complementary signals beyond interaction IDs.

2.2 LLM for Recommendation↩︎

Recent research has explored two primary paradigms for integrating LLMs into recommendation pipelines. The first paradigm is LLM-centric, in which user behaviors and item attributes are converted into natural language descriptions and processed by a pre-trained LLM to directly generate recommendation outputs [30]–[32]. This approach treats recommendation as a language modeling task and leverages the generative strengths of LLMs. The second paradigm explores the use of LLMs as feature extractors to enhance recommender systems, focusing on different components such as prompt design [7], [10], [12], aggregation strategies [13], [14], or adaptation modules [8], [16], [18]. While these studies demonstrate the potential of LLM-enhanced representations, they typically investigate specific techniques in isolation or under inconsistent setups, making it difficult to draw general conclusions about what design choices matter and why.

In contrast, our work introduces RecXplore, a unified and modular framework that systematically decomposes the LLM-as-feature-extractor pipeline into four core modules. This design enables controlled comparisons across design dimensions and datasets, allowing us to derive actionable principles for building effective LLM-enhanced recommender systems.

3 RecXplore FRAMEWORK↩︎

To systematically analyze the LLM-as-a-feature-extractor paradigm, we use the following analysis framework (Fig. 2).

Data Processing Module: Converts raw item attributes into a textual input format suitable for the LLM.
Feature Extraction Module: Encodes the textual input into semantically rich feature embeddings using LLM.
Feature Adaptation Module: Performs transformation, dimensionality reduction, and adaptation on the high-dimensional semantic embeddings.
Sequential Modeling Module: Models user behavior sequences with the adapted features within the downstream recommendation model.

The framework consists of four core modules, with data flowing sequentially from data processing to the final recommendation. This modular structure facilitates the testing and analysis of how different module designs and implementations affect overall performance. Next, we delve into the detailed design and exploration setup for each core module.

3.1 Data Processing Module↩︎

The core objective of this module is to construct high-quality input representations for items by converting raw item attributes (e.g., price, category, brand) into a format suitable for the LLM. Our investigation unfolds along two primary dimensions: Template-based Methods and LLM-based Semantic Enhancement.

3.1.1 Template-based Method.↩︎

This method aims to consolidate an item’s structured and unstructured information into a coherent natural language input using a predefined template. We test the Simple Attribute Concatenation approach, which combines all available attributes (e.g., brand, category, description) into a comprehensive descriptive sentence to serve as the LLM’s input.

3.1.2 LLM-based Semantic Enhancement.↩︎

This approach leverages the capabilities of an additional LLM to refine or enrich the input text. We evaluate three techniques: (1) Keyword Extraction, which uses the LLM to extract keyword information from item attributes; (2) Summarization, which uses the LLM to summarize item attributes into one or two sentences, reducing redundancy; (3) Knowledge Expansion, which utilizes the LLM’s world knowledge to augment the original item attributes.

3.2 Feature Extraction Module↩︎

This module aims to utilize an LLM to transform the processed textual input into semantic embeddings. We focus on two core strategies in this module: the LLM fine-tuning strategy and the feature aggregation strategy.

3.2.0.1 LLM Fine-tuning Strategy.

This strategy investigates whether the LLM’s parameters should be optimized for the recommendation task. We compare two primary approaches: (1) Frozen LLM, which keeps the LLM’s parameters fixed during training, directly leveraging its zero-shot capabilities; and (2) Fine-tuned LLM, which applies Parameter-Efficient Fine-Tuning (PEFT) techniques, such as LoRA, to optimize the LLM on recommendation data. Within the fine-tuning approach, we further evaluate several methods, including Continued Pre-training (CPT), Supervised Fine-tuning (SFT), Supervised Contrastive Fine-Tuning (SCFT) [16], and a combined CPT+SFT strategy, to identify the optimal implementation for using LLM as a feature extractor in recommendation tasks.

3.2.0.2 Feature Aggregation Strategy.

This strategy defines how to aggregate the sequence of token-level vectors from the LLM’s output to generate a final, item-level embedding. We systematically compare four mainstream methods: (1) Mean Pooling, which averages the hidden states of all tokens in the final layer; (2) Max Pooling, which takes the maximum value across the hidden states; (3) Last Token, which directly uses the hidden state of the sequence’s final token as the representation; and (4) Explicit One-word Limitation (EOL) [33], which prompts the model to distill item information into a single word and uses this word’s embedding as the final representation.

3.3 Feature Adaptation Module↩︎

This module adapts raw semantic embeddings for the downstream recommender backbone models, addressing potential issues like high dimensionality and feature space mismatch. To this end, we explore two primary approaches: the design of different Adaptation Architectures and various strategies for Fusion with item ID Embeddings.

3.3.0.1 Adaptation Architectures.

This approach aims to identify optimal adaptation methods for semantic features, preserving or even enhancing key semantic information while reducing computational complexity. We systematically evaluate five distinct architectures: (1) Principal Component Analysis (PCA); (2) Linear Projection; (3) Multilayer Perceptron (MLP) Adapter; (4) Product Quantization (PQ) [34]; and (5) Mixture-of-Experts (MoE) [35], which uses a network of specialized experts for dynamic, input-dependent adaptation.

3.3.0.2 Fusion with ID Embeddings.

To investigate whether item ID information can serve as an effective supplement to semantic embeddings, we compare three mainstream methods for fusing semantic embeddings and ID embeddings: (1) Replacement, in which the pre-trained ID embeddings is directly replaced by the semantic embeddings, enabling the system to fully rely on semantic embeddings; (2) Concatenation, which directly concatenates the semantic embeddings and pre-trained ID embeddings; and (3) Alignment, which uses an additional alignment loss [16] to constrain the semantic embedding space to align with the pre-trained ID embedding space.

3.4 Sequential Modeling Module.↩︎

This module serves as the backbone of the recommendation system, responsible for modeling user behavior sequences. In our study, we adopt the powerful and widely-used sequential recommendation model, SASRec [6]. SASRec takes the item embeddings from the feature adaptation module as input to capture users’ dynamic preferences and generate the final recommendation list.

3.5 Training and Inference↩︎

3.5.0.1 Training Stage.

We investigate two training modes, distinguished by whether the LLM’s parameters are updated:

Frozen LLM (Single-Stage Training): In this mode, the LLM’s parameters remain unchanged throughout training. The training process focuses solely on optimizing the parameters of the Adaptation Module and the downstream SASRec, driven entirely by the recommendation task’s loss function (e.g., cross-entropy loss).
Fine-tuned LLM (Two-Stage Training): This mode consists of two stages. In the first stage, the LLM is fine-tuned on recommendation data using Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA to instill domain knowledge. In the second stage, the fine-tuned LLM is frozen, and its superior output features are used to train the Adaptation Module and SASRec.

3.5.0.2 Inference Stage.

To ensure the low-latency responses required for industrial applications, all item semantic embeddings are pre-computed and cached offline. During online inference, the system only needs to perform a forward pass through the lightweight Adaptation Module and the Sequential Module. This completely avoids real-time calls to the LLM, thus guaranteeing efficient recommendation services.

4 Experiments↩︎

We conduct systematic experiments based on the proposed RecXplore framework to analyze how design choices across different modules affect recommendation performance. Our study is guided by the following research questions:

RQ1 What are the specific impacts of different data processing strategies on overall performance?
RQ2 What is the impact of different feature aggregation strategies on the final recommendation performance?
RQ3 What is the impact of different LLM fine-tuning strategies on recommendation quality?
RQ4 What is the impact of different feature adaptation strategies on the final recommendation performance?
RQ5 Can pre-trained item ID embeddings bring significant additional performance gains on top of the semantic embeddings?
RQ6 Can the optimal combination of components, distilled from our systematic decoupled analysis, outperform the state-of-the-art methods, and by what margin?

4.1 Experimental Settings↩︎

4.1.0.1 Dataset.

We conduct experiments on four real-world datasets, i.e., the Steam dataset [36] and three datasets from Amazon product reviews [37]: Beauty, Fashion, and Games. These datasets span different product domains and user behavior patterns, serving as common benchmarks for evaluating sequential recommendation models. To ensure consistency, we adopt the data pre-processing method proposed in LLMEmb [16] for all datasets.

4.1.0.2 Base Configuration.

This base configuration integrates the most straightforward and representative design choices from each module, serving as the reference benchmark for all subsequent decoupled analyses. The base configuration is composed as follows:

Data Processing: Uses the ‘Simple Attribute Concatenation’ strategy, which combines the values of item attributes to form the input text. This strategy serves as the default benchmark for our experiments.
Feature Extraction: Utilizes a frozen-parameter LLaMA model combined with Mean Pooling to efficiently convert the input text into a single item embedding.
Feature Adaptation: Reduces feature dimensionality using a single-layer Linear Projection adapter, without processing via Principal Component Analysis (PCA).
Sequential Modeling: Adopts SASRec as the downstream backbone to model user behavior sequences.

4.1.0.3 Implementation Details.

Our experiments were conducted on a server with an Intel Xeon Platinum 8457C CPU and 8 NVIDIA L20 GPUs. We chose LLaMA-7B [38] as the base model for its superior performance over alternatives (e.g., BERT, RoBERTa) in preliminary tests. We employ LoRA (r=8,alpha=32) and an 8-expert MoE adapter for parameter-efficient fine-tuning. Full details on the model comparison and hyperparameters are available in the Appendix.

4.1.0.4 Evaluation Metrics.

Following previous work [6], we use Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG @K) as our evaluation metrics, with K set to 5, 10. For all metrics, a higher value indicates better performance. In the evaluation, each expected recommended item in the test set is paired with 100 randomly sampled uninteracted items to calculate the metrics. To reduce the randomness of the results, all experiments are conducted three times with different random seeds (42, 43, 44), and we report the average metrics. As the standard deviation across all runs was consistently \(\leq\) 0.002, we omit it from the tables for brevity.

4pt

Table 1: Performance comparison of different data processing strategies on Beauty, Games, and Fashion datasets.(RQ1)
Dataset	Method	H@5	H@10	N@5	N@10
Beauty	SAC	0.4282	0.5525	0.3110	0.3511
	Keyword	0.4210	0.5443	0.3052	0.3450
	Summary	0.4179	0.5418	0.3017	0.3418
	Expansion	0.4269	0.5515	0.3093	0.3495
Games	SAC	0.5442	0.7019	0.3850	0.4362
	Keyword	0.5508	0.7038	0.3928	0.4425
	Summary	0.5275	0.6827	0.3741	0.4244
	Expansion	0.5430	0.6990	0.3846	0.4352
Fashion	SAC	0.5121	0.5832	0.4574	0.4803
	Keyword	0.5017	0.5659	0.4443	0.4650
	Summary	0.5184	0.5860	0.4599	0.4816
	Expansion	0.4971	0.5628	0.4394	0.4606
Steam	SAC	0.5592	0.7106	0.4043	0.4469
	Keyword	0.5372	0.6989	0.3968	0.4391
	Summary	0.5495	0.7041	0.3993	0.4412
	Expansion	0.5568	0.7086	0.4029	0.4479

4pt

Table 2: Performance comparison of different feature aggregation (pooling) strategies.(RQ2)
Dataset	Strategy	H@5	H@10	N@5	N@10
Beauty	EOL	0.4035	0.5224	0.2926	0.3310
	Last Token	0.3920	0.5166	0.2806	0.3209
	Max Pooling	0.1801	0.2695	0.1264	0.1550
	Mean Pooling	0.4282	0.5525	0.3110	0.3511
Games	EOL	0.5339	0.6914	0.3745	0.4256
	Last Token	0.5339	0.6907	0.3752	0.4260
	Max Pooling	0.4296	0.5962	0.2936	0.3474
	Mean Pooling	0.5442	0.7019	0.3850	0.4362
Fashion	EOL	0.4977	0.5692	0.4327	0.4557
	Last Token	0.5031	0.5746	0.4361	0.4591
	Max Pooling	0.4085	0.4709	0.3362	0.3564
	Mean Pooling	0.5121	0.5832	0.4574	0.4803
Steam	EOL	0.5502	0.7028	0.3949	0.4444
	Last Token	0.5393	0.6950	0.3860	0.4365
	Max Pooling	0.4511	0.6153	0.3020	0.3552
	Mean Pooling	0.5592	0.7106	0.4043	0.4469

Table 3: Performance comparison of different LLM fine-tuning paradigms across datasets.(RQ3)
Dataset	Strategy	H@5	H@10	N@5	N@10
Beauty	Frozen	0.4282	0.5525	0.3110	0.3511
	SCFT	0.4641	0.5705	0.3558	0.3901
	CPT	0.4803	0.5903	0.3679	0.4034
	SFT	0.4769	0.5886	0.3623	0.3984
	CPT+SFT	0.4812	0.5952	0.3633	0.4057
Games	Frozen	0.5442	0.7019	0.3850	0.4362
	SCFT	0.6093	0.7443	0.4475	0.4905
	CPT	0.5800	0.7261	0.4180	0.4654
	SFT	0.6108	0.7477	0.4518	0.4967
	CPT+SFT	0.6140	0.7511	0.4569	0.4988
Fashion	Frozen	0.5121	0.5832	0.4574	0.4803
	SCFT	0.5241	0.5740	0.4819	0.4980
	CPT	0.5197	0.5823	0.4628	0.4830
	SFT	0.5212	0.5837	0.4664	0.4865
	CPT+SFT	0.5270	0.5915	0.4680	0.4887
Steam	Frozen	0.5592	0.7106	0.4043	0.4469
	SCFT	0.5964	0.7432	0.4383	0.4859
	CPT	0.5800	0.7380	0.4173	0.4686
	SFT	0.5929	0.7416	0.4328	0.4811
	CPT+SFT	0.6056	0.7496	0.4427	0.4895

5pt

Table 4: Performance comparison of different adapter architectures, with and without PCA pre-processing. For each metric (row), the best result across all methods is highlighted in **bold**. (RQ4)
		Linear		MLP		PQ		MoE
Dataset	Metric	w/o PCA	w/ PCA	w/o PCA	w/ PCA	w/o PCA	w/ PCA	w/o PCA	w/ PCA
Beauty	H@5	0.4812	0.4750	0.4641	0.5053	0.3905	0.3753	0.5104	0.5066
	H@10	0.5952	0.5831	0.5814	0.6135	0.5008	0.4734	0.6155	0.6053
	N@5	0.3633	0.3681	0.3479	0.3908	0.2869	0.2868	0.3978	0.3957
	N@10	0.4057	0.4029	0.3859	0.4288	0.3225	0.3185	0.4308	0.4268
Games	H@5	0.6140	0.6038	0.6174	0.6423	0.5282	0.4878	0.6355	0.6464
	H@10	0.7511	0.7369	0.7563	0.7641	0.6738	0.6219	0.7624	0.7675
	N@5	0.4569	0.4459	0.4558	0.4836	0.3825	0.3524	0.4803	0.4896
	N@10	0.4988	0.4891	0.5009	0.5268	0.4297	0.3958	0.5216	0.5289
Fashion	H@5	0.5270	0.5489	0.5285	0.5423	0.4745	0.5054	0.5302	0.5544
	H@10	0.5915	0.6088	0.5993	0.5959	0.5309	0.5547	0.5953	0.6112
	N@5	0.4680	0.5105	0.4678	0.4965	0.4108	0.4656	0.4675	0.5088
	N@10	0.4887	0.5255	0.4906	0.5138	0.4322	0.4815	0.4886	0.5282
Steam	H@5	0.6056	0.5907	0.5942	0.5991	0.5185	0.4870	0.6027	0.6227
	H@10	0.7496	0.7338	0.7466	0.7495	0.6748	0.6262	0.7511	0.7683
	N@5	0.4427	0.4310	0.4307	0.4371	0.3695	0.3520	0.4318	0.4612
	N@10	0.4895	0.4775	0.4802	0.4860	0.4201	0.3971	0.4813	0.5062

4.2 Analysis of Data Processing Strategies (RQ1)↩︎

4.2.0.1 Experimental Setup.

To answer RQ1, we investigate the impact of different data processing strategies on recommendation performance. The experiment is benchmarked against the Base Configuration, which uses the most straightforward Simple Attribute Concatenation (SAC) strategy. Building upon this base, we fix all other modules (e.g., Feature Extraction, Adaptation) and vary only the data processing method. We also evaluate three LLM-based Attribute Enhancement methods: (1) Keyword Extraction, (2) Summarization, and (3) Knowledge Expansion. Detailed results are presented in Table 1.

4.2.0.2 Analysis.

Our experiments on data processing highlight the following key takeaways:

Complex Enhancement is Suboptimal. Additional attribute enhancements (e.g., keyword extraction, summarization) provide minimal or even negative performance improvements. This can be attributed to the powerful downstream encoder’s intrinsic ability to understand semantics, which makes further processing on structured data redundant and potentially harmful.
Simplicity is Optimal. With a powerful downstream encoder, simple attribute concatenation proves to be the most effective and efficient strategy, eliminating the need for complex attribute enhancements.

4.3 Analysis of Feature Aggregation (RQ2)↩︎

4.3.0.1 Experimental Setup.

Building on the optimal finding from RQ1 (SAC), we evaluate different feature aggregation strategies by comparing the default Mean Pooling against three other methods. The results are shown in Table 2.

4.3.0.2 Analysis.

Our experiments yield the following takeaways:

Mean Pooling is Optimal. It consistently outperforms all other strategies, as integrating all token information provides a more comprehensive semantic representation.
Max Pooling Performs Poorly. Max Pooling is clearly the worst-performing strategy on all datasets. This is likely because it over-focuses on isolated features, which harms the representation quality. The EOL and Last Token strategies provide middling, but better results.

4.4 Analysis of LLM Fine-tuning (RQ3)↩︎

4.4.0.1 Experimental Setup.

Building upon the optimal conclusions from prior research (RQ1 and RQ2), this experiment fixes the data processing to Simple Attribute Concatenation and feature aggregation to Mean Pooling, aiming to evaluate the impact of different LLM fine-tuning paradigms. The results are shown in Table 3.

4.4.0.2 Analysis.

Our experiments highlight the following key takeaways:

Fine-tuning helps. A clear conclusion is that any form of fine-tuning significantly outperforms keeping the LLM frozen, confirming that domain adaptation is essential for improving recommendation quality.
CPT+SFT is optimal. The two-stage CPT+SFT combination demonstrates the highest overall performance. This result strongly supports the effectiveness of a combined domain-first (CPT) and task-specific (SFT) training path.

Table 5: Performance comparison between **RecXplore** (with best-performing modules) and baselines(RQ6). **Bold** denotes the best result, underline indicates the second-best, and \({}^*\) indicates statistical significance (\(p<0.05\), paired t-test) compared to the strongest baseline.
Dataset	Metric	GRU4Rec	Bert4Rec	SASRec	SAID	LLMESR	LLMEMB	Our	*Improvement*
Beauty	H@5	0.2636	0.2867	0.3290	0.4218	0.4511	0.4362	0.5066*	+12.3%
	H@10	0.3639	0.3991	0.4193	0.4998	0.5692	0.5410	0.6053*	+6.3%
	N@5	0.1836	0.1998	0.2521	0.3067	0.3459	0.3340	0.3957*	+14.4%
	N@10	0.2160	0.2361	0.2812	0.3244	0.3741	0.3678	0.4268*	+14.1%
Games	H@5	0.3604	0.4165	0.5489	0.5603	0.5734	0.5654	0.6464*	+12.7%
	H@10	0.4906	0.5529	0.6813	0.6984	0.7080	0.7069	0.7675*	+8.4%
	N@5	0.2529	0.2955	0.3989	0.4096	0.4074	0.4126	0.4896*	+18.7%
	N@10	0.2949	0.3396	0.4682	0.4727	0.4791	0.4892	0.5289*	+8.1%
Fashion	H@5	0.4196	0.4102	0.4683	0.4982	0.5167	0.5096	0.5544*	+7.3%
	H@10	0.4822	0.4679	0.5030	0.5327	0.5678	0.5554	0.6112*	+7.6%
	N@5	0.3571	0.3683	0.4364	0.4389	0.4635	0.4710	0.5088*	+8.0%
	N@10	0.3773	0.3868	0.4619	0.4679	0.4866	0.4856	0.5282*	+8.5%
Steam	H@5	0.5216	0.5165	0.5541	0.5607	0.5811	0.5739	0.6227*	+7.2%
	H@10	0.6861	0.6779	0.6954	0.7028	0.7353	0.7327	0.7683*	+4.5%
	N@5	0.3673	0.3674	0.4016	0.4089	0.4311	0.4302	0.4612*	+7.0%
	N@10	0.4207	0.4196	0.4461	0.4573	0.4807	0.4789	0.5062*	+5.3%

2pt

Table 6: Performance comparison of different ID feature fusion strategies on the Beauty, Games, and Fashion datasets (with PCA). (RQ5)
Dataset	Adapter	Strategy	H@5	H@10	N@5	N@10
Beauty	Linear	Replace	0.4750	0.5831	0.3681	0.4029
		Concat	0.4649	0.5721	0.3567	0.3913
		Align	0.4741	0.5817	0.3644	0.3992
	MOE	Replace	0.5066	0.6053	0.3957	0.4268
		Concat	0.4485	0.5744	0.3401	0.3753
		Align	0.5001	0.6035	0.3885	0.4220
Games	Linear	Replace	0.6038	0.7369	0.4459	0.4891
		Concat	0.6390	0.7555	0.4940	0.5318
		Align	0.5995	0.7367	0.4390	0.4835
	MOE	Replace	0.6464	0.7675	0.4896	0.5289
		Concat	0.6338	0.7496	0.4893	0.5269
		Align	0.6346	0.7564	0.4792	0.5187
Fashion	Linear	Replace	0.5489	0.6088	0.5105	0.5255
		Concat	0.5297	0.5822	0.4819	0.4989
		Align	0.5554	0.6095	0.5075	0.5249
	MOE	Replace	0.5544	0.6112	0.5088	0.5282
		Concat	0.5388	0.5895	0.4968	0.5131
		Align	0.5477	0.6028	0.4985	0.5163
Steam	Linear	Replace	0.5907	0.7338	0.4310	0.4775
		Concat	0.5975	0.7423	0.4404	0.4873
		Align	0.5914	0.7317	0.4348	0.4804
	MOE	Replace	0.6227	0.7683	0.4612	0.5062
		Concat	0.5975	0.7453	0.4401	0.4880
		Align	0.6219	0.7630	0.4579	0.5038

4.5 Analysis of Feature Adaptation (RQ4)↩︎

4.5.0.1 Experimental Setup.

Building upon the optimal strategies from RQ1-RQ3 (SAC, Mean Pooling, and CPT+SFT), this experiment evaluates the impact of different adaptation architectures. We compare four architectures—Linear, MLP, PQ, and MoE—each with and without PCA processing, with results presented in Table 4.

4.5.0.2 Analysis.

Our experimental results reveal several key takeaways regarding adapter design:

MoE architecture is superior. Among all evaluated architectures, MoE consistently demonstrates the best performance. This highlights the superiority of its dynamic, input-dependent adaptation capabilities in capturing complex semantics.
PCA helps MoE perform better. PCA is not always helpful and can sometimes hurt performance. However, we found that when PCA is used with MoE, it achieves the best results on three out of our four datasets.

4.6 Analysis of Item ID Embeddings Fusion (RQ5)↩︎

4.6.0.1 Experimental Setup.

To answer RQ5, we investigate whether fusing traditional ID embeddings is always beneficial, and if this effect depends on the downstream adapter’s architecture. The experiments are based on two adapters from RQ4 with different expressive capabilities: the top-performing MoE adapter and the efficient Linear adapter. On top of these two architectures, we compare three integration strategies: (1) Direct Replacement (our baseline), (2) Concatenation, and (3) Alignment. The results are presented in Table 6.

4.6.0.2 Analysis.

Our experimental results reveal several key takeaways:

Powerful Adapters Render ID Embeddings Redundant. For the powerful MoE adapter, the rich semantic information it generates renders ID embeddings redundant, thus the Direct Replacement strategy performs best.
ID Embeddings Effectively Supplement Simpler Adapters. Conversely, for the simpler Linear adapter, fusing ID information effectively supplements its limited semantic capabilities, leading to superior performance from the Concatenation or Alignment in most cases.

4.7 Putting It All Together: RecXplore vs Baselines (RQ6)↩︎

Through systematic and decoupled analysis in prior sections, we have identified a set of optimal practices for the LLM-as-a-feature-extractor paradigm. Unlike the baseline configuration, this optimized setup integrates: (1) Simple Attribute Concatenation for processing item attributes; (2) Mean Pooling for feature aggregation; (3) a two-stage CPT+SFT strategy for LLM fine-tuning; (4) an MoE Adapter with PCA for feature adaptation; and (5) a Direct Replacement strategy, omitting item ID information, for backbone recommender systems.

We evaluate RecXplore against two categories of baselines: traditional sequential models (GRU4Rec [20], Bert4Rec [23], SASRec [6]) and up-to-date LLM-enhanced baselines (SAID [39], LLMESR [40], LLMEMB [16]).

As shown in Table 5, our optimal configuration, RecXplore, comprehensively outperforms all baselines, achieving improvements across the metrics on all datasets. This result demonstrates that our approach of systematically decoupling and optimizing individual components can yield a superior configuration.

5 Conclusion and Future Work↩︎

We proposed RecXplore, a modular framework for optimizing the LLM-as-a-feature-extractor paradigm in sequential recommendation. By decoupling the pipeline into four components—data processing, feature extraction, feature adaptation, and sequential modeling—and analyzing each in isolation, we identified effective design choices that, when combined into RecXplore, consistently outperformed both traditional and LLM-based baselines across four datasets. While effective, our approach uses a greedy optimization strategy that may miss globally optimal configurations, and focuses solely on the SASRec backbone; future work can explore joint optimization and extend the analysis to other architectures.

To Robert, for the bagels and explaining CMYK and color spaces.

References↩︎

[1]

Honghui Bao, Wenjie Wang, Xinyu Lin, Fengbin Zhu, Teng Sun, Fuli Feng, and Tat-Seng Chua.2025. . arXiv preprint arXiv:2507.04626(2025).

[2]

Jun Hu, Wenwen Xia, Xiaolu Zhang, Chilin Fu, Weichang Wu, Zhaoxin Huan, Ang Li, Zuoli Tang, and Jun Zhou.2024. . In Companion Proceedings of the ACM Web Conference 2024. 103–111.

[3]

Millennium Bismay, Xiangjue Dong, and James Caverlee.2024. . arXiv preprint arXiv:2410.23180(2024).

[4]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al2024. . World Wide Web27, 5(2024), 60.

[5]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al2023. . arXiv preprint arXiv:2302.13971(2023).

[6]

Wang-Cheng Kang Julian McAuley.2018. . In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.

[7]

Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen.2022. . In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.

[8]

Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao.2023. . In Proceedings of the ACM Web Conference 2023. 1162–1171.

[9]

Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Christopher Leung, Jiajie Tang, and Jiebo Luo.2023. . arXiv preprint arXiv:2307.15780(2023).

[10]

Chiyu Zhang, Yifei Sun, Minghao Wu, Jun Chen, Jie Lei, Muhammad Abdul-Mageed, Rong Jin, Angli Liu, Ji Zhu, Sem Park, et al2024. . In Proceedings of the 18th ACM Conference on Recommender Systems. 1010–1015.

[11]

Zhi Zheng, Wenshuo Chao, Zhaopeng Qiu, Hengshu Zhu, and Hui Xiong.2024. . In Proceedings of the ACM Web Conference 2024. 3207–3216.

[12]

Lanling Xu, Junjie Zhang, Bingqian Li, Jinpeng Wang, Mingchen Cai, Wayne Xin Zhao, and Ji-Rong Wen.2024. . arXiv preprint arXiv:2401.04997(2024).

[13]

Chao Zhang, Shiwei Wu, Haoxin Zhang, Tong Xu, Yan Gao, Yao Hu, and Enhong Chen.2024. . In Companion Proceedings of the ACM Web Conference 2024. 170–179.

[14]

Junyi Chen, Lu Chi, Bingyue Peng, and Zehuan Yuan.2024. . arXiv preprint arXiv:2409.12740(2024).

[15]

Lingzi Zhang, Xin Zhou, Zhiwei Zeng, and Zhiqi Shen.2024. . In 2024 IEEE 40th International Conference on Data Engineering (ICDE). IEEE, 530–543.

[16]

Qidong Liu, Xian Wu, Wanyu Wang, Yejing Wang, Yuanshao Zhu, Xiangyu Zhao, Feng Tian, and Yefeng Zheng.2025. . In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 12183–12191.

[17]

Leheng Sheng, An Zhang, Yi Zhang, Yuxin Chen, Xiang Wang, and Tat-Seng Chua.2024. . arXiv preprint arXiv:2407.05441(2024).

[18]

Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni.2023. . In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2639–2649.

[19]

Ruining He Julian McAuley.2016. . In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 191–200.

[20]

Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. . arXiv preprint arXiv:1511.06939(2015).

[21]

Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M Jose, and Xiangnan He.2019. . In Proceedings of the twelfth ACM international conference on web search and data mining. 582–590.

[22]

Jiaxi Tang Ke Wang.2018. . In Proceedings of the eleventh ACM international conference on web search and data mining. 565–573.

[23]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.2019. . In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.

[24]

Chengkai Liu, Jianghao Lin, Jianling Wang, Hanzhou Liu, and James Caverlee.2024. . arXiv preprint arXiv:2403.03900(2024).

[25]

Jinzhao Su Zhenhua Huang.2024. . arXiv preprint arXiv:2407.13135(2024).

[26]

Yuda Wang, Xuxin He, and Shengxin Zhu.2024. . arXiv preprint arXiv:2406.02638(2024).

[27]

Zihao Li, Aixin Sun, and Chenliang Li.2023. . ACM Transactions on Information Systems42, 3(2023), 1–28.

[28]

Qidong Liu, Fan Yan, Xiangyu Zhao, Zhaocheng Du, Huifeng Guo, Ruiming Tang, and Feng Tian.2023. . In Proceedings of the 32nd ACM International conference on information and knowledge management. 1576–1586.

[29]

Haokai Ma, Ruobing Xie, Lei Meng, Xin Chen, Xu Zhang, Leyu Lin, and Zhanhui Kang.2024. . In Proceedings of the AAAI conference on artificial intelligence, Vol. 38. 8886–8894.

[30]

Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang.2022. . In Proceedings of the 16th ACM conference on recommender systems. 299–315.

[31]

Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He.2023. . In Proceedings of the 17th ACM Conference on Recommender Systems. 1007–1014.

[32]

Junling Liu, Chao Liu, Peilin Zhou, Renjie Lv, Kang Zhou, and Yan Zhang.2023. . arXiv preprint arXiv:2304.10149(2023).

[33]

Bowen Zhang, Kehua Chang, and Chunping Li.2024. . In International Conference on Intelligent Computing. Springer, 52–64.

[34]

Herve Jegou, Matthijs Douze, and Cordelia Schmid.2010. . IEEE transactions on pattern analysis and machine intelligence33, 1(2010), 117–128.

[35]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean.2017. . arXiv preprint arXiv:1701.06538(2017).

[36]

Martin Bustos Roman.2022. Steam Games Dataset. https://doi.org/10.34740/KAGGLE/DS/2109585.

[37]

Ruining He Julian McAuley.2016. . In proceedings of the 25th international conference on world wide web. 507–517.

[38]

[39]