April 02, 2024
The rise of code pre-trained models has significantly enhanced various coding tasks, such as code completion, and tools like GitHub Copilot. However, the substantial size of these models, especially large models, poses a significant challenge when it comes to fine-tuning them for specific downstream tasks. As an alternative approach, retrieval-based methods have emerged as a promising solution, augmenting model predictions without the need for fine-tuning. Despite their potential, a significant challenge is that the designs of these methods often rely on heuristics, leaving critical questions about what information should be stored or retrieved and how to interpolate such information for augmenting predictions.
To tackle this challenge, we first perform a theoretical analysis of the fine-tuning process, highlighting the importance of \(\Delta logits\) as a catalyst for improving model predictions. Building on this insight, we develop a novel retrieval-based method, FT2Ra, which aims to mimic genuine fine-tuning. While FT2Ra adopts a retrieval-based mechanism, it uniquely adopts a paradigm with a learning rate and multi-epoch retrievals, which is similar to fine-tuning.
We conducted a comprehensive evaluation of FT2Ra in both token-level and line-level code completions. Our findings demonstrate the remarkable effectiveness of FT2Ra when compared to state-of-the-art methods and its potential to genuine fine-tuning. In token-level completion, which represents a relatively easier task, FT2Ra achieves a 4.29% improvement in accuracy compared to the best baseline method on UniXcoder. In the more challenging line-level completion task, we observe a substantial \(\sim\)2\(\times+\) increase in Exact Match (EM) performance, indicating the significant advantages of our theoretical analysis. Notably, even when operating without actual fine-tuning, FT2Ra exhibits competitive performance compared to the models with real fine-tuning.
<ccs2012> <concept> <concept_id>10011007.10011074.10011092</concept_id> <concept_desc>Software and its engineering Software development techniques</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>
In the realm of software engineering, code pre-trained models (CPMs) specialized for code generation and completion are becoming increasingly prevalent. Recently, a series of code completion plugins such as GitHub Copilot [@githubcopilot], and Visual Studio IntelliCode [@intellicode] have significantly alleviated the burden on software developers and enhanced their development efficiency.
Code-centric pre-trained models are generally trained using vast amounts of source code data harvested from open repositories. In the inference phase, CPMs typically map the code prefixes to fixed-sized representations and use the representations to predict the next code token. However, despite the extensive training data, CPMs still struggle to capture rare or specialized patterns. On one hand, the rarity of certain patterns in the training data makes it difficult for the model to learn them adequately. On the other hand, the complex inter-dependencies between different data samples could include conflicting coding styles or logic that the model fails to reconcile. Furthermore, these CPMs may not excel in specific domains where task-oriented or project-specific knowledge is essential, such as project-specific API invocations. For example, the recent study [@zhang2023repocoder] disclosed that these general-purpose pre-trained models are inferior in repository-level code completion where the interrelated dependencies among files within a repository are missed for these general models. Therefore, the post-training enhancement of these models becomes a crucial task.
To tackle the outlined challenges, a straightforward approach is to fine-tune the pre-trained models using specialized data, such as missing patterns or project-specific information. However, fine-tuning comes with its own set of limitations, particularly concerning the computational resources required and the quality of data necessary for effective adjustment. Fine-tuning the entire model necessitates storing and updating a colossal parameter set, an operation that becomes increasingly costly and often infeasible as model size escalates into billions of parameters. Furthermore, the success of this strategy hinges on the availability of high-quality, task-specific data. While parameter-efficient fine-tuning techniques have been proposed [@hu2021lora; @liu2023gpt; @lester2021power], they still demand considerable computational resources for fine-tuning.
Recent research [@zhang2023repocoder; @Khandelwal2020Generalization; @tang2023domain] proposes an alternative route through the use of retrieval-augmented language models (RaLMs). These models supplement the capabilities of pre-trained models by incorporating retrieval mechanisms that source information (e.g., rare patterns) from an external database, thereby bypassing the need for additional fine-tuning. This method enables the model to explicitly store and retrieve rare patterns, as opposed to implicitly integrating them into the model’s parameters [@Khandelwal2020Generalization]. This paradigm aligns well with human learning behavior, where sparse examples are leveraged to generalize effectively to new situations. Empirical results demonstrate that RaLMs can significantly enhance the performance of CPMs, particularly in the prediction of rare patterns.
For retrieval-augmented language models, two main challenges exist the identification of similar samples from an external database and the effective utilization of this retrieved information for making predictions. Typically, the former is addressed by retrieving nearest neighbors based on distance metrics in a pre-trained embedding space. For the latter, existing methods adopt different methods such as employing frequency analysis [@WikipediaFrequency] and empirical probabilities [@WikipediaEmpiricalProb] to integrate the retrieved information. For example, kNN-LM [@Khandelwal2020Generalization] retrieves the neighbors and computes a distribution over neighbors based on a softmax of their negative distances, which are used to augment the original predictions. The most recent work kNM-LM [@tang2023domain] retrieves the code tokens that the language model fails to predict and normalize into a distribution, which is merged with the predictions of the language model. While these heuristic approaches have yielded promising results, they largely depend on empirical settings, leaving theoretical gaps in terms of what information should be retrieved and how this information can be better exploited.
In this paper, to better understand the optimal use of retrieved knowledge, we first conduct a theoretical analysis of the fine-tuning process in CPMs. Our theoretical analysis and derivation reveal insights for designing a strategy that more closely approximates the effects of fine-tuning. While our theoretical derivation does incorporate certain approximations, the evaluation results still demonstrate the effectiveness of the retrieval mechanism. Specifically, our analysis indicates that the logits discrepancy between the predicted and actual values associated with neighbors (i.e., \(\Delta logits\)) 1 serves as crucial information for augmenting CPM predictions. Based on the analysis, we develop a novel retrieval-augmentation technique, denoted as FT2Ra, for code completion tasks. CPMs can recalibrate and improve its predictions by adding the \(\Delta logits\) to its logit output. Furthermore, akin to the iterative nature of the fine-tuning process, FT2Ra is designed to operate through an iterative retrieval cycle, progressively updating the external database to refine the quality of retrieved information, thereby continuously improving prediction accuracy.
To showcase the effectiveness of FT2Ra, we selected four state-of-the-art retrieval-based methods: kNN-LM [@Khandelwal2020Generalization], kNM-LM [@tang2023domain], ReACC [@lu2022reacc], and BM25 [@robertson2009probabilistic]. Our evaluation encompassed both token-level and line-level code completion. The experimental results demonstrate that, guided by our theoretical analysis, FT2Ra significantly outperforms the baseline methods and achieves competitive performance similar to actual fine-tuned models. For instance, in the context of token-level completion, FT2Ra obtains an average accuracy of 74.22% (4.29%+) on UniXcoder, whereas UniXcoder and the top-performing baseline, kNM-LM, achieve accuracy of 54.07% and 69.93%, respectively. In the more challenging line-level completion task, FT2Ra achieves an average Exact Match (EM) score of 26.32 (\(\sim\)2\(\times+\) ) on UniXcoder. In contrast, UniXcoder and kNM-LM only manage scores of 1.63 and 13.93, respectively. We also observed that, in line-level completion using UniXcoder, FT2Ra achieves performance better than that of the fine-tuned UniXcoder model after 10 epochs, even when operating without fine-tuning. These results not only demonstrate the effectiveness of FT2Ra but also highlight its significant potential to achieve competitive results comparable to those of fine-tuned models. Furthermore, our additional evaluations reveal that the iterative retrieval mechanism designed within FT2Ra significantly contributes to its performance.
In summary, this paper makes the following contributions:
Theoretical Analysis: We perform a theoretical analysis of the model fine-tuning process. This analysis provides valuable insights into how to effectively exploit retrieval information in retrieval augmentation mechanisms.
Methodology: Building upon the insights derived from our theoretical analysis, we introduce a novel method called FT2Ra. This innovative approach emulates real fine-tuning through an iterative retrieval process, enhancing its effectiveness.
Comprehensive Evaluation: We conduct an extensive evaluation to evaluate the effectiveness of FT2Ra in both token-level and line-level code completion tasks. The results highlight substantial improvements achieved by FT2Ra.
Open-Source Resources: We have made the pertinent data, detailed experimental findings, and the tools publicly available [@ft2rawebsite].
Recently, a series of retrieval-augmented language models [@tang2023domain; @Khandelwal2020Generalization; @guu2020retrieval] have been proposed to augment language models with external knowledge [@yang2023leandojo; @chen2022decoupling; @hofstatter2023fid]. Retrieval-augmented techniques can generally be divided into two types. The first type is at the input layer [@ram2023context; @izacard2022few; @guu2020retrieval], where the retrieved information is text chunks. The second type is at the output layer [@Khandelwal2020Generalization; @tang2023domain; @alon2022neuro], where the retrieved information is tokens. By combining the retrieved tokens with the tokens generated by the original model, the accuracy of the retrieval-augmented model’s generation for each token can be improved. The first type of method can provide the model with more external knowledge, making it adept at handling tasks in the NLP field such as knowledge-based question answering [@lewis2020retrieval; @wang2023learning; @shi2023replug]. The second type of method can refer to the retrieved information to correct the generated tokens, making it more suited for handling strictly structured generative tasks, such as code completion [@drozdov2022you; @alon2022neuro; @de2021mention]. In this work, we mainly focus on the second category.
To better understand the mechanism, we take kNN-LM [@Khandelwal2020Generalization] as an example for a detailed explanation. Given a context sequence \(c_t = (w_1, \dots, w_{t-1})\), the language models (LMs) estimate \(p_{LM}(w_t|c_t)\), i.e., the probability distribution over the next token \(w_t\). kNN-LM is designed to augment a pre-trained language model with a set of nearest neighbours retrieved from an external text collection, which can be the training set \(D\). Different from fine-tuning, retrieval augmentation does not need any retraining. In particular, RaLM includes two tasks, i.e., building a datastore and retrieval-augmented inference.
Datastore: The datastore is a retrieval set, which can be built with a forward pass by LM on the prepared text collection to store the context-target pairs as the subject of a query. We denote a function \(f(\cdot)\) to map a context \(c\) to a fixed-length vector representation computed by a pre-trained LM. Given an example \((c_i, w_i) \in D\), we can pass \(c_i\) to a LM to get its vector representation, i.e., \(k_i=f(c_i)\). The dataset \(D\) is a set of datasets such as the training data or other domain-specific data. In this way, we can obtain the key-value pair \((k_i, v_i)\), where \(k_i\) is the context representation computed from LM and \(v_i\) is the target word \(w_i\). Hence, the datastore \((K, V)\) is a set of all context-target pairs built from \(D\), which can be expressed as: \[(K, V) = \{(f(c_i), w_i) \mid (c_i, w_i) \in {D}\}\] Inference: The inference phase includes neighbour retrieval and the use of neighbour prediction information. Given a new input \(x\), the model first computes its context representation i.e., \(f(x)\). Using \(f(x)\) as a query to retrieve the \(k\)-nearest neighbours \(\mathcal{N}\) from the datastore \((K, V)\) based on a defined distance function \(dis(\cdot)\) such as Euclidean distance. Then it computes a distribution over these \(k\) neighbors using a softmax function. The probability for each vocabulary item is aggregated across all occurrences in the retrieved targets. Note that the items in the vocabulary set that do not appear in the retrieved targets have a probability of zero. \[\label{eq-retrieval} p_{kNN}(y|x) \propto \sum_{(k_i,v_i) \in \mathcal{N}} \mathbf{1}_{\{y = v_i\}} \exp(-dis(k_i, f(x)))\tag{1}\] The final distribution is interpolated with the original LM distribution \(p_{LM}(y|x)\) and \(p_{kNN}(y|x)\) to obtain the joint distribution: \[\label{eq-knn} p(y|x) = (1-\lambda)p_{LM}(y|x) + \lambda p_{kNN}(y|x)\tag{2}\] where \(\lambda\) is a tuned hyper-parameter to control the weight of generation and retrieval.
From Equation 2 , we can observe that the final distribution of kNN is the weighted sum of the original LM distribution i.e., \(p_{LM}(y|x)\) and the retrieved nearest neighbor distribution i.e., \(p_{kNN}(y|x)\). The key problem is how to interpolate the retrieved knowledge in the prediction, i.e., the design of \(p_{kNN}(y|x)\). In kNN-LM, \(p_{kNN}(y|x)\) is computed from Equation. 1 based on negative distances and the aggregated probability for each vocabulary item across all its occurrences in the retrieved targets. While the design is intuitive, it is still based on heuristics and lacks theoretical analysis and explanation. A key question to identify what kinds of information should be retrieved and how best to leverage that information.
In this section, we delve into a theoretical analysis aimed at identifying useful retrieval information, drawing inspiration from the fine-tuning process commonly employed for enhancing the performance of CPMs. Subsequently, we introduce our method, FT2Ra, which focuses on the effective interpolation of this retrieved information to improve the predictive accuracy of CPMs.
Fine-tuning serves as a practical technique for boosting the performance of pre-trained models, particularly when applied to domain-specific tasks or datasets that the original model may not adequately cover. Our goal is to distil insights from the mechanics of fine-tuning to inform the design of a retrieval-augmented method that approximates the performance improvements seen with fine-tuning, yet obviates the need for the fine-tuning process itself.
Let \(\mathcal{M}\) represent a given language model capable of predicting the next token \(x_t\) based on its preceding context sequence \(x = (x_1, x_2, \ldots, x_{t-1})\). We proceed with the following definitions:
\(\theta\) denotes the trained model parameters of \(\mathcal{M}\).
\(y\) is the ground-truth for \(x_t\) as a one-hot encoding, where the index corresponding to \(x_t\) is marked as 1 while other indices are 0. \(y \in \mathbb{R}^{v}\) is a vector where \(v\) is the length of the vocabulary set.
\(y'\) is the model prediction result for the next token, i.e., \(y'=\mathcal{M}(x|\theta)\) and \(y' \in \mathbb{R}^{v}\), which denotes the predicted probability of each token in the vocabulary set with the context. Typically, \(y'\) is the output for the probability layer of the model \(\mathcal{M}\).
\(logits \in \mathbb{R}^{v}\) encapsulates the values in the logits layer, preceding the probability layer.
\(seqout \in \mathbb{R}^{dmodel}\) is the output of the decoder sequence output layer, preceding the logits layer, and \(dmodel\) is the dimension of this layer.
Suppose the LM \(\mathcal{M}\) undergoes fine-tuning through multiple epochs, following best practices. Without loss of generality, we assume that the loss for a given input \(x\) diminishes after each iteration of the fine-tuning (i.e., the gradient descent algorithm). Let \(\theta\) and \(\theta'\) denote the model’s parameters before and after an epoch of fine-tuning, respectively, such that \(\theta' = \theta + \Delta\theta\). The corresponding loss values before and after the fine-tuning are: \[l=\mathcal{L}(\mathcal{M}(x|\theta), y), l'=\mathcal{L}(\mathcal{M}(x|\theta'), y)\]
where \(\mathcal{L}\) is the loss function, and \(y\) is the ground truth for the given context sequence \(x\). We define the change in the loss as \(\Delta l = l'-l\). Given that the language model is differentiable, the change in loss \(\Delta l\) can be expressed as: \[\label{eq:deta-loss1} \begin{align} \Delta l &= \mathcal{L}(\mathcal{M}(x|\theta + \Delta\theta), y) - \mathcal{L}(\mathcal{M}(x|\theta), y) \end{align}\tag{3}\]
In gradient descent, the learning rate \(\eta_{\theta}\) controls the magnitude of parameter updates: \[\label{eq:deta-parameter} \Delta \theta = - \eta_{\theta} \times \frac{\partial \mathcal{L}}{\partial \theta}\tag{4}\]
On the other hand, the loss can also be formulated in terms of logits, \(l=\mathcal{L}(softmax(logits), y)\), where \(logits\) is the model output on \(x\) and \(y\) is the ground truth. After one iteration of the gradient descent, the loss value \(l'\) can be described as \(l'=\mathcal{L}(softmax(logits'), y)\), with \(logits'\) denoting the output of the updated model. Let \(\Delta logits = logits' - logits\), and we can derive:
\[\label{eq:deta-loss4} \begin{align} \Delta l &= \mathcal{L}(softmax(logits + \Delta logits), y) - \mathcal{L}(softmax(logits), y) \\ &= (\Delta logits)^T \cdot \frac{\partial \mathcal{L}}{\partial logits} \end{align}\tag{5}\]
Intuitively, if we can develop a method for approximating \(\Delta logits\) without actually engaging in fine-tuning, then these approximated \(\Delta logits\) could be directly interpolated into the predictions of the model. This mimics the effects of fine-tuning and may achieve comparable performance, depending on the accuracy of the \(\Delta logits\) approximation.
We observe the final LM-head layer of the generative model, where \(logits = lm\_head(seqout)\). We ignore the activation layer in LM-head and can approximately treat the LM-head as a linear layer, from which we can derive: \[\label{eq:logits} logits \approx W \cdot seqout + b\tag{6}\] where \(W\) is a weight matrix with the dimension \(v\) * \(dmodel\).
From equation 6 , using the chain rule for differentiation [@stanfordcs229], we get: \[\label{eq:chain-rule} \frac{\partial l}{\partial W} = \frac{\partial l}{\partial logits} \cdot seqout^T\tag{7}\]
During the gradient descent process, since \(W\) is a part of \(\theta\), according to equation 4 , it also follows the gradient descent rule: \[\label{eq:gradient-w} \Delta W = -\eta_{\theta} \cdot \frac{\partial l}{\partial W}\tag{8}\]
When we fix the parameters preceding \(seqout\) and only fine-tune the subsequent parameters of the model, then according to equation 6 , 7 and 8 , we make an approximation:
\[\label{eq:approx} \begin{align} \Delta \text{logits} &\approx \Delta W \cdot seqout \\ &= -\eta_{\theta} \cdot \frac{\partial l}{\partial W} \cdot seqout \\ &= -\eta_{\theta} \cdot \frac{\partial l}{\partial logits} \cdot seqout^T \cdot seqout \\ &= -\eta_{\theta} \cdot ||seqout||^2_2 \cdot \frac{\partial l}{\partial logits} \end{align}\tag{9}\] We use \(||seqout||_2\) to denote the L2 norm of \(seqout\). Furthermore, we define \(-\eta_{logits}\) as \(-\eta_{\theta} \cdot ||seqout||^2_2\), and we can get: \[\label{eq:logitsresult} \Delta logits \approx - \eta_{logits} \times \frac{\partial \mathcal{L}}{\partial logits}\tag{10}\]
Equation 10 offers a feasible methodology for calculating changes in logits, which can be employed to bolster the current model’s performance on \(x\) reducing its loss. To obtain the value of \(\frac{\partial \mathcal{L}}{\partial logits}\), we propose the retrieval-based method detailed in Section 3.2.1.
Our derivation implies new insights about what kind of information should be stored and retrieved (i.e., the \(\Delta logits\)) and how to leverage the information (i.e., add \(\Delta logits\) to the predicted logits). In summary, it introduces the following benefits: 1) the retrieval mechanism is theoretically grounded, different from the existing mere heuristic approaches, 2) the retrieval mechanism tries to mimic the fine-tuning process, which has a high potential to achieve high performance and 3) the retrieved knowledge regarding \(\Delta logits\) is more fine-grained compared to existing methods, and its integration into the prediction process is both straightforward and direct.
Building on the theoretical insights from fine-tuning, we introduce a novel retrieval-augmented method.
To approximate the value of \(\frac{\partial \mathcal{L}}{\partial logits}\) shown in Equation 10 , we employ the nearest \(k\) neighbors of the sample \(x\) for the estimation. The approximation is formulated as \[\label{oshqmjak} \frac{\partial \mathcal{L}}{\partial logits} \approx \sum_i \lambda_i \times \frac{\partial \mathcal{L}_i}{\partial logits_i}\tag{11}\] where \(1 \le i \le k\) represents the \(i^{th}\) neighbor and \(\lambda_i\) serves as a hyper-parameter to adjust the contribution of each neighbor to the approximation. Since \(\frac{\partial \mathcal{L}_i}{\partial logits_i}\) is the partial derivative with respect to the logits layer, we have \(\frac{\partial \mathcal{L}_i}{\partial logits_i} = y'_i-y_i\) for each neighbor. Incorporating this into Equation 9 yields: \[\label{eq:approx2} \frac{\partial \mathcal{L}}{\partial logits} \approx \sum_i \lambda_i \times ( y'_i - y_i)\tag{12}\]
Finally, we integrate Equation 12 into Equation 10 to derive: \[{\textcolor{black}{\Delta logits \approx - \eta_{logits} \times \sum_i \lambda_i \times ( y'_i - y_i) }}\] \[\label{eq:finally} logits' \approx logits - \eta_{logits} \times \sum_i \lambda_i \times ( y'_i - y_i)\tag{13}\]
Given an input \(x\), Equation 13 offers a mechanism to calculate new logits by leveraging both the original prediction and the contributions from the nearest neighbours.
As with prior work in this area [@tang2023domain; @Khandelwal2020Generalization], a retrieval set, referred to as datastore \(D\), is essential for storing context-target pairs, often represented as key-value pairs. The nature of the knowledge encapsulated in this datastore depends on the specific retrieval mechanism employed, particularly the type of information used for the calculation.
In the datastore, each key is generated to facilitate distance calculation between the given input and the elements in the retrieval set. For a given training example \((c_i, w_i) \in D\), we map the context \(c_i\) to a fixed-length vector representation using a function \(f(\cdot)\). In line with previous research [@Khandelwal2020Generalization], we utilize the last hidden states (i.e., the output of the final layer of the CPM as this mapping function \(f(\cdot)\). Hence, the key for each entry is \(k_i = f(c_i)\).
Considering Equation 13 , the value associated with each key should include both the ground truth \(y_i\) (which corresponds to \(w_i\) in a one-hot encoded format) and the predicted probability distribution \(y_i'\). Instead of storing the probability vector, we opt to store the corresponding logits vector \(logits_i\). This is because: 1) The predicted probability \(y_i'\) can be easily recalculated from \(logits_i\) whenever needed and 2) Storing \(logits_i\) allows for their use in multiple retrieval iterations, as will be detailed in Section 3.2.3. Given these considerations, the datastore is formally defined as: \[(K, V) = \{(f(c_i), (y_i, logits_i)) \mid (c_i, w_i) \in {D}\}\]
Algorithm [alg:algorithm] outlines the steps involved in the execution of FT2Ra, our retrieval-augmented language model. The inputs to FT2Ra include: an input context \(x\), the original pre-trained model \(\mathcal{M}\), the learning rate \(\eta_{logits}\), the datastore \((K, V)\), the number of neighbors to retrieve \(N\), and the number of iterative retrieval cycles \(E\). The output generated by FT2Ra is the updated prediction \(y_x'\).
Initially, FT2Ra computes the representation vector \(r\) of the input \(x\), using it to fetch the top-\(N\) nearest neighbors from the datastore (lines 1–[algo:getpair]). An iterative retrieval process then follows, which is a unique feature compared to existing methods. The iterative retrieval process is similar to the process of model fine-tuning conducted over a specified number of epochs, denoted as \(E\) (lines [algo:startepoch]–[algo:endepoch]). At each iteration, the original model’s prediction (retrieved from \(logits_x\) in line [algo:getlogits]) is adjusted based on the logits alteration computed from the retrieved neighbors (lines [alg:nblogitstart]–[alg:nblogitend]).
It’s worth noting that neighbors may vary in their relevance to the input context. Accordingly, we introduce weights \(\lambda_i\) for each neighbor (line [algo:diffweights]). These weights are calculated based on the inverse of the distance between the neighbors and the input: \[\label{eq:lambda} \lambda_i = \frac{1/(d_i+1)}{\sum_{d\in D}(1/(d+1))}\tag{14}\]
Intuitively, a smaller distance between a neighbor and the input results in a higher weight, meaning that closer neighbors contribute more substantially to the updated prediction.
Finally, FT2Ra updates the logits using the calculated change in logits, which has been interpolated from retrieved samples (line [algo:finallogit]). To facilitate further iterations of the retrieval process, the datastore is also updated (line [algo:updatestore]). While an ideal update method would involve recalculating the entire datastore using Equation 13 , we opt for a more computationally efficient strategy. Specifically, we maintain the same set of neighbors across all iterations and apply a constant \(\Delta{logits}\) to the logits of these neighbors, balancing efficacy with computational efficiency.
Discussions. Differing from existing methods, FT2Ra provides two main advantages. First, it employs detailed retrieval information, \(\Delta {logits}\), for a more precise evaluation of each retrieved neighbor’s influence on the final prediction. Second, its iterative retrieval cycles can further improve performance. It’s important to note that these multiple iterations are not actual fine-tuning, but rather a series of retrieval processes. These iterations are also optional and can be adjusted according to specific needs, like accuracy and inference efficiency. In our evaluation, we found that FT2Ra outperformed baseline models even with just one iteration (the conventional setting). With multiple iterations, however, FT2Ra’s performance can be further enhanced (see results in RQ4).
The experimental design considers two completion scenarios: token-level and line-level completions, on models with or without fine-tuning. Specifically, we aim to answer the research questions:
RQ1: How effective is FT2Ra in the two completion tasks?
RQ2: To what extent can FT2Ra approximate the effect of actual fine-tuning?
RQ3: How do different parameter settings, including the weighting strategies and the number of neighbours selected, affect FT2Ra’s performance?
RQ4: How useful is the multi-round iteration strategy in FT2Ra?
Based on the scale of completion, we consider two completion scenarios, i.e., token-level and line-level completions. For token-level completion, the model predicts the next token, based on the given (correct) context. The metric for evaluation in token-level completion is accuracy, i.e., checking whether each completion is correct. For line-level completion, the model performs repeated execution of token-level completion until a line is completed, and retrieval occurs at every step of token prediction. Contrasting with token-level completions, predictions for each token depend on the prediction of the preceding token, which might be incorrect. In line with CodeXGLUE [@lu2021codexglue], the chosen evaluation metrics are exact match (EM) and edit similarity (ES).
We have chosen two widely used benchmarks for our study: the dataset from kNM-LM [@tang2023domain] and the code completion benchmarks from CodeXGLUE [@lu2021codexglue]. Specifically, kNM-LM benchmark comprises 20 Java projects: 10 large-scale and 10 small-scale. In our experiments, we selected the ten larger projects. CodeXGLUE benchmarks contain code samples written in both Java and Python programming languages, i.e., JavaCorpus [@allamanis2013mining] and PY150 [@raychev2016probabilistic].
We follow the settings in [@tang2023domain; @lu2021codexglue] for preparing and splitting the training and testing data. The training dataset can be used to fine-tune the pre-trained models. Note that the kNM-LM benchmarks do not provide a pre-defined test set tailored for line-level completion. To circumvent this, we follow the instructions in [@allamanis2013mining]. Specifically, we randomly extract 300 lines of code from the test data of each project to serve as targets for model completion. For evaluations regarding token-level predictions, we let the models predict each individual token in the test code samples.
Following the state-of-the-art work [@tang2023domain], we selected two widely used code pre-trained models in our experiments: 1) CodeGPT [@lu2021codexglue]: It is a GPT-style code pre-trained model to support code completion. CodeGPT has the same model architecture and training objectives as GPT-2 [@radford2019language], which consists of 12 layers of Transformer decoders. CodeGPT is pre-trained on Python and Java corpora from CodeSearchNet [@husain2019codesearchnet], which includes 1.1M Python code and 1.6M Java code. CodeGPT-adapted is pre-trained from GPT-2 and we use CodeGPT-small-java-adaptedGPT2 and CodeGPT-small-python-adaptedGPT2 to evaluate code completion in Java and Python datasets, respectively. 2) UniXcoder [@guo2022unixcoder]: It is a cross-modal pre-trained model using mask attention matrices with prefix adapters to control the model behaviour. Furthermore, it leverages cross-modal contents such as AST and code comment to enhance the code representations. Specifically, it consists of 12 layers of Transformer with 768-dimensional hidden states. UniXcoder is pre-trained on the CodeSearchNet [@husain2019codesearchnet] dataset for six programming languages including Java and Python.
Although our method is general, in this paper, we did not select very large models like CodeGen, InCoder, and CodeLLama, as fine-tuning them demands substantial computing resources. This requirement arises particularly because 1) the dataset includes unique symbols (e.g., \(<STR\_LIT>\), \(<NUM\_LIT>\), \(<CHAR\_LIT>\)) that necessitate specialized fine-tuning and 2) our experimental setting in RQ2 requires fine-tuning. Consequently, we chose two large CPMs, as suggested in the recent study [@tang2023domain].
Four state-of-the-art retrieval-based baselines are selected for comparisons, including kNN-LM, kNM-LM, BM25 and ReACC, where BM25 and ReACC are suitable to the line-level completions.
kNN-LM [@Khandelwal2020Generalization]: It augments the prediction of a pre-trained language model by linearly interpolating its next word distribution with a k-nearest neighbours model. The nearest neighbours are computed based on the distance in the vector space with a single forward pass of a pre-trained model over the retrieved dataset. The final distribution is the weighted sum of the original model distribution and the nearest neighbour distribution.
kNM-LM [@tang2023domain]: It utilizes the in-domain code to construct the retrieved datastore decouple from LM and then combines with LM by Bayesian Inference for code completion. Compared with kNN-LM, it is able to calculate the posterior probability and utilize it to merge the distributions of nearest neighbours and the original model, which avoids manual weight tuning between the model distribution and neighbour distribution.
BM25 [@robertson2009probabilistic]: It is a term-based retrieval approach, which considers each code fragment as a code token sequence and employs bag-of-words representations to compute the matching score based on the lexical similarity between the query and document. Hence, it is more suitable for the line-level completion. As BM25 is based on the term frequency, it is one kind of sparse retriever.
ReACC [@lu2022reacc]: It is a hybrid retriever framework by combining scores of sparse and dense retrievers. For sparse retrievers, it uses BM25 [@robertson2009probabilistic] for implementation. For dense retriever, it maps each code fragment to a dense vector based on the DPR model [@karpukhin2020dense], which consists of two bidirectional transformer encoders to encode the query code and the retrieved code for the retrieval.
Type | Dataset | CodeGPT | UniXcoder | |||||||
---|---|---|---|---|---|---|---|---|---|---|
3-6 | Original | kNN-LM | kNM-LM | FT2Ra | Original | kNN-LM | kNM-LM | FT2Ra | ||
kNM-LM | Rest. | 46.99 | 54.99 | 70.71 | 77.68 | 42.59 | 50.86 | 71.26 | 77.58 | |
Amaze. | 55.22 | 58.33 | 66.34 | 71.00 | 54.65 | 56.73 | 68.14 | 71.79 | ||
Dropwizard | 50.11 | 55.00 | 65.50 | 71.14 | 47.15 | 51.30 | 66.56 | 70.12 | ||
Eureka | 52.76 | 55.73 | 64.56 | 70.15 | 51.00 | 54.36 | 66.01 | 69.35 | ||
Feign | 48.71 | 54.47 | 70.63 | 77.48 | 45.01 | 50.36 | 70.87 | 77.12 | ||
Galaxy | 53.40 | 55.57 | 64.90 | 69.24 | 49.57 | 52.47 | 65.16 | 69.25 | ||
Interview | 64.70 | 66.46 | 69.29 | 73.14 | 62.91 | 64.80 | 71.54 | 75.27 | ||
Logging. | 60.06 | 65.10 | 79.10 | 87.38 | 56.95 | 61.85 | 79.69 | 86.30 | ||
Requery | 56.69 | 59.44 | 68.39 | 75.75 | 54.11 | 56.07 | 68.66 | 74.91 | ||
Froyo. | 59.53 | 62.56 | 67.64 | 71.04 | 58.79 | 61.31 | 69.38 | 72.79 | ||
2-11 | Avg. | 54.82 | 58.77 | 68.71 | 74.40 | 52.27 | 56.01 | 69.73 | 74.45 | |
CodeXGLUE | JavaCorpus | 64.95 | 67.83 | 70.74 | 72.07 | 65.61 | 67.92 | 72.13 | 74.73 | |
PY150 | 52.41 | 55.35 | 60.94 | 62.19 | 60.44 | 64.62 | 69.81 | 71.43 | ||
Total Avg. | 55.46 | 59.24 | 68.23 | 73.19 | 54.07 | 57.72 | 69.93 | 74.22 |
Considering the hyper-parameters in Algorithm [alg:algorithm], in our experiments, we configured the number of neighbors (\(N\)) and the number of iterations (\(E\)) to 20 and 7, respectively. Additionally, we set the learning rates (\(\eta_{logits}\)) to specific values: 3 for JavaCorpus, 5 for PY150, and 5 for the kNM-LM dataset. It’s worth noting that we thoroughly evaluated and discussed various settings of these hyperparameters in RQ2 and RQ3. For the other baseline methods, we selected the default configuration used in their papers. Notably, kNN-LM is not applied in code learning tasks, we followed the same configurations as described in [@tang2023domain].
The main goal of the retrieval augmentation is to bolster the model’s performance without the need for fine-tuning. Therefore, this experiment aims to evaluate the effectiveness of FT2Ra on pre-trained models without fine-tuning.
Token-level Completion. The results for token-level completion are shown in Table 1, including the results on the ten Java projects from kNM-LM and the two CodeXGLUE benchmarks. The Original column shows the results with the pre-trained models.
The overall results show that, compared with the original pre-trained models, all retrieval-augmented techniques have a higher accuracy, demonstrating the usefulness of retrieval-based augmentation. Furthermore, we can see that FT2Ra significantly outperforms the baselines across all datasets and models. For example, while the average accuracies of pre-trained models on CodeGPT and UniXcoder stand at 55.46% and 54.07%, respectively, FT2Ra increases the performance to 73.19% and 74.22%, outperforming all baseline models. Moreover, in comparison with the best baseline kNM-LM, FT2Ra boasts an average increase of 4.96% for CodeGPT and 4.29% for UniXcoder. The results demonstrate the effectiveness of FT2Ra in token-level completion.
Type | Dataset | CodeGPT | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3-19 | Original | kNN-LM | kNM-LM | BM25 | ReACC | FT2Ra | ||||||||||||
3-4 | EM | ES | EM | ES | EM | ES | EM | ES | EM | ES | EM | ES | ||||||
kNM-LM | Rest. | 1.00 | 49.36 | 1.00 | 53.68 | 9.63 | 47.05 | 3.99 | 53.48 | 3.99 | 53.05 | 17.94 | 70.62 | |||||
Amaze. | 1.99 | 56.86 | 3.99 | 57.94 | 9.30 | 47.75 | 2.99 | 59.25 | 2.99 | 58.89 | 22.92 | 66.54 | ||||||
Dropwizard | 1.00 | 52.13 | 2.33 | 57.83 | 4.65 | 49.56 | 1.33 | 52.52 | 1.33 | 51.84 | 22.59 | 69.34 | ||||||
Eureka | 3.99 | 55.76 | 5.32 | 58.20 | 8.31 | 50.41 | 3.99 | 57.59 | 3.99 | 57.78 | 20.93 | 68.00 | ||||||
Feign | 1.33 | 47.72 | 3.32 | 52.77 | 10.63 | 47.19 | 3.99 | 53.50 | 3.99 | 53.21 | 25.91 | 72.34 | ||||||
Galaxy | 1.33 | 50.97 | 2.66 | 54.61 | 11.96 | 48.53 | 2.66 | 51.72 | 2.33 | 52.07 | 22.26 | 64.14 | ||||||
Interview | 8.97 | 61.63 | 13.95 | 62.80 | 19.60 | 57.28 | 9.30 | 61.41 | 9.97 | 62.91 | 27.91 | 71.08 | ||||||
Logging. | 2.66 | 59.04 | 5.98 | 63.13 | 15.28 | 58.59 | 6.31 | 63.42 | 6.64 | 63.64 | 33.89 | 80.63 | ||||||
Requery | 5.98 | 61.76 | 8.97 | 62.81 | 9.63 | 46.51 | 6.64 | 63.21 | 7.31 | 63.12 | 28.24 | 73.11 | ||||||
Froyo. | 8.31 | 63.96 | 11.30 | 65.85 | 16.28 | 55.58 | 7.97 | 63.90 | 8.31 | 63.93 | 28.24 | 73.91 | ||||||
2-19 | Avg. | 3.65 | 55.92 | 5.88 | 58.96 | 11.53 | 50.85 | 4.92 | 58.00 | 5.08 | 58.04 | 25.08 | 70.97 | |||||
CodeXGLUE | JavaCorpus | 13.69 | 48.74 | 16.28 | 49.90 | 16.18 | 43.89 | 14.19 | 49.73 | 14.19 | 49.85 | 22.88 | 54.54 | |||||
PY150 | 0.00 | 12.54 | 10.39 | 41.40 | 8.69 | 11.56 | 0.20 | 13.35 | 0.20 | 13.36 | 18.48 | 50.52 | ||||||
Total Avg. | 4.19 | 51.71 | 7.12 | 56.74 | 11.68 | 46.99 | 5.30 | 53.59 | 5.43 | 53.64 | 24.35 | 67.90 | ||||||
Type | Dataset | UniXcoder | ||||||||||||||||
3-19 | Original | kNN-LM | kNM-LM | BM25 | ReACC | FT2Ra | ||||||||||||
3-4 | EM | ES | EM | ES | EM | ES | EM | ES | EM | ES | EM | ES | ||||||
kNM-LM | Rest. | 0.66 | 50.12 | 1.66 | 52.07 | 11.30 | 54.05 | 1.99 | 63.46 | 1.99 | 64.66 | 18.94 | 72.17 | |||||
Amaze. | 1.33 | 56.00 | 1.66 | 58.40 | 13.29 | 55.28 | 1.00 | 58.87 | 1.00 | 59.04 | 23.26 | 66.92 | ||||||
Dropwizard | 0.00 | 49.44 | 1.33 | 54.89 | 16.61 | 56.41 | 0.66 | 59.82 | 0.66 | 58.33 | 20.27 | 69.94 | ||||||
Eureka | 0.33 | 55.44 | 1.66 | 59.39 | 15.61 | 58.91 | 0.00 | 63.56 | 0.00 | 62.23 | 22.26 | 69.98 | ||||||
Feign | 0.00 | 50.06 | 1.00 | 52.46 | 8.97 | 52.99 | 4.65 | 67.68 | 4.65 | 67.39 | 27.24 | 75.24 | ||||||
Galaxy | 0.33 | 50.06 | 0.33 | 53.06 | 13.29 | 47.61 | 1.00 | 54.89 | 1.00 | 54.86 | 21.26 | 64.56 | ||||||
Interview | 4.65 | 59.89 | 7.31 | 61.88 | 20.93 | 62.78 | 2.66 | 58.28 | 3.32 | 57.65 | 30.56 | 73.34 | ||||||
Logging. | 0.00 | 55.11 | 4.98 | 60.78 | 17.28 | 59.45 | 3.32 | 72.28 | 3.99 | 72.45 | 35.22 | 81.23 | ||||||
Requery | 2.33 | 60.56 | 3.99 | 61.92 | 14.95 | 56.73 | 2.33 | 65.34 | 2.33 | 65.14 | 30.56 | 73.79 | ||||||
Froyo. | 1.33 | 63.60 | 1.99 | 65.14 | 21.93 | 62.75 | 2.33 | 63.94 | 2.66 | 64.09 | 32.56 | 76.46 | ||||||
2-19 | Avg. | 1.10 | 55.03 | 2.59 | 58.00 | 15.42 | 56.70 | 1.99 | 62.81 | 2.16 | 62.58 | 26.21 | 72.36 | |||||
CodeXGLUE | JavaCorpus | 8.49 | 47.72 | 9.99 | 49.69 | 12.89 | 48.06 | 7.79 | 46.97 | 7.79 | 47.09 | 24.58 | 58.67 | |||||
PY150 | 0.10 | 8.43 | 0.00 | 8.60 | 0.10 | 12.74 | 0.00 | 7.46 | 0.00 | 7.43 | 29.17 | 59.00 | ||||||
Total Avg. | 1.63 | 50.54 | 2.99 | 53.19 | 13.93 | 52.31 | 2.31 | 56.88 | 2.45 | 56.70 | 26.32 | 70.11 |
Line-level Completion. The results for line-level completion, evaluated on CodeGPT and UniXcoder, are presented in Table 2. The metrics of exact match and edit similarity are represented by the columns EM and ES, respectively. Similarly, we can find that all of the retrieval-based methods could still enhance the performance, but the improvement degree of baselines is generally limited. For example, on average, kNN-LM, kNM-KM, BM25 and ReACC achieve scores (7.12, 56.74), (11.68, 46.99), (5.30, 53.59) and (5.43, 53.64) on CodeGPT, respectively, while the pre-trained model achieves (4.19, 51.71). While the recent state-of-the-art kNM-LM can achieve higher EM scores than other baselines, its ES scores are lower. The low performance of baselines could be attributed to the difficulty of the line-level completion. Any incorrect token prediction (inaccurate context) could affect the prediction of the following tokens. It is obvious that FT2Ra significantly outperforms the baselines, manifesting its superiority in both the EM and ES metrics across all datasets and models. Considering the results on CodeGPT, FT2Ra increases the scores to (24.35, 67.90). While on UniXcoder, FT2Ra achieves higher improvement (26.32, 70.11) than the pre-trained model (1.63, 50.54) and the baselines. The results demonstrate the effectiveness of our proposed retrieval mechanism on line-level completion.
We have observed that the performance of various methods varies across different datasets and models. Considering the results on the dataset Froyo., we find that all methods consistently achieve higher EM scores on CodeGPT compared to UniXcoder. Interestingly, all baseline models, including pre-trained ones, exhibit poor performance on the PY150 dataset but demonstrate better results on the JavaCorpus dataset. Upon our in-depth analysis of CodeGPT, we discovered that the pre-trained model CodeGPT-small-py-adaptedGPT2 tends to underestimate the probability of end-of-line tokens (<EOL>). We randomly selected 30 test data instances from PY150, specifically targeting cases where FT2Ra succeeded while the original model failed. We discovered that 9 of these instances reached the maximum token prediction count (set at 100) when predicted by CodeGPT. In contrast, we randomly checked 100 instances in JavaCorpus predicted by CodeGPT-small-java-adaptedGPT2, and none of the predictions reached the token count limit. This discrepancy could be attributed to the natural line termination indicators present in Java code such as semicolons and braces, which allow the model to easily discern when to stop the prediction. However, in Python, the model must accurately predict the <EOL> symbol to recognize the end of a statement. Compared with others, FT2Ra exhibits significant enhancements on the PY150 dataset, with improvements of (18.48, 50.52) and (29.17, 59.00) when evaluated on CodeGPT and UniXcoder, respectively.
In Figure 2, we present an illustrative example that shows the advantage of FT2Ra when applied to PY150. Upon examining the results obtained by different methods, we observe that, except for kNM-LM, all models accurately predict the initial seven tokens. However, when reaching the eighth token, the baselines, including the original model, cannot predict the correct termination token <EOL>. Instead, they predict the token requires, which leads to an uninterrupted sequence of predictions until reaching the maximum token count. By checking CodeGPT’s prediction on the eighth token, we discover that the token requires has the highest prediction probability (0.13), while the token <EOL> receives a prediction probability of 0, making it challenging for the baselines to correct the prediction. kNM-LM exhibits too much augmentation, resulting in incorrect predictions for even the first token. The table on the right provides detailed insights into how FT2Ra corrects the prediction. Despite the stubborn prediction of requires, FT2Ra leverages the calculation of \(\Delta_{logits}\) across multiple iterations to steadily increase the logits of the token <EOL> while decreasing the logits of requires. Ultimately, at the sixth iteration, FT2Ra successfully fixes the prediction.
In Figure 3, we present a Venn diagram depicting the completion lines that achieve an exact match with the ground truth. For the sake of clarity, we have excluded the results of BM25 and kNN-LM from the diagram since their outcomes closely resemble those of ReACC and kNM-LM. The findings clearly illustrate that FT2Ra outperforms other methods by generating a significantly larger number of unique code lines.
Input Retrieval | Output Retrieval | ||||||
---|---|---|---|---|---|---|---|
3-4 | Original | BM25 | ReACC | kNN-LM | kNM-LM | FT2Ra | |
CodeGPT | 0.0163 | 0.0161 | 0.0164 | 0.0208 | 0.0193 | 0.0271 | |
UniXcoder | 0.0134 | 0.0143 | 0.0135 | 0.0163 | 0.0155 | 0.0214 |
Performance. To evaluate FT2Ra’s performance, we measured the average time required to predict a token. We did not compare the line prediction time since the predictions of different methods can have different lengths. Specifically, we selected 1,000 line-completion tasks at random, using different methods to predict the line with a set length of 100 tokens. We then recorded the average token prediction time for comparison. All experiments were conducted on a single A5000 GPU card for consistency.
Table 3 presents the results. Note that the time used by FT2Ra is from its seven retrieval iterations. On the CodeGPT model, the average prediction times for the original model, BM25, ReACC, kNN-LM, kNM-LM, and FT2Ra are 0.0163s, 0.0161s, 0.0164s, 0.0208s, 0.0193s, and 0.0271s, respectively, and a similar trend is observed with UniXcoder. The results indicate that while input retrieval methods slightly impact prediction speed, output retrieval methods, which require more computation, tend to slow it down more noticeably. Compared to other output retrieval baselines, FT2Ra, which retrieves more detailed information and allows for multiple retrieval rounds, takes slightly longer. This represents a trade-off between effectiveness and efficiency, with FT2Ra sacrificing some speed for significant improvements in effectiveness.
The key insight of FT2Ra lies an innovative approach that emulates the fine-tuning process with certain approximations (refer to Section 3.1). Hence, we compared the results of FT2Ra with genuine fine-tuning results on the kNM-LM dataset. We utilized the training data from all ten projects to fine-tune the pre-trained models and subsequently evaluated the methodologies on all test data of these projects. The pre-trained models were fine-tuned over a range of epochs. We compared the performance of various methods for both line-level and token-level completion, with the models fine-tuned across these different epochs.
Token-level Completion. For the token-level completion, we capped the maximum number of epochs at 20. This upper limit was chosen because it was observed that most methods tend to reach convergence within this period. Figure 4 presents the token-level completion performance of different methods on fine-tuned models over various epochs. Notably, the comparative results at each point are derived by the retrieval from the respective fine-tuned models at specific epochs (i.e., epochs 1, 2, …, 20). The data stores are also updated under different fine-tuned models. The results for epoch 0 are derived from the pre-trained model without fine-tuning.
Overall, we observe a progressive improvement in the performance of the original model with an increasing number of fine-tuning epochs (see blue lines). The results of the retrieval-based methods also exhibit an upward trend, showing the generalization capability across different fine-tuned models. However, as the model goes through multiple fine-tuning epochs, the improvements are diminishing as the model nears its best performance after sufficient tuning. Comparing FT2Ra to the baselines, it is clear that FT2Ra consistently outperforms the baselines on fine-tuned models.
To assess how closely FT2Ra’s effect (simulating fine-tuning) aligns with real fine-tuning, we compare FT2Ra’s performance on the pre-trained model without any fine-tuning to that of the fine-tuned models. As indicated by the dotted line, FT2Ra, without fine-tuning the model, achieves similar performance to fine-tuned CodeGPT and UniXcoder models after approximately 4 and 7 epochs, respectively. In contrast, the best baseline, kNM-LM, only reaches a similar performance level with a model fine-tuned for about one epoch. These results underscore the value of our theoretical analysis from the fine-tuning process.
Line-level Completion. Figure 5 illustrates the results in terms of EM for line-level completion. Due to space constraints, the results for Edit Similarity can be accessed on our website [@ft2rawebsite]. We capped the maximum number of epochs at 10 due to the large computational overhead of line-based completion. When compared to the token-level completion results in Figure 4, it becomes evident that the impact of other baseline methods is notably diminished in line-level completion, primarily because this task is more difficult. We observe that BM25 and ReACC yield similar results, likely due to their adoption of similar methods. On the other hand, the performance of kNN-LM and kNM-LM is very close to that of the fine-tuned models, which indicates that they have limited improvement.
Conversely, FT2Ra continues to demonstrate clear advantages over other methods, due to its precise token prediction. Notably, when comparing the performance of FT2Ra at epoch 0 with that of other fine-tuned models, it becomes apparent that even without fine-tuning, FT2Ra can outperform the performance of fine-tuned models at 10 epochs.
Dataset | Weighting Strategy | #Neighbors | |||||||
---|---|---|---|---|---|---|---|---|---|
2-5 | Rec. | Uni. | Smax | Smax-T | 5 | 10 | 20 | 50 | |
Rest. | 78.15 | 74.62 | 78.43 | 78.46 | 78.69 | 78.65 | 78.15 | 76.79 | |
Eureka | 70.65 | 68.80 | 67.12 | 66.84 | 69.38 | 70.14 | 70.65 | 69.79 | |
JavaCorpus | 72.07 | 71.97 | 67.98 | 68.15 | 70.58 | 71.40 | 72.07 | 72.46 | |
PY150 | 62.19 | 61.17 | 56.85 | 56.94 | 60.93 | 61.69 | 62.19 | 61.93 |
To evaluate the effectiveness of our weighting strategy and to understand the impact of the number of neighbours, we collected four datasets, including the two Java projects that were randomly chosen from the kNM-LM dataset, and the two datasets from CodeXGLUE. The evaluation is performed on the token-level completion task, which serves as the foundation for line-level completion.
Since various baseline methods employ different weighting strategies to determine the significance of the retrieved samples. For instance, kNN-LM utilizes the softmax (referred to as Smax), whereas kNM-LM employs softmax with temperate (denoted as Smax-T). Our method calculates weights based on distance, referred to as Rec. (see Equation 14 ). To provide a comparative evaluation, we incorporated these strategies into FT2Ra for the comparisons. Additionally, we introduced a baseline, i.e., uniform strategy (Uni.), which allocates equal weights to all samples. Detailed results can be found on the left of Table 4. Obviously, the weighting strategy Rec. outperforms other strategies when they are adopted to FT2Ra. An exception is the results on Rest., where Smax and Smax-T marginally exceed the performance of FT2Ra. Interestingly, the uniform strategy Uni. excels over the other two methods for the benchmark JavaCorpus and PY150, emphasizing the importance of designing a suitable weighting strategy.
We evaluated the performance of FT2Ra by setting the number of neighbors to 5, 10, 20, and 50. The findings, as presented in the right part of Table 4, suggest that FT2Ra exhibits relatively limited sensitivity to the number of neighbours chosen. There is not a single optimal parameter that is universally effective across all datasets. In general, selecting too few neighbours may not provide enough information to augment predictions. Conversely, selecting an excessive number might introduce negative effects, such as the interference of irrelevant neighbours.
To evaluate the effect of the multiple iteration strategy incorporated into FT2Ra, we configured FT2Ra with varying retrieval iterations (i.e., \(E\) in Algo. [alg:algorithm]) ranging from 1 to 10. We also consider the impact of the parameter \(\eta_{logits}\), which are configured with 4 values: 2.5, 5, 10, and 20. Evaluations were carried out using similar configurations as in RQ3, i.e., token-level completion on pre-trained models.
The results are presented in Figure 6. It is obvious that FT2Ra’s performance benefits from multiple retrieval rounds, which is a unique feature compared to existing retrieval-based baselines. By increasing the number of retrieval rounds, the performance of FT2Ra gradually gets better. From the results, we found that the performance of FT2Ra tends to stabilize after approximately 4 retrieval iteration cycles.
With respect to different learning rates (i.e., \(\eta_{logits}\)), FT2Ra’s performance shows high sensitivity to this parameter. In general, larger values of \(\eta_{logits}\) enable FT2Ra to converge faster, whereas smaller ones necessitate multiple iterations. For instance, with \(\eta_{logits}\) set to 0.5, convergence tends to be achieved after 10 iterations. In contrast, a setting of 4 for \(\eta_{logits}\) reaches optimal performance after just one iteration. Yet, we also observed that excessively high learning rates could hamper FT2Ra’s performance. For instance, settings of \(\eta_{logits}\) at 2 and 4 typically yield results inferior to those achieved with 0.5 and 1. The configuration using a value of 4 for \(\eta_{logits}\) achieves the poorest performance. The learning rate in FT2Ra shows a similar effect to the learning rate of real training.
Even with just one iteration, FT2Ra surpasses the best-performing baseline, kNM-LM, as shown by the straight line in Figure 6. A higher learning rate, (e.g., \(\eta_{logits}=10\)), is typically needed for faster convergence. Under this setting, FT2Ra outperforms kNM-LM in Rest., EureKa, and JavaCorpus. In PY 150, FT2Ra exceeds kNM-LM’s performance after only 2 epochs. These results further highlight FT2Ra’s effectiveness, even with no or a few iterations.
While automatically selecting optimal parameters for learning rate and number of iterations can be challenging, there are some general guidelines that can aid in this process. Conducting initial trials on a small test set allows for the assessment of the model’s performance. If the model exhibits rapid oscillation and a decline in performance, it suggests that the learning rate is too high. Conversely, if the model fails to converge after many iterations, it implies that the learning rate is too low. To adjust the number of iterations, early stopping techniques can be employed, ensuring that the tuning process is both computationally efficient and completed within a reasonable timeframe.
Potential biases from our choices of models and datasets represent a possible threat to our study. To mitigate it, we have followed the recent works [@tang2023domain] for guidance and selected two prominent datasets, i.e., the kNM-LM datasets and the CodeXGLUE benchmarks, and two widely-used pre-tained models, specifically CodeGPT and UniXcoder. We also plan to evaluate FT2Ra on the large models such as CodeLLama and CodeGen in future work. Furthermore, we were unable to establish a concrete theoretical framework to determine the weighting strategy. Instead, we empirically evaluated four common strategies in RQ3 and selected the most effective one. We acknowledge the significance of the weighting strategy and intend to investigate it further in future research. Another potential threat to our study is that the approximations inherent in retrieval-augmented methods may affect the precision of the results. This is particularly relevant when applying these methods to new models or datasets, where the impact of approximations might be more pronounced. In line with [@tang2023domain], there is a threat to the use of ReACC on Java programs. The original authors only made their retrieval models for Python available, leaving the Java version undisclosed. To circumnavigate this obstacle during our Java experiments, we utilized their Python version. In parallel, we incorporated the BM25 model, which has a similar performance to ReACC. For transparency, we have made our entire codebase, datasets, and experimental results public, thereby enabling independent verification.
Code Completion. Code completion is regarded as a vital aspect of enhancing software development efficiency in contemporary Integrated Development Environments (IDEs). Hindle [@hindle2016naturalness] were pioneers in employing N-gram techniques to implement code completion using language models. Subsequently, deep neural networks [@liu2016neural] and pre-training techniques [@feng2020codebert; @wang2021codet5; @liu2023contrabert; @liu2022commitbart] have been made great progress. While some of these efforts involve encoding code-specific structured information like Abstract Syntax Tree (AST) into inputs [@li2017code; @kim2021code], the prevailing trend in current research treats source code as sequences of code tokens, as exemplified by models like CodeGPT [@lu2021codexglue], and UniXcoder [@guo2022unixcoder]. The advent of large language models like ChatGPT [@ChatGPTblog], CodeGen [@nijkamp2022codegen], StarCoder [@li2023starcoder] has introduced new opportunities and challenges to code completion. large models entail a vast number of parameters, which significantly elevates the cost of fine-tuning. Therefore, research on retrieval-based enhancement is essential in this context.
Retrieval-augment Language Model. Retrieval-augmented techniques [@shi2023replug; @jiang2023active; @he2021efficient; @liu2020retrieval; @liu2020atom] are primarily categorized into two types: one being retrieval enhancement applied to inputs, also referred to as pre-task retrieval. This category encompasses techniques such as REACC [@lu2022reacc], REDCODER [@parvez2021retrieval] and DPR [@karpukhin2020dense] as discussed in previous works. These retrieval techniques necessitate the preliminary segmentation of the data to be retrieved into fixed-length chunks, with each chunk typically containing several hundred tokens. They concatenated the most relevant information to the inputs for the enhancement. Some works go beyond simply retrieving from the original training set, they refine new information from the original training dataset. ASAP [@ahmed2024automatic], in the task of code summarization, uses not just the conventional source code and summary as input but also incorporates static analysis products such as the repository name, the fully qualified name of the target function, its signature and its data flow graph. RLPG [@shrivastava2023repository] is proposed for single-line code auto-completion in an IDE. RLPG not only retrieves similar content as supplemental input but also utilizes repository-level code context such as Post Lines, Identifiers, Type Identifiers as additional input prompts. Joshi et al. [@joshi2023repair] proposes a multi-lingual program repair method named RING. It retrieves relevant buggy-fix examples from an example bank, using the completed bug repair and repair methods as supplemental input prompts. On the other hand, some works also try to focus on retrieval enhancement for outputs such as kNN-LM [@Khandelwal2020Generalization; @xu2023nearest], kNM-LM [@tang2023domain], and RETRO [@borgeaud2022improving]. This kind of retrieval paradigm involves the preliminary creation of a retrieval database, where information from this database is utilized to modify the output generated. For example, RETRO [@borgeaud2022improving] integrates it within the Transformer, while kNN-LM [@Khandelwal2020Generalization; @xu2023nearest] employ probability interpolation at the final probability layer. Compared with these works, we develop a novel retrieval-based approach from theoretical analysis to mimic genuine fine-tuning for code completion.
In this paper, we introduce a novel retrieval augmentation method for code completion tasks. Guided by a theoretical analysis, we discerned the value of \(\Delta logits\) as a pivotal retrieval metric. Building on this revelation, we designed FT2Ra, a method that is to simulate the fine-tuning process closely. Similarly, FT2Ra incorporates a learning rate and a multi-round iteration strategy, aiming to mirror the results of genuine fine-tuning. The experimental results demonstrated FT2Ra’s superiority against state-of-the-art methods and its competitive results with regards to fine-tuning.
Our source code and experimental data are available at [@ft2rawebsite].
This work was partially supported by the National Key R&D Project (2021YFF1201102), the National Natural Science Foundation of China (Key Program, Grant No. 62332005), the National Key R&D Program of China (2021ZD 0112903), the National Natural Science Foundation of China (Grant No. 61872262 and No. 62102283), the National Research Foundation, Singapore, and the Cyber Security Agency under its National Cybersecurity R&D Programme (NCRP25-P04-TAICeN). Lei Bu is supported in part by the Leading-edge Technology Program of Jiangsu Natural Science Foundation (No. BK20202001), the National Natural Science Foundation of China (No. 62232008, 62172200). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Cyber Security Agency of Singapore.
\(\Delta logits\) represents the difference in logits before and after gradient descent, which can be expressed as \(\Delta logits = logits' - logits\).↩︎