Peer-aided Repairer: Empowering Large Language Models to Repair Advanced Student Assignments


Automated generation of feedback on programming assignments holds significant benefits for programming education, especially when it comes to advanced assignments. Automated Program Repair techniques, especially Large Language Model based approaches, have gained notable recognition for their potential to fix introductory assignments. However, the programs used for evaluation are relatively simple. It remains unclear how existing approaches perform in repairing programs from higher-level programming courses. To address these limitations, we curate a new advanced student assignment dataset named Defects4DS from a higher-level programming course. Subsequently, we identify the challenges related to fixing bugs in advanced assignments. Based on the analysis, we develop a framework called PaR that is powered by the LLM. PaR works in three phases: Peer Solution Selection, Multi-Source Prompt Generation, and Program Repair. Peer Solution Selection identifies the closely related peer programs based on lexical, semantic, and syntactic criteria. Then Multi-Source Prompt Generation adeptly combines multiple sources of information to create a comprehensive and informative prompt for the last Program Repair stage. The evaluation on Defects4DS and another well-investigated ITSP dataset reveals that PaR achieves a new state-of-the-art performance, demonstrating impressive improvements of 19.94% and 15.2% in repair rate compared to prior state-of-the-art LLM- and symbolic-based approaches, respectively.

1 Introduction↩︎

Programming education occupies an important position in developing talents in the software engineering field, which has received wide attention in the past decades [1], [2]. Providing feedback on programming assignments for students is an essential part and also a tedious task, which requires substantial effort by the teachers and the teaching assistants. With the development of computer science and digital society, the demand for programming education increased sharply, spawning Massive Open Online Courses (MOOCs) that teach programming [1], [3]. Under such circumstances, providing customized and timely feedback to a large number of students becomes more pressing, even impossible with the massive number of students.

There has been increased interest in applying Automated Program Repair (APR) techniques for providing feedback on student programming assignments. APR tools can automatically generate patches to correct code errors by reasoning about the code semantics based on the given specification [4], [5]. A variety of approaches have been proposed across the years, including search-based methods [6], [7], semantic-based tools [8][10], and Deep Learning (DL)-based techniques [11], [12]. Based on existing advanced APR techniques, various approaches have been introduced for correcting students’ mistakes in their programming assignments [1], [2], [13], [14], aiming at facilitating the learning procedure [15]. The search-based and semantic-based methods [1], [2], [13], [16] demand extensive engineering efforts, including program analysis or repair experience and custom repair strategies tailored to the language domain in which students complete their assignments. Moreover, these methods have limited generality and are specifically tailored for simple syntax components, excluding more intricate elements such as pointers, structures, and multidimensional arrays. DL-based approaches alleviate some engineering challenges by capturing various bug-fixing patterns from large code corpora. However, these approaches generally rely on large quantities of training data [11], [17], [18], and are still limited to generating small patches due to the complexity of semantic reasoning, and search space explosion.

Recent advancements in the development of Large Language Models (LLMs) provide an alternative solution for bug repair that does not necessitate experts with program analysis/repair experience. Most recently, researchers have started to directly leverage advanced LLMs for APR [19][22], as well as repairing the student programming assignments [19], [23]. These models have shown promising results in student assignment bug fixing, and even outperform previous state-of-the-art APR techniques. However, these approaches suffer from the following limitations. The first limitation is the use of relatively simple programs for evaluation. The datasets they utilized are typically comprised of assignments from introductory programming courses, which tend to be shorter and simpler, averaging around 20 lines of code and without complex grammatical components. Consequently, it remains uncertain how these evaluated approaches would act when applied to more advanced programming courses, which contain assignments with more intricate grammatical components and logic. Furthermore, many of the approaches utilize the peer solution to assist the repair process [19], [23]. However, the selection process primarily relies on the execution of test suites or compiler messages, overlooking the implementation details and underlying code structure. Therefore, there may be discrepancies between the chosen peer solution and the students’ faulty code, making them less effective in fixing the bugs present in their own code. The limitations in both evaluation and example selection undermine the effectiveness and robustness of the approaches, hindering their ability to provide comprehensive and reliable solutions for complex programming issues.

To address the above issues, we first curate a new student assignment dataset named Defects4DS based on a higher-level programming course, including programs with increased complexity, longer lengths, and a variety of structures. Our goal is to obtain valuable insights into the performance and limitations of the evaluated APR approaches when applied to programs from advanced programming courses. These programs feature complex grammatical components and logic, making the bugs within them more challenging to locate and fix. Then, we outline the challenges involved in repairing bugs in advanced student assignments, drawing upon our analysis of the characteristics of Defects4DS and comparing them to an introductory programming assignment dataset. To address the challenges, we propose Peer-aided Repairer (PaR), which is a novel framework for advanced student assignment repair powered by a large language model. PaR works in three phases: Peer Solution Selection, Multi-Source Prompt Generation, and Program Repair. During the Peer Solution Selection stage, PaR identifies the closely related peer programs based on a mix of lexical, semantic, and syntactic criteria. Then during the Multi-Source Prompt Generation stage, PaR combines multiple sources of information to create a comprehensive and informative prompt \(\mathcal{P}\), including the peer solution obtained in the previous step, program description, IO-related information, buggy code, etc. Finally, during the Program Repair stage, the prompt generated by the previous stage is fed to the LLM, which then produces fixed code that is subsequently evaluated for correctness.

We evaluate PaR on students’ assignment bug fixing task using both our proposed Defects4DS dataset and another widely used introductory student assignment benchmark dataset ITSP [24]. The evaluation results demonstrate that PaR achieves superior performance in repairing both advanced and introductory student assignments, obtaining 301 correct fixes on Defects4DS and 251 correct fixes on the ITSP dataset, showcasing a remarkable improvement of 19.94% and 15.2% in repair rate compared to LLM-based model [25] and the powerful symbolic approach Verifix [2]. Further analysis also confirms the effectiveness of each component in PaR. The contributions of this paper can be summarized as follows:

  • We propose Defects4DS, a new dataset that contains 682 submissions from 4 programming assignments of a higher-level programming course. Additionally, we meticulously labeled every buggy code in the Defects4DS dataset, offering comprehensive bug information across four dimensions. Moreover, we also labeled the buggy code in the introductory programming assignment dataset (ITSP), enabling an in-depth comparison between Defects4DS and ITSP.

  • We design a novel repair framework PaR, which empowers large language models to repair student assignments with an innovative peer solution selection strategy and a multi-source prompt generation method.

  • We conduct a comprehensive evaluation of PaR in comparison to both large language models and a symbolic APR tool across Defects4DS and ITSP datasets. The evaluation results demonstrate that PaR consistently outperforms all the baseline approaches by a large margin.

  • We open-source Defects4DS along with the corresponding bug information, and the replication package of PaR at

2 Related Works&Background↩︎

2.1 Large Language Models↩︎

Large language models (LLMs) are a category of large-scale models that have been pre-trained on a massive textual corpus with self-supervised pre-training objectives [26], [27]. Most LLMs are built on the Transformer architecture [28], and can be categorized into three groups: encoder-only models [26], decoder-only models [29], and encoder-decoder models [30]. Later, researchers further incorporated reinforcement learning to align LLMs with human intent. One notable LLM, OpenAI’s ChatGPT [31], applies reinforcement learning from human feedback (RLHF) to optimize the model. This approach has garnered significant attention due to its remarkable ability to tackle a wide range of tasks. In the meanwhile, various open-source LLMs have emerged in the AI community. Notable contributions include GPT-NeoX [32], GPT-J [33], InCoder [34], CodeGen [27], LLaMA [35], StarCoder [25], Code Llama [36], etc. These models also have showcased remarkable progress in various tasks. It was worth noting that LLMs are highly effective in few-shot learning scenarios, which can successfully perform tasks for which they were not explicitly trained by providing only a few (or no) examples during the inference. This approach is commonly known as prompt-based learning [37]. A prompt generally refers to a predefined textual template that is given as an input instruction to the LLM, and it typically consists of a query and may include a few examples (also known as shots) of the task. LLM can generate results of corresponding tasks based on the prompt. There have been recent studies exploring ChatGPT and other open-source LLMs’ potential in software engineering tasks, such as program repair [20], [38], code generation [39], unit test generation [40], etc.

2.2 Automated Program Repair↩︎

2.2.1 General Purpose Program Repair↩︎

Automated Program Repair (APR) techniques are capable of identifying and fixing software bugs or defects without human intervention, aiming to reduce the effort required to improve software quality. A variety of APR approaches have been proposed across the years, including search-based methods, semantic-based approaches, and Deep Learning-based techniques. The search-based approaches first produce patches with predefined code transformation operators, and then search for correct patches among the pre-defined space [6], [7]. The search-based repair approaches demonstrate the ability to handle large programs. However, they may struggle when confronted with expansive search spaces. The semantic-based tools usually correct code errors via symbolic execution [8][10]. Specifically, they need to build a constraint that a program must meet to pass a given test suite, and then solve the constraint to generate the patches. Deep Learning-based APR approaches predominantly rely on Neural Machine Translation (NMT) techniques [11], [12], [41], which consider the program repair as a translation task, where the goal is to transform defective code into correct code. However, the effectiveness of NMT-based APR tools heavily relies on the quality and abundance of training data. More recently, researchers have begun to directly harness the capabilities of LLMs for APR [20], [22], [38], which have surpassed the existing state-of-the-art methods across various program repair tasks. For instance, ChatRepair [38], which is an APR tool powered by ChatGPT, fixes 114 bugs on Defects4J 1.2, with 15 more than the previous state-of-the-art APR tools. These tools work by first creating an input prompt with the original buggy code as well as the task description. Then they query the LLM to either fill in the correct code at the bug location or generate a new code snippet as the patch.

2.2.2 Programming Assignment Program Repair↩︎

Researchers have extended the application of APR to student programs in education, providing feedback to help students correct their programs [1], [13], [16]. AutoGrader [13] is an early feedback generation system that takes an erroneous student program, a reference solution, and a set of potential corrections provided by the instructor as inputs. It employs program synthesis techniques to find minimal corrections. CLARA [1] and sk_p [17] employed clustering and machine learning techniques to generate repairs for the assignment data. However, these repairs frequently exhibit imprecision and lack minimality, thereby diminishing the overall quality of the feedback. Verifix [2] aims to generate verifiable correct program repairs by aligning their assignments with a reference solution in terms of control flow. They utilize failed verification attempts to obtain potential repairs through the MaxSMT solver. More recently, researchers have begun to employ LLMs to fix bugs in student assignments. [23] built an APR system based on Codex [42] for introductory Python programming assignments. The empirical results on 286 Python programs demonstrated that their system outperforms other tools. [38] proposed a conversation-driven APR approach based on ChatGPT [31], which utilized instant feedback to perform patch generation in a conversational style. They evaluated both their approach on the Defects4J [43] and QuixBugs [44] dataset, and the results demonstrated their approach obtains the superior result on both datasets. However, there is still limited understanding regarding LLM’s effectiveness in fixing bugs in advanced student programming assignments.

2.2.3 Benchmarks for Programming Assignment Program Repair↩︎

Existing programming assignment-oriented approaches to APR focus on fixing bugs in C and Python. The widely used datasets in C are IntroClass [45] and ITSP [24]. IntroClass is collected from an introductory C programming class with an enrollment of about 200 students. It consists of 998 submitted programs with 572 unique defects out of 1143 defects in total (sometimes the students submitted identical code, thus there may exist identical defects). The average line of code is relatively short with around 20 lines of code. ITSP consists of 661 buggy-repaired program pairs submitted by students from an introductory C programming course for 74 assignments. For Python, the dataset used in [46] is collected from an introductory Python programming course credited by 361 students, and contains 2442 correct and 1783 buggy program attempts with an average lines of code of 14. Another dataset used in [1] consisting of student attempts from MITx MOOC is not publicly available.

3 Dataset↩︎

To facilitate the evaluation of higher-level programming assignment repair models, we collect a new dataset called Defects4DS from the Data Structures course in a college. Subsequently, we gather comprehensive bug information from 4 dimensions to facilitate further analysis of the buggy code. Furthermore, we conduct a meticulous examination of the characteristics of the buggy code within our dataset and the ITSP [24] dataset, and compare their respective features. This analysis helps us outline the specific challenges involved in repairing the bugs in advanced student programming assignments. The subsequent content will present each of these processes in detail.

3.1 Data Collection↩︎

We collected the dataset from a Data Structures course in a college, which is a foundational professional course for computer science students. Compared to the introductory programming courses, the Data Structures course emphasizes problem-solving, algorithmic thinking, and optimization techniques, preparing students to work with large and complex code files. To finish the assignments, students will explore various data structures (such as linked lists, stacks, queues, trees, etc.). In this course, students are tasked with writing C programs that meet the provided problem specifications, including detailed descriptions, input and output formats, test examples, and occasionally illustrations. For each assignment, students have the flexibility to make an unlimited number of submissions until the specified deadline or until their submission successfully passes all the provided test cases.

We gathered students’ submissions from 4 assignments of different complexity levels. Each submission includes two code variations: a correct code that successfully passes all five test cases, and a buggy code that represents the student’s last incorrect attempt. Considering that most syntax errors can be detected by compilers and the inherent difficulty that students face in addressing semantic errors, we have exclusively retained buggy codes that contain semantic errors. Then we further narrow down our selection to entries where the discrepancy between the correct code and the buggy code is limited to a maximum of five lines. This ensures that the code can be repaired with a few edits, rather than necessitating a near-complete rewrite. Finally, the selection process has resulted in 682 submissions.

3.2 Data Statistics & Comparison with ITSP↩︎

As shown in Table 1, the average and median number of lines of code in Defects4DS is 55 and 78, respectively. These values are much higher compared to the datasets utilized in introductory programming courses [24], which exhibit average values of 22 and a median of 20. As a result, this disparity implies a higher level of complexity and difficulty in repairing the bugs present in Defects4DS.

Besides, we further analyze the distribution of grammatical components, including complex grammatical components and custom functions. For complex grammatical components, we take the struct, pointer, and multi-dimensional array into consideration. These components are commonly utilized by students when completing programming assignments, as they play a crucial role in managing and organizing complex data structures. According to the analysis results, shown in Table 1, these components occur in 263/682 (38.6%) Defects4DS programs (the total count is less than the sum of three components due to the occurrence of multiple complex grammatical components in one piece of code), while never in ITSP used in Verifix. Additionally, the custom functions occur in 291/682 (42.7%) Defects4DS programs, while only 20.5% (70/341) in ITSP, implying the gap of fix difficulty.

3.3 Data Labeling↩︎

To streamline further analysis, we gather comprehensive bug information from four dimensions within the buggy code in both Defects4DS and ITSP datasets: Bug Line, Bug Type, Repair Type, and Bug Correlation. During the labeling process, we designate the buggy code as the annotation object, utilizing the correct code as a reference. It’s important to note that certain invalid changes are disregarded, such as deleting a defined but unused variable, replacing variable names, and making equivalent modifications (e.g., i+=1 and i=i+1), and so forth. The labeling and analysis were carried out by four of the authors, all of whom possess over three years of experience in C programming. The corresponding detailed information for each dimension is outlined below.


Table 1: Statistics of Defects4DS and ITSP. For ITSP, lab 3-6 are collected following Verifix [2], where the statistics are calculated based on their GitHub repositories. CGC stands for complex grammatical components, CF stands for custom function, and M-Array stands for multi-dimensional array.
Dataset ID Task Description #Programs #Lines of Code #CGC Prop. #CF Prop.
5-7 (lr)8-10 Avg. Median Max Struct Pointer M-Array
Defects4DS Prob.1 generation of full permutations 123 40 35 85 - 21 1 119
Prob.2 expression calculation 108 68 63 245 1 13 - 34
Prob.3 conversion between decimal form and scientific notation 197 54 52 123 - 12 - 16
Prob.4 continuous line segment 254 58 54 150 193 92 29 122
2-11 Overall - 682 55 78 245 194 138 30 291
ITSP Lab-3 simple expressions, printf, scanf 63 13 13 22 - - - -
Lab-4 conditionals 117 21 19 83 - - - -
Lab-5 loops, nested loops 82 23 24 57 - - - 45
Lab-6 integer arrays 79 28 28 57 - - - 25
2-11 Overall - 341 22 20 83 - - - 70

Bug Line. This dimension captures the line number(s) where the bug is present. In cases where the bug occurs within a code block, we prioritize labeling by the entire block range. For bugs involving the repair type of statement addition, we denote the bug line as a range within which new statements can be included.

Bug Type. This dimension categorizes the type of bug encountered. Our dataset divides bugs into 7 main categories and 30 subcategories, as shown in Figure 1. The numbers in parentheses indicate the frequency of occurrence for each bug type in our dataset. As a single bug sometimes corresponds to multiple bug types, the total number of bug type occurrences is slightly more than the number of bugs. To establish this classification criteria, we conducted 4 rounds of labeling. In the first round, we randomly selected a sample of 250 data points for analysis and labeling, achieving a 95% confidence level and a 5% confidence interval. The initial bug types were obtained by summarizing the labels from this sample data, merging bug types with similar meanings, and refining categories. In the second round, the sample data was relabeled, and any additional bug types that were not covered in the initial categorization were added to form an updated version of bug types. The third round involved relabeling the sample data and labeling the remaining data using the second version of bug types. Finally, refinement was conducted in the last round to create the final version of bug types, and all the data was labeled accordingly. The detailed information of code book can be found in our replication package.

Figure 1: Bug type distribution of Defects4DS. A single bug sometimes corresponds to multiple bug types.

Repair Type. This dimension categorizes the potential methods to fix a bug, which include:

(1) Statement Addition: The fix involves adding new statements to the code.

(2) Statement Deletion: The fix involves removing existing statements from the code.

(3) Statement Modification: The fix involves making changes to the content of existing statements.

(4) Position Modification: The fix involves changing the position or order of existing statements in the code.

These methods can also serve as indicators of the level of difficulty that APR models may encounter when attempting to fix a bug.

Bug Correlation. Bug correlation indicates whether a bug is related to other bugs in the same file. We consider the related bugs that involve using the same variable or having impacts on the value of the same variable or expression. By recognizing bug correlation, APR tools might optimize their bug-fixing process since the fix for one bug may provide clues on fixing other related bugs.

3.4 Data Analysis&Challenges↩︎

After the dataset is prepared, we analyze the characteristics of Defects4DS. Next, we compare these characteristics with those of ITSP and conclude by outlining the challenges involved in fixing bugs in advanced student assignments.

Distribution of the number of bugs. We calculate the number of bugs in each piece of buggy code, and the results are shown in Table 2. Overall, 58.4% of the codes in Defects4DS contain only one bug, about a quarter contain two bugs, and no more than a fifth contain three or more bugs. Defects4DS contains 1181 bugs in total, with an average of 1.73 bugs per program. Specifically, Prob. 2 contains an average of 2.1 bugs per code, which is the most among all the problems.


Table 2: Proportion of different numbers of bugs and related bugs for Defects4DS and ITSP.
Dataset ID # Bugs per Code Related Bugs
3-8 1 2 3 4 5 >5
Defects4DS Prob.1 72.4% 17.1% 5.7% 2.4% 1.6% 0.9% 53.0%
Prob.2 42.6% 32.4% 5.6% 14.8% 4.6% - 66.7%
Prob.3 66.5% 23.9% 5.1% 2.5% 1.0% 1.0% 56.1%
Prob.4 52.0% 26.0% 11.4% 5.5% 2.8% 2.4% 57.4%
2-9 overall 58.4% 24.8% 7.6% 5.6% 2.3% 1.3% 58.6%
ITSP Lab-3 63.5% 14.3% 19.0% - 3.2% - 14.1%
Lab-4 66.4% 15.0% 15.9% 0.9% 1.8% - 8.8%
Lab-5 68.1% 18.1% 11.1% 2.8% - - 15.5%
Lab-6 63.0% 20.1% 9.6% 2.7% 2.7% 1.4% 14.7%
2-9 overall 65.4% 16.8% 14.0% 1.6% 1.9% 0.3% 12.7%

Proportion of related bugs. We use the calculation of the number of codes containing related bugs divided by the number of codes containing two or more bugs as the proportion of codes containing related bugs, since bug correlation is not involved for codes containing only one bug. Results indicate that all the problems in Defects4DS have a large proportion of related bugs with an overall rate of 58.6%, and Prob. 2 has the highest value at 66.7%.



Prob. 1



Prob. 2



Prob. 3



Prob. 4

Figure 2: Distribution of bug types and repair types for Prob. 1-4..

Distribution of the bug types and repair types. Since each problem is not characterized in the same way, the corresponding distribution of bug types and repair types varies. We make a further analysis and present the distribution in Figure 2. Prob. 1 requires generating the full permutation of the given number. 109 (58%) of the bugs are about outputs, with the majority (100 out of 109) being format errors. We can deduce that most students knew how to solve the problem but overlooked the specific output formatting requirements mentioned in the problem description. Prob. 2 requires calculating the result of an input expression. Bugs in this problem primarily involve variables and branches, with additional issues encountered during loop usage. The majority of branch-related bugs consist of if statement condition errors (31 out of 73) and missing branches (27 out of 73). And the variable-related bugs mostly involve incorrect assignments (65 out of 78). These patterns suggest that the main logic in most of the buggy code is incorrect. Significantly, this problem has the highest proportion of related bugs, indicating that extensive modifications are necessary to fix the code. Prob. 3 involves converting between decimal form and scientific notation. Multiple conversion and output logic need to be considered due to the range of possible scenarios. Similar to Prob. 2, the majority of bugs are related to branches, with if statement condition errors and missing branches being the top issues. Bugs related to outputs and loops are also significant, but different from Prob. 1, most output bugs involve incorrect content rather than formatting errors. In terms of bug repairs, the proportion of statement addition is the highest among all the problems. This suggests that there is significant missing logic in the buggy code, indicating that students may have overlooked specific situations. In contrast to the previous 3 problems that mainly deal with one-dimensional data such as numbers or characters, Prob. 4 involves a plane coordinate system, which requires the calculation of the maximum number of line segments of a continuous line segment. A majority of students used a struct type to store the horizontal and vertical coordinates of the input line segments to preserve the correspondence. The use of struct involves multiple operations related to variables and requires attention to some rules and details, which may be easily ignored by students due to the lack of proficiency. Therefore, the proportion of bugs about variables is the highest in this problem. Complex problem description and implementation logic may bring challenges to repair.


Comparison of the distribution of repair types.


Comparison of the distribution of bug types.

Figure 3: Comparison of the distribution of repair types and bug types in the annotated results between Defects4DS and ITSP. To provide a more visually intuitive representation of the bug types, we utilize the same color scheme for different subcategories within the same main category..

Comparison with ITSP. For the distribution of the number of bugs, from Table 2 we can observe that 65.4% of the programs in ITSP contain only one bug, and the codes with more than three bugs constitute only 3.8%. Regarding related bugs, the overall proportion in ITSP is 12.7%, which is approximately one-fifth of the proportion in Defects4DS. Figure 3 showcases the comparison of repair types and bug types distribution between Defects4DS and ITSP. From Figure 3 (a), it can be noted that the overall distribution of repair types in both datasets shows a similar pattern. However, Defects4DS exhibits a higher proportion of statement additions and deletions, suggesting a greater extent of code changes required to repair the buggy code compared to ITSP. Figure 3 (b) shows the distribution of bug types. It is obvious that the proportion of variable-related bugs in Defects4DS far exceeds that in ITSP, almost doubling the rate. Variables are one of the most fundamental elements in code, and the usage of them permeates throughout. A small alteration in one place may potentially lead to significant changes in the final outcome. Besides, the output-related bugs constitute the largest proportion in ITSP, indicating that the majority of the computational logic in the code preceding them is correct. Furthermore, we also notice that Defects4DS includes several bug types rarely or never found in ITSP, such as function return value error, loop control flow error, loop missing, if else match error, etc.

Based on the analysis above, and coupled with the cases involving complex grammatical components shown in Table 1, we believe the proposal of Defects4DS contributes to the diversity and balance of bug types and comprehensive evaluation of APR methods.

Challenges. Based on the above analysis, we outline the challenges faced by repairing the advanced student assignments as follows:

(1) Hard to locate errors: The assignments often have larger code sizes and more syntax grammatical components, making them difficult to comprehend. Additionally, the presence of one or more semantic errors further complicates bug location.

(2) Hard to repair: To fix the bugs, it is necessary to add or modify statements based on syntactic and semantic contextual information. Additionally, when multiple bugs are present, it is important to consider the correlation and mutual influence between them during the repair.

(3) Multi-source information exists: Multiple sources of information are available during the repair, such as assignment descriptions, input/output format, example IOs, buggy code, passing conditions, peer solutions, etc. While having a wealth of information provides more opportunities for repair, it can also create confusion. Extracting the most effective information from these sources is still a challenge.

Finding: It is difficult to identify and repair bugs in advanced student assignments due to the complex programs and the multiple related semantic bugs. The process is further complicated by diverse information sources, necessitating efficient information extraction methods. Thus, the proposal of Defects4DS contributes to the diversity and balance of bug types and comprehensive evaluation of APR methods.

4 Approach↩︎

4.1 Overview↩︎

To address the above challenges, we propose a novel repair framework PaR for higher-level student assignment bug fixing. Figure 4 shows the architecture of our approach. To empower the LLM to fix the bugs in higher-level student programming assignments, we divide the procedure into three stages: Peer Solution Selection, Multi-Source Prompt Generation, and Program Repair. During the peer solution selection stage, our model takes the buggy code \(C_{buggy}\) as input and performs a comprehensive search to identify the most closely related solution \(C_p\) from the peer submissions \(P = \{(p_1^{buggy},p_1^{fix}), (p_2^{buggy},p_2^{fix}), ..., (p_n^{buggy},p_n^{fix})\}\), where \(p_i^{fix}\) denotes the \(i\)-th peer solution, and \(p_i^{buggy}\) denotes its corresponding buggy code. Following that, in the prompt generation stage, our model combines multiple sources of information to craft a prompt \(\mathcal{P}\) for the LLM, including the reference peer solution obtained in the previous step, program description, IO-related information, buggy code, etc. Finally, in the program repair stage, the prompt generated by the previous stage is fed to the LLM, which then produces code \(C_{fix}\) that is subsequently evaluated for correctness.

Figure 4: The architecture of PaR.

4.2 Peer Solution Selection Strategy↩︎

A major feature of the student programming assignments datasets that differs from other datasets is the availability of a substantial number of peer solutions that serve as valuable references. An effective reference code may improve the probability of correctly repairing code, while a poor reference code may impede the model’s comprehension and generation, resulting in counterproductive outcomes. We argue that a valid reference code should closely resemble the buggy code in terms of the implementation details (lexical and syntactic). Besides, its corresponding buggy code has a similar error pattern (semantic) to the buggy code to be fixed. Therefore, we evaluate the similarity between a peer solution to the buggy code based on four aspects, which are as follows:

\(\mathrm{\small match}_{tc}\): Test cases pass match score (buggy-buggy). If two codes pass/fail the test cases in the same way, they probably have similar errors thus their repair methods may also be similar. This match score also reflects their semantic similarity from the perspective of execution status. This score is calculated by dividing twice the number of test cases that both codes can pass by the sum of the number of test cases that can be passed respectively.

\(\mathrm{\small match}_{df}\): Data-flow match score (buggy-correct). This score is further used to measure the semantic match of the buggy code and the correct peer code. The data flow score is calculated following [47], which measures the proportion of the matched data flows.

\(\mathrm{\small match}_{ast}\): AST match score (buggy-correct). The abstract syntax tree (AST) is widely used to represent the syntactic structure of code. We obtain this match score by comparing the sub-trees of both codes to measure their similarity in terms of the syntactic structure.

\(\mathrm{\small BM25}_{anon}\): Anonymous BM25 score (buggy-correct). BM25 [48] is a metric that assesses the lexical relevance of documents. We first calculate the BM25 score after variable anonymization by replacing variable names with generic names (v1, v2), alleviating the impact of different naming conventions, and then normalize the scores of each \(C_{buggy}\) to be between 0 and 1.

Drawing upon the aforementioned aspects, we introduce a novel metric named Peer Solution Match (\(\mathbf{PSM}\)) score. This score is determined through a weighted sum of the above individual scores, enabling us to identify the most related peer solution. Notably, higher scores exemplify a stronger level of relevance.

\[\begin{align} \mathbf{PSM} &= \alpha \cdot \mathrm{\small match}_{tc} + \beta \cdot \mathrm{\small match}_{df} + \gamma \cdot \mathrm{\small match}_{ast} + \delta \cdot \mathrm{\small BM25}_{anon} \\ C_p &= \mathrm{argmax}_{p_i^{fix}}(PSM(C_{buggy}, (p_i^{bug}, p_i^{fix})), i=1,2,...n \label{con:score} \end{align}\tag{1}\]

where \(\alpha, \beta, \gamma, \delta\) are the weights for each score term, \(n\) is the number of the peers, and \(C_p\) is the final selected peer solution, which has the highest \(\mathbf{PSM}\) score with the buggy code \(C_{buggy}\).

4.3 Multi-Source Prompt Generation↩︎

As shown in Figure 4, the prompt consists of three parts. First is the general introduction, which describes the task and the detailed information introduced in the following parts. Then, more detailed content is provided, including the buggy code to be fixed and supplementary bug details to aid in the repair process. Finally, we repeat the task description concisely to prevent the model from forgetting after reading a long paragraph of text.1 Specifically, the following information is included:

Task Description \((\mathit{TD})\): This is the natural language used in the introduction that describes the task and the detailed information introduced in the following parts.

Problem Description \((\mathit{DESC})\): This is a detailed explanation, describing the task and the expected results.

Input/Output Format \((\mathit{IO})\): This provides a standardized description of the program input and output, aiming to prevent students from making errors due to format, such as spaces or newlines.

Example Input/Output \((\mathit{EXP})\): This is to help students understand the meaning of the problem through one or more specific examples, sometimes accompanied by natural language explanations, which will not appear in the test cases.

Reference Code \((\mathit{C}_p)\): This comes from the most closely related peer solution retrieved using our peer solution selection strategy. It is added here as a reference given that students may ask other classmates for help when encountering difficulties.

Buggy Code \((\mathit{C_{buggy}})\): This is the buggy code to be fixed.

Final Task Description \((\mathit{TD_{final}})\): This is the repeated task description used at the end of the prompt.

We represent the prompt used in PaR as \(\mathcal{P}_{\mathrm{\small PaR}{}} = \{\mathit{TD}, \mathit{DESC}, \mathit{IO}, \mathit{EXP}, \mathit{C}_p, \mathit{C}_{buggy}, \mathit{TD}_{final}\}\).

4.4 Program Repair↩︎

In this stage, we use the generated prompt \(\mathcal{P}_{\mathrm{\small PaR}{}}\) as input of LLM. Our framework is compatible with LLMs, allowing for seamless integration with a wide range of LLM models. To assess the generalizability of our framework, we applied PaR to ChatGPT [31], StarCoder (15.5B) [25], and Code Llama (34B) [36]. Since the output of LLM \(O_{LLM}\) is sometimes a mixture of natural language and code, we use a filter \(\mathcal{F}\) to retain code \(C_{fix}\) only, and then send the code to an evaluation system \(\mathcal{E}\) to verify whether it can pass all the test cases. The process can be summarized as follows: \[\begin{align} C_{fix} &= \mathcal{F}(O_{LLM}(\mathcal{P}_{\mathrm{\small PaR}{}})) \\ \mathcal{E}(C_{fix}) &= \begin{cases} \text{True}, & \text{if }C_{fix}\text{ passes all five test cases} \\ \text{False}, & \text{otherwise} \end{cases} \end{align}\]

5 Evaluation↩︎

We address the following research questions in our evaluation:

RQ1: Performance Comparison. How does the performance of PaR in repairing programming assignments compared to state-of-the-art techniques?

RQ2: Impact of Prompt Design Strategy. What are the contributions of various prompt components?

RQ3: Performance in Repairing Various Types of Bugs. How does the effectiveness of resolving different types of bugs under different prompt settings and backbone models?

We first demonstrate the performance of our model by comparing it against the state-of-the-art APR approaches on both Defects4DS and an introductory programming assignment dataset (ITSP). Subsequently, we investigate the influence of various prompt components. Finally, we conduct a detailed analysis of the performance of both our model and baseline models across different buggy scenarios, exploring possible reasons for the observed results.

5.1 Compared Techniques↩︎

Our baselines can be mainly categorized into two types: LLM-based approaches and symbolic approaches. According to prior work [20], [23], it is common to use the prompt which contains the problem description, the buggy code and the example IO (if available) as the input of LLM to repair the bugs, and their evaluation results have shown great performance on the introductory programming assignment repair datasets. Therefore, we also employ such an approach to evaluate the effectiveness of advanced LLMs on our dataset. Specifically, we aim to select LLMs recognized for their capability in program repair and applicability to Defects4DS, considering their maximum input length limits and model capacity. Following small-scale preliminary testing, we finally chose InCoder [34], CodeGen [27], StarCoder [25], Codex [42], ChatGPT [31] and Code Llama [36] as our baselines. InCoder [34] is a large generative code model that can infill arbitrary regions of code. CodeGen [27] is a decoder-only model for multi-turn program synthesis. StarCoder [25] is trained on English corpus and more than 80 programming languages using multi-query attention and offers two usage methods: generation (StarCoderPlus\(_{Generate}\)) and fill-in-the-middle (StarCoderPlus\(_{Infill}\)). Codex [42] is designed to understand and generate code built upon the GPT-3 architecture, which has been fine-tuned specifically for programming tasks. ChatGPT [31] is a widely used language model developed by OpenAI, capable of understanding and generation tasks for conversational interactions. Furthermore, we also include the recently released Code Llama model [36]. Code Llama, proposed by Meta AI, is a family of large language models for code based on Llama 2 [49]. It exhibits zero-shot instruction following capability for various programming tasks, which excels in performance compared to other open-source models. We compare against these LLM-based baseline results on Defects4DS dataset.

For symbolic approaches, we compare against Verifix [2], which is for providing verified repair to students undertaking introductory programming assignments. Despite demonstrating impressive repair capabilities on the widely utilized introductory assignment dataset (ITSP [24]), Verifix falls short when it comes to our Defects4DS dataset, where the pass rate is zero. To identify the cause of the failure, we conduct a thorough analysis and present the reasons as follows. Verifix uses an off-the-shelf SMT solver, specifically the Z3 solver [50], to determine the minimal repair. However, the theory solvers in Z3 are limited to linear arithmetic, fixed-sized bit-vectors, arrays, and tuples. Consequently, they are unable to handle more complex grammatical components such as struct, pointer, and multidimensional arrays. As discussed in Section 3.2, these components are present in 263 out of 682 (38.6%) of Defects4DS programs, while they are absent in the ITSP used by Verifix. Furthermore, there is a higher occurrence of custom functions in Defects4DS compared to ITSP. This discrepancy suggests a significant gap in the difficulty of repairs. To ensure a valid comparison, we compare the performance of PaR and Verifix on the ITSP dataset.

5.2 Implementation↩︎

Regarding the specific configuration and implementation of our PaR framework, we set all the coefficients in the formula (1 ) to 1/4 (\(\alpha, \beta, \gamma, \delta = 1/4\)). \(\mathrm{\small match}_{tc}\) are calculated based on the execution results of our data, \(\mathrm{\small match}_{df}\), \(\mathrm{\small match}_{ast}\) and \(\mathrm{\small BM25}_{anon}\) are calculated following the work of [47], [48], [51]. We use prompt \(\mathcal{P}_{\mathrm{\small PaR}{}}\) mentioned in Section 4.3 and employ Python to execute the repair functions of PaR. As mentioned in Section 4.4, we choose three LLMs as the backbone model for PaR, i.e., ChatGPT [31], StarCoder [25], and Code Llama [36]. For ChatGPT, we utilize the ChatGPT API gpt-3.5-turbo-0613 model2 endpoint. As for StarCoder, we utilize the StarCoderPlus\(_{infill}\) version, which has 15.5B parameters. Lastly, for Code Llama, we employ the instruct version with 34B parameters. The average length of \(\mathcal{P}_{\mathrm{\small PaR}{}}\) is 1270 tokens and the max length is 3027 tokens, within the max tokens limit of all three models. To obtain a varied collection of potential patches, we use a sampling temperature of 0.8. All the codes are compiled and run based on the C99 standard.

As for LLM baselines, following previous work [19], [20], [38], we generate a basic prompt \(\mathcal{P}_{\text{basic}} = \{\mathit{TD}, \mathit{DESC}, \mathit{IO}, \mathit{EXP}, \mathit{C}_{buggy}, \mathit{TD}_{final}\}\), which includes the information provided in the programming assignments. For Codex, Code Llama, and ChatGPT, which are good at few-shot learning or conversational interactions, we obtain the fixed code by directly feeding the prompt into the model, without providing the bug location. However, for both the infilling and generative baseline models, the bug location is required to perform repairs. Specifically, for the infilling models (InCoder and StarCoderPlus\(_{Infill}\)), we replace the buggy line with a special token and let the model complete it. For the generative models (CodeGen and StarCoderPlus\(_{Generate}\)), we retain the code before the appearance of the first bug and the rest is to be generated by the model. The input has an average length of around 900 tokens and the max length is around 2400 tokens. Prompts exceeding the max length limit of baseline models will be truncated.

5.3 Metrics↩︎

We use the following two metrics for evaluation:

Number of Fixed Programs (Pass Rate): As the behavior of LLM is inherently stochastic, while dealing with each buggy program, we initiate the bug repair request five times, each time within a new conversation context. We then classify the result based on a majority vote following [52], whereby if at least three out of the five replies correctly repair the code, we label it as “pass”. on and the overall pass rate based on this criterion. We calculate the pass rate by dividing the number of passed submissions by the total number of submissions for each specific problem as well as the entire dataset.

Number of Partial Repaired Programs (Partial Repair Rate): Following [24], a repair candidate \(P'\) is considered as a partial repair if (1) all previously passing tests still pass with \(P'\), and (2) at least one of previously failing tests passes with \(P'\). If at least three out of the five replies meet the above requirements, we consider the buggy code to have been partial repaired. It is worth mentioning that the fixed programs are included in the partial repaired programs.

6 Results↩︎

6.1 RQ1: Performance Comparison↩︎

6.1.1 Results on Defects4DS↩︎

We first compare the bug-fixing ability of PaR and LLM baselines in our advanced student assignment dataset Defects4DS. By default, PaR utilize the prompt setting of \(\mathcal{P}_{\mathrm{\small PaR}{}}\). The results can be found in the rows corresponding to PaR in Table 3. Furthermore, considering that most of the baselines require and include the bug location information, we conducted further experiments of introducing the Bug Location (BL) information into our prompt. Specifically, we add the line number(s) where the bug occurs into the prompt, as presented in the Bug Line part in the upper-left of Figure 5. The corresponding results are presented in the rows belonging to PaR w/BL in Table 3. As seen from the results, we observed that all the variations of PaR and PaR w/BL achieved excellent performance, surpassing other baseline models by a significant margin. When utilizing our default prompt setting, PaR-ChatGPT achieves the best performance with a pass rate of 37.97%. PaR-CodeLlama also demonstrates superior performance compared to other baselines. After introducing the Bug Location information, we observe different results. With the help of BL information, the performance of PaR-CodeLlama was significantly improved, with the pass rate increasing from 32.99% to 44.13%, making it the highest among all the models. Conversely, the performance of PaR-ChatGPT became worse, with the pass rate decreasing from 37.97% to 35.92%. PaR-StarCoderPlus also obtains considerable performance compared to the baseline results, particularly when compared to the powerful ChatGPT and its counterpart. Moreover, it is worth mentioning that these variations excelled in different problem areas, with PaR-CodeLlama performing best in Prob. 2 and 3, while PaR-ChatGPT excelled in Prob. 1 and 4. The output format in Problem 1 and the variable usage in Problem 4 involve relatively detailed issues. Code Llama may perform less effectively in handling such details, but it demonstrates greater competitiveness in correcting the overall logic of the code. This will be further discussed in subsequent sections.

Among all the baselines, the best performance is exhibited by ChatGPT, achieving a pass rate of 24.19%, with StarCoderPlus\(_{infill}\) and Code Llama falling behind. The remaining models all exhibit relatively poor performance, with a pass rate below 5%. When analyzing the results in terms of generate mode, it becomes apparent that the generative approaches exhibit the poorest performance. This could be attributed to their inability to obtain complete information from the original flawed code and make modifications before the first bug. Consequently, this mode is significantly limited in the information they can access, unlike the other two modes. Furthermore, the infill models are also subject to limitations in some scenarios. They can only generate code at marked positions, which prevents them from addressing issues that require changes in statement placement or adding code elsewhere. On the contrary, models utilizing the prompt mode do not suffer from these limitations, as they can access and comprehend all the information from the buggy code and repair it by generating complete code snippets. Regarding comparisons within each generate mode, models with a higher number of parameters consistently outperform others. This is closely tied to the extensive knowledge they acquired during the pre-training phase.

Table 3: Comparison of different LLMs on Defects4DS. The numbers inside the parentheses represent the pass rate.
Generate Mode Model # Params # Fixed Programs
5-9 Prob.1 Prob.2 Prob.3 Prob.4 Overall
Baseline Infill InCoder 1B 0 4 7 21 32
StarCoderPlus\(_{Infill}\) 15.5B 0 17 25 64 106
2-9 Generative CodeGen-multi 2B 7 4 2 0 13
StarCoderPlus\(_{Generate}\) 15.5B 0 6 3 20 29
2-9 Prompt Codex (code-davinci-edit-001) 12B 0 1 3 24 28
Code Llama 34B 19 8 6 30 63
ChatGPT unknown 76 11 5 73 165
PaR Prompt PaR-CodeLlama 34B 74 33 53 65 225
PaR-ChatGPT unknown 91 25 29 114 259
PaR w/BL Infill PaR-StarCoderPlus\(_{Infill}\) 15.5B 68 16 27 70 181
2-9 Prompt PaR-CodeLlama 34B 92 41 82 86 301
PaR-ChatGPT unknown 90 17 36 102 245


Table 4: Pass rate of Verifix and PaR on ITSP dataset (lab 3-6) which contains 28 assignments and 341 incorrect programs. The results of Verifix are sourced from [2]. The numbers inside the parentheses represent the number of fixed programs.
Model Pass Rate
2-6 Lab-3 Lab-4 Lab-5 Lab-6 Overall
Verifix 92.1% 82.9% 45.1% 21.5% 58.4%
PaR-ChatGPT 52.4% 76.1% 80.5% 79.7% 73.6%
PaR-CodeLlama 54.0% 59.0% 70.7% 57.0% 60.4%

6.1.2 Results on ITSP↩︎

Then we further compare PaR with Verifix [2] on ITSP [24] dataset. The settings of PaR are the same as in Section 5.2, meaning that all the coefficients in Formula (1 ) are set to 1/4 and prompt \(\mathcal{P}_{\mathrm{\small PaR}{}}\) is used. The results are shown in Table 4. Overall, PaR-ChatGPT achieves a pass rate of 73.6%, surpassing Verifix by 15.2%. Although the performance of PaR-CodeLlama is slightly inferior, it is still better than Verifix. Verifix aligns the student assignment with a reference solution in terms of control flow, and automatically summarizes differences in data variables using predicates to establish relationships between variable names. Hence, as the lab number increases, the difficulty of the tasks gradually escalates, code length grows, control flow becomes more intricate, and the number of variables increases. Consequently, the pass rate of Verifix gradually decreases. However, the pass rate of PaR-ChatGPT generally remains stable at 80%, except for Lab-3, where it is lower at 52.4%. This divergence can be attributed to the conciseness of the Lab-3 task description, as PaR primarily relies on the information provided in the prompt for repair. If the information is too brief or lacks sufficient details, it is likely to hinder its repair ability.

Answer to RQ1: PaR consistently outperforms both LLM and symbolic approaches on both Defects4DS and ITSP datasets, demonstrating its ability to empower large language models to effectively solve both advanced and introductory programming assignments. Additionally, PaR demonstrates strong generalization capabilities and performs well across different LLMs.

6.2 RQ2: Impact of Prompt Design Strategies↩︎

The design of prompts significantly influences the performance of the model, as a well-crafted prompt can unlock its full potential. Hence, in this section, our objective is to delve into the impact of various prompt components. Initially, we concentrate on assessing the effect of IO-related information. Additionally, considering that many of our compared baselines require knowledge of the bug-related information, like bug location, before performing a repair, it is important to include and discuss the impacts of this information. Given that we have pinpointed the bug-related information in the programs of Defects4DS, we explore whether incorporating explicit buggy information in the prompt aids the model in the repair process. This evaluation sheds light on the significance and necessity of identifying bug information ahead of time, providing valuable insights and guidance.

For the IO-related information, we mainly explore the impact of example input/output (\(\mathit{EXP}\)) and test cases. The results can be seen in the upper part of Table 5. As seen from the results, since the formats of \(\mathit{EXP}\) and test cases are similar, both comprising an input and its corresponding output, there is no obvious difference in the number of fixed programs. This suggests that the prompts have similar effects, and LLM can capture enough knowledge from \(\mathit{EXP}\). Providing prompts with test cases tends to have a slightly higher pass rate and partial repair rate, but it comes with some risks. Revealing test cases that assess program correctness to LLM might lead it to generate code specifically tailored to those cases. Therefore, we choose to use \(\mathit{EXP}\) to prompt the LLM in PaR.

Table 5: Results of using different prompt designing strategies of PaR-ChatGPT and PaR-CodeLLama.
Model Prompts # Fixed Programs # Partial Repaired
3-7 Prob.1 Prob.2 Prob.3 Prob.4 Overall Programs
Example IO and Test Cases (Based on: \(\mathit{TD}\) + \(\mathit{DESC}\) + \(\mathit{IO}\) + \(\mathit{C}_{buggy}\) + \(\mathit{TD}_{final}\))
PaR-ChatGPT \(\mathit{EXP}\) 76 11 5 73 165 225
Test Cases 80 14 18 65 177 231
\(\mathit{EXP}\) + Test Cases 78 15 17 62 172 241
1-8 PaR-CodeLlama \(\mathit{EXP}\) 19 8 6 30 63 90
Test Cases 27 6 8 29 70 99
\(\mathit{EXP}\) + Test Cases 30 4 8 30 72 102
Bug Information (Based on: \(\mathcal{P}_{\text{basic}}\) = \(\mathit{TD}\) + \(\mathit{DESC}\) + \(\mathit{IO}\) + \(\mathit{EXP}\) + \(\mathit{C}_{buggy}\) + \(\mathit{TD}_{final}\))
PaR-ChatGPT Bug Line 84 12 17 75 188 243
(+) Bug Type 82 17 17 73 189 248
(+) Repair Type 80 14 16 75 185 256
(+) Bug Correlation 82 14 19 73 188 257
1-8 PaR-CodeLlama Bug Line 25 6 7 43 81 106
(+) Bug Type 67 10 9 32 118 151
(+) Repair Type 58 12 7 37 114 145
(+) Bug Correlation 68 10 6 33 117 148

Figure 5: Example of the additional part of the prompt when considered the bug-related information. We use \(\mathit{l}_{i}, \mathit{bt}_{i}\) and \(\mathit{rt}_{i} \mathit{(i=1,2,3)}\) respectively to represent bug lines, bug types, and repair types. The texts in Bug Line is utilized in PaR w/BL in RQ1, while the remaining information is only used in RQ2.

For the bug-related information, we mainly consider the following bug-related information as introduced in Section 3.3: Bug Line, Bug Type, Repair Type, and Bug Correlation. We add each of the above contents gradually into our basic prompt \(\mathcal{P}_{\text{basic}}\), and the details can be seen in Figure 5. The results are shown in the lower half of Table 5. Given the conclusion we just reached, we add \(\mathit{EXP}\) to the common portion of prompts. For PaR-ChatGPT, we observe that the number of fixed programs is steady across these four prompts and slightly improved compared to those without bug information. The reason might be that ChatGPT inherently possesses strong comprehension and generation abilities. Thus, providing it with a bit of information related to bug localization, such as bug lines, allows it to uncover potential additional information on its own. While the trend presented by PaR-CodeLlama is different. After adding the bug line information, the pass rate gains a slight improvement. Furthermore, there is a significant increase after adding bug type information. The pass rate and partial repair rate remain relatively stable with the addition of repair type and bug correlation information. Through analyzing the performance of each problem, We notice that the increase is mainly attributed to Problem 1. We deduce that Code Llama itself lacks certain bug location and judgment abilities. When provided with external hints and when the bug types are easy to understand and correct, it can achieve a certain level of improvement. In comparison to Table 3, the absolute number of fixed programs and partial repaired programs is still quite low, which means that the increase brought by bug-related information is limited. Considering the challenges involved in obtaining bug-related information in real-world scenarios and the marginal performance improvement observed, we are inclined to believe that the inclusion of bug-related information is not essential for PaR. Consequently, while it is preferable to incorporate bug information, PaR can still demonstrate effective performance even without it.

Answer to RQ2: When designing the prompt, providing example input/outputs can sufficiently help LLMs understand the expected program behavior in terms of IO-related information. While bug-related information may offer only marginal benefits, PaR can still demonstrate effective performance without it.

6.3 RQ3: Performance in Repairing Various Types of Bugs↩︎

To comprehensively evaluate the performance across various buggy scenarios, we select four bug types that exhibit the highest occurrence frequencies in Defects4DS based on our preliminary analysis (i.e., “Variable”, “Branch”, “Loop” and “Output”) to investigate the effectiveness of repair under different prompt settings and backbone LLMs. We construct Venn diagrams to visually represent the results as depicted in Figure 6, 7 and 8. The numbers in the diagrams represent the quantity of successfully repaired bugs, and the number inside the parentheses of the caption indicates the total number of bugs of this type.

6.3.1 Different Prompt Settings↩︎


Variable (344)


Branch (298)


Loop (226)


Output (242)


Figure 6: Results of employing ChatGPT as the backbone for repairing various types of bugs across various prompt settings..


Variable (344)


Branch (298)


Loop (226)


Output (242)


Figure 7: Results of employing Code Llama as the backbone for repairing various types of bugs across various prompt settings..

For prompts, we mainly analyze the following settings: \(\mathcal{P}_{\text{basic}}\), \(\mathcal{P}_{\mathrm{\small PaR}{}}\), and \(\mathcal{P}_{\mathrm{\small PaR}{}} \cup \mathit{BL}\). \(\mathcal{P}_{\text{basic}}\) is used in the baseline as mentioned in Section 5.2, which includes the information provided in the programming assignments. \(\mathcal{P}_{\mathrm{\small PaR}{}}\) introduce the peer solution selected by our proposed strategy. \(\mathcal{P}_{\mathrm{\small PaR}{}} \cup \mathit{BL}\) further incorporates the bug location information as in RQ1. The results obtained using ChatGPT and Code Llama as the backbone are illustrated in Figure 6 and 7, respectively.

For ChatGPT-based models, as seen from Figure 6 (d), the majority of the fixed output bugs can be accomplished using any of the three prompt configurations. The rationale behind this is that output bugs are relatively easier to repair among these four bug types because approximately half only involve formatting issues, such as missing spaces or incorrect line breaks, which do not involve deep-level code logic. However, the other three bug types, i.e., variable, branch, and loop bugs, involve more complex code logic and thus are more challenging to fix. Figures 6 (a), 6 (b), and 6 (c) show that the bugs that can be repaired under all three prompts are in the minority, while PaR-ChatGPT can repair more unique bugs under prompt \(\mathcal{P}_{\mathrm{\small PaR}{}}\). Regarding the results of PaR-ChatGPT w/BL, where the bug location information (\(\mathit{BL}\)) is introduced, PaR can repair some new bugs but may fail to address bugs that have been previously fixed. To sum up, the results suggest that for simple bugs, \(\mathcal{P}_{\text{basic}}\) is also effective. \(\mathcal{P}_{\mathrm{\small PaR}{}}\) is well-suited for resolving more complex issues, as it can fix a higher number of unique bugs compared to other prompts in this scenario. However, the introduction of \(\mathit{BL}\) information does not always lead to beneficial outcomes. For CodeLlama-based models, as seen from Figure 7, there is a notable disparity in the results compared to the results of ChatGPT. We can observe that only employing \(\mathcal{P}_{\text{basic}}\), i.e., results of CodeLlama, performs poorly across all the bug types, demonstrating that CodeLlama is not good at locating and repairing the bug solely utilizing the information provided in the programming assignments. Utilizing \(\mathcal{P}_{\mathrm{\small PaR}{}}\) as input leads to a significant performance improvement across all bug types. This indicates that \(\mathcal{P}_{\mathrm{\small PaR}{}}\) provides valuable insights to guide Code Llama in effectively addressing the bugs. Moreover, it is important to highlight that the utilization of \(\mathit{BL}\) information results in an additional performance improvement, contrasting with the behavior observed in ChatGPT. The results further confirm that CodeLlama continues to face challenges in locating bugs. However, when the bug location is explicitly provided, it can carry out bug fixes more effectively.

6.3.2 Different Backbone Models↩︎

Subsequently, we compare the performance of using two backbone models in repairing various types of bugs. As depicted in Figure 8, the overlap between the PaR-ChatGPT and PaR-CodeLlama for resolving output bugs is greater than for other types of bugs, demonstrating most of the output bugs can be addressed by both of these two models. However, regarding other types of bugs, the overlap is much smaller, showcasing that there are distinct strengths and specializations in each model for handling different types of bugs. In particular, PaR-ChatGPT excels in resolving Loop and Variable bugs, whereas PaR-CodeLlama demonstrates superior performance in fixing Branch bugs. These findings suggest that leveraging different backbone models for diverse bug types in practice may be beneficial, which will be explored in our future research.


Variable (344)


Branch (298)


Loop (226)


Output (242)


Figure 8: Bug fix Venn diagrams on Defects4DS dataset under different models..

Answer to RQ3: Compared to the basic prompt, \(\mathrm{\small PaR}{}\) can repair more unique bugs and maintain its effectiveness across different types of bugs, especially when dealing with challenging bugs. Regarding different backbone models, each model has specific strengths and specializations for addressing various types of bugs. Additionally, providing bug location information is of greater importance for the Code Llama backbone.

7 discussion↩︎

7.1 Impact of different peer solution selection strategies↩︎

The previous results have shown that an effective reference code \(\mathit{C}_p\) can provide valuable guidance, improving the repair performance substantially. Therefore, we evaluate the effectiveness of different selection strategies in this section, which can be divided into three types: lexical search, semantical search, and our proposed selection strategies with different parameter settings. We use BM25 [48] algorithm for lexical search. We calculate the BM25 scores between \(\mathit{C}_{buggy}\) and other buggy codes, and retrieve the code with the highest score as \(\mathit{C}_p\). For semantic search, we use two code semantic learning models. The first one is UniXcoder [53], which can retrieve codes with the same semantics from a collection of candidates in a zero-shot setting given a source code as the query. NCC [54] is an LLVM IR-based code representation model, which learns code semantic relying on an embedding space and graph representation of LLVM IR statements and their context.

For our selection strategy, except for the four aspects mentioned in Section 4.2, we also explore the impact of considering bug types when searching for a similar peer solution, and the definition of this aspect is as follows:

\(\mathrm{\small match}_{bt}\): Bug type match score (buggy-buggy). The match of the bug types of the two buggy codes (one representing the code to be fixed and the other representing the corresponding buggy code of the peer solution) provides insights into their semantic similarity. We calculate this score by dividing twice the number of overlapping bug types (both the buy type and the repair type are the same) by the sum of the number of bugs in both buggy codes.

After importing the \(\mathrm{\small match}_{bt}\), the Peer Solution Match (\(\mathbf{PSM}\)) score is computed as follows: \[\begin{align} \mathbf{PSM} = \alpha \cdot \mathrm{\small match}_{tc} + \beta \cdot \mathrm{\small match}_{df} + \gamma \cdot \mathrm{\small match}_{ast} + \delta \cdot \mathrm{\small BM25}_{anon} + \eta \cdot \mathrm{\small match}_{bt} \label{con:score95new} \end{align}\tag{2}\] where \(\eta\) is the weight for the Bug-type match score. Therefore, except for the default parameter setting (\(\alpha, \beta, \gamma, \delta = 1/4\)), we experiment with a new parameter setting (\(\alpha, \beta, \gamma, \delta, \eta = 1/5\)). Besides, we also try another weight set, i.e., \(\alpha=1/9\), \(\beta=1/9\), \(\gamma=1/9\), \(\delta=1/3\), \(\eta=1/3\), as we divide the five aspects into three categories and give each category equal weight: (1) lexical: \(\mathrm{\small BM25}_{anon}\); (2) syntactic: \(\mathrm{\small match}_{ast}\); (3) semantic: \(\mathrm{\small match}_{bt}\), \(\mathrm{\small match}_{tc}\), and \(\mathrm{\small match}_{df}\).

Results in Table 6 demonstrate that our selection strategy surpasses all other search methods. Specifically, our default setting achieves the best result in terms of Partial Repair Rate. When introducing the bug types while selecting the peer solution, one more program is correctly fixed. Besides, when adjusting the weights for each aspect, the results also change slightly. This further suggests that our initial strategy embraces multiple evaluation metrics, enabling a comprehensive assessment of the code similarity from different angles. It is noteworthy that incorporating bug information does not explicitly enhance the performance, indicating that the bug information may not be crucial enough. Among all the baseline strategies, UniXcoder outperforms the rest, demonstrating the effectiveness and significance of incorporating semantic information in the peer solution selection process.

Table 6: Results of using different peer solution strategies when using ChatGPT as the backbone model. Other settings are the same with PaR.
Peer Solution Selection Strategy # Fixed Programs # Partial Repaired
2-6 Prob.1 Prob.2 Prob.3 Prob.4 Overall Programs
Lexical Search
\(\mathrm{\small BM25}\) 87 22 30 85 224 300
Semantic Search
UniXcoder code-to-code search [53] 89 16 33 110 248 303
NCC [54] 79 19 28 93 219 260
Our Selection Strategy
\(\alpha=1/4\), \(\beta=1/4\), \(\gamma=1/4\), \(\delta=1/4\) 91 25 29 114 259 321
\(\alpha=1/5\), \(\beta=1/5\), \(\gamma=1/5\), \(\delta=1/5\), \(\eta=1/5\) 89 14 44 113 260 307
\(\alpha=1/9\), \(\beta=1/9\), \(\gamma=1/9\), \(\delta=1/3\), \(\eta=1/3\) 95 14 37 111 257 297

7.2 Case Study/Qualitative Analysis↩︎

In this section, we further assess how well the generated feedback supported students in understanding and resolving issues on their own, and whether the selected similar peer solutions can provide useful information for resolving similar issues.

7.2.1 Effectiveness in assisting students solving problems.↩︎

Figure 9 illustrates an example of a successful repair by PaR. The critical error is the absence of reassigning the variable at the beginning of each loop iteration. The correct file submitted by the student made three changes to fix the bug, but two of them could not affect the output (different from the equivalent modifications mentioned in section 3.3). One is the change of loop exit condition. In this nested loop, the maximum value of variable \(j\) is \(n-1\), which implies that the maximum value of variable \(i\) can only reach \(n-2\). Therefore, there is no difference between \(i<n\) and \(i<n-1\) here. The other is to replace \(>=\) with \(>\) in the if statement condition, which will not affect the result of the swapping operation at this point. The modifications generated by PaR show that it clearly discovered this fact and made relevant changes. This demonstrates that PaR is able to provide effective assistance to students.



Figure 9: Bug fix example. The correct.c code is the correct code submitted by the student, where the changed code is highlighted. The modifications based on buggy.c is the fixed code generated by PaR..



Figure 10: Bug fix example with the guidance of our peer solution strategy..

7.2.2 Effectiveness of the selected similar peer solutions.↩︎

We provide a specific example to demonstrate the effectiveness of our selection strategy using \(\mathbf{PSM}\) score. As shown in Figure 10, the original buggy code fails to increment the index of the assigned array element, while the selected peer solution correctly resolved this issue (the highlighted yellow lines), which has the highest \(\mathrm{\small match}_{tc}\) score (test cases pass match) and a \(\mathrm{\small match}_{df}\) score (data-flow match) higher than most other programs. The operation +1 offers a hint that helps the successful repair of this buggy code. However, peer solutions selected by other strategies, such as BM25, do not include this similar change, and the model failed to fix this bug. This example affirms the superiority of our strategy in selecting semantically and syntactically related peer solutions.

8 Threats to Validity↩︎

Internal. The internal threat to validity in this study is associated with the potential data leakage issue. This occurs when certain code snippets from students’ correct submissions are included in the training data of ChatGPT or Code Llama. However, as students are expected to complete the assignments independently and the submissions in Defects4DS are not uploaded to public repositories, we believe the risk of data leakage is minimal.

External. The external threat comes from our evaluation dataset Defects4DS. The results obtained in this dataset may not generalize to other programming assignment repair datasets. To alleviate this issue, we also evaluate PaR on another generally used student assignment dataset ITSP to confirm the generalizability. Even though, our current findings are limited to assignments that can be solved by stand-alone programs, may not be applicable to assignments that involve the development of software systems or collaboration among multiple programmers.

Construct. The construct validity relates to the suitability of our evaluation metrics. We assess the quality of the generated programs by analyzing the proportion of test cases that successfully pass. Utilizing the tests as a proxy for correctness is common in the educational domain [2], [23]. Our further analysis has confirmed that PaR repairs the buggy code rather than generating an entirely new solution, supporting the feasibility of using pass rate-based metrics. However, it is important to note that validating program correctness through tests may not be as robust as formal verification methods.

9 conclusion & future work↩︎

In this paper, we propose PaR, an APR tool that leverages the LLM to repair semantic mistakes in advanced assignments with a novel peer solution selection strategy and a multi-source prompt generation method. We also build an advanced student assignment dataset named Defects4DS, aiming to facilitate the evaluation of higher-level program assignment repair models. The evaluation results on both Defects4DS and ITSP datasets prove that PaR achieves the new state-of-the-art performance. In the future, we will expand our dataset by incorporating assignments that involve software system development, like compiler systems, which demand a higher level of problem-solving and technical skills. Additionally, we will adapt our approach to providing effective feedback for tasks with varying difficulty levels and types. Ultimately, our goal is to enhance the quality of education in software development.


Sumit Gulwani, Ivan Radiček, and Florian Zuleger.2018. . ACM SIGPLAN Notices53, 4(2018), 465–480.
Umair Z. Ahmed, Zhiyu Fan, Jooyong Yi, Omar I. Al-Bataineh, and Abhik Roychoudhury.2021. . ACM Transactions on Software Engineering and Methodology (TOSEM)31(2021), 1 – 31.
Ken Masters.2011. .
Luca Gazzola, Daniela Micucci, and Leonardo Mariani.2018. . 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE)(2018), 1219–1219.
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury.2019. . Commun. ACM62(2019), 56 – 65.
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer.2011. . Ieee transactions on software engineering38, 1(2011), 54–72.
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé.2019. . In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.
Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra.2013. . In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 772–781.
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury.2016. . In Proceedings of the 38th international conference on software engineering. 691–701.
Xuan-Bach D Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser.2017. . In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 376–379.
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade.2017. . In Proceedings of the aaai conference on artificial intelligence, Vol. 31.
Nan Jiang, Thibaud Lutellier, and Lin Tan.2021. . In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama.2013. . In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation. 15–26.
Reudismam Rolim, Gustavo Soares, Loris D’Antoni, Oleksandr Polozov, Sumit Gulwani, Rohit Gheyi, Ryo Suzuki, and Björn Hartmann.2017. . In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 404–415.
Yang Hu, Umair Z Ahmed, Sergey Mechtaev, Ben Leong, and Abhik Roychoudhury.2019. . In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 388–398.
Ke Wang, Rishabh Singh, and Zhendong Su.2018. . In Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation. 481–495.
Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay.2016. . In Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity. 39–40.
Michihiro Yasunaga Percy Liang.2021. . In International Conference on Machine Learning. PMLR, 11941–11952.
Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radiček.2023. . In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 5131–5140.
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan.2023. . In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481.
Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan.2023. . arXiv preprint arXiv:2302.05020(2023).
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang.2022. . arXiv preprint arXiv:2210.14179(2022).
Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen.2022. . arXiv preprint arXiv:2209.14876(2022).
Jooyong Yi, Umair Z Ahmed, Amey Karkare, Shin Hwei Tan, and Abhik Roychoudhury.2017. . In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 740–751.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al2023. arXiv preprint arXiv:2305.06161(2023).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2018. . arXiv preprint arXiv:1810.04805(2018).
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong.2022. . arXiv preprint arXiv:2203.13474(2022).
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.2017. . Advances in neural information processing systems30(2017).
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.2020. . In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin(Eds.). ://
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.2020. . J. Mach. Learn. Res.21(2020), 140:1–140:67.
OpenAI.2022. ChatGPT: Optimizing Language Models for Dialogue.
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al2022. . arXiv preprint arXiv:2204.06745(2022).
Ben Wang Aran Komatsuzaki.2021. GPT-J-6B: A 6 billion parameter autoregressive language model.
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis.2022. . arXiv preprint arXiv:2204.05999(2022).
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al2023. . arXiv preprint arXiv:2302.13971(2023).
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al2023. . arXiv preprint arXiv:2308.12950(2023).
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig.2023. . Comput. Surveys55, 9(2023), 1–35.
Chunqiu Steven Xia Lingming Zhang.2023. . arXiv preprint arXiv:2304.00385(2023).
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li.2023. . arXiv preprint arXiv:2304.07590(2023).
Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng.2023. . arXiv preprint arXiv:2305.04207(2023).
Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang.2021. . In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 341–353.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al2021. . arXiv preprint arXiv:2107.03374(2021).
René Just, Darioush Jalali, and Michael D Ernst.2014. . In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama.2017. . In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer.2015. . IEEE Transactions on Software Engineering41, 12(2015), 1236–1256.
Yang Hu, Umair Z. Ahmed, Sergey Mechtaev, Ben Leong, and Abhik Roychoudhury.2019. . In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE/ACM, 388–398.
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma.2020. . arXiv preprint arXiv:2009.10297(2020).
Andrew Trotman, Antti Puurula, and Blake Burgess.2014. . In Proceedings of the 19th Australasian Document Computing Symposium. 58–65.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al2023. . arXiv preprint arXiv:2307.09288(2023).
Leonardo De Moura Nikolaj Bjørner.2008. . In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340.
Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K Reddy.2022. . arXiv preprint arXiv:2206.08474(2022).
Jialun Cao, Meiziniu Li, Ming Wen, and Shing-chi Cheung.2023. . arXiv preprint arXiv:2304.08191(2023).
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin.2022. . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7212–7225.
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler.2018. . Advances in Neural Information Processing Systems31(2018).

  1. The impact of different prompt designing strategies is discussed in Section 6.2↩︎