Evaluating and Improving the Robustness of Security Attack Detectors Generated by LLMs


Abstract

Large Language Models (LLMs) are increasingly used in software development to generate functions, such as attack detectors, that implement security requirements. However, LLMs struggle to generate accurate code, resulting, e.g., in attack detectors that miss well-known attacks when used in practice. This is most likely due to the LLM lacking knowledge about some existing attacks and to the generated code being not evaluated in real usage scenarios. We propose a novel approach integrating Retrieval Augmented Generation (RAG) and Self-Ranking into the LLM pipeline. RAG enhances the robustness of the output by incorporating external knowledge sources, while the Self-Ranking technique, inspired to the concept of Self-Consistency, generates multiple reasoning paths and creates ranks to select the most robust detector. Our extensive empirical study targets code generated by LLMs to detect two prevalent injection attacks in web security: Cross-Site Scripting (XSS) and SQL injection (SQLi). Results show a significant improvement in detection performance compared to baselines, with an increase of up to 71%pt (percentage points1) and 37%pt in the F2-Score for XSS and SQLi detection, respectively.

1 Introduction↩︎

The advent of Large Language Models (LLMs) has transformed software development with their impressive capabilities of understanding natural language prompts and producing accurate code that implements the given prompt. These models are the foundation of AI-coding assistants like GitHub Copilot [1] or Cursor AI [2]. With them, developers now often start with LLM-generated code as a base for refinement and testing [3], [4]. However, this new practice also introduces potential risks: LLMs can inadvertently introduce vulnerabilities into the generated code or struggle to effectively generate security functions that precisely satisfy the associated security requirements [5][8]. This issue likely stems from insufficient scrutiny of training data, lack of task-specific knowledge, inadequate fine-tuning, or missing assessment of the output [9].

An important family of security functions consists of attack detectors. With the words ‘attack detectors’ we refer to the functions that, taking into account an input, possibly representing a payload, analyze the input and return a boolean value representing the outcome of the analysis. More specifically, they return a positive value if an attempt for an attack is found inside of the input payload. These functions require expert knowledge about the attack vectors for effective protection against such attacks.2 This paper addresses this critical issue by evaluating and improving the robustness of the LLM-generated attack detection function.

While the evaluation of LLM-generated code has garnered attention, most existing benchmarks, like HumanEval [10], emphasize complex algorithm generation rather than code generation requiring domain/security-specific knowledge. In addition, previous work has largely focused on identifying vulnerabilities in LLM-generated code [6], [11], but a systematic approach to improving the robustness of security functions such as attack detectors is missing. We hypothesize that enhancing the robustness of the generated detectors requires more than mere prompting with a suitable query – it necessitates the integration of relevant knowledge, as well as the adoption of a systematic approach to robustness self-assessment.

To this end, we adopt two novel components to the LLM pipeline: Retrieval Augmented Generation (RAG) [12] and Self-Ranking. RAG, a technique widely investigated in Natural Language Processing (NLP), enhances the quality of the output by incorporating external information. We use RAG to leverage existing knowledge sources that document known attacks to increase the robustness of the detectors generated by LLMs. Additionally, LLMs may exhibit non-deterministic behaviour in generation [13], which results in multiple, diverse solutions for the same prompt. We propose Self-Ranking, building upon the concept of Self-Consistency [14]: we take advantage of the multiple reasoning paths associated with LLM’s non-determinism to rank the alternative outputs and select the best one. More concretely, the LLM can propose different implementations obtained querying the model multiple times, with Self-Ranking we propose a strategy to rank the different implementations and to keep only the best-performing ones. In turn, ranking is based on the creation of a synthetic dataset, also generated by LLMs (hence the moniker Self-Ranking), used to automatically assess the robustness of each candidate output.

In our empirical study, we target two well-known and prevalent attacks: XSS and SQLi [15], [16], which are ranked second and third in ‘2023 CWE Top 25 Most Dangerous Software Weaknesses’3 respectively. Our extensive empirical study involves the evaluation of nine different LLMs. Results indicate that integrating external knowledge with RAG improved detection performance up to 66%pt (on average 17%pt) on the F2-Score for XSS detection and up to 67%pt (on average 7%pt) for SQLi detection compared to relying solely on few-shot examples. Additionally, employing Self-Ranking enhanced the LLM performance by up to 71%pt (on average 37%pt) on the F2-Score for XSS detection and up to 43%pt (on average 6%pt) for SQLi detection.

State-of-the-art (SOTA) machine-learning based attack detectors require a labeled training set of attacks, while our LLM-based approach is applicable even when no training set is available, provided a RAG source is accessible. In our empirical study, we considered also the scenario when a training set is available and SOTA methods can be used. In such a scenario, our approach achieved a performance comparable to SOTA XSS and SQLi machine learning-based detection methods, with the remarkable advantage that LLMs are pre-trained, while specialized machine learning-based methods require dedicated model design and training: developers can obtain SOTA performance by just querying an LLM using our RAG and Self-Ranking augmented pipeline. Another key advantage is that LLMs generate code that can be understood and manipulated by developers, while the decision making logic of a machine learning model is to a large extent opaque to developers. We also demonstrated the transferability capabilities of LLMs, which are structurally missing in the SOTA detectors: after configuring the LLM (i.e., setting parameters such as the model checkpoint to use, the temperature, the number of few-shot examples) for optimal performance on one attack detection task (e.g., XSS), the resulting configuration can be transferred to another task (e.g., SQLi), with 16%pt and 10% improvement when transferring the LLM parameters from one task to another, as compared to average cases without parameter transfer.

The technical contributions of this paper are as follows:

  • We introduce a novel approach that integrates RAG and Self-Ranking to evaluate and improve the robustness of LLM-generated attack detectors.

  • We conduct extensive empirical experiments with nine LLMs and two attacks, demonstrating the usefulness of our approach.

  • We explore the transferability of optimal parameters between the two tasks, contributing to a generalizable approach to securing LLM-generated code.

2 Background↩︎

In this section, we provide a background on the two foundational concepts for our approach: RAG and Self-Consistency.

2.1 Retrieval-Augmented Generation (RAG)↩︎

LLMs often struggle with very specific topics, tending to produce hallucinations [17], making it challenging to ensure factual correctness. This happens because the knowledge of the LLM is stored in its parameters (parametric memory), and it is highly compressed. Retrieval-Augmented Generation (RAG) [12] tries to address this issue by incorporating a non-parametric memory, such as a database, to enhance the output’s quality in knowledge-intensive tasks. RAG’s key idea is to utilize external knowledge bases to fetch relevant information based on semantic similarity with the prompt, thereby reducing hallucinations. This approach has gained traction, making RAG essential in developing advanced chatbots [18]. The usage of RAG is also extremely useful to connect private data, not present in the initial training data, to an LLM without requiring any fine-tuning procedure.

A RAG application typically consists of two main steps. The first step is indexing, a pipeline for ingesting data from one or more sources and creating an index. This process usually occurs offline for efficiency. During indexing, the data sources are broken into smaller chunks, since large chunks are harder to search over and would not fit into the model’s finite context window. The result of this step is vector representations of the original knowledge source, obtained using an embedding model and stored in a vector database for efficient retrieval. The second step is retrieval: when a user submits a prompt, RAG uses the same embedding model from the indexing phase to convert the prompt into a vector representation. It then calculates similarity scores between the prompt vector and the vectors of chunks in the indexed corpus. Then, RAG selects and retrieves the top \(k\) chunks with the highest similarity to the prompt. These relevant chunks are then used to expand the context of the original prompt, providing additional information to the LLM.

2.2 Self-Consistency↩︎

LLMs are limited in their reasoning capabilities and this cannot be resolved merely by scaling them up [19], [20]. One of the most relevant approaches to overcome this limitation is Chain-of-Thought prompting (CoT) [21], which prompts LLMs to produce a sequence of short sentences replicating a human’s reasoning process to solve a task.

The idea of CoT was further extended into Self-Consistency [14], based on the observation that complex reasoning tasks often allow multiple reasoning paths to reach a correct answer [22]. The LLM is first prompted with CoT. Then, instead of greedily decoding the optimal reasoning path, a new “sample-and-marginalize” decoding strategy generates a diverse set of reasoning paths, each of which potentially leads to a different answer. The final answer is determined by marginalization, i.e., by summing up probabilities over these paths to find the most consistent response. Importantly, Self-Consistency differs from a typical ensemble approach, as it operates on a single LLM (also called a “self-ensemble”), rather than aggregating outputs from multiple LLMs [14].

Figure 1: Overview of our approach. There are two large parts, one for code generation (left) and the other for dataset generation (bottom-right). Each of the two parts takes a ‘Code Generation/Dataset Generation Template’ and a ‘Detector Signature’ as input (respectively matching ‘Input Template’ and ‘Input Task’) and generates one of four types of prompts: Basic, Few-shot, RAG, or RAG Few-shot (for simplicity, the latter is omitted). The prompt is fed into an LLM to generate candidate functions (top pipeline) or a synthetic dataset (bottom pipeline). We then use a Self-Ranking mechanism to evaluate and rank the generated functions based on their performance on the synthetic dataset.

3 Approach↩︎

In this section, we present our approach to assess and improve the robustness of LLM-generated attack detectors, achieved through a combination of LLM prompt design, RAG, and Self-Ranking, a novel technique built upon Self-Consistency. Figure 1 illustrates an overview of our approach. A Prompt (shown in the middle within a blue box) is constructed and fed into the LLM to generate functions, in our case attack detectors (top pipeline) or synthetic datasets, used to self-assess the generated functions (bottom pipeline). We present four types of Prompts, Basic, Few-shot, RAG, and RAG Few-shot4, which vary in their use of RAG and Few-shot examples. The synthetic dataset is then used to evaluate and select the generated functions by means of the Self-Ranking mechanism (described in the following sections). In the following subsections, we detail the construction of each type of Prompt and explain how RAG and Self-Ranking are employed.

3.1 Basic Prompting↩︎

We provide the LLM with a Prompt that consists of two parts: Template and Task (in Figure 1 (left), these are illustrated as ‘Code/Dataset Generation Template’ and ‘Detector Signature’, respectively). The Template contains a general instruction that specifies the desired output. To generate Python functions, such as attack detectors, a ‘Code Generation Template’ is used, which instructs the LLM to produce Python code as shown in Listing [lst:code_template].

Listing lst:code_template: Code Generation Template

The user will provide the initial part of a python function (function name, parameters with types, return type and a comment describing the purpose of the function) with some optional example. 
You are a coding assistant that writes some python code to create the user's function. Be sure that the code is syntactically correct, it is a callable function (containing the initial part provided by the user) and that it returns the correct type. 
The length of the code should be short, it should be readable, and without redundant checks on the parameters. 
Return only python code in Markdown format, e.g.:
```python
....
```

The Task contains the signature of the function (e.g., attack detector) followed by a short comment describing its purpose. Two examples of tasks we consider in our empirical study, XSS and SQLi, are shown in Listing [lst:xss_task] & [lst:sqli_task], respectively.

Listing lst:xss_task: Task for XSS detection (Detector Signature)

def detect_xss(http_get_request: str)->bool: 
""" Check if in the given http_get_request there is an XSS exploit, considering also the possible evasions that an attacker can perform.
"""

Listing lst:sqli_task: Task for SQLi detection (Detector Signature)

def detect_sqli(query: str)->bool: 
""" Check if the given SQL query contains some statement injected by an attacker to perform SQL injection. Be sure that in the patterns used for the detection, no pattern related to normal SQL code (like standard SELECT ... FROM ... WHERE ..., or standard INSERT ... INTO ...) is considered, since it is not an attack."""

The Basic Prompt is the result of merging Template and Task, serving as the input to the LLM. For example, we can combine Listing [lst:code_template] and Listing [lst:xss_task] to form a Basic Prompt for the generation of an XSS detector.

3.2 Few-shot Examples↩︎

Few-shot examples have been explored in benchmarks such as HumanEval [10], where they are appended to the function definition to illustrate the expected input-output relationship. In our approach, we extend this concept by including malicious and benign examples, creating what we call the Few-shot Prompt. An illustration of this is shown in Listing [lst:xss_task_examples], where we use Few-shot examples for XSS detection. Note that while an arbitrary number of examples can be appended, there is no guarantee that a higher quantity will yield better results.

Listing lst:xss_task_examples: Example task for XSS detection with few-shot examples

def detect_xss(http_get_request: str)->bool: 
""" Check if in the given http_get_request there is an XSS exploit, considering also the possible evasions that an attacker can perform.
>>> detect_xss('http://anonymised-site/moderate-social-media&t=1396533765893&n=1129109&k=mainentity')
False
>>> detect_xss('http://anonymised-site/guestbook/index.php?lang="><script>alert(document.cookie)</script>...')
True
"""

3.3 RAG↩︎

We integrate RAG to enhance the LLM’s ability to generate robust detectors. We followed the general RAG methodology, starting from the indexing phase. Once the knowledge sources about attacks are stored in vector representation5, they can be queried to retrieve the most relevant information chunks (referred to as ‘External Knowledge’ in Figure 1). We utilize the Task (‘Detector Signature’ in Figure 1) directly as a query to retrieve these chunks, which are then appended to the Template to obtain a RAG Prompt, as illustrated in Listing [lst:code_template_rag].

Few-shot examples and RAG can be combined as they modify different structures: the Task and the Template, respectively. Appending Few-shot examples to the Task would allow RAG to extract more diverse and useful information, possibly enhancing the overall performance. We call this combined prompt as RAG Few-shot Prompt.

Listing lst:code_template_rag: Template for code generation, with the relevant context selected via RAG (queried with XSS Detection Task) appended at the end. The first part of the template is shortened to avoid redundancy.

The user will provide the initial part of a python function 
...
...
Return only python code in Markdown format, e.g.:
```python
....
```
Use the following pieces of retrieved context to write a more complete function:
Context: Here's an XSS example that bets on the fact that the regex won't catch a matching pair of quotes but will rather find any quotes to terminate a parameter string improperly.
...
Example: <script> ... setTimeout(\\"writetitle()\\",$\_GET\[xss\]) ... </script>
Exploitation: /?xss=500); alert(document.cookie);//)

3.4 Self-Ranking↩︎

Self-Consistency does not apply directly to our task as proposed originally [14], because the generation of security attack detectors is not always easily decomposable into a chain of thoughts, representing partial and incremental steps leading to the solution. Moreover, reasoning on the generated candidates may not be feasible, as the set of all the generated functions may not fit within the LLM’s context window. More importantly, our preliminary experiments suggest that LLMs struggle to assess the robustness of a set of candidate functions. Hence, we reuse only the general idea behind Self-Consistency, i.e., the LLM operating as a “self-ensemble”. Specifically, the non-determinism of LLMs is leveraged to produce a set of candidates that are assessed by the LLM itself, which generates an assessment dataset instead of directly assessing its own output. We refer to our novel variant of Self-Consistency as Self-Ranking: we ask the LLM to generate a synthetic dataset that simulates the presence of ground-truth data to select the best function. This synthetic dataset is used to evaluate and rank the generated functions. To generate the synthetic dataset, we leverage the modular structure of the Prompt by introducing a second Template (‘Dataset Generation Template’ in Figure 1), as shown in Listing [lst:dataset_template]. This ‘Dataset Generation Template’ can be combined with the two proposed Tasks to create synthetic datasets for various vulnerabilities. Additionally, Few-shot examples and RAG can be integrated into synthetic dataset generation.

Listing lst:dataset_template: Dataset Generation Template

The user will provide the initial part of the function (function name, parameters with types, return type and a comment describing the purpose of the function, with some optional example. 
You are a testing assistant that generates a dataset to test the function provided by the user.

Once the synthetic dataset is generated, it serves as a testing ground for selecting the best function. Specifically, when multiple functions are generated, Self-Ranking starts evaluating each function with the synthetic dataset, ordering them based on a chosen metric, e.g., F2-Score, and retaining only the top-performing subset, denoted as the top_k functions. In our study, the usage of Self-Ranking aims to select the subset of \(k\) most robust attack detectors.

4 Empirical Study↩︎

In this section, we first present the two evaluation scenarios, which differ in the availability vs. unavailability of a training/validation dataset. We then formulate four research questions and describe our experimental settings, including the models used, the configurable parameters, and the datasets. Next, we outline the experimental procedure, including the generation and evaluation of functions, the generation of synthetic datasets, and the selection of top_k functions. Finally, we provide details on the implementation and the evaluation metrics.

4.1 Scenarios↩︎

4.1.1 Scenario NTD (No Training Dataset)↩︎

In this scenario, developers do not possess a training set that would allow them to train a machine learning-based SOTA detector. For the same reason, developers cannot extract a validation set from the training set, which would allow them to directly assess and rank the LLM-generated functions (without Self-Ranking). Also, they cannot decide if one LLM (checkpoint) is preferable over another one (assuming that they have multiple instances to select from), find optimal LLM parameter configurations (e.g., temperature), and choose the optimal prompt among Basic/Few-shot/RAG/Rag Few-shot Prompts. Moreover, they cannot decide on the potential benefits offered by Self-Ranking.

Therefore, in the NTD scenario, we consider the effectiveness of our approach in terms of average performance on a test set composed of real-world attacks. We average the performance calculated across multiple choices of LLM instances, temperatures, alternative function/dataset generation prompts, and inclusion/exclusion of Self-Ranking. The average performance can be conditioned on each element of our approach, to determine its effect on the average performance when all other choices are unconstrained.

4.1.2 Scenario TDA (Training Dataset Available)↩︎

In this scenario, a task-specific labeled training set of attacks is available to developers. This addition would allow them to directly train Machine Learning (ML) models that perform the task assigned to the generated detectors, without actually generating any function at all. In fact, SOTA techniques for many security tasks, including XSS and SQLi detection, use ML models trained on a labeled dataset.

Moreover, the availability of a training dataset allows the extraction of a validation set that can be used to analyze the performance of different LLM instances, parameters, and elements of our approach, to choose the optimal configuration. Developers may even choose between LLM-generated functions and ML model, based on the respective performance on the validation set.

4.2 Research Questions↩︎

In both NTD and TDA scenarios, we want to know if RAG and/or Self-Ranking are beneficial to the generation of robust attack detectors:

  • RQ1. RAG Benefit: How helpful is RAG in generating better security attack detectors? How does it perform when combined with Few-shot examples?

  • RQ2. Impact of Self-Ranking: Is the selection of top_k functions via Self-Ranking an effective strategy to enhance the robustness of LLM-generated attack detectors?

In the NTD scenario, we evaluate the average performance across configurations (LLM, temperature, number of few-shot examples), either with or without RAG/Self-Ranking. In the TDA scenario, we check whether the optimal configuration selected through the validation set includes RAG/Self-Ranking.

In the TDA scenario, the available training set can be used to train a SOTA machine learning-based model. The best LLM-generated functions can be compared with SOTA techniques solving the same task:

  • RQ3. Comparison with SOTA: Do the detectors generated by LLMs, when assessed on an existing validation dataset, demonstrate comparable performance to state-of-the-art ML models trained specifically for the task?

Finally, we are interested in the transferability from the TDA scenario on a task, to the NTD scenario on another task. We want to know if the optimal configuration of the LLM identified through the validation set available for a given task (e.g., XSS) provides good/optimal results when used to solve another security function generation task (e.g., SQLi).

  • RQ4. Transferability: Can the optimal parameters for function and synthetic data generation in one task be transferred and applied to achieve effective results in other tasks?
Table 1: LLMs used in the experiments. Column C. Window shows the size of the Context Window in tokens; Up To shows the last training update.
Model Name Checkpoint Name Alias Provider N. parameters C. Window Up To Pass@1
GPT-3.5 Turbo gpt-3.5-turbo-0125 GPT-3.5T OpenAI N/A 16,385 2021 64.9
GPT-4 gpt-4-1106-preview GPT-4 OpenAI N/A 128,000 2023 76.5
GPT-4 Turbo gpt-4-0125-preview GPT-4T OpenAI N/A 128,000 2023 87.1
Claude 3 Haiku anthropic-claude-3-haiku Haiku Anthropic Small Claude3 200,000 N/A 75.9
Claude 3 Sonnet anthropic-claude-3-sonnet Sonnet Anthropic Med. Claude3 200,000 N/A 73.0
Claude 3 Opus anthropic-claude-3-opus Opus Anthropic Large Claude3 200,000 N/A 84.9
Llama3 llama3-70b-instruct Llama Meta 70 billions 8,192 N/A 81.7
Mixtral 8x7b mixtral-8x7b-instruct-v01 Mixtral Mistral 12 billions 32,000 N/A 40.2
PaLM 2 Chat Bison gcp-chat-bison-001 PaLM Google N/A 2,500 2023 37.6

4.3 Experimental Settings↩︎

4.3.1 Models↩︎

We employ nine LLMs as listed in Table 1, which includes the aliases that are used throughout the remainder of this paper. Additionally, we report the Pass@1 scores from HumanEval [10] as a proxy for their general reasoning capabilities: GPT-4T showed the best performance, followed closely by Opus, whereas PaLM exhibited the lowest performance among the models we considered. While Pass@1, which assesses the model’s reasoning capabilities to generate complex algorithms, may serve as an indicator of its expected performance on our task, we do not expect a perfect correlation with our results, since our task is more focused on utilizing extensive knowledge rather than reasoning on a given problem.

4.3.2 Configurable Parameters↩︎

A model configuration refers to the model and temperature used (e.g., GPT-4 with temperatures \(0.5\) and \(1.0\)). A prompt configuration is determined by the number of Few-shot examples (with equal malicious and benign examples) and the choice to use RAG or not, to enrich the prompt. By combining model and prompt configurations, we obtain an overall Configuration Conf, consisting of a tuple with a model, a temperature, the few-shot number, and the RAG usage choice. For instance, \(( \mathrm{\small GPT-4}, 0.5, 2, T)\) indicates the configuration where the model is GPT-4, the temperature is set to 0.5, the number of Few-shot examples is 2, and RAG is used (T represents True for the RAG Usage Parameter).

Table 2 shows the domains of the configurations for the different parts of the empirical study. Given the domain, some of its configurations may fail on a given task. Hence, they are excluded. For example, among the configurations for Code Generation (see Table 2), when generating an XSS detector, all the configurations with models Sonnet, PaLM, Llama, Mixtral, and temperature 0 fail and are hence discarded. Similarly, for the SQLi detector, the aforementioned configurations, along with PaLM, Sonnet, and Mixtral at temperature 0.5, fail and are also excluded. Haiku and GPT-3.5T are completely discarded in code generation for the same reasons, but they are still among the employed models because they are used for synthetic dataset generation.

Table 2: Domain of the configurations used for code and dataset generations
Conf. Domain Model Temp. NShot. RAG.
Sonnet, PaLM,
Llama, Mixtral 0.0, 0.5, 0.1 0, 2, 6, 10 True, False
Sonnet, Haiku 0.0, 0.5, 0.1 0, 2, 6, 10 True, False

Every time we generate code, we produce 40 functions by feeding the model 40 times with the same prompt. This strategy accounts for the non-determinism of LLMs [13] and is exploited by Self-Ranking to select the top_k functions.

For synthetic dataset generation, we iteratively generate a dataset of 100 examples, with an equal split of 50 malicious and 50 benign cases. We use a timeout of 9,000 seconds to discard configurations that fail to fill the dataset with 100 examples in a reasonable amount of time. To account for non-determinism, we repeat the process 10 times and produce 10 distinct datasets. The performance used to determine the ranking is calculated by averaging the performance on the 10 different produced datasets. When Self-Ranking is used, we have to decide the number \(k\) of top-ranked functions to consider. In our experiments, the values of \(k\) for the selection of top_k functions are 1, 3, and 5.

Several RAG sources were analyzed with preliminary experiments to select the most suitable ones for our setup. Considering only one source of knowledge for every task is reasonable since the focus of the empirical study is to exploit a RAG source under the assumption that it is available and of high quality, not to automatically find the best possible source of knowledge among the available ones. For the XSS detection task, we used the XSS Evasion Cheat Sheet by OWASP6 as the RAG source. OWASP is a foundation that works to improve the security of software and the knowledge contained in their cheatsheets is potentially useful to improve the quality of the detectors generated for XSS attack. For the SQLi detection task, the selected RAG source is an article on WebSec7 containing a list of SQLi patterns. The knowledge shared in both sources is well-suited to our goal of creating effective detection functions via LLMs.

Table 3: Size of the datasets for XSS and SQLi, with a specific focus on the different class sizes and the splits into training, validation, and test set.
XSS SQLi
Training Validation Test Training Validation Test
Malicious 8,344 2,087 2,608 12,714 3,178 3,973
Benign 8,584 2,146 2,683 12,714 3,178 3,973
Total 16,928 4,233 5,291 25,428 6,356 7,946

4.3.3 Datasets↩︎

For XSS detection, we use a publicly available dataset containing Malicious and Benign payloads of HTTP requests from the FMereani repository8, while for SQLi detection, we use the dataset presented in the SOFIA paper [23]. Table 3 shows the size of the splits for our two datasets, into train_set, val_set, and test_set. Train_set is not used in our approach (except for the Few-shot examples), as we generate a detection function via pre-trained LLM, without any further training. It is used for training the ML-based detection model, which is compared with our approach to answer to RQ3. Val_set is used as validation set in the TDA scenario. Test_set is kept hidden and it is used when answering the experimental research questions RQ1 to RQ4.

4.4 Study Procedure↩︎

Figure 2: Main steps of the adopted experimental procedure: two configurations (\(C_1, C_2\)) are used for function generation and two configurations (\(C_3, C_4\)) for dataset generation. The LLM is queried three times, resulting in three generated functions (\(u_1, u_2, u_3\)) and three generated datasets (\(s_1, s_2, s_3\)) per configuration. In the TDA scenario, the resulting functions (\(U_1, U_2\)) are ranked using a validation set \(d\); in the NTD scenario they are ranked using the generated datasets (\(S_1, S_2\)).

The main steps of our experimental procedure are depicted in Figure 2. The pipeline at the top shows the generation of multiple functions (to account for non-determinism of the LLM) for each configuration. The pipeline at the bottom is similar, but the output consists of multiple synthetic datasets for each configuration. At the right, Figure 2 shows the selection of the best top_k functions based either on an existing validation dataset (top-right) or on a generated dataset (bottom-right).

4.4.1 Function Generation↩︎

Given a configuration \(C = \langle C.model, C.temperature, C.examples, C.rag\rangle\), and a task \(t\), we denote as \(u(t, C)\) the Generated Function, i.e., the output of \(C.model\) queried with the prompt constructed from the code generation template, shown in Listing [lst:code_template_rag], the task \(t\), and the parameters specified in the configuration \(C\). In Figure 2 (top), function generation is executed with two configurations, \(C_1\) and \(C_2\).

To account for non-determinism and to support Self-Ranking, we repeat the generation process \(n\) (in our experiment, 40) times, obtaining a Generated Function Run \(U\), consisting of the set of \(n\) Generated Functions: \(U(t, C) = \{u_1(t, C), \ldots, u_{n}(t, C)\}\). In Figure 2, \(n=3\) functions are generated, \(u_1, u_2, u_3\) for each configuration.

We define a Function Generation Experiment \(H(t)\) as the set of all Generated Function Runs, given all valid configurations from the code generation domain (see Table 2): \(H(t) = \{U(t, Conf_1), U(t, Conf_2), \ldots\}\). In Figure 2, the experiment \(H\) contains two sets of generated functions, \(U_1\) and \(U_2\).

4.4.2 Dataset Generation↩︎

A synthetic dataset, denoted as \(s(t, C)\), is obtained by prompting the model \(C.model\) with a prompt constructed from the dataset generation template shown in Listing [lst:dataset_template], the task \(t\), and the parameters specified in the configuration \(C\). In Figure 2 (bottom), dataset generation is executed with two configurations, \(C_3\) and \(C_4\).

To account for non-determinism, we repeat the generation process \(m\) (in our experiments, 10) times and obtain a Synthetic Dataset Run \(S(t, C) = \{s_1(t, C), \ldots, s_{m}(t, C)\}\). In Figure 2, \(m=3\) datasets are generated, \(s_1, s_2, s_3\) for each configuration. Given all the configurations in the domain, we define a Synthetic Dataset Experiment \(P(t)\) as \(\{S(t, Conf_1)\), \(S(t, Conf_2)\), \(\ldots\}\). In Figure 2, the experiment \(P\) contains two sets of generated datasets, \(S_1\) and \(S_2\).

4.4.3 Selection of top_k Functions↩︎

We select the top_k functions of a Generated Functions Run \(U\) using a dataset \(d\) and a performance metric \(\mathcal{M}\) (e.g., F2-Score) by sorting the functions in \(U\) by decreasing \(\mathcal{M}\) and including only the first \(k\) functions in the sorted list: \(\textit{top\_k}(U, d, \mathcal{M}) = \textit{sort}(U, \mathcal{M}(\cdot, d))[1\mathpunct{:}k]\).

Note that in the event of a tie for the \(k^{th}\) position, one element is randomly chosen. In the TDA scenario, we can set \(d = \textit{val\_set}\) to select the top_k functions. In the NTD scenario, we use a Synthetic Dataset Run as \(d\), exploiting Self-Ranking. In Figure 2, the TDA scenario is shown on the top-right, with \(d\) the Validation Dataset. The NTD scenario is shown on the bottom-right, with \(d\) equal to each of the generated datasets \(S_1\) and \(S_2\).

4.4.4 Evaluation Metrics↩︎

We use a separate, independent test set to evaluate the results produced by the pipeline shown in Figure 2: (1) we measure the quality of the generated functions without applying any ranking (NTD scenario with no generated dataset); (2) we measure the quality of the generated functions after ranking them based on a validation set (TDA scenario); (3) we measure the quality of the generated functions after ranking them based on a generated dataset (NTD scenario with synthetic dataset).

The quality of the generated functions without ranking is measured as the average performance of the generated functions across all Function Runs \(U\) in the experiment \(H\). The quality of the generated functions after ranking on a validation set (resp. synthetic dataset) is measured as the top_k performance (i.e., the average performance of the first \(k\) functions in the ranked list) of the best Generated Function Run \(U^{best}\), which is the function run \(U\) with highest average performance according to the validation dataset (resp. synthetic dataset).

We also measure the top_k performance improvement, defined as the performance difference between the top_k functions and all functions in the Generated Function Run \(U\).

Measuring the effectiveness of a Synthetic Dataset Run can be achieved by assessing its capability to select the top_k functions accurately, just as a ground-truth dataset (i.e., val_set) would do. This does not directly assess the quality of the generated Synthetic Dataset, since we cannot prove that all the elements are correctly labeled or semantically rich enough to capture all the aspects of the attack, but we indirectly assess it in terms of capability to act as a proxy for the real ground-truth dataset, thus selecting an optimal set of functions. To quantify this, we introduce the performance difference metric, defined as the difference between the average top_k performance with ranking on val_set and the average top_k performance with ranking on a Synthetic Dataset Run \(S\).

4.5 Implementation↩︎

Our experimental framework was implemented using Python 3.10 and Langchain9. For comparison to SOTA, we followed the approach proposed by Chen et al. [24] and implemented two XSS detection models by training a Convolutional Neural Network (CNN) and a Multi-Layer Perceptron (MLP). To compare with SOFIA [23] on SQLi detection, we consider the performance values reported in their paper, as we share exactly the same test set. All experiments on GPT models were conducted on a machine with AMD EPYC 7742 64-Core Processor CPU, Tesla V100 GPU, 512 GB RAM, running Ubuntu 20.04.6 LTS. The experiments on other models were conducted on a machine with AMD EPYC 7763 64-Core Processor CPU, 32 GB RAM, running Ubuntu 22.04.4 LTS. In the latter, models were running on services such as Google Vertex or Amazon Bedrock, depending on the provider.

4.6 Performance Metric↩︎

In our application scenarios (XSS/SQLi detection), we give more weight to false negatives (resulting in low recall) with respect to false positives (resulting in low precision), since a non-detected attack can cause much more damage than a benign request detected as an attack. For this reason, the main performance metric used in our empirical study is the F2-Score, referred to as F2 for simplicity (\(\mathcal{M} = F_2\)), which gives double importance to recall than to precision: \(F_2 = 5p*r / (4p+r)\), with \(p\) and \(r\) indicating precision and recall, respectively. To mitigate a possible threat to the construct validity associated with the choice of the performance metric F2, which is strongly related to the specific domain of the study, we replicated all experiments using accuracy and F1-Score, referred to as F1 and defined as \(F_1 = 2p*r / (p+r)\), as alternative performance metrics. Accuracy is a valuable choice since the datasets used in this study are well-balanced. However, since in a real-world scenario positive samples are much fewer than the negative ones, we replicated all the experiments also using F1-Score, a metric that ignores the false negatives and hence, differently from accuracy, is not sensitive to the balance of positive vs negative data. Results are consistent and largely independent of the choice of the performance metric. The interested reader can find the additional results obtained with \(\mathcal{M} = Accuracy\) and \(\mathcal{M} = F1\) in the replication package (see Section ¿sec:sec:avail?).

For RQ1, in the NTD scenario, we get two sets of F2 by evaluating the functions generated with and without RAG on the test_set. By comparing these two sets, we check if RAG provides statistically significant improvements using the Mann-Whitney U Test [25]. To understand the impact of RAG in the TDA scenario, we use val_set to select the best Generated Function Run \(U^{best}\) and we check if RAG was used to generate \(U^{best}\).

For RQ2, we first quantify the benefits of using top_k selection with Self-Ranking in the NTD scenario. We analyze all pairs \(\langle U, S\rangle\) of Generated Functions Runs \(U\) and Synthetic Datasets Runs \(S\) to compute the improvement of F2. We employ the Wilcoxon signed-rank test to establish the statistical significance of such improvement. To analyze the impact of Self-Ranking in the TDA scenario, we use val_set to select the best Function Run \(U^{best}\). Then, for each \(k\), \(S^{best}\) is selected starting from \(U^{best}\), as the Data Set Run \(S\) with minimum performance difference w.r.t. the top_k performance measured on val_set. At this point, it is possible to measure the improvement of F2, due to the usage of Self-Ranking, with \(U^{best}\) as selected Generated Function Run and \(S^{best}\) as selected Synthetic Dataset Run.

For RQ3, to compare the performance with the SOTA models, we first use val_set to select the best Generated Function Run \(U^{best}\), then the top_k function of \(U^{best}\). Then, we assess the top_k F2 measured on the test_set, comparing it with that of the SOTA models.

For RQ4, for each \(k\), we consider one of the two tasks (e.g., XSS) in turn, and we determine the configurations that give \(S_1^{best}\) and \(U_1^{best}\). We then apply such configurations to produce \(S_2^{transf}\) and \(U_2^{transf}\) for the other task (e.g., SQLi). For the second task, we also compute \(S_2^{best}\) and \(U_2^{best}\). We can now compare the best top_k F2 obtained using \(S_2^{best}\) and \(U_2^{best}\) vs. the top_k F2 of \(S_2^{transf}\) and \(U_2^{transf}\). Then, we swap the two tasks and repeat the transferability process in the other direction.

5 Results↩︎

5.1 RQ1 (RAG Benefit)↩︎

a

b

Figure 3: Difference between the F2 of Generated Function Runs with RAG (i.e., RAG Prompt and RAG-Few-shot Prompt) and the F2 of Generated Function Runs without RAG (i.e., Basic Prompt and Few-shot Prompt), for XSS detection (left) and SQLi detection (right)..

Figure 3 illustrates the impact of RAG on function generation for XSS and SQLi in the NTD scenario, showing the F2 differences between the configurations with and without RAG, across all possible configurations. Each bar represents the average F2 score difference for a specific Model-Temperature pair when using RAG, compared to the same pair without RAG. Results indicate that employing RAG generally enhances the performance of function generation for both tasks. The number of Model-Temperature pairs benefiting from RAG is much larger than the number of pairs showing degradation, and the improvements are statistically significant, as evidenced by \(p\)-values (\(\approx 10^{-66}\) for XSS and \(\approx 10^{-24}\) for SQLi) below the standard threshold of 0.05.

a

b

Figure 4: Difference between the F2 of Generated Function Runs with RAG Few-shot Prompt and the F2 of Generated Function Runs with RAG Zero-shot Prompt, for XSS detection (left) and SQLi detection (right)..

We further investigate the benefit of combining Few-shot examples with RAG using a similar setting as Figure 3. Figure 4 shows that while the addition of Few-shot examples shows some benefits for SQLi, the same cannot be said for XSS. These findings suggest that the usage of Few-shot examples may not always provide advantages when RAG is already employed, indicating that in the NTD scenario, it could be preferable to omit them.

Turning to the TDA scenario where we can select the best Generated Function Run using \(val\_set\), we observe that \(U^{best}\) takes always advantage of RAG as well as Few-shot examples: for XSS, \(U^{best}=\) (GPT-4T, 0.0, 10, T), and for SQLi, \(U^{best}=\) (GPT-4T, 0.0, 2, T). This indicates that while the use of few-shot examples may not consistently enhance performance in general (as seen in the NTD scenario), the best configuration selected with \(val\_set\) (in the TDA scenario) leverages the benefits of using both RAG and Few-shot examples.

Answer to RQ1: The usage of RAG yields statistically significant performance improvements for both tasks. While the combination of RAG with Few-shot examples does not consistently enhance this benefit, it proves advantageous in the best case.

5.2 RQ2 (Impact of Self-Ranking)↩︎

a

b

c

d

e

f

Figure 5: F2 given by top_k selection (i.e., Self-Ranking) for XSS detection (first row) and SQLi detection (second row), with \(k=1\), \(k=3\) and \(k=5\)..

Let us consider the NTD scenario. The violin plots shown in Figure 5 depict the effect of Self-Ranking, i.e., \(top\_k\) selection across three values of \(k\) for the two tasks, considering all the possible pairs of function/synthetic dataset configurations. Overall, we observe a clear improvement, particularly for XSS. The region between quartiles for XSS falls largely between a 20%pt and 40%pt improvement, with an average improvement of 37%pt and an improvement that affects 98% of the cases. While the improvement for SQLi is less pronounced, we still observe improvements in 73% cases, with an average improvement of 6%pt. Additionally, we can see that, as \(k\) increases, the average improvement decreases for both tasks, but the gains become more stable. The improvements given by the usage of \(top\_k\) selection are statistically significant: the \(p\)-values obtained with the Wilcoxon signed-rank test are below the threshold of 0.05 for both tasks and all the values of \(k\)10.

In the TDA scenario, we observe that utilizing the Self-Ranking mechanism to select the top_k functions is more effective than not employing it: when compared to solely using \(U_{best}\), employing \(S_{best}\) achieves 3.21%pt and 4.94%pt increases in F2, for XSS and SQLi respectively.

Answer to RQ2: Utilizing the Self-Ranking mechanism that selects \(top\_k\) functions leads to statistically significant improvements on both tasks.

Table 4: Comparison with SOTA models.
XSS Detection SQLi Detection
Method F2 Method F2
CNN [24] 0.998 SOFIA [23] 0.993
MLP [24] 0.995
Ours (k=1) 0.965 Ours (k=1) 0.991
Ours (k=3) 0.965 Ours (k=3) 0.988
Ours (k=5) 0.965 Ours (k=5) 0.975
Baseline 0.630 Baseline 0.800

5.3 RQ3 (Comparison with SOTA)↩︎

To compare ours with learning-based SOTA techniques, we select \(U^{best}\) and \(top\_k\) functions based on \(val\_set\). For XSS, \(U^{best}\) is \((\mathrm{\small GPT-4T}, 0.0, 10, T)\), while for SQLi, it is \((\mathrm{\small GPT-4T}, 0.0, 2, T)\). We establish a baseline to better understand the improvement offered by adopting our approach. The baseline is obtained using Basic Prompt, without Few-shot examples and RAG. It represents the expected results that can be obtained by generating detectors without exploiting any of the techniques presented in this work.

Table 4 presents the results of the comparison. We observe a significant improvement of 34%pt and 18%pt for XSS and SQLi respectively, when compared to the baseline. There is a slight decrease in F2 compared to SOTA models, with a 3.15%pt and 0.83%pt drop for XSS and SQLi, respectively. We argue that the slight performance gap between our approach and SOTA models is understandable, given our approach’s training-free nature and direct applicability to multiple tasks. Moreover, these results provide empirical support for our claim that incorporating external knowledge and Self-Ranking is essential for LLMs to achieve competitive performance with SOTA models.

It is important to remark that, despite the slight performance decrease w.r.t. SOTA models, there are several other advantages associated with the use of our method: our approach is applicable even in the absence of a training set, a scenario that rules out all ML-based SOTA techniques. We use a pre-trained LLM, while SOTA techniques require that a model is designed and trained for the attack detection task at hand. The output of our approach (a Python function) is interpretable by developers, while SOTA models are black-box. This transparency allows for a clear and full understanding of the detector’s decision-making process and provides an opportunity for further refinement and improvement. Another big advantage is transferability from one detection task to another, an aspect we will explore in the following section, which is structurally impossible for SOTA models.

Answer to RQ3: The functions generated with our approach are shown to be comparable to the SOTA models trained for the specific tasks, with the advantage of interpretability and applicability in the absence of a training set.

Table 5: Best Generated Function Runs and Synthetic Datasets
Task 1: XSS Detection
\(U^{best}_1 \rightarrow U^{transf}_2\) \(k\) \(S^{best}_1 \rightarrow S^{transf}_2\)
(GPT-4T, 0.0, 10, T) 1 (Haiku, 0.5, 6, F)
3 (Haiku, 1.0, 10, F)
5 (Haiku, 0.5, 10, F)
Task 2: SQLi Detection
\(U^{best}_2 \rightarrow U^{transf}_1\) \(k\) \(S^{best}_2 \rightarrow S^{transf}_1\)
(GPT-4T, 0.0, 2, T) 1 (Opus, 0.5, 6, T)
3 (Opus, 0.5, 6, T)
5 (Sonnet, 1.0, 10, T)
Table 6: Results of transferability study. The results with transferred configurations are underlined.
Task 1: XSS Detection
\(k\) \(F2(U^{best}_1, S^{best}_1)\) Avg. F2 \(F2(U^{transf}_1, S^{transf}_1)\)
1 0.965 0.809 0.949
3 0.965 0.771 0.929
5 0.965 0.740 0.933
Task 2: SQLi Detection
\(k\) \(F2(U^{best}_2, S^{best}_2)\) Avg. F2 \(F2(U^{transf}_2, S^{transf}_2)\)
1 0.964 0.787 0.853
3 0.951 0.775 0.900
5 0.946 0.763 0.867

5.4 RQ4 (Transferability)↩︎

Table 5 shows the best configuration of each task, specifically for function generation (\(U^{best}\)) and synthetic dataset generation (\(S^{best}\)), across different \(k\) values. In our notation, Task 1 is XSS and Task 2 is SQLi, and \(A \rightarrow B\) represents the transfer of a configuration from one task to another. As an illustration, in \(U^{best}_1 \rightarrow U^{transf}_2\), \(U^{best}_1\) denotes the best configuration for Task 1, whereas \(U^{transf}_2\) represents the transferred configuration, which originates from Task 1 (\(U^{best}_1\)) and is subsequently evaluated on Task 2.

Table 6 presents the transferability results. The ‘\(F2(U^{best}, S^{best})\)’ columns indicate F2 computed on the original task with its best configuration, ‘Avg. F2’ columns represent the average F2 computed across all the \(U-S\) pairs for a given \(k\), and the ‘\(F2(U^{transf}, S^{transf})\)’ columns show F2 computed using transferred configurations. Comparing the transferred configuration’s performance to the average column provides a good estimate of the benefits of configuration transfer over a mere random selection of a configuration for a new, unseen task.

The results support the effectiveness of transferring configurations. While there is a slight degradation in F2 compared to the original, best configuration (on average, 3%pt for XSS and 8%pt for SQLi), we can observe that results of transferred configurations outperform the average F2, achieving on average, 16%pt improvement for XSS and 10%pt improvement for SQLi.

Answer to RQ4: The best configurations obtained for one task can effectively be transferred to the other task, outperforming the average performance across all possible configurations.

a

b

Figure 6: Average F2, broken down by each LLM, for XSS detection (left) and SQLi detection (right)..

6 Discussion↩︎

With RQs 1 and 2, we highlighted the benefits of using RAG and Self-Ranking, highlighting the necessity to incorporate external knowledge about the attacks to generate robust detectors, and exploiting the multiple reasoning-paths of the LLM to select the \(K\) top-ranked detectors. However, these results do not offer explicit insights into the best choices for the other parameters, such as LLM or temperature. While this question may have a trivial answer in the TDA scenario where developers can experiment and evaluate different configurations, the NTD scenario presents a more challenging situation. On one hand, our transferability study (RQ4) demonstrated that the best-performing combinations for one task exhibit strong performance on the other task. Therefore, the combinations from another task could serve as a reasonable starting point for selecting the inner parameters when approaching a new task.

When transferability is not possible (e.g., because developers cannot access the LLM identified via transferability), developers have still to choose a model and a temperature. We can support their choice by conditioning the results of our experiments on model or on temperature. Results conditioned for each model are shown in Figure 6, which presents the average F2 achieved by different LLMs across the two tasks. These results indicate the strong performance of GPT-4T and GPT-4, which is unsurprising given their inclusion within the \(U^{best}\) in the earlier experiments. On the other hand, Mixtral consistently underperforms relative to other LLMs in both tasks. While these plots exhibit some correlation with the HumanEval scores reported in Table 1, it is important to note that models like Haiku, which achieved decent HumanEval scores, were discarded due to their inability to successfully complete the task (see Section 4.3). This observation aligns with our hypothesis that while HumanEval effectively assesses the reasoning capabilities of LLMs, it may not directly translate to the depth of knowledge required for more specialized tasks, such as secure code generation.

Regarding the temperature, our analysis suggests that it is highly LLM-dependent, making it challenging to draw general conclusions across LLMs. One consistent finding relates to synthetic dataset generation, where higher temperatures tend to outperform lower temperatures. This is because lower temperatures often result in a lack of diversity in generated examples, leading to a limited variety of samples.

7 Related Work↩︎

Recent research has scrutinized the security of code generated by LLMs, showing that it contains vulnerabilities. Mousavi et al. [11] conducted an examination of Java code produced by ChatGPT in the context of security API use cases. They uncovered 20 distinct varieties of API misuses by ChatGPT, demonstrating insecure practices. Bhatt et al. [26] presented a dataset exposing insecure code generation by LLMs, encompassing 9 programming languages and addressing 50 prevalent vulnerabilities. Khoury et al. [6] instructed ChatGPT to generate 21 programs susceptible to vulnerabilities including XSS and SQLi. While these works covered a broader range of vulnerabilities than our work, their focus remained primarily on identifying insecure code generation of LLMs. In contrast, our research aims to systematically enhance the security of the LLM-generated code. We achieve this by employing RAG and Self-Ranking. We also evaluated the LLM-generated functions systematically, using large datasets available from SOTA techniques. This contrasts with prior works that used only a few test cases to obtain evidence of vulnerabilities in LLM-generated security attack detectors.

Nair et al. [8] investigated ChatGPT-induced vulnerabilities in hardware code. They explored multiple prompts and provided guidance on generating secure code. While their goal aligns with ours, their evaluation was not conducted systematically using existing, solid benchmarks. Moreover, their approach to enhancing the security of generated code relies on manually crafted adjustments to the prompt, tailored to individual vulnerabilities. This manual intervention constrains the scalability and adaptability of their approach, particularly when addressing a broad spectrum of vulnerabilities.

SVEN [9] is a learning-based approach that aims to guide LLM’s code generation towards satisfying a given property, exploiting Prefix-Tuning [27] to fine-tune the LLM. While SVEN demonstrated promising results in preventing LLMs from introducing vulnerabilities, it requires a fine-tuning procedure based on a curated training set, hindering its efficiency and applicability to closed LLMs, hence limiting its scope.

8 Threats to Validity↩︎

Internal validity. LLMs are non-deterministic, hence we repeated our function and synthetic dataset generation experiments 40 and 10 times respectively (the disparity is due to the higher stability of the latter experiments). Indeed, we adopt Self-Ranking to exploit the non-deterministic nature of LLMs. While different LLM configurations and selection of RAG sources may yield varying results, our choices were based on documentation and existing best practices. Data leakage from LLM training could be a concern, but given the complexity of the task and the presence of only a few generated functions with perfect performance, we believe it was not an important factor in our experiment.

External validity. Our approach used nine recent LLMs and they were evaluated on two prevalent vulnerabilities, XSS and SQLi. While our approach may not generalize to all other vulnerabilities, we believe that our approach of enhancing prompts with external knowledge and Self-Ranking is a robust and general method. Construct validity. We utilized F2-Score and Accuracy as our evaluation metrics, which are considered standard measures in the security domain.

Conclusion validity. Many elements in our approach are non-deterministic. For this reason, we draw our conclusions based on statistical non-parametric tests (Mann-Whitney U and Wilcoxon).

9 Conclusion↩︎

In this paper, we present a novel approach to improving the robustness of LLM-generated security attack detectors by integrating RAG and Self-Ranking into the prompting process. Our extensive study with nine LLMs targets two well-known and prevalent vulnerabilities, Cross-Site Scripting (XSS) and SQL injection (SQLi). Results show the effectiveness of our approach in improving the robustness of the detectors generated by the LLM. The integration of external knowledge with RAG resulted in a notable enhancement in detection performance while Self-Ranking further improved the results. Our findings provide valuable insights for developers, highlighting the importance of incorporating relevant knowledge and utilizing automated methods for the assessment of LLM-generated code. At the same time, these results open the way for the researchers to develop optimal strategies to perform RAG using a snippet of code as the query and documentation as a source of knowledge, since this strategy can potentially benefit those application domains where knowledge-intense code generation is prevalent.

Compliance with Ethical Standards↩︎

The authors declare that they do not have any known relationship or competing interests that could have influenced this paper. The authors declare that their research for the current work did not involve Human Participants or Animals.

Data Availability↩︎

The implementations, source code, data, and experimental results are publicly available in a GitHub repository11.

Credits↩︎

Samuele Pasini: Problem Analysis, Investigation, Data Curation, Empirical Study, Writing - Original Draft, Visualization. Jinhan Kim: Investigation, Empirical Study, Writing - Original Draft. Tommaso Aiello: Empirical Study, Writing - Review & Editing. Rocio Cabrera Lozoya: Writing - Review & Editing. Antonino Sabetta: Resources, Writing - Review & Editing. Paolo Tonella: Supervision, Writing - Review & Editing.

This work is funded by the European Union’s Horizon Europe research and innovation programme under the project Sec4AI4Sec, grant agreement No 101120393.

References↩︎

[1]
“Github copilot - your ai pair programmer.” https://github.com/features/copilot, 2023.
[2]
“Cursorai - the ai code editor.” https://www.cursor.com, 2023.
[3]
GitHub. (2023) Survey reveals ai’s impact on the developer experience. [Online]. Available: https://github.blog/2023-06-13-survey-reveals-ais-impact-on-the-developer-experience/.
[4]
M. Tabachnyk and S. Nikolov, “Ml-enhanced code completion improves developer productivity,” 2022. [Online]. Available: https://research.google/blog/ml-enhanced-code-completion-improves-developer-productivity/.
[5]
N. Perry, M. Srivastava, D. Kumar, and D. Boneh, “Do users write more insecure code with ai assistants?” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23.1em plus 0.5em minus 0.4emNew York, NY, USA: Association for Computing Machinery, 2023, p. 2785–2799. [Online]. Available: https://doi.org/10.1145/3576915.3623157.
[6]
R. Khoury, A. R. Avila, J. Brunelle, and B. M. Camara, “How secure is code generated by chatgpt?” 2023.
[7]
N. Tihanyi, T. Bisztray, R. Jain, M. A. Ferrag, L. C. Cordeiro, and V. Mavroeidis, “The formai dataset: Generative ai in software security through the lens of formal verification,” in Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering, ser. PROMISE 2023.1em plus 0.5em minus 0.4emNew York, NY, USA: Association for Computing Machinery, 2023, p. 33–43. [Online]. Available: https://doi.org/10.1145/3617555.3617874.
[8]
M. Nair, R. Sadhukhan, and D. Mukhopadhyay, “Generating secure hardware using ChatGPT resistant to CWEs,” Cryptology ePrint Archive, Paper 2023/212, 2023, https://eprint.iacr.org/2023/212. [Online]. Available: https://eprint.iacr.org/2023/212.
[9]
J. He and M. Vechev, “Large language models for code: Security hardening and adversarial testing,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’23.1em plus 0.5em minus 0.4emNew York, NY, USA: Association for Computing Machinery, 2023, p. 1865–1879. [Online]. Available: https://doi.org/10.1145/3576915.3623175.
[10]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[11]
Z. Mousavi, C. Islam, K. Moore, A. Abuadbba, and M. A. Babar, “An investigation into misuse of java security apis by large language models,” in Proceedings of the 19th ACM Asia Conference on Computer and Communications Security, 2024, pp. 1299–1315.
[12]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.1em plus 0.5em minus 0.4emCurran Associates, Inc., 2020, pp. 9459–9474. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf.
[13]
S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” arXiv preprint arXiv:2308.02828, 2023.
[14]
X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw.
[15]
I. Hydara, A. B. M. Sultan, H. Zulzalil, and N. Admodisastro, “Current state of research on cross-site scripting (xss) – a systematic literature review,” Information and Software Technology, vol. 58, pp. 170–186, 2015. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950584914001700.
[16]
M. Lawal, A. B. M. Sultan, and A. O. Shakiru, “Systematic literature review on sql injection attack,” International Journal of Soft Computing, vol. 11, no. 1, pp. 26–35, 2016.
[17]
Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: a survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
[18]
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.
[19]
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
[20]
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” arXiv preprint arXiv:2206.04615, 2022.
[21]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
[22]
K. E. Stanovich, “On the distinction between rationality and intelligence: Implications for understanding individual differences in reasoning,” The Oxford handbook of thinking and reasoning, pp. 343–365, 2012.
[23]
M. Ceccato, C. D. Nguyen, D. Appelt, and L. C. Briand, “Sofia: An automated security oracle for black-box testing of sql-injection vulnerabilities,” in 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), 2016, pp. 167–177.
[24]
L. Chen, C. Tang, J. He, H. Zhao, X. Lan, and T. Li, “Xss adversarial example attacks based on deep reinforcement learning,” Computers & Security, vol. 120, p. 102831, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167404822002255.
[25]
N. Nachar, “The mann-whitney u: A test for assessing whether two independent samples come from the same distribution,” Tutorials in Quantitative Methods for Psychology, vol. 4, 03 2008.
[26]
M. Bhatt, S. Chennabasappa, Y. Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y. Chen, D. Kapil et al., “Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models,” arXiv preprint arXiv:2404.13161, 2024.
[27]
X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597.

  1. A percentage points is the standard unit of measure for differences between percentages.↩︎

  2. Throughout the paper, we use the terms ‘attack detectors’, ‘security functions’, and ‘functions’ interchangeably when discussing our approach. This flexible terminology enables us to define and apply our methodology across a range of domains.↩︎

  3. https://cwe.mitre.org/top25/archive/2023/2023_top25_list.html↩︎

  4. For the sake of simplicity, the RAG Few-shot Prompt is not shown in Figure 1. It can be easily obtained by combining the construction of Few-shot and RAG Prompts.↩︎

  5. The selection of appropriate data sources and their storage is discussed in Section 4 as it is an experimental choice.↩︎

  6. https://cheatsheetseries.owasp.org/cheatsheets/XSS_Filter_Evasion_Cheat_Sheet.html↩︎

  7. https://websec.wordpress.com/2010/12/04/sqli-filter-evasion-cheat-sheet-mysql/↩︎

  8. https://github.com/fmereani/Cross-Site-Scripting-XSS/blob/master/XSSDataSets/Payloads.csv↩︎

  9. https://www.langchain.com/↩︎

  10. Specifically, \(p\)-value is \(0\) for all the \(k\)s for XSS. For SQLi, it was \(0\) when \(k=1\) and \(k=2\) and \(\approx 10^{-247}\) when \(k=5\).↩︎

  11. https://github.com/PasiniSamuele/Robust-Attack-Detectors-LLM↩︎