Octopus: On-device language model for function calling of software APIs

Wei Chen\(^{\dagger}\) 1
Stanford University
Zhiyuan Li\(^\dagger\)
Stanford University
Mingyuan Ma\(^\dagger\)
Havard University


In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs. We meticulously compile a dataset derived from software API documentation and apply fine-tuning to LLMs with capacities of 2B, 3B and 7B parameters, specifically to enhance their proficiency in software API interactions. Our approach concentrates on refining the models’ grasp of API structures and syntax, significantly enhancing the accuracy of API function calls. Additionally, we propose conditional masking techniques to ensure outputs in the desired formats and reduce error rates while maintaining inference speeds. We also propose a novel benchmark designed to evaluate the effectiveness of LLMs in API interactions, establishing a foundation for subsequent research. Octopus, the fine-tuned model, is proved to have better performance than GPT-4 for the software APIs calling. This research aims to advance automated software development and API integration, representing substantial progress in aligning LLM capabilities with the demands of practical software engineering applications.

1 Introduction↩︎

The advent of Large Language Models (LLMs) has revolutionized the field of artificial intelligence, bringing forth a wide array of capabilities in natural language processing, alongside applications in specialized domains such as mathematics ([1], [2])), healthcare([1], [3], [4]), and legal analysis ([5][7]). Despite these advancements, LLMs face challenges in assimilating real-time updates and executing specific tasks like image/video editing ([8]) or intricate tax filings. The integration of Large Language Models (LLMs) with external APIs emerges as a pivotal improvement. This synthesis, leveraging APIs, not only augments the LLMs’ capabilities by facilitating access to up-to-date information and specialized functionalities but also sparks the creation of novel applications such as code interpreters ([9][11]). Research like ToolAlpaca ([12]) and NexusRaven ([13]) also proves the function calling capability of open source language model. Consequently, this integration signifies a crucial step toward overcoming the inherent limitations of LLMs, thereby extending their utility and potential for innovation in the field.

Enhancing the integration of Large Language Models (LLMs) with external APIs necessitates addressing the challenge of balancing large-scale model dependency against efficiency and cost. Focusing on specific tasks that utilize only a fraction of available APIs indicates the inefficiency of merely replying on large models like GPT-4 ([14][18]), which need a lot of computational resources. This scenario advocates for the development of smaller, task-oriented LLMs that preserve essential functionality while minimizing operational costs ([19], [20]). However, this shift towards smaller models introduces new challenges, including an increased risk of errors or "hallucinations" ([21][23]) which causes issues in precise output formatting ([24]). And the correct output formatting is critical for robust software application.

In response to the limitations of oversized Large Language Models (LLMs), which entail unnecessary inference costs and exhibit a lack of focus in training data, we propose a new framework to do LLM training and inference. Grounded in an expansive dataset of over 30,000 widely-utilized APIs from Rapid API Hub ([25]), this framework spans a diverse array of functionalities from Google searches to Amazon product lookups. By leveraging curriculum learning ([26]) strategies, we significantly refine the LLMs’ proficiency in selecting the appropriate API functions from a pool of similar options. This strategic dataset engineering, combined with our choice of base models, including Codellama7b ([27], [28]), Google’s Gemma 7B & 2B ([29]), Stable Code 3B([30]), underscores the effectiveness of our approach, outperforming GPT-4’s benchmarks. Moreover, this ensures the practicality of our solution across various platforms, including mobile devices since these models can already been deployed on mobile ([31]).

To ensure the consistency of our model’s output formatting, we introduce a conditional masking technique during inference. This innovative approach guarantees that our LLMs generate outputs in the desired formats, markedly improving accuracy and minimize validation loss without sacrificing inference speed. We also prove that the technique of conditional mask will only increase the accuracy mathematically.

This advancement, validated across our selected base models, not only showcases the potential of compact LLMs in external API integration but also sets a new efficiency benchmark for scalable AI applications. Through detailed exposition of our model selection and training process, we present a holistic solution that effectively addresses the prevailing challenges in LLM API utility. The dataset used for the LLM training and fine-tuned models will be open sourced soon.

2 Related Work↩︎

Enhancing LLMs with ToolsThe integration of external computational tools within Large Language Models (LLMs) like GPT-4, Alpaca, and Llama signifies a substantial advancement in augmenting their capabilities. Initially, integration efforts were centered around model-specific fine-tuning methods ([32], [33]), which, despite their effectiveness, encountered challenges in widespread and flexible application. A notable shift occurred with the adoption of prompts containing exemplary demonstrations, broadening the scope of tool accessibility. This range includes specialized code interpreters and extensive retrieval frameworks, significantly enhancing the models’ ability to interpret and execute complex instructions([34]). Developments in simulated environments for tool interaction ([35][37]) and frameworks for API engagement ([38]) have been observed as well. Furthermore, the incorporation of advanced reasoning ([39][41]) strategies has significantly improved the models’ efficiency in interpreting and solving complex tasks.

Dataset formatThe optimization of datasets ([42], [43]) for model fine-tuning is critical for enhancing LLM performance. This process involves a multi-stage refinement utilizing models such as GPT-4 and Alpaca. By iteratively enhancing the dataset, this methodology not only refines prompts but also improves response quality and develops advanced chain-of-thought ([44][48]) processes. Such advancements lead to a significant increase in the accuracy of function calling within LLMs, setting new benchmarks in dataset optimization and model training. This iterative refinement represents a strategic shift towards enhancing LLM output precision and quality.

Robustness in LLM GenerationContrary to article generation, which may accommodate flexible output formats, software applications necessitate strict adherence to specific output structures, such as JSON formatting in [49]. And many format consistency problem have been observed in the LLM generation ([50], [51]). Some research has been done to enforce these rigid output formats to maintain consistency and reliability in LLM-generated content. For example, in langchain framework [52], there are many output parsers to enforce the formats like YAML, JSON, CSV and so on. However, there are still many cases that can’t be resolved by output parser, especially for the function call response.

3 Methodology↩︎

In this section, we detail our approach to dataset collection and preparation, introducing the workflow we designed to format the dataset for effective training. We then describe the development of our model, Octopus, highlighting the training techniques and inference strategies we employed. One of the key innovation in our model is the use of a conditional mask for inference enhancement, which represents a novel approach to improving model performance. This methodology combines comprehensive data preparation with advanced modeling techniques to address the challenges of training and inference for function call model development.

3.1 Dataset collection and refinement↩︎

Our initial dataset comprises API documentation sourced from RapidAPI Hub, one of the world’s largest API repositories. This selection was made based on the website’s claim of millions developers’ engagement. To facilitate the large language model’s comprehension of API usage patterns, we compiled a comprehensive collection of API documentation, focusing on approximately 30,000 of the most frequently utilized APIs. This dataset acquisition was structured in two primary stages: the initial collection and processing of individual API documentation pieces, followed by a meticulous refinement process to optimize the dataset for training purposes.

Single API data preprocessFrom the detailed exploration of the document, we gather a comprehensive understanding of how RapidHub’s API usage examples are structured and utilized. The approach involves meticulously extracting API usage examples, which detail the API’s name, description, argument names, and their respective descriptions, and formatting this information in JSON. This data is then reorganized using OPENAI GPT-3.5 and CodeLlama 70B models to align with desired organizational standards. Then, we refine the function names based on their descriptions to ensure they are concise and informative. Subsequently, arguments’ names and descriptions are captured. To counteract potential inaccuracies ("hallucinations") inherent in smaller LLM model, the Python coding format is employed. This decision is strategic, leveraging the models’ inherent code reasoning capabilities from their training on extensive code datasets like in CodeLlama7b and StableCode3B model. This process not only streamlines the API information for enhanced usability, but also leverages advanced AI models to ensure the information is presented in a structured, accessible manner. By prioritizing the function description as a guide for renaming and carefully detailing argument names and descriptions, the approach ensures that the essential elements of API usage are conveyed effectively, supporting developers in integrating these APIs into their projects seamlessly. One example of the converted function can be found below.

def get_flight_details(flight_id):
  Get detailed information on specific flights, including real-time tracking,  departure/arrival times, flight path, and status insights.
    flight_id (string): The flight_id represents the ID of a flight.

In our methodology, we deliberately excluded function body for the final dataset compilation. Through a meticulous selection process, we aggregated approximately 20,000 APIs, employing OPENAI GPT-4 for a comprehensive examination and removal of APIs with deficiencies, such as missing arguments or inconsistencies between function descriptions and their parameters. This stringent selection criterion was pivotal in assuring the dataset’s quality. Each API underwent this rigorous scrutiny, culminating in the compilation of dataset A. The dataset A will be the basis for the subsequent data processing.

Dataset refinementTo enhance decision-making in Large Language Models (LLMs) for real-world API usage, we present a sophisticated dataset construction approach, crucial to our study. We begin by integrating various functions, intentionally incorporating some irrelevant functions to create a complex environment for the LLM. Inspired by curriculum learning, we design our dataset to include hard negative samples gradually. This involves introducing similar functions to incrementally raise the challenge of selecting the most relevant function. Our approach is depicted in Figure (1), illustrating the detailed process of compiling the dataset. Below, we describe the techniques employed.

  1. Negative samplesTo enhance the model’s reasoning capabilities and practical applicability, our methodology involves sampling both positive and negative examples. The ratio of these datasets is represented by the variable \(\frac{M}{N}\) in Figure (1), serving as an important parameter in our experimental setup. Specifically, in our framework, we select \(M\) and \(N\) to be equal, setting both values at 1.

  2. Similar functions clusterIn our practical implementation, the model selects functions from a diverse pool in response to user queries. To intensify the training challenge, we deliberately complicate the selection process. Specifically, we construct training data by associating a given data point with three semantically similar ones. This process involves calculating vector embeddings from function descriptions, with Milvus facilitating the search. The sampling of three similar functions is determined by their similarity scores, focusing on ranks 5 to 10, to deliberately exclude overly similar functions and avoid redundancy in individual queries. This approach guarantees a challenging training setting, cultivating a model capable of differentiating between closely related functions in practical use cases.

  3. GPT 4 generated query The creation of a high-quality dataset is crucially dependent on the formulation of qualified queries. In this context, we opt to generate positive queries solvable by a single API. Moreover, for such positive instances, we also generate and incorporate a Chain of Thought (CoT), which is utilized during model training. Recent studies have demonstrated that the addition of CoT not only enhances model performance but also significantly improves its reasoning abilities ([13]). Note worthily, the creation of qualified queries and auxiliary information is crucial in developing effective training datasets.

  4. GPT 4 verification Our dataset’s development included an observation regarding the potential error of GPT-4 generated responses, despite its advanced capabilities. Thus, we designed a workflow to let GPT-4 conduct the self-verification, effectively identifying and rectifying inaccuracies in its outputs. After getting dataset B, we employed GPT-4 to meticulously verify and eliminate any data points that failed to meet our stringent quality criteria. This rigorous validation process led to the exclusion of approximately 1000 data points, significantly contributing to our model’s enhanced performance.

Figure 1: Refining dataset A into dataset B through a strict workflow. This process involves three critical steps: sampling positive queries solvable by specific APIs and generating corresponding responses and CoTs; identifying unsolvable queries and augmenting them with irrelevant function bodies; and employing semantic analysis to incorporate similar functions into data points. Following GPT-4’s rigorous verification, Dataset B emerges as the optimized training dataset, poised to significantly elevate model efficacy.

Adhering to the outlined methodology, we are poised to meticulously compile the training dataset, achieving an impressive collection of approximately 150,000 data points. Each individual API is associated with 5 positive queries, which it can resolve. To provide a comprehensive understanding, a sample of the complete dataset has been included in the Appendix (7.1), showcasing the detailed structure and composition of our training data.

3.2 Octopus↩︎

To validate the efficacy of our framework, we fine-tuned it on four renowned open-source models: CodeLlama7b, Google Gemma 2B & 7B, and Stable Code LM 3B. A standardized training template was employed across all models, detailed in the Appendix (7.1). We utilized LoRA and 8-bit quantization techniques, allocating A100 80GB GPU hours as follows: 90h for CodeLlama7b and Google Gemma 7B, 30h for Google Gemma 2B, and 60h for Stable Code LM 3B. The learning rate was set at 5\(\times\)​10\(^{-5}\), with a linear scheduler optimizing outcomes. In the inference stage, user queries trigger function retrieval and execution, mapping the generated function and its arguments to corresponding APIs for final responses, thus ensuring accurate results upon correct function and argument name generation.

We have experimented different lora setup, and we found that the best setup is to choose lora rank as 16, and apply the method to the modulus of "q_proj", "v_proj", "o_proj", "up_proj", "down_proj". We also attach the training and validation loss for selected models in Figure (2). During training, we progressively trained on dataset point with more similar examples to do the curriculum learning.

Figure 2: The training and validation loss for selectd pretrained models

3.3 Inference using conditional mask↩︎

The utilization of smaller-parameter Large Language Models (LLMs) has a pivotal challenge: a noticeable decrement in robustness when generating outputs. This challenge is also observed in our model, which necessitates the need to enforce the response with precise function names along with their corresponding arguments. The expected output format demands that arguments be encapsulated within parentheses, function names align with a pre-defined repository, and argument values conform to their designated types. Discrepancies, such as typographical errors in function names or misalignment in argument types, critically undermine the integrity of the output, rendering it susceptible to errors. For instance, both in GPT-4 and our model, deviations in the function name—whether through misspelling or elongated expressions—can lead to unintended corrections that fail to map back to the original function names, thereby distorting the intended output. The original LLM generation process to sample the next token is \[P(x_{t+1}|x_{1:t}) = P(x_{t+1}|x_{1:t}; \operatorname{LLM}),\quad x_{t+1}=\operatorname{argmax} P(x_{t+1}|x_{1:t}; \operatorname{LLM})\] where \(x_{1:t}\) is all the current tokens, with the sequence length as \(t\), and \(x_{t+1}\) is the next token to be sampled. What we do here is to introduce another mask dependent on \(x_{1:t}\) so that \[x_{t+1}=\operatorname{argmax} \left[P(x_{t+1}|x_{1:t}; \operatorname{LLM})\odot \operatorname{mask}(x_{1:t}) \right].\]

In constructing the mask, we designate all tokens, which are not aligned with correct format, to be masked by assigning a value of 0 to their respective positions, and a value of 1 to all other positions. For example, if we already know the next token represents integers, we will only unmask the tokens that are used for integers. Therefore, the formulation of an accurate mask is paramount for achieving the desired outcome. In this context, we delineate several methodologies that were investigated for the derivation of the mask.

  • enum data typeFunction names are usually already known, and will not change during inference. We can treat them as enumerable data variables. To efficiently manage these names, a Trie tree can be constructed, facilitating the retrieval of the mask with a time complexity of \(O(D)\), where \(D\) denotes the Trie tree’s depth, equivalent to the maximum length of a function name, which in our case is approximately 20. This result in the constant time complexity. As an alternative approach, storing all prefixes of potential function names within a dictionary could further reduce the complexity to \(O(1)\). The implementation of the Trie class is provided in the Appendix (7.2).

  • string, float, dict, int typeRegular expressions can be employed to analyze subsequent tokens and generate the conditional mask.

Therefore, we can confirm that the output result is free from formatting errors. Our experimental findings indicate that the application of the conditional mask significantly enhances the robustness of the Large Language Model (LLM) in the context of function calls.

4 LLM Evaluation for Function Calling↩︎

We conducted a comprehensive series of tests on our dataset, evaluating the Octopus model against other leading models. This evaluation focused on Octopus’s capability to understand API calls, specifically those on RapidAPI. Additionally, we explored the impact of incorporating various retrieval techniques during the training phase on the model’s ultimate effectiveness.

In terms of baselines, our primary comparison was with cutting-edge language models in a zero-shot framework. The models we analyzed include GPT-4 by OpenAI, utilizing the gpt-4-0314 checkpoint, and GPT-3.5-turbo, employing the gpt-3.5-turbo-0301 checkpoint. Both models have been refined through RLHF (Reinforcement Learning from Human Feedback) for enhanced conversational abilities.

4.1 Evaluation dataset and benchmark↩︎

To benchmark function calls within commonly used software APIs, we specifically constructed a dataset. This dataset was generated by randomly selecting four different function APIs and sampling queries that could be addressed by these APIs. The sampling utilized the same prompt template employed during training, details of which are provided in Appendix 7.1. Additionally, we included queries that these APIs could not resolve, maintaining a balanced ratio of solvable to unsolvable queries at 1:1. Ground truth for the dataset was established through human annotation. We applied rigorous standards for benchmarking, focusing on real-world application requirements, including the precise matching of function names and arguments. For models not trained on this dataset, issues related to format correctness were overlooked to provide a more fair comparison. Consequently, in our analysis, GPT 3.5 was not marked incorrect for format discrepancies.

Figure 3: Comparison of accuracy between the GPT-3.5 and GPT-4 models, alongside our pretrained model named within the "Octopus" series. The prefix "Octopus" denotes the series, while the suffix indicates the specific pretrained model’s name.

4.2 Performance without the conditional mask↩︎

In the task of inferring function calls, we initially employed both GPT-3.5 and GPT-4 models to generate responses. For these pretrained models, the greedy search technique was utilized to select responses. This decision was based on the higher precision required for accurately identifying both function names and their corresponding parameters, where the model’s ability to choose the correct function name and parameters is crucial. Therefore, alternative methods such as sampling and beam search were not adopted for this task. The resulting accuracy metrics from this approach are presented in Figure (3).

The results highlight that GPT-4 consistently achieves the highest accuracy in producing correct outcomes. A notable issue leading to inaccuracies with GPT-4 involves "hallucinations," such as its ability to autocorrect misspelled function names, exemplified by transforming send_emil into send_email. It is critical that the function name provided in the initial prompt remains unaltered, regardless of spelling errors. This correction issue extends to parameters as well; for instance, GPT-4 might generate Australian as a parameter when the query explicitly requires a country name. The primary source of incorrect outputs is attributed to the generation of inaccurate parameters. However, all pretrained models demonstrate near-perfect performance in identifying the correct function name.

Figure 4: Comparison of accuracy between the GPT-3.5 and GPT-4 models, alongside our four pretrained models within the "Octopus" series, following the introduction of a conditional mask. "Octopus" serves as the series name, with the suffix indicating the specific name of each pretrained model. This comparison highlights the impact of conditional masking on model performance.

4.3 Performance with the conditional mask↩︎

In contrast to the inference approach described in the preceding subsection, we implemented a conditional mask during inference for this scenario. This modification has been effective in enhancing outcomes, particularly in the generation of parameters. Utilizing a conditional mask, especially when an input is of an enum type like a country name, helps prevent the model from generating unexpected parameters. The improvements facilitated by this approach are illustrated in Figure (4). However, since the APIs for GPT-3.5 and GPT-4 do not provide logits, the conditional mask technique could not be applied to these models, and thus, no improvement in their metrics was observed. Nevertheless, it’s noteworthy that two 7B models were able to achieve performance over GPT-4, highlighting the efficacy of the conditional masking technique in certain contexts.

5 Conclusion↩︎

In this study, we present a novel framework designed to train large language models on practical software APIs, with a subsequent evaluation of their performance in making API calls, specifically in comparison to the GPT-4 model. Our approach includes a methodology for refining the dataset and the associated prompt template, which incorporates negative sampling and curriculum learning strategies. Additionally, we introduce an innovative technique known as the conditional mask, aimed at addressing the challenge of mismatched output formats.


We acknowledge the significant contributions of the teams at Google, CodeLLAMA, and Stable AI in advancing the open model ecosystem through their provision of powerful pretrained models. These contributions have been instrumental in the development of Octopus.

6 Mathematical Derivation↩︎

6.1 Impact of conditional masking on inference performance↩︎

In this appendix, we examine the effect of applying a conditional mask during inference on a causal language model’s accuracy and validation loss. Consider the validation loss without masking defined as:

\[L_{\text{val}}^{\text{non-mask}} = \sum_{i \in V} -y_i \log(\hat{y}_i),\] where \(V\) denotes the vocabulary set, and \(y_i\) is a binary indicator (0 or 1) if class label \(i\) is the correct classification for the current observation.

Introducing a conditional mask allows us to partition the vocabulary \(V\) into two subsets: \(V_1\), containing indices not masked, and \(V_2\), containing indices that are masked. Given that the true label \(y_i\) belongs to \(V_1\) during inference, and considering that for all \(i\),

\[-y_i\log(\hat{y}_i) > 0,\]

the validation loss with masking can be expressed as:

\[L_{\text{val}}^{\text{mask}} = \sum_{i \in V_1} -y_i \log(\hat{y}_i) < L_{\text{val}}^{\text{non-mask}},\]

indicating that the validation loss is reduced when a conditional mask is applied during inference.

Accuracy, particularly precision in this context, for the non-masked scenario is determined by the alignment between the ground truth label’s index and the index of the maximum value in the predicted distribution:

\[\text{Precision}^{\text{non-mask}} = \mathbb{1}[\text{argmax}_i (y_i) = \text{argmax}_i (\hat{y}_i)],\] where \(\mathbb{1}[\cdot]\) is the indicator function, returning 1 if the condition is true, and 0 otherwise.

With conditional masking, the prediction \(\hat{y}_i\) is constrained to \(V_1\), effectively reducing the search space for \(\text{argmax}_i (\hat{y}_i)\) and increasing the likelihood of matching \(\text{argmax}_i (y_i)\), given that \(y_i \in V_1\). Hence,

\[\text{Precision}^{\text{mask}} \geq \text{Precision}^{\text{non-mask}},\]

demonstrating that conditional masking during inference not only reduces validation loss but also enhances the model’s precision by focusing on a more relevant subset of the vocabulary.

7 Dataset and code illustration↩︎

7.1 Dataset template↩︎

You are an assistant, and you need to call find appropriate functions according to the query of the users. Firstly, find the relevant functions, then get the function arguments by understanding the user's query. The following functions are available for you to fetch further data to answer user questions:


def no_relevant_function(user_query):
  Call this when no other provided function can be called to answer the user query.
    user_query (str): The user_query that cannot be answered by any other function calls.

def youtube_downloader(videourl):
  Get direct video URL for youtube to download and save for offline viewing or sharing.
    videourl (string): The URL of the video being accessed as a string.

def facebook_dl_link(url):
  Get downloadable link for facebook, allowing convenient offline viewing and sharing.
    url (string): The URL string for the function argument.

def pinterest_video_dl_api(url):
  Get download feature for videos from Pinterest enabling users to save videos for offline viewing.
    url (string): The URL string represents the web address of the resource being accessed.

def insta_download_url(url):
  Get download access to Instagram content by inputting the URL, enabling users to save and view content offline.
    url (string): The URL string.

Obtain download access for viewing a recent Instagram post offline using the URL https://www.instagram.com/p/CODEinstantiate123/


Thought:To acquire download access for Instagram content for offline viewing, 'insta_download_url' is called with the post's URL as the argument, ensuring the content specified by the URL is fetched for download.    

7.2 Trie class to process the enum variable↩︎

class TrieNode:
    def __init__(self) -> None:
        self.children: Dict[str, TrieNode] = {}
        self.isEndOfWord: bool = False

class Trie:
    def __init__(self) -> None:
        self.root: TrieNode = TrieNode()
    def insert(self, word: str) -> None:
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.isEndOfWord = True
    def is_prefix(self, prefix: str) -> bool:
        node = self.root
        for char in prefix:
            if char not in node.children:
                return False
            node = node.children[char]
        return True
    def get_all_prefixes(self) -> List[str]:
        prefixes: List[str] = []
        self._dfs(self.root, "", prefixes)
        return prefixes
    def _dfs(self, node: TrieNode, prefix: str, prefixes: List[str]) -> None:
        if node != self.root:
        for char, next_node in node.children.items():
            self._dfs(next_node, prefix + char, prefixes)
    def search(self, prefix: str, include_prefix: bool = True) -> List[str]:
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        initial_string: str = prefix if include_prefix else ""
        return self._find_words_from_node(node, initial_string)
    def _find_words_from_node(self, node: TrieNode, current_string: str) -> List[str]:
        words: List[str] = []
        if node.isEndOfWord:
        for char, next_node in node.children.items():
            words.extend(self._find_words_from_node(next_node, current_string + char))
        return words


Shima Imani, Liang Du, and Harsh Shrivastava. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
Joy He-Yueya, Gabriel Poesia, Rose E Wang, and Noah D Goodman. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102, 2023.
Eunkyung Jo, Daniel A Epstein, Hyunhoon Jung, and Young-Ho Kim. Understanding the benefits and challenges of deploying conversational ai leveraging large language models for public health intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–16, 2023.
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29 (8): 1930–1940, 2023.
Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, and Jidong Ge. Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289, 2023.
Andrea I Luppi, Pedro AM Mediano, Fernando E Rosas, Negin Holland, Tim D Fryer, John T O’Brien, James B Rowe, David K Menon, Daniel Bor, and Emmanuel A Stamatakis. A synergistic core for human brain evolution and cognition. Nature Neuroscience, 25 (6): 771–782, 2022.
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499, 2023.
Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts, pp. 1–7, 2022.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf 3 Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv:2107.03374, 2021.
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
Venkat Krishna Srinivasan, Zhen Dong, Banghua Zhu, Brian Yu, Damon Mosk-Aoyama, Kurt Keutzer, Jiantao Jiao, and Jian Zhang. Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 (8): 9, 2019.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, and Chi Wang. An empirical study on challenging math problem solving with gpt-4, 2023.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
Vishal Pallagani, Kaushik Roy, Bharath Muppasani, Francesco Fabiano, Andrea Loreggia, Keerthiram Murugesan, Biplav Srivastava, Francesca Rossi, Lior Horesh, and Amit Sheth. On the prospects of incorporating large language models (llms) in automated planning and scheduling (aps). arXiv preprint arXiv:2401.02500, 2024.
Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, and Li Yuan. Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023.
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
Ziwei Ji, YU Tiezheng, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. Llmparser: A llm-based log parsing framework. arXiv preprint arXiv:2310.01796, 2023.
Rapidapi hub, 2024. URL https://rapidapi.com/hub. Accessed on February 29, 2024.
Yinpeng Liu, Jiawei Liu, Xiang Shi, Qikai Cheng, and Wei Lu. Let’s learn step by step: Enhancing in-context learning ability with curriculum learning. arXiv preprint arXiv:2402.10738, 2024.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Gemma Team, Google DeepMind. Gemma: Open models based on gemini research and technology, 2023. URL https://goo.gle/GemmaReport.
Nikhil Pinnaparaju, Reshinth Adithyan, Duy Phung, Jonathan Tow, James Baicoianu, and Nathan Cooper. Stable code 3b, 2023. URL https://huggingface.co/stabilityai/stable-code-3b.
MLC team. , 2023. URL https://github.com/mlc-ai/mlc-llm.
Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, and Tat-Seng Chua. Data-efficient fine-tuning for llm-based recommendation. arXiv preprint arXiv:2401.17197, 2024.
Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. Llm as dba. arXiv preprint arXiv:2308.05481, 2023.
Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024.
Yu Du, Fangyun Wei, and Hongyang Zhang. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253, 2024.
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022.
Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024.
Yilun Kong, Jingqing Ruan, Yihong Chen, Bin Zhang, Tianpeng Bao, Shiwei Shi, Guoqing Du, Xiaoru Hu, Hangyu Mao, Ziyue Li, et al. Tptu-v2: Boosting task planning and tool usage of large language model-based agents in real-world systems. arXiv preprint arXiv:2311.11315, 2023.
Hongru Wang, Rui Wang, Fei Mi, Zezhong Wang, Ruifeng Xu, and Kam-Fai Wong. Chain-of-thought prompting for responding to in-depth dialogue questions with llm. arXiv preprint arXiv:2305.11792, 2023.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, and Asli Celikyilmaz. The art of llm refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961, 2023.
Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models, 2023.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837, 2022.
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Joshua Ackerman and George Cybenko. Large language models for fuzzing parsers (registered report). In Proceedings of the 2nd International Fuzzing Workshop, pp. 31–38, 2023.
Chase Harrison. Langchain, 2022. URL https://www.langchain.com/. Accessed on February 29, 2024.

  1. Corresponding author, \(^\dagger\) equal contribution↩︎