Towards Event Extraction with Massive Types: LLM-based Collaborative Annotation and Partitioning Extraction

Wenxuan Liu1,2, Zixuan Li1,2\(^{*}\), Long Bai1,2, Yuxin Zuo1,2,
Daozhu Xu3, Xiaolong Jin1,2\(^{*}\), Jiafeng Guo1,2, Xueqi Cheng1,2
1School of Computer Science and Technology, University of Chinese Academy of Sciences
2Key Laboratory of Network Data Science and Technology,
Institute of Computing Technology, Chinese Academy of Sciences
3State Key Laboratory of Geo-Information Engineering, Xi’an, Shaanxi, China
{liuwenxuan2024z, lizixuan, jinxiaolong}@ict.ac.cn


Abstract

Developing a general-purpose extraction system that can extract events with massive types is a long-standing target in Event Extraction (EE). In doing so, the challenge comes from two aspects: 1) The absence of an efficient and effective annotation method. 2) The absence of a powerful extraction method can handle massive types. For the first challenge, we propose a collaborative annotation method based on Large Language Models (LLMs). Through collaboration among multiple LLMs, it first refines annotations of trigger words from distant supervision and then carries out argument annotation. Next, a voting phase consolidates the annotation preferences across different LLMs. Finally, we create the EEMT dataset, the largest EE dataset to date, featuring over 200,000 samples, 3,465 event types, and 6,297 role types. For the second challenge, we propose an LLM-based Partitioning EE method called LLM-PEE. To overcome the limited context length of LLMs, LLM-PEE first recalls candidate event types and then splits them into multiple partitions for LLMs to extract events. The results in the supervised setting show that LLM-PEE outperforms the state-of-the-art methods by 5.4% in event detection and 6.1% in argument extraction. In the zero-shot setting, LLM-PEE achieves up to 12.9% improvement compared to mainstream LLMs, demonstrating its strong generalization capabilities.

1 Introduction↩︎

Figure 1: Statistics on the existing EE datasets.

Event Extraction (EE) aims to identify structural event information from text, which contains two subtasks [1], i.e., Event Detection (ED) and Event Argument Extraction (EAE). The former identifies the trigger words of events (event triggers) and their corresponding types, while the latter extracts arguments and their associated roles based on a trigger and its event type. EE has demonstrated its value across a variety of domains, including finance [2], biomedical research [3], and Cyber-security [4]. Each of these domains has its specific event types, which jointly form a large event schema containing massive types. It leads researchers [5] in this area to pursue a general-purpose system capable of extracting events with massive types in different domains.

In doing so, the basic challenge is the lack of an effective and efficient annotation method to construct datasets. As illustrated in Figure 1, existing datasets can be divided into two types based on their annotation methods: human-annotated and distant supervision-based ones. Human annotation is generally effective but inefficient, requiring annotators to understand long guidelines and undergo specialized training. Consequently, the obtained datasets are often limited in terms of both event type and scale. Considering that existing semantic frame knowledge bases, e.g., FrameNet [6], Propbank [7], contain massive types of predicates, distant supervision-based methods [1], [8], [9] automatically annotate triggers and arguments if they have been annotated in the knowledge bases, offering a more efficient alternative. For example, GLEN [1], the largest ED dataset to date, uses Propbank and Wikidata [10] to annotate event triggers across more than 3,000 types. However, these datasets often suffer from noise in three aspects: 1) Unreasonable Trigger Annotation: Some predicates in the semantic frame are not considered as events in the view of EE, leading to the incorrect annotation of irrelevant words as triggers. For example, the adverb “voluntarily” is annotated as a trigger in GLEN. 2) Coarse-Grained Type Annotation: Due to the hierarchical structure of event types, a predicate often has types with multiple granularities. cmethods struggle to assign precise fine-grained types to these predicates. For example, GLEN annotates several potential types (crime, property crime, etc) for the trigger “crime”. 3) Missing Argument Annotation: Unlike triggers, where candidate triggers are relatively limited, the number of potential arguments is much larger and cannot be fully enumerated by existing knowledge bases. Thus, distant supervision-based methods often miss argument annotations when the arguments are not included in the knowledge bases.

Large Language Models (LLMs) have recently achieved significant performance improvements across many Natural Language Processing (NLP) tasks and emerge as a promising approach for annotating EE datasets [11]. One key challenge, however, is to mitigate the annotation bias inherent in a specific LLM. Motivated by this, we propose an LLM-based collaborative annotation method. Based on the results from distant supervision-based methods, it first performs event trigger filtering by removing irrelevant triggers. This is followed by event type refinement, which assigns more fine-grained event types for triggers based on context. Finally, it identifies the roles of arguments associated with each event and refines the original annotation with human annotation rules. After each step, multiple LLMs collaborate to generate annotations, and then a voting phase is used to unify the annotation preferences across LLMs. Finally, we obtain a new dataset, called EEMT, with over 200,000 annotated samples, covering 3,465 event types and 6,297 argument role types, which is the largest EE dataset regarding event type and scale, to the best of our knowledge.

To adapt LLMs for EE with massive types, we propose a Partitioning EE method for LLMs called LLM-PEE, which addresses the prompt length limitation when handling massive types. LLM-PEE begins by recalling the top-k most similar event types, then divides these types into several partitions, which are assembled into prompts of LLMs. Based on these partitioning prompts, LLMs extract event triggers and their corresponding arguments. Experimental results on the EEMT dataset demonstrate that LLM-PEE outperforms the state-of-the-art models by 5.4% in event detection and 6.1% in argument extraction. Besides, LLM-PEE achieves up to 12.9% F1 improvement compared to mainstream LLMs in the zero-shot setting, demonstrating its strong generalization capabilities.

Our contributions can be summarized as follows:

  • We propose an LLM-based collaborative annotation method for EE with massive types, where multiple LLMs automatically annotate events collaboratively.

  • Based on the above method, we construct the EEMT dataset, which is the largest EE dataset to date in terms of both its coverage and scale.

  • We propose an LLM-based Partitioning Extraction method, which significantly improves EE under supervised and zero-shot settings.

Figure 2: Overview of the proposed LLM-based collaborative annotation method for EE with massive types.

2 Related Work↩︎

2.0.0.1 Event Extraction Dataset.

In human annotated datasets, ACE 2005 [12] is the most commonly used dataset, including 33 event types and 22 roles. Followed by this, Rich ERE [13] is proposed to further enhance the scale of the EE dataset. MAVEN-arg[14] is constructed based on MAVEN[15], the current largest EAE dataset annotated by human experts with 162 event types and 612 roles. However, due to the high cost of annotating events, human annotators can not further extend schema and data scale. For datasets from distant supervision-based methods, LSEE [8] is constructed from FrameNet [6] and WikiPedia [16] to automatically annotate events. GLEN [1] is the largest ED dataset including 3465 types and 200,000 samples. However, the dataset from distant supervision suffers from the low annotation quality.

2.0.0.2 Event Extraction Method.

EE methods can be divided into two kinds: classification-based and generation-based ones. Classification-based methods [1], [17] tend to formulate EE as a token classification or a sequential labeling task. Generation-based methods [18][20] aim to generate the text containing a structured event, while these methods require manual schema-specific templates, which are difficult to be adapted to massive types. Nowadays, due to the strong generation abilities of LLMs, they have been widely used in EE, like InstructUIE [21], KnowCoder [22], AlignXIE [23]. Besides, some works focus on few-shot or zero-shot settings such as Code4Struct [24] and CodeIE [25], which employ code format to extract information.

3 LLM-based Collaborative Annotation↩︎

3.1 Annotation Method↩︎

Given that LLMs demonstrate strong capabilities in understanding natural language, we incorporate them into the EE annotation process. To improve efficiency, we use LLMs to perform annotation based on trigger annotations from distant supervision, rather than starting from scratch. By collaborating across multiple LLMs using offset alignment and voting, we achieve consistent and precise annotation results, making the process more effective.

Specifically, as shown in Figure 2, the proposed annotation method consists of four steps: (1) Event trigger pre-annotation annotates the triggers and their potential types based on distant supervised methods; (2) Event trigger filtering filters the unreasonable event triggers via LLMs; (3) Event type refinement maps the trigger to more fine-grained event types via LLMs; (4) Event argument annotation further annotates the arguments via LLMs. The detailed prompts and voting strategies are listed in Appendix 9.2 and Appendix 9.3. In the first step, this paper takes the distant supervision-based methods based on Propbank [6] and Wikidata [10], as applied in GLEN [1], for example. The proposed annotation method can be easily extended to other Distant supervision-based methods and knowledge bases.

3.1.0.1 Event Trigger Pre-annotation.

Due to the massive types in the schema, directly using LLMs to annotate events is inefficient, as it requires including extensive guidelines in the prompt of LLMs. To improve the efficiency of the annotation process, we perform event trigger pre-annotation using distant supervision. For example, in the distant supervision-based method used in GLEN, a sentence is first processed by annotating words as triggers if they are included in the Propbank predicate annotations. Then, the event type (referred to as roleset in Propbank) of each trigger is mapped to Wikidata QNode types via DWD Overlay [5]. After this step, we obtain the initial annotations of event triggers and their corresponding candidate event types.

3.1.0.2 Event Trigger Filtering.

The event annotation guidelines in semantic frame knowledge bases, such as Propbank, differ from the event definitions in mainstream EE datasets [12]. For example, some adverbs or adjectives are treated as event triggers in Propbank. This gap leads to the issue of unreasonable trigger annotation, as discussed earlier. Thus, this step leverages LLMs to filter out invalid triggers. By observing these differences in event definition and describing them as guidelines, we instruct LLMs to evaluate the validity of event annotations. To help the LLMs better comprehend the task, carefully selected examples are also included in the prompt.

Due to potential differences and biases in the filtering process provided by different LLMs, a voting strategy is applied to obtain the final results. Specifically, we aggregate the valid events identified by each LLM. An event is considered valid if the majority of LLMs support it. In cases where a tie occurs during the voting process, we will instruct each LLM to re-annotate the case repeatedly until the majority of LLMs support the result.

3.1.0.3 Event Type Refinement.

In the original GLEN annotation, 60% of triggers have more than one event type. Distantly supervised methods cannot assign precise, fine-grained types to these triggers, as this requires more accurate event definitions and a deeper semantic understanding of the sentence. Therefore, this step uses LLMs to refine the event types to a more fine-grained level automatically.

We formalize the event type refinement task as a multiple-choice problem for the LLMs. Specifically, this step takes the candidate event types from Step 1 and their corresponding descriptions as input. It then instructs the LLMs to select the event types most accurately align with the event triggers and sentences. Additionally, we include a “None of them” option for cases where none of the candidate event types align with the event trigger.

Each LLM performs this task independently, and the final fine-grained event type is determined by selecting the candidate with the highest number of votes from different LLMs. In the event of a tie during the voting process, each LLM is instructed to re-annotate the case. However, since the most probable fine-grained types typically converge within one or two candidates, such ties are generally resolved within two rounds of iterative voting.

Table 1: Statistics of the EEMT dataset compared to those of other datasets.
Data Source Event Typess Argument Types Cases Event Mentions Argument Mentions Domain
ACE 2005 33 22 593 4,090 9,683 General
CASIE 5 26 1,594 3,027 6,135 Cybersecurity
Commodity EE 18 19 3,949 3,949 8,123 Commodity
GLEN 3,465 - 208,454 185,047 - General
Maven-arg 162 612 4,480 98,591 290,613 General
EEMT 3,465 6,297 208,454 170,908 481,855 General

3.1.0.4 Event Argument Annotation.

After the fine-grained event type is selected, we need to annotate the event arguments based on the event type and its corresponding schemas. As mentioned above, distant supervised methods often fail to conduct argument annotations when the relevant arguments are absent from the knowledge bases. Thus, we adopt LLMs to annotate arguments and their roles in this step. However, annotating event arguments introduces two additional challenges for LLMs: 1) Roles Understanding: Event argument annotation requires LLMs to comprehensively understand the complex event schema, including event types and role definitions. Additionally, LLMs must be capable of analyzing syntactic structures within sentence spans and mapping each span to a specific role. 2) Bias in Span Offset: LLMs exhibit inherent bias in span offset when dealing with different styles of text. Different LLMs are more likely to give out different offsets in argument annotation.

Considering the additional challenges, we develop a set of guidelines for analyzing logical relationships in sentences and assigning roles accordingly. Then, LLMs are adopted to identify logical relationships within sentences, and map text spans to their corresponding roles. Additionally, we include examples of manual annotations in the prompts to help the LLM better understand the roles of events. To eliminate bias in span offset across different LLMs, we instruct the LLMs to refine and update the original annotation results, especially in span offset, using the rules derived from human observations. We refer to this process as Offset Alignment, which ensures greater consistency in the annotations generated by different LLMs. Following this alignment, we employ a voting strategy to determine the final argument annotations. For each argument in the annotated event, if a specific argument-role pair appears in more than half of LLMs’ annotation results, it is deemed valid. If the LLMs generate completely different annotations for certain roles, we employ GPT-4o to annotate the case given the original annotation from each LLM.

3.2 The Constructed EEMT Dataset↩︎

3.2.0.1 Dataset Construction.

As some datasets already pre-annotate event triggers, we reuse the GLEN dataset, currently the largest ED dataset with over 3,000 event types, to construct a more comprehensive and high-quality dataset. Specifically, we employ three mainstream, state-of-the-art LLMs for collaborative annotation: Deepseek-V3 [26], Qwen-Plus [27], and GPT-4o-mini [28]. We follow the data splits of GLEN for the training, development, and test sets. Additionally, we select 1,500 samples from the origin test set and create a human-annotated test set with the help of three graduate students specializing in NLP. More details are provided in Appendix 9.

The statistics of the EEMT dataset and the existing datasets are shown in Table 1. Compared to commonly used datasets, EEMT surpasses them in terms of the size of event types, role types, and overall data scale. Particularly, our dataset contains 10 times more event and role types than the largest human-annotated EE dataset, MAVEN-arg. Compared to GLEN, we filter out nearly 7.64% of unreasonable event triggers, refine the coarse-grained type annotation (which accounts for 61.3% in the origin GLEN dataset) into fine-grained ones, and annotate the arguments along with their roles. More detailed statistics are in Appendix 10.

Table 2: Results between different single-LLMs and our collaborative annotation method(denoted as CA). ETF, ETR, EAA indicates Event Trigger Filtering, Event Type Refinement, and Event Argument Annotation, respectively. \(^{\dagger}\) indicate that we apply the offset alignment in EAA. We calculate the F1 score for each step.
Deepseek-V3\(^{\dagger}\) Qwen-Plus\(^{\dagger}\) GPT-4o-mini\(^{\dagger}\) CA
ETF 92.1 91.4 91.6 93.2
ETR 95.8 94.9 94.2 96.2
EAA 84.7 83.5 84.2 85.3

3.2.0.2 Quality Assessment.

To evaluate the effectiveness of the proposed annotation methods, we assess the annotation quality of each of the three LLM-based steps on the previously mentioned human-annotated test set. The results are in the last column of Table 2. The F1 scores for all steps surpass 85%, with the event type refinement achieving an impressive F1 score of 96.2%. These results strongly validate the effectiveness of the annotation method. Additionally, we evaluate the results annotated by a single LLM. By leveraging collaborative annotation, the F1 score improves at each step, further enhancing the quality of the annotations.

4 LLM-based Partitioning Extraction↩︎

To adopt LLMs to EE with massive types, we propose an LLM-based partitioning extraction method for EE, called LLM-PEE. As shown in Figure 3, LLM-PEE consists of three key components: similarity-based type recall, type-partitioning prompting, and LLM-based event extraction. In the similarity-based type recall step, we reduce the number of candidate event types that may appear in a sentence to a small subset using a similarity-based model. Next, the type-partitioning prompt divides the subset into several partitions using three different strategies, thus further reducing the prompt length of each partition. Finally, based on the partitioning prompts, LLMs extract event triggers and their corresponding arguments.

Figure 3: Overview of our LLM-based Partitioning Extraction Framework, including Event Detection and Event Argument Extraction.

4.0.0.1 Similarity-based Event Type Recalling.

Given a sentence \(s={s_1,..., s_i, ..., s_n}\) and a set of candidate event types \(\{e_1, ..., e_i, ..., e_m\}\), the similarity-based event type recalling step identifies the potential event types described by the sentence based on the similarity between the sentence and the event types. Following CDEAR [1], we employ ColBERT [29] as the encoder of sentence and event types. Based on the embeddings from encoders, we calculate the similarities and get \(k\) candidate event types for each sentence. More details are in Appendix 7.1.

4.0.0.2 Event Type Partitioning Prompting.

Existing LLM-based EE methods [30], [31] typically adopt prompting learning [32] for EE. Their input prompts generally contain three parts, i.e., event schema information, task description, and the input sentence. Although the number of event types is reduced to \(k\) after the event type recall step, the prompt describing the event schema information remains long, particularly for some open-source LLMs with relatively short context lengths. Moreover, some existing studies [33] have shown that as the prompt length increases, the difficulty of understanding the prompt also increases. Motivated by this, the event type partitioning prompting step splits the event types into smaller partitions to mitigate this issue. To determine which event types within a partition enhance LLM performance, we design three partitioning strategies based on the confidence scores obtained in the similarity-based event type recall step: 1) Random: The k event Types are evenly divided into N parts randomly; 2) Average: The k event types are evenly divided into N parts, ensuring that the sum of the confidences for event types in each part is as equal as possible. This strategy aims to ensure that the extraction difficulty is balanced across different partitions. 3) Level: Sort the k event types based on their confidences, and then evenly divide them into N parts. This strategy ensures different partitions have varying difficulty levels. We analyze different partition strategies in Appendix 8.1.

Table 3: Performance (in percentage) for the ED task on EEMT. We apply the similarity-based event type recalling (denoted as e.t.r) for all generation based models. t.p.p indicates the event type partitioning prompting.
Method TI (LLM Annotation) TC (LLM Annotation) TI (Human Annotation) TC (Human Annotation)
2-13 P R F1 P R F1 P R F1 P R F1
CA 100.0 100.0 100.0 100.0 100.0 100.0 92.1 94.3 93.2 89.8 90.5 90.1
C.B. Method
DMBERT 50.7 74.2 60.2 32.2 47.3 38.3 49.8 74.0 59.5 31.7 47.0 37.9
Token-Level 55.5 73.3 63.1 35.5 49.2 41.2 56.0 73.1 63.4 35.9 48.6 41.3
Span-Level 53.2 74.7 62.1 33.5 50.2 40.5 53.6 73.5 62.0 62.0 49.2 39.5
CDEAR 60.0 78.2 67.9 45.1 55.2 49.7 61.2 77.7 68.5 46.3 54.5 50.1
G.B. Method
Qwen-Plus 52.1 76.1 61.8 42.4 69.7 52.7 51.5 75.5 61.2 42.0 68.6 52.1
GPT-4o-mini 50.3 78.7 61.4 43.6 68.2 53.2 49.8 78.7 61.0 42.6 67.2 52.1
Deepseek-V3 49.7 81.2 61.7 43.9 72.1 54.6 49.6 80.5 61.4 43.0 71.7 53.8
InstructUIE 73.1 75.3 74.2 53.5 55.5 54.5 73.5 74.4 73.9 53.9 54.9 54.4
IEPILE 75.7 71.9 73.8 56.2 54.2 55.1 74.8 72.3 73.5 55.2 53.5 54.3
KnowCoder 74.9 74.3 74.6 57.0 57.2 57.1 74.1 73.7 73.9 56.0 56.4 56.2
LLM-PEE 79.7 72.4 75.9 63.6 57.5 60.2 79.0 71.8 75.2 62.5 56.4 59.3
LLM-PEE w/o. e.t.r 74.4 47.6 58.1 56.3 43.5 49.1 73.7 47.1 57.4 55.4 42.9 48.4
LLM-PEE w/o. t.p.p 75.1 74.2 74.7 56.8 57.2 57.0 73.9 74.0 73.9 56.5 56.1 56.3
Table 4: Performance (in percentage) for the EAE task on EEMT.
Method AI (LLM Annotation) AC (LLM Annotation) AI (Human Annotation) AC (Human Annotation)
2-13 P R F1 P R F1 P R F1 P R F1
CA 100.0 100.0 100.0 100.0 100.0 100.0 90.1 89.5 89.8 86.4 84.2 85.3
C.B. Method
CRF-Tagging 25.8 24.9 25.3 24.4 23.6 24.0 24.8 24.7 24.7 23.8 23.2 23.5
Tag-Prime 29.1 25.6 27.3 27.8 23.4 25.4 28.8 25.1 26.8 27.4 22.8 24.9
G.B. Method
Qwen-Plus 69.4 68.1 68.8 65.8 64.5 65.1 67.4 66.9 67.1 63.0 61.7 62.3
GPT-4o-mini 70.4 67.9 69.1 66.0 65.0 65.5 67.9 66.9 67.4 63.7 61.2 62.4
Deepseek-V3 71.9 68.5 70.2 66.4 65.2 65.8 68.0 68.7 68.3 64.1 62.2 63.1
Bart-Gen 38.0 37.2 37.6 35.0 35.4 35.2 36.8 36.5 36.6 33.6 34.9 34.2
InstructUIE 65.8 64.7 65.2 60.1 59.1 59.6 64.0 63.6 63.8 59.1 57.6 58.3
IEPILE 66.8 66.5 66.7 62.1 59.7 60.9 66.8 64.2 65.5 62.1 59.7 60.9
KnowCoder 72.1 68.2 70.1 65.1 62.4 63.7 68.7 67.6 68.2 63.8 60.2 61.9
LLM-PEE 75.5 70.9 73.1 69.9 65.6 67.7 73.2 68.7 70.9 67.9 63.8 65.8

4.0.0.3 LLM-based EE.

With the prompt as input, we use LLMs to conduct extraction. Specifically, we conduct the two-stage extraction process following KnowCoder [22]. For the ED task, with the partitioning prompts as input, we generate the triggers and corresponding event types. For the EAE task, with the golden event types and triggers in the single sentence as input, LLMs predict potential arguments and their corresponding roles. The details of the extraction prompt including schema, instruction and completion are in Appendix 11.

5 Experiment↩︎

5.1 Experiment Setting↩︎

5.1.0.1 Evaluation Metrics.

Following GLEN [1] and KnowCoder [22], we use Trigger Identification (TI) F1 and Trigger Classification (TC) F1 to evaluate ED. For EAE, we use Argument Identification (AI) F1 and Argument Classification (AC) F1 .

5.1.0.2 Baselines.

For ED tasks, following GLEN, we employ four classification-based baselines(denoted as C.B. Method), including DMBERT [34], token-level classification and span-level classification, and CDEAR [1]. For EAE tasks, we employ two classification baselines CRF-Tagging, and Tag-Prime [35] and a generative method(denoted as G.B. Method) Bart-Gen [36]. We compare with three LLM-based baselines, i.e., InstructUIE, IEPILE and KnowCoder. They are all fine-tuned on the proposed EEMT dataset. We also evaluate the mainstream LLMs, i.e., Qwen-Plus, GPT-4o-mini and DeepSeek-V3. For fairness, we evaluate the three models without the offset alignment in EAE. Besides, we conduct further evaluation on other mainstream LLMs in Appendix 8.2.

5.1.0.3 Benchmark.

For supervised evaluation, we evaluate the methods both on LLM-based testset and human annotated testset. For zero-shot evaluation, we evaluate methods on ACE 2005 [12].

5.1.0.4 Implement Details.

We utilize LLaMA2-7B-Base [37] as the backbone and LLAMA-Factory [38] as the training framework. Specifically, LoRA [39] is used for efficient hyper-parameter tuning. We set the LoRA Rank to 8 and learning rates to 0.0003. The maximum sequence length is set to 2048, and the batch size is set to 256. Training is conducted for four epochs. We use the VLLM [40] to accelerate the inference, employing greedy search and a maximum output length of 500. We recall the top 15 similar event types in the similarity-based event type recalling stage and divide them into two partitions with level strategy.

5.2 Experiment Results↩︎

5.2.1 Supervised Evaluation↩︎

The results of the supervised evaluation are listed in Tables 3 and  4. We conduct a comprehensive analysis from the next four perspectives.

5.2.1.1 Analyses on the proposed Annotation Method.

The collaborative annotation method (denoted as CA) is presented in the first row of Tables 3 and  4. First, the collaborative annotation method achieves an F1 score of 85% even on the human-annotated test set, demonstrating the high effectiveness and superiority of our proposed annotation method. Furthermore, our collaborative annotation method outperforms the direct extraction performance of any single annotation model, thereby validating the rationality and efficacy of the design underlying our method. Specifically, in the ED task, there is a significant performance gap (90.1 v.s.56.2) between the collaborative annotation method and individual annotation models. This disparity arises because distant supervision is incorporated as auxiliary information during the annotation process, whereas directly performing event detection using individual LLMs remains highly challenging. In the EAE task, the performance of a single LLM significantly declines due to the absence of a collaborative mechanism and offset alignment. This observation aligns with the results reported in Table 2, further reinforcing the advantages of our method that mitigates bias and enhances the consistency of dataset styles.

5.2.1.2 Analyses on ED.

Table 3 presents the results of ED on the proposed EEMT dataset. Compared with the KnowCoder (generation method based on LLM), LLM-PEE outperforms 1.7% on TI and 5.4% on TC, respectively. These results demonstrate the effectiveness of our partitioned extraction framework in addressing the event detection with massive types. In comparison with CDEAR (designed for ED with massive types), LLM-PEE integrates prompting learning into the framework, effectively unleashing the potential of LLMs in event detection with massive types. Besides, without fine-tuning on the dataset, mainstream LLMs tend to over-generate events, which leads to high recall but low precision. LLM-PEE demonstrates superior event judgment accuracy, leading to a higher F1 score compared to the mainstream LLMs. Additionally, the relatively low performance on TC highlights the inherent difficulty of accurately identifying event types from a large-scale event schema, which remains a significant challenge for our dataset.

5.2.1.3 Analyses on EAE.

Table 4 presented the results of EAE on the proposed EEMT dataset. Compared to KnowCoder, LLM-PEE achieves improvements of 3.9% in AI and 6.2% in AC, respectively. In LLM-PEE, the schema representations of ED and EAE are consistent. We guess that this consistency enables the ED task to facilitate the EAE task, thereby deepening the model’s understanding of event types and roles and contributing to higher performance on both tasks. In comparison with the mainstream LLMs, after fine-tuning on the dataset, the argument extraction performance of LLM-PEE (trained on LLaMA2-7B) surpasses that of the original annotation LLMs. This improvement can be attributed to the high-quality dataset after the offset alignment and collaborative annotation. Additionally, classification-based methods often face challenges in accurately identifying the boundaries of arguments, particularly for continuous spans (e.g., multi-token arguments) in roles such as “purpose”, which result in lower performance in both AI and AC. That suggests generation-based methods might be more suitable for the complex EAE task.

5.2.1.4 Ablation Analysis.

We conduct ablation experiments on two key modules: similarity-based event type recalling (denoted as LLM-PEE w/o e.t.r) and event type partitioning prompting (denoted as LLM-PEE w/o t.p.p). The results are presented in the bottom two rows of Table 3. Since the EAE task paradigm needs to specify event types and triggers in advance, we focus our ablation analysis solely on the ED task. LLM-PEE w/o e.t.r exhibits a significant drop of 18.3% in TC. This result suggests that, without recalling the most similar event types, the model struggles to accurately distinguish the correct event type from the event set, particularly for unseen event types. Furthermore, LLM-PEE w/o t.p.p shows a performance decline of 5.1% in TC, further validating the effectiveness of the partitioning strategies in our framework. More detailed results of the partitioning strategies are provided in Appendix8.1. Additionally, to verify the effectiveness of LLM-PEE when applied to other EE datasets, we conduct an additional supervised experiment on ACE2005. The results and analyses are presented in Appendix 8.4.

5.2.2 Zero-Shot Evaluation↩︎

Table 5: Performance (in percentage) for the ED and EAE tasks on ACE 2005.
Method TI TC AI AC
Qwen-Plus 20.65 14.69 35.11 25.37
GPT-4o-mini 19.97 13.67 35.11 25.37
Deepseek-V3 20.49 15.57 34.92 26.96
LLM-PEE 22.32 13.19 38.79 30.44

To assess the generalization capabilities of LLM-PEE to the unseen dataset, we conduct zero-shot experiments on the ACE 2005, which is the most commonly used EE dataset. The results are presented in Table 5. LLM-PEE outperforms other LLMs in TI, AI, and AC. Notably, in TI and AI, LLM-PEE demonstrates superior consistency in identifying spans. This improvement can be attributed to the high quality of the EEMT dataset, which significantly mitigates biases in span offsets and enhances the model’s ability to generalize across diverse event types. Although LLM-PEE exhibits slightly lower performance in TC compared to other methods, this is because our model predicts more fine-grained event types, which are not defined in ACE 2005. These findings convincingly demonstrate the effectiveness and advantages of LLM-PEE, underscoring its generalization capabilities in handling complex event extraction tasks.

6 Conclusion↩︎

In this paper, we proposed a new LLM-based collaborative annotation method. It refines trigger annotations from distant supervision and then performs argument annotation through collaboration among multiple LLMs. We then created the new EEMT dataset based on this annotation method, which is the largest EE dataset in terms of event types and data scale. To further adapt LLMs for EE with massive types, we introduce a Partitioning EE method for LLMs called LLM-PEE. The experimental results in both supervised and zero-shot settings demonstrate that LLM-PEE outperforms other baselines in ED and EAE tasks and even surpasses mainstream LLMs in terms of generalization capabilities in the zero-shot setting.

Limitations↩︎

We summarized the limitations of this work and looked at them as areas for future improvement.

  • Hierarchical level of event extraction. We believe that an important factor restricting our model is that the event hierarchy is not sufficiently distinguished. We will explore how to improve the understanding of the hierarchical level of events by better event definition or positive and negative sample strategy.

  • End-to-End event extraction. The proposed LLM-PEE method still divides the event extraction into two sub-tasks ED and EAE, we try to proceed directly with end-to-end event extraction based on LLMs.

  • Document level event extraction. The proposed EEMT dataset is only annotated on sentence-level text, lacking document-level annotation. We hope to annotate large-scale events in document-level documents to further verify the model’s capability.

7 LLM-based Partitioning Extraction↩︎

7.1 Similarity-based Event Type Recalling↩︎

ColBERT consists of a BERT [41] layer, a convolution layer and an L2 normalization layer. In this paper, we use \(\operatorname{ColBERT}(\cdot)\) to denote the encoder. Specifically, the embedding list of all tokens in a sentence \(s\), \(h_s = [\mathbf{h}_1^s, \mathbf{h}_2^s, ...]\), is calculated as follows: \[\begin{align} \small h_s = \operatorname{ColBERT}(\text{\small{[CLS]}} \text{\small{[SENT]}} s_1, s_2, ... \text{\small{[SEP]}}), \end{align}\] where [SENT] is a special token indicating the object being encoded is a sentence.

For an event type \(e\), the corresponding embedding list of all tokens in the event type name, \(h_e = [\mathbf{h}_1^{e}, \mathbf{h}_2^{e}, ...]\), is calculated as follows: \[\begin{align} \small h_e = \operatorname{ColBERT}(\text{\small{[CLS]}} \text{\small{[EVENT]}} \tau_i \text{\small{[SEP]}}), \end{align}\] where [EVENT] is a special token indicating the object being encoded is an event type.

Then, the similarity score between sentence \(s\) and event type \(e\) is computed as the sum of the maximum similarity between the token embeddings in sentence and event type: \(\rho_{(s, e)} = \sum_{h_s}\max_{h_e}(\mathbf{h}_{i}^e, \mathbf{h}_{j}^s)\).

A similar margin loss to CDEAR is adopted for training, which ensures that the best candidate is scored higher than all negative samples.

\[\small \mathcal{L} = \frac{1}{N}\sum_s \sum_{e^-} \max \{ 0, (\tau - \max_{e \in C_y} \rho_{(e, s)}+ \rho_{(e^-, s)}) \}.\]

Base on the similarity score, we can get \(k\) candidate event types for each sentence.

8 Experiment↩︎

8.1 Influence of partitioning strategy↩︎

We conduct the experiment on ED to figure out the influence of partitioning strategy in Table 6. Our strategy outperforms than IEPLIE1, because we incorporate the information of the sentence when recalling the similar event types and ensures consistency between the training and testing phases. Besides, it demonstrates that the Level strategy significantly enhances the precision of event extraction. We hypothesize that sorting samples in descending order of confidence enables the grouping of the most challenging-to-distinguish event types into the same partition. This approach facilitates the model’s ability to learn fine-grained distinctions between different event types, including their corresponding triggers and classifications. Notably, the inclusion of naturally occurring challenging negative samples within these partitions allows the model to better grasp the nuanced boundaries between similar or ambiguous event categories.

Table 6: Results with different partitioning strategy.
Partitioning strategy TI TC
IEPILE 74.9 59.0
Random 75.0 58.9
Average 75.1 59.1
Level 75.7 60.0
Table 7: Results evaluated on different mainstream LLMs
Model TI TC AI AC
LLama-3.1-405B 59.6 51.9 65.1 60.2
Gemini-2.0-flash 60.5 53.1 67.4 62.9
Qwen-Turbo 58.3 49.7 64.9 59.3
Qwen-Plus 61.3 52.8 67.0 62.3
Qwen-Max 61.3 53.4 67.5 62.4
GPT-4o-mini 60.1 52.1 67.4 62.5
GPT-4o 60.6 52.7 67.7 62.7
Deepseek-V3 61.4 53.8 68.3 63.1

8.2 Further Evaluation on Mainstream LLMs↩︎

To further evaluate how the mainstream LLMs perform on our dataset, we conduct the extra evaluation on our human annotation benchmark. We use the LLAMA3.1-405B-instruct [42], Gemini-2.0-flash [43], GPT-4o, Qwen-Max to conduct further evaluation. The results are listed in  7.

The results suggests that without training, the mainstream LLMs exhibit almost the same annotation and extraction capabilities, which is a still challenging task for mainstream LLMs. Considering the effectiveness and efficiency, we employ the Qwen-Plus, GPT-4o-mini and DeepSeek-V3 as the final annotation LLMs.

Table 8: Comparison of the prediction results across different datasets.
Context Predictions
In the most recent Third Assessment Report ( 2001 ), IPCC wrote there is new and stronger evidence that most of the warming observed over the last 50 years is attributable to human activities. GLEN: Assessment
EEMT: Risk Assessment
The occupying armies existing in german territory will end soon. GLEN: Occupation
EEMT: Military Occupation

8.3 Case Study↩︎

Since we use the GLEN dataset as the original propbank annotation, we further analyze whether more fine-grained event type can be predicted based on the EEMT, we randomly selected a subset of examples from the test set for verification. As shown in Table 8, we observed that compared to the original GLEN dataset, when trained on the EEMT, the model predictions shifted from “Assessment” to “Risk Assessment” and from “Occupation” to “Military occupation”, which better aligns the sentence. Upon analyzing the training data, we find that the key is the increase in the number of corresponding events in train set, the number of “Risk assesement” increases from 0 to 5, and the number of “Military occupation” increases from 1 to 4, which results in more accurate fine-grained event predictions.

8.4 Supervised Evaluation on other dataset↩︎

Table 9: Results under the supervised evaluation in ACE 2005
Model TI TC AI AC
InstructUIE 73.1 72.3 70.9 68.6
IEPILE 72.9 72.4 69.9 68.2
KnowCoder 73.1 72.3 70.9 68.6
LLM-PEE 73.3 72.6 71.3 69.2

Considering the generality of LLM-PEE   on other datasets, we apply LLM-PEE on the supervised setting in the dataset ACE 2005 in Table 9.

Compared to other fine-tuned methods, LLM-PEE achieves state-of-the-art performance. However, the margin of advantage is narrower than when applied to the EEMT dataset, which contains a significantly larger number of event types. This reduction in relative performance can be attributed to the core design philosophy of our model, which primarily focuses on addressing the challenges posed by long prompts resulting from massive event types. When applied to datasets with fewer event types, such as ACE 2005, the inherent advantages of LLM-PEE are somewhat diminished, as the problem it was specifically designed to tackle becomes less pronounced.

9 Annotation Details↩︎

9.1 Annotation Setting↩︎

We use the DeepSeek-V3 [26], Qwen-Plus [27] and GPT-4o-mini as the annotation LLM. Considering the accuracy of the annotation and the divergence of the answers, we set the temperature to 0.5. Besides, we set a answer detection mechanism, if the obtained result can not be correctly parsed, the model is required to annotate the events agian.

9.2 Prompts Examples↩︎

We introduce the prompts we used in the Collaborative Annotation method including Event Triggers Filtering, Event Type Refinement and Event Argument Annotation(two-stage) in Table 11, 1213 and  14.

9.3 Voting Strategies Details↩︎

After different stages, we employ different voting strategies. We will introduce the voting details here.

For voting after Event Trigger Filtering, if there are more than two LLMs that consider the event to be reasonable, it will consider the event to be reasonable.

For voting after Event Type Refinement, we tally and vote on the most potential event type identified by each LLM, selecting the event with the highest number of votes as the corresponding fine-level event. In cases where a tie occurs, we apply a repeated voting procedure, capping the maximum number of voting rounds at five to prevent excessive iterations. However, in practice, ambiguous potential events were typically limited to two or three, with most reaching a definitive outcome within two voting rounds.

For voting after Event Argument Annotation, if an argument with its role appears more than or equal to half the annotation results from different LLMs, the argument is considered correct. Specifically, after voting, for the roles if LLMs provide three different annotation results, we instruct GPT-4o with three annotation results as a reference to generate the final annotation result. The offset alignment prompt with multiple input are listed in Table 15.

Table 10: No caption
Data Source Cases Event Mentions Argument Mentions
Train 157,051 443,459
Dev 7,465 20,379
Test 6,392 18,017
Human Annotation 973 2,546

10 Dataset Statistics↩︎

10.1 The split of dataset↩︎

We introduce the spilt of our dataset in this section. For the train/dev/test set, we follow GLEN’s 90/5/5 setting, besides, to further improve the quality of our dataset. We manually annotate 1,500 cases as the human annotation set.

10.2 Arguments Distribution↩︎

Figure 4: The argument distribution in our dataset.

Furthermore, we analyze the distribution of argument in Figure 4 and categorize the role types into five broad classes: “agent”, “entity”, “location”, “purpose”, and “others”. Agent, location, and entity are the three most frequently occurring argument roles, aligning with real-world distributions. This alignment facilitates knowledge sharing of similar roles across different event schemas, thereby reinforcing the validity of our dataset construction.

11 Prompt for Training↩︎

We present examples of training prompts for both EAE and ED task, including instruction and completion in figure 5 and  6.

More specifically, we employ the same Python class style prompt with KnowCoder [22], which includes the schema and instruction. Besides, we use the same output representation as the object of the certain class of event type. Following the KnowCoder, we adopt class comments to provide clear definitions of concepts, which includes the event definition and several samples from the training set. However, while the KnowCoder use different schema representation for ED and EAE tasks, we use the same representation, which could better facilitate the model understanding about the certain event type.

[TABLE]

Prompt for Event Trigger Filtering.

[TABLE]

Prompt for Event Type Refinement.

[TABLE]

Prompt for Event Argument Annotation.

[TABLE]

Prompt for Event Offset Alignment.

[TABLE]

Prompt for Event Offset Alignment with multiple input.

Figure 5: An example of training data in ED task.

Figure 6: An example of training data in EAE task.

References↩︎

[1]
Qiusi Zhan, Sha Li, Kathryn Conger, Martha Palmer, Heng Ji, and Jiawei Han. 2023. Glen: General-purpose event detection for thousands of types.
[2]
Meisin Lee, Lay-Ki Soon, and Eu-Gene Siew. 2021. Effective use of graph convolution network and contextual sub-tree forcommodity news event extraction. arXiv preprint arXiv:2109.12781.
[3]
Zhaoyue Sun, Jiazheng Li, Gabriele Pergola, Byron C. Wallace, Bino John, Nigel Greene, Joseph Kim, and Yulan He. 2022. https://arxiv.org/abs/2210.12560. Preprint, arXiv:2210.12560.
[4]
Taneeya Satyapanich, Francis Ferraro, and Tim Finin. 2020. Casie: Extracting cybersecurity event information from text. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8749–8757.
[5]
Elizabeth Spaulding, Kathryn Conger, Anatole Gershman, Rosario Uceda-Sosa, Susan Windisch Brown, James Pustejovsky, Peter Anick, and Martha Palmer. 2023. https://aclanthology.org/2023.isa-1.1. In Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19), pages 1–10, Nancy, France. Association for Computational Linguistics.
[6]
Charles J Fillmore and Collin Baker. 2009. A frames approach to semantic analysis.
[7]
Paul R Kingsbury and Martha Palmer. 2002. From treebank to propbank. In LREC, pages 1989–1993.
[8]
Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and Jun Zhao. 2017. https://doi.org/10.18653/v1/P17-1038. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 409–419, Vancouver, Canada. Association for Computational Linguistics.
[9]
Ying Zeng, Yansong Feng, Rong Ma, Zheng Wang, Rui Yan, Chongde Shi, and Dongyan Zhao. 2017. https://arxiv.org/abs/1712.03665. CoRR, abs/1712.03665.
[10]
Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
[11]
Ruirui Chen, Chengwei Qin, Weifeng Jiang, and Dongkyu Choi. 2024. https://doi.org/10.1609/aaai.v38i16.29730Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17772–17780.
[12]
C. et al Walker. 2005. https://books.google.com/books?id=SbjjuQEACAAJ. LDC corpora. Linguistic Data Consortium.
[13]
Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant, and Xiaoyi Ma. 2015. https://doi.org/10.3115/v1/W15-0812. In Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation, pages 89–98, Denver, Colorado. Association for Computational Linguistics.
[14]
Xiaozhi Wang, Hao Peng, Yong Guan, Kaisheng Zeng, Jianhui Chen, Lei Hou, Xu Han, Yankai Lin, Zhiyuan Liu, Ruobing Xie, Jie Zhou, and Juanzi Li. 2023. https://arxiv.org/abs/2311.09105. Preprint, arXiv:2311.09105.
[15]
Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. 2020. https://arxiv.org/abs/2004.13590. Preprint, arXiv:2004.13590.
[16]
David et al Milne. 2008. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 509–518.
[17]
Zixuan Zhang and Heng Ji. 2021. https://doi.org/10.18653/v1/2021.naacl-main.4. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
[18]
Xinya Du and Claire Cardie. 2020. Event extraction by answering (almost) natural questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
[19]
Jian Liu, Yubo Chen, Kang Liu, Wei Bi, and Xiaojiang Liu. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.128. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1641–1651, Online. Association for Computational Linguistics.
[20]
I-Hung Hsu, Kuan-Hao Huang, Elizabeth Boschee, Scott Miller, Prem Natarajan, Kai-Wei Chang, and Nanyun Peng. 2022. https://arxiv.org/abs/2108.12724. Preprint, arXiv:2108.12724.
[21]
Xiao Wang, Weikang Zhou, Can Zu, Han Xia, Tianze Chen, Yuansen Zhang, Rui Zheng, Junjie Ye, Qi Zhang, Tao Gui, et al. 2023. Instructuie: Multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:2304.08085.
[22]
Zixuan Li, Yutao Zeng, Yuxin Zuo, Weicheng Ren, Wenxuan Liu, Miao Su, Yucan Guo, Yantao Liu, Xiang Li, Zhilei Hu, et al. 2024. Knowcoder: Coding structured knowledge into llms for universal information extraction. arXiv preprint arXiv:2403.07969.
[23]
Yuxin Zuo, Wenxuan Jiang, Wenxuan Liu, Zixuan Li, Long Bai, Hanbin Wang, Yutao Zeng, Xiaolong Jin, Jiafeng Guo, and Xueqi Cheng. 2024. Alignxie: Improving multilingual information extraction by cross-lingual alignment. arXiv preprint arXiv:2411.04794.
[24]
Xingyao Wang, Sha Li, and Heng Ji. 2022. Code4struct: Code generation for few-shot event structure prediction.
[25]
Peng Li, Tianxiang Sun, Qiong Tang, Hang Yan, Yuanbin Wu, Xuanjing Huang, and Xipeng Qiu. 2023. https://doi.org/10.18653/v1/2023.acl-long.855. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 15339–15353. Association for Computational Linguistics.
[26]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, and Bochao Wu et. al. 2024. https://arxiv.org/abs/2412.19437. Preprint, arXiv:2412.19437.
[27]
Qwen Team. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115.
[28]
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, and Adam Perelman et. al. 2024. https://arxiv.org/abs/2410.21276. Preprint, arXiv:2410.21276.
[29]
Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
[30]
Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, et al. 2023. Retrieval-augmented code generation for universal information extraction. arXiv preprint arXiv:2311.02962.
[31]
Mengna Zhu, Kaisheng Zeng, JibingWu JibingWu, Lihua Liu, Hongbin Huang, Lei Hou, and Juanzi Li. 2024. https://doi.org/10.18653/v1/2024.findings-acl.715. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12028–12038, Bangkok, Thailand. Association for Computational Linguistics.
[32]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
[33]
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. https://arxiv.org/abs/2307.03172. Preprint, arXiv:2307.03172.
[34]
Xiaozhi Wang, Xu Han, Zhiyuan Liu, Maosong Sun, and Peng Li. 2019. https://doi.org/10.18653/v1/n19-1105. In Proceedings of the 2019 Conference of the North.
[35]
I-Hung Hsu, Kuan-Hao Huang, Shuning Zhang, Wenxin Cheng, Premkumar Natarajan, Kai-Wei Chang, and Nanyun Peng. 2023. https://arxiv.org/abs/2205.12585. Preprint, arXiv:2205.12585.
[36]
Sha Li, Heng Ji, and Jiawei Han. 2021. https://arxiv.org/abs/2104.05919. Preprint, arXiv:2104.05919.
[37]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. https://arxiv.org/abs/2307.09288. Preprint, arXiv:2307.09288.
[38]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. http://arxiv.org/abs/2403.13372. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguistics.
[39]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
[40]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
[41]
Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[42]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and Abhinav Pandey et.al. 2024. https://arxiv.org/abs/2407.21783. Preprint, arXiv:2407.21783.
[43]
Google. 2024. Gemini 2.0 flash. https://gemini.google.com. Accessed: 12/2024.

  1. IEPILE method builds a hard negative dictionary, however, this method leads to train-test inconsistency in our dataset. We built the hard negative dictionary and employed hard negative sampling during the training phase, and during testing, the ranked samples were randomly allocated to simulate the effect of the IEPLIE method as closely as possible.↩︎