April 02, 2024
Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly, entity descriptions, which could contain crucial information to distinguish similar entities from each other, are often overlooked. We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions. Given text and candidate entities, the encoder learns interactions between the text and each candidate entity, producing representations for each entity candidate. The decoder then fuses the representations of entity candidates together and selects the correct entity. Our experiments, conducted on various entity disambiguation benchmarks, demonstrate the strong and robust performance of this model, particularly +1.5% in the ZELDA benchmark compared with GENRE. Furthermore, we integrate this approach into the retrieval/reader framework and observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA.
Entity linking (EL) extracts references (a.k.a. mentions) to entities within a document and associates these mentions with their corresponding entries in a knowledge base (KB). EL is a fundamental component in automatic text comprehension, with various practical applications such as question answering, text analysis, recommender systems, semantic search, and information retrieval.
As the most critical component of EL workflows, entity disambiguation (ED) aims to select the correct entity from a set of candidate entities, given textual references. For instance, the entity mention ‘Bert’ may stand for ‘the famous language model’ [1] or ‘the golden yellow Muppet character’ depending on the given context. Therefore, models need to understand context to disambiguate entities correctly.
Owing to its practical significance in the industry and the latest developments in utilizing pre-trained language models [1]–[4], various approaches for entity disambiguation have been introduced in recent years. Primarily, existing methods can be categorized into two styles: classification approaches [5]–[7] or generative approaches [8]. Classification approaches such as [6] predict the masked entity titles while generative approaches such as [8] directly decode entity titles.
The recently proposed ZELDA benchmark [9] standardizes the experimental setup (consistent training data, entity vocabulary, and candidate lists) and shows that generative approaches such as [8] have significantly stronger performance under this experimental setup.
However, [10] argues that generative approaches require large scale pre-training. In particular, [8] critically relies on a prefix tree (also known as a trie) derived from Wikipedia to constrain the beam search in order to produce a valid entity title in a given knowledge base (KB), which might be inefficient memory-wise. In addition, since it directly generates a valid entity without reading their descriptions, crucial information in the descriptions might be ignored. Therefore, disentangling significantly similar entities proves challenging with this method [9].
To better disentangle similar entities, in this paper we propose an encoder-decoder model that decodes entities by utilizing their descriptions. Our approach is mainly inspired by a recent work on question answering [11]. In particular, we make the following contributions: We summarize our contributions in the following:
We propose a new ED approach, using an encoder-decoder model. Given text and entity candidates, the encoder learns the interactions between the text and each entity candidate, generating representations for each candidate. Subsequently, the decoder fuses these candidate entity representations and generates correct entities. At inference, instead of relying on a constrained beam search, it only needs simple greedy decoding.
We follow the standard evaluation practice (ensuring consistent knowledge base, training corpus and entity candidate lists) and rigorously evaluate this approach in several ED benchmarks [9] and show its strong and robust performance.
We integrate our approach into an end-to-end entity linking pipeline and show large improvements compared with the current state-of-the-art in GERBIL [12] benchmark. To the best of our knowledge, our approach is the first retrieval-augmented generation approach in EL.
We propose retrieval augmented entity linking using Large Language Models (LLMs), e.g., GPT-4 and evaluate it in GERBIL [12] benchmark. Our results show that with augmented entity retrieval, GPT-4 outperforms the current SoTA on some datasets but in general, it underperforms compared to fine-tuning-based approaches.
Our approach outperforms strongest ED baselines [6]–[8] on ZELDA benchmark and EL baselines [8], [10], [13] on GEBIL benchmark [12].
Existing ED approaches typically fall into two main categories: classification approaches and generative approaches.
For classification approaches, LUKE [6] and FEVRY [7] are two of the most well-known approaches due to their strong performance. LUKE is based on masked entity prediction. During the pre-training, LUKE combines input text and ground-truth entities as input tokens. Then, it randomly masks entities from those ground-truth entities and predict those masked entities by leveraging both the input text and those unmasked entities. Their model is trained on a large entity-annotated corpus obtained from Wikipedia and achieves the current SoTA in several ED benchmark datasets.
For generative approaches, GENRE [8] uses BART weights from [2] and is trained on a Wikipedia corpus, learning to generate entity names in an autoregressive manner, conditioned on the provided context. At inference, GENRE employs a constrained beam search strategy that forces each generated name to be in a predefined entity set.
Conventionally, ED methods are evaluated on six datasets, MSNBC, AQUAINT, ACE2004, WNED-CWEB (CWEB) and WNED-WIKI (WIKI) [14], [15]. Nevertheless, as shown in [9], those different ED methods use significantly different amounts of training data (ranging from 2 to 20 million annotated text) obtained with diverse sampling methodologies and enhanced weak labels [16], [17], and completely different knowledge bases (ranging from few thousands to over 6 million) from different sources, YAGO [18] or KILT [19] and different candidate lists [20], [21]. Thus, comparing various approaches is highly challenging. It is impossible to conclude which approach performs best [9].
ZELDA [9] benchmark is proposed to unify the training data set, entity vocabulary, and candidate lists to facilitate direct comparability of ED approaches. For this reason, we compare our approach with SoTA approaches on ZELDA benchmark. Our experiment is rigorously conducted using the same training data, entity vocabulary, and candidate lists without additional information from Wikipedia or using weak labels.
Different from ED, the key challenge of EL is its significantly large search space. A system can potentially generate any subset of conceivable spans in the document, each of which could correspond to an entity in a large KB, typically containing millions of entities. To manage this overwhelming scale, existing approaches break down EL into two stage tasks: mention detection (MD) and entity disambiguation (ED). These tasks are often tackled with varying degrees of independence.
In most of these approaches, the sequence of subproblems is consistent: first, the system identifies possible entity mentions, and then it links these mentions to specific entries in the given knowledge base. This MD\(\rightarrow\)ED classic pipeline is utilized in most methods. They either assume that mentions are provided in advance, following the example of [22] or take a different route by employing readily available entity recognition systems to first identify mentions and then disambiguate them through the ED process, as evidenced in the works of [20], [23]. Furthermore, some research [8] trains an end-to-end autoregressive model that jointly performs MD\(\rightarrow\)ED by beam search.
Recently, [10] has shown that the classic MD \(\rightarrow\) ED approach suffers from identifying mentions without prior knowledge of their corresponding entities, which is unnatural and challenging. To fix this problem, the authors flip the order of MD and ED, and propose an ED \(\rightarrow\) MD pipeline. Their key observation is that finding relevant candidate entities is easy without the knowledge of their specific mentions. Their ED \(\rightarrow\) MD approach achieves SoTA results on the in-domain AIDA-CoNLL dataset [20] and GERBIL benchmark [12]. Although their retriever (select top-\(k\) candidate entities) performs remarkably well, the majority of errors are attributed to their reader (which predicts the final entities and mention spans).
A recent work [13] proposes a structured prediction approach and achieves 88.6% on AIDA-CoNLL test-b by using the PPRforNED [21] candidate list. However, [9], [24] question this candidate list since it is unclear how candidates were pruned. The entity candidates generated by PPRforNED [21] were found to be well-tailored to the AIDA-CoNLL test-b evaluation dataset, with high recall and low ambiguity. Models [6], [7] improve significantly when using these lists instead of the more generic lists by [20] and [25], respectively. Without the handcrafted PPRforNED [21] candidate list, the result of AIDA-CONLL test-b in [13] is the same as [10], 85.8%.
As discussed in ZELDA [9], using additional signals makes comparison unfair and indirect. Moreover, in real world entity linking applications, additional signals such as pruned candidate lists may not be available. Therefore, same as our comparison methodology in ED, we do not bring any additional signals and aim to conduct an end-to-end direct entity linking comparison precisely by using the same training data and same knowledge base, KILT [19] as EntQA [10] and GENRE [8].
Figure 1: Pipeline of the fusion entity decoding for entity disambiguation. Given a text ‘DUBLIN 1996-12-07 Jack Charlton’s relationship with the people of Ireland was cemented on Saturday when the Englishman was officially declared one
of their own. (few sentences are abbreviated here) That is why this is so emotional a night for me , <s1>
Charlton <e1>
said’. Follow [8], we add special tokens <s1>
and <e1>
to denote the corresponding mention to disambiguate. Given candidate entities ‘Charlton Athletic F.C.’, ‘Jack
Charlton’, ‘Bobby Charlton’, ‘Suzanne Charlton’ from KB, we concatenate text with each entity candidate, including its entity title and its description. The Encoder learns interactions between the text and each entity candidate and produces suitable
representations for each entity candidate; decoder concatenates those representations and selects the correct entity.
We formalize the ED task as follows. Given a set of candidate entities denoted as \(\mathcal{E}\) in a Knowledge Base (KB), and an input text \(D\) with a single mention flagged with two special start token and end token, the goal is to find the proper entity \(e\in \mathcal{E}\) that corresponds to the mention in \(D\).
In Figure 1, we show an example of entity disambiguation. Given a text with annotated mention that represents what we want to disambiguate, we add special tokens <s1>
and <e1>
before and
after the mention to denote the corresponding mention that we want to disambiguate. We concatenate input text with information from each entity candidate including entity title and entity description, and feed it into the encoder model to form an entity
representation and the decoder model takes the fused entity representations from all those candidates to generate the correct entity name.
We formalize the EL task as follows. Given a set of entities denoted as \(\mathcal{E}\) in a Knowledge Base (KB), and an input document \(D\), the objective is to identify every entity \(e \in \mathcal{E}\) along with a mention \(m\) such that \(m \in D\) and \(m\) links to \(e\). Typically, the length of \(D\) varies from few words (e.g., short queries) to few thousands of words (e.g., news). To handle long document entity linking, previous research [10] typically segments each document \(D\) into sentence chunks. For each sentence chunk \(p\), most approaches [20], [23] commonly break down the task of EL for a sentence chunk \(p\) into two main components: mention detection (MD) and entity disambiguation (ED), and first extract mentions from passages (MD) and then link to entities (ED).
[10] introduce a different two-stage process, instead of first identifying mentions and then link them entities, it first retrieve top-k candidate entities, followed by the reader’s task of picking up the accurate entities along with predicting their associated mention spans. Figure 2 illustrates an instance of end-to-end EL employing the retrieval-plus-reader approach. Our approach follow this pipeline.
Following [26], we represent an entity \(e\) as a combination of its title and description using the
format: [CLS]
title(
\(e\))
[ENT]
description(
\(e\))
[SEP]
. [ENT]
is a special token to
separate the entity title and description representation. For Wikipedia entities, we consider up to 128 tokens for their descriptions. We use an encoder enc\(_E\) to produce an embedding for an entity \(e\).
For each passages \(p\) with its document topics \(t\), we also concatenate those information using the following format: [CLS]
\(p\)
[SEP]
\(t\) [SEP]
. We use another encoder enc\(_P\) to produce an embedding for a passage \(p\).
The score of an entity \(e\) and a passage \(p\) is given as \(s(e, p) = \boldsymbol{enc}_E(e)^{\top} \boldsymbol{enc}_P({p})\). Same as [10], we train the retriever using a multi-label variant of noise contrastive estimation (NCE) [27].
Figure 2: Example of document level entity linking from AIDA test. Given a document, FusionED splits it into smaller passage chunks. Given the current passage ‘That is why this is so emotional a night for me, Charlton said.’, the bi-encoder entity retrieval picks up top 100 entity candidates, e.g., ‘Charlton Athletic F.C.’, ‘Bobby Charlton’, ‘Jack Charlton’. FusionED then decodes linked entities and mentions using entity candidate lists.
We use a similar architecture to the one we used for ED (Figure 1), while the model generates both entity names and mentions instead of only generating entity names as this was the case in ED.
Given a passage chunk \(p\) along with its truncated original document \(D\), the retrieval retrieves the top-\(k\) candidate entities \(e_1, \cdots, e_k\). Then, for each retrieved candidate entity \(e_i\), we concatenate the document \(D\), the current passage chunk \(p\), the entity title of \(e_i\), and the entity description of \(e_i\). We add special tokens
<extra_id_0>, <extra_id_1>, <extra_id_2>, <extra_id_3>
before the document, the current passage chunk, the entity name, and the entity description, respectively. The input format becomes
<extra_id_0> D <extra_id_1> p <extra_id_2> title(
\(e_i\)) <extra_id_3> description(
\(e_i\))
.
The encoder independently processes input data for each entity candidate \(e_i\) and then merges the resulting representations from all the candidates. Finally, the decoder performs the attention over the merged
representations of all the retrieved entities. If no candidate entities are linked, the decoder output an empty string. Otherwise, for each linked entity \(e_i\), it outputs \(e_i\)<extra_id_4>
\(m_{i1}, \cdots, m_{in}\) where \(m_{i1}, \cdots, m_{in}\) are all mentions from \(p\)
which links to \(e_i\). Finally, we use a special token <extra_id_5>
to split the decoding output from each entity \(e_i\). Therefore, the final output sting is \(e_1\)<extra_id_4>
\(m_{11}, \cdots m_{1n}\)<extra_id_5>
\(e_2\)<extra_id_4>
\(m_{21}, \cdots m_{2n}\)<extra_id_5>
\(\cdots\)\(e_i\)
<extra_id_4>
\(m_{i1}, \cdots,
m_{in}\).
We conduct extensive experiments to demonstrate the performance of our proposed approach (FusionED) over 20 datasets, addressing both single-entity disambiguation and end-to-end entity linking. The goal of our experiments is to facilitate a direct comparison, illustrating that under identical conditions (without incorporating extra training data or taking additional signals into account), our approach outperforms the current SoTA.
Method | AIDA-B | TWEEKI | REDDIT-POSTS | REDDIT-COMM | WNED-CWEB | WNED-WIKI | SLINKS-TAIL | SLINKS-SHADOW | SLINKS-TOP | AVG |
---|---|---|---|---|---|---|---|---|---|---|
Baselines | ||||||||||
CL-RECALL | 91.1 | 94.0 | 98.4 | 98.3 | 92.4 | 98.8 | 98.8 | 56.7 | 73.1 | 89.1 |
Classification | ||||||||||
FEVRYALL | 79.2 | 71.8 | 88.5 | 84.1 | 68.0 | 84.3 | 63.8 | 43.4 | 53.1 | 70.7 |
FEVRYCL | 79.5 | 76.9 | 89.0 | 86.5 | 70.3 | 84.5 | 87.6 | 31.9 | 47.7 | 72.7 |
LUKEPPRE | 79.3 | 73.8 | 76.1 | 69.9 | 66.8 | 68.4 | 97.7 | 20.4 | 50.8 | 67.0 |
LUKEPFT | 81.2 | 77.9 | 81.5 | 78.5 | 70.3 | 76.5 | 98.0 | 22.5 | 51.8 | 71.0 |
Generative | ||||||||||
GENREALL | 72.4 | 75.9 | 88.8 | 83.9 | 66.5 | 85.2 | 95.3 | 38.7 | 43.5 | 72.2 |
GENRECL | 78.6 | 80.1 | 92.8 | 91.5 | 73.6 | 88.4 | 99.6 | 37.3 | 52.8 | 77.2 |
FusionED | 80.1 | 81.4 | 93.9 | 92.3 | 73.6 | 89.0 | 98.3 | 41.5 | 57.9 | 78.7 |
We follow the experiment setup in ZELDA benchmark [9], using their training data, entity vocabulary and the more generic candidate list. We initialize the weights of our model using FLAN-T5-base [28] 220M to match the number of parameters of SoTA models (274M for LUKE [6] and FEVRY [7], 178M for GENRE [8]). We train the model for 60k steps with a learning rate 0.0001 using Adam optimizer [29], with a batch size of 12 on 12 NVIDIA Tesla V100 32GB.
Given a context with a mention, we consider approximately 250 tokens 3 surrounding the annotated mention. For each entity candidate, we concatenate the entity name, a special token, and the entity description, truncating to a maximum of 140 tokens. Then, for each context, we utilize the candidate list from the benchmark [9]. We only consider the top 200 entity candidates from this list. We evaluate checkpoints every 2000 steps for the last 8000 steps in AIDA-B, selecting the best checkpoint.
At inference, we evaluate the model using greedy decoding on 9 datasets: AIDA-B [20], TWEEKI [30], REDDIT-POSTS and REDDIT-COMMENTS [30], WNED-WIKI and WNED-CWEB [15], SLINKS-TOP and SLINKS-SHADOW and SLINKS-TAIL [31]. These datasets are collected from diverse sources: news (AIDA-B), annotated tweets (TWEEKI), top-scoring Reddit posts and comments (REDDIT-POSTS and REDDIT-COMMENTS), Wikipedia articles (WNED-WIKI and WNED-CWEB). In particular, [31] categorizes entities into three cases based on their appearance frequency in Wikipedia: SLINKS-TOP, where the ground truth entity is the most frequent; SLINKS-SHADOW, where a more popular entity overshadows the correct disambiguation; and SLINKS-TAIL, for rare long-tail entities.
We examine two methods presented in [7] using a candidate list (FEVRYCL) and without any restriction on the search space (FEVRYALL). Additionally, for one of the ED SoTA approaches LUKE [6], we present results of two models LUKEPPRE and LUKEPFT on ZELDA [9] benchmark.
GENRE [8] employs a prefix tree derived from all entity titles in the KB to restrict the generation process. While GENRE does not utilize candidate lists during training, in inference the prefix tree can be generated using the candidate lists GENRECL or without candidate lists GENREALL.
We also list CL-RECALL, which is the recall of the candidate list in ZELDA. It reflects the best possible accuracy if we always select the correct entity from the candidate list.
Table 1 reports the accuracy of FusionED compared with SoTA models. Clearly, FusionED achieves the highest performance across six datasets and secures the second position in three datasets. According to Table 1 and as it was previously pointed out by [9], GENRE shows significantly better performance over classification-based baselines. However, it struggles to disambiguate entities in SLINKS-TOP and SLINKS-SHADOW. One possible interpretation is that it never uses any entity description to disambiguate entities with a similar title. Thus, it favors decoding into the most prominent case where the generated entity title will be most similar to the mention text.
It is worth mentioning that FusionED demonstrates an over +4 point accuracy improvement compared to GENRE on SLINKS-TOP and SLINKS-SHADOW datasets. These datasets involve ambiguous entities with similar titles. Incorporating information from entity descriptions is a prominent reason for FusionED’s enhanced performance.
Method | 1 | [1 - 0.9] | [0.9 - 0.8] | [0.8 - 0.7] | [0.7 - 0.6] | [0.6 - 0.5] | [0.5 - 0.4] | [0.4 - 0.3] |
---|---|---|---|---|---|---|---|---|
CL-RECALL | 99.7 | 97.2 | 99.2 | 98.3 | 98.3 | 99.1 | 98.8 | 99.6 |
FEVRYCL | 94.8 | 92.2 | 88.8 | 87.2 | 84.1 | 80.0 | 76.0 | 72.2 |
LUKEPFT | 91.5 | 90.4 | 86.3 | 80.3 | 77.8 | 73.8 | 62.2 | 56.2 |
GENRECL | 97.1 | 94.2 | 91.2 | 85.6 | 87.8 | 86.9 | 86.9 | 79.7 |
FusionED | 96.4 | 92.4 | 90.8 | 87.5 | 86.1 | 88.1 | 87.1 | 85.0 |
Table 2 shows the accuracy of different approaches across various difficulty brackets in the WNED-WIKI dataset, introduced in [15]. They propose a baseline method PRIOR by selecting the entity with the highest prior probability, denoted as \(prior(m, e)\), for a given mention \(m\). This prior probability is precomputed using all annotated mention-entity pairs from Web-scale and Wikipedia corpora. PRIOR serves as a proxy to assess the difficulty of a mention. They further normalize the probability of the ground truth entity given mention. Based on this normalized value, they categorize difficulty into eight brackets. Specifically, if the probability for the corresponding ground truth entity of a mention is low, indicating increased ambiguity across the entire Web and Wikipedia corpora, the mention is considered more difficult. [0.4 - 0.3] represents the most difficult test cases while 1 represents the easiest ones. Our model has the highest accuracy across most different brackets (+5% in [0.4 - 0.3]), suggesting that using entity descriptions can help disambiguate closed entities in most challenging test cases.
In-domain | Out-of-domain | ||||||||
Method | AIDA-B | MSNBC | Der | K50 | R128 | R500 | OKE15 | OKE16 | AVG |
[20] | 72.8 | 65.1 | 32.6 | 55.4 | 46.4 | 42.4 | 63.1 | 0 | 47.2 |
[32] | 42.3 | 30.9 | 26.5 | 46.8 | 18.1 | 20.5 | 46.2 | 46.4 | 34.7 |
[33] | 48.5 | 39.7 | 29.8 | 55.9 | 23.0 | 29.1 | 41.9 | 37.7 | 38.2 |
[34] | 82.4 | 72.4 | 34.1 | 35.2 | 50.3 | 38.2 | 61.9 | 52.7 | 53.4 |
[35] | 79.3 | - | - | - | - | - | - | - | |
[36] | 81.9 | - | - | - | - | - | - | - | |
[37] | 80.5 | 72.4 | 41.1 | 50.7 | 49.9 | 35.0 | 63.1 | 58.3 | 56.4 |
[8] | 83.7 | 73.7 | 54.1 | 60.7 | 46.7 | 40.3 | 56.1 | 50.0 | 58.2 |
[38] | 85.5 | - | - | - | - | - | - | - | |
[10] | 85.8 | 72.1 | 52.9 | 64.5 | 54.1 | 41.9 | 61.1 | 51.3 | 60.5 |
[13] | 85.8 | 63.1 | 59.1 | 53.7 | 47.1 | 44.4 | 59.5 | 56.6 | 58.7 |
GPT-4 (zero-shot) [13] | 54.1 | - | - | - | - | - | - | - | |
GPT-4 + retrieval (zero-shot) | 58.4 | 42.4 | 40.1 | 69.0 | 35.1 | 29.4 | 58.3 | 53.1 | 48.3 |
GPT-4 + retrieval (zero-shot)* | 59.1 | 42.5 | 41.0 | 67.6 | 36.4 | 30.1 | 58.4 | 53.0 | 48.5 |
FusionED | 86.5 | 73.6 | 56.8 | 65.1 | 53.1 | 41.6 | 62.3 | 56.6 | 62.0 |
For EL, we adhere to the established convention [8], [10] by presenting the InKB Micro F1 score for both the in-domain and out-of-domain datasets. Specifically, for the in-domain scenario, we train FusionED using the AIDA-CoNLL dataset [20]. For the out-of-domain tests, following the same practice, we evaluate it on seven test sets: MSNBC [39], Derczynski (Der) [40], KORE 50 (K50) [41], N3-Reuters-128 (R128), N3-RSS-500 (R500) [42], and OKE challenge 2015 and 2016 (OKE15 and OKE16) [43]. For KB, we utilize the 2019 Wikipedia dump, as supplied within the KILT benchmark [19], encompassing a total of 5.9 million entities for our knowledge base (KB).
Following [10], we initialize weights of both the passage encoder (\(\boldsymbol{enc}_P\)) and the entity encoder (\(\boldsymbol{enc}_E\)) using BLINK [26] retrievers that have been pretrained on Wikipedia hyperlinks. We also finetune retrievers using NCE objective with hard negative mining and follow the same sampling strategy as [10] (90% from random sample and 10% from hard negatives) . We reproduce their retriever by matching their top 100 recall numbers reported in their paper. We use FAISS [44] to speed up vector similarity search.
We create the reader dataset by selecting the top 100 candidates from the retrieval process. For each ground truth entity, we create an entity title and mention pair. And we concatenate truncated document and those entity pairs together as discussed in section 3.2.2 4.
The model is initialized with the FLAN-T5-large model [28]. We finetune the model for 20k steps with a learning rate of 0.0001 using the Adam optimizer [29], with a batch size of 8, employing 8 NVIDIA Tesla A100 40GB GPUs. Following the approach in [10], we evaluate the models every 1000 steps in AIDA and select the best checkpoint. We use a linear decay learning rate scheduler that starts at 0, warms up to the peak learning rate, and then decays back to 0. The warm-up rate is set to 1%.
During inference, we employ a sliding window approach to split the document into passages with a window size of 20 tokens and a stride of 10 tokens to avoid cutting off any mentions. For each split passage, we first retrieve the top 100 entity candidates using the bi-encoder, followed by a FusionED reader to decode correct entities along with their mentions. Using a sliding window approach might cause the reader to identify overlapping mentions or to disambiguate a single mention into two different entities. For overlapping mentions, we retain the longest one. And if the same mention is disambiguated into two different entities, we retain both entities.
Table 3 shows InKB Micro F1 of FusionED compared with different entity linking systems. Clearly, FusionED achieves the best in-domain test (+0.7% F1 for AIDA-B [20]) without using any handcrafted candidate list [21]
Overall, FusionED achieves the best averaged F1 score across the all evaluation datasets; +1.5% over EntQA [10] and +2.8% over the latest work [13] in EL. The reason for the lower performance on OKE15 and OKE16 [43] is consistent with the observation made by [8]: these datasets include coreference annotations (such as pronouns and common nouns linked to entities), for which our model lacks training. In contrast, many other systems incorporate a component in their pipelines specifically designed to use these annotations.
Compared to the previous retrieval-plus-reader approach, EntQA [10], FusionED improves by +1.5% on MSNBC, +3.9% on Der, +0.6% on K50, and +4.7% on OKE16.
Datasets | GPT-4 + retrieval | FusionED | ||||
P | R | F1 | P | R | F1 | |
AIDA-B | 52.0 | 66.6 | 58.4 | 84.4 | 88.7 | 86.5 |
MSNBC | 32.6 | 60.7 | 42.4 | 75.6 | 71.7 | 73.6 |
Der | 29.2 | 63.9 | 40.1 | 55.2 | 58.5 | 56.8 |
K50 | 70.3 | 67.8 | 69.0 | 72.0 | 59.4 | 65.1 |
R128 | 25.6 | 55.6 | 35.1 | 56.3 | 50.2 | 53.1 |
R500 | 19.2 | 62.8 | 29.4 | 31.6 | 60.7 | 41.6 |
OKE15 | 64.1 | 53.5 | 58.3 | 80.1 | 51.0 | 62.3 |
OKE16 | 60.7 | 47.2 | 53.1 | 76.8 | 44.8 | 56.6 |
[13] has benchmarked LLMs for EL using the approach introduced in [8] where it produces a markup around the mentions followed by the linked entity name. However, the results are much worse than our approach, 54.1 vs 86.5. Although LLMs possess comprehensive knowledge about entities, they face a limitation in directly reasoning about specific Wikipedia URLs and Wikipedia names.
We conduct a preliminary study to assess the performance of retrieval-augmented prompting for linking entities using LLMs. This approach involves utilizing the same retrieval models that we described before, which are initialized using BLINK [26] weights and fine-tuned based on AIDA [20]. For the reader, we replace the FusionED with GPT-4. More precisely, we provide GPT-4 with truncated documents (up to 50 tokens), input passages, and entity candidates, including entity title and entity description (up to 50 tokens). We prompt it to link entities from the candidate entity sets and identify their corresponding mentions. To the best of our knowledge, we are the first to propose retrieval-augmented LLMs for EL.
Table 4 presents a detailed comparison between FusionED and GPT-4 + retrieval. GPT-4 + retrieval shows better recall (R) in all datasets except AIDA-B, MSNBC, but it has lower precision (P) in all datasets. The inferior precision of GPT-4 might stem from 1) ambiguity in defining entities, where it considers instances like ‘Spoon’, ‘Pasta’, ‘Scientist’ as entities diverge from actual ground truth labels in MSNBC [39]; 2) linking ambiguous partial names to famous entities (e.g., in a dataset based on tweets [40], a given query is ‘I’m going home to Wisconsin’, it links the ambiguous entity ‘Wisconsin’ to the Wisconsin state, but it may refer to ‘University of Wisconsin–Madison’). Our preliminary results suggest that future research should focus on enhancing the precision of LLMs by using varied prompts to match SoTA fine-tuned models.
We propose an encoder-decoder model architecture to enhance the disambiguation of entities by providing more detailed descriptions. The encoder, when given text and candidate entities learns the interactions between the text and each entity candidate, generating representations for each candidate. The decoder then combines these representations to produce the correct entity. Our experiments, conducted on various entity disambiguation benchmarks, demonstrate the model’s strong and robust performance. Furthermore, we integrate this approach into the retrieval/reader EL framework and observe improvements on the GERBIL benchmark compared with previous SoTA. We also propose entity retrieval-augmented large language models (LLMs) for EL. Results show that compared to FusionED, LLMs generally underperform while they demonstrate strong improvements compared to SoTA over some datasets.
The scope of our ED and EL models are limited to traditional Wikipedia and News datasets. We have not investigated its effectiveness in diverse domains such as biomedical research, e-commerce, and product catalogs. Furthermore, this paper focuses exclusively on the English corpus, and exploring the potential of our model in a multilingual setting would be an interesting expansion for future research. This includes investigating the advantages of projecting entity linking concepts from one language to another and employing multilingual representation learning to enhance our base model. While our retrieval-augmented LLMs exhibit notable performance improvements for certain datasets in EL, they underperform compared to the other approaches. Investigating how to enhance the performance of LLMs using different prompts further is an interesting direction for exploration.
Our models are trained using datasets comprised of existing textual collections sourced from Wikipedia and News. Recent studies have brought attention to potential societal biases ingrained in established corpora. We acknowledge the potential risk that our EL models may inherit such biases.
In-domain | Out-of-domain | ||||||
Method | AIDA-B | MSNBC | AQUAINT | ACE2004 | CWEB | WIKI | AVG |
[8] | 88.6 | 88.1 | 77.1 | 82.3 | 71.9 | 71.7 | 80.0 |
FusionED | 91.7 | 92.4 | 82.0 | 87.1 | 75.8 | 78.6 | 84.6 |
We also run a small ablation experiment on traditional named entity disambiguation datasets using FLAN-T5-large as base model to compare the corresponding large model. Unlike a standard benchmark, models which test on those datasets typically trained using different corpus and linked to different KB which maybe subset of YAGO [18] and KILT [19]. Reproducing those results might be a challenge due to the incomplete release of their entity vocabulary 5 6. And comparison is indirect since training datasets are different and may overlap with some test datasets used in out-of-domain evaluation 7.
We avoid training our model on Wikipedia datasets to prevent test data leakage. Instead, we conduct ablation experiments, training on AIDA and evaluating it in both in-domain AIDA-B and out-domain datasets such as MSNBC, AQUAINT, ACE2004, WNED-CWEB (CWEB), and WNED-WIKI (WIKI) [14], [15] to provide a direct comparison.
At the inference, we rely on the same candidate lists provided in [8] 8. Instead of decoding entity names, we decode the corresponding entity number in the given ordered candidate list.
Table 5 presents a comparison of InKB Micro F1 results between GENER and FusionED when only trained on the AIDA dataset and evaluated in both in-domain and out-of-domain scenarios. FusionED shows much better performance compared to GENER, supporting our claim that our model does not require significant pre-training. It is worth noting that our numbers are not directly comparable with SoTA models, as those models are trained on different corpus.
Our prompt template is as follows:
Given a input passage and a candidate entity list (each element in this list is a pair with entity title and entity description), your task is to select entities from this list and link them to mentions which appear in given passage. For each linkage, please output the entity title and mention, separated by @#@ on each line. You can use the truncated document as context information. passage: ... , entities: ... , document: ...
For each passage, we first retrieve the top-100 entity candidates, then feed this passage, entity candidates, and the corresponding truncated document into this template to produce a prompt. Subsequently, we call the GPT-4-16k API to get results. Then we parse results and evaluate those in GERBIL benchmark.
GPT-4 + retrieval | GPT-4 + retrieval* | |||||
Dataset | P | R | F1 | P | R | F1 |
AIDA-B | 52.0 | 66.6 | 58.4 | 53.2 | 66.5 | 59.1 |
MSNBC | 32.6 | 60.7 | 42.4 | 32.8 | 60.5 | 42.5 |
Der | 29.2 | 63.9 | 40.1 | 30.2 | 63.9 | 41.0 |
K50 | 70.3 | 67.8 | 69.0 | 72.0 | 59.4 | 65.1 |
R128 | 25.6 | 55.6 | 35.1 | 27.2 | 55.2 | 36.4 |
R500 | 19.2 | 62.8 | 29.4 | 20.1 | 61.0 | 30.1 |
OKE15 | 64.1 | 53.5 | 58.3 | 64.6 | 53.3 | 58.4 |
OKE16 | 60.7 | 47.2 | 53.1 | 61.5 | 46.5 | 53.0 |
Table 6 presents the results of GPT-4 in the GERBIL benchmark [12]. For GPT-4 + retrieval (zero-shot)*, we additionally filter entities generated by the model using candidate entities obtained from entity retrieval and this improves its precision and slightly improve its performance over all datasets except K50 and OKE16.
Work done as an intern at Apple.↩︎
Work done while at Apple.↩︎
ZELDA [9] benchmark considers 500 chars to the left and 500 chars to the right of each mention. We assume that each token’s length is on average and approximately equal to 4 English characters, then it results in using 250 tokens around the mention.↩︎
EntQA [10] shows injecting document level information can improve model performance largely (+1.4 F1). So we use a truncated document of up to 20 tokens, which roughly corresponds to the first sentence approach in EntQA [10]↩︎
https://github.com/facebookresearch/GENRE/tree/main/examples_genre↩︎