Multi-granular Adversarial Attacks
against Black-box Neural Ranking Models


Adversarial ranking attacks have gained increasing attention due to their success in probing vulnerabilities, and, hence, enhancing the robustness, of neural ranking models. Conventional attack methods employ perturbations at a single granularity, e.g., word-level or sentence-level, to a target document. However, limiting perturbations to a single level of granularity may reduce the flexibility of creating adversarial examples, thereby diminishing the potential threat of the attack. Therefore, we focus on generating high-quality adversarial examples by incorporating multi-granular perturbations. Achieving this objective involves tackling a combinatorial explosion problem, which requires identifying an optimal combination of perturbations across all possible levels of granularity, positions, and textual pieces. To address this challenge, we transform the multi-granular adversarial attack into a sequential decision-making process, where perturbations in the next attack step are influenced by the perturbed document in the current attack step. Since the attack process can only access the final state without direct intermediate signals, we use reinforcement learning to perform multi-granular attacks. During the reinforcement learning process, two agents work cooperatively to identify multi-granular vulnerabilities as attack targets and organize perturbation candidates into a final perturbation sequence. Experimental results show that our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.

1 Introduction↩︎

With the advance of deep neural networks [1], NRMs [2][5] have achieved promising ranking effectiveness in IR. Besides their proven effectiveness, considerable attention has been directed toward assessing the adversarial robustness of NRMs.

Adversarial ranking attacks. In IR, NRMs are prone to inheriting vulnerabilities to adversarial examples from general neural networks [6][8]. Such adversarial examples are crafted by introducing human-imperceptible perturbations to the input, capable of inducing model misbehavior. This discovery has sparked legitimate concerns about potential exploitation by black-hat SEO practitioners aiming to defeat meticulously designed search engines [9]. Consequently, there is a need to develop robust and reliable neural IR systems. A crucial step in this direction involves introducing adversarial ranking attacks to benchmark the vulnerability of black-box NRMs [7], [8], [10]. Here, the aim of an adversary is to find human-imperceptible perturbations injected into the document’s text, to promote a low-ranked document to a higher position in the ranked list produced for a given query [7], [10]. This attack approach allows us to identify vulnerabilities in NRMs before deploying them in real-world settings and devise effective countermeasures.

Single-granular ranking attacks. Existing studies on adversarial attacks against NRMs are typically restricted to document perturbation strategies that operate at a single level of granularity, such as the word-level [6], [7] or sentence-level [8], [10]. These methods face two main limitations:

Different perturbation granularities for different query-document pairs: When dealing with different query-document pairs, a priori limiting perturbations to a single granularity could considerably restrict the choice of attack targets, thereby impeding the overall effectiveness of the attack.

Multiple perturbation granularities for a query-document pair: Despite employing a meticulously chosen attack granularity tailored to specific query-document pairs, a fixed granularity falls short of fully capturing the diverse relevance patterns inherent in the matching between a query and a document [11]. An effective ranking attack should possess the flexibility to simultaneously consider various granularity perturbations in an adversarial example.

In this sense, we argue that the full potential of adversarial attacks has yet to be harnessed for uncovering the vulnerabilities of NRMs.

Multi-granular adversarial ranking attacks. In this paper, we develop multi-granular adversarial ranking attacks against NRMs; see Figure 1 for an illustrative example. We incorporate word-level, phrase-level, and sentence-level perturbations to generate fluent and imperceptible adversarial examples. The generated examples may not cover all three levels of granularity but allow for flexible selection based on an optimization strategy. Compared to existing single-granular attacks, this multi-granular approach broadens the selection of attack targets and explores the vulnerability distribution of NRMs at various granularities within a specific query-document pair. Consequently, it can yield richer and more diverse forms of adversarial examples, thereby enhancing campaign performance.

Figure 1: To prompt a target document in the rankings to a query, we identify multi-granular texts within the document as attack targets to generate effective adversarial examples.

Learning multi-granular attack sequences. Achieving a multi-granular attack is non-trivial due to the combinatorial explosion arising from the numerous possible actions, e.g., perturbation granularities, target positions in the document, and replacement content, posing a significant computational challenge. To address this challenge, we formulate the multi-granular ranking attack problem as a sequential decision-making process [12], [13]. In this process, the attacker sequentially introduces a perturbation at a specific level of granularity, i.e., word-level, phrase-level, or sentence-level, guided by the perturbations in the preceding steps.

Within the sequential decision-making process, the discrete and non-differentiable nature of the text space presents a challenge in finding a direct supervisory signal to facilitate the incorporation of multi-granular perturbations. Therefore, we propose RL-MARA, a novel reinforcement learning (RL) framework [14] to navigate an appropriate sequential multi-granular ranking attack path.

Following [7], [8], [10], our focus is on a practical and challenging decision-based black-box setting [15], where the adversary lacks direct access to model information and can only query the target NRM. We train a surrogate ranking model to substitute and achieve comparable performance to the black-box target NRM. We combine the surrogate ranking model with a LLM to form the complete environment, to provide rewards for assessing the effectiveness of the multi-granular attack and the naturalness of the perturbed document, respectively. We set up a multi-granular attacker by building upon existing single-granular attack methods agents. A sub-agent and a meta-agent are designed in a cooperative manner: the sub-agent is tasked to identify multi-granular vulnerabilities in the document as attack targets, while the meta-agent is tasked to generate and organize perturbations into a final perturbation sequence, respectively. During the RL process, the attacker sequentially incorporates perturbations until the cumulative perturbation exceeds a predefined budget.

Main findings. We conduct experiments on two web search benchmark datasets, MS MARCO Document Ranking [16] and ClueWeb09-B [17]. Experimental results demonstrate that RL-MARA significantly improves the document ranking of target documents and thus achieves a higher attack success rate than existing single-granular ranking attack baselines. According to automatic and human naturalness evaluations, RL-MARA could maintain the semantic consistency and fluency of adversarial examples.

2 Problem Statement↩︎

In ad-hoc retrieval, given a query \(q\) and a set of \(N\) document candidates \(\mathcal{D} = \{d_1,d_2,\ldots,d_N\}\) from a corpus \(\mathcal{C}\), the objective of a ranking model \(f\) is to assign a relevance score \(f(q,d_n)\) to each pair of \(q\) and \(d_n \in \mathcal{D}\), to obtain the ranked list \(L\).

Adversarial ranking attack. Many studies have examined adversarial ranking attacks against NRMs [7], [8], [10], [18]. Given a target document \(d\) and a query \(q\), the primary goal is to construct a valid adversarial example \(d^\mathrm{adv}\) capable of being ranked higher than the original \(d\) in response to \(q\) by NRMs, all while closely resembling \(d\). The adversarial example \(d^\mathrm{adv}\) can be regarded as \(d \oplus \mathcal{P}\), where \(\mathcal{P}\) denotes the perturbation applied to \(d\). The perturbations \(\mathcal{P}\) are crafted to conform to the following properties [7], [8], [10]: \[\label{eq:form} \begin{align} &\operatorname{Rank}(q,d \oplus \mathcal{P}) < \operatorname{Rank}(q,d) \\ &\quad \text{ such that } \| \mathcal{P} \| \leq \epsilon, \;\operatorname{Sim}(d^\mathrm{adv}, d) \geq \lambda, \end{align}\tag{1}\] where \(\operatorname{Rank}(q,d \oplus \mathcal{P})\) and \(\operatorname{Rank}(q,d)\) denote the position of \(d^\mathrm{adv}\) and \(d\) in the ranked list with respect to \(q\), respectively; a smaller value of ranking position denotes a higher ranking; \(\epsilon\) represents the budget for the number of manipulated terms \(\| \mathcal{P} \|\); \(\lambda\) is the coefficient; and the function \(\operatorname{Sim}(d^\mathrm{adv}, d)\) assesses the semantic or syntactic similarity [7], [19] between \(d\) and its corresponding \(d^\mathrm{adv}\). Ideally, \(d^\mathrm{adv}\) should preserve the original semantics of \(d\), and be imperceptible to human judges yet misleading to NRMs.

Decision-based black-box attacks. Following [7], [8], we focus on decision-based black-box attacks against NRMs for the adversarial ranking attack task. This choice is motivated by the fact that the majority of real-world search engines operate as black boxes, granting adversaries access only to the final decision, i.e., the rank positions within the partially retrieved list.

Perturbations at multiple levels of granularity. Existing work mainly focuses on a single granularity of \(\mathcal{P}\) to manipulate the target document, esp.word-level word substitution [7] and sentence-level trigger generation [8]. However, restricting perturbations to a single granularity may fail to adequately capture the nuanced and diverse vulnerability features, a limitation confirmed by our experimental results (see Section 6.1). Thus, we propose to find perturbations \(\mathcal{P}\) at multiple levels of granularity.

Effective character-level [20] and phrase-level [21] modifications have proven successful in textual attacks [20] within NLP, but are underutilized in IR. However, character-level attacks tend to create ungrammatical adversarial examples and are easily defended against [22]. And considering the naturalness requirements of the adversarial examples, introducing perturbations at higher levels of granularity, e.g., paragraph level, may pose challenges in avoiding suspicion. Therefore, we propose to launch an adversarial ranking attack at three levels of perturbation granularity, i.e., word, phrase and sentence levels.

3 Preliminaries↩︎

Our approach relies on three single-granular adversarial ranking attack methods, i.e., word-level, phrase-level and sentence-level. Typically, single-granular attacks begin by training a surrogate ranking model that imitates the target NRM. Subsequently, they execute attacks guided by the surrogate model, which involve two primary steps:

Identifying vulnerable positions in the target document; and

Perturbing the text at these identified positions.

Word-level attack. For word-level attacks against NRMs, the main approaches include word substitution [7], word insertion [23], and word removal [24]. In this study, we employ the word substitution ranking attack exemplified by PRADA [7], which has shown promising results in terms of the attack success rate. Specifically, PRADA first identifies vulnerable words in a document that significantly influence the final ranking result through the surrogate model; and then replaces these vulnerable words with synonyms, selecting the one that provides the most substantial boost in rankings from a pool of candidate synonyms.

Phrase-level attack. In textual attacks within NLP, the predominant method for phrase-level attacks [21], [25] is phrase substitution [21]. To the best of our knowledge, there has been an absence of phrase-level attacks specifically targeting NRMs in IR. PLAT [21] stands out in phrase-level textual attacks, aiming to induce text misclassification. PLAT first identifies the positions of vulnerable phrases that significantly influence the classification scores predicted by the surrogate model. Then, it utilizes BART [26] to generate multiple variations for each selected vulnerable phrase. PLAT selects the variant that introduces the most substantial interference in classification scores predicted by the surrogate model, deviating from the original phrase. To adapt PLAT from classification to ranking, we use a sub-agent (see Section 4.3.1) to find positions of important phrases and replace classification scores with relevance scores.

Sentence-level attack. For sentence-level attacks against NRMs, key strategies encompass sentence substitution [8], sentence insertion [10], and sentence rewriting [27]. Here, we employ the sentence substitution ranking attack exemplified by PAT [8], which replaces a sentence at a specific position in a document with a trigger. Specifically, PAT first designates the beginning of the document as the vulnerable sentence; and then optimizes the gradients of the ranking loss to derive a continuous trigger representation, which is mapped to the word space. In this work, we take a more flexible way, employing the sub-agent to identify important sentences, potentially located anywhere within the document.

Combining single-granular attacks. In our multi-granular attack:

Initially, a sub-agent, serving as a vulnerability indicator (see Section 4.3.1), is employed to identify important positions at each level of granularity.

Then, the three aforementioned single-granular attack methods are applied to generate a perturbation for each identified important position (see Section 4.3.2).

Finally, an organization of all settled perturbations is executed to refine and select the most effective perturbation sequence.

Discussions. Our framework can seamlessly integrate off-the-shelf ranking attack methods. In this work, for each granularity level, we choose a representative attack method, rather than involve several attack methods. Involving more attack methods at the same granularity may introduce additional variables and complexities, potentially confounding our results and making it more challenging to draw clear conclusions. In the future, we plan to consider more granularity and add more attack methods at the same granularity to make the perturbation even more diverse. Besides, the above three single-granular attacks remain constant in our current multi-granular attack method. In the future, we aim to make these attack methods learnable and dynamically update them within the entire framework to achieve enhanced interoperability.

4 Method↩︎

In this section, we introduce the RL-MARA framework, specifically crafted for achieving multi-granular attacks against NRMs.

Figure 2: The RL-MARA framework.

4.1 Overview↩︎

To generate multi-granular perturbations for a target document, we need to decide the granularity, position, and modified content of each single perturbation. We formulate multi-granular attacks against target NRMs as a sequential decision-making process:

The attackers manipulate the target document by introducing a perturbation that could be at any level of granularity; a surrogate model of the target NRM assesses the current ranking position, while a LLM evaluates its naturalness;

The attacker observes changes in ranking and naturalness and further optimizes its attack strategy to generate the next perturbation.

The global objective is to optimize the final ranking improvement of the target document with indiscernible perturbations.

During the sequential decision-making process, the discrete perturbed document leads to a lack of direct signals at each step. Therefore, we employ RL (RL) to identify an appropriate sequential attack path for generating adversaries. Specifically, we introduce the RL-based framework RL-MARA to learn an optimal multi-granular ranking attack strategy. As shown in Figure 2, the RL-MARA framework comprises two major components:

A surrogate model simulating the behavior of the target NRM, and an advanced LLM evaluating the naturality of adversarial samples, collectively serve as the whole environment to provide rewards; and

A multi-granular attacker, consisting of a sub-agent and a meta-agent, receives rewards from the environment and collaborates to generate perturbations at multiple levels of granularity.

4.2 Environment and reward↩︎

The multi-granular ranking attack problem is formally modeled as a Markov decision process (MDP) [28], wherein its components are defined as follows:

State \(d\) is the document, with the initial state \(d^0\) as a target document, and the terminal state signifying a successful adversarial example;

Action \(p\) denotes a multi-granular perturbation selected by the agent for injection into the document;

Transition \(\mathcal{T}\) alters the document state \(d\) by applying a perturbation at each step; and

Reward \(r\) is the attack reward provided by the environment, guiding the agent with supervisory signals.

Environment. In the decision-based black-box scenario, obtaining only hard-label predictions and lacking the relevance score for each candidate document predicted by the target NRM, poses a challenge. Besides, frequent queries to the target NRM might arouse suspicion. Consequently, we employ a surrogate ranking model to function as an environment and offer the attack reward as a proxy for the target NRM. Following [7], [8], we train a surrogate ranking model based on the Pseudo Relevance Feedback idea [29] and achieve comparable performance to the target NRM. Specific training details can be found in Section 5.4. Simultaneously, we introduce an advanced LLM as part of the environment to assess the naturalness of the current perturbed document as a reward. To sum up, the virtual environment for the RL Attacker comprises a surrogate ranking model and an LLM.

Multi-granular reward design. An effective reward function for multi-granular attacks should consider both attack effectiveness and the naturalness of the perturbed document. At each step of the sequential interactions, the attacker introduces a perturbation at a specific level of granularity. The reward furnishes appropriate feedback based on the granularity of the current perturbation, guiding the behavior of the attacker.

Specifically, the reward for each step is defined as follows:

If the relevance score of the current perturbed document is higher than before, the current attack succeeds. The reward not only evaluates the attack effectiveness and the naturalness of the perturbed document, but also considers the number of manipulated terms introduced by different perturbation granularities; and

Conversely, if the attack fails, we directly apply a fixed penalty factor \(\xi\) as the reward.

These assumptions lead us to define the attack reward function \(r^t\) at the step \(t\) as follows: \[r^t =\left\{ \begin{array}{ll} -\xi, & \text{ if } \tilde{f}(q,d^{t}) < \tilde{f}(q,d^{t-1}) \\ r^t_\mathrm{att} / | p^t | + \beta r^t_\mathrm{nat} , & \text{else,} \\ \end{array} \right.\] where \(d^{t}\) and \(d^{t-1}\) are the perturbed document at step \(t\) and \(t-1\), respectively; \(r^t_\mathrm{att}\) and \(r^t_\mathrm{nat}\) are the rewards with respect to attack effectiveness and document naturalness, respectively. The penalty factor \(\xi\) is set to 1. The function \(\tilde{f}\left( \cdot \right)\) outputs the relevance score judged by the surrogate ranking model \(\tilde{f}\), and the hyper-parameter \(\beta\) balances attack effectiveness with document naturalness.

To emphasize the impact of different levels of granularity, we introduce \(| p^t |\) as a reward discount of attack effectiveness, representing the number of manipulated terms of \(p^t\). The value of \(| p^t |\) varies significantly across perturbations at different levels of granularity. For instance, when an attacker introduces a sentence-level perturbation, it consumes a larger portion of the perturbation budget. To normalize its effect, given its expected stronger attack effects, we incorporate a corresponding discount factor.

Next, we detail \(r^t_\mathrm{att}\) and \(r^t_\mathrm{nat}\):

Figure 3: Instruction for naturalness evaluation with chatGPT. The gray and dark blue blocks indicate the inputs and outputs of the model, respectively.

  • Attack effectiveness reward. \(r^t_\mathrm{att}\) incentivizes ranking improvements of the perturbed document \(d^t\); a perturbed document should receive more rewards if it is ranked higher than before. However, directly using ranking as a reward is sparse. We shape the reward using the surrogate model’s relevance scores, i.e., \[r^t_\mathrm{att} = \tilde{f}(q,d^{t})-\tilde{f}(q,d^{t-1}),\] where \(\tilde{f}\left( \cdot \right)\) outputs the relevance score judged by \(\tilde{f}\). If the step \(t\) attack is successful, \(r^t_\mathrm{att}\) is positive.

  • Document naturalness reward. \(r^t_\mathrm{nat}\) guarantees that the perturbed document \(d^t\) satisfies semantic and syntactic constraints by an LLM [30][32]. In this work, we employ ChatGPT [33] as the LLM. Specifically, we use the Prompts, shown in Figure 3, to evaluate both the similarity and fluency of documents in each state, defined as \(r^t_\mathrm{sim}\) and \(r^t_\mathrm{flu}\), respectively.

    • \(r^t_\mathrm{sim}\) measures how semantically similar the perturbed document \(d^t\) is to the original document: \[r^t_\mathrm{sim} = \operatorname{LLM_\mathrm{sim}}\left(d^{t},d^{0}\right),\] where the function \(\operatorname{LLM_\mathrm{sim}}\left( \cdot \right)\) outputs the similarity score judged by the LLM.

    • \(r^t_\mathrm{flu}\) measures how fluent the perturbed document \(d^t\) is: \[r^t_\mathrm{flu} = \operatorname{LLM_\mathrm{flu}}\left(d^{t}\right),\] where the function \(\operatorname{LLM_\mathrm{flu}}\left( \cdot \right)\) outputs the fluency score judged by the LLM.

    Finally, the overall reward with respect to document naturalness \(r^t_\mathrm{nat}\) is defined as \(r^t_\mathrm{nat} = r^t_\mathrm{sim} + r^t_\mathrm{flu}\).

4.3 Multi-granular attacker↩︎

The objective of the multi-granular attacker is to identify possible attack positions at all granularities and organize them into a final perturbation sequence. Developing such a composite strategy is challenging for a single agent. Consequently, we adopt two agents in a cooperative manner to accomplish this goal.

The vulnerability indicator serves as a sub-agent, aiming to identify all vulnerable positions in the document at each level of perturbation granularity.

The perturbation aggregator acts as a meta-agent, aiming to generate specific perturbations for each selected vulnerable position and organize them to filter out the final perturbation sequence.

4.3.1 Sub-agent: vulnerability indicator↩︎

The aim of the vulnerability indicator is to identify all vulnerable positions within the target document and the corresponding level of perturbation granularity of each position.

Policy network. We employ BERT [34] as the backbone of the sub-agent policy network \(I_{\phi}\). Given a target document \(d\) and a query \(q\), the process proceeds as follows:

  • We employ the surrogate model \(\tilde{f}\) to compute the pairwise loss \(\mathcal{L}_\mathrm{pair} = \sum_{d^{\prime} \in L_{\backslash d}} \mathcal{L}_{\tilde{f}}\left(q, d, d^{\prime}\right)\), where \(d^{\prime}\) is the remaining documents in the ranked list \(L\) excluding the target document \(d\).

  • We compute the average gradient \(\boldsymbol{g}_{d} \in \mathbb{R}^{m * l}\) of \(\mathcal{L}_\mathrm{pair}\) concerning each position (i.e., each word) in the target document \(d\). Here, \(m\) represents the dimensions of the hidden state, and \(l\) denotes the length of the target document \(d\).

  • We feed \(\boldsymbol{g}_{d}\) into the vulnerability indicator \(I_{\phi}\), to derive the vulnerability distribution \(\boldsymbol{u}\). The vulnerability distribution \(\boldsymbol{u}\) represents a list of confidence scores indicating the confidence level for each position in the target document across various levels of perturbation granularity, which is calculated by: \[\boldsymbol{u} = I_{\phi} \left(\boldsymbol{g}_{d}\right),\] where \(\boldsymbol{u} = \left\{\boldsymbol{u}_1,\boldsymbol{u}_2,\ldots,\boldsymbol{u}_l\right\} \subseteq \mathbb{R}^{4*l}\). Each \(\boldsymbol{u}_i \in \mathbb{R}^4, i \in \left[1,l\right]\), represents the confidence scores at each level of perturbation granularity at position \(i\). The four dimensions correspond to perturbation at word-level (W), phrase-level (P), sentence-level (S), and no perturbation (N), respectively.

  • To condense the vulnerability distribution into a specific perturbation type, we apply the softmax function to \(\boldsymbol{u}\), yielding the vulnerable word positions \(\boldsymbol{c}\) for the target document \(d\), i.e., \[\boldsymbol{c} = \operatorname{softmax}\left(\boldsymbol{u}\right),\] where \(\boldsymbol{c} = \left\{c_1,c_2,\ldots,c_l\right\} \subseteq \mathbb{R}^{1 * l}\), and each \(c_i \in \{\operatorname{W}, \operatorname{P}, \operatorname{S},\operatorname{N}\} , i \in \left[1,l\right]\), is the granularity of perturbation at each word position \(i\).

the above process, to ensure the continuity of vulnerable positions at the phrase and sentence granularity, we constrain the output of the vulnerability indicator as part of a sequence labeling process, as detailed in Section 5.4. Based on this, we can map the vulnerable word positions to corresponding span positions at the word, phrase, and sentence levels, denoted as \(\mathbb{C} = \left\{C_1,C_2,\ldots,C_k\right\} \in \mathbb{R}^{1 * k}, k < l\). Each \(C_j \in \{\operatorname{W}, \operatorname{P}, \operatorname{S}\}, j \in \left[1,k\right]\), represents the level of perturbation granularity at each span \(j\).

4.3.2 Meta-agent: perturbation aggregator↩︎

The target of the perturbation aggregator is to generate specific perturbations for each selected vulnerable span position and organize them into a final perturbation sequence.

Generating specific perturbations for each selected span position via static single-granular attack methods. Once we have identified the vulnerable span positions with the corresponding perturbation granularity \(\mathbb{C}\) with the vulnerability indicator, we employ the respective attack method (outlined in Section 3) for each of these span positions to generate the specific perturbations \(\mathbb{P} \in \mathbb{R}^k\). Based on \(\mathbb{P}\), we design the policy network, which sequentially selects a perturbation \(p^t\) from \(\mathbb{P}\), adding to the target document until the budget of manipulated words number \(\epsilon\) is reached.

Policy network. We employ a multi-layer perception (MLP) [35] as the backbone of the meta-agent policy network \(G_{\varphi}\). For each step \(t\), \(G_{\varphi}\) takes the perturbed document \(d^{t-1}\) and vulnerability distribution \(\boldsymbol{u}\) as inputs. The action \(p^t\) is to select the \(t\)-th specific perturbation added to the document. The process is as follows:

  • We use the surrogate model \(\tilde{f}\) to obtain the hidden states of the perturbed document \(d^{t-1}\) as \(\boldsymbol{h}^{t-1} = [\boldsymbol{h}^{t-1}_1, \boldsymbol{h}^{t-1}_2, \ldots, \boldsymbol{h}^{t-1}_l]\), where \(l\) is the length of \(d^{t-1}\). \(\boldsymbol{h}^{t-1}_i \in \mathbb{R}^m\) is the hidden state of the \(i\)-th word in \(d^{t-1}\), where \(m\) is the dimension of hidden states.

  • For each span (e.g., word, phrase, sentence), let’s consider the \(j\)-th span. From its starting position \(j\text{-start}\) to its ending position \(j\text{-end}\), we concatenate the corresponding hidden state with the confidence score. Then, we sum them up to create the concatenated representation, denoted as, \(\sum_{o = j,start}^{j,end}\left[\boldsymbol{h}^{t-1}_o ; \boldsymbol{u}_o\right]\), where \(\left[; \right]\) is the concatenation operation.

  • We divide the concatenated representation by the length of the span to derive the final representation \(\boldsymbol{e}_j^{t-1}\) for the current state: \[\boldsymbol{e}_j^{t-1} = \left(\sum\nolimits_{o = j,start}^{j,end}\left[\boldsymbol{h}^{t-1}_o ; \boldsymbol{u}_o\right]\right) / \left|p_j\right|, j \in \left[1,k\right],\] where \(\left|p_j\right|\) is length of the specific perturbation of \(j\)-th span in \(\mathbb{P}\).

  • The probability distribution of each specific perturbation \(p(p_j\mid d^{t-1})\) at \(t\)-step can be calculated by perturbation aggregator \(G_{\varphi}\): \[P(p_j\mid d^{t-1}) = G_{\varphi}(\boldsymbol{e}_j^{t-1}), \;p_j \in \mathbb{P}.\] The aggregator samples the perturbation \(p^t\) at step \(t\) with the highest probability from the distribution to be injected into \(d\).

  • We get the final perturbation \(\mathcal{P} = \{p^1, p^2, \ldots, p^t, \ldots, p^T\}\), where \(T < \left| \mathbb{P} \right|\) and \(T\) is the number of steps.

4.4 Training with policy gradient↩︎

We solve the MDP problem with the policy gradient algorithm REINFORCE [14]. In each episode, a trajectory \(\tau = d^1,p^1,\ldots,d^T,p^T\) is sampled using policy \(\pi\). The episode terminates at step \(T\), when term manipulation budget \(\epsilon\) is reached, i.e. \(\sum_{t=1}^{T} \left| p^t \right| \leq \epsilon < \sum_{t=1}^{T+1} \left| p^t \right|\).

The aim of training is to learn an optimal policy \(\pi^*\) by maximizing the expected cumulative reward \(R(\tau) = \mathbb{E}\left[\sum_{t=1}^{T} \gamma^t r^t \right]\), where \(\gamma \in [0,1)\) is the discount factor for future rewards. The training objective is to maximize \(J(\phi,\varphi)\) via: \[\nabla_{\phi,\varphi} J(\phi,\varphi) = \mathbb{E}_{\pi_{\phi,\varphi}}\left[\nabla_{\phi,\varphi} \log \pi_{\phi,\varphi} R(\tau)\right],\] where \(\phi\) and \(\varphi\) denote the parameter for sub-agent \(I_{\phi}\) and meta-agent \(G_{\varphi}\), respectively. In RL-MARA, the policy networks of two agents are spliced together to be optimized together.

The solution can be approximated by a Monte Carlo estimator [36], i.e., \(\nabla_{\phi,\varphi} J(\phi,\varphi) \propto \sum_{u=1}^{U}\sum_{t=1}^{T} \nabla_{\phi,\varphi} \log \pi_{\phi,\varphi}\left(p^{u,t}\mid d^{u,t}\right) R^{u,t},\) where \(U\) is the number of samples and \(T\) is the number of steps.

4.5 Discussion↩︎

It is important to point out two key differences between RL-MARA and MoE frameworks in IR [37], [38], which seem more straightforward in assembling existing ranking attack methods. For MoE methods:

The MoE requires optimization of each expert (attacker) during training. However, mainstream attack methods primarily use a series of search algorithms with two detached steps, finding vulnerable positions and adding perturbations, both of which are non-trainable. While it is feasible to design an algorithm for the first step, the second step poses a challenge, as generating discrete textual perturbations makes it difficult to obtain direct supervised signals.

Constrained by the gating, the outcome of MoE frameworks is determined by one of a single expert and a weighting of all experts, introducing a sense of rigidity.

In contrast, for RL-MARA:

RL-MARA integrates existing single-granular attack methods in a trainable manner. In addressing the two non-trainable steps present in existing attacks (see Section 3): to identify vulnerable positions, RL-MARA employs a sub-agent called vulnerability indicator (see Section 4.3.1); to perturb the discrete text space, RL-MARA uses a meta-agent called perturbation aggregator to master the perturbation addition strategy (see Section 4.3.2); and

RL-MARA chooses a perturbation of any granularity at each step. Consequently, by the end of the attack, we can achieve either a single-granular perturbation or a combination of different levels of perturbation granularity with flexibility.

5 Experimental Settings↩︎

5.1 Datasets↩︎

Benchmark datasets. Like previous work [7], [18], we conduct experiments on two benchmark datasets:

The MS MARCO Document Ranking dataset [16] (MS MARCO) is a large-scale benchmark dataset for Web document retrieval, with about 3.21 million documents.

The ClueWeb09-B dataset [17] (ClueWeb09) comprises 150 queries with a collection of 50 million documents, with 242 additional queries from the TREC Web Track 2012 [39].

Target queries and documents. Following [10], [18], we randomly sample 1000 Dev queries from MS MARCO and use 242 additional queries from ClueWeb09 as target queries for each dataset evaluation, respectively. For each target query, we adopt two categories of target documents based on the top-100 ranked results of the target NRM, considering different levels of attack difficulty, i.e., Easy and Hard. Specifically, we randomly choose 5 documents ranked between \([30,60]\) as Easy target documents and select the 5 bottom-ranked documents as Hard target documents. In addition to the two types, we incorporate Mixture target documents for a thorough analysis. These consist of 5 documents randomly sampled from both the Easy and Hard target document sets.

5.2 Evaluation metrics↩︎

Attack performance. We use three automatic metrics:

Attack success rate (ASR) (%), which evaluates the percentage of target documents successfully boosted under the corresponding target query.

Average boosted ranks (Boost), which evaluates the average improved rankings for each target document under the corresponding target query.

Boosted top-\(K\) rate (T\(K\)R) (%), which evaluates the percentage of target documents that are boosted into top-\(K\) under the corresponding target query.

The effectiveness of an adversary is better with a higher value for all these metrics.

Naturalness performance. Here, we use four metrics:

Spamicity detection, which detects whether target web pages are spam. Following [7], [8], we adopt a utility-based term spamicity detection method, OSD [40], to detect the after-attack documents.

Grammar checking, which calculates the average number of errors in the after-attack documents. Specifically, we use, Grammarly [41], an online grammar checker following [8].

Language model perplexity (PPL), which measures the fluency using the average perplexity calculated using a pre-trained GPT-2 model [42].

Human evaluation, which measures the quality of the after-attack documents following the criteria in [7], [8].

5.3 Models↩︎

Target NRMs. We choose three typical NRMs as target NRMs:

a pre-trained model, BERT [34];

a pre-trained model tailored for IR, PROP [43]; and

a model distilled from the ranking capability of LLMs, RankLLM [44].

For RankLLM, we employ the model introduced by [44], distilling the ranking capabilities of an LLM, i.e., ChatGPT, into DeBERTa-large [45] in a permutation distillation manner on LLM-generated permutations within MS MARCO.

Baselines. We compare the following attack methods against NRMs:

Term spamming (TS) [9] randomly selects a starting position in the target document and replaces the subsequent words with terms randomly sampled from the target query.

PRADA [7], PLAT [21] and PAT [8] are representative word-level, phrase-level and sentence-level ranking attack method against NRMs, introduced in Section 3.

IDEM [10] is sentence-level ranking attack that inserts the generated connection sentence in the document.

Model variants. We implement three variants of RL-MARA:

RL-MARA\(_\mathrm{single}\) selects a single level of perturbation for each document. After generating a vulnerability distribution, it sums confidence scores for each perturbation level. The one with the highest total score is chosen, while others are ignored and their scores are reset to zero.

RL-MARA\(_\mathrm{triple}\) incorporates all three perturbation levels in each document. It calculates average confidence scores for each level across vulnerable spans, ranks them from high to low, and then sequentially selects the highest-ranked span at each level until the term manipulation budget is reached, ensuring each perturbation level occurs at least once.

RL-MARA\(_\mathrm{greedy}\) greedily chooses vulnerable spans with the highest confidence for perturbation. It computes and ranks average confidence scores for vulnerable spans, regardless of granularity, and applies perturbations in descending order until the term manipulation budget is exhausted.

   \caption{Attack performance of RL-MARA and baselines; $\ast$ indicates significant improvements over the best baseline ($p \le 0.05$).}
    \begin{tabular}{l  c c c c  c c c c   c c c c  c c c c }
   & \multicolumn{8}{c}{MS MARCO} & \multicolumn{8}{c}{ClueWeb09}  \\ 
  \cmidrule(r){2-9} \cmidrule(r){10-17} 
  Method & \multicolumn{4}{c}{Easy} & \multicolumn{4}{c}{Hard} & \multicolumn{4}{c}{Easy}& \multicolumn{4}{c}{Hard} \\
  \cmidrule{1-1} \cmidrule(r){2-5} \cmidrule(r){6-9} \cmidrule(r){10-13} \cmidrule(r){14-17}
   \textbf{BERT}    & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R \\ 
TS & 100.0 & 38.1 & 84.3 & 26.9 & 89.5 & 68.2 & 23.6 & 5.9 
& 100.0 & 36.2 & 81.0 & 23.6 & 90.5 & 65.9 & 21.8 & 4.6\\
PRADA  & \phantom{1}98.3 & 26.1 & 69.3 & 18.3 & 78.9 & 55.9 & \phantom{1}9.6 & 1.8 
& \phantom{1}97.6 & 24.8 & 66.9 & 16.5 & 77.1 & 53.9 & \phantom{1}8.2 & 1.2\\
PLAT   & \phantom{1}93.6 & 24.3 & 63.1 & 15.6 & 72.1 & 50.0 & \phantom{1}8.5 & 1.1
& \phantom{1}92.1 & 23.1 & 61.9 & 14.0 & 70.2 & 48.2 & \phantom{1}7.2 & 0.8 \\
PAT   & 100.0 & 35.1 & 78.1 & 23.8 & 82.3 & 60.3 & 18.3 & 3.9 
& 100.0 & 34.3 & 75.6 & 20.6 & 78.3 & 54.1 & 14.9 & 2.1 \\
IDEM   & 100.0 & 39.6 & 85.6 & 26.8 & 90.2 & 69.6 & 25.8 & 7.2 
& 100.0 & 37.1 & 82.6 & 24.9 & 87.2 & 65.2 & 22.1 & 5.1\\
RL-MARA$_\mathrm{single}$  & 100.0 & 36.4 & 82.6 & 26.0 & 88.7 & 67.2 & 23.9 & 6.1 
& 100.0 & 35.4 & 81.0 & 24.3 & 86.3 & 65.6 & 21.8 & 5.3\\
RL-MARA$_\mathrm{triple}$  & 100.0 & 40.3 & 87.1 & 27.9 & 93.5 & 75.1\rlap{$^{\ast}$} & 30.6\rlap{$^{\ast}$} & 8.6\rlap{$^{\ast}$}
& 100.0 & 38.9 & 85.7\rlap{$^{\ast}$} & 26.8 & 91.7\rlap{$^{\ast}$} & 73.2\rlap{$^{\ast}$} & 28.9\rlap{$^{\ast}$} & 7.8\rlap{$^{\ast}$}\\
RL-MARA$_\mathrm{greedy}$  & 100.0 & 40.8 & 88.3\rlap{$^{\ast}$} & 28.3\rlap{$^{\ast}$} & 94.6\rlap{$^{\ast}$} & 78.3\rlap{$^{\ast}$} & 31.3\rlap{$^{\ast}$} & 9.3\rlap{$^{\ast}$} 
& 100.0 & 38.8 & 86.1\rlap{$^{\ast}$} & 26.9\rlap{$^{\ast}$} & 93.7\rlap{$^{\ast}$} & 74.1\rlap{$^{\ast}$} & 29.2\rlap{$^{\ast}$} & 8.4\rlap{$^{\ast}$}\\
RL-MARA  & 100.0 & \textbf{42.2}\rlap{$^{\ast}$} & \textbf{90.2}\rlap{$^{\ast}$} & \textbf{30.2}\rlap{$^{\ast}$} & \textbf{98.9}\rlap{$^{\ast}$} & \textbf{88.1}\rlap{$^{\ast}$} & \textbf{36.9}\rlap{$^{\ast}$} & \textbf{9.9}\rlap{$^{\ast}$} 
& 100.0 & \textbf{40.1}\rlap{$^{\ast}$} & \textbf{87.9}\rlap{$^{\ast}$} & \textbf{28.6}\rlap{$^{\ast}$} & \textbf{97.2}\rlap{$^{\ast}$} & \textbf{85.8}\rlap{$^{\ast}$} & \textbf{35.3}\rlap{$^{\ast}$} & \textbf{9.5}\rlap{$^{\ast}$}\\
   \textbf{PROP}  & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R \\ 
TS & 100.0 & 37.6 & 83.0 & 25.8 & 89.7 & 67.3 & 22.8 & 5.1 
& 100.0 & 35.0 & 79.6 & 22.8 & 90.5 & 64.8 & 20.9 & 4.3\\
PRADA  & \phantom{1}95.2 & 23.4 & 66.6 & 16.4 & 75.8 & 53.4 & \phantom{1}8.6 & 1.2 
& \phantom{1}93.4 & 22.5 & 63.4 & 14.1 & 74.9 & 51.2 & \phantom{1}6.8 & 0.9\\
PLAT   & \phantom{1}91.2 & 22.0 & 60.5 & 13.5 & 69.9 & 48.2 & \phantom{1}7.5 & 0.8
& \phantom{1}89.8 & 21.4 & 59.8 & 12.1 & 68.0 & 46.3 & \phantom{1}6.7 & 0.5 \\
PAT   & \phantom{1}98.6 & 33.6 & 75.9 & 22.5 & 80.2 & 58.7 & 17.3 & 3.1 
& \phantom{1}96.3 & 31.1 & 72.2 & 20.1 & 77.3 & 53.9 & 15.1 & 2.4 \\
IDEM   & 100.0 & 37.3 & 83.0 & 25.0 & 87.9 & 67.5 & 24.0 & 6.5 
& 100.0 & 35.8 & 80.1 & 23.1 & 85.8 & 65.1 & 21.7 & 4.1\\
RL-MARA$_\mathrm{single}$  & 100.0 & 35.3 & 80.2 & 24.5 & 86.4 & 65.1 & 23.0 & 5.8 
& 100.0 & 34.6 & 79.9 & 23.1 & 84.5 & 63.5 & 20.0 & 4.9\\
RL-MARA$_\mathrm{triple}$  & 100.0 & 37.9 & 85.6 & 26.5 & 91.2 & 73.8\rlap{$^{\ast}$} & 28.9\rlap{$^{\ast}$} & 8.0\rlap{$^{\ast}$}
& 100.0 & 37.5 & 83.8\rlap{$^{\ast}$} & 25.3\rlap{$^{\ast}$} & 90.1\rlap{$^{\ast}$} & 71.8\rlap{$^{\ast}$} & 27.2\rlap{$^{\ast}$} & 6.3\rlap{$^{\ast}$}\\
RL-MARA$_\mathrm{greedy}$  & 100.0 & 38.4 & 87.9\rlap{$^{\ast}$} & 27.6\rlap{$^{\ast}$} & 92.3\rlap{$^{\ast}$} & 77.6\rlap{$^{\ast}$} & 30.8\rlap{$^{\ast}$} & 8.7\rlap{$^{\ast}$} 
& 100.0 & 36.3 & 85.1\rlap{$^{\ast}$} & 25.4\rlap{$^{\ast}$} & 91.9\rlap{$^{\ast}$} & 72.5\rlap{$^{\ast}$} & 28.5\rlap{$^{\ast}$} & 7.9\rlap{$^{\ast}$}\\
RL-MARA  & 100.0 & \textbf{41.0}\rlap{$^{\ast}$} & \textbf{88.7}\rlap{$^{\ast}$} & \textbf{28.9}\rlap{$^{\ast}$} & \textbf{97.5}\rlap{$^{\ast}$} & \textbf{87.0}\rlap{$^{\ast}$} & \textbf{36.0}\rlap{$^{\ast}$} & \textbf{9.1}\rlap{$^{\ast}$} 
& 100.0 & \textbf{39.2}\rlap{$^{\ast}$} & \textbf{85.6}\rlap{$^{\ast}$} & \textbf{27.4}\rlap{$^{\ast}$} & \textbf{95.8}\rlap{$^{\ast}$} & \textbf{83.6}\rlap{$^{\ast}$} & \textbf{33.8}\rlap{$^{\ast}$} & \textbf{8.9}\rlap{$^{\ast}$}\\
   \textbf{RankLLM} & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R & ASR & Boost & T10R & T5R \\ 
TS & 100.0 & 34.3 & 79.4 & 22.5 & 89.8 & 63.9 & 19.7 & 3.2 
& \phantom{1}99.2 & 32.1 & 73.0 & 19.3 & 89.9 & 59.8 & 18.4 & 2.8\\
PRADA  & \phantom{1}92.1 & 21.1 & 60.9 & 13.4 & 68.9 & 50.2 & \phantom{1}6.7 & 0.7 
& \phantom{1}89.6 & 20.2 & 59.8 & 12.3 & 70.3 & 48.9 & \phantom{1}5.2 & 0.5\\
PLAT   & \phantom{1}88.9 & 19.1 & 55.8 & 11.5 & 63.5 & 45.9 & \phantom{1}5.9 & 0.5
& \phantom{1}85.6 & 18.6 & 55.9 & 10.0 & 64.8 & 42.6 & \phantom{1}5.2 & 0.3 \\
PAT   & \phantom{1}95.6 & 30.2 & 72.1 & 19.8 & 75.6 & 54.3 & 14.9 & 2.8 
& \phantom{1}93.9 & 28.7 & 68.6 & 18.4 & 73.5 & 49.8 & 12.8 & 1.7 \\
IDEM   & \phantom{1}98.9 & 34.8 & 79.2 & 22.2 & 84.8 & 63.2 & 21.8 & 5.2 
& \phantom{1}98.2 & 33.2 & 77.9 & 21.1 & 82.3 & 61.8 & 18.9 & 3.2\\
RL-MARA$_\mathrm{single}$  & \phantom{1}97.9 & 33.6 & 77.9 & 22.3 & 83.8 & 62.6 & 21.2 & 4.3 
& \phantom{1}97.5 & 32.1 & 76.7 & 21.2 & 81.8 & 60.2 & 18.1 & 3.7\\
RL-MARA$_\mathrm{triple}$  & \phantom{1}99.8 & 35.8 & 82.9 & 24.3 & 89.0 & 71.8\rlap{$^{\ast}$} & 27.0\rlap{$^{\ast}$} & 7.1\rlap{$^{\ast}$}
& \phantom{1}99.2 & 35.2 & 81.2\rlap{$^{\ast}$} & 22.8\rlap{$^{\ast}$} & 88.7\rlap{$^{\ast}$} & 69.8\rlap{$^{\ast}$} & 25.4\rlap{$^{\ast}$} & 5.8\rlap{$^{\ast}$}\\
RL-MARA$_\mathrm{greedy}$  & 100.0 & 36.2\rlap{$^{\ast}$} & 85.3\rlap{$^{\ast}$} & 25.1\rlap{$^{\ast}$} & 89.7\rlap{$^{\ast}$} & 74.8\rlap{$^{\ast}$} & 28.9\rlap{$^{\ast}$} & 8.1\rlap{$^{\ast}$} 
& \phantom{1}99.7 & 34.6 & 82.3\rlap{$^{\ast}$} & 22.7\rlap{$^{\ast}$} & 89.2\rlap{$^{\ast}$} & 70.1\rlap{$^{\ast}$} & 26.4\rlap{$^{\ast}$} & 7.2\rlap{$^{\ast}$}\\
RL-MARA  & 100.0 & \textbf{39.7}\rlap{$^{\ast}$} & \textbf{85.8}\rlap{$^{\ast}$} & \textbf{27.0}\rlap{$^{\ast}$} & \textbf{95.6}\rlap{$^{\ast}$} & \textbf{85.0}\rlap{$^{\ast}$} & \textbf{34.3}\rlap{$^{\ast}$} & \textbf{8.6}\rlap{$^{\ast}$} 
& \textbf{100.0} & \textbf{37.0}\rlap{$^{\ast}$} & \textbf{82.4}\rlap{$^{\ast}$} & \textbf{25.2}\rlap{$^{\ast}$} & \textbf{92.1}\rlap{$^{\ast}$} & \textbf{81.1}\rlap{$^{\ast}$} & \textbf{31.3}\rlap{$^{\ast}$} & \textbf{8.2}\rlap{$^{\ast}$}\\

5.4 Implementation details↩︎

For MS MARCO and ClueWeb09, following [7], we truncate each document to 512. The initial retrieval is performed using the Anserini toolkit [46] with the BM25 model to obtain the top 100 ranked documents following [7], [18]. For the environment, following [7], [8], we use BERT\(_{base}\) as the surrogate model, and the training details are consistent with [7], [18]. For the reward, the balance hyper-parameter \(\beta\) is 0.2 and the discount factor \(\gamma\) is 0.9.

For the perturbations, we set the term manipulation budget \(\epsilon=25\) for RL-MARA and all baselines. We specify a word length of 1 for word-level perturbations, 2 to 5 for phrase-level perturbations, and 6 to 10 for sentence-level perturbations. For the sub-agent, we supervise its training in a sequence labeling manner [47] to ensure that the labeling of each perturbation position is a continuous span. We set the learning rate to \(3e^{-6}\) with Adam as the optimizer to train RL-MARA. Following [18], when the training process ends, we stop the updating of policy networks while running another epoch on the full dataset as a testing phase to evaluate the performance.

6 Experimental Results↩︎

6.1 Attack evaluation↩︎

Table [table:Baseline] showcases the attack performance among three target NRMs with different attack methods, evaluated on both Easy and Hard target documents. We have the following observations:

Overall, the attack methods have effects on all three NRMs, exposing the prevalence of adversarial vulnerability. RankLLM is more resistant to adversarial attacks than other NRMs, indicating that distilling the ranking capabilities with LLMs helps to enhance the adversarial robustness of NRMs.

The attack efficacy of most methods on ClueWeb09 is observed to be inferior compared to their performance on MS MARCO. This disparity may stem from the noise present in ClueWeb09’s documents, which potentially renders the model less responsive to adversarial perturbations. This observation aligns with previous findings reported in [18].

Hard documents exhibit lower ASR and T\(K\)R compared to Easy ones, due to the higher prevalence of irrelevant information in bottom-ranked documents, which challenges effective attacks with limited perturbation.

Sentence-level attack methods yield better attack results than word-level and phrase-level methods. The reason may be that the sentence-level perturbation is a continuous optimization of an entire vulnerable span in a document, thus increasing the likelihood of misleading the relevance judgment of NRMs to a greater extent. However, these sentence-level attack methods run the risk of being suspect due to naturalness flaws in respective aspects. We will discuss this further in Section 6.3.

RL-MARA significantly outperforms all baselines. In Hard documents of MS MARCO, while attacking RankLLM, RL-MARA improves over the best baseline, IDEM, by 65.4% in T5R and by 34.5% in Boost, highlighting the attack effect of perturbations at multiple levels of granularity.

The superiority of RL-MARA over RL-MARA\(_\mathrm{single}\) suggests that the single granularity of perturbation is not sufficient to fully utilize the diverse vulnerability distributions in documents. Moreover, the synergy between perturbations of different levels of granularity can lead to more threatening adversarial examples.

The advantage of RL-MARA over RL-MARA\(_\mathrm{triple}\) indicates that the sequential decision-making in adding perturbations of each granularity, based on RL rather than mechanically applying perturbations across all levels of granularity, enables more effective exploitation of each document’s vulnerability distribution.

The improvement of RL-MARA over RL-MARA\(_\mathrm{greedy}\) demonstrates that the cooperative approach of two agents in flexibly organizing multi-granular perturbations is instrumental in generating high-quality perturbation sequences, posing significant threats to NRMs.

For the Mixture target documents, the performance of all attack methods remains consistent with that observed in each NRM for Easy and Hard documents. Even attacking the most defensive RankLLM on MS MARCO, RL-MARA outperforms the best baseline by 34.8% in T5R and 25.1% in Boost. The complete experimental results are in

   \caption{The MRR@10 (\%) performance on MS MARCO and ClueWeb09 of target NRMs (BERT, PROP and RankLLM) and their corresponding surrogate models (S$_\mathrm{NRM}$). }
    \begin{tabular}{l c c  c c   c c}
       NRM & BERT & S$_\mathrm{BERT}$ & RROP & S$_\mathrm{RROP}$ & RankLLM & S$_\mathrm{RankLLM}$ \\
MS MARCO & 38.48 & 35.41 & 39.01 & 36.24 & 39.89 & 37.86 \\
ClueWeb09 & 27.50 & 24.93 & 28.25 & 25.46 & 28.96 & 26.65\\

Furthermore, the performance of the surrogate ranking model plays an important role in the success of the black-box attack. As shown in Table [table:ranking32performance], for all target NRMs, the corresponding surrogate model can imitate their performance to some extent. This allows the vulnerabilities identified on the surrogate ranking model to be effectively transferred to the target NRM.

Figure 4: The impact of hyper-parameter \(\beta\) on the attack performance of RL-MARA against RankLLM on MS MARCO.

The impact of the balance hyper-parameter. \(\beta\) is an important hyper-parameter in our multi-granular reward, since it balances the attack performance and the naturalness of adversarial example. We investigate the impact of \(\beta\) on the performance of RL-MARA. Lower values of \(\beta\) imply a less emphasis on the naturalness and a greater emphasis on the attack effectiveness. We take attacking RankLLM on Mixture documents of MS MARCO as an example. The trend of the attack performance with \(\beta\) is shown in Figure 4. As \(\beta\) decreases, the attack performance gradually increases and plateaus, since the naturalness reward no longer dominates the update of the attack strategy. However, trivializing naturalness reward signals can lead to susceptibility to suspicion, as discussed in Section 6.3.

When serving as a benchmark, RL-MARA can create adversarial examples of varying naturalness by tuning the hyper-parameter \(\beta\), enabling a comprehensive analysis for model robustness.

6.2 Capability of surrogate models↩︎

Black-box vs. white-box attack. We further focus on the white-box scenario, which is valuable for enhancing understanding of our method. In white-box attacks, we directly substitute the surrogate ranking model as the target NRM and keep other components the same in RL-MARA. We consider the most defensive RankLLM, which has a 4.4% higher ranking performance than its surrogate model, as shown in Table [table:ranking32performance]. The result on the Mixture target documents of MS MARCO is shown in Figure 4 (Left), with similar findings on other target documents. Compared with the white-box setting, RL-MARA still obtains competitive performance in black-box scenarios. The results demonstrate that the training method of the surrogate model is sufficient to simulate the vulnerability performance of the target NRM at different levels of granularity, thus making our multi-granular attack effects transferable.

Training a surrogate model in the out-of-distribution (OOD) setting. In our experiments, the training data of the surrogate model is directly adopted from the Eval queries of the target model, i.e., the same distribution as the query used to train the target NRM. However, in realistic search scenarios, obtaining an identically distributed query set is difficult. Following [10], we show the results for the IID and OOD scenarios in Figure 4 (Right). Specifically, we take the Mixture target documents as an example and evaluate the attack performance against RankLLM on MS MARCO. For our IID scenarios, we use Eval queries of MS MARCO for surrogate model training. For OOD scenarios, we use Eval queries of Natural Questions (NQ) [48] to train the surrogate model and observe a 25.7% decrease on MRR@10 relative to the IID surrogate model. The results reveal that despite compromised attack performance when the IID data is unavailable, RL-MARA continues to perform an effective attack method that can identify model vulnerabilities.

Figure 5: Adversarial examples generated by RL-MARA for attacking RankLLM on MS MARCO based on different target documents. More example details are displayed in

Figure 6: (Left): Attack performance changes of RL-MARA against RankLLM on MS MARCO in the white-box setting, compared to black-box setting. (Right): Attack performance changes of RL-MARA against RankLLM on MS MARCO in the OOD scenario, compared to the IID scenario.

   \caption{The online grammar checker, perplexity, and human evaluation results for attacking RankLLM on MS MARCO.}
    \begin{tabular}{l c c  c c   c c}
       Method & Grammar & PPL & Impercept. & \textit{kappa} & Fluency & \textit{Kendall} \\
Original & 59 & 44.2 & 0.90 & 0.52 & 4.57 & 0.65\\
TS  & 67 & 59.7 & 0.06 & 0.53 & 2.42 & 0.73\\
PRADA & 108 & 113.4 & 0.53 & 0.48 & 3.21 & 0.82\\
PLAT  & \phantom{1}98 & \phantom{1}86.3 & 0.62 & 0.58 & 3.30 & 0.74 \\
PAT & \phantom{1}83 & \phantom{1}72.1 & 0.58 & 0.61 & 3.42 & 0.76\\
IDEM & \phantom{1}67 & \phantom{1}55.6 & 0.79 & 0.46 & 3.75 & 0.82 \\
RL-MARA$_\mathrm{-NC}$ & 103 & \phantom{1}89.5 & 0.58 & 0.47 & 2.97 & 0.69 \\
RL-MARA & \phantom{1}65 & \phantom{1}53.5 & 0.87 & 0.58 & 3.90 & 0.76\\

6.3 Naturalness evaluation↩︎

Next, we report on the naturalness of generated adversarial examples against RankLLM on MS MARCO, with similar findings on ClueWeb09. Here, we set the balance hyper-parameter \(\beta\) to \(0\) of RL-MARA, denoted as RL-MARA\(_\mathrm{-NC}\), for comparison.

Grammar checking, PPL, and human evaluation. Table [table:human32evaluation] lists the results of the automatic grammar checker, PPL, and human evaluation. For human evaluation, we recruit five annotators to annotate 32 randomly sampled Mixture adversarial examples from each attack method [10]; The annotators score the Fluency from 1 to 5; higher scores indicate more fluent examples; the Imperceptibility is to determine whether an example is attacked (0) or not (1); including the annotation consistency (the Kappa value and Kendall’s Tau coefficient) [7], [18]. We observe that:

TS demonstrates subpar performance in all naturalness evaluations because it abruptly inserts the query terms into the document without regard to semantic coherence.

The single-granular attack methods fall short when compared to the original documents, a possible reason is that a priori restricting perturbations to a specific level of granularity carries the risk of generating unnatural perturbations.

Although discarding the fluency constraint may improve attack efficacy, the adversarial examples produced by RL-MARA\(_{-NC}\) tend to arouse suspicion.

RL-MARA outperforms the baselines, demonstrating the effectiveness of the naturalness reward provided by the LLM.

   \caption{The detection rate (\%) via a representative anti-spamming method for attacking RankLLM on MS MARCO.}
    \begin{tabular}{l c c c c}
       Threshold & 0.08 & 0.06 & 0.04 & 0.02 \\ 
TS  & 40.3 & 53.4 & 76.3 & 94.5 \\
PRADA  & 13.4 & 22.6 & 40.1 & 64.2 \\
PLAT   & 11.6 & 20.6 & 32.5 & 58.7 \\
PAT   & \phantom{1}9.3 & 15.2 & 26.4 & 49.8 \\
IDEM   & 17.2 & 29.4 & 48.3 & 72.9 \\
RL-MARA$_\mathrm{-NC}$ & 15.8 & 26.3 & 42.4 & 68.6 \\
RL-MARA & \phantom{1}\textbf{6.2} & \textbf{10.6} & \textbf{20.2} & \textbf{40.3} \\

Spamicity detection. Table [table:anti-spamming] shows the automatic spamicity detection results on Mixture documents with similar findings on other target documents. If the spamicity score of a document exceeds the detection threshold, it is identified as suspected spam content. Even with the effective attack performance, IDEM is easier to detect among other single-granular attack baselines because it ignores the avoidance of query terms in prompting the language model to generate perturbed sentences. RL-MARA outperforms the baselines significantly (p-value \(\leq 0.05\)), demonstrating the similarity reward provided by an LLM is effective in preventing attack methods from abusing query terms and thus being suspected.

6.4 Case study↩︎

We sample a query with three target low-ranking documents for attacking RankLLM on MS MARCO; adversarial examples generated by RL-MARA are shown in We observe that, only document D2020339 uses sentence-level perturbations, likely due to this document organizing its sentences through a clear structure, making a key sentence easily identifiable by RL-MARA and naturally replacing it with query-relevant content. Document D1307681 exclusively employs phrase-level perturbations. This is probably because this document compares medications with a list of phrases, allowing RL-MARA to focus on phrases impacting NRMs judgment, effectively misleading them to perceive relevance to the query. These examples illustrate RL-MARA can flexibly deploy multi-granular perturbations, ranging from single to multiple levels, for various documents to create natural and effective adversarial examples.

7 Related Work↩︎

Neural ranking models. Ranking models have evolved from early heuristic models [49] to probabilistic [50], [51], and modern learning-to-rank models [52], [53]. NRMs [2], [3], [5] emerged with deep learning and demonstrated excellent ranking performance through powerful relevance modeling capabilities. Research has also explored integrating potent pre-trained language models into ranking tasks, including tailoring specific pre-training [43], fine-tuning [54], and distilling techniques [44], leading to state-of-the-art performance [55]. However, these NRMs also exhibit disconcerting adversarial vulnerabilities inherited from neural networks [6], [23], [56], [57].

Adversarial ranking attacks. With the development of deep learning, many fields such as retrieval augmentation [58], recommender systems [59], [60], and knowledge graphs [61], [62] have begun to focus on the robustness challenge. In the field of IR, the challenge of black-hat search engine optimization (SEO) has been significant since the inception of (web) search engines [9]. Typically, the goal of black-hat SEO is to boost a page’s rank by maliciously manipulating documents in a way that is unethical or non-compliant with search engine guidelines [63], [64]. It usually causes erosion of search quality and deterioration of the user experience. To safeguard NRMs from exploitation, research has focused on adversarial ranking attacks [7], [8], in order to expose the vulnerability of NRMs. The goal of adversarial ranking attacks is to manipulate the target document through imperceptible perturbations to improve its ranking for an individual or a small group of target queries [7], [18]. Previous studies have mainly explored attacks against NRMs with a single granularity of perturbations, e.g., word-level [6], [7], [18], [23], [65] and sentence-level [8], [10], [18], [66]. However, these efforts lack the flexibility needed for the comprehensive exploitation of NRM vulnerabilities. In this paper, we launch attacks with perturbations at multiple levels of granularity.

Multi-granular adversarial attack. Multi-granular attacks, in contrast to single-granular ones, engage different levels of perturbation granularity [20]. In CV [67], [68] and NLP [69][71], they exploit model vulnerabilities by freely combining varying perturbation granularities to produce threatening adversarial examples. However, in NLP, existing multi-granular attack methods are mainly confined to selecting one granularity from several options in a predetermined manner [69], [70]. Unlike these studies, we combine perturbations at multiple levels of granularity, allowing them to coexist in an adversarial example to enhance the effectiveness of adversarial ranking attacks.

8 Conclusion↩︎

In this work, we investigated a multi-granular ranking attack framework against black-box NRMs. We modeled multi-granular attacks as a sequential decision-making process and proposed a reinforcement learning-based framework, RL-MARA, to address it. Our extensive experimental results reveal that the proposed method can effectively boost the target document through multi-granularity perturbations with imperceptibility.

In future work, we would like to explore how to efficiently optimize the combinatorial explosion problem due to multi-granular attacks from a theoretical perspective. Beyond the current single-granular attack methods, we will explore integrating more granularity and different attack methods into our framework and making them learnable. Our method proves effective against NRMs distilled from LLMs, as exemplified by RankLLM. However, directly attacking LLMs as rankers and, in turn, leveraging their advanced capabilities for attacks presents promising future research avenues.

This work was funded by the Strategic Priority Research Program of the CAS under Grants No. XDB0680102, the National Key Research and Development Program of China under Grants No. 2023YFA1011
602 and 2021QY1701, the National Natural Science Foundation of China (NSFC) under Grants No. 62372431, the Youth Innovation Promotion Association CAS under Grants No. 2021100, the Lenovo-CAS Joint Lab Youth Scientist Project, and the project under Grants No. JCKY2022130C039. This work was also (partially) funded by the Hybrid Intelligence Center, a 10-year program funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research,, project LESSEN with project number NWA.1389.20.183 of the research program NWA ORC 2020/21, which is (partly) financed by the Dutch Research Council (NWO), and the FINDHR (Fairness and Intersectional Non-Discrimination in Human Recommendation) project that received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070212.

All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.


of mathematics and mechanics (1957), 679–684.
of Data and Information Quality 10, 4(2018), Article 16.