In a multi-follower Bayesian Stackelberg game, a leader plays a mixed strategy over \(L\) actions to which \(n\ge 1\) followers, each having one of \(K\) possible private types, best respond. The leader’s optimal strategy depends on the distribution of the followers’ private types. We study an online learning version of this problem: a leader interacts for \(T\) rounds with \(n\) followers with types sampled from an unknown distribution every round. The leader’s goal is to minimize regret, defined as the difference between the cumulative utility of the optimal strategy and that of the actually chosen strategies. We design learning algorithms for the leader under different feedback settings. Under type feedback, where the leader observes the followers’ types after each round, we design algorithms that achieve \(\mathcal{O}\big(\sqrt{\min\{L\log(nKA T), ~ nK \} \cdot T} \big)\) regret for independent type distributions and \(\mathcal{O}\big(\sqrt{\min\{L\log(nKA T), ~ K^n \} \cdot T} \big)\) regret for general type distributions. Interestingly, those bounds do not grow with \(n\) at a polynomial rate. Under action feedback, where the leader only observes the followers’ actions, we design algorithms with \(\mathcal{O}( \min\{\sqrt{ n^L K^L A^{2L} L T \log T}, ~ K^n\sqrt{ T } \log T \} )\) regret. We also provide a lower bound of \(\Omega(\sqrt{\min\{L, ~ nK\}T})\), almost matching the type-feedback upper bounds.
Stackelberg games are a fundamental model of strategic interaction in multi-agent systems. Unlike normal-form games where all agents simultaneously play their strategy, Stackelberg games model a leader committing to their strategy; the remaining follower(s) take their actions after observing the leader’s commitment [1], [2]. Such asymmetric interactions are ubiquitous in a wide range of setting, from a firm entering a market dominated by an established competitor [2], to an online platform releasing features that influence consumers on that platform [3], [4], to security games [5], [6] to strategic machine learning [7], [8]. They also form the foundation of seminal models in computational economics like Bayesian Persuasion [9] or contract design [10] that capture more structured settings with asymmetries relating to information or payouts respectively.
In these settings and beyond, there is one key question: what is the optimal strategy for the leader to commit to? Answering this question requires knowing how the follower(s) will react to the leader’s strategy, which typically boils down to knowing the followers’ utilities. The Bayesian approach attempts to relax this complete information assumption. Pioneering works like [1] assume that followers’ utilities are parametrized by hidden types from a known distribution. Here, the leader aims to compute the Bayesian Stackelberg Equilibrium: the strategy maximizing the leader’s expected utility with the followers’ types drawn from the known distribution.
In many of the settings mentioned, even the Bayesian perspective may be too strong and unrealistic: the leader (e.g., online platform, dominant firm) may only know the structure of the followers’ utilities but not the distribution of their types [11]. While not much can be achieved in a single round of such a game, the leader can often interact with the followers over multiple rounds and learn about them over time. The leader must, however, balance learning with playing the optimal strategy given current information – the well-known exploration-exploitation trade-off in the online learning literature.
This paper comprehensively studies the learning and computational problem for an online Bayesian Stackelberg game (BSG). Specifically, we consider the interaction over \(T\) rounds between a leader and \(n\) followers, each realizing one of \(K\) possible private types at each round. To our knowledge, this is the first work on online learning in BSGs with multiple followers. We study two feedback models: observing realized types of the followers, or observing their best-responding actions, after each round. Our core objective is exploring how these feedback models affect the learnability of the optimal strategy, which is challenging for several reasons. First, with multiple followers, the unknown joint type space is exponentially large. Further, followers’ taking best-responding actions means that the leader’s utility function is discontinuous and non-convex. Lastly, even the offline single-follower version of this problem has known computational challenges [1]. A key technical tool used to unravel this is a geometric characterization of the leader’s strategy space in terms of best-response regions (presented in Section 3). Section 4 uses this and an observation about learning type distributions vis-a-vis learning utility to provide algorithms for both general type distributions and independent ones. A matching lower bound is also provided. Section 5 then studies algorithms for the action feedback case, where we leverage our geometric insights along with techniques from linear bandits. Table 1 summarizes our results. Throughout, we comment on the computational complexity of our algorithms and uncover interesting trade-offs that situate our work with the broader literature on Stackelberg games.
Type Feedback | Action Feedback | ||
Independent types | General types | ||
Upp. Bound | \(\widetilde{\mathcal{O}}(\sqrt{\min\{L,\, nK \}T})\) | \(\widetilde{\mathcal{O}}(\sqrt{\min\{L,\, K^n\}T})\) | \(\widetilde{\mathcal{O}}( \min\{\sqrt{n^L K^L A^{2L} L},\, K^n\} \sqrt{T} )\) |
Low. Bound | \(\Omega(\sqrt{\min\{L,\, nK\}T})\) | \(\Omega(\sqrt{\min\{L,\, nK\}T})\) | \(\Omega(\sqrt{\min\{L, nK\} T })\) |
Our work contributes to the growing literature on the computational and learning aspects of Stackelberg games [1], [12]–[14]. In particular, [15]–[17] study learning in single-follower non-Bayesian Stackelberg games. Like our work, they assume that the follower myopically best responds in each round. However, they assume the follower’s utility function to be fixed but unknown, whereas we consider a Bayesian setting in which followers have unknown stochastic types that parameterize a known utility function.
Closer to our work, [5], [18] design online learning algorithms with \(\textrm{poly}(K)\sqrt{T}\) regrets for Bayesian Stackelberg games with a single follower with unknown type distribution, while [19] obtain \(\widetilde{\mathcal{O}}(K^{3n/2} \sqrt{T})\) regret for multi-receiver Bayesian persuasion problem (which is similar to multi-follower Bayesian Stackelberg game) by a reduction to adversarial linear bandit problem. Adopting previous techniques would lead to a \(\textrm{poly}(K^n)\sqrt{T}\) regret bound in our multi-follower setting, which is exponential in the number of followers \(n\) (see details in Section 5) and undesirable when followers are many. Using a different approach, we design an algorithm with \(\widetilde{\mathcal{O}}(\sqrt{n^L K^L A^{2L}LT})\) regret, a better result when the number of leader’s actions \(L\) is small compared to \(n\). The exponential dependency on \(L\) is unavoidable from a computational perspective, as [1] show that BSGs are NP-Hard to solve with respect to \(L\). Our algorithm combines the Upper Confidence Bound (UCB) principle and a partition of the leader’s strategy space into best-response regions, which is a novel approach to our knowledge.
Online Bayesian Stackelberg game can be seen as a piecewise linear stochastic bandit problem. While linear stochastic bandit problems have been well studied [20]–[22], piecewise linearity brings additional challenges. [23] study a single-dimensional piecewise linear stochastic bandit problem with unknown pieces; in contrast, we have known pieces but a multi-dimensional space, so the techniques and results of that work and ours are not directly comparable.
We consider the interactions between a single leader and \(n\ge 1\) followers. The leader has \(L \geq 2\) actions, denoted by \(\mathcal{L}= [L] = \{1, \ldots, L\}\), and chooses a mixed strategy \(x \in \Delta(\mathcal{L})\) over them, where \(\Delta(\mathcal{L})\) is the space of probability distributions over the action set. We use \(x(\ell)\) to denote the probability of the leader playing action \(\ell \in \mathcal{L}\). Each follower has a finite action set \(\mathcal{A} = [A]\). We represent the joint action of the \(n\) followers as \(\boldsymbol{a} = (a_1, ..., a_n)\). Each follower \(i\) also has a private type \(\theta_i \in \Theta = [K]\), with the vector of all follower types denoted by \(\boldsymbol{\theta}= (\theta_1, ..., \theta_n) \in \Theta^{n}\). We consider a Bayesian setting where this type vector is drawn from a distribution \(\boldsymbol{\mathcal{D}}\) (i.e. \(\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}}\)), with \(\mathcal{D}_i\) denoting the marginal distribution of \(\theta_i\). The properties of this joint distribution play a key role in our results. We will consider two scenarios:
Independent type distributions: The followers’ types are independent: \(\boldsymbol{\mathcal{D}} = \mathcal{D}_1 \times \cdots \times \mathcal{D}_n\).
General type distributions: The followers’ types can be arbitrarily correlated.
If the leader selects action \(\ell\) and the followers select joint action \(\boldsymbol{a}\), the leader receives utility \(u(\ell, \boldsymbol{a}) \in [0, 1]\) and each follower \(i\) receives utility \(v_i(\ell, a_i, \theta_i) \in [0, 1]\). Observe that each follower’s utility depends only on their own action and type, alongside the leader’s action; it does not depend on the actions of other followers.2 For a mixed strategy \(x \in \Delta(\mathcal{L})\) and followers’ actions \(\boldsymbol{a}\), the leader’s expected utility is given by \(u(x, \boldsymbol{a}) = \mathbb{E}_{ \ell \sim x } [u(\ell, \boldsymbol{a})] = \sum_{\ell \in \mathcal{L}} x(\ell) u(\ell, \boldsymbol{a})\). Likewise, the \(i\)th follower’s expected utility under \(x\) is \(v_i(x, a_i, \theta_i) = \mathbb{E}_{ \ell \sim x } [ v_i(\ell, a_i, \theta) ]\). We assume that the leader knows each follower’s utility function but not their private types.
An instance of a multi-follower Bayesian Stackelberg game is defined by the tuple \(I = (n, L, A, K, u, v, \boldsymbol{\mathcal{D}} )\). In this game, the leader first commits to a mixed strategy \(x\) without knowledge of the followers’ types. The follower types are then jointly realized from \(\boldsymbol{\mathcal{D}}\), and each follower selects a best-responding action based on the leader’s strategy. It is without loss of generality to consider followers choosing pure action since follower utilities are independent of one another.
Definition 1 (Followers’ Best Response). For a leader’s mixed strategy \(x\), the best response of a follower \(i\) with realized type \(\theta_i\) is given by \(\mathrm{br}_{i}(\theta_i, x) \in \arg \max_{a \in \mathcal{A}} v_i(x, a, \theta_i)\).3 The vector of best responses is denoted by \({\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x) = (\mathrm{br}_1(\theta_1, x), ..., \mathrm{br}_n(\theta_n, x))\).
Let \(U_{ \boldsymbol{\mathcal{D}} }(x) = \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} } [ u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x))]\) denote the leader’s expected utility when the leader commits to mixed strategy \(x\), the followers have their types drawn from \(\boldsymbol{\mathcal{D}}\) and best respond.
Definition 2 (Leader’s Optimal Strategy). For a joint follower type distribution \(\boldsymbol{\mathcal{D}}\), the leader’s optimal strategy, also known as the Stackelberg Equilibrium, is given as follows: \[\begin{align} x^* ~ \in ~ \mathop{\mathrm{arg\,max}}_{x \in \Delta(\mathcal{L})} U_{ \boldsymbol{\mathcal{D}} }(x) ~ = ~ \mathop{\mathrm{arg\,max}}_{x \in \Delta(\mathcal{L})}{\mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} } [ u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x))]}. \end{align}\]
When the leader knows the distribution \(\boldsymbol{\mathcal{D}}\), they can compute the optimal strategy by solving the optimization problem specified in Definition 2. Indeed, this is the premise of [1]. Our work examines an online learning model where the leader does not know the type distribution \(\boldsymbol{\mathcal{D}}\) a priori; instead, the leader must learn the optimal strategy through feedback from repeated interactions with followers over \(T\) rounds.
We examine two feedback models. In the type feedback setting, the leader observes the types \(\boldsymbol{\theta}^{t}\) of the followers after each round \(t\), whereas in the action feedback setting, the leader only observes the actions \(\boldsymbol{a}^{t}\) of the followers. Note that type feedback is strictly more informative than action feedback since the follower’s actions can be inferred from their types by computing their best response (Definition 1). We summarize the interactions at a given round \(t\) as follows:
The leader chooses a strategy \(x^t \in \Delta(\mathcal{L})\).
Follower types for this round are realized: \(\boldsymbol{\theta}^t \sim \boldsymbol{\mathcal{D}}\).
Followers take their best-responding actions \(\boldsymbol{a}^t = {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x^t)\) and the leader gets utility \(u(x^t, \boldsymbol{a}^t)\).
Under type feedback, the leader observes the type profile \(\boldsymbol{\theta}^{t}\). Under action feedback, the leader observes only the followers’ actions \(\boldsymbol{a}^{t}\).
The leader deploys a learning algorithm (based on past feedback) to select strategy \(x^t\) for every round \(t\). We study learning algorithms that minimize the cumulative regret with respect to the optimal equilibrium strategy (Definition 2). Formally defined below, minimizing this objective requires a careful balance between exploring the strategy space while not taking too many sub-optimal strategies.
Definition 3. The regret of a learning algorithm that selects strategy \(x^t\) at round \(t \in [T]\) is: \[\mathrm{Reg}(T) = \sum_{t=1}^{T}\mathbb{E}_{\boldsymbol{\theta}^t \sim \boldsymbol{\mathcal{D}} }{\Big[ u(x^*, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}^t, x^*)) - u(x^t, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}^t, x^t)) \Big]} = \sum_{t=1}^{T}\Big(U_{ \boldsymbol{\mathcal{D}} }(x^*) - U_{ \boldsymbol{\mathcal{D}} }(x^t) \Big).\]
Note that \(\mathrm{Reg}(T)\) is a random variable, because the selection of \(x^t\) depends on the past type realizations \(\boldsymbol{\theta}^{1}\), ..., \(\boldsymbol{\theta}^{t - 1}\). We aim to minimize the expected regret \(\mathbb{E}[ \mathrm{Reg}(T) ]\).
Lastly, our model assumes followers behave myopically, selecting their best actions based only on the leader’s current strategy, without considering future rounds. This is consistent with the related literature [15]–[17] and well-motivated in settings like online platforms or security games where followers maximize their immediate utility.
Since the followers’ best-responding actions are sensitive to the leader’s strategy \(x\), the leader’s expected utility function \(U_{ \boldsymbol{\mathcal{D}} }(x)\) is discontinuous in \(x\). This presents a key challenge to both learning and optimizing over the leader’s strategy space. To overcome this challenge, we first show that the leader’s strategy space \(\Delta(\mathcal{L})\) can be partitioned into a polynomial number of non-empty best-response regions (followers have the same best-response actions within each region). While the notion of best-response regions has been proposed by prior works [5], [16], [17], [26], those works consider single-follower cases. With multiple followers, we will argue that the number of such regions does not increase exponentially in the number of followers \(n\) (Lemma 2) – a key property to be used in later sections.
To build intuition, we first consider the leader playing against a single follower \((n=1)\). The follower has a utility function \(v(\ell, a, \theta)\) and a type \(\theta \in \Theta=[K]\) drawn from distribution \(\mathcal{D}\). Next, let \(w: \Theta \to \mathcal{A}\) be a mapping from follower type to action – i.e. \(w(\theta)\) specifies an action for type \(\theta\). For such a mapping \(w\), let \(R(w) \subseteq \Delta(\mathcal{L})\) be the set of leader strategies under which the follower’s best response action \(\mathrm{br}(\theta, x)\) is equal to \(w(\theta)\) for every type \(\theta \in \Theta\). Formally: \[\begin{align} R(w) & ~ = ~ \Big\{ x \in \Delta(\mathcal{L}) ~ \big| ~ \mathrm{br}(\theta, x) = w(\theta), \,\, \forall \theta \in \Theta \Big\} \nonumber \\ & ~ = ~ \Big\{ x \in \Delta(\mathcal{L}) ~ \big| ~ v(x, w(\theta), \theta) \ge v(x, a', \theta), \,\, \forall \theta \in \Theta, \forall a' \in \mathcal{A}\Big\} \nonumber \end{align}\]
where we recall that for any action \(a\), \(v(x, a, \theta) = \sum_{\ell \in \mathcal{L}}{x(\ell)v(\ell, a, \theta)}\). The set \(R(w)\) is defined as the best-response region for mapping \(w\). This region can also equivalently be defined as the intersection of several halfspaces (see Figure 4 in Appendix 7 for a visual). In particular, let \[\begin{align} d_{ \theta, a, a'} = \big[ v(1, a, \theta) - v(1, a', \theta) \,\, , \,\, \dots \,\, , \,\, v(L, a, \theta) - v(L, a', \theta) \big]^T \in \mathbb{R}^L \end{align}\] denote the “advantage” of follower type \(\theta\) taking action \(a\) over \(a'\) at each of the \(L\) possible leader actions. Then the halfspace \(H(d_{\theta, a, a'}) = \big\{ x \in \Delta(\mathcal{L}) \,|\, \langle x, d_{\theta, a, a'} \rangle \ge 0 \big\}\) contains all the leader strategies under which the follower with type \(\theta\) prefers action \(a\) over \(a'\). Thus, the best-response region is \(R(w) ~ = ~ \bigcap_{\theta \in \Theta, a \in \mathcal{A}} H(d_{\theta, w(\theta), a})\), the intersection of \(|\Theta| \cdot |\mathcal{A}| = KA\) halfspaces.
We generalize the intuitions from the single-follower case to the multi-follower case. Let \(W = (w_1, \ldots, w_n)\) be a set of \(n\) mappings, where each \(w_i : \Theta \to A\) is the best-response mapping for follower \(i\). Alternatively, one can think of \(W\) as a matrix, \(W = [w_{ik}] \in \mathcal{A}^{n \times K}\), where each entry \(w_{ik}\) records the best-response action for follower \(i\) if he has type \(\theta_i = k\). Given a joint type \(\boldsymbol{\theta}= (\theta_1, \ldots, \theta_n)\) of all followers, we use \(W(\boldsymbol{\theta}) = (w_1(\theta_1), \ldots, w_n(\theta_n)) \in \mathcal{A}^n\) to denote the joint action of all followers under this type profile. We generalize the notion of best-response region \(R(w)\) from the singe-follower case to the multi-follower case:
Definition 4 (Best-Response Region). For a matrix \(W \in \mathcal{A}^{n \times K}\), the best-response region for \(W\) is the set of leader strategies under which the followers’ best responses are given by \(W\): \[\begin{align} R(W) & ~ = ~ \Big\{ x \in \Delta(\mathcal{L}) ~ \big| ~ {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x) = W(\boldsymbol{\theta}), \quad \forall \boldsymbol{\theta}\in \Theta^n \Big\}. \end{align}\]
As in the single-follower case, \(R(W)\) can be expressed as the intersection of multiple halfspaces: \(R(W) = \bigcap_{i \in [n], \theta_i \in \Theta, a_i \in \mathcal{A}} H(d_{\theta_i, w_i(\theta_i), a_i})\).
We make an important observation: the leader’s expected utility function \(U_{ \boldsymbol{\mathcal{D}} }(x)\) is linear in \(x\) within each non-empty best-response region. By definition, for all \(\boldsymbol{\theta}\in \Theta^n\) and \(x \in R(W)\), we have \({\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x) = W(\boldsymbol{\theta})\). So, \[\begin{align} U_{ \boldsymbol{\mathcal{D}} }(x) = \sum_{\boldsymbol{\theta}\in \Theta^n} \boldsymbol{\mathcal{D}}(\boldsymbol{\theta}) u(x, \boldsymbol{\mathrm{br}}(\boldsymbol{\theta}, x)) = \sum_{\boldsymbol{\theta}\in \Theta^n} \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) \sum_{\ell \in \mathcal{L}} x(\ell) u(\ell, W(\boldsymbol{\theta})) = \sum_{\boldsymbol{\theta}\in \Theta^n} \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) \langle x, z_{W, \boldsymbol{\theta}} \rangle. \end{align}\] where \(z_{W, \boldsymbol{\theta}}\) is the \(L\)-dimensional vector \(z_{W, \boldsymbol{\theta}} = (u(1, W(\boldsymbol{\theta})), ..., u(L, W(\boldsymbol{\theta})))\). So, we conclude that the leader’s expected utility is linear within each region \(R(W)\):
Lemma 1. For each \(W\), the leader’s expected utility function \(U_{ \boldsymbol{\mathcal{D}} }(x)\) is linear in \(x \in R(W)\).
Although \(U_{ \boldsymbol{\mathcal{D}} }(x)\) is linear within each best-response region, it could be non-linear and even discontinuous across different best-response regions.
Let \(\mathcal{ W } = \{ W \in \mathcal{A}^{n\times K} ~ | ~ R(W)\ne \emptyset \}\) denote the set of mappings \(W\) for which the corresponding best-response region \(R(W)\) is non-empty. Although the total number of \(W \in \mathcal{A}^{n\times K}\) is \(A^{n\times K}\), the number of non-empty best-response regions is significantly smaller, especially when \(L\) (number of actions of the leader) is treated as a constant. The exact characterization is given below. The proof (in Appendix 7) uses a result in computational geometry regarding the number of nonempty regions obtained by dividing \(\mathbb{R}^L\) using \(\mathcal{O}(nKA^2)\) hyperplanes.
Lemma 2. The number of non-empty best-response regions, \(| \mathcal{ W } |\), is \(\mathcal{O}(n^L K^L A^{2L})\).
For any algorithm to leverage these best response regions, it is imperative that these regions can be enumerated efficiently. The following lemma shows this is always possible. Intuitively, we construct a graph where the nodes represent non-empty best-response regions and an edge exists between \(W, W' \in \mathcal{ W }\) if and only if \(W\) and \(W'\) differ in exactly one entry. Traversing an edge, therefore, corresponds to moving to an adjacent best-response region by crossing a single hyperplane boundary. We show that this graph is always connected and can thus be efficiently traversed using breadth-first search. The exact algorithm and proof of Lemma 3 are in Appendix 7.
Lemma 3. The set of non-empty best-response regions \(\{R(W): W \in \mathcal{ W } \}\) can be enumerated in \(\mathrm{poly}(n^L,K^L,A^L, L)\) time.
We now show that the optimal strategy within each region can be efficiently computed. Recall from Definition 2 that, when given the followers’ type distribution \(\boldsymbol{\mathcal{D}}\), computing the leader’s optimal strategy requires solving \(\max_{x \in \Delta(\mathcal{L})}{\mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} } [ u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x))]}\). Since the leader’s utility is linear within a region \(R(W)\), the optimal solution within \(R(W)\) can be computed by the following linear program: \[\begin{align} \label{eq:optimal-empirical-x-regions} & \max_{x \in R(W)} \sum_{\boldsymbol{\theta}\in \Theta^n}{ \boldsymbol{\mathcal{D}} (\boldsymbol{\theta})u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}))} = \max_{x \in R(W)} \sum_{\boldsymbol{\theta}\in \Theta^n}{ \boldsymbol{\mathcal{D}} (\boldsymbol{\theta})u(x, W(\boldsymbol{\theta}))} \end{align}\tag{1}\] where \(x\in R(W)\) is given by the following set of linear constraints: \[\begin{align} \begin{cases} \sum_{\ell \in \mathcal{L}} x(\ell) \big[ v_i(\ell, w_i(\theta_i), \theta_i) - v_i(\ell, a'_i, \theta_i) \big] \ge 0, ~~ \forall i\in[n], ~ \forall \theta_i \in \Theta, ~ \forall a'_i \in \mathcal{A}, \\ x(\ell) \ge 0, ~ \forall \ell \in \mathcal{L}, ~ \text{ and }~ \sum_{\ell \in \mathcal{L}} x(\ell) = 1. \end{cases} \label{eq:opt95strat95cons} \end{align}\tag{2}\] While there are \(\mathcal{O}(nKA)\) constraints, each involving a sum over \(L\) elements, the objective involves summing over \(K^n\) possible type profiles. While this is exponential in \(n\), any input to the complete information instance must provide the joint type distribution \(\boldsymbol{\mathcal{D}} \in [0,1]^{K^n}\) as input. Thus, the time to compute the optimal solution within each region is polynomial in the input size.
The above results imply that, given distribution \(\boldsymbol{\mathcal{D}}\), the optimal leader strategy in BSGs can be computed efficiently when the number \(L\) of leader’s actions is small. This is because the optimal strategy within each best-response region \(R(W)\) can be computed efficiently by the linear program equation 1 , the overall optimal strategy is the maximum over all non-empty best-response regions, and there are at most \(\mathcal{O}(n^L K^L A^{2L})\) such regions by Lemma 2. In comparison, [1] prove that the optimal strategy is NP-hard to compute in BSGs with an asymptotically increasing \(L\). We showed above that BSGs are polynomial-time solvable for a constant \(L\).
We now address the core problem of learning the optimal leader strategy from online feedback. This section considers the type-feedback setting, where the leader observes each follower’s realized type \(\boldsymbol{\theta}^t = (\theta^t_1, \dots, \theta^t_n)\) at the end of round \(t\). We start with general distributions – that is, the followers’ types can be arbitrarily correlated. Observing types after each round allows us to directly estimate the unknown distribution \(\boldsymbol{\mathcal{D}}\) and compute an optimal strategy accordingly. This is formalized in Algorithm 1:
At first glance, one might think that this algorithm might suffer a large regret because the distribution \(\boldsymbol{\mathcal{D}}\), which has support size \(|\Theta^n| = K^n\), is difficult to estimate. Indeed, the estimation error for such a distribution using \(t\) samples is at least \(\Omega\big(\sqrt{\tfrac{K^n}{t}}\big)\) even if \(\boldsymbol{\mathcal{D}}\) is a product distribution (namely, the types are independent) [27]. This suggests that the empirically optimal strategy \(x^t\) might be worse than the true optimal strategy \(x^*\) by at least \(\Omega\big(\sqrt{\tfrac{K^n}{t}}\big)\), which would cause an \(\Omega(\sqrt{K^n T})\) regret in \(T\) rounds in total. As we will show in Theorem 1, one analysis of Algorithm 1 achieves exactly this as a regret upper bound. The proof (in Appendix 8.2) upper bounds the single-round regret by the total variation (TV) distance between the empirical distribution \(\hat{ \boldsymbol{\mathcal{D}} }^t\) and the true distribution \(\boldsymbol{\mathcal{D}}\).
While this suggests that \(\mathcal{O}(\sqrt{K^nT})\) regret might be tight, this is interestingly not true when \(n\) is large! That is, the intuitive lower bound that arises from the estimation error for distribution \(\boldsymbol{\mathcal{D}}\) is not correct. Although the empirical type distribution can differ significantly from the true type distribution, the empirical utility of any strategy \(x \in \Delta(\mathcal{L})\) is actually concentrated around the true expected utility of \(x\) with high probability. We formalize this below:
Lemma 4. Given \(t\) samples \(\boldsymbol{\theta}^1, \ldots, \boldsymbol{\theta}^t\) from distribution \(\boldsymbol{\mathcal{D}}\), let \(\hat{U}^t(x) = \frac{1}{t} \sum_{s=1}^{t} u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}^s, x))\) be the empirical expected utility of a strategy \(x \in \Delta(\mathcal{L})\) computed on the \(t\) samples. Recall that \(U_{ \boldsymbol{\mathcal{D}} }(x) = \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[ u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x)) ]\) denotes the true expected utility of \(x\). With probability at least \(1 - \delta\), we have: for all \(x \in \Delta(\mathcal{L})\), \(\big| U_{ \boldsymbol{\mathcal{D}} }(x) - \hat{U}^t(x) \big| \le \mathcal{O}\Big( \sqrt{\tfrac{L \log t}{t}} + \sqrt{\tfrac{L \log(nKA) + \log(1/\delta)}{t}} \Big).\)
Note that the above concentration result holds for all strategies \(x \in \Delta(\mathcal{L})\) simultaneously, instead of for a single fixed strategy (which easily follows from Hoeffding’s inequality). The proof of Lemma 4 (in Appendix 8.1) relies on an analysis of the pseudo-dimension of the leader’s utility functions and leverages insights about the best-response regions (discussed in Section 3). This result means that the simple Algorithm 1 can achieve a regret that is of the order \(\sqrt{T}\), logarithmic in \(n\), with an additional \(\sqrt{L}\) factor. This is better for large \(n\) and small \(L\). This new regret bound, along with the earlier one \(\mathcal{O}(\sqrt{K^nT})\), is formalized in Theorem 1 below, with the proof given in Appendix 8.2.
Theorem 1. The type-feedback Algorithm 1 for general type distributions achieves expected regret \(\mathcal{O}\big(\min\big\{\sqrt{LT \cdot \log(nKAT)}, \sqrt{K^nT} \big\} \big)\) and can be implemented in \(\mathrm{poly}((nKA)^LLT)\) time.
Theorem 1 also comments on the runtime of Algorithm 1, which hinges on the computability of \(x^t \in \mathop{\mathrm{arg\,max}}_{x \in \Delta(\mathcal{L})} \sum_{s=1}^{t-1} u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}^s, x))\). Using the techniques developed in Section 3, this maximization can be solved by taking the maximum over the optimal strategies from each non-empty best-response region \(W\in \mathcal{ W }\), computed using the empirical type distribution. Using Lemmas 2 and 3 and the fact that the optimal strategy within a non-empty \(R(W)\) can be solved by the following linear program, we obtain a runtime that is polynomial when \(L\) is constant:4 \[\label{equation:empirical95opt95lp} \max_{W \in \mathcal{ W } } \left\{ \max_{x \in R(W)}{\sum_{s=1}^{t}{u(x, W(\boldsymbol{\theta}^s))}} ~~ \text{subject to the constraints in (\ref{eq:opt95strat95cons})} \right\}.\tag{3}\]
Algorithm 1 and the corresponding regret bound in Theorem 1 hold without any assumptions on the joint type distribution \(\boldsymbol{\mathcal{D}}\). In many settings, however, the followers’ types may be independent of one another. Intuitively, one expects learning to be easier in such settings since it suffices to learn the marginals as opposed to the richer joint distribution. This is indeed correct: in Algorithm [alg:type-feedback], we build the empirical distribution \(\hat{\mathcal{D}}_i^t\) for each marginal from samples \(\theta_i^1, \ldots, \theta_i^t\) for follower \(i\) and then take the product \(\hat{ \boldsymbol{\mathcal{D}} }^t = \prod_{i=1}^{n} \hat{\mathcal{D}}_i^t\) to estimate \(\boldsymbol{\mathcal{D}} = \prod_{i=1}^n \mathcal{D}_i\).
This algorithm achieves a much improved regret, \(\mathcal{O}(\sqrt{nKT})\), formalized in Theorem 2 and empirically verified in Appendix 10. The proof (in Appendix 8.3) is similar to the \(\mathcal{O}(\sqrt{K^nT})\) regret analysis of Theorem 1, which upper bounds the single-round regret by the TV distance between \(\hat{\boldsymbol{\mathcal{D}}} ^t\) and \(\boldsymbol{\mathcal{D}}\). But for independent distributions, we can relate the TV distance with the sum of Hellinger distances between the marginals \(\hat{\mathcal{D}}^t_i\) and \(\mathcal{D}_i\), which is bounded by \(\mathcal{O}(\sqrt{\frac{nK}{t}})\) instead of \(\mathcal{O}(\sqrt{\frac{K^n}{t}})\), so the total regret is bounded by \(\mathcal{O}(\sqrt{nKT})\). The computational complexity, though, increases to \(\mathrm{poly}((nKA)^LLTK^n)\) as the empirical product distribution \(\hat{\boldsymbol{\mathcal{D}}} ^t = \prod_{i}^n \hat{\mathcal{D}}_i^t\) has support size \(K^n\).
Theorem 2. The type-feedback Algorithm 2 for independent type distributions achieves expected regret \(\mathcal{O}\big(\sqrt{nKT}\big)\) and can be implemented in \(\mathrm{poly}((nKA)^LLTK^n)\) time.
Corollary 1. Taking the minimum of Theorems 1 and 2, we obtain a type-feedback algorithm with expected regret \(\mathcal{O}\big(\min\big\{\sqrt{LT \cdot \log(nKAT)}, \sqrt{nKT} \big\} \big)\) for independent type distributions.
We then provide a lower bound result: no algorithm for online Bayesian Stackelberg game has a better regret than \(\Omega(\sqrt{\min\{L, nK\} T})\). When the number of followers \(n\) is large, this lower bound matches the previous upper bounds \(\widetilde{\mathcal{O}}(\sqrt{LT})\). To our knowledge, this work is the first to provide a lower bound for the multi-follower problem and give an almost tight characterization of the factor before the classical \(\sqrt{T}\) term. Interestingly, this \(\widetilde{\mathcal{O}}(\sqrt{L})\) factor does not grow with \(n\) up to log factor.
Theorem 3. The expected regret of any type-feedback algorithm is at least \(\Omega(\sqrt{\min\{L, nK\} T })\). This holds even if the followers’ types are independent and the leader’s utility does not depend \(\ell\).
The proof (given in Appendix 8.5) involves two non-trivial reductions. First, we reduce the distribution learning problem to a single-follower Bayesian Stackelberg game, obtaining an \(\Omega(\sqrt{\min\{L, K\} T})\) lower bound. Then, we reduce the single-follower game with \(nK\) types to a game with \(n\) followers each with \(K\) types. One might wish to reduce a single-follower game with \(K^n\) types to an \(n\)-follower game to prove a lower bound of \(\Omega(\sqrt{\min\{L, K^n\} T})\) for general type distributions, but that cannot be done easily due to no externality between the followers.
We now discuss the setting where the leader observes the followers’ actions after each round. This setting is more practical yet challenging than the type-feedback setting. We present two learning algorithms. The first algorithm achieves \(\mathcal{O}( K^n \sqrt{T} \log T )\) regret, using a previous technique from [19]. The second algorithm involves a novel combination of the Upper Confidence Bound principle and the concentration analysis of best-response regions from Lemma 4, achieving \(\mathcal{O}(\sqrt{n^L K^L A^{2L} L T \log T })\) regret. The latter is better when the number of followers \(n\) is large and the number of leader actions \(L\) is small. We empirically simulate both approaches in Appendix 10.
[19] developed a technique to reduce the online learning problem for a linear program with unknown objective parameter to a linear bandit problem. Although the optimization problem for Bayesian Stackelberg game (Definition 2) is not a linear program, we can still apply [19]’s technique. We first show that, under a different formulation, Bayesian Stackelberg game can actually be solved by a single linear program. Then, we reduce the linear program formulation of online Bayesian Stackelberg game to a linear bandit problem. A difference between our work and [19] is that, while they consider an adversarial online learning environment, we have a stochastic environment. Directly applying their technique will lead to a sub-optimal \(\widetilde{\mathcal{O}}(K^{\frac{3n}{2}}\sqrt{T})\) regret bound. Instead, we apply the OFUL algorithm for stochastic linear bandit [22] to obtain a better regret bound of \(\widetilde{\mathcal{O}}(K^{n}\sqrt{T})\). See details in Appendix 9.1.
Theorem 4. There exists an action-feedback algorithm for online Bayesian Stackelberg game with \(\mathcal{O}( K^n \sqrt{T} \log T )\) regret.
We design a better algorithm for large \(n\) and small \(L\). Recall from Section 3 that the leader’s strategy space can be partitioned into best-response regions: \(\Delta(\mathcal{L}) = \bigcup_{W \in \mathcal{ W } } R(W)\). When the leader plays strategy \(x\) in a region \(R(W)\), the followers’ best-response function satisfies \({\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x) = W(\boldsymbol{\theta})\), so the leader’s expected utility is \[\begin{align} U(x, R(W)) = \sum_{\boldsymbol{\theta}\in \Theta^n} \boldsymbol{\mathcal{D}}(\boldsymbol{\theta}) u(x, W(\boldsymbol{\theta})) = \sum_{\boldsymbol{a} \in \mathcal{A}^n} u(x, \boldsymbol{a}) \sum_{\boldsymbol{\theta}| W(\boldsymbol{\theta}) = \boldsymbol{a}} \boldsymbol{\mathcal{D}}(\boldsymbol{\theta}) = \sum_{\boldsymbol{a} \in \mathcal{A}^n} u(x, \boldsymbol{a}) \mathcal{P}(\boldsymbol{a} \,|\, R(W)) \end{align}\] where \(\mathcal{P}(\boldsymbol{a} \,|\, R(W))=\sum_{\boldsymbol{\theta}\in \Theta^n: W(\boldsymbol{\theta}) = \boldsymbol{a}} \boldsymbol{\mathcal{D}}(\boldsymbol{\theta})\) denotes the probability that the followers jointly take action \(\boldsymbol{a}\in\mathcal{A}^n\) when the leader plays \(x \in R(W)\). Since the distribution \(\mathcal{P}(\cdot \,|\, R(W)) \in \Delta(\mathcal{A}^n)\) does not depend on \(x\) as long as \(x\in R(W)\), playing \(N\) strategies \(x^1, ..., x^N\) within \(R(W)\) yields \(N\) observations \(\boldsymbol{a}^1, ..., \boldsymbol{a}^N \sim \mathcal{P}(\cdot \,|\, R(W))\). Using these samples, we can estimate the utility of any other strategy \(x \in R(W)\) within the same region. We define the empirical utility estimate on \(N\) samples of joint actions as \(\hat{U}_N(x, R(W)) ~ = ~ \frac{1}{N} \sum_{s=1}^{N} u(x, \boldsymbol{a}^s)\).
Lemma 5. Suppose \(T \ge | \mathcal{ W } |\). With probability at least \(1 - \frac{1}{T^2}\), we have: \(\forall W \in \mathcal{ W }\), \(\forall N\in\{1, \ldots, T\}\), \(\forall x \in R(W)\), \(| U(x, R(W)) - \hat{U}_N(x, R(W)) | \le \sqrt{\tfrac{4(L+1) \log(3T)}{N}}\).
The proof of this lemma is similar to the proof of Lemma 4 and given in Appendix 9.2.
For each region \(W\in \mathcal{ W }\), let \(N^t(W) = \sum_{s=1}^{t-1} \mathbb{1}\left[x^s \in R(W)\right]\) be the number of times when strategies in region \(R(W)\) were played in the first \(t-1\) rounds. Given the result in Lemma 5, we define an Upper Confidence Bound (UCB) on the expected utility of the optimal strategy in region \(R(W)\): \[\begin{align} \mathrm{UCB}^t(W) ~ = ~ \max_{x \in R(W)} \Big\{ \hat{U}_{N^t(W)}(x, R(W)) \Big\} ~ + ~ \sqrt{\tfrac{4(L+1) \log(3T)}{N^t(W)}}. \end{align}\] We design the following algorithm: at each round \(t\), select the region \(W\in \mathcal{ W }\) with the highest \(\mathrm{UCB}^{t}(W)\), play the empirically optimal strategy in that region, and increment \(N^t(W)\) by 1. Full description of the algorithm is given in Algorithm 3.
Theorem 5. Algorithm 3 has expected regret \(\mathcal{O}\big(\sqrt{ n^L K^L A^{2L} L \cdot T \log T}\big)\).
While the full proof is in Appendix 9.3, we sketch the intuition. In the classical multi-armed bandit problem, the UCB algorithm has expected regret \(\mathcal{O}( \sqrt{ m T \log T } )\) where \(m\) is the number of arms. In our setting, each best-response region corresponds to an arm, and the confidence bound for each region is \(\mathcal{O}( \sqrt{ \frac{L \log T}{ N^t(W) } } )\). The number of arms/regions is \(m = | \mathcal{ W } | = \mathcal{O}(n^L K^L A^{2L})\) by Lemma 2. So, the regret of Algorithm 3 is at most \(\mathcal{O}( \sqrt{ | \mathcal{ W } | \cdot L \cdot T \log T } ) = \mathcal{O}( \sqrt{ n^L K^L A^{2L} \cdot L \cdot T \log T } )\).
Corollary 2. By taking the better algorithm in Theorems 4 and 5, we obtain an action-feedback algorithm with \(\widetilde{\mathcal{O}}\big( \min\big\{K^n, \sqrt{n^L K^L A^{2L} L} \big\} \sqrt{T} \big)\) regret.
Since action-feedback is more limited than type-feedback, the lower bound in Theorem 3 immediately carries over and shows that the \(\widetilde{\mathcal{O}}(\sqrt{T})\) regret bounds here are tight in \(T\). There are several subtleties in achieving tighter bounds on the remaining parameters. [1] show that, even with known distributions, BSG are NP-Hard to solve with respect to \(L\); so an exponential computational dependence on \(L\) is unavoidable, even if the regret could be made independent of \(L\) as shown in our \(\mathcal{O}( K^n \sqrt{T} \log T )\) result. Whether an online learning algorithm with \(\textrm{poly}(n, K, L)\sqrt{T}\) regret exists is an open question, but such an algorithm will suffer an exponential runtime in \(L\) unless P = NP.
This work designed online learning algorithms for Bayesian Stackelberg games with multiple followers with unknown type distributions. Although the joint type space of \(n\) followers has an exponentially large size \(K^n\), we achieved significantly smaller regrets: \(\widetilde{\mathcal{O}}(\sqrt{\min\{nK, L\} T})\) when the followers’ types are independent and observable, \(\widetilde{\mathcal{O}}(\sqrt{\min\{K^n, L\} T})\) when followers’ types are correlated and observable, and \(\widetilde{\mathcal{O}}(\min\{K^n, \sqrt{n^L K^L A^{2L} L}\} \sqrt{T})\) when only the followers’ actions are observed. These results exploit various geometric properties of the leader’s strategy space. The type-feedback bounds are tight in all parameters and the action-feedback bounds are tight in \(T\). The exponential dependency on \(L\) is unavoidable computationally [1]. Further closing the gaps between upper and lower regret bounds is an open question and will likely involve tradeoff between different parameters and tradeoff between computation and regret.
Proof. We have \(n\) followers each with \(K\) type and \(A\) actions. Each follower has \(K\binom{A}{2} \le K A^2\) advantage vectors, where each advantage vector \(d_{\theta, a, a'}\) corresponds to a hyperplane in \(\mathbb{R}^L\) that separates the leader’s mixed strategy space \(\Delta(\mathcal{L}) \subseteq \mathbb{R}^L\) into two halfspaces. In total, \(n\) followers have \(n K A^2\) hyperplanes. Those hyperplanes divide \(\mathbb{R}^L\) into at most \(\mathcal{O}((nKA^2)^L) = \mathcal{O}(n^L K^L A^{2L})\) regions (see, e.g., [28]). Each non-empty best response region is one of such regions, so the total number is \(\mathcal{O}(n^L K^L A^{2L})\). ◻
Proof. We construct a graph \(G = (V, E)\) where \(V\) consists of the elements \(W\), each representing a best-response region. An edge exists between two vertices \(W\), \(W' \in \mathcal{ W }\) if and only if \(W\) and \(W'\) differ in exactly one entry. Traversing an edge corresponds to moving between adjacent best-response regions by crossing a hyperplane boundary.
We claim that this graph is connected. Since each vertex \(W\) corresponds to a best-response region defined by the inequalities in (2 ), and the leader’s strategy space is the \(L\)-dimensional probability simplex, the union of non-empty best response regions forms a partition of the strategy space. Because these regions are convex polytopes sharing boundaries, the adjacent structure defined by differing in one entry corresponds to crossing a shared facet. Starting from any non-empty region, we can traverse to any other by crossing shared facets through adjacent regions, so the graph is connected.
Thus, to enumerate all non-empty best-response regions, we can perform a graph search (e.g., breadth-first search or depth-first search) starting from any initial vertex \(W\) to traverse all vertices in \(\mathcal{O}(|W|)\) steps, which is \(\mathcal{O}(n^L K^L A^{2L})\) by Lemma 2. Specifically, at each vertex \(W\), we examine all its adjacent \(nKA\) vertices. For each adjacent vertex \(W'\), we determine whether \(R(W')\) is a non-empty region by solving a feasibility linear program defined by the constraints in (2 ), which runs in \(\mathrm{poly} (n, K, A, L)\) time. Then, the total running time is \(\mathrm{poly} (n^L, K^L, A^L, L)\). We present the algorithm formally in Algorithm 5. ◻
The following definitions will be used in the proofs for this section:
Definition 5 (Total variation distance). For two discrete distributions \(\mathcal{D}\) and \(\hat{ \mathcal{D}}\) over support \(\Theta\), the total variation is half the \(L_1\) distance between the two distributions: \[\delta( \mathcal{D}, \hat{ \mathcal{D}} ) = \frac{1}{2} ||\mathcal{D}- \hat{\mathcal{D}} ||_1 = \frac{1}{2} \sum_{ \theta \in \Theta } |\mathcal{D}(\theta) - \hat{ \mathcal{D}}(\theta)|.\]
Definition 6 (Hellinger distance). For two discrete distributions \(\mathcal{D}\) and \(\hat{ \mathcal{D}}\) over support \(\Theta\), the Hellinger distance is defined as \[H(\mathcal{D}, \hat{ \mathcal{D}})=\frac{1}{\sqrt{2}}\|\sqrt{\mathcal{D}}-\sqrt{ \hat{ \mathcal{D}} }\|_2 = \frac{1}{ \sqrt{2} } \sqrt{ \sum_{\theta \in \Theta } \big( \mathcal{D}(\theta) - \hat{ \mathcal{D}}(\theta) \big)^2 }.\]
We rely on several key insights about the pseudo-dimension of a family of functions, defined below:
Definition 7 (Definition 10.2 in [29]). Let \(\mathcal{G}\) be a family of functions from input space \(\mathcal{Z}\) to real numbers \(\mathbb{R}\).
A set of inputs \(\{z_1, \ldots, z_m\} \subseteq \mathcal{Z}\) is shattered by \(\mathcal{G}\) if there exists thresholds \(t_1, \ldots, t_m \in \mathbb{R}\) such that for any sign vector \(\boldsymbol{\sigma} = (\sigma_1, \ldots, \sigma_m ) \in \{-1, +1\}^m\), there exists a function \(g \in \mathcal{G}\) satisfying \(\text{sign}(g(x_i) - t_i) = \sigma_i \text{ for all } i = 1, ...m .\)
The size of the largest set of inputs that can be shattered by \(\mathcal{G}\) is called the pseudo-dimension of \(\mathcal{G}\), denoted by \(\mathrm{Pdim}(\mathcal{G})\).
Given a family of functions with a finite pseudo-dimension, and samples \(z^1, \ldots, z^N\) drawn from a distribution on the input space \(\mathcal{Z}\), the empirical mean of any function in the family will, with high probability, be close to the true mean. Formally:
Theorem 6 (e.g., Theorem 10.6 in [29]). Let \(\mathcal{G}\) be a family of functions from \(\mathcal{Z}\) to \([0, 1]\) with pseudo-dimension \(\mathrm{Pdim}(\mathcal{G}) = d\). For any distribution \(F\) over \(\mathcal{Z}\), with probability at least \(1-\delta\) over the random draw of \(N\) samples \(z^1, \ldots, z^N\) from \(F\), the following holds for all \(g \in \mathcal{G}\), \[\begin{align} \Big|\, \mathbb{E}_{z\sim F}[g(z)] - \frac{1}{N}\sum_{i=1}^N g(z^i) \, \Big| ~ \le ~ \sqrt{\frac{2d \log 3N}{N}} + \sqrt{\frac{\log\frac{1}{\delta}}{2N}}. \end{align}\]
Consider the family of linear functions over \(\mathbb{R}^{L}\): \(\mathcal{G} = \{g_x : z \to \langle x, z \rangle \mid x \in \mathbb{R}^L \}.\) It is known that the pseudo-dimension of this family is \(L\):
Lemma 6 (e.g., Theorem 10.4 in [29]). The family of linear functions \(\{g_x: z \to \langle x, z \rangle \mid x \in \mathbb{R}^L\}\) in \(\mathbb{R}^{L}\) has pseudo-dimension \(L\).
Proof of Lemma 4. We now have the tools to prove Lemma 4. For any non-empty best-response region defined by \(W \in \mathcal{ W }\), let \(\boldsymbol{\theta}^{1}\), ..., \(\boldsymbol{\theta}^{t} \sim \boldsymbol{\mathcal{D}}\) be \(t\) i.i.d samples. For each sample \(\boldsymbol{\theta}^i\), we can directly compute \[z_{W, \boldsymbol{\theta}^i } = \Big( u(1, W(\boldsymbol{\theta}^i)), ..., u(L, W(\boldsymbol{\theta}^i)) \Big).\] Fix any leader strategy \(x \in R(W)\). By Lemma 1, the leader’s expected utility by using strategy \(x\) is \(U_{ \boldsymbol{\mathcal{D}} }(x) = \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[\langle x, z_{W, \boldsymbol{\theta}}\rangle]\), which is the expectation of the linear function \(g_x(z_{W, \boldsymbol{\theta}}) = \langle x , z_{W, \boldsymbol{\theta}} \rangle\). Therefore, by Theorem 6 and Lemma 6, we have \[\begin{align} \Pr\left[ \forall x \in R(W), ~ \Big|\mathbb{E}_{ \boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[ g_{x}( z_{W, \boldsymbol{\theta}} ) ] - \frac{1}{t} \sum_{i = 1}^{t} g_x( z_{W, \boldsymbol{\theta}^{i} } ) \Big| \leq \sqrt{ \frac{2 L \log 3 t }{ t } } + \sqrt{ \frac{\log \frac{1}{\delta'} }{2t} } ~ \right] \geq 1 - \delta'. \end{align}\] By definition, \(U_{ \boldsymbol{\mathcal{D}} }(x) = \mathbb{E}_{ \boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[ g_{x}( z_{W, \boldsymbol{\theta}} ) ]\) and \(\hat{U}^{t}(x) = \frac{1}{t} \sum_{i = 1}^{t} g(z_{W, \boldsymbol{\theta}^{t} } )\). So, \[\begin{align} \Pr\left[ \forall x \in R(W), ~ \Big| U_{ \boldsymbol{\mathcal{D}} }(x) - \hat{U}^{t}(x) \Big| \leq \sqrt{\frac{2L \log 3t}{t}} + \sqrt{\frac{\log\frac{1}{\delta'}}{2t}} ~ \right] \geq 1 - \delta'. \end{align}\] Let \(\delta' = \frac{\delta}{| \mathcal{ W } |}\). By the union bound, with probability at least \(1 - \delta\), the following bound holds simultaneously for all \(W \in \mathcal{ W }\) and \(x \in R(W)\): \[\begin{align} \Big| U_{ \boldsymbol{\mathcal{D}} }(x) - \hat{U}^t(x) \Big| \, \le \, \sqrt{\frac{2L \log(3t)}{t}} + \sqrt{\frac{\log \frac{| \mathcal{ W } |}{\delta}}{2t}} \, = \, \mathcal{O}\bigg( \sqrt{\frac{L \log t}{t}} + \sqrt{\frac{L \log(nKA) + \log(\frac{1}{\delta})}{t}} \bigg), \end{align}\] where we used the fact \(| \mathcal{ W } | = O(n^{L}K^{L}A^{2L})\) from Lemma 2. ◻
Consider a Bayesian Stackelberg game with \(n\) followers each with \(K\) types, with joint type distribution \(\boldsymbol{\mathcal{D}}\). Let \(U(x, \boldsymbol{\mathcal{D}} ) = \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x) )]\) be the expected utility of the leader playing mixed strategy \(x\) when the type distribution is \(\boldsymbol{\mathcal{D}}\). Let \(x^* = \arg \max_{x \in \Delta(\mathcal{L}) } U(x, \boldsymbol{\mathcal{D}} )\) be the optimal strategy for \(\boldsymbol{\mathcal{D}}\). At each round \(t\), Algorithm 1 chooses the optimal strategy \(x^t = \mathop{\mathrm{arg\,max}}_{x \in \Delta(\mathcal{L})} U(x, \hat{ \boldsymbol{\mathcal{D}} }^{t-1})\) for the empirical distribution \(\hat{ \boldsymbol{\mathcal{D}} }^{t-1}\) over \(t-1\) samples. The total expected regret is equal to \[\mathbb{E}[\mathrm{Reg}(T)] = \sum_{t = 1}^{T} \mathbb{E}\Big[ \underbrace{U(x^*, \boldsymbol{\mathcal{D}} ) - U(x^t, \boldsymbol{\mathcal{D}} )}_{\text{single-round regret r(t)}} \Big],\] We upper bound the single-round regret \(r(t)\) by the total variation distance (Definition 5) between \(\boldsymbol{\mathcal{D}}\) and \(\hat{\boldsymbol{\mathcal{D}}} ^{t-1}\):
Claim 1. \(r(t) = U(x^*, \boldsymbol{\mathcal{D}} ) - U(x^t, \boldsymbol{\mathcal{D}} ) \leq 4 \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^{t-1})\).
Proof. \[\begin{align} r(t) & = U(x^*, \boldsymbol{\mathcal{D}} ) - U(x^t, \boldsymbol{\mathcal{D}} ) \nonumber \\ &= U(x^*, \boldsymbol{\mathcal{D}} ) - U(x^*, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) + U(x^*, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) - U(x^t, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) + U(x^t, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) - U(x^t, \boldsymbol{\mathcal{D}} ) \nonumber \\ &\leq U(x^*, \boldsymbol{\mathcal{D}} ) - U(x^*, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) + 0 + U(x^t, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) - U(x^t, \boldsymbol{\mathcal{D}} ) \label{eq:middle-term-less-zero} \end{align}\tag{4}\] where (4 ) follows from \(U(x^*, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) \;-\; U(x^t, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) \leq 0\) because \(x^t\) maximizes \(U(x, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1} )\). We bound the first term in Equation (4 ) as follows: \[\begin{align} U(x^*, \boldsymbol{\mathcal{D}} ) \,-\, U(x^*, \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}) &= \sum_{\theta \in \Theta} \boldsymbol{\mathcal{D}} (\boldsymbol{\theta})\,u\bigl(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta},x)\bigr) \,-\, \sum_{\boldsymbol{\theta}\in \Theta} \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}(\theta)\,u\bigl(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta},x)\bigr) \nonumber \\ &= \sum_{\boldsymbol{\theta}\in \Theta} \Bigl( \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) - \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}(\boldsymbol{\theta})\Bigr)\, u\bigl(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta},x)\bigr) \\ &\leq \sum_{\boldsymbol{\theta}\in \Theta} \Bigl| \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) - \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}(\boldsymbol{\theta})\Bigr|\, u(x, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x)) \\ &\leq \sum_{\boldsymbol{\theta}\in \Theta} \Bigl| \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) - \hat{ \boldsymbol{\mathcal{D}} }^{t - 1}(\boldsymbol{\theta})\Bigr| \cdot 1 \nonumber ~ = ~ 2\delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^{t - 1} ). \end{align}\] By a symmetrical argument, the second term in Equation (4 ) is also bounded by \(2 \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^{ t- 1} )\). ◻
Using Claim 1 and taking expectation, we have \[\begin{align} \mathbb{E}[r(t)] ~ \leq~ 4 \mathbb{E}[ \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^{t - 1} ) ]. \end{align}\] According to [30], for distributions with support size \(K^n\), \(\mathbb{E}[ \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^{t - 1} ) ] \le \mathcal{O}(\sqrt{\frac{K^n}{t-1}})\). Thus, we have \(\mathbb{E}[r(t)] \le \mathcal{O}\big( \sqrt{ \frac{K^n}{t - 1} } \big)\). Using the inequality \(\sum_{t = 1}^{T} \frac{1}{ \sqrt{t} } \leq 2 \sqrt{T}\), we obtain \[\begin{align} \mathbb{E}[\mathrm{Reg}(T)] ~ = ~ \sum_{t = 1}^{T} \mathbb{E}[r(t)] ~ \leq ~ \mathcal{O}\left(\sum_{t = 1}^{T} \sqrt{ \frac{K^n}{t} } \right) ~ \leq ~ \mathcal{O}( 2 \sqrt{K^nT} ) ~ = ~ \mathcal{O}( \sqrt{K^nT}). \end{align}\]
Consider round \(t \geq 2\). By Lemma 4, we have that with probability at least \(1 - \delta\): \[\begin{align} \Big|\, U_{ \boldsymbol{\mathcal{D}} }(x)- \hat{U}^t(x) \, \Big| ~ \le ~ \mathcal{O}\bigg( \sqrt{\frac{L \log t}{t}} + \sqrt{\frac{L \log(nKA) + \log(\frac{1}{\delta})}{t}} \bigg), \quad \forall x \in \Delta(\mathcal{L}). \end{align}\] Suppose this event happens. Then, the regret of the algorithm at round \(t\) is bounded as follows: \[\begin{align} r(t) & ~ = ~ \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} } \big[ u(x^*, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x^*)) - u(x^t, {\boldsymbol{\mathrm{br}}} (\boldsymbol{\theta}, x^t)) \big] \\ & ~ = ~ U_{ \boldsymbol{\mathcal{D}} }(x^*) - U_{ \boldsymbol{\mathcal{D}} }(x^t) \\ & ~ = ~ U_{ \boldsymbol{\mathcal{D}} }(x^*) - \hat{U}^{t-1}(x^*) + \hat{U}^{t-1}(x^*) - \hat{U}^{t-1}(x^t) + \hat{U}^{t-1}(x^t) - U_{ \boldsymbol{\mathcal{D}} }(x^t) \\ & ~ \le ~ \hat{U}^{t-1}(x^*) - \hat{U}^{t-1}(x^t) + 2 \cdot \mathcal{O}\bigg( \sqrt{\frac{L \log (t-1)}{t-1}} + \sqrt{\frac{L \log(nKA) + \log(\frac{1}{\delta})}{t-1}} \bigg) \\ & ~ \le ~ 0 + 2 \cdot \mathcal{O}\bigg( \sqrt{\frac{L \log (t-1)}{t-1}} + \sqrt{\frac{L \log(nKA) + \log(\frac{1}{\delta})}{t-1}} \bigg), \end{align}\] where the last inequality follows from \(\hat{U}^{t-1}(x^*) - \hat{U}^{t-1}(x^t) \leq 0\) because the algorithm selects the strategy \(x^t\) that maximizes the empirical utility \(\hat{U}^{t - 1}(x)\). Then: \[\begin{align} \mathbb{E}[\text{Reg}(T)] &= \mathbb{E} \bigg[ \sum_{t=1}^T r(t) \bigg] \nonumber \\ &\le \sum_{t=1}^T (1 - \delta) \cdot \mathcal{O}\bigg( \sqrt{\frac{L \log t}{t}} + \sqrt{\frac{L \log(nKA) + \log(\frac{1}{\delta})}{t}} \bigg) + \delta T \tag{5} \\ &\le \mathcal{O}\bigg( \sqrt{T L \log T} + \sqrt{T \big(L \log(nKA) + \log(\frac{1}{\delta}) \big)} \bigg) + \delta T \tag{6} \\ &\le \mathcal{O}\big( \sqrt{T L (\log T + \log(nKA))} \big)+ \sqrt{b} \leq \sqrt{2(a + b)} } \nonumber \\ &= \mathcal{O}\big( \sqrt{T L \log(nKAT)} \big). \nonumber \end{align}\]
Equation (5 ) follows from the law of total expectation and the fact that the single-round regret is bounded by 1. Equation (6 ) follows from the known inequality \(\sum_{t = 1}^{T} \frac{1}{\sqrt{t}} \leq 2 \sqrt{T}\). We set \(\delta = \frac{1}{T}\).
As for the computational complexity of this algorithm, note that Lemma 2 states that the number of non-empty best-response regions is \(\mathcal{O}(n^L K^L A^{2L})\). As shown by Equation (3 ), we can compute the optimal strategy within each best-response region using a linear program with \(L\) variables and at most \(\mathrm{poly}(n^L, K^L, A^L, L, T)\) number of constraints. Further, evaluating each constraint and the objective function can also be accomplished in this time. Since each linear program can be solved in \(\mathrm{poly}(n^L, K^L, A^L, L, T)\) time, and we run \(\mathcal{O}(n^L K^L A^{2L})\) linear programs at each round, with at most \(T\) rounds, Algorithm 1 runs \(\mathrm{poly}((nKA)^L LT)\) time.
Proof. Let \(\boldsymbol{\mathcal{D}} = \prod_{i = 1}^{n} \mathcal{D}_i\) denote the distribution over independent types. According to Claim 1, the single-round regret \(r(t) = U(x^*, \boldsymbol{\mathcal{D}} ) - U(x^t, \boldsymbol{\mathcal{D}} )\) satisfies \[\begin{align} r(t) ~ \leq~ \mathcal{O}( \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^{t-1} ) ), \end{align}\] where \(\hat{\boldsymbol{\mathcal{D}}} ^{t-1} = \prod_{i=1}^n \hat{\mathcal{D}}^{t-1}_i\) is the product of the empirically computed marginal type distributions. We will use the following properties of Hellinger Distance (Definition 6):
[31] For any two distributions \(\boldsymbol{\mathcal{D}}\) and \(\hat{ \boldsymbol{\mathcal{D}} }\), \[\label{eq:hellinger95b1} H^2( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }) \leq \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }) \leq \sqrt{2} H( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }).\tag{7}\]
[31] If both \(\boldsymbol{\mathcal{D}}\) and \(\hat{ \boldsymbol{\mathcal{D}} }\) are product distributions, i.e. \(\boldsymbol{\mathcal{D}} = \prod_{i = 1}^{n} \mathcal{D}_i\) and \(\hat{ \boldsymbol{\mathcal{D}} } = \prod_{i = 1}^{n} \hat{ \mathcal{D}}_i\), then: \[\label{eq:hellinger:b2} H^2(\mathbf{ \boldsymbol{\mathcal{D}} }, \hat{ \boldsymbol{\mathcal{D}} }) \leq \sum_{i=1}^n H^2\left(\mathcal{D}_i, \hat{ \mathcal{D}}_i \right).\tag{8}\]
[30] For a distribution \(\mathcal{D}\) with support size \(K\), the empirical distribution \(\hat{\mathcal{D}}^t\) over \(t\) samples from \(\mathcal{D}\) satisfies: \[\label{eq:hellinger95b3} \mathbb{E}[ H^2( \mathcal{D}, \hat{\mathcal{D}}^t ) ] \leq \frac{ K }{ 2t }.\tag{9}\]
We now upper bound the single-round regret \(r(t+1)\) in expectation: \[\begin{align} \mathbb{E}[r(t+1)] & \le \mathcal{O}\bigl( \mathbb{E}\big[ \delta( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^t) \big] \bigr) \\ & \le \mathcal{O}\big( \mathbb{E}\big[ H( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^t) \big] \big) \tag*{\text{by (\ref{eq:hellinger95b1}})}\\ & \le \mathcal{O}\big( \sqrt{ \mathbb{E}\big[ H^2( \boldsymbol{\mathcal{D}} , \hat{ \boldsymbol{\mathcal{D}} }^t) \big] } \big) \tag*{\text{because \mathbb{E}[X^2] \ge (\mathbb{E}[X])^2}} \\ & \le \mathcal{O}\left( \sqrt{ \mathbb{E}\Big[ \sum_{i=1}^n H^2(\mathcal{D}_i, \hat{\mathcal{D}}_i^t) \Big] } \right) \tag*{\text{by (\ref{eq:hellinger:b2}})} \\ & \le \mathcal{O}\left( \sqrt{ \frac{nK}{2t} } \right) \tag*{\text{by (\ref{eq:hellinger95b3}})}. \end{align}\] Using the inequality \(\sum_{t = 1}^{T} \frac{1}{ \sqrt{t} } \leq 2 \sqrt{T}\), we obtain \[\begin{align} \mathbb{E}[\mathrm{Reg}(T)] ~ = ~ \sum_{t = 1}^{T} \mathbb{E}[r(t)] ~ \leq ~ \sum_{t = 1}^{T} \mathcal{O}\left( \sqrt{ \frac{nK}{t} } \right) ~ \le ~ \mathcal{O}( \sqrt{nK T}). \end{align}\] ◻
In this section, we prove a lower bound of \(\Omega(\sqrt{\min\{L, K\}T})\) on the expected regret of any algorithm in the case of a single-follower \((n=1)\), formalized in Theorem 7.
Theorem 7. For single-follower Bayesian Stackelberg games where the follower has \(K\) types and the leader has \(L\) actions, the expected regret of any type-feedback online learning algorithm is at least \(\Omega(\sqrt{\min\{L, K\} T})\).
At a high level, the proof Theorem 7 is a reduction from the distribution learning problem. Without loss of generality, assume that \(\min\{K, L\} = 2c\) is an even number. Further assume that \(K = L = 2c\).5 The single follower has \(K = 2c\) types, with type space \(\Theta = \{ \pm1, \pm 2, ..., \pm c \}\). Consider a class \(\mathcal{C}\) of distributions over \(\Theta\) defined as follows:
Definition 8 (Class of Distributions \(\mathcal{C}\)). A distribution \(\mathcal{D}= \mathcal{D}_{\boldsymbol{\sigma}} \in \mathcal{C}\) is specified by a vector \(\boldsymbol{\sigma} = (\sigma_1, \ldots, \sigma_c) \in \{ \pm 1 \}^c\). For each \(j = 1, \ldots, c\), \[\mathcal{D}_{\boldsymbol{\sigma}}(+j) = \frac{1}{2c}(1 + \sigma_j {\epsilon}), \quad \mathcal{D}_{\boldsymbol{\sigma}}(-j) = \frac{1}{2c}(1 - \sigma_j {\epsilon}).\] for some \({\epsilon}> 0\). Note that \(\mathcal{D}_{\boldsymbol{\sigma}}(+j) > \mathcal{D}_{\boldsymbol{\sigma}}(-j)\) if and only if \(\sigma_j = +1\). The class \(\mathcal{C}\) consists of \(2^c\) distributions.
In the distribution learning problem, given \(t\) samples \(\theta^1, ..., \theta^t\) from an unknown distribution \(\mathcal{D}\in \mathcal{C}\), the goal is to construct an estimator \(\hat{\mathcal{D}}\) specified by a vector \(\hat{\sigma} \in \{ \pm 1 \}^{c}\) such that the expected total variation distance (Definition 5) satisfies \(\mathbb{E}[\delta( \mathcal{D}, \hat{\mathcal{D}} )] \le \mathcal{O}({\epsilon})\). It is known that solving this problem requires at least \(\Omega( \frac{2c}{{\epsilon}^2} )\) samples.
Theorem 8 (e.g., [32], [33]). When \(\mathcal{D}\) is uniformly sampled from the class \(\mathcal{C}\), any algorithm that constructs estimator \(\hat{\mathcal{D}}\) using \(t\) samples from \(\mathcal{D}\) has expected error at least \(\mathbb{E}[ \delta(\hat{\mathcal{D}}, \mathcal{D}) ] \geq \Omega({\epsilon})\) if \(t \le \mathcal{O}(\frac{2c}{{\epsilon}^2})\).
Distribution learning instance: An unknown distribution \(\mathcal{D}\in \mathcal{C}\).
Bayesian Stackelberg game instance: A single follower with type space \(\Theta = \{\pm 1, \pm 2,\ldots, \pm c\}\) and an unknown type distribution \(\mathcal{D}\). The follower has binary action set \(\mathcal{A}= \{ \text{Good}, \text{Bad} \}\). The leader has action set \(\mathcal{L}= \Theta = \{\pm 1, \pm 2,\ldots, \pm c\}\). The utility functions of the two players are:
Follower’s utility function: \[\label{eq:single-follower-utility} v(\ell, a, \theta) = \begin{cases} 1 & \text{if } \theta = +j, \ell = +j, a = \text{Good} \\ 1 & \text{if } \theta = +j, \ell = -j, a = \text{Bad} \\ 1 & \text{if } \theta = -j, \ell = -j, a = \text{Good} \\ 1 & \text{if } \theta = -j, \ell = +j, a = \text{Bad} \\ 0 & \text{otherwise}. \end{cases}\tag{10}\]
Leader’s utility function: For any action \(\ell \in \mathcal{L}\), \[\label{eq:leader-utility-single-follower} u(\ell, \mathrm{Good}) = 1, \quad u(\ell, \mathrm{Bad}) = 0.\tag{11}\] Note that for any mixed strategy \(x\), \(u(x, \mathrm{Good}) = 1\) and \(u(x, \mathrm{Bad}) = 0\).
Reduction:
Given an online learning algorithm \(\mathrm{Alg}\) for Bayesian Stackelberg game with type feedback, we use it to construct an online learning algorithm for the distribution learning problem as follows: At each round \(t = 1, \ldots, T\),
Receive the leader’s mixed strategy \(x^t\) from \(\mathrm{Alg}\).
Construct an estimated distribution \(\hat{\mathcal{D}}_{x^t} = \mathcal{D}_{\boldsymbol{\sigma}(x^t)}\in\mathcal{C}\) based on vector \(\boldsymbol{\sigma}(x^t)\) defined as follows: \[\label{eq:D-hat-construction} \sigma_j(x^t) = \begin{cases} +1, & \text{if } x^t(+j) \ge x^t(-j) \\ -1, & \text{if } x^t(+j) < x^t(-j). \end{cases}\tag{12}\]
Observe sample \(\theta^t \sim \mathcal{D}\) and feed \(\theta^t\) to \(\mathrm{Alg}\).
Lemma 7 (Follower’s Best Response). Given the leader’s mixed strategy \(x \in \Delta(\Theta)\), for each \(j \in \{1, \ldots, c\}\), the best-response function of a follower with type \(+j\) or \(-j\) is: \[\begin{gather} \mathrm{br}(+j, x) \;=\; \begin{cases} \mathrm{Good}, & \text{if } x(+j) \ge x(-j),\\ \mathrm{Bad}, & \text{if } x(+j) < x(-j), \end{cases} \\[6pt] \mathrm{br}(-j, x) \;=\; \begin{cases} \mathrm{Good}, & \text{if } x(+j) < x(-j),\\ \mathrm{Bad}, & \text{if } x(+j) \ge x(-j). \end{cases} \end{gather}\]
Proof. For a follower with type \(+j\), their utility for choosing action Good is given by \[v(x, \text{Good}, +j) = \mathbb{E}_{ \ell \sim x } [v(\ell, \text{Good}, +j)] = \sum_{ \ell \in \mathcal{L}} x(\ell) v(\ell, \text{Good}, +j) = x(+j). \label{eq:follower-utility-reduction}\tag{13}\] Similarly, their utility for choosing action Bad is: \[v(x, \text{Bad}, +j) = x(-j).\]
Thus, by definition, the follower with type \(+j\) best responds with \(\text{Good}\) if \(x(+j) \geq x(-j)\).
Likewise, for a follower with type \(-j\), \[\begin{align} v(x, \text{Good}, -j) &= x(-j), \\ v(x, \text{Bad}, -j) &= x(+j). \end{align}\] Thus, a follower with type \(-j\) best responds with if \(x(+j) \geq x(-j)\), otherwise. ◻
We define \(U(x, \mathcal{D})\) as the expected utility of the leader when using mixed strategy \(x\) under the type distribution \(\mathcal{D}\). By Lemma 7, we have \[\begin{align} U(x, \mathcal{D}) &= \sum_{\theta \in \Theta} \mathcal{D}(\theta)\, u\bigl(x, \mathrm{br}(\theta, x)\bigr) \nonumber \\ &= \sum_{j=1}^c \Bigl[\mathcal{D}(+j)\,u\bigl(x, \mathrm{br}(+j, x)\bigr) + \mathcal{D}(-j)\,u\bigl(x, \mathrm{br}(-j, x)\bigr)\Bigr] \nonumber \\ &= \sum_{j=1}^c \Big( \frac{1 + \sigma_j {\epsilon}}{2c}\, \mathbb{1}\left[x(+j) \ge x(-j)\right] + \frac{1-\sigma_j {\epsilon}}{2c} \mathbb{1}\left[x(+j) < x(-j)\right] \Big). \label{eq:leader-utility-1} \end{align}\tag{14}\]
Definition 9 (Disagreement Function). The disagreement function \(\mathrm{Disagree}(x, \mathcal{D})\) is the number of \(j\in\{1, \ldots, c\}\) where the indicators \(\mathbb{1}\left[x(+j) \ge x(-j)\right]\) and \(\mathbb{1}\left[\mathcal{D}(+j) \ge \mathcal{D}(-j)\right]\) differ: \[\begin{align} \mathrm{Disagree}\bigl(x, \mathcal{D}\bigr) &= \sum_{j=1}^c \mathbb{1}\Big[ \mathbb{1}\left[x(+j) \ge x(-j)\right] \;\neq\; \mathbb{1}\left[\mathcal{D}(+j) \ge \mathcal{D}(-j)\right] \Big] \\ &= \sum_{j=1}^c \mathbb{1}\Big[ \mathbb{1}\left[x(+j) \ge x(-j)\right] \;\neq\; \mathbb{1}\left[\sigma_j = +1\right] \Big]. \end{align}\]
Lemma 8. \(U(\mathcal{D}, \mathcal{D}) - U(x, \mathcal{D}) = \frac{{\epsilon}}{c} \cdot \mathrm{Disagree}(x, \mathcal{D})\). In particular, the optimal strategy for the leader is \(x^* = \mathcal{D}\).
Proof. \[\begin{align} U(\mathcal{D}, \mathcal{D}) - U(x, \mathcal{D}) = \sum_{j = 1}^c \Bigg(& \frac{1+\sigma_j{\epsilon}}{2c} \Big( \mathbb{1}\big[\mathcal{D}(+j) \geq \mathcal{D}(-j)\big] - \mathbb{1}\big[x(+j) \geq x(-j)\big] \Big) \\ & + \frac{1-\sigma_j {\epsilon}}{2c} \Big( \mathbb{1}\big[\mathcal{D}(+j) < \mathcal{D}(-j)\big] - \mathbb{1}\big[x(+j) < x(-j)\big] \Big) \Bigg). \end{align}\]
For each term in the summation where \(\mathcal{D}(+j) \ge \mathcal{D}(-j)\) and \(x(+j) \ge x(-j)\) agree, the term evaluates to \(0\). When they disagree, there are two possible cases:
\(\mathcal{D}(+j) \geq \mathcal{D}(-j)\) but \(x(+j) < x(-j)\). Since \(\mathcal{D}(+j) \ge \mathcal{D}(-j)\) implies \(\sigma_j = +1\), the term simplifies to \[\frac{1 + {\epsilon}}{2c} - \frac{1 - {\epsilon}}{2c} = \frac{{\epsilon}}{c}.\]
\(\mathcal{D}(+j) < \mathcal{D}(-j)\) but \(x(+j) \geq x(-j)\). Here, \(\sigma_j = -1\), so the term simplifies to \[- \frac{1 - {\epsilon}}{2c} + \frac{1 + {\epsilon}}{2c} = \frac{ {\epsilon}}{c}.\]
Thus, we conclude that \[U(\mathcal{D}, \mathcal{D}) - U(x, \mathcal{D}) = \frac{{\epsilon}}{c} \cdot \mathrm{Disagree}(x, \mathcal{D}).\] ◻
Lemma 9. Let \(\hat{\mathcal{D}}_{x}\) be the estimated distribution constructed from \(x\) according to Equation (12 ). The total variation distance between \(\hat{\mathcal{D}}_x\) and \(\mathcal{D}\) is given by \[\delta( \hat{\mathcal{D}}_x , \mathcal{D}) = \frac{ {\epsilon}}{ c } \cdot \mathrm{Disagree}(x, \mathcal{D}).\]
Proof. By definition, \[\begin{align} \delta\bigl(\hat{\mathcal{D}}_x, \mathcal{D}\bigr) &\;=\; \frac{1}{2} \sum_{j = 1}^{c} \bigg( \Bigl|\hat{\mathcal{D}}_x(+j) - \mathcal{D}(+j)\Bigr| + \Bigl|\hat{\mathcal{D}}_x(-j) - \mathcal{D}(-j)\Bigr| \bigg). \end{align}\] If \(x\) and \(\mathcal{D}\) agree at \((+j)\) and \((-j)\), the corresponding term in the total variation sum is \(0\). Consequently, we only need to consider the case when a disagreement occurs.
\(x(+j) < x(-j)\) while \(\mathcal{D}(+j) \ge \mathcal{D}(-j)\). In this case, \[\hat{\mathcal{D}}_x(+j) = \frac{1 - {\epsilon}}{2c}, \quad \hat{\mathcal{D}}_x(-j) = \frac{1 + {\epsilon}}{2c}, \quad \mathcal{D}(+j) = \frac{1 + {\epsilon}}{2c}, \quad \mathcal{D}(-j) = \frac{1 - {\epsilon}}{2c}.\] Hence, \[\begin{align} \Bigl|\hat{\mathcal{D}}_x(+j) - \mathcal{D}(+j)\Bigr| + \Bigl|\hat{\mathcal{D}}_x(-j) - \mathcal{D}(-j)\Bigr| \;=\; \frac{{\epsilon}}{c} + \frac{{\epsilon}}{c} \;=\; \frac{2\,{\epsilon}}{c}. \end{align}\]
\(x(+j) \geq x(-j)\) while \(\mathcal{D}(+j) < \mathcal{D}(-j)\). Similarly, we have \[\hat{\mathcal{D}}_x(+j) = \frac{1 + {\epsilon}}{2c}, \quad \hat{\mathcal{D}}_x(-j) = \frac{1 - {\epsilon}}{2c}, \quad \mathcal{D}(+j) = \frac{1 - {\epsilon}}{2c}, \quad \mathcal{D}(-j) = \frac{1 + {\epsilon}}{2c}.\]
Again, we have \[\begin{align} \Bigl|\hat{\mathcal{D}}_x(+j) - \mathcal{D}(+j)\Bigr| + \Bigl|\hat{\mathcal{D}}_x(-j) - \mathcal{D}(-j)\Bigr| \;=\; \frac{{\epsilon}}{c} + \frac{{\epsilon}}{c} \;=\; \frac{2\,{\epsilon}}{c}. \end{align}\]
Thus, it follows that \[\delta( \hat{\mathcal{D}}_x, \mathcal{D}) = \frac{ {\epsilon}}{ c }\cdot \mathrm{Disagree}(x, \mathcal{D}).\] ◻
From Lemma 8 and Lemma 9, \[\label{eq:regret-equal-tv} U(\mathcal{D}, \mathcal{D}) - U(x, \mathcal{D}) ~ = ~ \frac{{\epsilon}}{c} \cdot \mathrm{Disagree}(x, \mathcal{D}) ~ = ~ \delta(\hat{\mathcal{D}}_x, \mathcal{D}).\tag{15}\]
Consider the regret of the online learning algorithm \(\mathrm{Alg}\) for the Bayesian Stackelberg game, where the algorithm outputs \(x^t\) at round \(t\). By Equation (15 ) and Theorem 8, the expected regret at round \(t \leq \mathcal{O}(\frac{2c}{{\epsilon}^2})\) is at least \[\begin{align} \mathbb{E}[ U(\mathcal{D}, \mathcal{D}) - U(x^t, \mathcal{D}) ] ~ = ~ \mathbb{E}[ \delta(\hat{\mathcal{D}}_{x^t}, \mathcal{D}) ] ~ \ge ~ \Omega({\epsilon}). \end{align}\] Thus, the expected regret over \(T\) rounds is at least: \[\begin{align} \mathbb{E}[\mathrm{Reg}(T)] ~ = ~ \sum_{t=1}^T \mathbb{E}[ U(\mathcal{D}, \mathcal{D}) - U(x^t, \mathcal{D}) ] & ~ \ge ~ \min\Big\{T, \; \mathcal{O}\big(\frac{2c}{{\epsilon}^2}\big) \Big\} \cdot \Omega({\epsilon}) \\ & ~ \ge ~ \Omega( \sqrt{2c T}) ~ = ~ \Omega(\sqrt{\min\{K, L\} T}) \end{align}\] where we choose \({\epsilon}= \sqrt{ \frac{2c}{T} }\).
We now prove a lower bound of \(\Omega(\sqrt{\min\{L, nK\} T})\) on the expected regret of any online learning algorithms for Bayesian Stackelberg games with multiple followers. Without loss of generality, assume that \(nK\) is an even integer, and assume that the number of leader actions \(L \ge nK\). We do a reduction from the single-follower problem to the multi-follower problem.
Consider the single-follower Bayesian Stackelberg game instance defined in Appendix 8.4, but instead of a single follower with \(K\) types, we change the instance so that the single follower has \(nK\) types, indexed by \(\Theta = \{ (i, j) : i \in [n], j \in [K] \}\). Suppose the single follower’s type distribution \(\mathcal{D}\) belongs to the class \(\mathcal{C}\) in Definition 8 with support size \(2c = nK\) (instead of \(2c = K\)). Note that for such a \(\mathcal{D}\in \mathcal{C}\), \[\begin{align} \sum_{i=1}^n \sum_{j=1}^K \mathcal{D}(i, j) = 1 ~~ \text{ and }~~ \forall i\in[n],~~ \sum_{j=1}^K \mathcal{D}(i, j) = \frac{1}{n}. \end{align}\] The follower’s utility function \(v\) is given by (10 ), except that we now use \(\theta=(i, j)\) to represent a type and \(\ell=(i, j)\) to represent a leader’s action. The leader’s action set is \(\mathcal{L}= \Theta\), with utility function \(u\) given by (11 ).
We reduce the single-follower game to an \(n\)-follower game defined below. Consider a Bayesian Stackelberg game with \(n\) followers each with \(K+1\) types. The type distribution and the followers and leader’s actions and utilities are defined below: (To distinguish the notations from the single-follower game, we use tilde notations \(\tilde{\cdot}\))
Type distribution: The followers’ types are independently distributed according to distribution \(\tilde{\boldsymbol{\mathcal{D}}} = \prod_{i=1}^n \tilde{\mathcal{D}}_i\) where the probability that follower \(i \in [n]\) has type \(j\) is: \[\tilde{\mathcal{D}}_i(j) = \begin{cases} 1 - \frac{1}{100n} & \text{if } j = 0, \\ \frac{1}{100} \mathcal{D}(i, j) & \text{if } j = 1, \ldots, K. \end{cases}\]
Followers’ actions and utilities: Each follower has 3 actions \(\tilde{\mathcal{A}} = \{ \text{Good}, \text{Bad}, a_0 \}\). The utility of a follower \(i\) with type \(j \ne 0\) is equal to the utility of the single follower with type \((i, j)\). Utilities for type \(j=0\) and action \(a_0\) are specially defined: \[\tilde{v}_i(\ell, a, \theta_i = j) = \begin{cases} v(\ell, a, (i, j)) & \text{ if } \theta_i\ne 0 \text{ and } a\ne a_0 \\ -1 & \text{ if } \theta_i \ne 0 \text{ and } a = a_0, \\ 1 & \text{ if } \theta_i = 0 \text{ and } a = a_0, \\ -1 & \text{ if } \theta_i = 0 \text{ and } a \neq a_0. \end{cases}\] Note that the best-response action of a follower with type \(0\) is always \(a_0\), regardless of the leader’s strategy.
Leader’s actions and utilities: The leader has the same action set as the single-follower game: \(\mathcal{L}= \Theta = \{(i, j) : i\in [n], j\in[K]\}\). For any leader action \(\ell \in \mathcal{L}\), \[\tilde{u}(\ell, \boldsymbol{a}) = \begin{cases} 1 & \text{if } n-1 \text{ followers choose } a_0 \text{ and one plays Good}, \\ 0 & \text{otherwise}. \end{cases}\]
Given an online learning algorithm \(\mathrm{Alg}\) for the \(n\)-follower problem, we construct an online learning algorithm for the single-follower problem as follows:
At each round \(t = 1, \ldots, T\):
Obtain a strategy \(x^t \in \Delta(\mathcal{L})\) from algorithm \(\mathrm{Alg}\). Output \(x^t\).
Receive a sample of the single follower’s type \(\theta^t = (i^t, j^t) \sim \mathcal{D}\).
For every follower \(i\in [n]\), we construct their type \(\theta_i^t\) in the following way: Independently flip a coin that lands on head with probability \(1 - \frac{1}{100n}\). If it lands on head, set the follower type \(\theta_i^t\) to \(0\). If it lands on tail, we select the most recent sample of the form \((i^s = i, j^s)\) from the history \(\{(i^s, j^s)\}_{s=1}^t\), and set the follower’s type \(\theta_i^t\) to \(j^s\). Each sample can only be used once. If there are insufficient samples, we halt the algorithm.
Provide the constructed types \((\theta_1^t, \ldots, \theta_n^t)\) to algorithm \(\mathrm{Alg}\).
In the above reduction process, if we always have sufficient samples in the third step at each round, then the distribution of samples \((\theta_1^t, \ldots, \theta^t)\) provided to algorithm \(\mathrm{Alg}\) is equal to the type distribution \(\tilde{\boldsymbol{\mathcal{D}}} = \prod_{i=1}^n \tilde{\mathcal{D}}_i\) of the \(n\)-follower game. Thus, from algorithm \(\mathrm{Alg}\)’s perspective, it is solving the \(n\)-follower game with unknown type distribution \(\tilde{\boldsymbol{\mathcal{D}}}\). We then argue that we have sufficient samples with high probability. Let \(H_i^t\) be the number of available samples in the history that we can use to set follower \(i\)’s type at round \(t\), and let \(N_i^t\) be the number of samples that we actually need. Define \[\begin{align} 1 - \delta(t) = \Pr\big(\forall i\in[n], H_i^t \geq N_i^t \big), \end{align}\] which is the probability that we have sufficient samples at round \(t\).
Claim 2. \(\delta(t) \le 2n \exp \Bigl( -\frac{t^2 \bigl(\tfrac{1}{100n} - \tfrac{1}{n}\bigr)^2}{2t} \Bigr)\).
Proof. Note that \(H_i^t\) and \(N_i^t\) are Binomial random variables: \(H_i^t \sim \mathrm{Bin}(t, \frac{1}{n})\), \(N_i^t \sim \mathrm{Bin}(t, \frac{1}{100n})\). So, by union bound and Hoeffding’s inequality: \[\begin{align} \Pr\bigl(\exists i \in [n],\, H_i^t < N_i^t \bigr) &\leq n \Pr\bigl(H_i^t < N_i^t \bigr) \leq 2n \exp \Bigl( -\frac{t^2 \bigl(\tfrac{1}{100n} - \tfrac{1}{n}\bigr)^2}{2t} \Bigr). \end{align}\] ◻
Let \(\tilde{U}(x)\) be the leader’s expected utility in the \(n\)-follower game (on type distribution \(\tilde{\boldsymbol{\mathcal{D}}}\)) and \(U(x)\) be the leader’s utility in the single-follower game (on type distribution \(\mathcal{D}\)). We note that, given any strategy \(x \in \Delta(\mathcal{L})\) of the leader, the best-response action of follower \(i\) with type \(\theta_i = j \ne 0\) (in the \(n\)-follower game) is equal to the best-response action of the single follower with type \((i, j)\), namely, \(\mathrm{br}_i(j, x) = \mathrm{br}((i, j), x)\). Thus, \[\begin{align} \tilde{U}(x) & ~ = ~ \Pr[\text{exactly one follower has a non-0 type}] \\ & \qquad \cdot \mathbb{E}[\text{leader's utility} \mid \text{exactly one follower has a non-0 type}] ~ + ~ 0 \\ & ~ = ~ \Bigl(1 - \frac{1}{100n}\Bigr)^{n-1} \sum_{i=1}^n \sum_{j=1}^K \frac{1}{100}\mathcal{D}(i, j)\, \mathbb{1}\left[\mathrm{br}_i\bigl(j, x \bigr) = \text{Good}\right] \\ & ~ = ~ \Bigl(1 - \frac{1}{100n}\Bigr)^{n-1} \sum_{i=1}^n \sum_{j=1}^K \frac{1}{100}\mathcal{D}(i, j)\, \mathbb{1}\left[\mathrm{br}\bigl((i, j), x \bigr) = \text{Good}\right] \\ & ~ = ~ \frac{1}{100} \Bigl(1 - \frac{1}{100n}\Bigr)^{n-1} \, U(x) \\ &~ \approx ~ \frac{1}{100} e^{-\frac{1}{100}} \, U(x). \end{align}\] Define \(C = \frac{1}{100}( 1 - \frac{1}{100n} )^{n - 1}\). Let \(\tilde{r}(t) = \tilde{U}(x^*) - \tilde{U}(x^t)\) denote the per-round regret of the online learning algorithm \(\mathrm{Alg}\) for the \(n\)-follower game. Let \(r(t) = U(x^*) - U(x^t)\) denote the per-round regret of the single-follower algorithm constructed by the above reduction. Consider the expected total regret in the \(n\)-follower game: \[\begin{align} \mathbb{E}[\tilde{\mathrm{Reg}}(T)] & = \sum_{t=1}^T \mathbb{E}[ \tilde{r}(t) ] \nonumber \\ & \ge \sum_{t=1}^T \Big( (1-\delta(t)) \cdot \mathbb{E}[ \tilde{r}(t) ] \; - \; \delta(t)\cdot 1 \Big) \nonumber \\ & = \sum_{t = 1}^{T} \bigl(1 - \delta(t)\bigr) \cdot C \cdot \mathbb{E}[r(t)] \;-\; \sum_{t = 1}^{T} \delta(t) \nonumber \\ & \ge C \cdot \sum_{t = 1}^{T} \mathbb{E}[r(t)] \;-\; \sum_{t = 1}^{T} \delta(t)\cdot C \cdot 1 \;-\; \sum_{t = 1}^{T} \delta(t) \nonumber \\ & = C \cdot \mathbb{E}[\mathrm{Reg}(T)] \;-\; (C + 1) \sum_{t = 1}^{T} \delta(t). \nonumber \end{align}\]
Now, we bound \(\sum_{t = 1}^{T} \delta(t)\). Consider a threshold \(\tau\) such that for all \(t \geq \tau\), we have \(\delta(t) \leq \frac{1}{T^2}\). To find \(\tau\), we solve \[2n \exp\Big( - \frac{(1 / 100n - 1 / n )^2}{2} \tau \Big) \leq \frac{1}{T^2}.\] Rearranging, we choose \(\tau\) such that \[\begin{align} \tau \geq \frac{ \ln(2nT^2) }{C_{\tau} } \end{align}\] where \(C_{\tau} = ( \frac{1}{100n} - \frac{1}{n} )^2\). If \(t \leq \tau\), we bound \(\delta(t) \leq 1\). For \(t > \tau\), we use the bound \(\delta(t) \leq \frac{1}{T^2}\). Now, summing over all \(t\), \[\begin{align} \sum_{t = 1}^{T} \delta(t) ~\le~ \tau + (T - \tau)\,\frac{1}{T^2} ~ =~ \frac{\ln\bigl(2nT^2\bigr)}{C_{\tau}} ~+~ \frac{T - \tau}{T^2} ~\le~ \mathcal{O}\Bigl(\frac{\ln T}{C_{\tau}} + \frac{1}{T}\Bigr) ~=~ \mathcal{O}(\log T). \end{align}\] Then, \[\begin{align} \mathbb{E}[\tilde{\mathrm{Reg}}(T)] & ~ \ge ~ C\cdot \mathbb{E}[\mathrm{Reg}(T)] - \mathcal{O}(\log T). \end{align}\] The regret \(\mathbb{E}[\mathrm{Reg}(T)]\) for a single-follower game where the follower has \(nK\) types and the leader has \(L=nK\) actions is at least \(\Omega(\sqrt{nKT})\) by Theorem 7. Thus, we obtain \[\begin{align} \mathbb{E}[\tilde{\mathrm{Reg}}(T)]& ~ \ge ~ C \cdot \Omega(\sqrt{nKT}) - \mathcal{O}(\log T) ~ = ~ \Omega(\sqrt{nKT}), \end{align}\] which is also \(\Omega(\sqrt{\min\{L, nK\} T})\) because \(L = nK\).
We show that the online learning problem for a Bayesian Stackelberg game with action feedback can be solved with \(\mathcal{O}(K^n \sqrt{T} \log T)\) regret, by using a technique developed by [19].
[19] showed that the online learning problem for a linear program with unknown objective parameter can be reduced to a linear bandit problem. We first show that the Bayesian Stackelberg game (which is not a linear program as defined in Definition 2) can be reformulated as a linear program. Then, we use [19]’s reduction to reduce the linear program formulation of online Bayesian Stackelberg game to a linear bandit problem. A difference between our work and [19] is that, while they consider an adversarial online learning setting, we consider a stochastic online learning setting. Directly applying [19]’s result will lead to an \(\widetilde{\mathcal{O}}(K^{\frac{3n}{2}}\sqrt{T})\) regret bound. Instead, we apply the OFUL algorithm for stochastic linear bandit [22] to obtain a better regret bound of \(\widetilde{\mathcal{O}}(K^{n}\sqrt{T})\).
First, we reformulate the Bayesian Stackelberg game optimization problem \(\max_{x \in \Delta(\mathcal{L})} U_{ \boldsymbol{\mathcal{D}} }(x)\) (Definition 2), which is a nonlinear program by definition, into a linear program. Let variable \(x\) represent a joint distribution over best-response function \(W \in \mathcal{A}^{nK}\) and the leader’s actions \(\mathcal{L}\). Specifically, \[\begin{align} & x = ( x(W, \ell) )_{W \in \mathcal{A}^{nK}, \ell \in \mathcal{L}} \in \mathbb{R}^{A^{nK}\times L}, & \text{where} \sum_{W\in\mathcal{A}^{nK}, \ell \in L} x(W, \ell) = 1, ~ \text{ and } ~ x(W, \ell) \ge 0. \end{align}\] Alternatively, \(x\) can be viewed as an \(A^{nK} \times L\)-dimensional matrix, where \(W\) indexes the row and \(\ell\) indexes the columns. We maximize the following objective (which is linear in \(x\)): \[\begin{align} \label{eq:linear-program-objective} \max_{x} ~ U(x) = \sum_{ W \in \mathcal{A}^{nK} } \sum_{ \ell \in \mathcal{L}} \sum_{ \boldsymbol{\theta}\in \Theta^n } \boldsymbol{\mathcal{D}} ( \boldsymbol{\theta}) x(W, \ell) u(\ell, W(\boldsymbol{\theta})), \end{align}\tag{16}\] subject to the Incentive Compatibility (IC) constraint, meaning that the followers’ best-response actions are consistent with \(W\): \(\forall W \in \mathcal{A}^{nK}, \forall i \in [n], \forall \theta_i \in \Theta, \forall a_i \in \mathcal{A}\), \[\begin{align} \label{eq:linear-program-IC} \sum_{ \ell \in \mathcal{L}} x(W, \ell) \Big( v_i(\ell, w_i(\theta_i), \theta_i ) - v_i(\ell, a_i ,\theta_i ) \Big)\ge 0. \end{align}\tag{17}\]
Lemma 10. With known distribution \(\boldsymbol{\mathcal{D}}\), the Bayesian Stackelberg game can be solved by the linear program (16 )(17 ) in the following sense: there exists a solution \(x\) to (16 )(17 ) with only one non-zero row \(x(W^*, \cdot)\), and this row \(x(W^*, \cdot)\in \mathbb{R}^L\) is a solution to \(\max_{x\in\Delta(\mathcal{L})}U_{ \boldsymbol{\mathcal{D}} }(x)\).
Proof. First, we prove that the linear program (16 )(17 ) contains an optimal solution with only one non-zero row. Suppose an optimal solution \(x\) has two non-zero rows \(W_1\), \(W_2\): \[\sum_{\ell \in L} x(W_1, \ell) = p_1 > 0, \quad \sum_{\ell \in L} x(W_2, \ell) = p_2 > 0.\] Consider the conditional expected utility of these two rows. Because, when conditioned on row \(i\), the conditional probability of playing action \(\ell\) is \(\frac{x(W, \ell)}{p_1}\), we have: \[\begin{align} u_1 = \frac{1}{p_1} \sum_{\ell \in \mathcal{L}} \sum_{\boldsymbol{\theta}\in \Theta^n} x(W_1, \ell) \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) u(\ell, W_1(\boldsymbol{\theta})), \\ u_2 = \frac{1}{p_2} \sum_{\ell \in \mathcal{L}} \sum_{\boldsymbol{\theta}\in \Theta^n} x(W_2, \ell) \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) u(\ell, W_1(\boldsymbol{\theta})). \end{align}\]
Without loss of generality, assume \(u_1 \geq u_2\). We construct a new solution \(x'\) by transferring probability mass from row \(W_2\) to row \(W_1\). Specifically, \(x'\) is defined as follows: \[\begin{align} & x'(W_1, \ell) = \frac{p_1 + p_2}{p_1} x(W_1, \ell), && \forall \ell \in \mathcal{L}. \\ & x'(W_2, \ell) = 0, && \forall \ell \in \mathcal{L}, \\ & x'(W_j, \ell) = x(W_j, \ell), && \forall \text{ other W_j}, ~ \forall \ell \in \mathcal{L}. \end{align}\] It is straightforward to verify that \(x'\) satisfies the IC constraint. Now, we show that the utility of \(x'\) is weakly greater than the utility of \(x\). \[\begin{align} U(x') & = \sum_{\ell \in \mathcal{L}} \sum_{\boldsymbol{\theta}\in \Theta^n} \frac{p_1+p_2}{p_1} x(W_1, \ell) \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) u(\ell, W_1(\boldsymbol{\theta})) \\ & \quad + \text{utility from rows other than } \{W_1, W_2\} \\ & = (p_1+p_2) u_1 + \text{utility from rows other than } \{W_1, W_2\} \\ & \ge p_1 u_1 + p_2 u_2 + \text{utility from rows other than } \{W_1, W_2\} \\ & = U(x). \end{align}\] Note that the \(W_2\) row of \(x'\) has become \(0\). We can apply this construction iteratively until only one row remains non-zero, without decreasing utility, thus obtaining an optimal solution with only one non-zero row.
Let \(x^*\) be an optimal solution to the linear program (16 )(17 ) with only one non-zero row \(W^*\). Let \(x^*_{BS} = \max_{x\in\Delta(\mathcal{L})}U_{ \boldsymbol{\mathcal{D}} }(x)\) be an optimal solution for the Bayesian Stackelberg game. We prove that \(U(x^*) = U_{ \boldsymbol{\mathcal{D}} }(x^*_{BS})\).
First, we prove \(U_{ \boldsymbol{\mathcal{D}} }(x^*_{BS}) \le U(x^*)\). Let \(W^*_{BS}\) be the best-response function corresponding to \(x^*_{BS}\), i.e., \(x_{BS}^* \in R(W^*_{BS})\). We construct a feasible solution \(x\) to the linear program (16 )(17 ) by setting the row indexed by \(W^*_{BS}\) to \(x^*_{BS}\) and assigning zero values to all other rows. By definition, \(x\) satisfies the IC constraint, so it is a feasible solution. Moreover, \(x^*_{BS}\) and \(x\) achieve the same objective value \(U_{ \boldsymbol{\mathcal{D}} }(x^*_{BS}) = U(x)\). By definition, \(U(x)\) is weakly less than the optimal objective value \(U(x^*)\) of the linear program, so \(U_{ \boldsymbol{\mathcal{D}} }(x^*_{BS}) \le U(x^*)\).
Then, we prove \(U(x^*) \leq U_{ \boldsymbol{\mathcal{D}} }(x^*_{BS})\). Suppose the leader uses the strategy defined by the non-zero row of \(x^*\), which is \(x^*(W^*, \cdot) \in \Delta(\mathcal{L})\). By the IC constraint of the linear program, the best-response function of the followers is equal to \(W^*\), so the expected utility of the leader is exactly equal to \(U(x^*)\), which is \(\le U_{ \boldsymbol{\mathcal{D}} }(x^*_{BS})\) because \(x^*_{BS}\) is an optimal solution for the Bayesian Stackelberg game. ◻
Based on the linear program formulation (16 )(17 ), we then reduce the online Bayesian Stackelberg game problem to a linear bandit problem, using the technique in [19]. Let \(\mathcal{X} \subseteq \Delta( \mathcal{A}^{nK} \times \mathcal{L})\) be the set of feasible solutions to the linear program (16 )(17 ). We define the loss of a strategy \(x \in \mathcal{X}\) when the follower types are \(\boldsymbol{\theta}\in [K]^n\) as: \[\begin{align} L_{ \boldsymbol{\theta}}(x) = - \sum_{ W \in \mathcal{A}^{nK} } \sum_{ \ell \in \mathcal{L}} x(W, \ell) u( \ell, W(\boldsymbol{\theta}) ). \end{align}\]
We define a linear map \(\phi : \mathcal{X} \to \mathbb{R}^{ K^n }\) that maps a strategy \(x \in \mathcal{X}\) to a vector in \(\mathbb{R}^{K^n}\), representing the loss of the strategy for each type profile: \[\begin{align} \phi(x) &= \begin{pmatrix} L_{ \boldsymbol{\theta}_1 }( x ) \\ \vdots \\ L_{ \boldsymbol{\theta}_{K^n} }(x) \end{pmatrix} \in \mathbb{R}^{K^n}. \end{align}\] Its inverse, \(\phi^{\dagger} : \mathbb{R}^{K^n} \to \mathcal{X}\) maps a loss vector back to a strategy. Let \(\mathrm{co} \, \phi( \mathcal{X} )\) denote the convex hull of the image set of \(\phi\).
Let \(\mathfrak{R}\) be a stochastic linear bandit algorithm with decision space \(\mathrm{co} \, \phi( \mathcal{X} ) \subseteq \mathbb{R}^{K^n}\). In particular, we let \(\mathfrak{R}\) be the OFUL algorithm [22]. At each round, \(\mathfrak{R}\) outputs a strategy \(z^t \in \mathrm{co} \, \phi( \mathcal{X} )\), and we invoke a Carathédory oracle to decompose \(z^t\) into \(K^n + 1\) elements from \(\phi( \mathcal{X} )\), forming a convex combination.6 We then sample one of the elements \(z_j^t\), and apply the inverse map \(\phi^\dagger\) to obtain a strategy \(x^t \in \mathcal{X}\) for the leader. After playing strategy \(x^t\), we observe the utility \(u^t = u(\ell^t, \boldsymbol{a}^t)\) and feed the utility feedback to \(\mathfrak{R}\).
Theorem 9. The expected regret of Algorithm 6 is \(O(K^n\sqrt{T} \log T)\).
Proof. Because \(x^t \in \mathcal{X} = \Delta(\mathcal{A}^{nK}\times \mathcal{L})\) is a feasible solutions to the linear program (16 )(17 ), it satisfies the IC constraint. So, when the leader plays \(x^t(W, \cdot) / p(W) \in \Delta(\mathcal{L})\), the followers (with types \(\boldsymbol{\theta}\)) will best respond according to the function \(W(\boldsymbol{\theta})\). Thus, the leader’s expected utility at round \(t\) is \[\begin{align} \sum_{W\in A^{nK}} p(W) \sum_{\ell \in \mathcal{L}} \frac{x^t(W, \ell)}{p(W)} \sum_{\boldsymbol{\theta}\in [K]^n} \boldsymbol{\mathcal{D}} (\boldsymbol{\theta}) u(\ell, W(\boldsymbol{\theta})) = U(x^t) = - \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[L_{\boldsymbol{\theta}}(x^t)]. \end{align}\] Then, the regret of Algorithm 6 in \(T\) rounds can be expressed as \[\begin{align} \mathrm{Reg}(T) ~ = ~ \sum_{t=1}^T \Big( \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} }[ L_{\boldsymbol{\theta}}(x^t)] - \mathbb{E}_{\boldsymbol{\theta}\sim \boldsymbol{\mathcal{D}} } [L_{\boldsymbol{\theta}}(x^*)]\Big), \end{align}\] where \(x^*\) is the optimal strategy in \(\mathcal{X}\), which minimizes the expected loss (maximizes expected utility). Let \(\mathrm{Reg}_{ \mathfrak{R} , \mathrm{co} \,\phi( \mathcal{X} )}(T)\) be the expected regret of the linear bandit algorithm \(\mathfrak{R}\) on decision space \(\mathrm{co} \, \phi( \mathcal{X} )\) in \(T\) rounds. According to the Theorem 3.1 of [19], \[\mathrm{Reg}(T) \leq \mathrm{Reg}_{ \mathfrak{R} , \mathrm{co} \, \phi( \mathcal{X} )}(T).\] We let \(\mathfrak{R}\) be the OFUL algorithm [22]. For any \(z \in \mathrm{co} \, \phi( \mathcal{X} ) \subseteq \mathbb{R}^{K^n}\), the stochastic loss of \(z\) can be expressed as \(L^t = \langle z, \boldsymbol{\mathcal{D}} \rangle + \eta^t\), with \(|L^t| \le 1\), \(\| \boldsymbol{\mathcal{D}} \|_2 \le \| \boldsymbol{\mathcal{D}} \|_1 = 1\), and \(\eta^t\) being a bounded zero-mean noise. Then, from [22]’s Theorem 3, we have with probability at least \(1 - \delta\), \[\begin{align} \mathrm{Reg}_{ \mathfrak{R} , \mathrm{co} \, \phi( \mathcal{X} )}(T) &\leq 4 \sqrt{ T K^n \log\left( \lambda + \frac{T L}{K^n} \right) } \cdot \left( \lambda^{1/2} + \sqrt{ 2 \log\left( \frac{1}{\delta} \right) + K^n \log\left( 1 + \frac{T L}{ \lambda K^n } \right) } \right) \end{align}\] where \(\lambda\) is a tunable parameter in the OFUL algorithm. By setting \(\lambda = 1\) and \(\delta = \frac{1}{T}\), we obtain \[\mathbb{E}[\mathrm{Reg}(T)] ~ \leq ~ (1 - \delta)\cdot O( K^n \sqrt{T} \log T ) + \delta\cdot T ~ = ~ \mathcal{O}( K^n \sqrt{T} \log T ).\] ◻
We can express the leader’s utility function as \[u(x, \boldsymbol{a}) = \sum_{\ell \in \mathcal{L}} x(\ell) u(\ell, \boldsymbol{a}) = \langle x, u_{ \boldsymbol{a} } \rangle\] where vector \(u_{ \boldsymbol{a} } = (u(\ell, \boldsymbol{a}))_{\ell \in \mathcal{L}} \in \mathbb{R}^{L}\). Note that \(u(x, \boldsymbol{a})\) is a linear function of \(u_{\boldsymbol{a}}\). Consequently, the expected utility of a strategy \(x \in R(W)\) on the true distribution \(\boldsymbol{\mathcal{D}}\) is given by \[U(x, R(W)) = \mathbb{E}_{ \boldsymbol{a} \sim \mathcal{P}( \cdot | R(W) ) } [ \langle x, u_{\boldsymbol{a}} \rangle ].\] Given samples \(\boldsymbol{a}^{1}, ..., \boldsymbol{a}^N\), we can compute \(u_{ \boldsymbol{a}^{1} }, ..., u_{ \boldsymbol{a}^N }\) because we know the utility function. By Lemma 6, the pseudo-dimension of the family of linear functions \(\{ \langle x, \cdot \rangle ~ | ~ x \in R(W) \in \mathbb{R}^{L} \}\) is \(L\). Applying Theorem 6, with \(N\) samples, we have \[\begin{align} \Pr\left[ \exists x \in R(W), ~ \big| U(x, R(W)) - \hat{U}_N(x, R(W)) \big| > \sqrt{\frac{2L \log 3N}{N}} + \sqrt{\frac{\log\frac{1}{\delta}}{2N}} ~ \right] \le \delta. \end{align}\] Let \(\delta = \frac{1}{T^4}\). Taking a union bound over all \(N \in \{ 1, ..., T \}\) and all \(W \in \mathcal{ W }\), we obtain \[\begin{align} & \Pr\bigg[ \exists W \in \mathcal{W}, \exists N \in [T], \exists x \in R(W), ~ \big| U(x, R(W)) - \hat{U}_N(x, R(W)) \big| > \sqrt{\frac{2L \log(3N)}{N}} + \sqrt{\frac{\log T^4}{2N}} ~ \bigg] \\ & \quad \le |\mathcal{W}| T \delta = \frac{|\mathcal{W}| T}{T^4} \le \frac{1}{T^2} \end{align}\] (assuming \(T \ge | \mathcal{ W } |\)). Thus, with probability at least \(1 - \frac{1}{T^2}\), for every \(W \in \mathcal{ W }\), \(N \in [T]\), and \(x \in R(W)\), we have \[\big|\, U(x, R(W)) - \hat{U}_N(x, R(W)) \, \big| ~ \le ~ \sqrt{ \frac{ 4(L + 1)\log(3T) }{t} }\] using the inequality \(\sqrt{a} + \sqrt{b} \leq \sqrt{2(a + b)}\).
By Lemma 5, the event \[C = \left[ \forall W \in \mathcal{W}, \forall N \in [T], \forall x \in R(W), \;\left| U(x, R(W)) - \hat{U}_N(x, R(W)) \right| \le \sqrt{\frac{4(L+1) \log(3T)}{N}} ~ \right]\] happens with probability at least \(1 - \frac{1}{T^2}\). Suppose \(C\) happens. The regret at round \(t\) is given by \[r(t) = U(x^*, R(W^*) ) - U(x^t, R(W^t)).\] For any strategy \(x \in R(W)\), we define the upper confidence bound of its utility as \[\mathrm{UCB}^t(x) = \hat{U}_{N^t(W)}(x, R(W)) + \sqrt{\frac{4(L+1) \log(3T)}{N^t(W)}}.\] Since \(C\) holds, it follows that \[U(x^*, R(W^*)) \leq \mathrm{UCB}^t(x^*).\] Because the UCB algorithm chooses the strategy with the highest upper confidence bound at round \(t\), we have \(\mathrm{UCB}^t(x^*) \leq \mathrm{UCB}^t(x^t)\). Thus, \[\begin{align} r(t) &\leq \mathrm{UCB}^{t}(x^*) - U(x^t, R(W^t)) \\ &\leq \mathrm{UCB}^t(x^t) - U(x^t, R(W^t)) \\ &= \hat{U}_{N^t(W)}(x^t, R(W^t)) - U(x^t, R(W^t)) + \sqrt{\frac{4(L+1) \log(3T)}{N^t(W^t)}} \\ &\leq 2 \sqrt{\frac{4(L+1) \log(3T)}{N^t(W^t)}}. \end{align}\] The total regret is at most \[\begin{align} \mathrm{Reg}(T) & ~ = ~ \sum_{t=1}^T r(t) \\ & ~ \le ~ 2 \sum_{t=1}^T \sqrt{\frac{4(L+1) \log(3T)}{N^t(W^t)}} \\ & ~ = ~ 2 \sum_{W\in \mathcal{ W } } \sum_{m=1}^{N^T(W)} \sqrt{\frac{4(L+1) \log(3T)}{m}} \\ & ~ \le ~ 8 \sum_{W\in \mathcal{ W } } \sqrt{N^T(W) \cdot (L+1) \cdot \log(3T)} \end{align}\] where we applied the inequality \(\sum_{m=1}^N \sqrt{\frac{1}{m}} \le 2 \sqrt{N}\). By Jensen’s inequality, \[\begin{align} \frac{1}{| \mathcal{ W } |} \sum_{W\in \mathcal{ W } } \sqrt{ N^T(W) } \;\leq\; \sqrt{ \frac{1}{| \mathcal{ W } |} \sum_{W\in \mathcal{ W } } N^T(W) } \; = \; \sqrt{ \frac{1}{| \mathcal{ W } |}T}. \end{align}\] Thus, \[\begin{align} \mathrm{Reg}(T) & ~ \le ~ 8 \sum_{W\in \mathcal{ W } } \sqrt{| \mathcal{ W } | T} \cdot \sqrt{(L+1) \log(3T)} \\ & ~ = ~ \mathcal{O}\Big( \sqrt{ | \mathcal{ W } | L \cdot T \log T}\Big) \\ & ~ = ~ \mathcal{O}\Big( \sqrt{ n^L K^L A^{2L} L \cdot T \log T}\Big) \end{align}\] where we used \(| \mathcal{ W } | = O(n^L K^L A^{2L})\) from Lemma 2.
Finally, considering the case where \(C\) does not happen (which has probability at most \(\frac{1}{T^2}\)), \[\begin{align} \mathbb{E}[\mathrm{Reg}(T)] ~ = ~ \big(1 - \frac{1}{T^2}\big)\mathcal{O}\Big( \sqrt{ n^L K^L A^{2L} L \cdot T \log T}\Big) + \frac{1}{T^2} \cdot T ~ \leq ~ \mathcal{O}\Big( \sqrt{ n^L K^L A^{2L} L \cdot T \log T}\Big). \end{align}\]
We empirically simulate and validate the results of the studied algorithms in both the type-feedback setting and action feedback setting. For the former, we consider the independent type setting to understand how much better, in practice, is Algorithm 2 (customized for independent types) as opposed to the general purpose Algorithm 1 (works for general type distributions). We consider an \((L=2, K=6, A=2, n=2)\) instance and simulate the results in Figure 7. As expected, the specialized algorithm does indeed outperform the general one.
For the action feedback case, we empirically compare our UCB-based Algorithm 3 with the linear bandit approach inspired by [19], Algorithm 6. We especially consider the small \(n, L\) regime where our theory does not provide any concrete guidance. Shown in Figure [fig:action95feedback], we consider an \((L=2, K=6, A=2, n=2)\) instance and observe the advantage of the UCB-based algorithm over the linear bandit one.
Correspondence to: {tlin, shossain}@g.harvard.edu
↩︎
This no externality assumption is common in modeling a large population of agents [13], [24], [25].↩︎
In case of ties, we assume that followers break ties in favor of the leader.↩︎
Also note that Algorithm 1 does not need as input the entire utility function of the leader \(u(\cdot, \cdot)\), which has an exponential size \(L\cdot A^n\). The algorithm only needs the utility function for the sampled types.↩︎
If \(K > 2c\), we can let the additional types to have probability \(0\). If \(L>2c\), we can let the additional actions of the leader to have very low utility.↩︎
The Carathédory oracle is based on the well-known Carathédory Theorem.↩︎