Exactly or Approximately Wasserstein Distributionally Robust Estimation According to Wasserstein Radii Being Small or Large


Abstract: This paper primarily considers the robust estimation problem under Wasserstein distance constraints on the parameter and noise distributions in the linear measurement model with additive noise, which can be formulated as an infinite-dimensional nonconvex minimax problem. We prove that the existence of a saddle point for this problem is equivalent to that for a finite-dimensional minimax problem, and give a counterexample demonstrating that the saddle point may not exist. Motivated by this observation, we present a verifiable necessary and sufficient condition whose parameters can be derived from a convex problem and its dual. Additionally, we also introduce a simplified sufficient condition, which intuitively indicates that when the Wasserstein radii are small enough, the saddle point always exists. In the absence of the saddle point, we solve an finite-dimensional nonconvex minimax problem, obtained by restricting the estimator to be linear. Its optimal value establishes an upper bound on the robust estimation problem, while its optimal solution yields a robust linear estimator. Numerical experiments are also provided to validate our theoretical results.

Keywords: Robust estimation, Wasserstein distance, Saddle point

1 Introduction↩︎

Robust estimation is a classical methodology widely employed in statistics [1][3] and engineering [4][6] to deal with parameter uncertainty arising from limited observable data, noise, outliers, and measurement errors. Within this framework, decision makers are typically assumed to have only partial information about uncertain parameters. Based on this partial information, an appropriate uncertainty set can be constructed: for deterministic parameters, the set contains all possible values [7]; for random parameters, it can be defined by the family of admissible distributions or by the structured uncertainties about some statistics [8]. By minimizing the worst-case risk over the uncertainty set, the estimator can be derived that exhibit relative insensitivity to deviations of the actual model from the assumed model [9], [10]. However, it also implies that the performance of robust estimator depends heavily on the characterization of uncertainties.

An important research direction in robust estimation for random parameters, known as distributionally robust estimation, is designed to minimize the worst-case risk within a specified distributional uncertainty set. A common assumption in this framework is that the unknown true distribution is proximate to a predefined nominal distribution in some sense, where the nominal distribution captures known characteristics and can often be chosen as the empirical distribution derived from observed samples. Numerous metrics are employed to quantify this proximity, including Kullback-Leibler (KL) divergence [11], Wasserstein distance [12], \(\phi\)-divergence [13], moments-based similarity [14], and \(\epsilon\)-contamination [15].

Under this assumption, the distributionally robust estimation problem can be formulated as an infinite-dimensional minimax problem. Existing work has explored related topics under various models. In [11], the authors derive the optimal least-squares estimator for the least-favorable distribution within a KL-divergence ball centered on a given nominal joint distribution of parameter and observation. This result has been later applied to state-space models to obtain robust Kalman filtering, yielding significant practical implications [16], [17]. Following this, [18] extends this result to the multi-sensor setting by imposing the same KL divergence constraint on each marginal joint distribution of the parameter and the single-sensor observation. Additionally, [19] considers the linear observation model with additive noise and obtains the minimum mean square error estimator corresponding to the least favorable distribution in a KL-divergence ball on the parameter distribution. Typically, these studies identify an optimal solution to a finite-dimensional auxiliary problem and establish its role as a saddle point solution to the infinite-dimensional minimax problem. On the other hand, for the Wasserstein ambiguity sets,[20] considers the distributionally robust estimation problem under a Wasserstein distance ball on the joint distribution, reformulating the minimax problem as an semidefinite programming (SDP) problem via its Nash equilibrium. Subsequently, [13][15] considered similar distributionally robust estimation problems defined by more types of metrics in the state space model.

However, as the model complexity increases, such as the linear measurement model with additive noise by assuming separate uncertainties in both the parameter and the noise, which is a common scenario in practical dynamic systems, the distributionally robust estimation problem becomes nonconvex and the existence of saddle points is not always guaranteed. In this paper, we mainly focus on a Wasserstein distributionally robust estimation (WDRE) problem in the linear measurement model with additive noise, where the parameter and the noise are mutually independent and their true distributions are unknown, and each constrained in a Wasserstein-distance ball. This problem has been considered in [12] and the authors prove that it is equivalent to a finite-dimensional SDP problem when a saddle point exists. However, an example is proposed in this paper to illustrate that the saddle point solution to this problem does not necessarily exist, motivating the persuit for the conditions that allow for its existence. On the other hand, if the saddle point is absent, the optimal value derived from the SDP problem can merely provide a lower bound for the optimal value of the WDRE problem. Consequently, we must either tackle the original infinite-dimensional nonconvex robust estimation problem directly or establish an upper bound by relaxing the minimax problem.

Our main contributions are summarized as follows:

  • We first prove that the saddle point solution of the WDRE problem exists if and only if that of a finite-dimensional minimax problem exists. This finite-dimensional problem is formulated by restricting the distribution to be Gaussian and the estimator to be linear. Then we establish a connection among the optimal values of this finite-dimensional problem, the WDRE problem, and their corresponding problems obtained by interchanging the minimization and maximization. Furthermore, we provide an example to illustrate that the saddle point solution of the WDRE problem may not exist.

  • We present a necessary and sufficient condition for the existence of the saddle point solution for the WDRE problem, which can be precisely characterized by determining its parameters through the resolution of a convex problem and its dual. Moreover, to simplify the assessment of the existence of the saddle point solution, we also provide a straightforward sufficient condition, which indicates that when the Wasserstein radii of the uncertainty sets are small enough, the saddle point always exists.

  • In the absence of the saddle point, we focus on a robust estimation problem with the linear estimator, which can be formulated as a finite-dimensional nonconvex problem. The optimal value of this problem, which serves as an upper bound for the WDRE problem, can be demonstrated to be equivalent to the optimal value of a SDP problem. Furthermore, leveraging the primal and dual optimal solutions of this SDP problem, we construct the optimal solution to the finite-dimensional nonconvex problem, which yields a robust linear estimator.

The rest of this paper is organized as follows. Section 2 formally presents the WDRE problem. Section 3 provides a theoretical analysis for the existence of the saddle point solution to the WDRE problem. In cases where the saddle point is absent, Section 4 addresses an upper bound problem that restricts the estimator to a linear function, which provides a robust linear estimator. Subsequently, Section 5 presents numerical experiments designed to validate the theoretical results. Finally, Section 6 concludes this paper.

Notation: Let \(\mathbb{R}^n\) be the \(n\) dimensional real vector space and \(\mathbb{R}^{n \times m}\) be the \(n \times m\) dimensional real matrix space. The notation \(I_n\) stands for the identity matrix in \(\mathbb{R}^{n \times n}\) and \(\boldsymbol{0}\) denotes the zero matrix of proper dimension. For any \(A \in \mathbb{R}^{n \times n}\), we use \(A^-\) and \(A^\dagger\) to denote the generalized inverse and the Moore-Penrose pseudo-inverse of the matrix \(A\). The notation \(Tr(A)\) denotes the trace of the matrix \(A\) and the null space of \(A\) is denoted by \(Null(A)\). For any \(A, B \in \mathbb{R}^{n \times n}\), the inner product of \(A\) and \(B\) is denoted by \(\left\langle A,B \right\rangle =Tr(AB)\) and \(A \perp B\) means \(\left\langle A,B \right\rangle =0\). Let \(\mathbb{S}^n\) be the space of symmetric matrices in \(\mathbb{R}^{n \times n}\) and \(\mathbb{S}^n_+\) be the cone of positive semidefinite matrices in \(\mathbb{S}^n\). For any \(A \in \mathbb{S}^n\), we use \(\lambda_{\min}(A)\) and \(\lambda_{\max}(A)\) to denote the minimum and maximum eigenvalues of \(A\), respectively. For any \(A \in \mathbb{S}^n_+\), the notation \(A^{\frac{1}{2}}\) denotes the unique positive semidefinite square root of \(A\). For any \(A,B \in \mathbb{S}^n\), the relation \(A \succ B\) \((A \succeq B)\) means that \(A-B\) is positive definite (semidefinite). For a measurable function of two variables \(f(x,y)\), we use \(\bigtriangledown_x f (x,y)\) to denote the partial derivative of \(f(x,y)\) with respect to \(x\). The notation \(\| \cdot \|\) denotes the Euclidean norm. A normal distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\) is denoted by \(\mathcal{N}(\mu,\Sigma)\).

2 Problem Formulation↩︎

Consider the linear measurement model with additive noise \[y=H x +w,\] where \(x \in \mathbb{R}^n\) is the unknown parameter, \(y \in \mathbb{R}^m\) is the observation, \(H \in \mathbb{R}^{m \times n}\) is the known observation matrix and the noise \(w \in \mathbb{R}^m\) is independent of \(x\). Let a Lebesgue measurable function \(f:\mathbb{R}^m \to \mathbb{R}^n\) be an estimator that estimates parameter \(x\) from the noisy observation \(y\) and \[\mathcal{F} \triangleq \left\{ f \left| f:\mathbb{R}^m \to \mathbb{R}^n \text{ is a Lebesgue measurable function} \right\} \right. \label{fc}\tag{1}\] be the set of all estimators. Moreover, let \(\mathcal{P}_d\) denote the set of probability distributions of a random variable on \(\mathbb{R}^d\) with finite second order moments. Then for a given joint distribution of \(x\) and \(w\) denoted by \(P \in \mathcal{P}_{n+m}\), the mean square error (MSE) obtained by the estimator \(f\) is denoted by \[mse(f,P) \triangleq \int_{\mathbb{R}^n \times \mathbb{R}^m} \left\| f(Hx+w)-x\right\| ^2 \mathrm{d} P(x,w). \label{mse}\tag{2}\] Furthermore, the minimum mean square error (MMSE) of the distribution \(P\), as determined by the optimal estimator corresponding to \(P\), is provided by \[mmse(P) \triangleq \inf_{f \in \mathcal{F}} \int_{\mathbb{R}^n \times \mathbb{R}^m} \left\| f(Hx+w)-x\right\| ^2 \mathrm{d} P(x,w).\]

Similar to [12], assume that the marginal distributions \(P_x\) of the parameter \(x\) and \(P_w\) of the noise \(w\) are unknown. However, given the nominal distributions \(\hat{P}_x\) and \(\hat{P}_w\), the uncertainty about \(P_x\) and \(P_w\) can be quantified by their corresponding bounded Wasserstein distances from \(\hat{P}_x\) and \(\hat{P}_w\). In this paper, we focus exclusively on the type-2 Wasserstain distance, which is defined for two distributions \(P\) and \(\hat{P}\) on \(\mathbb{R}^d\) as follows: \[W_2\left( P,\hat{P} \right) \triangleq \inf_{\pi \in \Pi \left( P, \hat{P} \right) } \left( \int_{\mathbb{R}^d \times \mathbb{R}^d} \left\| \xi_1-\xi_2\right\| ^2 \pi(\mathrm{d}\xi_1,\mathrm{d}\xi_2) \right) ^\frac{1}{2},\] where \(\Pi \left( P, \hat{P} \right)\) denotes the set of all joint distributions of the random variables \(\xi_1\) and \(\xi_2\) that have marginal distributions \(P\) and \(\hat{P}\), respectively. Specifically, for the given nominal Gaussian marginal distributions \(\hat{P}_x\) and \(\hat{P}_w\), due to the independence of \(x\) and \(w\), the nominal joint distribution of \(x\) and \(w\) is \(\hat{P}=\hat{P}_x \times \hat{P}_w\), where the product of two distributions means that for any \((x,w)\in \mathbb{R}^n \times \mathbb{R}^m\), \(\hat{P}(x,w)=\hat{P}_x(x) \hat{P}_w(w)\). Taking the Wasserstain radii \(\rho_x \geq 0\) and \(\rho_w \geq 0\), we assume that the true joint distribution \(P\) belongs to the following set \[\mathbb{B}(\hat{P})\triangleq \left\{ P_x \times P_w \in \mathcal{P}_n \times \mathcal{P}_m \left| \begin{align} &W_2\left( P_x,\hat{P}_x \right) \leq \rho_x, \; \hat{P}_x=\mathcal{N}\left(\hat{\mu}_x,\hat{\Sigma}_x \right) \\ &W_2\left( P_w,\hat{P}_w \right) \leq \rho_w,\; \hat{P}_w=\mathcal{N}\left( \hat{\mu}_w,\hat{\Sigma}_w \right) \end{align} \right\} \right., \label{set}\tag{3}\] where \(\mathcal{N}(\mu,\Sigma)\) denotes a normal distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\). We intend to find the optimal estimator of the least favorable distribution in set (3 ), i.e., considering the following robust estimation problem \[\inf_{f \in \mathcal{F}} \sup_{P \in \mathbb{B}(\hat{P})} mse \left( f,P \right), \label{minmax} \tag{4}\] where the objective function is given by (2 ), the constraint sets are defined by (1 ) and (3 ), respectively, and the following assumption will be made in this paper.

Assumption 1.

i) The Wasserstein radii \(\rho_x>0\) and \(\rho_w>0\).

ii) The nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\) are positive semidefinite.

3 The Existence of the Saddle Point Solution for (4 )↩︎

In this section, we devote to a theoretical analysis on the existence of the saddle point solution for the robust estimation problem (4 ). To begin with, we demonstrate that the saddle point solution for (4 ) exists if and only if a finite-dimensional minimax problem has a saddle point solution. Subsequently, we show that the saddle point solution does not always exist for (4 ) by means of a counterexample. Following this, we present a necessary and sufficient condition for the existence of the saddle point solution. Furthermore, we propose a straightforward sufficient condition, which does not require complex calculations and allows us to ascertain the existence of the saddle point solution when it is satisfied.

We first glance at problem (4 ) as an optimization problem with infinite-dimensional variables being the distribution \(P\) and the estimator \(f\). Notice that the objective function of (4 ) is linear in \(P \in \mathbb{B}(\hat{P})\) for a given \(f\), and convex in \(f \in \mathcal{F}\) for a given \(P\). However, according to the definition of \(\mathbb{B}(\hat{P})\) in (3 ), the elements in the constraint set of the joint distribution \(\mathbb{B}(\hat{P})\) depends on the elements in the two convex Wasserstein balls, which is generally not convex. Consequently, it is hard to apply current theoretical results (such as Sion’s minimax theorem) to directly deduce the existence of the saddle point solution for (4 ).

In order to discuss the existence of the saddle point solution for (4 ), we first formulate a finite-dimensional minimax problem. Specifically, we restrict the distribution to the set consisting only of Gaussian distributions \[\mathbb{B}_\mathcal{N}\left( \hat{P}\right) \triangleq \left\lbrace P_x \times P_w \in \mathcal{P}_n \times \mathcal{P}_m \left| \begin{align} &W_2\left( P_x,\hat{P}_x \right) \leq \rho_x, \; \hat{P}_x=\mathcal{N}\left(\hat{\mu}_x,\hat{\Sigma}_x \right), P_x=\mathcal{N}\left(\mu_x,\Sigma_x \right) \\ &W_2\left( P_w,\hat{P}_w \right) \leq \rho_w,\; \hat{P}_w=\mathcal{N}\left( \hat{\mu}_w,\hat{\Sigma}_w \right), P_w=\mathcal{N}\left(\mu_w,\Sigma_w \right) \end{align} \right\rbrace. \right. \label{Guass32class}\tag{5}\] Correspondingly, the estimator is limited to the set of linear functions \[\mathcal{F_L} \triangleq \left\lbrace f \in \mathcal{F} \left| \undefined A \in \mathbb{R}^{n \times m}, b \in \mathbb{R}^nwithf(y)=Ay+b, \;\undefined y \in \mathbb{R}^m \right\rbrace. \right. \label{linear32class}\tag{6}\] Then we introduce an auxiliary problem \[\inf_{f \in \mathcal{F_L}} \sup_{P \in \mathbb{B}_\mathcal{N}\left( \hat{P}\right) } mse \left( f,P \right), \label{minmaxG} \tag{7}\] where the objective function is given by (2 ), and the constraint sets are defined by (6 ) and (5 ), respectively. Essentially, (7 ) is a finite-dimensional robust estimation problem, because the linear estimator \(f\) can be parameterized by the matrix \(A\) and the vector \(b\), and the Gaussian distributions can also be fully described by their mean vectors and covariance matrices. Next, we shall explore the equivalence between the existence of the saddle point solution in (4 ) and that in (7 ).

First, we review a property of Wasserstain distance.

Lemma 1. ([21]) Assume that the nominal distribution is Guassian, i.e., \(P_0=\mathcal{N}(\mu_0,\Sigma_0)\). For an arbitrary distribution \(P\) with mean vector \(\mu_P\) and covariance matrix \(\Sigma_P\), we have \[W_2 \left( P , P_0 \right) \geq W_2 \left( \mathcal{N} \left( \mu_P,\Sigma_P\right) , P_0 \right).\]

Lemma 1 shows that for any distribution in \(\mathbb{B}(\hat{P})\) , the Gaussian distribution defined by its mean vector and covariance matrix is still in \(\mathbb{B}(\hat{P})\). Then we have the following lemma.

Lemma 2. Suppose that Assumption 1 holds. The optimal values of (4 ), (7 ) and their corresponding problems obtained by exchanging the minimization and maximization are ordered as follows: \[\sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} \inf_{f \in \mathcal{F_L}} \text{mse}(f,P) = \sup_{P \in \mathbb{B}(\hat{P})} \inf_{f \in \mathcal{F}} mse(f,P) \leq \inf_{f \in \mathcal{F}} \sup_{P \in \mathbb{B}(\hat{P})} mse(f,P) \leq \inf_{f \in \mathcal{F_L}} \sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} mse(f,P). \label{lemma322462}\qquad{(1)}\]

Proof. We divide the proof into three parts, corresponding to the equality or inequalities in (?? ), respectively.

Step 1: We begin by proving the equality in (?? ).

For any \(P \in \mathbb{B}(\hat{P})\), let \(P_\mathcal{N}\) denote a Gaussian distribution \(\mathcal{N}(\mu_{P},\Sigma_{P})\) which has the same mean vector \(\mu_{P}\) and covariance matrix \(\Sigma_{P}\) as \(P\). Then \(P_\mathcal{N}\) belongs to \(\mathbb{B}(\hat{P})\) according to Lemma 1. Define \[\mathbb{B}_P\left( \hat{P}\right)\triangleq\left\lbrace P_\mathcal{N}=\mathcal{N}\left( \mu_{P},\Sigma_{P}\right) \left| \begin{align} &\undefined P \in \mathbb{B}\left( \hat{P}\right) \\ & with mean vector\mu_{P} \\ &and covariance matrix \Sigma_{P} \end{align} \right\rbrace \right..\] Then we have \(\mathbb{B}_P (\hat{P}) \subseteq \mathbb{B}_\mathcal{N}(\hat{P})\). Combined with \(\mathbb{B}_\mathcal{N}(\hat{P}) \subseteq \mathbb{B}_P( \hat{P})\) obtained from the definition of \(\mathbb{B}_\mathcal{N}( \hat{P})\), we derive \(\mathbb{B}_\mathcal{N}(\hat{P})= \mathbb{B}_P(\hat{P})\).

On the other hand, let the linear function \(f_\mathcal{N}\) denote the optimal estimator for \(P_\mathcal{N}\). Then we have \[mmse(P_\mathcal{N})= mse(f_\mathcal{N},P_\mathcal{N})= mse(f_\mathcal{N},P) \geq \inf_{f \in \mathcal{F}} mse (f,P) = mmse(P), \label{ieqc}\tag{8}\] where the second equality holds because for a linear estimator \(f(y)=Ay+b\), the MSE given by \[mse(f,P)=\int_{\mathbb{R}^n \times \mathbb{R}^m} \left\| x-(AHx+Aw+b)\right\| ^2 \mathrm{d}P\left( x,w\right)\] depends only on the first and second moments of the distribution \(P\). Thus, we obtain \[\sup_{P \in \mathbb{B}(\hat{P})} \inf_{f \in \mathcal{F}} mse(f,P)= \sup_{P \in \mathbb{B}(\hat{P})} mmse(P) \leq \sup_{P_\mathcal{N} \in \mathbb{B}_P(\hat{P})} mmse(P_\mathcal{N}) = \sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} mmse(P)= \sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} \inf_{f \in \mathcal{F}} \text{mse}(f,P),\] where the inequality follows from (8 ). On the other hand, since \(\mathbb{B}_\mathcal{N}(\hat{P}) \subset \mathbb{B}(\hat{P})\), we derive \[\sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} \inf_{f \in \mathcal{F}} \text{mse}(f,P) \leq \sup_{P \in \mathbb{B}(\hat{P})} \inf_{f \in \mathcal{F}} mse(f,P).\] Consequently, we have \[\sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} \inf_{f \in \mathcal{F}} \text{mse}(f,P) = \sup_{P \in \mathbb{B}(\hat{P})} \inf_{f \in \mathcal{F}} mse(f,P)\] and then the equality in (?? ) arises from the optimality of the linear estimator under the Gaussian distribution.

Step 2: The first inequality in (?? ) is evident due to weak duality.

Step 3: Finally, we shall prove the last inequality in (?? ).

Since \(\mathcal{F_L} \subset \mathcal{F}\), we have \[\inf_{f \in \mathcal{F}} \sup_{P \in \mathbb{B}(\hat{P})} mse(f,P) \leq \inf_{f \in \mathcal{F_L}} \sup_{P \in \mathbb{B}(\hat{P})} mse(f,P). \label{2462463}\tag{9}\] For any \(P \in \mathbb{B}(\hat{P})\) with the mean vector \(\mu_{P}\) and covariance matrix \(\Sigma_{P}\), the Gaussian distribution \(P_\mathcal{N} = \mathcal{N}(\mu_{P}, \Sigma_{P})\) belonging to \(\mathbb{B}(\hat{P})\) according to Lemma 1. Then since MSE is only related to the first and second moments of the distribution \(P\) for a linear estimator \(f \in \mathcal{F_L}\), it holds that \(mse(P, f) = mse(P_\mathcal{N}, f)\), which further implies that \[\inf_{f \in \mathcal{F_L}} \sup_{P \in \mathbb{B}(\hat{P})} mse(f,P) = \inf_{f \in \mathcal{F_L}} \sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} mse(f,P). \label{rlr}\tag{10}\] Thus, the last inequality in (?? ) holds by combining (9 ) and (10 ). ◻

Remark 1. Formula (4.1) in [12] presents a conclusion similar to the one above. However, in Lemma 2, we focus on the relationship between the optimal values of (4 ), (7 ) and their corresponding problems obtained by exchanging the minimization and maximization. We further prove the equality of the optimal values of the two maximin problems, which is key to proving the following theorem.

With the help of Lemma 2, we can now establish the equivalence between the existence of the saddle point solution in (4 ) and that in (7 ).

Theorem 1. Suppose that Assumption 1 holds. The saddle point solution of (4 ) exists if and only if the saddle point solution of (7 ) exists.

Proof. If the saddle point solution of (7 ) exists, it is easy to show that the saddle point solution of (4 ) exists from Lemma 2.

Conversely, if \(\sup_{P \in \mathbb{B}(\hat{P})} \inf_{f \in \mathcal{F}} mse(f,P)\) admits an optimal solution \((f^*,P^*)\), it follows from (8 ) that \(mmse(P^*) \leq mmse(P_\mathcal{N}^*)\), where the Gaussian distribution \(P_\mathcal{N}^*=\mathcal{N}(\mu_{P^*},\Sigma_{P^*})\) is given by the mean vector \(\mu_{P^*}\) and covariance matrix \(\Sigma_{P^*}\) of \(P^*\) and belongs to \(\mathbb{B}(\hat{P})\) according to Lemma 1. Furthermore, since \(P^*\) is the least favorable distribution in the sense of MMSE, we have \(mmse(P^*) \geq mmse(P_\mathcal{N}^*)\). Then it follows that \(mmse(P_\mathcal{N}^*)= mmse(P^*)\), which means that \(P_\mathcal{N}^*\) is also the least favorable distribution in the sense of MMSE.

Since (4 ) admits a saddle point solution, according to the optimality of the linear estimator under the Gaussian distribution and the Cartesian product property of the saddle point [22], we can deduce that there exists \(f_\mathcal{N}^* \in \mathcal{F_L}\) such that \((f_\mathcal{N}^*,P_\mathcal{N}^*)\) is a saddle point solution of (4 ). Consequently, \((f_\mathcal{N}^*,P_\mathcal{N}^*)\) is also the optimal solution to \(\sup_{P \in \mathbb{B}_\mathcal{N}(\hat{P})} \inf_{f \in \mathcal{F_L}} mse(f,P)\) according to the equality in (?? ), and for given \(f_\mathcal{N}^*\), \(P_\mathcal{N}^*\) is the least favorable distribution on the constraint set \(\mathbb{B}(\hat{P})\). Then since \(P^*_\mathcal{N} \in \mathbb{B}_\mathcal{N}(\hat{P}) \subseteq \mathbb{B}(\hat{P})\), \(P_\mathcal{N}^*\) is also the least favorable distribution on the smaller constraint set \(\mathbb{B}_\mathcal{N}(\hat{P})\). Therefore, \((f_\mathcal{N}^*,P_\mathcal{N}^*)\) constitutes a saddle point solution of (7 ). ◻

The above theorem shows that if a saddle point solution exists for (4 ), there must be a saddle point solution consisting of a linear estimator and a Gaussian distribution, which is also a saddle point solution for (7 ). Unfortunately, we have constructed a counterexample in which the saddle point solution of (7 ) does not exist.

Counterexample: Consider the scalar case, i.e., \(m=n=1\). Let the nominal means and variances be \(\hat{\mu}_x=\hat{\mu}_w=0\) and \(\hat{\Sigma}_x=\hat{\Sigma}_w=1\), respectively. Take the Wasserstein radii \(\rho_x=\rho_w=2\) and the observation matrix \(H=1\). Consider the problem \[\sup_{P \in \mathbb{B}_\mathcal{N}\left( \hat{P}\right) } \inf_{f \in \mathcal{F_L}} \text{mse}\left( f,P\right) .\] It can be parameterized as follows [12] \[\begin{align} \sup_{\mu_x,\mu_w,\Sigma_x,\Sigma_w} &\Sigma_x-\Sigma_x \left(\Sigma_x+\Sigma_w \right) ^{-1} \Sigma_x \\ s.t. \;&\mu_x^2 + \Sigma_x+1-2 \sqrt{\Sigma_x} \leq 4, \;\Sigma_x \geq 0 ,\\ & \mu_w^2+ \Sigma_w+1-2 \sqrt{\Sigma_w} \leq 4, \;\Sigma_w \geq 0 . \label{example} \end{align}\tag{11}\] Then the objective function of (11 ) can be expressed as \[\begin{align} \Sigma_x-\Sigma_x \left( \Sigma_x+\Sigma_w \right) ^{-1} \Sigma_x &=\Sigma_x \left( \Sigma_x+\Sigma_w \right) ^{-1} \left( \Sigma_x+\Sigma_w \right) -\Sigma_x \left( \Sigma_x+\Sigma_w \right) ^{-1} \Sigma_x \\ &=\Sigma_x \left( \Sigma_x+\Sigma_w \right) ^{-1} \Sigma_w \\ &=\left(\Sigma_w^{-1}+\Sigma_x^{-1} \right) ^{-1} , \end{align}\] which is monotonically increasing with \(\Sigma_x\) and \(\Sigma_w\). Therefore, (11 ) has a unique optimal solution given by \(\mu_x^*=\mu_w^*=0\), \(\Sigma_x^*=\Sigma_w^*=9\), i.e., the least favorable distribution \(P^*=\mathcal{N}(0,9) \times \mathcal{N}(0,9)\). Then the corresponding optimal estimator is given by \[f^*(y)= \frac{H \Sigma_x^*}{H^2 \Sigma_x^*+\Sigma_w^*}y=\frac{1}{2}y,\] and the mean square error is \[mse(f^*,P^*)=\Sigma_x^*-\Sigma_x^* \left(\Sigma_x^*+\Sigma_w^* \right) ^{-1} \Sigma_x^*=\frac{9}{2}.\] But in fact, for \(\tilde{P} = \mathcal{N}\left( \sqrt{3},4\right) \times \mathcal{N}\left( -\sqrt{3},4\right)\), the type-2 Wasserstain Distance between \(\tilde{P}\) and the nominal distribution is given by \[W_2 \left(\mathcal{N}(\sqrt{3},4), \mathcal{N}(0,1)\right) = \sqrt{\|\sqrt{3} - 0\|^2 + \left[ 4+1 - 2\times 4^\frac{1}{2}\right] } = 2,\] and \[W_2 \left(\mathcal{N}(-\sqrt{3},4), \mathcal{N}(0,1)\right) = \sqrt{\|-\sqrt{3}- 0\|^2 + \left[ 4 + 1 - 2 \times 4^\frac{1}{2}\right] } = 2.\] It implies that \(\tilde{P} \in \mathbb{B}_\mathcal{N}( \hat{P}) \subseteq \mathbb{B}(\hat{P})\). And we have \[\begin{align} mse( f^*,\tilde{P}) &=\int_{\mathbb{R} \times \mathbb{R}} \left\| \frac{1}{2} (x+w) - x \right\| ^2 \mathrm{d} \tilde{P} \\ &=\frac{1}{4} \Sigma_x +\frac{1}{4} \Sigma_w +\left( \frac{1}{2} \mu_x - \frac{1}{2}\mu_w \right) ^2 \\ &=5> mse\left( f^*,P^*\right) . \end{align}\] Hence, for the given \(f^*\), the corresponding least favorable distribution is not \(P^*\), which indicates that the unique solution to the maximin problem \(\left( f^*,P^*\right)\) does not form a saddle point solution of (7 ).

Remark 2. The example above shows that, in general, the saddle point solution of (7 ) is not guaranteed to exist without additional assumptions, which implies that the saddle point solution of (4 ) may not exist either. In addition, a multi-dimensional example is given by simulation 1 in Section 5.

3.2 A Necessary and Sufficient Condition for the Existence of the Saddle Point Solution for (4 )↩︎

When a saddle point solution exists, it is easy to deduce from Theorem 1 that solving the infinite-dimensional robust estimation problem (4 ) is equivalent to solving the finite-dimensional minimax problem (7 ). Under such scenario, all inequalities in Lemma 2 become equalities. Consequently, as detailed in [12], the optimal solution to (4 ) can be effectively obtained by solving the tractable problem \[\sup_{P \in \mathbb{B}_\mathcal{N}\left( \hat{P}\right) } \inf_{f \in \mathcal{F_L}} mse\left( f,P\right) . \label{maxmina}\tag{12}\] Conversely, in the absence of the saddle point, the optimal value of (12 ) serves merely as a strict lower bound for the optimal value of (4 ). It complicates the resolution of (4 ) and motivates us to establish a necessary and sufficient condition for its existence in this case.

To identitify this condition, we commence our analysis with (12 ). Since the distribution and the estimator are restricted to be a Gaussian distribution in (5 ) and a linear estimator in (6 ) respectively, (12 ) can be reformulated as a finite-dimensional optimization problem, where MSE can be naturally parameterized as the following objective function by its definition, and the Wasserstein distance degenerates into the Gelbrich distance [23]: \[\begin{align} \sup_{\mu_x,\mu_w,\Sigma_x,\Sigma_w} \inf_{A,b} \;\; &Tr\left[ \left( AH-I_n\right) \Sigma_x \left( AH-I_n\right) ^\top +A\Sigma_w A^\top \right] +\left[ \left( AH-I_n \right) \mu_x+A\mu_w+b\right] ^\top \left[ \left( AH-I_n \right) \mu_x+A\mu_w+b\right] \nonumber \\ s.t. \; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \nonumber \\ & \left\| \mu_x-\hat{\mu}_x \right\| ^2+ Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_x^2, \nonumber \\[0.5ex] & \left\| \mu_w-\hat{\mu}_w \right\| ^2+ Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_w^2. \label{maxminap} \end{align}\tag{13}\] For fixed \(\mu_x\), \(\mu_w\), \(\Sigma_x\), \(\Sigma_w\) and \(A\), the optimal solution to the inner minimization problem is given by \(b^* = (I_n - A H)\mu_x - A \mu_w\), which minimizes the objective function by eliminating the quadratic term with respect to \(b\). By substituting \(b^*\) into the above equation, problem (13 ) becomes \[\begin{align} \sup \limits_{\mu_x,\mu_w,\Sigma_{x},\Sigma_w} \inf \limits_A \;\; &Tr\left[ \left( AH-I_n \right) \Sigma_x \left( AH-I_n \right) ^\top +A\Sigma_w A^\top \right]\\ s.t. \; &\Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \\&\left\| \mu_x-\hat{\mu}_x\right\| ^2+Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_x^2, \\&\left\| \mu_w-\hat{\mu}_w\right\| ^2+Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_w^2. \\\end{align}\] It is straightforward to observe that the objective function is independent of \(\mu_x\) and \(\mu_w\). Consequently, setting \(\mu_x^*=\hat{\mu}_x\) and \(\mu_w^*=\hat{\mu}_w\) can result in the largest feasible set of \(\Sigma_x\) and \(\Sigma_w\), which leads to the largest objective function value. It follows that the optimal solution must satisfy \(\mu_x^*=\hat{\mu}_x\) and \(\mu_w^*=\hat{\mu}_w\). Accordingly, the above problem can be further simplified to \[\begin{align} \sup \limits_{\Sigma_{x},\Sigma_w} \inf \limits_A \;\; &Tr\left[ (AH-I_n)\Sigma_x(AH-I_n)^\top +A\Sigma_w A^\top \right]\\ s.t. \; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \\& Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_x^2, \\& Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_w^2. \\\end{align} \label{maxmin1}\tag{14}\]

For the sake of simplicity of notations, we denote the constraint set in (14 ) by \[\mathbb{B}_\Sigma=\left\lbrace \left( \Sigma_x,\Sigma_w \right) \left| \begin{align} &\Sigma_x \succeq{\boldsymbol{0}}, \; Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_x^2 \\ &\Sigma_x \succeq{\boldsymbol{0}}, \; Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_w^2 \end{align} \right. \right\rbrace .\] Since the objective function of (14 ) is convex in \(A\) for each \((\Sigma_x,\Sigma_w)\), concave in \((\Sigma_x,\Sigma_w)\) for each \(A\), and \(\mathbb{B}_\Sigma\) is convex and compact [21], we can apply Sion’s minimax theorem [24] to conclude that \[\begin{align} &\sup_{(\Sigma_x,\Sigma_w) \in \mathbb{B}_\Sigma} \inf_A \quad \left\langle \left( I_n-AH\right) ^\top \left( I_n-AH\right) ,\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle \tag{15} \\ &=\inf_A \sup_{(\Sigma_x,\Sigma_w) \in \mathbb{B}_\Sigma} \left\langle \left( I_n-AH \right) ^\top \left( I_n-AH \right) ,\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle, \tag{16} \end{align}\] where (15 ) is a reformulation of (14 ) with simplified notations. Furthermore, since \(\mathbb{B}_\Sigma\) is compact, and there exists a pair of positive definite matrices \((\Sigma_x,\Sigma_w) \in \mathbb{B}_\Sigma\) such that the objective function is coercive in \(A\), then the above problems admit the saddle point solution [25].

Then problem (15 ) can be transformed into \[\begin{align} &\sup_{(\Sigma_x,\Sigma_w) \in \mathbb{B}_\Sigma} \inf_A \quad \left\langle \left( I_n-AH\right) ^\top \left( I_n-AH\right) ,\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle \notag \\ &\begin{aligned} =\sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \; &\left[ \inf_A \quad \left\langle \left( I_n-AH\right) ^\top \left( I_n-AH\right) ,\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle \right]+ \notag \\ &\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\}+ \gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \end{aligned} \\ &\begin{align} =\sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \inf_A \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \quad &\left\langle \left(I_n-AH\right) ^\top \left(I_n-AH\right) ,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle A^\top A,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\}, \label{maxminL} \end{align} \end{align}\tag{17}\] where the first equality arises from reformulating the constraints as penalty terms in the objective function. Specifically, when \((\Sigma_x,\Sigma_w) \in \mathbb{B}_\Sigma\), the corresponding optimal multiplier \((\gamma_x,\gamma_w)\) satisfies \[\gamma_x \left\lbrace \rho_x^2-Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\rbrace =0\] and \[\gamma_w \left\lbrace \rho_w^2-Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\rbrace =0,\] thereby keeping the objective function value unchanged; otherwise, the multiplier \(\gamma_x\) or \(\gamma_w\) tends to \(+\infty\), driving the objective function to \(-\infty\).

On the other hand, for any fixed \(A\), consider the inner maximization problem in (16 ) \[\sup_{(\Sigma_x,\Sigma_w) \in \mathbb{B}_\Sigma} \left\langle \left( I_n-AH\right) ^\top \left( I_n-AH \right) ,\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle. \label{maxminis32in}\tag{18}\] Problem (18 ) maximizes a continuous concave function on a convex and compact set, and thus it has a finite optimal value. Moreover, since the cartesian set of two cones of positive semidefinite matrices is convex, the objective function and the Gelbrich distance are also convex in \((\Sigma_x,\Sigma_w)\), then strong duality holds under Slater condition, i.e., (18 ) is equivalent to its Lagrangian dual problem \[\begin{align} \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \quad &\left\langle \left( I_n-AH\right) ^\top \left( I_n-AH\right) ,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle A^\top A,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\}, \end{align} \label{maxminis32in32dual}\tag{19}\] and the set of optimal dual solutions is nonempty [26]. Thus, problem (16 ) is equivalent to \[\begin{align} \inf_A \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \quad &\left\langle \left( I_n-AH\right) ^\top \left( I_n-AH\right) ,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle A^\top A,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right) ^{\frac{1}{2}} \right] \right\}. \label{maxminLe} \end{align}\tag{20}\] We denote the objective function of (20 ), which is also the objective function of (17 ), as \[\begin{align} G(\Sigma_x,\Sigma_w,A,\gamma_x,\gamma_w) \triangleq &\left\langle \left( I_n-AH \right) ^\top \left( I_n-AH \right) ,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ &+\left\langle A^\top A,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}} \right] \right\}. \label{G} \end{align}\tag{21}\] Then since the optimal values of problems (15 ) and (16 ) are equal, it follows that the optimal values of (17 ) and (20 ) are also equal, i.e., \[\sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \inf_A \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } G(\Sigma_x,\Sigma_w,A,\gamma_x,\gamma_w)= \inf_A \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} G(\Sigma_x,\Sigma_w,A,\gamma_x,\gamma_w).\]

Furthermore, we shall give the relationship between saddle point solutions to problems (15 ) and (17 ).

Lemma 3. ([22]) Under Assumption 1, the following statements hold:

(i)If problem (17 ) has a saddle point solution \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\), then \((\Sigma_x^*,\Sigma_w^*,A^*)\) is a saddle point solution to (15 ).

(ii)If problem (15 ) has a saddle point solution \((\Sigma_x^*,\Sigma_w^*,A^*)\) and Slater condition holds, then there are \(\gamma_x^* \geq 0\) and \(\gamma_w^* \geq 0\) such that \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\) is a saddle point solution to (17 ).

Subsequently, the saddle point solution of problem (17 ) enables us to derive a necessary and sufficient condition for the existence of a saddle point solution for the robust estimation problem (4 ).

Lemma 4. Suppose that Assumption 1 holds. The saddle point solution of (4 ) exists if and only if (17 ) has a saddle point solution \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\) such that \[\begin{bmatrix} (I_n - A^* H)^\top (I_n - A^* H) - \gamma_x^*I_n & (I_n - A^* H)^\top A^* \\ (A^*)^\top (I_n - A^* H) & (A^*)^\top A^* - \gamma_w^*I_m \end{bmatrix} \preceq {\boldsymbol{0}}.\]

Proof. For ease of notations, define \(K^* \triangleq I_n - A^* H\). Notice that Theorem 1 shows that the saddle point solution of (4 ) exists if and only if that of (7 ) exists. Then since (12 ) is obtained by exchanging the supremum and infimum of (7 ) and it can be parameterized as (13 ), it suffices to prove that the saddle point solution of (13 ) exists if and only if (17 ) has a saddle point solution \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\) such that \(\begin{bmatrix} (K^*)^\top (K^*) - \gamma_x^*I_n & (K^*)^\top A^* \\ (A^*)^\top (K^*) & (A^*)^\top A^* - \gamma_w^*I_m \end{bmatrix} \preceq {\boldsymbol{0}}\).

\(\Longleftarrow\)" Sufficiency.

We first demonstrate that, if problem (17 ) has a saddle point solution \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\) such that \(\begin{bmatrix} (K^*)^\top (K^*) - \gamma_x^*I_n & (K^*)^\top A^* \\ (A^*)^\top (K^*) & (A^*)^\top A^* - \gamma_w^*I_m \end{bmatrix} \preceq {\boldsymbol{0}}\), then \((A^*,b^*,\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\) constitutes a saddle point solution of (13 ), where \(\hat{\mu}_x\) and \(\hat{\mu}_w\) are the nominal mean vectors of parameter and noise, respectively, and \(b^* \triangleq (I_n - A^* H)\hat{\mu}_x - A^* \hat{\mu}_w\).

Since \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\) is a saddle point solution to (17 ), it follows from Lemma 3 that \((\Sigma_x^*, \Sigma_w^*,A^*)\) is an optimal solution to (15 ), which implies that \((A^*,b^*,\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\) is an optimal solution to (13 ). To further establish that \((A^*,b^*,\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\) is a saddle point solution of (13 ), it suffices to show that, for given \(A^*\) and \(b^*\), the least favorable distribution is determined by the mean vectors \(\hat{\mu}_x\), \(\hat{\mu}_w\) and covariance matrices \(\Sigma_x^*\), \(\Sigma_w^*\). Specifically, we intend to prove that \((\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\) is an optimal solution to the following problem \[\begin{align} \sup_{\mu_x,\mu_w,\Sigma_x,\Sigma_w} &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle + \left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\left\| K^*\left( \mu_x-\hat{\mu}_x \right) -A^*\left( \mu_w-\hat{\mu}_w\right) \right\| ^2\\ s.t. \; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \\ & \|\mu_x-\hat{\mu}_x\|^2+ Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_x^2, \\[0.5ex] & \|\mu_w-\hat{\mu}_w\|^2+ Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_w^2. \end{align} \label{saddle32judge}\tag{22}\] Then we denote the Lagrangian function of (22 ) as \[\begin{align} L(\mu_x,\mu_w,\Sigma_x,\Sigma_w;\gamma_x,\gamma_w) \triangleq &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle + \left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\left\| K^*\left( \mu_x-\hat{\mu}_x\right) -A^*\left( \mu_w-\hat{\mu}_w\right) \right\| ^2 \notag \\ &+\gamma_x \left\{ \rho_x^2- \|\mu_x-\hat{\mu}_x\|^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ &+\gamma_w \left\{ \rho_w^2- \|\mu_w-\hat{\mu}_w\|^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ =& \left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ &+\left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ &+\begin{bmatrix} \mu_x-\hat{\mu}_x \\ \mu_w-\hat{\mu}_w \end{bmatrix}^\top \begin{bmatrix} (K^*)^\top K^*-\gamma_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma_wI_m \end{bmatrix} \begin{bmatrix} \mu_x-\hat{\mu}_x \\ \mu_w-\hat{\mu}_w \end{bmatrix}, \label{F} \end{align}\tag{23}\] where the equality comes from rearranging items and combining like items. Next we discuss the relationship between \(L\) and the objective function of (17 ), which is defined as \(G\) in (21 ). Note that taking \(\mu_x=\hat{\mu}_x\) and \(\mu_w=\hat{\mu}_w\) eliminates the quadratic term with respect to \((\mu_x, \mu_w)\) in \(L\), and setting \(A=A^*\) in \(G\) ensures that its inner probuct terms with respect to \(\Sigma_x\) and \(\Sigma_w\) are identical to that in \(L\), i.e., for any \(\Sigma_x \succeq{\boldsymbol{0}}\), \(\Sigma_w \succeq {\boldsymbol{0}}\), \(\gamma_x \geq 0\) and \(\gamma_w \geq 0\), \[L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x,\Sigma_w;\gamma_x,\gamma_w)= G(\Sigma_x,\Sigma_w,A^*,\gamma_x,\gamma_w). \label{eqL}\tag{24}\] Since \((\Sigma_x^*, \Sigma_w^*, A^*, \gamma_x^*, \gamma_w^*)\) is a saddle point solution of (17 ), for any \(\Sigma_x \succeq{\boldsymbol{0}}\), \(\Sigma_w \succeq {\boldsymbol{0}}\), \(A\), \(\gamma_x \geq 0\) and \(\gamma_w \geq 0\), we have \[G(\Sigma_x,\Sigma_w,A^*,\gamma_x^*,\gamma_w^*) \leq G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*) \leq G(\Sigma_x^*,\Sigma_w^*,A,\gamma_x,\gamma_w). \label{sd32per}\tag{25}\] Then for any \(\mu_x\), \(\mu_w\), \(\Sigma_x \succeq{\boldsymbol{0}}\), \(\Sigma_w \succeq {\boldsymbol{0}}\), \(\gamma_x \geq 0\) and \(\gamma_w \geq 0\), we obtain \[\begin{align} &L(\mu_x,\mu_w,\Sigma_x,\Sigma_w;\gamma_x^*,\gamma_w^*) \leq L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x,\Sigma_w;\gamma_x^*,\gamma_w^*) =G(\Sigma_x,\Sigma_w,A^*,\gamma_x^*,\gamma_w^*)\\ &\leq G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)= L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*;\gamma_x^*,\gamma_w^*) \\ &\leq G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x,\gamma_w) = L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*;\gamma_x,\gamma_w), \end{align}\] where the first inequality is due to \(\begin{bmatrix} (K^*)^\top K^*-\gamma_x^*I_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma_w^*I_m \end{bmatrix} \preceq {\boldsymbol{0}}\); the second inequality arises from the first inequality in (25 ); the last inequality follows from the second inequality in (25 ) with \(A=A^*\) and all equalities are due to (24 ). Therefore, the Lagrangian function \(L\) admits a saddle point \((\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*,\gamma_x^*,\gamma_w^*)\). Combined with saddle point theorem [26], it further implies that \((\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\) is an optimal solution to (22 ).

\(\Longrightarrow\)" Necessity.

Assuming that (13 ) has a saddle point solution \((A^*,b^*,\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\), it follows that: (i) (15 ) has an optimal solution \((\Sigma_x^*,\Sigma_w^*,A^*)\), and (ii) for the given \(A^*\) and \(b^*\), the least favorable distribution is determined by the mean vectors \(\hat{\mu}_x\), \(\hat{\mu}_w\) and covariance matrices \(\Sigma_x^*\), \(\Sigma_w^*\), that is, (22 ) has an optimal solution \((\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\).

We first consider the following equivalent problem of (22 ) \[\begin{align} \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \sup_{\mu_x,\mu_w} \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \; &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle + \left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\|K^*(\mu_x-\hat{\mu}_x)-A^*(\mu_w-\hat{\mu}_w)\|^2\\ &+\gamma_x \left\{ \rho_x^2- \|\mu_x-\hat{\mu}_x\|^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\gamma_w \left\{ \rho_w^2- \|\mu_w-\hat{\mu}_w\|^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\}. \\ \end{align} \label{saddle32judge1}\tag{26}\] For fixed \(\Sigma_x\) and \(\Sigma_w\), the maximization problem over \((\mu_x,\mu_w)\) in (22 ) is a quadratically constrained quadratic program (QCQP) in which the two constraint functions and the objective function are all homogeneous quadratic. Under tha assumption that \(\rho_x>0\) and \(\rho_w>0\), Slater condition is readily verified. Then according to Theorem 2.5 in [27], strong duality holds for the maximization problem over \((\mu_x,\mu_w)\) in (22 ) and its dual. Specifically, the maximization over \((\mu_x,\mu_w)\) and the minimization over \((\gamma_x,\gamma_w)\) in (26 ) can be interchanged. Then (26 ) is equivalent to \[\begin{align} \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \sup_{\mu_x,\mu_w} \; &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\begin{bmatrix} \mu_x-\hat{\mu}_x \\ \mu_w-\hat{\mu}_w \end{bmatrix}^\top \begin{bmatrix} (K^*)^\top K^*-\gamma_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma_wI_m \end{bmatrix} \begin{bmatrix} \mu_x-\hat{\mu}_x \\ \mu_w-\hat{\mu}_w \end{bmatrix}. \end{align} \label{saddle32judge2}\tag{27}\] For any pair of positive semidefinite matrices \((\Sigma_x,\Sigma_w)\), consider the inner minimax problem in (27 ). It is easy to know that its optimal solution satisfies \[(\gamma^*_x,\gamma^*_w) \in \mathcal{A} \triangleq \left\lbrace (\gamma_x,\gamma_w) \left| \begin{bmatrix} (K^*)^\top K^*-\gamma_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma_wI_m \end{bmatrix} \preceq \boldsymbol{0} \right. \right\rbrace \label{calA}\tag{28}\] and the corresponding maxmizer with respect to \((\mu_x,\mu_w)\) makes the the quadratic term equal to zero, i.e., \[\begin{bmatrix} \mu_x^*-\hat{\mu}_x \\ \mu_w^*-\hat{\mu}_w \end{bmatrix} \in Null \left( \begin{bmatrix} (K^*)^\top K^*-\gamma^*_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma^*_wI_m \end{bmatrix} \right). \label{nulla}\tag{29}\] Otherwise, if (28 ) does not hold, the matrix \(\begin{bmatrix} (K^*)^\top K^*-\gamma_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma_wI_m \end{bmatrix}\) has at least one positive eigenvalue. In this case, let \(\begin{bmatrix} \mu_x^*-\hat{\mu}_x \\ \mu_w^*-\hat{\mu}_w \end{bmatrix}\) take the direction of the corresponding eigenvector, and as the norm tends to \(\infty\), the objective function value tends to \(+\infty\). On the other hand, based on (28 ), it is obvious that (29 ) holds. Thus, (27 ) is further equivalent to \[\begin{align} \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \inf_{(\gamma_x,\gamma_w)\in \mathcal{A}} \; &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\}. \end{align} \label{saddle32judge3}\tag{30}\] Note that the objective function of (30 ) is linear in \((\gamma_x,\gamma_w)\) for each \((\Sigma_x,\Sigma_w)\), concave in \((\Sigma_x,\Sigma_w)\) for each \((\gamma_x,\gamma_w)\), and the constraint sets \(\mathbb{S}_+^n \times \mathbb{S}_+^m\) and \(\mathcal{A}\) are convex, closed and nonempty. Moreover, the objective function of (30 ) with \(\Sigma_x=\hat{\Sigma}_x\succeq{\boldsymbol{0}}\) and \(\Sigma_w=\hat{\Sigma}_w\succeq{\boldsymbol{0}}\) tends to \(+\infty\) as \(\gamma_x\) or \(\gamma_w\) tends to \(+\infty\). On the other hand, there exists \((\bar{\gamma}_x,\bar{\gamma}_w) \in \mathcal{A}\) such that \(\left( K^*\right) ^\top K^*-\bar{\gamma}_x I_n \prec \boldsymbol{0}\) and \(\left( A^*\right) ^\top A^*-\bar{\gamma}_w I_m \prec \boldsymbol{0}\). Without loss of generality, we only consider the spectral norm below, while other norms can be handled analogously via the equivalence of norms in finite-dimensional spaces. Then the objective function of (30 ) with \((\bar{\gamma}_x,\bar{\gamma}_w)\) can be transformed to \[\begin{align} & -\left\langle \bar{\gamma}_x I_n-\left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle +2 \bar{\gamma}_x Tr \left[ \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] +\bar{\gamma}_x \left[ \rho_x^2- Tr\left( \hat{\Sigma}_x \right) \right] \notag \\ &-\left\langle \bar{\gamma}_w I_m- \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +2 \bar{\gamma}_w Tr\left[ \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] +\bar{\gamma}_w \left[ \rho_w^2- Tr\left( \hat{\Sigma}_w \right) \right] \notag \\ \leq &-\lambda_{\min}\left[ \bar{\gamma}_x I_n-\left( K^*\right) ^\top K^* \right] Tr\left( \Sigma_x\right) + 2 \bar{\gamma}_x \sum_{i=1}^n \lambda_i^\frac{1}{2} \left( \hat{\Sigma}_x \Sigma_x \right) +\bar{\gamma}_x \left[ \rho_x^2- Tr\left( \hat{\Sigma}_x \right) \right] \notag \\ &-\lambda_{\min} \left[ \bar{\gamma}_w I_m- \left( A^*\right) ^\top A^* \right] Tr\left( \Sigma_w\right) + 2 \bar{\gamma}_w \sum_{i=1}^m \lambda_i^\frac{1}{2} \left( \hat{\Sigma}_w \Sigma_w \right) +\bar{\gamma}_w \left[ \rho_w^2- Tr\left( \hat{\Sigma}_w \right) \right] \notag \\ \leq &-\lambda_{\min}\left[ \bar{\gamma}_x I_n-\left( K^*\right) ^\top K^* \right] \left\| \Sigma_x \right\| + 2 \bar{\gamma}_x n \lambda_{\max}^\frac{1}{2}\left( \hat{\Sigma}_x \right) \left\| \Sigma_x \right\|^\frac{1}{2} +\bar{\gamma}_x \left[ \rho_x^2- Tr\left( \hat{\Sigma}_x \right) \right] \notag \\ &-\lambda_{\min}\left[ \bar{\gamma}_w I_m-\left( A^*\right) ^\top A^* \right] \left\| \Sigma_w \right\| + 2 \bar{\gamma}_w m \lambda_{\max}^\frac{1}{2}\left( \hat{\Sigma}_w \right) \left\| \Sigma_w \right\|^\frac{1}{2} +\bar{\gamma}_w \left[ \rho_w^2- Tr\left( \hat{\Sigma}_w \right) \right], \notag \end{align}\] where the first inequality is due to \(Tr \left[ \left( B^\frac{1}{2} D B^\frac{1}{2}\right)^\frac{1}{2} \right]=\sum_{i=1}^n \lambda_i^\frac{1}{2}\left( BD\right)\) for \(B,D \in \mathbb{S}^n_+\) [28], and the second inequality follows from \(\lambda_i(BD) \leq \lambda_{\max}(BD) \leq \lambda_{\max}(B) \lambda_{\max}(D)\) for \(B,D \in \mathbb{S}^n_+\) [28]. Hence, The objective function of (30 ) with \((\bar{\gamma}_x,\bar{\gamma}_w)\) tends to \(-\infty\) as \(\left\| \Sigma_x \right\|\) or \(\left\| \Sigma_w \right\|\) tends to \(+\infty\), owing to the positive definiteness of \(\bar{\gamma}_x I_n-\left( K^*\right) ^\top K^*\) and \(\bar{\gamma}_w I_m-\left( A^*\right) ^\top A^*\).

Therefore, according to Theorem 10.2 in [25], problem (30 ) is equivalent to \[\begin{align} \inf_{(\gamma_x,\gamma_w) \in \mathcal{A}} \; \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \; &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\}. \end{align} \label{mid}\tag{31}\] Furthermore, notice that problem (31 ) is equivalent to \[\begin{align} \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \sup_{\substack{\Sigma_x\succeq{\boldsymbol{0}}\\ \Sigma_w\succeq{\boldsymbol{0}}}} \sup_{\mu_x,\mu_w} \; &\left\langle \left( K^*\right) ^\top K^*,\Sigma_{x} \right\rangle +\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\left\langle \left( A^*\right) ^\top A^*,\Sigma_w \right\rangle +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \\ &+\begin{bmatrix} \mu_x-\hat{\mu}_x \\ \mu_w-\hat{\mu}_w \end{bmatrix}^\top \begin{bmatrix} (K^*)^\top K^*-\gamma_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma_wI_m \end{bmatrix} \begin{bmatrix} \mu_x-\hat{\mu}_x \\ \mu_w-\hat{\mu}_w \end{bmatrix}, \end{align} \label{saddle32judge22}\tag{32}\] since when \((\gamma_x,\gamma_w) \in \mathcal{A}\), as defined in (28 ), the supremum of the quadratic term with respect to \((\mu_x,\mu_w)\) in (32 ) is zero and thus the objective function values of (31 ) and (32 ) are identical; otherwise, the supremum is \(+\infty\) as \(\mu_x\) or \(\mu_w\) tends to \(\infty\). Consequently, based on the above argument that (26 ) is equivalent to (32 ), we further have \[\sup_{\substack{\Sigma_x \succeq {\boldsymbol{0}} \\ \Sigma_w\succeq {\boldsymbol{0}}}} \sup_{\mu_x,\mu_w} \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0}} L(\mu_x,\mu_w,\Sigma_x,\Sigma_w;\gamma_x,\gamma_w)= \inf_{\substack {\gamma_x \geq 0 \\ \gamma_w \geq 0} } \sup_{\substack{\Sigma_x \succeq {\boldsymbol{0}} \\ \Sigma_w\succeq {\boldsymbol{0}}}} \sup_{\mu_x,\mu_w} L(\mu_x,\mu_w,\Sigma_x,\Sigma_w;\gamma_x,\gamma_w),\] where the left side of the above equality is (26 ) and the right side is (32 ) by the definition of \(L\) in (23 ). Then for the optimal solution \((\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*)\) to (22 ), there exists an optimal dual solution \((\gamma^*_x\), \(\gamma^*_w) \in \mathcal{A}\) such that for any \(\mu_x\), \(\mu_w\), \(\Sigma_x \succeq {\boldsymbol{0}}\), \(\Sigma_w \succeq {\boldsymbol{0}}\), \(\gamma_x \geq 0\) and \(\gamma_w \geq 0\), the Lagrangian function of (22 ), defined as \(L\) in (23 ), satisfies \[L(\mu_x,\mu_w,\Sigma_x,\Sigma_w;\gamma^*_x,\gamma^*_w) \leq L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*; \gamma^*_x,\gamma^*_w) \leq L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*;\gamma_x,\gamma_w). \label{sd32perd}\tag{33}\] Then for any \(\Sigma_x \succeq {\boldsymbol{0}}\), \(\Sigma_w \succeq {\boldsymbol{0}}\), \(A\), \(\gamma_x \geq 0\) and \(\gamma_w \geq 0\), the objective function of (17 ), defined as \(G\) in (21 ), satisfies \[\begin{align} &G(\Sigma_x,\Sigma_w,A^*,\gamma^*_x,\gamma^*_w)= L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x,\Sigma_w;\gamma^*_x,\gamma^*_w)\\ &\leq L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*; \gamma^*_x,\gamma^*_w)= G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma^*_x,\gamma^*_w)\\ &\leq L(\hat{\mu}_x,\hat{\mu}_w,\Sigma_x^*,\Sigma_w^*;\gamma_x,\gamma_w) =G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x,\gamma_w) \leq G(\Sigma_x^*,\Sigma_w^*,A,\gamma_x,\gamma_w), \end{align}\] where the first inequality follows from the first inequality in (33 ) for \(\mu_x=\hat{\mu}_x\) and \(\mu_w=\hat{\mu}_w\); the second inequality arises from the second inequality in (33 ); the last inequality is due to the fact that \((\Sigma_x^*,\Sigma_w^*,A^*)\) is an optimal solution of (15 ) and all equalities follow from (24 ). Therefore, \((\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)\) is a saddle point solution of (17 ) such that \(\begin{bmatrix} (K^*)^\top K^*-\gamma^*_xI_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* -\gamma^*_wI_m \end{bmatrix} \preceq {\boldsymbol{0}}\). ◻

Nevertheless, the condition presented in Lemma 4 not easily checked in certain cases. Specifically, if the matrix \(H\Sigma_x^* H^\top + \Sigma_w^*\) is not positive definite, the uniqueness of the optimal solution \(A^*\) for (14 ) is no longer guaranteed and it becomes challenging to identify which solution can form a saddle point solution for (17 ). However, Section 4 proposes that for this case, the existence of a saddle point can be determined numerically by solving two SDPs. Therefore, we shall only focus on a simplified case under a mild assumption below.

Assumption 2.

i) The Wasserstein radii \(\rho_x>0\) and \(\rho_w>0\).

ii) The nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\) are positive definite.

Refocusing on problem (14 ), it is well known that for fixed \(\Sigma_x\) and \(\Sigma_w\), the optimal solution to the inner minimization problem is given by \(A^* = \Sigma_x H^\top (H \Sigma_x H^\top + \Sigma_w)^{-}\). By substituting \(A^*\) into (14 ), we can derive its equivalent problem as follows: \[\begin{array}{cl} \sup \limits_{\Sigma_{x},\Sigma_w} & Tr\left[\Sigma_x - \Sigma_x H^\top \left( H \Sigma_x H^\top + \Sigma_w \right)^{-} H \Sigma_x\right] \\ s.t. & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \\& Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_x^2, \\[0.5ex] & Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \leq \rho_w^2. \\[0.5ex] \end{array} \label{maxmin}\tag{34}\]

The objective function of (34 ), i.e., the inner minimization problem of (14 ) \[\inf_A \quad \left\langle \left(I_n-AH\right)^\top \left(I_n-AH\right),\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle\] can be interpreted as minimizing a set of linear functions over the variable \((\Sigma_x,\Sigma_w)\), which implies that it is concave in \((\Sigma_x,\Sigma_w)\) [29]. Consequently, (34 ) is a convex optimization problem in which strong duality holds under Slater condition. As detailed in Corollary 3.1 of [12], problem (34 ) can be transformed into the following SDP problem \[\begin{align} \sup \limits_{\Sigma_x,\Sigma_w,V_x,V_w,U} &Tr(\Sigma_x)-Tr(U) \\ s.t.\;\;\;\; &\Sigma_x \succeq {\boldsymbol{0}}, \Sigma_w \succeq {\boldsymbol{0}}, \;V_x \succeq{\boldsymbol{0}}, \;V_w \succeq{\boldsymbol{0}},\\ &\left[\begin{array} {cc} \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}} & V_x \\ V_x &I_n \end{array} \right]\succeq {\boldsymbol{0}}, \quad \left[\begin{array} {cc} \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}} & V_w \\ V_w &I_m \end{array} \right]\succeq {\boldsymbol{0}},\\ &Tr\left( \Sigma_x+\hat{\Sigma}_x-2V_x \right) \succeq {\boldsymbol{0}}, \quad Tr\left( \Sigma_w+\hat{\Sigma}_w-2V_w \right) \succeq {\boldsymbol{0}},\\ &\begin{bmatrix} U & \Sigma_x H^\top \\ H \Sigma & H \Sigma_x H^\top+\Sigma_w \end{bmatrix} \succeq {\boldsymbol{0}}. \end{align} \label{maxmin32SDP}\tag{35}\] In the following theorem, we will demonstrate that under a mild condition, the condition in Lemma 4 can be simplified, with all parameters determined through the convex problem (34 ) and its dual.

Theorem 2. Suppose that Assumption 2 holds and problem (34 ) has a primal and dual optimal solution pair \((\Sigma_x^*, \Sigma_w^*,\gamma_x^*, \gamma_w^*)\). Define \(A^* \triangleq \Sigma_x^* H^\top \left( H \Sigma_x^* H^\top+\Sigma_w^*\right) ^{-1}\) and \(K^* \triangleq I_n - A^* H\). Then the saddle point solution of (4 ) exists if and only if \(\begin{bmatrix} (K^*)^\top K^* - \gamma_x^*I_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* - \gamma_w^*I_m \end{bmatrix} \preceq {\boldsymbol{0}}\).

Proof. Under the assumption that the nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\) are positive definite, it follows from Theorem 3.1 in [12] that (34 ) admits an optimal solution \((\Sigma_x^*,\Sigma_w^*)\) with \[\Sigma_x^* \succeq \lambda_{\min}\left( \hat{\Sigma}_x\right) I_n \succ {\boldsymbol{0}} \label{pdsx}\tag{36}\] and \[\Sigma_w^* \succeq \lambda_{\min}\left( \hat{\Sigma}_w\right) I_m \succ {\boldsymbol{0}}, \label{pdsw}\tag{37}\] which implies that \(H \Sigma_x^* H^\top+\Sigma_w^* \succ {\boldsymbol{0}}\). Then \(A^*=\Sigma_x^* H^\top \left( H \Sigma_x^* H^\top+\Sigma_w^*\right) ^{-1}\) is well-defined.

Subsequently, we shall demonstrate that \((\Sigma_x^*, \Sigma_w^*, A^*,\gamma_x^*, \gamma_w^*)\) is a saddle point solution of (17 ). We denote the Lagrangian function of (34 ) as \[\begin{align} g(\Sigma_x,\Sigma_w;\gamma_x,\gamma_w) \triangleq & Tr\left[\Sigma_x - \Sigma_x H^\top \left( H \Sigma_x H^\top + \Sigma_w \right)^{-1} H \Sigma_x\right] \notag \\ &+\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ =& \inf_A \;\left\langle \left(I_n-AH \right) ^\top \left(I_n-AH \right) ,\Sigma_{x} \right\rangle +\left\langle A^\top A,\Sigma_w \right\rangle \notag \\ &+\gamma_x \left\{ \rho_x^2- Tr\left[\Sigma_x + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}}\Sigma_x\hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} +\gamma_w \left\{ \rho_w^2- Tr\left[\Sigma_w + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}}\Sigma_w\hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\} \notag \\ =& \inf_A \;G(\Sigma_x,\Sigma_w, A,\gamma_x,\gamma_w), \label{regG} \end{align}\tag{38}\] where \(G\) is the objective function of (17 ) defined in (21 ). According to the assumption that (34 ) has a primal and dual optimal solution pair \((\Sigma_x^*, \Sigma_w^*,\gamma_x^*, \gamma_w^*)\), we have [26] \[g(\Sigma_x^*,\Sigma_w^*;\gamma_x^*,\gamma_w^*)=\max_{\Sigma_x \succeq {\boldsymbol{0}},\Sigma_w \succeq {\boldsymbol{0}}} g(\Sigma_x,\Sigma_w;\gamma_x^*,\gamma_w^*) \label{gmmin}\tag{39}\] and the following complementary slackness conditions \[\begin{align} &\gamma_x^* \left\lbrace \rho_x^2-Tr\left[\Sigma_x^* + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x^* \hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\rbrace =0, \\ &\gamma_w^* \left\lbrace \rho_w^2-Tr\left[\Sigma_w^* + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w^* \hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right] \right\rbrace =0. \end{align} \label{cs2}\tag{40}\] Moreover, for the optimal solution \((\Sigma_x^*, \Sigma_w^*)\) to (34 ), \(A^*=\Sigma_x^* H^\top \left( H \Sigma_x^* H^\top+\Sigma_w^*\right) ^{-1}\) is the unique minimizer of the inner minimization problem in (14 ), which implies that \((\Sigma_x^*,\Sigma_w^*,A^*)\) is a saddle point solution of (14 ). Combined with the equality of the optimal values of (14 ) and (17 ) and the complementary slackness conditions (40 ), it follows that \((\Sigma_x^*, \Sigma_w^*, A^*,\gamma_x^*, \gamma_w^*)\) is an optimal solution of (17 ).

On the other hand, for given \((A^*,\gamma_x^*, \gamma_w^*)\), we obtain \[\bigtriangledown_{\Sigma_x} G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)= \bigtriangledown_{\Sigma_x} g(\Sigma_x^*,\Sigma_w^*;\gamma_x^*,\gamma_w^*)=\boldsymbol{0},\] where the first equality is due to the relationship between \(g\) and \(G\) given by (38 ), Danskin’s theorem [26] and the uniqueness of the minimizer \(A^*\), and the second equality follows from (39 ) and the positive definiteness of \(\Sigma_x^*\) given by (36 ). Similarly, we have \[\bigtriangledown_{\Sigma_w} G(\Sigma_x^*,\Sigma_w^*,A^*,\gamma_x^*,\gamma_w^*)= \bigtriangledown_{\Sigma_w} g(\Sigma_x^*,\Sigma_w^*;\gamma_x^*,\gamma_w^*)=\boldsymbol{0}.\] Therefore, \((\Sigma_x^*,\Sigma_w^*)\) is the maximizer of \(G(\Sigma_x,\Sigma_w,A^*,\gamma_x^*,\gamma_w^*)\), which implies that \((\Sigma_x^*, \Sigma_w^*, A^*,\gamma_x^*, \gamma_w^*)\) is a saddle point solution of (17 ).

Finally, we prove by contradiction that (17 ) has a unique saddle point solution. If \((\bar{\Sigma}_x^*, \bar{\Sigma}_w^*, \bar{A}^*, \bar{\gamma}_x^*, \bar{\gamma}_w^*)\) is an another saddle point solution of (17 ), \((\Sigma_x^*, \Sigma_w^*, \bar{A}^*, \bar{\gamma}_x^*, \bar{\gamma}_w^*)\) is also a saddle point solution of (17 ) [22]. Then the uniqueness of the minimizer \(A^*\) for \(G(\Sigma_x^*,\Sigma_w^*, A,\gamma_x,\gamma_w)\) gives rise to \(\bar{A}^*=A^*\). Furthermore, for given \(A^*\), it follows from Proposition A.2 in [12] that there exists a unique minimizer \((\gamma_x^*,\gamma_w^*)\) and the unique optimal solution \((\Sigma_x^*, \Sigma_w^*)\) to the inner maximization problem in (20 ). Then it contradicts the assumption that \((\bar{\Sigma}_x^*, \bar{\Sigma}_w^*, \bar{\gamma}_x^*, \bar{\gamma}_w^*)\) is also an optimal solution to the inner minimax problem in (20 ) for given \(\bar{A}^*=A^*\).

Consequently, the primal and dual optimal solution pair \((\Sigma_x^*, \Sigma_w^*,\gamma_x^*, \gamma_w^*)\) of (34 ) and the corresponding \(A^*\) yield the unique saddle point solution of (17 ). Hence, it follows directly from Lemma 4 that the saddle point solution of (4 ) exists if and only if \(\begin{bmatrix} (K^*)^\top K^* - \gamma_x^*I_n & (K^*)^\top A^* \\ (A^*)^\top K^* & (A^*)^\top A^* - \gamma_w^*I_m \end{bmatrix} \preceq {\boldsymbol{0}}\). ◻

3.3 A Sufficient Condition for the Existence of the Saddle Point Solution for (4 )↩︎

Section 3.2 establishes a necessary and sufficient condition for the existence of the saddle point solution in (4 ). However, verifying this condition requires determining both the primal and dual optimal solutions of (34 ), which may be computationally intractable for large-scale problems. In this section, we will present a more direct and computationally simple sufficient condition for the existence of a saddle point solution to (4 ), which intuitively indicates that when the Wasserstein radii \(\rho_x\) and \(\rho_w\) are small enough, the saddle point always exists.

Theorem 3. Suppose that Assumption 2 holds. If the Wasserstein radii \(\rho_x\) and \(\rho_w\) satisfy the inequality \(\rho_x \rho_w \leq \lambda_{\min}^{\frac{1}{2}}(\hat{\Sigma}_x) \lambda_{\min}^{\frac{1}{2}}(\hat{\Sigma}_w)\), problem (4 ) has a saddle point solution.

Proof. Under the assumption that the nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\) are both positive definite, we obtain from the proof of Theorem 2 that (17 ) admits a unique saddle point solution, denoted by \((\Sigma_x^*, \Sigma_w^*, A^*,\gamma_x^*, \gamma_w^*)\). This solution is also optimal for (20 ). Then it follows from Proposition A.2 in [12] that for given \(A^*\) and \(K^*=I_n-A^*H\), the unique minimizer \((\gamma_x^*,\gamma_w^*)\) satisfies \(\gamma_x^* > \lambda_{\max}\left( \left( K^*\right) ^\top K^* \right)\) and \(\gamma_w^* > \lambda_{\max}\left( \left(A^*\right) ^\top A^* \right)\), and the unique optimal solution \((\Sigma_x^*, \Sigma_w^*)\) to the inner maximization problem in (20 ) is given by \[\Sigma_x^*=\left( \gamma_x^*\right) ^2 \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \hat{\Sigma}_x \left[ \gamma_x^* I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \label{opsx}\tag{41}\] and \[\Sigma_w^*=\left( \gamma_w^*\right) ^2 \left[ \gamma_w^*I_m-\left( A^*\right) ^\top A^*\right] ^{-1} \hat{\Sigma}_w \left[ \gamma_w^* I_m-\left( A^*\right) ^\top A^*\right] ^{-1} .\label{opsw}\tag{42}\]

Due to the strict positivity of \(\gamma_x^*\) and \(\gamma_w^*\), the Wasserstein distance constraints with respect to \((\Sigma_x,\Sigma_w)\) are active. Then we have \[\rho_x^2-Tr\left[\Sigma_x^* + \hat{\Sigma}_x - 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x^* \hat{\Sigma}_x^{\frac{1}{2}} \right)^{\frac{1}{2}}\right]=0 \label{bound1}\tag{43}\] and \[\rho_w^2-Tr\left[\Sigma_w^* + \hat{\Sigma}_w - 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w^* \hat{\Sigma}_w^{\frac{1}{2}} \right)^{\frac{1}{2}}\right]=0. \label{bound2}\tag{44}\] Substituting (41 ) into (43 ), we obtain \[\rho_x^2-Tr\left( \hat{\Sigma}_x\right) +2\gamma_x^* Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \hat{\Sigma}_x\right\rbrace -(\gamma_x^*)^2 Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \hat{\Sigma}_x \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \right\rbrace =0.\] Rearranging all the terms, we derive \[\begin{align} &Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right) ^\topK^*\right] ^{-1}\left[ \left( \gamma_x^*\right) ^2 \hat{\Sigma}_x -\gamma_x^* \hat{\Sigma}_x \left( \gamma_x^*I_n-\left( K^*\right) ^\topK^*\right) -\gamma_x^* \left( \gamma_x^*I_n-\left( K^*\right) ^\topK^*\right) \hat{\Sigma}_x \right]\left[ \gamma_x^*I_n-\left( K^*\right) ^\topK^*\right] ^{-1} \right\rbrace \\&=\rho_x^2-Tr\left( \hat{\Sigma}_x\right). \end{align}\] By expanding the brackets in the middle terms of the product under the trace and combining like terms, we get \[Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right) ^\topK^*\right] ^{-1} \left[ -\left( \gamma_x^*\right) ^2 \hat{\Sigma}_x +\gamma_x^* \hat{\Sigma}_x \left( K^* \right) ^\topK^* +\gamma_x^* \left( K^*\right) ^\topK^* \hat{\Sigma}_x \right] \left[ \gamma_x^*I_n-\left( K^*\right) ^\topK^*\right] ^{-1} \right\rbrace =\rho_x^2-Tr\left( \hat{\Sigma}_x\right) .\] Furthermore, completing the square for the middle terms in the product under the trace gives \[\begin{align} &Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right)^\topK^*\right] ^{-1}\left\lbrace-\left[ \gamma_x^*I_n - \left( K^*\right) ^\topK^*\right]\hat{\Sigma}_x\left[ \gamma_x^*I_n - \left( K^*\right) ^\topK^*\right] + \left( K^*\right) ^\topK^* \hat{\Sigma}_x (K^*)^\topK^* \right\rbrace\left[ \gamma_x^*I_n - \left( K^*\right) ^\topK^*\right] ^{-1} \right\rbrace \notag \\ &= \rho_x^2-Tr\left( \hat{\Sigma}_x\right) . \end{align}\] Eliminating \(-Tr(\hat{\Sigma}_x)\) which appears on both sides of the equation, we obtain \[Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \left[ \left( K^*\right) ^\top K^* \hat{\Sigma}_x \left( K^*\right) ^\top K^* \right] \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \right\rbrace =\rho_x^2.\] Upon dividing both sides by \(\rho_x^2\), we have \[Tr \left\lbrace \left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \left[ \left( K^*\right) ^\top K^* \frac{\hat{\Sigma}_x}{\rho_x^2} \left( K^*\right) ^\top K^* \right] \left[ \gamma_x^* I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \right\rbrace =1.\] Then the matrix under the trace operation is positive semidefinite and the sum of its eigenvalues equal to one. Consequently, all its eigenvalues are not greater than one, which leads to the conclusion that \[\left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \left[ \left( K^*\right) ^\top K^* \frac{\hat{\Sigma}_x}{\rho_x^2} \left( K^*\right) ^\top K^* \right] \left[ \gamma_x^* I_n-\left( K^*\right) ^\top K^*\right] ^{-1} \preceq I_n. \label{loose1}\tag{45}\] Multiplying both sides by \(\gamma_x^*I_n-(K^*)^\top K^*\) gives rise to \[\left[ \gamma_x^*I_n-\left( K^*\right) ^\top K^*\right] ^{2} \succeq \left( K^*\right) ^\top K^* \frac{\hat{\Sigma}_x}{\rho_x^2} \left( K^*\right) ^\top K^*.\] Due to \(\gamma_x^*>\lambda_{\max}\left[ \left( K^*\right) ^\top K^* \right]\), the matrix \(\gamma_x^*I_n-\left( K^*\right) ^\top K^*\) is positive definite. According to the Löwner-Heinz inequality [30], it follows that \[\gamma_x^*I_n-\left( K^*\right) ^\top K^* \succeq \left[ \left( K^*\right) ^\top K^* \frac{\hat{\Sigma}_x}{\rho_x^2} \left( K^*\right) ^\top K^* \right] ^\frac{1}{2}.\] Combining with \(\hat{\Sigma}_x \succeq \lambda_{\min}\left( \hat{\Sigma}_x\right) I_n\), we further have \[\gamma_x^*I_n-\left( K^*\right) ^\top K^* \succeq \frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) }{\rho_x} \left( K^*\right) ^\top K^*. \label{loose2}\tag{46}\]

Similarly, based on (42 ) and (44 ), we derive \[\gamma_w^*I_m-\left( A^*\right) ^\top A^* \succeq \frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_w\right) }{\rho_w} \left( A^*\right) ^\top A^*. \label{loosew}\tag{47}\]

Subsequently, by utilizing equations (46 ) and (47 ), we deduce that: \[\begin{align} \begin{bmatrix} \left( K^*\right)^\top K^*-\gamma_x^*I_n & \left( K^*\right) ^\top A^* \\ \left( A^*\right) ^\top K^* & \left( A^*\right) ^\top A^* -\gamma_w^*I_m \end{bmatrix} & \preceq \begin{bmatrix} -\frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) }{\rho_x} \left( K^*\right) ^\top K^* & \left( K^*\right)^\top A^* \\ \left(A^*\right)^\top \left( K^*\right) & - \frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_w\right) }{\rho_w} \left( A^*\right) ^\top A^* \end{bmatrix}\\ &=\begin{bmatrix} K^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & A^* \end{bmatrix} ^\top \begin{bmatrix} -\frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) }{\rho_x}I_n &I_n \\I_n & - \frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_w\right) }{\rho_w}I_n \end{bmatrix} \begin{bmatrix} K^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & A^* \end{bmatrix}. \end{align}\] Here, the matrix \(\begin{bmatrix} -\frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) }{\rho_x}I_n &I_n \\I_n & - \frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_w\right) }{\rho_w}I_n \end{bmatrix}\) is negative semidefinite if and only if its schur complement \(-\frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) }{\rho_x}I_n+\frac{\rho_w}{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_w\right) }I_n\) is negative semidefinite [31]. Consequently, if \[\rho_x \rho_w \leq \lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) \lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_w\right) ,\] the matrix \(\begin{bmatrix} \left(K^*\right) ^\top K^*-\gamma_x^*I_n & \left(K^*\right) ^\top A^* \\ \left(A^*\right) ^\top K^* & \left(A^*\right)^\top A^* -\gamma_w^*I_m \end{bmatrix}\) is negative semidefinite, which implies that the saddle point solution of (4 ) exists by Lemma 4. ◻

Remark 3. Based on the aforementioned sufficient condition, we can easily obtain the following conclusions:

  • In the scalar case, i.e., when both the parameter \(x\) and the observation \(y\) are scalars, the relaxations (45 ) and (46 ) are tight. This indicates that Theorem 3 provides a necessary and sufficient condition for the existence of the saddle point solution for (4 ).

  • The sufficient condition given in Theorem 3 constrains the product of the two Wasserstein radii, which implies that when one is sufficiently small, the other can be larger. Furthermore, if one of the Wasserstein radii is zero, i.e., \(\rho_x = 0\) or \(\rho_w = 0\), a saddle point solution to (4 ) always exists, regardless of the size of the other radius.

4 The Robust Linear Estimator↩︎

In Section 3.1, Theorem 1 shows that the absence of the saddle point solution in the finite-dimensional optimization problem (7 ) implies the absence of the saddle point solution in the infinite-dimensional optimization problem (4 ). In this case, although the exact optimal solution of (4 ) is unavailable by solving (7 ), the optimal value of (7 ) can still provide an upper bound on that of (4 ), as detailed in Lemma 2. Furthermore, due to (10 ), the optimal solution to (7 ) is also the optimal solution to \[\label{infLsup} \inf_{f \in \mathcal{F_L}} \sup_{P \in \mathbb{B}(\hat{P})} mse(f,P),\tag{48}\] which indicates that this solution yields an optimal robust estimator in the class of linear estimators.

Consequently, the focus of this section is on problem (7 ). Specifically, we demonstrate that (7 ) is equivalent to a convex relaxation problem, and thus equivalent to an SDP problem. Furthermore, based on the primal and dual optimal solutions of the SDP problem, we construct an optimal solution to (7 ), which is also the optimal solution to (48 ) and thus provides a robust linear estimator.

4.1 A Tight Convex Relaxation of (7 )↩︎

In light of the above discussion, we now focus on (7 ). Analogous to parameterizing (12 ) into (13 ) in Section 3.2, problem (7 ) can be parameterized into a finite-dimensional form as follows: \[\begin{align} \inf_{A,b} \sup_{\mu_x,\mu_w,\Sigma_x,\Sigma_w} &Tr\left[ \left( AH-I_n\right) \Sigma_x\left( AH-I_n\right) ^\top+A\Sigma_w A^\top \right]+\left[ \left( AH-I_n\right) \mu_x+A\mu_w+b\right] ^\top \left[ \left( AH-I_n\right) \mu_x+A\mu_w+b \right] \notag \\ s.t.\; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \notag \\ &\left\| \mu_x-\hat{\mu}_x \right\| ^2+Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_x^2, \label{minmaxGG} \\ &\left\| \mu_w-\hat{\mu}_w\right\| ^2+Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_w^2. \notag \end{align}\tag{49}\] For given \(A\) and \(b\), the objective function of (49 ) is convex in \((\mu_x,\mu_w)\). Consequently, the inner maximization problem over \((\mu_x,\mu_w)\) is non-convex, which brings a significant challenge in identifying the optimal solution. For ease of notations, we denote \(\tilde{\mu}_x \triangleq \mu_x-\hat{\mu}_x\), \(\tilde{\mu}_w \triangleq \mu_w-\hat{\mu}_w\) and \(\tilde{b} \triangleq b+(AH-I_n)\hat{\mu}_x+A\hat{\mu}_w\), respectively. Then problem (49 ) can be reformulated as follows \[\begin{align} \inf_{A,\tilde{b}} \; \sup_{\tilde{\mu}_x,\tilde{\mu}_w,\Sigma_x,\Sigma_w} &Tr\left[ \left( AH-I_n\right) \Sigma_x\left( AH-I_n\right) ^\top+A\Sigma_w A^\top \right] +\left[\left( AH-I_n\right) \tilde{\mu}_x+A\tilde{\mu}_w+\tilde{b}\right] ^\top \left[\left( AH-I_n \right) \tilde{\mu}_x+A\tilde{\mu}_w+\tilde{b}\right] \notag \\ s.t.\; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \notag \\ &\left\| \tilde{\mu}_x\right\| ^2+Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_x^2, \label{3minimax} \\ &\left\| \tilde{\mu}_w\right\| ^2+Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_w^2. \notag \end{align}\tag{50}\] Notice that for given \(A\), \(\tilde{b}\), \(\Sigma_x\) and \(\Sigma_w\), the inner maximization problem over \((\tilde{\mu}_x,\tilde{\mu}_w)\) in (50 ) is a nonhomogeneous QCQP problem with two homogeneous constraints. If we directly apply the SDP relaxation method, we can not assert that the relaxed problem is equivalent to problem (50 ) [27], [32]. To address this difficulty, we first give a special structure of the optimal solution to (50 ) in the following theorem.

Theorem 4. If \((A^*,\tilde{b}^*,\tilde{\mu}^*_x,\tilde{\mu}^*_w,\Sigma^*_x,\Sigma^*_w)\) is an optimal solution to (50 ), then \(\tilde{b}^*={\boldsymbol{0}}\).

Proof. For the sake of simplicity of notations, we denote the objective function of (50 ) as \[\begin{align} \psi(A,\tilde{b},\tilde{\mu}_x,\tilde{\mu}_w,\Sigma_x,\Sigma_w) =&Tr\left[ \left( AH-I_n\right) \Sigma_x\left( AH-I_n\right) ^\top+A\Sigma_w A^\top \right]+ \\ &\left[\left( AH-I_n\right) \tilde{\mu}_x+A\tilde{\mu}_w\right] ^\top \left[\left( AH-I_n\right) \tilde{\mu}_x+A\tilde{\mu}_w\right]+ 2\left[\left( AH-I_n\right) \tilde{\mu}_x+A\tilde{\mu}_w\right] ^\top \tilde{b} +\tilde{b}^\top \tilde{b}. \end{align}\] In addition, for given \(A\) and \(\tilde{b}\), the optimal solution to the inner maximization problem in (50 ) is denoted as \((\tilde{\mu}_x^*(A,\tilde{b}),\tilde{\mu}_w^*(A,\tilde{b}),\Sigma_x^*(A,\tilde{b}),\Sigma_w^*(A,\tilde{b}))\). Then for any \(A\) and \(\tilde{b} \neq {\boldsymbol{0}}\), we have \[\begin{align} &\psi(A,{\boldsymbol{0}},\tilde{\mu}_x^*(A,{\boldsymbol{0}}),\tilde{\mu}_w^*(A,{\boldsymbol{0}}),\Sigma_x^*(A,{\boldsymbol{0}}),\Sigma_w^*(A,{\boldsymbol{0}})) \\ &< \max \left\lbrace \psi(A,\tilde{b},\tilde{\mu}_x^*(A,{\boldsymbol{0}}),\tilde{\mu}_w^*(A,{\boldsymbol{0}}),\Sigma_x^*(A,{\boldsymbol{0}}),\Sigma_w^*(A,{\boldsymbol{0}})), \psi(A,\tilde{b},-\tilde{\mu}_x^*(A,{\boldsymbol{0}}),-\tilde{\mu}_w^*(A,{\boldsymbol{0}}),\Sigma_x^*(A,{\boldsymbol{0}}),\Sigma_w^*(A,{\boldsymbol{0}})) \right\rbrace \\ &\leq \psi\left( A,\tilde{b},\tilde{\mu}_x^*(A,\tilde{b}),\tilde{\mu}_w^*(A,\tilde{b}),\Sigma_x^*(A,\tilde{b}),\Sigma_w^*(A,\tilde{b})\right) , \end{align}\] where the first inequality holds because: i) the first two terms of the objective function \(\psi\) remain unchanged; ii) \((\tilde{\mu}_x^*(A,{\boldsymbol{0}}),\tilde{\mu}_w^*(A,{\boldsymbol{0}}),\Sigma_x^*(A,{\boldsymbol{0}}),\Sigma_w^*(A,{\boldsymbol{0}}))\) is a feasible solution and so is \((-\tilde{\mu}_x^*(A,{\boldsymbol{0}}),-\tilde{\mu}_w^*(A,{\boldsymbol{0}}),\Sigma_x^*(A,{\boldsymbol{0}}),\Sigma_w^*(A,{\boldsymbol{0}}))\); iii) at least one of these two feasible solutions satisfies \(\left[(AH-I_n)\tilde{\mu}_x+A\tilde{\mu}_w\right] ^\top \tilde{b} \geq 0\); iv) \(\tilde{b}^\top \tilde{b}>0\), and the second inequality follows from the optimality of \((\tilde{\mu}_x^*(A,\tilde{b}),\tilde{\mu}_w^*(A,\tilde{b}),\Sigma_x^*(A,\tilde{b}),\Sigma_w^*(A,\tilde{b}))\) for \(A\) and \(\tilde{b}\). Therefore, the optimal solution to (50 ) must satisfy \(\tilde{b}^*=\boldsymbol{0}\). ◻

With the help of Theorem 4, (50 ) can be equivalently transformed into the following problem: \[\begin{align} \inf_{A} \sup_{\tilde{\mu}_x,\tilde{\mu}_w,\Sigma_x,\Sigma_w} &Tr\left[ \left( AH-I_n\right) \Sigma_x\left( AH-I_n\right) ^\top+A\Sigma_w A^\top \right] +\left[\left( AH-I_n\right) \tilde{\mu}_x+A\tilde{\mu}_w \right] ^\top \left[\left( AH-I_n\right) \tilde{\mu}_x+A\tilde{\mu}_w \right] \notag \\ s.t.\; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \notag \\ &\left\| \tilde{\mu}_x\right\| ^2+Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_x^2, \label{2minimax} \\ &\left\| \tilde{\mu}_w\right\| ^2+Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_w^2. \notag \end{align}\tag{51}\] Notice that for given \(A\), \(\Sigma_x\) and \(\Sigma_w\), the inner maximization problem over \((\tilde{\mu}_x,\tilde{\mu}_w)\) in (51 ) is a homogeneous QCQP problem with two homogeneous constraints, which can be be equivalently reformulated as a convex problem through the SDP relaxation method as described in the following theorem.

Theorem 5. The optimal value of (50 ) is equal to the optimal value of the following SDP problem \[\begin{align} \begin{array}{cl} \sup \limits_{Q,S,\Sigma_x,\Sigma_w,V_x,V_w} &Tr(Q) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}}, \;V_x \succeq{\boldsymbol{0}}, \;V_w \succeq{\boldsymbol{0}},\\ &\begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}-\begin{bmatrix} Q & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} \succeq {\boldsymbol{0}}\\ &\left[\begin{array} {cc} \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}} & V_x \\ V_x &I_n \end{array} \right]\succeq {\boldsymbol{0}}, \quad \left[\begin{array} {cc} \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}} & V_w \\ V_w &I_m \end{array} \right]\succeq {\boldsymbol{0}},\\ & Tr\left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) +Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) \leq \rho_x^2 ,\\ &Tr \left( \begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) +Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) \leq \rho_w^2. \end{array} \label{minmax32SDP} \end{align}\qquad{(2)}\]

Proof. Since \[\begin{align} \left\lbrace \left. \begin{bmatrix} \tilde{\mu}_x \\ \tilde{\mu}_x \end{bmatrix} \begin{bmatrix} \tilde{\mu}_x^\top & \tilde{\mu}_x^\top \end{bmatrix} \right| \tilde{\mu}_x \in \mathbb{R}^n, \tilde{\mu}_w \in \mathbb{R}^m \right\rbrace =\left\lbrace S \left| S \in \mathbb{S}^{m+n}_+, rank(S)=1 \right. \right\rbrace \subseteq \left\lbrace S \left| S \in \mathbb{S}^{m+n}_+ \right. \right\rbrace, \end{align}\] then, by ignoring the constraints \(rank(S)=1\) and thus relaxing the inner maximization problem of (51 ), we obtain the following relaxation of (51 ): \[\begin{align} \label{least32SDPs} \begin{array}{cl} \inf \limits_{A} \sup \limits_{S,\Sigma_x,\Sigma_w} &Tr\left[\left(AH-I_n\right)\Sigma_x\left(AH-I_n\right)^\top +A \Sigma_w A^\top\right]+ Tr \left(\begin{bmatrix} (AH-I_n)^\top \\ A^\top \end{bmatrix} \begin{bmatrix} AH-I_n & A \end{bmatrix} S \right) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}},\\ & Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) + Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_x^2,\\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) + Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_w^2. \end{array} \end{align}\tag{52}\] Notice that for any \(A\), \(\Sigma_x\) and \(\Sigma_w\), the inner maximization problem on \(S\) in (52 ) always admits a rank-one optimal solution [27]. This implies that the relaxation is tight, i.e., problems (51 ) and (52 ) are equivalent. Note that the objective function of (52 ) is convex in \(A\) for each \((S,\Sigma_x,\Sigma_w)\), concave in \((S,\Sigma_x,\Sigma_w)\) for each \(A\), and the constraint set under the supremum is convex and compact. Then according to Sion’s minimax theorem [24], (52 ) is equivalent to the following problem obtained by exchanging minimization and maximization: \[\begin{align} \label{least32SDPss} \begin{array}{cl} \sup \limits_{S,\Sigma_x,\Sigma_w} \inf \limits_{A} &Tr\left[\left(AH-I_n\right)\Sigma_x\left(AH-I_n\right)^\top +A\Sigma_w A^\top\right]+ Tr \left(\begin{bmatrix} \left( AH-I_n\right) ^\top \\ A^\top \end{bmatrix} \begin{bmatrix} AH-I_n & A \end{bmatrix} S \right) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}},\\ & Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) + Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_x^2,\\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) + Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_w^2. \end{array} \end{align}\tag{53}\] The objective function of (53 ) can be reformulated as \[\begin{align} &Tr \left( \begin{bmatrix} AH-I_n & A \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} (AH-I_n)^\top \\ A^\top \end{bmatrix} \right) \\ &=Tr \left( \begin{bmatrix} I_n & -A \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} I_n \\ -A^\top \end{bmatrix} \right). \end{align}\] Then since the constraints of (53 ) are independent of \(A\), we first consider the inner unconstrained minimization problem \[\inf \limits_{A}Tr \left( \begin{bmatrix} I_n & -A \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} I_n \\ -A^\top \end{bmatrix} \right). \label{least32SDP32min}\tag{54}\] Let \[\begin{align} R \triangleq \begin{bmatrix} R_{11} & R_{12} \\ R_{12}^\top & R_{22} \end{bmatrix} \triangleq \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}. \end{align}\] It is well-known that the optimal value of (54 ) is \(Tr\left( R_{11}-R_{12} R_{22}^\dagger R_{12}^\top\right)\) [33]. Consequently, (53 ) is equivalent to \[\begin{align} \begin{array}{cl} \sup \limits_{S,\Sigma_x,\Sigma_w} &Tr\left( R_{11}-R_{12} R_{22}^\dagger R_{12}^\top\right) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}},\\ &R=\begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix},\\ & Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) + Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2\left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_x^2,\\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) + Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_w^2. \end{array} \label{least32SDPsss} \end{align}\tag{55}\] By introducing a slack variable \(Q \in \mathbb{S}^n\), applying the Schur complement theorem, and eliminating \(R\), problem (55 ) is equivalent to \[\begin{align} \begin{array}{cl} \sup \limits_{Q,S,\Sigma_x,\Sigma_w} &Tr(Q) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}}, \\ &\begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}-\begin{bmatrix} Q & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} \succeq {\boldsymbol{0}},\\ & Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) + Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_x^2,\\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) + Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}} \right] \leq \rho_w^2. \end{array} \label{minmax32SDPs} \end{align}\tag{56}\] Finally, by Proposition 2 in [34], introducing auxiliary variables \(V_x \in \mathbb{S}_+^n\) and \(V_w \in \mathbb{S}_+^m\), (56 ) can be transformed into the SDP problem (?? ), which completes the proof. ◻

Remark 4. The optimal value of (?? ) is equal to that of (7 ), while the optimal value of (35 ) is equal to that of problem (12 ) obtained by exchanging the supremum and infimum in (7 ). Therefore, according to Theorem 1, if the optimal values of the two SDP problems (?? ) and (35 ) are equal, the robust estimation problem (4 ) has a saddle point solution.

4.2 An Optimal Solution to (7 ) and (48 )↩︎

Section 4.1 shows that the optimal value of (7 ) is equal to that of an SDP problem (?? ). However, in practical applications, we often need to obtain the optimal solution of (7 ), specifically the robust linear estimator and the corresponding least favorable distribution. When the saddle point solution does not exist, the robust linear estimator is not the optimal estimator corresponding to its least favorable distribution, and thus it can not be directly calculated.

Notice that problem (7 ) is equivalent to (52 ), while the SDP problem (?? ) is equivalent to (53 ) obtained by exchanging minimization and maximization in problem (52 ). Therefore, if the saddle point solution of (53 ) can be constructed by the primal and dual optimal solutions of (?? ), then this saddle point solution is also an optimal solution to (52 ), and thus the optimal solution of (7 ) can be further obtained.

Therefore, we first consider the Lagrangian function of (?? ) denoted by \[\begin{align} &\mathfrak{L}(Q,S,\Sigma_x,\Sigma_w,V_x,V_w,G_S,G_x,G_w,G_{vx},G_{vw},W,T_x,T_w,\alpha_x,\alpha_w)\\ &\begin{aligned} =&-Tr(Q)-Tr(G_S^\top S)-Tr(G_x^\top \Sigma_{x}) -Tr(G_w^\top \Sigma_w)-Tr(G_{vx}^\top V_x)-Tr(G_{vw}^\top V_w)\\ &-Tr\left[ W^\top \left( \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}-\begin{bmatrix} Q & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} \right) \right] \\ &-Tr\left(T_x^\top \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_x \Sigma_x \hat{\Sigma}^\frac{1}{2}_x & V_x \\ V_x & I_n \end{bmatrix} \right) -Tr\left(T_w^\top \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_w \Sigma_w \hat{\Sigma}^\frac{1}{2}_w & V_w \\ V_w & I_m \end{bmatrix} \right)\\ &+\alpha_x \left[Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) +Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) -\rho_x^2 \right]\\ &+\alpha_w \left[ Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) +Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) -\rho_w^2 \right], \end{aligned} \end{align}\] where the dual variables \(G_x,G_{vx} \in \mathbb{S}_+^n\); \(G_w,G_{vw} \in \mathbb{S}_+^m\); \(G_S,W \in \mathbb{S}_+^{m+n}\); \(T_x \in \mathbb{S}_+^{2n}\); \(T_w \in \mathbb{S}_+^{2m}\) and \(\alpha_x,\alpha_w \in \mathbb{R}_+\). Since (?? ) is a convex optimization problem, its optimal solution must stasify KKT system, i.e., \[\begin{align} &\bigtriangledown \mathfrak{L}_Q=-I_n+W_{11}={\boldsymbol{0}},\tag{57} \\ &\bigtriangledown \mathfrak{L}_S=-G_S-\begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} W \begin{bmatrix} I_n & {\boldsymbol{0}} \\ H & I_m \end{bmatrix}+\alpha_x \begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix}+\alpha_w \begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix}={\boldsymbol{0}}, \tag{58}\\ &\bigtriangledown \mathfrak{L}_{\Sigma_x}=-G_x-\left( W_{11}+W_{12} H+H^\top W_{12}^\top+H^\top W_{22} H\right) - \hat{\Sigma}_x^\frac{1}{2} T^{11}_x \hat{\Sigma}_x^\frac{1}{2}+ \alpha_x I_n={\boldsymbol{0}}, \tag{59}\\ &\bigtriangledown \mathfrak{L}_{\Sigma_w}= -G_w-W_{22}-\hat{\Sigma}_w^\frac{1}{2} T^{11}_w \hat{\Sigma}_w^\frac{1}{2}+\alpha_w I_m={\boldsymbol{0}},\tag{60}\\ &\bigtriangledown \mathfrak{L}_{V_x}= -G_{vx}-T^{12}_x-\left( T^{12}_x\right) ^\top-2\alpha_x I_n={\boldsymbol{0}},\tag{61}\\ &\bigtriangledown \mathfrak{L}_{V_w}= -G_{vw}-T^{12}_w-\left( T^{12}_w\right) ^\top-2\alpha_w I_m={\boldsymbol{0}},\tag{62} \\ &{\boldsymbol{0}} \preceq G_S \perp S \succeq {\boldsymbol{0}}, \tag{63} \\ &{\boldsymbol{0}} \preceq G_x \perp \Sigma_x \succeq {\boldsymbol{0}}, \tag{64} \\ &{\boldsymbol{0}} \preceq G_w \perp \Sigma_w \succeq {\boldsymbol{0}}, \tag{65} \\ &{\boldsymbol{0}} \preceq G_{vx} \perp V_x \succeq {\boldsymbol{0}},\tag{66} \\ &{\boldsymbol{0}} \preceq G_{vw} \perp V_w \succeq {\boldsymbol{0}},\tag{67} \\ &{\boldsymbol{0}} \preceq W \perp \left( \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w \end{bmatrix}+S \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}-\begin{bmatrix} Q & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} \right) \succeq {\boldsymbol{0}}, \tag{68}\\ &{\boldsymbol{0}} \preceq T_x \perp \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_x \Sigma_x \hat{\Sigma}^\frac{1}{2}_x & V_x \\ V_x & I_n \end{bmatrix} \succeq {\boldsymbol{0}}, \tag{69} \\ &{\boldsymbol{0}} \preceq T_w \perp \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_w \Sigma_w \hat{\Sigma}^\frac{1}{2}_w & V_w \\ V_w & I_m \end{bmatrix} \succeq {\boldsymbol{0}}, \tag{70} \\ &{\boldsymbol{0}} \preceq \alpha_x \perp \left[\rho_x^2-Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) -Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) \right] \succeq {\boldsymbol{0}}, \tag{71} \\ &{\boldsymbol{0}} \preceq \alpha_w \perp \left[\rho_w^2-Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) -Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) \right] \succeq {\boldsymbol{0}}, \tag{72} \end{align} \tag{73}\] where some dual variables are appropriately partitioned as \[W=\begin{bmatrix} W_{11} & W_{12} \\ W_{12}^\top & W_{22} \end{bmatrix}, T_x=\begin{bmatrix} T^{11}_x & T^{12}_x \\ (T^{12}_x)^\top & T^{22}_x \end{bmatrix}, T_w=\begin{bmatrix} T^{11}_w & T^{12}_w \\ (T^{12}_w)^\top & T^{22}_w \end{bmatrix},\] with the first diagonal submatrices \(W_{11}, T^{11}_x \in \mathbb{S}_+^n\), \(T^{11}_w \in \mathbb{S}_+^m\). Then we can formulate the saddle point solution of (53 ) in the following theorem.

Theorem 6. Assuming that \((Q^*,S^*,\Sigma_x^*,\Sigma_w^*,V_x^*,V_w^*,G_S^*,G_x^*,G_w^*,G_{vx}^*,G_{vw}^*,W^*,T_x^*,T_w^*,\alpha_x^*,\alpha_w^*)\) is a solution of KKT system (73 ), then \(A^*=-W_{12}^*\) and \((S^*,\Sigma_{x}^*,\Sigma_w^*)\) constitute a saddle point solution of (53 ).

Proof. First, we prove that \(A^*=-W_{12}^*\) and \((S^*,\Sigma_{x}^*,\Sigma_w^*)\) constitute an optimal solution to (53 ), i.e., \[Tr(Q^*)=Tr \left( \begin{bmatrix} I_n & W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}+S^* \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} I_n \\ (W_{12}^*)^\top \end{bmatrix} \right). \label{T33461}\tag{74}\] According to (57 ) and (68 ), we have \[\begin{bmatrix} I_n & W_{12}^* \\ (W_{12}^*)^\top & W_{22}^* \end{bmatrix} \left( \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}+S^* \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}-\begin{bmatrix} Q^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} \right) ={\boldsymbol{0}}, \label{T334611}\tag{75}\] which implies that \[\begin{bmatrix} I_n & W_{12}^* \end{bmatrix} \left( \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}+S^* \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix}-\begin{bmatrix} Q^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} \right) ={\boldsymbol{0}}.\] Then we derive \[\begin{align} \begin{bmatrix} I_n \\ (W_{12}^*)^\top \end{bmatrix} \begin{bmatrix} I_n & W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}+S^* \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} =\begin{bmatrix} I_n \\ (W_{12}^*)^\top \end{bmatrix} \begin{bmatrix} I_n & W_{12}^* \end{bmatrix} \begin{bmatrix} Q^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & {\boldsymbol{0}} \end{bmatrix} =\begin{bmatrix} Q^* & {\boldsymbol{0}} \\ (W_{12}^*)^\top Q^* & {\boldsymbol{0}} \end{bmatrix}. \label{T33462} \end{align}\tag{76}\] Taking trace operation on both sides, equation (74 ) is proved.

Subsequently, it is sufficient to prove that for given \(A^*=-W_{12}^*\), \((S^*,\Sigma_{x}^*,\Sigma_w^*)\) is the maximizer of the inner maximization problem in (52 ), which is equivalent to proving that \((S^*,\Sigma_{x}^*,\Sigma_w^*,V_x^*,V_w^*)\) is an optimal solution to the following problem \[\begin{align} \begin{array}{cl} \sup \limits_{S,\Sigma_x,\Sigma_w,V_x,V_w} &Tr\left[\left(W_{12}^*H+I_n\right)\Sigma_x\left(W_{12}^*H+I_n\right)^\top +W_{12}^* \Sigma_w (W_{12}^*)^\top\right]+ Tr \left(\begin{bmatrix} (W_{12}^*H+I_n)^\top \\ (W_{12}^*)^\top \end{bmatrix} \begin{bmatrix} W_{12}^*H+I_n & W_{12}^* \end{bmatrix} S \right) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}}, \;V_x \succeq{\boldsymbol{0}},\;V_w \succeq{\boldsymbol{0}}\\ &\left[\begin{array} {cc} \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}} & V_x \\ V_x &I_n \end{array} \right]\succeq {\boldsymbol{0}}, \quad \left[\begin{array} {cc} \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}} & V_w \\ V_w &I_m \end{array} \right]\succeq {\boldsymbol{0}},\\ & Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) +Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) \leq \rho_x^2 ,\\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) +Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) \leq \rho_w^2. \end{array} \label{T33463} \end{align}\tag{77}\] Let the Lagrangian function of (77 ) be \[\begin{align} &\bar{\mathfrak{L}}(S,\Sigma_x,\Sigma_w,V_x,V_w,\hat{G}_S,\hat{G}_x,\hat{G}_w,\hat{G}_{vx},\hat{G}_{vw},\hat{T}_x,\hat{T}_w,\hat{\alpha}_x,\hat{\alpha}_w)\\ &\begin{aligned} =&-Tr\left[\left(W_{12}^*H+I_n\right)\Sigma_x\left(W_{12}^*H +I_n\right)^\top+W_{12}^* \Sigma_w \left( W_{12}^*\right) ^\top\right]- Tr \left(\begin{bmatrix} (W_{12}^*H+I_n)^\top \\ (W_{12}^*)^\top \end{bmatrix} \begin{bmatrix} W_{12}^*H+I_n & W_{12}^* \end{bmatrix} S \right)\\ &-Tr\left( \hat{G}_S^\top S\right) -Tr\left( \hat{G}_x^\top \Sigma_{x}\right) -Tr\left( \hat{G}_w^\top \Sigma_w\right) -Tr\left( \hat{G}_{vx}^\top V_x\right) -Tr\left( \hat{G}_{vw}^\top V_w\right) \\ &-Tr\left(\hat{T}_x^\top \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_x \Sigma_x \hat{\Sigma}^\frac{1}{2}_x & V_x \\ V_x & I_n \end{bmatrix} \right) -Tr\left(\hat{T}_w^\top \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_w \Sigma_w \hat{\Sigma}^\frac{1}{2}_w & V_w \\ V_w & I_m \end{bmatrix} \right)\\ &+\hat{\alpha}_x \left[Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) +Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) -\rho_x^2 \right]\\ &+\hat{\alpha}_w \left[Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) +Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) -\rho_w^2 \right], \end{aligned} \end{align}\] where the dual variables \(\hat{G}_S \in \mathbb{S}_+^{m+n}\); \(\hat{G}_x,\hat{G}_{vx} \in \mathbb{S}_+^n\); \(\hat{G}_w,\hat{G}_{vw} \in \mathbb{S}_+^m\); \(\hat{T}_x \in \mathbb{S}_+^{2n}\); \(\hat{T}_w \in \mathbb{S}_+^{2m}\) and \(\hat{\alpha}_x,\hat{\alpha}_w \in \mathbb{R}_+\). Since (77 ) is a convex optimization problem, its optimal solution stasifies the KKT conditions, i.e., \[\begin{align} &\bigtriangledown \bar{\mathfrak{L}}_{S}=-\hat{G}_S- \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} I_n \\ (W_{12}^*)^\top \end{bmatrix} \begin{bmatrix} I_n & W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\ H & I_m \end{bmatrix}+\alpha_x \begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix}+\alpha_w \begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix}={\boldsymbol{0}}, \tag{78}\\ &\bigtriangledown \bar{\mathfrak{L}}_{\Sigma_x}=-\hat{G}_x-\left[ I_n+W_{12} H+H^\top \left( W_{12}^*\right) ^\top+H^\top \left( W_{12}^*\right) ^\top W_{12}^* H\right] - \hat{\Sigma}_x^\frac{1}{2} \hat{T}^{11}_x \hat{\Sigma}_x^\frac{1}{2}+ \hat{\alpha}_x I_n={\boldsymbol{0}}, \tag{79}\\ &\bigtriangledown \bar{\mathfrak{L}}_{\Sigma_w}= -\hat{G}_w-\left( W_{12}^*\right) ^\top W_{12}^*-\hat{\Sigma}_w^\frac{1}{2} \hat{T}^{11}_w \hat{\Sigma}_w^\frac{1}{2}+\hat{\alpha}_w I_m={\boldsymbol{0}},\tag{80}\\ &\bigtriangledown \bar{\mathfrak{L}}_{V_x}= -\hat{G}_{vx}-\hat{T}^{12}_x-\left( \hat{T}^{12}_x\right) ^\top-2\hat{\alpha}_x I_n={\boldsymbol{0}},\tag{81}\\ &\bigtriangledown \bar{\mathfrak{L}}_{V_w}= -\hat{G}_{vw}-\hat{T}^{12}_w-\left( \hat{T}^{12}_w\right) ^\top-2\hat{\alpha}_w I_m={\boldsymbol{0}},\tag{82} \\ &{\boldsymbol{0}} \preceq \hat{G}_S \perp S \succeq {\boldsymbol{0}}, \tag{83} \\ &{\boldsymbol{0}} \preceq \hat{G}_x \perp \Sigma_x \succeq {\boldsymbol{0}}, \tag{84} \\ &{\boldsymbol{0}} \preceq \hat{G}_w \perp \Sigma_w \succeq {\boldsymbol{0}}, \tag{85} \\ &{\boldsymbol{0}} \preceq \hat{G}_{vx} \perp V_x \succeq {\boldsymbol{0}},\tag{86} \\ &{\boldsymbol{0}} \preceq \hat{G}_{vw} \perp V_w \succeq {\boldsymbol{0}},\tag{87} \\ &{\boldsymbol{0}} \preceq \hat{T}_x \perp \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_x \Sigma_x \hat{\Sigma}^\frac{1}{2}_x & V_x \\ V_x & I_n \end{bmatrix} \succeq {\boldsymbol{0}}, \tag{88} \\ &{\boldsymbol{0}} \preceq \hat{T}_w \perp \begin{bmatrix} \hat{\Sigma}^\frac{1}{2}_w \Sigma_w \hat{\Sigma}^\frac{1}{2}_w & V_w \\ V_w & I_m \end{bmatrix} \succeq {\boldsymbol{0}}, \tag{89} \\ &{\boldsymbol{0}} \preceq \hat{\alpha}_x \perp \left[\rho_x^2-Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) -Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) \right] \succeq {\boldsymbol{0}}, \tag{90} \\ &{\boldsymbol{0}} \preceq \hat{\alpha}_w \perp \left[\rho_w^2-Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) -Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) \right] \succeq {\boldsymbol{0}}, \tag{91} \end{align} \tag{92}\] where some dual variables are appropriately partitioned as \[\hat{T}_x=\begin{bmatrix} \hat{T}^{11}_x & \hat{T}^{12}_x \\ \left( \hat{T}^{12}_x\right) ^\top & \hat{T}^{22}_x \end{bmatrix}, \hat{T}_w=\begin{bmatrix} \hat{T}^{11}_w & \hat{T}^{12}_w \\ \left( \hat{T}^{12}_w\right) ^\top & \hat{T}^{22}_w \end{bmatrix},\] with the first diagonal submatrices \(\hat{T}^{11}_x \in \mathbb{S}_+^n\) and \(\hat{T}^{11}_w \in \mathbb{S}_+^m\).

Then we intend to prove that \((S^*,\Sigma_x^*,\Sigma_w^*,V^*_x,V^*_w,\hat{G}^*_S,\hat{G}^*_x,\hat{G}^*_w,\hat{G}^*_{vx},\hat{G}^*_{vw},\hat{T}^*_x,\hat{T}^*_w,\hat{\alpha}^*_x,\hat{\alpha}^*_w)\) stasifies the KKT system (92 ), where \[\begin{align*} &\hat{G}^*_S=G_S^*+\begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} {\boldsymbol{0}} & {\boldsymbol{0}} \\ {\boldsymbol{0}} & W_{22}^*-(W_{12}^*)^\top W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\ H & I_m \end{bmatrix},\\ &\hat{G}^*_x=G_x^*+H^\top \left[ W_{22}^*-\left( W_{12}^*\right) ^\top W_{12}^* \right] H,\\ &\hat{G}^*_w=G_w^*+W_{22}^*-(W_{12}^*)^\top W_{12}^*,\\ &\hat{G}^*_{vx}=G^*_{vx}, \;\hat{G}^*_{vw}=G^*_{vw}, \; \hat{T}^*_x=T_x^*, \;\hat{T}^*_w=T_w^*, \; \hat{\alpha}^*_x=\alpha_x^*,\;\hat{\alpha}^*_w=\alpha_w^*. \end{align*}\] It is obvious that (78 82 ) and (86 91 ) hold. Since \(W^* \succeq {\boldsymbol{0}}\), it holds that \(W_{22}^*-(W_{12}^*)^\top W_{12}^* \succeq {\boldsymbol{0}}\) by schur complement theorem. Then due to the positive semidefiniteness of \(G_S^*\), \(G_X^*\) and \(G_w^*\), we obtain that \(\hat{G}^*_S\), \(\hat{G}^*_x\) and \(\hat{G}^*_w\) are also positive semidefinite. Therefore, it only remains for us to demonstrate that \(\hat{G}^*_S S^*={\boldsymbol{0}}\), \(\hat{G}^*_x \Sigma_x^*={\boldsymbol{0}}\) and \(\hat{G}^*_w \Sigma_w^*={\boldsymbol{0}}\). According to (75 ) and (76 ), we have \[\begin{bmatrix} {\boldsymbol{0}} & {\boldsymbol{0}} \\ {\boldsymbol{0}} & W_{22}^*-(W_{12}^*)^\top W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}+S^* \right) \begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} ={\boldsymbol{0}},\] which implies that \[\begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} {\boldsymbol{0}} & {\boldsymbol{0}} \\ {\boldsymbol{0}} & W_{22}^*-(W_{12}^*)^\top W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \left( \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}+S^* \right) ={\boldsymbol{0}}.\] Since the matrices \(\begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} {\boldsymbol{0}} & {\boldsymbol{0}} \\ {\boldsymbol{0}} & W_{22}^*-(W_{12}^*)^\top W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix}\), \(\begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}\) and \(S^*\) are positive semidefinite, we have \[\begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} {\boldsymbol{0}} & {\boldsymbol{0}} \\ {\boldsymbol{0}} & W_{22}^*-(W_{12}^*)^\top W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} \begin{bmatrix} \Sigma_x^* & {\boldsymbol{0}} \\ {\boldsymbol{0}} & \Sigma_w^* \end{bmatrix}={\boldsymbol{0}}, \label{T33465}\tag{93}\] and \[\begin{bmatrix} I_n & H^\top \\ {\boldsymbol{0}} & I_m \end{bmatrix} \begin{bmatrix} {\boldsymbol{0}} & {\boldsymbol{0}} \\ {\boldsymbol{0}} & W_{22}^*-(W_{12}^*)^\top W_{12}^* \end{bmatrix} \begin{bmatrix} I_n & {\boldsymbol{0}} \\H & I_m \end{bmatrix} S^*={\boldsymbol{0}}, \label{T33466}\tag{94}\] which follows from the fact that for any matrices \(A,B,C \succeq {\boldsymbol{0}}\), the equation \(A(B+C)={\boldsymbol{0}}\) implies \(AB={\boldsymbol{0}}\) and \(AC={\boldsymbol{0}}\) and the fact is a direct consequence of Fact 10.14.5 in [28]. In accordance with equation (93 ), we have \[H^\top \left[ W_{22}^*-\left( W_{12}^*\right) ^\top W_{12}^* \right] H \Sigma_x^*={\boldsymbol{0}} \label{T33467}\tag{95}\] and \[\left[ W_{22}^*-\left( W_{12}^*\right) ^\top W_{12}^*\right] \Sigma_w^*={\boldsymbol{0}}. \label{T33468}\tag{96}\] Thus, combining (94 96 ) with (63 65 ), it follows that (62 64 ) hold.

Therefore, \((S^*,\Sigma_x^*,\Sigma_w^*,V^*_x,V^*_w,\hat{G}^*_S,\hat{G}^*_x,\hat{G}^*_w,\hat{G}^*_{vx},\hat{G}^*_{vw},\hat{T}^*_x,\hat{T}^*_w,\hat{\alpha}^*_x,\hat{\alpha}^*_w)\) stasifies the KKT system (92 ). Then for given \(A^*=-W_{12}^*\), \((S^*,\Sigma_{x}^*,\Sigma_w^*)\) is the maximizer of the inner maximization problem in (52 ), which implies that \(A^*=-W_{12}^*\) and \((S^*,\Sigma_{x}^*,\Sigma_w^*)\) constitute a saddle point solution of (53 ). ◻

For given \(A^*=-W_{12}^*\), \(\Sigma_{x}^*\) and \(\Sigma_{w}^*\), we can obtain a rank-one optimal solution \(S^*=\begin{bmatrix} \tilde{\mu}_x^* \\ \tilde{\mu}_w^* \end{bmatrix} \begin{bmatrix} (\tilde{\mu}_x^*)^\top & (\tilde{\mu}_w^*)^\top \end{bmatrix}\) to the inner maximization problem over \(S\) in (52 ) as outlined in the proof of Theorem 2.5 in [27]. Then \((A^*,\tilde{\mu}^*_x,\tilde{\mu}^*_w,\Sigma^*_x,\Sigma^*_w)\) is an optimal solution to (51 ), which further implies that \((A^*,-(A^*H-I_n)\hat{\mu}_x-A^* \hat{\mu}_w,\tilde{\mu}^*_x+\hat{\mu}_x,\tilde{\mu}^*_w+\hat{\mu}_w,\Sigma^*_x,\Sigma^*_w)\) is an optimal solution to (49 ). That is, \(f^*(y)=A^*y-(A^*H-I_n)\hat{\mu}_x-A^* \hat{\mu}_w\) and \(P^*=\mathcal{N}(\tilde{\mu}^*_x+\hat{\mu}_x,\Sigma^*_x) \times \mathcal{N}(\tilde{\mu}^*_w+\hat{\mu}_w,\Sigma^*_w)\) constitute an optimal solution to (7 ) and (48 ).

5 Simulation↩︎

In this section, we intend to verify the effectiveness of our theory through numerical experiments. All experiments are implemented in MATLAB R2024a on a PC with AMD Ryzen 7 9800X3D processors (4.7GHz) and 64 GB of RAM. In all experiments, the SDP problems are numerically solved by SDPT3 solver through CVX interface [35], and all parameters are set to default values when solving optimization problems.

5.1 The Nonexistence of the Saddle Point↩︎

In the first experiment, we aim to illustrate that the saddle point may not exist in high-dimensional case when the parameter and noise dimensions are fixed to \(n=m=d\). We take the nominal mean vectors to be \(\hat{\mu}_x=\hat{\mu}_w={\boldsymbol{0}}\) and draw the elements of the observation matrix \(H\) independently from the standard Gaussian distribution. The nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\) are constructed as follows: first, we sample the elements of matrices \(Q_x\) and \(Q_w\) independently from the standard Gaussian distribution and denote \(R_x\) and \(R_w\) the orthogonal matrices whose columns are the orthogonal eigenvectors of \(Q_x+Q_x^\top\) and \(Q_w+Q_w^\top\), respectively. Then we define \(\hat{\Sigma}_x=R_x \Lambda_x R_x^\top\) and \(\hat{\Sigma}_w=R_w \Lambda_w R_w^\top\), where \(\Lambda_x\) and \(\Lambda_w\) are diagonal with entries sampled uniformly from [1,5] and [1,2], respectively. Finally, we set the Wasserstein radiu of the parameter distribution to \(\rho_x=3\) and vary the Wasserstein radiu of the noise distribution \(\rho_w\) across the interval [1,10] with a stepsize of 0.5.

For a given \(\rho_w\), we first calculate the optimal solution to (12 ), which is equal to that of the SDP problem (35 ). From the proof of Theorem 2, the optimal solution to (12 ) denoted by \((f^*,P^*)\) is unique due to the positive definiteness of the nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\). Subsequently, the least favourable distribution corresponding to \(f^*\) will be calculated by solving the problem \[\sup_P mse(f^*,P), \label{least}\tag{97}\] where \(f^*=A^* y+b^*\) and \(b^*= (I_n-A^*H)\hat{\mu}_x - A^* \hat{\mu}_w={\boldsymbol{0}}\). Then (97 ) can be parameterized as \[\label{least322} \begin{align} \sup_{\mu_x,\mu_w,\Sigma_x,\Sigma_w} &Tr\left[\left( A^*H-I_n\right) \Sigma_x\left( A^*H-I_n\right) ^\top+A^*\Sigma_w \left( A^*\right) ^\top \right]+\left[ \left( A^*H-I_n\right) \mu_x+A^*\mu_w\right] ^\top \left[ \left( A^*H-I_n\right) \mu_x+A^*\mu_w \right] \\ s.t.\; & \Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \\ &\left\| \mu_x \right\| ^2+Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_x^2, \\ &\left\| \mu_w\right\| ^2+Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_w^2. \end{align}\tag{98}\] For given \(\Sigma_x\) and \(\Sigma_w\), the maximization problem over \((\mu_x,\mu_w)\) is a QCQP in which the two constraint functions and the objective function are all homogeneous quadratic. Then the SDP relaxation of problem (98 ) is \[\begin{align} \begin{array}{cl} \sup \limits_{S,\Sigma_x,\Sigma_w} &Tr\left[\left(A^*H-I_n\right)\Sigma_x\left(A^*H-I_n\right)^\top +A^*\Sigma_w \left( A^*\right) ^\top\right]+ Tr \left(\begin{bmatrix} (A^*H-I_n)^\top \\ (A^*)^\top \end{bmatrix} \begin{bmatrix} A^*H-I_n & A^* \end{bmatrix} S \right) \\ s.t. &\Sigma_x \succeq{\boldsymbol{0}}, \quad \Sigma_w \succeq {\boldsymbol{0}}, \\ &Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) +Tr \left[ \Sigma_x+\hat{\Sigma}_x- 2 \left( \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_x^2, \\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) +Tr \left[ \Sigma_w+\hat{\Sigma}_w- 2 \left( \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}}\right) ^{\frac{1}{2}}\right] \leq \rho_w^2. \end{array} \label{least32SDP1} \end{align}\tag{99}\] According to Theorem 2.5 in [27], problem (99 ) has a rank-one optimal solution, which implies that the SDP relaxation is tight. Thus, combined with Proposition 2 in [34], introducing auxiliary variables \(V_x \in \mathbb{S}_+^n\) and \(V_w \in \mathbb{S}_+^m\), problem (98 ) is equivalent to the following SDP problem \[\begin{align} \label{least32SDP} \begin{array}{cl} \sup \limits_{S,\Sigma_x,\Sigma_w,V_x,V_w} &Tr\left[\left(A^*H-I_n\right)\Sigma_x\left(A^*H-I_n\right)^\top +A^*\Sigma_w \left( A^*\right) ^\top\right]+ Tr \left(\begin{bmatrix} (A^*H-I_n)^\top \\ (A^*)^\top \end{bmatrix} \begin{bmatrix} A^*H-I_n & A^* \end{bmatrix} S \right) \\ s.t.& S \succeq{\boldsymbol{0}}, \;\Sigma_x \succeq{\boldsymbol{0}}, \;\Sigma_w \succeq{\boldsymbol{0}}, \;V_x \succeq{\boldsymbol{0}}, \;V_w \succeq{\boldsymbol{0}},\\ &\left[\begin{array} {cc} \hat{\Sigma}_x^{\frac{1}{2}} \Sigma_x \hat{\Sigma}_x^{\frac{1}{2}} & V_x \\ V_x &I_n \end{array} \right]\succeq {\boldsymbol{0}}, \quad \left[\begin{array} {cc} \hat{\Sigma}_w^{\frac{1}{2}} \Sigma_w \hat{\Sigma}_w^{\frac{1}{2}} & V_w \\ V_w &I_m \end{array} \right]\succeq {\boldsymbol{0}},\\ & Tr \left(\begin{bmatrix} I_n & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{0} \end{bmatrix} S \right) +Tr\left( \Sigma_x + \hat{\Sigma}_x - 2 V_x \right) \leq \rho_x^2 ,\\ &Tr \left(\begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} &I_m \end{bmatrix} S \right) +Tr\left(\Sigma_w + \hat{\Sigma}_w - 2 V_w\right) \leq \rho_w^2. \end{array} \end{align}\tag{100}\] We denote the least favorable distribution corresponding to \(f^*\) obtained by (100 ) as \(\tilde{P}\), and then compare \(mse(f^*,P^*)\) and \(mse(f^*,\tilde{P})\) which are the optimal values of (35 ) and (100 ), respectively. If \(mse(f^*,P^*) < mse(f^*,\tilde{P})\), it can be inferred that \(P^*\) is not the least favorable distribution corresponding to \(f^*\), which indicates that the saddle point solution of (4 ) does not exist.

Figure 1: MSE vs \rho_w when \rho_x=3 for d=20.

In Figure 5.1, \(mse(f^*,P^*)\), \(mse(f^*,\tilde{P})\) and the optimal value of (?? ), which is also the optimal value of the minimax problem (48 ), are plotted versus the quantile of the radiu \(\rho_w\) for \(d=20\). It is evident that the saddle point exists when \(\rho_w\) is sufficiently small. However, as \(\rho_w\) gradually increases, \(mse(f^*,P^*)\) is smaller than \(mse(f^*,\tilde{P})\), which indicates that \(P^*\) is no longer the least favorable distribution corresponding to \(f^*\). Consequently, the saddle point solution of (4 ) may not exist. Moreover, irrespective of the existence of the saddle point, the optimal value of (48 ) is always greater than \(mse(f^*,P^*)\) but less than \(mse(f^*,\tilde{P})\), which is consistent with the theoretical result. Furthermore, the optimal value of (48 ) and \(mse(f^*,P^*)\) provide upper and lower bounds on the optimal value of the original problem (4 ), respectively.

5.2 The Validity of Sufficient Condition↩︎

In the second experiment, we aim to verify the validity of the sufficient condition by comparing the lower bound on the existence of the saddle point determined by the sufficient condition in Theorem 3 with the actual bound determined by the necessary and sufficient condition in Theorem 2. The nominal mean vectors \(\hat{\mu}_x\) and \(\hat{\mu}_w\), nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\), and observation matrix \(H\) are generated in the same way as in the first experiment. We vary the Wasserstein radiu of the parameter distribution \(\rho_x\) across the interval [1,10] with a stepsize of 0.1 and calculate the lower bound determined by the sufficient condition in Theorem 3 through \[\rho_w^L=\frac{\lambda_{\min}^{\frac{1}{2}}\left( \hat{\Sigma}_x\right) \lambda_{\min}^{\frac{1}{2}}\left(\hat{\Sigma}_w\right) }{\rho_x}.\] In addition, for a given \(\rho_x\), to determine the actual bound of the existence of the saddle point, we vary the Wasserstein radiu of the noise distribution \(\rho_w\) across the interval [1,10] with a stepsize of 0.02. For given \(\rho_x\) and \(\rho_w\), we slove (35 ) and verify whether the matrix in Theorem 2 is negative semidefinite. If the largest eigenvalue of the matrix is larger than 0, we assert that the saddle point solution of (4 ) does not exist by Theorem 2 and record the smallest \(\rho_w^*\) that satisfies this condition for each \(\rho_x\). Then the actual bound of the existence of the saddle point can be given by \(\rho_w^A=\rho_w^*-0.02\).

Figure 2: The actual bound and the lower bound for the existence of the saddle point.

In Figure 5.2, the lower bound calculated by the sufficient condition \(\rho_w^L\) and the actual bound obtained by the traversal algorithm \(\rho_w^A\) are plotted versus the quantile of the radiu \(\rho_x\) for \(d=20\). It is evident that the lower bound is consistently inferior to the actual bound. As \(\rho_x\) increases, the discrepancy between the two bounds decreases.

5.3 The Robustness of the Robust Linear Estimator↩︎

In the third experiment, we aim to verify the robustness of the linear estimator obtained from the upper bound problem (48 ) in Section 4. In this experiment, we assume that the parameter and noise dimensions are equal, denoted by \(n=m=d\). Furthermore, we take the observation matrix to be \(H=I_d\). Without loss of generality, the true mean vectors and the nominal mean vectors are all set to be zeros, i.e., \(\mu_x=\mu_w=\hat{\mu}_x=\hat{\mu}_w={\boldsymbol{0}}\). The experiment comprises 3000 simulation runs. In each run, the true covariance matrices \(\Sigma_x\) and \(\Sigma_w\) is randomly generated in the same way as in the first experiment. Then the nominal covariance matrices \(\hat{\Sigma}_x\) and \(\hat{\Sigma}_w\) are defined as the sample covariance matrices corresponding to 100 independent samples from \(\mathcal{N}({\boldsymbol{0}},\Sigma_x)\) and \(\mathcal{N}({\boldsymbol{0}},\Sigma_w)\). Finally, we design the Wasserstein radii of the uncertainty sets \(\rho_x\) and \(\rho_w\). The simulation is repeated 1,000 times, with 100 samples drawn from the true distributions on each occasion. The Wasserstein distances between the sample covariance matrices and the true covariance matrices are then calculated and recorded. The Wasserstein distances obtained from the 1,000 simulations are sorted in ascending order, and the 0.95-quantile is taken as the Wasserstein radii \(\rho_x\) and \(\rho_w\) to ensure that the sample covariance matrices can be included in the uncertainty sets in most cases.

In this framework, two distributions are considered: the true distribution \(P=\mathcal{N}({\boldsymbol{0}},\Sigma_x) \times \mathcal{N}({\boldsymbol{0}},\Sigma_w)\) and the nominal distribution \(\hat{P}=\mathcal{N}({\boldsymbol{0}},\hat{\Sigma}_x) \times \mathcal{N}({\boldsymbol{0}},\hat{\Sigma}_w)\). This allows us to obtain three estimators: the optimal estimator of the nominal distribution \(f^*(\hat{P})\), the optimal estimator of the true distribution \(f^*(P)\), and the robust linear estimator \(\tilde{f}^*\) which is proposed in Section 4. To verify the robustness of the estimator \(\tilde{f}^*\), we can compare the relative mean square errors (RMSE) \(mse(\tilde{f}^*,P)-mse(f^*(P),P)\) and \(mse(f^*(\hat{P}),P)-mse(f^*(P),P)\). Figure 5.3 presents the frequency histograms of the relative mean square errors of the robust linear estimator and the nominal optimal estimator in 3,000 repeated experiments for \(d=10\) and \(d=20\), respectively. It is evident that the relative mean square error of the robust linear estimator is superior to that of the optimal estimator of the nominal distribution, and this advantage becomes increasingly apparent as the dimensions of the parameter and noise increase.

Figure 3: The frequency histograms of the RMSE.

6 Conclusion↩︎

In this paper, we consider a robust estimation problem in the linear measurement model with additive noise, where the parameter and noise are constrained by bounded Wasserstein-distance balls, respectively. This robust estimation problem can be formulated as an infinite-dimensional nonconvex minimax problem whose saddle point may not exist. By transforming the existence of its saddle point to that in a finite-dimensional minimax problem, we provide a verifiable necessary and sufficient condition and a simplified sufficient condition. When a saddle point exists, the original infinite-dimensional minimax problem reduces to a SDP problem. Conversely, when the saddle point is absent, the problem becomes intractable. This fact motivates us to consider an upper-bound problem where the estimator is restricted to be linear. By demonstrating the tightness of the SDP relaxation for the upper-bound problem, we prove that its optimal value coincides with that of a SDP problem. Furthermore, the optimal solution of this upper-bound problem is constructed and yields a robust linear estimator.

7 Acknowledgements↩︎

The authors would like to thank Prof. Zhi-Quan (Tom) Luo from The Chinese University of Hong Kong, Shenzhen, for the helpful discussions on this work.

References↩︎

[1]
Yonina C Eldar, Aharon Ben-Tal, and Arkadi Nemirovski. Robust mean-squared error estimation in the presence of model uncertainties. IEEE Transactions on Signal Processing, 53(1):168–181, 2004.
[2]
Giuseppe Calafiore and Laurent El Ghaoui. Robust maximum likelihood estimation in the linear model. Automatica, 37(4):573–580, 2001.
[3]
Laurent El Ghaoui and Hervé Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997.
[4]
Dimitris Bertsimas and David B. Brown. Constrained stochastic LQC: A tractable approach. IEEE Transactions on Automatic Control, 52(10):1826–1841, 2007.
[5]
Kan-Lin Hsiung, Seung-Jean Kim, and S. Boyd. Power control in lognormal fading wireless channels with uptime probability specifications via robust geometric programming. In Proceedings of the 2005, American Control Conference, volume 6, pages 3955–3959, 2005.
[6]
Sergiy A Vorobyov, Alex B Gershman, and Zhi-Quan Luo. Robust adaptive beamforming using worst-case performance optimization: A solution to the signal mismatch problem. IEEE Transactions on Signal Processing, 51(2):313–324, 2003.
[7]
Y.C. Eldar, A. Ben-Tal, and A. Nemirovski. Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Transactions on Signal Processing, 52(8):2177–2188, 2004.
[8]
Y. C Eldar. Robust competitive estimation with signal and noise covariance uncertainties. IEEE Transactions on Information Theory, 52(10):4532–4547, 2006.
[9]
Aharon Ben-Tal and Arkadi Nemirovski. Robust convex optimization. Mathematics of Operations Research, 23(4):769–805, 1998.
[10]
Aharon Ben-Tal and Arkadi Nemirovski. Robust solutions of linear programming problems contaminated with uncertain data. Mathematical Programming, 88:411–424, 2000.
[11]
B.C. Levy and R. Nikoukhah. Robust least-squares estimation with a relative entropy constraint. IEEE Transactions on Information Theory, 50(1):89–104, 2004.
[12]
Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Bridging bayesian and minimax mean square error estimation via wasserstein distributionally robust optimization. Mathematics of Operations Research, 48(1):1–37, 2023.
[13]
Shixiong Wang. Distributionally robust state estimation for nonlinear systems. IEEE Transactions on Signal Processing, 70:4408–4423, 2022.
[14]
Shixiong Wang, Zhongming Wu, and Andrew Lim. Robust state estimation for linear systems under distributional uncertainty. IEEE Transactions on Signal Processing, 69:5963–5978, 2021.
[15]
Shixiong Wang and Zhi-Sheng Ye. Distributionally robust state estimation for linear systems subject to uncertainty and outlier. IEEE Transactions on Signal Processing, 70:452–467, 2022.
[16]
Bernard C. Levy and Ramine Nikoukhah. Robust state space filtering under incremental model perturbations subject to a relative entropy tolerance. IEEE Transactions on Automatic Control, 58(3):682–695, 2013.
[17]
Mattia Zorzi. Robust kalman filtering under model perturbations. IEEE Transactions on Automatic Control, 62(6):2902–2907, 2017.
[18]
Dunbiao Niu, Enbin Song, Zhi Li, Linxia Zhang, Ting Ma, Juping Gu, and Qingjiang Shi. A marginal distributionally robust MMSE estimation for a multisensor system with kullback-leibler divergence constraints. IEEE Transactions on Signal Processing, 71:3772–3787, 2023.
[19]
Alex Dytso, Michael Fauß, Abdelhak M. Zoubir, and H. Vincent Poor. bounds for additive noise channels under Kullback–Leibler divergence constraints on the input distribution. IEEE Transactions on Signal Processing, 67(24):6352–6367, 2019.
[20]
Soroosh Shafieezadeh Abadeh, Viet Anh Nguyen, Daniel Kuhn, and Peyman M Mohajerin Esfahani. Wasserstein distributionally robust kalman filtering. Advances in Neural Information Processing Systems, 31, 2018.
[21]
Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. Operations Research, pages 130–166, 2019.
[22]
Josef Stoer and Christoph Witzgall. Convexity and optimization in finite dimensions I. Springer-Verlag, Berlin, Germany, 1970.
[23]
Matthias Gelbrich. On a formula for the L\(^2\)Wasserstein metric between measures on euclidean and hilbert spaces. Mathematische Nachrichten, 147(1):185–203, 1990.
[24]
Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8:171–176, 1958.
[25]
Fabio Botelho. Functional Analysis and Applied Optimization in Banach Spaces: Applications to Non-Convex Variational Models. Springer, Cham, Switzerland, 2014.
[26]
Dimitri P Bertsekas. Nonlinear Programming. Athena scientfic, Belmont, Massachusetts, USA, third edition, 2016.
[27]
Yinyu Ye and Shuzhong Zhang. New results on quadratic minimization. SIAM Journal on Optimization, 14(1):245–267, 2003.
[28]
Dennis Bernstein. Scalar, Vector, and Matrix Mathematics: Theory, Facts, and Formulas-Revised and Expanded Edition. Princeton university press, Princeton, NJ, USA, 2018.
[29]
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge university press, Cambridge, UK, 2009.
[30]
Xingzhi Zhan. Matrix Inequalities. Springer, Berlin, Germany, 2002.
[31]
Fuzhen Zhang. The Schur Complement and its Applications. Springer, New York, NY, USA, 2005.
[32]
Wenbao Ai, Wei Liang, and Jianhua Yuan. On the tightness of an SDP relaxation for homogeneous QCQP with three real or four complex homogeneous constraints. Mathematical Programming, 211:5–48, 2025.
[33]
Arthur Albert. Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York, NY, USA, 1972.
[34]
Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein Riemannian geometry of Gaussian densities. Information Geometry, 1:137–179, 2018.
[35]
Michael Grant, Stephen Boyd, and Yinyu Ye. users’ guide, 2019. online: http://www.stanford.edu/ boyd/software.html.