Differentially Private Wasserstein Barycenters

Anming Gu1
UT Austin

,

Sasidhar Kunapuli
Independent Researcher

,

Mark Bun
Boston University

,

Edward Chien
Boston University

,

Kristjan Greenewald
MIT-IBM Watson AI Lab; IBM Research


Abstract

The Wasserstein barycenter is defined as the mean of a set of probability measures under the optimal transport metric, and has numerous applications spanning machine learning, statistics, and computer graphics. In practice these input measures are empirical distributions built from sensitive datasets, motivating a differentially private (DP) treatment. We present, to our knowledge, the first algorithms for computing Wasserstein barycenters under differential privacy. Empirically, on synthetic data, MNIST, and large-scale U.S. population datasets, our methods produce high-quality private barycenters with strong accuracy-privacy tradeoffs.

1 Introduction↩︎

In the era of big data and machine learning, users are increasingly concerned about their privacy. Differential privacy (DP) [1] has seen widespread adoption to provide guarantees for user privacy. For example, government bureaus use DP when releasing census data [2], [3], and companies such as Apple [4], Microsoft [5], and LinkedIn [6] extensively employ DP when releasing data — aiming to protect user data from security threats.

Clustering, summarizing and reducing the size of datasets are fundamental tasks in unsupervised machine learning. Many of these unsupervised learning problems are NP-hard [7], [8], leading to the development of polynomial-time approximation algorithms [9][14]. A long line of works [15][20] have further studied clustering under DP, providing polynomial time algorithms with tight approximation bounds.

Defined as the mean of a set of probability measures under the optimal transport metric,2 the Wasserstein barycenter is a useful notion that contains many of these unsupervised tasks as special cases, with applications to a much more general suite of problems. Specific instances of the Wasserstein barycenter problem include centroids of probability measures [21] and \(k\)-means clustering [22]. Consequently, Wasserstein barycenters have seen extensive applications in domain adaptation [23], computer graphics [24], [25], and biology [26], [27].

Similar to clustering and other unsupervised learning problems mentioned above, privatization of Wasserstein barycenters becomes crucial when working with sensitive data. As one of many possible examples, suppose a company wishes to train and deploy machine learning models for analysis of sensitive data for each of many countries. Prior to actually training these models, to minimize the risk of privacy breaches, a machine learning engineer must design the model architecture and tune hyperparameters on a DP synthetic dataset (see [28][30] for more details on this emerging practice), which by the post-processing property of DP can be used and re-used ad infinitum without incurring any additional privacy loss. A private Wasserstein barycenter of a (sub)set of country-level datasets would be a natural candidate for this private synthetic dataset, as (1) it averages across many countries and hence should incur much less privacy cost than maintaining separate private synthetic datasets for each country, and (2) it approximately minimizes the Wasserstein distances to the true distributions, which should maximize the chance that the designed model architecture will work well when applied to each country’s data at deployment time.

More classically, recall that the Wasserstein barycenter minimizes the (weighted) average transport cost from the barycenter to each marginal. This can be an interesting optimization problem in its own right, e.g. choosing locations for distribution centers for multiple products each with its own geographic demand distribution, where each center has the same mix of products.

Motivated by these considerations, our work answers the following question in the affirmative:

Do there exist efficient3 algorithms for computing Wasserstein barycenters under DP?

1.0.0.1 Contributions

To the best of our knowledge, we provide the first algorithms for computing Wasserstein barycenters under the constraints of the central model [1] of DP. We work under the setting where each individual contributes one datapoint to one distribution.

Our main contributions are as follows.

  • We provide an efficient \(\varepsilon\)-DP algorithm using a black-box reduction from private Wasserstein distance coresets (Definition 8); see Theorem 4. Here, we form a private version of each distribution, and use these to solve the barycenter problem.

  • We provide a lightweight \((\varepsilon,\delta)\)-DP algorithm that works well when the data is clustered (Definition 9) and the number of support points in the barycenter is not too large; see Theorem 8. This method treats the output barycenter as a vector and uses the Gaussian mechanism for privacy.

  • We show the efficacy of our algorithms when applied to large-scale real-world sensitive data; see Figures 5 and 6.

2 Preliminaries↩︎

2.0.0.1 Notation

In big-O notation, we use \(\tilde{O}\) to hide logarithmic factors and subscripts to hide dependence in those variables. We use \(a\lesssim_p b\) to denote \(a=O_p(b)\), e.g. a constant that depends on \(p\). We use \([t]\) to denote the set \(\{1,\dots, t\}\). For a function \(T:\mathcal{X}\to\mathcal{Y}\) and a measure \(\mu\), we use \(T_\sharp\mu\) to denote the pushforward measure, e.g. \(T_\sharp\mu(B) = \mu(T^{-1}(B))\) for a measurable set \(B\subseteq \mathcal{Y}\). For a datapoint \(x\), we use \(\delta_x\) to denote a Dirac delta at \(x\). We reserve the Greek letter \(\xi\) to denote a failure probability. We use \(B_x(R) := \{y\mid \|x-y\|_2\le R\}\) to denote the closed Euclidean ball of radius \(R\) centered at point \(x\). (OT) We use Greek letters \(\mu\) for the dataset and \(\nu\) for barycenter, which are probability measures. (DP) We use the script font \(\mathcal{A}\) and \(\mathcal{D}\) to denote algorithms and datasets, respectively. We use \(\varepsilon,\delta\) for privacy parameters for DP. (JL) We use letters \(d\) to represent the dimension of the ambient space and \(d'\) to represent the dimension of the projected space from the JL transform. We use the Greek letter \(\gamma\) for the multiplicative factor in the JL and \(\Pi\) for the projection matrix.

Differential privacy (DP) [1] is a mathematical framework for establishing guarantees on privacy loss of an algorithm, with nice properties such as degradation of privacy loss under composition and robustness to post-processing. We provide a brief introduction and refer to [31] for a thorough treatment.

Definition 1 (\((\varepsilon,\delta)\)-DP). Algorithm \(\mathcal{A}\) is said to satisfy \((\varepsilon,\delta)\)-differential privacy if for all adjacent datasets \(\mathcal{D}, \mathcal{D}'\) (datasets differing in at most one element) and all \(\mathcal{S} \subseteq \operatorname{range}\mathcal{A}\), it holds \[\Pr[\mathcal{A}(\mathcal{D})\in \mathcal{S}]\le e^\varepsilon\Pr[\mathcal{A}(\mathcal{D}')\in \mathcal{S}]+\delta.\] If \(\delta = 0\), we drop the dependence on \(\delta\) and say \(\mathcal{A}\) satisfies \(\varepsilon\)-differential privacy.

Let \((\mathcal{X}, \rho)\) be a metric space and let \(\mathcal{P}(\mathcal{X})\) be the set of Borel probability measures on \(\mathcal{X}\).

Definition 2 (Wasserstein distance). For \(p\in [1, \infty)\), the \(p\)-Wasserstein distance between probability measures \(\mu,\nu\in\mathcal{P}(\mathcal{X})\) is defined to be \[W_p(\mu,\nu):= \left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{\mathcal{X}\times\mathcal{X}}\rho(x,y)^pd\pi(x,y)\right)^{1/p},\] where \(\Pi(\mu,\nu) := \{\pi \in \mathcal{P}(\mathcal{X}\times\mathcal{X})\mid (P_x)_\sharp \pi = \mu, (P_y)_\sharp \pi = \nu\}\) is the set of transport plans, and \(P_x(x,y) := x\) and \(P_y(x,y) := y\) are the projections onto the first and second coordinates, respectively.

We will be using the \(L_p\) metric for the cost function, e.g. \(\rho(x, y) := \|x - y\|_p\).

In Appendix 10, we recall some additional facts on differential privacy, optimal transport, and the Johnson-Lindenstrauss transform.

3 Problem statement↩︎

[32] introduced the notion of barycenters on Wasserstein space:

Definition 3 (Wasserstein barycenter). Given probability distributions \(\mu_1,\dots, \mu_k\in\mathcal{P}(\mathcal{X})\) and weights \(\beta_1,\dots, \beta_k > 0\), the \(p\)-Wasserstein barycenter is any distribution \(\nu^\ast\) satisfying \[\nu^\ast \in \underset{\nu\in\mathcal{P}(\mathcal{X})}{\arg\min}\sum_{i=1}^k \beta_i W_p^p(\mu_i, \nu).\label{eq:wb95objective}\qquad{(1)}\]

We will be working with discrete distributions, where each distribution can be thought of as a subpopulation, and one individual contributes sensitive data to one of the distributions. For instance each of these distributions could represent the data from one country.

Formally, we have \(k\) empirical distributions \(\mu_i\) for \(i \in [k]\), each with \(n\) point masses, where \[\label{eq:mu95i} \mu_i = \frac{1}{n}\sum_{j=1}^n \delta_{x_j},\tag{1}\]

Our goal is to compute a distribution \(\nu\) consisting of exactly \(m\le n\) point masses and uniform weights that minimizes the objective ?? under the constraints of DP. For any application of DP, a definition of neighboring datasets is required. We use the following.

Definition 4 (Neighboring datasets). Let \(\{\mu_i\}_{i\in[k]}\) be a collection of empirical distributions. Let the dataset be \(\mathcal{D} := \{(x_{i,j},i)\mid x_{i,j}\in \mu_i, i\in [k]\}\). We say two datasets \(\mathcal{D},\mathcal{D}'\) are neighboring if they differ by exactly one element in the first coordinate, e.g. they differ by one point in one distribution.

This definition of neighboring datasets is motivated by viewing each \(\mu_i\) as a non-overlapping subpopulation, i.e. we are essentially assuming that individuals do not appear in multiple of the \(\mu_i\). Without loss of generality, we can assume all of the points are distinct. We will also abuse notation and identify \(\mathcal{D}\) with \(\{\mu_1,\dots, \mu_k\}\).

In order to limit the influence of any individual, we require an assumption on the support. For simplicity, without loss of generality we assume a support contained in a ball of radius of 1/2.

Assumption 1. It holds that \(\cup_{i\in[k]}\operatorname{supp} \mu_i\subseteq B_0(1/2)\).

This is a standard assumption in private clustering, e.g. see [20],4 and more generally private convex optimization. This assumption is used to simplify the description of the results as it is well known that the additive error of any DP algorithm scales proportionally with respect to the radius of the support of the dataset, e.g. see [33].

[34] utilizes the Johnson-Lindenstrauss transform to speed up algorithms for Wasserstein barycenters. We start with the following definition, adapted from Definition 2.1 of [34].5

Definition 5 (Solution). Fix a candidate barycenter \(\nu\) supported on points \(\nu^{(1)}, \dots, \nu^{(m)}\). Define the solution \((\mathbf{S}, \mathbf{w}) := (S_1, \dots, S_m, w_1, \dots, w_m)\) as follows. Let \(w_j(x)\) denote the total weight transported from \(x\in \cup_{i=1}^k\operatorname{supp}\mu_i\) to point \(\nu^{(j)}\) based on the optimal transport plan.

Define the set \[S_j := \left\{x\in \mathcal{D}\,\bigg|\, w_j(x)>0\right\}.\]

We call \((\mathbf{S}, \mathbf{w})\) a solution because the following holds. For each \(j\in [m]\), \(\nu^{(j)}\) minimizes the objective \[\begin{align} \sum_{x\in S_j} w_j(x) \|x-\nu^{(j)}\|^p. \end{align}\]

Figure 1: Example of a solution. The input distributions are \mu_a := \frac{1}{2}\delta_{a_1} + \frac{1}{2}\delta_{a_2}, \mu_b := \frac{1}{2}\delta_{b_1} + \frac{1}{2}\delta_{b_2}, \mu_c := \frac{1}{2}\delta_{c_1} + \frac{1}{2}\delta_{c_2} and the candidate barycenter is \nu := \frac{1}{2}\delta_{\nu^{(1)}} + \frac{1}{2}\delta_{\nu^{(2)}}. Observe that: S_1 = \{a_1, b_1, c_1\}, S_2 = \{a_2, b_2, c_2\}, w_1(a_1) = w_1(b_1) = w_1(c_1)= w_2(a_2) = w_2(b_2) = w_2(c_2) = 1.

See Figure 1 for intuition on the definition of this solution. Notice that if we are given the weights \(w_j\), we can easily reconstruct the points \(\nu^{(j)}\) using convex optimization. We obtain these weights by solving a corresponding Wasserstein barycenter in the reduced space using any approximation algorithm. Note that by conservation of mass, it holds that \(\sum_{j=1}^nw_j(x) = 1\).

We constrain the optimization objective in ?? as follows.

Assumption 2. Assume that the objective ?? has an added constraint that the solution has \(m\) equally weighted atoms, where \(m\) is specified. Specifically, the solution satisfies \(\nu = \frac{1}{m}\sum_{j=1}^m \delta_{\nu^{(j)}},\) where \(m \le n\), and \(\beta_i = \frac{1}{k}\).6

We make two remarks on the uniform weight assumption:

  • From an interpretability perspective, uniform weights is a reasonable assumption so that each datapoint can be considered as data representing one synthetic person.

  • From a computational perspective, many papers on Wasserstein barycenters a priori solve the problem under the uniform weight assumption, as optimizing weights for barycenters is much more challenging than optimizing for supports, e.g. see the discussion in [35].

We will use the following cost function for Wasserstein barycenters.

Definition 6 (Cost). For a solution \((\mathbf{S}, \mathbf{w})\), define its cost to be the value of ?? when \(\nu\) is reconstructed from \((\mathbf{S}, \mathbf{w})\): \[\operatorname{cost}(\mathbf{S}) := \min_\nu \frac{1}{nk}\sum_{j=1}^n \sum_{x\in S_j}w_j(x) \|x-\nu^{(j)}\|^p. \label{eq:cost95of95soln}\qquad{(2)}\]

Similarly, for a projection \(\Pi\), define \(\operatorname{cost}(\Pi_\sharp \mathbf{S})\) to be the value of ?? when we first project each distribution to \(\mathbb{R}^{d'}\) using \(\Pi\), then compute \(\tilde{\nu}\) using the original weights \(w_j\): \[\operatorname{cost}(\Pi_\sharp \mathbf{S}) := \min_\nu \frac{1}{nk}\sum_{j=1}^n \sum_{x\in S_j}w_j(x) \|\Pi x-\tilde{\nu}^{(j)}\|^p.\label{eq:cost95of95jl95soln}\qquad{(3)}\]

Above, note that we suppress the dependence on \(p\) for the cost. We use the following definition for approximate Wasserstein barycenters.

Definition 7 (Approximate Wasserstein barycenter). Let \(\mathsf{OPT}\) be the minimum of ?? . A \((z,t)\)-approximation for the \(p\)-Wasserstein barycenter is probability measure \(\nu\) such that \[\operatorname{cost}(\nu)\le z \cdot \mathsf{OPT}_{(\mu_1,\dots,\mu_k)} + t,\] where \(\mathsf{OPT}\) is the cost of an optimal barycenter supported on \(m\) atoms with uniform weights. When it’s clear, we suppress the dependence on the input barycenters.

4 A private coreset approach↩︎

In this section, we use a black-box private coreset approach. In particular, we consider private distributions that are close in Wasserstein distance to the sensitive distribution. We use these coresets as input to the approximate Wasserstein barycenter algorithm to obtain the private barycenter via post-processing.

We start by introducing the notion of coresets for Wasserstein distance.

Definition 8 (Coreset for Wasserstein distance). A measure \(\mu'\) is a \((p, z, t)\)-coreset of \(\mu\) for the \(p\)-Wasserstein distance if for every \(\pi\in\mathcal{P}(\mathbb{R}^d)\), we have \(W_p(\mu',\pi)\le z\cdot W_p(\mu,\pi) + t\). When \(p\) is unambiguous, we drop the \(p\).

The following proposition is a direct consequence of the triangle inequality.

Proposition 1. If \(W_p(\mu,\mu')\le t\), then \(\mu'\) is a \((p, 1, t)\)-coreset of \(\mu\) for the \(p\)-Wasserstein distance problem.

Now our goal is to find a coreset for the \(p\)-Wasserstein distance problem. We use the algorithm from [36]. Informally, the algorithm works as follows. First, obtain a hierarchical binary partition over the space of \(\log \varepsilon n\) levels. Use the (discrete) Laplace mechanism on each cell to compute the number of points in each cell, with noise calibrated to the level. Then, it suffices to choose points in each cell totaling the number of counts independently of the data. The set of all of these points becomes the private data. For a full description of the algorithm, see Algorithm 4 of [36].

We remark that there exists a corresponding with high probability (w.h.p.) algorithm that has the following guarantee.

Theorem 2. For every \(\varepsilon> 0\) and \(\xi\in(0,1)\), there exists an \(\varepsilon\)-DP algorithm running in time \(\tilde{O}(\varepsilon dn)\) that with probability \(1-\xi\), outputs an \[\left(p, 1,O_p\left(\left(\frac{1}{(\varepsilon n)^{1/d}}\cdot\operatorname{poly}\log\left(\frac{1}{ \xi}\right)\right)^{1/p}\right)\right)\]-approximate coreset of size \(O(n\log\varepsilon n)\) for the Wasserstein distance.

Proof. [36] provides an algorithm with guarantees in expectation for the \(W_1\) distance. Due to Lemma 6, this implies a similar guarantee for \(W_p\) distance. To obtain the w.h.p. algorithm, we run \(O\left(\log\frac{1}{\xi}\right)\) trials of the algorithm and use the exponential mechanism [37] to choose the best one, e.g. see Appendix D of [38] for an analogous argument. ◻

Our key technical lemma is the following result to bound the error using Wasserstein distance coresets instead of the true distributions:

Lemma 1. Let \(\mu_1,\dots, \mu_k\) be discrete probability measures and suppose \(\mu_1',\dots, \mu_k'\) are \((p, 1,t)\)-coresets for each \(\mu_i\), respectively. Then, \[\mathsf{OPT}_{(\mu_1',\dots, \mu_k')}\le \mathsf{OPT}_{(\mu_1,\dots,\mu_k)}+O_p(t^p).\]

Proof (sketch). This follows from Definition 8 and ?? . See Appendix 12.1 for the full proof. ◻

[34] generalized the breakthrough work of [39] to show that reducing to \(O(\log n)\) dimension suffices to preserve the cost of \(p\)-Wasserstein distances for all solutions supported on at most \(n\) data points. Their main result is the following:

Theorem 3. Let \(\mu_1, \dots, \mu_k\) be discrete probability distributions on \(\mathbb{R}^d\) such that \(|{\operatorname{supp}\mu_i}|\le \operatorname{poly}(n)\) for all \(i \in [k]\). Let \(d\ge 1\), \(\gamma,\xi\in (0, 1)\). Let \(\Pi:\mathbb{R}^d\to \mathbb{R}^{d'}\) be an i.i.d. Gaussian JL map with \(d' = O\left(\frac{p^4}{\gamma^2}\log \frac{n}{\gamma\xi}\right)\). Then, with probability \(1 - \xi\), it holds that \[\operatorname{cost}(\mathbf{S}) \approx_{1+\gamma}\operatorname{cost}(\Pi_\sharp \mathbf{S})\] for all solutions \((\mathbf{S},\mathbf{w})\).

Above, for \(\gamma\ge 0\), we use \(a\approx_{1+\gamma}b\) to denote \(\frac{1}{1+\gamma}\le \frac{a}{b} \le 1 + \gamma\). We briefly remark that it is also possible to use the fast JL transform using \(d' = O\left(\frac{p^6}{\gamma^2}\log \frac{n}{\gamma\xi}\right)\); for details please refer to Appendix B of [34].

Our main result in this section is the following, whose proof we provide in Appendix 12.2.

Figure 2: \mathsf{CoresetBarycenter}^{\varepsilon}

Theorem 4. For any \(p\ge 1\), suppose that there exists a (not necessarily private) \((z, t)\)-approximation algorithm that runs in time \(2^{O(d)}\cdot \operatorname{poly}(n,k)\) for the \(p\)-Wasserstein barycenter problem. Then, for every \(\varepsilon>0\) and \(\gamma,\xi\in(0,1)\), there exists a polynomial-time \(\varepsilon\)-DP algorithm that outputs an \[\left(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\left(\frac{1}{(\varepsilon n)^{1/d}} + t\right)\right)\] -approximate \(p\)-Wasserstein barycenter, with probability \(1-\xi\).

Here, we suppress poly-log dependence on \(1/\xi\). Via dimensionality reduction, we can afford an algorithm that has an exponential dependence on the dimension as \(d' = O(\log n)\). Unfortunately, many state of the art additive approximation algorithms still do not lend polynomial runtime when combined with dimensionality reduction. For instance, the algorithm of [35] runs in time \((nk)^{O(d)}\).

The weights \((\mathbf{S}, \mathbf{w})\) from Definition 5 are computed via optimal transport plans between \(\hat{\mu}_i\) and \(\hat{\nu}\), e.g. the distributions in low dimension. Due to post-processing, the \(\hat{\mu}_i\) are private, so the computation incurs no additional privacy loss. We provide pseudocode in Algorithm 7. To recover the support points, we use empirical risk minimization (Algorithm 8).

5 An output perturbation approach↩︎

One issue of the method in the previous section is that the curse of dimensionality implies that if \(d = \Omega(\log n)\), then the additive error becomes a constant. To address this, we can alternately consider output perturbation.

However, a naive application of output perturbation will only yield good utility if \(md \ll (\varepsilon k)^2\). The issue that inhibits an upper bound that benefits from increasing \(n\) is that in a neighboring dataset, the couplings of all \(n\) points in the updated distribution could potentially change, so we only obtain an averaging effect due to the \(k\) distributions, as opposed to \(\frac{nk}{m}\) (number of points that are mapped to each point in the support of the barycenter); see Proposition 12.

Figure 3: \mathsf{OutputPerturbationBarycenter}^{\varepsilon,\delta}

We provide the pseudocode for this basic output perturbation method in Algorithm 3. Here, \(\mathsf{Approx}^\mathsf{JL}\) denotes the algorithm with dimensionality reduction, as in Section 4. Its guarantees are as follows, and the proof is provided in Appendix 12.3.

Theorem 5. For any \(p\ge 1\), suppose that there exists a (not necessarily private) \((z, t)\)-approximation algorithm that runs in time \(2^{O(d)}\cdot \operatorname{poly}(n,k)\) for the \(p\)-Wasserstein barycenter problem. Then, for every \(\varepsilon>0\) and \(\delta,\gamma,\xi \in (0, 1)\), there exists a polynomial-time \((\varepsilon,\delta)\)-DP algorithm that outputs an \[\left(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon k)^2}\right)^{p/2}\right) + t\right)\] -approximation for the \(p\)-Wasserstein barycenter problem, with probability \(1 - \xi\).

Fortunately, by splitting each distribution \(\mu_i\) into \(k'\) disjoint distributions \(\{\mu_{i,j}\}_{j\in[m]}\), we can obtain benefits from increasing \(n\).

Proposition 6. By splitting each distribution into \(k'\) disjoint distributions, in the same setting as Theorem 5, we have an \[\begin{align} &\bigg(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\bigg(\left(\frac{md\log (1/\delta)}{(\varepsilon kk')^2}\right)^{p/2} \\ &\qquad\qquad\qquad\qquad + \left(1 - \frac{1}{k'}\right) + t\bigg)\bigg) \end{align}\] -approximation for the \(p\)-Wasserstein barycenter problem, with probability \(1 - \xi\).

The proof and pseudocode are provided in Appendices 12.4 and 9, respectively. Note that the additive error is asymptotically vacuous because the solution \(\delta_0\) has cost \(O(1)\). Nonetheless, this \(1-\frac{1}{k'}\) factor is for worst-case data; for real-world settings, data are clustered, e.g. see Figures 5 and 6. To make precise the definition of clustered distributions, we consider a slight modification of the definition of [40][42].

Definition 9 (\((m,\Delta, c)\)-approximately clusterable distribution). A distribution \(\mu\) is \((m,\Delta,c)\)-approximately clusterable if at least a \(1-\varepsilon\) fraction of \(\mathrm{supp}(\mu)\) lies in the union of \(m\) balls of radius at most \(\Delta\).

We prove an empirical rate of convergence bound for such distributions.

Proposition 7 (Informal, see Proposition 18). If \(\mu\) is \((m,\Delta, c\le \frac{1}{2})\)-approximately clusterable, then for all \(n \le m(2\Delta)^{-2p}\), letting \(\hat{\mu}_n = \frac{1}{n}\sum_{i\in[n]}\delta_{X_i}\) for \(X_i\sim \mu\) i.i.d., it holds \[\begin{align} &W_p^p(\mu, \hat{\mu}_n)\lesssim_{p} (1-c)\sqrt{\frac{m}{n}}+c + \sqrt{\frac{\log(1/\xi)}{n}}. \end{align}\] with probability \(1-\xi\).

Essentially, the first term on the right-hand-side is from the clustered portion [40], the second is the unclustered portion, and the last is a bound on the contributions from mismatched items. We provide the proof in Appendix 12.5.

a
b
c

Figure 4: Synthetic experiments testing sample size \(n\), privacy parameter \(\varepsilon\), and projection dimension \(d'\), averaged over 30 runs for the private coreset approach of Section 4.. a — \(n = 50, 100, \dots, 1600\)., b — \(\varepsilon= 2^{-4}, 2^{-2}, \dots, 2^3\)., c — \(d' = 1, 2, \dots, 9\).

Using this proposition, we can prove the following theorem.

Theorem 8. Suppose every input distribution \(\{\mu_i\}_{i\in[k]}\) is \(\left(O(m), \Delta, c\le \frac{1}{2}\right)\)-approximately clusterable and \(\frac{n}{k'}\le O(m)\cdot (2\Delta)^{-2p}\). For any \(p\ge 1\), suppose that there exists a (not necessarily private) \((z, t)\)-approximation algorithm that runs in time \(2^{O(d)}\cdot \operatorname{poly}(n,k)\) for the \(p\)-Wasserstein barycenter problem. Then, for every \(\varepsilon>0\) and \(\delta,\gamma,\xi \in (0, 1)\), there exists an \((\varepsilon,\delta)\)-DP algorithm that outputs a \[\begin{align} & \bigg(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\bigg(\left(\frac{md\log (1/\delta)}{(\varepsilon kk')^2}\right)^{p/2} \\ &\qquad\qquad\qquad\qquad\quad+{}\sqrt{\frac{mk'}{n}}+c + t\bigg)\bigg) \end{align}\]

-approximation for the \(p\)-Wasserstein barycenter problem, with probability \(1-\xi\).

Here, the algorithm is the same as that of Proposition 6, Algorithm 9; and the proof is provided in Appendix 12.6. Optimizing the first two terms so they are equal, suppressing dependence on \(\delta\), we should set \(k'\) to be \[k'_* = \Theta\left(n^{\frac{1}{2p+1}}m^{\frac{p-1}{2p+1}}d^{\frac{p}{2p+1}}(\varepsilon k)^{-\frac{2p}{2p+1}}\right).\] Using this \(k_*'\), by suppressing poly-log dependence on \(\delta\) and \(\xi\), this yields an excess asymptotic additive error of \[m^\frac{3p}{4p+2}d^{\frac{p}{4p+2}}(\varepsilon kn)^{-\frac{p}{2p+1}} + c.\] Thus, we will have good utility if \(m^3d\ll (\varepsilon kn)^2\) and \(c\) is sufficiently small.

Under the assumption that the data is approximately clustered and \(c\) is small, e.g. \(O(n^{-\frac{1}{2}})\), we should always pick this approach over the private coreset approach for small to moderate regimes of \(m\). Further, observe that this method does not suffer from the curse of dimensionality as the scaling is with respect to \(\approx (nk)^{-\frac{1}{2}}\) as opposed to \(n^{-\frac{1}{d}}\) in the coreset-based approach.

6 Experiments↩︎

a
b

Figure 5: Barycenters on continental US populations.. a — \(n = 200000\) and \(\varepsilon= 1\) (and \(\delta = \frac{1}{n}\)) for \(m = 48\) and \(k = 1\). Denoting \(\nu,\nu_\mathsf{core},\nu_\mathsf{pert}\) as the non-private, private coreset-based, and output-perturbation barycenters, respectively, we have \(\mathrm{cost}(\nu) = 15.92\), \(\mathrm{cost}(\nu_\mathsf{core}) = 21.62\), \(\mathrm{cost}(\nu_\mathsf{pert}) = 16.031\) (squared degrees longitude/latitude), and \(W_2(\nu,\nu_\mathsf{core}) = 5.633\), \(W_2(\nu,\nu_\mathsf{pert}) = 2.665\) (degrees)., b — \(n = 100000\) and \(\varepsilon= 1\) (and \(\delta=\frac{1}{n}\)) for \(m = 48\) and \(k = 4\) (self-reported White, Asian, Black, Hispanic). Denoting \(\nu,\nu_\mathsf{core},\nu_\mathsf{pert}\) as the non-private, private coreset-based, and output-perturbation barycenters, respectively, we have \(\mathrm{cost}(\nu) = 5018.618\), \(\mathrm{cost}(\nu_\mathsf{core}) = 6770.353\), \(\mathrm{cost}(\nu_\mathsf{pert}) = 4924.467\) (squared degrees longitude/latitude), and \(W_2(\nu,\nu_\mathsf{core}) = 11.978\), \(W_2(\nu,\nu_\mathsf{pert}) = 1.622\) (degrees).

Figure 6: n = 2000, 4000, \dots, 128000 and \varepsilon= 1 (and \delta = \frac{1}{n}) in the same experimental setup as Figure 5 (a), averaged over 10 trials. On the left, we have cost in squared degrees. On the right, we plot the 2-Wasserstein distance between the private and non-private barycenters (in degrees).

We test our method from Section 4 on simple synthetic data and MNIST, and both methods on the US population data. We provide additional experiments and discussion of setup in Appendix 13. All of our experiments use the Sinkhorn free support barycenter [43] with \(50\) iterations and \(100\) inner (Sinkhorn) iterations. We utilize subsampling of the private coresets, which increases cost by a negligible amount, but significantly improves the runtime (see Figure 12 (a) on MNIST in Appendix 13).

In Figure 4, we consider equally weighted mixtures of 4 Gaussians at \((\pm0.25, \pm0.25, 0, \cdots, 0)\in\mathbb{R}^{10}\). We use \(m = 8\) and \(0.04\) for the entropic regularization. We fix \(n = 1000\), \(\varepsilon= 1\), and \(d' = 5\) (when they are not varied). We observe that using JL provides better utility (for small \(n\) or \(\varepsilon\)) under DP, as otherwise the algorithm tends to get stuck in local minima.

In Figure 5, we consider the experiment setup from [44] (\(m = 48\), \(k = 1\)) and use US population data from the 2015 American Community Survey.7 In Figure 5 (a), we take \(k = 1\), where the dataset is the whole US population. We take the (sensitive) data to be multisets of the centers of census tracts (chosen with replacement) of size \(n = 200000\). In Figure 5 (b), we take the sensitive data to be each of \(n = 100000\) (uniformly at randomly chosen) points corresponding to the self-reported racial groups White, Asian, Black, Hispanic, where for privacy, we assume the groups are disjoint.

In our private coreset construction, we only sample points that are inside the US border (which is private by data independent post-processing). Our algorithms use \(\varepsilon= 1\) on the full population (or subgroup), utilizing privacy amplification by subsampling [45]. Each barycenter computation only take a few minutes to run on CPU; however, the sampling of points inside the US takes a few hours for the largest experiments. In this experiment, we use entropic regularization of \(0.001\) and do not use dimensionality reduction.

For our output perturbation algorithms, we always take \(k'=1000\), and we use \((\varepsilon=1, \delta = \frac{1}{n})\)-DP on the full population, again using privacy amplification by subsampling.

In Figure 6, we report results for the \(k = 1\) setting for the US population experiments as the non-private cost in the \(k \ge 2\) setting is already very large. In Appendix 13, we provide additional experiments with \(\varepsilon=5\). We empirically observe that, due to the clustered-nature of the US population, subsampling becomes competitive with the coreset construction when each subdistribution has just 32 datapoints! For larger \(n\), we hypothesize the smaller cost for the subsampling \(k'\) compared to the full non-private barycenter is a result of optimization issues. As the (full) distribution consists of multisets of points, it is more likely for optimization issues to occur (e.g. see the average error for the non-private barycenter is increasing), and subsampling into multiple distributions makes the optimization more stable.

7 Conclusion↩︎

We extended the study of private facility location problems from clustering to Wasserstein barycenters. One limitation of Algorithm 2 is the curse of dimensionality, and future work can study the setting where the data lies near a low dimensional subspace [40] and alleviate the curse of dimensionality via privatized versions of entropic OT [46], [47] or Gaussian-smoothed OT [48][51]. In this work, we studied Wasserstein barycenters under the central model of differential privacy, for future work it would also be interesting to obtain results under the local [52] and shuffle [53][55] models. Furthermore, we focused on the setting where one individual contributes a single datapoint out of the \(k\) distributions. An interesting direction would be to consider the setting where one person contributes a whole probability measure, as this would allow practitioners to consider continuous distributions as input data.

8 Related work↩︎

In the theoretical computer science community, the Wasserstein barycenter falls under the category of facility location problems. This class of problem is concerned with placing points, or “facilities,” to minimize some objective given a set of input data. Note that clustering also falls under this category. Clustering [56] has seen many non-private approximation algorithms. Over the past few decades, a line of works [9][14] have pushed multiplicative approximation factors to \(2.406\) and \(5.912\) for Euclidean \(k\)-medians and \(k\)-means, respectively [57].

[15] initiated the study of facility location algorithms under DP, and provided an inefficient algorithm based on the exponential mechanism [37] that gave constant factor multiplicative approximation. Then a series of works [16][20] culminated in polynomial time algorithms for private clustering with the optimal multiplicative approximation ratio and small additive errors.

On the other hand, the Wasserstein barycenter is a much more nascent problem. Initial works provide approximations using methods such as entropic regularization [44], [58], iterative Bregman projections [59], or stochastic optimization [60]; however, these lack worst-case guarantees on the approximations, for instance to the non-entropic setting. Even theoretical guarantees for fast approximations of Wasserstein distances are recent [61], [62]. More recently, some works have provided theoretical guarantees for Wasserstein barycenters in the \(p = 2\) setting: [35], [63] showed that additive and multiplicative (respectively) approximations for Wasserstein barycenters can be computed in polynomial time for constant dimension. Recently, [36], [64] provided constructions for private measures that are close to input empirical measures over \([0,1]^d\) in 1-Wasserstein distance and [65] provided instance-optimal constructions for finite metric spaces.

The Johnson-Lindenstrauss (JL) lemma [66] is a dimensionality reduction method that provides worst-case guarantees on preserving pairwise distances between a collection of points. It has been applied to numerous problems in many areas of computer science, including streaming algorithms [67], [68] and DP [69], [70]. The (lower) bound on the dimension required to approximately preserve solutions varies from problem to problem, e.g. see [71], [72] for a discussion. For facility location problems, [39] showed that dimension \(d' = O(\log k)\) suffices for preserving the cost of solutions to \(k\)-means clustering, and [34] showed that dimension \(d' = O(\log n)\) suffices for Wasserstein barycenters supported on \(\le n\) points.

9 Deferred algorithms↩︎

Figure 7: \mathsf{SolutionWeights}
Figure 8: \mathsf{SupportPoints}
Figure 9: \mathsf{OutputPerturbationBarycenterSubsampled}^{\varepsilon,\delta}

10 Additional preliminaries and lemmata↩︎

10.1 Differential privacy↩︎

Lemma 2 (Parallel composition). Let \(\mathcal{A}_1,\dots, \mathcal{A}_k\) be \(\varepsilon\)-DP algorithms. Suppose \(\mathcal{D} = S_1\cup \cdots \cup S_k\), where \(S_i \cap S_j = \emptyset\) for every \(i \neq j\). Then \((\mathcal{A}_1(S_1),\dots, \mathcal{A}_k(S_k))\) is \(\varepsilon\)-DP.

A nice property of differential privacy is the post-processing property, which informally says that transforming private output does not incur additional privacy loss. Formally, we have the following:

Lemma 3 (Post-processing). Let \(\mathcal{A}\) be an \(\varepsilon\)-DP algorithm. Then for any (possibly randomized algorithm) \(g\), \(g\circ \mathcal{A}(\mathcal{D})\) is \(\varepsilon\)-DP.

Definition 10 (\(\ell_p\)-sensitivity). We define the \(\ell_p\)-sensitivity of a function \(f\) to be \[\Delta_p f:= \max_{\mathcal{D}, \mathcal{D}'}\|f(\mathcal{D}) - f(\mathcal{D}')\|_p,\] where \(\mathcal{D}, \mathcal{D}'\) are adjacent datasets.

Lemma 4 (Gaussian mechanism). Let \(f\) be a function, \(\varepsilon,\delta \in (0, 1)\), and \(\sigma^2 \ge \left(\Delta_2f\right)^2\cdot\frac{2\log(1.25/\delta)}{\varepsilon^2}\). The Gaussian mechanism \(f(\mathcal{D})+\mathcal{N}(0, \sigma^2)\) is \((\varepsilon,\delta)\)-DP.

10.2 Optimal transport↩︎

It can easily be checked that indeed the Wasserstein distance is a metric: in particular, the triangle inequality holds.

Lemma 5 ([73], Lemma 5.4). For any \(p \ge 1, \mu,\nu,\pi\in \mathcal{P}_p(\mathcal{X})\), we have \(W_p(\mu,\pi) \le W_p(\mu,\nu)+ W_p(\nu,\pi)\).

For bounded spaces, we can bound the \(p\)-Wasserstein distance by the \(1\)-Wasserstein distance:

Lemma 6 ([73]). Let \(\mathcal{X}\) be bounded. Then for any \(p \ge 1, \mu,\nu\in \mathcal{P}_p(\mathcal{X})\), we have \(W_p(\mu,\nu) \le \operatorname{diam}(\mathcal{X})^{(p-1)/p}W_1(\mu,\nu)^{1/p}\).

10.3 JL transform↩︎

A JL transform is any (linear) map that satisfies the JL lemma:

Theorem 9 (Johnson-Lindenstrauss lemma, [66]). Given an accuracy parameter \(0 < \gamma < 1\), a set of \(n\) points \(X\) in \(\mathbb{R}^d\), and the projection dimension \(d' = O(\log n /\gamma^2)\), there exists a linear map \(\Pi:\mathbb{R}^d\to \mathbb{R}^{d'}\) such that all pairwise distances are preserved within factor \((1\pm \gamma)\), i.e., it holds \[\frac{1}{1+\gamma}\|x-y\|\le \|\Pi x-\Pi y\|\le (1+\gamma)\|x-y\|\] for every \(x,y\in X\), with probability \(1 - 1/{\operatorname{poly}(n)}\).

In other words, the JL transform reduces the dimensionality of the data from \(d\) to \(d'\) while approximately preserving all pairwise distances (whp). Note that this worst-case guarantee is the main strength of the JL approach, other dimensionality reduction techniques such as principal component analysis typically do not have this guarantee.

10.4 Useful lemmata↩︎

Lemma 7. Let \(\mu\in\mathcal{P}_p(\mathbb{R}^d)\) and \(\mu_\sigma := \mu * \mathcal{N}(0, \sigma^2I_d)\). Then \[W_p(\mu, \mu_\sigma)\lesssim \sigma\left(\sqrt{d} + \sqrt{2p}\right).\]

Proof. By definition of Wasserstein distance, we have \[W_p(\mu,\mu_\sigma) \le \left(\mathbb{E}[X - (X+\sigma Z)]^p\right)^{1/p} = \sigma \left(\mathbb{E}\|Z\|^p\right)^{1/p},\] where \(X\sim \mu\) and \(Z\sim\mathcal{N}(0, I_d)\). We have \[\begin{align} \mathbb{E}\|Z\|^p &= \mathbb{E}\left(\|Z\|^2\right)^{p/2} = \mathbb{E}[Y^{p/2}], \end{align}\] where \(Y := \|Z\|^2\sim \chi_d^2\) (chi-squared with \(d\) degrees of freedom). Using [74], it holds \[\begin{align} \Pr[Y - d \ge 2 \sqrt{dt}+2t]&\le \exp(-t)\\ \Pr[d - Y \ge 2\sqrt{dt}]&\le \exp(-t). \end{align}\] Applying Theorem 2.3 of [75] yields the desired result. ◻

Lemma 8. For every \(p\ge 1\), if \(0\le a, b \le 1\), then it holds \[(a + b)^p \le a^p + p(a + b)^{p-1}b \le a^p + p2^{p-1}b.\]

Proof. Let \(g(t) = (a+t)^p\) for \(t\in[0,b]\). Observe that \(g\) is convex as \(g''(t) = p(p-1)(a+t)^{p-2}\ge0\). Thus, by convexity, we have \[(a+b)^p - a^p = g(b) - g(0)\le g'(b)b = p(a+b)^{p-1}b.\] Rearranging yields the first inequality. The second inequality is immediate by \(0\le a,b\le 1\). This concludes the proof. ◻

Lemma 9 (Hoeffding’s inequality). Let \(X_1,\dots, X_n\) be i.i.d. random variables such that \(X_i \in [a_i,b_i]\) a.s. Let \(S_n := \sum_{i\in[n]}X_i\). Then \[\Pr[|S_n-\mathbb{E}[S_n]|\ge t]\le 2\exp\left(-\frac{2t^2}{\sum_{i\in[n]}(b_i-a_i)^2}\right).\]

11 Comparing clustering and Wasserstein barycenters↩︎

11.1 Clustering↩︎

For completeness, we provide a brief discussion of the construction for private clusterings. We start by formalizing the problem statement for clustering:

Definition 11 (\((k,p)\)-clustering). Given \(k\in\mathbb{N}\) and a dataset \(\mathbf{X}= \{x_1,\dots, x_n\}\) in \(B_0(1/2)\), we want to find \(k\) centers \(c_1,\dots, c_k\in\mathbb{R}^d\) that minimizes \[\mathrm{cost}_{\mathbf{X}}(c_1,\dots, c_k) := \sum_{i\in[n]}\left(\min_{j\in[k]}\|x_i - c_j\|\right)^p. \label{eq:clustering95cost}\qquad{(4)}\]

The optimal cost is denoted as \(\mathsf{OPT}\), where we suppress the dependence on \(k,p,\mathbf{X}\).

Definition 12 (Approximation for clustering). A \((w,t)\)-approximation algorithm for \((k,p)\)-clustering outputs \(c_1,\dots, c_k\) such that \(\mathrm{cost}(c_1,\dots, c_k) \le w\cdot \mathsf{OPT}+ t\).

[20] showed the following:

Theorem 10. For any \(p\ge 1\), suppose that there exists a polynomial time (not necessarily private) \((w, 0)\)-approximation algorithm for the \((k,p)\)-clustering problem. Then, for every \(\varepsilon>0\) and \(\delta,\gamma,\xi\in(0,1)\), there exists an \((\varepsilon,\delta)\)-DP algorithm that runs in \(\left(\frac{k}{\xi}\right)^{O_{p,\gamma}(1)}\cdot \operatorname{poly}(nd)\) time and outputs an \[\left(w(1+\gamma), O_{p,\gamma,w}\left(\frac{k\sqrt{d}}{\varepsilon}\cdot \operatorname{poly}\log\frac{k}{\delta\gamma} + \frac{(k/\gamma)^{O_{p,\gamma}(1)}}{\varepsilon}\cdot \operatorname{poly}\log\frac{n}{\gamma}\right)\right)\] -approximation for \((k,p)\)-clustering, with probability \(1-\xi\).

The first term in the additive error comes from computing the centers in high dimension. The second term comes from bounding the error in low dimension.

The construction of private \((k,p)\)-clustering in low dimension works as follows. Under DP, we start by finding a centroid set8 of size \(O(k\log n)\) with small multiplicative error. Then, we create a private coreset by “snapping” all the input data to their nearest point in the centroid set and use the (discrete) Laplace mechanism to privatize the counts.

11.2 Stability↩︎

The \((k,p)\)-clustering objective is stable in the following sense: suppose we fix a dataset \(\mathbf{X}\) and \(k\) centers. If we move one datapoint and update the \(k\) centers, a large fraction of the points will remain clustered together. This fact is key to the accuracy of private clustering algorithms.

Due to the stability of the clustering objective, the snapping procedure above incurs small additive error and is easy to reason about. On the other hand, optimal transport plans are highly non-stable:

Proposition 11. Let \(n\in\mathbb{N}\). Fix a distribution \(\nu\) supported on \(n\) atoms with uniform weights. There exists a distribution over \(\mathbb{R}\) such that moving one datapoint by \(O\left(\frac{1}{n}\right)\) changes the mapping for every datapoint.

Proof. We construct \(\mu_n\) with support over \(\left\{\frac{k}{n}\right\}_{k\in[n]}\). Recall that the optimal transport plan in one dimension can be computed from the cumulative density function, e.g. see [73]. In the setting of the proposition, this will just be based on the order of the datapoints. Thus moving the particle on \(\frac{1}{n}\) to \(\frac{2 + \varepsilon}{n}\) for any \(\varepsilon> 0\) will yield the desired result. ◻

Remark 1. The distribution in Proposition 11 also shows that in order to use the private coreset construction from \(k\)-means clustering for Wasserstein distances requires taking \(k = \tilde{\Omega}(n)\) for small snapping error, when we usually should think of \(k = \tilde{O}(1)\).

Proposition 12. There exists a distribution over \(\mathbb{R}^2\) such that changing one point by \(O(1)\) causes all the points in the support of the output barycenter by \(\Omega(1)\).9

Proof. See Figure 10. ◻

a
b

Figure 10: Unperturbed data is uniform over \(\mathbb{S}^1\). Here, the averages of any of the two disjoint half-arcs yield an optimal barycenter. However, with a bad initialization, each point in the support of the output distribution can move \(\Omega(1)\) as \(\Omega(n)\) of the couplings change.. a — Non-optimal barycenter., b — Optimal barycenter.

12 Deferred proofs↩︎

12.1 Proof of Lemma 1↩︎

Lemma 10 (Lemma 1, restated). Let \(\mu_1,\dots, \mu_k\) be discrete probability measures and suppose \(\mu_1',\dots, \mu_k'\) are \((p, 1,t)\)-coresets for each \(\mu_i\), respectively. Then, \[\mathsf{OPT}_{(\mu_1',\dots, \mu_k')}\le \mathsf{OPT}_{(\mu_1,\dots,\mu_k)}+O_p(t^p).\]

Proof. Consider a candidate barycenter \(\nu\). Using ?? , we have \[\begin{align} \mathrm{cost}_{\mu_1',\dots, \mu_k'}(\nu) &= \frac{1}{k}\sum_{i=1}^k W_p^p(\mu_i',\nu)\notag \\ &\le \frac{1}{k}\sum_{i=1}^k\left(W_p(\mu_i',\mu_i) + W_p(\mu_i,\nu)\right)^p\tag{2}\\ &\le \frac{1}{k}\sum_{i=1}^k W_p^p(\mu_i,\nu) + \frac{p2^{p-1}}{k}\sum_{i=1}^k W_p^p(\mu_i',\mu_i)\tag{3}\\ &= \mathrm{cost}_{\mu_1,\dots, \mu_k}(\nu) + \frac{p2^{p-1}}{k}\sum_{i=1}^k W_p^p(\mu_i',\mu_i)\tag{4}\\ &\le \mathrm{cost}_{\mu_1,\dots, \mu_k}(\nu) + \frac{p2^{p-1}}{k}\sum_{i=1}^k t^p,\tag{5}\\ &= \mathrm{cost}_{\mu_1,\dots, \mu_k}(\nu) + O_p(t^p)\notag \end{align}\] where 2 follows from the triangle inequality, 3 follows from Lemma 8, 4 follows from ?? , and 5 follows from Definition 8. Note that applying Lemma 8 uses Assumption 1. This concludes the proof. ◻

12.2 Proof of Theorem 4↩︎

Theorem 13 (Theorem 4, restated). For any \(p\ge 1\), suppose that there exists a (not necessarily private) \((z, t)\)-approximation algorithm that runs in time \(2^{O(d)}\cdot \operatorname{poly}(n,k)\) for the \(p\)-Wasserstein barycenter problem. Then, for every \(\varepsilon>0\) and \(\gamma,\xi\in(0,1)\), there exists a polynomial-time \(\varepsilon\)-DP algorithm that outputs an \[\left(z(1+\gamma), O_{p,\gamma,z}\left(\frac{1}{(\varepsilon n)^{1/d}}\cdot\operatorname{poly}\log\left(\frac{k}{ \xi}\right) + t\right)\right)\] -approximate \(p\)-Wasserstein barycenter, with probability \(1-\xi\).

Proof. Let \(d'\) be as in Theorem 3.

(Runtime) It suffices to bound the runtime of computing the barycenter in low dimensions as it is clear that the pre- and post-processing steps run in polynomial time. With the given \(d'\), we have that \(2^{O(d')}\cdot \operatorname{poly}(n,k) = \operatorname{poly}(n)\cdot \operatorname{poly}(n,k) = \operatorname{poly}(n,k)\), as desired.

(Privacy) The privacy follows from Lemmas 2 and 3 as the input distributions are disjoint.

(Utility) For each \(\mu_i\), we invoke Theorem 2 with failure probability \(\frac{\xi}{2k}\). Then by a union bound, with probability \(1 - \frac{\xi}{2}\), \(\mu_i'\) is a \(\left(1,O_p\left(\left(\frac{1}{(\varepsilon n)^{1/d}}\cdot\operatorname{poly}\log\left(\frac{k}{ \xi}\right)\right)^{1/p}\right)\right)\)-coreset for \(\mu_i\), for each \(i \in [k]\). Assuming this event holds, Lemma 1 implies \[\mathsf{OPT}_{(\mu_1',\dots, \mu_k')}\le \mathsf{OPT}_{(\mu_1,\dots, \mu_k)} + O_p\left(\frac{1}{(\varepsilon n)^{1/d}}\cdot\operatorname{poly}\log\left(\frac{k}{ \xi}\right)\right).\label{eq:c2i}\tag{6}\] Let \(\nu\) be the output of the algorithm. Now we apply Theorem 3, along with the guarantee of the not necessarily private approximation algorithm, which implies with probability \(1 - \frac{\xi}{2}\), \[\begin{align} \mathrm{cost}_{(\mu_1',\dots,\mu_k')}(\nu) &\le z(1+\gamma)\mathsf{OPT}_{(\mu_1',\dots,\mu_k')} + O_{p, \gamma,z}(t)\label{eq:c2ii} \end{align}\tag{7}\] By a union bound, 6 and 7 both occur with probability \(1 - \xi\). When this is the case, we deduce \[\begin{align} \operatorname{cost}_{(\mu_1',\dots, \mu_k')}(\nu) \le z(1+\gamma)\mathsf{OPT}_{(\mu_1,\dots, \mu_k)} + O_{p,\gamma,z}\left(\frac{1}{(\varepsilon n)^{1/d}}\cdot\operatorname{poly}\log\left(\frac{k}{ \xi}\right) + t\right), \end{align}\] which concludes the proof. ◻

12.3 Proof of Theorem 5↩︎

Theorem 14 (Theorem 5, restated). For any \(p\ge 1\), suppose that there exists a (not necessarily private) \((z, t)\)-approximation algorithm that runs in time \(2^{O(d)}\cdot \operatorname{poly}(n,k)\) for the \(p\)-Wasserstein barycenter problem. Then, for every \(\varepsilon>0\) and \(\delta,\gamma,\xi \in (0, 1)\), there exists a polynomial-time \((\varepsilon,\delta)\)-DP algorithm that outputs an \[\left(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon k)^2}\right)^{p/2}\right) + t\right)\] -approximation for the \(p\)-Wasserstein barycenter problem, with probability \(1 - \xi\).

Proof. We prove the privacy and utility separately.

(Privacy) Consider the \(\ell_2\) sensitivity of the algorithm, which is a function \(\mathcal{X}^{nk}\to \mathbb{R}^{m\times d}\). If we change one datapoint, the couplings of up to \(n\) elements could potentially change, namely all of those in the subpopulation. By normalization and Assumption 1, this implies the \(\ell_2\) sensitivity of \(\|\nu^{(j)} - \nu'^{(j)}\|_2\) is \(\frac{1}{k}\). Thus, the \(\ell_2\) sensitivity of the output is \[\|\nu_1\circ\cdots\circ \nu_m - \nu_1'\circ\cdots\circ\nu_m'\|_2 = \left(\sum_{j\in[m]}\|\nu_j-\nu_j'\|^2\right)^{1/2} \le \left(m\left(\frac{1}{k}\right)^2\right)^{1/2} = \frac{\sqrt{m}}{k},\] where \(\nu_1\circ\cdots \circ \nu_m\in\mathbb{R}^{md}\) is vector-concatenation. Privacy follows by the guarantees of the Gaussian mechanism (Lemma 4).

(Utility) We have \[\begin{align} \mathrm{cost}_{\mu_1,\dots, \mu_k}(\tilde{\nu}) &= \frac{1}{k}\sum_{i = 1}^k W_p^p(\mu_i, \tilde{\nu})\notag\\ &\le \frac{1}{k}\sum_{i=1}^k(W_p(\mu_i,\nu) + W_p(\nu, \tilde{\nu}))^p\tag{8}\\ &\le \frac{1}{k}\sum_{i=1}^kW_p^p(\mu_i,\nu) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\tag{9}\\ &= \operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu) + p2^{p-1}W_p^p\left(\nu, \nu*\mathcal{N}\left(0, \frac{2m\log(1.25/\delta)}{\left(\varepsilon k\right)^2}I_d\right)\right)\notag\\ &\le \operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu) +p2^{p-1}\left(\frac{2m\log(1.25/\delta)}{(\varepsilon k)}\right)^{p/2}\left(\sqrt{d} + \sqrt{2p}\right)^p\tag{10}\\ &\le \operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu) + O_{p}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon k)^2}\right)^{p/2}\right),\notag \end{align}\] where 8 follows from Lemma 5, 9 follows from Lemma 8, and 10 follows from Lemma 7. Then, the claim follows from an analogous application dimensionality reduction as in Appendix 12.2. ◻

12.4 Proof of Proposition 6↩︎

Before we provide the proof, we introduce some preliminaries.

Definition 13. The total variation distance between two discrete measures \(\mu,\nu\) is defined to be\[\|\mu-\nu\|_\mathsf{TV} = \frac{1}{2}\sum_x|\mu(x) - \nu(x)|.\]

Proposition 15. Let \((\mathcal{X}, c)\) be a metric space with diameter \(D\). For any \(p\ge 1\) and probability measures \(\mu,\nu\in \mathcal{P}(\mathcal{X})\), it holds \[W_p(\mu,\nu)\le D\|\mu - \nu\|_\mathsf{TV}^{1/p}.\]

Proof. Let \(\omega = \mu \land \nu\), and rewrite \(\mu = \omega +\alpha\) and \(\nu = \omega + \beta\) for \(\alpha,\beta\ge 0\) and \(\alpha(\mathcal{X})=\beta(\mathcal{X})=t = \|\mu - \nu\|_\mathsf{TV}\). Now we construct a coupling \(\pi\in\Pi(\mu,\nu)\) as follows. We couple \(\mu\) and \(\nu\) by coupling \(\omega\) to itself and \(\alpha\) to \(\beta\) with any plan \(\gamma\in \Pi(\alpha,\beta)\). This yields \[\int c(x,y)^pd\pi(x,y) \le \int c(x,y)^pd\gamma(x,y)\le D^p\|\mu - \nu\|_\mathsf{TV},\] where we use \(c(x,y)\le D\). Taking the infimum over couplings and taking the \(p\)th root yields the desired result. ◻

Proposition 16. Let \(X :=\{X_i\}_{i\in [n]}\subset \mathcal{X}\), which forms measure \(\mu_n = \frac{1}{n}\sum_{i\in[n]}\delta_{X_i}\), and let \(S\subset[n]\) be a subset of \([n]\) of size \(m\), chosen uniformly at random, without replacement. Form \(\mu_m := \frac{1}{m}\sum_{j\in S}\delta_{X_j}\). It holds \[\|\mu_n - \mu_m\|_\mathsf{TV} \le 1 - \frac{m}{n}.\]

Proof. We have \[\begin{align} \|\mu_n - \mu_m\|_\mathsf{TV} &= \frac{1}{2}\sum_k \left|\frac{m_k}{m} - \frac{n_k}{n}\right|\\ &\le \frac{1}{2}\sum_k\sum_{i:X_i=x^{(k)}}\left|\frac{\mathbf{1}_{\{i\in S\}}}{m}-\frac{1}{n}\right|\\ &= \frac{1}{2}\left(\frac{1}{m}-\frac{1}{n}\right)\sum_k m_k + \frac{1}{2n}\sum_k(n_k-m_k)\\ &= 1 - \frac{m}{n}, \end{align}\] where we group the samples by distinct points \(x^{(k)}\) with multiplicities \(n_k\) (and similarly \(m_k\) for the subsample). This concludes the proof. ◻

Corollary 1. Let \(\mu_n\) and \(\mu_m\) be in Proposition 16, and suppose \(\mathsf{diam}(\mathcal{X})\le D\). Then \[W_p(\mu_n,\mu_m)\le D\left(1 - \frac{m}{n}\right)^{1/p}.\]

Proof. This is a straightforward consequence of Proposition 15 and 16. ◻

Now we provide the proof of Theorem 6, which we restate for convenience.

Theorem 17 (Theorem 6, restated). By splitting each distribution into \(k'\) disjoint distributions, in the same setting as Theorem 5, we have an \[\left(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon kk')^2}\right)^{p/2} + \left(1 - \frac{1}{k'}\right) + t\right)\right)\] -approximation for the \(p\)-Wasserstein barycenter problem, with probability \(1- \xi\).

Proof. Privacy is immediate by parallel composition, Lemma 2, and the privacy guarantees of Theorem 5. For utility, we mirror the analysis of Theorem 5 to obtain \[\begin{align} \mathrm{cost}_{\mu_{1, 1}, \dots, \mu_{k,k'}}(\tilde{\nu}) &= \frac{1}{k k'}\sum_{i = 1}^{k}\sum_{j=1}^{k'} W_p^p(\mu_{i,j}, \tilde{\nu})\\ &\le \frac{1}{kk'}\sum_{i=1}^k\sum_{j=1}^{k'}(W_p(\mu_{i,j},\nu) + W_p(\nu, \tilde{\nu}))^p\\ &\le \frac{1}{kk'}\sum_{i=1}^k\sum_{j=1}^{k'}W_p^p(\mu_{i,j},\nu) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &\le \frac{1}{kk'}\sum_{i=1}^k\sum_{j=1}^{k'}W_p^p(\mu_{i,j},\nu) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &\le \frac{1}{k}\sum_{i=1}^k\frac{1}{k'}\sum_{j=1}^{k'}\left(W_p(\mu_{i,j}, \mu_i) +W_p(\mu_{i},\nu)\right)^p + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &\le \frac{1}{k}\sum_{i=1}^kW_p^p(\mu_i,\nu) + \frac{p2^{p-1}}{kk'}\sum_{i=1}^k\sum_{j=1}^{k'}W_p^p(\mu_{i,j}, \mu_i) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &\le \frac{1}{k}\sum_{i=1}^kW_p^p(\mu_i,\nu) + p2^{p-1}\left(1 - \frac{1}{k'}\right) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &=\operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu)+p2^{p-1}\left(1 - \frac{1}{k'}\right) + p2^{p-1}W_p^p\left(\nu, \nu*\mathcal{N}\left(0, \frac{2m\log(1.25/\delta)}{\left(\varepsilon kk'\right)^2}I_d\right)\right)\\ &\le \operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu)+p2^{p-1}\left(1 - \frac{1}{k'}\right) +p2^{p-1}\left(\frac{2m\log(1.25/\delta)}{(\varepsilon kk')}\right)^{p/2}\left(\sqrt{d} + \sqrt{2p}\right)^p\\ &\le \operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu) + O_{p}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon kk')^2}\right)^{p/2} + \left(1 - \frac{1}{k'}\right)\right), \end{align}\] where we use Lemma 8 and Corollary 1. Then, the claim follows from an analogous application dimensionality reduction as in Appendix 12.2. ◻

12.5 Proof of Proposition 7↩︎

Here, we state the full result.

Proposition 18. If \(\mu\) is \((m,\Delta, c)\)-approximately clusterable, then for all \(n \le m(2\Delta)^{-2p}\), letting \(\hat{\mu}_n = \frac{1}{n}\sum_{i\in[n]}\delta_{X_i}\) for \(X_i\sim \mu\) i.i.d., it holds \[\begin{align} W_p^p(\mu, \hat{\mu}_n)\le (1-c)\left((9^p+3)\sqrt{\frac{m}{(1-c-r)n}} +\sqrt{\frac{\log(4/\xi)}{2(1-c-r)n}}\right)+c + r \end{align}\] with probability \(1-\xi\), where \(r = \sqrt{\frac{\log(4/\xi)}{2n}}\).

Proof. Let \(\hat{c}_n = \frac{1}{n}\sum_{i\in[n]}\mathbf{1}_{\{X_i \text{ is not clustered}\}}\), and let \(\hat{\mu}_\mathsf{clustered}\) be the empirical measure of the \(n_\mathsf{clustered}:=n(1-\hat{c}_n)\) clustered points. Using a coupling that transports all the clustered points to points in the clustered and matching the mismatched points and non-clustered points with cost \(\le 1\) (since the diameter is \(1\)), we have \[W_p^p(\mu,\hat{\mu}_n)\le (1-c)W_p^p(\mu_\mathsf{clustered},\hat{\mu}_{n,\mathsf{clustered}}) + c + |\hat{c}_n-c|.\] For the first term on the right-hand side, using Propositions 13 and 20 of [40], for any \(n_\mathsf{clustered}\le m(2\Delta)^{-2p}\), we have \[\mathbb{E}[W_p^p(\mu_\mathsf{clustered},\hat{\mu}_{n,\mathsf{clustered}})]\le (9^p+3)\sqrt{\frac{m}{n}}\] and \[\Pr[W_p^p(\mu_\mathsf{clustered},\hat{\mu}_{n,\mathsf{clustered}})\ge \mathbb{E}[W_p^p(\mu_\mathsf{clustered},\hat{\mu}_{n,\mathsf{clustered}})] + t]\le \exp(-2n_\mathsf{clustered}t^2).\label{eq:eq:approx95clusterable95i}\tag{11}\] Let \(r := \sqrt{\frac{\log(4/\xi)}{2n}}\). Using Hoeffding’s inequality, Lemma 9, with probability at least \(1 - \frac{\xi}{2}\), \(|\hat{c}_n-c|\le r\), which also implies \(n_\mathsf{clustered} = n(1-\hat{c}_n)\ge (1-c - r)n\). Using \(r=t\) for 11 , we have with probability at least \(1-\frac{\xi}{2}\), \[W_p^p(\mu_\mathsf{clustered},\hat{\mu}_{n,\mathsf{clustered}})\le (9^p + 3)\sqrt{\frac{m}{(1-c-r)n}}+\sqrt{\frac{\log(4/\xi)}{2(1-c-r)n}}.\] The claim follows by a union bound. ◻

12.6 Proof of Theorem 8↩︎

Theorem 19 (Theorem 8, restated). Suppose every input distribution \(\{\mu_i\}_{i\in[k]}\) is \(\left(O(m), \Delta, c\le \frac{1}{2}\right)\)-approximately clusterable and \(\frac{n}{k'}\le O(m)\cdot (2\Delta)^{-2p}\). For any \(p\ge 1\), suppose that there exists a (not necessarily private) \((z, t)\)-approximation algorithm that runs in time \(2^{O(d)}\cdot \operatorname{poly}(n,k)\) for the \(p\)-Wasserstein barycenter problem. Then, for every \(\varepsilon>0\) and \(\delta,\gamma,\xi \in (0, 1)\), there exists an \((\varepsilon,\delta)\)-DP algorithm that outputs a \[\left(z(1+\gamma), \tilde{O}_{p,\gamma,z,\xi}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon kk')^2}\right)^{p/2}+\sqrt{\frac{mk'}{n}}+c + t\right)\right)\]

-approximation for the \(p\)-Wasserstein barycenter problem, with probability \(1-\xi\).

Proof. Privacy is immedate as in Thereom 6. For the utility analysis, using subsampled size \(n' = \frac{n}{k'}\), independence of the data, \(c \le \frac{1}{2}\), and union bounding over the \(kk'\) distributions, we have \[W_p^p(\mu_{i,j},\mu_i)\lesssim \sqrt{\frac{m}{n'}} + c + \sqrt{\frac{\log(kk'/\xi)}{n'}},\] for every \((i,j)\in[k]\times[k']\). Then following the proof of Theorem 6 and using Proposition 7, we deduce \[\begin{align} \mathrm{cost}_{\mu_{1, 1}, \dots, \mu_{k,k'}}(\tilde{\nu}) &\le \frac{1}{k}\sum_{i=1}^kW_p^p(\mu_i,\nu) + \frac{p2^{p-1}}{kk'}\sum_{i=1}^k\sum_{j=1}^{k'}W_p^p(\mu_{i,j}, \mu_i) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &\le \frac{1}{k}\sum_{i=1}^kW_p^p(\mu_i,\nu) + \frac{p2^{p-1}}{kk'}\sum_{i=1}^k\sum_{j=1}^{k'}\left( \sqrt{\frac{mk'}{n}} + \sqrt{\frac{k'\log(kk'/\xi)}{n}}+c\right) + p2^{p-1}W_p^p(\nu, \tilde{\nu})\\ &\le \operatorname{cost}_{\mu_1,\dots,\mu_k}(\nu) + O_{p}\left(\left(\frac{md\log (1/\delta)}{(\varepsilon kk')^2}\right)^{p/2} + \sqrt{\frac{mk'}{n}}+\sqrt{\frac{k'\log(kk'/\xi)}{n}}+c\right). \end{align}\] Then, the claim follows from an analogous application dimensionality reduction as in Appendix 12.2, by now union bounding over \(2kk'\) events. ◻

13 Experiments↩︎

In all the experiments, we scale the noise down by \((240d)^{1/d}\), e.g. Figure 12 (b). We do not use zero noise as it usually is unstable.

a
b

Figure 11: Example of a MNIST 2 digit with privacy parameter \(\varepsilon= 1\). On the left, we have the sensitive data, upsampled to \(n = 10000\). In the middle, we have the full private coreset (with \(O(n\log \varepsilon n)\) points). On the right, we choose a subsample the private coreset down to \(n\) points.. a — Points chosen uniformly at random from the full cell., b — Points chosen uniformly at random from a scaled-down version of the cell.

a
b

Figure 12: MNIST experiments testing sample size and noise scaling.. a — MNIST experiments with \(n = 50, 100, \dots, 1600\) and \(\varepsilon= 1\), averaged over 10 runs. (left) Cost of solutions. (right) Runtime in seconds., b — Uniform corresponds to the setting as Figure 11 (a) and scaled uniform setup corresponds to the setting as Figure 11 (b).

13.1 Additional experiments↩︎

We provide an example of the private coreset under uniform noise and scaled-down unifrom noise on the MNIST dataset (as points over \([-0.25, 0.25]^2\) in Figure 11. In particular, for visualization, we treat one image as one distribution.

In the next experiment with MNIST, we follow the setup of [34] and treat each image as a point in \(\mathbb{R}^n\). We consider \(k = 10\) (each digit is one distribution), \(d' = 25\), and \(m = 40\). One difference is that we pre-process the data onto \(B_{0.5}(0)\). We consider the cost to be the Wasserstein distance between the output and the Wasserstein barycenter over 5000 images from each class. Our experiment parameters are \(0.01\) for entropic regularization, \(50\) iterations, and \(100\) inner (Sinkhorn) iterations. In Figure 12 (a), we show that subsampling on the private coresets to size \(n\) has a negligible increase in cost for sufficiently large \(n\). In Figure 12 (b), we observe that the choice of how points are sampled in each cell has a small effect on the cost for sufficiently large \(n\).

In the US population experiment, we use GPT to implement the code for testing whether a point is inside the US population.

Figure 13: n = 2000, 4000, \dots, 128000 and \varepsilon= 5, similar to Figure 6, averaged over 10 trials.
a
b

Figure 14: Same experimental setup as Figure 5, with \(\varepsilon= 5\).. a — \(n = 200000\) and \(\varepsilon= 5\) (and \(\delta = \frac{1}{n}\)) for \(m = 48\) and \(k = 1\). Denoting \(\nu,\nu_\mathsf{core},\nu_\mathsf{pert}\) as the non-private, private coreset-based, and output-perturbation barycenters, respectively, we have \(\mathrm{cost}(\nu) = 15.92\), \(\mathrm{cost}(\nu_\mathsf{core}) = 24.99\), \(\mathrm{cost}(\nu_\mathsf{pert}) = 16.957\) (squared degrees longitude/latitude), and \(W_2(\nu,\nu_\mathsf{core}) = 5.964\), \(W_2(\nu,\nu_\mathsf{pert}) = 2.906\) (degrees)., b — \(n = 100000\) and \(\varepsilon= 5\) (and \(\delta = \frac{1}{n}\)) for \(m = 48\) and \(k = 4\) (self-reported White, Asian, Black, Hispanic). Denoting \(\nu,\nu_\mathsf{core},\nu_\mathsf{pert}\) as the non-private, private coreset-based, and output-perturbation barycenters, respectively, we have \(\mathrm{cost}(\nu) = 5018.618\), \(\mathrm{cost}(\nu_\mathsf{core}) = 7046.907\), \(\mathrm{cost}(\nu_\mathsf{pert}) = 4937.966\) (squared degrees longitude/latitude), and \(W_2(\nu,\nu_\mathsf{core}) = 12.967\), \(W_2(\nu,\nu_\mathsf{pert}) = 1.449\) (degrees).

References↩︎

[1]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265–284. Springer, 2006.
[2]
John M Abowd. The US census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2867–2867, 2018.
[3]
Shlomi Hod and Ran Canetti. Differentially private release of Israel’s national registry of live births. arXiv preprint arXiv:2405.00267, 2024.
[4]
Apple. Learning with privacy at scale. 2017. URL https://machinelearning.apple.com/research/learning-with-privacy-at-scale.
[5]
Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. Advances in Neural Information Processing Systems, 30, 2017.
[6]
Ryan Rogers, Subbu Subramaniam, Sean Peng, David Durfee, Seunghyun Lee, Santosh Kumar Kancha, Shraddha Sahay, and Parvez Ahammad. Linkedin’s audience engagements API: A privacy preserving data analytics system at scale. arXiv preprint arXiv:2002.05839, 2020.
[7]
Nimrod Megiddo and Kenneth J Supowit. On the complexity of some common geometric location problems. SIAM Journal on Computing, 13 (1): 182–196, 1984.
[8]
Jason M. Altschuler and Enric Boix-Adserà. Wasserstein barycenters are NP-hard to compute. volume 4, pp. 179–203, 2022. . URL https://doi.org/10.1137/21M1390062.
[9]
Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the \(k\)-median problem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pp. 1–10, 1999.
[10]
Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and \(k\)-median problems. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp. 378–388. IEEE, 1999.
[11]
Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. Journal of the ACM (JACM), 48 (2): 274–296, 2001.
[12]
Kamal Jain, Mohammad Mahdian, Evangelos Markakis, Amin Saberi, and Vijay V Vazirani. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. Journal of the ACM (JACM), 50 (6): 795–824, 2003.
[13]
Moses Charikar and Shi Li. A dependent LP-rounding approach for the \(k\)-median problem. In International Colloquium on Automata, Languages, and Programming, pp. 194–205. Springer, 2012.
[14]
Vincent Cohen-Addad, Anupam Gupta, Lunjia Hu, Hoon Oh, and David Saulpic. An improved local search algorithm for \(k\)-median. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1556–1612. SIAM, 2022.
[15]
Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar. Differentially private combinatorial optimization. In Proceedings of the 2010 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1106–1125, 2010. . URL https://epubs.siam.org/doi/abs/10.1137/1.9781611973075.90.
[16]
Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. Differentially private clustering in high-dimensional Euclidean spaces. In International Conference on Machine Learning, pp. 322–331. PMLR, 2017.
[17]
Haim Kaplan and Uri Stemmer. Differentially private \(k\)-means with constant multiplicative error. arXiv preprint arXiv:1804.08001, 2018.
[18]
Matthew Jones, Huy L Nguyen, and Thy D Nguyen. Differentially private clustering via maximum coverage. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11555–11563, 2021.
[19]
Anamay Chaturvedi, Huy Nguyen, and Eric Xu. Differentially private \(k\)-means clustering via exponential mechanism and max cover. arXiv preprint arXiv:2009.01220, 2020.
[20]
Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. Differentially private clustering: Tight approximation ratios. Advances in Neural Information Processing Systems, 33: 4040–4054, 2020.
[21]
Gloria Zen and Elisa Ricci. Earth mover’s prototypes: A convex learning approach for discovering activity patterns in dynamic scenes. In CVPR 2011, pp. 3225–3232. IEEE, 2011.
[22]
Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. Advances in Neural Information Processing Systems, 25, 2012.
[23]
Eduardo Fernandes Montesuma and Fred Maurice Ngole Mboula. Wasserstein barycenter for multi-source domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16785–16793, 2021.
[24]
Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In 2009 IEEE 12th international conference on computer vision, pp. 460–467. IEEE, 2009.
[25]
Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG), 34 (4): 1–11, 2015.
[26]
Saad Nadeem, Travis Hollmann, and Allen Tannenbaum. Multimarginal Wasserstein barycenter for stain normalization and augmentation. In Anne L. Martel, Purang Abolmaesumi, Danail Stoyanov, Diana Mateus, Maria A. Zuluaga, S. Kevin Zhou, Daniel Racoceanu, and Leo Joskowicz (eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pp. 362–371, Cham, 2020. Springer International Publishing. ISBN 978-3-030-59722-1.
[27]
Florian Heinemann, Axel Munk, and Yoav Zemel. Randomized wasserstein barycenter computation: Resampling with statistical guarantees. volume 4, pp. 229–259, 2022. . URL https://doi.org/10.1137/20M1385263.
[28]
Diane Ridgeway, Diane Ridgeway, Mary F Theofanos, Terese W Manley, and Christine Task. Challenge design and lessons learned from the 2018 differential privacy challenges. US Department of Commerce, National Institute of Standards and Technology, 2021.
[29]
Ryan McKenna, Gerome Miklau, and Daniel Sheldon. Winning the nist contest: A scalable and general approach to differentially private synthetic data. arXiv preprint arXiv:2108.04978, 2021.
[30]
Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018.
[31]
Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9 (3–4): 211–407, 2014. ISSN 1551-305X. . URL http://dx.doi.org/10.1561/0400000042.
[32]
Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43 (2): 904–924, 2011.
[33]
Jason M. Altschuler, Jinho Bok, and Kunal Talwar. On the privacy of noisy stochastic gradient descent for convex optimization. SIAM Journal on Computing, 53 (4): 969–1001, 2024. . URL https://doi.org/10.1137/23M1556538.
[34]
Zachary Izzo, Sandeep Silwal, and Samson Zhou. Dimensionality reduction for Wasserstein barycenter, 2021. URL https://arxiv.org/abs/2110.08991.
[35]
Jason M Altschuler and Enric Boix-Adsera. Wasserstein barycenters can be computed in polynomial time in fixed dimension. Journal of Machine Learning Research, 22 (44): 1–19, 2021. URL http://jmlr.org/papers/v22/20-588.html.
[36]
Yiyun He, Roman Vershynin, and Yizhe Zhu. Algorithmically effective differentially private synthetic data. In The Thirty Sixth Annual Conference on Learning Theory, pp. 3941–3968. PMLR, 2023.
[37]
Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103, 2007. .
[38]
Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th annual symposium on foundations of computer science, pp. 464–473. IEEE, 2014.
[39]
Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Performance of Johnson-Lindenstrauss transform for \(k\)-means and \(k\)-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 1027–1038, 2019.
[40]
Jonathan Weed and Francis Bach. . Bernoulli, 25 (4A): 2620 – 2648, 2019. . URL https://doi.org/10.3150/18-BEJ1065.
[41]
Justin Solomon, Kristjan Greenewald, and Haikady Nagaraja. k-variance: A clustered notion of variance. SIAM Journal on Mathematics of Data Science, 4 (3): 957–978, 2022.
[42]
Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, and Edward Chien. k-mixup regularization for deep learning via optimal transport. arXiv preprint arXiv:2106.02933, 2021.
[43]
Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. Pot: Python optimal transport. Journal of Machine Learning Research, 22 (78): 1–8, 2021. URL http://jmlr.org/papers/v22/20-451.html.
[44]
Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 685–693, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/cuturi14.html.
[45]
Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31, 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/3b5020bb891119b9f5130f1fea9bd773-Paper.pdf.
[46]
Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. volume 32, 2019.
[47]
Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. In The 22nd international conference on artificial intelligence and statistics, pp. 1574–1583. PMLR, 2019.
[48]
Ziv Goldfeld, Kristjan Greenewald, Jonathan Niles-Weed, and Yury Polyanskiy. Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory, 66 (7): 4368–4391, 2020.
[49]
Ziv Goldfeld and Kristjan Greenewald. Gaussian-smoothed optimal transport: Metric structure and statistical efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 3327–3337. PMLR, 2020.
[50]
Sloan Nietert, Ziv Goldfeld, and Kengo Kato. Smooth \(p\)-wasserstein distance: structure, empirical approximation, and statistical applications. In International Conference on Machine Learning, pp. 8172–8183. PMLR, 2021.
[51]
Yixing Zhang, Xiuyuan Cheng, and Galen Reeves. Convergence of gaussian-smoothed optimal transport distance with sub-gamma distributions and dependent samples. In International Conference on Artificial Intelligence and Statistics, pp. 2422–2430. PMLR, 2021.
[52]
Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40 (3): 793–826, 2011. . URL https://doi.org/10.1137/090756090.
[53]
Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th symposium on operating systems principles, pp. 441–459, 2017.
[54]
Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2468–2479. SIAM, 2019.
[55]
Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Advances in Cryptology–EUROCRYPT 2019: 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, May 19–23, 2019, Proceedings, Part I 38, pp. 375–403. Springer, 2019.
[56]
Stuart Lloyd. Least squares quantization in PCM. IEEE transactions on information theory, 28 (2): 129–137, 1982.
[57]
Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Improved approximations for Euclidean \(k\)-means and \(k\)-median, via nested quasi-independent sets. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1621–1628, 2022.
[58]
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/af21d0c97db2e27e13572cbf59eb343d-Paper.pdf.
[59]
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative Bregman projections for regularized transportation problems. volume 37, pp. A1111–A1138, 2015. . URL https://doi.org/10.1137/141000439.
[60]
Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic Wasserstein barycenters, 2018. URL https://arxiv.org/abs/1802.05757.
[61]
Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration, 2018. URL https://arxiv.org/abs/1705.09634.
[62]
Pankaj K. Agarwal, Sharath Raghvendra, Pouyan Shirzadian, and Keegan Yao. Fast and accurate approximations of the optimal transport in semi-discrete and discrete settings. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 4514–4529, 2024. . URL https://epubs.siam.org/doi/abs/10.1137/1.9781611977912.159.
[63]
Pankaj K. Agarwal, Sharath Raghvendra, Pouyan Shirzadian, and Keegan Yao. Efficient approximation algorithm for computing Wasserstein barycenter under Euclidean metric. Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2025. . URL https://epubs.siam.org/doi/abs/10.1137/1.9781611978322.163.
[64]
March Boedihardjo, Thomas Strohmer, and Roman Vershynin. Private measures, random walks, and synthetic data. Probability theory and related fields, pp. 1–43, 2024.
[65]
Vitaly Feldman, Audra McMillan, Satchit Sivakumar, and Kunal Talwar. Instance-optimal private density estimation in the wasserstein distance. Advances in Neural Information Processing Systems, 37: 90061–90131, 2024.
[66]
William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into Hilbert space. Contemporary mathematics, 26: 189–206, 1984.
[67]
Noga Alon, Phillip B Gibbons, Yossi Matias, and Mario Szegedy. Tracking join and self-join sizes in limited storage. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 10–20, 1999.
[68]
Shanmugavelayutham Muthukrishnan et al. Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science, 1 (2): 117–236, 2005.
[69]
Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The Johnson-Lindenstrauss transform itself preserves differential privacy, 2012. URL https://arxiv.org/abs/1204.2136.
[70]
Aleksandar Nikolov. Private query release via the Johnson-Lindenstrauss transform, 2022. URL https://arxiv.org/abs/2208.07410.
[71]
Shyam Narayanan, Sandeep Silwal, Piotr Indyk, and Or Zamir. Randomized dimensionality reduction for facility location and single-linkage clustering. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7948–7957. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/narayanan21b.html.
[72]
Moses Charikar and Erik Waingarten. The johnson-lindenstrauss lemma for clustering and subspace approximation: From coresets to dimension reduction. pp. 3172–3209. Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, 2025. . URL https://epubs.siam.org/doi/abs/10.1137/1.9781611978322.102.
[73]
Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55 (58-63): 94, 2015.
[74]
Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of statistics, pp. 1302–1338, 2000.
[75]
Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Summer school on machine learning, pp. 208–240. Springer, 2003.

  1. Correspondence to anminggu@utexas.edu. Part of this work was done while A.G. was at Boston University.↩︎

  2. Specifically, it is the distribution that minimizes the average Wasserstein distance between itself and each distribution in the set, and generalizes the classical concept of the (Euclidean) mean from datapoints to entire distributions.↩︎

  3. Same (asymptotic) runtime as a non-private algorithm.↩︎

  4. [20] considers the ball of radius \(1\), while we consider the ball with diameter \(1\).↩︎

  5. Their definition is slightly different. Our definition is simplified to work better with Assumption 2.↩︎

  6. We remark that our first algorithm (Theorem 4) does not require any assumption on \(\beta_i\).↩︎

  7. https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2015/↩︎

  8. A set that contains a good approximate solution.↩︎

  9. whereas the expected should be \(O\left(\frac{m}{n}\right)\)↩︎