May 07, 2025
This paper introduces the FFT-Enhanced Kalman Filter (FFTKF), a differentially private optimization method that addresses the challenge of preserving performance in DP-SGD, where added noise typically degrades model utility. FFTKF integrates frequency-domain noise shaping with Kalman filtering to enhance gradient quality while preserving \((\varepsilon, \delta)\)-DP guarantees. It employs a high-frequency shaping mask in the Fourier domain to concentrate differential privacy noise in less informative spectral components, preserving low-frequency gradient signals. A scalar-gain Kalman filter with finite-difference Hessian approximation further refines the denoised gradients. With a per-iteration complexity of \(\mathcal{O}(d \log d)\), FFTKF demonstrates improved test accuracy over DP-SGD and DiSK across MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets using CNNs, Wide ResNets, and Vision Transformers. Theoretical analysis confirms that FFTKF maintains equivalent privacy guarantees while achieving a tighter privacy-utility trade-off through reduced noise and controlled bias.
Differential Privacy (DP) has become a foundational concept for safeguarding individual-level information in machine learning and data analysis, providing rigorous guarantees against information leakage from model outputs [1]–[4]. Standard DP mechanisms, such as the Laplace and Gaussian mechanisms, achieve privacy by injecting calibrated noise into data or gradients. However, this noise injection often results in substantial degradation in model utility, particularly in high-dimensional or deep learning scenarios.
One of the central challenges in differentially private learning is improving the performance of DP-SGD, a widely used algorithm that combines stochastic gradient descent with Gaussian noise perturbation [1]. Due to the stochastic and high-magnitude nature of DP noise, especially under tight privacy budgets, the privatized gradients tend to exhibit poor signal-to-noise ratios. Consequently, effectively denoising DP-SGD updates to enhance performance while strictly preserving DP guarantees remains a formidable and open problem. Recent studies highlight that the noise introduced in DP-SGD can significantly impair convergence and model accuracy, particularly in complex tasks, making denoising a critical yet challenging objective [5], [6]. These works underscore the difficulty of balancing noise reduction with the preservation of privacy constraints, as excessive denoising risks compromising the \((\epsilon, \delta)\)-DP guarantees.
To address this, recent methods have introduced signal processing and state estimation tools to enhance gradient quality under DP. For example, the DiSK framework [7] incorporates Kalman filtering into DP optimization to recursively estimate cleaner gradients, resulting in improved convergence and test accuracy. Kalman filters are well-established tools for estimating latent states in dynamic systems based on noisy observations [8], [9]. Their application in DP optimization allows for smoother updates by exploiting temporal correlations across gradient steps, which is particularly effective in large-scale and deep models.
In parallel, frequency-domain techniques such as low-pass filtering have shown promise in separating useful signal components from high-frequency noise [10]–[12]. Leveraging the Fast Fourier Transform (FFT), recent signal processing approaches selectively attenuate noise by filtering out high-frequency components in a computationally efficient manner [13]. When adapted to machine learning, these spectral methods provide a structured way to shape and redistribute noise, preserving signal fidelity in lower-frequency domains [6], [14].
Building on these developments, we propose the FFT-Enhanced Kalman Filter (FFTKF), a novel optimization framework that combines spectral noise shaping via FFT with Kalman filtering for improved gradient estimation under differential privacy. Our method reshapes the injected DP noise to concentrate its energy in high-frequency spectral components, thereby preserving low-frequency gradient information that is critical for effective model training. The resulting privatized gradients are then denoised using a scalar-gain Kalman filter that leverages temporal structure to produce stable update directions.
Our contributions are as follows. First, we introduce a frequency-domain noise shaping strategy that integrates into the DP optimization pipeline while preserving the required \((\epsilon, \delta)\)-DP guarantees. Second, we combine this with a simplified Kalman filter that enables efficient gradient tracking with a per-step complexity of \(O(d \log d)\). Third, we empirically validate FFTKF on standard benchmarks—MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet—across multiple model architectures including CNNs, Wide ResNets, and Vision Transformers. Our results show that FFTKF consistently improves test accuracy over DP-SGD and DiSK, particularly under tight privacy budgets. These findings demonstrate the effectiveness of combining spectral filtering and state estimation for denoising gradients in private optimization.
Gradient-Based Optimization. Stochastic Gradient Descent (SGD) and its variants, such as Adam, form the backbone of optimization in machine learning by iteratively updating model parameters using gradient information [15], [16]. SGD relies on noisy gradient estimates computed from
random mini-batches, offering computational efficiency but suffering from variance in updates. Adam enhances SGD by incorporating adaptive momentum and second-moment estimates, which accelerate convergence, particularly in non-convex settings. However,
these methods lack inherent privacy protections, as gradients can leak sensitive information about training data. This limitation has spurred the development of privacy-preserving optimizers like DP-SGD, which integrate noise to safeguard data
privacy [1].
Differential Privacy Optimization. Differential Privacy (DP) is a rigorous framework for protecting individual data privacy in machine learning by injecting calibrated noise into data or gradients, often at the cost of reduced model
utility [1]–[4]. DP-SGD, a widely adopted algorithm, achieves privacy by adding Gaussian noise to gradients during stochastic gradient
descent, but the high-magnitude noise often degrades signal-to-noise ratios, impairing convergence [1]. Recent efforts address this trade-off
through innovative techniques. For instance, Fan and Xiong proposed an adaptive noise adjustment method for real-time aggregate monitoring, dynamically balancing privacy and utility based on data characteristics [17]. Similarly, Zhang et al.’s DiSK framework integrates Kalman filtering to denoise privatized gradients, enhancing model performance while adhering to privacy
constraints [7]. Adaptive clipping techniques further improve DP-SGD by dynamically tuning noise levels to preserve gradient information [18].
Kalman Filter Optimization. Kalman filters are powerful tools for state estimation in noisy dynamic systems, leveraging temporal correlations to refine estimates from noisy observations [8], [9]. In the context of differential
privacy, the DiSK framework employs a simplified Kalman filter to smooth noisy gradient estimates, improving convergence rates and reducing computational overhead in DP optimization [7]. This approach is particularly effective for large-scale models, where noise can significantly disrupt training. Recent research has also explored Kalman filters in federated learning, where
they help balance privacy and accuracy in distributed environments by refining gradient updates across clients. These advancements highlight the versatility of Kalman-based methods in privacy-preserving machine learning.
Low-Pass Filtering Techniques. Low-pass filters are essential in signal processing for attenuating high-frequency noise while preserving critical signal components, making them valuable for data analysis [13]. Fourier transform-based low-pass filters, known for their computational efficiency, are widely used to separate noise from meaningful
signals [19]. In differential privacy, adaptive low-pass filtering techniques optimize cutoff frequencies to maximize utility under
strict privacy budgets, effectively denoising gradients in high-dimensional tasks [6], [14], [20]. These methods are particularly relevant for deep learning, where preserving low-frequency gradient information is crucial for maintaining model performance while ensuring robust privacy guarantees.
Figure 1: Visualization of the proposed frequency-domain gradient denoising process. (A) The original flattened gradient before privatization. (B) The FFT transforms the gradient into the frequency domain, where shaped Gaussian noise is added and attenuated by a high-frequency mask. Kalman filtering is then applied in the time domain to exploit temporal correlations. (C) The final denoised gradient (dashed line) shows reduced high-frequency perturbations while preserving the underlying signal structure.
Preliminaries. Let \(\mathbb{R}^d\) denote the \(d\)-dimensional Euclidean space. A vector \(z \in \mathbb{R}^d\) is written as \(z = (z_0, \dots, z_{d-1})^\top\), with \(\ell_2\)-norm \(\|z\|_2 = \sqrt{\sum_{i=0}^{d-1} z_i^2}\). The identity matrix in \(\mathbb{R}^{d \times d}\) is \(I_d\), and a diagonal matrix with entries \(\varphi_0, \dots, \varphi_{d-1}\) is \(\mathrm{diag}(\varphi_0, \dots, \varphi_{d-1})\). For a function \(f: \mathbb{R}^d \to \mathbb{R}\), its gradient is \(\nabla f\), and Hessian is \(\nabla^2 f\). The Hadamard product is \(\odot\), and \(\lfloor x \rfloor\) denotes the floor function. The Gaussian distribution with mean zero and covariance \(\sigma^2 I_d\) is \(\mathcal{N}(0, \sigma^2 I_d)\). We aim to minimize \(F(x) = \mathbb{E}_{\xi \sim \mathcal{D}} [f(x; \xi)]\), the expected loss over a data distribution \(\mathcal{D}\). Parameters at iteration \(t\) are \(x_t \in \mathbb{R}^d\), updated as \(x_{t+1} = x_t - \eta \tilde{g}_t\), where \(\eta > 0\) is the learning rate and \(\tilde{g}_t\) is the estimated gradient. The true gradient \(\nabla F(x_t)\) is approximated via a mini-batch \(\mathcal{B}_t \subseteq \mathcal{D}\) of size \(B\), giving \(g_t = \frac{1}{B} \sum_{\xi \in \mathcal{B}_t} \nabla f(x_t; \xi)\). The parameter difference is \(d_t = x_{t+1} - x_t\). In differential privacy, a mechanism satisfies \((\varepsilon, \delta)\)-DP if, for neighboring datasets \(D, D'\) differing in one entry, and any output set \(S\), \(\Pr[\mathcal{M}(D) \in S] \leq e^\varepsilon \Pr[\mathcal{M}(D') \in S] + \delta\). Gradients are clipped as \(\text{clip}(v, C) = v \cdot \min(1, C / \|v\|_2)\) to bound their \(\ell_2\)-norm by \(C\), and noise \(w_t \sim \mathcal{N}(0, \sigma_w^2 I_d)\) is added.
This section briefly reviews the discrete Fast Fourier transform (FFT) and the algorithmic considerations that motivate its use for gradient denoising.
Recall that the discrete Fourier transform (DFT) of a real-valued vector \(z = (z_0, \dots, z_{d-1})^\top \in \mathbb{R}^d\) is the complex vector \(\hat{z} = \mathcal{F}(z) \in \mathbb{C}^d\), with components \[\hat{z}_k = \sum_{n=0}^{d-1} z_n \, e^{-2\pi i kn/d}, \quad \text{for } k = 0, \dots, d-1,\] and its inverse is given by \[z_n = \frac{1}{d} \sum_{k=0}^{d-1} \hat{z}_k \, e^{2\pi i kn/d}, \quad \text{for } n = 0, \dots, d-1.\]
With this normalization, the Fourier transform \(\mathcal{F}\) is unitary, i.e., \[\mathcal{F}^{-1}(\mathcal{F}(z)) = z,\] and Parseval’s identity holds: \[\|z\|_2^2 = \frac{1}{d} \|\hat{z}\|_2^2,\] where \(\hat{z} = \mathcal{F}(z)\).
Consequently, adding Gaussian noise in the Fourier domain preserves the \(\ell_2\)-sensitivity required for \((\varepsilon,\delta)\)-DP due to the invariance of the \(\ell_2\)-norm under unitary transformations.
Low/high–frequency split. Fix a pivot index \(k_0=\lfloor \lambda d\rfloor\) for some \(\lambda\in(0,1)\). Frequencies \(k<k_0\) are called low-frequency components, and \(k\ge k_0\) high-frequency components. This separation reflects the empirical observation that most signal information, especially in gradient vectors of smooth loss landscapes, is concentrated in the lower spectral range, while the high-frequency components often contain stochastic noise.
Spectral filtering. A diagonal mask \(\Phi = \mathrm{diag}(\varphi_0, \dots, \varphi_{d-1})\) defines a linear filter \[\mathcal{G}_\Phi(z) = \mathcal{F}^{-1} \left( \Phi\, \hat{z} \right) = \frac{1}{d} \sum_{k=0}^{d-1} \varphi_k \hat{z}_k \, e^{2\pi i kn/d},\] where \(\hat{z} = \mathcal{F}(z)\).
By the convolution theorem, this operation in the frequency domain is equivalent to convolution in the time domain and can be evaluated in \(O(d \log d)\) time via the FFT algorithm, which significantly improves efficiency compared to the naive \(O(d^2)\) convolution [21]–[23].
High-frequency shaping mask. To enhance denoising while maintaining DP, we use a smooth mask function \[\varphi_k = \begin{cases} 1, & k<k_0,\\[3pt] 1 - \rho\, & k \geq k_0, \end{cases}\] where \(\rho \in (0, 1)\) controls the magnitude of suppression. This *step-wise attenuation* suppresses higher-frequency components beyond a cutoff index \(k_0\), thereby reducing the influence of DP noise concentrated in those frequencies. Unlike sharp cutoffs, this mask gently dampens high-frequency content while preserving the low-frequency structure of gradients, offering a balance between denoising and signal fidelity [2], [24].
Remarks. Let \(\Phi_\rho = \mathrm{diag}(\varphi_0, \dots, \varphi_{d-1})\), then the filtered version of a privatized gradient \(g = \nabla f + w\) is \[\hat{g} = \mathcal{G}_{\Phi_\rho}(g) = \mathcal{F}^{-1}(\Phi_\rho \cal{F}(g)).\] When \(w \sim \mathcal{N}(0, \sigma^2 I)\), the transformed noise \(\hat{w} := \mathcal{F}^{-1}(\Phi_\rho \cal{F}(w))\) is still zero-mean but now has reduced energy in the low-frequency components: \[\mathbb{E}[\|\hat{w}_{<k_0}\|_2^2] \ll \mathbb{E}[\|\hat{w}\|_2^2],\] facilitating more accurate recovery of the gradient signal after filtering.
This FFT recap underpins our FFT-Enhanced Kalman Filter in Sec. 3.4, where we combine spectral noise shaping with a scalar-gain Kalman predictor to denoise privatized gradients efficiently, achieving both computational and privacy-preserving benefits.
To explain our proposed idea of using the FFT-Enhanced Kalman Filter for denoising gradients, we first establish a dynamic system for the gradients. This system consists of a system update equation and an observation equation. The system update of the gradient dynamics is derived via Taylor expansion of \(\nabla F\) around \(x_{t-1}\), allowing for a second-order approximation of the gradient evolution at step \(t\): \[\nabla F(x_t)=\nabla F(x_{t-1})+\boldsymbol{H}_t\!\cdot\!(x_t-x_{t-1})+v_t, \label{eq:sysupdate}\tag{1}\] where \(\boldsymbol{H}_t := \nabla^2F(x_{t-1}) \in \mathbb{R}^{d \times d}\) is the local Hessian, and \(v_t\) represents the Taylor expansion remainder: \[v_t = \frac{1}{2}\int_0^1 \nabla^3 F\bigl((1-z)x_{t-1} + z x_t\bigr)[x_t - x_{t-1}]^{\otimes 2} \, dz.\] Here, \((\cdot)^{\otimes 2}\) denotes the second-order tensor product, capturing the nonlinearity of the third derivative in the remainder [7], [25].
The observed gradient \(g_t\) is a noisy, privatized estimate of the true gradient: \[g_t = \frac{1}{B}\sum_{\xi \in \mathcal{B}_t}\text{clip}(\nabla f(x_t, \xi), C) + w_t = C_t \nabla F(x_t) + w'_t,\] where \(w'_t\) contains both DP noise and subsampling noise. The matrix \(C_t\) is the effective observation operator determined by the clipping factor and the sampling mechanism. In the ideal full-batch, unclipped setting, \(C_t = I_d\), but more generally it is a contraction mapping satisfying \(\|C_t\|_2 \leq 1\).
Combining the update and observation equations, we obtain the system: \[\begin{align} \nabla F(x_t) &= \nabla F(x_{t-1})+\boldsymbol{H}_t(x_t - x_{t-1}) + v_t,\\ g_t &= C_t \nabla F(x_t) + w'_t. \end{align}\]
To enforce differential privacy while retaining useful structure, we inject noise in the frequency domain: \[g_t \;=\; C_t\nabla F(x_t) \;+\; \tilde{w}_t, \qquad \tilde{w}_t :=\mathcal{F}^{-1}\!\bigl( \Phi_\rho \odot \mathcal{F}(w_t) \bigr), \label{eq:obs}\tag{2}\] where \(w_t \sim \mathcal{N}(0,\sigma_w^2I_d)\) is isotropic Gaussian noise. The mask \(\Phi_\rho \in \mathbb{R}^d\) is diagonal and satisfies \[(\Phi_\rho)_k = \begin{cases} 1, & 0 \le k < k_0,\\[3pt] 1 - \rho e^{-\alpha(k - k_0)}, & k_0 \le k < d, \end{cases}\] for some pivot frequency \(k_0 = \lfloor \lambda d \rfloor\), with shaping strength \(\rho \in (0,1)\) and decay rate \(\alpha > 0\). This ensures that the privacy-preserving noise \(\tilde{w}_t\) is spectrally shaped to occupy primarily high-frequency components, which contribute less to gradient descent. This leads to improved recoverability of the informative low-frequency gradient content.
To recover the low-frequency content of the privatized gradient, we apply the inverse of the noise shaping operation: \[\mathcal{G}_{\rho}(z) := \mathcal{F}^{-1}\!\bigl( \Phi_\rho\odot\mathcal{F}(z) \bigr),\qquad z\in\mathbb{R}^d.\] This filtering step yields the estimate \(\widehat g_t = \mathcal{G}_{\rho}(g_t)\). Since \(\mathcal{G}_{\rho}\) is a linear operator with spectral mask \(\Phi_\rho\), this operation has complexity \(O(d\log d)\) and does not distort the signal beyond a known attenuation factor. That is, the covariance of \(\widehat{g}_t\) is a spectrally reweighted version of \(\mathrm{Cov}(g_t)\), which we exploit in the Kalman update below.
We adopt the scalar-gain Kalman filtering approximation introduced in [7], which simplifies the covariance matrices to scalar multiples of the identity. Specifically, we let \(P_t = p_t I_d\), \(K_t = \kappa I_d\), and estimate the Hessian action using a finite-difference formula with hyperparameter \(\gamma > 0\).
Prediction Step. Given \(\tilde{g}_{t-1}\), we predict the next gradient by using a first-order approximation based on finite differences: \[\tilde{g}_{t|t-1} = \tilde{g}_{t-1} +\frac{1}{B}\sum_{\xi\in\mathcal{B}_t} \frac{\nabla f(x_t+\gamma d_{t-1};\xi)-\nabla f(x_t;\xi)}{\gamma},\label{eq:prediction}\tag{3}\] where \(d_{t-1} := x_t - x_{t-1} = -\eta \tilde{g}_{t-1}\). This approximates the action of the local Hessian without explicitly computing second-order derivatives.
Correction Step. The predicted gradient is then corrected using the filtered observation \(\widehat g_t\): \[\tilde{g}_t = (1-\kappa)\,\tilde{g}_{t-1} +\kappa\,\widehat g_t,\label{eq:correction}\tag{4}\] where \(\kappa \in (0,1)\) is the Kalman gain that balances the reliance on the prediction versus the new (denoised) observation. This form ensures that the update direction incorporates temporal consistency across iterations while attenuating the influence of high-frequency noise.
Together, Eqs. 3 and 4 constitute a computationally lightweight Kalman filtering mechanism enhanced by frequency-domain denoising. It achieves \(O(d\log d)\) per-step complexity and enhances DP optimization without sacrificing privacy guarantees.
Figure 2: FFT-Enhanced Kalman Filter Optimizer (FFTKF)
The high-frequency shaping in Eq. 2 intentionally pushes privacy noise into spectral regions that matter least for optimization. Because the Kalman filter relies on low-frequency temporal correlations captured by Eqs. 3 –4 , the FFT step removes most of the injected disturbance before the gain \(\kappa\) is applied, resulting in a provably lower steady-state covariance.
Let \(\Sigma_w = \sigma_w^2 I_d\) be the covariance of the original DP noise \(w_t\), then the shaped noise \(\tilde{w}_t = \mathcal{F}^{-1}(\Phi_\rho \odot \mathcal{F}(w_t))\) has covariance \[\Sigma_{\tilde{w}} = \mathcal{F}^{-1} \cdot \Phi_\rho^2 \cdot \mathcal{F} \cdot \Sigma_w \cdot \mathcal{F}^{-1} \cdot \Phi_\rho^2 \cdot \mathcal{F},\] whose low-frequency principal components are suppressed relative to \(\Sigma_w\). Hence, the Kalman filter receives observations with diminished low-frequency noise variance, resulting in lower mean-square estimation error.
Crucially, FFTKF inherits the \(O(d)\) memory and \(O(d)\) algebraic complexities of the simplified DiSK variant while adding only two in-place FFTs per iteration.
Our FFT-Enhanced Kalman Filter (FFTKF) inherits the scalar–gain reduction of DiSK [7], wherein both the state covariance \(P_t\) and the Kalman gain \(K_t\) are isotropic: \[P_t = p_t I_d, \quad K_t = \kappa I_d.\] This diagonal simplification ensures that all matrix-vector operations reduce to scalar multiples of vector additions, preserving an \(\mathcal{O}(d)\) runtime and storage profile. The Hessian-vector product \(\boldsymbol{H}_t d_{t-1}\) is approximated with a single finite-difference query: \[\boldsymbol{H}_t d_{t-1} \approx \frac{\nabla F(x_t + \gamma d_{t-1}) - \nabla F(x_t)}{\gamma},\] eliminating the need for Hessian storage or inversion.
While DiSK performs time-domain exponential smoothing, FFTKF additionally reshapes the injected DP noise to concentrate its energy in the high-frequency spectrum: \[\tilde{w}_t := \mathcal{F}^{-1}\bigl(\Phi_\rho \odot \mathcal{F}(w_t)\bigr), \quad (\Phi_\rho)_k= \begin{cases} 1, & k < k_0,\\[2pt] 1-\rho e^{-\alpha(k-k_0)}, & k\ge k_0, \end{cases}\] with pivot index \(k_0 = \lfloor \lambda d \rfloor\). The mask \(\Phi_\rho\) acts as a soft high-pass filter for the noise, minimizing the effect of noise on low-frequency directions where the Kalman filter’s predictive prior is most accurate. This filtering can be viewed as a dual to the temporal smoothing in DiSK, but operating in the spectral domain.
Compared with DPSGD, FFTKF requires:
one additional forward pass per iteration to compute the finite-difference directional gradient, and
two in-place 1-D FFTs: a forward transform \(\mathcal{F}\) and its inverse \(\mathcal{F}^{-1}\).
Both operations scale as \(O(d \log d)\), while the state vector \(\tilde{g}_t\) and difference direction \(d_t\) are stored as \(O(d)\) vectors. Thus, FFTKF matches the memory profile of DiSK [7] but enables more precise noise attenuation with marginal overhead.
Since the FFT operation is orthonormal, it preserves the \(\ell_2\) norm: \[\|\tilde{w}_t\|_2 = \|\Phi_\rho \odot \mathcal{F}(w_t)\|_2 \le \|w_t\|_2.\] Thus, FFT-based reshaping does not increase the sensitivity of the privatized quantity. The overall privacy budget \((\varepsilon,\delta)\) remains exactly that of DPSGD and DiSK, guaranteed by
Let the FFT operator be \(\mathcal{F}:\mathbb{R}^d\!\to\!\mathbb{C}^d\) with inverse \(\mathcal{F}^{-1}\), as introduced in Section 3.1. Fix a pivot index \(k_0=\lfloor\lambda d\rfloor\) (\(\lambda\!\in\!(0,1)\)) and a high–frequency attenuation ratio \(\rho\in(0,1)\). Define the diagonal spectral mask \[\Phi_\rho = \operatorname{diag}\!\bigl( \underbrace{1,\dots,1}_{k_0}, \underbrace{1-\rho,\dots,1-\rho}_{d-k_0} \bigr),\] and the deterministic post‑processing map \(P(g)=\mathcal{F}^{-1}\!\bigl(\Phi_\rho\,\mathcal{F} g\bigr).\) Given a privatised gradient \(g_t\), the filtered release is \(\hat{g}_t := P(g_t).\)
Proposition 1 (Post‑processing invariance). Because \(P\) is data‑independent, \(\hat{g}_t\) is \((\varepsilon,\delta)\)‑DP whenever the DiSK gradient \(g_t\) is \((\varepsilon,\delta)\)‑DP.2
Sensitivity note.The mask satisfies \(\|\Phi_\rho\|_2=1\); hence \(P\) does not increase the \(\ell_2\)‑sensitivity of its input. The Gaussian noise scale \(\sigma_w\) chosen for DiSK therefore continues to satisfy the target \((\varepsilon,\delta)\) budget. Consequently, Algorithm 2 inherits exactly the same overall \((\varepsilon,\delta)\) guarantee as standard DP‑SGD/DiSK, computed with the moments accountant over \(T\) iterations.
Lemma 1 (Effect of the low‑pass mask). Write \(g_t=\nabla F(x_t)+\eta_t\) with \(\eta_t\sim\mathcal{N}(0,\sigma_w^{2}I_d)\). Let \(\rho^\star=\bigl(k_0+(1-\rho)^{2}(d-k_0)\bigr)/d\). Then \[\mathbb{E}[\hat{g}_t]=A\,\nabla F(x_t),\qquad \mathrm{Cov}[\hat{g}_t]=\sigma_w^{2}A^{2},\] where \(A:=\mathcal{F}^{-1}\Phi_\rho\,\mathcal{F}\) satisfies \(\|A-I_d\|_2 = \rho\) and \(\operatorname{tr}\!\bigl(\mathrm{Cov}[\hat{g}_t]\bigr)= \rho^\star\,d\sigma_w^{2}\).
Proof. Unitary invariance of \(\mathcal{F}\) yields the stated mean and covariance. Eigenvalues of \(A\) are \(1\) (multiplicity \(k_0\)) and \(1-\rho\) (multiplicity \(d-k_0\)). ◻
Remark. “Bias’’ in Lemma 1 refers to \(E[\hat{g}_t]-\nabla F(x_t)\); filtering does not introduce systematic noise bias but scales the signal by \(A\).
Lemma 1 replaces the isotropic noise term \(d\sigma_w^{2}\) in the DiSK analysis with \(\rho^\star d\sigma_w^{2}\) and introduces a multiplicative bias factor \(1-\rho\). Repeating the steps of Zhang el yields:
Theorem 2 (Privacy–utility with FFT filtering). Under Assumptions A1–A3 and the same \((\eta,\kappa,\gamma)\) schedule as in Zhang et al., Algorithm 2 satisfies \[\begin{align} \frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\|\nabla F(x_t)\|^{2} &\le \frac{2\bigl(F(x_0)-F^\star + \beta\|\nabla F(x_0)\|^{2}\bigr)}{C_1\eta T}\\[4pt] &+\frac{2(\beta+\eta^{2}L)\kappa^{2}}{C_1\eta} \!\Bigl[(2+|1+\gamma|)\rho^\star\,d\sigma_w^{2} +\tfrac{\sigma_{\!SGD}^{2}}{B}\Bigr]\\[4pt] &+\rho^{2}\,G_T, \end{align}\] where \(G_T=\tfrac1T\sum_{t}\mathbb{E}\|\nabla F(x_t)\|^{2}\) and \[C_1 = (1+\kappa-2\eta L) -4(\beta+\eta^{2}L)(1-\kappa)^{2}L^{2}\eta(2+|1+\gamma|).\]
In all experiments we fix \(\lambda=\tfrac12\) and \(\rho=0.5\) a priori (i.e.independently of any individual training sample); this gives \(\rho^\star=0.625\) and \(\rho^{2}=0.25\). Thus the DP‑noise contribution is reduced by \(37.5\%\) while the extra bias inflates the optimization term by at most \(25\%\), yielding a provably tighter trade‑off than plain DiSK.
In this section, we explore how the FFT-Enhanced Kalman Filter (FFTKF) improves the performance of differential privacy (DP) optimizers on various models, datasets, and privacy budgets. The utilization of FFT for the purpose of reshaping the DP noise in the frequency domain is undertaken with the objective of preserving the essential low-frequency gradient signal, while concomitantly directing privacy noise into spectral regions. For code and implementation details, please refer to $ https://github.com/OpenNN-Theory/KalmanDP-BEDB.
Figure 3: Test accuracy curves at \(\epsilon = 4\) across four datasets (CIFAR10 [27], CIFAR100 [27], MNIST [28], Tiny-ImageNet [29]) and five model architectures. Each plot compares DPAdam [30] (green), DISK [7] (blue), and the proposed FFTKF-DPAdam (red). FFTKF consistently improves final test accuracy, particularly on CIFAR and Tiny-ImageNet benchmarks.
The experiments are conducted on four standard image classification benchmarks, including MNIST[28], CIFAR-10, CIFAR-100[27] and Tiny-ImageNet[29]. The experiments are conducted on three image classification models, including 5-layer CNN [31], Wide ResNet [32], and ViT [33]. A comparative analysis was conducted to assess the impact of FFTKF on various base algorithms, including the DP versions of Adam and SGD. The updates of these algorithms are delineated in Algorithm 1. In our experiments, the term FFTKF- is employed to denote the privatized version of the FFT-enhanced Kalman filter algorithms.We apply a high-frequency shaping mask with parameters \(\rho\), where \(\rho \in (0, 1)\), to push DP noise into high-frequency components while preserving the essential low-frequency gradient signal. The pivot index \(k_0\) is determined by the parameter \(\lambda \in (0, 1)\), which defines the transition point between low and high frequencies. In addition, we experimentally adjust the batch size \(B\), the total epochs \(E = \frac{N T}{B}\), and the learning rate \(\eta\) to achieve optimal performance within a given privacy budget \(\varepsilon\). The privacy parameter \(\delta\) is constant throughout all experiments to ensure a reasonable privacy guarantee.
width=1
When operating within identical privacy budgets, the FFTKF consistently exhibits superior performance compared to baseline DP optimizers, including DPAdam and DISK, across a wide range of datasets and models. For example, when applied to CIFAR-10 with Wide ResNet-40, FFTKF demonstrates a test accuracy enhancement of up to 1.6% over the best-performing state-of-the-art algorithm. On Tiny-ImageNet with ViT-small, FFTKF exhibits superior convergence stability and accuracy, a benefit that can be attributed to its effective spectral noise shaping.
As illustrated in Figure 3 and Table [tab:fftkf95results95compact], FFTKF achieves a better
final precision within fixed privacy budgets. The efficacy of these enhancements is particularly evident under tight privacy constraints, where conventional optimizers frequently encounter significant noise corruption. The findings indicate the
effectiveness of frequency domain filtering and Kalman-based prediction in mitigating the adverse effects of DP noise, particularly in high-dimensional vision tasks.
Figure 4: Ablation study of FFTKF. (a) Varying \(\epsilon\) at epoch 80. (b) Varying \(\epsilon\) at epoch 40. (c) Test accuracy across \((\rho, \epsilon)\) at epoch 40.
Ablation study. To better understand the influence of FFTKF parameters, we conduct ablation studies on the high-frequency shaping parameter \(\rho\) and the privacy budget \(\epsilon\). We observe that moderate values of \(\rho \in [0.6, 0.7]\) provide a good trade-off between stability and adaptability. Furthermore, Figure 4 shows the result that higher values of \(\epsilon\), which imply weaker privacy but less noise, result in more accurate gradient estimation. The parameter \(\rho\) controls the redistribution of spectral noise and setting \(\rho = 0.6\) consistently yields strong performance across a wide range of datasets.
This paper introduced the FFT-Enhanced Kalman Filter (FFTKF), a differentially private optimization method that integrates frequency-domain noise shaping with Kalman filtering to enhance gradient quality while preserving \((\varepsilon, \delta)\)-DP guarantees. By using FFT to concentrate privacy noise in high-frequency spectral components, FFTKF retains critical low-frequency gradient signals, complemented by a scalar-gain Kalman filter for further denoising. With a per-iteration complexity of \(\mathcal{O}(d \log d)\), FFTKF demonstrates superior test accuracy over DP-SGD and DiSK across standard benchmarks, particularly under tight privacy constraints. Theoretically, FFTKF maintains equivalent privacy guarantees while achieving a tighter privacy-utility trade-off through reduced noise and controlled bias. FFTKF represents a significant advancement in efficient and effective private optimization.
This research was supported by Brian Impact Foundation, a non-profit organization dedicated to the advancement of science and technology for all.