Bayesian shrinkage priors subject to linear constraints

Zhi Ling
Saw Swee Hock School of Public Health
National University of Singapore
lingzhi@nus.edu.sg
Shozen Dan
Department of Mathematics
Imperial College London
shozen.dan21@imperial.ac.uk


Abstract

In Bayesian regression models with categorical predictors, constraints are needed to ensure identifiability when using all \(K\) levels of a factor. The sum-to-zero constraint is particularly useful as it allows coefficients to represent deviations from the population average. However, implementing such constraints in Bayesian settings is challenging, especially when assigning appropriate priors that respect these constraints and general principles. Here we develop a multivariate normal prior family that satisfies arbitrary linear constraints while preserving the local adaptivity properties of shrinkage priors, with an efficient implementation algorithm for probabilistic programming languages. Our approach applies broadly to various shrinkage frameworks including Bayesian Ridge, horseshoe priors and their variants, demonstrating excellent performance in simulation studies. The covariance structure we derive generalizes beyond regression models to any Bayesian analysis requiring linear constraints on parameters, providing practitioners with a principled approach to parameter identification while maintaining proper uncertainty quantification and interpretability.

1 Introduction↩︎

In regression models with categorical predictors, parameter identifiability issues arise when all K levels of a factor are included. Consider a simple example where gender is included in a regression model: \[y_i = \beta_0 + \beta_{M} \text{Male}_i + \beta_{F}\text{Female}_i + ... + \epsilon_i.\] Without additional constraints, this model is overparameterized since knowing one gender category perfectly predicts the other, leading to multicollinearity. Traditional approaches like treatment coding (setting one level as reference) work but have interpretability disadvantages - the intercept represents the mean for the reference group rather than the population average, and coefficients represent deviations from the reference group rather than from the overall mean. This complicates interpretation, especially with complex models involving multiple categorical variables or when no natural reference category exists.

A more interpretable approach retains all K coefficients while imposing the constraint that \(\sum_{k=1}^K \beta_k = 0\) (sum-to-zero constraint), allowing \(\beta_0\) to represent the population average and each categorical coefficient to represent the deviation from this average. However, implementing such constraints in Bayesian settings is challenging. One method redefines the last coefficient as \(\beta_K = -\sum_{k=1}^{K-1} \beta_k\), which creates undesirable asymmetry among coefficients, complicates prior specification, and leads to difficult posterior geometry. Another approach implements "soft constraints" through highly informative priors (e.g., \(\sum_k \beta_k \sim \mathcal{N}(0, \varepsilon)\) with very small \(\varepsilon\)), but this approximation can lead to numerical instabilities. We propose that the optimal solution is to directly incorporate linear constraints into prior distributions, preserving both identifiability and interpretability while enabling proper uncertainty quantification. Our approach allows for multiple arbitrary linear constraints on parameter vectors and includes efficient sampling algorithms for implementation in probabilistic programming languages. For common zero-sum constrained shrinkage priors in statistical practice, we provide four specializations as special cases of our proposed method: Bayesian ridge, hierarchical Bayesian ridge, horseshoe prior, and horseshoe prior variants. These specializations provide practitioners with ready-to-use solutions.

2 Linear constraints via priors↩︎

In this section, we develop a principled approach for incorporating linear constraints into Bayesian regression models with multivariate normal priors.

Given coefficient vector \(\boldsymbol{\beta} = (\beta_1,\beta_2,...,\beta_K)^\top \in \mathbb{R}^K\), such that \[\beta_k \mid \lambda_k \sim \mathcal{N}\left( 0, \lambda_k^2 \right), ~ k=1,2,...,K\] where \(\lambda_k\) can be either a fixed hyperparameter or a random variable that forms a Gaussian scale mixture.

Consider a linear constraint \(A \boldsymbol{\beta} = b\), with constraint matrix \(A\in \mathbb{R}^{J \times K}\) (\(1\leq J \leq K-1\)) with full row rank and constant vector \(b \in \mathbb{R}^{K}\). We propose the following prior: \[\begin{align}\label{eq46multinormal95cond} \boldsymbol{\beta} \sim \mathcal{N}\left(D A^\top \left( A D A^\top \right)^{-1}b, \; D - D A^\top \left( A D A^\top \right)^{-1} A D \right) \end{align}\tag{1}\] where \(D = \text{diag}\left( \lambda_1^2, \lambda_2^2,...,\lambda_K^2 \right)\).

Theorem 1. The above prior family satisfies the given linear constraints almost everywhere.

Proof. Consider \(\boldsymbol{\beta} \sim \mathcal{N}(\boldsymbol{\mu}, D)\). Let \(\boldsymbol{z}=A \boldsymbol{\beta}\) represent an auxiliary observation derived from \(\boldsymbol{\beta}\). The covariance structure is given by: \[\operatorname{Cov}(\boldsymbol{\beta},A\boldsymbol{\beta})=D A^\top,\quad \operatorname{Cov}(A\boldsymbol{\beta}, A\boldsymbol{\beta})=A D A^\top.\]

Therefore, the joint distribution can be written as: \[\begin{bmatrix} \boldsymbol{\beta} \\ \boldsymbol{z} \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \boldsymbol{\mu} \\ A\boldsymbol{\mu} \end{bmatrix}, \begin{bmatrix} D & D A^\top \\ A D & A D A^\top \end{bmatrix} \right).\]

By standard results for conditional multivariate normal distributions, we have: \[\boldsymbol{\beta}\mid A\boldsymbol{\beta}=b \sim \mathcal{N}\left(m^*, \Sigma^*\right),\] where the conditional mean is: \[m^* = \boldsymbol{\mu} + D A^\top (A D A^\top)^{-1} (b - A\boldsymbol{\mu})\] and the conditional covariance matrix is: \[\label{eq46cov95matrix} \Sigma^* = D - D A^\top (A D A^\top)^{-1} A D.\tag{2}\] The invertibility of \(A D A^\top\) is given by Lemma 1.

In the absence of specific prior information, it is often desirable to employ a neutral prior centered at zero for regularization purposes. Setting \(\boldsymbol{\mu}=\mathbf{0}\) yields the form presented in Eq. 1 by letting: \[\boldsymbol{\beta}^* \sim \mathcal{N}\left( D A^\top \left( A D A^\top \right)^{-1}b, \; D - D A^\top \left( A D A^\top \right)^{-1} A D \right).\] ◻

By adjusting the covariance structure in this manner, we effectively reduce the parameter space to values that almost surely satisfy the constraint. This approach implements hard parameter constraints solely through the prior distribution. Although this conditional method does modify the original prior distribution, we can largely preserve the desired marginal distributions by appropriately scaling the diagonal elements of the covariance matrix, minimizing distortion to the original model specification. The specific scaling factors and their effects on the resulting posterior distributions will be explored in detail in Section 4, along with practical guidelines for implementation in various modeling contexts.

3 Sampling algorithm↩︎

The the covariance matrix (Eq. 2 ) we derived is singular (see Lemma 2). Therefore, the multivariate normal implementations in most mathematical libraries cannot directly sample from it. This section gives a general sampling algorithm and Stan implementation.

We want to sample from \[\boldsymbol{\beta} \sim \mathcal{N}\Bigl(m^*,\,\Sigma^*\Bigr)\] with \[m^* = D A^\top (A D A^\top)^{-1} b,\quad \Sigma^* = D - D A^\top (A D A^\top)^{-1} A D,\] knowing that \(\Sigma^*\) is singular with rank \(K-J\), where \(J=\operatorname{rowrank}(A)\), \(1\leq J \leq K-1\) (see Lemma 3).

The key idea is to sample in the full‐rank subspace where \(\Sigma^*\) is nondegenerate and then map the sample back to \(\mathbb{R}^K\). One common approach is to obtain an orthonormal basis for the null space of \(A\). We formulate the procedure below.

Figure 1: Sample \boldsymbol{\beta} \sim \mathcal{N}(m^*,\,\Sigma^*)

Theorem 2. The above procedure returns sample from Eq. 1 .

Proof. The mean is \[E[\boldsymbol{\beta}] = m^* + M L \mathbb{E}[z] = m^*;\] The covariance is \[\operatorname{Var}(\boldsymbol{\beta}) = M L\,\operatorname{Var}(z)\,L^\top M^\top = M L L^\top M^\top = M\,\Omega\,M^\top = \Sigma^*.\] ◻

Therefore, the above algorithm first simplifies the sampling problem to a \(K-J\) subspace where the covariance is non-degenerate, and then lifts the samples to \(\mathbb{R}^K\). This achieves the construction of samples from the degenerate conditional Gaussian distribution while ensuring that the constraint \(A\boldsymbol{\beta} = b\) is satisfied.

4 Sum-to-zero specialization↩︎

In this section, we demonstrate how the covariance structure can be tailored to different modeling needs, especially sum-to-zero constraints.

Note that in GP regression, the constraint matrix \(A\) only required to have full row rank. Then it is possiable to set \(A = (1,1,...,1) \in\mathbb{R}^{1\times K}\) and \(b=0 \in \mathbb{R}\). Therefore, the constrain becomes \[\sum_{k=1}^K \beta_k = 0.\] i.e., sum-to-zero constraints.

We present four specialization of our multivariate normal structure for the commonly encountered sum-to-zero constraint in statistical practice: Bayesian ridge, herarchical Bayesian ridge, horseshoe prior, and variants of the horseshoe prior. For each prior type, we derive the explicit form of the covariance matrix, discuss the implications on mathematical details, and provide efficient computational approaches that can be directly implemented in probabilistic programming languages. These specializations provide practitioners with ready-to-use solutions for imposing sum-to-zero constraints while maintaining the desirable shrinkage properties of each prior family.

4.1 Bayesian ridge↩︎

Bayesian ridge regression uses a normal prior which imposes a global \(\ell_2\) regularization: \[\boldsymbol{\beta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\] where \(\boldsymbol{I}\) denotes the identity matrix. By choosing \(D=\boldsymbol{I}\) in Eq. 1 , the sum-to-zero constraint can be achieved by assuming the following prior \[\boldsymbol{\beta} \sim \mathcal{N}\left( \boldsymbol{0},~ \sigma^2\left( \boldsymbol{I}-\frac{1}{K}\boldsymbol{1}\boldsymbol{1}^\intercal \right) \right).\] \(\sigma^2 = \frac{K}{K-1}\) is chosen to ensure the diagonal element of covariance matrix to be \(\Sigma_{k,k} = \sigma^2 \left( 1 - \frac{1}{N} \right) = 1.\) So the marginal distribution are therefore preserved: \[\beta_k \sim \mathcal{N}(0, 1), k=1,2,...,K.\]

Figure 2: Sample from sum-to-zero Bayesian ridge

4.2 Hierarchical Bayesian ridge↩︎

It’s possiable to introducing an additional scale factor controling global shrinkage level. \[\boldsymbol{\beta} \sim \mathcal{N}\left( \boldsymbol{0}, \lambda^2 \boldsymbol{I} \right)\] The samller the \(\lambda\), the stronger the \(\ell_2\) penalty is. It is also a common choice to introduce partial pooling for variance parameters allowing adaptation to variability from the data \[\lambda \sim \text{Cauchy}^+(0,1).\] or any other distribution with support \(\mathbb{R}^+\).

By choosing \(D=\lambda^2\boldsymbol{I}\) in Eq. 1 , the sum-to-zero constraint can be achieved by assuming the following prior \[\boldsymbol{\beta} \sim \mathcal{N}\left( \boldsymbol{0},~ \lambda^{2}\sigma^2\left( \boldsymbol{I}-\frac{1}{K}\boldsymbol{1}\boldsymbol{1}^\intercal \right) \right).\] \(\sigma^2 = \frac{N}{N-1}\) is chosen so the marginal distribution are preserved: \[\beta_k \sim \mathcal{N}(0, \lambda^{2}), k=1,2,...,K.\]

When sampling, only need to carry out non-centered parameterization, \[\boldsymbol{\beta} = \lambda \boldsymbol{\beta^*}\] and then use Algorithm 2 to sample \(\boldsymbol{\beta^*}\).

4.3 Horseshoe prior↩︎

The Horseshoe prior [1] adopts a local-global hierarchical structure for adaptive shrinkage in regression analysis. For each regression coefficient, \[\begin{align} \beta_k &= \mathcal{N}(0, \tau^2\lambda_k^2), k=1,2,...,K \\ \lambda_j &\sim C^+(0, 1) \\ \tau &\sim C^+(0, 1) \end{align}\] The global scale parameter \(\tau\) controls the overall shrinkage level, while the local scale parameter \(\lambda_j\) adjusts the individual shrinkage for each coefficient.

By choosing \(D=\tau^2 \text{diag}\left( \lambda_1^2, \lambda_2^2,...,\lambda_K^2 \right)\) in Eq. 1 , the sum-to-zero constraint can be achieved by assuming the following prior \[\begin{align} \pmb{\beta} \mid D \sim \mathcal{N}\left(0, \Sigma\right), ~ \Sigma=D - D \mathbf{1} \left( \mathbf{1}^\top D \mathbf{1} \right)^{-1} \mathbf{1}^\top D \end{align}\]

The constraint of sum to zero introduces negative correlations among \(\beta_k\), which are reflected in the off-diagonal elements of the adjusted covariance matrix \(\tilde{\Sigma}\). The covariance between \(\beta_k\) and \(\beta_j\) is given by: \[\operatorname{Cov}(\beta_k, \beta_j) = -\tau^2 \frac{\lambda_k^2 \lambda_j^2}{\sum_{l=1}^K \lambda_l^2}, \quad j \neq k\]

The marginal variance of each \(\beta_k\) is: \[\operatorname{Var}(\beta_k) = \tau^2 \left( \lambda_k^2 - \frac{\lambda_k^4}{\sum_{l=1}^K \lambda_l^2} \right)\]

Compared to the desired \(\tau^2 \lambda_k^2\), this variance is slightly reduced due to the presence of the constraint. Therefore, in the constrained covariance matrix, the marginal variance could be compensated by multiplying \(D\) by the factor \(K / (K - 1)\), which assumes that all \(\lambda_k\) are equal.

Figure 3: Sample from the sum-to-zero horseshoe prior

It can also be used to first separate the global shrinkage parameter \(\tau\) by non-central parameterization before running the algorithm.

4.4 Horseshoe-like prior↩︎

The proposed method is also applicable to variants of the horseshoe prior. The regularized horseshoe prior (RHS) [2] is a class of hierarchical priors designed to address sparsity in high dimension coefficients. In contrast to the traditional horseshoe prior, RHS improves model robustness by imposing moderate regularization on large coefficients through adjustments to the slab width. RHS is formulated as \[\begin{align} \beta_k \mid \zeta_k, c, \tau &\sim \mathcal{N}\left( 0, \tau^2 \tilde{\zeta}_k^2 \right), \quad k=1,2,...,K \\ \zeta_k &\sim \text{Student-}t^+_{\nu_1} (0,1) \\ c^2 &\sim \text{Inv-Gamma}(\nu_2/2, \nu_2s^2/2) \\ \tau &\sim \text{Student-}t^+_{\nu_3} (0, \tau_0) \end{align}\] where \(\tilde{\zeta}_k^2=\frac{c^2 \zeta_k^2}{c^2 + \tau^2 \zeta_k^2}\), and \(\tau_0 = \frac{p_0}{K-p_0} \frac{\tilde{\sigma}}{\sqrt{n}}\). \(p_0\in \left\{ 1,...,K-1 \right\}\) is a hyperparameter describing the prior belief on effective number of non-zero coefficients. \(\tilde{\sigma}^2\) is the pseudo variance define by the likelihood and link function.

By choosing \(D=\tau^2 \text{diag}\left( \tilde{\zeta}_1^2, \tilde{\zeta}_2^2,...,\tilde{\zeta}_K^2 \right)\) in Eq. 1 , the sum-to-zero constraint can be achieved by assuming the following prior \[\begin{align} \pmb{\beta} \mid D \sim \mathcal{N}\left(0, \Sigma\right), ~ \Sigma=D - D \mathbf{1} \left( \mathbf{1}^\top D \mathbf{1} \right)^{-1} \mathbf{1}^\top D. \end{align}\]

When sampling, only need to carry out non-centered parameterization, \[\boldsymbol{\beta} = \tau \boldsymbol{\beta^*}\] and then use Algorithm 3 to sample \(\boldsymbol{\beta^*}\).

5 Supplementary materials↩︎

5.1 Supplementary proofs↩︎

Lemma 1. The matrix \(A D A^\top\) is invertible.

Proof. Since \(\lambda_k>0, k=1,2,...,K\) are standard deviations of normal distributions, \(D\) is invertible.

Let \(\tilde{A} := A D^{1/2} \in \mathbb{R}^{n \times K}\). Right multiply invertible matrix \(D^{1/2}\) does not change the rank of matrix \(A\): \[\text{rank}(\tilde{A}) = \text{rank}(A D^{1/2}) = \text{rank}(A) = J\] Note that \[A D A^\top = (A D^{1/2})(A D^{1/2})^\top = \tilde{A} \tilde{A}^\top\] so \[\text{rank}(A D A^\top) = \text{rank}(\tilde{A} \tilde{A}^\top) = \text{rank}(\tilde{A}) = \text{rank}(A) = J\]

Therefore, \(A D A^\top \in \mathbb{R}^{J \times J}\) is of full rank and hence invertible. ◻

Lemma 2. The covariance matrix \[\Sigma^* = D - D A^\top (A D A^\top)^{-1} A D.\] is positive semi-definite.

Proof. Observe that \[\Sigma^* = D - D A^\top \left( A^\top D A \right)^{-1} A^\top D = D^{1/2}\Bigl[I - D^{1/2} A^\top \left( A D A^\top \right)^{-1} A D^{1/2}\Bigr]D^{1/2}.\]

Let \[P = D^{1/2} A^\top \left( A D A^\top \right)^{-1} A D^{1/2}.\] Since \(A\) is full row rank and \(D\) is a diagonal matrix with non-negative diagonal elements, it is known that \(ADA^\top\) is invertible, thus \(P\) is well-defined.

We show that \(P\) is a projection matrix

Symmetry: \(P^\top = P\).

Idempotence: \[P^2 = D^{1/2} A^\top \left( A D A^\top \right)^{-1}A D^{1/2} D^{1/2} A^\top \left( A D A^\top \right)^{-1}A D^{1/2} = P,\]

Since \(P\) is a symmetric and idempotent projection matrix, all eigenvalues of \(P\) and \(I-P\) can only be 1 or 0. Hence \(I-P\) is positive semidefinite.

Finally, \[\Sigma^* = D^{1/2} \Bigl[I-P\Bigr] D^{1/2}.\] Since \(D^{1/2}=\operatorname{diag}\left(\lambda_1,\lambda_2,\ldots,\lambda_K\right)\) is positive definite, the product of positive definite matrix and positive semi-positive matrix is still positive semi-definite. Therefore, \(\Sigma^*\) is a semi-positive definite matrix. ◻

Lemma 3. If the row rank of \(A\) is \(J\), the covariance matrix \(\Sigma^*\) is of rank \(K-J\). Consequently, for any single linear constraint (including sum-to-zero constraint), the covariance matrix \(\Sigma^*\) is of rank \(K-1\)

Proof. Since \(P\) is a symmetric and idempotent projection matrix, its rank equals the dimension of the subspace onto which it projects (i.e. the number of eigenvalues of 1), which is \(D^{1/2}A^\top\). For \(A\in\mathbb{R}^{J\times K}\), we have \[\operatorname{rank}(P) = \operatorname{rank}(D^{1/2}A^\top) = \operatorname{rank}(A^{\top}) = J\]

The matrix\(i-p\) is also a projection matrix with eigenvalues of 0 or 1. The number of its eigenvalue 0 (1) is equal to the number of eigenvalues in P with 1 (0). Since the rank of \(P\) is J, the rank of \(I - P\) is \(K - J\).

Finally, since \(D^{1/2}\) is a full-rank matrix, \(I-P\) and \(D^{1/2}[I - P]D^{1/2}\) are congruent, so they have the same rank: \[\operatorname{rank}(\Sigma^*) = \operatorname{rank}\left(D^{1/2}[I - P]D^{1/2}\right) = \operatorname{rank}(I - P) = K - J.\]

In particular, sum-to-zero constraint corresponding to \(\operatorname{rank}(A)=1\), so \(\operatorname{rank}(\Sigma^*)=K-1\). ◻

Lemma 4. The transformed low-dimensional covariance matrix \[\Omega = M^\top \Sigma^* M \in \mathbb{R}^{(K-J)\times (K-J)}.\] is positive definite. So that we can preform Cholesky decomposition.

Proof. Let \(z \in \mathbb{R}^{K-J} \neq 0\), and \(v = M z \in \mathbb{R}^K\). Since the columns of \(M\) are linearly independent and \(z \neq 0\), we have \(v \neq 0\) and \(A v = 0\).

\[\begin{align} z^\top \Omega z &= z^\top M^\top \Sigma^* M z \\ &= v^\top \Sigma^* v \\ &= v^\top D v - v^\top D A^\top (A D A^\top)^{-1} A D v. \end{align}\]

Denote \(w = D^{1/2} v\), \(D\) is a symmetric positive definite matrix.

\[\begin{align} v^\top D v &= w^\top w, \\ v^\top D A^\top (A D A^\top)^{-1} A D v &= w^\top P w, \end{align}\] where \(P = D^{1/2} A^\top (A D A^\top)^{-1} A D^{1/2}\)

we note that \(P\) is a orthogonal projection matrix, \[P^2 = P, \quad P^\top = P.\] which project \(\mathbb{R}^K\) to the column space of \(N = D^{1/2}A^\top\).

We note that \[v^\top \Sigma^* v = w^\top (I - P) w.\] Since \(I - P\) is a projection matrix to \(\text{Im}(P)^\perp\), \(I - P\) is semi-positive: \[w^\top (I - P) w \geq 0.\]

Assume there exists a non-zero \(z\) such that \(z^\top \Omega z = 0\), that is:

\[w^\top (I - P) w = 0.\]

Since \(I - P\) is a projection matrix, the equation holds if and only if \(w \in \text{Im}(P)\). In this case, there exists \(u \in \mathbb{R}^m\) such that:

\[w = D^{1/2} A^\top u.\]

However, \(w = D^{1/2} v\), and \(v \in N(A)\), so:

\[A v = A D^{-1/2} w = A D^{-1/2} (D^{1/2} A^\top u) = A A^\top u = 0.\]

Since \(A\) is full row rank, \(A A^\top\) is invertible, hence \(u = 0\), resulting in \(w = 0\), and thus \(v = D^{-1/2} w = 0\), which contradicts \(v = M z \neq 0\).

Therefore, for any non-zero \(z \in \mathbb{R}^{K-J}\), \(z^\top \Omega z > 0\), hence \(\Omega\) is a positive definite matrix. ◻

5.2 Supplementary algorithms↩︎

Figure 4: Compute orthonormal basis for N(A)
Figure 5: Compute orthonormal basis for N(A), A=(1,1,...,1)\in \mathbb{R}^K

References↩︎

[1]
C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. 97 (2): 465–480. ISSN 0006-3444, 1464-3510. .
[2]
Juho Piironen and Aki Vehtari. Sparsity information and regularization in the horseshoe and other shrinkage priors. 11 (2). ISSN 1935-7524. .