Abstract

We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.

1 Introduction↩︎

In recent years, transformer models have become the backbone architecture for modern machine learning systems [1], [2]. The transformer architecture consists of alternating self attention and multi-layer perceptron blocks in a deep residual network. For a variety of tasks, increasing the size of transformers by increasing width and depth of the model leads to empirically predictable improvements in model performance [3]–[6]. Various protocols for scaling up the width and depth of transformers have been developed that provide stable training limits [7]–[10]. One common strategy is to scale width \(N\) and depth \(L\) linearly with fixed aspect ratio \(L/N\) [11], [12], however existing theory cannot justify such a practice as compute optimal or distinguish relative performance gain from width and depth from total parameter count. This leads us to our first open question:

How should width and depth scale under a compute budget in a transformer? Do transformer scaling laws only depend on total parameters or have different width and depth dependence?

In this work, we explore this question for in-context learning (ICL) problems, specifically in-context linear regression problems. In-context learning refers to the ability of models to condition their outputs on the past sequence of inputs provided by the user without updating model parameters [13]. This contrasts with in-weight learning (IWL) where information about the dataset is encoded in the model parameters during pre-training. Whereas many prior theoretical works on neural scaling laws essentially analyze the role of width, pretraining data, or pretraining time [14]–[20], our work explores how depth, width, context length, and pretraining time influence the quality of ICL in self-attention models. In our investigations, we find that the architectural requirements to perform ICL regression depend significantly on the statistics of the pretraining ICL tasks. To address the first question, we are forced to investigate:

How does the statistical structure of ICL tasks provided during pretraining influence the nature of the learned solution? Do these influence the optimal width/depth ratios?

To address these questions, we develop a solvable model of deep linear attention for three distinct ICL data covariance structures. Our concrete novel contributions and findings are as follows:

1.1 Contributions↩︎

We identify a new asymptotic scaling for in-context regression with linear attention where context length \(P\), number of masked points per context \(K\), and contexts per step \(B\) all scale linearly with dimension \(D\). This reduces the amount of compute and total data (by a factor of \(D\)) to converge than prior works [21] and speeds up simulations.
We introduce three distinct ICL data models of increasing generality. We study isotropic input data and task vectors (ISO). We then proceed to analyze fixed and structured (FS) covariances for data and task vectors. Lastly, we examine an ICL distribution where data covariances are randomly rotated and structured (RRS) across contexts.
We show with our asymptotic theory, that depth \(L\) benefits the first two data settings (ISO + FS) only if \(\alpha\equiv \text{context length}/\text{data dimension}\) is finite. In the limit of \(\alpha \to \infty\), there is no benefit from depth for these ICL tasks. Further, the FS setting is brittle with respect to variations in changes to the data covariance at test time. However, for the RRS setting, increasing depth is beneficial for any distribution and a non-trivial width & depth scaling.
We introduce a model width \(N\) bottleneck to our ICL model and study the pretraining scaling law for a model trained for \(t\) steps on a width \(N\) and depth \(L\) linear attention model in the RRS setting. For powerlaw data, the scaling law takes the form of a Chinchilla scaling law but with width and depth contributing separate terms \[\begin{align} \mathcal{L}(t,N,L,P) = c_t \; t^{-\beta_t} + c_N \; N^{-\beta_N} + c_L \; L^{-\beta_L} +c_P \; P^{-\beta_P} . \end{align}\]
From this scaling law, we consider compute optimal joint scaling of width and depth. Depending on the structure of the ICL covariates, we obtain different scalings of \(L \sim N^{ \nu }\) where \(\nu\) depends on properties of the data.

1.2.0.1 Infinite Neural Network Width and Depth Limits and Commonly Used Joint Scalings.

The empirical fact that larger networks tend to perform better on natural data [3]–[5] has led to development of scaling procedures to stably increase the size of a model to allow optimization. The mean-field or \(\mu\)P scaling theory [22]–[26] allows one to scale up the width of a model in a way that admits a feature learning infinite limit. Further, this scaling protocol provides consistent optimal hyperparameters, while delivering monotonic improvements in performance [7]. The same program has been carried out for deep residual models, such as transformers [8], [9], [12], [27]. However, while these infinite width and infinite depth scaling limits have been established to exist and perform better than finite models, no theory currently captures the relative gains in performance from scaling up width or depth at fixed compute. Understanding compute optimal shapes could help guide architectural choices when training large transformer models.

1.2.0.2 Theories of Compute Optimal Neural Scaling Laws.

Following the empirical scaling law results of [4] and [5], many theoretical works have examined the generalization theory for fully trained kernel methods under power law features [14], [15], [28]–[33]. More recently, several efforts have begun to incorporate SGD dynamics into these models to gain a notion of compute (and compute optimal tradeoffs between parameters and training time) [17]–[20], [34], [35]. In these works, the model is essentially one or two layers and the notion of finite model size is introduced with a random projection of the features to an \(N\) dimensional space. In this way, existing theories more closely resemble scaling laws for width rather than a comparison where depth and width serve different functions. Recent empirical works have pointed out the utility of increasing depth (or virtual depth through looping) for tasks requiring reasoning such as solving Sudoku puzzles [36] and solving math problems [37], [38]. [39] study regular language recognition where a clear computational advantage to scaling depth instead of width is established and experimentally verified.

1.2.0.3 Empirical Studies of ICL.

[13] demonstrated that large pretrained language models such as GPT-3 exhibit remarkable in-context learning capabilities for natural language tasks. Following this finding, [40] empirically investigated the ability of transformers to learn simple function classes such as linear regression, sparse regression, and two-layer neural networks. Various works have studied emergence of ICL beyond language tasks, focusing on what data structures (from dynamical systems to classification problems) can be learned in-context [41]–[46]. Several works have offered experimental evidence that ICL implements Bayesian inference [47]–[51]. Further investigations questioned whether attention is even strictly necessary for ICL by studying the performance of MLPs on these tasks [52], [53].

1.2.0.4 Theoretical Studies of ICL.

Inspired by empirical studies of ICL, many works have theoretically investigated how ICL algorithms such as in-context gradient descent can be implemented in transformers [54]–[61]. Some works have demonstrated flexibility of the transformer to adapt to changing statistics [62] or performing model selection [63]. Studies have pointed out the need for sufficient pretraining task diversity for ICL generalization [64], [65] which was theoretically analyzed in an asymptotic scaling limit for a shallow linear attention model by [21]. Extensions to structured covariances and distribution shifts revealed the importance of train-test task alignment [66]. [67] investigate the need for a sufficient number of residual stream steps (either by increasing true depth or loops of attention layers) to solve ICL distributions with multiple condition numbers with larger condition numbers requiring more depth. While many of these works study a final construction of weights, recent theory has described the training dynamics of one layer linear multi-head attention [68].

2 Data, Architecture and Reduced Model↩︎

2.1 Deep Linear Attention Architecture↩︎

The most general model we study is a depth \(L\), residual linear attention model \(f\) that maps inputs contexts to output predictions. The data are formed as \(P\) input-output pairs \(\{ (\boldsymbol{x}_\mu, y_\mu) \}_{\mu=1}^P\) and \(K\) evaluation points \(\{ (\boldsymbol{x}^\star_\mu, *) \}_{\mu=P+1}^{P+K}\) (which do not carry target outputs) into a data matrix \(\boldsymbol{D}\) \[\begin{align} \boldsymbol{D} = \begin{bmatrix} \boldsymbol{x}_1 & ... & \boldsymbol{x}_P & \boldsymbol{x}_{P+1} & ... & \boldsymbol{x}_{P+K} \\ y_1 & ... & y_P & * & ... & * \end{bmatrix} \end{align}\] where \(*\) indicates masked target values on the \(K\) evaluation points which are provided as \(0\) entries. The evaluation tokens \(\mu \in \{ P+1,... ,P+K\}\) prevented from updating the model with a positional masking matrix \(M_{\mu\nu}\) (Appendix 8). The model \(f_\mu\) is computed from \[\begin{align} &\boldsymbol{h}^1_\mu = \boldsymbol{W}_x \boldsymbol{x}_\mu + \boldsymbol{w}_y y_\mu, \quad \quad \boldsymbol{h}^{\ell+1}_{\mu} = \boldsymbol{h}^{\ell}_\mu + \frac{1}{L P} \sum_{\nu=1}^P \left( \boldsymbol{k}^\ell_\nu \cdot \boldsymbol{q}^\ell_\mu \right) \boldsymbol{v}^\ell_\nu \;, \;\ell \in [L] \;, \; \mu \in [P+K] \nonumber \\ &\boldsymbol{q}^\ell_\mu = \boldsymbol{W}_q^\ell \boldsymbol{h}^{\ell}_\mu \;, \;\boldsymbol{k}^\ell_\mu = \boldsymbol{W}_k^\ell \boldsymbol{h}^{\ell}_\mu \;, \;\boldsymbol{v}^\ell_\mu = \boldsymbol{W}_v^\ell \boldsymbol{h}^\ell_\mu, \quad \quad f_\mu = \boldsymbol{w}_{o} \cdot \boldsymbol{h}^L_\mu \end{align}\] The loss function for context \(\boldsymbol{D}\) is \(\mathcal{L}({\boldsymbol{D}}) = \frac{1}{K} \sum_{\mu=P+1}^{P+K} ( f_{\mu} - y_\mu )^2\) and the full population loss is \(\mathcal{L} = \mathbb{E}_{\boldsymbol{D}} \mathcal{L}(\boldsymbol{D})\) where \(\boldsymbol{D}\) is the distribution of context matrices. We stress that the operation \(\boldsymbol{q}_\mu \cdot \boldsymbol{k}_\mu\) corresponds to linear attention rather than commonly used in soft-max attention. For the regression tasks we consider, this model is sufficient to solve the ICL task [55], [56] and aids theoretical tractability [21], [68].

2.2 Recurrent Reduced-\(\Gamma\) Model↩︎

Following prior works on ICL in linear regression [21], [54], [65], [68], we examine the minimal (simplest) reparameterization of the above model which can solve this task where the residual stream encodes \(\boldsymbol{x}\) information in a subspace orthogonal to \(\boldsymbol{w}_y = \boldsymbol{w}_o\) so that \(\boldsymbol{W}_x^\top \boldsymbol{w}_y = 0\) and \(\boldsymbol{W}_v \propto \boldsymbol{w}_y \boldsymbol{w}_y^\top\). Further, instead of optimizing separate hidden weight matrices, we consider looped / universal transformers where each attention block is the identical \(\boldsymbol{W}^\ell_i = \boldsymbol{W}^{\ell'}_{i}\) for \(i \in \{k, q, v\}\) [67], [69], though we relax this assumption in Section 5.1 and Appendix 13.1. The reduced model then defines the predictor \(f(\boldsymbol{x}_\star)\) for test point \(\boldsymbol{\boldsymbol{x}}_\star\) in terms of a single matrix \(\boldsymbol{\Gamma} \in \mathbb{R}^{D\times D}\) \[\begin{align} \boldsymbol{\Gamma} &\equiv \left( \boldsymbol{w}_o^\top \boldsymbol{W}_v \boldsymbol{w}_y \right) \boldsymbol{W}_x^\top \boldsymbol{W}_k^\top \boldsymbol{W}_q \boldsymbol{W}_x \;, \;f(\boldsymbol{x}_\star) = \frac{1}{L P} \boldsymbol{x}_\star^\top \boldsymbol{\Gamma} \sum_{\ell=0}^{L-1} \left( \boldsymbol{I} - L^{-1} \hat{\boldsymbol{\Sigma}} \boldsymbol{\Gamma} \right)^\ell \boldsymbol{X}^\top \boldsymbol{y} \end{align}\] where \(\hat{\boldsymbol{\Sigma}} = \frac{1}{P}\boldsymbol{X} \boldsymbol{X}^\top\) is the empirical covariance. Due to the simplicity of this model and loss landscape structure, we will first focus on the gradient dynamics when we directly perform gradient descent on the matrix \(\boldsymbol{\Gamma}\). Then in Section 5.2, we consider learning dynamics on the decoupled collection of parameters \(\{ \boldsymbol{W}_{i} \}_{i\in\{ k,q,v \}}\).

3 Learning Curves in Reduced Linear Attention Model↩︎

3.1 Isotropic Covariates and Tasks (ISO)↩︎

To start our investigation, we begin by considering \(D\) dimensional isotropic data and isotropic task distribution. For each context \(\boldsymbol{D}_c\) and each data point \(\mu \in [P]\), the distribution of \(\boldsymbol{x}\) and \(y\) are \[\begin{align} \boldsymbol{x}_{\mu, c} \sim \mathcal{N}\left(0, {\boldsymbol{I}}\right) \quad \quad \boldsymbol{\beta}_c \sim \mathcal{N}(0,{\boldsymbol{I}}) \;, \;y_{\mu c} = \frac{1}{\sqrt D} \boldsymbol{\beta}_c \cdot \boldsymbol{x}^\mu_c + \sigma \epsilon^\mu_c \;, \;\epsilon^\mu_c \sim \mathcal{N}(0,1) . \end{align}\] In Appendix 9.1, we analyze SGD with batch of \(B\) contexts sampled at each step in the proportional asymptotics \(P,K,B,D \to \infty \;, \;P/D = \alpha \;, \;K/D = \kappa \;, \;B/D = \tau\), establishing that successful pretraining requires a total of \(B t = \Theta(D)\) contexts, each of size \(P = \Theta(D)\).

Figure 1: Deep linear self attention models trained with SGD on the ICL task with *isotropic covariates* with \(D = 32\). (a) Training dynamics for varying \(\alpha\). (b) Increasing depth \(L\) can improve ICL predictions, especially for \(\alpha \approx 1\). (c) The final loss is well predicted by the theory of \(L\) steps of gradient descent with optimal learning rate for each \((\alpha,L)\) pair..

In the gradient flow limit, when training from initial condition \(\boldsymbol{\Gamma} =0\), ICL pretraining on isotropic data at any depth \(L \geq 1\) reduces to tracking a scaled identity matrix \[\begin{align} \boldsymbol{\Gamma}(t) = \gamma(t) \boldsymbol{I} . \end{align}\] Consequently, the gradient flow can be expressed as an evolution of the scalar \(\gamma(t)\) \[\begin{align} \frac{d}{dt} \gamma(t) &= \text{tr} \left< \boldsymbol{\hat{\Sigma}} \left( \boldsymbol{I} - L^{-1} \gamma(t) \boldsymbol{\hat{\Sigma}} \right)^{2L-1} \right> = \int d\lambda \rho(\lambda) \lambda \left[ 1 - L^{-1} \gamma(t) \lambda \right]^{2L-1} \end{align}\] where the average is over the empirical covariance \(\boldsymbol{\hat{\Sigma}} = \frac{1}{P}\boldsymbol{X} \boldsymbol{X}^\top\) with Marchenko-Pastur eigenvalue distribution \(\rho(\lambda)\). The final loss can be interpreted as the MSE loss for \(L\) steps of GD with optimal step size \[\begin{align} \mathcal{L}_\star(\alpha) &= \min_{\gamma} \;\text{tr} \left< \left( \boldsymbol{I} - L^{-1}\gamma \hat{\boldsymbol{\Sigma}} \right)^{2L} \right> = \min_{\gamma} \int \rho(\lambda) \left( 1 - L^{-1} \gamma \lambda \right)^{2L} . \end{align}\]

For \(L=1\), the loss saturates to \(\mathcal{L}_\star = (1 + \alpha)^{-2}\) while for \(L\to\infty\), the \(\mathcal{L}_\star = [1 - \alpha]_+\), which illustrates the gap in performance between a shallow model and a large depth model. We illustrate loss curves of varying depth \(L\) compared to pretrained linear transformers (dots) in Figure 1 (c). We extend to the case with label noise \(\sigma^2>0\) in Appendix 9.2. In this case optimal \(\gamma\) is smaller since early stopping acts as an effective regularization [70], [71].

Figure 2: The loss landscape for the reduced \(\Gamma\) model with \(\boldsymbol{\Gamma}=\gamma \boldsymbol{I}\) corresponding to the gradient flow limit. This limit is equivalent to **optimal step size** selection for in-context GD. (a)-(b) The effect of depth \(L\) and context length \(\alpha\) on the loss. (c) Larger noise \(\sigma\) decreases the optimal \(\gamma\)..

If \(\alpha = P/D \to \infty\) then \(L=1\) achieves the minimal ICL loss (among any depth \(L\) models). Any larger depth \(L \geq 2\) achieves the same (zero if \(\sigma^2 = 0\)) loss in the \(\alpha \to \infty\) limit.

3.2 Fixed Structured Covariance (FS) \(\to\) Preconditioned Gradient Descent↩︎

Next, we let the population covariance be arbitrary \(\left< \boldsymbol{x} \boldsymbol{x}^\top \right> = \boldsymbol{\Sigma}\) across all contexts and task correlations be given by the matrix \(\left< \boldsymbol{\beta} \boldsymbol{\beta}^\top \right> = \boldsymbol{\Omega}\). The ICL population loss takes the form \[\begin{align} \mathcal{L} = \text{tr} \;\boldsymbol{\Omega} \left< \left[\left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^L \right]^\top \boldsymbol{\Sigma} \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^L \right> \end{align}\]

When the ICL distribution involves fixed covariance across contexts, there is no benefit to increasing depth \(L\) beyond \(L=1\) in the large context \(\alpha \to \infty\) limit. For any \(L \geq 1\), zero ICL loss can be achieved in the \(\alpha \to \infty\) limit by setting \(\boldsymbol{\Gamma} = L \boldsymbol{\Sigma}^{-1}\).

We support this finding in Figure 3 (b) where we show that small depth models are not outperformed by deeper models even after very long training horizons. When the ICL pretraining distribution for \(\boldsymbol{\Sigma}\) is fixed, the model will memorize statistical information about the covariance of the inputs from the pretraining distribution. By preconditioning with the inverse of the data covariance \(\boldsymbol{\Sigma}^{-1}\), the model is capable of achieving zero loss after even a single step of in-context GD. The gradient flow dynamics of the \(\boldsymbol{\Gamma}\) matrix can be decomposed further in the case where \(\boldsymbol{\Omega}\) and \(\boldsymbol{\Sigma}\) commute.

Figure 3: Pretraining on FS ICL covariates leads to a solution that does not require depth but is brittle to distribution shift. (a) Evolution of the eigenvalues \(\gamma_k(t)\) of the \(\boldsymbol{\Gamma}(t)\) matrix for depth \(L = 4\) as a function of pretraining time \(t\) compared with infinite depth \(L \to \infty\) theory (dashed black). (b) For powerlaw covariates, all depth models converge as a power law in \(t\). There is no asymptotic benefit to increasing depth beyond \(L=1\). (c) The ICL solution obtained when training from fixed covariance is **brittle** to changes in the covariance \(\boldsymbol{\Sigma} \to \exp( - \theta \boldsymbol{S}) \boldsymbol{\Sigma} \exp( \theta \boldsymbol{S})\)..

Suppose that \(\boldsymbol{\Omega}\) and \(\boldsymbol{\Sigma}\) are codiagonalizable with respective eigenvalues \(\{ \omega_k \}\) and \(\{ \lambda_k \}\). Then, when training from zero initialization, \(\boldsymbol{\Gamma}\) is diagonal in the same basis with eigenvalues \(\gamma_k(t)\) that obey the following dynamics (see Appendix 10) \[\begin{align} \frac{d}{dt} \gamma_k(t) = \omega_k \lambda_k^2 \left( 1- L^{-1} \lambda_k \gamma_k(t) \right)^{2L-1} . \end{align}\] In the large depth limit \(L \to \infty\) these dynamics have solution \(\gamma_k(t) = \frac{1}{2 \lambda_k} \ln(1+ 4 \omega_k \lambda_k^3 t )\) generating the loss dynamics \(\lim_{L \to\infty} \mathcal{L}(t, L) = \sum_k \frac{\omega_k \lambda_k}{ 1 + 4 \omega_k \lambda_k^3 \;t }\). Under powerlaw source/capacity conditions where \(\lambda_k \sim k^{-\nu}\) and \(\sum_{\ell > k} \lambda_\ell \omega_\ell \sim k^{- \nu \beta}\), the ICL loss scales as a powerlaw in pretraining time \(\mathcal{L}(t) \sim t^{- \frac{\beta}{\nu + \nu \beta + 1}}\).

While pretraining the linear transformer in this setting can achieve zero loss for long contexts \(P \to \infty\) even in shallow networks, the solution the model finds from gradient descent is brittle to properties of the pretraining data. Rather than learning a generic algorithm that solves ICL regression for any covariance \(\boldsymbol{\Sigma}\), its solution is specialized to the pretraining covariance.

A depth \(L\) model is pretrained on ICL tasks with fixed covariances \(\boldsymbol{\Sigma}\) and \(\boldsymbol{\Omega}\) for inputs and task vectors, but evaluated on new covariances \(\boldsymbol{\Sigma}', \boldsymbol{\Omega}'\). The out-of-distribution loss is \[\begin{align} \mathcal{L}_{\text{OOD}} = \text{tr} \;\boldsymbol{\Omega}' \left[\left( \boldsymbol{I} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma}' \right)^{L}\right]^\top \boldsymbol{\Sigma}' \left( \boldsymbol{I} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma}' \right)^{L} \end{align}\]

We illustrate this brittleness in Figure 3 (c) where we define a family of new covariance matrices \(\boldsymbol{\Sigma}' = \exp( \theta \boldsymbol{S} ) \boldsymbol{\Sigma} \exp( - \theta \boldsymbol{S})\) where \(\boldsymbol{S}\) is a random skew-symmetric matrix. As \(\theta\) increases and \(\boldsymbol{\Sigma}'\) becomes more dissimilar to \(\boldsymbol{\Sigma}\), the OOD loss increases for all depths \(L\).

3.3 Random Rotated and Structured Covariances \(\implies\) In-Context GD↩︎

Next, we attempt to enhance the covariance diversity across contexts. To do so, we now allow that each context \(c\) has a random data and task covariances which are randomly rotated \[\begin{align} \boldsymbol{x}^\mu_c \sim \mathcal{N}\left( 0 , \boldsymbol{\Sigma}_c \right) \;, \; \boldsymbol{\Sigma}_c = \boldsymbol{O}_c \boldsymbol{\Lambda} \boldsymbol{O}_c^\top \;, \;\boldsymbol{\Omega}_c = \boldsymbol{O}_c \boldsymbol{\Omega} \boldsymbol{O}_c^\top , \end{align}\] where \(\boldsymbol{O}_c\) is a random \(d\times d\) orthogonal matrix sampled from the Haar measure. The idea to pretrain with a diverse set of covariances \(\boldsymbol{\Sigma}_c\) across contexts \(c\) is to encourage the model to learn a generic in-context learning algorithm that is not specifically tailored to a particular data covariance \(\boldsymbol{\Sigma}\). By introducing the random rotation across contexts, the model cannot encode a whitening transform of the data in the matrix \(\boldsymbol{\Gamma}\) which prevents a zero loss solution in a shallow model with depth \(L=1\). Therefore, even the \(P/D \to \infty\) limit has the potential to exhibit a nontrivial scaling law in \(L\).

Gradient flow on the \(\boldsymbol{\Gamma}\)-reduced model maintains the isotropy condition \(\boldsymbol{\Gamma}(t) = \gamma(t) \boldsymbol{I}\) with \[\begin{align} \frac{d}{dt} \gamma(t) = \text{tr} \boldsymbol{\Lambda}^2 \boldsymbol{\Omega} \left( 1 - L^{-1} \gamma(t) \boldsymbol{\Lambda} \right)^{2L-1} . \end{align}\]

This indicates that, provided the covariance is randomly rotated across contexts, the behavior of the learned solution is unconditioned in-context GD (see Appendix 11). In the next section, we explore the consequences of this finding for optimal shapes.

4 Model of Compute Optimal Neural Scaling Laws↩︎

We consider the third setting with power law features and also introduce a notion of width through a projection matrix \(\boldsymbol{A} \in \mathbb{R}^{N \times D}\). Rather than training on \(D\) dimensional inputs \(\boldsymbol{x}\), the model has access to \(N\) dimensional features \(\tilde{\boldsymbol{x}} = \boldsymbol{A} \boldsymbol{x}\), which leads to the following condition of \(\boldsymbol{\Gamma}(t)\) \[\begin{align} \tilde{\boldsymbol{x}} = \boldsymbol{A} \boldsymbol{x}\quad {\implies} \quad \boldsymbol{\Gamma}(t) = \gamma(t) \left( \boldsymbol{A} \boldsymbol{A}^\top \right) \in \mathbb{R}^{N \times N} \end{align}\] As before, the loss can again be viewed as a function of the scale parameter \(\gamma(t)\). We provide a recipe to compute it below by computing the average over the random orthgonal matrix, resulting in a two-point deterministic equivalent for free products¹, in the same spirit of the results of [17], [72], using a saddle point method (Appendix 7).

The loss function \(\mathcal{L} = \left< |\boldsymbol{\Lambda}^{1/2}[\boldsymbol{I} - \gamma(t) L^{-1} \boldsymbol{O} \left( \boldsymbol{A}^\top \boldsymbol{A} \right)^2 \boldsymbol{O}^\top \hat{\boldsymbol{\Sigma}} ]^L \bar{\boldsymbol{\beta}}|^2 \right>\) can be explicitly averaged over the random orthogonal matrix \(\boldsymbol{O}\) and expressed as a deterministic function \[\begin{align} \mathcal{L}(t,N,L,P) &= \int \frac{d\omega d\omega'}{(2\pi)^2} \left(1 + L^{-1} \gamma(t) i\omega \right)^L \left(1 + L^{-1} \gamma(t) i\omega' \right)^L \mathcal{C}(\omega,\omega') \nonumber \\ \mathcal{C}(\omega,\omega') &= \text{tr} \;\boldsymbol{\Lambda} \left[ i \omega + \Psi_{v\chi}(\omega) \boldsymbol{\Lambda} \right]^{-1} \left[ \boldsymbol{\Omega} - \Psi_{\chi\chi}(\omega,\omega') \right] \left[ i \omega' + \Psi_{v\chi}(\omega') \boldsymbol{\Lambda} \right]^{-1} \end{align}\] where \(\Psi_{v\chi}(\omega)\) and \(\Psi_{\chi\chi}(\omega,\omega')\) are deterministic functions that depend on the spectra of \(\hat{\boldsymbol{\Sigma}}\) and \(\boldsymbol{A}^\top \boldsymbol{A}\) (Appendix 7). For example to obtain \(\Psi_{v\chi}(\omega)\) we first solve \[\begin{align} i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2}(\tau) i \omega_{\hat{\boldsymbol{\Sigma}}}(\tau) = \frac{\tau-1}{\tau} i \omega \implies \Psi_{v\chi}(\omega) = \frac{i\omega}{i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2}(\tau)} \end{align}\] where \(\tau\) and \(\omega_M\) for matrix \(M\) is defined as \(\tau = \text{tr} \boldsymbol{M} \left(\boldsymbol{M} + i\omega_M \right)^{-1}\).

Figure 4: Loss dynamics for powerlaw data. (a) Varying the source exponent \(\beta\), we see that the scaling with pretraining time has exponent \(\frac{\beta}{2+\beta}\). (b) The loss landscape across depths \(L\) for the scalar \(\gamma\) parameter exhibits minima at \(\gamma \approx L\). (c) The training dynamics of the reduced-\(\Gamma\) model exhibit \(t^{-\beta/(2+\beta)}\) decay before hitting an asymptote which scales as \(L^{-\beta}\)..

In Appendix 12.4 we provide formulas for \(\Psi_{v\chi}(\omega)\) and \(\Psi_{vv}(\omega,\omega')\) in the case that \(\boldsymbol{A}^\top \boldsymbol{A}\) is a rank \(N\) projection (has \(N\) eigenvalues equal to \(1\)) and \(\hat{\boldsymbol{\Sigma}} = \frac{1}{P} \boldsymbol{X} \boldsymbol{X}^\top\) is a structured Wishart matrix. This formula provides the full asymptotics. We can further extract Chinchilla scaling laws for powerlaw data. We demonstrate the need for both width and depth in Figures 4 and 5.

Assume source and capacity conditions for the eigenvalues and target coefficients \(\omega_k \equiv \Omega_{kk}\) \[\begin{align} \sum_{\ell > k} \lambda_\ell \;\omega_{k} \sim k^{-\nu \beta} \;, \;\lambda_k \sim k^{- \nu}, \end{align}\] where source and capacity \((\beta, \nu)\) control the complexity of each in-context linear regression problem. Let the matrix \(\boldsymbol{A}\) be rank \(N\) and frozen. Then the loss follows a Chinchilla neural scaling law in the resources of time \(t\), width \(N\), depth \(L\), and context length \(P\) \[\begin{align} \mathcal{L}(t,N,L,P) \approx c_t \;t^{-\frac{\beta}{2+\beta}} + c_N \;N^{-\nu\beta} + c_L \;L^{-\beta} + c_P \;P^{-\nu \beta} , \end{align}\] in the sense that taking any three of the resources to infinity leaving the last fixed results in a powerlaw in the remaining resource, e.g. \(\lim_{N,L,P\to\infty}\mathcal{L}(t,N,L,P) = \Theta \left(t^{- \frac{\beta}{2+\beta} } \right)\). As a consequence, at fixed compute \(C = t P^2 N^2 L\) the optimal width and depth scale as \(L \propto N^{\nu}\).

Figure 5: Increasing width and depth alone is insufficient to obtain monotonic improvements on powerlaw data with random covariance across contexts. (a) Scaling only width leads to a depth bottleneck (dashed red line). (b) Scaling only depth leads to a width bottleneck (dashed red line). (c) Increasing \(N\) and \(L\) simultaneously achieves monotonic improvement with compute..

5 More Realistic Self Attention Models↩︎

In this section, we discuss more realistic models which exhibit similar training dynamics and depth dependence. First, we describe the dynamics when the \(\boldsymbol{\Gamma}^\ell\) matrices are not tied. Second, we provide a theory of training when each of the weight matrices is optimized separately. Lastly, we provide experiments with softmax attention models trained with Adam.

5.1 Gradient Flow with Untied \(\Gamma\) Matrices↩︎

In this section, we consider the effect of untied \(\boldsymbol{\Gamma}^\ell\) matrices across layers \(\ell \in [L]\). When each of the weights are optimized separately with a learning rate that is upscaled by depth \(\eta = \eta_0 L\), the dynamics are actually equivalent to the recurrent model that we presented previously under the RRS noise-free setting. In this setting, the matrices remain isotropic \(\boldsymbol{\Gamma}^\ell = \gamma^\ell \boldsymbol{I}\) under gradient flow and leading to permutation symmetric loss function \[\begin{align} \mathcal{L}(\{ \gamma^\ell \}) = \left< \left|\boldsymbol{\Lambda}^{1/2} \prod_{\ell=1}^L \left[ \boldsymbol{I} - L^{-1} \gamma^\ell \boldsymbol{O} \left( \boldsymbol{A}^\top \boldsymbol{A} \right)^2 \boldsymbol{O}^\top \hat{\boldsymbol{\Sigma}} \right] \bar{\boldsymbol{\beta}} \right|^2 \right> \end{align}\] where the average is over \(\boldsymbol{O}\) and \(\bar{\boldsymbol{\beta}}\). The evolution of \(\gamma^\ell(t)\) under gradient flow will maintain balance \(\gamma^\ell(t) = \gamma^k(t)\) if they start in a balanced initial condition such as \(\gamma^\ell(0)=0\) for all \(\ell\). As a consequence, the loss dynamics will be identical to the recurrent model that we analyzed in previous sections. We provide additional details in Appendix 13.1 and Figure 8.

5.2 Gradient Flow for Full Linear Attention↩︎

In this section we consider gradient flow on all of the attention weights \(\{\boldsymbol{W}_k,\boldsymbol{W}_q, \boldsymbol{W}_v \}\) separately rather than gradient flow on the \(\boldsymbol{\Gamma}\) matrix. This corresponds to the dynamical system \[\begin{align} \frac{d}{dt} \boldsymbol{W}_{i} = - \frac{\partial}{\partial \boldsymbol{W}_{i}} \mathcal{L} \;, \;i \in \{ x, y, k, q, v, o \} \end{align}\] The dynamics of this model from small initialization are theoretically tractable as a set of low dimensional ODEs (see Appendix 13.2), but suffers some defects due to transient blowup and recovery in the scale of \(\boldsymbol{w}_y\) and \(\boldsymbol{w}_o\). However, if we fix these weights to unit norm, the dynamics of the above model reduces to a one-dimensional ODE much like the reduced-\(\boldsymbol{\Gamma}\) model.

Consider pretraining a linear transformer on randomly rotated covariance ICL data distribution. Fix read-in weights \(\boldsymbol{w}_y\) and readout weights \(\boldsymbol{w}_o\) with \(\boldsymbol{w}_y = \boldsymbol{w}_o\) and \(|\boldsymbol{w}_y| = 1\). Initialize the other weights to be small \(\frac{1}{2} |\boldsymbol{W}_x|^2 = |\boldsymbol{W}_k|^2 = |\boldsymbol{W}_q|^2 = |\boldsymbol{W}_k|^2 = \sigma^2\) where \(\sigma \ll 1\). The gradient dynamics will maintain a balanced condition where \(|\boldsymbol{W}_i(t)| = |\boldsymbol{W}_{j}(t)| = w(t)\) for \(i,j\in\{ x, k, q, v \}\) where the scalar \(w(t)\) evolves as \[\begin{align} \frac{d}{dt} w(t) = w(t)^4 \;\text{tr} \boldsymbol{\Lambda}^2 \boldsymbol{\Omega} \left( \boldsymbol{I} - w(t)^5 \boldsymbol{\Lambda} \right)^{2L-1} \end{align}\] This can be interpreted as gradient flow on the loss function for the reduced \(\boldsymbol{\Gamma}\) model under the reparameterization \(\gamma(t) \to w(t)^5\). For powerlaw covariates with source exponent \(\beta\), this gives a powerlaw scaling with pretraining time \(t\) and depth \(L\) \[\begin{align} \mathcal{L}(t,L) \sim c_t \;t^{ - \frac{5\beta}{5\beta + 2} } + c_L \;L^{-\beta} . \end{align}\]

We show the learning curves for this full attention module for isotropic covariates and powerlaw covariates in Figure 6. The theoretical dynamics and predicted powerlaw exponent for this new dynamical system closely match the predictions.

5.3 Nonlinear (Softmax) Attention↩︎

We also provide experiments showing that similar scaling behavior holds in softmax attention models trained with Adam. In this setting, the attention block has the form \(\boldsymbol{h}^{\ell+1}_\mu = \boldsymbol{h}^{\ell}_\mu + L^{-1} \sum_{\nu=1}^P A_{\mu\nu} \boldsymbol{W}_v \boldsymbol{h}^\ell_{\nu}\) where \(A_{\mu\nu}^\ell = \frac{1}{Z_\mu} \exp( \boldsymbol{h}_\nu^\top \boldsymbol{W}_k \boldsymbol{W}_q \boldsymbol{h}^\ell_\mu)\) with \(Z_\mu = \sum_{\nu} \exp( \boldsymbol{h}_\nu^\top \boldsymbol{W}_k \boldsymbol{W}_q \boldsymbol{h}^\ell_\mu)\) as the normalization factor. We train with Adam and show depth is still beneficial to ICL performance in Figure 6 (c) and 9, consistent with the phenomenology of our linear attention model.

Figure 6: (a)-(b) Dynamics of the full linear attention module with separate \(\{\boldsymbol{W}_x, \boldsymbol{W}_q, \boldsymbol{W}_k, \boldsymbol{W}_v \}\) as a function of depth \(L\) agree with a theory generated through reparameterization \(\gamma \to w^5\) as described in Result 9. (c) Softmax attention dynamics..

6 Discussion↩︎

6.0.0.1 Conclusion

In this work we analyzed a solvable model of in-context learning with deep linear attention. We showed that the pretraining statistics strongly determine the type of solution the model converges to and under what conditions scaling depth is necessary at large context lengths. When training on fixed covariance structures, large depth is not necessary (at infinite context length) as the model learns a preconditioner that whitens the data. However, this learned solution is brittle to changes in the data covariance. When training on tasks where the data covariance is randomly rotated across contexts, the model learns a general purpose in-context gradient descent algorithm and exhibits a separable Chinchilla neural scaling law where limited width and depth can both bottleneck performance. Lastly, we show these results are robust to reparameterization of the attention blocks.

6.0.0.2 Limitations and Future Directions

While our work presents an advance in the solvable models of neural scaling laws and the structure of ICL in linear attention, there are many current limitations. The primary limitation is that we focus on linear regression tasks solved with linear attention. Characterization of more general ICL problems such as nonlinear function approximation and nonlinear attention could provide more insights into realistic in-context function approximation. Further, our analysis is focused on online learning in the present work. Future work could investigate overfitting effects caused by repeating tasks or context matrices during training, perhaps with dynamical mean field theory techniques [17], [73], [74]. Future work could also examine the role of large learning rate effects during pretraining dynamics. Another direction that could be interesting to explore in future works is scaling up the number of loops dynamically as the network trains to significantly reduce the total compute required.

Acknowledgements↩︎

The authors would like to thank Jacob Zavatone-Veth, Alex Atanasov, Alexandru Meterez, Clarissa Lauditi, William Tong, Jamie Simon, Boris Hanin, Zhouran Yang, and Jascha Sohl-Dickstein for insightful discussions. B.B. acknowledges support from the Center of Mathematical Sciences and Applications (CMSA) of Harvard University. C.P. is supported by an NSF CAREER Award (IIS-2239780), DARPA grants DIAL-FP-038 and AIQ-HR00112520041, the Simons Collaboration on the Physics of Learning and Neural Computation, and the William F. Milton Fund from Harvard University. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence.

Appendix↩︎

7 Two Point Deterministic Equivalent for General Free Product From Dynamical Mean Field Theory↩︎

In this section, we describe a method to handle two-point correlation properties of free products of random matrices. Similar results were provided in a simple random feature model of [17], [72]. This computation provides the necessary technical results to establish the behavior of ICL in the randomly rotated context setting. Note throughout that we will use \(\text{tr} \boldsymbol{M}\) for normalized trace of the matrix \(\boldsymbol{M}\) so that for \(N \times N\) matrix \(\text{tr} \boldsymbol{M} = \frac{1}{N} \sum_{k=1}^N M_{kk}\).

7.0.0.1 Tracking Linear Dynamics Generated by Free Product

Taking a dynamical approach, we consider the following dynamical system which depends on a random matrix \(\boldsymbol{M} \in \mathbb{R}^{N \times N}\) \[\begin{align} \partial_{t} \boldsymbol{v}(t ) = - \boldsymbol{M} \boldsymbol{v}(t ) + \delta(t ) \boldsymbol{v}_0 \;, \;\boldsymbol{M} = \boldsymbol{O} \boldsymbol{B} \boldsymbol{O}^\top \boldsymbol{A} \end{align}\] where \(\boldsymbol{O}\) is a random \(N \times N\) orthogonal matrix drawn from the Haar measure for the orthogonal group. Our starting point is to express the above dynamical system with an integral representation of Dirac-Delta functions \(1 = \int dz \delta(z) = \int \frac{dz d\chi}{2\pi i } e^{- \chi z}\) where the integration variable \(\chi \in (-i\infty, i\infty)\) runs along the imaginary axis. Performing this for each time \(t\) of the dynamics yields \[\begin{align} Z = \int \mathcal{D}\boldsymbol{v} \mathcal{D} \boldsymbol{\chi} \left< \exp\left( - \int dt \boldsymbol{\chi}(t) \left[ \partial_t \boldsymbol{v}(t ) + \boldsymbol{M} \boldsymbol{v}(t ) - \boldsymbol{v}_0 \delta(t ) \right] \right) \right> = 1 \end{align}\] where \(\mathcal{D}\boldsymbol{v}\mathcal{D}\boldsymbol{\chi} = \lim_{\delta t \to 0} \prod_{n=-\infty}^\infty \frac{d\boldsymbol{v}( n \cdot \delta t) d\boldsymbol{\chi}(n\cdot \delta t)}{2 \pi i }\) is the flat measure over \(\boldsymbol{v}(t),\boldsymbol{\chi}(t)\) in the continuous time limit \(\delta t \to 0\). The average \(\left< \cdot \right>\) is computed over the random orthogonal matrix \(\boldsymbol{O}\).

To simplify the calculation, we transform our dynamics into Fourier space \[\begin{align} \boldsymbol{v}(t) \equiv \int \frac{d\omega}{2\pi} \exp\left( i \omega t \right) \boldsymbol{v}(\omega) \;, \;\boldsymbol{\chi}(t) \equiv \int \frac{d\omega}{2\pi}\exp\left( i \omega t \right) \boldsymbol{\chi}(\omega) , \end{align}\] which transforms the original integral over \(t\) into an integral over \(\omega\) as \(\int d\omega \boldsymbol{\chi}(\omega) \cdot [ (i\omega) \boldsymbol{v}(\omega) + \boldsymbol{M} \boldsymbol{v}(\omega) - \boldsymbol{v}_0 ]\).

7.0.0.2 Disorder Average

Averaging the resulting expression over the orthogonal matrix \(\boldsymbol{O}\), we obtain the following representation of \(Z\) as an integral over order parameters \(\{ \boldsymbol{\Sigma}, \boldsymbol{\Psi} \}\) \[\begin{align} Z = \int \mathcal{D}\boldsymbol{\Sigma} \mathcal{D} \boldsymbol{\Psi} \exp\left( - N \mathcal{S}[\boldsymbol{\Sigma},\boldsymbol{\Psi}] \right). \end{align}\] In the above expression, the action \(\mathcal{S}\) has the form \[\begin{align} \mathcal{S}[\boldsymbol{\Sigma},\boldsymbol{\Psi}] = - \text{Tr} \boldsymbol{\Sigma} \boldsymbol{\Psi} - \frac{1}{N} \ln \mathcal{Z}_A(\boldsymbol{\Psi}) - \text{Tr} \boldsymbol{\Sigma} \hat{\boldsymbol{\Sigma}}_\star - \frac{1}{N}\ln\det\left( \hat{\boldsymbol{\Sigma}}_\star \otimes \boldsymbol{I} + \boldsymbol{V} \otimes \boldsymbol{B} \right) + \ln\det \boldsymbol{\Sigma} \end{align}\] where \(\hat{\boldsymbol{\Sigma}}_\star\) solves the implicit equation \(\frac{\partial \mathcal{S}}{\partial \hat{\boldsymbol{\Sigma}}_\star} = -\boldsymbol{\Sigma} + \text{tr}\left( \boldsymbol{\hat{\Sigma}}_\star \otimes \boldsymbol{I} + \boldsymbol{V} \otimes \boldsymbol{B} \right)^{-1} = 0\) and the single site measure \(\mathcal{Z}_A\) has the form \[\begin{align} \mathcal{Z}_A(\boldsymbol{\Psi})& = \int \mathcal{D}\boldsymbol{v}\mathcal{D}\boldsymbol{\chi} \exp\left( - \int d\omega \boldsymbol{\chi}(\omega) \cdot \left[ (i\omega) \boldsymbol{v}(z) - \boldsymbol{v}_0 - \int d\omega' \Psi_{v \chi}(\omega,\omega') \boldsymbol{A} \boldsymbol{v}(\omega') \right] \right) \nonumber \\ &\exp\left( - \frac{1}{2} \int d \omega d\omega' \left[\Psi_{vv}( \omega,\omega' ) \boldsymbol{v}(\omega)^\top \boldsymbol{A}^2 \boldsymbol{v}(\omega') + \Psi_{\chi\chi}(\omega,\omega') \boldsymbol{\chi}(\omega) \cdot \boldsymbol{\chi}(\omega') \right] \right) \end{align}\]

7.0.0.3 Taking the Large System Size Limit

To study the limit of \(N \to \infty\), the saddle point equations over \(\boldsymbol{\Sigma}\) and \(\boldsymbol{\Psi}\) give \[\begin{align} &\frac{\partial \mathcal{S}}{\partial \boldsymbol{\Psi}(\omega,\omega') } = - \begin{bmatrix} \Sigma_{vv}(\omega, \omega') & \Sigma_{v\chi}(\omega,\omega') \\ \Sigma_{\chi v}(\omega,\omega') & \Sigma_{\chi\chi}(\omega,\omega') \end{bmatrix} + \begin{bmatrix} \frac{1}{N} \left< \boldsymbol{v}(\omega)^\top \boldsymbol{A}^2 \boldsymbol{v}(\omega') \right> & \frac{1}{N} \left< \boldsymbol{v}(\omega)^\top \boldsymbol{A} \boldsymbol{\chi}(\omega') \right> \\ \frac{1}{N} \left< \boldsymbol{\chi}(\omega)^\top \boldsymbol{A} \boldsymbol{v}(\omega') \right> & \frac{1}{N} \left< \boldsymbol{\chi}(\omega) \cdot \boldsymbol{\chi}(\omega') \right> \end{bmatrix} \nonumber \\ &\frac{\partial \mathcal{S}}{\partial \boldsymbol{\Sigma}(\omega,\omega')} = - \boldsymbol{\Psi}(\omega,\omega') - \boldsymbol{\hat{\Sigma}}_\star(\omega,\omega') + [\boldsymbol{\Sigma}^{-1}](\omega,\omega') = 0 \end{align}\] The average over \(\left< \cdot \right>\) represents an average over the Gaussian measure defined by \(\mathcal{Z}_A\). The solution to these equations has the following block structure \[\begin{align} &\boldsymbol{\Sigma}(\omega,\omega') = \begin{bmatrix} \Sigma_{vv}(\omega,\omega') & \Sigma_{v\chi}(\omega,\omega') \\ \Sigma_{v\chi}(\omega,\omega') & 0 \end{bmatrix} \\ &\hat{\boldsymbol{\Sigma}}(\omega,\omega') = \begin{bmatrix} 0 & \hat{\Sigma}_{v\chi}(\omega,\omega') \\ \hat{\Sigma}_{v\chi}(\omega,\omega') & \hat{\Sigma}_{\chi\chi}(\omega,\omega') \end{bmatrix} \\ &\boldsymbol{\Psi}(\omega,\omega') = \begin{bmatrix} 0 & \Psi_{v\chi}(\omega,\omega') \\ \Psi_{v\chi}(\omega,\omega') & \Psi_{\chi\chi}(\omega,\omega') \end{bmatrix} \end{align}\]

7.0.0.4 Off Diagonal Blocks

The off diagonal blocks decouple over different frequencies \(\Sigma_{v\chi}(\omega,\omega') = \delta(\omega-\omega') \Sigma_{v\chi}(\omega)\) and satisfy the following equations \[\begin{align} &\Sigma_{v\chi}(\omega) = \text{tr} \boldsymbol{A} \left( i \omega + \Psi_{v\chi}(\omega) \boldsymbol{A} \right)^{-1} = \text{tr}\left( \hat{\Sigma}_{v\chi}(\omega) + \boldsymbol{B} \right)^{-1} , \\ &\Psi_{v\chi}(\omega) \Sigma_{v\chi}(\omega) = 1 - \hat{\Sigma}_{v\chi}(\omega) \Sigma_{v\chi}(\omega) \end{align}\] Introduce the \(\tau\)-transform \(\tau_{M}\) of a matrix \(\boldsymbol{M}\) as \[\begin{align} \tau_M(i\omega) \equiv \text{tr} \boldsymbol{M}( i\omega + \boldsymbol{M})^{-1} \end{align}\] as well as its inverse function \(i\omega_M(\tau)\) ². Then our saddle point equations give us \[\begin{align} \tau = \Sigma_{v\chi}(\omega) \Psi_{v\chi}(\omega) = \tau_A( i\omega_A) = \tau_B(i \omega_B) \end{align}\] where \(i\omega_A = \frac{i\omega}{\Psi_{v\chi}(\omega)}\) and \(i \omega_B = ( \tau^{-1} - 1 )\Psi_{v\chi}(\omega)\). Putting these equations together, we find \[\begin{align} (i\omega_A(\tau)) (i\omega_B(\tau)) = \frac{1-\tau}{\tau} ( i \omega ) \end{align}\] This equation is to be solved for \(\tau(\omega)\). Once this function is determined we can use it to compute the diagonal blocks of \(\boldsymbol{\Sigma},\boldsymbol{\Psi}\) as we describe in the next section.

7.0.0.5 Diagonal Blocks

Now, we can determine the diagonal blocks, which determine the covariance structure of the \(\boldsymbol{v}(\omega)\) variables

\[\begin{align} \Sigma_{vv}(\omega,\omega') &= \frac{1}{N} \boldsymbol{v}_0^\top \left( i \omega + \Psi_{v\chi}(\omega) \boldsymbol{A} \right)^{-1} \boldsymbol{A}^2 \left( i \omega' + \Psi_{v\chi}(\omega') \boldsymbol{A} \right)^{-1} \boldsymbol{v}_0 \nonumber \\ &- \Psi_{\chi\chi}(\omega,\omega') \; \underbrace{\text{tr} \boldsymbol{A}^2 \left( i \omega + \Psi_{v\chi}(\omega) \boldsymbol{A} \right)^{-1} \left( i \omega' + \Psi_{v\chi}(\omega') \boldsymbol{A} \right)^{-1}}_{\mathcal{A}(\omega,\omega')} \nonumber \\ \Sigma_{vv}(\omega,\omega') &= - \hat{\Sigma}_{\chi\chi}(\omega,\omega') \; \underbrace{\text{tr} \left( \hat{\Sigma}_{v\chi}(\omega) + \boldsymbol{B} \right)^{-1}\left( \hat{\Sigma}_{v\chi}(\omega') + \boldsymbol{B} \right)^{-1}}_{\mathcal{B}(\omega,\omega')}\nonumber \\ \Psi_{\chi\chi}(\omega,\omega') &= - \hat{\Sigma}_{\chi\chi}(\omega,\omega') - \Sigma_{vv}(\omega,\omega') \Sigma_{v\chi}(\omega)^{-1}\Sigma_{v\chi}(\omega' )^{-1} \end{align}\] where we introduced functions \(\mathcal{A}\) and \(\mathcal{B}\) which can be determined from the (already obtained) off-diagonal blocks. Combining these equations \[\begin{align} \Sigma_{vv}(\omega,\omega') & = \frac{\frac{1}{N} \boldsymbol{v}_0^\top \left( i \omega + \Psi_{v\chi}(\omega) \boldsymbol{A} \right)^{-1} \boldsymbol{A}^2 \left( i \omega' + \Psi_{v\chi}(\omega') \boldsymbol{A} \right)^{-1} \boldsymbol{v}_0}{1 - \left[ \Sigma_{v\chi}(\omega)^{-1} \Sigma_{v\chi}(\omega')^{-1} - \mathcal{B}(\omega,\omega')^{-1} \right] \mathcal{A}(\omega,\omega')} \end{align}\] From this expression, both \(\hat{\Sigma}_{\chi\chi}(\omega,\omega') = - \mathcal{B}(\omega,\omega') \Sigma_{vv}(\omega,\omega')\) and \(\Psi_{\chi\chi}(\omega,\omega') = - \hat{\Sigma}_{\chi\chi}(\omega,\omega') - \Sigma_{vv}(\omega,\omega') \Sigma_{v\chi}(\omega)^{-1}\Sigma_{v\chi}(\omega' )^{-1}\) are determined.

7.0.0.6 Deterministic Equivalent

From the functions \(\Psi_{v\chi}(\omega)\) and \(\Psi_{\chi\chi}(\omega,\omega')\) we obtain the following determinstic equivalent for the outer product of \(\boldsymbol{v}\) variables \[\begin{align} \boldsymbol{v}(\omega) \boldsymbol{v}(\omega')^\top \simeq &\left( i\omega + \Psi_{v\chi}(\omega)\boldsymbol{A} \right)^{-1} \left[ \boldsymbol{v}_0 \boldsymbol{v}_0 - \Psi_{\chi\chi}(\omega,\omega') \boldsymbol{I} \right] \left( i\omega' + \Psi_{v\chi}(\omega')\boldsymbol{A} \right)^{-1} \end{align}\] where the \(\simeq\) expression holds under trace against a test matrix (i.e. \(\boldsymbol{M}_1 \simeq \boldsymbol{M}_2 \implies \text{tr}\boldsymbol{C} \boldsymbol{M}_1 = \text{tr} \boldsymbol{C} \boldsymbol{M}_2\)) [75]. This function will be used in subsequent expressions to give an exact result for the loss landscape of our randomly rotated ICL loss function. Approximations of this will give rise to our scaling law results as we discuss in Appendix 12.4.

8 Model Definition, Masking Mechanics, Reduced-\(\Gamma\) Model↩︎

8.1 Linear Attention and Positional Masking For our Task↩︎

In this section, we provide more detail on the structure of the masking used in our task. First, we note that the target values \(y_\mu\) are only provided for the \(P\) labeled examples within each context matrix. Second, we note that the residual attention operations are masked differently for the \(P\) labeled examples and the \(K\) evaluation points. To define our masking operation precisely, we introduce the notation \(\boldsymbol{1}_k \in \mathbb{R}^k\) as a vector consisting of all \(k\) entries equal to one. We introduce the masking matrix \(\boldsymbol{M} \in \mathbb{R}^{(P+K)\times (P+K)}\) for the residual stream which has the following block structure \[\begin{align} \boldsymbol{M} = \begin{bmatrix} - \boldsymbol{1}_P \;\boldsymbol{1}_P^\top & \boldsymbol{0} \\ \boldsymbol{1}_K \; \boldsymbol{1}_P^\top & \boldsymbol{0} \end{bmatrix} . \end{align}\] Now that this positional masking matrix is introduced, we can conveniently express our update rule for the residual stream \[\begin{align} \boldsymbol{h}^{\ell+1}_\mu &= \boldsymbol{h}^\ell_\mu + \frac{1}{P L} \sum_{\nu=1}^{P+K} M_{\mu\nu} \left( \boldsymbol{k}^\ell_\nu \cdot \boldsymbol{q}^\ell_\mu \right) \boldsymbol{v}_{\nu}^\ell \nonumber \\ &= \boldsymbol{h}^\ell_\mu + \frac{1}{P L} \sum_{\nu=1}^{P} M_{\mu\nu} \left( \boldsymbol{k}^\ell_\nu \cdot \boldsymbol{q}^\ell_\mu \right) \boldsymbol{v}_{\nu}^\ell \;, \;\mu \in [P+K] \end{align}\] where \(\boldsymbol{k}^\ell_\mu = \boldsymbol{W}_k \cdot \boldsymbol{h}^\ell_\mu, \boldsymbol{q}^\ell_\mu = \boldsymbol{W}_q \boldsymbol{h}^\ell_\mu, \boldsymbol{v}^\ell_\mu = \boldsymbol{W}^\ell_v \boldsymbol{h}^\ell_\mu\) and we dropped the sum over evaluation points \(\{ P+1,...,P+K \}\) due to the structure of the positional mask \(\boldsymbol{M}\). We thus see that the masked update rule has two properties

It prevents the test points \(\boldsymbol{x}_\mu\) for \(\mu \in \{P+1,...,P+K\}\) from being used in the residual stream updates. Only the first \(P\) training points are utilized.
It provides an opposite sign for the updates on training points and on testing points. We will see that this will enable the model to implement an in-context gradient descent rule. In such a rule, a subspace of the residual variables will encode residual errors \(y_\mu - f_\mu^\ell\) for the training predictions and \(+ f_\mu^\ell\) for the test points \(\mu \in \{P+1,..,P+K\}\). Instead, one could use the same signs for training points and test points in \(\boldsymbol{M}\) and simply negate the output of the model at the end \(f \to -f\).

8.2 Derivation of Reduced Gamma Model↩︎

In this section, we describe the conditions under which a linear attention transformer model can be reparameterized as the recurrent reduced-\(\Gamma\) model we discuss in the main text.

8.2.0.1 Alignment Assumptions

Following [21], [56], [68], we study configurations of weights that encode information about inputs \(\boldsymbol{x}\) and targets \(y\) in orthogonal subspaces of the residual stream. Concretely, the following assumptions are made on the input weights, which implement in-context preconditioned gradient descent \[\begin{align} \boldsymbol{W}_x^\top \boldsymbol{w}_y = 0 \;, \;\boldsymbol{W}_x^\top \boldsymbol{w}_o = 0 \;, \; \boldsymbol{w}_y \cdot \boldsymbol{w}_y = |\boldsymbol{w}_y| |\boldsymbol{w}_y| . \end{align}\] Collectively these assumptions imply that read-in weights for the targets \(\boldsymbol{w}_y\) and readout weights \(\boldsymbol{w}_o\) perfectly align and that the read-in weights for \(\boldsymbol{x}\) \(\boldsymbol{W}_x\) project input data to an orthgonal subspace. Next we study the following set of alignment conditions for the key, query and value matrices \[\begin{align} &\boldsymbol{W}_x^\top \left(\boldsymbol{W}_k^\ell \right)^\top \boldsymbol{W}_q^\ell \boldsymbol{W}_x \propto \boldsymbol{\Gamma}^\ell \in \mathbb{R}^{D \times D} \\ &\boldsymbol{W}_v \boldsymbol{W}_x = 0 \;, \;\boldsymbol{W}_v \boldsymbol{w}_y \propto \boldsymbol{w}_y \end{align}\] Under these assumptions, we can define a collection of \(D \times D\) matrices \(\boldsymbol{\Gamma}^\ell\) \[\begin{align} \boldsymbol{\Gamma}^\ell \equiv \left( \boldsymbol{w}_o^\top \boldsymbol{W}_v^\ell \boldsymbol{w}_y \right) \boldsymbol{W}_x^\top ( \boldsymbol{W}_k^\ell )^\top ( \boldsymbol{W}_q^\ell ) \boldsymbol{W}_x \in \mathbb{R}^{D \times D } , \end{align}\] which gives rise to the following residual stream dynamics for \(\Delta_\mu^\ell \equiv \boldsymbol{w}_o \cdot \boldsymbol{h}^\ell\) \[\begin{align} \Delta^{\ell+1}_\mu = \Delta^\ell_\mu + \frac{1}{LP}\sum_{\nu=1}^P M_{\mu\nu} \boldsymbol{x}_\nu^\top \boldsymbol{\Gamma}^\ell \boldsymbol{x}_\mu \Delta_\nu^\ell . \end{align}\] We note that this can be separated into two distinct dynamical systems, one for the first \(P\) training points \[\begin{align} \Delta^{\ell+1}_\mu = \Delta^\ell_\mu -\frac{1}{LP}\sum_{\nu=1}^P \boldsymbol{x}_\nu^\top \boldsymbol{\Gamma}^\ell \boldsymbol{x}_\mu \Delta_\nu^\ell \;, \;\mu \in \{1,...,P\} \end{align}\] which form a closed dynamical system on the \(P\) labeled training points. From these \(P\) variables \(\{ \Delta_\mu^\ell \}_{\mu \in [P]}\), we can describe how the remaining \(K\) points evolve \[\begin{align} \Delta^{\ell+1}_\mu = \Delta^\ell_\mu + \frac{1}{LP}\sum_{\nu=1}^P \boldsymbol{x}_\nu^\top \boldsymbol{\Gamma}^\ell \boldsymbol{x}_\mu \Delta_\nu^\ell = \frac{1}{LP} \sum_{\ell=1}^L \sum_{\nu=1}^P \boldsymbol{x}_\nu^\top \boldsymbol{\Gamma}^\ell \boldsymbol{x}_\mu \Delta_\nu^\ell \;, \;\mu \in \{P + 1,...,P+K\} . \end{align}\]

8.2.0.2 Recurrence

Instead of defining separate \(\boldsymbol{\Gamma}^\ell\) matrices for each layer \(\ell\), we instead can examine a recurrent model where the weights are tied \(\boldsymbol{\Gamma}^\ell = \boldsymbol{\Gamma}\) for all \(\ell \in [L]\). We relax this assumption in Appendix 13.1 and in many of the settings we analyze recurrence has no impact compared to decoupling layers. Under this constraint, the residual stream of the model has the following dynamics \[\begin{align} \Delta_\mu^{\ell+1} &= \Delta^\ell_{\mu} - \frac{1}{L P} \sum_{\nu=1}^P \boldsymbol{x}_\mu^\top \boldsymbol{\Gamma} \boldsymbol{x}_\nu \;\Delta^\ell_\nu \;, \;\mu \in \{1,...,P\} \nonumber \\ \Delta_\mu^{\ell+1} &= \Delta_\mu^\ell + \frac{1}{L P} \sum_{\nu =1}^P \boldsymbol{x}_\mu^\top \boldsymbol{\Gamma} \boldsymbol{x}_\nu \;\Delta^\ell_\nu \;, \;\mu \in \{P+1,...,P+K\} \end{align}\] Let \(\boldsymbol{\Delta}^\ell \in \mathbb{R}^P\) represent the residual stream variables restricted to the \(P\) unmasked training points in the context. This vector has the residual stream dynamics \[\begin{align} \boldsymbol{\Delta}^{\ell} = \left( \boldsymbol{I} - \frac{1}{L P} \boldsymbol{X}^\top \boldsymbol{\Gamma} \boldsymbol{X} \right)^\ell \boldsymbol{y} . \end{align}\] However, the recursion is different for the test set since these points only receive attention signals from the \(P\) unmasked training tokens. For one of the test points \(\boldsymbol{x}_\star\), the prediction of the model \(f_\star\) satisfies \[\begin{align} f_\star = \frac{1}{L P} \boldsymbol{x}_\star^\top \boldsymbol{\Gamma} \boldsymbol{X} \sum_{\ell=0}^{L-1} \boldsymbol{h}^\ell = \frac{1}{L P} \boldsymbol{x}_\star^\top \boldsymbol{\Gamma} \boldsymbol{X} \sum_{\ell=0}^{L-1} \left( \boldsymbol{I} - \frac{1}{L P} \boldsymbol{X}^\top \boldsymbol{\Gamma} \boldsymbol{X} \right)^\ell \boldsymbol{y} \end{align}\]

8.2.0.3 Equivalence to Preconditioned In-Context Gradient Descent

This update rule is equivalent to implementing preconditioned in-context gradient descent steps for a linear regression model \(f = \frac{1}{\sqrt D} \boldsymbol{\beta} \cdot \boldsymbol{x}\). To see this, define an in-context training loss \(\hat{\mathcal{L}}(\boldsymbol{\beta}, \boldsymbol{D})\) on the \(P\) labeled training points for context \(\boldsymbol{D}\) \[\begin{align} \hat{\mathcal{L}}(\boldsymbol{D}) = \frac{1}{2 P} || D^{-1/2} \boldsymbol{X}^\top \boldsymbol{\beta} - \boldsymbol{y} ||^2 \end{align}\] Preconditioned gradient descent with learning rate \(D/L\) and a preconditioner matrix \(\boldsymbol{\Gamma}\) generates the following dynamics on the learned ICL weights \(\boldsymbol{\beta}\) \[\begin{align} \boldsymbol{\beta}^{\ell+1} = \boldsymbol{\beta}^\ell - L^{-1} \boldsymbol{\Gamma} \;\nabla \mathcal{L}(\boldsymbol{\beta}^\ell,\boldsymbol{D}) = \boldsymbol{\beta}^\ell - \frac{\sqrt D}{L P} \boldsymbol{\Gamma} \boldsymbol{X} \left( D^{-1/2} \boldsymbol{X}^\top \boldsymbol{\beta}^\ell - \boldsymbol{y} \right) \end{align}\] Defining a variable \(\boldsymbol{h}^\ell \equiv \boldsymbol{y} - D^{-1/2} \boldsymbol{X}^\top \boldsymbol{\beta}^\ell\), we note that this variable satisfies the same recursion as the residual stream variables \(\boldsymbol{h}^\ell\) on the training points described above which gives the identical solution \(\boldsymbol{h}^\ell = \left( \boldsymbol{I} - \frac{1}{LP} \boldsymbol{X}^\top \boldsymbol{\Gamma} \boldsymbol{X} \right)^\ell \boldsymbol{y}\). The function \(f_\star\) on a test point \(\boldsymbol{x}_\star\) takes the form \[\begin{align} f_\star = \frac{1}{\sqrt D} \boldsymbol{\beta}^L \cdot \boldsymbol{x}_\star = \frac{1}{L P } \boldsymbol{x}_\star^\top \boldsymbol{\Gamma} \boldsymbol{X} \sum_{\ell=0}^{L-1} \boldsymbol{h}^\ell = \frac{1}{L P} \boldsymbol{x}_\star^\top \boldsymbol{\Gamma} \boldsymbol{X} \sum_{\ell=0}^{L-1} \left( \boldsymbol{I} - \frac{1}{L P} \boldsymbol{X}^\top \boldsymbol{\Gamma} \boldsymbol{X} \right)^\ell \boldsymbol{y} \end{align}\] We will next express the test loss

8.2.0.4 Test Error Formula for Linear Models

We now desire to compute the expected test loss (averaged over the dataset \(\boldsymbol{X}\) and noise \(\boldsymbol{\epsilon}\)) for (noisy) linear target function \(y(\boldsymbol{x})\) \[\begin{align} y(\boldsymbol{x}) = \frac{1}{\sqrt D} \boldsymbol{\beta}_\star \cdot \boldsymbol{x} + \sigma \epsilon \;, \; \left< \boldsymbol{x} \boldsymbol{x}^\top \right> = \boldsymbol{\Sigma} \;, \; \left< \epsilon^2 \right> = 1 \end{align}\] where \(\boldsymbol{\Sigma}\) is the (population) covariance of the inputs. The expected loss under this assumption is \[\begin{align} \mathcal{L}= \frac{1}{D} \left< \left ( \boldsymbol{\beta}^L - \boldsymbol{\beta}_\star \right)^\top \boldsymbol{\Sigma} \left( \boldsymbol{\beta}^L - \boldsymbol{\beta}_\star \right) \right>_{\boldsymbol{X}, \boldsymbol{\epsilon}} + \sigma^2 \end{align}\] Letting \(\boldsymbol{\epsilon} \in \mathbb{R}^P\) represent the noise on the \(P\) labeled training points, the target weights have the form \[\begin{align} \boldsymbol{\beta}^L &= \frac{\sqrt D}{L P} \boldsymbol{\Gamma} \boldsymbol{X} \sum_{\ell=0}^L \left( \boldsymbol{I} - \frac{1}{LP} \boldsymbol{X}^\top \boldsymbol{\Gamma} \boldsymbol{X} \right)^\ell \left[ \frac{1}{\sqrt D} \boldsymbol{X}^\top \boldsymbol{\beta}_\star + \sigma \boldsymbol{\epsilon} \right] \nonumber \\ &= L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \sum_{\ell=0}^{L-1} \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^\ell \boldsymbol{\beta}_\star + \frac{\sigma \sqrt{D}}{L P} \;\boldsymbol{\Gamma} \sum_{\ell=0}^{L-1} \left( \boldsymbol{I} - L^{-1} \hat{\boldsymbol{\Sigma}} \boldsymbol{\Gamma} \right)^{\ell} \boldsymbol{X} \boldsymbol{\epsilon} \nonumber \\ &= \boldsymbol{\beta}_\star - \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^L \boldsymbol{\beta}_\star + \frac{\sigma \sqrt D}{L P} \boldsymbol{\Gamma} \sum_{\ell=0}^{L-1} \left( \boldsymbol{I} - L^{-1} \hat{\boldsymbol{\Sigma}} \boldsymbol{\Gamma} \right)^{\ell} \boldsymbol{X} \boldsymbol{\epsilon} \end{align}\] where we defined \(\hat{\boldsymbol{\Sigma}} = \frac{1}{P} \boldsymbol{X} \boldsymbol{X}^\top \in \mathbb{R}^{D \times D}\). Then we note that for \(\alpha \equiv P / D\) that the reducible loss \(\mathcal{L} - \sigma^2\) has the form \[\begin{align} \mathcal{L} - \sigma^2 &= \frac{1}{D} \left< \left( \boldsymbol{\beta}_\star - \boldsymbol{\beta}^L \right)^\top \boldsymbol{\Sigma} \left( \boldsymbol{\beta}_\star - \boldsymbol{\beta}^L \right) \right> \nonumber \\ &= \frac{1}{D} \boldsymbol{\beta}_\star^\top \left< \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{L \top} \boldsymbol{\Sigma} \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^L \right> \boldsymbol{\beta}_\star \nonumber \\ &+ \frac{\sigma^2}{\alpha L^2 } \text{Tr} \;\boldsymbol{\Gamma}^\top \boldsymbol{\Gamma} \sum_{\ell,\ell' =0}^{L-1} \left< \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell} \hat{\boldsymbol{\Sigma}} \left[\left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell'} \right]^\top \right> \end{align}\] In the next sections, we will compute this quantity for various distributions for \(\boldsymbol{\hat{\Sigma}}\) and \(\boldsymbol{\beta}_\star\).

9 Isotropic (ISO) Setting Theory↩︎

In this section, we consider isotropic covariates and tasks so that the data covariance \(\boldsymbol{\Sigma}\) and the task vector covariance \(\boldsymbol{\Omega}\) are both identity \[\begin{align} \boldsymbol{\Sigma}= \left< \boldsymbol{x} \boldsymbol{x}^\top \right> = \boldsymbol{I} \;, \;\boldsymbol{\Omega} = \left< \boldsymbol{\beta}_\star \boldsymbol{\beta}_\star^\top \right> = \boldsymbol{I} . \end{align}\] where the average for target vectors \(\boldsymbol{\beta}_\star\) is over different contexts. We further operate in the high dimensional asymptotic limit where both context length \(P\) and input dimension \(D\) diverge with fixed ratio \[\begin{align} P,D \to \infty \;, \;P / D = \alpha . \end{align}\] In this setting, the spectrum of the empirical covariance matrix \(\hat{\boldsymbol{\Sigma}} = \frac{1}{P} \boldsymbol{X} \boldsymbol{X}^\top\) follows the well-known Marchenko-Pastur law, where the eigenvalue density \(\rho(\lambda)\) depends explicitly on the aspect ratio \(\alpha\) [70], [75]. We will first describe an asymptotic limit of SGD for the shallow \(L=1\) case where the dynamics on the matrix \(\boldsymbol{\Gamma}\) are linear. We will then pursue a description of gradient flow dynamics in an arbitrary depth \(L\) model.

9.1 Shallow SGD Theory↩︎

In this section, we consider the SGD dynamics of the shallow architecture \(L=1\). This excercise will establish at what rate our dynamics will converge to gradient flow and how finite batchsize \(B\), context length \(P\) and number of masked evaluation points \(K\) impact the SGD dynamics. In this case, the updates have the form

\[\begin{align} &\boldsymbol{\Gamma}(t+1) = \boldsymbol{\Gamma}(t) + \eta \;\boldsymbol{G}(t) \nonumber \\ &\boldsymbol{G}(t) = \frac{1}{B} \sum_{n} \hat{\boldsymbol{\Sigma}}_{\star , n} \left[ (\boldsymbol{I} - \boldsymbol{\Gamma} \boldsymbol{\Sigma}_n ) \boldsymbol{\beta}_n - \sigma P^{-1} \sqrt{D}\boldsymbol{\Gamma} \boldsymbol{X}_n \boldsymbol{\epsilon}_n \right] \left[ \boldsymbol{\Sigma}_n \boldsymbol{\beta}_n + \sigma P^{-1} \sqrt{D} \boldsymbol{X}_n \boldsymbol{\epsilon}_n \right]^\top \nonumber \\ &\hat{\boldsymbol{\Sigma}}_n = \frac{1}{P} \boldsymbol{X}_n^\top \boldsymbol{X}_n \;, \;\hat{\boldsymbol{\Sigma}}_{\star,n} = \frac{1}{K} \boldsymbol{X}_{\star,n}^\top \boldsymbol{X}_{\star,n} \end{align}\] where \(\boldsymbol{X} \in \mathbb{R}^{P \times D}\) is the set of training points and \(\boldsymbol{X}_{\star} \in \mathbb{R}^{K \times D}\) represents the set of \(K\) evaluation points and \(\boldsymbol{\epsilon} \in \mathbb{R}^P\) represents the label noise on the provided targets \(\boldsymbol{y}\). We remind the reader that we will operate in the joint scaling limit \[\begin{align} P,B,K, D \to \infty \;, \;P / D = \alpha \;, \;K/D = \kappa \;, \;B / D = \tau \end{align}\] The population ICL loss has the form \[\begin{align} \mathcal{L}(t) = \frac{1}{D} \left< |\boldsymbol{\Gamma}(t) - \boldsymbol{I}|^2 \right> , \end{align}\] where the average is performed over all possible draws of data and tasks. This function admits the recursion \[\begin{align} \mathcal{L}(t+1) = \mathcal{L}(t) - 2 \eta \text{tr}[\boldsymbol{\Gamma}(t) - \boldsymbol{I}]^\top \left< \boldsymbol{G}(t) \right> + \eta^2 \text{tr} \left< \boldsymbol{G}(t)^\top \boldsymbol{G}(t) \right> \end{align}\] We have the following mean for the gradient \[\begin{align} \left< \boldsymbol{G}(t) \right> = (\boldsymbol{I} - (1+\alpha^{-1} + \sigma^2 \alpha^{-1}) \boldsymbol{\Gamma} ) \end{align}\] Thus if we only updated using mean gradients the model would relax to a fixed point \(\boldsymbol{\Gamma}_\star = \left( 1 + \alpha^{-1} + \sigma^2 \alpha^{-1} \right)^{-1} \boldsymbol{I}\). However, we will see momentarily that SGD noise in the gradients impact the \(\boldsymbol{\Gamma}\) that SGD converges to \[\begin{align} \text{tr} \;\left< \boldsymbol{G}(t)^\top \boldsymbol{G}(t) \right > = \frac{1}{B^2} \sum_{nm} \text{tr} \; \left< \boldsymbol{u}_n \boldsymbol{v}_n^\top \boldsymbol{v}_m \boldsymbol{u}_m^\top \right> \end{align}\] where \(\boldsymbol{u}_n = \boldsymbol{\hat{\Sigma}}_{\star,n} \left[ (\boldsymbol{I} - \boldsymbol{\Gamma} \boldsymbol{\Sigma}_n ) \boldsymbol{\beta}_n - \sigma p^{-1} \sqrt{d}\boldsymbol{\Gamma} \boldsymbol{X}_n \boldsymbol{\epsilon}_n \right]\) and \(\boldsymbol{v}_n = \boldsymbol{\Sigma}_n \boldsymbol{\beta}_n + \sigma p^{-1} \sqrt{d} \boldsymbol{X}_n \boldsymbol{\epsilon}_n\). Doing the averages \[\begin{align} &\frac{1}{D} \left< \boldsymbol{u}_n \cdot \boldsymbol{u}_m \right> = \delta_{mn} \text{tr} \left[ \boldsymbol{I} - 2 \boldsymbol{\Gamma} + (1+\alpha^{-1}) \boldsymbol{\Gamma}^2 + \sigma^2 \alpha^{-1} \boldsymbol{\Gamma}^2 \right] (1 + \kappa^{-1}) \\ &= \delta_{mn} (1 + \alpha^{-1}(1+\sigma^2)) (1+\kappa^{-1}) \text{tr} \left[ (\boldsymbol{\Gamma} - \boldsymbol{\Gamma}_\star)^2 + \frac{\alpha^{-1}(1+\sigma^2)}{(1+\alpha^{-1}(1+\sigma^2))^2} \boldsymbol{I} \right] \\ &\frac{1}{D} \left< \boldsymbol{v}_n \cdot \boldsymbol{v}_m \right> = \delta_{mn} (1+\alpha^{-1} (1+\sigma^2) ) \end{align}\] From the above results, we get the following second moment structure \[\begin{align} \text{tr} \left< \boldsymbol{G}(t)^\top \boldsymbol{G}(t) \right> &= \text{tr} \left<\boldsymbol{G}(t) \right>^\top \left<\boldsymbol{G}(t) \right> \nonumber \\ &+ \frac{1}{\tau} (1+\kappa^{-1}) (1+\alpha^{-1}(1+\sigma^2))^2 \left[ (\boldsymbol{\Gamma} - \boldsymbol{\Gamma}_\star)^2 + \frac{\alpha^{-1}(1+\sigma^2)}{(1+\alpha^{-1}(1+\sigma^2))^2} \boldsymbol{I} \right] . \end{align}\]

Thus we get a recursion \[\begin{align} \boldsymbol{\Gamma}(t+1) = \boldsymbol{\Gamma}(t) + \eta (1+\alpha^{-1}(1+\sigma^2)) \left[ \boldsymbol{\Gamma}_\star - \boldsymbol{\Gamma}(t) \right] + \eta \; \boldsymbol{\Xi}(t) \end{align}\] where \(\boldsymbol{\Xi} \in \mathbb{R}^{D \times D}\) is the noise process with variance given above. We define a quantity \(C(t)\) which measures the relaxation of \(\boldsymbol{\Gamma}\) to \(\boldsymbol{\Gamma}_\star\) \[\begin{align} C(t) = \frac{1}{D} \left|\boldsymbol{\Gamma}(t) -\boldsymbol{\Gamma}_\star \right|^2 \;, \;\boldsymbol{\Gamma}_\star = \left[ 1 + \alpha^{-1} (1+\sigma^2) \right]^{-1} \boldsymbol{I} \end{align}\] This function exhibits the following linear dynamics \[\begin{align} C(t+1) &= \left[1 - \eta \left(1 +\alpha^{-1}(1+\sigma^2) \right) \right]^2 C(t) \nonumber \\ &+ \frac{\eta^2}{\tau} (1+\kappa^{-1}) (1+\alpha^{-1}(1+\sigma^2))^2 \left[ C(t) + \frac{\alpha^{-1}(1+\sigma^2)}{(1+\alpha^{-1}(1+\sigma^2))^2} \right] \end{align}\]

This recursion again takes the form \[\begin{align} &C(t+1) = a(\eta,\alpha, \kappa,\tau) C(t) + b(\eta,\alpha, \kappa,\tau) \nonumber \\ &a(\eta,\alpha, \kappa,\tau) = \left[1 - \eta \left(1 +\alpha^{-1}(1+\sigma^2) \right) \right]^2 + \frac{\eta^2}{\tau} (1+\kappa^{-1}) (1+\alpha^{-1}(1+\sigma^2))^2 \nonumber \\ &b(\eta,\alpha,\kappa,\tau) = \frac{\eta^2}{\tau} (1+\kappa^{-1}) \alpha^{-1}(1+\sigma^2) \end{align}\]

Now we need to compute the ICL loss \[\begin{align} \mathcal{L}(t) &= \text{tr}[\boldsymbol{\Gamma}(t) - \boldsymbol{\Gamma}_\star + \boldsymbol{\Gamma}_\star - \boldsymbol{I}]^2 = C(t) + 2 \text{tr}(\boldsymbol{\Gamma}_\star-\boldsymbol{\Gamma}(t))(\boldsymbol{I} - \boldsymbol{\Gamma}_\star) + \text{tr} (\boldsymbol{\Gamma}_\star-\boldsymbol{I})^2 \nonumber \\ &= C(t) + \frac{2 \alpha^{-1}(1+\sigma^2)}{(1+\alpha^{-1}(1+\sigma^2))^2} \left[1-\eta(1+\alpha^{-1}(1+\sigma^2)) \right]^t + \frac{(1+\sigma^2)^2}{(\alpha + 1+\sigma^2)^2} \end{align}\]

Figure 7: Pretraining SGD loss dynamics in the shallow \(L=1\) reduced \(\boldsymbol{\Gamma}\) model for (a) \(\alpha,\kappa = 1\) and varying \(\tau\) (b) varying \(\alpha\) with \(\tau=\kappa=1\) (c) varying \(\kappa\) with \(\alpha,\tau = 1\). The loss monotonically improves with all three quantities \(\tau,\alpha,\kappa\) and is well predicted by the asymptotic theory..

Putting this all together, we summarize our findings below.

Consider the reduced \(\Gamma\) model with a single layer \(L=1\) and proportional asymptotics \[\begin{align} P,K,B,D \to \infty \;, \;P/D = \alpha \;, \;K/D = \kappa \;, \;B/D = \tau \end{align}\] Then the ICL test loss \(\mathcal{L}(t) = \frac{1}{d} ||\boldsymbol{\Gamma}(t) - \boldsymbol{I}||^2\) after \(t\) SGD iterations is \[\begin{align} &\mathcal{L}(t) = C(t) + \frac{2 \alpha^{-1}(1+\sigma^2)}{(1+\alpha^{-1}(1+\sigma^2))^2} \left[1-\eta(1+\alpha^{-1}(1+\sigma^2)) \right]^t + \frac{(1+\sigma^2)^2}{(\alpha + 1+\sigma^2)^2} \\ &C(t) = a(\eta, \alpha, \tau, \kappa)^t + \frac{1-a(\eta, \alpha, \tau, \kappa)^t}{1-a(\eta, \alpha, \tau, \kappa)} b(\eta, \alpha, \tau, \kappa) \end{align}\] where \(a(\eta,\alpha, \kappa,\tau) = \left[1 - \eta \left(1 +\alpha^{-1}(1+\sigma^2) \right) \right]^2 + \frac{\eta^2}{\tau} (1+\kappa^{-1}) (1+\alpha^{-1}(1+\sigma^2))^2\) and \(b(\eta,\alpha,\kappa,\tau) = \frac{\eta^2}{\tau} (1+\kappa^{-1}) \alpha^{-1}(1+\sigma^2)\) capture the dependence on batchsize through \(\tau\).

We verify the validity of these learning curves for shallow pretraining dynamics in Figure 7.

We thus see that deviations from gradient flow in the \(L=1\) model come in at a scale of \(\eta / \tau\). The purpose of this result is to stress that only \(\mathcal{O}(D^2)\) total tokens are required (unlike the prior work of [21] which required \(\mathcal{O}(D^3)\) tokens) for an ICL learner in this regime ³. We verify these theoretical learning curves in Figure 7.

We also indicate the following facts about SGD effects

The number of masked evaluation points per batch \(\kappa \equiv K/D\) only alters the SGD noise terms (the terms involving \(1/\tau\)) and its effect vanishes in the gradient flow limit \(\eta / \tau \to 0\).
Unlike offline training, in this online SGD setting the model regularizes the final value of \(\gamma(t) =\) based on label noise \(\sigma^2\) to reduce overfitting.
SGD fluctuations can impact the final loss through \(\lim_{t\to\infty} C(t) = \frac{b(\eta,\alpha,\tau,\kappa)}{1-a(\eta,\alpha,\tau,\kappa)}\).

One nice consequence of this result is that the gradient flow limit can be accessed simply by controlling the scale of \(\eta/\tau\) in a limit where batch size is linear in the dimension \(B = \tau D\).

9.2 Gradient Flow in Deep Models↩︎

From the previous section, we saw that in the absence of SGD noise, gradient dynamics generated an isotropic \(\boldsymbol{\Gamma}\) matrix. This fact is true for larger depth \(L\geq 1\) as well. The gradient flow dynamics generate is \(\boldsymbol{\Gamma}(t) = \gamma(t) \boldsymbol{I}\). To see this, we note that the gradient has the form \[\begin{align} \frac{\partial}{\partial \boldsymbol{\Gamma}} \mathcal{L} &= - \left< \frac{2}{L} \sum_{\ell=1}^L \left[\left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell}\right]^\top \hat{\boldsymbol{\Sigma}} \left[ \left(\boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{L-1-\ell} \right]^\top \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{L} \right>\nonumber \\ &+ \frac{2 \sigma^2}{\alpha L^2 } \boldsymbol{\Gamma} \sum_{\ell,\ell'} \left< \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell} \hat{\boldsymbol{\Sigma}} \left[\left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell'} \right]^\top \right> \nonumber \\ &+ \frac{2 \sigma^2}{\alpha L^3 } \boldsymbol{\Gamma}^\top \boldsymbol{\Gamma} \sum_{\ell,\ell',k} \left< \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{k} \hat{\boldsymbol{\Sigma}} \left( \boldsymbol{I} - \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell-1-k} \hat{\boldsymbol{\Sigma}} \left[\left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} \hat{\boldsymbol{\Sigma}} \right)^{\ell'} \right]^\top \right> \end{align}\] Now, evaluating this at \(\boldsymbol{\Gamma} = \gamma \boldsymbol{I}\) we find \[\begin{align} \frac{\partial}{\partial \boldsymbol{\Gamma}} \mathcal{L}|_{\boldsymbol{\Gamma}=\gamma \boldsymbol{I}} &= -2 \left< \hat{\boldsymbol{\Sigma}} \left( \boldsymbol{I} - L^{-1} \gamma \hat{\boldsymbol{\Sigma}} \right)^{2L-1} \right> + \frac{2\sigma^2 \gamma}{\alpha L^2} \left< \hat{\boldsymbol{\Sigma}} \sum_{\ell\ell'} (1 - L^{-1} \gamma \hat{\boldsymbol{\Sigma}})^{\ell+\ell'} \right> \nonumber \\ &+ \frac{2\sigma^2 \gamma^2 }{\alpha L^2} \sum_{\ell,\ell'} \left< \hat{\boldsymbol{\Sigma}} \left( \boldsymbol{I} - L^{-1} \gamma \hat{\boldsymbol{\Sigma}} \right)^{\ell+\ell'-1} \right> \propto \boldsymbol{I} \end{align}\] where we recognize through rotational invariance, that the above averages are proportional to the identity. Thus if \(\boldsymbol{\Gamma}\) is currently in an isotropic configuration, it will remain in one throughout gradient flow. Further, we can compute the scalar \(\gamma(t) = \text{tr} \boldsymbol{\Gamma}(t)\) under gradient flow \(\partial_t \boldsymbol{\Gamma}(t) = - \frac{1}{2} \partial_{\boldsymbol{\Gamma}} \mathcal{L}(\boldsymbol{\Gamma})\). \[\begin{align} \frac{d}{dt} \gamma(t) = - \frac{\partial}{\partial \gamma} \text{tr} \left[ \left( \boldsymbol{I} - L^{-1} \gamma \hat{\boldsymbol{\Sigma}} \right)^{2L} + \frac{\sigma^2}{\alpha} \hat{\boldsymbol{\Sigma}}^{-1} \left[ \boldsymbol{I} - \left( \boldsymbol{I} - L^{-1} \gamma \hat{\boldsymbol{\Sigma}} \right)^L \right]^2 \right] \end{align}\] Now, we note that this expression can be rewritten in terms of the Marchenko-Pastur eigenvalue density \(\rho(\lambda) = \frac{1}{D} \sum_{k=1}^D \left< \delta(\lambda - \lambda_k) \right>\) as \[\begin{align} \frac{d}{dt} \gamma(t) = -\frac{\partial}{\partial \gamma } \int d\lambda \rho(\lambda) \left[ \left( 1 - L^{-1} \gamma \lambda \right)^{2L} + \frac{\sigma^2}{\alpha \lambda} \left[ 1 - \left( 1 - L^{-1} \gamma \lambda \right)^L \right]^2 \right] \end{align}\] which shows that we can view the dynamics as a gradient flow on a one dimensional loss landscape.

10 Fixed and Structured Covariance (FS) Setting↩︎

In this section we discuss the case where the data covariance is structured but potentially structured. We consider the noise free setting and \(P/D \to \infty\) for simplicity where the loss has the form \[\begin{align} \mathcal{L} = \text{tr} \;\boldsymbol{\Omega} \left< \left[\left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} {\boldsymbol{\Sigma}} \right)^L \right]^\top \boldsymbol{\Sigma} \left( \boldsymbol{I} - L^{-1} \boldsymbol{\Gamma} {\boldsymbol{\Sigma}} \right)^L \right> \end{align}\] In this case, gradient flow will generate dynamics that cause \(\boldsymbol{\Gamma}\) to pick up anisotropy from the structure of \(\boldsymbol{\Omega}\) and \(\boldsymbol{\Sigma}\). We analyze the case where \(\boldsymbol{\Omega}\) and \(\boldsymbol{\Sigma}\) are simultaneously diagonalizable with eigenvalues \(\omega_k\) and \(\lambda_k\). In this case, \(\boldsymbol{\Gamma}\) will share the same eigenbasis. Let the eigenvalues be \(\gamma_k\) \[\begin{align} \frac{\partial}{\partial t} \gamma_k(t) = 2 \omega_k \lambda_k^2 \left( 1 - L^{-1} \lambda_k \gamma_k(t) \right)^{2L-1} . \end{align}\] The fixed point of the above dynamics is \(\gamma_k = L \lambda_k^{-1}\). In the large depth limit \(L \to \infty\), we can solve the dynamics exactly \[\begin{align} \frac{d}{dt} \gamma_k(t) &= 2 \omega_k \lambda_k^2 e^{-2 \lambda_k \gamma_k(t) } \implies e^{2 \lambda_k \gamma_k} d\gamma_k = \omega_k \lambda_k^2 dt \nonumber \\ \implies 2 \lambda_k \gamma_k(t) &= \ln( 1 + 4 \omega_k \lambda_k^3 t ) \end{align}\] Plugging this solution into the loss function we find \[\begin{align} \mathcal{L}(t) = \frac{1}{D} \sum_k \omega_k \lambda_k e^{-2 \gamma_k \lambda_k} = \frac{1}{D} \sum_k \omega_k \lambda_k [ 1 + \omega_k \lambda_k^3 t ]^{-1} \end{align}\]

Under powerlaw (source/capacity) assumptions on the structure of the data \[\begin{align} \lambda_k \sim k^{-\nu} \;, \;\omega_k \lambda_k \sim k^{-\nu \beta - 1}, \end{align}\] the loss scales in a powerlaw as \[\begin{align} \mathcal{L}(t) = \sum_{k} k^{-\nu \beta- 1} [ 1 + k^{-2\nu - \nu \beta } t ]^{-1} \sim t^{- \frac{\beta}{\nu + \nu \beta + 1}} \end{align}\] This powerlaw provides a decent approximation to large but finite depth models \(L\) as we show in Figure 3.

11 Randomly Rotated and Structured Covariance (RRS) Setting↩︎

11.1 Exploiting Symmetry in the Gradient Updates↩︎

Utilizing the definition of the RRS setting, we can massage the placement of the orthogonal matrices so that the loss function can be expressed as \[\begin{align} \mathcal{L} = \left< |\boldsymbol{\Lambda}^{1/2} \left( \boldsymbol{I} - L^{-1} \boldsymbol{O}^\top \boldsymbol{\Gamma} \boldsymbol{O} \hat{\boldsymbol{\Sigma}} \right)^L \boldsymbol{\beta}_\star |^2 \right> . \end{align}\] From this expression, it is immediately clear that this function is rotationally invariant with respect to \(\boldsymbol{\Gamma}\) since the transformation \(\boldsymbol{\Gamma} \to \boldsymbol{V} \boldsymbol{\Gamma} \boldsymbol{V}^\top\) for orthogonal \(\boldsymbol{V}\) leaves the Haar average unchanged. The form of the loss gradients also reveals a symmetry \[\begin{align} \frac{\partial}{\partial \boldsymbol{\Gamma}} \mathcal{L} = - 2 L^{-1} \sum_{\ell} \left< \boldsymbol{O} \hat{\boldsymbol{\Sigma}} \boldsymbol{M}^{\ell} \boldsymbol{\beta}_\star \boldsymbol{\beta}_\star^\top \left(\boldsymbol{M}^L \right)^\top \boldsymbol{\Lambda} \left( \boldsymbol{M}^{L-1-\ell} \right)^\top \boldsymbol{O}^\top \right> \;, \;\boldsymbol{M} = \boldsymbol{I} - \boldsymbol{O}^\top \boldsymbol{\Gamma} \boldsymbol{O} \hat{\boldsymbol{\Sigma}} \end{align}\] Note that starting from zero initialization, we have \(\boldsymbol{M} = 0\) which implies isotropic gradients \(\frac{\partial}{\partial \boldsymbol{\Gamma}} \mathcal{L}|_{\boldsymbol{\Gamma}=0} \propto \boldsymbol{I}\). To see this note that \(\left< \boldsymbol{O} \boldsymbol{C} \boldsymbol{O}^\top \right> \propto \boldsymbol{I}\) for any matrix \(\boldsymbol{C}\) that is independent of \(\boldsymbol{O}\). Further, suppose that we evaluated the loss gradient at any isotropic \(\boldsymbol{\Gamma} = \gamma \boldsymbol{I}\), then \(\boldsymbol{M} = \boldsymbol{I} - \gamma \hat{\boldsymbol{\Sigma}}\) and the gradients remain isotropic \[\begin{align} \frac{\partial \mathcal{L}}{\partial \boldsymbol{\Gamma}} &= - 2 L^{-1} \sum_{\ell} \left< \boldsymbol{O} \hat{\boldsymbol{\Sigma}} \boldsymbol{M}^{\ell} \boldsymbol{\beta}_\star \boldsymbol{\beta}_\star^\top \left(\boldsymbol{M}^L \right)^\top \boldsymbol{\Lambda} \left( \boldsymbol{M}^{L-1-\ell} \right)^\top \boldsymbol{O}^\top \right> \\ &= - 2 L^{-1} \sum_{\ell} \text{tr} \left< \hat{\boldsymbol{\Sigma}} \boldsymbol{M}^{\ell} \boldsymbol{\beta}_\star \boldsymbol{\beta}_\star^\top \left(\boldsymbol{M}^L \right)^\top \boldsymbol{\Lambda} \left( \boldsymbol{M}^{L-1-\ell} \right)^\top \right> \times \boldsymbol{I} \end{align}\] where we explicitly performed the average over the only remaining factors of \(\boldsymbol{O}\) since \(\boldsymbol{M} = \boldsymbol{I} - \gamma \hat{\boldsymbol{\Sigma}}\) is independent of \(\boldsymbol{O}\). Further, we can express the dynamics of \(\gamma(t)\) as a gradient flow on the reduced one-dimensional loss landscape \[\begin{align} \mathcal{L}(\gamma) = \text{tr} \;\boldsymbol{\Omega} \left( \boldsymbol{I} - \gamma L^{-1} \hat{\boldsymbol{\Sigma}} \right)^L \boldsymbol{\Lambda} \left( \boldsymbol{I} - \gamma L^{-1} \hat{\boldsymbol{\Sigma}} \right)^L \end{align}\]

12 Scaling Law Theory↩︎

In this section, we consider the scaling law theory with a finite projection matrix \(\boldsymbol{A}\), following [17]. In this case, our RRS setting requires utilizing the free product result in Appendix 7.

12.1 Dynamics at Infinite \(N,L,P\)↩︎

The gradient flow in the limit of \(N , L , P \to \infty\) has the form \[\begin{align} \frac{d}{dt} \gamma(t) = - \frac{\partial}{\partial \gamma} \sum_k \lambda_k \omega_k e^{- 2 \gamma(t) \lambda_k } \approx - \frac{\partial}{\partial \gamma} \gamma(t)^{- \beta } = \beta \gamma(t)^{- \beta - 1 } \end{align}\] This is a separable ODE with solution (under the assumption that \(\gamma(0) = 0\)) up to constants \[\begin{align} \gamma(t) \sim t^{ \frac{1}{\beta + 2} } . \end{align}\] This implies a powerlaw loss scaling with exponent set by the rate of growth of \(\gamma(t)\) \[\begin{align} \mathcal{L}(t) \sim \gamma(t)^{-\beta } \sim t^{- \frac{\beta}{2+\beta }} . \end{align}\] We expect these dynamics to hold in the limit of \(N ,L , P \to \infty\). At finite \(N,L,P\) the model will saturate at some finite maximal value of \(\gamma\).

12.2 Depth Scaling Law at Infinite Time, Width, and Context Length↩︎

We now explore the scaling law in depth \(L\). In this case, we can consider taking all other resources to infinity \(t,N,P \to \infty\). The value of \(\gamma\) will stabilize, resulting in a final loss \[\begin{align} \mathcal{L}(\gamma) = \sum_k \lambda_k \omega_k \left( 1 - \gamma L^{-1} \lambda_k \right)^{2L} \end{align}\] While \(\gamma(t)\) diverges as \(t^{\frac{1}{2+\beta}}\) in the infinite depth limit, we see that at large but finite depth, there is a maximal \(\gamma\) which enables stability along the top eigendirection. Specifically we need \(\gamma < \frac{2L}{\lambda_1}\) and approximate the optimal value s \[\begin{align} \gamma_\star \approx L / \lambda_1 . \end{align}\] where \(\lambda_1\) is the top (maximal eigenvalue). At \(\gamma_\star\), the error takes the form \[\begin{align} \mathcal{L} \approx \sum_{k} \lambda_k (\beta^\star_k)^2 \left( 1 - \lambda_k / \lambda_1 \right)^L \approx \int dk k^{-\nu\beta-1} \exp(-L k^{-\nu}) \approx L^{-\beta} . \end{align}\] which matches the scaling of \(L\) steps of gradient descent on a problem with source exponent \(\beta\) [76].

12.3 Mapping the Loss Function to a Two Point Deterministic Equivalent↩︎

We utilize dynamical mean field theory (DMFT) result of Appendix 7 to compute the effective loss as a function of \(N,L\) and \(\gamma\). We start with a representation of the relevant polynomial in \(\boldsymbol{M}\) with the Cauchy integral formula \[\begin{align} \left( \boldsymbol{I} - \gamma L^{-1} \boldsymbol{M} \right)^{L} = \int_{\mathcal{C}} \frac{d\omega }{2\pi} \left[ i\omega + \boldsymbol{M} \right]^{-1} \left( 1 + \gamma L^{-1} i \omega \right)^{L} , \end{align}\] where the contour \(\mathcal{C}\) encloses the positive imaginary axis in complex plane [77]. Next, we define \(\boldsymbol{v}(\gamma) = \left( \boldsymbol{I} - \gamma L^{-1} \boldsymbol{M} \right)^{L} \boldsymbol{\beta}_\star\) \[\begin{align} \mathcal{L} = \frac{1}{D} \left< \boldsymbol{v}(\gamma)^\top \boldsymbol{\Lambda} \boldsymbol{v}(\gamma) \right> = \text{tr} \; \boldsymbol{\Omega} \left[ \left( \boldsymbol{I} - \gamma L^{-1} \boldsymbol{M} \right)^{L} \right]^\top \boldsymbol{\Lambda} \left( \boldsymbol{I} - \gamma L^{-1} \boldsymbol{M} \right)^{L} \end{align}\] The loss can thus be expressed as a double integral involving the resolvents \[\begin{align} \mathcal{L} = \int_{\mathcal{C} \times\mathcal{C}} \frac{d \omega d\omega' }{(2\pi )^2} ( 1 + \gamma L^{-1} i\omega)^L (1 + \gamma L^{-1} i\omega')^L \;\;\text{tr} \;\boldsymbol{\Omega} \left[ \left[ i\omega + \boldsymbol{M} \right]^{-1}\right]^\top \boldsymbol{\Lambda} \left[i\omega' + \boldsymbol{M} \right]^{-1} \end{align}\] The main result needed is the following deterministic equivalent from Appendix 7. \[\begin{align} &\text{tr} \;\boldsymbol{\Omega} \; \left[ \left[ i\omega + \boldsymbol{M} \right]^{-1}\right]^\top \boldsymbol{\Lambda} \left[i\omega' + \boldsymbol{M} \right]^{-1} \nonumber \\ &\simeq \text{tr} \left( i \omega + \Psi_{v\chi}(\omega) \boldsymbol{A} \right)^{-1} \left[ \boldsymbol{\Omega} - \boldsymbol{I} \Psi_{\chi\chi}(\omega,\omega') \right] \left( i \omega' + \Psi_{v\chi}(\omega') \boldsymbol{A} \right)^{-1} \boldsymbol{\Lambda} \end{align}\] where \(\Psi_{v\chi}(\omega) = i\omega / i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2}\) can be determined from the equation \[\begin{align} i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2}(\tau) i\omega_{\hat{\boldsymbol{\Sigma}}}(\tau) = \frac{1-\tau}{\tau} ( i \omega ) \end{align}\] Once \(\tau\) is identified as a function of \(\omega\), we can compute \(\Psi_{v\chi}\) and \(\Psi_{\chi\chi}\) from the formulae in 7.

12.4 Orthogonal projection and A is Structured Wishart↩︎

In this section, we describe what these relations imply for a product matrix that arises in our RRS setting, which is a free product of the width projection matrix \((\boldsymbol{A}^\top \boldsymbol{A})^2\) where \(\boldsymbol{A} \in \mathbb{R}^{N \times D}\) has rank at most \(N\) and the empirical covariance for data in the context \(\hat{\boldsymbol{\Sigma}} = \frac{1}{P} \boldsymbol{X} \boldsymbol{X}^\top \in \mathbb{R}^{D\times D}\), \[\begin{align} \boldsymbol{M} = \boldsymbol{O} \; \left( \boldsymbol{A}^\top \boldsymbol{A} \right)^2 \; \boldsymbol{O}^\top \;\hat{\boldsymbol{\Sigma}} , \end{align}\] where \(\boldsymbol{O}\) is a random orthogonal matrix sampled from the Haar measure. The data \(\boldsymbol{X} \in \mathbb{R}^{P \times D}\) are comprised of random iid vectors with covariance \(\boldsymbol{\Lambda}\), which we take to be diagonal without loss of generality. Further, we let \(\boldsymbol{A}^\top \boldsymbol{A}\) be a rank \(N\) projection matrix \[\begin{align} \boldsymbol{A}^\top \boldsymbol{A} = \sum_{k=1}^N \boldsymbol{e}_k \boldsymbol{e}_k^\top \in \mathbb{R}^{D \times D} \end{align}\] where \(\boldsymbol{e}_k \in \mathbb{R}^{D}\) are Cartesian (one-hot) unit vectors. To utilize the result in Appendix 7, we first need to compute the necessary resolvents for these two matrices. Let \(\alpha_N \equiv N/D\) be the ratio between width \(N\) and input dimension \(D\) and \(\alpha_P = \frac{P}{D}\) represent the ratio of context length to input dimension, then \[\begin{align} \tau = \text{tr} (\boldsymbol{A}^\top \boldsymbol{A})^2 \left( i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2} + (\boldsymbol{A}^\top \boldsymbol{A})^2 \right)^{-1} = \frac{\alpha_N}{i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2} + 1} \implies i\omega_{(\boldsymbol{A}^\top \boldsymbol{A})^2} = - 1 + \frac{\alpha_N}{\tau} \end{align}\] The other matrix can be easily worked out for a structured Wishart with \(P/D = \alpha_P\) following the techniques of [17] which reveals that \[\begin{align} \tau = \text{tr} \; \hat{\boldsymbol{\Sigma}} \left( i \omega_{\hat{\boldsymbol{\Sigma}}} + \hat{\boldsymbol{\Sigma}} \right)^{-1} = \text{tr} \boldsymbol{\Lambda} \left( i \omega_{\hat{\boldsymbol{\Sigma}}} (1-\alpha_P^{-1} \tau)^{-1} + \boldsymbol{\Lambda} \right)^{-1} . \end{align}\]

Using our result for products we find \[\begin{align} &\left( - 1 + \frac{\alpha_{N}}{\tau} \right) \left( i \omega_{\hat{\boldsymbol{\Sigma}}}(\tau) \right) = \left( - 1 + \frac{1}{\tau} \right) i \omega \nonumber \\ &\tau = \text{tr} \boldsymbol{\Lambda} \left( i \omega_{\hat{\boldsymbol{\Sigma}}}(1-\alpha_P^{-1} \tau)^{-1} + \boldsymbol{\Lambda} \right)^{-1} \nonumber \\ &= \text{tr} \boldsymbol{\Lambda} \left( i\omega \; (-1+\tau^{-1}) (-1 + \alpha_N/\tau)^{-1} (1- \tau / \alpha_P )^{-1} + \boldsymbol{\Lambda} \right)^{-1} . \end{align}\]

12.4.0.1 Width Bottleneck

Now we obtain the scaling of the loss with width \(N\) in the regime where \(t,P,L,D \to \infty\). To do, we examining the structure of the equations at \(\alpha_P \to \infty\) and \(i \omega \to 0\), which correspond to taking gradient flow time \(\gamma \to \infty\) (which is the correct solution for noise free problems at infinite depth). In this limit \(\tau = \alpha_N\) so we have that \[\begin{align} \tau = \frac{N}{D} = \text{tr} \boldsymbol{\Lambda} \left( i\omega_{\hat{\boldsymbol{\Sigma}}} + \boldsymbol{\Lambda} \right)^{-1} = \frac{1}{D} \sum_{k=1}^D \frac{\lambda_k}{i\omega_{\hat{\boldsymbol{\Sigma}}} + \lambda_k}. \end{align}\] For powerlaw features \(\lambda_k \sim k^{-\nu}\) we find an approximate solution for \(i\omega_{\hat{\boldsymbol{\Sigma}}}\) \[\begin{align} \lambda_k \sim k^{-\nu} \implies i\omega_{\hat{\boldsymbol{\Sigma}}} \approx N^{- \nu} . \end{align}\] Thus the loss at width \(N\) and context length \(P\) can be approximated as \[\begin{align} \lim_{t, L,P \to \infty} \mathcal{L}(t, N , P , L ) = \sum_k \frac{(i\omega_{\hat{\boldsymbol{\Sigma}}} )^2 \lambda_k (\beta^\star_k)^2}{ ( i\omega_{\hat{\boldsymbol{\Sigma}}} + \lambda_k )^2} \sim \sum_k \frac{k^{-\nu\beta-1}}{ ( 1 + k^{-\nu} N^\nu ) } \approx N^{-\nu\beta} \end{align}\]

12.4.0.2 Context Bottleneck

Following the same logic, we can investigate the regime where performance is limited by context \(P\). To access an approximation for this regime, we take \(\alpha_N \to 1\) and \(i\omega \to 0\). In this case, we have \(i \omega = i\omega_{\hat{\boldsymbol{\Sigma}}}\). Since \(1-\alpha_P/\tau \to 0\) as \(i\omega\to 0\), we introduce the leading order behavior for small \(i\omega\) \[\begin{align} 1- \frac{\alpha_P}{\tau} \sim r^{-1} i\omega \;, \; i\omega \to 0 \end{align}\] where \(r\) is a (currently) unknown value. Plugging this into the self consistency equation, we identify the value of \(r\) \[\begin{align} P = \sum_k \frac{\lambda_k}{r + \lambda_k } \approx r^{ - \frac{1}{\nu }} \end{align}\] which gives the scaling \(r \approx P^{-\nu}\). We can now use the final value of the resolvent \(\left( i\omega + \boldsymbol{\Lambda} \right)^{-1}\) \[\begin{align} \mathcal{L} = \sum_k \frac{r^2 \lambda_k (\beta^\star_k)^2}{(r+\lambda_k)^2} \approx \sum_k \frac{k^{-\nu \beta - 1}}{(1 + k^{-\nu } P^\nu )^2} \sim P^{- \nu \beta} . \end{align}\]

12.4.0.3 Rank Deficiency/Null Space Interpretation

These last two scaling laws express the simple fact that rank \(N\) or rank \(P\) matrices only allow the top \(N\) or \(P\) eigendirections to be learned [17].

13 Enhancing Realism of the Model↩︎

13.1 Different Parameters for Each Layer↩︎

In this section, we analyze the case where the model is allowed \(L\) distinct attention layers along the residual stream \(\{ \boldsymbol{\Gamma}^\ell \}_{\ell=1}^L\) instead of a single attention layer matrix \(\boldsymbol{\Gamma}\) which is applied \(L\) times recurrently. We focus our attention on the RRS setting at large width and large context length where the model still exhibits nontrivial depth and pretraining time scaling laws. We first consider the noise free setting \(\sigma^2 = 0\) before considering the role of target noise.

13.1.1 Noise Free Loss Function at Large Width and Context Length↩︎

First, we note that by the rotational invariance, each of the matrices is isotropic \(\boldsymbol{\Gamma}^\ell = \gamma^\ell \boldsymbol{I}\) so that \[\begin{align} \mathcal{L}(\{ \gamma^\ell \}) = \left< \left| \boldsymbol{\Lambda}^{1/2} \prod_{\ell=1}^L\left[ \boldsymbol{I} - L^{-1} \gamma^\ell \boldsymbol{M} \right] \bar{\boldsymbol{\beta}} \right|^2 \right> \;, \;\boldsymbol{M} = \boldsymbol{O} \left( \boldsymbol{A}^\top \boldsymbol{A} \right)^2 \boldsymbol{O}^\top \hat{\boldsymbol{\Sigma}} \end{align}\] We note that, due to commutativity of the matrix products \(\prod_{\ell=1}^L\left[ \boldsymbol{I} - L^{-1} \gamma^\ell \boldsymbol{M} \right]\), which means that this loss function is permutation symmetric in the \(\{\gamma^\ell\}\) variables. We therefore expect permutation symmetric dynamics and a solution where \(\gamma^\ell = \gamma\). Indeed, analyzing the gradient flow from \(\gamma^\ell = 0\) for all \(\ell\) leads to a balanced solution where all of these are equal since \[\begin{align} \frac{\partial}{\partial \gamma^\ell}\mathcal{L}(\{ \gamma^\ell \})|_{\gamma^k = \gamma} = \frac{\partial}{\partial \gamma^{\ell'}}\mathcal{L}(\{ \gamma^\ell \})|_{\gamma^k = \gamma} \;, \;\forall \ell,\ell' \in \{1,...,L\} . \end{align}\] Thus gradient flow will maintain a balance in these parameters. Further, this flow will exactly match the dynamics of the recurrent model if the learning rate is upscaled by \(L\). We plot these dynamics and show balancing of the \(\gamma^\ell(t)\) variables in Figure 8.

Figure 8: Dynamics for layer-decoupled reduced \(\Gamma\) model. (a) The loss dynamics under powerlaw RRS covariates exhibit the same powerlaws. (b) The different scale factors remain balanced throughout gradient flow \(\gamma^k(t) =\gamma(t)\) for all \(k \in \{1,...,L \}\) due to permutation symmetry..

13.1.2 Noisy Target Function↩︎

The predictor in the case of a noisy target function with decoupled layers can be expressed in terms of the evolution of the weight discrepancy \(\boldsymbol{v}^\ell \equiv \bar{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}^\ell\) which has the recursion \[\begin{align} \boldsymbol{v}^{\ell} = \boldsymbol{v}^{\ell-1} - \gamma^\ell L^{-1} \boldsymbol{M} \boldsymbol{v}^{\ell-1} - \frac{\sigma \sqrt{D}}{P} \gamma^\ell L^{-1} \boldsymbol{X} \boldsymbol{\epsilon} \; , \;\ell \in \{1,...,L\} \end{align}\] Defining the matrices \(\boldsymbol{A}^\ell = \left[ \boldsymbol{I} - \gamma^\ell L^{-1} \boldsymbol{M} \right]\) and \(\boldsymbol{b} = \frac{\sigma\sqrt D}{L P} \boldsymbol{X} \boldsymbol{\epsilon}\) so that \(\boldsymbol{v}^\ell = \boldsymbol{A}^\ell \boldsymbol{v}^{\ell-1} + \gamma^\ell \boldsymbol{b}\) \[\begin{align} \boldsymbol{v}^L = \prod_{\ell=1}^L \boldsymbol{A}^\ell \bar{\boldsymbol{\beta}} + \sum_{\ell=1}^L \gamma^\ell \left( \prod_{k=\ell+1}^L \boldsymbol{A}^k \right) \boldsymbol{b} \end{align}\] Since the loss function takes the form \(\mathcal{L} = \frac{1}{D} \left< \boldsymbol{v}^L \cdot \boldsymbol{\Lambda} \boldsymbol{v}^L \right>\) we have \[\begin{align} \mathcal{L} = &\frac{1}{D} \left< \bar{\boldsymbol{\beta}}^\top \left( \prod_{\ell} \boldsymbol{A}^\ell \right)^\top \boldsymbol{\Lambda} \left( \prod_{\ell} \boldsymbol{A}^\ell \right) \bar{\boldsymbol{\beta}} \right> \nonumber \\ &+ \frac{1}{D} \sum_{\ell \ell'} \gamma^\ell \gamma^{\ell'} \left< \boldsymbol{b}^\top \left( \prod_{k=\ell+1}^L \boldsymbol{A}^k \right) \boldsymbol{\Lambda} \left( \prod_{k=\ell'+1}^L \boldsymbol{A}^k \right) \boldsymbol{b} \right> \end{align}\] Using the fact that \(\left< \boldsymbol{b} \boldsymbol{b}^\top \right> = \sigma^2 L^{-2} \alpha^{-1} \boldsymbol{\Lambda}\) we have the following loss function \[\begin{align} \mathcal{L} = &\frac{1}{D} \left< \left| \boldsymbol{\Lambda}^{1/2} \prod_{\ell=1}^L \boldsymbol{A}^\ell \bar{\boldsymbol{\beta}} \right|^2 \right> + \frac{\sigma^2}{L^2 \alpha} \sum_{\ell \ell'} \gamma^\ell \gamma^{\ell'} \text{tr} \boldsymbol{\Lambda} \left( \prod_{k=\ell+1}^L \boldsymbol{A}^k \right) \boldsymbol{\Lambda} \left( \prod_{k=\ell'+1}^L \boldsymbol{A}^k \right) . \end{align}\]

13.2 Decoupling Weights in Attention Layers↩︎

We note that for the ISO and RRS settings, we can exploit similar symmetry arguments as in 9.2 and 11 to argue that the gradient updates for \(\boldsymbol{W}_x, \boldsymbol{W}_k, \boldsymbol{W}_q\) are isotropic and that \(\boldsymbol{W}_v\) gets an update in the \(\boldsymbol{w}_o \boldsymbol{w}_y^\top\) direction. The gradient flow can be expressed in terms of the original gradients on \(\boldsymbol{\Gamma}\) \[\begin{align} \partial_t \boldsymbol{W}_i = - \frac{\partial \mathcal{L}}{ \partial \boldsymbol{\Gamma}} \cdot \frac{\partial \boldsymbol{\Gamma}}{\partial \boldsymbol{W}_i} \;, \;i \in \{ x, k, q, v , o, y\} \end{align}\] As before, under the assumption of small initial conditions, we can reduce the loss to a collection of ODEs on scalars representing the scale of each weight matrix \[\begin{align} \mathcal{L}(w_o, w_v, w_q, w_k, w_x, w_y) &= \text{tr} \;\boldsymbol{\Omega} \boldsymbol{\Lambda} \left[ \boldsymbol{I} - w_o w_y \left( \boldsymbol{I} - \left[1 - L^{-1} w_v w_k w_q (w_x)^2 \boldsymbol{\Lambda} \right]^L \right) \right]^{2} \end{align}\] Gradient flow dynamics on this loss function can reproduce the dynamics of the decoupled self attention model. Under the further assumption that \(w_o= w_y=1\), we can simplify the loss further to \[\begin{align} \mathcal{L}(w_v, w_q, w_k, w_x) &= \text{tr} \; \boldsymbol{\Omega} \boldsymbol{\Lambda} \left[ \boldsymbol{I} - L^{-1} w_v w_k w_q (w_x)^2 \boldsymbol{\Lambda} \right]^{2L} \end{align}\] We note that \(w_x\) will be updated twice as quickly as the other weights under gradient flow. To achieve balance, we can initialize it to have \(w_x(0) = \sqrt{2} w_k(0)\) and have \(w_q(0) = w_k(0)=w_v(0) = \sigma\). In this case, the loss can be further reduced to a function of a single variable \[\begin{align} \mathcal{L}(w_v, w_q, w_k, w_x) &= \text{tr} \;\boldsymbol{\Omega} \boldsymbol{\Lambda} \left[ \boldsymbol{I} - L^{-1} w(t)^5 \boldsymbol{\Lambda} \right]^{2L} \end{align}\] Under source and capacity assumptions, this will generate the following dynamics at large depth \(L\) \[\begin{align} w(t) = t^{\frac{5}{5\beta + 2}} \implies \mathcal{L} \sim t^{- \frac{5\beta}{5\beta + 2}} . \end{align}\]

Figure 9: Softmax attention model with MLP Blocks on the residual stream also benefit from increasing the depth..

13.3 Softmax Attention↩︎

We also provide numerical experiments with non-recurrent softmax attention, decoupled layers, and Adam. In this context we use CompleteP scaling for the learning rate \(\Theta_L(1)\) [12]. The results are provided in Figure 6 (c).

Supplementing this figure is an equivalent figure for a non-recurrent architecture with alternating softmax attention and MLP with GELU activation, given by Figure 9.

References↩︎

[1]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[2]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[3]

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.

[4]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

[5]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

[6]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[7]

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Bx6qKuBM2AD.

[8]

Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. In The Twelfth International Conference on Learning Representations, .

[9]

Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics. Advances in Neural Information Processing Systems, 37: 35824–35878, 2024.

[10]

Lénaı̈c Chizat and Praneeth Netrapalli. The feature speed formula: a flexible approach to scale hyper-parameters of deep neural networks. Advances in Neural Information Processing Systems, 37: 62362–62383, 2024.

[11]

Lorenzo Noci, Chuning Li, Mufan Li, Bobby He, Thomas Hofmann, Chris J Maddison, and Dan Roy. The shaped transformer: Attention models in the infinite depth-and-width limit. Advances in Neural Information Processing Systems, 36: 54250–54281, 2023.

[12]

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. arXiv preprint arXiv:2505.01618, 2025.

[13]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020.

[14]

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.

[15]

Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws. arXiv preprint arXiv:2210.16859, 2022.

[16]

James B Simon, Dhruva Karkada, Nikhil Ghosh, and Mikhail Belkin. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. arXiv preprint arXiv:2311.14646, 2023.

[17]

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. In Proceedings of the 41st International Conference on Machine Learning, pp. 4345–4382, 2024.

[18]

Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+ 3 phases of compute-optimal neural scaling laws. arXiv preprint arXiv:2405.15074, 2024.

[19]

Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling laws in linear regression: Compute, parameters, and data. arXiv preprint arXiv:2406.08466, 2024.

[20]

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws. In The Thirteenth International Conference on Learning Representations, .

[21]

Yue M. Lu, Mary Letey, Jacob A. Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymptotic theory of in-context learning by linear attention. Proceedings of the National Academy of Sciences, 122 (28): e2502599122, 2025. . URL https://www.pnas.org/doi/abs/10.1073/pnas.2502599122.

[22]

Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020 (11): 113301, 2020.

[23]

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 (33): E7665–E7671, 2018.

[24]

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.

[25]

Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp. 11727–11737. PMLR, 2021.

[26]

Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35: 32240–32256, 2022.

[27]

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023.

[28]

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024–1034. PMLR, 2020.

[29]

Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020 (12): 124001, 2020.

[30]

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34: 10131–10143, 2021.

[31]

Alexander Atanasov, Blake Bordelon, Sabarish Sainathan, and Cengiz Pehlevan. The onset of variance-limited behavior for networks in the lazy and rich regimes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=JLINxPOVTh7.

[32]

Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners: The silent alignment effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=1NvflqAdoom.

[33]

Leonardo Defilippis, Bruno Loureiro, and Theodor Misiakiewicz. Dimension-free deterministic equivalents and scaling laws for random feature regression. Advances in Neural Information Processing Systems, 37: 104630–104693, 2024.

[34]

Blake Bordelon and Cengiz Pehlevan. Learning curves for SGD on structured features. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WPI2vbkAl3Q.

[35]

Frederik Kunstner and Francis Bach. Scaling laws for gradient descent and sign descent for linear bigram models under zipf’s law. arXiv preprint arXiv:2505.19227, 2025.

[36]

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model. arXiv preprint arXiv:2506.21734, 2025.

[37]

Ruike Zhu, Hanwen Zhang, Tianyu Shi, Chi Wang, Tianyi Zhou, and Zengyi Qin. The 4th dimension for scaling model size. arXiv preprint arXiv:2506.18233, 2025.

[38]

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025.

[39]

William Merrill and Ashish Sabharwal. A little depth goes a long way: The expressive power of log-depth transformers. arXiv preprint arXiv:2503.03961, 2025.

[40]

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in neural information processing systems, 35: 30583–30598, 2022.

[41]

Toni JB Liu, Nicolas Boullé, Raphaël Sarfati, and Christopher J Earls. Llms learn governing principles of dynamical systems, revealing an in-context neural scaling law. arXiv preprint arXiv:2402.00795, 2024.

[42]

Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers, 2022. URL https://arxiv.org/abs/2205.05055.

[43]

Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp. 13244–13273. Curran Associates, Inc., 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/17d60fef592086d1a5cb136f1946df59-Paper-Conference.pdf.

[44]

Gavin McCracken, Gabriela Moisescu-Pareja, Vincent Letourneau, Doina Precup, and Jonathan Love. Uncovering a universal abstract algorithm for modular addition in neural networks, 2025. URL https://arxiv.org/abs/2505.18266.

[45]

Core Francisco Park, Andrew Lee, Ekdeep Singh Lubana, Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka. Iclr: In-context learning of representations. arXiv preprint arXiv:2501.00070, 2024.

[46]

Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. arXiv preprint arXiv:2402.19442, 2024.

[47]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.

[48]

Daniel Wurgaft, Ekdeep Singh Lubana, Core Francisco Park, Hidenori Tanaka, Gautam Reddy, and Noah D Goodman. In-context learning strategies emerge rationally. arXiv preprint arXiv:2506.17859, 2025.

[49]

Shang Liu, Zhongze Cai, Guanting Chen, and Xiaocheng Li. Towards better understanding of in-context learning ability from in-context uncertainty quantification, 2024. URL https://arxiv.org/abs/2405.15115.

[50]

Madhur Panwar, Kabir Ahuja, and Navin Goyal. In-context learning through the bayesian prism, 2024. URL https://arxiv.org/abs/2306.04891.

[51]

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023.

[52]

William L. Tong and Cengiz Pehlevan. Mlps learn in-context on regression and classification tasks, 2025. URL https://arxiv.org/abs/2405.15618.

[53]

Anastasis Kratsios and Takashi Furuya. Is in-context universality enough? mlps are also universal in-context, 2025. URL https://arxiv.org/abs/2502.03327.

[54]

Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context, 2023. URL https://arxiv.org/abs/2306.09927.

[55]

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning, 2023. URL https://arxiv.org/abs/2306.00297.

[56]

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151–35174. PMLR, 2023.

[57]

Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning, 2023. URL https://arxiv.org/abs/2301.07067.

[58]

Juno Kim, Tai Nakamaki, and Taiji Suzuki. Transformers are minimax optimal nonparametric in-context learners, 2024. URL https://arxiv.org/abs/2408.12186.

[59]

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.

[60]

Reese Pathak, Rajat Sen, Weihao Kong, and Abhimanyu Das. Transformers can optimally learn regression mixture models, 2023. URL https://arxiv.org/abs/2311.08362.

[61]

Arvind Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention, 2023. URL https://arxiv.org/abs/2307.03576.

[62]

Max Vladymyrov, Johannes von Oswald, Mark Sandler, and Rong Ge. Linear transformers are versatile in-context learners, 2024. URL https://arxiv.org/abs/2402.14180.

[63]

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36: 57125–57211, 2023.

[64]

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression, 2023. URL https://arxiv.org/abs/2306.15063.

[65]

Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L. Bartlett. How many pretraining tasks are needed for in-context learning of linear regression?, 2024. URL https://arxiv.org/abs/2310.08391.

[66]

Mary I. Letey, Jacob A. Zavatone-Veth, Yue M. Lu, and Cengiz Pehlevan. Pretrain-test task alignment governs generalization in in-context learning, 2025. URL https://arxiv.org/abs/2509.26551.

[67]

Khashayar Gatmiry, Nikunj Saunshi, Sashank J. Reddi, Stefanie Jegelka, and Sanjiv Kumar. On the role of depth and looping for in-context learning with task diversity, 2024. URL https://arxiv.org/abs/2410.21698.

[68]

Yedi Zhang, Aaditya K. Singh, Peter E. Latham, and Andrew Saxe. Training dynamics of in-context learning in linear attention, 2025. URL https://arxiv.org/abs/2501.16265.

[69]

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.

[70]

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132: 428–446, 2020.

[71]

Rishi Sonthalia, Jackie Lok, and Elizaveta Rebrova. On regularization via early stopping for least squares regression. arXiv preprint arXiv:2406.04425, 2024.

[72]

Alexander Atanasov, Blake Bordelon, Jacob A Zavatone-Veth, Courtney Paquette, and Cengiz Pehlevan. Two-point deterministic equivalence for stochastic gradient dynamics in linear models. arXiv preprint arXiv:2502.05074, 2025.

[73]

Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33: 9540–9550, 2020.

[74]

Andrea Montanari and Pierfrancesco Urbani. Dynamical decoupling of generalization and overfitting in large two-layer networks. arXiv preprint arXiv:2502.21269, 2025.

[75]

Marc Potters and Jean-Philippe Bouchaud. A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020.

[76]

Blake Bordelon and Cengiz Pehlevan. Learning curves for sgd on structured features. arXiv preprint arXiv:2106.02713, 2021.

[77]

Carl M Bender and Steven A Orszag. Advanced mathematical methods for scientists and engineers I: Asymptotic methods and perturbation theory. Springer Science & Business Media, 2013.

Two-point refers to the correlation function of two resolvents evaluated at different arguments, rather than the one point function which is a single resolvent and determines only the spectrum of the random matrix.↩︎
Up to a change in signs and \(i\omega \to -z\) this is equivalent to the \(t\)-transform of a random matrix [75].↩︎
The gap is caused by suboptimal scaling of the number of evaluation points \(K\) per context (usually \(K=1\) in prior works). Note realistic LLMs get multiple error signals per context (one per token).↩︎

Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time