Is Model Collapse Inevitable? Breaking the Curse of

Recursion by Accumulating Real and Synthetic Data

Matthias Gerstgrasser^{1} ^{2} , Rylan Schaeffer\(^*\), Apratim Dey\(^*\), Rafael Rafailov\(^*\), Dhruv Pai

Stanford University

`{mgerst,rschaef,apd1995,rafailov,dhruvpai}@stanford.edu`

Henry Sleight^{3} , John Hughes\(^{\ddag}\), Tomasz Korbak\(^{\ddag}\), Rajashree Agrawal\(^{\ddag}\)

Constellation

Andrey Gromov\(^{\S}\)

University of Maryland, College Park

Daniel A. Roberts\(^{\S}\)

MIT & Sequoia Capital

Diyi Yang^{4} , David L. Donoho\(^{\S}\) , & Sanmi Koyejo\(^{\S}\)

Stanford University

`{diyiy,donoho,sanmi}@stanford.edu`

April 01, 2024

The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops discovered
that such loops can lead to *model collapse*, a phenomenon where performance progressively degrades with each model-fitting iteration until the latest model becomes useless. However, several recent papers studying model collapse assumed that new
data *replace* old data over time rather than assuming data *accumulate* over time. In this paper, we compare these two settings and show that accumulating data prevents model collapse. We begin by studying an analytically tractable setup in
which a sequence of linear models are fit to the previous models’ predictions. Previous work showed if data are replaced, the test error increases linearly with the number of model-fitting iterations; we extend this result by proving that if data instead
accumulate, the test error has a finite upper bound independent of the number of iterations. We next empirically test whether accumulating data similarly prevents model collapse by pretraining sequences of language models on text corpora. We confirm that
replacing data does indeed cause model collapse, then demonstrate that accumulating data prevents model collapse; these results hold across a range of model sizes, architectures and hyperparameters. We further show that similar results hold for other deep
generative models on real data: diffusion models for molecule generation and variational autoencoders for image generation. Our work provides consistent theoretical and empirical evidence that data accumulation mitigates model collapse.

The advent of large-scale generative models such as GPT-4 [1], DALL-E [2] and Stable Diffusion [3] has revolutionized the field of artificial intelligence. These models, trained on vast web-scale datasets, exhibit remarkable capabilities in generating text, images, and other media [4], [5]. However, as these models become more widely used, an increasing amount of generated data populates the web. This raises a critical question: what are the consequences of training generative models on datasets containing their own outputs?

Recent studies have investigated this question, revealing that training generative models on their own outputs can cause the performance of such models to progressively degrade with each model-fitting iteration, eventually rendering newer models
useless. This phenomenon was consequently labeled *model collapse* [6]–[13] (see App. Sec 5 for summarization and discussion of prior work). Model collapse warns that democratizing access to generative models runs the risk
of polluting the very data necessary to train future iterations of generative models.

To theoretically understand model collapse, [13] recently studied an analytically tractable setup involving a sequence of linear
models, each trained on the predictions of the previous linear model, and demonstrated that the test error increases linearly with the number of iterations \(n\). Critically, this analysis assumed that each linear model
undergoes training exclusively on data from the single preceding iteration. This assumption that data does not accumulate was similarly made by other recent mathematical and empirical investigations of model collapse [8], [9]. However, this assumption might not match the real world, where both
human-generated and machine-generated data are continually added to existing data, and hence, training *solely* on the previous model’s generated data seems much less likely than training on combined and accumulating real and synthetic data.

In this work, we study what effect accumulating data has instead of replacing data. We begin with the analytically-tractable setup of [13] and prove that if data accumulates, the test error no longer grows linearly with the number of model-fitting iterations, but instead has a finite and relatively small upper bound independent of the number of model-fitting iterations. This finding suggests that data accumulation could serve as a robust solution for mitigating model collapse. We then test this conjecture on sequences of deep generative models trained on real data: causal transformers on text (Sec. 3.1), diffusion models on molecules (Sec. 3.2) and variational autoencoders on images (Sec. 3.3). After confirming that replacing data at every iteration indeed causes error to increase with the number of iterations, we test our conjecture that accumulating data prevents such model collapse. We empirically find that for all models and for all data modalities, accumulating synthetic data with real data prevents model collapse. Altogether, our work emphasizes the importance of considering data accumulation in the analysis of model collapse and suggests that accumulating data can be a robust approach for avoiding model collapse when training future generations of generative models on web-scale data.

We consider a linear regression setup, which, despite its simplicity, already gives useful insights into the cause and mitigation of model collapse. Our process is nearly identical to the setting introduced by [13], with the simple but critical modification that we assume data accumulate across model-fitting iterations. After gleaning insights from linear regression trained on toy data, we then carry our insights to deep generative models trained on real data (Sec. 3).

Define the distribution \(P_{\Sigma,w,\sigma^2}\) on \(\mathbb{R}^d \times \mathbb{R}\) given by \((x, y) \sim P_{\Sigma,w,\sigma^2} \quad \text{iff} \quad\): \[\begin{align} \text{(Input)} \quad & x \sim \mathcal{N}(0, \Sigma), \\ \text{(Noise)} \quad & \epsilon \sim \mathcal{N}(0, \sigma^2), \text{ independent of } x, \\ \text{(Label)} \quad & y = x \cdot w^* + \epsilon. \end{align}\] The positive integer \(d\) is the input-dimension, the matrix \(\Sigma \in \mathbb{R}^{d \times d}\) is the true covariance structure of the input \(x\), the vector \(w^*\) is the true linear relationship used to generate the original data and the scalar \(\sigma\) is the level of label noise. We start at iteration \(n=1\) with \(T\) initial independent data points \((x_i, y_i)\) each following \(P_{\Sigma, w^*, \sigma^2}\), that is, \(y_i = x_i\cdot w^* + \epsilon_i\) for each \(i=1,2,\cdots, T\). We form the design matrix \(X\in\mathbb{R}^{T\times d}\) with \(x_1^\top,\cdots, x_T^\top\) as rows. We also form the vectors \(Y\) and \(E\) with \(i\)-th coordinate \(y_i\) and \(\epsilon_i\) respectively. In whatever follows, we will assume that \(X\) has full column rank, i.e., \(T\geq d\), \(X^\top X\) is invertible and the model is underparameterized.

We generate synthetic data from the following sequence of distributions

\[\begin{align} P_{\Sigma,w^*,\sigma^2} \to P_{\Sigma,\hat{w}_1,\sigma^2} \to \ldots \to P_{\Sigma,\hat{w}_n,\sigma^2}, \end{align}\] where \(n \in \mathbb{N}\) is the number of iterations. The scheme is outlined as follows.

For \(n=1\):

Accumulating Covariates/Features: \(\tilde{X}_1 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}X\)

Accumulating Targets: \(\tilde{Y}_1 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}\hat{Y}_1 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}Xw^* + E_1\), where \(E_1 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}E \sim \mathcal{N}(0, \sigma^2 I_T)\)

Fit linear model: \(\hat{w}_1 = \tilde{X}_1^{\dagger} \tilde{Y}_1\)

Make predictions for the next iteration: \(\hat{Y}_2 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}X\hat{w}_1 + E_2\), where \(E_2 \sim \mathcal{N}(0, \sigma^2 I_T)\)

For \(n \geq 2\):

Accumulating Covariates/Features: \(\tilde{X}_n^\top = [\tilde{X}_{n-1}^\top; X^\top] \in \mathbb{R}^{nT \times d}\)

Accumulating Targets: \(\tilde{Y}_n^\top = [\tilde{Y}_{n-1}^\top; \hat{Y}_n^\top] \in \mathbb{R}^{nT \times 1}\)

Fit linear model: \(\hat{w}_n \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}\tilde{X}_n^{\dagger} \tilde{Y}_n\)

Make predictions for the next iteration: \(\hat{Y}_{n+1} \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}X \hat{w}_n + E_{n+1}\), where \(E_{n+1} \sim \mathcal{N}(0, \sigma^2 I_T)\)

Here, for a matrix \(A\) with full column rank, \(A^\dagger=A(A^\top A)^{-1}\) is the Moore-Penrose pseudo-inverse of \(A\). The noise terms \(E_1, E_2, \ldots, E_n\) are independent of each other and of the covariates/features. Since \(X\) has full column rank, so does \(\tilde{X}_n\) for every \(n\geq 1\).

We are interested in the dynamics of the test error \(E_{\text{test}}(\hat{w}_n)\) of this sequence of linear model \(\hat{w}_1, \hat{w}_2, ...\). Note that evaluation of the model is done on the true distribution \(P_{\Sigma,w^*,\sigma^2}\), even though the model is trained on the accumulated synthetic data. For any linear estimator \(\hat{w}\) computed from the training data, we measure test error in the standard way:

\[E_{test}(w) \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}\mathbb{E}\left[(x_{test}^T w - y_{test})^2 \right] - \sigma^2 = \mathbb{E}[\|w - w^*\|_{\Sigma}^2]\] where the expectation is taken over the training data and \((x_{test}, y_{test})\sim P_{\Sigma,w^*,\sigma^2}\) independent of the training data.

Although we present our results in the context of ordinary linear regression in \(\mathbb{R}^d\), our analysis can be readily extended to ridge regression and the kernel setting [14]–[17]. We leave such contributions to future work.

Our goal is to establish an analytic formula for the test error of the \(n\)th model in the data accumulation setting. We begin by characterizing the relationship between the fitted linear parameters \(\hat{w}_n\) and the true parameters \(w^*\). We remind the reader that we assume that \(X\) has full column rank, i.e., \(X^\top X\) is invertible. Proofs are deferred to App. 7.

**Theorem 1**. *In the data accumulation setting, \(\forall n \geq 1\), the fitted linear parameters \(\hat{w}_n\) can be expressed as: \[\hat{w}_n = w^* + (X^\top X)^{-1} X^\top \left(\sum_{i=1}^n \frac{E_i}{i}\right)\] where, recall, \(w^*\) is the true parameter, \(X\) is the original design
matrix, and \(E_i\) is the extra noise added at the \(i\)’th iteration. *

**Theorem 2**. *For an \(n\)-fold synthetic data generation process with \(T \geq d + 2\) samples per iteration and isotropic features (\(\Sigma
\mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}I_d\)), the test error for the ridgeless linear predictor \(\hat{w}_n\) learned on the accumulated data up to iteration \(n\) is given
by: \[E_{\text{test}}^{\text{Accum}}(\hat{w}_n) = \frac{\sigma^2 d}{T-d-1} \left(\sum_{i=1}^n \frac{1}{i^2} \right) \leq \frac{\sigma^2 d}{T-d-1} \times \frac{\pi^2}{6}\] where, recall, \(\sigma^2\) is the noise variance of the fake data generation process, \(d\) is the input dimension, and \(T\) is the number of samples (i.e., data points) added
per iteration. *

How does test error with accumulating data compare against test error with replacing data? Under otherwise identical assumptions, [13]
showed in the data-replacing setting that the test error is given by^{5}: \[E_{\text{test}}^{\text{Replace}}(\hat{w}_n) = \frac{\sigma^2 d}{T - d -1} \times \color{red}n \color{black}
\label{dohmatob2024thm4p1}\tag{1}\]

When data are replaced, the test error grows linearly with the number of iterations \(n\) (Fig 2 top), with the rate of growth determined by a
noise-to-signal ratio: the amount of noise per dimension \(\sigma^2\) times by the number of dimensions \(d\), adjusted by the (per-iteration) sample size \(T\). In contrast, when data accumulate, Theorem 2 shows the test error is upper bounded *regardless
of the number of iterations \(n\)*: \[E_{\text{test}}^{\text{Accum}}(\hat{w}_n) \leq \frac{\sigma^2 d}{T-d-1} \times \color{red}\frac{\pi^2}{6}\color{black}\]

This striking difference can be intuitively explained by the way data is handled across iterations. In the data replacing setting, because previous data were discarded, the model fits to the noise that each iteration of generated data introduces, with no way to recognize and ignore the noise. But in the accumulating setting, because iteration \(i\) contributes \(1/i\) to the total amount of data, the errors from the \(i\)th iteration are suppressed proportional to \(1/i^2\) (due to squared error), preventing the test error from growing indefinitely. This suggests that accumulating generated data with real data can be a robust solution to model collapse.

To confirm the analytical results, we numerically simulate the setup. The numerics almost perfectly matched the analytics (Fig. 2): when data are replaced, the test error grows linearly with the number of iterations \(n\), with the prefactor set by the noise-to-signal ratio \(\sigma^2 d / (T - d -1)\) (Fig. 2 Top), but when data accumulate, the test error rapidly flatlines with the prefactor similarly set (Fig. 2 Bottom). For log test error and higher model-fitting iterations, see App. Fig. 9.

The linear regression setup is useful for studying replacing data versus accumulating data because it displays the phenomenon of interest and is analytically tractable, but its applicability to deep generative models trained on real data is unclear. We thus test whether the same results qualitatively hold for in causal transformers, diffusion models, and variational autoencoders, respectively, trained on real text, molecular, and image data. We find that for all models and all datasets, replacing data causes model collapse, whereas accumulating data prevents model collapse.

We first turn to causal transformers [19] trained on text data. We pretrain 9M parameter GPT-2 [20] and 12M, 42M and 125M parameter Llama2 [21] language models for a single epoch on TinyStories [18], a 470M token GPT-3.5/4-generated dataset of short stories at a kindergarten reading level. For each model-fitting iteration \(n \geq 2\), we sample a new dataset of the same size as TinyStories from the previous iteration’s language model and then either replace or concatenate the previous dataset with the newly generated dataset. In each model-fitting iteration, we then pretrain a newly initialized model on the replaced or concatenated dataset from the previous iteration. We experiment with sampling the new datasets using temperatures \(0.3\) or \(1.0\). We chose this combination of architectures, scales, dataset, and sampling because the setup necessitates pretraining multiple iterations of language models – a computationally costly endeavor – but we also wish to study realistic conditions where generative models are high-performing and generative diverse outputs because small language models (below 10M parameters) pretrained on TinyStories were shown to be able to generate coherent-albeit-simple English sentences, this choice of architectures, scales, dataset and temperature hopefully strikes a good balance between being representative, being diverse and being computationally feasible.

Model | Iteration | Sample Generation |
---|---|---|

Llama2 (125M) | 3 (A) | In the end, the crab found a smooth shell. He took it to a safe place under a tree. The crab put the shell where he found it. Tim and his mom were tired, but they were happy. They had a fun day at the beach. And they lived happily ever after. The end. |

3 (R) | Henry asked his Mom why the golf sounded so special. His Mom explained that the line of lumber had something special that would help. She said that if you’re not sure, the lumber is special. | |

8 (R) | Friend Stan and Millie laughed together and prepared to spend the morning together. Mamaing Grandma’s possibilitant, twice would measure how much she lovedk. Everyone started to get ready when they started arguing until their mum upset. | |

GPT2 (9M) | 5 (A) | Jack was so happy that he took care of the honey. He thought, "I care about the beautiful garden, because it is nice and clean." He started to feed the flower every day. The flower grew bigger and taller, and Jack became very happy. |

5 (R) | After playing, Lily got tired and quickly ran back to playing with her dolls. She opened her eyes and played with her dolls all day long. Her grandma was so happy that she screamed as she watched her look back at her original clothes and laughed. | |

10 (R) | When she finished eating it, she tasted it all up. She said goodbye to her mom and said goodbye. Mommy smiled, feeling very proud of her. It was other. She knew that sharing is always easy to share her meal with her mom. |

We found that for all architectures, parameter counts, and sampling temperatures, as the number of model-fitting iterations increased, replacing data led to an increase in test cross entropy (Fig. 3 top). We also found that for all architectures, parameter counts and sampling temperatures, as the number of model-fitting iterations increased, accumulating data led so equal-or-lower test cross entropy (Fig. 3 bottom). Lower temperature (0.3) led to a faster increase in test error than higher temperature (1.0) (App. Fig. 10), but the trend was consistent for both temperatures. Table 1 shows samples of generated texts for GPT2 (9M) and Llama2 (125M) models at model-fitting iterations 3-5 when both accumulating and replacing data, as well as iterations 8-10 (replacing only).

We ablate for several additional potential confounds, in addition to generation temperature. First, when accumulating data, subsequent model iterations are trained on larger datasets than when replacing data. To control for this, we also perform experiments in which data is replaced, but the size of the (fully synthetic) dataset is grown to match the training set size in the accumulation regime. We find that model performance still degrades (albeit at a lower rate). This is shown in App. 8, Table 2, right-most column. Second, a possible concern could be that degrading performance when replacing data could be due to low model performance in iteration 1 (and thus the quality of the first synthetic dataset). We control for this by varying the amount of training performed in iteration 1 only, and find that this has no significant impact. Lastly, we find that our results are also consistent across varying dataset sizes and training epochs. These ablations are discussed in App. 6.

**Experiments** We next turn to diffusion models trained on molecular data. We used GeoDiff [22], a geometric diffusion model for
molecular conformation generation, trained on the GEOM-Drugs [23] dataset. We down-sample the training split of GEOM-Drugs to \(40,000\) molecular conformations, which we use as our initial training set, and perform \(50\) diffusion steps for each prediction.

**Results** Over \(8\) model-fitting iterations, we find a similar increase in test loss when replacing data matching our language model experiments, and similarly a relatively constant performance when
accumulating data (Fig. 5). Unlike with language models, we found that while performance when replacing data worsens significantly in the first model-fitting iteration trained on synthetic data, it does not degrade
further substantially in subsequent iterations.

We lastly evaluate the training dynamics of variational autoencoders (VAEs) [24], [25] trained on image data. We choose CelebA [26], a widely-used dataset of 200k human faces split between train and test sets, as a balance between being a realistic dataset with many samples, color images and resolution, and computational feasibility of training multiple iterations of models on accumulating data. We discuss more details of our experiments in Appendix 9.

We find that replacing data at each iteration indeed causes model collapse: the test error rises swiftly with each additional iteration, and each iteration yields lower quality and less diverse generated faces until all model generations represent a single mode as shown in left panel of Figure 7. In contrast, accumulating data at each iteration significantly slows model collapse: the test error increases significantly slower with each additional iteration. While the diversity of generations does go down as compared in the middle and right panel of Fig. 7, it still represents major axes of variation in the dataset, such as gender, but no longer seems to generate other details, along more minor axis of the data manifold, such as glasses and accessories. We discuss further analysis of VAE reconstructions in Appendix 9.

Interestingly, unlike language modeling, the test error of accumulating data does increase with the number of iterations (albeit much more slowly than with replacing data). We also note that [7] found slightly contradictory evidence, specifically that a different architecture on a much smaller dataset exhibits fast deterioration of performance even with accumulating data. Understanding under what conditions and why these discrepancies exists is an interesting direction we leave for future research.

This work explored the phenomenon of model collapse, an important concern as AI-generated content permeates the internet and finds its way into future training datasets. Prior work has shown that training on model outputs can lead to degraded
performance [6], [8]–[10], [13], implying that future model training faces a difficult challenge of ensuring strict
training dataset hygiene. Our findings extend these prior works to show that if data *accumulates* and models train on a mixture of “real” and synthetic data, model collapse no longer occurs.

We show this both theoretically for linear regression, as well as in experiments on causal transformers for language modeling, diffusion models for molecule generation, and variational auto-encoders on image data.

Taken together, these results strongly suggests that the “curse of recursion" we face is not as dire as had been feared. Nevertheless, many question worth investigating remain. For instance, in future work we would like to explore different data generation and accumulation regimes, such as additional”real” data being introduced in each model-fitting iteration and different schedules of how much synthetic data is generated at each iteration.

A growing body of recent work has investigated the phenomenon of iteratively training models on data generated by previous models, e.g., [6]–[13] and (in a slightly different context) [27]. [6] and [10] conducted experiments replacing real training data with generated data at each iteration, assuming that the dataset size remains fixed over time. They found that this iterative retraining procedure can lead to model degradation if the proportion of synthetic data becomes too high. Similarly, [8] ran experiments with Gaussian mixture models, VAEs, and language models in which the total number of samples per iteration was held constant, and the samples always originated with the previous model rather than aggregating over time. Building on this work, [9] considered three treatments of data: fully replacing real data with synthetic data, augmenting a fixed real dataset with additional synthetic data, and mixing new real data with synthetic data at each iteration. In almost all of their experiments, they drew a fixed size dataset from the most recent model at each iteration, without accumulating data. [11] also assumed that dataset size and mixing proportions are constant over time in their theoretical stability analysis and empirical validation. [13] also theoretically analyzed a simplified setting where a sequence of linear regression models are fit to previous predictions, characterizing why collapse occurs if the data are replaced at each iteration.

The two papers we found that considered accumulating data are [7] and [9]. [9] did so in one-half of one experiment: StyleGAN2 trained on FliqrFaces 128\(\times\)128 (App. Fig. 8). The authors concluded that accumulating data does not prevent model collapse, but merely slows it down. However, we believe that a closer examination of their results (App. Fig. 8) reveals that accumulating data causes the test error to plateau to a relatively low error with increasing numbers of model-fitting iterations. This result would support our conclusion that accumulating data prevents model collapse and does not merely delay it. The results from [7] are harder to evaluate; the paper provides scarce methods, no quantitative losses or metrics, and model collapse only seems to occur when the amount of synthetic data added per model-fitting iteration is 2\(\times\) the total amount of accumulated data. We think understanding what conditions and why these discrepancies exist is an interesting question.

We point out a lemma useful to prove Theorem 2.

**Lemma 1**. *Let \(T\) and \(d\) be positive integers with \(T \geq d + 2\), and let \(X \in \mathbb{R}^{T
\times d}\) be a random matrix with i.i.d. rows from \(\mathcal{N}(0, \Sigma)\) with \(\Sigma\) positive definite. Then, \(X\) has full rank a.s.
Moreover, it holds that: \[\mathbb{E}_X[(X^\top X)^{-1}] = \frac{1}{T - d - 1} \Sigma^{-1}.\] *

*Proof.* See [13]. ◻

Assuming Lemma 1 and Theorem 1, we present the proof of Theorem 2.

*Proof of Theorem 2.* From Theorem 1, we have: \[\hat{w}_n = w^* + (X^\top X)^{-1} X^\top \left(\sum_{i=1}^n \frac{E_i}{i}\right)\] where \(w^*\) is the true
parameter, \(X\) is the original data matrix, and \(E_i\) are the noise terms at each iteration, with \(E_i \sim \mathcal{N}(0, \sigma^2 I_T)\). The test
error is given by: \[E_{\text{test}}(\hat{w}_n) = \mathbb{E}[||\hat{w}_n - w^*||_{\Sigma}^2]\]where the expectation is taken over all random quantities involved.

Substituting \(\hat{w}_n\) into the test error expression and using the fact that \(\Sigma \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}I_d\), we get: \[\begin{align} E_{\text{test}}(\hat{w}_n) &= \mathbb{E}\left[\left(\sum_{i=1}^n \frac{E_i}{i}\right)^\top X(X^\top X)^{-2} X^\top \left(\sum_{i=1}^n \frac{E_i}{i}\right)\right] \\ &= \mathbb{E}\left[\sum_{i=1}^n \frac{\sigma^2}{i^2} \text{tr}(X(X^\top X)^{-2} X^\top)\right] \\ &= \sum_{i=1}^n \frac{\sigma^2}{i^2} \mathbb{E}\left[\text{tr}((X^\top X)^{-1})\right] \end{align}\]

Using Lemma 1, we have: \[\mathbb{E}_{X}\left[\text{tr}((X^\top X)^{-1})\right] = \frac{d}{T-d-1}\]

Therefore, the test error for ridgeless regression with isotropic features in the data accumulation setting is: \[\begin{align} E_{\text{test}}(\hat{w}_n) &= \sum_{i=1}^n \frac{\sigma^2}{i^2} \cdot \frac{d}{T-d-1} < \frac{\sigma^2 d}{T-d-1} \left(\frac{\pi^2}{6}\right) \end{align}\] as \(\sum_{i=1}^n i^{-2} < \lim_{n \to \infty} \sum_{i=1}^n i^{-2} = \pi^2/6\). ◻

Finally, we prove Theorem 1.

*Proof of Theorem 1.* We prove this theorem by induction.

**Base case:** For \(n=1\), we have: \[\begin{align}
\hat{w}_1 &= \tilde{X}_1^{\dagger} \tilde{Y}_1 = (X^\top X)^{-1} X^\top (Xw^* + E_1) = w^* + (X^\top X)^{-1} X^\top E_1
\end{align}\] which satisfies the lemma.

**Inductive step:** Assume that for some \(n \geq 1\), we have:

\[\hat{w}_n = w^* + (X^\top X)^{-1} X^\top \left(\sum_{i=1}^n \frac{E_i}{i}\right)\]

Now, consider \(\hat{w}_{n+1}\): \[\begin{align} \hat{w}_{n+1} &= \tilde{X}_{n+1}^{\dagger} \tilde{Y}_{n+1} \\ &= (\tilde{X}_{n+1}^\top \tilde{X}_{n+1})^{-1} \tilde{X}_{n+1}^\top \tilde{Y}_{n+1} \\ &= \frac{1}{n+1}(X^\top X)^{-1} \sum_{i=1}^{n+1} X^\top \hat{Y}_i \end{align}\]

Recalling that \(\hat{Y}_i\): \[\begin{align} \hat{Y}_i = \begin{cases} X w^* + E_1, & i = 1 \\ X \hat{w}_{i-1} + E_i, & 2 \leq i \leq n+1 \end{cases} \end{align}\]

Substituting this back into the expression for \(\hat{w}_{n+1}\): \[\begin{align} \hat{w}_{n+1} &= \frac{1}{n+1}(X^\top X)^{-1} \left(X^\top (X w^* + E_1) + \sum_{i=2}^{n+1} X^\top (X\hat{w}_{i-1} + E_i)\right) \\ &= \frac{1}{n+1}(X^\top X)^{-1} \left(X^\top Xw^* + X^\top E_1 + \sum_{i=2}^{n+1} (X^\top X\hat{w}_{i-1} + X^\top E_i)\right) \\ &= \frac{1}{n+1}(X^\top X)^{-1} \left(X^\top Xw^* + X^\top E_1 + \sum_{i=1}^{n} (X^\top X\hat{w}_i + X^\top E_{i+1})\right) \\ &= \frac{1}{n+1}(X^\top X)^{-1} \left(X^\top X w^* + \sum_{i=1}^{n} X^\top X\hat{w}_i + \sum_{i=1}^{n+1} X^\top E_i\right) \end{align}\]

Now, using the induction hypothesis: \[\begin{align} \hat{w}_{n+1} &= \frac{1}{n+1}(X^\top X)^{-1} \left(X^\top Xw^* + \sum_{i=1}^{n} X^\top X\left(w^* + (X^\top X)^{-1} X^\top \sum_{j=1}^i \frac{E_j}{j}\right) + \sum_{i=1}^{n+1} X^\top E_i\right) \\ &= \frac{1}{n+1}(X^\top X)^{-1} \left((n+1)X^\top Xw^* + \sum_{i=1}^{n} X^\top X(X^\top X)^{-1} X^\top \sum_{j=1}^i \frac{E_j}{j} + \sum_{i=1}^{n+1} X^\top E_i\right) \\ &= w^* + \frac{1}{n+1}(X^\top X)^{-1} \left(\sum_{i=1}^{n} X^\top \sum_{j=1}^i \frac{E_j}{j} + \sum_{i=1}^{n+1} X^\top E_i\right) \\ &= w^* + \frac{1}{n+1}(X^\top X)^{-1} X^\top \left(\sum_{i=1}^{n} \sum_{j=1}^i \frac{E_j}{j} + \sum_{i=1}^{n+1} E_i\right) \end{align}\]

Now, we need to simplify the term \(\sum_{i=1}^{n} \sum_{j=1}^i \frac{E_j}{j} + \sum_{i=1}^{n+1} E_i\). We can do this by counting the number of times each \(E_i\) appears in the double sum: \(E_1\) appears \(n\) times in the double sum and once in the single sum, so its coefficient is \(\frac{n+1}{1}\). \(E_2\) appears \(n-1\) times in the double sum and once in the single sum, so its coefficient is \(\frac{n}{2}\). This continues along till we reach \(E_n\), which appears once in the double sum and once in the single sum, so its coefficient is \(\frac{2}{n}\). \(E_{n+1}\) appears only once in the single sum, so its coefficient is \(\frac{1}{n+1}\). Therefore, \[\begin{align} \sum_{i=1}^{n} \sum_{j=1}^i \frac{E_j}{j} + \sum_{i=1}^{n+1} E_i &= \sum_{i=1}^{n+1} \frac{n+2-i}{i} E_i = (n+1) \sum_{i=1}^{n+1} \frac{E_i}{i} \end{align}\]

Substituting this back into the expression for \(\hat{w}_{n+1}\): \[\begin{align} \hat{w}_{n+1} &= w^* + \frac{1}{n+1}(X^\top X)^{-1} X^\top \left((n+1) \sum_{i=1}^{n+1} \frac{E_i}{i}\right) \\ &= w^* + (X^\top X)^{-1} X^\top \sum_{i=1}^{n+1} \frac{E_i}{i} \end{align}\]

Therefore, by mathematical induction, the lemma holds for all \(n \geq 1\). ◻

Model training was implemented using Huggingface Transformers [28]. Dataset generation was implemented using vllm [29].

In addition to the experiments shown in the main paper, we conducted several ablation studies.

One possible concern is that when accumulating data, the train dataset size will grow at each model-fitting iteration, meaning subsequent models will be trained on more aggregate data than their counterparts in the replacement regime. To control for this, we run experiments controlling for this. In this “replace-multiple” regime, we create a fully synthetic dataset at the end of each model-fitting iteration, but grow the size of this dataset to match that of the accumulated data in the accumulation regime. Table 2 rightmost column shows that in this regime, evaluation loss still increases over model-fitting iterations.

Most of our language model experiments were run with sampling temperature \(1.0\) during generation of new datasets. To ensure that this choice is not critical, we also run one experiment with temperature \(0.3\), and see that this shows similar results (with even larger increases in validation loss in the replacement regime than temperature \(1.0\)), as shown in Table 2, row 2, and Figure 10.

We similarly vary the size of the initial (and subsequent) training datasets and number of training epochs, and see that this has no qualitative effect on the results (Table 2, rows 3 & 4 show training on 1/5th of the TinyStories dataset for 1 & 3 epochs, respectively).

Finally, we control specifically for model (and thus synthetic dataset) quality after the first iteration, to rule out an undue influence of a “bad” first synthetic dataset on subsequent training. Figure 11 shows performance in subsequent iterations for different amounts of training in the first iteration, showing no qualitative differences.

Model | t=1 | t=4 (acc) | t=4 (repl) | t=10 (repl) | t=4 (*) |
---|---|---|---|---|---|

GPT-2 (9M) | 1.82 | 1.74 (-0.07) | 2.39 (+0.58) | 2.91 (+1.09) | 2.18 (+0.36) |

GPT-2 (9M) (temp=0.3) | 1.82 | 1.75 (-0.06) | 5.82 (+4.00) | 9.85 (+8.04) | n/a |

GPT-2 (9M) (small dataset) | 2.56 | 2.28 (-0.28) | 3.21 (+0.65) | 3.72 (+1.16) | 2.91 (+0.35) |

ibid (+ 3 epochs) | 1.99 | 1.87 (-0.12) | 2.62 (+0.63) | n/a | n/a |

Llama-2 (12M) | 2.06 | 1.94 (-0.12) | 2.72 (+0.66) | n/a | n/a |

Llama-2 (42M) | 1.90 | 1.76 (-0.14) | 2.52 (+0.62) | n/a | n/a |

Llama-2 (126M) | 1.71 | 1.59 (-0.12) | 2.23 (+0.53) | n/a | n/a |

As pre-processing, we crop and down-sample the images to 64x64 pixels. We use a standard convolutional architecture for the VAE model consisting of 5 convolutional layers with 32, 64, 128, 256, and 512 channels, respectively, and a similar convolutional decoder structure. The latent space is 128-dimensional isotropic Gaussian, represented by 2 MLP layers. Each data iteration consists of 100 training epochs, after which we generate 163K new training images by sampling latents from the Gaussian prior and the passing them through the generator model.

Figure 12 shows reconstructions after replacing (left) and accumulating (center) data, compared to baseline (right). Analyzing the reconstruction of test set images also reveals interesting findings - the model trained only on data from the prior iteration has indeed collapsed and cannot represent any other classes besides the single mode it generates. Interestingly, the model trained on aggregated data still maintains it’s capabilities and generates accurate reconstructions, including smaller details such as glasses and hats. We hypothesize that this model maintains it’s generative capabilities, but these details become a more minor sub-manifold in the latent space, which is realigned with the newly-generated data, hence why they appear less often in the generated images, which use samples from the prior.

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical
report. *arXiv preprint arXiv:2303.08774*, 2023.

[2]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. *arXiv preprint arXiv:2204.06125*, 1 (2):
3, 2022.

[3]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In *Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition*, pp. 10684–10695, 2022.

[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. *Advances in neural information processing systems*, 33: 1877–1901, 2020.

[5]

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic
text-to-image diffusion models with deep language understanding. *Advances in neural information processing systems*, 35: 36479–36494, 2022.

[6]

Ryuichiro Hataya, Han Bao, and Hiromi Arai. Will large-scale generative models corrupt future datasets? In *Proceedings of the IEEE/CVF International Conference on Computer
Vision*, pp. 20555–20565, 2023.

[7]

Gonzalo Martı́nez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Combining generative artificial intelligence
(ai) and the internet: Heading towards evolution or degradation? *arXiv preprint arXiv:2303.01255*, 2023.

[8]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. The curse of recursion: Training on generated data makes models forget. *arXiv preprint
arXiv:2305.17493*, 2023.

[9]

Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard G Baraniuk. Self-consuming generative models go
mad. *arXiv preprint arXiv:2307.01850*, 2023.

[10]

Gonzalo Martı́nez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of
generative artificial intelligence and the internet. *arXiv preprint arXiv:2306.06130*, 2023.

[11]

Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data.
*arXiv preprint arXiv:2310.00429*, 2023.

[12]

Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop. *arXiv preprint
arXiv:2311.16822*, 2023.

[13]

Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression. *arXiv preprint arXiv:2402.07712*, 2024.

[14]

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. *Foundations of Computational Mathematics*, 7: 331–368, 2007.

[15]

James B Simon, Madeline Dickens, and Michael Deweese. Neural tangent kernel eigenvalues accurately predict generalization. 2021.

[16]

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime.
*Advances in Neural Information Processing Systems*, 34: 10131–10143, 2021.

[17]

Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. In *International Conference on Machine
Learning*, pp. 23549–23588. PMLR, 2022.

[18]

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? *arXiv preprint arXiv:2305.07759*, 2023.

[19]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. *Advances in
neural information processing systems*, 30, 2017.

[20]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. *OpenAI blog*, 1 (8): 9, 2019.

[21]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. *arXiv preprint arXiv:2307.09288*, 2023.

[22]

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. *arXiv preprint
arXiv:2203.02923*, 2022.

[23]

Simon Axelrod and Rafael Gomez-Bombarelli. Geom, energy-annotated molecular conformations for property prediction and molecular generation. *Scientific Data*, 9 (1): 185,
2022.

[24]

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. *arXiv preprint arXiv:1312.6114*, 2013.

[25]

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In *International conference on machine
learning*, pp. 1278–1286. PMLR, 2014.

[26]

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In *Proceedings of International Conference on Computer Vision (ICCV)*, December
2015.

[27]

Rohan Taori and Tatsunori Hashimoto. Data feedback loops: Model-driven amplification of dataset biases. In *International Conference on Machine Learning*, pp. 33883–33920. PMLR,
2023.

[28]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s
transformers: State-of-the-art natural language processing. *arXiv preprint arXiv:1910.03771*, 2019.

[29]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving
with pagedattention. In *Proceedings of the 29th Symposium on Operating Systems Principles*, pp. 611–626, 2023.

Denotes equal authorship.↩︎

Harvard School of Engineering and Applied Sciences & Stanford CS↩︎

Denotes equal contribution.↩︎

Denotes equal contribution.↩︎

For notational simplicity, we assume that [13]’s \(T_0 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}T\) and \(\sigma_0 \mathop{\mathrm{\stackrel{\text{def}}{\;=\;}}}\sigma\).↩︎