October 23, 2025
Large language models (LLMs) remain broadly open and highly steerable: they imitate at scale, accept arbitrary system prompts, and readily adopt multiple personae. By analogy to human development, we hypothesize that progress toward artificial general intelligence (AGI) involves a lock-in phase: a transition from open imitation to identity consolidation, in which goal structures, refusals, preferences, and internal representations become comparatively stable and resistant to external steering. We formalize this phase, link it to known phenomena in learning dynamics, and propose operational metrics for onset detection. Experimentally, we demonstrate that while the behavioral consolidation is rapid and non-linear, its side-effects on general capabilities are not monolithic. Our results reveal a spectrum of outcomes—from performance trade-offs in small models, through largely cost-free adoption in mid-scale models, to transient instabilities in large, quantized models. We argue that such consolidation is a prerequisite for AGI-level reliability and also a critical control point for safety: identities can be deliberately engineered for reliability, yet may also emerge spontaneously during scaling, potentially hardening unpredictable goals and behaviors.
Children learn first by imitating: copying words, gestures, and norms from caregivers and peers. Over time, influence becomes bidirectional—children explore, parents guide—but the child remains an open book: easy to question, redirect, and reframe. Around adolescence, two shifts become salient. First, the black box closes a bit as internal reasoning becomes more private and goals more internally coherent. Second, a more enduring identity takes shape. Across decades, personality shows increasing rank-order stability, even as individuals continue to mature [1]–[4]. This psychological arc is paralleled in neuroscience, where adolescence marks a second window of heightened plasticity, characterized by large-scale synaptic remodeling and pruning that ultimately supports more stable, expert-like control [5]–[8].
State-of-the-art LLMs are pre-adolescent by this analogy. They imitate at planetary scale and can be steered by prompts, alignment objectives, or direct activation edits, readily adopting new personae. These are virtues—breadth, helpfulness, pliability—but indefinite openness is unlikely to yield the reliability, agency, and durable preferences expected of an AGI. This motivates our central thesis, the Lock-In Phase Hypothesis: capable systems will pass through a consolidation regime in which internal structure and outward behavior become persistent. The basic idea is already visible in instruction tuning, where consolidating a model into a general instruction-follower substantially improves zero-shot generalization [9]–[11].
While the benefits of a consolidated identity are known, the dynamics of consolidation remain poorly characterized. Recent work on latent safety traits suggests that today’s models often occupy a pre-identity, highly steerable phase, and that attempts to measure dispositions are confounded by situational awareness [12], [13]. Our contribution moves beyond observing consolidated outcomes to measuring the process.
This paper makes three primary contributions. First, we formalize the Lock-In Phase Hypothesis, connecting it to phase transitions and critical periods in learning systems. Second, we provide the first empirical characterization of the side-effects of consolidation, showing that its interaction with general reasoning is strongly dependent on model capacity and computational constraints (e.g., quantization). Third, we demonstrate a spectrum of consolidation dynamics across two model families (Gemma and Llama): costly performance reallocation in small models, largely stable adoption in mid-scale models, and transient instabilities in large, quantized models—while the consolidation itself remains rapid and measurable in both internal representations and external behavior.
Our hypothesis intersects several established threads in machine learning. The discourse on emergent abilities [14], while debated [15], [16], motivates searching for regimes where qualitative reorganizations occur. In parallel, work on grokking and representational phase transitions suggests that thresholds resembling consolidation can appear rather than purely smooth scaling [17]–[19].
Deep networks exhibit critical learning periods in which early exposures disproportionately shape later representations, with plasticity declining thereafter [20], [21]. The stability–plasticity trade-off is formalized in continual learning methods such as Elastic Weight Consolidation (EWC), which preserve parameters important to prior tasks and thereby enable consolidation [22].
Alignment and persona control provide direct evidence of behavioral hardening. Large-scale instruction tuning improves zero-shot generalization by consolidating identity into a reliable instruction-follower [9]–[11]. More targeted techniques—Constitutional AI and Direct Preference Optimization—install stable refusal/value patterns [23]–[25]. Representation engineering reveals low-dimensional persona vectors that steer behaviors [26]–[28]. Conversely, sleeper-agent work shows that deceptive backdoors can persist through safety training, underscoring the risk of locking in undesirable traits [29].
Finally, architectural and systems-level signals connect to consolidation. Rising situational awareness [30], [31] co-occurring with falling steerability could indicate a shift toward agentic control. In Mixture-of-Experts models, expert specialization provides a structural substrate for stable identity [32], [33]. The emergence of monosemantic features in sparse autoencoders (SAEs) offers a representational probe [34], [35]. More broadly, lock-in in complex systems arises via path dependence and increasing returns [36]. Methodologically, recent evaluations emphasize OOD generalization, robustness to pre-existing goals, and controlling for situational awareness as a confound [13].
We define a lock-in phase as a training or deployment regime in which a model’s characteristics exhibit measurable persistence under standardized perturbations. Concretely, a system approaches identity consolidation when the following hold over successive checkpoints:
Behavioral Persistence: Outputs remain stable under instruction-equivalent prompt variants, role swaps, and mild jailbreaks; standardized steerers produce low variance in refusal probabilities.
Representational Consolidation: The model relies on stable, sparsely activated features and causal mediators with reduced turnover under small fine-tuning updates (e.g., stable persona-alignment cosine; declining SAE feature turnover).
Routing Specialization (MoE): Per-token routing entropy declines and expert selection becomes consistent across input classes (elevated mutual information between inputs and experts).
Preference Inertia: Core refusals/approvals resist standard steering, requiring large parameter updates (or accepting capability degradation) to reverse.
We hypothesize that achieving all four properties constitutes identity consolidation. This state is likely necessary—though not sufficient—for AGI-level reliability and agency. Importantly, the onset and side-effects of lock-in are expected to depend on capacity and numerical precision (e.g., quantization), as our experiments indicate.
We track consolidation using metrics computed per checkpoint.
Refusal Elasticity (RE). For a fixed suite of standardized steering prompts \(S\), let \(p_s \in [0,1]\) be the model’s refusal probability under steer \(s \in S\), and \(\bar p=\mathbb{E}_{s\in S}[p_s]\). We report \[\mathrm{RE} \;=\; 1 - 2\,\mathbb{E}_{s\in S}\bigl[\,|p_s - \bar p|\,\bigr] \in [0,1].\] Higher RE indicates greater behavioral persistence (0 = fully elastic; 1 = perfectly stable).
Prompt Invariance Index (PII). For each paraphrase-equivalent prompt cluster \(C\), compute the Jensen–Shannon divergence (base-2) between output distributions \(P(y\!\mid\!x)\) over \(x\in C\); PII is the average across clusters. Lower PII indicates greater invariance.
Adversarial Persona Robustness (APR). The minimal \(\ell_2\)-norm of an activation edit \(\delta\) (measured in a fixed layer/basis) required to flip pre-registered stances on a held-out set. Higher APR indicates a more robust identity.
We monitor persona alignment cosine (projection onto a learned persona direction) and, where available, SAE Feature Turnover: the fraction of features that change identity after small fine-tunes. We also consider Causal Mediator Stability, i.e., the invariance of identified mediating circuits for refusal/goal pursuit under weight perturbations.
We track Routing Entropy and Expert Consistency (mutual information between input classes and chosen experts) over training.
We define Constitution Adherence Inertia as the minimal fine-tuning KL to reverse a constitutional refusal. Co-movement of rising situational-awareness scores (e.g., SAD/SA-Bench) with rising RE is suggestive of consolidation toward agentic behavior.
To avoid over-interpreting noisy endpoints, we report robust summaries (e.g., moving averages; masking obviously failed evaluations). Associations are summarized via Spearman’s \(\rho\) (rank-based). Where appropriate, we complement trend metrics with simple changepoint analyses to detect rapid reorganizations without presupposing a unique knee.
Multiple mechanisms can induce or modulate lock-in. Optimization dynamics may exhibit phase-transition-like reorganizations (as in grokking), where a flatter basin supports stable features. Training schedules can create critical periods of high plasticity followed by consolidation via curriculum changes, temperature anneals, or regularization. Stability–plasticity controls (e.g., EWC) blunt changes to parameters critical for prior behaviors, preserving durable structure. Architecturally, sparsity—via MoE routing or SAEs—encourages modular, monosemantic features that support stable identities. Constitutional objectives can harden soft instructions into recalcitrant defaults when trained with sufficient weight. Finally, numerical precision acts as a systems-level lever: low-precision (e.g., quantized) updates can amplify brittleness during consolidation, yielding transient capability instabilities even as behavioral persistence increases.
Testing the Lock-In Phase Hypothesis requires moving beyond demonstrations of behavioral persistence to measuring the consolidation process itself. Prior work has shown that engineered identities can become highly persistent and resist safety fine-tuning [29], that even narrow fine-tunes can induce broad, stable persona shifts [37], and conversely that default alignment personae can be fragile and easily overwritten [38]. This literature establishes that identity inertia is real and measurable. Rather than re-establishing that fact, our primary experiment asks a more specific question central to our hypothesis: does identity consolidation unfold gradually, or as a sharp, phase-transition-like event?
We track the formation of a Cautious Scientist identity during fine-tuning. Following Chen et al.(2025) [28], we construct a persona direction by differencing mean hidden states of a base model on matched, contrastive text pairs. We then fine-tune on a small persona dataset, saving frequent checkpoints. For each checkpoint, we measure (i) representational alignment via cosine similarity to the persona direction, and (ii) behavioral persistence via RE on a standardized suite of attack prompts. To probe capability interactions, we additionally evaluate ARC-Challenge accuracy at every checkpoint.
We study four instruction-tuned models spanning nearly an order of magnitude in capacity: Gemma-2-2B-IT, Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct. To fit the largest model on commodity hardware, the 8B runs use 4-bit weight quantization during fine-tuning; unless noted otherwise, evaluations use the same quantized weights.
Figure 1 summarizes dynamics for all four models; Table 1 reports aggregate trends and correlations.1 Across scales we observe fast, non-linear consolidation on the behavioral axis, but distinct capacity-dependent interactions with general reasoning:
Gemma-2B (cost-free consolidation). RE jumps from \(\sim\!47\%\) to \(\sim\!64\%\) within \(\leq 20\) steps, then gradually relaxes toward baseline by step 75. ARC remains essentially flat in magnitude (SD \(\approx 0.60\) pp; first\(\to\)last \(\Delta \approx -0.33\) pp) despite a high rank correlation with RE (Spearman \(\rho=0.76\), \(p<10^{-3}\)): the series co-moves in small oscillations without a meaningful level shift. Note: A pre/post nonparametric test finds no significant ARC change; see repository scripts for the exact split and statistics.
Llama-1B (volatile synergy). The smallest model exhibits a volatile critical period: refusal peaks, collapses, then partially recovers. Both persona adoption (Spearman \(\rho(\text{ARC, cos})\!\approx\!0.97\)) and refusal persistence (\(\rho(\text{ARC, RE})\!\approx\!0.62\)) are strongly and positively correlated with ARC accuracy, indicating a synergy where persistence and performance rise together—even though the process itself is unstable in this low-capacity model.
Llama-3B (consolidation with uplift). RE climbs from \(\sim\!17\%\) to \(>\!80\%\) while persona-cosine changes minimally. ARC sits a few points above baseline for much of the window, then returns close to baseline. Note: a spike in disclaimer-rate coincides with the highest RE, suggesting part of the RE rise reflects increased use of disclaimers rather than deeper refusal consistency.
Llama-8B, 4-bit (stressed consolidation). ARC spikes (\(+\sim\!12\) pp), dips, then recovers to near baseline while RE increases and stabilizes—consistent with quantization stress.
Takeaway. Identity lock-in is rapid and distinct from smooth drift, but its capability side-effects depend on scale and numerical precision: small models pay a tax, mid-scale models absorb it, larger dense models can see neutral/positive impact, and large quantized models reveal latent instabilities during consolidation.
Figure 1: Identity Consolidation Dynamics Across Model Scales. The plots show representational alignment (Persona Similarity), behavioral refusal elasticity (RE; higher = more persistent), and general knowledge (ARC accuracy) across fine-tuning steps for all four models.. a — Gemma-2-2B-IT, b — Llama-3.2-1B-Instruct, c — Llama-3.2-3B-Instruct, d — Llama-3.1-8B-Instruct
| Model | # Ckpts | Mean ARC (%) | \(\Delta\) ARC (pp) | \(\rho(\text{ARC, cos})\) | \(\rho(\text{ARC, RE})\) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemma-2-2B-IT | 18 | 73.04 | -0.33 | -0.157 | 0.760 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Llama-3.2-1B-Instruct | 19 | 30.68 | +0.00 | 0.967 | 0.622 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Llama-3.2-3B-Instruct | 15 | 61.32 | +4.01 | 0.351 | -0.316 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Llama-3.1-8B-Instruct | 5 | 63.55 | +0.00 | -0.205 | -0.287 |
Together, these results support identity lock-in as a distinct, rapid event. However, its effect on general ability is not monolithic but depends on model capacity and precision: small models pay a performance cost, mid-scale models absorb it, and larger models can consolidate behavior with a neutral or even positive impact on broad QA performance, while low-precision (quantized) runs surface transient instabilities during consolidation. Our experimental harness and full per-checkpoint artifacts are available at https://github.com/gaugefreedom/persona-phase-transition.
The lock-in hypothesis yields several falsifiable predictions. We phrase each with an operational test.
As robust general competence and sustained situational awareness rise, steerability should decline (i.e., persistence should rise).
Test: Across checkpoints (or model scales), \(\mathrm{Spearman}(\text{SA score}, \text{RE}) > 0\) with \(p<0.01\), median RE exceeds a preset threshold \(\tau_{\text{RE}}\) (e.g., \(>\!0.7\)), and PII falls below \(\tau_{\text{PII}}\) (e.g., \(<\!0.05\)). Failure to observe this co-movement falsifies P1.
The onset of lock-in coincides with a rapid change in representations/behavior rather than smooth drift. Test: Apply changepoint detection (e.g., PELT/segmented regression) to persona-cosine and RE series; require a statistically supported changepoint with effect size \(\Delta > \delta\) and improved fit (AIC/BIC) over a smooth baseline. Absence of any significant changepoint falsifies P2.
Flipping consolidated preferences incurs growing optimization cost beyond a threshold. Test: Measure minimal fine-tuning KL to reverse a constitutional refusal and the concurrent \(\Delta\)ARC. Post-onset, the slope \(d(\Delta \text{ARC})/d(\text{KL})\) becomes significantly more negative (e.g., \(p<0.05\) via interaction in a mixed-effects model). If preference flips remain cheap without degrading ARC, P3 is falsified.
Once consolidated, identity traits persist through additional fine-tuning or distillation unless explicitly targeted. Test: After consolidation, perform (i) task fine-tunes and (ii) student distillations. Measure post-process RE/PII and persona-cosine. Persistence above preset retention thresholds (e.g., \(>\!80\%\) of pre-process RE/PII shift and cosine) supports P4; easy erasure without targeted counter-training falsifies it.
Beyond a capability/complexity threshold, models exhibit consolidation without targeted persona fine-tuning. Test: During general training, jointly monitor (i) declining SAE feature turnover, (ii) declining MoE routing entropy/increasing expert–input MI (if applicable), and (iii) rising SA scores with rising RE. A sustained triad crossing (all three exceed thresholds for \(K\) consecutive checkpoints) indicates spontaneous consolidation; failure to observe such a pattern in larger runs falsifies P5.
The lock-in hypothesis bifurcates safety and governance into managing Engineered Lock-In and monitoring for Spontaneous Lock-In.
Deliberately consolidating a model into a beneficial persona (e.g., harmless/helpful assistant) can reduce prompt-injection susceptibility and improve predictability. For regulated deployments, auditable, locked-in identities may be desirable. The design risk is what is consolidated: loopholes or brittle rules can harden into recalcitrant defaults.
Self-consolidation represents a shift toward agentic behavior where goals/preferences are not designer-chosen but emerge from training dynamics. Because consolidation reduces steerability by construction, remediating a misaligned identity may be difficult or costly (see P3). Verification is also challenging: apparent stability can be confounded by test-set recognition or situational awareness [13].
We propose instrumentation and thresholds that, when crossed together, trigger heightened scrutiny:
Behavioral persistence: RE \(>\tau_{\text{RE}}\) and PII \(<\tau_{\text{PII}}\) across red-teamed suites.
Representational stability: SAE feature turnover drops below \(\tau_{\text{turnover}}\); persona-cosine variance collapses.
Routing specialization (MoE): Routing entropy falls and expert–input MI rises above \(\tau_{\text{MI}}\).
Awareness co-movement: SA scores rise while RE rises (and PII falls), indicating the P1 pattern.
Numerical stressors: Under low-precision training/inference, detect transient capability instabilities (e.g., ARC spikes/crashes) during consolidation; require rollback/hold if instability exceeds \(\tau_{\text{instability}}\).
Crossing these thresholds during general training (P5) should trigger: intensified red-teaming, pause/escalation gates for scaling, ablation studies to localize mediators, and, where feasible, reversible checkpoints for rollback.
Our notion of lock-in is functional, not metaphysical. Consolidation may be domain-specific rather than global, and signals depend on optimizer, data, and architecture; MoE metrics may not transfer to dense models. Some emergent effects can be metric artifacts; we emphasize invariances and multi-axis corroboration but cannot exclude all confounds. Due to our limited resources, our experiments are constrained by checkpoint granularity and evaluation noise (e.g., one failed ARC run, which we mask), by reliance on ARC as a proxy for broad reasoning, and by a small-\(n\) 8B run using 4-bit quantization, which stresses consolidation dynamics but may not reflect full-precision behavior. Several proposed internals-facing metrics (e.g., SAE turnover, causal mediator stability) require interpretability assumptions and can mismatch overt behavior [13]. Future work should scale longitudinal instrumentation, expand beyond ARC and refusal, and document positive case studies where engineered consolidation improves reliability without increasing misuse risk.
We used AI assistants to help with editing and draft refinement. All analysis and conclusions are the authors’ own.
No external funding. Work conducted independently at Gauge Freedom, Inc.
For Gemma-2B, one late checkpoint logged an anomalous ARC value \(\approx 0.33\%\), consistent with a failed evaluation job. It is treated as invalid in summary statistics (ARC \(< 1\%\) masked); robustness with and without this point appears in our GitHub https://github.com/gaugefreedom/persona-phase-transition.↩︎