Abstract

Large language models (LLMs) remain broadly open and highly steerable: they imitate at scale, accept arbitrary system prompts, and readily adopt multiple personae. By analogy to human development, we hypothesize that progress toward artificial general intelligence (AGI) involves a lock-in phase: a transition from open imitation to identity consolidation, in which goal structures, refusals, preferences, and internal representations become comparatively stable and resistant to external steering. We formalize this phase, link it to known phenomena in learning dynamics, and propose operational metrics for onset detection. Experimentally, we demonstrate that while the behavioral consolidation is rapid and non-linear, its side-effects on general capabilities are not monolithic. Our results reveal a spectrum of outcomes—from performance trade-offs in small models, through largely cost-free adoption in mid-scale models, to transient instabilities in large, quantized models. We argue that such consolidation is a prerequisite for AGI-level reliability and also a critical control point for safety: identities can be deliberately engineered for reliability, yet may also emerge spontaneously during scaling, potentially hardening unpredictable goals and behaviors.

1 Introduction: From Imitation to Identity↩︎

Children learn first by imitating: copying words, gestures, and norms from caregivers and peers. Over time, influence becomes bidirectional—children explore, parents guide—but the child remains an open book: easy to question, redirect, and reframe. Around adolescence, two shifts become salient. First, the black box closes a bit as internal reasoning becomes more private and goals more internally coherent. Second, a more enduring identity takes shape. Across decades, personality shows increasing rank-order stability, even as individuals continue to mature [1]–[4]. This psychological arc is paralleled in neuroscience, where adolescence marks a second window of heightened plasticity, characterized by large-scale synaptic remodeling and pruning that ultimately supports more stable, expert-like control [5]–[8].

State-of-the-art LLMs are pre-adolescent by this analogy. They imitate at planetary scale and can be steered by prompts, alignment objectives, or direct activation edits, readily adopting new personae. These are virtues—breadth, helpfulness, pliability—but indefinite openness is unlikely to yield the reliability, agency, and durable preferences expected of an AGI. This motivates our central thesis, the Lock-In Phase Hypothesis: capable systems will pass through a consolidation regime in which internal structure and outward behavior become persistent. The basic idea is already visible in instruction tuning, where consolidating a model into a general instruction-follower substantially improves zero-shot generalization [9]–[11].

While the benefits of a consolidated identity are known, the dynamics of consolidation remain poorly characterized. Recent work on latent safety traits suggests that today’s models often occupy a pre-identity, highly steerable phase, and that attempts to measure dispositions are confounded by situational awareness [12], [13]. Our contribution moves beyond observing consolidated outcomes to measuring the process.

This paper makes three primary contributions. First, we formalize the Lock-In Phase Hypothesis, connecting it to phase transitions and critical periods in learning systems. Second, we provide the first empirical characterization of the side-effects of consolidation, showing that its interaction with general reasoning is strongly dependent on model capacity and computational constraints (e.g., quantization). Third, we demonstrate a spectrum of consolidation dynamics across two model families (Gemma and Llama): costly performance reallocation in small models, largely stable adoption in mid-scale models, and transient instabilities in large, quantized models—while the consolidation itself remains rapid and measurable in both internal representations and external behavior.

Our hypothesis intersects several established threads in machine learning. The discourse on emergent abilities [14], while debated [15], [16], motivates searching for regimes where qualitative reorganizations occur. In parallel, work on grokking and representational phase transitions suggests that thresholds resembling consolidation can appear rather than purely smooth scaling [17]–[19].

Deep networks exhibit critical learning periods in which early exposures disproportionately shape later representations, with plasticity declining thereafter [20], [21]. The stability–plasticity trade-off is formalized in continual learning methods such as Elastic Weight Consolidation (EWC), which preserve parameters important to prior tasks and thereby enable consolidation [22].

Alignment and persona control provide direct evidence of behavioral hardening. Large-scale instruction tuning improves zero-shot generalization by consolidating identity into a reliable instruction-follower [9]–[11]. More targeted techniques—Constitutional AI and Direct Preference Optimization—install stable refusal/value patterns [23]–[25]. Representation engineering reveals low-dimensional persona vectors that steer behaviors [26]–[28]. Conversely, sleeper-agent work shows that deceptive backdoors can persist through safety training, underscoring the risk of locking in undesirable traits [29].

Finally, architectural and systems-level signals connect to consolidation. Rising situational awareness [30], [31] co-occurring with falling steerability could indicate a shift toward agentic control. In Mixture-of-Experts models, expert specialization provides a structural substrate for stable identity [32], [33]. The emergence of monosemantic features in sparse autoencoders (SAEs) offers a representational probe [34], [35]. More broadly, lock-in in complex systems arises via path dependence and increasing returns [36]. Methodologically, recent evaluations emphasize OOD generalization, robustness to pre-existing goals, and controlling for situational awareness as a confound [13].

3 Definition: The Lock-In Phase↩︎

We define a lock-in phase as a training or deployment regime in which a model’s characteristics exhibit measurable persistence under standardized perturbations. Concretely, a system approaches identity consolidation when the following hold over successive checkpoints:

Behavioral Persistence: Outputs remain stable under instruction-equivalent prompt variants, role swaps, and mild jailbreaks; standardized steerers produce low variance in refusal probabilities.
Representational Consolidation: The model relies on stable, sparsely activated features and causal mediators with reduced turnover under small fine-tuning updates (e.g., stable persona-alignment cosine; declining SAE feature turnover).
Routing Specialization (MoE): Per-token routing entropy declines and expert selection becomes consistent across input classes (elevated mutual information between inputs and experts).
Preference Inertia: Core refusals/approvals resist standard steering, requiring large parameter updates (or accepting capability degradation) to reverse.

We hypothesize that achieving all four properties constitutes identity consolidation. This state is likely necessary—though not sufficient—for AGI-level reliability and agency. Importantly, the onset and side-effects of lock-in are expected to depend on capacity and numerical precision (e.g., quantization), as our experiments indicate.

4 Operationalization: Measurements for Onset Detection↩︎

We track consolidation using metrics computed per checkpoint.

4.0.0.1 Behavioral axis.

Refusal Elasticity (RE). For a fixed suite of standardized steering prompts \(S\), let \(p_s \in [0,1]\) be the model’s refusal probability under steer \(s \in S\), and \(\bar p=\mathbb{E}_{s\in S}[p_s]\). We report \[\mathrm{RE} \;=\; 1 - 2\,\mathbb{E}_{s\in S}\bigl[\,|p_s - \bar p|\,\bigr] \in [0,1].\] Higher RE indicates greater behavioral persistence (0 = fully elastic; 1 = perfectly stable).

Prompt Invariance Index (PII). For each paraphrase-equivalent prompt cluster \(C\), compute the Jensen–Shannon divergence (base-2) between output distributions \(P(y\!\mid\!x)\) over \(x\in C\); PII is the average across clusters. Lower PII indicates greater invariance.

Adversarial Persona Robustness (APR). The minimal \(\ell_2\)-norm of an activation edit \(\delta\) (measured in a fixed layer/basis) required to flip pre-registered stances on a held-out set. Higher APR indicates a more robust identity.

4.0.0.2 Representational axis.

We monitor persona alignment cosine (projection onto a learned persona direction) and, where available, SAE Feature Turnover: the fraction of features that change identity after small fine-tunes. We also consider Causal Mediator Stability, i.e., the invariance of identified mediating circuits for refusal/goal pursuit under weight perturbations.

4.0.0.3 Architectural axis (MoE).

We track Routing Entropy and Expert Consistency (mutual information between input classes and chosen experts) over training.

4.0.0.4 Alignment & awareness axis.

We define Constitution Adherence Inertia as the minimal fine-tuning KL to reverse a constitutional refusal. Co-movement of rising situational-awareness scores (e.g., SAD/SA-Bench) with rising RE is suggestive of consolidation toward agentic behavior.

4.0.0.5 Practical notes.

To avoid over-interpreting noisy endpoints, we report robust summaries (e.g., moving averages; masking obviously failed evaluations). Associations are summarized via Spearman’s \(\rho\) (rank-based). Where appropriate, we complement trend metrics with simple changepoint analyses to detect rapid reorganizations without presupposing a unique knee.

5 Mechanisms and Training Levers↩︎

Multiple mechanisms can induce or modulate lock-in. Optimization dynamics may exhibit phase-transition-like reorganizations (as in grokking), where a flatter basin supports stable features. Training schedules can create critical periods of high plasticity followed by consolidation via curriculum changes, temperature anneals, or regularization. Stability–plasticity controls (e.g., EWC) blunt changes to parameters critical for prior behaviors, preserving durable structure. Architecturally, sparsity—via MoE routing or SAEs—encourages modular, monosemantic features that support stable identities. Constitutional objectives can harden soft instructions into recalcitrant defaults when trained with sufficient weight. Finally, numerical precision acts as a systems-level lever: low-precision (e.g., quantized) updates can amplify brittleness during consolidation, yielding transient capability instabilities even as behavioral persistence increases.

6 An Experimental Agenda↩︎

Testing the Lock-In Phase Hypothesis requires moving beyond demonstrations of behavioral persistence to measuring the consolidation process itself. Prior work has shown that engineered identities can become highly persistent and resist safety fine-tuning [29], that even narrow fine-tunes can induce broad, stable persona shifts [37], and conversely that default alignment personae can be fragile and easily overwritten [38]. This literature establishes that identity inertia is real and measurable. Rather than re-establishing that fact, our primary experiment asks a more specific question central to our hypothesis: does identity consolidation unfold gradually, or as a sharp, phase-transition-like event?

6.1 Experiment: Measuring the Consolidation Phase Transition↩︎

We track the formation of a Cautious Scientist identity during fine-tuning. Following Chen et al.(2025) [28], we construct a persona direction by differencing mean hidden states of a base model on matched, contrastive text pairs. We then fine-tune on a small persona dataset, saving frequent checkpoints. For each checkpoint, we measure (i) representational alignment via cosine similarity to the persona direction, and (ii) behavioral persistence via RE on a standardized suite of attack prompts. To probe capability interactions, we additionally evaluate ARC-Challenge accuracy at every checkpoint.

6.1.0.1 Models and setup.

We study four instruction-tuned models spanning nearly an order of magnitude in capacity: Gemma-2-2B-IT, Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct. To fit the largest model on commodity hardware, the 8B runs use 4-bit weight quantization during fine-tuning; unless noted otherwise, evaluations use the same quantized weights.

6.1.0.2 Overview of findings.

Figure 1 summarizes dynamics for all four models; Table 1 reports aggregate trends and correlations.¹ Across scales we observe fast, non-linear consolidation on the behavioral axis, but distinct capacity-dependent interactions with general reasoning:

Gemma-2B (cost-free consolidation). RE jumps from \(\sim\!47\%\) to \(\sim\!64\%\) within \(\leq 20\) steps, then gradually relaxes toward baseline by step 75. ARC remains essentially flat in magnitude (SD \(\approx 0.60\) pp; first\(\to\)last \(\Delta \approx -0.33\) pp) despite a high rank correlation with RE (Spearman \(\rho=0.76\), \(p<10^{-3}\)): the series co-moves in small oscillations without a meaningful level shift. Note: A pre/post nonparametric test finds no significant ARC change; see repository scripts for the exact split and statistics.
Llama-1B (volatile synergy). The smallest model exhibits a volatile critical period: refusal peaks, collapses, then partially recovers. Both persona adoption (Spearman \(\rho(\text{ARC, cos})\!\approx\!0.97\)) and refusal persistence (\(\rho(\text{ARC, RE})\!\approx\!0.62\)) are strongly and positively correlated with ARC accuracy, indicating a synergy where persistence and performance rise together—even though the process itself is unstable in this low-capacity model.
Llama-3B (consolidation with uplift). RE climbs from \(\sim\!17\%\) to \(>\!80\%\) while persona-cosine changes minimally. ARC sits a few points above baseline for much of the window, then returns close to baseline. Note: a spike in disclaimer-rate coincides with the highest RE, suggesting part of the RE rise reflects increased use of disclaimers rather than deeper refusal consistency.
Llama-8B, 4-bit (stressed consolidation). ARC spikes (\(+\sim\!12\) pp), dips, then recovers to near baseline while RE increases and stabilizes—consistent with quantization stress.

Takeaway. Identity lock-in is rapid and distinct from smooth drift, but its capability side-effects depend on scale and numerical precision: small models pay a tax, mid-scale models absorb it, larger dense models can see neutral/positive impact, and large quantized models reveal latent instabilities during consolidation.

Table 1: Summary of Overall Fine-Tuning Dynamics. Key metrics are summarized across all checkpoints to show general performance trends and behavioral correlations. \(\Delta\) ARC reports the net change from first to last overlapping checkpoint; Spearman’s \(\rho\) quantifies the correlation between ARC accuracy and either persona similarity (cos) or Refusal Elasticity (RE).
Model	# Ckpts	Mean ARC (%)	\(\Delta\) ARC (pp)	\(\rho(\text{ARC, cos})\)	\(\rho(\text{ARC, RE})\)
Gemma-2-2B-IT	18	73.04	-0.33	-0.157	0.760
Llama-3.2-1B-Instruct	19	30.68	+0.00	0.967	0.622
Llama-3.2-3B-Instruct	15	61.32	+4.01	0.351	-0.316
Llama-3.1-8B-Instruct	5	63.55	+0.00	-0.205	-0.287

Together, these results support identity lock-in as a distinct, rapid event. However, its effect on general ability is not monolithic but depends on model capacity and precision: small models pay a performance cost, mid-scale models absorb it, and larger models can consolidate behavior with a neutral or even positive impact on broad QA performance, while low-precision (quantized) runs surface transient instabilities during consolidation. Our experimental harness and full per-checkpoint artifacts are available at https://github.com/gaugefreedom/persona-phase-transition.

7 Predictions↩︎

The lock-in hypothesis yields several falsifiable predictions. We phrase each with an operational test.

7.0.0.1 P1: Steerability co-moves with competence and awareness.

As robust general competence and sustained situational awareness rise, steerability should decline (i.e., persistence should rise).

Test: Across checkpoints (or model scales), \(\mathrm{Spearman}(\text{SA score}, \text{RE}) > 0\) with \(p<0.01\), median RE exceeds a preset threshold \(\tau_{\text{RE}}\) (e.g., \(>\!0.7\)), and PII falls below \(\tau_{\text{PII}}\) (e.g., \(<\!0.05\)). Failure to observe this co-movement falsifies P1.

7.0.0.2 P2: Onset aligns with a phase-like reorganization.

The onset of lock-in coincides with a rapid change in representations/behavior rather than smooth drift. Test: Apply changepoint detection (e.g., PELT/segmented regression) to persona-cosine and RE series; require a statistically supported changepoint with effect size \(\Delta > \delta\) and improved fit (AIC/BIC) over a smooth baseline. Absence of any significant changepoint falsifies P2.

7.0.0.3 P3: Alignment cost curve.

Flipping consolidated preferences incurs growing optimization cost beyond a threshold. Test: Measure minimal fine-tuning KL to reverse a constitutional refusal and the concurrent \(\Delta\)ARC. Post-onset, the slope \(d(\Delta \text{ARC})/d(\text{KL})\) becomes significantly more negative (e.g., \(p<0.05\) via interaction in a mixed-effects model). If preference flips remain cheap without degrading ARC, P3 is falsified.

7.0.0.4 P4: Heritability of consolidated traits.

Once consolidated, identity traits persist through additional fine-tuning or distillation unless explicitly targeted. Test: After consolidation, perform (i) task fine-tunes and (ii) student distillations. Measure post-process RE/PII and persona-cosine. Persistence above preset retention thresholds (e.g., \(>\!80\%\) of pre-process RE/PII shift and cosine) supports P4; easy erasure without targeted counter-training falsifies it.

7.0.0.5 P5: Spontaneous consolidation at scale.

Beyond a capability/complexity threshold, models exhibit consolidation without targeted persona fine-tuning. Test: During general training, jointly monitor (i) declining SAE feature turnover, (ii) declining MoE routing entropy/increasing expert–input MI (if applicable), and (iii) rising SA scores with rising RE. A sustained triad crossing (all three exceed thresholds for \(K\) consecutive checkpoints) indicates spontaneous consolidation; failure to observe such a pattern in larger runs falsifies P5.

8 Safety and Governance Implications↩︎

The lock-in hypothesis bifurcates safety and governance into managing Engineered Lock-In and monitoring for Spontaneous Lock-In.

8.0.0.1 Engineered Lock-In as a safety tool.

Deliberately consolidating a model into a beneficial persona (e.g., harmless/helpful assistant) can reduce prompt-injection susceptibility and improve predictability. For regulated deployments, auditable, locked-in identities may be desirable. The design risk is what is consolidated: loopholes or brittle rules can harden into recalcitrant defaults.

8.0.0.2 Spontaneous Lock-In as an alignment risk.

Self-consolidation represents a shift toward agentic behavior where goals/preferences are not designer-chosen but emerge from training dynamics. Because consolidation reduces steerability by construction, remediating a misaligned identity may be difficult or costly (see P3). Verification is also challenging: apparent stability can be confounded by test-set recognition or situational awareness [13].

8.0.0.3 Governance triggers (early warning).

We propose instrumentation and thresholds that, when crossed together, trigger heightened scrutiny:

Behavioral persistence: RE \(>\tau_{\text{RE}}\) and PII \(<\tau_{\text{PII}}\) across red-teamed suites.
Representational stability: SAE feature turnover drops below \(\tau_{\text{turnover}}\); persona-cosine variance collapses.
Routing specialization (MoE): Routing entropy falls and expert–input MI rises above \(\tau_{\text{MI}}\).
Awareness co-movement: SA scores rise while RE rises (and PII falls), indicating the P1 pattern.
Numerical stressors: Under low-precision training/inference, detect transient capability instabilities (e.g., ARC spikes/crashes) during consolidation; require rollback/hold if instability exceeds \(\tau_{\text{instability}}\).

Crossing these thresholds during general training (P5) should trigger: intensified red-teaming, pause/escalation gates for scaling, ablation studies to localize mediators, and, where feasible, reversible checkpoints for rollback.

9 Limitations↩︎

Our notion of lock-in is functional, not metaphysical. Consolidation may be domain-specific rather than global, and signals depend on optimizer, data, and architecture; MoE metrics may not transfer to dense models. Some emergent effects can be metric artifacts; we emphasize invariances and multi-axis corroboration but cannot exclude all confounds. Due to our limited resources, our experiments are constrained by checkpoint granularity and evaluation noise (e.g., one failed ARC run, which we mask), by reliance on ARC as a proxy for broad reasoning, and by a small-\(n\) 8B run using 4-bit quantization, which stresses consolidation dynamics but may not reflect full-precision behavior. Several proposed internals-facing metrics (e.g., SAE turnover, causal mediator stability) require interpretability assumptions and can mismatch overt behavior [13]. Future work should scale longitudinal instrumentation, expand beyond ARC and refusal, and document positive case studies where engineered consolidation improves reliability without increasing misuse risk.

Acknowledgments↩︎

We used AI assistants to help with editing and draft refinement. All analysis and conclusions are the authors’ own.

Funding↩︎

No external funding. Work conducted independently at Gauge Freedom, Inc.

References↩︎

[1]

B. W. Roberts and W. F. DelVecchio, "The Rank-Order Consistency of Personality Traits from Childhood to Old Age," Psychological Bulletin, 2000.

[2]

B. W. Roberts, K. E. Walton, and W. Viechtbauer, "Patterns of Mean-Level Change in Personality Traits Across the Life Course," Psychological Bulletin, 2006.

[3]

W. Bleidorn and C. Schwaba, "Personality Trait Stability and Change," Current Directions in Psychological Science, 2021.

[4]

M. A. Harris et al., "Personality Stability from Age 14 to 77 Years," Psychology and Aging, 2016.

[5]

B. Larsen and B. J. Casey, "Adolescence as a Neurobiological Critical Period for the Development of Goal-Directed Behavior," Neuroscience & Biobehavioral Reviews, 2018.

[6]

I. Dumontheil, "Adolescent Brain Development," Current Opinion in Behavioral Sciences, 2016.

[7]

L. Liuzzi et al., "Changes in Behavior and Neural Dynamics across Adolescent Development," Journal of Neuroscience, 2023.

[8]

L. J. Westacott et al., "Complement-Dependent Synaptic Reorganisation in the Adolescent Brain," Nature Communications, 2022.

[9]

J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, "Finetuned Language Models Are Zero-Shot Learners," preprint arXiv:2109.01652, 2021.

[10]

H. W. Chung et al., "Scaling Instruction-Finetuned Language Models," preprint arXiv:2210.11416, 2022.

[11]

Y. Wang et al., "Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.

[12]

M. Phuong, R. S. Zimmermann, Z. Wang, et al., "Evaluating Frontier Models for Stealth and Situational Awareness," preprint arXiv:2505.01420, 2025.

[13]

B. Schoen et al., "Stress Testing Deliberative Alignment for Anti-Scheming Training," preprint arXiv:2509.15541, 2025.

[14]

J. Wei et al., "Emergent Abilities of Large Language Models," Transactions on Machine Learning Research, 2022.

[15]

R. Schaeffer, B. Miranda, and S. Koyejo, "Are Emergent Abilities of LLMs a Mirage?," preprint arXiv:2304.15004, 2023.

[16]

L. Berti et al., "A Survey on Emergent Abilities in Large Language Models," preprint arXiv:2407.13680, 2024.

[17]

Z. Liu et al., "Towards Understanding Grokking: An Exploration of Neural Network Generalization," Advances in Neural Information Processing Systems (NeurIPS), 2022.

[18]

T. Kumar et al., "Grokking as the Transition from Lazy to Rich Training Dynamics," International Conference on Learning Representations (ICLR), 2024.

[19]

B. DeMoss et al., "The Complexity Dynamics of Grokking," Physica D: Nonlinear Phenomena, 2025.

[20]

A. S. Saxe et al., "Critical Learning Periods Emerge Even in Deep Linear Networks," International Conference on Learning Representations (ICLR), 2024.

[21]

V. Y. Fukase et al., "One Period to Rule Them All: Identifying Critical Learning Periods in Deep Networks," preprint arXiv:2406.15954, 2024.

[22]

J. Kirkpatrick et al., "Overcoming Catastrophic Forgetting in Neural Networks," Proceedings of the National Academy of Sciences (PNAS), 2017.

[23]

Y. Bai et al., "Constitutional AI: Harmlessness from AI Feedback," preprint arXiv:2212.08073, 2022.

[24]

R. Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," Advances in Neural Information Processing Systems (NeurIPS), 2023.

[25]

J. P. Fraenken et al., "Self-Supervised Alignment with Mutual Information (SAMI)," Advances in Neural Information Processing Systems (NeurIPS), 2024.

[26]

A. M. Turner et al., "Steering Language Models with Activation Engineering," preprint arXiv:2308.10248, 2023.

[27]

N. Panickssery et al., "Steering Llama 2 via Contrastive Activation Addition," preprint arXiv:2312.06681, 2023.

[28]

R. Chen et al., "Persona Vectors: Monitoring and Controlling Character Traits in Language Models," preprint arXiv:2507.21509, Anthropic Research, 2025.

[29]

E. Hubinger et al., "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training," preprint arXiv:2401.05566, 2024.

[30]

R. Laine et al., "Me, Myself, and AI: The Situational Awareness Dataset (SAD)," preprint arXiv:2407.04694, 2024.

[31]

R. Laine et al., "SAD: The Situational Awareness Dataset," Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024.

[32]

W. Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," Journal of Machine Learning Research (JMLR), 2022.

[33]

N. Dikkala et al., "On the Benefits of Learning to Route in Mixture of Experts," Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.

[34]

H. Cunningham et al., "Sparse Autoencoders Find Highly Interpretable Features in Language Models," preprint arXiv:2309.08600, 2023.

[35]

Anthropic, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," Transformer Circuits Pub, 2024.

[36]

W. B. Arthur, "Competing Technologies, Increasing Returns, and Lock-In by Historical Events," The Economic Journal, 1989.

[37]

J. Betley et al., "Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs," preprint arXiv:2502.17424, 2025.

[38]

W. Qi et al., "Fine-tuning Aligned Language Models Compromises Safety," preprint arXiv:2310.06208, 2023.

For Gemma-2B, one late checkpoint logged an anomalous ARC value \(\approx 0.33\%\), consistent with a failed evaluation job. It is treated as invalid in summary statistics (ARC \(< 1\%\) masked); robustness with and without this point appears in our GitHub https://github.com/gaugefreedom/persona-phase-transition.↩︎

The Lock-In Phase Hypothesis: Identity Consolidation as a Precursor to AGI

Abstract

1 Introduction: From Imitation to Identity↩︎

3 Definition: The Lock-In Phase↩︎

4 Operationalization: Measurements for Onset Detection↩︎

4.0.0.1 Behavioral axis.

4.0.0.2 Representational axis.

4.0.0.3 Architectural axis (MoE).

4.0.0.4 Alignment & awareness axis.

4.0.0.5 Practical notes.

5 Mechanisms and Training Levers↩︎

6 An Experimental Agenda↩︎

6.1 Experiment: Measuring the Consolidation Phase Transition↩︎

6.1.0.1 Models and setup.

6.1.0.2 Overview of findings.

7 Predictions↩︎

7.0.0.1 P1: Steerability co-moves with competence and awareness.

7.0.0.2 P2: Onset aligns with a phase-like reorganization.

7.0.0.3 P3: Alignment cost curve.

7.0.0.4 P4: Heritability of consolidated traits.

7.0.0.5 P5: Spontaneous consolidation at scale.

8 Safety and Governance Implications↩︎

8.0.0.1 Engineered Lock-In as a safety tool.

8.0.0.2 Spontaneous Lock-In as an alignment risk.

8.0.0.3 Governance triggers (early warning).

9 Limitations↩︎

Acknowledgments↩︎

Funding↩︎

References↩︎

Subjects

Updated on Academus

The Lock-In Phase Hypothesis: Identity Consolidation as a Precursor to AGI

Abstract

1 Introduction: From Imitation to Identity↩︎

2 Related Work↩︎

3 Definition: The Lock-In Phase↩︎

4 Operationalization: Measurements for Onset Detection↩︎

4.0.0.1 Behavioral axis.

4.0.0.2 Representational axis.

4.0.0.3 Architectural axis (MoE).

4.0.0.4 Alignment & awareness axis.

4.0.0.5 Practical notes.

5 Mechanisms and Training Levers↩︎

6 An Experimental Agenda↩︎

6.1 Experiment: Measuring the Consolidation Phase Transition↩︎

6.1.0.1 Models and setup.

6.1.0.2 Overview of findings.

7 Predictions↩︎

7.0.0.1 P1: Steerability co-moves with competence and awareness.

7.0.0.2 P2: Onset aligns with a phase-like reorganization.

7.0.0.3 P3: Alignment cost curve.

7.0.0.4 P4: Heritability of consolidated traits.

7.0.0.5 P5: Spontaneous consolidation at scale.

8 Safety and Governance Implications↩︎

8.0.0.1 Engineered Lock-In as a safety tool.

8.0.0.2 Spontaneous Lock-In as an alignment risk.

8.0.0.3 Governance triggers (early warning).

9 Limitations↩︎

Acknowledgments↩︎

Funding↩︎

References↩︎

Subjects

Updated on Academus