New articles on Quantitative Biology


[1] 2602.03886

Prenatal Stress Detection from Electrocardiography Using Self-Supervised Deep Learning: Development and External Validation

Prenatal psychological stress affects 15-25% of pregnancies and increases risks of preterm birth, low birth weight, and adverse neurodevelopmental outcomes. Current screening relies on subjective questionnaires (PSS-10), limiting continuous monitoring. We developed deep learning models for stress detection from electrocardiography (ECG) using the FELICITy 1 cohort (151 pregnant women, 32-38 weeks gestation). A ResNet-34 encoder was pretrained via SimCLR contrastive learning on 40,692 ECG segments per subject. Multi-layer feature extraction enabled binary classification and continuous PSS prediction across maternal (mECG), fetal (fECG), and abdominal ECG (aECG). External validation used the FELICITy 2 RCT (28 subjects, different ECG device, yoga intervention vs. control). On FELICITy 1 (5-fold CV): mECG 98.6% accuracy (R2=0.88, MAE=1.90), fECG 99.8% (R2=0.95, MAE=1.19), aECG 95.5% (R2=0.75, MAE=2.80). External validation on FELICITy 2: mECG 77.3% accuracy (R2=0.62, MAE=3.54, AUC=0.826), aECG 63.6% (R2=0.29, AUC=0.705). Signal quality-based channel selection outperformed all-channel averaging (+12% R2 improvement). Mixed-effects models detected a significant intervention response (p=0.041). Self-supervised deep learning on pregnancy ECG enables accurate, objective stress assessment, with multi-layer feature extraction substantially outperforming single embedding approaches.


[2] 2602.03902

All-Atom GPCR-Ligand Simulation via Residual Isometric Latent Flow

G-protein-coupled receptors (GPCRs), primary targets for over one-third of approved therapeutics, rely on intricate conformational transitions to transduce signals. While Molecular Dynamics (MD) is essential for elucidating this transduction process, particularly within ligand-bound complexes, conventional all-atom MD simulation is computationally prohibitive. In this paper, we introduce GPCRLMD, a deep generative framework for efficient all-atom GPCR-ligand this http URL employs a Harmonic-Prior Variational Autoencoder (HP-VAE) to first map the complex into a regularized isometric latent space, preserving geometric topology via physics-informed constraints. Within this latent space, a Residual Latent Flow samples evolution trajectories, which are subsequently decoded back to atomic coordinates. By capturing temporal dynamics via relative displacements anchored to the initial structure, this residual mechanism effectively decouples static topology from dynamic fluctuations. Experimental results demonstrate that GPCRLMD achieves state-of-the-art performance in GPCR-ligand dynamics simulation, faithfully reproducing thermodynamic observables and critical ligand-receptor interactions.


[3] 2602.04008

Mathematical simulations of pediatric hemodynamics in isolated ventricular septal defect

Computer modeling of the cardiovascular system has potential to revolutionize personalized medical care. This is especially promising for congenital heart defects, such as ventricular septal defect (VSD), a hole between the two ventricles of the heart. However, relatively few studies have built computer models for VSD, nor have they considered how natural adaptation to the cardiovascular system with age might interact with the presence of a small, medium, or large size VSD. Here, we combine a lumped parameter model of the cardiovascular system with two key modeling components: a size-dependent resistance dictating shunt flow between the two ventricles and age-dependent scaling relationships for the systemic and pulmonary circulations. Our results provide insight into changes in hemodynamic conditions with various VSD sizes. We investigate the combined effects of VSD size, vascular parameters, and age, showing distinct differences with these three factors. This study lays the necessary foundation for studying VSD and towards building digital shadows and digital twins for managing VSD in pediatrics.


[4] 2602.04058

RareCollab -- An Agentic System Diagnosing Mendelian Disorders with Integrated Phenotypic and Molecular Evidence

Millions of children worldwide are affected by severe rare Mendelian disorders, yet exome and genome sequencing still fail to provide a definitive molecular diagnosis for a large fraction of patients, prolonging the diagnostic odyssey. Bridging this gap increasingly requires transitioning from DNA-only interpretation to multi-modal diagnostic reasoning that combines genomic data, transcriptomic sequencing (RNA-seq), and phenotype information; however, computational frameworks that coherently integrate these signals remain limited. Here we present RareCollab, an agentic diagnostic framework that pairs a stable quantitative Diagnostic Engine with Large Language Model (LLM)-based specialist modules that produce high-resolution, interpretable assessments from transcriptomic signals, phenotypes, variant databases, and the literature to prioritize potential diagnostic variants. In a rigorously curated benchmark of Undiagnosed Diseases Network (UDN) patients with paired genomic and transcriptomic data, RareCollab achieved 77% top-5 diagnostic accuracy and improved top-1 to top-5 accuracy by ~20% over widely used variant-prioritization approaches. RareCollab illustrates how modular artificial intelligence (AI) can operationalize multi-modal evidence for accurate, scalable rare disease diagnosis, offering a promising path toward reducing the diagnostic odyssey for affected families.


[5] 2602.04095

A computational account of dreaming: learning and memory consolidation

A number of studies have concluded that dreaming is mostly caused by randomly arriving internal signals because "dream contents are random impulses", and argued that dream sleep is unlikely to play an important part in our intellectual capacity. On the contrary, numerous functional studies have revealed that dream sleep does play an important role in our learning and other intellectual functions. Specifically, recent studies have suggested the importance of dream sleep in memory consolidation, following the findings of neural replaying of recent waking patterns in the hippocampus. The randomness has been the hurdle that divides dream theories into either functional or functionless. This study presents a cognitive and computational model of dream process. This model is simulated to perform the functions of learning and memory consolidation, which are two most popular dream functions that have been proposed. The simulations demonstrate that random signals may result in learning and memory consolidation. Thus, dreaming is proposed as a continuation of brain's waking activities that processes signals activated spontaneously and randomly from the hippocampus. The characteristics of the model are discussed and found in agreement with many characteristics concluded from various empirical studies.


[6] 2602.04150

A brief review of evolutionary game dynamics in the reinforcement learning paradigm

Cooperation, fairness, trust, and resource coordination are cornerstones of modern civilization, yet their emergence remains inadequately explained by the persistent discrepancies between theoretical predictions and behavioral experiments. Part of this gap may arise from the imitation learning paradigm commonly used in prior theoretical models, which assumes individuals merely copy successful neighbors according to predetermined, fixed rules. This review examines recent advances in evolutionary game dynamics that employ reinforcement learning (RL) as an alternative paradigm. In RL, individuals learn through trial and error and introspectively refine their strategies based on environmental feedback. We begin by introducing key concepts in evolutionary game theory and the two learning paradigms, then synthesize progress in applying RL to elucidate cooperation, trust, fairness, optimal resource coordination, and ecological dynamics. Collectively, these studies indicate that RL offers a promising unified framework for understanding the diverse social and ecological phenomena observed in human and natural systems.


[7] 2602.04492

Discovering Mechanistic Models of Neural Activity: System Identification in an in Silico Zebrafish

Constructing mechanistic models of neural circuits is a fundamental goal of neuroscience, yet verifying such models is limited by the lack of ground truth. To rigorously test model discovery, we establish an in silico testbed using neuromechanical simulations of a larval zebrafish as a transparent ground truth. We find that LLM-based tree search autonomously discovers predictive models that significantly outperform established forecasting baselines. Conditioning on sensory drive is necessary but not sufficient for faithful system identification, as models exploit statistical shortcuts. Structural priors prove essential for enabling robust out-of-distribution generalization and recovery of interpretable mechanistic models. Our insights provide guidance for modeling real-world neural recordings and offer a broader template for AI-driven scientific discovery.


[8] 2602.04512

BrainVista: Modeling Naturalistic Brain Dynamics as Multimodal Next-Token Prediction

Naturalistic fMRI characterizes the brain as a dynamic predictive engine driven by continuous sensory streams. However, modeling the causal forward evolution in realistic neural simulation is impeded by the timescale mismatch between multimodal inputs and the complex topology of cortical networks. To address these challenges, we introduce BrainVista, a multimodal autoregressive framework designed to model the causal evolution of brain states. BrainVista incorporates Network-wise Tokenizers to disentangle system-specific dynamics and a Spatial Mixer Head that captures inter-network information flow without compromising functional boundaries. Furthermore, we propose a novel Stimulus-to-Brain (S2B) masking mechanism to synchronize high-frequency sensory stimuli with hemodynamically filtered signals, enabling strict, history-only causal conditioning. We validate our framework on Algonauts 2025, CineBrain, and HAD, achieving state-of-the-art fMRI encoding performance. In long-horizon rollout settings, our model yields substantial improvements over baselines, increasing pattern correlation by 36.0\% and 33.3\% on relative to the strongest baseline Algonauts 2025 and CineBrain, respectively.


[9] 2602.04762

Uncertainty in Island-based Ecosystem Services and Climate Change

Small and medium-sized islands are acutely exposed to climate change and ecosystem degradation, yet the extent to which uncertainty is systematically addressed in scientific assessments of their ecosystem services remains poorly understood. This study revisits 226 peer-reviewed articles drawn from two global systematic reviews on island ecosystem services and climate change, applying a structured post hoc analysis to evaluate how uncertainty is treated across methods, service categories, ecosystem realms, and decision contexts. Studies were classified according to whether uncertainty was explicitly analysed, just mentioned, or ignored. Only 30 percent of studies incorporated uncertainty explicitly, while more than half did not address it at all. Scenario-based approaches dominated uncertainty assessment, whereas probabilistic and ensemble-based frameworks remained limited. Cultural ecosystem services and extreme climate impacts exhibited the lowest levels of uncertainty integration, and few studies connected uncertainty treatment to policy relevant decision frameworks. Weak or absent treatment of uncertainty emerges as a structural challenge in island systems, where narrow ecological thresholds, strong land-sea coupling, limited spatial buffers, and reduced institutional redundancy amplify the consequences of decision-making under incomplete knowledge. Systematic mapping of how uncertainty is framed, operationalised, or neglected reveals persistent methodological and conceptual gaps and informs concrete directions for strengthening uncertainty integration in future island-focused ecosystem service and climate assessments. Embedding uncertainty more robustly into modelling practices, participatory processes, and policy tools is essential for enhancing scientific credibility, governance relevance, and adaptive capacity in insular socio-ecological systems.


[10] 2602.03875

Reversible Deep Learning for 13C NMR in Chemoinformatics: On Structures and Spectra

We introduce a reversible deep learning model for 13C NMR that uses a single conditional invertible neural network for both directions between molecular structures and spectra. The network is built from i-RevNet style bijective blocks, so the forward map and its inverse are available by construction. We train the model to predict a 128-bit binned spectrum code from a graph-based structure encoding, while the remaining latent dimensions capture residual variability. At inference time, we invert the same trained network to generate structure candidates from a spectrum code, which explicitly represents the one-to-many nature of spectrum-to-structure inference. On a filtered subset, the model is numerically invertible on trained examples, achieves spectrum-code prediction above chance, and produces coarse but meaningful structural signals when inverted on validation spectra. These results demonstrate that invertible architectures can unify spectrum prediction and uncertainty-aware candidate generation within one end-to-end model.


[11] 2602.03896

A Hitchhiker's Guide to Poisson Gradient Estimation

Poisson-distributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this: Exponential Arrival Time (EAT) simulation and Gumbel-SoftMax (GSM) relaxation. We provide the first systematic comparison of these methods, along with practical guidance for practitioners. Our main technical contribution is a modification to the EAT method that theoretically guarantees an unbiased first moment (exactly matching the firing rate), and reduces second-moment bias. We evaluate these methods on their distributional fidelity, gradient quality, and performance on two tasks: (1) variational autoencoders with Poisson latents, and (2) partially observable generalized linear models, where latent neural connectivity must be inferred from observed spike trains. Across all metrics, our modified EAT method exhibits better overall performance (often comparable to exact gradients), and substantially higher robustness to hyperparameter choices. Together, our results clarify the trade-offs between these methods and offer concrete recommendations for practitioners working with Poisson latent variable models.


[12] 2602.03998

AtlasPatch: An Efficient and Scalable Tool for Whole Slide Image Preprocessing in Computational Pathology

Whole-slide image (WSI) preprocessing, typically comprising tissue detection followed by patch extraction, is foundational to AI-driven computational pathology workflows. This remains a major computational bottleneck as existing tools either rely on inaccurate heuristic thresholding for tissue detection, or adopt AI-based approaches trained on limited-diversity data that operate at the patch level, incurring substantial computational complexity. We present AtlasPatch, an efficient and scalable slide preprocessing framework for accurate tissue detection and high-throughput patch extraction with minimal computational overhead. AtlasPatch's tissue detection module is trained on a heterogeneous and semi-manually annotated dataset of ~30,000 WSI thumbnails, using efficient fine-tuning of the Segment-Anything model. The tool extrapolates tissue masks from thumbnails to full-resolution slides to extract patch coordinates at user-specified magnifications, with options to stream patches directly into common image encoders for embedding or store patch images, all efficiently parallelized across CPUs and GPUs. We assess AtlasPatch across segmentation precision, computational complexity, and downstream multiple-instance learning, matching state-of-the-art performance while operating at a fraction of their computational cost. AtlasPatch is open-source and available at this https URL.


[13] 2602.04021

Group Contrastive Learning for Weakly Paired Multimodal Data

We present GROOVE, a semi-supervised multi-modal representation learning approach for high-content perturbation data where samples across modalities are weakly paired through shared perturbation labels but lack direct correspondence. Our primary contribution is GroupCLIP, a novel group-level contrastive loss that bridges the gap between CLIP for paired cross-modal data and SupCon for uni-modal supervised contrastive learning, addressing a fundamental gap in contrastive learning for weakly-paired settings. We integrate GroupCLIP with an on-the-fly backtranslating autoencoder framework to encourage cross-modally entangled representations while maintaining group-level coherence within a shared latent space. Critically, we introduce a comprehensive combinatorial evaluation framework that systematically assesses representation learners across multiple optimal transport aligners, addressing key limitations in existing evaluation strategies. This framework includes novel simulations that systematically vary shared versus modality-specific perturbation effects enabling principled assessment of method robustness. Our combinatorial benchmarking reveals that there is not yet an aligner that uniformly dominates across settings or modality pairs. Across simulations and two real single-cell genetic perturbation datasets, GROOVE performs on par with or outperforms existing approaches for downstream cross-modal matching and imputation tasks. Our ablation studies demonstrate that GroupCLIP is the key component driving performance gains. These results highlight the importance of leveraging group-level constraints for effective multi-modal representation learning in scenarios where only weak pairing is available.


[14] 2602.04119

Synthesizable Molecular Generation via Soft-constrained GFlowNets with Rich Chemical Priors

The application of generative models for experimental drug discovery campaigns is severely limited by the difficulty of designing molecules de novo that can be synthesized in practice. Previous works have leveraged Generative Flow Networks (GFlowNets) to impose hard synthesizability constraints through the design of state and action spaces based on predefined reaction templates and building blocks. Despite the promising prospects of this approach, it currently lacks flexibility and scalability. As an alternative, we propose S3-GFN, which generates synthesizable SMILES molecules via simple soft regularization of a sequence-based GFlowNet. Our approach leverages rich molecular priors learned from large-scale SMILES corpora to steer molecular generation towards high-reward, synthesizable chemical spaces. The model induces constraints through off-policy replay training with a contrastive learning signal based on separate buffers of synthesizable and unsynthesizable samples. Our experiments show that S3-GFN learns to generate synthesizable molecules ($\geq 95\%$) with higher rewards in diverse tasks.


[15] 2602.04270

Multi-Integration of Labels across Categories for Component Identification (MILCCI)

Many fields collect large-scale temporal data through repeated measurements (trials), where each trial is labeled with a set of metadata variables spanning several categories. For example, a trial in a neuroscience study may be linked to a value from category (a): task difficulty, and category (b): animal choice. A critical challenge in time-series analysis is to understand how these labels are encoded within the multi-trial observations, and disentangle the distinct effect of each label entry across categories. Here, we present MILCCI, a novel data-driven method that i) identifies the interpretable components underlying the data, ii) captures cross-trial variability, and iii) integrates label information to understand each category's representation within the data. MILCCI extends a sparse per-trial decomposition that leverages label similarities within each category to enable subtle, label-driven cross-trial adjustments in component compositions and to distinguish the contribution of each category. MILCCI also learns each component's corresponding temporal trace, which evolves over time within each trial and varies flexibly across trials. We demonstrate MILCCI's performance through both synthetic and real-world examples, including voting patterns, online page view trends, and neuronal recordings.


[16] 2602.04437

How seed banks evolve in plants: a stochastic dynamical system subject to a strong drift

We study how changes in population size and fluctuating environmental conditions influence the establishment of seed banks in plants. Our model is a modification of the Wright-Fisher model with seed bank, introduced by Kaj, Krone and Lascoux. We distinguish between wild type individuals, producing only nondormant seeds, and mutants, producing seeds with dormancy. To understand how changing population size shapes the establishment of seed banks, we analyse the process under a diffusive scaling. The results support the biological insight that seed banks are favoured in a declining population, and disfavoured if population size is constant or increasing. The surprise is that this is true even when population sizes are changing very slowly -- over evolutionary timescales. We also investigate the influence of short-term fluctuations, such as annual variations in rainfall or temperature. Mathematically, our analysis reduces to a stochastic dynamical system forced onto a manifold by a large drift, which converges under scaling to a diffusion on the manifold. Inspired by the Lyapunov--Schmidt reduction, we derive an explicit formula for the limiting diffusion coefficients by projecting the system onto its linear counterpart. This provides a general framework for deriving diffusion approximations in models with strong drift and nonlinear constraints.


[17] 2602.04481

The impact of heterogeneity on the co-evolution of cooperation and epidemic spreading in complex networks

The dynamics of herd immunity depend crucially on the interaction between collective social behavior and disease transmission, but the role of heterogeneity in this context frequently remains unclear. Here, we dissect this co-evolutionary feedback by coupling a public goods game with an epidemic model on complex networks, including multiplex and real-world networks. Our results reveals a dichotomy in how heterogeneity shapes outcomes. We demonstrate that structural heterogeneity in social networks acts as a powerful catalyst for cooperation and disease suppression. This emergent effect is driven by highly connected hubs who, facing amplified personal risk, adopt protective strategies out of self-interest. In contrast, heterogeneity in individual infection costs proves detrimental, undermining cooperation and amplifying the epidemic. This creates a ``weakest link'' problem, where individuals with low perceived risk act as persistent free-riders and disease reservoirs, degrading the collective response. Our findings establish that heterogeneity is a double-edged sword: its impact is determined by whether it creates an asymmetry of influence (leverage points) or an asymmetry of motivation (weakest links), recommending disease intervention policies that facilitate cooperative transition in hubs (strengthening the leverage point) and homogenize incentives to weakest links.


[18] 2602.04883

Protein Autoregressive Modeling via Multiscale Structure Generation

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.


[19] 2210.09470

Biomass transfer on autocatalytic reaction network: a delay differential equation formulation

For a biological system to grow and expand, mass must be transferred from the environment to the system and be assimilated into its reaction network. Here, I characterize the biomass transfer process for growing autocatalytic systems. By track biomass along reaction pathways, an n-dimensional ordinary differential equation (ODE) of the reaction network can be reformulated into a one-dimensional delay differential equation (DDE) for its long-term dynamics. The kernel function of the DDE summarizes the overall amplification and transfer delay of the system and serves as a signature for autocatalysis dynamics. The DDE formulation allows reaction networks of various topologies and complexities to be compared and provides rigorous estimation scheme for growth rate upon dimensional reduction of reaction networks.


[20] 2412.12112

Generative Modeling of Neural Dynamics via Latent Stochastic Differential Equations

We propose a probabilistic framework for developing computational models of biological neural systems. In this framework, physiological recordings are viewed as discrete-time partial observations of an underlying continuous-time stochastic dynamical system which implements computations through its state evolution. To model this dynamical system, we employ a system of coupled stochastic differential equations with differentiable drift and diffusion functions and use variational inference to infer its states and parameters. This formulation enables seamless integration of existing mathematical models in the literature, neural networks, or a hybrid of both to learn and compare different models. We demonstrate this in our framework by developing a generative model that combines coupled oscillators with neural networks to capture latent population dynamics from single-cell recordings. Evaluation across three neuroscience datasets spanning different species, brain regions, and behavioral tasks show that these hybrid models achieve competitive performance in predicting stimulus-evoked neural and behavioral responses compared to sophisticated black-box approaches while requiring an order of magnitude fewer parameters, providing uncertainty estimates, and offering a natural language for interpretation.


[21] 2505.17914

Flexible MOF Generation with Torsion-Aware Flow Matching

Designing metal-organic frameworks (MOFs) with novel chemistries is a longstanding challenge due to their large combinatorial space and complex 3D arrangements of the building blocks. While recent deep generative models have enabled scalable MOF generation, they assume (1) a fixed set of building blocks and (2) known local 3D coordinates of building blocks. However, this limits their ability to (1) design novel MOFs and (2) generate the structure using novel building blocks. We propose a two-stage MOF generation framework that overcomes these limitations by modeling both chemical and geometric degrees of freedom. First, we train an SMILES-based autoregressive model to generate metal and organic building blocks, paired with a cheminformatics toolkit for 3D structure initialization. Second, we introduce a flow matching model that predicts translations, rotations, and torsional angles to assemble the blocks into valid 3D frameworks. Our experiments demonstrate improved reconstruction accuracy, the generation of valid, novel, and unique MOFs, and the ability to create novel building blocks. Our code is available at this https URL.


[22] 2506.00597

Processing-in-memory for genomics workloads

Low-cost, high-throughput DNA and RNA sequencing (HTS) data is the backbone of the life sciences. Genome sequencing is now becoming a part of Predictive, Preventive, Personalized, and Participatory (termed 'P4') medicine. All genomic data are currently processed in energy-hungry computer clusters and centers, necessitating data transfer, consuming substantial energy, and wasting valuable time. Therefore, there is a need for fast, energy-efficient, and cost-efficient technologies that enable genomics research without requiring data centers and cloud platforms. We recently launched the BioPIM Project to leverage emerging processing-in-memory (PIM) technologies to enable energy- and cost-efficient analysis of bioinformatics workloads. The BioPIM Project focuses on co-designing algorithms and data structures commonly used in genomics with several PIM architectures to achieve the highest cost, energy, and time savings.


[23] 2512.03312

Unlocking hidden biomolecular conformational landscapes in diffusion models at inference time

The function of biomolecules such as proteins depends on their ability to interconvert between a wide range of structures or "conformations." Researchers have endeavored for decades to develop computational methods to predict the distribution of conformations, which is far harder to determine experimentally than a static folded structure. We present ConforMix, an inference-time algorithm that enhances sampling of conformational distributions using a combination of classifier guidance, filtering, and free energy estimation. Our approach upgrades diffusion models -- whether trained for static structure prediction or conformational generation -- to enable more efficient discovery of conformational variability without requiring prior knowledge of major degrees of freedom. ConforMix is orthogonal to improvements in model pretraining and would benefit even a hypothetical model that perfectly reproduced the Boltzmann distribution. Remarkably, when applied to a diffusion model trained for static structure prediction, ConforMix captures structural changes including domain motion, cryptic pocket flexibility, and transporter cycling, while avoiding unphysical states. Case studies of biologically critical proteins demonstrate the scalability, accuracy, and utility of this method.


[24] 2601.15313

Attention Is Not Retention: The Orthogonality Constraint in Infinite-Context Architectures

Biological memory solves a problem that eludes current AI: storing specific episodic facts without corrupting general semantic knowledge. Complementary Learning Systems theory explains this through two subsystems - a fast hippocampal system using sparse, pattern-separated representations for episodes, and a slow neocortical system using distributed representations for statistical regularities. Current AI systems lack this separation, attempting to serve both functions through neural weights alone. We identify the Orthogonality Constraint: reliable memory requires orthogonal keys, but semantic embeddings cannot be orthogonal because training clusters similar concepts together. The result is Semantic Interference (connecting to what cognitive psychologists have long observed in human memory), where neural systems writing facts into shared continuous parameters collapse to near-random accuracy within tens of semantically related facts. Through semantic density (rho), the mean pairwise cosine similarity, we show collapse occurs at N=5 facts (rho > 0.6) or N ~ 20-75 (moderate rho). We validate across modalities: 16,309 Wikipedia facts, scientific measurements (rho = 0.96, 0.02% accuracy at N=10,000), and image embeddings (rho = 0.82, 0.05% at N=2,000). This failure is geometric - no increase in model capacity can overcome interference when keys share semantic overlap. We propose Knowledge Objects (KOs): structured facts with hash-based identity, controlled vocabularies, and explicit version chains. On Wikipedia facts, KO retrieval achieves 45.7% where Modern Hopfield Networks collapse to near-zero; hash-based retrieval maintains 100%. Production systems (Claude Memory, ChatGPT Memory) store unstructured text, causing schema drift (40-70% consistency) and version ambiguity. Knowledge Objects provide the discrete hippocampal component that enables reliable bicameral memory.


[25] 2601.19002

Intrinsic Limits of Read Trimming in Single-Stranded Bisulfite Sequencing

Single-stranded whole-genome bisulfite sequencing (ssWGBS) enables DNA methylation profiling in low-input and highly fragmented samples, including cell-free DNA, but introduces stochastic enzymatic artifacts that complicate preprocessing and downstream interpretation. In post-bisulfite library construction, Adaptase-mediated tailing blurs the boundary between biological sequence and synthetic additions, rendering read trimming a persistent source of variability across analytical pipelines. We show that this variability reflects an intrinsic limit of per-read boundary inference rather than an algorithmic shortcoming: boundary localization is fundamentally asymmetric between paired-end reads, with Read 2 exhibiting kinetically structured artifacts that support constrained read-level inference, while apparent contamination in Read 1 arises conditionally from geometry-driven read-through events and is not well-defined at the single-read level. Even within Read 2, bisulfite-induced compositional degeneracy creates an indistinguishable regime in which genomic and synthetic origins share support under the same observable sequence evidence, implying a strictly positive Bayes error under any deterministic per-read decision rule and placing a fundamental limit on per-read boundary fidelity. By explicitly characterizing these limits, we reframe read trimming in ssWGBS as a constrained inference problem and introduce a conservative framework that operates only where supported by observable evidence (including short-range nucleotide texture), exposes interpretable trade-offs between genomic retention and residual artifact risk, and avoids forced resolution where boundaries are intrinsically unresolvable. Together, these results clarify why fixed trimming heuristics persist in practice and provide a principled foundation for uncertainty-aware preprocessing in ssWGBS.


[26] 2602.00157

ProDCARL: Reinforcement Learning-Aligned Diffusion Models for De Novo Antimicrobial Peptide Design

Antimicrobial resistance threatens healthcare sustainability and motivates low-cost computational discovery of antimicrobial peptides (AMPs). De novo peptide generation must optimize antimicrobial activity and safety through low predicted toxicity, but likelihood-trained generators do not enforce these goals explicitly. We introduce ProDCARL, a reinforcement-learning alignment framework that couples a diffusion-based protein generator (EvoDiff OA-DM 38M) with sequence property predictors for AMP activity and peptide toxicity. We fine-tune the diffusion prior on AMP sequences to obtain a domain-aware generator. Top-k policy-gradient updates use classifier-derived rewards plus entropy regularization and early stopping to preserve diversity and reduce reward hacking. In silico experiments show ProDCARL increases the mean predicted AMP score from 0.081 after fine-tuning to 0.178. The joint high-quality hit rate reaches 6.3\% with pAMP $>$0.7 and pTox $<$0.3. ProDCARL maintains high diversity, with $1-$mean pairwise identity equal to 0.929. Qualitative analyses with AlphaFold3 and ProtBERT embeddings suggest candidates show plausible AMP-like structural and semantic characteristics. ProDCARL serves as a candidate generator that narrows experimental search space, and experimental validation remains future work.


[27] 2602.02916

Mathematical Modeling of Lesion Pattern Formation in Dendritic Keratitis

Dendritic keratitis is a form of eye infection caused by herpes simplex virus (HSV). The virus spreads via direct cell-to-cell infection among corneal epithelial cells. This leads to the formation of dendritic lesions characterized by terminal bulbs at their tips. Under immunosuppression, the condition may progress to geographic keratitis, which is a map-shaped lesion with dendritic tails. The mechanism of this pattern formation remains to be elucidated. In this study, we propose a mathematical model to elucidate the mechanisms of lesion pattern formation in dendritic keratitis. Our model shows that increased production of infection-suppressive cytokines induces dendritic patterns with terminal bulbs, whereas reduced cytokine levels lead to geographic patterns. Furthermore, altering the spatial distribution of cytokine production can reproduce dendritic tails. By including external cytokine secretion, we could reproduce tapered lesions observed in non-HSV keratitis. By clarifying the mechanisms behind terminal bulb formation and reproducing atypical lesion morphologies, our findings enhance the understanding of herpetic keratitis and highlight the utility of mathematical modeling in ophthalmology.


[28] 2508.21749

When Many Trees Go to War: On Sets of Phylogenetic Trees With Almost No Common Structure

It is known that any two trees on the same $n$ leaves can be displayed by a network with $n-2$ reticulations, and there are two trees that cannot be displayed by a network with fewer reticulations. But how many reticulations are needed to display multiple trees? For any set of $t$ trees on $n$ leaves, there is a trivial network with $(t - 1)n$ reticulations that displays them. To do better, we have to exploit common structure of the trees to embed non-trivial subtrees of different trees into the same part of the network. In this paper, we show that for $t \in o(\sqrt{\lg n})$, there is a set of $t$ trees with virtually no common structure that could be exploited. More precisely, we show for any $t\in o(\sqrt{\lg n})$, there are $t$ trees such that any network displaying them has $(t-1)n - o(n)$ reticulations. For $t \in o(\lg n)$, we obtain a slightly weaker bound. We also prove that already for $t = c\lg n$, for any constant $c > 0$, there is a set of $t$ trees that cannot be displayed by a network with $o(n \lg n)$ reticulations, matching up to constant factors the known upper bound of $O(n \lg n)$ reticulations sufficient to display \emph{all} trees with $n$ leaves. These results are based on simple counting arguments and extend to unrooted networks and trees.