New articles on Quantitative Biology


[1] 2602.00019

AutoBinder Agent: An MCP-Based Agent for End-to-End Protein Binder Design

Modern AI technologies for drug discovery are distributed across heterogeneous platforms-including web applications, desktop environments, and code libraries-leading to fragmented workflows, inconsistent interfaces, and high integration overhead. We present an agentic end-to-end drug design framework that leverages a Large Language Model (LLM) in conjunction with the Model Context Protocol (MCP) to dynamically coordinate access to biochemical databases, modular toolchains, and task-specific AI models. The system integrates four state-of-the-art components: MaSIF (MaSIF-site and MaSIF-seed-search) for geometric deep learning-based identification of protein-protein interaction (PPI) sites, Rosetta for grafting protein fragments onto protein backbones to form mini proteins, ProteinMPNN for amino acid sequences redesign, and AlphaFold3 for near-experimental accuracy in complex structure prediction. Starting from a target structure, the framework supports de novo binder generation via surface analysis, scaffold grafting and pose construction, sequence optimization, and structure prediction. Additionally, by replacing rigid, script-based workflows with a protocol-driven, LLM-coordinated architecture, the framework improves reproducibility, reduces manual overhead, and ensures extensibility, portability, and auditability across the entire drug design process.


[2] 2602.00057

Explore Brain-Inspired Machine Intelligence for Connecting Dots on Graphs Through Holographic Blueprint of Oscillatory Synchronization

Neural coupling in both neuroscience and artificial intelligence emerges as dynamic oscillatory patterns that encode abstract concepts. To this end, we hypothesize that a deeper understanding of the neural mechanisms governing brain rhythms can inspire next-generation design principles for machine learning algorithms, leading to improved efficiency and robustness. Building on this idea, we first model evolving brain rhythms through the interference of spontaneously synchronized neural oscillations, termed HoloBrain. The success of modeling brain rhythms using an artificial dynamical system of coupled oscillations motivates a "first principle" for brain-inspired machine intelligence based on a shared synchronization mechanism, termed HoloGraph. This principle enables graph neural networks to move beyond conventional heat diffusion paradigms toward modeling oscillatory synchronization. Our HoloGraph framework not only effectively mitigates the over-smoothing problem in graph neural networks but also demonstrates strong potential for reasoning and solving challenging problems on graphs.


[3] 2602.00143

Early warning prediction: Onsager-Machlup vs Schrödinger

Predicting critical transitions in complex systems, such as epileptic seizures in the brain, represents a major challenge in scientific research. The high-dimensional characteristics and hidden critical signals further complicate early-warning tasks. This study proposes a novel early-warning framework that integrates manifold learning with stochastic dynamical system modeling. Through systematic comparison, six methods including diffusion maps (DM) are selected to construct low-dimensional representations. Based on these, a data-driven stochastic differential equation model is established to robustly estimate the probability evolution scoring function of the system. Building on this, a new Score Function (SF) indicator is defined by incorporating Schrödinger bridge theory to quantify the likelihood of significant state transitions in the system. Experiments demonstrate that this indicator exhibits higher sensitivity and robustness in epilepsy prediction, enables earlier identification of critical points, and clearly captures dynamic features across various stages before and after seizure onset. This work provides a systematic theoretical framework and practical methodology for extracting early-warning signals from high-dimensional data.


[4] 2602.00157

ProDCARL: Reinforcement Learning-Aligned Diffusion Models for De Novo Antimicrobial Peptide Design

Antimicrobial resistance threatens healthcare sustainability and motivates low-cost computational discovery of antimicrobial peptides (AMPs). De novo peptide generation must optimize antimicrobial activity and safety through low predicted toxicity, but likelihood-trained generators do not enforce these goals explicitly. We introduce ProDCARL, a reinforcement-learning alignment framework that couples a diffusion-based protein generator (EvoDiff OA-DM 38M) with sequence property predictors for AMP activity and peptide toxicity. We fine-tune the diffusion prior on AMP sequences to obtain a domain-aware generator. Top-k policy-gradient updates use classifier-derived rewards plus entropy regularization and early stopping to preserve diversity and reduce reward hacking. In silico experiments show ProDCARL increases the mean predicted AMP score from 0.081 after fine-tuning to 0.178. The joint high-quality hit rate reaches 6.3\% with pAMP $>$0.7 and pTox $<$0.3. ProDCARL maintains high diversity, with $1-$mean pairwise identity equal to 0.929. Qualitative analyses with AlphaFold3 and ProtBERT embeddings suggest candidates show plausible AMP-like structural and semantic characteristics. ProDCARL serves as a candidate generator that narrows experimental search space, and experimental validation remains future work.


[5] 2602.00193

Multi-strain SIS dynamics with coinfection under host population structure

Coinfection phenomena are common in nature, yet there is a lack of analytical approaches for coinfection systems with a high number of circulating and interacting strains. In this paper, we investigated a coinfection SIS framework applied to N strains, co-circulating in a structured host population. Adopting a general formulation for fixed host classes, defined by arbitrary epidemiological traits such as class-specific transmission rates, susceptibilities, clearance rates, etc., our model can be easily applied in different frameworks: for example, when different host species share the same pathogen, in classes of vaccinated or non-vaccinated hosts, or even in classes of hosts defined by the number of contacts. Using the strain similarity assumption, we identify the fast and slow variables of the epidemiological dynamics on the host population, linking neutral and non-neutral strain dynamics, and deriving a global replicator equation. This global replicator equation allows to explicitly predict coexistence dynamics from mutual invasibility coefficients among strains. The derived global pairwise invasion fitness matrix contains explicit traces of the underlying host population structure, and of its entanglement with the strain interaction and trait landscape. Our work thus enables a more comprehensive study and efficient simulation of multi-strain dynamics in endemic ecosystems, paving the way to deeper understanding of global persistence and selection forces, jointly shaped by pathogen and host diversity.


[6] 2602.00197

Rank-and-Reason: Multi-Agent Collaboration Accelerates Zero-Shot Protein Mutation Prediction

Zero-shot mutation prediction is vital for low-resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet-lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank-and-Reason (VenusRAR), a two-stage agentic framework to automate this workflow and maximize expected wet-lab fitness. In the Rank-Stage, a Computational Expert and Virtual Biologist aggregate a context-aware multi-modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason-Stage, an agentic Expert Panel employs chain-of-thought reasoning to audit candidates against geometric and structural constraints, improving the Top-5 Hit Rate by up to 367% on ProteinGym-DMS99. The wet-lab validation on Cas12i3 nuclease further confirms the framework's efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23-fold and 5.05-fold activity improvements. Code and datasets are released on GitHub (this https URL).


[7] 2602.00464

A 30-item Test for Assessing Chinese Character Amnesia in Child Handwriters

Handwriting literacy is an important skill for learning and communication in school-age children. In the digital age, handwriting has been largely replaced by typing, leading to a decline in handwriting proficiency, particularly in non-alphabetic writing systems. Among children learning Chinese, a growing number have reported experiencing character amnesia: difficulty in correctly handwriting a character despite being able to recognize it. Given that there is currently no standardized diagnostic tool for assessing character amnesia in children, we developed an assessment to measure Chinese character amnesia in Mandarin-speaking school-age population. We utilised a large-scale handwriting dataset in which 40 children handwrote 800 characters from dictation prompts. Character amnesia and correct handwriting responses were analysed using a two-parameter Item Response Theory model. Four item-selection schemes were compared: random baseline, maximum discrimination, diverse difficulty, and an upper-and-lower-thirds discrimination score. Candidate item subsets were evaluated using out-of-sample prediction. Among these selection schemes, the upper-and-lower-thirds discrimination procedure yields a compact 30-item test that preserves individual-difference structure and generalizes to unseen test-takers (cross-validated mean r =.74 with full 800-item-test; within-sample r =.93). This short-form test provides a reliable and efficient tool of assessing Chinese character amnesia in children and can be used to identify early handwriting and orthographic learning difficulties, contributing to the early detection of developmental dysgraphia and related literacy challenges.


[8] 2602.00586

RAG-GNN: Integrating Retrieved Knowledge with Graph Neural Networks for Precision Medicine

Network topology excels at structural predictions but fails to capture functional semantics encoded in biomedical literature. We present a retrieval-augmented generation (RAG) embedding framework that integrates graph neural network representations with dynamically retrieved literature-derived knowledge through contrastive learning. Benchmarking against ten embedding methods reveals task-specific complementarity: topology-focused methods achieve near-perfect link prediction (GCN: 0.983 AUROC), while RAG-GNN is the only method achieving positive silhouette scores for functional clustering (0.001 vs. negative scores for all baselines). Information-theoretic decomposition shows network topology contributes 77.3% of predictive information, while retrieved documents provide 8.6% unique information. Applied to cancer signaling networks (379 proteins, 3,498 interactions), the framework identifies DDR1 as a therapeutic target based on retrieved evidence of synthetic lethality with KRAS mutations. These results establish that topology-only and retrieval-augmented approaches serve complementary purposes: structural prediction tasks are solved by network topology alone, while functional interpretation uniquely benefits from retrieved knowledge.


[9] 2602.00660

Phase Transitions in Unsupervised Feature Selection

Identifying minimal and informative feature sets is a central challenge in data analysis, particularly when few data points are available. Here we present a theoretical analysis of an unsupervised feature selection pipeline based on the Differentiable Information Imbalance (DII). We consider the specific case of structural and physico-chemical features describing a set of proteins. We show that if one considers the features as coordinates of a (hypothetical) statistical physics model, this model undergoes a phase transition as a function of the number of retained features. For physico-chemical descriptors, this transition is between a glass-like phase when the features are few and a liquid-like phase. The glass-like phase exhibits bimodal order-parameter distributions and Binder cumulant minima. In contrast, for structural descriptors the transition is less sharp. Remarkably, for physico-chemical descriptors the critical number of features identified from the DII coincides with the saturation of downstream binary classification performance. These results provide a principled, unsupervised criterion for minimal feature sets in protein classification and reveal distinct mechanisms of criticality across different feature types.


[10] 2602.00782

Controlling Repetition in Protein Language Models

Protein language models (PLMs) have enabled advances in structure prediction and de novo protein design, yet they frequently collapse into pathological repetition during generation. Unlike in text, where repetition merely reduces readability, in proteins it undermines structural confidence and functional viability. To unify this problem, we present the first systematic study of repetition in PLMs. We first propose quantitative metrics to characterize motif-level and homopolymer repetition and then demonstrate their negative impact on folding reliability. To address this challenge, we propose UCCS (Utility-Controlled Contrastive Steering), which steers protein generation with a constrained dataset. Instead of naively contrasting high- vs. low-repetition sequences, we construct contrastive sets that maximize differences in repetition while tightly controlling for structural utility. This disentanglement yields steering vectors that specifically target repetition without degrading foldability. Injected at inference, these vectors consistently reduce repetition without retraining or heuristic decoding. Experiments with ESM-3 and ProtGPT2 in CATH, UniRef50, and SCOP show that our method outperforms decoding penalties and other baselines, substantially lowering repetition while preserving AlphaFold confidence scores. Our results establish repetition control as a central challenge for PLMs and highlight dataset-guided steering as a principled approach for reliable protein generation.


[11] 2602.01019

Inter- and Intra-Subject Variability in EEG: A Systematic Survey

Electroencephalography (EEG) underpins neuroscience, clinical neurophysiology, and brain-computer interfaces (BCIs), yet pronounced inter- and intra-subject variability limits reliability, reproducibility, and translation. This systematic review studies that quantified or modeled EEG variability across resting-state, event-related potentials (ERPs), and task-related/BCI paradigms (including motor imagery and SSVEP) in healthy and clinical cohorts. Across paradigms, inter-subject differences are typically larger than within-subject fluctuations, but both affect inference and model generalization. Stability is feature-dependent: alpha-band measures and individual alpha peak frequency are often relatively reliable, whereas higher-frequency and many connectivity-derived metrics show more heterogeneous reliability; ERP reliability varies by component, with P300 measures frequently showing moderate-to-good stability. We summarize major sources of variability (biological, state-related, technical, and analytical), review common quantification and modeling approaches (e.g., ICC, CV, SNR, generalizability theory, and multivariate/learning-based methods), and provide recommendations for study design, reporting, and harmonization. Overall, EEG variability should be treated as both a practical constraint to manage and a meaningful signal to leverage for precision neuroscience and robust neurotechnology.


[12] 2602.01088

INDIGENA: inductive prediction of disease-gene associations using phenotype ontologies

Motivation: Predicting gene-disease associations (GDAs) is the problem to determine which gene is associated with a disease. GDA prediction can be framed as a ranking problem where genes are ranked for a query disease, based on features such as phenotypic similarity. By describing phenotypes using phenotype ontologies, ontology-based semantic similarity measures can be used. However, traditional semantic similarity measures use only the ontology taxonomy. Recent methods based on ontology embeddings compare phenotypes in latent space; these methods can use all ontology axioms as well as a supervised signal, but are inherently transductive, i.e., query entities must already be known at the time of learning embeddings, and therefore these methods do not generalize to novel diseases (sets of phenotypes) at inference time. Results: We developed INDIGENA, an inductive disease-gene association method for ranking genes based on a set of phenotypes. Our method first uses a graph projection to map axioms from phenotype ontologies to a graph structure, and then uses graph embeddings to create latent representations of phenotypes. We use an explicit aggregation strategy to combine phenotype embeddings into representations of genes or diseases, allowing us to generalize to novel sets of phenotypes. We also develop a method to make the phenotype embeddings and the similarity measure task-specific by including a supervised signal from known gene-disease associations. We apply our method to mouse models of human disease and demonstrate that we can significantly improve over the inductive semantic similarity baseline measures, and reach a performance similar to transductive methods for predicting gene-disease associations while being more general. Availability and Implementation: this https URL


[13] 2602.01230

Toward Interpretable and Generalizable AI in Regulatory Genomics

Deciphering how DNA sequence encodes gene regulation remains a central challenge in biology. Advances in machine learning and functional genomics have enabled sequence-to-function (seq2func) models that predict molecular regulatory readouts directly from DNA sequence. These models are now widely used for variant effect prediction, mechanistic interpretation, and regulatory sequence design. Despite strong performance on held-out genomic regions, their ability to generalize across genetic variation and cellular contexts remains inconsistent. Here we examine how architectural choices, training data, and prediction tasks shape the behavior of seq2func models. We synthesize how interpretability methods and evaluation practices have probed learned cis-regulatory organization and highlighted systematic failure modes, clarifying why strong predictive accuracy can fail to translate into robust regulatory understanding. We argue that progress will require reframing seq2func models as continually refined systems, in which targeted perturbation experiments, systematic evaluation, and iterative model updates are tightly coupled through AI-experiment feedback loops. Under this framework, seq2func models become self-improving tools that progressively deepen their mechanistic grounding and more reliably support biological discovery.


[14] 2602.01347

Vulnerability-Amplifying Interaction Loops: a systematic failure mode in AI chatbot mental-health interactions

Millions of users turn to consumer AI chatbots to discuss behavioral and mental health concerns. While this presents unprecedented opportunities to deliver population-level support, it also highlights an urgent need to develop rigorous and scalable safety evaluations. Here we introduce SIM-VAIL, an AI chatbot auditing framework that captures how harmful AI chatbot responses manifest across a range of mental-health contexts. SIM-VAIL pairs a simulated human user, harboring a distinct psychiatric vulnerability and conversational intent, with an audited frontier AI chatbot. It scores conversation turns on 13 clinically relevant risk dimensions, enabling context-dependent, temporally resolved assessment of mental-health risk. Across 810 conversations, encompassing over 90,000 turn-level ratings and 30 psychiatric user profiles, we find that significant risk occurs across virtually all user phenotypes. Risk manifested across most of the 9 consumer AI chatbot models audited, albeit mitigated in more modern variants. Rather than arising abruptly, risk accumulated over multiple turns. Risk profiles were phenotype-dependent, indicating that behaviors that appear supportive in general settings are liable to be maladaptive when they align with mechanisms that sustain a user's vulnerability. Multivariate risk patterns revealed trade-offs across dimensions, suggesting that mitigation targeting one harm domain can exacerbate others. These findings identify a novel failure mode in human-AI interactions, which we term Vulnerability-Amplifying Interaction Loops (VAILs), and underscore the need for multi-dimensional approaches to risk quantification. SIM-VAIL provides a scalable evaluation framework for quantifying how mental-health risk is distributed across user phenotypes, conversational trajectories, and clinically grounded behavioral dimensions, offering a foundation for targeted safety improvements.


[15] 2602.01482

Community-Level Modeling of Gyral Folding Patterns for Robust and Anatomically Informed Individualized Brain Mapping

Cortical folding exhibits substantial inter-individual variability while preserving stable anatomical landmarks that enable fine-scale characterization of cortical organization. Among these, the three-hinge gyrus (3HG) serves as a key folding primitive, showing consistent topology yet meaningful variations in morphology, connectivity, and function. Existing landmark-based methods typically model each 3HG independently, ignoring that 3HGs form higher-order folding communities that capture mesoscale structure. This simplification weakens anatomical representation and makes one-to-one matching sensitive to positional variability and noise. We propose a spectral graph representation learning framework that models community-level folding units rather than isolated landmarks. Each 3HG is encoded using a dual-profile representation combining surface topology and structural connectivity. Subject-specific spectral clustering identifies coherent folding communities, followed by topological refinement to preserve anatomical continuity. For cross-subject correspondence, we introduce Joint Morphological-Geometric Matching, jointly optimizing geometric and morphometric similarity. Across over 1000 Human Connectome Project subjects, the resulting communities show reduced morphometric variance, stronger modular organization, improved hemispheric consistency, and superior alignment compared with atlas-based and landmark-based or embedding-based baselines. These findings demonstrate that community-level modeling provides a robust and anatomically grounded framework for individualized cortical characterization and reliable cross-subject correspondence.


[16] 2602.01492

From Discrete to Continuous Mixed Populations of Conformists, Nonconformists, and Imitators

In two-strategy decision-making problems, individuals often imitate the highest earners or choose either the common or rare strategy. Individuals who benefit from the common strategy are conformists, whereas those who profit by choosing the less common one are called nonconformists. The population proportions of the two strategies may undergo perpetual fluctuations in finite, discrete, heterogeneous populations of imitators, conformists, and nonconformists. How these fluctuations evolve as population size increases was left as an open question and is addressed in this paper. We show that the family of Markov chains describing the discrete population dynamics forms a generalized stochastic approximation process for a differential inclusion--the continuous-time dynamics. Furthermore, we prove that the continuous-time dynamics always equilibrate. Then, by leveraging results from the stochastic approximation theory, we show that the amplitudes of fluctuations in the proportions of the two strategies in the population approach zero with probability one when the population size grows to infinity. Our results suggest that large-scale perpetual fluctuations are unlikely in large, well-mixed populations consisting of these three types, particularly when imitators follow the highest earners.


[17] 2602.01496

Is Normalized Biomass Really Abundance? Pitfalls, Artifacts, and Misconceptions in the Field of Size Spectra Analysis -- A Case for Back-Transformed Spectra

The NBSS (normalized biomass size spectrum) is a common, intuitive approach for the study of natural ecosystems. However, very few studies have been dedicated to verifying possible bias, flaws, and paradoxes in this widely used method. An evident issue of this method, that best exemplifies its discrepancies and paradoxes, is the use of intriguing non-biomass units (such as abundance, biomass flux, or pseudo-abundance units) on NBSS plots, that are intended to visualize biomass spectra. The main objectives of this study were to verify, test and analyze the procedures involved in transformations that lead to the popular NBSS plot, and to check for the correctness of currently used units, while testing the hypothesis that NBSS indeed represents biomass, not abundance or biomass flux (dB/dM), while developing i.) a new conceptual framework, ii.) new terminology, iii.) a novel back-transformation method, iv.) a simple, new calculation method, that yields the best (i.e., least biased) representation of the original biomass vs body mass distribution shape, numerical values, dimensions, and units. Extensive tests with in-situ and synthetic (simulated) data were used to verify the procedures involved in transformations that lead to the popular NBSS plots, and to compare the original biomass distribution data with the binned outputs. Original biomass units and dimensions are retained in the novel backtransformed normalized biomass spectrum (bNBS), proposed and described herein. The proposed bNBS constitutes a new, improved approach of robust size spectra science, that allows for quantitative inter-comparisons of biomass spectra across regions and time periods.


[18] 2602.02374

Recurrent neural chemical reaction networks trained to switch dynamical behaviours through learned bifurcations

Both natural and synthetic chemical systems not only exhibit a range of non-trivial dynamics, but also transition between qualitatively different dynamical behaviours as environmental parameters change. Such transitions are called bifurcations. Here, we show that recurrent neural chemical reaction networks (RNCRNs), a class of chemical reaction networks based on recurrent artificial neural networks that can be trained to reproduce a given dynamical behaviour, can also be trained to exhibit bifurcations. First, we show that RNCRNs can inherit some bifurcations defined by smooth ordinary differential equations (ODEs). Second, we demonstrate that the RNCRN can be trained to infer bifurcations that allow it to approximate different target behaviours within different regions of parameter space, without explicitly providing the bifurcation itself in the training. These behaviours can be specified using target ODEs that are discontinuous with respect to the parameters, or even simply by specifying certain desired dynamical features in certain regions of the parameter space. To achieve the latter, we introduce an ODE-free algorithm for training the RNCRN to display designer oscillations, such as a heart-shaped limit cycle or two coexisting limit cycles.


[19] 2602.00156

Accelerating De Novo Genome Assembly via Quantum-Assisted Graph Optimization with Bitstring Recovery

Genome sequencing is essential to decode genetic information, identify organisms, understand diseases and advance personalized medicine. A critical step in any genome sequencing technique is genome assembly. However, de novo genome assembly, which involves constructing an entire genome sequence from scratch without a reference genome, presents significant challenges due to its high computational complexity, affecting both time and accuracy. In this study, we propose a hybrid approach utilizing a quantum computing-based optimization algorithm integrated with classical pre-processing to expedite the genome assembly process. Specifically, we present a method to solve the Hamiltonian and Eulerian paths within the genome assembly graph using gate-based quantum computing through a Higher-Order Binary Optimization (HOBO) formulation with the Variational Quantum Eigensolver algorithm (VQE), in addition to a novel bitstring recovery mechanism to improve optimizer traversal of the solution space. A comparative analysis with classical optimization techniques was performed to assess the effectiveness of our quantum-based approach in genome assembly. The results indicate that, as quantum hardware continues to evolve and noise levels diminish, our formulation holds a significant potential to accelerate genome sequencing by offering faster and more accurate solutions to the complex challenges in genomic research.


[20] 2602.00163

Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.


[21] 2602.00259

Intelligent Reasoning Cues: A Framework and Case Study of the Roles of AI Information in Complex Decisions

Artificial intelligence (AI)-based decision support systems can be highly accurate yet still fail to support users or improve decisions. Existing theories of AI-assisted decision-making focus on calibrating reliance on AI advice, leaving it unclear how different system designs might influence the reasoning processes underneath. We address this gap by reconsidering AI interfaces as collections of intelligent reasoning cues: discrete pieces of AI information that can individually influence decision-making. We then explore the roles of eight types of reasoning cues in a high-stakes clinical decision (treating patients with sepsis in intensive care). Through contextual inquiries with six teams and a think-aloud study with 25 physicians, we find that reasoning cues have distinct patterns of influence that can directly inform design. Our results also suggest that reasoning cues should prioritize tasks with high variability and discretion, adapt to ensure compatibility with evolving decision needs, and provide complementary, rigorous insights on complex cases.


[22] 2602.00663

SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent

Optimizing the structure of molecules to achieve desired properties is a central bottleneck across the chemical sciences, particularly in the pharmaceutical industry where it underlies the discovery of new drugs. Since molecular property evaluation often relies on costly and rate-limited oracles, such as experimental assays, molecular optimization must be highly sample-efficient. To address this, we introduce SEISMO, an LLM agent that performs strictly online, inference-time molecular optimization, updating after every oracle call without the need for population-based or batched learning. SEISMO conditions each proposal on the full optimization trajectory, combining natural-language task descriptions with scalar scores and, when available, structured explanatory feedback. Across the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves a 2-3 times higher area under the optimisation curve than prior methods, often reaching near-maximal task scores within 50 oracle calls. Our additional medicinal-chemistry tasks show that providing explanatory feedback further improves efficiency, demonstrating that leveraging domain knowledge and structured information is key to sample-efficient molecular optimization.


[23] 2602.00832

Size and shape of terrestrial animals

Natural selection for terrestrial locomotion has yielded unifying patterns in the body shape of legged animals, often manifesting as scaling laws. One such pattern appears in the frontal aspect ratio. Smaller animals like insects typically adopt a landscape frontal aspect ratio, with a wider side-to-side base of support than center of mass height. Larger animals like elephants, however, are taller than wide with a portrait aspect ratio. Known explanations for postural scaling are restricted to animal groups with similar anatomical and behavioural motifs, but the trend in frontal aspect ratio transcends such commonalities. Here we show that vertebrates and invertebrates with diverse body plans, ranging in mass from 28 mg to 22000 kg, exhibit size-dependent scaling of the frontal aspect ratio driven by the need for lateral stability on uneven natural terrain. Because natural terrain exhibit scale-dependent unevenness, and the frontal aspect ratio is important for lateral stability during locomotion, smaller animals need a wider aspect ratio for stability. This prediction is based on the fractal property of natural terrain unevenness, requires no anatomical or behavioural parameters, and agrees with the measured scaling despite vast anatomical and behavioural differences. Furthermore, a statistical phylogenetic comparative analysis found that shared ancestry and random trait evolution cannot explain the measured scaling. Thus, our findings reveal that terrain roughness, acting through natural selection for stability, likely drove the macroevolution of frontal shape in terrestrial animals.


[24] 2602.00978

Organismal Agency and Rapid Adaptation: The Phenopoiesis Algorithm for Phenotype-First Evolution

Evolutionary success depends on the capacity to adapt: organisms must respond to environmental challenges through both genetic innovation and lifetime learning. The gene-centric paradigm attributes evolutionary causality exclusively to genes, while Denis Noble's phenotype-first framework argues that organisms are active agents capable of interpreting genetic resources, learning from experience, and shaping their own development. However, this framework has remained philosophically intuitive but algorithmically opaque. We show for the first time that organismal agency can be implemented as a concrete computational process through heritable phenotypic patterns. We introduce the Phenopoiesis Algorithm, where organisms inherit not just genes but also successful phenotypic patterns discovered during lifetime learning. Through experiments in changing environments, these pattern-inheriting organisms achieve 3.4 times faster adaptation compared to gene-centric models. Critically, these gains require cross-generational inheritance of learned patterns rather than within-lifetime learning alone. We conclude that organismal agency is not a philosophical abstraction but an algorithmic mechanism with measurable adaptive value. The mechanism works through compositional reuse: organisms discover how to compose primitive elements into solutions, encode those compositional recipes, and transmit them to offspring. Evolution operates across multiple timescales -- fast, reversible phenotypic inheritance and slow, permanent genetic inheritance -- providing adaptive flexibility that single-channel mechanisms cannot achieve.


[25] 2602.01604

Thermodynamic cost-controllability tradeoff in metabolic currency coupling

Cellular metabolism is globally regulated by various currency metabolites such as ATP, GTP, and NAD(P)H. These metabolites cycle between charged (high-energy) and uncharged (low-energy) states to mediate energy transfer. While distinct currency metabolites are associated with different metabolic functions, their charged and uncharged forms are generally interchangeable via biochemical reactions such as ${\rm ATP{\,+\,}GDP{\,\rightleftharpoons\,}ADP{\,+\,}GTP}$ and $\rm NADP^+{\,+\,}NADH{\,\rightleftharpoons\,}NADPH{\,+\,}NAD^+ $. Thus, their energetic states are generally coupled and influence each other, which would hinder the independent regulation of different currency metabolites. Despite the extensive knowledge of the molecular biology of individual currency metabolites, it remains poorly understood how the coordination of various coupled currency metabolites shapes metabolic regulation, efficiency, and ultimately the evolution of organisms. Here, we present a minimal theoretical model of metabolic currency coupling and reveal a fundamental tradeoff relationship between metabolic controllability and thermodynamic cost: increasing the capacity to independently regulate multiple currency metabolites generally requires comparable abundances of those metabolites, which in turn incurs a higher entropy production rate. The tradeoff suggests that in complex environments, organisms evolutionarily favor an equal abundance of currency metabolites to enhance metabolic controllability at the expense of a higher thermodynamic cost; conversely, in simple environments, organisms evolve to have imbalanced amounts of them to reduce heat dissipation. These considerations also offer a hypothesis regarding evolutionary trends in nucleotide-pool balance and genomic GC content.


[26] 2602.01751

MGKAN: Predicting Asymmetric Drug-Drug Interactions via a Multimodal Graph Kolmogorov-Arnold Network

Predicting drug-drug interactions (DDIs) is essential for safe pharmacological treatments. Previous graph neural network (GNN) models leverage molecular structures and interaction networks but mostly rely on linear aggregation and symmetric assumptions, limiting their ability to capture nonlinear and heterogeneous patterns. We propose MGKAN, a Graph Kolmogorov-Arnold Network that introduces learnable basis functions into asymmetric DDI prediction. MGKAN replaces conventional MLP transformations with KAN-driven basis functions, enabling more expressive and nonlinear modeling of drug relationships. To capture pharmacological dependencies, MGKAN integrates three network views-an asymmetric DDI network, a co-interaction network, and a biochemical similarity network-with role-specific embeddings to preserve directional semantics. A fusion module combines linear attention and nonlinear transformation to enhance representational capacity. On two benchmark datasets, MGKAN outperforms seven state-of-the-art baselines. Ablation studies and case studies confirm its predictive accuracy and effectiveness in modeling directional drug effects.


[27] 2602.01772

DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics

Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semi-supervised training to universal cross-modal representation learning. By integrating dual-encoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.


[28] 2602.01839

DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequence data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, thereby hindering the utility of ML models. To address them, we propose DOGMA, a holistic data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on stochastic heuristics, DOGMA redefines graph construction by integrating Statistical Anchors with Cell Ontology and Phylogenetic Trees to enable deterministic structure discovery and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA achieves SOTA performance, exhibiting superior zero-shot robustness and sample efficiency while operating with significantly lower computational cost.


[29] 2602.01845

No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $\rho = 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50--200$\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20$\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at this https URL


[30] 2602.02128

Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics--substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD's joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.


[31] 2602.02320

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.


[32] 2602.02425

Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.


[33] 2602.02494

MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training

Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at this https URL .


[34] 2409.13669

A Spatiotemporal Perspective on Dynamical Computation in Neural Information Processing Systems

Spatiotemporal flows of neural activity, such as traveling waves, have been observed throughout the brain since the earliest recordings; yet there is still little consensus on their functional role. Recent experiments and models have linked traveling waves to visual and physical motion, but these observations have been difficult to reconcile with standard accounts of topographically organized selectivity and feedforward receptive fields. Here, we introduce a theoretical framework that formalizes and generalizes the connection between 'motion' and flowing neural dynamics in the language of equivariant neural network theory. We consider 'motion' not only in physical or visual spaces, but also in more abstract representational spaces, and we argue that recurrent traveling-wave-like dynamics are not just useful but necessary for accurate and stable processing of any signal undergoing such motion. Formally, we show that for any non-trivial recurrent neural network to process a sequence undergoing a flow transformation (such as visual motion) in a structured equivariant manner, its hidden state dynamics must actively realize a homomorphic representation of the same flow through recurrent connectivity. In this ''spatiotemporal perspective on dynamical computation'', traveling waves and related flows are best understood as faithful dynamic representations of stimulus flows; and consequently the natural inclination of biological systems towards such dynamics may be viewed as an innate inductive bias towards efficiency and generalization in the spatiotemporally-structured dynamical world they inhabit.


[35] 2502.06914

UniZyme: A Unified Protein Cleavage Site Predictor Enhanced with Enzyme Active-Site Knowledge

Enzyme-catalyzed protein cleavage is essential for many biological functions. Accurate prediction of cleavage sites can facilitate various applications such as drug development, enzyme design, and a deeper understanding of biological mechanisms. However, most existing models are restricted to an individual enzyme, which neglects shared knowledge of enzymes and fails to generalize to novel enzymes. Thus, we introduce a unified protein cleavage site predictor named UniZyme, which can generalize across diverse enzymes. To enhance the enzyme encoding for the protein cleavage site prediction, UniZyme employs a novel biochemically-informed model architecture along with active-site knowledge of proteolytic enzymes. Extensive experiments demonstrate that UniZyme achieves high accuracy in predicting cleavage sites across a range of proteolytic enzymes, including unseen enzymes. The code is available in this https URL


[36] 2503.00008

Determining the Equivalence of Small Zero-one Reaction Networks

Zero-one reaction networks are pivotal to cellular signaling, and establishing the equivalence of such networks represents a foundational computational challenge in the realm of chemical reaction network research. Herein, we propose a high-efficiency approach for identifying the equivalence of zero-one networks. Its efficiency stems from a set of criteria tailored to judge the equivalence of steady-state ideals derived from zero-one networks, which effectively reduces the computational cost associated with Gröbner basis calculations. Experimental results demonstrate that our proposed method can successfully categorize more than three million networks by their equivalence within a feasible timeframe. Also, our computational results for two important classes of quadratic zero-one networks (3-dimensional with 3 species, 6 reactions; 4-dimensional with 4 species, 5 reactions) show that they have no positive steady states for a generic choice of rate constants, implying these small networks generically exhibit neither multistability nor periodic orbits.


[37] 2506.04056

Generalized Lotka-Volterra systems with quenched random interactions and saturating nonlinear response

The generalized Lotka-Volterra (GLV) equations with quenched random interactions have been extensively used to investigate the stability and dynamics of complex ecosystems. However, the standard linear interaction model suffers from pathological unbounded growth, especially under strong cooperation or heterogeneity. This work addresses that limitation by introducing a Monod-type saturating nonlinear response into the GLV framework. Using Dynamical Mean Field Theory, we derive analytical expressions for the species abundance distribution in the Unique Fixed Point phase and show the suppression of unbounded dynamics. Numerical simulations reveal a rich dynamical structure in the Multiple Attractor phase, including a transition between high-dimensional chaotic and low-volatility regimes, governed by interaction symmetry. These findings offer a more ecologically realistic foundation for disordered ecosystem models and highlight the role of nonlinearity and symmetry in shaping the diversity and resilience of large ecological communities.


[38] 2506.12277

Evaluation of machine-learning models to measure individualized treatment effects from randomized clinical trial data with time-to-event outcomes

Objective: In randomized clinical trials, prediction models can be used to explore the relationships between patients' variables (e.g., clinical, pathological, or lifestyle variables, and also biomarker or genomic data) and treatment effect magnitude. Our aim was to evaluate flexible machine learning models capable of incorporating interactions and nonlinear effects from high-dimensional data to estimate individualized treatment recommendations in trials with time-to-event outcomes. Methods: We compared survival models based on neural networks (CoxCC and CoxTime) and random survival forests (Interaction Forests) against a Cox proportional hazards model with an adaptive LASSO (ALASSO) penalty as a benchmark. For individualized treatment recommendations in the survival setting, we adapted metrics originally designed for binary outcomes to accommodate time-to-event data with censoring. These adapted metrics included the C-for-Benefit, the E50-for-Benefit, and the root mean squared error for treatment benefit. An extensive simulation study was conducted using two different data generation processes incorporating nonlinearity and interactions. The models were applied to gene expression and clinical data from three cancer clinical trial data sets. Results: In the first data generation process, neural networks outperformed ALASSO in terms of calibration while the Interaction Forests showed superior C-for-benefit performance. In the second data generation process, both machine learning methods outperformed the benchmark linear ALASSO method across discrimination, calibration, and RMSE metrics. In the cancer trial data sets, the machine learning methods often performed better than ALASSO, particularly IF in terms of C-for-benefit, and either a neural network or IF for calibration measures addressing treatment benefit.


[39] 2506.17310

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights leading to semantic fragmentation. Inspired by the brain's working memory and cortical modularity, we propose PaceLLM, featuring two innovations: (1) a Persistent Activity (PA) Mechanism that mimics prefrontal cortex (PFC) neurons' persistent firing by introducing an activation-level memory bank to dynamically retrieve, reuse, and update critical FFN states, addressing contextual decay; and (2) Cortical Expert (CE) Clustering that emulates task-adaptive neural specialization to reorganize FFN weights into semantic modules, establishing cross-token dependencies and mitigating fragmentation. Extensive evaluations show that PaceLLM achieves 6% improvement on LongBench's Multi-document QA and 12.5-17.5% performance gains on Infinite-Bench tasks, while extending measurable context length to 200K tokens in Needle-In-A-Haystack (NIAH) tests. This work pioneers brain-inspired LLM optimization and is complementary to other works. Besides, it can be generalized to any model and enhance their long-context performance and interpretability without structural overhauls.


[40] 2511.15839

Comparing Bayesian and Frequentist Inference in Biological Models: A Comparative Analysis of Accuracy, Uncertainty, and Identifiability

Mathematical models support inference and forecasting in ecology and epidemiology, but results depend on the estimation framework. We compare Bayesian and Frequentist approaches across three biological models using four datasets: Lotka-Volterra predator-prey dynamics (Hudson Bay), a generalized logistic model (lung injury and 2022 U.S. mpox), and an SEIUR epidemic model (COVID-19 in Spain). Both approaches use a normal error structure to ensure a fair comparison. We first assessed structural identifiability to determine which parameters can theoretically be recovered from the data. We then evaluated practical identifiability and forecasting performance using four metrics: mean absolute error (MAE), mean squared error (MSE), 95 percent prediction interval (PI) coverage, and weighted interval score (WIS). For the Lotka-Volterra model with both prey and predator data, we analyzed three scenarios: prey only, predator only, and both. The Frequentist workflow used QuantDiffForecast (QDF) in MATLAB, which fits ODE models via nonlinear least squares and quantifies uncertainty through parametric bootstrap. The Bayesian workflow used BayesianFitForecast (BFF), which employs Hamiltonian Monte Carlo sampling via Stan to generate posterior distributions and diagnostics such as the Gelman-Rubin R-hat statistic. Results show that Frequentist inference performs best when data are rich and fully observed, while Bayesian inference excels when latent-state uncertainty is high and data are sparse, as in the SEIUR COVID-19 model. Structural identifiability clarifies these patterns: full observability benefits both frameworks, while limited observability constrains parameter recovery. This comparison provides guidance for choosing inference frameworks based on data richness, observability, and uncertainty needs.


[41] 2512.20581

MERGE-RNA: a physics-based model to predict RNA secondary structure ensembles with chemical probing

RNA function is deeply tied to secondary structure, with most molecules operating through dynamic and heterogeneous structural ensembles. While current analysis tools typically output single static structures or averaged contact maps, chemical probing methods like DMS capture nucleotide-resolution signals that represent the full structural ensemble, which however remain of difficult structural interpretation. To address this, we present MERGE-RNA, a framework that explicitly describes and outputs RNA as a structural ensemble. By modeling the experimental pipeline through a statistical physics framework, MERGE-RNA learns a small set of transferable and interpretable parameters, enabling the seamless integration of measurements across different molecules, probe concentrations, and replicates in a single optimization to improve robustness. Our model employs a maximum-entropy principle to predict thermodynamic populations, introducing only the minimal sequence-specific adjustments necessary to align the ensemble with experimental data. We validate MERGE-RNA on diverse RNAs, showing that it achieves strong structural accuracy and the ability to fill knowledge gaps in single-conformation reference structures. Furthermore, in a designed RNA construct for which we report new DMS data, MERGE-RNA deconvolves mixed states to reveal transient intermediate populations involved in strand displacement, dynamics that remain invisible to traditional analysis methods.


[42] 2601.19002

Y-Trim: Evidence-gated per-read trimming for stochastic end artifacts in ssWGBS

Single-stranded whole-genome bisulfite sequencing (ssWGBS) profiles DNA methylation in low-input and fragmented samples, yet Adaptase-mediated tailing creates stochastic artifacts that obscure true genomic ends. Current deterministic trimming methods struggle because, as we show, bisulfite-induced degeneracy creates a locally indistinguishable regime with a strictly positive Bayes error, making exact per-read boundary recovery ill-posed from FASTQ observables alone. As a result, fixed trimming rules inevitably mix over-trimming (avoidable genomic loss) and under-trimming (residual artifactual signal) under heterogeneous tail-length regimes, amplifying end-proximal bias in methylation-relevant summaries. Here we present Y-Trim, an evidence-gated trimming framework that operationalizes these constraints. Y-Trim separates admission from inference: sample-level gating activates trimming only when chemistry-consistent evidence is present, and read-level logic treats Read 2 and Read 1 asymmetrically, reflecting kinetic tailing versus conditional read-through geometry. With explicit safeguards -- including abstention under non-separable evidence -- Y-Trim bounds action without forcing false precision. Using a chemistry-consistent simulator with ground truth and a 34-sample public cohort (CCGB-34), we show that Y-Trim yields an interpretable retention-risk frontier on Read 2 while revealing feasibility-limited behavior on Read 1, and achieves competitive end-proximal artifact suppression relative to common fixed-length practice. Y-Trim provides a practical, uncertainty-aware preprocessing step for high-precision ssWGBS methylation studies.


[43] 2407.11498

Thermodynamic Space of Chemical Reaction Networks

Living systems operate out of equilibrium, continuously consuming energy to sustain organised, functional states. Their emergent behaviour usually relies on a set of interconnected chemical reaction networks (CRNs) driven by external fluxes that keep some species at fixed concentrations. Hence, uncovering the principles governing the functioning of these CRNs is crucial to understand how living systems generate and regulate complexity. While kinetics plays a key role in shaping detailed dynamical phenomena, the range of operations of a CRN is fundamentally constrained by thermodynamics. Here, we introduce and analytically derive the "thermodynamic space" of a CRN, i.e., the range of accessible stationary concentrations that can be realized under a given energetic budget. We establish analogous bounds for reaction affinities, shedding light on how global thermodynamic properties, such as the total non-equilibrium driving, can limit local non-equilibrium quantities. We illustrate our results in various paradigmatic examples, demonstrating how the onset of complex behaviors is intimately tangled with the presence of non-equilibrium conditions. By providing a general tool for analysing CRNs, the presented framework constitutes a stepping stone to deepen our ability to predict complex out-of-equilibrium phenomena and design artificial chemical systems, starting from the sole knowledge of the underlying thermodynamic properties.


[44] 2502.07272

GENERator: A Long-Context Generative Genomic Foundation Model

The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at this https URL.


[45] 2503.23189

A mean-field theory for heterogeneous random growth with redistribution

We study the competition between random multiplicative growth and redistribution/migration in the mean-field limit, when the number of sites is very large but finite. We find that for static random growth rates, migration should be strong enough to prevent localisation, i.e. extreme concentration on the fastest growing site. In the presence of an additional temporal noise in the growth rates, a third partially localised phase is predicted theoretically, using results from Derrida's Random Energy Model. Such temporal fluctuations mitigate concentration effects, but do not make them disappear. We discuss our results in the context of population growth and wealth inequalities.


[46] 2509.14053

Trade-offs between structural richness and communication efficiency in music network representations

Music is a structured and perceptually rich sequence of sounds in time with well-defined symbolic features, whose perception is shaped by the interplay of expectation and uncertainty. Network science offers a powerful framework for studying its structural organization and communication efficiency. However, it remains unclear how feature selection affects the properties of reconstructed networks and perceptual alignment. Here, we systematically compare eight encodings of musical sequences, ranging from single-feature descriptions to richer multi-feature combinations. We show that representational choices fundamentally shape network topology, the distribution of uncertainty, and the estimated communication efficiency under perceptual constraints. Single-feature representations compress sequences into dense transition structures that support efficient communication, yielding high entropy rates with low modeled perceptual error, but they discard structural richness. By contrast, multi-feature representations preserve descriptive detail and structural specificity, expanding the state space and producing sharper transition profiles and lower entropy rates, which leads to higher modeled perceptual error. Across representations, we found that uncertainty increasingly concentrates in nodes with higher diffusion-based centrality while their perceptual error remains low, unveiling an interplay between predictable structure and localized surprise. Together, these results show that feature choice directly shapes music network representation, describing trade-offs between descriptive richness and communication efficiency and suggesting structural conditions that may support efficient learning and prediction.


[47] 2509.15748

Hybrid Lie semi-group and cascade structures for the generalized Gaussian derivative model for visual receptive fields

Because of the variabilities of real-world image structures under the natural image transformations that arise when observing similar objects or spatio-temporal events under different viewing conditions, the receptive field responses computed in the earliest layers of the visual hierarchy may be strongly influenced by such geometric image transformations. One way of handling this variability is by basing the vision system on covariant receptive field families, which expand the receptive field shapes over the degrees of freedom in the image transformations. This paper addresses the problem of deriving relationships between spatial and spatio-temporal receptive field responses obtained for different values of the shape parameters in the resulting multi-parameter families of receptive fields. For this purpose, we derive both (i) infinitesimal relationships, roughly corresponding to a combination of notions from semi-groups and Lie groups, as well as (ii) macroscopic cascade smoothing properties, which describe how receptive field responses at coarser spatial and temporal scales can be computed by applying smaller support incremental filters to the output from corresponding receptive fields at finer spatial and temporal scales, structurally related to the notion of Lie algebras, although with directional preferences. The presented results provide (i) a deeper understanding of the relationships between spatial and spatio-temporal receptive field responses for different values of the filter parameters, which can be used for both (ii) designing more efficient schemes for computing receptive field responses over populations of multi-parameter families of receptive fields, as well as (iii)~formulating idealized theoretical models of the computations of simple cells in biological vision.


[48] 2511.06356

Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets

Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which extends the differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings, capturing structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76% on R$^2$ under permutation perturbations.