Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches largely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.
Predicting functional consequences of genetic variants in crop genes remains a critical bottleneck for precision breeding programs. We present AgriVariant, an end-to-end pipeline for variant-effect prediction in rice (Oryza sativa) that addresses the lack of crop-specific variant-interpretation tools and can be extended to any crop species with available reference genomes and gene annotations. Our approach integrates deep learning-based variant calling (DeepChem-Variant) with custom plant genomics annotation using RAP-DB gene models and database-independent deleteriousness scoring that combines the Grantham distance and the BLOSUM62 substitution matrix. We validate the pipeline through targeted mutations in stress-response genes (OsDREB2a, OsDREB1F, SKC1), demonstrating correct classification of stop-gained, missense, and synonymous variants with appropriate HIGH / MODERATE / LOW impact assignments. An exhaustive mutagenesis study of OsMT-3a analyzed all 1,509 possible single-nucleotide variants in 10 days, identifying 353 high-impact, 447 medium-impact, and 709 low-impact variants - an analysis that would have required 2-4 years using traditional wet-lab approaches. This computational framework enables breeders to prioritize variants for experimental validation across diverse crop species, reducing screening costs and accelerating development of climate-resilient crop varieties.
Scaling laws in biological neural networks have long been investigated. From 1/f noise to neuronal avalanches, evidence of scaling in brain activity has been increasingly linked to tuning to or near criticality. The concept of scaling is intimately related to the renormalization group (RG), in essence providing coarse-grained, simplified descriptions that generalize to classes of diverse physical systems. Following the RG idea, a coarse-graining scheme has recently been proposed for populations of real neurons, and scaling behaviors in collective quantities have been reported in the hippocampus and in different areas of the rat cortex. To bridge the gap between neuronal population scales and species, here we consider large-scale, electrophysiological recordings of human brain activity in the awake resting-state. We demonstrate robust scaling behaviors of collective dynamics across coarse-graining scales, with exponents close to those measured in populations of spiking neurons. Further, we show that dynamics of neuronal avalanches, scale-free cascades of neural activity, are invariant under the proposed coarse-graining approach. Simulations of a non-equilibrium adaptive Ising model inferred from data and apt to reproduce a large repertoire of resting-state brain dynamics indicate that the scaling behaviors of the resting human brain activity emerge close to criticality and depend on the excitation/inhibition (E/I) balance of the network. While extending the range of validity of previous observations at small spatial scales and pointing to common scaling laws in mammals, the results open the way to a robust (currently missing) non-invasive approach to estimate the E/I balance, a key quantity in neuroscience research.
One of the longest standing open problems in science is how life arises from non-living matter. If it is possible to measure this transition in the lab, then it might be possible to understand the physical mechanisms by which the emergence of life occurs, which so far have evaded scientific understanding. A significant hurdle is the lack of standards or a framework for cross comparison across different experimental contexts and planetary environments. In this essay, I review current challenges in experimental approaches to origin of life chemistry, focusing on those associated with quantifying experimental selectivity versus de novo generation of molecular complexity, and I highlight new methods using molecular assembly theory to measure molecular complexity. This metrology-centered approach can enable rigorous testing of hypotheses about the cascade of major transitions in molecular order marking the emergence of life, while potentially bridging traditional divides between metabolism-first and genetics-first scenarios. Grounding the study of life's origins in measurable complexity has significant implications for the search for life beyond Earth, suggesting paths toward theory-driven detection of biological complexity in diverse planetary contexts. As the field moves forward, standardized measurements of molecular complexity may help unify currently disparate approaches to understanding how matter transforms to life. Much remains to be done in this exciting frontier.
Contraction of the left ventricle of the heart increases aortic root blood pressure (P), diameter (D) and blood velocity (U). When contraction diminishes, all three properties decrease. These perturbations propagate down the systemic arteries as the S wave and D wave, respectively. Peak carotid artery S-wave intensity is diminished and delayed in heart failure with reduced ejection fraction (HFrEF). A clinical trial demonstrated that these changes can be used to detect HFrEF with high sensitivity and specificity. Assessment of wave intensity and timing conventionally requires high-frequency, temporally and spatially coincident measurement of changes in P and U or D and U over the cardiac cycle. The practical difficulty of making such measurements accurately and noninvasively limits clinical utility. Here we test simpler methods by using numerical models of wave propagation and data from the clinical trial. We show that methods based on measuring only one of P, D or U can provide good surrogates for the full P-U and D-U methods. The best results were obtained when using measurement of D to assess wave timing. That gave Receiver Operating Characteristics (ROCs) indistinguishable from those based on the full D-U method, with areas under the ROC of up to 0.905 when timing was anchored to the ECG rather than to other waves. Measuring vessel diameter over the cardiac cycle is technically simple and would be a cost-effective way of screening for HFrEF in primary care. Other metrics, similarly measured, might also allow screening for heart failure with preserved ejection fraction (HFpEF).
First-passage times are often the most relevant aspect of a complex Markovian network, because they signify when information processing has resulted in a definite decision. Previous studies have shown that for kinetic proofreading networks in the limit of large network size the first-passage time distribution converges either to a delta or to an exponential distribution. Remarkably, these two forms correspond to the two extreme distributions of minimal and maximal entropy for a fixed mean, respectively. Here we build on the connection between first-passage times and graph theory to show that these two limits are not model-specific, but arise generically in Markovian networks from the distribution of the eigenvalues of the generator matrix. A deterministic peak emerges when infinitely many eigenvalues contribute, while the exponential limit arises from a single dominant eigenvalue. We also show that the exponential limit emerges robustly for reversible networks when a backward bias exists. In contrast, the deterministic limit is not obtained from a simple reversal of this condition, but under structurally tighter conditions, revealing a fundamental asymmetry between both regimes. Our theoretical analysis is illustrated and validated by computer simulations of one-step master equations and random networks.
Biological neural networks (like the hippocampus) can internally generate "replay" resembling stimulus-driven activity. Recent computational models of replay use noisy recurrent neural networks (RNNs) trained to path-integrate. Replay in these networks has been described as Langevin sampling, but new modifiers of noisy RNN replay have surpassed this description. We re-examine noisy RNN replay as sampling to understand or improve it in three ways: (1) Under simple assumptions, we prove that the gradients replay activity should follow are time-varying and difficult to estimate, but readily motivate the use of hidden state leakage in RNNs for replay. (2) We confirm that hidden state adaptation (negative feedback) encourages exploration in replay, but show that it incurs non-Markov sampling that also slows replay. (3) We propose the first model of temporally compressed replay in noisy path-integrating RNNs through hidden state momentum, connect it to underdamped Langevin sampling, and show that, together with adaptation, it counters slowness while maintaining exploration. We verify our findings via path-integration of 2D triangular and T-maze paths and of high-dimensional paths of synthetic rat place cell activity.
Hemodynamic quantities are valuable biomedical risk factors for cardiovascular pathology such as atherosclerosis. Non-invasive, in-vivo measurement of these quantities can only be performed using a select number of modalities that are not widely available, such as 4D flow magnetic resonance imaging (MRI). In this work, we create a surrogate model for hemodynamic flow field estimation, powered by machine learning. We train graph neural networks that include priors about the underlying symmetries and physics, limiting the amount of data required for training. This allows us to train the model using moderately-sized, in-vivo 4D flow MRI datasets, instead of large in-silico datasets obtained by computational fluid dynamics (CFD), as is the current standard. We create an efficient, equivariant neural network by combining the popular PointNet++ architecture with group-steerable layers. To incorporate the physics-informed priors, we derive an efficient discretisation scheme for the involved differential operators. We perform extensive experiments in carotid arteries and show that our model can accurately estimate low-noise hemodynamic flow fields in the carotid artery. Moreover, we show how the learned relation between geometry and hemodynamic quantities transfers to 3D vascular models obtained using a different imaging modality than the training data. This shows that physics-informed graph neural networks can be trained using 4D flow MRI data to estimate blood flow in unseen carotid artery geometries.
Large genomic and imaging datasets can be used to train models that learn meaningful representations of cellular systems. Across domains, model performance improves predictably with dataset size and compute budget, providing a basis for allocating data and computation. Scientific data, however, is also limited by noise arising from factors such as molecular undersampling, sequencing errors, and image resolution. By fitting 1,670 representation learning models across three data modalities (gene expression, sequence, and image data), we show that noise defines a distinct axis along which performance improves. Noise scaling follows a logarithmic law. We derive the law from a model of noise propagation, and use it to define noise sensitivity and model capacity as benchmarking metrics. We show that protein sequence representations are noise-robust while single cell transcriptomics models are not, with a Transformer-based model showing greater noise robustness but lower saturating performance than a variational autoencoder model. Noise scaling metrics may support future model evaluation and experimental design.
This work introduces a minimal, information-theoretic dynamical framework for modeling longitudinal cohort data using an entropy-initiated system of coupled-trait ordinary differential equations (ECTO). For each survey wave, item-level Likert responses are compressed into a normalized Shannon entropy index that summarizes cross-sectional dispersion; this index is used to initialize the low-dimensional state variables of the autonomous ODE system. ECTO then tracks the interactions among a primary trait-like state, a secondary coupled state, and a latent environmental-stress component through phenomenological terms representing generic self-limitation, trade-offs, and feedback. Using data from the Swedish Adoption/Twin Study on Aging (SATSA), the framework reproduces broad cohort-level trajectories and is evaluated with leave-one-wave-out forecasting and comparisons against simple statistical baselines. A second longitudinal dataset of U.S. dental student data provides an external validation test, demonstrating that low-dimensional dynamics initialized from entropy measures can generalize across cohorts with different measurement instruments, demographic compositions, and timescales. Across both datasets, ECTO achieves stable out-of-sample performance, indicating that major cohort-level trends can be captured without assuming complex latent-variable models or time-varying causal inputs. Entropy here functions as a compact summary of population heterogeneity rather than a dynamical driver, and the coupled ODEs supply an interpretable alternative to high-dimensional or black box machine-learning approaches. This framework establishes a concise, transparent method for linking information-theoretic preprocessing with cohort-level dynamical modeling and provides a foundation for future multivariate or multi-cohort extensions.
The rapid advances in artificial intelligence (AI) have largely been driven by scaling deep neural networks (DNNs) - increasing model size, data, and computational resources. Yet performance is ultimately governed by network dynamics. The lack of a principled understanding of DNN dynamics beyond heuristic design has contributed to challenges in robustness, suboptimal performance, high energy consumption, and pathologies in continual and AI-generated content learning. In contrast, the human brain appears largely resilient to these problems, and converging evidence suggests this advantage arises from dynamics poised at a critical phase transition. Inspired by this principle, we propose that criticality provides a unifying framework linking structure, dynamics, and function in DNNs. First, analyzing more than 80 state-of-the-art models, we show that a decade of AI progress has implicitly driven successful networks toward criticality - explaining why some architectures succeeded while others failed. Second, we demonstrate that explicitly incorporating criticality into training improves robustness and accuracy while mitigating key limitations of current models. Third, we show that major AI pathologies - including performance degradation in continual learning and model collapse during training on AI-generated data - reflect a loss of critical dynamics. By maintaining networks near criticality, we provide a principled solution to these failures, demonstrating that criticality-based optimization prevents degradation and collapse. Our results establish criticality as a substrate-independent principle of intelligence, connecting AI progress with fundamental principles of brain function, and offering both theoretical insight and practical strategies to ensure long-term DNN performance and resilience as models scale.
fMRI is a non-invasive technique for investigating brain activity, offering high-resolution insights into neural processes. Understanding and decoding cognitive brain states from fMRI depends on how functional interactions are represented. We propose an ensemble-based graph representation in which each edge weight encodes state evidence as the difference between posterior probabilities of two states, estimated by an ensemble of edge-wise probabilistic classifiers from simple pairwise time-series features. We evaluate the method on seven task-fMRI paradigms from the Human Connectome Project, performing binary classification within each paradigm. Using compact node summaries (mean incident edge weights) and logistic regression, we obtain average accuracies of 97.07-99.74 %. We further compare ensemble graphs with conventional correlation graphs using the same graph neural network classifier; ensemble graphs consistently yield higher accuracy (88.00-99.42 % vs 61.86-97.94 % across tasks). Because edge weights have a probabilistic, state-oriented interpretation, the representation supports connection- and region-level interpretability and can be extended to multiclass decoding, regression, other neuroimaging modalities, and clinical classification.
Infectious disease outbreaks have precipitated a profusion of mathematical models. Epidemic curves predicted by these models are typically qualitatively similar, despite distinct model assumptions, but there is no theoretical explanation for this similarity in terms of any recognised common structure. In addition, fits of epidemic models to time series conflate pathogen transmissibility with pre-existing population immunity, so only a single composite parameter can be inferred. Here, we introduce a unifying concept of "epidemic momentum" -- prevalence weighted by potential to infect -- which is more informative than prevalence, yet analytically tractable. Epidemic momentum reveals a common underlying geometry in which outbreak trajectories always follow contours of a conserved quantity. This previously unrecognised conservation law constrains how epidemics can unfold, enabling us to disentangle transmissibility from prior immunity and to infer each separately from the same time series. We illustrate the significance of these insights with a novel reappraisal of the transmissibility of influenza during the 1918 pandemic. Beyond resolving an apparent identifiability problem, epidemic momentum also exposes the true final size of an outbreak and a universal phase-plane description that links generic renewal models to the classical SIR system. A broader concept of "population momentum" has the potential to illuminate seemingly intractable nonlinear dynamical processes in many other areas of science.
Off-lattice agent-based models (or cell-based models) of multicellular systems are increasingly used to create in-silico models of in-vitro and in-vivo experimental setups of cells and tissues, such as cancer spheroids, neural crest cell migration, and liver lobules. These applications, which simulate thousands to millions of cells, require robust and efficient numerical methods. At their core, these models necessitate the solution of a large friction-dominated equation of motion, resulting in a sparse, symmetric, and positive definite matrix equation. The conjugate gradient method is employed to solve this problem, but this requires a good preconditioner for optimal performance. In this study, we develop a graph-based preconditioning strategy that can be easily implemented in such agent-based models. Our approach centers on extending support graph preconditioners to block-structured matrices. We prove asymptotic bounds on the condition number of these preconditioned friction matrices. We then benchmark the conjugate gradient method with our support graph preconditioners and compare its performance to other common preconditioning strategies.
Understanding protein-metal interactions is central to structural biology, with metal ions being vital for catalysis, stability, and signal transduction. Predicting metal-binding residues and metal types remains challenging due to the structural and evolutionary complexity of proteins. Conventional sequence- and structure-based methods often fail to capture co-evolutionary constraints that reflect how residues evolve together to maintain metal-binding functionality. Recent co-evolution-based methods capture part of this information, but still underutilize the complete co-evolved residue network. To address this limitation, we introduce the Metal-Binding Graph Neural Network (MBGNN), which leverages the complete co-evolved residue network to better capture complex dependencies within protein structures. Experimental results show that MBGNN substantially outperforms the state-of-the-art co-evolution-based method MetalNet2, achieving F1 score improvements of 2.5% for binding residue identification and 3.3% for metal type classification on the MetalNet2 dataset. Its superiority is further demonstrated on both the MetalNet2 and MIonSite datasets, where it outperforms two co-evolution-based and two sequence-based methods, achieving the highest mean F1 scores across both prediction tasks. These findings highlight how integrating co-evolutionary residue networks with graph-based learning advances our ability to decode protein-metal interactions, thereby facilitating functional annotation and rational metalloprotein design. The code and data are released at this https URL.
Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning donor-level representations from single-nuclei RNA sequencing data (6M cells), capturing clonal dynamics in lineage-traced RNA sequencing data (150K cells), predicting perturbation effects on transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single-cell images), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).
The evolutionary mechanisms of cooperative behavior represent a fundamental topic in complex systems and evolutionary dynamics. Real-world collective interactions, particularly in multi-agent systems, are often characterized by behavior-dependent mechanism switching where the environmental state is endogenously shaped by group strategies. However, existing models typically treat such environmental variations as static stochasticity and neglect the closed-loop feedback between environmental states and cooperative behaviors. Here, we introduce a dynamic environmental feedback mechanism into a nonlinear public goods game framework to establish a coevolutionary model that couples environmental states and individual cooperative strategies. Our results demonstrate that the interplay among environmental feedback, nonlinear effects, and environmental randomness can drive the system toward a wide variety of steady-state structures, including full defection, full cooperation, stable coexistence, and periodic limit cycles. Further analysis reveals that asymmetric nonlinear parameters and environmental feedback rates exert significant regulatory effects on cooperation levels and system dynamics. This study not only enriches the theoretical framework of evolutionary game theory but also provides a foundation for modeling environmental feedback loops in scenarios ranging from ecological management to the design of cooperative mechanisms in autonomous systems.
Humans effortlessly recognize social interactions from visual input, yet the underlying computations remain unknown, and social interaction recognition challenges even the most advanced deep neural networks (DNNs). Here, we hypothesized that humans rely on 3D visuospatial pose information to make social judgments, and that this information is largely absent from most vision DNNs. To test these hypotheses, we used a novel pose and depth estimation pipeline to automatically extract 3D body joint positions from short video clips. We compared the ability of these body joints to predict human social judgments in the videos with embeddings from over 350 vision DNNs. We found that body joints predicted social judgments better than most DNNs. We then reduced the 3D body joints to an even more compact feature set describing only the 3D position and direction of people in the videos. We found that this minimal 3D feature set, but not its 2D counterpart, was necessary and sufficient to explain the prediction performance of the full set of body joints. These minimal 3D features also predicted the extent to which DNNs aligned with human social judgments and significantly improved their performance on these tasks. Together, these findings demonstrate that human social perception depends on simple, explicit 3D pose information.