Deep-learning structure predictors are sensitive to their multiple sequence alignment (MSA) input, making MSA subsampling a practical route to recovering alternative conformations. Existing approaches such as AF-Cluster operate in sequence space, providing limited control over which conformational basin is sampled. We introduce SF-Cluster, which subsamples MSAs using patterns of predicted local energetic frustration, a representation largely independent of sequence similarity. Across a benchmark of 48 cases spanning fold-switching, allosteric, oligomerization-coupled, and intrinsically disordered systems, and using an AF-Cluster-style dual-reference RMSD criterion, SF-Cluster improves target-state recovery of the alternative conformation over AF-Cluster across the two-state classes, with the largest improvement observed for allosteric systems (+15.5 percentage points). The selected MSAs transfer to an architecturally distinct predictor, indicating that the conformational signal resides in MSA composition. Mechanistically, matched-depth controls show that this recovery advantage is largely explained by the effective depth of the selected subsets, which frustration-pattern selection reliably reaches. At the same time, highly frustrated residues are enriched at sites supported by deep mutational scanning and NMR two-state exchange, and frustration covariation is enriched at state-switching contacts while remaining distinct from coevolutionary coupling. Together, these results identify frustration patterns as a transferable representation for conformational prediction and position MSA subsampling as a representation-guided reweighting problem.
Multi-level selection and senescence do not at first sight have much in common. Here, we demonstrate that the emergent mortality patterns generated by demographic senescence can be understood as the product of multi-level selection. We formulate a two-level Moran type process and use its scaling limits to illustrate that a simple mathematical framework that models multi-level selection in group-structured populations also models damage accumulation patterns and resultant mortality curves in ageing organisms. To verbally make the connection, observe that defectors spread within a group consisting of cooperators and defectors; when groups compete against each other, defector-rich groups suffer, and between-group selection causes such groups to be systematically under-represented. Exactly analogously, senescing individuals accumulate damage to physiological sub-systems, and `damage begets damage'; individuals who are more damaged are more likely to die, hence damage-rich individuals are systematically under-represented in later age classes. Thus, emergent senescence patterns in complex, integrated organisms are formally equivalent to the patterns generated by a within-generation multi-level selection process in which intra-organismal sub-systems play the role of particles, organisms play the role of collectives, and selective disappearance plays the role of group selection.
Understanding how complex cognitive functions are organized within artificial systems is central to interpreting large language models (LLMs) and relating them to biological cognition. Yet although LLMs exhibit broad cognitive-like behaviours, it remains unclear whether their internal representations form reproducible functional systems that explain behaviour, failure and links to human cognition. Here we present NeuroCogMap, a cognitive neuroscience-inspired framework that organizes internal features of LLMs into functional parcels and links them to interpretable functions, cognitive capabilities and a cognitive hierarchy. These parcels form a stable and semantically coherent organization that is partly conserved across models and functionally linked to model outputs. Within this organization, major LLM failures, including hallucination, bias, refusal failure and sycophancy, correspond to distinct disruptions in representational and behavioural-control systems, yielding internal signatures for mechanism-guided detection and targeted intervention. Beyond model behaviour, NeuroCogMap improves prediction of human cortical responses during naturalistic language comprehension, with the strongest correspondence in higher-order association cortex. At the cognitive level, its internal signatures expose latent strategies that guide refinements of classical models of human decision-making. Together, these findings establish NeuroCogMap as a system-level framework for mapping functional organization in artificial systems and for relating this organization to human cortical function and cognitive behaviour.
This study presents a comprehensive analysis of bird diversity across Sri Lanka by integrating spatial, temporal, and environmental data. Bird observation records were combined with environmental variables, including weather conditions, air pollution, the Normalized Difference Vegetation Index (NDVI), land cover, elevation, and Artificial Light At Night (ALAN), and rigorously preprocessed to ensure data quality. Spatial analyses were conducted on multiple grid scales (2 km, 5 km, 10 km) to evaluate patterns in species richness while minimizing sampling bias through spatial thinning. Temporal trends were assessed using effort-corrected metrics including rarefied richness and occupancy rates to account for variations in observation effort over time. Environmental drivers of bird diversity were examined using multivariate statistical models, including Poisson Generalized Linear Models (GLMs) and correlation analyses, to identify key associations between ecological factors and species richness. Additionally, community structure, dominance patterns, and beta diversity were analyzed to understand variations in species composition across regions and time. The study found that land-cover type is a stronger predictor of bird diversity than individual continuous variables such as NDVI or temperature alone. Urbanization, measured by ALAN, exhibits nuanced scale-dependent effects, supporting high abundances of a few generalist species while reducing overall richness. The findings provide actionable insights into the patterns and drivers of avian diversity in Sri Lanka, offering a scalable and reproducible framework for biodiversity research and conservation planning.
This work addresses an optimal control problem for a SI epidemic model incorporating heterogeneities in resistance and viral load at the population level. Building upon the heterogeneous SI framework developed in [1], a minimization problem constrained to the macroscopic counterpart of the SI dynamics derived therein is proposed. Unlike traditional optimal control problems in homogeneous epidemic models, the present approach focuses on an optimal control problem that accounts for population heterogeneity, offering insights from a microscale perspective. The contribution aims to minimize the final size of the infection within a finite time horizon by developing a pharmaceutical strategy, under a supply constraint that translates into an integral equality constraint in the control function. By applying the Pontryagin Minimum Principle, a characterization of an optimal control is provided.
In multispecies birth-death processes, how population regulation -- through suppressed replication, elevated mortality, or both -- affects macroscopic stochastic dynamics has escaped detailed analysis. Here, we show that the distribution of regulation mechanisms can be invisible in deterministic or mean-field dynamics but play a significant role in the diffusive evolution of population frequencies. By introducing a tunable regulation partitioning parameter $\alpha_i$ and projecting a $d$-species birth-death process onto a $(d{-}1)$-dimensional Moran process, we find a regulation-mechanism-dependent diffusion tensor. For the simple two-species case, we derive exact fixation times and probabilities to show how different regulation mechanisms stochastically favors a more birth-regulated species, even under complete deterministic neutrality. Our model also allows us to define an $\alpha$-dependent effective population size $N_{\rm e}(\alpha)$ among neutral species, generalizing its classical interpretation. For near-neutral populations or populations that are heterogeneous in their regulation mechanism, we used perturbation theory to calculate the spectral gap, identifying it with a diversity loss timescale which can also be interpreted as setting an effective population size. Our results are particularly applicable to interacting subpopulations of T cells ("clones") which are near-neutral, are regulated through proliferation and apoptosis, and lose diversity with time.
Brain activity spans single-neuron, population, and network levels, and core questions in neural coding require moving between them. Yet current tools target a single paradigm and incompatible data formats, leaving cross-level questions hard to address. We present DRIADA, an open-source Python framework that unifies neural signals and time-aligned behavior in a shared data model, so selectivity testing, dimensionality reduction, and network analysis operate within a unified workflow. We evaluate it on synthetic data with known ground truth, hippocampal calcium imaging from 13~mice in an open field, and a simulated toroidal attractor network. In the hippocampal data, selectivity-based filtering restored a two-dimensional spatial embedding from a collapsed all-neuron embedding, while reverse analysis showed that ${\sim}57\%$ of neurons informative about leading manifold dimensions were not selective to any of the 11 measured behavioral features. On the toroidal benchmark, four independent modules recovered the expected topology. DRIADA makes cross-scale analysis routine across calcium imaging, spike trains, and simulated networks.
Protein flexibility, commonly quantified by B-factors, is closely related to protein structure and function. However, accurate B-factor prediction remains challenging due to the multiscale nature of protein structures and the complexity of atomic interactions. In this work, we propose a commutative algebra-based learning framework, termed CAL, for protein B-factor prediction. Unlike many biomolecular prediction tasks that rely primarily on global structural representations, B-factor prediction requires an accurate characterization of the local geometric environments surrounding individual atoms. To address this challenge, CAL employs commutative algebra theory to construct localized algebraic descriptors at multiple spatial scales. On a benchmark dataset of 364 proteins, CAL improves prediction accuracy by 34.5\% over the classical Gaussian network model (GNM). Extensive experiments demonstrate that CAL achieves robust and consistent performance across diverse datasets and is competitive with existing state-of-the-art methods. Furthermore, by integrating CAL with machine learning, we develop a blind prediction model capable of cross-protein B-factor prediction. Overall, CAL provides an effective, efficient, and mathematically principled framework for protein flexibility prediction and offers a powerful approach for analyzing and predicting localized structural properties in complex biomolecular systems.
Population immunity carried over from past epidemics of an antigenically variable pathogen influences the epidemic of new variants based on their antigenic similarity to the previous ones. We develop a recurrent SIR model where a population faces sequential, antigenically related variants. The model yields a recurrence map for the population susceptibility to successive variants under the assumption of status-based population immunity. The model reveals that stable, equal-sized recurrent epidemics occur across broad parameter ranges, but can be destabilized when transmission is strong and antigenic escape is limited, leading to period-2 or more, or even more complex epidemic dynamics. Epidemic size is maximized at an intermediate basic reproduction number: higher transmissibility boosts immediate infection but also enhances cross-immunity, reducing future susceptibility of the population. Our results clarify how immune history shapes recurrent epidemics and why success in one wave does not ensure larger future epidemics.
Estimating peak prevalence is a central problem in epidemic modeling because it determines the period of greatest infectious burden and is closely linked to health-care demand. In multistage SIR models, however, peak prevalence is generally less tractable than in the classical model with exponentially distributed infectious periods. Motivated by the use of weighted infectious-stage aggregates as surrogates for prevalence, we investigate the relationship between the prevalence peak and the maximum of a weighted stage functional in deterministic SI$(k)$R epidemic models. We show that this relationship depends critically on how the stage-progression rate is scaled as the number of infectious stages increases. Under naive scaling, in which the progression rate remains fixed, the weighted peak is asymptotically equivalent to the prevalence peak and the commonly used factor-two approximation fails. Under Erlang scaling, which preserves the mean infectious period, the multistage model converges to a delay formulation in which prevalence and the weighted stage functional become unweighted and triangularly weighted moving averages of incidence. This limiting representation provides a theoretical basis for the factor-two approximation and identifies the regimes in which it is accurate. It also explains why this approximation deteriorates as epidemic waves become more sharply peaked. We derive analytical error bounds and develop curvature-based and parameter-based corrections that substantially improve accuracy. Numerical studies confirm these improvements across a broad range of epidemiological parameters. Overall, the results show when weighted-stage peaks can be used reliably as proxies for peak prevalence and how the resulting estimates can be refined when the standard approximation loses accuracy.
Colored sectors in a microbial range expansion encode more than lineage survival counts. We formulate a computer-vision inverse problem: from one endpoint image of an accretive multi-type expansion, recover the radius-indexed pairwise boundary-flow field and test whether the visual pattern is compatible with a transitive scalar fitness hierarchy. The observable is a geometric signal extracted from sector-boundary curves in log-polar coordinates. We prove endpoint observability and stability for frozen fronts, weighted transitive/cyclic decomposition, contact-complete circular design, physical-clock and mechanism non-identifiability, exact Gaussian cyclicity testing, and Bonferroni-valid interval scanning. The benchmark is deterministic: analytic endpoint images, blurred/noisy pixel round trips, scalar-null stress tests, public-image tracing, multi-resolution mechanistic endpoints, and a non-learning frozen-front simulator. The implementation recovers pairwise edge-flow histories from endpoint images, detects cyclic residuals in a mechanistic four-type expansion, and uses those residuals as forcing signals for a dimensionless active design-control layer covering reaction-diffusion control, phenotype-frontier optimization, protocol synthesis, Monte Carlo robustness, and a downstream population-state bridge.
Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based molecular optimization, where answer-only supervised fine-tuning (SFT) collapses multi-step reasoning and reinforcement learning with verifiable rewards (RLVR) suffers from sparse feedback. Reference-guided Policy Optimization mitigates both by anchoring policy updates to dataset-provided references, but its effectiveness is tightly coupled to reference quality: weak or misaligned references impose a performance ceiling. To overcome this ceiling, we propose active reasoning, a paradigm in which the policy actively decides, on a per-instance basis, when to imitate a reference and when to reinforce its own discoveries, while continuously upgrading what it imitates. We instantiate this paradigm as Active Group Relative Policy Optimization (Active-GRPO), realized through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when the reference still outperforms the policy's own candidates, and shifts to self-improvement via reinforcement learning once the policy has generated molecules that surpass the reference. The latter continuously upgrades the reference itself by replacing it with the best policy-generated candidate discovered so far, progressively raising the imitation target and ensuring that reference guidance remains informative-rather than restrictive-throughout training. Across TOMG-Bench MOLOPT, Active-GRPO improves average SRxSim from 0.0959 for GRPO and 0.1665 for RePO to 0.1773 under matched three-seed evaluation, with statistically significant gains on LogP, MR, and QED.
Proteins move and deform to ensure their biological functions. Despite significant progress in protein structure prediction, approximating conformational ensembles at physiological conditions remains a fundamental open problem. This paper presents a novel perspective on the problem by directly targeting continuous compact representations of protein motions inferred from sparse experimental observations. We develop a task-specific loss function enforcing data symmetries, including scaling and permutation operations. Our method PETIMOT (Protein sEquence and sTructure-based Inference of MOTions) leverages transfer learning from pre-trained protein language models through an SE(3)-equivariant graph neural network. When trained and evaluated on the Protein Data Bank, PETIMOT shows superior performance in time and accuracy, capturing protein dynamics, particularly large/slow conformational changes, compared to state-of-the-art diffusion and flow-matching approaches, as well as traditional physics-based models. Our code and protocols are available at this https URL.
Humans can uniquely assign value to novel, abstract outcomes to support reinforcement learning. However, this flexibility is cognitively costly and reduces learning efficiency. We propose that goal-dependent learning initially relies on capacity-limited working memory. With consistent experience, learners create a "compressed" reward function - a simplified goal rule -- that transfers to long-term memory for a more automatic evaluation upon receiving feedback. This automaticity frees working memory resources, thereby boosting learning efficiency. Across six experiments, we demonstrate that learning is impaired by the size of the goal space but improves when this space allows for compression. Additionally, faster reward processing correlates with better learning. Although the algorithmic details remain to be established, our behavioral results and computational models suggest that efficient goal-directed learning relies on compressing complex goal information into a stable reward function. These findings illuminate the cognitive mechanisms of intrinsic motivation and can inform behavioral interventions supporting human goal achievement.
Trans-membrane ion-gradients and fluxes are central to conventional electrical activity in aerobic cells/organelles. The Murburn concept offers novel physico-chemical models for various metabolic, bioenergetic and electrophysiological phenomena. Here, we develop a foundational framework for neuronal electrical activity and axonal signal propagation using the electron-holding potential (EHP), a dimensionless field related logarithmically to electron chemical potential. By combining local redox relaxation dynamics with spatial transport driven by thermodynamic gradients, we derive a unified reaction-transport-relaxation equation that accounts for resting potential, excitability, waveform generation, and signal propagation within a single formalism. Nonlinear local redox kinetics yield a stable resting state and graded responses from a single scalar field; extending it to the two-variable excitable (FitzHugh Nagumo) form, a bistable reaction with a slow recovery variable, further yields a genuine threshold, all-or-none spikes, a refractory period and a propagating action potential. The framework accommodates known physiological variability of neurons and provides a direct bridge between metabolic/redox state and electrophysiology. This framework offers testable predictions for neuronal dynamics (such as velocity, waveform morphology, and environmental conditions) across biological systems. We derive and solve the equations to obtain the transmembrane potential as a function of time, and the neuronal conduction velocity as a function of parameters like ionic strength, temperature, axon diameter, myelination, and driving potential. In the second part of this work, we present comparative analyses, simulations, and experimental strategies for validation and falsification.
Vector-borne diseases often involve multiple host species that differ in their ability to sustain transmission. At the same time, vector feeding preferences can change in response to host availability and disease-control interventions, potentially altering disease dynamics in unexpected ways. We develop a two-host vector-borne disease model that links host diversity, adaptive vector feeding preferences, and disease transmission. We demonstrate that the effect of host diversity on disease transmission is mediated by vector feeding behavior and cannot be inferred from host abundance alone. In particular, we identify a critical threshold, $R_{0c}$, that determines whether shifts in vector preference amplify or suppress disease burden in a focal host. This threshold marks a qualitative transition in system behavior and provides a basis for predicting epidemiological responses to changes in host composition and vector behavior. Using adaptive dynamics, we further show that vector populations may evolve toward either specialist or opportunistic feeding strategies depending on host encounter rates and trade-off strength. Finally, we demonstrate that host-targeted interventions can induce adaptive changes in vector feeding behavior that reduce prevalence in the protected host while potentially increasing overall infection burden. Our results highlight how evolutionary responses of vector population can generate unexpected epidemiological outcomes and should be considered when designing disease-control strategies.
The analyses presented herein demonstrate that neuronal electrical activity can be consistently interpreted as a manifestation of murburn redox-mediated electronic dynamics rather than as a process fundamentally driven by transmembrane ionic flux. By integrating comparison with established models, quantitative predictions, and diverse experimental observations, the murburn framework emerges as a unified and chemically grounded description of excitability. A key strength of the model lies in its predictive structure. Unlike phenomenological frameworks that rely on parameter fitting, the murburn formulation links measurable electrophysiological outputs: such as conduction velocity, waveform morphology, and threshold behavior; to physically interpretable variables including redox kinetics, transport efficiency, and environmental conditions. This enables direct experimental validation through perturbations in oxygen availability, redox balance, solvent properties, ionic strength, and external fields. Importantly, the framework extends beyond neurons to a broader class of excitable systems, including cardiac tissue, photoreceptors, and artificial redox-active materials, suggesting that excitability is a general physicochemical phenomenon rooted in reaction-transport dynamics. While the present work establishes the mid-scale dynamics of neuronal electricality, further developments are required to connect quantum-level electron transfer processes with macroscopic electrophysiological signals such as EEG and EMG. These extensions, along with targeted experimental tests, will determine the ultimate scope and applicability of the murburn paradigm.
Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks and model architectures.
Within the biological, physical, and social sciences, there are two broad quantitative traditions: statistical and mathematical modeling. Both traditions have the common pursuit of advancing our scientific knowledge, but these traditions have developed largely with distinct languages and inferential frameworks. This paper uses the notion of identification from causal inference, a field originating from the statistical modeling tradition, to develop a shared language. I first review foundational identification results for statistical models and then extend these ideas to mathematical models. Central to this framework is the use of bounds, ranges of plausible numerical values, to analyze both statistical and mathematical models. I discuss the implications of this perspective for the interpretation, comparison, and integration of different modeling approaches, and illustrate the framework with a simple pharmacodynamic model for hypertension. To conclude, I describe areas where the approach taken here should be extended in the future. By formalizing connections between statistical and mathematical modeling, this work contributes to a shared framework for quantitative science. My hope is that this work will advance interactions between these two traditions.