The automatic recognition of virus particles in transmission electron microscopy (TEM) images remains a demanding task, primarily owing to strong inter-class similarity, scale variability, and pronounced class imbalance. In this study, several convolutional neural networks and transformer-based architectures were comparatively evaluated for the classification of 22 virus categories using the TEM virus dataset. All models were trained under identical preprocessing and optimization conditions, and imbalance effects were mitigated through a weighted cross-entropy formulation. Performance was quantified using overall accuracy together with macro-averaged precision, recall, and F1 score. Among standalone models, the Swin Transformer achieved the highest accuracy (0.8831) and macro-F1 score (0.8444), followed by DeiT (accuracy 0.8669). Convolutional architectures exhibited comparatively lower balanced performance, with ResNet50 demonstrating substantial degradation (accuracy 0.5887) under imbalanced conditions. To exploit complementary representational properties, decision-level hybrid strategies were implemented. The performance-weighted hybrid attained an accuracy of 0.8831 and the highest macro-F1 score (0.8528), slightly surpassing the equal-weight hybrid configuration. These observations indicate that architectural heterogeneity contributes to improved inter-class balance without sacrificing overall predictive accuracy. Future work may explore scale-aware representations, feature-level fusion mechanisms, and expanded TEM datasets to further enhance robustness and generalization in virus identification tasks.
Motivation: Gene regulatory network inference from single-cell RNA sequencing (scRNA-seq) data is important for uncovering cell-state-specific transcriptional programs. However, scRNA-seq measurements are sparse and noisy, and experimentally validated TF-target interactions remain limited, making reliable inference challenging. Although graph neural networks have advanced GRN prediction, existing methods often rely on biologically unconstrained graph augmentation, such as random edge perturbation, and insufficiently control information transfer between genes and cells. These limitations may distort regulatory structures and weaken robustness under noisy and weakly supervised settings. Results: To address these issues, we propose an innovative framework named Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks (BRIDGE). BRIDGE extracts gene and cell representations from the expression matrix and its matrix dual, and performs contrastive learning in the gene space and cell space between self and neighbors across the co-expression-refined regulatory view and the original graph. It then applies heterogeneous gated encoding to adaptively regulate information transfer between genes and cells, enabling robust transcription factor-to-target gene prediction. Experiments on benchmark datasets spanning three network types and seven cell types show that BRIDGE achieves state-of-the-art AUROC and AUPRC in most settings. In particular, on Specific networks, BRIDGE improves average AUPRC by 5% over the second-best baseline, GCLink. In cross-cell-type few-shot transfer, BRIDGE consistently outperforms GCLink and GENELink across all six target cell types. A case study on hESC further supports the biological relevance of the predictions, with 9 of the top 10 and 46 of the top 100 novel TF-target interactions validated by ChIPBase.
Molecular dynamics (MD) simulations generate trajectories in a high-dimensional configuration space whose analysis critically depends on molecular descriptors, typically handcrafted observables or learned kinetic embeddings. Designing descriptors that are both expressive and broadly applicable, however, remains challenging. We study persistent homology (PH) as a general-purpose representation for MD and introduce the masked Flood complex, a protein-tailored modification of a recently introduced simplicial complex construction that emphasizes inter-residue structure at low computational cost. Vectorized persistence diagrams then provide information-rich, geometry-aware summaries of protein conformations, which we evaluate on protein class prediction, frame-level observable regression, and Markov state model (MSM) estimation from learned low-dimensional coordinates in a single shared representation space. Results on the mdCATH dataset show that PH-based descriptors are competitive across tasks, with masked Flood PH yielding the most consistent overall performance. Further, when using topologically-informed MSMs as a drop-in replacement within the recent MarS-FM framework for generative modeling of protein conformations, we obtain consistently better ensemble statistics than MSMs based on physical observables. Finally, we explore the transferability of the generative model to qualitatively different, fast folding, proteins.
Do LLMs have emotions? A recent paper from Anthropic reports finding internal representations of emotion concepts in Claude Sonnet 4.5, concluding that the LLM has 'functional emotions.' We evaluate this claim against what is known about how emotions actually function in biological systems. We argue that emotions serve two core functions: the context-sensitive interpretation of situations, and the reorganization of processing across multiple systems in response to those interpretations. The Anthropic findings offer partial support for the first function, though the consistent, discrete emotional representations identified in Claude sit uneasily with affective neuroscience findings that human emotion is characterized by variable rather than uniform neural signatures. On the second function, the evidence is mixed: Claude's representations modulate output without producing the dynamic reorganization of attention, decision speed, and motivational state that defines emotion in biological systems. We close by proposing what it would take for an LLM to have emotions.
Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.
Modeling growth in bacterial cells is a major issue in systems and synthetic biology. Despite several growth rate functions proposed in the literature, most focus on nutrient composition without explicitly accounting for the possible perturbation provided by the expression of recombinant genes, an effect known as cell load or burden. On the other hand, mathematical models that attempt to provide mechanistic details on the phenomena, leveraging ribosome partitioning and nutrient availability, are generally too detailed and complex to be easily applied to the rational design of synthetic genetic circuits. A bottom-up approach is adopted herein to identify and analyze the minimal model structure, thereby unveiling the fundamental role of negative feedback in ribosomal synthesis in predicting the effects of cell load on both gene expression and growth rate. Indeed, to ensure cellular efficiency, ribosome synthesis must be finely regulated. While an increased number of ribosomes generally enhances protein production and cellular performance, their synthesis incurs a high energetic cost. For this reason, cells have evolved mechanisms to tightly control ribosome synthesis, avoiding unnecessary accumulation. One of the key regulatory strategies, usually neglected in previous cell models, involves a negative feedback loop that modulates the production of ribosomal components. This feedback ensures that ribosomes are produced only in the amount strictly needed, balancing functionality and energy expenditure. This work evaluates the individual contribution of this feedback under heterologous expression conditions using minimal gene-circuit models, explicitly linking ribosome allocation, hidden couplings between protein synthesis levels, and growth rate.
Boolean models are widely used to characterize the dynamics of gene regulatory networks. However, their coarse state discretization limits their ability to capture complex continuous dynamics and continuous parameter dependencies. In this paper, we present a rigorous mathematical framework that embeds monotone Boolean models into a broader class of multilevel combinatorial models, which in turn embed into the Dynamic Signatures Generated by Regulatory Networks (DSGRN) methodology. We define the DSGRN parameter graph, which encodes the notion of parameter adjacency and is used to map Boolean functions to specific nodes within the DSGRN parameter space. We prove that these multilevel discrete update functions act as a multilevel refinement of monotone Boolean models. We demonstrate that purely Boolean models systematically underestimate network dynamics by missing crucial intermediate behaviors such as higher-order multistability and stable periodic orbits. We show that the DSGRN framework efficiently captures a strictly richer set of dynamics consistent with ordinary differential equations (ODEs), providing a mathematically rigorous and computationally viable bridge between discrete and continuous network modeling.
Temporal Interference (TI) stimulation enables deep brain targeting, yet precise optimization tools for mouse models remain limited. We developed a computational optimization tool integrating mouse head modeling with the optimization algorithm to optimize stimulation strategies for predefined target regions. By balancing target intensity and spatial focality, the optimized strategy significantly outperformed empirical baselines. For the CA3-CA1 target, it achieved a 7-fold intensity increase (10.29 vs. 2.89 V/m) under iso-focality conditions. Conversely, for the Dentate Gyrus, it improved spatial confinement ($r_{0.5}$ reduced from 3.99 to 3.54 mm) while maintaining comparable intensity. Cross-model validation on a standardized Sim4Life phantom further confirmed the framework's robustness. This approach offers a powerful tool for enhancing the precision and reproducibility of preclinical TI stimulation studies.
Warm-blooded vertebrates accumulate approximately conserved numbers of physiological cycles over a natural lifetime: of order $10^9$ heartbeats and $10^8$--$3\times10^8$ breaths. These regularities are not exact constants, but their persistence across orders-of-magnitude variation in body mass, metabolic power, physiological frequency, and lifespan suggests that biological time is not measured by chronological duration alone. We develop the Principle of Biological Time Equivalence (PBTE), a thermodynamic framework in which lifetime cycle count is determined by the ratio between total lifetime entropy production and the entropy cost of one physiological cycle. Starting from the open-system entropy balance $\dot S=\dot e_p-\dot h_d$, we define the entropy cost per cycle as $\sigma_0=d\Sigma/dN$, where $d\Sigma$ is the entropy produced as the physiological clock advances by $dN$ cycles. For an adult homeostatic regime, this gives the cycle-count relation $N_\star=\Sigma/\langle\sigma_0\rangle$, with $\Sigma=\int_0^L \dot e_p(t)\,dt$, where $N_\star$ is the lifetime cycle count, $\Sigma$ is total lifetime entropy production, and $\langle\sigma_0\rangle$ is the lifetime-averaged entropy cost per cycle. In the homeostatic limit, $\dot e_p\simeq P/T$, so direct measurement of metabolic power $P$, body temperature $T$, and physiological frequency $f$ gives $\sigma_0\simeq P/(Tf)$. PBTE converts the empirical lifetime-cycle invariants into entropy-cost invariants. Under Kleiber metabolic scaling and quarter-power physiological-frequency scaling, the mass-specific entropy cost satisfies $\bar\sigma_0=P/(TfM)\propto M^{3/4+1/4-1}=M^0$, providing a thermodynamic interpretation of allometric mass cancellation.
A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.
Motivation: BWA-MEM remains a popular short-read mapper especially for the purpose of variant calling. Several groups have accelerated this algorithm as it has been the performance bottleneck of many current workflows. However, constrained by the original design, these drop-in replacements could only achieve limited speedup. Breaking changes to BWA-MEM are required for further improvement. Results: We developed minibwa for aligning short and accurate long reads against a reference genome. It combines BWA-MEM variable-length seeding with minimap2 chaining and base alignment. It speeds up BWA-MEM2 further with additional prefetch for seeding, new heuristics to skip unnecessary mate rescue and reduced effort in highly repetitive regions where reads would anyway be wrongly mapped due to structural changes. Minibwa is about four times as fast as BWA-MEM and over twice as fast as BWA-MEM2 at comparable accuracy. It also natively supports directional bisulfite sequencing data to high mapping accuracy. Availability and implementation: this https URL
The effective reproduction number ($R_t$) is widely used to track epidemic dynamics in real time. The standard estimation framework uses "instantaneous $R_t$," defined via the renewal equation, which relates new infections to past infections through a generation interval distribution. Compartmental models like SEIR yield a seemingly distinct quantity, "mechanistic $R_t$," based on the effective contact rate and duration of infectiousness. We prove these two definitions are equivalent under homogeneous mixing, the standard assumption in compartmental modeling. We also derive the generation interval distribution implied by SEIR dynamics. A practical consequence is that generation intervals, often treated as assumption-light inputs to renewal equation estimators, in fact encode specific compartmental structure.
Adaptation often requires the assembly of favorable combinations of mutations that are individually deleterious. As a result, populations may remain trapped in low-fitness genetic states even when higher-fitness genotypes exist. Recombination plays a dual role in this process because it can both generate and disrupt advantageous multilocus combinations. Previous work showed that the balance between selection and recombination determines whether populations cross fitness valleys or persist in low-fitness states associated with demographic decline. We study this problem in a three-locus model consisting of two selected loci and a recombination modifier locus. The modifier has no direct effect on fitness but alters the recombination rate between the selected loci, allowing recombination itself to evolve. We characterize the fixation states of the system and derive explicit conditions for the local stability of the low-fitness fixation set. Stability depends on selection strength, recombination among selected loci, recombination between the modifier and selected loci, and modifier composition. In the classical two-locus model, stability depends on a single recombination parameter. By contrast, the modifier model generates a continuum of fixation states whose stability varies with modifier frequency. Populations with identical selected-haplotype frequencies can therefore differ in stability solely because they differ in modifier composition. We further show that modifier polymorphism can either stabilize or destabilize the low-fitness state, depending on the relative magnitudes of modifier-dependent recombination rates. These results demonstrate that genetic variation affecting recombination alters evolutionary outcomes not only by changing the formation of favorable multilocus combinations but also by changing the stability of alternative evolutionary states.
Evolving populations both respond to and reshape their environments, making fitness landscapes dynamic rather than static. We present a minimal eco-evolutionary model that couples replicator dynamics for a population density with a regenerating resource-driven landscape through a single environmental sensitivity parameter. This allows evolving populations to generate and ride self-induced selection gradients, enabling directed motion in trait space even on initially flat landscapes. Our analysis reveals sustained oscillations, chaotic dynamics, and evolutionary branching. To explain these, we derive reduced dynamical equation that extend Fisher's fundamental theorem to deformable landscapes by incorporating curvature-driven variance dynamics and environmental feedback. Together, these results show how populations actively reshape and self-propel themselves on regenerating landscapes.
Since Schrodinger's \emph{What Is Life?}, the physical basis of biological organization has been understood in terms of the interplay between matter, energy, and information. Subsequent developments in molecular biology, information theory, nonequilibrium thermodynamics, and evolutionary theory have clarified how hereditary information is stored, maintained, and modified through natural selection. Here, we extend this program by asking what minimal physical principles are required for adaptable life. We propose six postulates governing adaptive living systems: the existence of an entropy source, longevity of information, fast response to environmental change, repeatable operation, energetic efficiency, and networks of multiple interacting switches. These principles are introduced as a minimal foundation for biological information processing and adaptation. We examine their implications and compare them with observations across multiple levels of biological organization, including genetic inheritance, epigenetic regulation, cellular signaling, neural computation, metabolic networks, and ecological systems. The resulting framework suggests that adaptability emerges from the interplay of energy flow, information storage, information processing, and natural selection in systems maintained far from thermodynamic equilibrium. Although the proposed principles are qualitative and not yet predictive, they provide a unified perspective on the physical constraints governing adaptive behavior and offer a starting point for the development of a quantitative theory of adaptable life.
Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject's original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.
Public neurophysiological datasets are increasingly accessible but remain hard to reuse: turning one into a trained model still takes thousands of lines of code for download, loading, format repair, windowing, and evaluation, and a dataset that meets metadata standards can still fail to load. EEG-Dash is a software resource that catalogues 791 publicly archived recordings (39,778 participants, over 86,051 hours) spanning electroencephalography (EEG), magnetoencephalography (MEG), intracranial EEG (iEEG), electromyography (EMG), and functional near-infrared spectroscopy (fNIRS) from the OpenNeuro and NEMAR archives. It exposes each dataset as an importable, queryable class that preserves signal attributes and loads into machine-learning workflows without custom code, delegating signal handling to MNE-Python, windowing to Braindecode, and format compliance to the official Brain Imaging Data Structure (BIDS) validator. A metadata-first registry adds semantic search, a format-repair layer, automatic dataset-level tags drawn from each source publication, and a feature-extraction framework. The catalogue, with per-record loadability and compliance metadata, supports benchmarking, model development, and cross-dataset analysis.
Biomolecular sequence models are increasingly reused outside the studies in which they were introduced, but public checkpoints rarely preserve the execution context needed to inspect source-defined behavior, adapt models to new assays, compare models under shared task definitions or deploy biological predictions. MultiMolecule is an open-source Python ecosystem that turns heterogeneous RNA, DNA and protein sequence-model releases into complete, source-checked model-family implementations with shared loading, workflow and prediction interfaces. The Resource state reported here includes 53 complete model-family implementations with 112 standardized model checkpoints, together with 16 curated dataset resources released through 39 public dataset repositories and 10 user-facing prediction pipelines. Standardized components are linked to source provenance, conversion or preparation code, source-reference checks, Extended Data summaries and public documentation, allowing users to inspect what was standardized, what behavior was checked and how each component enters training, evaluation, inference or deployment. By shifting reuse from repository-specific checkpoints to executable implementations connected to standardized checkpoints, curated datasets, Runner workflows and biological prediction pipelines, MultiMolecule provides common infrastructure for preserving source-defined model behavior, adapting models to new assays, enabling controlled evaluation and deploying biomolecular predictions.
Biophysical neuron models link measurements of neural activity to underlying cellular mechanisms. Yet, a central challenge is that the kinetics of many ion channels are poorly characterized, and practical simplifications -- omitting channels or reducing morphological detail -- introduce systematic gaps between model and biology. Bridging these gaps requires approaches that can flexibly discover unmodeled dynamics while preserving mechanistic interpretability. Here, we introduce a hybrid modeling framework that embeds neural ordinary differential equations into conductance-based biophysical models to capture unknown currents or mis-specified channel kinetics. By parameterizing the neural ODE in terms of voltage-dependent steady-state and time-constant functions, we recover interpretable gating dynamics directly from voltage recordings without assuming a functional form. We show that the hybrid model fits the gating kinetics of 2400 ion channel models and recovers unknown gating dynamics from single current-clamp recordings, generalizing to out-of-distribution stimulus regimes under realistic inputs and parameter misspecification. We also use our method to reduce a multicompartment model of a cortical neuron into a single-compartment hybrid model with a learned axial current, yielding up to an order of magnitude lower computational cost. Together, our results establish a plug-and-play framework for selectively replacing unknown components of conductance-based models with neural ODEs while preserving their mechanistic structure.
Determining an appropriate sample size for a study is a crucial step in planning scientific research. Appropriate sample size planning avoids both inadequate and inflated sample sizes. Inflated sample sizes wastes resources, time and effort of human subjects, and lives of experimental animals. Inadequate sample sizes, a much more common problem, wastes even more resources through the inability to detect biologically meaningful differences and encourages questionable research practices like $p$-hacking. Microbiome studies are particularly challenged by small sample sizes, particularly in studies of human subjects or expensive animal models. In practice, the statistical power of taxa within a differential abundance study is influenced by the effect size (typically quantified as fold change), mean abundance of individual taxa, and the number of samples. We present a novel approach for sample size calculation for differential abundance studies as a function of effect size, mean abundance and statistical power of taxa. Our method is implemented in the this http URL R package, available at this https URL. We applied our model for sample size calculation using estimates of mean abundance and fold change of taxa obtained from thirty real-world microbiome datasets. Our results showed that differential abundance microbiome studies require larger sample sizes than are currently prevalent in the literature to achieve adequate statistical power. Our framework will help researchers make informed decisions about appropriate sample sizes.
Gene regulatory networks govern cellular fate decisions through multistable dynamics. The genetic toggle switch is a canonical model of such behaviour; yet, the impact of cell division on its dynamics remains poorly understood. We derive analytical separatrices for a simplified Boolean toggle switch with and without division. We show that division can redirect trajectories with identical initial conditions to opposing stable states, and we define a region of disagreement where fate decisions are predicted incorrectly if division is neglected. Our results imply that division can fundamentally reshape fate boundaries in multistable regulatory networks.
We construct a three-layer reaction-diffusion model of an autocatalytic chemical system in which raw molecules ($a_i$), catalytic proteins ($p_l$) and large RNA/protein ``genes'' ($W_p^{(k)}$) interact through a mass-action stoichiometry tensor $\mathrm{Coef}_{ijk}$ whose magnitude is modulated by the fold-stable activity of the largest this http URL-action is broken by an $\epsilon$-noise term so that the system is nonequilibrium. We compute the total entropy production $\sigma(t)$, the genetic Shannon entropy $S_\mathrm{gene}$ and the thermodynamic uncertainty relation (TUR) and thermodynamic speed limit (TSL) bounds on growth and evolution rates. The hierarchical model exhibits the expected co-occurrence of $\sigma_\mathrm{env}\!\uparrow$ and $S_\mathrm{gene}\!\downarrow$ predicted by Schrodinger's negentropy argument and reformulated as maximum-entropy-production-principle (MEPP)-driven adaptation. In contrast to a single kinetic-proofreading-like cycle, whose TUR products of $\sim 5$, matching the experimentally reported regime of the this http URL hierarchical model's TUR product sits $10^4$-$10^5$ above the universal bound of 2, and the TSL ratio sits $10^6$-$10^8$ above its bound of 1. And scaling number of molucules leaves the looseness intact for the hierarchical model but tightens it monotonically with particle number for the minimal model. We close by drawing an explicit correspondence between the autocatalytic system and diffusion-model training: $a_\mathrm{ext} \to a$ flux $ \Leftrightarrow $ data-information flow, $ \tanh(\beta Wp - \theta) $ $\Leftrightarrow$ score network, replication noise $\Leftrightarrow$ forward-diffusion noise, $ S_\mathrm{gene} \searrow $ $\Leftrightarrow$ $ H[q(\theta|\mathcal{D})] \searrow $. All code and figures are available this https URL
How the wiring and functional organization of cortex shape recurrent computation remains a central question in both neuroscience and machine learning. Here, we leverage data released through the Machine Intelligence from Cortical Networks (MICrONS) program--a functional connectomics resource spanning multiple areas of mouse visual cortex, in which dense calcium imaging is co-registered with high-resolution electron microscopy reconstruction from the same animal--to build biologically grounded recurrent neural networks. Using neuronal spatial coordinates, anatomical connectivity, and function-derived relationships from nearly 12,000 coregistered excitatory neurons, we initialize recurrent weights and impose communication-aware spatial constraints during learning. Across three cognitive decision-making tasks, networks constrained by cortical structure and function consistently outperform baseline and partially constrained models. Functional weight initialization provides the largest gain, while real spatial embedding yields robust additional improvements across conditions. These biologically grounded networks also develop low-entropy, modular, and small-world organization, and retain strong performance even when recurrence is restricted to positive weights. Together, our results show that the machinery of cortex--its geometry, wiring, and functional structure--can be harnessed as a powerful inductive basis for building recurrent networks that learn more effectively while converging toward key organizational principles of biological computation.
A system of coupled oscillators provides a fundamental framework for modeling a wide range of physical and biological phenomena. In neuroscience, the central nervous system exhibits synchronized oscillatory activity with adjacent brain regions, giving rise to traveling wave dynamics for instance during sleep. Similarly, in the gastrointestinal system, neuromuscular cells coordinate their oscillations to generate propagating waves of slow wave activity. To estimate probability distributions of multivariate phase relationships, existing approaches typically rely on equilibrium thermodynamics, expressing the system in a Boltzmann form through a pairwise exponential family distribution. However, these assumptions are often violated in real-world systems, which are inherently dynamic and frequently transition between equilibrium and non-equilibrium regimes. To address this, we propose an efficient method for estimating the probability distribution of coupled oscillators that does not assume thermodynamic equilibrium. Using a Langevin dynamics-based construction, the approach enables accurate modeling even in non-equilibrium regimes. The maximum likelihood estimation method is shown to have a closed form algebraic solution in the high sampling rate regime, a condition commonly satisfied by modern data acquisition systems, which makes it readily applicable in practice. We demonstrate its robustness on simulated data, where it outperforms existing approaches in non-equilibrium settings, and further illustrate its utility for characterizing dynamic brain traveling waves in response to brain stimulation and in hypothesis testing within the context of electrophysiologic recordings of the human stomach.
Therapeutic peptides occupy a valuable design space between small molecules and biologics, but their development requires satisfying several competing constraints at once: solubility, hemolytic activity, and nonspecific surface fouling are governed by overlapping sequence features, so improving one property often degrades another. Computational design addresses this by pairing generative models with sequence-based property predictors, iteratively proposing and refining candidates. However, these components are typically wired together as monolithic scripts that are difficult to inspect, extend, or reuse, and they often refine sequences by natural-language reasoning rather than by tracking the evolving multi-property state of each candidate. We present Pepti-Agent, a closed-loop, peptide-specific framework that exposes generation, property prediction, and single-residue mutation as independently inspectable Model Context Protocol (MCP) tools. A large language model controller invokes these tools and consults live predictor output between calls, so refinement is guided by each sequence's current property profile rather than by language reasoning alone. Task-specific PeptideGPT models generate candidates, ProtBERT-based classifiers score solubility, hemolysis, and non-fouling, and two interchangeable mutation operators propose sequence edits. By recording a per-step trace of controller decisions, predictor outputs, and accepted mutations, Pepti-Agent offers a reproducible substrate for benchmarking multi-objective design strategies and for prioritizing candidates for experimental validation.
Assembly theory is an experimental and theoretical framework that introduces a metrological approach to detecting life, with potential applications across diverse substrates. Its two central observables are assembly index and copy number. The assembly index is the minimum number of joining operations required to construct an object from its elementary parts; for molecules, it can be measured using mass spectrometry, infrared spectroscopy, and NMR. Copy number is the abundance of a given distinguishable object within a sample. A key empirical result of the theory is that high assembly index combined with high copy number constitutes a signature that cannot arise abiotically, and this has been validated experimentally in application to molecular biosignatures. The foundational theoretical concept underlying these results is the assembly space, which encodes the causal possibilities determinable from observed objects, with the assembly index the shortest path to them given the physical constraints of a given substrate. Here, we provide a generalized formalism to describe assembly spaces and tools for assembly index approximations. We begin by reviewing the applications of assembly theory across molecules, minerals and atmospheres, and then introduce a general, substrate-independent formal definition of assembly spaces and assembly indices. We develop a unified path hierarchy framework to clarify relationships among the various representations of assembly spaces and assembly paths that appear in the literature on molecular assembly. Finally, we show how formal grammar algorithms can be adapted to efficiently bound assembly index calculations and provide clarification on the utility of such approximations, with the goal to increase the accessibility of tools to explore this emerging area for a broader group of researchers across chemistry, biology, and complexity science.
Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.
This study examines how deficiencies in one brain connectome modality propagate to the other, using the Krakencoder as a simulation framework. Structural and functional connectomes from 702 healthy participants in the Human Connectome Project were analyzed, with the impact of each of the Yeo-7 functional networks assessed separately. Seven scenarios were considered, each involving the removal of a single network while the remaining networks were preserved. The resulting perturbations in cross-modal predictions were quantified using three complementary metrics: KL divergence on eigenvalue spectra, Frobenius norm, and Wasserstein distance. In addition, the persistence of sex-specific information within the predicted connectomes was evaluated. Across all metrics and both prediction directions, the Default Mode Network produced the largest perturbations, whereas the Somatomotor network yielded the smallest. Sex differences in network-level perturbation signatures were subtle, with the best result being an accuracy of 66.09% from connectomes predicted under network-removal conditions. In contrast, connectomes predicted from intact inputs achieved substantially higher sex classification accuracy, reaching up to 84.76%. These findings confirm that full predicted connectomes retain considerably more sex-discriminative information than perturbation-derived signatures alone.
This paper addresses real-time state estimation for a nine-compartment nonlinear COVID-19 epidemic model with two co-circulating strains, a super-spreader subpopulation, vaccination with waning immunity, hospitalization, and mortality. Time-varying transmission and vaccination rates are known inputs from a companion calibration, leaving the reconstruction of all nine states from three routinely reported observables: hospitalizations H, fatalities F, and vaccinated stock V. The contributions are theoretical rather than in the filter recursion. First, a Lie-derivative observability analysis yields, via a six-step derivation, the closed-form determinant |det(O9)| = delta_w * gamma_a^2 * kappa * rho2 * w1^2 * (delta_i - delta_p)^2 * |r1 - r2|, showing the level-2 codistribution is rank-deficient at the calibrated symmetric parameters (delta_i = delta_p, r1 = r2); the third Lie derivative restores full rank 9, with r2 the symmetry-breaking parameter. Second, an EKF is designed on the Euler-discretized dynamics with a closed-form 9x9 Jacobian and Joseph covariance update. Third, local exponential mean-square boundedness of the error is proved as a full theorem via the Reif-Gunther-Yaz-Unbehauen hypotheses, exploiting the bilinear drift and linear output to obtain a global-radius quadratic remainder bound that extends to bilinear-drift, linear-output systems. Fourth, the noise covariances are designed from calibration residuals and assessed by NEES and innovation-whiteness tests. All experiments use synthetic measurements from the calibrated model, so reported RMSE values (0.07%-2.72%) are methodology benchmarks, not predictive accuracy. A parameter-mismatch study shows measured and directly-coupled channels stay accurate under model error up to +/-30% while indirectly observed states degrade gracefully. The framework provides the state-feedback basis for future Model Predictive Control.
Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.
Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies' end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at this https URL.
Living cells utilize condensates to spatially concentrate molecules in response to dynamic signals. For instance, nuclear condensates respond to oscillations in transcription factor levels in the nucleoplasm, including those involved in repairing multiple DNA breaks. To understand how oscillating signals affect condensates, we analyze a theoretical model using numerical simulations and analytical theory. While passive dynamics would drive all molecules into a single condensate, we find that sufficiently fast oscillations stabilize multiple droplets, allowing control of their sizes. We thus reveal a new behavior of chemically active droplets, which could be exploited in synthetic applications.
Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.
Nanopores are versatile single-molecular sensors, but their utility is fundamentally constrained by stochastic translocation dynamics warping any encoded information. We resolve it by shifting from time-domain analysis to a learned latent-space mapping via a contrastive encoder trained exclusively on simulated signals from a physics-informed model. This encoder maps solid-state nanopore signals of engineered DNA barcodes into an interpretable molecular coordinate system. The learned representation is responsive to structural barcode parameters while remaining invariant to acquisition conditions and translocation conformation, allowing data pooling across devices. Molecule identification requires a single pass through the encoder, reducing computational cost by three orders of magnitude relative to alignment-based methods. We experimentally validate through mixture quantification, rare-variant detection, consensus barcode reconstruction, and real-time signal acquisition. This shift from temporal analysis to mapping structural coordinates into a latent space changes the paradigm behind analyzing stochastic sensor signals by linking classification to interpretable encoded molecular information.
This study aims to identify typical collective phenomena that emerge in excitatory and inhibitory (E-I) spiking neural networks as reported in recent computational studies. The research methodology used is Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) procedures, comprising three primary stages: an initial search for literature in the SCOPUS database, a screening process based on specific inclusion and exclusion criteria, and a review of the selected literatures. Out of 491 documents from 2014 to 2024, six research papers are selected in the review stage. Four generic collective regimes have been identified, including synchrony, irregular behavior, stationary state, and oscillatory patterns. Our review findings suggest that the collective dynamics of E-I spiking neurons stem from the interplay of intrinsic neuronal characteristics, network topology, and external stimuli. Additionally, the prevalent use of Quadratic Integrate-and-Fire (QIF) neuron model in the literature highlights its significance as a robust candidate for exploring collective behaviors in large-scale neuronal networks. The findings outlined in this paper might be useful for individuals who lack prior familiarity with computational modelling of spiking neurons but have an interest in the field.
Identifying preictal states -- periods during which seizures are more likely to occur -- remains a central challenge in clinical computational neuroscience. In this study, we introduce a novel framework that embeds functional brain connectivity networks, derived from intracranial EEG (iEEG) recordings, into a low-dimensional Euclidean space. This compact representation captures essential topological features of brain dynamics and facilitates the detection of subtle connectivity changes preceding seizures. Using standard machine learning techniques, we define a dimensionless biomarker, $\mathcal{B}$, that discriminates between interictal (seizure-free) and preictal (within 24 hours of seizure) network states. Our method focuses on connectivity patterns among a subset of informative iEEG electrodes, identified through permutation-based testing, enabling robust classification of brain states across time. We validate our approach using a leave-one-out cross-validation scheme and a pseudo-prospective forecasting strategy, assessing performance with metrics such as F1-score and balanced accuracy. Results show that low-dimensional Euclidean embeddings of iEEG connectivity yield interpretable and predictive markers of preictal activity, offering promising implications for real-time seizure forecasting and individualized therapeutic interventions.
Designing molecules that are both property-optimal and readily synthesizable is a central challenge in drug discovery. Existing works that do consider synthesizability can jointly output predicted synthesis routes for generated molecules. However, there has been minimal attention in addressing the ease of synthesis and with flexibility to incorporate desired reaction constraints. On the other hand, virtual screening searches for commercially available compounds, but imposes challenges when scaling to ultra-large (billion-size and beyond) chemical spaces. Here, we propose a generative design framework that unifies synthesis-constrained molecular design and ultra-large-scale virtual screening through steerable and granular synthesizability control. Generated molecules satisfy arbitrary multi-parameter optimization objectives with predicted synthesis routes satisfying mix-and-match constraints: including or avoiding certain reactions, incorporating specific building blocks, and minimizing synthesis route length. In an end-to-end in-house campaign targeting BRD4, we designed molecules synthesizable with specific selected reactions and building blocks, synthesized all six selected compounds, and identified two micromolar binders. We further demonstrate that reaction control enables efficient navigation of ultra-large make-on-demand chemical spaces to identify property-optimal candidates. By applying our framework to Chemspace's Freedom 4.0 make-on-demand space (142 billion molecules), we generated ~320k molecules (0.00023% of the library) on a single consumer-grade GPU (with only 8 GB GPU memory) and identified a micromolar Wee1 binder amongst 60 synthesized candidates. The single unified framework thus enables generating novel synthesizable molecules and retrieving catalogue-ready candidates, offering a flexible solution to mitigating the synthesizability bottleneck.
Recent advances in biomolecular modeling have been catalyzed by models such as AlphaFold3 (AF3), which introduce science-informed changes to the transformer architecture. Unlike transformers, a defining characteristic of AF3-style models is their 3D attention over 2D pairwise representations which produces tensors whose computation and memory costs scale cubically with sequence length. As a result, despite moderate parameter counts, AF3-style models are far more expensive to train than size-equivalent transformers, and are severely constrained by GPU memory capacity. Our characterization shows 3D attention fundamentally changes the training workload, causing massive 3D attention maps, complex inter-operator dependencies, kernel fragmentation, and heavy host-side data pipelines which differ substantially from LLM training, leading to poor utilization on modern GPU systems. Moreover, existing GPU optimizations do not adequately address these challenges due to complex cross-layer inter-operator dependencies introduced by 3D attention. Motivated by these challenges, we introduce MegaFold, a novel cross-platform system for efficient training of next-generation 3D-attention protein models. MegaFold combines a memory-efficient 3D-attention kernel, a communication-efficient sharding strategy for quadratic representations, fused operator implementations for critical execution paths, and a determinism-aware host-device pipeline that eliminates preprocessing stalls. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold enables training with up to 3.36$\times$ longer sequence lengths on 32 GPUs while reducing end-to-end execution time by up to 1.73$\times$ (NVIDIA) and 1.62$\times$ (AMD).
Brain dynamics dominate every level of neural organization -- from single-neuron spiking to the macroscopic waves captured by functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), and electroencephalography (EEG) -- yet the mathematical tools used to interrogate those dynamics remain scattered across a patchwork of traditions. Neural mass models (NMMs) (aggregate neural models) provide one of the most popular gateways into this landscape, but their sheer variety -- spanning lumped parameter models, firing-rate equations, and multi-layer generators -- demands a unifying framework that situates diverse architectures along a continuum of abstraction and biological detail. Here, we start from the idea that oscillations originate from a simple push-pull interaction between two or more neural populations. We build from the undamped harmonic oscillator and, guided by a simple push-pull motif between excitatory and inhibitory populations, climb a systematic ladder of detail. Each rung is presented first in isolation, next under forcing, and then within a coupled network, reflecting the progression from single-node to whole-brain modeling. By transforming a repertoire of disparate formalisms into a navigable ladder, we hope to turn NMM choice from a subjective act into a principled design decision, helping both theorists and experimentalists translate between scales, modalities, and interventions. In doing so, we offer a Rosetta Stone for brain oscillation models -- one that lets the field speak a common dynamical language while preserving the dialectical richness that fuels discovery.
Discrete dynamical models underpin systems biology, but we still lack substrate-agnostic diagnostics for identifying finite-horizon dynamical signatures that may be relevant to open-ended evolution (OEE), such as the recurrent production of novel phenotypic states rather than rapid settling or unstructured noise. We introduce a simple, model-independent metric, {\Omega}, that summarizes the residence-time-weighted contribution of attractor cycle lengths across the sequence of recurrent episodes realized within a finite observation window. {\Omega} is zero for single-attractor dynamics and also vanishes for pure novelty without recurrence, while increasing when trajectories repeatedly enter multiple persistent cyclic phenotypes. Using Random Boolean Networks (RBNs) as a controlled testbed, we compare classical Boolean dynamics with biologically motivated non-classical mechanisms (probabilistic context switching, annealed rule mutation, paraconsistent logic, modal necessary/possible gating, and quantum-inspired superposition/paired-state coupling) under homogeneous and heterogeneous updating schemes. Our results support the view that undecidability-adjacent, state-dependent mechanisms -- implemented as probabilistic context switching, modal necessity/possibility gating, paraconsistent logic, or quantum-inspired correlated branching -- are enabling conditions for sustained novelty. At the end of our manuscript we outline a practical extension of {\Omega} to continuous/hybrid state spaces, positioning {\Omega} as a portable proxy for OEE in biological modeling and a guide for engineering evolvable synthetic circuits.
Animal behavior reflects interactions between the nervous system, body, and environment. Therefore, biomechanics and environmental context must be considered to understand algorithms for behavioral control. Computational models that embed artificial neural controllers within body models in simulated environments, are a powerful tool for this purpose. Here, we review advances in biorealistic neuromechanical models while also highlighting emerging opportunities ahead. We first show how these models enable inference of biophysical variables that are difficult to measure experimentally. Through systematic perturbation, one can generate new experimentally testable hypotheses through these models. We then examine how neuromechanical models facilitate the exchange between neuroscience, robotics, and machine learning, and showcase their applications in healthcare. We envision that coupling experimental studies with active probing of their neuromechanical surrogates will significantly accelerate progress in neuroscience.
Summary: A bioinformatics tool is presented for estimating recent effective population size by using a neural network to automatically compute linkage disequilibrium-related features as a function of genomic distance between polymorphisms. The new method outperforms existing deep learning and summary statistic-based approaches using relatively few sequenced individuals and variant sites, making it particularly valuable for molecular ecology applications with sparse, unphased data. Availability and implementation: The program is available as an easily installable Python package with documentation here: this https URL. The open source code is available from: this https URL.
Absolute concentration robustness (ACR) means the concentration of certain species stays the same in all the steady states. In this work, we study how conservation laws might effect non-vacuous ACR in reaction networks. The goal is to show whether non-vacuous ACR can be preserved or precluded by adding species that depend on the existing species. We have the following two main results. (i) For networks with conservation laws, we prove a criterion: for a nondegenerate network, augmenting it with one new species that depends on the original species leads to the resulting network having no non-vacuous ACR in the new species for any generic choice of rate constants. (ii) We characterize all non-redundant zero-one networks with dimension of at most two that exhibit non-vacuous ACR according to the number of distinct rows in the stoichiometric matrices. An important finding is that if there are at least four distinct rows in the stoichiometric matrix, then the corresponding network has no non-vacuous ACR for any generic choice of rate constants, which implies that many conservation laws prevent non-vacuous ACR in non-redundant zero-one reaction networks.
Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling -- even in the mouse visual cortex, a relatively simple system -- models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at this https URL.
Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin's Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fréchet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.
DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le \alpha + \mathrm{TV}$, where the additive term is the calibration-to-test distribution shift under family holdout (a certified ceiling of 24-49% across folds). Across ten leave-one-taxonomic-family-out folds at $\alpha=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% empirical test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $\alpha=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: this https URL
Phylogenetic networks provide a general framework for modeling reticulate evolutionary processes such as hybridization, recombination, and horizontal gene transfer. In this paper, we study the asymptotic counting of binary phylogenetic networks with $k$ reticulations on $n$ taxa, where $k$ is allowed to grow with $n$. Using edge insertion, we analyze the local structures that affect the number of possible constructions of such networks. By bounding the contribution of networks with exceptional local configurations and combining these bounds with known asymptotic formulas for tree-child networks, we show that, when $k=o(\sqrt n)$, the number of binary phylogenetic networks with $k$ reticulations on $n$ taxa is asymptotic to \[ \binom{n}{k}2^{n+k-1/2}n^{n+k-1}e^{-n}. \]
Biological neural systems achieve remarkable robustness and adaptability in uncertain environments through sparse, event-driven spike-based information processing and adaptive regulation. Inspired by this paradigm, this paper develops a neuromorhpic disturbance observer (NDO) and control framework that replaces conventional continuous-time signal representations with spike-timing encoding. Both disturbance estimates and control inputs are constructed via integrate-and-fire (IF) neuron dynamics from discrete spike events, yielding intrinsically event-driven updates. An adaptive-threshold triggering mechanism is inspired by spike-frequency adaptation (SFA), enabling history-dependent regulation of spike generation. Simulation results demonstrate that the proposed framework achieves neurally inspired robustness and adaptability, while the adaptive-threshold spiking scheme reduces spike events to 42.6% of the fixed-threshold case under noisy conditions.
N6-methyladenosine (m6A) regulates mRNA fate through site-specific methylation, reader recognition and downstream effects on RNA stability and decay. However, current computational approaches focus mainly on site prediction, leaving unresolved the broader challenge of inferring m6A regulatory context and function from epitranscriptomic profiles. Here we present m6A-FORM, an m6A-focused foundation model for for regulatory discovery. Pretrained on 24.9 million RNA sequence windows from 22.5 million MeRIP-seq regions across 143 human studies, m6A-FORM learns reusable representations of m6A-associated transcript contexts. We adapt this encoder to single-nucleotide m6A discovery, regulator-binding prediction, YTHDF2-associated decay prediction and tissue-scale epitranscriptomic mapping. m6A-FORM predicts binding of 19 m6A readers, writers and erasers and identifies sequence and RBP-context features associated with YTHDF2-mediated RNA degradation. Applied to 67 datasets from 24 human tissues, it identifies tissue-conserved m6A sites linked to stronger methylation, reader binding, RBP occupancy and decay propensity.
Predicting signed interactions in biological networks is crucial for understanding drug mechanisms and facilitating drug repurposing. While deep graph models have demonstrated success in modeling complex biological systems, existing approaches often fail to distinguish between positive and negative interactions, limiting their utility for precise pharmacological predictions. In this study, we propose a novel deep graph model, PAMR (polarity-aware multi-relational model), designed to predict both polar (e.g., activation, inhibition) and non-polar (e.g., binding, affect) chemical-gene interactions. Our model integrates graph convolutional networks with tensor decomposition to enhance feature representation and incorporates a conflict-aware sampling strategy to resolve polarity ambiguities. We introduce new evaluation metrics, polarity discrimination score (PDS) and CP@100, to assess the model's ability to differentiate interaction types. Experimental results demonstrate that PAMR outperforms baseline models, achieving superior classification accuracy and improved discrimination of polar edges. Specifically, PAMR-CL attains a Macro AUROC of 0.9072 and CP@100 of 0.974, surpassing RGCN, GraphSAGE, TransE, and BioNet baselines. A case study on nicotine further identifies two novel chemical-gene suppression links, S100A6 and SPP1, that are corroborated by independent experimental literature. Furthermore, we analyze the impact of subgraph components on predictive performance, revealing that additional network structures do not always enhance accuracy. These findings highlight the importance of polarity-aware modeling in drug discovery and network pharmacology, providing a scalable computational framework for polarity-aware chemical-gene interaction prediction and network pharmacology analysis.
Representations from large language models are highly effective at predicting BOLD fMRI responses to language stimuli. However, these representations are largely opaque: it is unclear what features of the language stimulus drive the response in each brain area. We present generative causal testing (GCT), a framework for generating concise explanations of language selectivity in the brain from predictive models and then testing those explanations in follow-up experiments using LLM-generated this http URL approach is successful at explaining selectivity both in individual voxels and cortical regions of interest (ROIs), including newly identified microROIs in prefrontal cortex. We show that explanatory accuracy is closely related to the predictive power and stability of the underlying predictive models. Finally, we show that GCT can dissect fine-grained differences between brain areas with similar functional selectivity. These results demonstrate that LLMs can be used to bridge the widening gap between data-driven models and formal scientific theories.
We develop a generating-function calculus for equilibrium combinatorial self-assembly. Starting from a bond-level specification of allowable interactions, we define a symmetry-weighted species generating function whose evaluation yields equilibrium concentrations, and an ensemble generating function (an exponential transform) that packages equilibrium probabilities of mixtures. We ground these objects in both deterministic (coagulation-fragmentation) and stochastic (master equation) dynamics, showing how detailed balance leads to the equilibrium expressions and how the exponential generating function arises as a partition function. We develop a formal power series calculus -- derivatives, integrals, exponentials, composition -- where each operation acquires a precise combinatorial interpretation. The paper is organized around two regimes: cycle-free assembly, where binding equations for the species generating function are nonlinear and the ensemble equation couples to the species generating function; and assembly with cycles, where a cycle-opening term enters the species equation and the exponential transform linearizes the ensemble equation into a closed linear PDE with operator-exponential solutions. Each regime is developed with a linear polymer worked example in which we compute equilibrium concentrations, extract canonical partition functions, and derive canonical factorial moments. A cross-linked polymer example -- combining heterotypic chain bonds with homotypic cross-links -- illustrates both regimes together, yielding a factorized canonical partition function and an explicit gelation surface.
Background: Antimicrobial resistance (AMR) is a global health threat. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized data, population-level machine learning forecasting of resistance trends remains limited. Translating computational forecasts into policy requires transparent interpretation mechanisms. Methods: Surveillance data (2021-2023) comprising 5,909 observations across 44 countries and five WHO regions were processed. A rigorous temporal split prevented data leakage. Six models (Naive, Linear, Ridge, XGBoost, LightGBM, LSTM) were benchmarked to forecast one-year-ahead resistance rates using features including prior-year resistance and antibiotic consumption. Evaluation metrics (MAE, RMSE, sMAPE) were computed, with 95% bootstrap confidence intervals for MAE. A local Retrieval-Augmented Generation (RAG) system utilizing Gemma 4 was implemented to translate forecast findings into policy guidance grounded in retrieved WHO documents. Results: XGBoost achieved the best performance (test MAE = 6.13% [95% CI: 5.83-6.44]), an 85.3% error reduction versus the naive baseline (MAE = 41.79%). SHAP analysis identified prior-year resistance as the dominant predictor (50.5% gain), confirming strong autoregressive behavior. Regional forecast error tracked closely with surveillance coverage, ranging from 3.65% in the European Region to 8.61% in South-East Asia. The RAG pipeline generated accurate, source-attributed policy responses without fabricated citations. Conclusion: Short-term AMR resistance rates exhibit strong temporal autocorrelation that can be accurately forecasted using gradient boosting. Coupling these forecasts with a hallucination-resistant RAG system provides a scalable, evidence-based decision-support framework for AMR governance.
High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.