This study establishes a benchmark for Caenorhabditis elegans neuron classification, comparing four graph methods (GCN, GraphSAGE, GAT, GraphTransformer) against four non-graph methods (Logistic Regression, MLP, LOLCAT, NeuPRINT). Using the functional connectome, we classified Sensory, Interneuron, and Motor neurons based on Spatial, Connection, and Neuronal Activity features. Results show that attention-based GNNs significantly outperform baselines on the Spatial and Connection features. The Neuronal Activity features yielded poor performance, likely due to the low temporal resolution of the underlying neuronal activity data. Our benchmark validates the use of GNNs and highlights that Spatial and Connection features are key predictors for Caenorhabditis elegans neuron classes. Code is available at: this https URL.
Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant, but high-quality drug response samples are often sparse. While deep learning models achieve high predictive accuracy, they remain black boxes that fail to provide the causal mechanisms required for clinical decision-making. We present a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning World Model with an LLM-based agentic reasoning layer. Our system utilises a forensic data pipeline built on the Sanger GDSC dataset (N=83), achieving a robust predictive correlation (r=0.504) and a significant performance gain through the explicit modelling of clinical context, specifically Microsatellite Instability (MSI) status. We introduce the concept of Inverse Reasoning, where the agentic layer performs in silico CRISPR perturbations to predict how specific genomic edits, such as APC or TP53 repair, alter drug sensitivity. By distinguishing between therapeutic opportunity and contextual resistance, and validating these findings against human clinical data (p=0.023), our framework provides a transparent, biologically grounded path towards explainable AI in cancer research.
RNA design aims to identify RNA sequences that fold into a target secondary structure. This task is challenging in terms of computational efficiency. Most existing methods focus on either minimum free energy (MFE)-based or ensemble-based metrics, leaving a gap for a unified approach that performs well across both. We introduce a fast and versatile RNA design algorithm inspired by our previous work on the undesignability of RNA structures and motifs (i.e., sets of contiguous structural loops). Our approach decomposes a target structure into a tree of sub-targets where each leaf node corresponds to a motif and each internal node corresponds to a substructure. We first design partial sequences for each motif, then these partial sequences are selectively and recursively combined via the cube pruning strategy borrowed from computational linguistics, enabling effective optimization of ensemble-based metrics. Finally, a novel whole-structure rival search further refines sequences to suppress misfolded alternatives and enhance MFE-based performance. Our method is highly efficient and also achieves state-of-the-art results on native RNAsolo structures and the Eterna100 benchmark, excelling in both ensemble- and MFE-based metrics. Additionally, it substantially improves the design of long-structure benchmark derived from 16S rRNA, increasing average folding probability from 0.18 to 0.39 with an order-of-magnitude speedup, demonstrating its effectiveness and scalability. Availability: Source code and data are available at: this https URL.
In order to develop association studies and to improve the phenotypic description for abiotic and biotic stress related traits, nested core collections of 48, 96, 144 and 384 sunflower lines were built from a set of 752 diverse, public or private accessions. These 752 lines were been genotyped with 51 SSR markers covering the genetic map (3 markers/linkage group). We then used MSTRAT software (Gouesnard et al., 2000) to construct 4 nested core collections as follows: we built a first core with 48 public lines and a kernel of 12 selected entries, accounting for 47% of the total diversity. This short core collection was then used as a kernel to define a second core with 96 public entries, accounting for 59% of the total diversity. Finally the private entries were added to build a core collection of 144 and 384 entries, accounting for 78% and 100% of the total diversity, respectively.
Single-cell sequencing technologies reveal cellular heterogeneity at high resolution, advancing our understanding of biological complexity. As datasets start to scale to tens of millions of cells, computational workflows face substantial bottlenecks, with CPU-based analytical pipelines requiring hours or days for routine processing steps like filtering, normalization, and clustering. These scalability limitations fundamentally restrict common interactive data exploration and iterative hypothesis testing. Here we introduce rapids-singlecell, a GPU-accelerated framework that integrates natively with the scverse ecosystem and operates directly on the AnnData data structure, which delivers orders-of-magnitude speedups for single-cell workflows. Built on CuPy arrays and the NVIDIA CUDA-X Data Science (RAPIDS) ecosystem, rapids-singlecell provides near drop-in GPU replacements for core scanpy-based analysis steps. Across standard single-cell workflows such as preprocessing, dimensionality reduction, neighborhood graph construction, clustering, and batch correction, rapids-singlecell achieves speedups of up to several hundred-fold compared to optimized CPU baselines. This reduces analysis time from hours to minutes on standard hardware, while maintaining consistent biological interpretations. These performance improvements make it possible to analyze large data sets in close to real time, without the need for data splitting. Together with real-time parameter tuning and iterative workflows, rapids-singlecell makes interactive large-scale single-cell analysis possible.
Species growing in environments that change in time and space will vary in their abundance across locations, even in the absence of persistent location preferences. Here we quantify this non-equilibrium effect by studying a minimal model of a spatially compartmentalised community with time-averaged-neutral competition but location-dependent environmental fluctuations. We analytically derive distributions of two-point inequality, defined as the log-ratio of a species' abundance across a pair of locations. We characterise how the balance of relaxation via migration and fluctuation strength determine the bulk and extreme value statistics of these distributions in the two-patch and infinite-patch cases. We demonstrate the existence of a noise-induced transition to bimodal inequality, which depends on the correlation timescale of the environmental fluctuations. Finally, we discuss the evolutionary benefit of finite migration rates in environments with temporal correlations.
Understanding how decision making changes across the lifespan is a central challenge for neuroscience, yet research on cognitive aging has remained largely disconnected from the theoretical and computational advances that now shape modern systems neuroscience. Over the past two decades, theoretical frameworks have transformed how we study cognition in young, healthy brains, providing principled tools to model latent decision states, neural dynamics, population codes, and interareal communication. In contrast, aging research has often relied on single metric behavioral readouts, cross sectional comparisons, and descriptive neural analyses, limiting our ability to explain fundamental differences in individual aging trajectories. This gap represents a missed opportunity because aging offers a powerful platform for testing theories of neural computation, stability, and flexibility under changing biological constraints. Here, we argue that closer integration between aging research and contemporary theoretical neuroscience can move the field beyond descriptive accounts toward more mechanistic explanations of decision making across the lifespan. To this end, we outline how recent advances in behavioral quantification, latent state modeling, dynamical systems, encoding models, representational geometry, and recurrent neural networks offer a rich theoretical toolkit for neuroscientists studying decision making across the lifespan.
The Reduction Principle states that, near a stable equilibrium under fixed viability selection, a selectively neutral modifier allele that reduces recombination rate among selected loci is favored, whereas one that increases recombination rate is eliminated. This result assumes constant transmission parameters across generations, so that invasion is determined by the dominant eigenvalue of a single transmission-selection matrix. Here we analyze a minimal departure from this framework. In a diploid model, two loci experience symmetric multiplicative viability selection and a third, neutral locus modifies their recombination rate. All parameters are fixed except that recombination in modifier heterozygotes varies randomly across generations according to a stochastic process. When the recombination rate in modifier heterozygotes is constant, the Reduction Principle holds exactly: invasion occurs if the rare modifier allele reduces recombination relative to the resident rate. When recombination varies randomly across generations, invasion is governed by the top Lyapunov exponent of a product of random matrices. We show that temporal variation in recombination rate alone, in the absence of fluctuating viability selection, can reverse the direction of selection on the modifier locus predicted by the deterministic model. The mean recombination rate is insufficient to determine invasion of $M_2$; instead, outcomes depend on the full distribution of recombination rates and their ordered accumulation across generations. Parameters that affect only the magnitude of selection under constant transmission - including resident recombination, selection strength, and background linkage - can alter its sign under stochastic transmission. These results demonstrate that temporal variability in transmission constitutes an independent and qualitatively distinct force in the evolution of recombination rates.
Growth and decay are system-level properties of chemical reaction networks (CRNs) relevant from prebiotic chemistry to cellular metabolism. Their properties are typically analyzed through the kinetics of particular models, which requires specification of the full set of kinetic laws and parameters. In this work, we derive stoichiometry-based constraints on the growth (or shrinkage) rate, in the balanced-growth regime of scalable CRNs. The resulting bounds are controlled by a topological quantity, the maximum amplification factor, defined via a von Neumann max-min problem over feasible fluxes as illustrated by numerical tests on random-network ensembles of CRNs. We argue for the relevance of our results in the context of origin of life studies but also for designing synthetic chemical reaction networks.
Stochastic modeling of movement behavior provides a valuable way to understand how complex motion can be generated from relatively simple building blocks. Ants demonstrate sophisticated social behavior ranging from foraging to nest relocation; while emphasis is often placed on the communication methods used to synchronize individuals, the movement paradigms of those individuals are of tantamount importance. Here, we apply a stochastic modeling approach to better understand the movement of isolated long-legged ant (A. gracilipes) specimens, informed by extensive laboratory tracking experiments. We find that a combination of active Brownian and run-and-tumble models reproduces the trajectory statistics observed in experiments, both qualitatively and quantitatively. We identify reproducible probability distributions for the turn angles, run times, and waiting times across specimens, and find good agreement between analytical predictions and quantities empirically measured from the trajectories. Having such a model allows for a better understanding and predictions of movement ecology from both simulations and analytics, and even can give insight into the underlying generative mechanisms of motion and the ants' sensory systems.
The interplay between tumor cells and macrophages plays a central regulatory role in cancer progression. In this study, we developed a mathematical model that incorporates tumor cells, M1 type macrophages, M2 type macrophages and an M3 type macrophage population characterized by dual phenotypic features. First, we analyzed the fundamental mathematical properties of the model and derived the conditions under which the system attains a tumor free stable state or a coexistence state of tumor and immune cells. Second, global sensitivity analysis revealed that key parameters governing macrophage polarization and intercellular communication vary dynamically during tumor development. Bifurcation analysis further identified the polarization rate of M1 type macrophages $\kappa$ and the baseline level of resting macrophages $M_0$ as critical determinants of the system's dynamical behavior. Notably, using approximate Bayesian computation for parameter inference and dynamic simulations, the model successfully recapitulated the evolutionary trajectories of eight tumor samples. The results demonstrate that lower tumor burden is significantly associated with higher M1 type macrophage infiltration and delayed peak time of M3 type macrophage activation. Moreover, survival analysis indicated that both enhanced M1 type macrophage infiltration and delayed peak time of M3 type macrophage activation are correlated with longer survival time. In summary, this study not only provides a theoretical framework for understanding the dynamic mechanisms underlying tumor macrophage interactions but also proposes two potential clinical prognostic markers: the level of M1 type macrophage infiltration and the peak time of M3 type macrophage activation.
Background: Single-cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co-expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models. Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST, with U-shaped layer profiles reflecting hierarchical abstraction. Features organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (median 2.36x), and form cross-layer information highways (63 to 99.8 percent). When tested against genome-scale CRISPRi perturbation data, only 3 of 48 transcription factors (6.2 percent) show regulatory-target-specific feature responses. A multi-tissue control yields marginal improvement (10.4 percent, 5 of 48 TFs), establishing model representations as the bottleneck. Conclusions: These models have internalized organized biological knowledge, including pathway membership, protein interactions, functional modules, and hierarchical abstraction, yet they encode minimal causal regulatory logic. We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single-cell foundation models.
Symbolic sequences such as written language and genomic DNA display characteristic frequency distributions and long-range correlations extending over many symbols. In language, this takes the form of Zipf's law for word frequencies together with persistent correlations spanning hundreds or thousands of tokens, while in DNA it is reflected in nucleotide composition and long-memory walks under purine-pyrimidine mappings. Existing surrogate models usually preserve either the frequency distribution or the correlation properties, but not both simultaneously. We introduce a surrogate model that retains both constraints: it preserves the empirical symbol frequencies of the original sequence and reproduces its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. Our method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomising short-range dependencies. We validate the model on representative texts in English and Latin, and illustrate its broader applicability with genomic DNA, showing that base composition and DFA scaling are reproduced. This approach provides a principled tool for disentangling structural features of symbolic systems and for testing hypotheses on the origin of scaling laws and memory effects across language, DNA, and other symbolic domains.
As artificial agents become increasingly capable, what internal structure is *necessary* for an agent to act competently under uncertainty? Classical results show that optimal control can be *implemented* using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that low *average-case regret* on structured families of action-conditioned prediction tasks forces an agent to implement a predictive, structured internal state. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary "betting" decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of belief-like memory and predictive state, addressing an open question in prior world-model recovery work.
Huntington's disease (HD) is caused by CAG-repeat expansion in HTT, which lengthens the polyglutamine (polyQ) tract in huntingtin (HTT) and promotes misfolding and aggregation. While polyQ-length-dependent aggregation is well established, the atomistic conformational dynamics preceding aggregation remain less defined. Here we perform all-atom molecular dynamics simulations of HTT exon-1 constructs containing the N17 domain, polyQ tracts of clinically relevant lengths (Q21, wildtype; Q40, adult onset threshold; Q70, juvenile onset), and the polyproline (polyP) region. Multi-copy simulations (four chains) were run for 100 ns in explicit SPC/E water using the OPLS-AA force field. We quantified radius of gyration (Rg), solvent-accessible surface area (SASA), root-mean-square deviation (RMSD), and intra-protein hydrogen bonds as proxies for conformational expansion and aggregation propensity. PolyQ expansion drove progressive increases in Rg and SASA, consistent with more extended, solvent-exposed ensembles. We further tested organic co-solvents (methanol, hexane, trichloroethylene; 0.5 to 1.0 M), which modulated these landscapes in a solvent-dependent manner. Trichloroethylene induced marked expansion in Q21 and Q40, whereas methanol produced mild compaction in Q21. To our knowledge, this is the first MD study to systematically examine co-solvent effects on HTT exon-1 conformational dynamics. Although limited sampling precludes definitive mechanistic conclusions, the observed trends suggest that hydrophobic co-solvents can bias HTT exon-1 toward more expanded ensembles, motivating computational studies of gene-environment modulation in HD.
The local thermal conductivity (\k{appa}) is a pivotal biophysical parameter, governing intracellular heat flux and underlying functional processes like metabolic regulation and stress response. However, label-free mapping with sub-micron resolution in living cells remains challenge. Here, we present frequency-domain fluorescence thermometry (FD-FTM), an all-optical method based on a hybrid nanodiamond-on-gold-membrane platform, which enables quantitative mapping of \k{appa} in biological systems. Fluorescence nanodiamonds (FNDs) are deposited on substrates coated with a 50 nm gold membrane, where FNDs function as nanoscale thermometers, and the gold membrane serves as a photothermal heat source. We validate FD-FTM across reference materials and biological media, with fitting uncertainties of ~10%. By varying the modulation frequency, we tune the thermal penetration depths, enabling controlled heat propagation from the substrate to the cell nucleus. The method delivers sensitivity sufficient to resolve changes in biofluid thermal conductivity on the order of 16% relative to water. Using these capabilities, we demonstrate non-invasive thermal profiling across scales: at the cellular level, nuclear chromatin packing yields \k{appa} higher by ~10% relative to the cytoplasm; at the organelle level, we resolve \k{appa} variations associated with protein aggregates formed during liquid-liquid phase separation in an amyotrophic lateral sclerosis disease model. Temporal measurements in living cells over 30 minutes further reveal spatially resolved intracellular responses to osmotic stress, linking nanoscale thermal dynamics to biomolecular condensates. These results establish FD-FTM as a label-free, robust, and quantitative platform for thermally decoding intracellular processes, opening avenues for studying metabolic heterogeneity, disease mechanisms, and therapeutic responses.
Designing novel proteins with desired characteristics remains a significant challenge due to the large sequence space and the complexity of sequence-function relationships. Efficient exploration of this space to identify sequences that meet specific design criteria is crucial for advancing therapeutics and biotechnology. Here, we present BoGA (Bayesian Optimization Genetic Algorithm), a framework that combines evolutionary search with Bayesian optimization to efficiently navigate the sequence space. By integrating a genetic algorithm as a stochastic proposal generator within a surrogate modeling loop, BoGA prioritizes candidates based on prior evaluations and surrogate model predictions, enabling data-efficient optimization. We demonstrate the utility of BoGA through benchmarking on sequence and structure design tasks, followed by its application in designing peptide binders against pneumolysin, a key virulence factor of \textit{Streptococcus pneumoniae}. BoGA accelerates the discovery of high-confidence binders, demonstrating the potential for efficient protein design across diverse objectives. The algorithm is implemented within the BoPep suite and is available under an MIT license at \href{this https URL}{GitHub}.
Autocatalytic cores are minimal units in reaction networks (RNs) responsible for the emergence of autocatalysis. In the absence of explicit catalysis, i.e., when an entity appears both as reactant and product in the same reaction, they are known to be encoded by square submatrices of the stoichiometric matrix whose columns can be reordered as an irreducible child-selection (CS) matrix with negative diagonal and nonnegative off-diagonal (Metzler matrix). In the bipartite Koenig graph representing the RN, these CS matrices can be identified by fluffles, i.e., strong blocks with an identical number of entity and reaction vertices that have out- and in-degree 1, respectively. Here, we adapt the concepts derived for autocatalytic cores to RNs with explicitly catalyzed reactions, which emerge as digons, i.e., elementary circuits in the Koenig graph of length 2. In this setting, we confirm that an inspection of the stoichiometric matrix alone is inconclusive concerning the presence and number of autocatalytic cores, requiring a more delicate algebraic analysis. Nevertheless, this generalization preserves both the graph and the matrix representation as fluffles and irreducible Metzler CS matrices, respectively, although the diagonal is no longer necessarily strictly negative. We introduce the notion of hard autocatalytic cores, i.e. those that do not yield other autocatalytic cores upon inclusion of all reverse reactions. Finally, we consider the case of unit stoichiometries and show that each autocatalytic core can be constructed as the superposition of at most 2 elementary circuits. In particular, autocatalytic cores involving explicitly catalyzed reactions always contain a spanning subgraph consisting of a single elementary circuit together with a simple entity-to-reaction chord. Moreover, we identify the essentially unique example for which at least two circuits are required.
We use topological data analysis to study neural population activity in the Sensorium 2023 dataset, which records responses from thousands of mouse visual cortex neurons to diverse video stimuli. For each video, we build frame-by-frame cubical complexes from neuronal activity and apply zigzag persistent homology to capture how topological structure evolves over time. These dynamics are summarized with persistence landscapes, providing a compact vectorized representation of temporal features. We focus on one-dimensional topological features-loops in the data-that reflect coordinated, cyclical patterns of neural co-activation. To test their informativeness, we compare repeated trials of different videos by clustering their resulting topological neural representations. Our results show that these topological descriptors reliably distinguish neural responses to distinct stimuli. This work highlights a connection between evolving neuronal activity and interpretable topological signatures, advancing the use of topological data analysis for uncovering neural coding in complex dynamical systems.
During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.
Reasoning is the ability to integrate internal states and external inputs in a meaningful and semantically consistent flow. Contemporary machine learning (ML) systems increasingly rely on such sequential reasoning, from language understanding to multi-modal generation, often operating over dictionaries of prototypical patterns reminiscent of associative memory models. Understanding retrieval and sequentiality in associative memory models provides a powerful bridge to gain insight into ML reasoning. While the static retrieval properties of associative memory models are well understood, the theoretical foundations of sequential retrieval and multi-memory integration remain limited, with existing studies largely relying on numerical evidence. This work develops a dynamical theory of sequential reasoning in Hopfield networks. We consider the recently proposed input-driven plasticity (IDP) Hopfield network and analyze a two-timescale architecture coupling fast associative retrieval with slow reasoning dynamics. We derive explicit conditions for self-sustained memory transitions, including gain thresholds, escape times, and collapse regimes. Together, these results provide a principled mathematical account of sequentiality in associative memory models, bridging classical Hopfield dynamics and modern reasoning architectures.
Spatial relationships in multi-species data can indicate and affect system outcomes and behaviors, ranging from disease progression in cancer to coral reef resilience in ecology; therefore, quantifying these relationships is an important problem across scientific disciplines. Persistent homology (PH), a key mathematical and computational tool in topological data analysis (TDA), provides a multiscale description of the shape of data. While it effectively describes spatial organization of species, such as cellular patterns in pathology, it cannot detect the shape relations between different types of species. Traditionally, PH analyzes single-species data, which limits the spatial analysis of interactions between different species. Leveraging recent developments in TDA and computational geometry, we introduce a scalable approach to quantify higher-order interactions in multi-species data. The framework can distinguish the presence of shape features or patterns in the data that are (i) common to multiple species of points, (ii) present in some species but disappear in the presence of other species, (iii) only visible when multiple species are considered together, and (iv) formed by some species and remain visible in the presence of others. We demonstrate our approach on two example applications. We identify (1) different behavioral regimes in a synthetic tumor micro-environment model, and (2) interspecies spatial interactions that are most significantly altered in colorectal cancer tissue samples during disease progression.
RNA structure determination is essential for understanding its biological functions. However, the reconstruction process often faces challenges, such as atomic clashes, which can lead to inaccurate models. To address these challenges, we introduce the principal submanifold (PSM) approach for analyzing RNA data on a torus. This method provides an accurate, low-dimensional feature representation, overcoming the limitations of previous torus-based methods. By combining PSM with DBSCAN, we propose a novel clustering technique, the principal submanifold-based DBSCAN (PSM-DBSCAN). Our approach achieves superior clustering accuracy and increased robustness to noise. Additionally, we apply this new method for multiscale corrections, effectively resolving RNA backbone clashes at both microscopic and mesoscopic scales. Extensive simulations and comparative studies highlight the enhanced precision and scalability of our method, demonstrating significant improvements over existing approaches. The proposed methodology offers a robust foundation for correcting complex RNA structures and has broad implications for applications in structural biology and bioinformatics.
The automated extraction of chemical structures and their corresponding bioactivity data is essential for accelerating drug discovery and enabling data-driven research. Current optical chemical structure recognition tools lack the capability to autonomously link molecular structures with their bioactivity profiles, posing a significant bottleneck in structure-activity relationship analysis. To address this, we present BioChemInsight, an open-source pipeline that integrates DECIMER Segmentation with MolNexTR for chemical structure recognition, GLM-4.5V for compound identifier association, and PaddleOCR combined with GLM-4.6 for bioactivity extraction and unit normalization. We evaluated BioChemInsight on 181 patents covering 15 therapeutic targets. The system achieved an average extraction accuracy of above 90% across three key tasks: chemical structure recognition, bioactivity data extraction, and compound identifier association. Our analysis indicates that the chemical space covered by patents is largely complementary to that contained in established public database ChEMBL. Consequently, by enabling systematic patent mining, BioChemInsight provides access to chemical information underrepresented in ChEMBL. This capability expands the landscape of explorable compound-target interactions, enriches the data foundation for quantitative structure-activity relationship modeling and targeted screening, and reduces data preprocessing time from weeks to hours. BioChemInsight is available at this https URL.
As neuroscientific theories of consciousness continue to proliferate, the need to assess their similarities and differences - as well as their predictive and explanatory power - becomes ever more pressing. Recently, a number of structured adversarial collaborations have been devised to test the competing predictions of several candidate theories of consciousness. In this review, we compare and contrast three theories being investigated in one such adversarial collaboration: Integrated Information Theory, Neurorepresentationalism, and Active Inference. We begin by presenting the core claims of each theory, before comparing them in terms of the phenomena they seek to explain, the sorts of explanations they avail, and the methodological strategies they endorse. We then consider some of the inherent challenges of theory-testing, and how adversarial collaboration addresses some of these difficulties. The stage is then set for the empirical work to come: first, we outline the key hypotheses to be tested across a series of multi-site experiments; second, we discuss the kinds of observations that would support or challenge each theory; third, we consider how these theories might assimilate or accommodate such observations. Finally, we show how data harvested across disparate experiments (and their replicates) may be formally integrated to provide a quantitative measure of the evidential support accrued under each theory. Besides orienting the reader to the theoretical foundations of our collaboration, this review aims to provide valuable meta-scientific insights into the mechanics of adversarial collaboration and theory-testing in general - including the way theories may be evaluated in terms of the scientific progress they deliver.
Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a uniffed module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid conffguration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end ffnetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufffcient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our 2-step model achieves the same structural accuracy as state-of-the-art 200-step diffusion baselines, delivering approximately 10 times faster wall-clock speed while guaranteeing physical validity. The code is available at this https URL.
In temperate regions, respiratory virus epidemics recur on a yearly basis, primarily during the winter season. This is believed to be induced by seasonal forcing, where the rate at which the virus can be transmitted varies cyclically across the course of each year. Seasonal epidemics can place substantial burden upon the healthcare system, with large numbers of infections and hospitalisations occurring across a short time period. However, the interactions between seasonal forcing and the factors necessary for epidemic resurgence - such as waning immunity, antigenic variation or demography - remain poorly understood. In this manuscript, we examine how the dynamics of antibody waning and antigenic variation can shape the seasonal recurrence of epidemics. We develop a novel susceptible-infectious-susceptible (SIS) immuno-epidemiological model of respiratory virus spread, where the susceptible population is stratified by their antibody level against the currently circulating strain of the virus, with this decaying as both antibody waning and antigenic drift occur. In the absence of seasonal forcing, we demonstrate the existence of two Hopf bifurcations over the effective antibody decay rate, with associated periodic model solutions. When seasonal forcing is introduced, we identify complex interactions between the strength of forcing and the effective antibody decay rate, yielding myriad dynamics including multi-year periodicity, quasiperiodicity and chaos. The timing and magnitude of seasonal epidemics is highly sensitive to this interaction, with the distribution of infection timing (by time of year) varying substantially across the parameter space. Finally, we show that seasonal forcing can produce resonant damping resulting in a cumulative infection incidence that is less than would otherwise be observed.
Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.
Neural population activity in cortical and hippocampal circuits can be flexibly reorganized by context, suggesting that cognition relies on dynamic manifolds rather than static representations. However, how such dynamic organization can be realized mechanistically within a unified dynamical system remains unclear. Continuous Hopfield networks provide a classical attractor framework in which neural dynamics follow gradient descent on a fixed energy landscape, constraining retrieval within a static attractor manifold geometry. Extending this approach, we introduce Dynamic Manifold Hopfield Networks (DMHN), continuous dynamical models in which contextual modulation dynamically reshapes attractor geometry, transforming a static attractor manifold into a context-dependent family of neural manifolds. In DMHN, network interactions are learned in a data-driven manner, to intrinsically deform the geometry of its attractor manifold across cues without explicit context-specific parameterization. As a result, in associative retrieval, DMHN achieve substantially higher capacity and robustness than classical and modern Hopfield networks: when storing $2N$ patterns in a network of $N$ neurons, DMHN attain reliable retrieval with an average accuracy of 64%, compared with 1% and 13% for classical and modern variants, respectively. Together, these results establish dynamic reorganization of attractor manifold geometry as a principled mechanism for context-dependent remapping in neural associative memory.
Rare but critical events in complex systems, such as protein folding, chemical reactions, disease progression, and extreme weather or climate phenomena, are governed by complex, high-dimensional, stochastic dynamics. Identifying an optimal reaction coordinate (RC) that accurately captures the progress of these dynamics is crucial for understanding and simulating such processes. However, determining an optimal RC for realistic systems is notoriously difficult, due to methodological challenges that limit the success of standard machine learning techniques. These challenges include the absence of ground truth, the lack of a loss function for general nonequilibrium dynamics, the difficulty of selecting expressive neural network architectures that avoid overfitting, the irregular and incomplete nature of many real world trajectories, limited sampling and the extreme data imbalance inherent in rare event problems. Here, we introduce a nonparametric RC optimization framework that incorporates trajectory histories and circumvents these challenges, enabling robust analysis of irregular or incomplete data without requiring extensive sampling. The power of the method is demonstrated through increasingly challenging analyses of protein folding dynamics, where it yields accurate committor estimates that pass stringent validation tests and produce high resolution free energy profiles. Its generality is further illustrated through applications to phase space dynamics, a conceptual ocean circulation model, and a longitudinal clinical dataset. These results demonstrate that rare event dynamics can be accurately characterized without extensive sampling of the configuration space, establishing a general, flexible, and robust framework for analyzing complex dynamical systems and longitudinal datasets.
Cytotoxic T lymphocytes eliminate infected or malignant cells, safeguarding surrounding tissues. Although experimental and systems-immunology studies have cataloged many molecular and cellular actors involved in an immune response, the design principles governing how the speed and magnitude of T-cell responses emerge from cellular decision-making remain elusive. Here, we recast the T-cell response as a feedback-controlled program, wherein the rates of activation, proliferation, differentiation and death are regulated through antigenic, pro- and anti-inflammatory cues. By exploring a broad class of feedback-controller designs as potential immune programs, we demonstrate how the speed and magnitude of T-cell responses emerge from optimizing signal-feedback to protect against diverse infection settings. We recover an inherent trade-off: infection clearance at the cost of immunopathology. We show how this trade-off is encoded into the logic of T-cell responses by hierarchical sensitivity to different immune signals. Notably, we find that designs that balance harm from acute infections and autoimmunity produce immune responses consistent with experimentally observed patterns of T-cell effector expansion in mice. Extending our model to immune-based T-cell therapies for cancer tumors, we identify a trade-off between the affinity for tumor antigens ("quality") and the abundance ("quantity") of infused T-cells necessary for effective treatment. Finally, we show how therapeutic efficacy can be improved by targeted genetic perturbations to T-cells. Our findings offer a unified control-logic for cytotoxic T-cell responses and point to specific regulatory programs that can be engineered for more robust T-cell therapies.