New articles on Quantitative Biology


[1] 2604.11818

Scale-dependent Temporal Signatures of Arboviral Transmission in Urban Environments

Understanding epidemic dynamics in urban environments requires models that capture interactions across space and time while incorporating biological constraints. In this work, we propose a probabilistic spatiotemporal framework based on pairwise interaction kernels to analyze arboviral transmission using large-scale georeferenced data from Recife, Brazil. The model describes interactions as a function of spatial distance and temporally delayed influence, with parameters estimated via maximum likelihood. Our results reveal a marked asymmetry between spatial and temporal components. The spatial parameter systematically collapses, indicating that spatial proximity does not provide discriminatory information between diseases at the urban scale. In contrast, temporal dynamics exhibit scale-dependent behavior: statistical differentiation between dengue, Zika, and chikungunya emerges only beyond a critical temporal window. We show that unconstrained models primarily capture short-term co-occurrence, leading to apparent but non-robust differences, while biologically constrained models reveal a common underlying transmission structure. Additionally, reconstructed transmission networks exhibit localized and structured interaction patterns consistent with plausible epidemic propagation. These findings demonstrate that epidemic differentiation is not intrinsic, but an emergent phenomenon dependent on temporal scale, highlighting the importance of biologically grounded and scale-aware modeling in spatiotemporal epidemic analysis.


[2] 2604.11824

Patterns in Individual Blood Count Trajectories in the UK Biobank Characterise Disease-Specific Signatures and Anticipate Pan-Cancer Risk

We investigate the longitudinal behaviour of blood markers from common haematological tests as a marker of disease and as a function of disease progression in a variety of conditions including cancer, cardiovascular disease, and infections. We study confounding and non-confounding factors to allow for the earlier detection of disease and conditions based on their longitudinal signatures from biomarker patterns commonly measured in popular and scalable common blood tests across routine clinical tests, in particular the Complete Blood Count (CBC or FBC). Our analysis with normalised temporal profiles and machine learning techniques even before any symptoms appear demonstrates that analyte-group patterns found in blood testing are disease sensitive and disease specific. We demonstrate that CBC markers contribute to the majority of the predictive signal, while biochemistry and other blood panels provide only a modest additional gain mostly associated to very the individual disease for which the test was designed (e.g. CRP, liver enzymes, blood sugar). Our results demonstrate how regular monitoring, computational intelligence, and machine learning applied to longitudinal CBC data can converge to uncover disease patterns, advancing the potential for precision healthcare and predictive medicine on a mass scale leveraging an existing and pervasive blood test.


[3] 2604.11852

Evaluating the Limitations of Protein Sequence Representations for Parkinson's Disease Classification

The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.


[4] 2604.12004

Fixation probabilities for multi-allele Moran dynamics with weak selection

Fixation probabilities are essential for characterizing stochastic evolutionary dynamics, but analytical results remain limited mainly to systems with two competing types. We develop a perturbative framework to compute fixation probabilities in multi-allele Moran processes under weak selection. Exploiting the general structure of the backward Fokker-Planck operator in this regime, we show that fixation probabilities admit a systematic expansion around their neutral solution. We first introduce the framework in a general case with $M$ competing alleles and arbitrary fitness functions, and then apply it to three biologically motivated examples: a simple model of three competing alleles with a constant fitness function, a coordination game in which allele fitness increases with its frequency in the population, and a model of clonal interference between mutualistic alleles. These results extend the analytical understanding of fixation probabilities beyond pairwise interactions, establishing a framework for investigating multi-strategy stochastic evolutionary dynamics.


[5] 2604.12164

Phylogenetic Inference under the Balanced Minimum Evolution Criterion via Semidefinite Programming

In this study, we investigate the application of Semidefinite Programming (SDP) to phylogenetics. SDP is a powerful optimization framework that seeks to optimize a linear objective function over the cone of positive semidefinite matrices. As a convex optimization problem, SDP generalizes linear programming and provides tight relaxations for many combinatorial optimization problems. However, despite its many applications, SDP remains largely unused in computational biology. We argue that SDP relaxations are particularly well suited for phylogenetic inference. As a proof of concept, we focus on the Balanced Minimum Evolution (BME) problem, a widely used model in distance-based phylogenetics. We propose an algorithm combining an SDP relaxation with a rounding scheme that iteratively converts relaxed solutions into valid tree topologies. Experiments on simulated and empirical datasets show that the method enables accurate phylogenetic reconstruction. The approach is sufficiently general to be extendable to other phylogenetic problems.


[6] 2604.12294

The IQ-Motion Confound in Multi-Site Autism fMRI May Be Inflated by Site-Correlated Measurement Uncertainty

Multi-site autism neuroimaging studies routinely control for the confound between full-scale IQ and head motion by regressing framewise displacement against IQ scores and removing shared variance. This procedure assumes that ordinary least squares (OLS) provides an unbiased estimate of the confound magnitude. We tested this assumption on the ABIDE-I phenotypic dataset (n=935 subjects across 19 international scanning sites) using Probability Cloud Regression, an errors-in-variables (EIV) estimator that models per-observation measurement uncertainty in both variables. IQ measurement error was derived from published Wechsler test-retest reliability coefficients; response-side uncertainty was represented by a site-level proxy equal to the within-site standard deviation of mean framewise displacement. Three findings emerged. First, OLS overestimates the IQ-motion slope by a factor of 4.67 relative to the EIV-corrected estimate when the bias factor is computed from the full-precision fitted coefficients (OLS -0.00125, EIV -0.00027 mm per IQ point after rounding for display). Second, under leave-site-out cross-validation a single pooled predictor of raw FD produces negative out-of-sample R^2 at all 19 sites (overall R^2 = -0.074), indicating that the pooled predictor does not transport cleanly across sites once site information is removed. Third, the direction of the EIV-corrected slope is robust across all 64 configurations of an 8x8 sensitivity grid spanning 12-fold ranges of each noise parameter. These results suggest that pooled OLS may overstate the IQ-motion association in ABIDE-I, but direct downstream consequences for motion-correction pipelines remain to be quantified using raw motion traces and connectivity-level re-analysis. Formal EIV methods appear to remain uncommon in multi-site neuroimaging confound estimation.


[7] 2604.12387

oxo-call: Documentation-grounded Skill Augmentation for Accurate Bioinformatics Command-line Generation with Large Language Models

Command-line bioinformatics tools remain essential for genomic analysis, yet their diversity in syntax and parameterization presents a persistent barrier to productive research. We present oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model (LLM) with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. oxo-call (v0.10) ships >150 built-in skills covering 44 analytical categories, from variant calling and genome assembly to single-cell transcriptomics, compiled into a single, statically linked binary. Every generated command is logged with provenance metadata to support reproducible research. oxo-call also provides a DAG-based workflow engine, extensibility through user-defined and community skills via the Model Context Protocol, and support for local LLM inference to address data-privacy requirements. oxo-call is freely available for academic use at this https URL.


[8] 2604.12546

Predicting success of cooperators across arbitrary heterogeneous environmental landscapes

Cooperation is central to the organization of complex biological and social systems. Most theoretical models assume homogeneous environments; in reality, populations inhabit spatially varying landscapes in which the payoffs of cooperation differ across space. Here, we introduce a general framework for the evolution of cooperation in complex, heterogeneous environments where the benefit of cooperation depends on local environmental quality. Cooperators in environmentally rich sites confer greater benefits than those on poor sites. We show that whether heterogeneity promotes or suppresses cooperation is determined primarily by the spatial organization of environmental states. Across arbitrary environmental landscapes, a single quantity, the spatial correlation index (SCI), predicts the fixation probability of cooperators. Under weak selection, segregated environments enhance cooperation, whereas highly intermixed, checkerboard-like landscapes suppress it. Beyond fixation probabilities, environmental organization also controls evolutionary timescales: segregated landscapes generate long-lived metastable coexistence, whereas intermixed landscapes lead to faster but less successful fixation of cooperators. Together, these results provide a unifying description of how spatial environmental heterogeneity shapes the evolution of cooperation and suggest measurable predictors of cooperative success in biological and social settings.


[9] 2604.12671

Differentiating Physical and Psychological Stress Using Wearable Physiological Signals and Salivary Cortisol

Objective: This study aimed to assess how wearable physiological signals, alone and combined with salivary cortisol, distinguish physical and psychological stress and their recovery states. Methods: Six healthy adults completed three laboratory sessions on separate days: rest, physical stress (high-intensity cycling), or psychological stress (modified Trier Social Stress Test). Heart rate, heart rate variability, electrodermal activity, and wrist accelerometry were recorded continuously, and salivary cortisol was sampled at five time points. Features were extracted in non-overlapping 10-minute windows and labelled as rest, physical stress, physical recovery, psychological stress, or psychological recovery. A gradient boosting classifier was trained using wearable features alone and with five additional cortisol features per window. Performance was evaluated using leave-one-participant-out cross-validation. Results: Wearable-only classification achieved 77.8% overall accuracy, with high accuracy for physical stress and recovery but frequent misclassification of psychological stress and recovery (recall 50.0% and 54.2%). Including cortisol improved overall accuracy (94.4%), particularly for psychological states, increasing recall to 83.3% and 87.5%. Cortisol also reduced misclassification between psychological stress and rest. Conclusion: Wearable signals alone were insufficient to reliably distinguish psychological stress from rest and recovery. Integrating salivary cortisol improved classification of psychological stress and recovery and reduced confusion with rest, highlighting the value of endocrine context alongside wearable physiology. Significance: These findings support multimodal stress monitoring and motivate larger, ecologically valid studies and scalable alternatives to repeated cortisol sampling.


[10] 2604.12825

The illusory simplicity of the feedforward pass: evidence for the dynamical nature of stimulus encoding along the primate ventral stream

In studying primate vision, a large body of work focuses on the first feedforward sweep. During this initial time window, information is thought to pass through ventral stream regions in a stage-like fashion in an effort to extract high-level information from the retinal input. Consequently, electrophysiological analyses commonly focus on spatial response patterns, either by averaging data in time, or by applying decoders in a temporally local fashion. By analysing data recorded simultaneously across multiple arrays placed along the macaque ventral stream, we here show that this prior approach may be missing key aspects of information encoding. First, time-resolved, multivariate analyses of information transfer between V4 and IT reveal temporally and semantically varied information content as being exchanged within the first 100ms of processing. Second, by employing recurrent neural network (RNN) decoding techniques that extend across the temporal domain, we demonstrate that the neural pattern dynamics themselves carry categorical information far beyond the spatially encoded information available at any given time point. These findings challenge the prevailing view of a single, stage-like feedforward process and suggest that even the earliest parts of visual processing are better characterised as a spatiotemporally evolving process that encodes information in its dynamics rather than purely spatial response patterns.


[11] 2604.11915

Can AI Detect Life? Lessons from Artificial Life

Modern machine learning methods have been proposed to detect life in extraterrestrial samples, drawing on their ability to distinguish biotic from abiotic samples based on training models using natural and synthetic organic molecular mixtures. Here we show using Artificial Life that such methods are easily fooled into detecting life with near 100% confidence even if the analyzed sample is not capable of life. This is due to modern machine learning methods' propensity to be easily fooled by out-of-distribution samples. Because extra-terrestrial samples are very likely out of the distribution provided by terrestrial biotic and abiotic samples, using AI methods for life detection is bound to yield significant false positives.


[12] 2604.11944

A unified data format for managing diabetes time-series data: DIAbetes eXchange (DIAX)

Diabetes devices, including Continuous Glucose Monitoring (CGM), Smart Insulin Pens, and Automated Insulin Delivery systems, generate rich time-series data widely used in research and machine learning. However, inconsistent data formats across sources hinder sharing, integration, and analysis. We present DIAX (DIAbetes eXchange), a standardized JSON-based format for unifying diabetes time-series data, including CGM, insulin, and meal signals. DIAX promotes interoperability, reproducibility, and extensibility, particularly for machine learning applications. An open-source repository provides tools for dataset conversion, cross-format compatibility, visualization, and community contributions. DIAX is a translational resource, not a data host, ensuring flexibility without imposing data-sharing constraints. Currently, DIAX is compatible with other standardization efforts and supports major datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient-hours of data. this https URL


[13] 2604.12026

TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure-based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well-established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four-expert Mixture-of-Experts (MoE) fusion module with trimodal cross-modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM-2 (650M); structural embeddings from AlphaFold2-predicted C-alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and residue-residue cross-correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein-specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/- 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero-shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well-calibrated probabilistic outputs (ECE = 0.044) without post-hoc correction.


[14] 2604.12060

Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees

The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.


[15] 2604.12075

OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.


[16] 2604.12683

Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textit{Brain-DiT}, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textit{Brain-DiT} adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \href{this https URL}{Link}.


[17] 2604.12884

An abstract model of nonrandom, non-Lamarckian mutation in evolution using a multivariate estimation-of-distribution algorithm

At the fundamental conceptual level, two alternatives have traditionally been considered for how mutations arise and how evolution happens: 1) random mutation and natural selection, and 2) Lamarckism. Recently, the theory of Interaction-based Evolution (IBE) has been proposed, according to which mutations are neither random nor Lamarckian, but are influenced by information accumulating internally in the genome over generations. Based on the estimation-of-distribution algorithms framework, we present a simulation model that demonstrates nonrandom, non-Lamarckian mutation concretely while capturing indirectly several aspects of IBE: selection, recombination, and nonrandom, non-Lamarckian mutation interact in a complementary fashion; evolution is driven by the interaction of parsimony and fit; and random bits do not directly encode improvement but enable generalization by the manner in which they connect with the rest of the evolutionary process. Connections are drawn to Darwin's observations that changed conditions increase the rate of production of heritable variation; to the causes of bell-shaped distributions of traits and how these distributions respond to selection; and to computational learning theory, where analogizing evolution to learning in accord with IBE casts individuals as examples and places the learned hypothesis at the population level. The model highlights the importance of incorporating internal integration of information through heritable change in both evolutionary theory and evolutionary computation.


[18] 2604.12930

Building and maintaining a System of Intracellular Compartments

Organelle patterning and its heritability remain central mysteries in cell biology, highlighting the fundamental tension between genetic inheritance and self-assembly. Here, we explore the nonequilibrium assembly and size control of the Golgi complex and endosomes, amid a continuous flux of membrane traffic, within a stochastic framework of mechanochemical fusion-fission cycles that violate detailed balance. Using a dynamical systems approach, we identify distinct, robust regimes, ranging from fixed points to limit cycles with definite phase relations. We identify these dynamical regimes with diverse phenotypes, from stable cisternae to periodic, cell-cycle-dependent dissolution/reassembly to cisternal progression. We analyse its dynamic response to systematic perturbations or driving protocols and make definite predictions that may be tested experimentally. Our analysis reveals that the two competing models of Golgi organization-vesicular transport and cisternal progression - are, in fact, two phases of the same underlying nonequilibrium process. Finally, our framework offers a strategy for controlling cisternal chemical identity and number and by modulating the interplay between glycosylation enzymes and membrane fission-fusion dynamics.


[19] 2406.12581

Aperiodic clustered and periodic hexagonal vegetation spot arrays explained by inhomogeneous environments and climate trends in arid ecosystems

Due to climate change, overgrazing, and deforestation, arid ecosystems are vulnerable to desertification and land degradation. As aridity increases, vegetation cover loses spatial homogeneity and self-organizes into heterogeneous vegetation patterns, a step before a catastrophic shift to bare soil. Several studies suggest that environmental inhomogeneities in time or space are crucial to understand these phenomena. Using a unified mathematical model and incorporating environmental inhomogeneities in space, we show how two branches of vegetation patterns create a hysteresis loop as the mortality level changes. In an increasing mortality scenario, one observes an equilibrium branch of high vegetation biomass that forms self-organized hexagonal-like patterns. However, when the mortality trend is reversed, one observes a branch with low biomass and no periodicity, where vegetation spots form disordered clusters instead of a hexagonal lattice. This behavior is supported by remote sensing and field observations and can be linked to climate change in arid ecosystems.


[20] 2410.00532

smICA: Open-Source Software for Quantitative, Lifetime-Resolved Mapping of Absolute Fluorophore Concentrations in Living Cells

Advanced microscopy techniques are essential in biomedical research for visualising and tracking biomolecules within living cells and their compartments. Conventional fluorescence microscopy methods, however, often struggle with accurately measuring the absolute concentrations of fluorescent probes in living cells. To overcome these limitations, we introduce an open-source analysis tool, smICA (Single-Molecule Image to Concentration Analyser). The smICA method offers quantitative mapping of absolute fluorophore concentrations, lifetime-resolved filtering methods of the signal, intensity-based cell segmentation, and requires only a few photons per pixel. Our approach also reduces the time required for the determination of the mean concentration per cell, compared to the standard FCS measurement performed in multiple posts. To highlight the robustness of the method, we validated it against standard fluorescence correlation spectroscopy (FCS) measurements by performing in vitro (aqueous solutions of polymers) and in vivo (polymers and EGFP in living cells) experiments. The presented methodology, along with the software, is a promising tool for quantitative single-cell studies, including, but not limited to, protein expression, degradation of biomolecules (such as proteins and mRNA), and monitoring of enzymatic reactions.


[21] 2505.10517

A Tutorial on Structural Identifiability of Epidemic Models Using StructuralIdentifiability.jl

Structural identifiability is the theoretical ability to uniquely recover model parameters from ideal, noise-free data and is a prerequisite for reliable parameter estimation in epidemic modeling. Despite its importance for calibration and inference, structural identifiability analysis remains underused and inconsistently applied in infectious disease modeling. This paper presents a user-oriented methodological tutorial demonstrating how global structural identifiability analysis can be systematically integrated into epidemic modeling workflows. We provide a reproducible framework for conducting structural identifiability analysis of ordinary differential equation models using the Julia package this http URL. The workflow is illustrated across commonly used epidemic models, including SEIR variants with asymptomatic and presymptomatic transmission, vector-borne disease models, and systems incorporating hospitalization and disease-induced mortality. We also introduce a visual communication strategy that embeds identifiability results directly into compartmental diagrams, facilitating interpretation and interdisciplinary communication. Our results show that identifiability depends critically on model structure, the choice of observed variables, and assumptions about initial conditions, and that identifiable parameter combinations may exist even when individual parameters are not globally identifiable. Emphasizing transparent implementation, interpretation, and communication, this work provides practical guidance and comparative insights across model classes. The tutorial is designed as both a reference and a teaching resource for researchers and educators seeking to incorporate structural identifiability analysis into epidemic model development. All code and annotated diagrams are publicly available to ensure reproducibility and reuse.


[22] 2508.19420

Using PyBioNetFit to Leverage Qualitative and Quantitative Data in Biological Model Parameterization and Uncertainty Quantification

Data generated in studies of cellular regulatory systems are often qualitative. For example, measurements of signaling readouts in the presence and absence of mutations may reveal a rank ordering of responses across conditions but not the precise extents of mutation-induced differences. Qualitative data are often ignored by mathematical modelers or are considered in an ad hoc manner, as in the study of Kocieniewski and Lipniacki (2013) [Phys Biol 10: 035006], which was focused on the roles of MEK isoforms in ERK activation. In this earlier study, model parameter values were tuned manually to obtain consistency with a combination of qualitative and quantitative data. This approach is not reproducible, nor does it provide insights into parametric or prediction uncertainties. Here, starting from the same data and the same ordinary differential equation (ODE) model structure, we generate formalized statements of qualitative observations, making these observations more reusable, and we improve the model parameterization procedure by applying a systematic and automated approach enabled by the software package PyBioNetFit. We also demonstrate uncertainty quantification (UQ), which was absent in the original study. Our results show that PyBioNetFit enables qualitative data to be leveraged, together with quantitative data, in parameterization of systems biology models and facilitates UQ. These capabilities are important for reliable estimation of model parameters and model analyses in studies of cellular regulatory systems and reproducibility.


[23] 2511.13790

GeoPl@ntNet: A Platform for Exploring Essential Biodiversity Variables

This paper describes GeoPl@ntNet, an interactive web application designed to make Essential Biodiversity Variables accessible and understandable to everyone through dynamic maps and fact sheets. Its core purpose is to allow users to explore high-resolution AI-generated maps of species distributions, habitat types, and biodiversity indicators across Europe. These maps, developed through a cascading pipeline involving convolutional neural networks and large language models, provide an intuitive yet information-rich interface to better understand biodiversity, with resolutions as precise as 50x50 meters. The website also enables exploration of specific regions, allowing users to select areas of interest on the map (e.g., urban green spaces, protected areas, or riverbanks) to view local species and their coverage. Additionally, GeoPl@ntNet generates comprehensive reports for selected regions, including insights into the number of protected species, invasive species, and endemic species.


[24] 2512.24427

Epigenetic feedback reshapes dynamical landscapes in gene regulatory networks

Understanding how gene regulatory networks (GRNs) give rise to stable and dynamic cellular states remains a central challenge in theoretical biology, particularly when slow epigenetic feedback reshapes the underlying regulatory landscape. While experimental approaches such as single-cell transcriptomics reveal rich dynamical behaviour, a tractable theoretical framework that links gene expression, epigenetic control, and collective dynamics remains challenging. Here, we develop an extended Dynamical Mean Field Theory (DMFT) framework for GRNs that incorporates epigenetic modifications as slow, feedback-driven variables. Building on the analogy between Hopfield networks and spin glass systems, we derive effective stochastic equations that reduce high-dimensional dynamics to a tractable form across multiple timescales. This formulation enables quantitative characterization of both stable and oscillatory regimes and reveals how epigenetic feedback reshapes the effective potential landscape governing cell fate decisions. Our model shows how epigenetic feedback regulation dynamically reshapes the Waddington landscape. Our results and methodology provide a unified theoretical framework for understanding developmental dynamics and epigenetic reprogramming in complex biological systems.


[25] 2602.14863

Quasilocalization under coupled mutation-selection dynamics

When mutations are rampant, quasispecies theory or Eigen's model predicts that the fittest type in a population may not dominate. Beyond a critical mutation rate, the population may even be delocalized completely from the peak of the fitness landscape and the fittest is ironically lost. Extensive efforts have been made to understand this exceptional scenario. But in general, there is no simple prescription that predicts the eventual degree of localization for arbitrary fitness landscapes and mutation rates. Here, we derive a simple and general relation linking the quasispecies' Hill numbers, which are diversity metrics in ecology, and the ratio of an effective fitness variance to the mean mutation rate squared. This ratio, which we call the localization factor, emerges from mean approximations of decomposed surprisal or stochastic entropy change rates. On the side of application, the relation we obtained here defines a combination of Hill numbers that may complement other complexity or diversity measures for real viral quasispecies. Its advantage being that there is an underlying biological interpretation under Eigen's model.


[26] 2603.24626

A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data

Background: Single-cell RNA sequencing (scRNA-seq) enables gene expression profiling at cellular resolution but is inherently affected by sparsity caused by dropout events, where expressed genes are recorded as zeros due to technical limitations. These artifacts distort gene expression distributions and compromise downstream analyses. Numerous imputation methods have been proposed to recover latent transcriptional signals. These methods range from traditional statistical models to deep learning (DL)-based methods. However, their comparative performance remains unclear, as existing benchmarks evaluate only a limited subset of methods, datasets, and downstream analyses. Results: We present a comprehensive benchmark of 15 scRNA-seq imputation methods spanning 7 methodological categories, including traditional and DL-based methods. Methods are evaluated across 30 datasets from 10 experimental protocols on 6 downstream analyses. Results show that traditional methods, such as model-based, smoothing-based, and low-rank matrix-based methods, generally outperform DL-based methods, including diffusion-based, GAN-based, GNN-based, and autoencoder-based methods. In addition, strong performance in numerical gene expression recovery does not necessarily translate into improved biological interpretability in downstream analyses, including cell clustering, differential expression analysis, marker gene analysis, trajectory analysis, and cell type annotation. Furthermore, method performance varies substantially across datasets, protocols, and downstream analyses, with no single method consistently outperforming others. Conclusions: Our findings provide practical guidance for selecting imputation methods tailored to specific analytical objectives and underscore the importance of task-specific evaluation when assessing imputation performance in scRNA-seq data analysis.


[27] 2406.01253

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Bioacoustic research, vital for understanding animal behavior, conservation, and ecology, faces a monumental challenge: analyzing vast datasets where animal vocalizations are rare. While deep learning techniques are becoming standard, adapting them to bioacoustics remains difficult. We address this with animal2vec, an interpretable large transformer model, and a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data. Furthermore, we introduce and publicly release MeerKAT: Meerkat Kalahari Audio Transcripts, a dataset of meerkat (Suricata suricatta) vocalizations with millisecond-resolution annotations, the largest labeled dataset on non-human terrestrial mammals currently available. Our model outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset. Moreover, animal2vec performs well even with limited labeled data (few-shot learning). animal2vec and MeerKAT provide a new reference point for bioacoustic research, enabling scientists to analyze large amounts of data even with scarce ground truth information.


[28] 2412.07238

Speaker effects in language comprehension: An integrative model of language and speaker processing

The identity of a speaker influences language comprehension through modulating perception and expectation. This review explores speaker effects and proposes an integrative model of language and speaker processing that integrates distinct mechanistic perspectives. We argue that speaker effects arise from the interplay between bottom-up perception-based processes, driven by acoustic-episodic memory, and top-down expectation-based processes, driven by a speaker model. We show that language and speaker processing are functionally integrated through multi-level probabilistic processing: prior beliefs about a speaker modulate language processing at the phonetic, lexical, and semantic levels, while the unfolding speech and message continuously update the speaker model, refining broad demographic priors into precise individualized representations. Within this framework, we distinguish between speaker-idiosyncrasy effects arising from familiarity with an individual and speaker-demographics effects arising from social group expectations. We discuss how speaker effects serve as indices for assessing language development and social cognition, and we encourage future research to extend these findings to the emerging domain of artificial intelligence (AI) speakers, as AI agents represent a new class of social interlocutors that are transforming the way we engage in communication.


[29] 2503.14333

Characterizing higher-order representations through generative diffusion models explains human decoded neurofeedback performance

Brains construct not only "first-order" representations of the environment but also "higher-order" representations about those representations -- including higher-order uncertainty estimates that guide learning and adaptive behavior. Higher-order expectations about representational uncertainty -- i.e., learned through experience -- may play a key role in guiding behavior and learning, but their characterization remains empirically and theoretically challenging. Here, we introduce the Noise Estimation through Reinforcement-based Diffusion (NERD) model, a novel computational framework that trains denoising diffusion models via reinforcement learning to infer distributions of noise in functional MRI data from a decoded neurofeedback task, where healthy human participants learn to achieve target neural states. We hypothesize that participants accomplish this task by learning about and then minimizing their own representational uncertainty. We test this hypothesis with NERD, which mirrors brain-like unsupervised learning. Our results show that NERD outperforms backpropagation-trained control models in capturing human performance with explanatory power enhanced by clustering learned noise distributions. Importantly, our results also reveal individual differences in expected-uncertainty representations that predict task success, demonstrating NERD's utility as a powerful tool for probing higher-order neural representations.


[30] 2505.15653

Quantifying structural uncertainty in chemical reaction network inference

Dynamical systems in biology are complex, and one often does not have comprehensive knowledge about the interactions involved. Chemical reaction network (CRN) inference aims to identify, from observing species concentrations over time, the unknown reactions between the species. Existing approaches such as sparse regularisation largely focus on identifying a single, most likely CRN, without addressing uncertainty about the network structure. However, it is important to quantify structural uncertainty to have confidence in our inference and predictions. In this work, we explore how effective sparse regularisation methods are for quantifying structural uncertainty. Locally optimal solutions to sparse regularisation are mapped to CRN structures; however, it is unclear whether this approach encompasses all plausible CRNs. We find that inducing sparsity with nonconvex penalty functions results in better coverage of the plausible CRNs compared to the popular lasso regularisation. To validate our approach, we apply our methods to real-world data examples, and are able to simultaneously recover reactions proposed across multiple literature sources for a reaction system. Our emphasis on network-level probabilities enables a novel, hierarchical representation of structural ambiguities in the space of CRNs. This representation translates into alternative reaction pathways suggested by the available data, thus guiding the efforts of future experimental design.


[31] 2508.12260

Mantis: A Foundation Model for Mechanistic Disease Forecasting

Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for large disease and covariate data sets, bespoke training, and expert tuning, all of which can hinder rapid generation of forecasts for new settings. To help address these challenges, we developed Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 78 forecasting models across sixteen diseases with diverse modes of transmission, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC's COVID-19 Forecast Hub when backtested on early pandemic forecasts which it had not previously seen. Across all other diseases tested, Mantis consistently ranked in the top two models across evaluation metrics. Mantis further generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it can capture fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities illustrate that purely simulation-based foundation models such as Mantis can provide a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models struggle.


[32] 2602.05971

Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.