New articles on Quantitative Biology


[1] 2503.16560

Early Prediction of Alzheimer's and Related Dementias: A Machine Learning Approach Utilizing Social Determinants of Health Data

Alzheimer's disease and related dementias (AD/ADRD) represent a growing healthcare crisis affecting over 6 million Americans. While genetic factors play a crucial role, emerging research reveals that social determinants of health (SDOH) significantly influence both the risk and progression of cognitive functioning, such as cognitive scores and cognitive decline. This report examines how these social, environmental, and structural factors impact cognitive health trajectories, with a particular focus on Hispanic populations, who face disproportionate risk for AD/ADRD. Using data from the Mexican Health and Aging Study (MHAS) and its cognitive assessment sub study (Mex-Cog), we employed ensemble of regression trees models to predict 4-year and 9-year cognitive scores and cognitive decline based on SDOH. This approach identified key predictive SDOH factors to inform potential multilevel interventions to address cognitive health disparities in this population.


[2] 2503.16615

Dynamics and control of maize infection by Busseola fusca: multi-seasonal modeling and biocontrol strategies

Maize production in sub-Saharan Africa faces significant challenges due to the maize stalk borer (Busseola fusca), a major pest that causes substantial yield losses. Chemical control methods have raised concerns about environmental impact and pest resistance, making biological control a promising alternative. In this study, we develop a multi-seasonal mathematical model using an impulsive system of differential equations to describe stalk borer population dynamics and evaluate pest control strategies. We analyze the stability of the pest-free solution using Floquet theory and study the effects of periodic predator releases on pest suppression. Numerical simulations illustrate the impact of cultural practice and predator release frequency. Moreover, our simulations show that, under good cultural practices, releasing predators once or three times a year is an effective biocontrol strategy. However, in cases of poor cultural practices, biocontrol has only a limited effect, and the best outcome is achieved when predators are released once a year at the beginning of the cropping season.


[3] 2503.16676

Simulating foot-and-mouth dynamics and control in Bolivia

Examining the dissemination dynamics of foot-and-mouth disease (FMD) is critical for revising national response plans. We developed a stochastic SEIR metapopulation model to simulate FMD outbreaks in Bolivia and explore how the national response plan impacts the dissemination among all susceptible species. We explored variations in the control strategies, mapped high-risk areas, and estimated the number of vaccinated animals during the reactive ring vaccination. Initial outbreaks ranged from 1 to 357 infected farms, with control measures implemented for up to 100 days, including control zones, a 30-day movement ban, depopulation, and ring vaccination. Combining vaccination (50-90 farms/day) and depopulation (1-2 farms/day) controlled 60.3% of outbreaks, while similar vaccination but higher depopulation rates (3-5 farms/day) controlled 62.9% and eliminated outbreaks nine days faster. Utilizing depopulation alone controlled 56.76% of outbreaks, but had a significantly longer median duration of 63 days. Combining vaccination (25-45 farms/day) and depopulation (6-7 farms/day) was the most effective, eliminating all outbreaks within a median of three days (maximum 79 days). Vaccination alone controlled only 0.6% of outbreaks and had a median duration of 98 days. Ultimately, results showed that the most effective strategy involved ring vaccination combined with depopulation, requiring a median of 925,338 animals to be vaccinated. Outbreaks were most frequent in high-density farming areas such as Potosi, Cochabamba, and La Paz. Our results suggest that emergency ring vaccination alone can not eliminate FMD if reintroduced in Bolivia, and combining depopulation with vaccination significantly shortens outbreak duration. These findings provide valuable insights to inform Bolivia national FMD response plan, including vaccine requirements and the role of depopulation in controlling outbreaks.


[4] 2503.16962

ATP requirements for growth reveal the bioenergetic impact of mitochondrial symbiosis

Studies by microbiologists from the 1970s provided robust estimates for the energy supply and demand of a prokaryotic cell. The amount of ATP needed to support growth was calculated from the chemical composition of the cell and known enzymatic pathways that synthesize its constituents from known substrates in culture. Starting in 2015, geneticists and evolutionary biologists began investigating the bioenergetic role of mitochondria at eukaryote origin and energy in metazoan evolution using their own, widely trusted but hitherto unvetted model for the costs of growth in terms of ATP per cell. The more recent model contains, however, a severe and previously unrecognized error that systematically overestimates the ATP cost of amino acid synthesis up to 200 fold. The error applies to all organisms studied by such models and leads to conspicuously false inferences, for example that the synthesis of an average amino acid in humans requires 30 ATP, which no biochemistry textbook will confirm. Their ATP cost calculations would require that Escherichia coli obtains roughly 100 ATP per glucose and that mammals obtain roughly 240 ATP per glucose, propositions that invalidate evolutionary inferences so based. By contrast, established methods for estimating the ATP cost of microbial growth show that the first mitochondrial endosymbionts could have easily doubled the hosts available ATP pool, provided that genes for growth on environmental amino acids were transferred from the mitochondrial symbiont to the archaeal host and that the host for mitochondrial origin was an autotroph using the acetyl-CoA pathway.


[5] 2503.16996

An Energy-Adaptive Elastic Equivariant Transformer Framework for Protein Structure Representation

Structure-informed protein representation learning is essential for effective protein function annotation and \textit{de novo} design. However, the presence of inherent noise in both crystal and AlphaFold-predicted structures poses significant challenges for existing methods in learning robust protein representations. To address these issues, we propose a novel equivariant Transformer-State Space Model(SSM) hybrid framework, termed $E^3$former, designed for efficient protein representation. Our approach uses energy function-based receptive fields to construct proximity graphs and incorporates an equivariant high-tensor-elastic selective SSM within the transformer architecture. These components enable the model to adapt to complex atom interactions and extract geometric features with higher signal-to-noise ratios. Empirical results demonstrate that our model outperforms existing methods in structure-intensive tasks, such as inverse folding and binding site prediction, particularly when using predicted structures, owing to its enhanced tolerance to data deviation and noise. Our approach offers a novel perspective for conducting biological function research and drug discovery using noisy protein structure data.


[6] 2503.17007

RiboFlow: Conditional De Novo RNA Sequence-Structure Co-Design via Synergistic Flow Matching

Ribonucleic acid (RNA) binds to molecules to achieve specific biological functions. While generative models are advancing biomolecule design, existing methods for designing RNA that target specific ligands face limitations in capturing RNA's conformational flexibility, ensuring structural validity, and overcoming data scarcity. To address these challenges, we introduce RiboFlow, a synergistic flow matching model to co-design RNA structures and sequences based on target molecules. By integrating RNA backbone frames, torsion angles, and sequence features in an unified architecture, RiboFlow explicitly models RNA's dynamic conformations while enforcing sequence-structure consistency to improve validity. Additionally, we curate RiboBind, a large-scale dataset of RNA-molecule interactions, to resolve the scarcity of high-quality structural data. Extensive experiments reveal that RiboFlow not only outperforms state-of-the-art RNA design methods by a large margin but also showcases controllable capabilities for achieving high binding affinity to target ligands. Our work bridges critical gaps in controllable RNA design, offering a framework for structure-aware, data-efficient generation.


[7] 2503.17047

Ex vivo experiment on vertebral body with defect representing bone metastasis

Osteolytic metastases located in the vertebrae reduce strength and enhance the risk of vertebral fractures. This risk can be predicted by means of validated finite element models, but their reproducibility needs to be assessed. For that purpose, experimental data are requested. The aim of this study was to conduct open-access experiments on vertebrae, with artificial defect representing lytic metastasis and using well-defined boundary conditions. Twelve lumbar vertebral bodies (L1) were prepared by removing the cortical endplates and creating defects that represent lytic metastases, by drilling the cancellous bone. Vertebral bodies were scanned using clinical High Resolution peripherical Quantitative Computed Tomography before and after defect creation for 3D reconstruction. The specimens were then tested under compression loading until failure. Surface Digital Image Correlation was used to assess strain fields on the anterior wall of the vertebral body. These data (biomechanics data and the tomographic images needed to build subject-specific models) are shared with the scientific community in order to assess different vertebral models on the same dataset.


[8] 2503.17135

Structural and Practical Identifiability of Phenomenological Growth Models for Epidemic Forecasting

Phenomenological models are highly effective tools for forecasting disease dynamics using real world data, particularly in scenarios where detailed knowledge of disease mechanisms is limited. However, their reliability depends on the model parameters' structural and practical identifiability. In this study, we systematically analyze the identifiability of six commonly used growth models in epidemiology: the generalized growth model (GGM), the generalized logistic model (GLM), the Richards model, the generalized Richards model (GRM), the Gompertz model, and a modified SEIR model with inhomogeneous mixing. To address challenges posed by non integer power exponents in these models, we reformulate them by introducing additional state variables. This enables rigorous structural identifiability analysis using the StructuralIdentifiability.jl package in JULIA. We validate the structural identifiability results by performing parameter estimation and forecasting using the GrowthPredict MATLAB toolbox. This toolbox is designed to fit and forecast time series trajectories based on phenomenological growth models. We applied it to three epidemiological datasets: weekly incidence data for monkeypox, COVID 19, and Ebola. Additionally, we assess practical identifiability through Monte Carlo simulations to evaluate parameter estimation robustness under varying levels of observational noise. Our results demonstrate the structural and practical identifiability of the models, emphasizing how noise affects parameter estimation accuracy. These findings provide valuable insights into the utility and limitations of phenomenological models for epidemic data analysis, highlighting their adaptability to real world challenges and their role in guiding public health decision making.


[9] 2503.17293

Leveraging statistical models to improve pre-season forecasting and in-season management of a recreational fishery

Effective management of recreational fisheries requires accurate forecasting of future harvests and real-time monitoring of ongoing harvests. Traditional methods that rely on historical catch data to predict short-term harvests can be unreliable, particularly if changes in management regulations alter angler behavior. In contrast, statistical modeling approaches can provide faster, more flexible, and potentially more accurate predictions, enhancing management outcomes. In this study, we developed and tested models to improve predictions of Gulf of Mexico gag harvests for both pre-season planning and in-season monitoring. Our best-fitting model outperformed traditional methods (i.e., estimates derived from historical average harvest) for both cumulative pre-season projections and in-season monitoring. Notably, our modeling framework appeared to be more accurate in more recent, shorter seasons due to its ability to account for effort compression. A key advantage of our framework is its ability to explicitly quantify the probability of exceeding harvest quotas for any given season duration. This feature enables managers to evaluate trade-offs between season duration and conservation goals. This is especially critical for vulnerable, highly targeted stocks. Our findings also underscore the value of statistical models to complement and advance traditional fisheries management approaches.


[10] 2503.16437

Haunted House: A text-based game for comparing the flexibility of mental models in humans and LLMs

This study introduces "Haunted House" a novel text-based game designed to compare the performance of humans and large language models (LLMs) in model-based reasoning. Players must escape from a house containing nine rooms in a 3x3 grid layout while avoiding the ghost. They are guided by verbal clues that they get each time they move. In Study 1, the results from 98 human participants revealed a success rate of 31.6%, significantly outperforming seven state-of-the-art LLMs tested. Out of 140 attempts across seven LLMs, only one attempt resulted in a pass by Claude 3 Opus. Preliminary results suggested that GPT o3-mini-high performance might be higher, but not at the human level. Further analysis of 29 human participants' moves in Study 2 indicated that LLMs frequently struggled with random and illogical moves, while humans exhibited such errors less frequently. Our findings suggest that current LLMs encounter difficulties in tasks that demand active model-based reasoning, offering inspiration for future benchmarks.


[11] 2503.16571

Change of some cropping systems in a long-term trial comparing different systems: rationale and implications for statistical analysis

The project Agriculture 4.0 without chemical synthetical plant protection (NOcsPS) tests a number of cropping systems that avoid the use of chemical synthetical pesticides while at the same time using mineral fertilizers. The experiment started in 2020 (sowing fall 2019). In 2024 (sowing fall 2023), some of the cropping systems were modified. Analysis of this experiment may be done using linear mixed models. In order to include the data from 2020-2023 in joint analyses with the data collected for the modified systems from 2024 onwards, the mixed modelling approach needs to be reconsidered. In this paper, we develop models for this purpose. A key feature is the use of network meta-analytic concepts that allow a combination of direct and indirect comparisons among systems from the different years. The approach is first illustrated using a toy example. This is followed by detailed analyses of data from two the two trials sites Dahnsdorf and Hohenheim.


[12] 2503.16582

Machine Learning-Based Genomic Linguistic Analysis (Gene Sequence Feature Learning): A Case Study on Predicting Heavy Metal Response Genes in Rice

This study explores the application of machine learning-based genetic linguistics for identifying heavy metal response genes in rice (Oryza sativa). By integrating convolutional neural networks and random forest algorithms, we developed a hybrid model capable of extracting and learning meaningful features from gene sequences, such as k-mer frequencies and physicochemical properties. The model was trained and tested on datasets of genes, achieving high predictive performance (precision: 0.89, F1-score: 0.82). RNA-seq and qRT-PCR experiments conducted on rice leaves which exposed to Hg0, revealed differential expression of genes associated with heavy metal responses, which validated the model's predictions. Co-expression network analysis identified 103 related genes, and a literature review indicated that these genes are highly likely to be involved in heavy metal-related biological processes. By integrating and comparing the analysis results with those of differentially expressed genes (DEGs), the validity of the new machine learning method was further demonstrated. This study highlights the efficacy of combining machine learning with genetic linguistics for large-scale gene prediction. It demonstrates a cost-effective and efficient approach for uncovering molecular mechanisms underlying heavy metal responses, with potential applications in developing stress-tolerant crop varieties.


[13] 2503.16841

Preferential Multi-Objective Bayesian Optimization for Drug Discovery

Despite decades of advancements in automated ligand screening, large-scale drug discovery remains resource-intensive and requires post-processing hit selection, a step where chemists manually select a few promising molecules based on their chemical intuition. This creates a major bottleneck in the virtual screening process for drug discovery, demanding experts to repeatedly balance complex trade-offs among drug properties across a vast pool of candidates. To improve the efficiency and reliability of this process, we propose a novel human-centered framework named CheapVS that allows chemists to guide the ligand selection process by providing preferences regarding the trade-offs between drug properties via pairwise comparison. Our framework combines preferential multi-objective Bayesian optimization with a docking model for measuring binding affinity to capture human chemical intuition for improving hit identification. Specifically, on a library of 100K chemical candidates targeting EGFR and DRD2, CheapVS outperforms state-of-the-art screening methods in identifying drugs within a limited computational budget. Notably, our method can recover up to 16/37 EGFR and 37/58 DRD2 known drugs while screening only 6% of the library, showcasing its potential to significantly advance drug discovery.


[14] 2503.16999

Route to Chaos and Unified Dynamical Framework of Multi-Species Ecosystems

We investigate species-rich mathematical models of ecosystems. Much of the existing literature focuses on the properties of equilibrium fixed points, in particular their stability and feasibility. Here we emphasize the emergence of limit cycles following Hopf bifurcations tuned by the variability of interspecies interaction. As the variability increases, and owing to the large dimensionality of the system, limit cycles typically acquire a growing spectrum of frequencies. This often leads to the appearance of strange attractors, with a chaotic dynamics of species abundances characterized by a positive Lyapunov exponent. We find that limit cycles and strange attractors preserve biodiversity as they maintain dynamical stability without species extinction. We give numerical evidences that this route to chaos dominates in ecosystems with strong enough interactions and where predator-prey behavior dominates over competition and mutualism. Based on arguments from random matrix theory, we further conjecture that this scenario is generic in ecosystems with large number of species, and identify the key parameters driving it. Overall, our work proposes a unifying framework, where a wide range of population dynamics emerge from a single model.


[15] 2503.17361

Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation

Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.