New articles on Quantitative Biology


[1] 2602.20176

Cross-Chirality Generalization by Axial Vectors for Hetero-Chiral Protein-Peptide Interaction Design

D-peptide binders targeting L-proteins have promising therapeutic potential. Despite rapid advances in machine learning-based target-conditioned peptide design, generating D-peptide binders remains largely unexplored. In this work, we show that by injecting axial features to $E(3)$-equivariant (polar) vector features,it is feasible to achieve cross-chirality generalization from homo-chiral (L--L) training data to hetero-chiral (D--L) design tasks. By implementing this method within a latent diffusion model, we achieved D-peptide binder design that not only outperforms existing tools in in silico benchmarks, but also demonstrates efficacy in wet-lab validation. To our knowledge, our approach represents the first wet-lab validated generative AI for the de novo design of D-peptide binders, offering new perspectives on handling chirality in protein design.


[2] 2602.20198

KEMP-PIP: A Feature-Fusion Based Approach for Pro-inflammatory Peptide Prediction

Pro-inflammatory peptides (PIPs) play critical roles in immune signaling and inflammation but are difficult to identify experimentally due to costly and time-consuming assays. To address this challenge, we present KEMP-PIP, a hybrid machine learning framework that integrates deep protein embeddings with handcrafted descriptors for robust PIP prediction. Our approach combines contextual embeddings from pretrained ESM protein language models with multi-scale k-mer frequencies, physicochemical descriptors, and modlAMP sequence features. Feature pruning and class-weighted logistic regression manage high dimensionality and class imbalance, while ensemble averaging with an optimized decision threshold enhances the sensitivity--specificity balance. Through systematic ablation studies, we demonstrate that integrating complementary feature sets consistently improves predictive performance. On the standard benchmark dataset, KEMP-PIP achieves an MCC of 0.505, accuracy of 0.752, and AUC of 0.762, outperforming ProIn-fuse, MultiFeatVotPIP, and StackPIP. Relative to StackPIP, these results represent improvements of 9.5% in MCC and 4.8% in both accuracy and AUC. The KEMP-PIP web server is freely available at this https URL and the full implementation at this https URL.


[3] 2602.20209

Regressor-guided Diffusion Model for De Novo Peptide Sequencing with Explicit Mass Control

The discovery of novel proteins relies on sensitive protein identification, for which de novo peptide sequencing (DNPS) from mass spectra is a crucial approach. While deep learning has advanced DNPS, existing models inadequately enforce the fundamental mass consistency constraint, that a predicted peptide's mass must match the experimental measured precursor mass. Previous DNPS methods often treat this critical information as a simple input feature or use it in post-processing, leading to numerous implausible predictions that do not adhere to this fundamental physical property. To address this limitation, we introduce DiffuNovo, a novel regressor-guided diffusion model for de novo peptide sequencing that provides explicit peptide-level mass control. Our approach integrates the mass constraint at two critical stages: during training, a novel peptide-level mass loss guides model optimization, while at inference, regressor-based guidance from gradient-based updates in the latent space steers the generation to compel the predicted peptide adheres to the mass constraint. Comprehensive evaluations on established benchmarks demonstrate that DiffuNovo surpasses state-of-the-art methods in DNPS accuracy. Additionally, as the first DNPS model to employ a diffusion model as its core backbone, DiffuNovo leverages the powerful controllability of diffusion architecture and achieves a significant reduction in mass error, thereby producing much more physically plausible peptides. These innovations represent a substantial advancement toward robust and broadly applicable DNPS. The source code is available in the supplementary material.


[4] 2602.20495

Unveiling Scaling Laws of Parameter Identifiability and Uncertainty Quantification in Data-Driven Biological Modeling

Integrating high-dimensional biological data into data-driven mechanistic modeling requires rigorous practical identifiability to ensure interpretability and generalizability. However, coordinate identifiability analysis often suffers from numerical instabilities near singular local minimizers. We present a computational framework that uncovers fundamental scaling laws governing practical identifiability through asymptotic analysis. By synthesizing Fisher information with perturbed Hessian matrices, we establish a hierarchical approach to quantify coordinate identifiability and inform uncertainty quantification within non-identifiable subspaces across different orders. Supported by rigorous mathematical analysis and validated on synthetic and real-world data, our framework was applied to HIV-host dynamics and spatiotemporal amyloid-beta propagation. These applications demonstrate the framework's efficiency in elucidating critical mechanisms underlying HIV diagnostics and Alzheimer's disease progression. In the era of large-scale mechanistic digital twins, our framework provides the scaling laws for data-driven modeling in terms of both parameter identifiability and uncertainty, ensuring that data-driven inferences are grounded in verifiable biological reality.


[5] 2602.20702

Tipping points in complex ecological systems

Tipping points are one of the hot topics in modern physics of complex systems. But what is a tipping point? A generic definition declares it as ``a state of the system where a small change in its parameters can lead to a significant change in its properties''. Additional ingredients that often enter the definition of tipping process are the abruptness of the resulting change and its irreversibility, i.e. it is impossible to recover the initial state if one reverses the protocol of change of the parameters. However, there exists a number of different mathematical structures that can show this behavior, the one that was originally suggested as a tipping point (nowadays usually referred to as bifurcation induced tipping) is just one of many. Different preconditions and/or different level of details included into the model, reflecting also different environmental forcing, can lead to a variety of tipping mechanisms. Furthermore, in a spatially extended system and/or a system with multiple scales, different parts can react to a change in environmental conditions differently or at a different time, interacting with each other to create a tipping cascade. In this paper, using ecosystems as a paradigm of complex nonlinear open systems, we provide a critical overview of the progress made in tipping point science over the last 15 years. We highlight the main findings, identify gaps in our knowledge, and outline a roadmap for further progress.


[6] 2602.20883

Adaptation by Cumulative Selection

Biological systems like long-lived clonal organisms, holobionts and clades challenge traditional evolutionary thinking since they adapt without populations or reproduction. This paper aims to provide an overarching theoretical framework which encompasses standard Darwinian evolution as well as other processes of adaptation. This framework is cumulative selection and I provide a general `recipe' for it to occur. Lewontin's recipe for evolution by natural selection is shown to be a particular example of cumulative selection, but not the only one. Similarly, reproduction, inheritance and populations are just one way to perform cumulative selection. I discuss several other examples of cumulative selection including clonal organisms, dioecious populations, Gaia and neural networks.


[7] 2602.21150

Age Structured Epidemic Model under Vaccination with Vector Transmission

Dengue remains a major global public health concern due to its high mortality and economic burden. Mathematical modeling is essential to understand its transmission mechanisms and for evaluating intervention strategies. In this paper, we formulate a vector host model in which the human population is structured by age, and vaccinated individuals are further described by time since vaccination. The mosquito population is coupled to the host dynamics and reduced under a quasi steady state assumption. By integrating over vaccination age, we obtain a nonlinear steady state formulation and express the endemic equilibrium as a fixed point problem for the infected mosquito population. Using Lipschitz estimates and a contraction argument, we establish existence and uniqueness of the equilibrium under a weak transmission condition. The analysis highlights the influence of age dependent vaccination on long term dengue dynamics.


[8] 2602.18889

Topological shape transform for thymus structures

The Euler characteristic transform (ECT) is an emerging and powerful framework within topological data analysis for quantifying the geometry of shape. The applicability of ECT has been limited due to its sensitivity to noisy data. Here, we introduce SampEuler, a novel ECT-based shape descriptor designed to achieve enhanced robustness to perturbations. We provide a theoretical analysis establishing the stability of SampEuler and validate these properties empirically through pairwise similarity analyses on a benchmark dataset and showcase it on a thymus dataset. The thymus is a primary lymphoid organ that is essential for the maturation and selection of self-tolerant T cells, and within the thymus, thymic epithelial cells are organized in complex three-dimensional architectures, yet the principles governing their formation, functional organization, and remodeling during age-related involution remain poorly understood. Addressing these questions requires robust and informative shape descriptors capable of capturing subtle architectural changes across developmental stages. We develop and apply SampEuler to a newly generated two-dimensional imaging dataset of mouse thymi spanning multiple age groups, where SampEuler outperforms both persistent homology--based methods and deep learning models in detecting subtle, localized morphological differences associated with aging. To facilitate interpretation, we develop a vectorization and visualization framework for SampEuler, which preserves rich morphological information and enables identification of structural features that distinguish thymi across age groups. Collectively, our results demonstrate that SampEuler provides a robust and interpretable approach for quantifying thymic architecture and reveals age-dependent structural changes that offer new insights into thymic organization and involution.


[9] 2602.20218

Targeted T2-FLAIR Dropout Training Improves Robustness of nnU-Net Glioblastoma Segmentation to Missing T2-FLAIR

Purpose: To determine whether targeted T2 fluid-attenuated inversion recovery (T2-FLAIR) dropout training improves glioblastoma MRI tumor segmentation robustness to missing T2-FLAIR without degrading performance when T2-FLAIR is available. Materials and Methods: This retrospective multi-dataset study developed nnU-Net models on BraTS 2021 (n=848) and externally tested them on UPenn-GBM glioblastoma MRI (n=403; 2006-2018; age 18-89 years; 60% male). Models were trained with no dropout or targeted T2-FLAIR dropout (probability rate r=0.35 or 0.50) by replacing only the T2-FLAIR channel with zeros. Inference used T2-FLAIR-present and T2-FLAIR-absent scenarios (T2-FLAIR set to zero). The primary endpoint was Dice similarity coefficient (DSC); secondary endpoints were 95th percentile Hausdorff distance and Bland-Altman whole-tumor volume bias. Equivalence was assessed with two one-sided tests using +/-1.5 DSC percentage points, and noninferiority versus HD-GLIO used a -1.5-point margin. Results: With T2-FLAIR present, median overall DSC was 94.8% (interquartile range, 90.0%-97.1%) with dropout and 95.0% (interquartile range, 90.3%-97.1%) without dropout (equivalence supported, p<0.001). With T2-FLAIR absent, median overall DSC improved from 81.0% (interquartile range, 75.1%-86.4%) without dropout to 93.4% (interquartile range, 89.1%-96.2%) with dropout (r=0.35); edema DSC improved from 14.0% to 87.0%, edema 95th percentile Hausdorff distance improved from 22.44 mm to 2.45 mm, and whole-tumor volume bias improved from -45.6 mL to 0.83 mL. Dropout was noninferior to HD-GLIO under T2-FLAIR-present (all p<0.001). Conclusion: Targeted T2-FLAIR dropout preserved segmentation performance when T2-FLAIR was available and reduced segmentation error and whole-tumor volume bias when T2-FLAIR was absent.


[10] 2602.20266

Multiple Poisson-Dirichlet diffusions on generalized Kingman simplices

We construct a new class of infinite-dimensional diffusions taking values in a generalized Kingman simplex. Our model describes the temporal evolution of the relative frequencies of infinitely-many types which are "labeled" by an arbitrary finite number of marks or colors, but "unlabeled" within each mark. We start with a finite-dimensional construction which extends to Wright-Fisher diffusions a self-similarity property known for Dirichlet distributions, and corresponds to a multiple skew-product representation of the Wright-Fisher diffusion relative to the marks in the population. After ranking decreasingly the frequencies within each mark, we identify the limit in distribution of the resulting diffusion when the number of types for each mark goes to infinity, and describe its infinitesimal operator. The limiting process reduces to a diffusion in the Thoma simplex in the special case of only two marks, whereas the infinitely-many-neutral-alleles model is recovered when all frequencies have the same mark. The stationary measure of the limit diffusion is shown to be the recently introduced multiple Poisson-Dirichlet distribution, which extends Kingman's Poisson-Dirichlet distribution and is the de Finetti representing measure for a family of random partitions whose elements are marked.


[11] 2602.20289

The Sim-to-Real Gap in MRS Quantification: A Systematic Deep Learning Validation for GABA

Magnetic resonance spectroscopy (MRS) is used to quantify metabolites in vivo and estimate biomarkers for conditions ranging from neurological disorders to cancers. Quantifying low-concentration metabolites such as GABA ($\gamma$-aminobutyric acid) is challenging due to low signal-to-noise ratio (SNR) and spectral overlap. We investigate and validate deep learning for quantifying complex, low-SNR, overlapping signals from MEGA-PRESS spectra, devise a convolutional neural network (CNN) and a Y-shaped autoencoder (YAE), and select the best models via Bayesian optimisation on 10,000 simulated spectra from slice-profile-aware MEGA-PRESS simulations. The selected models are trained on 100,000 simulated spectra. We validate their performance on 144 spectra from 112 experimental phantoms containing five metabolites of interest (GABA, Glu, Gln, NAA, Cr) with known ground truth concentrations across solution and gel series acquired at 3 T under varied bandwidths and implementations. These models are further assessed against the widely used LCModel quantification tool. On simulations, both models achieve near-perfect agreement (small MAEs; regression slopes $\approx 1.00$, $R^2 \approx 1.00$). On experimental phantom data, errors initially increased substantially. However, modelling variable linewidths in the training data significantly reduced this gap. The best augmented deep learning models achieved a mean MAE for GABA over all phantom spectra of 0.151 (YAE) and 0.160 (FCNN) in max-normalised relative concentrations, outperforming the conventional baseline LCModel (0.220). A sim-to-real gap remains, but physics-informed data augmentation substantially reduced it. Phantom ground truth is needed to judge whether a method will perform reliably on real data.


[12] 2602.20344

Hierarchical Molecular Representation Learning via Fragment-Based Self-Supervised Embedding Prediction

Graph self-supervised learning (GSSL) has demonstrated strong potential for generating expressive graph embeddings without the need for human annotations, making it particularly valuable in domains with high labeling costs such as molecular graph analysis. However, existing GSSL methods mostly focus on node- or edge-level information, often ignoring chemically relevant substructures which strongly influence molecular properties. In this work, we propose Graph Semantic Predictive Network (GraSPNet), a hierarchical self-supervised framework that explicitly models both atomic-level and fragment-level semantics. GraSPNet decomposes molecular graphs into chemically meaningful fragments without predefined vocabularies and learns node- and fragment-level representations through multi-level message passing with masked semantic prediction at both levels. This hierarchical semantic supervision enables GraSPNet to learn multi-resolution structural information that is both expressive and transferable. Extensive experiments on multiple molecular property prediction benchmarks demonstrate that GraSPNet learns chemically meaningful representations and consistently outperforms state-of-the-art GSSL methods in transfer learning settings.


[13] 2602.20449

Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference

Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties. However, protein language has key differences from natural language, such as a rich functional space despite a vocabulary of only 20 amino acids. These differences motivate research into how transformer-based architectures operate differently in the protein domain and how we can better leverage PLMs to solve protein-related tasks. In this work, we begin by directly comparing how the distribution of information stored across layers of attention heads differs between the protein and natural language domain. Furthermore, we adapt a simple early-exit technique-originally used in the natural language domain to improve efficiency at the cost of performance-to achieve both increased accuracy and substantial efficiency gains in protein non-structural property prediction by allowing the model to automatically select protein representations from the intermediate layers of the PLMs for the specific task and protein at hand. We achieve performance gains ranging from 0.4 to 7.01 percentage points while simultaneously improving efficiency by over 10 percent across models and non-structural prediction tasks. Our work opens up an area of research directly comparing how language models change behavior when moved into the protein domain and advances language modeling in biological domains.


[14] 2503.01834

Intercellular contact is sufficient to drive Fibroblast to Myofibroblast transitions

Fibroblast cells play a key role in maintaining the extracellular matrix. During wound healing, fibroblasts differentiate into highly contractile myofibroblasts, which secrete extracellular matrix proteins like collagen to facilitate tissue repair. Under normal conditions, myofibroblasts undergo programmed cell death after healing to prevent excessive scar formation. However, in diseases like fibrosis, the myofibroblasts remain active even after the wound is closed, resulting in excessive collagen buildup and a stiff, fibrotic matrix. The reasons for the persistence of myofibroblasts in fibrosis are not well understood. Here, we show the existence of a mechanism where direct physical contact between a fibroblast and a myofibroblast is sufficient for fibroblasts to transition into myofibroblasts. We demonstrate that the fibroblast-myofibroblast transition can occur even in the absence of known biochemical cues, such as growth factor activation or mechanical cues from a stiff, fibrotic matrix. Furthermore, we demonstrate that contact-based fibroblast-myofibroblast activation can be inhibited by the G{\alpha}q/11/14 inhibitor FR900359, which prevents the formation of myofibroblasts. These findings provide new insights into the persistence of fibrosis despite therapeutic interventions, suggesting a potential strategy for targeting the fibroblast-to-myofibroblast transition in fibrotic conditions.


[15] 2504.15166

Simulating stochastic population dynamics: The Linear Noise Approximation can capture non-linear phenomena

Population dynamics in fields such as molecular biology, epidemiology, and ecology exhibit highly stochastic and non-linear behavior. In gene regulatory systems in particular, oscillations and multistability are especially common. Despite this, none of the currently available stochastic models for population dynamics are both accurate and computationally efficient for long-term predictions. A prominent model in this field, the Linear Noise Approximation (LNA), is computationally efficient for tasks such as simulation, sensitivity analysis, and parameter estimation; however, it is only accurate for linear systems and short-time predictions. Other models may achieve greater accuracy across a broader range of systems, but they sacrifice computational efficiency and analytical tractability. This paper demonstrates that, with specific modifications, the LNA can accurately capture non-linear dynamics in population processes. We introduce a new framework based on centre manifold theory, a classical concept from non-linear dynamical systems. This approach enables the identification of simple, system-specific modifications to the LNA, tailored to classes of qualitatively similar non-linear dynamical systems. With these modifications, the LNA can achieve accurate long-term simulations without compromising computational efficiency. We apply our methodology to classes of oscillatory and bi-stable systems, and present multiple examples from molecular population dynamics that demonstrate accurate long-term simulations alongside significant improvements in computational efficiency.


[16] 2508.15077

Modelling the transmission and impact of Omicron variants of Covid-19 in different ethnicity groups in Aotearoa New Zealand

Previous pandemics, including influenza pandemics and Covid-19, have disproportionately impacted Māori and Pacific populations in Aotearoa New Zealand. The reasons for this are multi-faceted, including differences in socioeconomic deprivation, housing conditions and household size, vaccination rates, access to healthcare, and prevalence of pre-existing health conditions. Many mathematical models that were used to inform the response to the Covid-19 pandemic did not explicitly include ethnicity or other socioeconomic variables. This limited their ability to predict, understand and mitigate inequitable impacts of the pandemic. Here, we extend a model that was developed during the Covid-19 pandemic to support the public health response by stratifying the population into four ethnicity groups: Māori, Pacific, Asian and European/other. We include three ethnicity-specific components in the model: vaccination rates, clinical severity parameters, and contact patterns. We compare model results to ethnicity-specific data on Covid-19 cases, hospital admissions and deaths between 1 January 2022 and 30 June 2023, under different model scenarios in which these ethnicity-specific components are present or absent. We find that differences in vaccination rates explain only part of the observed disparities in outcomes. While no model scenario is able to fully capture the heterogeneous temporal dynamics, our results suggest that differences between ethnicities in the per-infection risk of clinical severe disease is an important factor. Our work is an important step towards models that are better able to predict inequitable impacts of future pandemic and emerging disease threats, and investigate the ability of interventions to mitigate these.


[17] 2509.02060

Morphology-Aware Peptide Discovery via Masked Conditional Generative Modeling

Peptide self-assembly prediction offers a powerful bottom-up strategy for designing biocompatible, low-toxicity materials for large-scale synthesis in a broad range of biomedical and energy applications. However, screening the vast sequence space for categorization of aggregate morphology remains intractable. We introduce PepMorph, an end-to-end peptide discovery pipeline that generates novel sequences that are not only prone to aggregate but whose self-assembly is steered toward fibrillar or spherical morphologies by conditioning on isolated peptide descriptors that serve as morphology proxies. To this end, we compiled a new dataset by leveraging existing aggregation propensity datasets and extracting geometric and physicochemical descriptors. This dataset is then used to train a Transformer-based Conditional Variational Autoencoder with a masking mechanism, which generates novel peptides under arbitrary conditioning. After filtering to ensure design specifications and validation of generated sequences through coarse-grained molecular dynamics (CG-MD) simulations, PepMorph yielded 83% success rate under our CG-MD validation protocol and morphology criterion for the targeted class, showcasing its promise as a framework for application-driven peptide discovery.


[18] 2510.05193

Harmonic fields and the mechanical response of a cellular monolayer to ablation

Multicellular tissues, such as the epithelium coating a developing embryo, often combine complex tissue shapes with heterogeneity in the spatial arrangement of individual cells. Discrete approximations, such as the cell vertex model, can accommodate these geometric features, but techniques for analysis of such models are underdeveloped. Here, we express differential operators defined on a network representing a monolayer of confluent cells in a framework inspired by discrete exterior calculus, considering scalar fields defined over cell vertices and centres and vector fields defined over cell edges. We achieve this by defining Hodge stars, wedge products and musical isomorphisms that are appropriate for a disordered monolayer for which cell edges and links between cell centres are not orthogonal, as is generic for epithelia. We use this framework to evaluate the harmonic vector field arising in an ablated planar monolayer, demonstrating an approximate 1/\textit{r} scaling of the upper bound of the field's amplitude, where \textit{r} is the distance from the ablation. Using a vertex model that incorporates osmotic effects, we then calculate the mechanical response of a monolayer in a jammed state to ablation. Perturbation displacements exhibit long-range coherence, monopolar and quadrupolar features, and an approximate 1/\textit{r} near-hole upper-bound scaling, implicating the harmonic field. The upper bounds on perturbation stress amplitudes scale approximately like 1/\textit{r}$^2$, a feature relevant to long-range mechanical signalling.


[19] 2510.06578

A geometric feature tracking approach for noninvasive patient specific estimation of leaflet strain from 3D images of heart valves

Valvular heart disease is prevalent and a major contributor to heart failure. Valve leaflet strain is a promising metric for evaluating the mechanics underlying the initiation and progression of valvular pathology. However, generalizable methods for noninvasively quantifying valvular strain from clinically acquired patient images remain limited. To address this limitation, we developed a geometric feature-tracking framework to quantify in vivo leaflet strain from 3DE images. The method integrates a cohort-derived geometric reference atlas to establish geometric correspondence and introduces a novel distance-weighted coherent point drift algorithm for non-rigid registration. We evaluated performance against a finite element benchmark model and compared the approach with conventional point-based tracking methods. The framework was applied to pediatric and adult patient datasets (N = 31) across variable valve morphologies. The proposed method demonstrated greater accuracy in quantifying anatomical alignment and leaflet strain than conventional point-based approaches. Validation against the finite element benchmark confirmed improved strain estimation. The framework achieved reliable inter-phase tracking of valve deformation across diverse morphologies in pediatric and adult patients. Analysis identified a consistent distribution pattern of the 1st principal strain associated with leaflet billow (prolapse). This feature-tracking framework provides a generalizable method for noninvasive quantification of atrioventricular valve leaflet strain from clinical 3DE images. Characterization of biomechanical strain patterns may improve prognostic assessment and support longitudinal evaluation of valvular heart disease. Further investigation of the biomechanical signatures of heart valve disease has the potential to enhance prognostic assessment and longitudinal evaluation of valvular heart disease.


[20] 2512.24192

SeedProteo: Accurate De Novo All-Atom Design of Protein Binders

We present SeedProteo, a diffusion-based model for de novo all-atom protein design. We demonstrate how to repurpose a cutting-edge folding architecture into a powerful generative design framework by effectively integrating self-conditioning features. Extensive benchmarks highlight the model's capabilities across two distinct tasks: in unconditional generation, SeedProteo exhibits superior length generalization and structural diversity, maintaining robustness for long sequences and complex topologies; in binder design, it achieves state-of-the-art performance among open-source methods, attaining the highest in-silico design success rates, structural diversity and novelty. Finally, we validate SeedProteo through wet-lab assays on two therapeutic targets, achieving hit rates of 70%-80% and picomolar-level binding affinities, establishing leading results. To facilitate community adoption, we provide public access to SeedProteo via a webserver (this https URL).


[21] 2602.02620

CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models

Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies. However, the exponential growth in cryo-EM data throughput and complexity, coupled with diverse downstream analytical tasks, necessitates unified computational frameworks that transcend current task-specific deep learning approaches with limited scalability and generalizability. We present CryoLVM, a foundation model that learns rich structural representations from experimental density maps with resolved structures by leveraging the Joint-Embedding Predictive Architecture (JEPA) integrated with SCUNet-based backbone, which can be rapidly adapted to various downstream tasks. We further introduce a novel histogram-based distribution alignment loss that accelerates convergence and enhances fine-tuning performance. We demonstrate CryoLVM's effectiveness across three critical cryo-EM tasks: density map sharpening, density map super-resolution, and missing wedge restoration. Our method consistently outperforms state-of-the-art baselines across multiple density map quality metrics, confirming its potential as a versatile model for a wide spectrum of cryo-EM applications.


[22] 2602.17557

Probability-Invariant Random Walk Learning on Gyral Folding-Based Cortical Similarity Networks for Alzheimer's and Lewy Body Dementia Diagnosis

Alzheimer's disease (AD) and Lewy body dementia (LBD) present overlapping clinical features yet require distinct diagnostic strategies. While neuroimaging-based brain network analysis is promising, atlas-based representations may obscure individualized anatomy. Gyral folding-based networks using three-hinge gyri provide a biologically grounded alternative, but inter-individual variability in cortical folding results in inconsistent landmark correspondence and highly irregular network sizes, violating the fixed-topology and node-alignment assumptions of most existing graph learning methods, particularly in clinical datasets where pathological changes further amplify anatomical heterogeneity. We therefore propose a probability-invariant random-walk-based framework that classifies individualized gyral folding networks without explicit node alignment. Cortical similarity networks are built from local morphometric features and represented by distributions of anonymized random walks, with an anatomy-aware encoding that preserves permutation invariance. Experiments on a large clinical cohort of AD and LBD subjects show consistent improvements over existing gyral folding and atlas-based models, demonstrating robustness and potential for dementia diagnosis.


[23] 2507.00407

Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP pre-trained models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the pre-trained models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain approximate low-energy 3D geometries via geometry optimization. While these geometries do not consistently reach DFT-level chemical accuracy or convergence, they can still improve downstream performance compared to non-relaxed structures. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the pre-trained models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP pre-trained models trained on relaxation data can learn transferable molecular representations to improve downstream molecular property prediction and can provide practically valuable but approximate molecular geometries that benefit property predictions. Our code is publicly available at: this https URL


[24] 2509.15796

Monte Carlo Tree Diffusion with Multiple Experts for Protein Design

The goal of protein design is to generate amino acid sequences that fold into functional structures with desired properties. Prior methods combining autoregressive language models with Monte Carlo Tree Search (MCTS) struggle with long-range dependencies and suffer from an impractically large search space. We propose MCTD-ME, Monte Carlo Tree Diffusion with Multiple Experts, which integrates masked diffusion models with tree search to enable multi-token planning and efficient exploration under the guidance of multiple experts. Unlike autoregressive planners, MCTD-ME uses biophysical-fidelity-enhanced diffusion denoising as the rollout engine, jointly revising multiple positions and scaling to large sequence spaces. It further leverages experts of varying capacities to enrich exploration, guided by a pLDDT-based masking schedule that targets low-confidence regions while preserving reliable residues. We propose a novel multi-expert selection rule ( PH-UCT-ME) extends Shannon-entropy-based UCT to expert ensembles with mutual information. MCTD-ME achieves superior performance on the CAMEO and PDB benchmarks, excelling in protein design tasks such as inverse folding, folding, and conditional design challenges like motif scaffolding on lead optimization tasks. Our framework is model-agnostic, plug-and-play, and extensible to denovo protein engineering and beyond.


[25] 2510.22293

Predicting Metabolic Dysfunction-Associated Steatotic Liver Disease using Machine Learning Methods: A Retrospective Cohort Study

Background and Aims: Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD) affects 30-40% of U.S. adults and is the most common chronic liver disease. Although often asymptomatic, progression can lead to cirrhosis. We developed a prediction model to assist with early detection of MASLD. Approach and Results: We evaluated LASSO logistic regression, random forest, XGBoost, and a neural network model for MASLD prediction using clinical feature subsets from a large electronic health record (EHR) database, including the top 10 ranked features. To reduce disparities in true positive rates across racial and ethnic subgroups, we applied an equal opportunity postprocessing method in a prediction model called MASLD EHR Static Risk Prediction (MASER). This retrospective cohort study included 59,492 participants in the training data, 24,198 in the validating data, and 25,188 in the testing data. The LASSO logistic regression model with the top 10 features was selected for its interpretability and comparable performance. Before fairness adjustment, the model achieved AUROC of 0.84, accuracy of 78%, sensitivity of 72%, specificity of 79%, and F1-score of 0.617. After equal opportunity postprocessing, accuracy modestly increased to 81% and specificity to 94%, while sensitivity decreased to 41% and F1-score to 0.515, reflecting the fairness trade-off. Conclusions: MASER achieved competitive performance for MASLD prediction, comparable to previously reported ensemble and tree-based models, while using a limited and routinely collected feature set and a diverse study population. The development of MASER lends itself to ease of clinical implementation for early detection and for further integration into primary care workflows.