New articles on Quantitative Biology


[1] 2504.00007

Clustering Analysis of Long-term Cardiovascular Complications in COVID-19 Patients

This study investigates long-term cardiovascular complications in COVID-19 patients using advanced clustering techniques. The objective was to analyse ECG parameters, demographic data, comorbidities, and hospitalization details to identify patterns in cardiovascular health outcomes. We applied K-means clustering and identified three distinct clusters: Cluster 0 with moderate heart rate variability and ICU admissions, Cluster 1 with lower heart rate variability and ICU admissions, and Cluster 2 with higher heart rate variability and ICU admissions, indicating higher risk profiles.


[2] 2504.00009

Deep Learning-Based Hypoglycemia Classification Across Multiple Prediction Horizons

Type 1 diabetes (T1D) management can be significantly enhanced through the use of predictive machine learning (ML) algorithms, which can mitigate the risk of adverse events like hypoglycemia. Hypoglycemia, characterized by blood glucose levels below 70 mg/dL, is a life-threatening condition typically caused by excessive insulin administration, missed meals, or physical activity. Its asymptomatic nature impedes timely intervention, making ML models crucial for early detection. This study integrates short- (up to 2h) and long-term (up to 24h) prediction horizons (PHs) within a single classification model to enhance decision support. The predicted times are 5-15 min, 15-30 min, 30 min-1h, 1-2h, 2-4h, 4-8h, 8-12h, and 12-24h before hypoglycemia. In addition, a simplified model classifying up to 4h before hypoglycemia is compared. We trained ResNet and LSTM models on glucose levels, insulin doses, and acceleration data. The results demonstrate the superiority of the LSTM models when classifying nine classes. In particular, subject-specific models yielded better performance but achieved high recall only for classes 0, 1, and 2 with 98%, 72%, and 50%, respectively. A population-based six-class model improved the results with at least 60% of events detected. In contrast, longer PHs remain challenging with the current approach and may be considered with different models.


[3] 2504.00020

Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation

Recent breakthroughs in single-cell technology have ushered in unparalleled opportunities to decode the molecular intricacy of intricate biological systems, especially those linked to diseases unique to humans. However, these progressions have also ushered in novel obstacles-specifically, the efficient annotation of extensive, long-tailed single-cell data pertaining to disease conditions. To effectively surmount this challenge, we introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data. Celler incorporates two groundbreaking elements: First, we introduced the Gaussian Inflation (GInf) Loss function. By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories while reducing the risk of overfitting for common categories. Secondly, we introduce an innovative Hard Data Mining (HDM) strategy into the training process, specifically targeting the challenging-to-learn minority data samples, which significantly improved the model's predictive accuracy. Additionally, to further advance research in this field, we have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases. This dataset provides critical support for comprehensively exploring the potential of single-cell technology in disease research. Our code is available at https://github.com/AI4science-ym/HiCeller.


[4] 2504.00028

Chaos and noise in evolutionary game dynamics

Evolutionary game theory has traditionally employed deterministic models to describe population dynamics. These models, due to their inherent nonlinearities, can exhibit deterministic chaos, where population fluctuations follow complex, aperiodic patterns. Recently, the focus has shifted towards stochastic models, quantifying fixation probabilities and analysing systems with constants of motion. Yet, the role of stochastic effects in systems with chaotic dynamics remains largely unexplored within evolutionary game theory. This study addresses how demographic noise -- arising from probabilistic birth and death events -- impacts chaotic dynamics in finite populations. We show that despite stochasticity, large populations retain a signature of chaotic dynamics, as evidenced by comparing a chaotic deterministic system with its stochastic counterpart. More concretely, the strange attractor observed in the deterministic model is qualitatively recovered in the stochastic model, where the term deterministic chaos loses its meaning. We employ tools from nonlinear dynamics to quantify how the population size influences the dynamics. We observe that for small populations, stochasticity dominates, overshadowing deterministic selection effects. However, as population size increases, the dynamics increasingly reflect the underlying chaotic structure. This resilience to demographic noise can be essential for maintaining diversity in populations, even in non-equilibrium dynamics. Overall, our results broaden our understanding of population dynamics, and revisit the boundaries between chaos and noise, showing how they maintain structure when considering finite populations in systems that are chaotic in the deterministic limit.


[5] 2504.00036

Improving Diseases Predictions Utilizing External Bio-Banks

Machine learning has been successfully used in critical domains, such as medicine. However, extracting meaningful insights from biomedical data is often constrained by the lack of their available disease labels. In this research, we demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations, even when predictive improvements in disease modeling are limited. We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features and apply them to the UK Biobank (UKBB) for downstream analysis. The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors. As a result, our approach successfully identified biologically relevant connections that were not previously known to the predictive models. Additionally, we applied a genome-wide association study (GWAS) on key metabolomics features, revealing a link between vascular dementia and smoking. Although being a well-established epidemiological relationship, this link was not embedded in the model's training data, which validated the method's ability to extract meaningful signals. Furthermore, by integrating survival models as inputs in the 10K data, we uncovered associations between metabolic substances and obesity, demonstrating the ability to infer disease risk for future patients without requiring direct outcome labels. These findings highlight the potential of leveraging external bio-banks to extract valuable biomedical insights, even in data-limited scenarios. Our results demonstrate that machine learning models trained on smaller datasets can still be used to uncover real biological associations when carefully integrated with survival analysis and genetic studies.


[6] 2504.00047

EAP4EMSIG -- Enhancing Event-Driven Microscopy for Microfluidic Single-Cell Analysis

Microfluidic Live-Cell Imaging yields data on microbial cell factories. However, continuous acquisition is challenging as high-throughput experiments often lack realtime insights, delaying responses to stochastic events. We introduce three components in the Experiment Automation Pipeline for Event-Driven Microscopy to Smart Microfluidic Single-Cell Analysis: a fast, accurate Deep Learning autofocusing method predicting the focus offset, an evaluation of real-time segmentation methods and a realtime data analysis dashboard. Our autofocusing achieves a Mean Absolute Error of 0.0226\textmu m with inference times below 50~ms. Among eleven Deep Learning segmentation methods, Cellpose~3 reached a Panoptic Quality of 93.58\%, while a distance-based method is fastest (121~ms, Panoptic Quality 93.02\%). All six Deep Learning Foundation Models were unsuitable for real-time segmentation.


[7] 2504.00052

Assessing Validity of ICD-10 Administrative Data in Coding Comorbidities

Objectives: Administrative data is commonly used to inform chronic disease prevalence and support health informatics research. This study assessed the validity of coding comorbidity in the International Classification of Diseases, 10th Revision (ICD-10) administrative data. Methods: We analyzed three chart review cohorts (4,008 patients in 2003, 3,045 in 2015, and 9,024 in 2022) in Alberta, Canada. Nurse reviewers assessed the presence of 17 clinical conditions using a consistent protocol. The reviews were linked with administrative data using unique identifiers. We compared the accuracy in coding comorbidity by ICD-10, using chart review data as the reference standard. Results: Our findings showed that the mean difference in prevalence between chart reviews and ICD-10 for these 17 conditions was 2.1% in 2003, 7.6% in 2015, and 6.3% in 2022. Some conditions were relatively stable, such as diabetes (1.9%, 2.1%, and 1.1%) and metastatic cancer (0.3%, 1.1%, and 0.4%). For these 17 conditions, the sensitivity ranged from 39.6-85.1% in 2003, 1.3-85.2% in 2015, and 3.0-89.7% in 2022. The C-statistics for predicting in-hospital mortality using comorbidities by ICD-10 were 0.84 in 2003, 0.81 in 2015, and 0.78 in 2022. Discussion: The under-coding could be primarily due to the increase of hospital patient volumes and the limited time allocated to coders. There is a potential to develop artificial intelligence methods based on electronic health records to support coding practices and improve coding quality. Conclusion: Comorbidities were increasingly under-coded over 20 years. The validity of ICD-10 decreased but remained relatively stable for certain conditions mandated for coding. The under-coding exerted minimal impact on in-hospital mortality prediction.


[8] 2504.00055

Topological Properties of the Effective Reproduction Number in an Heterogeneous SIS Model

This present results lay the foundations for the study of the optimal allocation of vaccine in the simple epidemiological SIS model where one consider a very general heterogeneous population. In the present setting each individual has a type x belonging to a general space, and a vaccination strategy is a function $\eta$ where $\eta$(x) $\in$ [0, 1] represents the proportion of non-vaccinated among individuals of type x. We shall consider two loss functions associated to a vaccination strategy $\eta$: either the effective reproduction number, a classical quantity appearing in many models in epidemiology, and which is given here by the spectral radius of a compact operator that depends on $\eta$; or the overall proportion of infected individuals after vaccination in the maximal endemic state.By considering the weak-* topology on the set $\Delta$ of vaccination strategies, so that it is a compact set, we can prove that those two loss functions are continuous using the notion of collective compactness for a family of operators. We also prove their stability with respect to the parameters of the SIS model. Eventually, we consider their monotonicity and related properties in particular when the model is ``almost'' irreducible.


[9] 2504.00300

From Chaos to Coherence: Effects of High-Order Synaptic Correlations on Neural Dynamics

Recurrent Neural Network models have elucidated the interplay between structure and dynamics in biological neural networks, particularly the emergence of irregular and rhythmic activities in cortex. However, most studies have focused on networks with random or simple connectivity structures. Experimental observations find that high-order cortical connectivity patterns affect the temporal patterns of network activity, but a theory that relates such complex structure to network dynamics has yet to be developed. Here, we show that third- and higher-order cyclic correlations in synaptic connectivities greatly impact neuronal dynamics. Specifically, strong cyclic correlations in a network suppress chaotic dynamics, and promote oscillatory or fixed activity. The change in dynamics is related to the form of the unstable eigenvalues of the random connectivity matrix. A phase transition from chaotic to fixed or oscillatory activity coincides with the development of a cusp at the leading edge of the eigenvalue support. We also relate the dimensions of activity to the network structure.


[10] 2504.00334

Pharmacokinetic characteristics of Jinhong tablets in normal, chronic superficial gastritis and intestinal microbial disorder rats

Jinhong tablet (JHT), a traditional Chinese medicine made from four herbs, effectively treats chronic superficial gastritis (CSG) by soothing the liver, relieving depression, regulating qi, and promoting blood circulation. However, its pharmacokinetics are underexplored. This study investigates JHT's pharmacokinetics in normal rats and its differences in normal, CSG, and intestinal microbial disorder rats. A quantitative method for seven active ingredients in rat plasma was established using UPLC-TQ-MS/MS. After administering various JHT doses, plasma concentrations were measured to assess pharmacokinetics in normal rats. The pharmacokinetics of four main ingredients were compared in normal, CSG, and fecal microbiota transplantation (FMT) rats. Intestinal microbial changes were evaluated by high-throughput sequencing. Spearman correlation analysis linked ingredient exposure to gut microbiota disturbances. The method showed good linearity, precision, accuracy, extraction recovery, and stability. In normal rats, all seven ingredients were rapidly absorbed. Tetrahydropalmatine, corydaline, costunolide, and rhamnosylvitexin had good exposure, while dehydrocorydaline, allocryptopine, and palmatine hydrochloride had low exposure. Tetrahydropalmatine, corydaline, and costunolide followed linear pharmacokinetics (AUC0-t, Cmax) at doses of 0.7-5.6 g/kg, while rhamnosylvitexin and dehydrocorydaline showed linearity at 0.7-2.8 g/kg. In CSG and FMT rats, pharmacokinetic differences were observed. CSG enhanced costunolide exposure and Cmax, and increased rhamnosylvitexin exposure. FMT raised corydaline exposure and rhamnosylvitexin Cmax, linked to 20 bacterial genera.


[11] 2504.00488

Dynamical model-based experiment design for drug repositioning

Computational methods in drug repositioning can help to conserve resources. In particular, methods based on biological networks are showing promise. Considering only the network topology and knowledge on drug target genes is not sufficient for quantitative predictions or predictions involving drug combinations. We propose an iterative procedure alternating between system identification and drug response experiments. Data from experiments are used to improve the model and drug effect knowledge, which is then used to select drugs for the next experiments. Using simulated data, we show that the procedure can identify nearly optimal drug combinations.


[12] 2504.00572

Group centrality in optimal and suboptimal vaccination for epidemic models in contact networks

The pursuit of strategies that minimize the number of individuals needing vaccination to control an outbreak is a well-established area of study in mathematical epidemiology. However, when vaccines are in short supply, public policy tends to prioritize immunizing vulnerable individuals over epidemic control. As a result, optimal vaccination strategies may not always be effective in supporting real-world public policies. In this work, we focus on a disease that results in long-term immunity and spreads through a heterogeneous population, represented by a contact network. We study four well-known group centrality measures and show that the GED-Walk offers a reliable means of estimating the impact of vaccinating specific groups of indi\-vi\-duals, even in suboptimal cases. Additionally, we depart from the search for target individuals to be vaccinated and provide proxies for identifying optimal groups for vaccination. While the GED-Walk is the most useful centrality measure for suboptimal cases, the betweenness (a related, but different centrality measure) stands out when looking for optimal groups. This indicates that optimal vaccination is not concerned with breaking the largest number of transmission routes, but interrupting geodesic ones.


[13] 2504.00575

The Effect of Assortativity on Mpox Spreading with Two Core Groups

The spread of infectious diseases often concentrates within specific subgroups of a broader population. For instance, during recent mpox outbreaks in non-endemic countries, transmission primarily affected men who have sex with men (MSM). However, the internal structure of these subpopulations plays a crucial role in disease dynamics and should be accurately represented in mathematical models. In this study, we highlight the importance of modeling interactions between distinct subgroups and their impact on transmission patterns. We consider a stochastic SEIR-based model with two core groups embedded into the general population, and investigate the outcome of the outbreak with different levels of symmetry between these groups and assortativity in their contacts. Our results indicate that the efficiency of commonly used non-pharmaceutical interventions is greatly influenced by these factors, hence they should be considered in the design of intervention strategies.


[14] 2504.00764

A DNA-Centric Mechanism for Protein Targeting in 6mA Methylation

How DNA-binding proteins locate specific genomic targets remains a central challenge in molecular biology. Traditional protein-centric approaches, which rely on wet-lab experiments and visualization techniques, often lack genome-wide resolution and fail to capture physiological dynamics in living cells. Here, we introduce a DNA-centric strategy that leverages in vivo N6-methyladenine (6mA) data to decode the logic of protein-DNA recognition. By integrating linguistically inspired modeling with machine learning, we reveal two distinct search modes: a protein-driven diffusion mechanism and a DNA sequence-driven mechanism, wherein specific motifs function as protein traps. We further reconstruct high-resolution interaction landscapes at the level of individual sequences and trace the evolutionary trajectories of recognition motifs across species. This framework addresses fundamental limitations of protein-centered approaches and positions DNA itself as an intrinsic reporter of protein-binding behavior.


[15] 2504.00872

Bioelectrical Interfaces Beyond Cellular Excitability: Cancer, Aging, and Gene Expression Reprogramming

Bioelectrical interfaces represent a significant evolution in the intersection of nanotechnology and biophysics, offering new strategies for probing and influencing cellular processes. These systems capitalize on the subtle but powerful electric fields within living matter, potentially enabling applications beyond cellular excitability, ranging from targeted cancer therapies to interventions in genetic mechanisms and aging. This perspective article envisions the translation, development and application of next-generation solid-state bioelectrical interfaces and their transformative impact across several critical areas of medical research.


[16] 2504.00146

Why risk matters for protein binder design

Bayesian optimization (BO) has recently become more prevalent in protein engineering applications and hence has become a fruitful target of benchmarks. However, current BO comparisons often overlook real-world considerations like risk and cost constraints. In this work, we compare 72 model combinations of encodings, surrogate models, and acquisition functions on 11 protein binder fitness landscapes, specifically from this perspective. Drawing from the portfolio optimization literature, we adopt metrics to quantify the cold-start performance relative to a random baseline, to assess the risk of an optimization campaign, and to calculate the overall budget required to reach a fitness threshold. Our results suggest the existence of Pareto-optimal models on the risk-performance axis, the shift of this preference depending on the landscape explored, and the robust correlation between landscape properties such as epistasis with the average and worst-case model performance. They also highlight that rigorous model selection requires substantial computational and statistical efforts.


[17] 2504.00232

Opportunistic Screening for Pancreatic Cancer using Computed Tomography Imaging and Radiology Reports

Pancreatic ductal adenocarcinoma (PDAC) is a highly aggressive cancer, with most cases diagnosed at stage IV and a five-year overall survival rate below 5%. Early detection and prognosis modeling are crucial for improving patient outcomes and guiding early intervention strategies. In this study, we developed and evaluated a deep learning fusion model that integrates radiology reports and CT imaging to predict PDAC risk. The model achieved a concordance index (C-index) of 0.6750 (95% CI: 0.6429, 0.7121) and 0.6435 (95% CI: 0.6055, 0.6789) on the internal and external dataset, respectively, for 5-year survival risk estimation. Kaplan-Meier analysis demonstrated significant separation (p<0.0001) between the low and high risk groups predicted by the fusion model. These findings highlight the potential of deep learning-based survival models in leveraging clinical and imaging data for pancreatic cancer.


[18] 2504.00306

LOCO-EPI: Leave-one-chromosome-out (LOCO) as a benchmarking paradigm for deep learning based prediction of enhancer-promoter interactions

In mammalian and vertebrate genomes, the promoter regions of the gene and their distal enhancers may be located millions of base-pairs from each other, while a promoter may not interact with the closest enhancer. Since base-pair proximity is not a good indicator of these interactions, there is considerable work toward developing methods for predicting Enhancer-Promoter Interactions (EPI). Several machine learning methods have reported increasingly higher accuracies for predicting EPI. Typically, these approaches randomly split the dataset of Enhancer-Promoter (EP) pairs into training and testing subsets followed by model training. However, the aforementioned random splitting causes information leakage by assigning EP pairs from the same genomic region to both testing and training sets, leading to performance overestimation. In this paper we propose to use a more thorough training and testing paradigm i.e., Leave-one-chromosome-out (LOCO) cross-validation for EPI-prediction. We demonstrate that a deep learning algorithm, which gives higher accuracies when trained and tested on random-splitting setting, drops drastically in performance under LOCO setting, confirming overestimation of performance. We further propose a novel hybrid deep neural network for EPI-prediction that fuses k-mer features of the nucleotide sequence. We show that the hybrid architecture performs significantly better in the LOCO setting, demonstrating it can learn more generalizable aspects of EP interactions. With this paper we are also releasing the LOCO splitting-based EPI dataset. Research data is available in this public repository: https://github.com/malikmtahir/EPI


[19] 2504.00459

Graphical Models and Efficient Inference Methods for Multivariate Phase Probability Distributions

Multivariate phase relationships are important to characterize and understand numerous physical, biological, and chemical systems, from electromagnetic waves to neural oscillations. These systems exhibit complex spatiotemporal dynamics and intricate interdependencies among their constituent elements. While classical models of multivariate phase relationships, such as the wave equation and Kuramoto model, give theoretical models to describe phenomena, the development of statistical tools for hypothesis testing and inference for multivariate phase relationships in complex systems remains limited. This paper introduces a novel probabilistic modeling framework to characterize multivariate phase relationships, with wave-like phenomena serving as a key example. This approach describes spatial patterns and interactions between oscillators through a pairwise exponential family distribution. Building upon the literature of graphical model inference, including methods like Ising models, graphical lasso, and interaction screening, this work bridges the gap between classical wave dynamics and modern statistical approaches. Efficient inference methods are introduced, leveraging the Chow-Liu algorithm for directed tree approximations and interaction screening for general graphical models. Simulated experiments demonstrate the utility of these methods for uncovering wave properties and sparse interaction structures, highlighting their applicability to diverse scientific domains. This framework establishes a new paradigm for statistical modeling of multivariate phase relationships, providing a powerful toolset for exploring the complexity of these systems.


[20] 2504.00635

Coconvex characters on collections of phylogenetic trees

In phylogenetics, a key problem is to construct evolutionary trees from collections of characters where, for a set X of species, a character is simply a function from X onto a set of states. In this context, a key concept is convexity, where a character is convex on a tree with leaf set X if the collection of subtrees spanned by the leaves of the tree that have the same state are pairwise disjoint. Although collections of convex characters on a single tree have been extensively studied over the past few decades, very little is known about coconvex characters, that is, characters that are simultaneously convex on a collection of trees. As a starting point to better understand coconvexity, in this paper we prove a number of extremal results for the following question: What is the minimal number of coconvex characters on a collection of n-leaved trees taken over all collections of size t >= 2, also if we restrict to coconvex characters which map to k states? As an application of coconvexity, we introduce a new one-parameter family of tree metrics, which range between the coarse Robinson-Foulds distance and the much finer quartet distance. We show that bounds on the quantities in the above question translate into bounds for the diameter of the tree space for the new distances. Our results open up several new interesting directions and questions which have potential applications to, for example, tree spaces and phylogenomics.


[21] 2504.00670

Oscillation in the SIRS model

We study the SIRS epidemic model, both analytically and on a square lattice. The analytic model has two stable solutions, post outbreak/epidemic (no infected, $I=0$) and the endemic state (constant number of infected: $I>0$). When the model is implemented with noise, or on a lattice, a third state is possible, featuring regular oscillations. This is understood as a cycle of boom and bust, where an epidemic sweeps through, and dies out leaving a small number of isolated infecteds. As immunity wanes, herd immunity is lost throughout the population and the epidemic repeats. The key result is that the oscillation is an intrinsic feature of the system itself, not driven by external factors such as seasonality or behavioural changes. The model shows that non-seasonal oscillations, such as those observed for the omicron COVID variant, need no additional explanation such as the appearance of more infectious variants at regular intervals or coupling to behaviour. We infer that the loss of immunity to the SARS-CoV-2 virus occurs on a timescale of about ten weeks.