New articles on Quantitative Biology


[1] 2406.06726

The Price of Cognition and Replicator Equations in Parallel Neural Networks

In this paper, we are aiming to propose a novel mathematical model that studies the dynamics of synaptic damage in terms of concentrations of toxic neuropeptides/neurotransmitters during neurotransmission processes. Our primary objective is to employ Wardrop's first and second principles within a neural network of the brain. In order to comprehensively incorporate Wardrop's first and second principles into the neural network of the brain, we introduce two novel concepts: \textit{neuropeptide's (neurotransmitter's) equilibrium} and \textit{synapses optimum}. The \textit{neuropeptide/neurotransmitter equilibrium} refers to \textit{a distribution of toxic neuropeptides/neurotransmitters that leads to uniform damage across all synaptic links}. Meanwhile, \textit{synapses optimum} is \textit{the most desirable distribution of toxic neuropeptides/neurotransmitters that minimizes the cumulative damage experienced by all synapses}. In the context of a neural network within the brain, an analogue of the price of anarchy is \textit{the price of cognition} which is \textit{the most unfavorable ratio between the overall impairment caused by toxic neuropeptide's (neurotransmitter's) equilibrium in comparison to the optimal state of synapses (synapses optimum)}. To put it differently, \textit{the price of cognition} measures \textit{the loss of cognitive ability resulting from increased concentrations of toxic neuropeptides/neurotransmitters}. Additionally, a replicator equation is proposed within this framework that leads to the establishment of the synapses optimum during the neurotransmission process.


[2] 2406.06731

The Imaging Database for Epilepsy And Surgery (IDEAS)

Magnetic resonance imaging (MRI) is a crucial tool to identify brain abnormalities in a wide range of neurological disorders. In focal epilepsy MRI is used to identify structural cerebral abnormalities. For covert lesions, machine learning and artificial intelligence algorithms may improve lesion detection if abnormalities are not evident on visual inspection. The success of this approach depends on the volume and quality of training data. Herein, we release an open-source dataset of preprocessed MRI scans from 442 individuals with drug-refractory focal epilepsy who had neurosurgical resections, and detailed demographic information. The MRI scan data includes the preoperative 3D T1 and where available 3D FLAIR, as well as a manually inspected complete surface reconstruction and volumetric parcellations. Demographic information includes age, sex, age of onset of epilepsy, location of surgery, histopathology of resected specimen, occurrence and frequency of focal seizures with and without impairment of awareness, focal to bilateral tonic-clonic seizures, number of anti-seizure medications (ASMs) at time of surgery, and a total of 1764 patient years of post-surgical follow up. Crucially, we also include resection masks delineated from post-surgical imaging. To demonstrate the veracity of our data, we successfully replicated previous studies showing long-term outcomes of seizure freedom in the range of around 50%. Our imaging data replicates findings of group level atrophy in patients compared to controls. Resection locations in the cohort were predominantly in the temporal and frontal lobes. We envisage our dataset, shared openly with the community, will catalyse the development and application of computational methods in clinical neurology.


[3] 2406.06765

Classical Myelo-Proliferative Neoplasms emergence and development based on real life incidence and mathematical modeling

Mathematical modeling offers the opportunity to test hypothesis concerning Myeloproliferative emergence and development. We tested different mathematical models based on a training cohort (n=264 patients) (Registre de la c\^ote d'Or) to determine the emergence and evolution times before JAK2V617F classical Myeloproliferative disorders (respectively Polycythemia Vera and Essential Thrombocytemia) are diagnosed. We dissected the time before diagnosis as two main periods: the time from embryonic development for the JAK2V617F mutation to occur, not disappear and enter in proliferation, and a second time corresponding to the expansion of the clonal population until diagnosis. We demonstrate using progressively complexified models that the rate of active mutation occurrence is not constant and doesn't just rely on individual variability, but rather increases with age and takes a median time of 63.1+/-13 years. A contrario, the expansion time can be considered as constant: 8.8 years once the mutation has emerged. Results were validated in an external cohort (national FIMBANK Cohort, n=1248 patients). Analyzing JAK2V617F Essential Thrombocytema versus Polycythemia Vera, we noticed that the first period of time (rate of active homozygous mutation occurrence) for PV takes approximatively 1.5 years more than for ET to develop when the expansion time was quasi-similar. In conclusion, our multi-step approach and the ultimate time-dependent model of MPN emergence and development demonstrates that the emergence of a JAK2V617F mutation should be linked to an aging mechanism, and indicates a 8-9 years period of time to develop a full MPN.


[4] 2406.06801

Overcoming Limitations in Artificial Intelligence-based Prostate Cancer Detection through Better Datasets and a Bayesian Approach to Aggregate Panel Predictions

Despite considerable progress in developing artificial intelligence (AI) algorithms for prostate cancer detection from whole slide images, the clinical applicability of these models remains limited due to variability in pathological annotations and existing dataset limitations. This article proposes a novel approach to overcome these challenges by leveraging a Bayesian framework to seamlessly integrate new data, and present results as a panel of annotations. The framework is demonstrated by integrating a Bayesian prior with one trained AI model to generate a distribution of Gleason patterns for each pixel of an image. It is shown that using this distribution of Gleason patterns rather than a ground-truth label can improve model applicability, mitigate errors, and highlight areas of interest for pathologists. Additionally, we present a high-quality, hand-curated dataset of prostate histopathological images annotated at the gland level by trained pre-medical students and verified by an expert pathologist. We highlight the potential of this adaptive and uncertainty-aware framework for developing clinically deployable AI tools that can support pathologists in accurate prostate cancer grading, improve diagnostic accuracy, and create positive patient outcomes.


[5] 2406.06826

Experimental Measurement of Assembly Indices are Required to Determine The Threshold for Life

Assembly Theory (AT) was developed to help distinguish living from non-living systems. The theory is simple as it posits that the amount of selection or Assembly is a function of the number of complex objects where their complexity can be objectively determined using assembly indices. The assembly index of a given object relates to the number of recursive joining operations required to build that object and can be not only rigorously defined mathematically but can be experimentally measured. In pervious work we outlined the theoretical basis, but also extensive experimental measurements that demonstrated the predictive power of AT. These measurements showed that is a threshold in assembly indices for organic molecules whereby abiotic chemical systems could not randomly produce molecules with an assembly index greater or equal than 15. In a recent paper by Hazen et al [1] the authors not only confused the concept of AT with the algorithms used to calculate assembly indices, but also attempted to falsify AT by calculating theoretical assembly indices for objects made from inorganic building blocks. A fundamental misunderstanding made by the authors is that the threshold is a requirement of the theory, rather than experimental observation. This means that exploration of inorganic assembly indices similarly requires an experimental observation, correlated with the theoretical calculations. Then and only then can the exploration of complex inorganic molecules be done using AT and the threshold for living systems, as expressed with such building blocks, be determined. Since Hazen et al.[1] present no experimental measurements of assembly theory, their analysis is not falsifiable.


[6] 2406.06897

Bifurcations and multistability in empirical mutualistic networks

Individual species may experience diverse outcomes, from prosperity to extinction, in an ecological community subject to external and internal variations. Despite the wealth of theoretical results derived from random matrix ensembles, a theoretical framework still remains to be developed to understand species-level dynamical heterogeneity within a given community, hampering real-world ecosystems' theoretical assessment and management. Here, we consider empirical plant-pollinator mutualistic networks, additionally including all-to-all intragroup competition, where species abundance evolves under a Lotka-Volterra-type equation. Setting the strengths of competition and mutualism to be uniform, we investigate how individual species persist or go extinct under varying the interaction strengths. By employing bifurcation theory in tandem with numerical continuation, we elucidate transcritical bifurcations underlying species extinction and demonstrate that the Hopf bifurcation of unfeasible equilibria and degenerate transcritical bifurcations give rise to multistability, i.e., the coexistence of multiple attracting feasible equilibria. These bifurcations allow us to partition the parameter space into different regimes, each with distinct sets of extinct species, offering insights into how interspecific interactions generate one or multiple extinction scenarios within an ecological network.


[7] 2406.06969

Data mining method of single-cell omics data to evaluate a pure tissue environmental effect on gene expression level

While single-cell RNA-seq enables the investigation of the celltype effect on the transcriptome, the pure tissue environmental effect has not been well investigated. The bias in the combination of tissue and celltype in the body made it difficult to evaluate the effect of pure tissue environment by omics data mining. It is important to prevent statistical confounding among discrete variables such as celltype, tissue, and other categorical variables when evaluating the effects of these variables. We propose a novel method to enumerate suitable analysis units of variables for estimating the effects of tissue environment by extending the maximal biclique enumeration problem for bipartite graphs to $k$-partite hypergraphs. We applied the proposed method to a large mouse single-cell transcriptome dataset of Tabala Muris Senis to evaluate pure tissue environmental effects on gene expression. Data Mining using the proposed method revealed pure tissue environment effects on gene expression and its age-related change among adipose sub-tissues. The method proposed in this study helps evaluations of the effects of discrete variables in exploratory data mining of large-scale genomics datasets.


[8] 2406.06985

pVACview: an interactive visualization tool for efficient neoantigen prioritization and selection

Neoantigen targeting therapies including personalized vaccines have shown promise in the treatment of cancers. Accurate identification/prioritization of neoantigens is highly relevant to designing clinical trials, predicting treatment response, and understanding mechanisms of resistance. With the advent of massively parallel sequencing technologies, it is now possible to predict neoantigens based on patient-specific variant information. However, numerous factors must be considered when prioritizing neoantigens for use in personalized therapies. Complexities such as alternative transcript annotations, various binding, presentation and immunogenicity prediction algorithms, and variable peptide lengths/registers all potentially impact the neoantigen selection process. While computational tools generate numerous algorithmic predictions for neoantigen characterization, results from these pipelines are difficult to navigate and require extensive knowledge of the underlying tools for accurate interpretation. Due to the intricate nature and number of salient neoantigen features, presenting all relevant information to facilitate candidate selection for downstream applications is a difficult challenge that current tools fail to address. We have created pVACview, the first interactive tool designed to aid in the prioritization and selection of neoantigen candidates for personalized neoantigen therapies. pVACview has a user-friendly and intuitive interface where users can upload, explore, select and export their neoantigen candidates. The tool allows users to visualize candidates using variant, transcript and peptide information. pVACview will allow researchers to analyze and prioritize neoantigen candidates with greater efficiency and accuracy in basic and translational settings. The application is available as part of the pVACtools pipeline at pvactools.org and as an online server at pvacview.org.


[9] 2406.07148

Astrocytic NMDA Receptors Modulate the Dynamics of Continuous Attractors

Neuronal networking supports complex brain functions, with neurotransmitters facilitating communication through chemical synapses. The release probability of neurotransmitters varies and is influenced by pre-synaptic neuronal activity. Recent findings suggest that blocking astrocytic N-Methyl-D-Aspartate (NMDA) receptors reduces this variation. However, the theoretical implications of this reduction on neuronal dynamics have not been thoroughly investigated. Utilizing continuous attractor neural network (CANN) models with short-term synaptic depression (STD), we explore the effects of reduced release probability variation. Our results show that blocking astrocytic NMDA receptors stabilizes attractor states and diminishes their mobility. These insights enhance our understanding of NMDA receptors' role in astrocytes and their broader impact on neural computation and memory, with potential implications for neurological conditions involving NMDA receptor antagonists.


[10] 2406.07249

Are Protein Language Models Compute Optimal?

While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases, and we identify a performance plateau in training loss comparable to the one found in relevant works in the field. Our findings suggest that widely-used pLMs might not be compute-optimal, indicating that larger models could achieve convergence more efficiently. Training a 35M model on a reduced token set, we attained perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.


[11] 2406.07269

The geometry of efficient codes: how rate-distortion trade-offs distort the latent representations of generative models

Living organisms rely on internal models of the world to act adaptively. These models cannot encode every detail and hence need to compress information. From a cognitive standpoint, information compression can manifest as a distortion of latent representations, resulting in the emergence of representations that may not accurately reflect the external world or its geometry. Rate-distortion theory formalizes the optimal way to compress information, by considering factors such as capacity limitations, the frequency and the utility of stimuli. However, while this theory explains why the above factors distort latent representations, it does not specify which specific distortions they produce. To address this question, here we systematically explore the geometry of the latent representations that emerge in generative models that operate under the principles of rate-distortion theory ($\beta$-VAEs). Our results highlight that three main classes of distortions of internal representations -- prototypization, specialization, orthogonalization -- emerge as signatures of information compression, under constraints on capacity, data distributions and tasks. These distortions can coexist, giving rise to a rich landscape of latent spaces, whose geometry could differ significantly across generative models subject to different constraints. Our findings contribute to explain how the normative constraints of rate-distortion theory distort the geometry of latent representations of generative models of artificial systems and living organisms.


[12] 2406.07445

Metastability in networks of nonlinear stochastic integrate-and-fire neurons

Neurons in the brain continuously process the barrage of sensory inputs they receive from the environment. A wide array of experimental work has shown that the collective activity of neural populations encodes and processes this constant bombardment of information. How these collective patterns of activity depend on single neuron properties is often unclear. Single-neuron recordings have shown that individual neural responses to inputs are nonlinear, which prevents a straightforward extrapolation from single neuron features to emergent collective states. In this work, we use a field theoretic formulation of a stochastic leaky integrate-and-fire model to study the impact of nonlinear intensity functions on macroscopic network activity. We show that the interplay between nonlinear spike emission and membrane potential resets can i) give rise to metastable transitions between active firing rate states, and ii) can enhance or suppress mean firing rates and membrane potentials in opposite directions.


[13] 2406.06630

A simple way to well-posedness in $H^{1}$ of a delay differential equation from cell biology

We present an application of recent well-posedness results in the theory of delay differential equations for ordinary differential equations arXiv:2308.04730 to a generalized population model for stem cell maturation. The weak approach using Sobolev-spaces we take allows for a larger class of initial prehistories and makes checking the requirements for well-posedness of such a model considerably easier compared to previous approaches. In fact the present approach is a possible means to guarantee that the solution manifold is not empty, which is a necessary requirement for a $C^{1}$-approach to work.


[14] 2406.06767

ULV: A robust statistical method for clustered data, with applications to multisubject, single-cell omics data

Molecular and genomic technological advancements have greatly enhanced our understanding of biological processes by allowing us to quantify key biological variables such as gene expression, protein levels, and microbiome compositions. These breakthroughs have enabled us to achieve increasingly higher levels of resolution in our measurements, exemplified by our ability to comprehensively profile biological information at the single-cell level. However, the analysis of such data faces several critical challenges: limited number of individuals, non-normality, potential dropouts, outliers, and repeated measurements from the same individual. In this article, we propose a novel method, which we call U-statistic based latent variable (ULV). Our proposed method takes advantage of the robustness of rank-based statistics and exploits the statistical efficiency of parametric methods for small sample sizes. It is a computationally feasible framework that addresses all the issues mentioned above simultaneously. An additional advantage of ULV is its flexibility in modeling various types of single-cell data, including both RNA and protein abundance. The usefulness of our method is demonstrated in two studies: a single-cell proteomics study of acute myelogenous leukemia (AML) and a single-cell RNA study of COVID-19 symptoms. In the AML study, ULV successfully identified differentially expressed proteins that would have been missed by the pseudobulk version of the Wilcoxon rank-sum test. In the COVID-19 study, ULV identified genes associated with covariates such as age and gender, and genes that would be missed without adjusting for covariates. The differentially expressed genes identified by our method are less biased toward genes with high expression levels. Furthermore, ULV identified additional gene pathways likely contributing to the mechanisms of COVID-19 severity.


[15] 2406.06768

Data-Driven Switchback Experiments: Theoretical Tradeoffs and Empirical Bayes Designs

We study the design and analysis of switchback experiments conducted on a single aggregate unit. The design problem is to partition the continuous time space into intervals and switch treatments between intervals, in order to minimize the estimation error of the treatment effect. We show that the estimation error depends on four factors: carryover effects, periodicity, serially correlated outcomes, and impacts from simultaneous experiments. We derive a rigorous bias-variance decomposition and show the tradeoffs of the estimation error from these factors. The decomposition provides three new insights in choosing a design: First, balancing the periodicity between treated and control intervals reduces the variance; second, switching less frequently reduces the bias from carryover effects while increasing the variance from correlated outcomes, and vice versa; third, randomizing interval start and end points reduces both bias and variance from simultaneous experiments. Combining these insights, we propose a new empirical Bayes design approach. This approach uses prior data and experiments for designing future experiments. We illustrate this approach using real data from a ride-sharing platform, yielding a design that reduces MSE by 33% compared to the status quo design used on the platform.


[16] 2406.06841

Compass: A Comprehensive Tool for Accurate and Efficient Molecular Docking in Inference and Fine-Tuning

While there has been discussion about noise levels in molecular docking datasets such as PDBBind, a thorough analysis of their physical/chemical and bioactivity noise characteristics is still lacking. PoseCheck addresses this issue by examining molecular strain energy, molecular-protein clashes, and interactions, but it is primarily created for $de$ $novo$ drug design. Another important metric in molecular docking, Binding Affinity Energy, is better assessed by the new empirical score function, AA-Score, which has demonstrated improved performance over existing methods. To tackle these challenges, we propose the COMPASS method, which integrates the PoseCheck and AA-Score modules. This approach evaluates dataset noise levels and the physical/chemical and bioactivity feasibility of docked molecules. Our analysis of the PDBBind dataset using COMPASS reveals significant noise in the ground truth data. Additionally, we incorporate COMPASS with the state-of-the-art molecular docking method, DiffDock, in inference mode to achieve efficient and accurate assessments of docked ligands. Finally, we propose a new paradigm to enhance model performance for molecular docking through fine-tuning and discuss the potential benefits of this approach. The source code is available publicly at https://github.com/BIMSBbioinfo/Compass.


[17] 2406.06957

Charting a finite element, mechanical atlas of dermatologic wound closure

Wound geometry and the mechanical properties of human skin govern the failure modes of partially healed or scarred tissue. Though dermatologists and surgeons develop an intuitive understanding of the mechanical characteristics of skin through clinical practice, finite element models of wounds can aid in formalizing intuition. In this work, we explore the effect of wound geometry and primary intention closure on the propagation of mechanical stresses through skin. We use a two-layer, orthotropic, hyperelastic model of the epidermis, dermis, and subcutis to accurately capture the mechanical and geometric effects at work. We highlight the key assumptions which must be made when modeling closure of wounds by primary intention, clearly delineating promising areas for model improvement. Models are implemented in DOLFINx, an open-source finite element framework, and reference code is provided for reproducible and extensible science.


[18] 2406.07025

Entropy-Reinforced Planning with Large Language Models for Drug Discovery

The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: https://github.com/xuefeng-cs/ERP.


[19] 2406.07263

Active learning for affinity prediction of antibodies

The primary objective of most lead optimization campaigns is to enhance the binding affinity of ligands. For large molecules such as antibodies, identifying mutations that enhance antibody affinity is particularly challenging due to the combinatorial explosion of potential mutations. When the structure of the antibody-antigen complex is available, relative binding free energy (RBFE) methods can offer valuable insights into how different mutations will impact the potency and selectivity of a drug candidate, thereby reducing the reliance on costly and time-consuming wet-lab experiments. However, accurately simulating the physics of large molecules is computationally intensive. We present an active learning framework that iteratively proposes promising sequences for simulators to evaluate, thereby accelerating the search for improved binders. We explore different modeling approaches to identify the most effective surrogate model for this task, and evaluate our framework both using pre-computed pools of data and in a realistic full-loop setting.


[20] 2406.07418

Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization

Recent advancements in single-cell genomics necessitate precision in gene panel selection to interpret complex biological data effectively. Those methods aim to streamline the analysis of scRNA-seq data by focusing on the most informative genes that contribute significantly to the specific analysis task. Traditional selection methods, which often rely on expert domain knowledge, embedded machine learning models, or heuristic-based iterative optimization, are prone to biases and inefficiencies that may obscure critical genomic signals. Recognizing the limitations of traditional methods, we aim to transcend these constraints with a refined strategy. In this study, we introduce an iterative gene panel selection strategy that is applicable to clustering tasks in single-cell genomics. Our method uniquely integrates results from other gene selection algorithms, providing valuable preliminary boundaries or prior knowledge as initial guides in the search space to enhance the efficiency of our framework. Furthermore, we incorporate the stochastic nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization through reward-based feedback. This combination mitigates the biases inherent in the initial boundaries and harnesses RL's adaptability to refine and target gene panel selection dynamically. To illustrate the effectiveness of our method, we conducted detailed comparative experiments, case studies, and visualization analysis.