New articles on Quantitative Biology


[1] 2407.14556

Mechanical Self-replication

This study presents a theoretical model for a self-replicating mechanical system inspired by biological processes within living cells and supported by computer simulations. The model decomposes self-replication into core components, each of which is executed by a single machine constructed from a set of basic block types. Key functionalities such as sorting, copying, and building, are demonstrated. The model provides valuable insights into the constraints of self-replicating systems. The discussion also addresses the spatial and timing behavior of the system, as well as its efficiency and complexity. This work provides a foundational framework for future studies on self-replicating mechanisms and their information-processing applications.


[2] 2407.14668

Towards a "universal translator" for neural dynamics at single-cell, single-spike resolution

Neuroscience research has made immense progress over the last decade, but our understanding of the brain remains fragmented and piecemeal: the dream of probing an arbitrary brain region and automatically reading out the information encoded in its neural activity remains out of reach. In this work, we build towards a first foundation model for neural spiking data that can solve a diverse set of tasks across multiple brain areas. We introduce a novel self-supervised modeling approach for population activity in which the model alternates between masking out and reconstructing neural activity across different time steps, neurons, and brain regions. To evaluate our approach, we design unsupervised and supervised prediction tasks using the International Brain Laboratory repeated site dataset, which is comprised of Neuropixels recordings targeting the same brain locations across 48 animals and experimental sessions. The prediction tasks include single-neuron and region-level activity prediction, forward prediction, and behavior decoding. We demonstrate that our multi-task-masking (MtM) approach significantly improves the performance of current state-of-the-art population models and enables multi-task learning. We also show that by training on multiple animals, we can improve the generalization ability of the model to unseen animals, paving the way for a foundation model of the brain at single-cell, single-spike resolution.


[3] 2407.14708

Modeling flexible behaviour by the interactions between hippocampus and cortex

Animals flexibly change their behaviour depending on context. In theory, such context-dependent behaviour can be explained by model-based reinforcement learning models. However, existing models lack structure underlying context-dependent model selection and thus, a correspondence to neural activity and brain regions. Here, we employ interacting sequential and context-inference modules to drive model-based learning as a means to better understand experimental neuronal activity data, lesion studies, and clinical research. We propose a neural circuit implementation of the sequential module by the hippocampus and the context-inference module by the cortex that together enable flexible behaviour. Our model explains a variety of experimental findings, including impairments in model-based reasoning reported in lesion studies. Furthermore, our model predicts the relationship between deficits in model-based learning and sensory processing, which often co-occur in psychoses such as schizophrenia (SZ) or autism spectrum disorder (ASD).


[4] 2407.14794

mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics

Recent advancements in protein structure determination are revolutionizing our understanding of proteins. Still, a significant gap remains in the availability of comprehensive datasets that focus on the dynamics of proteins, which are crucial for understanding protein function, folding, and interactions. To address this critical gap, we introduce mdCATH, a dataset generated through an extensive set of all-atom molecular dynamics simulations of a diverse and representative collection of protein domains. This dataset comprises all-atom systems for 5,398 domains, modeled with a state-of-the-art classical force field, and simulated in five replicates each at five temperatures from 320 K to 413 K. The mdCATH dataset records coordinates and forces every 1 ns, for over 62 ms of accumulated simulation time, effectively capturing the dynamics of the various classes of domains and providing a unique resource for proteome-wide statistical analyses of protein unfolding thermodynamics and kinetics. We outline the dataset structure and showcase its potential through four easily reproducible case studies, highlighting its capabilities in advancing protein science.


[5] 2407.14798

An Information-Geometric Formulation of Pattern Separation and Evaluation of Existing Indices

Pattern separation is a computational process by which dissimilar neural patterns are generated from similar input patterns. We present an information-geometric formulation of pattern separation, where a pattern separator is modelled as a family of statistical distributions on a manifold. Such a manifold maps an input (i.e. coordinates) to a probability distribution that generates firing patterns. Pattern separation occurs when small coordinate changes result in large distances between samples from the corresponding distributions. Under this formulation, we implement a two-neuron system whose probability law forms a 3-dimensional manifold with mutually orthogonal coordinates representing the neurons' marginal and correlational firing rates. We use this highly controlled system to examine the behaviour of spike train similarity indices commonly used in pattern separation research. We found that all indices (except scaling factor) were sensitive to relative differences in marginal firing rates, but no index adequately captured differences in spike trains that resulted from altering the correlation in activity between the two neurons. That is, existing pattern separation metrics appear (A) sensitive to patterns that are encoded by different neurons, but (B) insensitive to patterns that differ only in relative spike timing (e.g. synchrony between neurons in the ensemble).


[6] 2407.14924

The Origin of Quantum Mechanical Statistics: Some Insights from the Research on Human Language

Identical systems, or entities, are indistinguishable in quantum mechanics (QM), and the symmetrization postulate rules the possible statistical distributions of a large number of identical quantum entities. However, a thorough analysis on the historical development of QM attributes the origin of quantum statistics, in particular, Bose-Einstein statistics, to a lack of statistical independence of the micro-states of identical quantum entities. We have recently identified Bose-Einstein statistics in the combination of words in large texts, as a consequence of the entanglement created by the meaning carried by words when they combine in human language. Relying on this investigation, we put forward the hypothesis that entanglement, hence the lack of statistical independence, is due to a mechanism of contextual updating, which provides deeper reasons for the appearance of Bose-Einstein statistics in human language. However, this investigation also contributes to a better understanding of the origin of quantum mechanical statistics in physics. Finally, we provide new insights into the intrinsically random behaviour of microscopic entities that is generally assumed within classical statistical mechanics.


[7] 2407.14949

CoCoG-2: Controllable generation of visual stimuli for understanding human concept representation

Humans interpret complex visual stimuli using abstract concepts that facilitate decision-making tasks such as food selection and risk avoidance. Similarity judgment tasks are effective for exploring these concepts. However, methods for controllable image generation in concept space are underdeveloped. In this study, we present a novel framework called CoCoG-2, which integrates generated visual stimuli into similarity judgment tasks. CoCoG-2 utilizes a training-free guidance algorithm to enhance generation flexibility. CoCoG-2 framework is versatile for creating experimental stimuli based on human concepts, supporting various strategies for guiding visual stimuli generation, and demonstrating how these stimuli can validate various experimental hypotheses. CoCoG-2 will advance our understanding of the causal relationship between concept representations and behaviors by generating visual stimuli. The code is available at \url{https://github.com/ncclab-sustech/CoCoG-2}.


[8] 2407.15028

Statistical Models for Outbreak Detection of Measles in North Cotabato, Philippines

A measles outbreak occurs when the number of cases of measles in the population exceeds the typical level. Outbreaks that are not detected and managed early can increase mortality and morbidity and incur costs from activities responding to these events. The number of measles cases in the Province of North Cotabato, Philippines, was used in this study. Weekly reported cases of measles from January 2016 to December 2021 were provided by the Epidemiology and Surveillance Unit of the North Cotabato Provincial Health Office. Several integer-valued autoregressive (INAR) time series models were used to explore the possibility of detecting and identifying measles outbreaks in the province along with the classical ARIMA model. These models were evaluated based on goodness of fit, measles outbreak detection accuracy, and timeliness. The results of this study confirmed that INAR models have the conceptual advantage over ARIMA since the latter produces non-integer forecasts, which are not realistic for count data such as measles cases. Among the INAR models, the ZINGINAR (1) model was recommended for having a good model fit and timely and accurate detection of outbreaks. Furthermore, policymakers and decision-makers from relevant government agencies can use the ZINGINAR (1) model to improve disease surveillance and implement preventive measures against contagious diseases beforehand.


[9] 2407.15132

Deep multimodal saliency parcellation of cerebellar pathways: linking microstructure and individual function through explainable multitask learning

Parcellation of human cerebellar pathways is essential for advancing our understanding of the human brain. Existing diffusion MRI tractography parcellation methods have been successful in defining major cerebellar fibre tracts, while relying solely on fibre tract structure. However, each fibre tract may relay information related to multiple cognitive and motor functions of the cerebellum. Hence, it may be beneficial for parcellation to consider the potential importance of the fibre tracts for individual motor and cognitive functional performance measures. In this work, we propose a multimodal data-driven method for cerebellar pathway parcellation, which incorporates both measures of microstructure and connectivity, and measures of individual functional performance. Our method involves first training a multitask deep network to predict various cognitive and motor measures from a set of fibre tract structural features. The importance of each structural feature for predicting each functional measure is then computed, resulting in a set of structure-function saliency values that are clustered to parcellate cerebellar pathways. We refer to our method as Deep Multimodal Saliency Parcellation (DeepMSP), as it computes the saliency of structural measures for predicting cognitive and motor functional performance, with these saliencies being applied to the task of parcellation. Applying DeepMSP we found that it was feasible to identify multiple cerebellar pathway parcels with unique structure-function saliency patterns that were stable across training folds.


[10] 2407.15202

Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors

Drug-Target binding Affinity (DTA) prediction is essential for drug discovery. Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal. In this work, inspired by the recent success of retrieval methods, we propose $k$NN-DTA, a non-parametric embedding-based retrieval method adopted on a pre-trained DTA prediction model, which can extend the power of the DTA model with no or negligible cost. Different from existing methods, we introduce two neighbor aggregation ways from both embedding space and label space that are integrated into a unified framework. Specifically, we propose a \emph{label aggregation} with \emph{pair-wise retrieval} and a \emph{representation aggregation} with \emph{point-wise retrieval} of the nearest neighbors. This method executes in the inference phase and can efficiently boost the DTA prediction performance with no training cost. In addition, we propose an extension, Ada-$k$NN-DTA, an instance-wise and adaptive aggregation with lightweight learning. Results on four benchmark datasets show that $k$NN-DTA brings significant improvements, outperforming previous state-of-the-art (SOTA) results, e.g, on BindingDB IC$_{50}$ and $K_i$ testbeds, $k$NN-DTA obtains new records of RMSE $\bf{0.684}$ and $\bf{0.750}$. The extended Ada-$k$NN-DTA further improves the performance to be $\bf{0.675}$ and $\bf{0.735}$ RMSE. These results strongly prove the effectiveness of our method. Results in other settings and comprehensive studies/analyses also show the great potential of our $k$NN-DTA approach.


[11] 2407.15220

Privacy-Preserving Multi-Center Differential Protein Abundance Analysis with FedProt

Quantitative mass spectrometry has revolutionized proteomics by enabling simultaneous quantification of thousands of proteins. Pooling patient-derived data from multiple institutions enhances statistical power but raises significant privacy concerns. Here we introduce FedProt, the first privacy-preserving tool for collaborative differential protein abundance analysis of distributed data, which utilizes federated learning and additive secret sharing. In the absence of a multicenter patient-derived dataset for evaluation, we created two, one at five centers from LFQ E.coli experiments and one at three centers from TMT human serum. Evaluations using these datasets confirm that FedProt achieves accuracy equivalent to DEqMS applied to pooled data, with completely negligible absolute differences no greater than $\text{$4 \times 10^{-12}$}$. In contrast, -log10(p-values) computed by the most accurate meta-analysis methods diverged from the centralized analysis results by up to 25-27. FedProt is available as a web tool with detailed documentation as a FeatureCloud App.


[12] 2407.15298

Understanding cell populations sharing information through the environment, as reinforcement learning

Collective migration is a phenomenon observed in various biological systems, where the cooperation of multiple cells leads to complex functions beyond individual capabilities, such as in immunity and development. A distinctive example is cell populations that not only ascend attractant gradient originating from targets, such as damaged tissue, but also actively modify the gradient, through their own production and degradation. While the optimality of single-cell information processing has been extensively studied, the optimality of the collective information processing that includes gradient sensing and gradient generation, remains underexplored. In this study, we formulated a cell population that produces and degrades an attractant while exploring the environment as an agent population performing distributed reinforcement learning. We demonstrated the existence of optimal couplings between gradient sensing and gradient generation, showing that the optimal gradient generation qualitatively differs depending on whether the gradient sensing is logarithmic or linear. The derived dynamics have a structure homogeneous to the Keller-Segel model, suggesting that cell populations might be learning. Additionally, we showed that the distributed information processing structure of the agent population enables a proportion of the population to robustly accumulate at the target. Our results provide a quantitative foundation for understanding the collective information processing mediated by attractants in extracellular environments.


[13] 2407.15322

Molecular design for cardiac cell differentiation using a small dataset and decorated shape features

The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, facilitate the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we demonstrate the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.


[14] 2407.15697

Voltage mapping in subcellular nanodomains using electro-diffusion modeling

Voltage distribution in sub-cellular micro-domains such as neuronal synapses, small protrusions or dendritic spines regulates the opening and closing of ionic channels, energy production and thus cellular homeostasis and excitability. Yet how voltage changes at such a small scale in vivo remains challenging due to the experimental diffraction limit, large signal fluctuations and the still limited resolution of fast voltage indicators. Here, we study the voltage distribution in nano-compartments using a computational approach based on the Poisson-Nernst-Planck equations for the electro-diffusion motion of ions, where inward and outward fluxes are generated between channels. We report a current-voltage (I-V) logarithmic relationship generalizing Nernst law that reveals how the local membrane curvature modulates the voltage. We further find that an influx current penetrating a cellular electrolyte can lead to perturbations from tens to hundreds of nanometers deep depending on the local channels organization. Finally, we show that the neck resistance of dendritic spines can be completely shunted by the transporters located on the head boundary, facilitating ionic flow. To conclude, we propose that voltage is regulated at a subcellular level by channels organization, membrane curvature and narrow passages.


[15] 2407.15735

A Real-Time Suite of Biological Cell Image Analysis Software for Computers, Smartphones, and Smart Glasses, Suitable for Resource-Constrained Computing

Methods for personalizing medical treatment are the focal point of contemporary clinical research. In cancer care, for instance, we can analyze the effects of therapies at the level of individual cells. Complete characterization of treatment efficacy and evaluation of why some individuals respond to specific regimens, whereas others do not, requires additional approaches to genetic sequencing at single time points. Methods for the continuous analysis of changes in phenotype, such as morphology and motion tracking of cellular proteins and organelles over time frames spanning the minute-hour scales, can provide important insight to patient treatment options. The integration of measurements of intracellular dynamics and the contribution of multiple genetic pathways in degenerative diseases is vital for the development of biomarkers for the early detection of pathogenesis and therapy efficacy. We have developed a software suite (DataSet Tracker) for real-time analysis designed to run on computers, smartphones, and smart glasses hardware and suitable for resource-constrained, on-the-fly computing in microscopes without internet connectivity; a demo is available for viewing at datasetanalysis.com. Our objective is to present the community with an integrated, easy to use by all, tool for resolving the complex dynamics of the cytoskeletal meshworks, intracytoplasmic membranous networks, and vesicle trafficking. It is our goal to have this integrated tool approved for use in the clinical practice.


[16] 2407.14968

Technical report: Improving the properties of molecules generated by LIMO

This technical report investigates variants of the Latent Inceptionism on Molecules (LIMO) framework to improve the properties of generated molecules. We conduct ablative studies of molecular representation, decoder model, and surrogate model training scheme. The experiments suggest that an autogressive Transformer decoder with GroupSELFIES achieves the best average properties for the random generation task.


[17] 2407.14976

Multiple merger coalescent inference of effective population size

Variation in a sample of molecular sequence data informs about the past evolutionary history of the sample's population. Traditionally, Bayesian modeling coupled with the standard coalescent, is used to infer the sample's bifurcating genealogy and demographic and evolutionary parameters such as effective population size, and mutation rates. However, there are many situations where binary coalescent models do not accurately reflect the true underlying ancestral processes. Here, we propose a Bayesian nonparametric method for inferring effective population size trajectories from a multifurcating genealogy under the $\Lambda-$coalescent. In particular, we jointly estimate the effective population size and model parameters for the Beta-coalescent model, a special type of $\Lambda-$coalescent. Finally, we test our methods on simulations and apply them to study various viral dynamics as well as Japanese sardine population size changes over time. The code and vignettes can be found in the phylodyn package.


[18] 2407.15101

Holographic nature of critical quantum states of proteins

The Anderson metal-insulator transition is a fundamental phenomenon in condensed matter physics, describing the transition from a conducting (metallic) to a non-conducting (insulating) state driven by disorder in a material. At the critical point of the Anderson transition, wave functions exhibit multifractal behavior, and energy levels display a universal distribution, indicating non-trivial correlations in the eigenstates. Recent studies have shown that proteins, traditionally considered as insulators, exhibit much higher conductivity than previously assumed. In this paper, we investigate several proteins known for their efficient electron transport properties. We compare their energy level statistics, eigenfunction correlation, and electron return probability to those expected in metallic, insulating, or critical states. Remarkably, these proteins exhibit properties of critically disordered metals in their natural state without any parameter adjustment. Their composition and geometry are self-organized into the critical state of the Anderson transition, and their fractal properties are universal and unique among critical systems. Our findings suggest that proteins' wave functions fulfill "holographic" area laws, and the correlation fractal dimension is precisely $d_2=2$.


[19] 2407.15254

Does EDPVR Represent Myocardial Tissue Stiffness? Toward a Better Definition

Accurate assessment of myocardial tissue stiffness is pivotal for the diagnosis and prognosis of heart diseases. Left ventricular diastolic stiffness ($\beta$) obtained from the end-diastolic pressure-volume relationship (EDPVR) has conventionally been utilized as a representative metric of myocardial stiffness. The EDPVR can be employed to estimate the intrinsic stiffness of myocardial tissues through image-based in-silico inverse optimization. However, whether $\beta$, as an organ-level metric, accurately represents the tissue-level myocardial tissue stiffness in healthy and diseased myocardium remains elusive. We developed a modeling-based approach utilizing a two-parameter material model for the myocardium (denoted by $a_f$ and $b_f$) in image-based in-silico biventricular heart models to generate EDPVRs for different material parameters. Our results indicated a variable relationship between $\beta$ and the material parameters depending on the range of the parameters. Interestingly, $\beta$ showed a very low sensitivity to $a_f$, once averaged across several LV geometries, and even a negative correlation with $a_f$ for small values of $a_f$. These findings call for a critical assessment of the reliability and confoundedness of EDPVR-derived metrics to represent tissue-level myocardial stiffness. Our results also underscore the necessity to explore image-based in-silico frameworks, promising to provide a high-fidelity and potentially non-invasive assessment of myocardial stiffness.


[20] 2407.15301

U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks

Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the H\'ajek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.


[21] 2407.15713

Inverse problems for coupled nonlocal nonlinear systems arising in mathematical biology

In this paper, we propose and study several inverse problems of determining unknown parameters in nonlocal nonlinear coupled PDE systems, including the potentials, nonlinear interaction functions and time-fractional orders. In these coupled systems, we enforce non-negativity of the solutions, aligning with realistic scenarios in biology and ecology. There are several salient features of our inverse problem study: the drastic reduction in measurement/observation data due to averaging effects, the nonlinear coupling between multiple equations, and the nonlocality arising from fractional-type derivatives. These factors present significant challenges to our inverse problem, and such inverse problems have never been explored in previous literature. To address these challenges, we develop new and effective schemes. Our approach involves properly controlling the injection of different source terms to obtain multiple sets of mean flux data. This allows us to achieve unique identifiability results and accurately determine the unknown parameters. Finally, we establish a connection between our study and practical applications in biology, further highlighting the relevance of our work in real-world contexts.


[22] 2407.15801

Selection pressure/Noise driven cooperative behaviour in the thermodynamic limit of repeated games

Consider the scenario where an infinite number of players (i.e., the \textit{thermodynamic} limit) find themselves in a Prisoner's dilemma type situation, in a \textit{repeated} setting. Is it reasonable to anticipate that, in these circumstances, cooperation will emerge? This paper addresses this question by examining the emergence of cooperative behaviour, in the presence of \textit{noise} (or, under \textit{selection pressure}), in repeated Prisoner's Dilemma games, involving strategies such as \textit{Tit-for-Tat}, \textit{Always Defect}, \textit{GRIM}, \textit{Win-Stay, Lose-Shift}, and others. To analyze these games, we employ a numerical Agent-Based Model (ABM) and compare it with the analytical Nash Equilibrium Mapping (NEM) technique, both based on the \textit{1D}-Ising chain. We use \textit{game magnetization} as an indicator of cooperative behaviour. A significant finding is that for some repeated games, a discontinuity in the game magnetization indicates a \textit{first}-order \textit{selection pressure/noise}-driven phase transition. The phase transition is particular to strategies where players do not severely punish a single defection. We also observe that in these particular cases, the phase transition critically depends on the number of \textit{rounds} the game is played in the thermodynamic limit. For all five games, we find that both ABM and NEM, in conjunction with game magnetization, provide crucial inputs on how cooperative behaviour can emerge in an infinite-player repeated Prisoner's dilemma game.