New articles on Quantitative Biology

[1] 2312.00796

Multiple Protein Profiler 1.0 (MPP): A webserver for predicting and visualizing physiochemical properties of proteins at the proteome level

Determining the physicochemical properties of a protein can reveal important insights in their structure, biological functions, stability, and interactions with other molecules. Although tools for computing properties of proteins already existed, we could not find a comprehensive tool that enables the calculations of multiple properties for multiple input proteins on the proteome level at once. Facing this limitation, we have developed Multiple Protein Profiler (MPP) 1.0 as an integrated tool that allows the profiling of 12 individual properties of multiple proteins in a significant manner. MPP provides a tabular and graphic visualization of properties of multiple proteins. The tool is freely accessible at

[2] 2312.00811

Seizure detection from Electroencephalogram signals via Wavelets and Graph Theory metrics

Epilepsy is one of the most prevalent neurological conditions, where an epileptic seizure is a transient occurrence due to abnormal, excessive and synchronous activity in the brain. Electroencephalogram signals emanating from the brain may be captured, analysed and then play a significant role in detection and prediction of epileptic seizures. In this work we enhance upon a previous approach that relied on the differing properties of the wavelet transform. Here we apply the Maximum Overlap Discrete Wavelet Transform to both reduce signal \textit{noise} and use signal variance exhibited at differing inherent frequency levels to develop various metrics of connection between the electrodes placed upon the scalp. %The properties of both the noise reduced signal and the interconnected electrodes differ significantly during the different brain states. Using short duration epochs, to approximate close to real time monitoring, together with simple statistical parameters derived from the reconstructed noise reduced signals we initiate seizure detection. To further improve performance we utilise graph theoretic indicators from derived electrode connectivity. From there we build the attribute space. We utilise open-source software and publicly available data to highlight the superior Recall/Sensitivity performance of our approach, when compared to existing published methods.

[3] 2312.00842

ESM-NBR: fast and accurate nucleic acid-binding residue prediction via protein language model feature representation and multi-task learning

Protein-nucleic acid interactions play a very important role in a variety of biological activities. Accurate identification of nucleic acid-binding residues is a critical step in understanding the interaction mechanisms. Although many computationally based methods have been developed to predict nucleic acid-binding residues, challenges remain. In this study, a fast and accurate sequence-based method, called ESM-NBR, is proposed. In ESM-NBR, we first use the large protein language model ESM2 to extract discriminative biological properties feature representation from protein primary sequences; then, a multi-task deep learning model composed of stacked bidirectional long short-term memory (BiLSTM) and multi-layer perceptron (MLP) networks is employed to explore common and private information of DNA- and RNA-binding residues with ESM2 feature as input. Experimental results on benchmark data sets demonstrate that the prediction performance of ESM2 feature representation comprehensively outperforms evolutionary information-based hidden Markov model (HMM) features. Meanwhile, the ESM-NBR obtains the MCC values for DNA-binding residues prediction of 0.427 and 0.391 on two independent test sets, which are 18.61 and 10.45% higher than those of the second-best methods, respectively. Moreover, by completely discarding the time-cost multiple sequence alignment process, the prediction speed of ESM-NBR far exceeds that of existing methods (5.52s for a protein sequence of length 500, which is about 16 times faster than the second-fastest method). A user-friendly standalone package and the data of ESM-NBR are freely available for academic use at:

[4] 2312.00847

Handling nonlinearities and uncertainties of fed-batch cultivations with difference of convex functions tube MPC

Bioprocesses are often characterized by nonlinear and uncertain dynamics. This poses particular challenges in the context of model predictive control (MPC). Several approaches have been proposed to solve this problem, such as robust or stochastic MPC, but they can be computationally expensive when the system is nonlinear. Recent advances in optimal control theory have shown that concepts from convex optimization, tube-based MPC, and difference of convex functions (DC) enable stable and robust online process control. The approach is based on systematic DC decompositions of the dynamics and successive linearizations around feasible trajectories. By convexity, the linearization errors can be bounded tightly and treated as bounded disturbances in a robust tube-based MPC framework. However, finding the DC composition can be a difficult task. To overcome this problem, we used a neural network with special convex structure to learn the dynamics in DC form and express the uncertainty sets using simplices to maximize the product formation rate of a cultivation with uncertain substrate concentration in the feed. The results show that this is a promising approach for computationally tractable data-driven robust MPC of bioprocesses.

[5] 2312.00910

Effectiveness of probabilistic contact tracing in epidemic containment: the role of super-spreaders and transmission paths reconstruction

The recent COVID-19 pandemic underscores the significance of early-stage non-pharmacological intervention strategies. The widespread use of masks and the systematic implementation of contact tracing strategies provide a potentially equally effective and socially less impactful alternative to more conventional approaches, such as large-scale mobility restrictions. However, manual contact tracing faces strong limitations in accessing the network of contacts, and the scalability of currently implemented protocols for smartphone-based digital contact tracing becomes impractical during the rapid expansion phases of the outbreaks, due to the surge in exposure notifications and associated tests. A substantial improvement in digital contact tracing can be obtained through the integration of probabilistic techniques for risk assessment that can more effectively guide the allocation of new diagnostic tests. In this study, we first quantitatively analyze the diagnostic and social costs associated with these containment measures based on contact tracing, employing three state-of-the-art models of SARS-CoV-2 spreading. Our results suggest that probabilistic techniques allow for more effective mitigation at a lower cost. Secondly, our findings reveal a remarkable efficacy of probabilistic contact-tracing techniques in capturing backward propagations and super-spreading events, relevant features of the diffusion of many pathogens, including SARS-CoV-2.

[6] 2312.00932

Boltzmann's casino and the unbridgeable chasm in emergence of life research

Notwithstanding its long history and compelling motivation, research seeking to explicate the emergence life (EoL) has throughout been a cacophony of unresolved speculation and dispute; absent still any clear convergence or other inarguable evidence of progress. This notwithstanding that it has also produced a rich and varied supply of putatively promising technical advances. Not surprising then the effort being advanced by some to establish a shared basis in fundamental assumptions upon which a more productive community research effort might arise. In this essay however, I press a case in opposition. First, that a chasm divides the rich fauna of contesting EoL models into two conceptually incommensurate classes; here named "chemistry models" (to which class belongs nearly all thinking and work in the field, past and present) and "engine models" (advanced in various more-or-less partial forms by a marginal minority of voices dating from Boltzmann forward). Second, that contemporary non-equilibrium thermodynamics dictates that 'engine-less' (i.e. 'chemistry') models cannot in principle generate non-equilibrium, organized states of matter and are in consequence inherently incapable of prizing life out of inanimate matter.

[7] 2312.01186

Linker-Tuning: Optimizing Continuous Prompts for Heterodimeric Protein Prediction

Predicting the structure of interacting chains is crucial for understanding biological systems and developing new drugs. Large-scale pre-trained Protein Language Models (PLMs), such as ESM2, have shown impressive abilities in extracting biologically meaningful representations for protein structure prediction. In this paper, we show that ESMFold, which has been successful in computing accurate atomic structures for single-chain proteins, can be adapted to predict the heterodimer structures in a lightweight manner. We propose Linker-tuning, which learns a continuous prompt to connect the two chains in a dimer before running it as a single sequence in ESMFold. Experiment results show that our method successfully predicts 56.98% of interfaces on the i.i.d. heterodimer test set, with an absolute improvement of +12.79% over the ESMFold-Linker baseline. Furthermore, our model can generalize well to the out-of-distribution (OOD) test set HeteroTest2 and two antibody test sets Fab and Fv while being $9\times$ faster than AF-Multimer.

[8] 2312.01272

Multiscale Topology in Interactomic Network: From Transcriptome to Antiaddiction Drug Repurposing

The escalating drug addiction crisis in the United States underscores the urgent need for innovative therapeutic strategies. This study embarked on an innovative and rigorous strategy to unearth potential drug repurposing candidates for opioid and cocaine addiction treatment, bridging the gap between transcriptomic data analysis and drug discovery. We initiated our approach by conducting differential gene expression analysis on addiction-related transcriptomic data to identify key genes. We propose a novel topological differentiation to identify key genes from a protein-protein interaction (PPI) network derived from DEGs. This method utilizes persistent Laplacians to accurately single out pivotal nodes within the network, conducting this analysis in a multiscale manner to ensure high reliability. Through rigorous literature validation, pathway analysis, and data-availability scrutiny, we identified three pivotal molecular targets, mTOR, mGluR5, and NMDAR, for drug repurposing from DrugBank. We crafted machine learning models employing two natural language processing (NLP)-based embeddings and a traditional 2D fingerprint, which demonstrated robust predictive ability in gauging binding affinities of DrugBank compounds to selected targets. Furthermore, we elucidated the interactions of promising drugs with the targets and evaluated their drug-likeness. This study delineates a multi-faceted and comprehensive analytical framework, amalgamating bioinformatics, topological data analysis and machine learning, for drug repurposing in addiction treatment, setting the stage for subsequent experimental validation. The versatility of the methods we developed allows for applications across a range of diseases and transcriptomic datasets.

[9] 2312.01275

A Review of Link Prediction Applications in Network Biology

In the domain of network biology, the interactions among heterogeneous genomic and molecular entities are represented through networks. Link prediction (LP) methodologies are instrumental in inferring missing or prospective associations within these biological networks. In this review, we systematically dissect the attributes of local, centrality, and embedding-based LP approaches, applied to static and dynamic biological networks. We undertake an examination of the current applications of LP metrics for predicting links between diseases, genes, proteins, RNA, microbiomes, drugs, and neurons. We carry out comprehensive performance evaluations on established biological network datasets to show the practical applications of standard LP models. Moreover, we compare the similarity in prediction trends among the models and the specific network attributes that contribute to effective link prediction, before underscoring the role of LP in addressing the formidable challenges prevalent in biological systems, ranging from noise, bias, and data sparseness to interpretability. We conclude the review with an exploration of the essential characteristics expected from future LP models, poised to advance our comprehension of the intricate interactions governing biological systems.

[10] 2312.01527

NovoMol: Recurrent Neural Network for Orally Bioavailable Drug Design and Validation on PDGFRα Receptor

Longer timelines and lower success rates of drug candidates limit the productivity of clinical trials in the pharmaceutical industry. Promising de novo drug design techniques help solve this by exploring a broader chemical space, efficiently generating new molecules, and providing improved therapies. However, optimizing for molecular characteristics found in approved oral drugs remains a challenge, limiting de novo usage. In this work, we propose NovoMol, a novel de novo method using recurrent neural networks to mass-generate drug molecules with high oral bioavailability, increasing clinical trial time efficiency. Molecules were optimized for desirable traits and ranked using the quantitative estimate of drug-likeness (QED). Generated molecules meeting QED's oral bioavailability threshold were used to retrain the neural network, and, after five training cycles, 76% of generated molecules passed this strict threshold and 96% passed the traditionally used Lipinski's Rule of Five. The trained model was then used to generate specific drug candidates for the cancer-related PDGFR{\alpha} receptor and 44% of generated candidates had better binding affinity than the current state-of-the-art drug, Imatinib (with a receptor binding affinity of -9.4 kcal/mol), and the best-generated candidate at -12.9 kcal/mol. NovoMol provides a time/cost-efficient AI-based de novo method offering promising drug candidates for clinical trials.

[11] 2312.01646

Enhancing data-limited assessments: Optimal utilization of fishery-dependent data through random effects -- A case study on Korea chub mackerel ($\textit{Scomber japonicus}$)

In a state-space framework, temporal variations in fishery-dependent processes, such as selectivity and catchability, can be modeled as random effects. This makes state-space models (SSMs) powerful tools for data-limited assessments, especially when conventional CPUE standardization is inapplicable. However, the flexibility of this modeling approach can lead to challenges such as overfitting and parameter non-identifiability. To demonstrate and address these challenges, we developed a state-space length-based age-structured model, which we applied to the Korea chub mackerel ($\textit{Scomber japonicus}$) stock as a case study. The model underwent rigorous scrutiny using various model checking methods to detect potential model mis-specification and non-identifiability under diverse scenarios. Our results demonstrated that incorporating temporal variations in fishery-dependent processes through random effects resolved model mis-specification, but excessive inclusion of random effects rendered the model sensitive to a small number of observations, even when the model was identifiable. For the non-identifiability issue, we employed a non-degenerate estimator, using a gamma distribution as a penalty for the standard deviation (SD) parameters of observation errors. This approach made the SD parameters identifiable and facilitated the simultaneous estimation of both process and observation error variances with minimal bias, known to be a challenging task in SSMs. These findings underscore the importance of model checking in SSMs and emphasize the need for careful consideration of overfitting and non-identifiability when developing such models for data-limited assessments. Additionally, novel assessment results for the mackerel stock were presented, and implications for future stock assessment and management were discussed.

[12] 2312.01788

Functional Magnetic Resonance Imaging Changes and Increased Muscle Pressure in Fibromyalgia: Insights from Prominent Theories of Pain and Muscle Imaging

Fibromyalgia is a complicated and multifaceted disorder marked by widespread chronic pain, fatigue, and muscle tenderness. Current explanations for the pathophysiology of this condition include the Central Sensitization Theory, Cytokine Inflammation Theory, Muscle Hypoxia, Muscle Tender Point Theory, and Small Fiber Neuropathy Theory. The objective of this review article is to examine and explain each of these current theories and to provide a background on our current understanding of fibromyalgia. The medical literature on this disorder, as well as on the roles of functional magnetic resonance imaging (fMRI) and elastography as diagnostic tools, was reviewed from the 1970s to early 2023, primarily using the PubMed database. Five prominent theories of fibromyalgia etiology were examined: 1) Central Sensitization Theory; 2) Cytokine Inflammation Theory; 3) Muscle Hypoxia; 4) Muscle Tender Point Theory; and 5) Small Fiber Neuropathy Theory. Previous fMRI studies of FMS have revealed two key findings. First, patients with FMS show altered activation patterns in brain regions involved in pain processing. Second, the connectivity between brain structures in individuals diagnosed with FMS and healthy controls is different. Both of these findings will be expanded upon in this paper. The article also explores the potential for future research in fibromyalgia due to the advancements in fMRI and elastography techniques, such as shear wave ultrasound. Increased understanding of the underlying mechanisms contributing to fibromyalgia symptoms is necessary for improved diagnosis and treatment, and advanced imaging techniques can aid in this process.

[13] 2312.01833

Minimum-phase property of the hemodynamic response function, and implications for Granger Causality in fMRI

Granger Causality (GC) is widely used in neuroimaging to estimate directed statistical dependence among brain regions using time series of brain activity. An important issue is that functional MRI (fMRI) measures brain activity indirectly via the blood-oxygen-level-dependent (BOLD) signal, which affects the temporal structure of the signals and distorts GC estimates. However, some notable applications of GC are not concerned with the GC magnitude but its statistical significance. This is the case for network inference, which aims to build a statistical model of the system based on directed relationships among its elements. The critical question for the viability of network inference in fMRI is whether the hemodynamic response function (HRF) and its variability across brain regions introduce spurious relationships, i.e., statistically significant GC values between BOLD signals, even if the GC between the neuronal signals is zero. It has been mathematically proven that such spurious statistical relationships are not induced if the HRF is minimum-phase, i.e., if both the HRF and its inverse are stable (producing finite responses to finite inputs). However, whether the HRF is minimum-phase has remained contentious. Here, we address this issue using multiple realistic biophysical models from the literature and studying their transfer functions. We find that these models are minimum-phase for a wide range of physiologically plausible parameter values. Therefore, statistical testing of GC is plausible even if the HRF varies across brain regions, with the following limitations. First, the minimum-phase condition is violated for parameter combinations that generate an initial dip in the HRF, confirming a previous mathematical proof. Second, the slow sampling of the BOLD signal (in seconds) compared to the timescales of neural signal propagation (milliseconds) may still introduce spurious GC.

[14] 2312.01923

Clonal dynamics of surface-driven growing tissues

The self-organization of cells into complex tissues relies on a tight coordination of cell behavior. Identifying the cellular processes driving tissue growth is key for understanding the emergence of tissue forms and for devising targeted therapies for aberrant growth, such as in cancer. Inferring the mode of tissue growth, whether it is driven by cells on the surface or cells in the bulk, is possible in cell culture experiments, but difficult in most tissues in living organisms (in vivo). Genetic tracing experiments, where a subset of cells is labelled with inheritable markers have become important experimental tools to study cell fate in vivo. Here, we show that the mode of tissue growth is reflected in the size distribution of the progeny of marked cells. To this end, we derive the clone-size distributions using analytical calculations and an agent-based stochastic sampling technique in the limit of negligible cell migration and cell death. We show that for surface-driven growth the clone-size distribution takes a characteristic power-law form with an exponent determined by fluctuations of the tissue surface. Our results allow for the inference of the mode of tissue growth from genetic tracing experiments.

[15] 2312.01966

Distance function for spike prediction

Approaches to predicting neuronal spike responses commonly use a Poisson learning objective. This objective quantizes responses into spike counts within a fixed summation interval, typically on the order of 10 to 100 milliseconds in duration; however, neuronal responses are often time accurate down to a few milliseconds, and at these timescales, Poisson models typically perform poorly. To overcome this limitation, we propose the concept of a spike distance function that maps points in time to the temporal distance to the nearest spike. We show that neural networks can be trained to approximate spike distance functions, and we present an efficient algorithm for inferring spike trains from the outputs of these models. Using recordings of chicken and frog retinal ganglion cells responding to visual stimuli, we compare the performance of our approach to Poisson models trained with various summation intervals. We show that our approach outperforms the standard Poisson approach at spike train inference.

[16] 2312.00791

Replicator-mutator dynamics of Rock-Paper-Scissors game: Learning through mistakes

We generalize the Bush--Mosteller learning, the Roth--Erev learning, and the social learning to include mistakes such that the nonlinear replicator-mutator equation with either additive or multiplicative mutation is generated in an asymptotic limit. Subsequently, we exhaustively investigate the ubiquitous Rock-Paper-Scissors game for some analytically tractable motifs of mutation pattern. We consider both symmetric and asymmetric game interactions, and reveal that mistakes can some-times help the players learn. While the replicator-mutator flow exhibits rich dynamics that include limit cycles and chaotic orbits, it can control chaos as well to lead to rational Nash equilibrium outcome. Moreover, we also report an instance of hitherto unknown Hamiltonian structure of the replicator-mutator equation.

[17] 2312.01147

On the knowledge production function

Knowledge amount is an integral indicator of the humanitarian and technological development of society. The rate of knowledge production depends on population size and knowledge amount. Formalizing this relationship, we lead to an equation for knowledge production that connects population and information dynamics. This equation uses the representation of per capita working efficiency in knowledge production as a function of knowledge amount. We explore this function in detail and verify it on empirical material that includes global demographic and information data. Knowledge can be represented in different forms such as patents, scientific and technical journal articles, and books of any genre. Knowledge is stored in various types of devices, which together form a global informational storage. Storage capacity is increasing rapidly as digital technology advances. The model is also applied to the description of this process. The model is in good agreement with the literature data. The study performed allows us to deepen our understanding of the dynamics of civilization.

[18] 2312.01267

Distributed Reinforcement Learning for Molecular Design: Antioxidant case

Deep reinforcement learning has successfully been applied for molecular discovery as shown by the Molecule Deep Q-network (MolDQN) algorithm. This algorithm has challenges when applied to optimizing new molecules: training such a model is limited in terms of scalability to larger datasets and the trained model cannot be generalized to different molecules in the same dataset. In this paper, a distributed reinforcement learning algorithm for antioxidants, called DA-MolDQN is proposed to address these problems. State-of-the-art bond dissociation energy (BDE) and ionization potential (IP) predictors are integrated into DA-MolDQN, which are critical chemical properties while optimizing antioxidants. Training time is reduced by algorithmic improvements for molecular modifications. The algorithm is distributed, scalable for up to 512 molecules, and generalizes the model to a diverse set of molecules. The proposed models are trained with a proprietary antioxidant dataset. The results have been reproduced with both proprietary and public datasets. The proposed molecules have been validated with DFT simulations and a subset of them confirmed in public "unseen" datasets. In summary, DA-MolDQN is up to 100x faster than previous algorithms and can discover new optimized molecules from proprietary and public antioxidants.

[19] 2312.01517

Age-based restrictions: An alternative to horizontal lockdowns?

During an epidemic, such as the COVID-19 pandemic, policy-makers are dealing with the decision of effective, but socioeconomically costly interference, such as horizontal lockdowns, including school and workplace closure, physical distancing e.t.c. Investigating the role of the age of the individuals in the evolution of epidemiological phenomena, we propose a scheme for the comparison of certain epidemiological strategies. Then, we put the proposed scheme to the test by employing an age-based epidemiological compartment model introduced in a previous work of the authors, coupled with data from the literature, in order to compare the effectiveness of age-based and horizontal interventions. In general, our results suggest that these two are comparable mainly at low or medium level of austerity.

[20] 2312.01817

A simulation method for the wetting dynamics of liquid droplets on deformable membranes

Biological cells utilize membranes and liquid-like droplets, known as biomolecular condensates, to structure their interior. The interaction of droplets and membranes, despite being involved in several key biological processes, is so far little-understood. Here, we present a first numerical method to simulate the continuum dynamics of droplets interacting with deformable membranes via wetting. The method combines the advantages of the phase field method for multi-phase flow simulation and the arbitrary Lagrangian-Eulerian (ALE) method for an explicit description of the elastic surface. The model is thermodynamically consistent, coupling bulk hydrodynamics with capillary forces, as well as bending, tension, and stretching of a thin membrane. The method is validated by comparing simulations for single droplets to theoretical results of shape equations, and its capabilities are illustrated in 2D and 3D axisymmetric scenarios.

[21] 2312.01994

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model na\"ively trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks.

[22] 2312.02111

TriDeNT: Triple Deep Network Training for Privileged Knowledge Distillation in Histopathology

Computational pathology models rarely utilise data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilising privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distil knowledge from scarce or costly data during training, to create significantly better models for routine inputs.