New articles on Quantitative Biology


[1] 2306.04658

Mathematics-assisted directed evolution and protein engineering

Directed evolution is a molecular biology technique that is transforming protein engineering by creating proteins with desirable properties and functions. However, it is experimentally impossible to perform the deep mutational scanning of the entire protein library due to the enormous mutational space, which scales as $20^N$ , where N is the number of amino acids. This has led to the rapid growth of AI-assisted directed evolution (AIDE) or AI-assisted protein engineering (AIPE) as an emerging research field. Aided with advanced natural language processing (NLP) techniques, including long short-term memory, autoencoder, and transformer, sequence-based embeddings have been dominant approaches in AIDE and AIPE. Persistent Laplacians, an emerging technique in topological data analysis (TDA), have made structure-based embeddings a superb option in AIDE and AIPE. We argue that a class of persistent topological Laplacians (PTLs), including persistent Laplacians, persistent path Laplacians, persistent sheaf Laplacians, persistent hypergraph Laplacians, persistent hyperdigraph Laplacians, and evolutionary de Rham-Hodge theory, can effectively overcome the limitations of the current TDA and offer a new generation of more powerful TDA approaches. In the general framework of topological deep learning, mathematics-assisted directed evolution (MADE) has a great potential for future protein engineering.


[2] 2306.04665

Statistical thermodynamics of self-organization in the adaptive immune system

The steady flow of energy can arrange matter and information in particular ways in a process we perceive as self-organization. Adaptive immunity is a phenomenon implemented by a complex adaptive biological system, whose self-organization can be understood as the maintenance of a steady state and can be modeled mathematically and physically. Using this approach, statistical distributions of thermodynamics can be shown to be applicable for the description of the organization and are in accordance with experimental observations of the immune system. Here I summarize arguments for such a statistical thermodynamic interpretation of immune function and highlight the interpretations of a key variable that characterizes self-organization in the context of chemical thermodynamics, networks and biochemical measurements.


[3] 2306.04667

Neural Embeddings for Protein Graphs

Proteins perform much of the work in living organisms, and consequently the development of efficient computational methods for protein representation is essential for advancing large-scale biological research. Most current approaches struggle to efficiently integrate the wealth of information contained in the protein sequence and structure. In this paper, we propose a novel framework for embedding protein graphs in geometric vector spaces, by learning an encoder function that preserves the structural distance between protein graphs. Utilizing Graph Neural Networks (GNNs) and Large Language Models (LLMs), the proposed framework generates structure- and sequence-aware protein representations. We demonstrate that our embeddings are successful in the task of comparing protein structures, while providing a significant speed-up compared to traditional approaches based on structural alignment. Our framework achieves remarkable results in the task of protein structure classification; in particular, when compared to other work, the proposed method shows an average F1-Score improvement of 26% on out-of-distribution (OOD) samples and of 32% when tested on samples coming from the same distribution as the training data. Our approach finds applications in areas such as drug prioritization, drug re-purposing, disease sub-type analysis and elsewhere.


[4] 2306.04776

A quantitative theory of viral-immune coevolution is within reach

Pathogens drive changes in host immune systems that in turn exert pressure for pathogens to evolve. Quantifying and understanding this constant coevolutionary process has clear practical global health implications. Yet its relatively easier accessibility compared to macroevolution makes it a fascinating system to learn about the basic laws of evolution. Focusing on immune-viral evolution, we present an overview of theoretical and experimental approaches that have recently started coming together to build the foundations for a quantitative and predictive co-evolutionary theory.


[5] 2306.04886

Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity Prediction

Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. In addition, different bioassays use varying affinity measurement labels (i.e., IC50, Ki, Kd), and different experimental conditions inevitably introduce systematic noise, which poses a significant challenge to constructing high-precision affinity prediction models. To address these issues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (2) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked three-dimensional structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the capability of MBP as a general framework that can improve and be tailored to mainstream structure-based PLBA prediction tasks. To the best of our knowledge, MBP is the first affinity pre-training model and shows great potential for future development.


[6] 2306.04899

Multi-level Protein Representation Learning for Blind Mutational Effect Prediction

Directed evolution plays an indispensable role in protein engineering that revises existing protein sequences to attain new or enhanced functions. Accurately predicting the effects of protein variants necessitates an in-depth understanding of protein structure and function. Although large self-supervised language models have demonstrated remarkable performance in zero-shot inference using only protein sequences, these models inherently do not interpret the spatial characteristics of protein structures, which are crucial for comprehending protein folding stability and internal molecular interactions. This paper introduces a novel pre-training framework that cascades sequential and geometric analyzers for protein primary and tertiary structures. It guides mutational directions toward desired traits by simulating natural selection on wild-type proteins and evaluates the effects of variants based on their fitness to perform the function. We assess the proposed approach using a public database and two new databases for a variety of variant effect prediction tasks, which encompass a diverse set of proteins and assays from different taxa. The prediction results achieve state-of-the-art performance over other zero-shot learning methods for both single-site mutations and deep mutations.


[7] 2306.05203

A cognitive process approach to modeling gap acceptance in overtaking

Driving automation holds significant potential for enhancing traffic safety. However, effectively handling interactions with human drivers in mixed traffic remains a challenging task. Several models exist that attempt to capture human behavior in traffic interactions, often focusing on gap acceptance. However, it is not clear how models of an individual driver's gap acceptance can be translated to dynamic human-AV interactions in the context of high-speed scenarios like overtaking. In this study, we address this issue by employing a cognitive process approach to describe the dynamic interactions by the oncoming vehicle during overtaking maneuvers. Our findings reveal that by incorporating an initial decision-making bias dependent on the initial velocity into existing drift-diffusion models, we can accurately describe the qualitative patterns of overtaking gap acceptance observed previously. Our results demonstrate the potential of the cognitive process approach in modeling human overtaking behavior when the oncoming vehicle is an AV. To this end, this study contributes to the development of effective strategies for ensuring safe and efficient overtaking interactions between human drivers and AVs.


[8] 2306.05286

JGAT: a joint spatio-temporal graph attention model for brain decoding

The decoding of brain neural networks has been an intriguing topic in neuroscience for a well-rounded understanding of different types of brain disorders and cognitive stimuli. Integrating different types of connectivity, e.g., Functional Connectivity (FC) and Structural Connectivity (SC), from multi-modal imaging techniques can take their complementary information into account and therefore have the potential to get better decoding capability. However, traditional approaches for integrating FC and SC overlook the dynamical variations, which stand a great chance to over-generalize the brain neural network. In this paper, we propose a Joint kernel Graph Attention Network (JGAT), which is a new multi-modal temporal graph attention network framework. It integrates the data from functional Magnetic Resonance Images (fMRI) and Diffusion Weighted Imaging (DWI) while preserving the dynamic information at the same time. We conduct brain-decoding tasks with our JGAT on four independent datasets: three of 7T fMRI datasets from the Human Connectome Project (HCP) and one from animal neural recordings. Furthermore, with Attention Scores (AS) and Frame Scores (FS) computed and learned from the model, we can locate several informative temporal segments and build meaningful dynamical pathways along the temporal domain for the HCP datasets. The URL to the code of JGAT model: https://github.com/BRAINML-GT/JGAT.


[9] 2306.04719

Don't trust your eyes: on the (un)reliability of feature visualizations

How do neural networks extract patterns from pixels? Feature visualizations attempt to answer this important question by visualizing highly activating patterns through optimization. Today, visualization methods form the foundation of our knowledge about the internal workings of neural networks, as a type of mechanistic interpretability. Here we ask: How reliable are feature visualizations? We start our investigation by developing network circuits that trick feature visualizations into showing arbitrary patterns that are completely disconnected from normal network behavior on natural input. We then provide evidence for a similar phenomenon occurring in standard, unmanipulated networks: feature visualizations are processed very differently from standard input, casting doubt on their ability to "explain" how neural networks process natural images. We underpin this empirical finding by theory proving that the set of functions that can be reliably understood by feature visualization is extremely small and does not include general black-box neural networks. Therefore, a promising way forward could be the development of networks that enforce certain structures in order to ensure more reliable feature visualizations.


[10] 2306.04810

Correlative Information Maximization: A Biologically Plausible Approach to Supervised Deep Neural Networks without Weight Symmetry

The backpropagation algorithm has experienced remarkable success in training large-scale artificial neural networks, however, its biological-plausibility is disputed, and it remains an open question whether the brain employs supervised learning mechanisms akin to it. Here, we propose correlative information maximization between layer activations as an alternative normative approach to describe the signal propagation in biological neural networks in both forward and backward directions. This new framework addresses many concerns about the biological-plausibility of conventional artificial neural networks and the backpropagation algorithm. The coordinate descent-based optimization of the corresponding objective, combined with the mean square error loss function for fitting labeled supervision data, gives rise to a neural network structure that emulates a more biologically realistic network of multi-compartment pyramidal neurons with dendritic processing and lateral inhibitory neurons. Furthermore, our approach provides a natural resolution to the weight symmetry problem between forward and backward signal propagation paths, a significant critique against the plausibility of the conventional backpropagation algorithm. This is achieved by leveraging two alternative, yet equivalent forms of the correlative mutual information objective. These alternatives intrinsically lead to forward and backward prediction networks without weight symmetry issues, providing a compelling solution to this long-standing challenge.


[11] 2306.05143

Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D Shifted Window Transformer

Given the increasing volume and quality of genomics data, extracting new insights requires interpretable machine-learning models. This work presents Genomic Interpreter: a novel architecture for genomic assay prediction. This model outperforms the state-of-the-art models for genomic assay prediction tasks. Our model can identify hierarchical dependencies in genomic sites. This is achieved through the integration of 1D-Swin, a novel Transformer-based block designed by us for modelling long-range hierarchical data. Evaluated on a dataset containing 38,171 DNA segments of 17K base pairs, Genomic Interpreter demonstrates superior performance in chromatin accessibility and gene expression prediction and unmasks the underlying `syntax' of gene regulation.


[12] 2306.05255

Toward more accurate and generalizable brain deformation estimators for traumatic brain injury detection with unsupervised domain adaptation

Machine learning head models (MLHMs) are developed to estimate brain deformation for early detection of traumatic brain injury (TBI). However, the overfitting to simulated impacts and the lack of generalizability caused by distributional shift of different head impact datasets hinders the broad clinical applications of current MLHMs. We propose brain deformation estimators that integrates unsupervised domain adaptation with a deep neural network to predict whole-brain maximum principal strain (MPS) and MPS rate (MPSR). With 12,780 simulated head impacts, we performed unsupervised domain adaptation on on-field head impacts from 302 college football (CF) impacts and 457 mixed martial arts (MMA) impacts using domain regularized component analysis (DRCA) and cycle-GAN-based methods. The new model improved the MPS/MPSR estimation accuracy, with the DRCA method significantly outperforming other domain adaptation methods in prediction accuracy (p<0.001): MPS RMSE: 0.027 (CF) and 0.037 (MMA); MPSR RMSE: 7.159 (CF) and 13.022 (MMA). On another two hold-out test sets with 195 college football impacts and 260 boxing impacts, the DRCA model significantly outperformed the baseline model without domain adaptation in MPS and MPSR estimation accuracy (p<0.001). The DRCA domain adaptation reduces the MPS/MPSR estimation error to be well below TBI thresholds, enabling accurate brain deformation estimation to detect TBI in future clinical applications.


[13] 2306.05257

Comprehensive evaluation of deep and graph learning on drug-drug interactions prediction

Recent advances and achievements of artificial intelligence (AI) as well as deep and graph learning models have established their usefulness in biomedical applications, especially in drug-drug interactions (DDIs). DDIs refer to a change in the effect of one drug to the presence of another drug in the human body, which plays an essential role in drug discovery and clinical research. DDIs prediction through traditional clinical trials and experiments is an expensive and time-consuming process. To correctly apply the advanced AI and deep learning, the developer and user meet various challenges such as the availability and encoding of data resources, and the design of computational methods. This review summarizes chemical structure based, network based, NLP based and hybrid methods, providing an updated and accessible guide to the broad researchers and development community with different domain knowledge. We introduce widely-used molecular representation and describe the theoretical frameworks of graph neural network models for representing molecular structures. We present the advantages and disadvantages of deep and graph learning methods by performing comparative experiments. We discuss the potential technical challenges and highlight future directions of deep and graph learning models for accelerating DDIs prediction.


[14] 2306.05302

Emergent circulation patterns from anonymized mobility data: Clustering Italy in the time of Covid

Using anonymized mobility data from Facebook users and publicly available information on the Italian population, we model the circulation of people in Italy before and during the early phase of the SARS-CoV-2 pandemic (COVID-19). We perform a spatial and temporal clustering of the movement network at the level of fluxes across provinces on a daily basis. The resulting partition in time successfully identifies the first two lockdowns without any prior information. Similarly, the spatial clustering returns 11 to 23 clusters depending on the period ("standard" mobility vs. lockdown) using the greedy modularity communities clustering method, and 16 to 30 clusters using the critical variable selection method. Fascinatingly, the spatial clusters obtained with both methods are strongly reminiscent of the 11 regions into which emperor Augustus had divided Italy according to Pliny the Elder. This work introduces and validates a data analysis pipeline that enables us: i) to assess the reliability of data obtained from a partial and potentially biased sample of the population in performing estimates of population mobility nationwide; ii) to identify areas of a Country with well-defined mobility patterns, and iii) to distinguish different patterns from one another, resolve them in time and find their optimal spatial extent. The proposed method is generic and can be applied to other countries, with different geographical scales, and also to similar networks (e.g. biological networks). The results can thus represent a relevant step forward in the development of methods and strategies for the containment of future epidemic phenomena.


[15] 2306.05396

Asymmetric periodic boundary conditions for molecular dynamics and coarse-grained simulations of nucleic acids

Periodic boundary conditions are commonly applied in molecular dynamics simulations in the microcanonical (NVE), canonical (NVT) and isothermal-isobaric (NpT) ensembles. In their simplest application, a biological system of interest is placed in the middle of a solvation box, which is chosen 'sufficiently large' to minimize any numerical artefacts associated with the periodic boundary conditions. This practical approach brings limitations to the size of biological systems that can be simulated. Here, we study simulations of effectively infinitely-long nucleic acids, which are solvated in the directions perpendicular to the polymer chain, while periodic boundary conditions are also applied along the polymer chain. We study the effects of these asymmetric periodic boundary conditions (APBC) on the simulated results, including the mechanical properties of biopolymers and the properties of the surrounding solvent. To get some further insights into the advantages of using the APBC, a coarse-grained worm-like chain model is first studied, illustrating how the persistence length can be extracted from local properties of the polymer chain, which are less affected by the APBC than some global averages. This is followed by all-atom molecular dynamics simulations of DNA in ionic solutions, where we use the APBC to investigate sequence-dependent properties of DNA molecules and properties of the surrounding solvent.