The Genebass dataset, released by Karczewski et al. (2022), provides a comprehensive resource elucidating associations between genes and 4,529 phenotypes based on nearly 400,000 exomes from the UK Biobank. This extensive dataset enables the evaluation of gene set enrichment across a wide range of phenotypes, facilitating the inference of associations between specified gene sets and phenotypic traits. Despite its potential, no established method for applying gene set enrichment analysis (GSEA) to Genebass data exists. To address this gap, we propose utilizing fast pre-ranked gene set enrichment analysis (FGSEA) as a novel approach to determine whether a specified set of genes is significantly enriched in phenotypes within the UK Biobank. We developed an R package, ukbFGSEA, to implement this analysis, completed with a hands-on tutorial. Our approach has been validated by analyzing gene sets associated with autism spectrum disorder, developmental disorder, and neurodevelopmental disorders, demonstrating its capability to reveal established and novel associations.
Chikungunya virus (CHIKV) is one of the most relevant arboviruses affecting public health today. It belongs to the Togaviridae family and alphavirus genus, causing an arthritogenic disease known as Chikungunya fever (CHIKF). This multifaceted disease is distinguished from other arbovirus infections by intense arthralgia, which can persist for months or even years in some individuals. The virus has re-emerged as a global health threat in recent decades, originating in Africa and spreading across Asia and America, leading to widespread outbreaks affecting millions. Despite more than 50 years of research on CHIKV pathogenesis, no drugs or vaccines are available. Current management focuses on supportive care to alleviate symptoms and improve patient's quality of life. The ongoing threat posed by CHIKV highlights the need to understand its pathogenesis better. This review provides a comprehensive overview of CHIKV, focusing on host factors, vector-related factors, and complex viral genetic interactions. By exploring these intricate connections, we aim to offer insights that may lead to more effective strategies for preventing and managing this re-emerging global health threat.
We assessed the validity of one of the most frequently used methods to estimate cancer incidence, on the basis of cancer mortality data and the incidence-to-mortality ratio IMR, the IMR method. Using the previous 15 year cancer mortality time series, we derived the expected yearly number of cancer cases in the period 2004 to 2013 for six cancer sites for each sex. Generalized linear mixed models, including a polynomial function for the year of death and smoothing splines for age, were adjusted. Models were fitted under a Bayesian framework based on Markov chain Monte Carlo methods. The IMR method was applied to five scenarios reflecting different assumptions regarding the behavior of the IMR. We compared incident cases estimated with the IMR method to observed cases diagnosed in 2004 to 2013 in Granada. A goodness-of-fit GOF indicator was formulated to determine the best estimation scenario. The relative differences between the observed and predicted numbers of cancer cases were less than 10 percent for most cancer sites. The constant assumption for the IMR trend provided the best GOF for colon, rectal, lung, bladder, and stomach cancers in men and colon, rectum, breast, and corpus uteri in women.
Natural products are substances produced by organisms in nature and often possess biological activity and structural diversity. Drug development based on natural products has been common for many years. However, the intricate structures of these compounds present challenges in terms of structure determination and synthesis, particularly compared to the efficiency of high-throughput screening of synthetic compounds. In recent years, deep learning-based methods have been applied to the generation of molecules. In this study, we trained chemical language models on a natural product dataset and generated natural product-like compounds. The results showed that the distribution of the compounds generated was similar to that of natural products. We also evaluated the effectiveness of the generated compounds as drug candidates. Our method can be used to explore the vast chemical space and reduce the time and cost of drug discovery of natural products.
Proteins evolve through complex sequence spaces, with fitness landscapes serving as a conceptual framework that links sequence to function. Fitness landscapes can be smooth, where multiple similarly accessible evolutionary paths are available, or rugged, where the presence of multiple local fitness optima complicate evolution and prediction. Indeed, many proteins, especially those with complex functions or under multiple selection pressures, exist on rugged fitness landscapes. Here we discuss the theoretical framework that underpins our understanding of fitness landscapes, alongside recent work that has advanced our understanding - particularly the biophysical basis for smoothness versus ruggedness. Finally, we address the rapid advances that have been made in computational and experimental exploration and exploitation of fitness landscapes, and how these can identify efficient routes to protein optimization.
We introduce a stochastic model of a population with overlapping generations and arbitrary levels of self-fertilization versus outcrossing. We study how the global graph of reproductive relationships, or pedigree, influences the genealogical relationships of a sample of two gene copies at a genetic locus. Specifically, we consider a diploid Moran model with constant population size $N$ over time, in which a proportion of offspring are produced by selfing. We show that the conditional distribution of the pairwise coalescence time at a single locus given the random pedigree converges to a limit law as $N$ tends to infinity. This limit law generally differs from the corresponding traditional result obtained by averaging over all possible pedigrees. We describe three different behaviors in the limit depending on the relative strengths, from large to small, of selfing versus outcrossing: partial selfing, limited outcrossing, and negligible outcrossing. In the case of partial selfing, coalescence times are related to the Kingman coalescent, similar to what is found in traditional analyses. In the the case of limited outcrossing, the retained pedigree information forms a random graph coming from a fragmentation-coagulation process, with the limiting coalescence times given by the meeting times of random walks on this graph. In the case of negligible outcrossing, which represents complete or nearly complete selfing, coalescence times are determined entirely by the fixed times to common ancestry of diploid individuals in the pedigree.
Herd immunity is a critical concept in epidemiology, describing a threshold at which a sufficient proportion of a population is immune, either through infection or vaccination, thereby preventing sustained transmission of a pathogen. In the classic Susceptible-Infectious-Recovered (SIR) model, which has been widely used to study infectious disease dynamics, the achievement of herd immunity depends on key parameters, including the transmission rate ($\beta$) and the recovery rate ($\gamma$), where $\gamma$ represents the inverse of the mean infectious period. While the transmission rate has received substantial attention, recent studies have underscored the significant role of $\gamma$ in determining the timing and sustainability of herd immunity. Additionally, it is becoming increasingly evident that assuming $\gamma$ as a constant parameter might oversimplify the dynamics, as variations in recovery times can reflect diverse biological, social, and healthcare-related factors. In this paper, we investigate how heterogeneity in the recovery rate affects herd immunity. We show empirically that the mean of the recovery rate is not a reliable metric for determining the achievement of herd immunity. Furthermore, we provide a theoretical result demonstrating that it is instead the mean recovery time, which is the mean of the inverse $1/\gamma$ of the recovery rate that is critical in deciding whether herd immunity is achievable within the SIR framework. A similar result is proved for the SEIR model. These insights have significant implications for public health interventions and theoretical modeling of epidemic dynamics.
Liquid Chromatography Mass Spectrometry (LC-MS) is an indispensable analytical technique in proteomics, metabolomics, and other life sciences. While OpenMS provides advanced open-source software for MS data analysis, its complexity can be challenging for non-experts. To address this, we have developed OpenMS WebApps, a framework for creating user-friendly MS web applications based on the Streamlit Python package. OpenMS WebApps simplifies MS data analysis through an intuitive graphical user interface, interactive result visualizations, and support for both local and online execution. Key features include workspaces management, automatic generation of input widgets, and parallel execution of tools resulting in highperformance and ready-to-use solutions for online and local deployment. This framework benefits both researchers and developers: scientists can focus on their research without the burden of complex software setups, and developers can rapidly create and distribute custom WebApps with novel algorithms. Several applications built on the OpenMS WebApps template demonstrate its utility across diverse MS-related fields, enhancing the OpenMS eco-system for developers and a wider range of users. Furthermore, it integrates seamlessly with third-party software, extending benefits to developers beyond the OpenMS community.
Understanding the relationship between a population's probability of extinction and its carrying capacity is key in assessing conservation status, and critical to efforts to understand and mitigate the ongoing biodiversity crisis. Despite this, there has been limited research into the form of this relationshop. We conducted around five billion population viability assessments which reveal that the relationship is a modified Gompertz curve. This finding is consistent across around 1700 individual model populations, which between them span different breeding systems and widely varying rates of population growth, levels of environmental stochasticity, adult survival rate, age at first breeding and starting population size. Applying analytical methods to equations describing population dynamics showed that minimal assumptions were required to prove this is a general relationship whichholds for any extant population subject to density dependant growth. Finally, we discuss the implications of these finds and consider the practical use of our results by conservationists.
Electron cryomicroscopy (cryo-EM) is a technique in structural biology used to reconstruct accurate volumetric maps of molecules. One step of the cryo-EM pipeline involves solving an inverse-problem. This inverse-problem, referred to as \textit{ab-initio} single-particle reconstruction, takes as input a collection of 2d-images -- each a projection of a molecule from an unknown viewing-angle -- and attempts to reconstruct the 3d-volume representing the underlying molecular density. Most methods for solving this inverse-problem search for a solution which optimizes a posterior likelihood of generating the observed image-data, given the reconstructed volume. Within this framework, it is natural to study the Hessian of the log-likelihood: the eigenvectors and eigenvalues of the Hessian determine how the likelihood changes with respect to perturbations in the solution, and can give insight into the sensitivity of the solution to aspects of the input. In this paper we describe a simple strategy for estimating the smallest eigenvalues and eigenvectors (i.e., the `softest modes') of the Hessian of the log-likelihood for the \textit{ab-initio} single-particle reconstruction problem. This strategy involves rewriting the log-likelihood as a 3d-integral. This interpretation holds in the low-noise limit, as well as in many practical scenarios which allow for noise-marginalization. Once we have estimated the softest modes, we can use them to perform many kinds of sensitivity analysis. For example, we can determine which parts of the reconstructed volume are trustworthy, and which are unreliable, and how this unreliability might depend on the data-set and the imaging parameters. We believe that this kind of analysis can be used alongside more traditional strategies for sensitivity analysis, as well as in other applications, such as free-energy estimation.
Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the first gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization. Our proposed MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3% , Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x "Me-Better" Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility and potential.
Phylogenetic networks are directed acyclic graphs that depict the genomic evolution of related taxa. Reticulation nodes in such networks (nodes with more than one parent) represent reticulate evolutionary events, such as recombination, reassortment, hybridization, or horizontal gene transfer. Typically, the complexity of a phylogenetic network is expressed in terms of its level, i.e., the maximum number of edges that are required to be removed from each biconnected component of the phylogenetic network to turn it into a tree. Here, we study the relationship between the level of a phylogenetic network and another popular graph complexity parameter - treewidth. We show a $\frac{k+3}{2}$ upper bound on the treewidth of level-$k$ phylogenetic networks and an improved $(1/3 + \delta) k$ upper bound for large $k$. These bounds imply that many computational problems on phylogenetic networks, such as the small parsimony problem or some variants of phylogenetic diversity maximization, are polynomial-time solvable on level-$k$ networks with constant $k$. Our first bound is applicable to any $k$, and it allows us to construct an explicit tree decomposition of width $\frac{k+3}{2}$ that can be used to analyze phylogenetic networks generated by tools like SNAQ that guarantee bounded network level. Finally, we show a $k/13$ lower bound on the maximum treewidth among level-$k$ phylogenetic networks for large enough $k$ based on expander graphs.
Since the approval of the first antibody drug in 1986, a total of 162 antibodies have been approved for a wide range of therapeutic areas, including cancer, autoimmune, infectious, or cardiovascular diseases. Despite advances in biotechnology that accelerated the development of antibody drugs, the drug discovery process for this modality remains lengthy and costly, requiring multiple rounds of optimizations before a drug candidate can progress to preclinical and clinical trials. This multi-optimization problem involves increasing the affinity of the antibody to the target antigen while refining additional biophysical properties that are essential to drug development such as solubility, thermostability or aggregation propensity. Additionally, antibodies that resemble natural human antibodies are particularly desirable, as they are likely to offer improved profiles in terms of safety, efficacy, and reduced immunogenicity, further supporting their therapeutic potential. In this article, we explore the use of energy-based generative models to optimize a candidate monoclonal antibody. We identify tradeoffs when optimizing for multiple properties, concentrating on solubility, humanness and affinity and use the generative model we develop to generate candidate antibodies that lie on an optimal Pareto front that satisfies these constraints.
Protein structures represent the key to deciphering biological functions. The more detailed form of similarity among these proteins is sometimes overlooked by the conventional structural comparison methods. In contrast, further advanced methods, such as Triangular Spatial Relationship (TSR), have been demonstrated to make finer differentiations. Still, the classical implementation of TSR does not provide for the integration of secondary structure information, which is important for a more detailed understanding of the folding pattern of a protein. To overcome these limitations, we developed the SSE-TSR approach. The proposed method integrates secondary structure elements (SSEs) into TSR-based protein representations. This allows an enriched representation of protein structures by considering 18 different combinations of helix, strand, and coil arrangements. Our results show that using SSEs improves the accuracy and reliability of protein classification to varying degrees. We worked with two large protein datasets of 9.2K and 7.8K samples, respectively. We applied the SSE-TSR approach and used a neural network model for classification. Interestingly, introducing SSEs improved performance statistics for Dataset 1, with accuracy moving from 96.0% to 98.3%. For Dataset 2, where the performance statistics were already good, further small improvements were found with the introduction of SSE, giving an accuracy of 99.5% compared to 99.4%. These results show that SSE integration can dramatically improve TSR key discrimination, with significant benefits in datasets with low initial accuracies and only incremental gains in those with high baseline performance. Thus, SSE-TSR is a powerful bioinformatics tool that improves protein classification and understanding of protein function and interaction.
DNA microarray gene-expression data has been widely used to identify cancerous gene signatures. Microarray can increase the accuracy of cancer diagnosis and prognosis. However, analyzing the large amount of gene expression data from microarray chips pose a challenge for current machine learning researches. One of the challenges lie within classification of healthy and cancerous tissues is high dimensionality of gene expressions. High dimensionality decreases the accuracy of the classification. This research aims to apply a hybrid model of Genetic Algorithm and Neural Network to overcome the problem during subset selection of informative genes. Whereby, a Genetic Algorithm (GA) reduced dimensionality during feature selection and then a Multi-Layer perceptron Neural Network (MLP) is applied to classify selected genes. The performance evaluated by considering to the accuracy and the number of selected genes. Experimental results show the proposed method suggested high accuracy and minimum number of selected genes in comparison with other machine learning algorithms.