In this work, we explore heuristics for the Adjacency Graph Packing problem, which can be applied to the Double Cut and Join (DCJ) Distance Problem. The DCJ is a rearrangement operation and the distance problem considering it is a well established method for genome comparison. Our heuristics will use the structure called adjacency graph adapted to include information about intergenic regions, multiple copies of genes in the genomes, and multiple circular or linear chromosomes. The only required property from the genomes is that it must be possible to turn one into the other with DCJ operations. We propose one greedy heuristic and one heuristic based on Genetic Algorithms. Our experimental tests in artificial genomes show that the use of heuristics is capable of finding good results that are superior to a simpler random strategy.
Climate change impacts ecosystems worldwide, affecting animal behaviour and survival both directly and indirectly through changes such as the availability of food. For animals reliant on leaves as a primary food source, understanding how climate change influences leaf production of trees is crucial, yet this is understudied, especially in moist evergreen tropical forests. We analyzed a 23-year dataset of young leaf phenology from a moist tropical forest in Kibale National Park, Uganda, to examine seasonal and long-term patterns of 12 key tree species consumed by folivorous primates. We described phenological patterns and explored relationships between young leaf production of different tree species and climate variables. We also assessed the suitability of the Enhanced Vegetation Index (EVI) as a proxy for young leaf production in moist evergreen tropical forests. Our results showed that tree species exhibited distinct phenological patterns, with most species producing young leaves during two seasonal peaks aligned with the rainy seasons. Rainfall, cloud cover, and maximum temperature were the most informative predictors of seasonal variation in young leaf production. However, solar radiation and atmospheric CO$_2$ were most informative regarding long-term trends. EVI was strongly correlated with young leaf production within years but less effective for capturing inter-annual trends. These findings highlight the complex relationship between climate and young leaf phenology in moist evergreen tropical forests, and helps us understand the changes in food availability for tropical folivores.
We tackle a quantification of synchrony in a large ensemble of interacting neurons from the observation of spiking events. In a simulation study, we efficiently infer the synchrony level in a neuronal population from a point process reflecting spiking of a small number of units and even from a single neuron. We introduce a synchrony measure (order parameter) based on the Bartlett covariance density; this quantity can be easily computed from the recorded point process. This measure is robust concerning missed spikes and, if computed from observing several neurons, does not require spike sorting. We illustrate the approach by modeling populations of spiking or bursting neurons, including the case of sparse synchrony.
An annealed version of the quenched mean-field model for epidemic spread is introduced and investigated analytically and assisted by numerical calculations. The interaction between individuals follows a prescription that is used to generate a scale-free network, and we have adjusted the number of connections to produce a sparse network. Specifically, the model's behavior near the infection threshold is examined, as well as the behavior of the stationary prevalence and the probability that a connection between individuals encounters an infected one. We found that these functions display a monotonically increasing dependence on the infection rate. Subsequently, a modification that mimics the mitigation in the probability of encountering an infected individual is introduced, following an old idea rooted in the Malthus-Verhulst model. We found that this modification drastically changes the probability that a connection meets an infected individual. However, despite this change, it does not alter the monotonically increasing behavior of the stationary prevalence.
Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.
Using as a narrative theme the example of Darwin's finches, a microscopic agent-based model is introduces to study sympatric speciation as a result of competition for resources in the same ecological niche. Varying competition among individuals and resource distribution, the model exhibits some of the main features of evolutionary branching processing. The model can be extended to include spatial effects, different genetic loci, sexual mating and recombination, etc.
In 1929 Jan Lukasiewicz used, apparently for the first time, his Polish notation to represent the operations of formal logic. This is a parenthesis-free notation, which also implies that logical functions are operators preceding the variables on which they act. In the 1980s, within the framework of research into mathematical models on the parallel processing of neural systems, a group of operators emerged -- neurally inspired and based on matrix algebra -- which computed logical operations automatically. These matrix operators reproduce the order of operators and variables of Polish notation. These logical matrices can also generate a three-valued logic with broad similarities to Lukasiewicz's three-valued logic. In this paper, a parallel is drawn between relevant formulas represented in Polish notation, and their counterparts in terms of neurally based matrix operators. Lukasiewicz's three-valued logic, shown in Polish notation has several points of contact with what matrices produce when they process uncertain truth vectors. This formal parallelism opens up scientific and philosophical perspectives that deserve to be further explored.
We study the mixing time of the noisy SIS (Susceptible-Infected-Susceptible) model on graphs. The noisy SIS model is a variant of the standard SIS model, which allows individuals to become infected not just due to contacts with infected individuals but also due to external noise. We show that, under strong external noise, the mixing time is of order $O(n \log n)$. Additionally, we demonstrate that the mixing time on random graphs, namely Erd\"os--R\'enyi graphs, regular multigraphs, and Galton--Watson trees, is also of order $O(n \log n)$ with high probability.
Various approaches utilizing Transformer architectures have achieved state-of-the-art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre-trained on billions of proteins, which form the basis of various state-of-the-art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.
Drug combination therapies have shown promising therapeutic efficacy in complex diseases and have demonstrated the potential to reduce drug resistance. However, the huge number of possible drug combinations makes it difficult to screen them all in traditional experiments. In this study, we proposed MD-Syn, a computational framework, which is based on the multidimensional feature fusion method and multi-head attention mechanisms. Given drug pair-cell line triplets, MD-Syn considers one-dimensional and two-dimensional feature spaces simultaneously. It consists of a one-dimensional feature embedding module (1D-FEM), a two-dimensional feature embedding module (2D-FEM), and a deep neural network-based classifier for synergistic drug combination prediction. MD-Syn achieved the AUROC of 0.919 in 5-fold cross-validation, outperforming the state-of-the-art methods. Further, MD-Syn showed comparable results over two independent datasets. In addition, the multi-head attention mechanisms not only learn embeddings from different feature aspects but also focus on essential interactive feature elements, improving the interpretability of MD-Syn. In summary, MD-Syn is an interpretable framework to prioritize synergistic drug combination pairs with chemicals and cancer cell line gene expression profiles. To facilitate broader community access to this model, we have developed a web portal (https://labyeh104-2.life.nthu.edu.tw/) that enables customized predictions of drug combination synergy effects based on user-specified compounds.
Videomicroscopy is a promising tool combined with machine learning for studying the early development of in vitro fertilized bovine embryos and assessing its transferability as soon as possible. We aim to predict the embryo transferability within four days at most, taking 2D time-lapse microscopy videos as input. We formulate this problem as a supervised binary classification problem for the classes transferable and not transferable. The challenges are three-fold: 1) poorly discriminating appearance and motion, 2) class ambiguity, 3) small amount of annotated data. We propose a 3D convolutional neural network involving three pathways, which makes it multi-scale in time and able to handle appearance and motion in different ways. For training, we retain the focal loss. Our model, named SFR, compares favorably to other methods. Experiments demonstrate its effectiveness and accuracy for our challenging biological task.
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
Across the scientific realm, we find ourselves subtracting or dividing stochastic signals. For instance, consider a stochastic realization, $x$, generated from the addition or multiplication of two stochastic signals $a$ and $b$, namely $x=a+b$ or $x = ab$. For the $x=a+b$ example, $a$ can be fluorescence background and $b$ the signal of interest whose statistics are to be learned from the measured $x$. Similarly, when writing $x=ab$, $a$ can be thought of as the illumination intensity and $b$ the density of fluorescent molecules of interest. Yet dividing or subtracting stochastic signals amplifies noise, and we ask instead whether, using the statistics of $a$ and the measurement of $x$ as input, we can recover the statistics of $b$. Here, we show how normalizing flows can generate an approximation of the probability distribution over $b$, thereby avoiding subtraction or division altogether. This method is implemented in our software package, NFdeconvolve, available on GitHub with a tutorial linked in the main text.