This paper investigates a behavioral-feedback SIR model in which the infection rate adapts dynamically based on the fractions of susceptible and infected individuals. We introduce an invariant of motion and we characterize the peak of infection. We further examine the system under a threshold constraint on the infection level. Based on this analysis, we formulate an optimal control problem to keep the infection curve below a healthcare capacity threshold while minimizing the economic cost. For this problem, we study a feasible strategy that involves applying the minimal necessary restrictions to meet the capacity constraint and characterize the corresponding cost.
We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.
Antiterminators are essential components of bacterial transcriptional regulation, allowing the control of gene expression in response to fluctuating environmental conditions. Among them, RNA-binding antiterminator proteins play a major role in preventing transcription termination by binding to specific RNA sequences. These RNA-binding antiterminators have been extensively studied for their role in regulating various metabolic pathways. However, their function in modulating the physiology of pathogens requires further investigation. This review focuses on RNA-binding proteins displaying CAT (Co-AntiTerminator) or ANTAR (AmiR and NasR Transcription Antitermination Regulators) domains reported in model bacteria. In particular, their structures, mechanism of action, and target genes will be described. The involvement of the antitermination mechanisms in bacterial pathogenicity is also discussed. This knowledge is crucial for understanding the regulatory mechanisms that control bacterial virulence, and opens up exciting prospects for future research, and potentially new alternative strategies to combat infectious diseases.
Summary: Microbiome HiFi Amplicon Sequence Simulator (MHASS) creates realistic synthetic PacBio HiFi amplicon sequencing datasets for microbiome studies, by integrating genome-aware abundance modeling, realistic dual-barcoding strategies, and empirically derived pass-number distributions from actual sequencing runs. MHASS generates datasets tailored for rigorous benchmarking and validation of long-read microbiome analysis workflows, including ASV clustering and taxonomic assignment. Availability and Implementation: Implemented in Python with automated dependency management, the source code for MHASS is freely available at this https URL along with installation instructions. Contact: this http URL-stone@uconn.edu or this http URL@uconn.edu Supplementary information: Supplementary data are available online at this https URL.
Protein design has the potential to revolutionize biotechnology and medicine. While most efforts have focused on proteins with well-defined structures, increased recognition of the functional significance of intrinsically disordered regions, together with improvements in their modeling, has paved the way to their computational de novo design. This review summarizes recent advances in engineering intrinsically disordered regions with tailored conformational ensembles, molecular recognition, and phase behavior. We discuss challenges in combining models with predictive accuracy with scalable design workflows and outline emerging strategies that integrate knowledge-based, physics-based, and machine-learning approaches.
The Gillespie algorithm and its extensions are commonly used for the simulation of chemical reaction networks. A limitation of these algorithms is that they have to process and update the system after every reaction, requiring significant computation. Another class of algorithms, based on the tau-leaping method, is able to simulate multiple reactions at a time at the cost of decreased accuracy. We present a new algorithm for the exact simulation of chemical reaction networks that is capable of sampling multiple reactions at a time via a first-order approximation similarly to the tau-leaping methods. We prove that the algorithm has an improved runtime complexity compared to existing methods for the exact simulation of chemical reaction networks, and present an efficient and easy to use implementation that outperforms existing methods in practice.
Cancer is a complex sequence of disease conditions progressing gradually with generalized loss of growth control. It continues to be one of the biggest global health problems, and its etiology has given rise to a huge array of treatment outside of conventional chemotherapy. Melanoma is one of the deadliest forms of skin cancer originating from melanocytes. It is characterized by the overproduction of melanin by the increased in cell proliferation. Melanogenesis, the production of melanin is by melanocyte-stimulating hormone (MSH) which stimulates cyclic AMP (cAMP) production to increase melanocyte production. Through the use of methylxanthines, theophylline proliferation rate can be decreased by increasing cell differentiation. One of the basic principles of cell biology is the selectivity of differentiation and proliferation, where cells usually grow or differentiate but not both. This study aimed to collect baseline data of untreated B16-F10 melanoma cells to determine morphology and doubling time of untreated cells. This data was further compared to results from varied concentration treatments of theophylline. It was hypothesized that increased levels in cyclic adenosine monophosphate (cAMP) could induce differentiation of melanocytes to terminate proliferation. To test this hypothesis, we collected a series of images of B16- F10 derived from mouse melanoma cells and seeded cells into a 12-well plate to calculate cell concentration for a five-day period and doubling time for untreated cells. Theophylline levels were varied to stimulate the production of cAMP and determine its effect on melanocyte proliferation and differentiation. The observed results showed that intracellular stabilization of cAMP, via phosphodiesterase inhibition by methylxanthines like theophylline, increases cell differentiation on melanoma cells and suppresses their growth.
Neurons communicate through spikes, and spike timing is a crucial part of neuronal processing. Spike times can be recorded experimentally both intracellularly and extracellularly, and are the main output of state-of-the-art neural probes. On the other hand, neuronal activity is controlled at the molecular level by the currents generated by many different transmembrane proteins called ion channels. Connecting spike timing to ion channel composition remains an arduous task to date. To address this challenge, we developed a method that combines deep learning with a theoretical tool called Dynamic Input Conductances (DICs), which reduce the complexity of ion channel interactions into three interpretable components describing how neurons spike. Our approach uses deep learning to infer DICs directly from spike times and then generates populations of "twin" neuron models that replicate the observed activity while capturing natural variability in membrane channel composition. The method is fast, accurate, and works using only spike recordings. We also provide open-source software with a graphical interface, making it accessible to researchers without programming expertise.
The simulation of whole-brain dynamics should reproduce realistic spontaneous and evoked neural activity across different scales, including emergent rhythms, spatio-temporal activation patterns, and macroscale complexity. Once a mathematical model is selected, its configuration must be determined by properly setting its parameters. A critical preliminary step in this process is defining an appropriate set of observables to guide the selection of model configurations (parameter tuning), laying the groundwork for quantitative calibration of accurate whole-brain models. Here, we address this challenge by presenting a framework that integrates two complementary tools: The Virtual Brain (TVB) platform for simulating whole-brain dynamics, and the Collaborative Brain Wave Analysis Pipeline (Cobrawap) for analyzing the simulations using a set of standardized metrics. We apply this framework to a 998-node human connectome, using two configurations of the Larter-Breakspear neural mass model: one with the TVB default parameters, the other tuned using Cobrawap. The results reveal that the tuned configuration exhibits several biologically relevant features, absent in the default model for both spontaneous and evoked dynamics. In response to external perturbations, the tuned model generates non-stereotyped, complex spatio-temporal activity, as measured by the perturbational complexity index. In spontaneous activity, it displays robust alpha-band oscillations, infra-slow rhythms, scale-free characteristics, greater spatio-temporal heterogeneity, and asymmetric functional connectivity. This work demonstrates the potential of combining TVB and Cobrawap to guide parameter tuning and lays the groundwork for data-driven calibration and validation of accurate whole-brain models.
The scarcity of experimental protein-ligand complexes poses a significant challenge for training robust deep learning models for molecular docking. Given the prohibitive cost and time constraints associated with experimental structure determination, scalable generation of realistic protein-ligand complexes is needed to expand available datasets for model development. In this study, we introduce a novel workflow for the procedural generation and validation of synthetic protein-ligand complexes, combining a diverse ensemble of generation techniques and rigorous quality control. We assessed the utility of these synthetic datasets by retraining established docking models, Smina and Gnina, and evaluating their performance on standard benchmarks including the PDBBind core set and the PoseBusters dataset. Our results demonstrate that models trained on synthetic data achieve performance comparable to models trained on experimental data, indicating that current synthetic complexes can effectively capture many salient features of protein-ligand interactions. However, we did not observe significant improvements in docking or scoring accuracy over conventional methods or experimental data augmentation. These findings highlight the promise as well as the current limitations of synthetic data for deep learning-based molecular docking and underscore the need for further refinement in generation methodologies and evaluation strategies to fully exploit the potential of synthetic datasets for this application.
Digital pathology has emerged as a transformative approach to tissue analysis, offering unprecedented opportunities for objective, quantitative assessment of histopathological features. However, the complexity of implementing artificial intelligence (AI) solutions in pathology workflows has limited widespread adoption. Here we present ORCA (Optimized Research and Clinical Analytics), a comprehensive no-code AI platform specifically designed for digital pathology applications. ORCA addresses critical barriers to AI adoption by providing an intuitive interface that enables pathologists and researchers to train, deploy, and validate custom AI models without programming expertise. The platform integrates advanced deep learning architectures with clinical workflow management, supporting applications from tissue classification and cell segmentation to spatial distribution scoring and novel biomarker discovery. We demonstrate ORCA's capabilities through validation studies across multiple cancer types, showing significant improvements in analytical speed, reproducibility, and clinical correlation compared to traditional manual assessment methods. Our results indicate that ORCA successfully democratizes access to state-of-the-art AI tools in pathology, potentially accelerating biomarker discovery and enhancing precision medicine initiatives.
Fragment-based drug design is a promising strategy leveraging the binding of small chemical moieties that can efficiently guide drug discovery. The initial step of fragment identification remains challenging, as fragments often bind weakly and non-specifically. We developed a protein-fragment encoder that relies on a contrastive learning approach to map both molecular fragments and protein surfaces in a shared latent space. The encoder captures interaction-relevant features and allows to perform virtual screening as well as generative design with our new method LatentFrag. In LatentFrag, fragment embeddings and positions are generated conditioned on the protein surface while being chemically realistic by construction. Our expressive fragment and protein representations allow location of protein-fragment interaction sites with high sensitivity and we observe state-of-the-art fragment recovery rates when sampling from the learned distribution of latent fragment embeddings. Our generative method outperforms common methods such as virtual screening at a fraction of its computational cost providing a valuable starting point for fragment hit discovery. We further show the practical utility of LatentFrag and extend the workflow to full ligand design tasks. Together, these approaches contribute to advancing fragment identification and provide valuable tools for fragment-based drug discovery.
Summary: Uchimata is a toolkit for visualization of 3D structures of genomes. It consists of two packages: a Javascript library facilitating the rendering of 3D models of genomes, and a Python widget for visualization in Jupyter Notebooks. Main features include an expressive way to specify visual encodings, and filtering of 3D genome structures based on genomic semantics and spatial aspects. Uchimata is designed to be highly integratable with biological tooling available in Python. Availability and Implementation: Uchimata is released under the MIT License. The Javascript library is available on NPM, while the widget is available as a Python package hosted on PyPI. The source code for both is available publicly on Github (this https URL and this https URL). The documentation with examples is hosted at this https URL Contact: david_kouril@hms.this http URL or nils@hms.this http URL.
Unraveling the dynamical motions of biomolecules is essential for bridging their structure and function, yet it remains a major computational challenge. Molecular dynamics (MD) simulation provides a detailed depiction of biomolecular motion, but its high-resolution temporal evolution comes at significant computational cost, limiting its applicability to timescales of biological relevance. Deep learning approaches have emerged as promising solutions to overcome these computational limitations by learning to predict long-timescale dynamics. However, generalizable kinetics models for proteins remain largely unexplored, and the fundamental limits of achievable acceleration while preserving dynamical accuracy are poorly understood. In this work, we fill this gap with DeepJump, an Euclidean-Equivariant Flow Matching-based model for predicting protein conformational dynamics across multiple temporal scales. We train DeepJump on trajectories of the diverse proteins of mdCATH, systematically studying our model's performance in generalizing to long-term dynamics of fast-folding proteins and characterizing the trade-off between computational acceleration and prediction accuracy. We demonstrate the application of DeepJump to ab initio folding, showcasing prediction of folding pathways and native states. Our results demonstrate that DeepJump achieves significant $\approx$1000$\times$ computational acceleration while effectively recovering long-timescale dynamics, providing a stepping stone for enabling routine simulation of proteins.
Sequencing of PCR amplicons generated using degenerate primers (typically targeting a region of the 16S ribosomal gene) is widely used in metagenomics to profile the taxonomic composition of complex microbial samples. To reduce taxonomic biases in primer selection it is important to conduct in silico PCR analyses of the primers against large collections of up to millions of bacterial genomes. However, existing in silico PCR tools have impractical running time for analyses of this scale. In this paper we introduce AmpliconHunter, a highly scalable in silico PCR package distributed as open-source command-line tool and publicly available through a user-friendly web interface at this https URL. AmpliconHunter implements an accurate nearest-neighbor model for melting temperature calculations, allowing for primer-template hybridization with mismatches, along with three complementary methods for estimating off-target amplification. By taking advantage of multi-core parallelism and SIMD operations available on modern CPUs, the AmpliconHunter web server can complete in silico PCR analyses of commonly used degenerate primer pairs against the 2.4M genomes in the latest AllTheBacteria collection in as few as 6-7 hours.
This work introduces a physics-informed neural networks (PINNs)-based model predictive control (MPC) framework for susceptible-infected-recovered ($SIR$) spreading models. Existing studies in MPC design for epidemic control often assume either 1) measurable states of the dynamics, where the parameters are learned, or 2) known parameters of the model, where the states are learned. In this work, we address the joint real-time estimation of states and parameters within the MPC framework using only noisy infected states, under the assumption that 1) only the recovery rate is known, or 2) only the basic reproduction number is known. Under the first assumption, we propose MPC-PINNs and two novel PINNs algorithms, all of which are integrated into the MPC framework. First, we introduce MPC-PINNs, which are designed for $SIR$ models with control. We then propose log-scaled PINNs (MPC-LS-PINNs), which incorporate a log-scaled loss function to improve robustness against noise. Next, we present split-integral PINNs (MPC-SI-PINNs), which leverage integral operators and state coupling in the neural network training process to effectively reconstruct the complete epidemic state information. Building upon these methods, we further extend our framework for the second assumption. We establish the necessary conditions and extend our PINNs algorithms, where MPC-SI-PINNs are simplified as split-PINNs (MPC-S-PINNs). By incorporating these algorithms into the MPC framework, we simultaneously estimate the epidemic states and parameters while generating optimal control strategies. Experiment results demonstrate the effectiveness of the proposed methods under different settings.
The ability of virus shells to encapsulate a wide range of functional cargoes, especially multiple cargoes - siRNAs, enzymes, and chromophores - has made them an essential tool in biotechnology for advancing drug delivery applications and developing innovative new materials. Here we present a mechanistic study of the processes and pathways that lead to multiple cargo encapsulation in the co-assembly of virus shell proteins with ligand-coated nanoparticles. Based on the structural identification of different intermediates, enabled by the contrast in electron microscopy provided by the metal nanoparticles that play the cargo role, we find that multiple cargo encapsulation occurs by self-assembly via a specific ``assembly line'' pathway that is different from previously described \emph{in vitro} assembly mechanisms of virus-like particles (VLP). The emerging model explains observations that are potentially important for delivery applications, for instance, the pronounced nanoparticle size selectivity.
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE's generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.
Despite significant medical advancements, cancer remains the second leading cause of death, with over 600,000 deaths per year in the US. One emerging field, pathway analysis, is promising but still relies on manually derived wet lab data, which is time-consuming to acquire. This work proposes an efficient, effective end-to-end framework for Artificial Intelligence (AI) based pathway analysis that predicts both cancer severity and mutation progression, thus recommending possible treatments. The proposed technique involves a novel combination of time-series machine learning models and pathway analysis. First, mutation sequences were isolated from The Cancer Genome Atlas (TCGA) Database. Then, a novel preprocessing algorithm was used to filter key mutations by mutation frequency. This data was fed into a Recurrent Neural Network (RNN) that predicted cancer severity. Then, the model probabilistically used the RNN predictions, information from the preprocessing algorithm, and multiple drug-target databases to predict future mutations and recommend possible treatments. This framework achieved robust results and Receiver Operating Characteristic (ROC) curves (a key statistical metric) with accuracies greater than 60%, similar to existing cancer diagnostics. In addition, preprocessing played an instrumental role in isolating important mutations, demonstrating that each cancer stage studied may contain on the order of a few-hundred key driver mutations, consistent with current research. Heatmaps based on predicted gene frequency were also generated, highlighting key mutations in each cancer. Overall, this work is the first to propose an efficient, cost-effective end-to-end framework for projecting cancer progression and providing possible treatments without relying on expensive, time-consuming wet lab work.
We present a probabilistic framework for modeling structured spatiotemporal dynamics from sparse observations, focusing on cardiac motion. Our approach integrates neural ordinary differential equations (NODEs), graph neural networks (GNNs), and neural processes into a unified model that captures uncertainty, temporal continuity, and anatomical structure. We represent dynamic systems as spatiotemporal multiplex graphs and model their latent trajectories using a GNN-parameterized vector field. Given the sparse context observations at node and edge levels, the model infers a distribution over latent initial states and control variables, enabling both interpolation and extrapolation of trajectories. We validate the method on three synthetic dynamical systems (coupled pendulum, Lorenz attractor, and Kuramoto oscillators) and two real-world cardiac imaging datasets - ACDC (N=150) and UK Biobank (N=526) - demonstrating accurate reconstruction, extrapolation, and disease classification capabilities. The model accurately reconstructs trajectories and extrapolates future cardiac cycles from a single observed cycle. It achieves state-of-the-art results on the ACDC classification task (up to 99% accuracy), and detects atrial fibrillation in UK Biobank subjects with competitive performance (up to 67% accuracy). This work introduces a flexible approach for analyzing cardiac motion and offers a foundation for graph-based learning in structured biomedical spatiotemporal time-series data.
This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.
Dinoflagellates are marine phytoplankton that emit flashes of light in response to flow-induced deformation; they are responsible for illuminating breaking-waves, wakes of ships, and other intensely turbulent spots of the upper ocean. Here, we ask how bioluminescence is affected by the fluctuating nature of turbulence -- a question motivated by the dependence of emitted flashes on both the extent and rate of deformation. Introducing a light-emitting dumbbell as a minimal model, we study the Lagrangian dynamics of flashing in a homogeneous isotropic turbulent flow, and contrast it with that in an extensional flow and a Gaussian random flow. We show that turbulent fluctuations strongly enhance bioluminescence, while introducing a Poisson-like stochasticity in the flashing dynamics. Furthermore, the intermittent fluctuations of the velocity-gradient subjects the dinoflagellate to bursts of extreme straining and produces bright flashes -- more intense, though less frequent, than what would result from Gaussian fluctuations. Our results suggest that radiant displays of marine bioluminescence are strongly promoted by turbulence and its dissipation-scale intermittency.
Since the earliest proposals for artificial neural network (ANN) models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that uses metalearning to overcome several classic challenges, which we characterize as addressing the Problem of Incentive and Practice -- that is, providing machines with both incentives to improve specific skills and opportunities to practice those skills. This explicit optimization contrasts with more conventional approaches that hope the desired behaviour will emerge through optimizing related but different objectives. We review applications of this principle to addressing four classic challenges for ANNs: systematic generalization, catastrophic forgetting, few-shot learning and multi-step reasoning. We also discuss how large language models incorporate key aspects of this metalearning framework (namely, sequence prediction with feedback trained on diverse data), which helps to explain some of their successes on these classic challenges. Finally, we discuss the prospects for understanding aspects of human development through this framework, and whether natural environments provide the right incentives and practice for learning how to make challenging generalizations.
Groups of cells, including clusters of cancerous cells, multicellular organisms, and developing organs, may both grow and break apart. What physical factors control these fractures? In these processes, what sets the eventual size of clusters? We first develop a one-dimensional framework for understanding cell clusters that can fragment due to cell motility using an active particle model. We compute analytically how the break rate of cell-cell junctions depends on cell speed, cell persistence, and cell-cell junction properties. Next, we find the cluster size distributions, which differ depending on whether all cells can divide or only the cells on the edge of the cluster divide. Cluster size distributions depend solely on the ratio of the break rate to the growth rate - allowing us to predict how cluster size and variability depend on cell motility and cell-cell mechanics. Our results suggest that organisms can achieve better size control when cell division is restricted to the cluster boundaries or when fracture can be localized to the cluster center. Additionally, we derive a universal survival probability for an intact cluster $S(t)=\mathrm{e}^{-k_d t}$ at steady state if all cells can divide, which is independent of the rupture kinetics and depends solely on the cell division rate $k_d$. Finally, we further corroborate the one-dimensional analytics with two-dimensional simulations, finding quantitative agreement with some elements of the theory across a wide range of cell motility. Our results link the general physics problem of a collective active escape over a barrier to size control, providing a quantitative measure of how motility can regulate organ or organism size.
While monitoring biodiversity through camera traps has become an important endeavor for ecological research, identifying species in the captured image data remains a major bottleneck due to limited labeling resources. Active learning -- a machine learning paradigm that selects the most informative data to label and train a predictive model -- offers a promising solution, but typically focuses on uncertainty in the individual predictions without considering uncertainty across the entire dataset. We introduce a new active learning policy, Vendi information gain (VIG), that selects images based on their impact on dataset-wide prediction uncertainty, capturing both informativeness and diversity. We applied VIG to the Snapshot Serengeti dataset and compared it against common active learning methods. VIG needs only 3% of the available data to reach 75% accuracy, a level that baselines require more than 10% of the data to achieve. With 10% of the data, VIG attains 88% predictive accuracy, 12% higher than the best of the baselines. This improvement in performance is consistent across metrics and batch sizes, and we show that VIG also collects more diverse data in the feature space. VIG has broad applicability beyond ecology, and our results highlight its value for biodiversity monitoring in data-limited environments.