MLE-Toolbox is a comprehensive open-source MATLAB toolbox for end-to-end analysis of magnetoencephalography (MEG) and electroencephalography (EEG) data. Inspired by widely used neuroimaging platforms such as Brainstorm and FieldTrip, it integrates the full analysis pipeline within a unified and user-friendly graphical interface (GUI), covering raw data import, preprocessing, source localization, functional connectivity, oscillatory analysis, and machine learning-based classification. The toolbox includes automated artifact rejection methods, including independent component analysis (ICA), signal-space projection (SSP), and signal-space separation (SSS); multiple source localization approaches, including minimum norm estimation (MNE), dynamic statistical parametric mapping (dSPM), standardized low-resolution brain electromagnetic tomography (sLORETA), and beamforming; multi-atlas parcellation with anatomical visualization; spectral power analysis with frequency-band brain mapping; phase-amplitude coupling (PAC); graph-theoretic brain network analysis; and integrated machine learning and deep learning classifiers. MLE-Toolbox also provides native interoperability with Brainstorm, FieldTrip, EEGLAB, and FreeSurfer, allowing researchers to build on established workflows while benefiting from additional automation, interactive visualization, and one-click academic report generation. Freely available for non-commercial use, MLE-Toolbox is designed to lower the barrier to rigorous, reproducible MEG/EEG research.
Genome engineering has achieved remarkable sequence-level precision, yet predicting the transcriptomic state that a cell will occupy after perturbation remains an open problem. Single-cell CRISPR screens measure how far cells move from their unperturbed state, but this effect magnitude ignores a fundamental question: do the cells move together? Two perturbations with identical magnitude can produce qualitatively different outcomes if one drives cells coherently along a shared trajectory while the other scatters them across expression space. We introduce a geometric stability metric, Shesha, that quantifies the directional coherence of single-cell perturbation responses as the mean cosine similarity between individual cell shift vectors and the mean perturbation direction. Across five CRISPR datasets (2,200+ perturbations spanning CRISPRa, CRISPRi, and pooled screens), stability correlates strongly with effect magnitude (Spearman $\rho=0.75-0.97$), with a calibrated cross-dataset correlation of 0.97. Crucially, discordant cases where the two metrics decouple expose regulatory architecture: pleiotropic master regulators such as CEBPA and GATA1 pay a "geometric tax," producing large but incoherent shifts, while lineage-specific factors such as KLF1 produce tightly coordinated responses. After controlling for magnitude, geometric instability is independently associated with elevated chaperone activation (HSPA5/BiP; $\rho_{partial}=-0.34$ and $-0.21$ across datasets), and the high-stability/high-stress quadrant is systematically depleted. The magnitude-stability relationship persists in scGPT foundation model embeddings, confirming it is a property of biological state space rather than linear projection. Perturbation stability provides a complementary axis for hit prioritization in screens, phenotypic quality control in cell manufacturing, and evaluation of in silico perturbation predictions.
Designing proteins that satisfy natural language functional requirements is a central goal in protein engineering. A straightforward baseline is to fine-tune generic instruction-tuned LLMs as direct text-to-sequence generators, but this is data- and compute-hungry. With limited supervision, LLMs can produce coherent plans in text yet fail to reliably realize them as sequences. This plan-execute gap motivates ProtoCycle, an agentic framework for protein design that uses LLMs primarily to drive a multi-round, feedback-driven decision cycle. ProtoCycle couples an LLM planner with a lightweight tool environment designed to emulate the iterative workflow of human protein engineering and uses LLM-driven reflection on tool feedback to revise plans. Trained with supervised trajectories and online reinforcement learning, ProtoCycle achieves strong language alignment while maintaining competitive foldability, and ablations show that reflection substantially improves sequence quality.
Fitness landscapes are mappings between genotypes, phenotypes, and fitness that shape evolution. In recent years, empirical work and theoretical models have greatly advanced our understanding of how populations navigate rugged fitness landscapes. Here, we provide a timely review of this field. Its rapidly growing literature employs a wide range of terms, which are sometimes used ambiguously or inconsistently. We therefore begin by defining the major concepts and the field's vocabulary, highlighting our own terminology choices wherever needed. We then review key results on the relationships between epistasis, ruggedness, accessibility, and navigability for genotype-fitness maps, highlighting several complex and sometimes counterintuitive connections that have emerged. Further, we review how the conserved structural properties of the underlying genotype-phenotype map -- that leads to the formation of large connected neutral networks of genotypes -- influence dynamics on fitness landscapes. We then compare the two levels to study landscape navigation -- the level of the genotype-phenotype maps and the level of genotype-fitness maps. Our review leads us to propose a new measure of navigability, based on evolutionary outcomes, that is broadly applicable and overcomes limitations of existing measures. Finally, we review the smaller body of work that relaxes the common assumption of fitness-monotonic paths on static landscapes, and discuss how this can fundamentally change the nature of fitness landscape navigation. Throughout the review, we identify directions for future work to fill existing gaps and to synthesize the disparate strands of research within the field.
Classical causal models, such as Granger causality and structural equation modeling, are largely restricted to acyclic interactions and struggle to represent cyclic and higher-order dynamics in complex networks. We introduce a causal framework grounded in a variational principle, interpreting causality as directional energy flow from high- to low-energy states along network connections. Using Hodge theory, network flows are decomposed into dissipative components and a persistent harmonic component that captures stable cyclic interactions. Applied to resting-state fMRI connectivity, our variational framework reveals robust cyclic causal patterns that are not detected by conventional causal models, highlighting the value of variational principles for causality.
Cortical folding reflects coordinated neurodevelopmental processes and provides a sensitive marker of neurological disease. In juvenile myoclonic epilepsy (JME), structural abnormalities are subtle and spatially distributed, limiting the sensitivity of conventional morphometric measures such as cortical thickness. We introduce a Poisson flow model derived from gradients of the mean curvature field on the cortical surface. The method yields a smooth scalar field obtained from a Poisson equation, whose surface gradient defines a flow representation of folding organization. This representation enables spatially coherent characterization of sulcal--gyral patterns and provides a principled geometric framework for studying distributed cortical alterations in JME.
Background: Three-dimensional dual-energy X-ray absorptiometry reconstructs three-dimensional maps of the proximal femur's density distribution from standard hip scans, enabling the estimation of trabecular and cortical bone parameters. The aim of this study was to assess the agreement of these three-dimensional cortical and trabecular femur parameters across different series and models of Hologic densitometers. Methodology: The study cohort was composed of 103 women and men recruited from four clinical centers in Spain and France. Subjects had duplicated hip scans using different Hologic scanners from the Horizon, Discovery, and QDR4500 series. Analyses were performed using 3D-Shaper software. Inter-scanner agreement was evaluated using Deming regression and Bland-Altman analysis. Results: The parameters demonstrated strong inter-device agreement across all clinical centers and scanner models, with coefficients of determination greater than 0.91. Absolute biases were less than 2.5 mg$/$cm$^3$ for integral volumetric bone mineral density, less than 2.9 mg$/$cm$^3$ for trabecular volumetric bone mineral density, and less than 1.7 mg$/$cm$^2$ for cortical surface bone mineral density. No statistically significant bias was found between parameters obtained from different scanners. Furthermore, the observed bias was lower than the expected least significant change, indicating that inter-scanner variability across these devices is not clinically significant. Conclusions: This study demonstrated excellent agreement for standard and three-dimensional derived bone parameters at the hip across Hologic densitometers. These findings support their suitability for clinical use.
The coordination of the immune system and its components is essential for the body to maintain a healthy status. Recent clinical studies show that breast cancer patients with high Dendritic cell clustering in tumour draining lymph nodes have improved survival outcomes, compared to those with a lower degree of clustering. These results suggest that a specific form of Dendritic cell clustering promotes T cell activation. However, the mechanistic effects of this spatial organisation is unclear. We develop a spatially dynamic model of T cells interacting with Dendritic cells within the lymph node. We present a novel probabilistic agent-based model (ABM) of T cells, and use it to derive the deterministic, phenotypically structured partial differential equation (PS-PDE) of T cell activation and motion. Using the PS-PDE, we derive analytic approximations of the expected T cell stimulation distribution, based on the topology and level of clustering of a given Dendritic cell population. Our analytic approximation enables us to identify T cell characteristics that benefit most from Dendritic cell clustering, to result in an enhanced stimulation distribution. We also perform a sensitivity analysis with our models to identify T cell characteristics that result in desirable T cell activation characteristics, such as rapid T cell activation, and robust heterogeneous T cell activation. Our key findings show that T cells with an intermediate level of stimulation uptake benefit most from higher levels of Dendritic cell clustering, activating with a comparable or greater abundance, and greater heterogeneity, when compared to T cells of a similar characteristic but with a lower level of Dendritic cell clustering.
The parameters of many classes of birth-death processes cannot be inferred uniquely from phylogenetic trees: infinitely many parameter combinations yield the same distribution of phylogenetic trees. Here, we show that parameter identifiability can be recovered even for the most general cases of time-dependent rates when additional information on hidden birth events along branches of the reconstructed tree is available. This holds both for models in which individuals are sampled at a single point in time or through time at a time-dependent rate. Moreover, we prove that when mutations occur at birth - assuming two different models for the accumulation of mutations at a birth event - then information about hidden birth events is available in the sequences and thus all parameters of time-dependent birth-death models become identifiable. Thus, phylodynamic inference is identifiable whenever evolutionary models with mutation accumulation at birth (such as at speciation, transmission, or cell division) are plausible.
Recent studies reveal striking representational alignment between artificial neural networks (ANNs) and biological brains, leading to proposals that all sufficiently capable systems converge on universal representations of reality. Here, we argue that this claim of Universality is premature. We introduce the Umwelt Representation Hypothesis (URH), proposing that alignment arises not from convergence toward a single global optimum, but from overlap in ecological constraints under which systems develop. We review empirical evidence showing that representational differences between species, individuals, and ANNs are systematic and adaptive, which is difficult to reconcile with Universality. Finally, we reframe ANN model comparison as a method for mapping clusters of alignment in ecological constraint space rather than searching for a single optimal world model.
The inverse Potts problem for estimating evolutionary single-site fields and pairwise couplings in homologous protein sequences from their single-site and pairwise amino acid frequencies observed in their multiple sequence alignment would be still one of useful methods in the studies of protein structure and evolution. Since the reproducibility of fields and couplings are the most important, the Boltzmann machine method is employed here, although it is computationally intensive. In order to reduce computational time required for the Boltzmann machine, parallel, persistent Markov chain Monte Carlo method is employed to estimate the single-site and pairwise marginal distributions in each learning step. Also, stochastic gradient descent methods are used to reduce computational time for each learning. Another problem is how to adjust the values of hyperparameters; there are two regularization parameters for evolutionary fields and couplings. The precision of contact residue pair prediction is often used to adjust the hyperparameters. However, it is not sensitive to these regularization parameters. Here, they are adjusted for the fields and couplings to satisfy a specific condition that is appropriate for protein conformations. This method has been applied to eight protein families.
Fibrous networks are ubiquitous structural components in biology, spanning cellulose in plant cell walls, fibrin in blood clots, and collagen in the extracellular matrix of animal tissues. Theoretical models predict that network connectivity critically influences their mechanical behavior. However, accurately reconstructing network topology from 3D image data remains a major challenge as current segmentation methods are not designed to preserve network topology and often rely on intensity-based thresholding, which can fragment fibers and distort junction connectivity. Here, we introduce ToFiE, an open-source topology-aware fiber extraction workflow for reconstructing dense and heterogeneous fibrous networks from high resolution microscopy images while preserving connectivity in three dimensions. We validate ToFiE using synthetic fluorescence microscopy images of fiber networks with varying topologies and signal-to-noise ratios. We further demonstrate its performance by reconstructing the fiber networks of a library of collagen gels with various microstructures, imaged using confocal fluorescence microscopy. Altogether, the results establish ToFiE as a practical semi-automated framework for extracting mechanically relevant network information from imaging data across a broad range of fibrous materials.
The most common cause of dementia is Alzheimer disease, a progressive neurodegenerative disorder affecting older adults that gradually impairs memory, cognition, and behavior. It is characterized by the accumulation of abnormal proteins in the brain, including amyloid-beta plaques and neurofibrillary tangles of tau protein, which disrupt neuronal communication and lead to neuronal death. Early manifestations typically include mild memory impairment and reduced ability to acquire new information. As the disease progresses, patients experience severe cognitive decline, loss of independence, and significant personality and behavioral changes. Although the exact etiology of Alzheimer disease remains unclear, factors such as age, genetic predisposition, lifestyle, and cardiovascular health contribute to its development. While no definitive cure exists, early diagnosis, pharmacological interventions, and supportive care can slow progression and improve quality of life. This study presents a predictive cheminformatics-based model for identifying natural medicinal compounds with potential therapeutic efficacy against Alzheimer disease. The model functions as a drug screening system utilizing molecular descriptors and machine learning to detect anti-Alzheimer activity. More than 7,000 compounds from ChEBI, SynSysNet, and INDOFINE were preprocessed using Open Babel and analyzed with Dragon descriptors. A Random Forest classifier trained on approved treatments achieved moderate performance, with precision of 0.5970 and recall of 0.6590, identifying 73 candidate compounds. Key descriptors included atomic polarizability, bond multiplicity, and non-hydrogen bond this http URL findings demonstrate the value of cheminformatics in early-stage drug discovery for Alzheimer disease.
Antibiotic resistance is a major threat to global health. It emerges in multispecies microbial communities under antibiotic exposure. This makes antibiotic spectrum -- a drug's distribution of effects across species -- a potential key parameter in resistance management. However, we currently lack evolutionary theory for resistance dynamics in a multispecies setting. Analysing established community ecology theory, we develop a simple mathematical measure for how one taxon (strain or species) affects another taxon through all direct and indirect interactions in a complex interaction network. Using this, we derive the expected effects of different antibiotic spectra on the abundance of resistant taxa in microbial communities. This furthers our understanding of microbial evolutionary ecology in multispecies communities, and provides a formal theoretical basis for empirical work on optimal antibiotic choice.
Models from the AlphaFold (AF) family reliably predict one dominant conformation for most well-ordered proteins but struggle to capture biologically relevant alternate states. Several efforts have focused on eliciting greater conformational variability through ad hoc inference-time perturbations of AF models or their inputs. Despite their progress, these approaches remain inefficient and fail to consistently recover major conformational modes. Here, we investigate both the optimal location and manner-of-operation for perturbing latent representations in the AF3 architecture. We distill our findings in ConforNets: channel-wise affine transforms of the pre-Pairformer pair latents. Unlike previous methods, ConforNets globally modulate AF3 representations, making them reusable across proteins. On unsupervised generation of alternate states, ConforNets achieve state-of-the-art success rates on all existing multi-state benchmarks. On the novel supervised task of conformational transfer, ConforNets trained on one source protein can induce a conserved conformational change across a protein family. Collectively, these results introduce a mechanism for conformational control in AF3-based models.
When a system commits to a hypothesis, much of the evidential structure behind that commitment is lost to compression. Standard accounts assume that selected content and scalar confidence suffice for downstream control. This paper argues that they do not, and that determining what must survive compression is itself a consequence-sensitive problem. We develop a recurrent arbitration architecture in which active constraint fields jointly determine a hypothesis geometry over candidates. Rather than carrying that geometry forward in full, the system compresses it into a support-aware control state whose resolution is regulated by current consequence geometry, arbitration memory, and resource constraints. A bounded objective formalizes the tradeoff. Too little retained support collapses policy-relevant distinctions, producing controllers that select content adequately while misrouting verification, abstention, and recovery. Too much retained support fragments learning across overly fine contexts, degrading adaptation even as discrimination improves. These failure modes yield ordered controller predictions confirmed by a minimal repeated-interaction simulation. Adaptive controllers that regulate support resolution outperform all fixed-resolution controllers in cumulative utility. Agile adaptive control outperforms sluggish adaptive control. Fixed high-resolution control achieves the best commitment accuracy but still trails adaptive controllers because resource cost and learning fragmentation offset the gains from richer retention. Support sufficiency should be understood not as a static representational threshold, but as a dynamic compression criterion. Robust arbitration depends on preserving the smallest support structure adequate for policy under the current consequence landscape, and on regulating that structure as conditions change across repeated cycles of inference and action.
Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.
Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complementary paradigms, including Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models, and outlines a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. Benchmark analyses integrate evidence from both widely used datasets and datasets reflecting industry perspectives, encompassing quantum, physicochemical, physiological, and biophysical domains. The survey examines current standards in data curation, splitting strategies, and evaluation protocols, highlighting challenges including inconsistent stereochemistry, heterogeneous assay sources, and reproducibility limitations under random or poorly defined splits. These observations motivate the modernization of benchmark design toward more transparent, time- and scaffold-aware methodologies. We further propose three forward-looking directions: (i) physics-aware learning embedding quantum consistency, (ii) uncertainty-calibrated foundation models for trustworthy inference, and (iii) realistic multimodal benchmark ecosystems integrating computational and experimental data. Repository: this https URL.
In this work, we present FRIGID, a framework with a novel diffusion language model that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference-time scaling by identifying spectrum-inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference-time scaling significantly improves its accuracy, surpassing 18% Top-1 accuracy on the challenging MassSpecGym benchmark and tripling the Top-1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log-linear performance scaling with increasing inference-time compute, opening a promising new direction for continued improvements in de novo structural elucidation. FRIGID code is publicly available at this https URL
Linear-threshold networks (LTNs) capture the mesoscale behavior of interacting populations of neurons and are of particular interest to control theorists due to their dynamical richness and relative ease of analysis. The aim of this paper is to advance the study of global asymptotic stability in LTNs with asymmetric neural interactions and heterogeneous dissipation under the structural Lyapunov diagonal stability (LDS) condition. To this end, we introduce a one-parameter family of LTNs that preserves the LDS condition and has a parameter-independent equilibrium set. In the fast limit, this family converges to a projected dynamical system (PDS), while in the slow limit, it converges to a discontinuous hard-selector system (HSS). Under LDS, we prove that the fast PDS limit is globally exponentially stable and that the HSS limit is globally asymptotically stable. This alignment suggests that the limiting systems capture essential mechanisms governing stability across the entire LTN family. Together with numerical evidence, these findings indicate that resolving stability at the fast and slow endpoints provides a promising and structurally grounded path toward establishing global stability for LTNs with biologically plausible recurrence and diagonal dissipation.
This dissertation explores how deep generative models can advance the analysis of challenging biological problems by integrating domain knowledge with deep learning. It focuses on two areas: DNA reaction kinetics and cryogenic electron microscopy (cryo-EM). In the first part, we present ViDa, a biophysics-informed framework leveraging variational autoencoders (VAEs) and geometric scattering transforms to generate biophysically-plausible embeddings of DNA reaction kinetics simulations. These embeddings are reduced to a two-dimensional space to visualize DNA hybridization and toehold-mediated strand displacement reactions. ViDa preserves structure and clusters trajectory ensembles into reaction pathways, making simulation results more interpretable and revealing new mechanistic insights. In the second part, we address key challenges in cryo-EM density map interpretation and protein structure modeling. We provide a comprehensive review and benchmarking of deep learning methods for atomic model building, with improved evaluation metrics and practical guidance. We then present Struc2mapGAN, a generative adversarial network that synthesizes high-fidelity experimental-like cryo-EM density maps from protein structures. Finally, we present CryoSAMU, a structure-aware multimodal U-Net that enhances intermediate-resolution cryo-EM maps by integrating density features with structural embeddings from protein language models via cross-attention. Overall, these contributions demonstrate the potential of deep generative models to interpret DNA reaction mechanisms and advance cryo-EM density map analysis and protein structure modeling.
A central question in computational neuroscience is whether the learning rule used to train a neural network determines how well its internal representations align with those of the human visual cortex. We present a systematic comparison of four learning rules -- backpropagation (BP), feedback alignment (FA), predictive coding (PC), and spike-timing-dependent plasticity (STDP) -- applied to identical convolutional architectures and evaluated against human fMRI data from the THINGS-fMRI dataset (720 stimuli, 3 subjects) using Representational Similarity Analysis (RSA). Crucially, we include an untrained random-weights baseline that reveals the dominant role of architecture. We find that early visual alignment (V1/V2) is primarily architecture-driven: an untrained CNN achieves rho = 0.071, statistically indistinguishable from BP (rho = 0.072, p = 0.43). Learning rules only differentiate at higher visual areas: BP dominates at LOC/IT, and PC with local Hebbian updates achieves IT alignment statistically indistinguishable from BP (p = 0.18). FA consistently impairs representations below the random baseline at V1. Partial RSA confirms all effects survive pixel-similarity control. These results demonstrate that the relationship between learning rules and cortical alignment is region-specific: architecture determines early alignment, while supervised objectives drive late alignment.
We introduce RosettaSearch, an inference-time multi-objective optimization approach for protein sequence optimization. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18\% to 68\%, translating to a 2.5$\times$ improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves sequence fidelity for ProteinMPNN-designed sequences on \textit{de novo} backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. The sequence trajectories generated by our approach can be used as training data in sequence design models or in post-training and will be released along with the code and datasets upon publication.
How much data is enough to make a scientific discovery? As biomedical datasets scale to millions of samples and AI models grow in capacity, progress increasingly depends on predicting when additional data will substantially improve performance. In practice, model development often relies on empirical scaling curves measured across architectures, modalities, and dataset sizes, with limited theoretical guidance on when performance should improve, saturate, or exhibit cross-over behavior. We propose a scaling-law framework for cross-modal discoverability based on spectral structure of data covariance operators, task-aligned signal projections, and learned representations. Many performance metrics, including AUC, can be expressed in terms of cumulative signal-to-noise energy accumulated across identifiable spectral modes of an encoder and cross-modal operator. Under mild assumptions, this accumulation follows a zeta-like scaling law governed by power-law decay of covariance spectra and aligned signal energy, leading naturally to the appearance of the Riemann zeta function. Representation learning methods such as sparse models, low-rank embeddings, and multimodal contrastive objectives improve sample efficiency by concentrating useful signal into earlier stable modes, effectively steepening spectral decay and shifting scaling curves. The framework predicts cross-over regimes in which simpler models perform best at small sample sizes, while higher-capacity or multimodal encoders outperform them once sufficient data stabilizes additional degrees of freedom. Applications include multimodal disease classification, imaging genetics, functional MRI, and topological data analysis. The resulting zeta law provides a principled way to anticipate when scaling data, improving representations, or adding modalities is most likely to accelerate discovery.
Molecule generation requires satisfying multiple chemical and biological constraints while searching a large and structured chemical space. This makes it a non-binary problem, where effective models must identify non-obvious solutions under constraints while maintaining exploration to improve success by escaping local optima. From this perspective, creativity is a functional requirement in molecular generation rather than an aesthetic notion. Large language models (LLMs) can generate molecular representations directly from natural language prompts, but it remains unclear what type of creativity they exhibit in this setting and how it should be evaluated. In this work, we study the creative behavior of LLMs in molecular generation through a systematic empirical evaluation across physicochemical, ADMET, and biological activity tasks. We characterize creativity along two complementary dimensions, convergent creativity and divergent creativity, and analyze how different factors shape these behaviors. Our results indicate that LLMs exhibit distinct patterns of creative behavior in molecule generation, such as an increase in constraint satisfaction when additional constraints are imposed. Overall, our work is the first to reframe the abilities required for molecule generation as creativity, providing a systematic understanding of creativity in LLM-based molecular generation and clarifying the appropriate use of LLMs in molecular discovery pipelines.
Gene expression in cells is stochastic, yet differentiation is robust. We propose a mechanism in which frustrated genes with weakly stable intermediate expression undergo noise-driven switching between basins of attraction, followed by irreversible fate fixation through slow epigenetic feedback. Regulatory interactions amplify effective noise and promote differentiation. We derive analytic expression for the logarithmic dependence of differentiation time on noise strength and input-dependent cell-fate selection, and demonstrate homeorhesis, the dynamical robustness of the epigenetic landscape.
Treatment of cancer involves heterogeneous, complex care pathways. The relationship between these longitudinal trajectories, baseline mental health, and prognostic outcomes remains poorly understood. We introduce an interpretable time-analysis framework leveraging these temporal dynamics, analyzing care patterns spanning up to 37 years for >8,000 patients. Using Dynamic Time Warping (DTW) and Hierarchical Clustering on sequence data of healthcare encounters, we identified nine distinct, robust trajectory phenotypes. We evaluated their prognostic utility by incorporating them into generalized linear models alongside conventional clinical, demographic, and socioeconomic covariates. The trajectory clusters significantly enhanced mortality prediction and maintained independent predictive significance. Compared to a low-utilization reference group (mortality 31.5%), all eight remaining clusters exhibited substantially higher mortality odds. We uncovered two primary high-risk trajectory patterns: long-term, complex care pathways reflecting chronic disease courses (up to 196 events; mortality OR up to 3.38, 95% CI 2.13-5.37), and shorter but intense trajectories indicating rapid progression (median 78 events; OR 2.32, 95% CI 1.82-2.97). Unexpectedly, the high-utilization complexity clusters were associated with significantly lower baseline anxiety scores, highlighting a divergent relationship between trajectory intensity, mortality risk, and initial psychological burden. These results demonstrate that incorporating temporal healthcare utilization data uncovers robust trajectory phenotypes capturing multidimensional prognostic information. This offers significant explanatory power beyond established static variables for refining risk stratification in precision oncology.
Alzheimer's disease is the most common neurodegenerative disorder. Its pathological development is connected with the misfolding and accumulation of two toxic proteins: amyloid-beta and tau proteins. Mathematical models provide a valuable quantitative tool for monitoring disease progression. In this work, we proposed and compare a novel framework where the spatio-temporal dynamics of amyloid-beta and tau proteins is modeled based on employing either three-dimensional patient-specific geometries or through reduced network-based models defined on the brain connectome. More specifically, a high-fidelity biophysical model is proposed on three-dimensional brain geometries reconstructed from magnetic resonance imaging, whereas a network-based reduced formulation is defined on the brain connectome. For both approaches, a suitable numerical discretisation is proposed. A sensitivity analysis is presented to quantify the influence of model parameters on protein concentration patterns as well as compare the quality of the predictions. For both approaches, the results are validated against PET-SUVR clinical data using 18FAZD4694 for amyloid-beta and 18FMK6240 for tau protein. The results indicate that the three-dimensional model provides the most accurate and biologically consistent description of the disease progression, but remains computationally demanding. On the other hand, the reduced graph-based model is cheaper, but it is not always able to achieve reliable results.
Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on this https URL to enable standardized, reproducible benchmarking of outbreak forecasting methods.
Physics-informed neural networks (PINNs) provide a powerful framework for learning governing equations of dynamical systems from data. Biologically-informed neural networks (BINNs) are a variant of PINNs that preserve the known differential operator structure (e.g., reaction-diffusion) while learning constitutive terms via trainable neural subnetworks, enforced through soft residual penalties. Existing BINN studies are limited to $1\mathrm{D}{+}t$ reaction-diffusion systems and focus on forward prediction, using the governing partial differential equation as a regulariser rather than an explicit identification target. Here, we extend BINNs to $2\mathrm{D}{+}t$ systems within a PINN framework that combines data preprocessing, BINN-based equation learning, and symbolic regression post-processing for closed-form equation discovery. We demonstrate the framework's real-world applicability by learning the governing equations of lung cancer cell population dynamics from time-lapse microscopy data, recovering $2\mathrm{D}{+}t$ reaction-diffusion models from experimental observations. The proposed framework is readily applicable to other spatio-temporal systems, providing a practical and interpretable tool for fast analytic equation discovery from data.
Molecular Dynamics (MD) simulations are essential for understanding the atomic-level behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar molecular systems. Therefore, we propose \textbf{Uni}fied \textbf{Sim}ulator (UniSim), which leverages cross-domain knowledge to enhance the understanding of atomic interactions. First, we employ a multi-head pretraining approach to learn a unified atomic representation model from a large and diverse set of molecular data. Then, based on the stochastic interpolant framework, we learn the state transition patterns over long timesteps from MD trajectories, and introduce a force guidance module for rapidly adapting to different chemical environments. Our experiments demonstrate that UniSim achieves highly competitive performance across small molecules, peptides, and proteins.
Tree-grass coexistence is a defining feature of savanna ecosystems, which play an important role in supporting biodiversity and human populations worldwide. While recent advances have clarified many of the underlying processes, how these mechanisms interact to shape ecosystem dynamics under environmental stress is not yet understood. Here, we present and analyze a minimalistic spatially extended model of tree-grass dynamics in dry savannas. We incorporate tree facilitation of grasses through shading and grass competing with trees for water, both varying with tree life stage. Our model shows that these mechanisms lead to grass-tree coexistence and bistability between savanna and grassland states. Moreover, the model predicts vegetation patterns consisting of trees and grasses, particularly under harsh environmental conditions, which can persist in situations where a non-spatial version of the model predicts ecosystem collapse from savanna to grassland instead (a phenomenon called ``Turing-evades-tipping''). Additionally, we identify a novel ``Turing-triggers-tipping'' mechanism, where unstable pattern formation drives tipping events that are overlooked when spatial dynamics are not included. These transient patterns act as early warning signals for ecosystem transitions, offering a critical window for intervention. Further theoretical and empirical research is needed to determine when spatial patterns prevent tipping or drive collapse.
Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, DynaProt estimates dynamics at two complementary scales: (1) per-residue marginal anisotropy as $3 \times 3$ covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, DynaProt achieves high accuracy in predicting residue-level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.
Recovering unbiased kinetic and thermodynamic observables from the enhanced sampling simulations is a central challenge in rare-event sampling. Classical Girsanov Reweighting (GR) offers a principled solution by yielding exact pathwise probability ratios between biased and unbiased processes. However, the variance of GR weights grows rapidly with time, rendering it impractical for long-horizon reweighting. We introduce Marginal Girsanov Reweighting (MGR), which mitigates variance explosion by marginalizing over intermediate paths, producing stable and scalable weights for long-timescale dynamics. Experiments on various molecular dynamics systems demonstrate that MGR accurately recovers unbiased kinetic properties from trajectories generated under both umbrella sampling and metadynamics biases.
Why do some physical systems possess consciousness, while others do not? Is this a question of physics? Or is it a question of the theory of causation? Physics and the theory of causation serve different descriptive purposes, and in this study we refer to them respectively as the Physical Stance and the Causal Stance. We propose that the generation of consciousness is determined by its internal causal mechanisms in the Causal Stance. To describe a causal model, we will introduce an asymmetric relation between cause and effect into the formulation that is necessary for describing causality, though not physical laws. We argue that the causal conditions for the generation of consciousness are constituted by internal causal mechanisms of the system, rather than by external interventions. To explain such intrinsic causes, this study focuses on inter-level causality. Traditionally, inter-level causality has been considered an emergent phenomenon rather than a mechanism. We devise a method to implement these mechanisms explicitly in a causal model by examining how causes originating at higher levels are transmitted to lower levels within the system. We then propose a Dual-Laws Model (DLM), which features distinct dynamical laws at higher and lower levels. Finally, we discuss the generation of functional consciousness and its causality based on the DLM. Note that this study does not address the causal efficacy of the phenomenological aspect.
Codon usage bias has a crucial impact on the translation efficiency and co-translational folding of proteins, necessitating the algorithmic development of codon optimization/harmonization methods, particularly for heterologous recombinant protein expression. Codon harmonization is especially valuable for proteins sensitive to translation rates, because it can potentially replicate native translation speeds, preserving proper folding and maintaining protein activity. This work proposes a Monte Carlo-based codon harmonization algorithm, MOSAIC (Monte Carlo-based Simulated Annealing for Linked Codons), for the harmonization of a set of linked codons, which differs from conventional codon harmonization, by focusing on the codon sets rather than individual ones. Our MOSAIC demonstrates robust computational performance on ribosomal proteins (S18, S15, S10, and L11) as model systems. Among them, the harmonized gene of RP S18 was expressed and compared with the expression of the wild-type gene. The harmonized gene clearly yielded a larger quantity of the protein, from which the amount of the soluble protein was also significant. These results underscored the potential of the linked codon harmonization approach to enhance the expression and functionality of sensitive proteins, setting the stage for more efficient production of recombinant proteins in various biotechnological and pharmaceutical applications.
High-dimensional omics datasets are routinely visualized as heatmaps, where color intensities reveal co-expression patterns and correlations. However, modern omics technologies increasingly generate matrices so large that existing visual exploration tools require down-sampling or filtering, causing loss of biologically important patterns. Additional barriers arise from tools that require command-line expertise, or fragmented workflows for downstream biological interpretation. We present ClusterChirp, a web-based platform for real-time exploration of large-scale data matrices. The platform combines GPU-accelerated rendering and parallelized hierarchical clustering using multiple CPU cores. Built on this http URL and multi-threaded clustering algorithms, ClusterChirp supports on-the-fly clustering, multi-metric sorting, feature search and interactive visualization controls within a single interface. Uniquely, a natural language interface powered by a Large Language Model allows users to perform complex operations and build reproducible workflows through conversational commands. ClusterChirp further enables within-cluster correlation network analysis in 2D or 3D, and integrates functional enrichment through biological knowledge bases. Developed with iterative user feedback and adhering to FAIR4S principles, ClusterChirp enables users to extract insights from high-dimensional omics data with unprecedented ease and speed. It is freely available at this http URL without login and is also distributed as a Dockerized application at this http URL.
We continue recent attempts to put together concepts and results of Chemical Reaction Networks theory (CRNT) and Mathematical Epidemiology (ME), for solving problems of stability of positive ODEs. We provide first an elegant CRN-flavored generalization of the most cited result in ME, the Next Generation Matrix (NGM) theorem. We review next the "symbolic-numeric approach of Vassena and Stadler, which tackles bifurcation problems by viewing the characteristic polynomial of the Jacobian at fixed points as a formal polynomial in the "symbolic reactivities", and identifies its coefficients as "Child Selection minors of the stoichiometric matrix". We also review two applications of this approach using the Mathematica package Epid-CRN tools from both CRNT and ME.
We investigate the longitudinal behaviour of blood markers from common haematological tests as a marker of disease and as a function of disease progression in a variety of conditions including cancer, cardiovascular disease, and infections. We study confounding and non-confounding factors to allow for the earlier detection of disease and conditions based on their longitudinal signatures from biomarker patterns commonly measured in popular and scalable common blood tests across routine clinical tests, in particular the Complete Blood Count (CBC or FBC). Our analysis with normalised temporal profiles and machine learning techniques even before any symptoms appear demonstrates that analyte-group patterns found in blood testing are disease sensitive and disease specific. We demonstrate that CBC markers contribute to the majority of the predictive signal, while biochemistry and other blood panels provide only a modest additional gain mostly associated to very the individual disease for which the test was designed (e.g. CRP, liver enzymes, blood sugar). Our results demonstrate how regular monitoring, computational intelligence, and machine learning applied to longitudinal CBC data can converge to uncover disease patterns, advancing the potential for precision healthcare and predictive medicine on a mass scale leveraging an existing and pervasive blood test.
Biomarker detection is an indispensable part of the diagnosis and treatment of low-grade glioma (LGG). However, current LGG biomarker detection methods rely on expensive and complex molecular genetic testing, for which professionals are required to analyze the results, and intra-rater variability is often reported. To overcome these challenges, we propose an interpretable deep learning pipeline, named Multi-Biomarker Histomorphology Discoverer (Multi-Beholder), to predict the status of five biomarkers in LGG using only hematoxylin and eosin-stained whole slide images. Specifically, Multi-Beholder incorporates one-class classification into the multiple instance learning framework to achieve accurate instance-level pseudo-labeling, thereby complementing slide-level labels and improving prediction performance. Multi-Beholder demonstrates high performance on two LGG cohorts with diverse races and scanning protocols, with area under the receiver operating characteristic curve up to 0.973 on the internal-validated TCGA-LGG dataset and 0.820 on the external-validated Xiangya cohort. Moreover, the interpretability of Multi-Beholder allows for discovering quantitative and qualitative correlations between biomarker status and histomorphology characteristics. Our pipeline not only provides a novel approach for biomarker prediction, enhancing the applicability of molecular treatments for LGG patients but also facilitates the discovery of new mechanisms in molecular functionality and LGG progression. Code can be accessed at this https URL.
When a new infectious disease (or a new strain of an existing one) emerges, as in the recent COVID-19 pandemic, different types of mobility restrictions are considered to slow down or mitigate the spread of the disease. The measures to be adopted require carefully weighing the social cost against their impact on disease control. In this work, we analyze, in a context of mobility restrictions, the role of frequent versus occasional contacts in epidemic spread. We develop an individual-based mathematical model where frequent contacts among individuals (at home, work, schools) and occasional contacts (at stores, transport, etc.) are considered. We define several contact structures by varying the relative weight of frequent and occasional contacts while keeping the same initial growth rate of the epidemic. We find the remarkable result that the more frequent contacts prevail over occasional ones, the higher the epidemic peak, the sooner it occurs, and the greater the final number of individuals affected by the epidemic. We conduct our study using an SIR model, considering both exponential and deterministic recovery from infection, and obtain that this effect is more pronounced under deterministic recovery. We find that the impact of relaxation measures depends on the relative importance of frequent and occasional contacts within the considered social structures. Finally, we assess in which of the considered scenarios the homogeneous mixing approximation provides a reasonable description of the epidemic dynamics.
Where do objective functions come from? How do we select what goals to pursue? Human intelligence is adept at synthesizing new objective functions on the fly. How does this work, and can we endow artificial systems with the same ability? This paper proposes an approach to answering these questions, starting with the concept of a subjective function, a higher-order objective function that is endogenous to the agent (i.e., defined with respect to the agent's features, rather than an external task). Expected prediction error is studied as a concrete example of a subjective function. This proposal has many connections to ideas in psychology, neuroscience, and machine learning.
Representational similarity analysis and related methods have become standard tools for comparing the internal geometries of neural networks and biological systems. These methods measure what is represented, the alignment between two representational spaces, but not whether that structure is robust. We introduce geometric stability, a distinct dimension of representational quality that quantifies how reliably a representation's pairwise distance structure holds under perturbation. Our metric, Shesha, measures self-consistency through split-half correlation of representational dissimilarity matrices constructed from complementary feature subsets. A key formal property distinguishes stability from similarity: Shesha is not invariant to orthogonal transformations of the feature space, unlike CKA and Procrustes, enabling it to detect compression-induced damage to manifold structure that similarity metrics cannot see. Spectral analysis reveals the mechanism: similarity metrics collapse after removing the top principal component, while stability retains sensitivity across the eigenspectrum. Across 2463 encoder configurations in seven domains -- language, vision, audio, video, protein sequences, molecular profiles, and neural population recordings -- stability and similarity are empirically uncorrelated ($\rho=-0.01$). A regime analysis shows this independence arises from opposing effects: geometry-preserving transformations make the metrics redundant, while compression makes them anti-correlated, canceling in aggregate. Applied to 94 pretrained models across 6 datasets, stability exposes a "geometric tax": DINOv2, the top-performing model for transfer learning, ranks last in geometric stability on 5/6 datasets. Contrastive alignment and hierarchical architecture predict stability, providing actionable guidance for model selection in deployment contexts where representational reliability matters.
Motif discovery is a core problem in computational biology, traditionally formulated as a likelihood optimization task that returns a single dominant motif from a DNA sequence dataset. However, regulatory sequence data admit multiple plausible motif explanations, reflecting underlying biological heterogeneity. In this work, we frame motif discovery as a quality-diversity problem and apply the MAP-Elites algorithm to evolve position weight matrix motifs under a likelihood-based fitness objective while explicitly preserving diversity across biologically meaningful dimensions. We evaluate MAP-Elites using three complementary behavioral characterizations that capture trade-offs between motif specificity, compositional structure, coverage, and robustness. Experiments on human CTCF liver ChIP-seq data aligned to the human reference genome compare MAP-Elites against a standard motif discovery tool, MEME, under matched evaluation criteria across stratified dataset subsets. Results show that MAP-Elites recovers multiple high-quality motif variants with fitness comparable to MEME's strongest solutions while revealing structured diversity obscured by single-solution approaches.