We presented an adjustment method for the calculation of medication exposure trajectories based on the number of dispensed packs and the type of dispensations (occasional or regular). A comparative study based on the EFEMERIS data was carried out using three different scenarios of trajectory calculation depending on whether or not the number of packs and the periodicity of medication dispensations were taken into account. The impact of the scenario was highlighted using global indicators on the number of Define-Daily Dose (DDD) on all women exposed; the study of changes in individual trajectories from one scenario to another was carried out; we also compared the results of a clustering into four groups. If 65% of the trajectories remained unchanged, we could observe on the rest significant changes in number of DDD and/or on individual exposure profile. We observed 4% of trajectories that were attributed to a different cluster, and the clustering was of better quality with the adjustment method. Depending on the study context, an impact on cluster distribution could be observed for some maternal characteristics and neonatal outcomes. This was the case for a higher occurrence of neonatal pathology for neonates from mothers belonging to the cluster with high doses of psychotropics, thus reinforcing the conclusions of previous studies of a link between high exposure to psychotropic medications and presence of pathology for the newborn.
Computational simulation provides a powerful toolkit for in silico experimentation. However, while the field has developed best practices for the design and implementation of such models, there remains ambiguity in discussions about how to understand and/or interpret their results due to their inherent ability to overwhelm traditional frequentist statistics by simply increasing the number of trials simulated. This fails the discipline in two ways: first, it leaves the community unsure of what constitutes a best practice for uniform understanding, and second, it potentially overburdens computational studies that burn clock cycles solely to ensure "enough runs to satisfy peers" without any theoretical underpinning for a definition of "enough". We propose a simple and straightforward standard for when to stop simulating additional trials, the {\Omega} test, designed to be analogous to the function of traditional frequentist P-tests. Community adoption of a reasonable and uniform standard will permit more efficient computational experimentation and clearly communication/interpretation of the findings discovered in this way.
Multifractality is an effective formalism for quantifying the nonlinear, scale-free properties of complex data. In this study, we propose a novel and efficient methodology, termed Multifractal Space-filling Curve Analysis (MFSCA), for quantifying the correlation structure of multidimensional data. Within this framework, the original multidimensional data - while preserving both local and long-range organisational properties - are projected onto a one-dimensional representation using a fractal space-filling curve. The resulting one-dimensional signal is then analysed using multifractal algorithms. We demonstrate the utility of the method using both artificially generated multifractal structures and real data. In particular, we apply MFSCA to analyse magnetic resonance imaging (MRI) data from Alzheimer patients at different stages of dementia. Based on the results, we estimate the multifractal profiles of the brain for healthy subjects of different ages as well as for dementia patients. The analysis reveals that the spatial organization of brain structures, as measured by the degree of multifractality, progressively weakens with age and the development of dementia. A transition from multifractality to monofractality is observed both in control groups, when comparing the Young Control and Elderly Control groups, and among dementia subjects of similar age but at different stages of the disease, namely early dementia and mild cognitive impairment. Thus, from the perspective of multiscaling properties, the heterogeneous characteristics of spatial brain organization deteriorate under worsening conditions, leading to a homogeneous and weakly correlated structure. These findings not only effectively capture key aspects of brain organisation, but also demonstrate that the multifractality of MRI data can serve as a marker of structural brain changes.
Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure underlies population activity in the hippocampus. Here we provide a theoretical framework for this phenomenon. First, we propose a plausible construction of hippocampal tuning curves that statistically induces hyperbolic geometry. Next, we establish a connection between neural decoding and associative memory by demonstrating that the Modern Hopfield Network update rule computes the minimum mean-squared-error (MMSE) estimator. Finally, we introduce a novel associative memory model defined in hyperbolic space that yields significantly larger capacity than leading models. Our results suggest that animals encode spatial information as a latent hyperbolic cognitive map, improving both memory capacity and decoding accuracy.
Trap crops reduce damage to a cash (main) crop by attracting pests away from it. Yet this protection is weakened when pests disperse back into the cash crop. In this paper, we focus on the importance of preventing this backflow, showing that effective trap cropping depends jointly on how strongly pests are attracted to trap plants and how rarely they leave them. Together with the proportion of the field devoted to trap plants, these processes determine both the efficacy and feasibility of trap cropping at commercial scales. We formalise this relationship using a simple yield-maximisation framework, in which growers weigh pest suppression benefits against the land sacrificed to trap plants. The model shows that when dispersal from trap plants equals that from the cash crop, optimal trap coverage can exceed 20 to 30 percent of the landscape, levels rarely acceptable to growers. However, reducing pest dispersal off trap plants to just one-quarter of cash crop dispersal lowers the optimal required trap area to approximately 5 percent, transforming trap cropping from impractical to feasible. Understanding these relationships can guide trap-cropping design, from plant choice to targeted interventions that reduce pest movement, to minimise damage, maximise yield, and make trap cropping a reliable component of sustainable pest management.
High-dimensional multicomponent systems, including immune and epigenetic repertoires, must selectively retain rare, beneficial components while purging a massive influx of suboptimal variants. We demonstrate that critical tuning of component control parameters through competition naturally implements proofreading in these systems. Competition for shared inputs pins the system to the marginal stability threshold of the most persistent species. This grants dominant species extended lifetimes, concentrating the population into dominant components while forcing less-stable variants into rapid drift-driven turnover. When aggregate drive exceeds a characteristic scale, this pinning fails, producing a non-selective state where component lifetimes scale as a universal power law with aggregate drive. Applying this framework to biological memory, we identify the hallmarks of this effect in plasma cell accumulation dynamics and propose that de-pinning transitions may represent failure points across biological domains, including cancer, immunodeficiencies, and the aberrant activation of harmful genomic elements during ageing.
Understanding the dynamics of real world complex networks is crucial for assessing their predictability, resilience, and improving ecosystem management, especially in the context of climate change. The relationship between stability and complexity in ecological networks is still debated in the literature. In this modeling study, we investigate whether a complex marine trophic network, characterized by multiple trophic interactions and environmental constraints, exhibits predominantly stable, periodic or chaotic dynamics. We incorporate the microbial loop into a trophic network model, which includes one to three primary producers, one or two consumers, and up to three trophic levels of predators. The microbial loop is a key process in which bacteria recycle detritus from higher trophic levels into nutrients available for the growth of primary producers, ensuring mass conservation within the system. We perform numerical simulations to investigate the dynamic behavior of the network, exploring several configurations by turning off predator prey links between species and varying the high dimensional parameter space. Our results show that (i) longer trophic chains and (ii) a higher number of consumers increase system chaoticity, whereas (iii) omnivorous interactions promote stability. Notably, many of the configurations exhibit high percentages of chaotic behavior. Feedback loop analysis suggests that the balance between negative and positive interactions plays a key role in the convergence of the system toward a steady state. This study shows that interactions and feedback, rather than complexity, are key drivers of stability, pointing to the absence of a clear stability complexity relationship and instead highlighting a stability interaction dependence. Chaotic dynamics may also play an important role, with potential implications for predictability and ecosystem management.
All data-driven modeling tasks (e.g., parameter estimation, uncertainty quantification, and data forecasting) require the selection of a mathematical model. An overlooked aspect of model selection is modality; for example, there are no guidelines on when to use a partial differential equation (PDE) model or an agent-based model (ABM) for spatial processes. To address this, we created a model selection pipeline that uses approximate Bayesian computations to perform parameter estimation, uncertainty quantification, and model selection (using both information criteria and out-of-sample forecasting). Applying the pipeline to artificial datasets (generated from ABMs) reveals that while both modalities yield comparable parameter estimation performance, the ABM estimates exhibit higher uncertainty, and the PDE models compute more than 1,000$\times$ faster. Surprisingly, the mean-field PDE is often selected over the true generative ABM model using both information criteria and data forecasting. Applying the pipeline to public wound healing data indicates that a PDE model with cell pulling and a time delay is the most appropriate model for this data, however, this model has high levels of parametric uncertainty. This methodology establishes a preliminary framework for selecting the appropriate modeling modality for spatial biological data.
Modern biological microscopy routinely generates large and complex image datasets, including multidimensional, multimodal, and time-resolved acquisitions. While imaging technologies have rapidly evolved, data management infrastructures within microscopy facilities often remain fragmented, relying on heterogeneous local solutions that are difficult to maintain, scale, and integrate with High-Performance Computing (HPC) centers and public data repositories. To address these issues, France BioImaging (FBI), the French national infrastructure for biological imaging, has developed this http URL and the associated BioImage Cloud platform. This initiative aims to provide a coordinated national infrastructure connecting microscopy facilities, centralized storage resources, HPC environments, and public bioimaging archives through interoperable and scalable this http URL proposed architecture combines open-source technologies including OMERO for image management, iRODS for distributed data orchestration, Authentik for federated authentication, and emerging standards such as OME-Zarr and REMBI metadata recommendations. The infrastructure is designed to support the complete imaging data lifecycle, from acquisition and transfer to visualization, analysis, sharing, and long-term archiving. Beyond the technical implementation, this work presents the organizational and governance strategies required to deploy a shared national infrastructure across distributed imaging facilities. We discuss the challenges associated with interoperability, metadata standardization, sustainability, and user adoption, as well as the perspectives opened by tighter integration between imaging data and large-scale computing resources for future AI-driven bioimage analysis workflows.
Early detection of neurodegeneration remains a critical clinical challenge. This study investigates whether sleep EEG signal criticality, quantified via Multifractal Detrended Fluctuation Analysis (MFDFA), serves as a non-invasive biomarker for future cognitive decline. We analyzed longitudinal data from the National Sleep Research Resource (NSRR) Study of Osteoporotic Fractures (SOF) cohort, comparing baseline sleep EEG dynamics between women who remained cognitively normal and those who later progressed to dementia-related impairment ($3MS < 78$).Our results reveal significant group-level differences in Hurst exponent $H(q)$ distributions, particularly during non-REM stages N2 and N3. Cognitively healthy individuals exhibited signal dynamics significantly closer to an optimally critical state across all electrode locations ($p \leqslant 0.001$), supporting the Brain Criticality Hypothesis. Supervised UMAP projections confirmed clear spatial separation between groups throughout the overnight sleep this http URL dementia group demonstrated a shift in DFA exponents toward $1.0$, suggesting that a reconfiguration of scale-free neural dynamics during sleep precedes clinical symptoms. These findings highlight the potential for MFDFA-derived measures to be integrated into automated, sleep-based screening tools, enabling earlier preventative interventions during the prodromal window of dementia.
Movement requires the motor cortex to specify both \emph{what} action to produce and \emph{which goal} it serves, yet how individual neurons separate these factors is not understood. Here we show that in macaque motor cortex the \emph{burst fraction} of a neuron, the proportion of its spikes emitted in high-frequency bursts, encodes reach direction far more selectively than its overall firing rate. This dissociation is highly consistent: it holds in every one of 12 recording sessions spanning three animals and two laboratories (all $p<10^{-12}$) and survives controls that remove any contribution of firing rate, showing that goal information is concentrated specifically in bursts. We then show that this coding signature is the predicted consequence of dendritic coincidence detection in layer-5 pyramidal neurons: when a goal-related apical input coincides with a state-related basal drive the neuron bursts, so burst probability computes the product of goal and state, a bilinear gate $G(g)\,Y(s)$. A minimal two-compartment spiking model reproduces the effect, and the same multiplicative gate, embedded in a reinforcement-learning agent, supports zero-shot generalisation to new goals and rapid online adaptation, providing a computational rationale for segregating goal information into bursts. These results identify burst fraction as a goal-selective code in motor cortex, tie it to a concrete cellular mechanism, and show that the mechanism confers a learning advantage.
Force-induced dissociation of short double-stranded DNA (dsDNA) is central to single-molecule biophysics and DNA nanotechnology, yet a physically grounded kinetic description of shear-induced rupture for finite-length constructs remains lacking. Here we develop a master equation framework built on a force-dependent nucleation-zipper pathway with single-base transitions, enabling direct calculation of dissociation rates and transition state distances over a broad force range. Applied to a DNA-gold nanoparticle-DNA construct under constant shear force, the model accurately reproduces the experimental room-temperature data in the covered force regime and provides a unified interpretation of prior measurements on similarly sheared duplexes across all force regimes. A central result is that the three-dimensional helical geometry of dsDNA is essential for correctly defining the end to end distance under shear in the rod-like polymer model of short dsDNA. We further show that the extracted transition state distances are robust to variations in ssDNA polymer parameters within the experimentally relevant regime. Finally, we analyze the temperature dependence of the transition state distance and discuss how our framework captures globally-heated rupture while identifying the additional complications introduced by localized plasmonic heating in gold nanoparticle-coupled constructs. These results provide a predictive kinetic foundation for interpreting force-rupture experiments and for designing force- and temperature-actuated DNA nanostructures.
Cancer treatment planning requires decisions across multiple clinical dimensions at once. Clinicians must determine whether a patient should receive targeted molecular therapy, radiation therapy, and whether they are likely to survive beyond six months. Existing pathway-informed deep learning models have been developed and tested in isolation, making fair comparison across architectures impossible. We present the first unified benchmark for pathway-guided therapy response modeling, evaluating three biologically informed architectures, BINN, GraphPath, and PATH, across five cancer cohorts drawn from The Cancer Genome Atlas, representing 2,622 patients encoded using Reactome pathway activity scores. Each model is trained jointly on all three clinical outcomes under identical data and evaluation conditions, the first study to treat pathway-structured deep learning as a combined therapy and survival prediction problem. Our results show that no single architecture wins across all tasks: PATH performs best for targeted molecular therapy prediction overall, BINN is most reliable for survival prediction, and no model produces useful predictions for radiation therapy, as the key drivers of that decision are clinical variables not captured in gene expression data. Most strikingly, GraphPath achieves an AUROC of 0.92 on prostate targeted molecular therapy prediction, the highest score in the entire benchmark, demonstrating that lateral co-regulation structure produces exceptional discriminative power when matched to a cohort with a narrow targetable driver programme, even under conditions of extreme class imbalance at only 11\% positive prevalence.
Biodiversity measures are often used descriptively: one computes a diversity index from an observed or estimated community composition and maps the resulting values across space. Conservation planning, however, also requires a site-specific benchmark against which the observed community can be compared. This chapter develops an information-geometric framework for such \emph{potential diversity} and the associated \emph{diversity gap}. The central object is a pair of probability vectors on the species simplex: an observed or realized composition \(p^{\mathrm{obs}}\), and a potential composition \(p^{\mathrm{pot}}\) obtained by a constrained variational principle. The gap is then defined by comparing a diversity functional at these two compositions. The framework is developed for both Hill-type diversity, which measures abundance and evenness, and Rao's quadratic entropy, which incorporates trait, phylogenetic, or ecological dissimilarities among species. A spatial point-process interpretation clarifies how local ecological capacities can be defined before passing to the simplex. Escort constraints, capacity constraints, and divergence projections then provide a unified way to define nontrivial benchmarks beyond the uniform distribution. The resulting formulation separates two distinct questions: how diverse a community is, and how far it is from a locally admissible potential benchmark. It also connects the ecological idea of dark diversity with a continuous, abundance-weighted comparison on the probability simplex. We also outline a dynamic extension in which capacities, species migration, and climate-driven shifts vary over time. Empirical implementation with large-scale citizen-science biodiversity data and trait databases is left for future work.
Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.
Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.
This work introduces a biophysical formalism to describe the spatiotemporal evolution of the chemical profile in tissues, with the novelty of modeling tissue compartmentalization and the mechanism by which cells maintain the system far from thermodynamic equilibrium via production and/or degradation of substances. The models were derived from conservation laws, chemical kinetic theory, and geometric constraints, while considering fundamental properties of tissues to connect theoretical modeling with experimental observations. In a morphogenetic context, each morphogen is described by two coupled reaction-diffusion equations, representing intra- and extracellular dynamics, linked through membrane transport processes such as nonlinear, cross, and anomalous diffusion. We explore the models' morphogenetic potential through diffusion-driven instabilities and discuss how natural tissue heterogeneities influence Turing instabilities and self-organized phenomena. The mathematical structure reveals that two-morphogen systems can produce Turing patterns with multiple characteristic length scales, while the system's dimensionality enables chaotic behavior in well-mixed dynamics. Moreover, due to domain coupling, Turing instabilities are allowed for single-morphogen systems. We used Schnakenberg kinetics to demonstrate that Turing patterns arise even when the activator diffuses faster than the inhibitor (d$<$1), thereby expanding the parameter space for pattern formation. Our results suggest that tissue spatial structure has important consequences for Turing instability mechanisms, in some cases weakening the usual conditions for its emergence while widening the possible patterns it can produce. The proposed framework offers a minimal mathematical basis to explore emergent dynamics in biological and synthetic contexts, with potential applications in developmental biology and tissue engineering.
Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.
Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.
Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.
We propose a minimal one-dimensional continuum model for the spontaneous initiation of protrusion-driven cell crawling on a rigid substrate. The cell cytoskeleton is represented as a viscous actin meshwork that turns over in the bulk and polymerizes at two moving cell edges. Symmetry breaking arises from the feedback between cell motion, an external chemical regulator of actin nucleation, and actin polymerization at the cell fronts. When the cell moves, the regulator becomes polarized around the moving boundaries, thereby imposing different actin nucleation densities at the two edges. This generates unequal protrusive rates, which in turn reinforce motion and sustain the chemical polarization. Above a critical protrusive activity, the static symmetric state loses stability and the system undergoes a bifurcation toward a motile polarized state. Depending on how the external cue controls actin nucleation, the transition can be either supercritical or subcritical, leading in the latter case to coexistence between static and motile states. Using parameter values appropriate for keratocyte cells, the model predicts realistic crawling speeds and actin-density profiles, including asymmetric edge-localized density peaks. These results identify a generic mechanism by which external biochemical regulation of actin nucleation can trigger spontaneous motility along a one-dimensional track without requiring molecular motors, specific adhesion dynamics, deformable substrates, or pre-existing polarity.
Despite its importance to applications in protein design, predicting protein properties like binding affinity and thermostability from sparse experimental data remains a significant challenge. Accordingly, we introduce a class of sequence kernels that exploit evolutionary substitution matrices as well as local linearity and demonstrate that the resulting Gaussian processes provide data-efficient models of protein property landscapes, frequently outperforming alternatives that rely on foundation model embeddings. Furthermore--by learning what are in effect structure-aware substitution matrices--we show that our kernels can readily incorporate structural information from foundation models. We demonstrate that these structure-conditioned kernels are well suited to multi-task learning across multiple protein property landscapes and can decisively outperform local supervised learning methods.
Neural population activity models can recover rich temporal structure from binned spikes, but their read-in and readout layers often remain tied to a fixed set of recorded neurons. This coupling limits reuse in long-term brain-computer interfaces, where recorded neuron identities, counts, and response statistics can change across days. We introduce GRAFT, a Transformer-based neural population activity model that separates reusable temporal dynamics from a recalibratable neuron interface. The neuron interface controls how recorded neurons enter and leave the shared backbone, and auxiliary gain and positional mechanisms support neural activity modeling inside the Transformer. On MC Maze under the standard NLB'21 protocol, GRAFT reaches 0.3866 co-bps as an ensemble, setting a new state of the art on the primary co-bps metric among public and reported NLB'21 results. In a cross-day protocol constructed from the NLB'21 MC Maze dataset series, GRAFT recalibrates from MC Maze to the scaled MC Maze datasets (Large/Medium/Small) by updating only 9.21% of parameters, reaching 0.3749, 0.3112, and 0.3152 co-bps with restricted target-day support sets. These results show that the same interface-backbone separation supports both strong Transformer-based neural population activity modeling and data-efficient cross-day recalibration.
Network control theory can be used to model intrinsic and extrinsic strategies to steer neural dynamics. Standard approaches are node-centric, structural, and focused on achieving desired instantaneous states. Here, we develop an edge-centric approach which incorporates both structure and function to achieve extended patterns of neural dynamics characterized by desired synchronization states. Our method, Quantifying Underutilized Influential Edges for Targeted Synchronization (QUIET), is an edge-centric framework that integrates structural controllability of individual white matter connections and mutual information between pairwise functional timeseries to identify energy-efficient synchronization pathways. QUIET identifies quiet highways, edges that are structurally influential but functionally underutilized, to optimize regional synchronization. We validated QUIET across 75 synthetic configurations, where QUIET-ranked edge sets significantly outperformed random selection in 93% of cases (p<0.01). The framework, tested on Human Connectome Project participants, revealed that the control energy required for synchronization of the salience network correlates with fluid intelligence. QUIET, applied to healthy adults undergoing dexmedetomidine-induced unresponsiveness, showed that the frontoparietal and default-mode networks exhibited the largest control energy required for synchronization in both awake and sedated states. QUIET is released as a stand-alone software to be used to study theoretically-defined synchronization pathways, which in turn could inform testable hypotheses in perturbative studies.
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.
In recent years, there has been an effort to extend the classical notion of phylogenetic balance, originally defined in the context of trees, to networks. One of the most natural ways to do this is with the so-called $B_2$ index. In this paper, we study the $B_2$ index for a prominent class of phylogenetic networks: galled trees. We show that the $B_2$ index of a uniform leaf-labeled galled tree converges in distribution as the network becomes large. We characterize the corresponding limiting distribution, and provide a way to compute its moments. This is the first time that a balance index has been studied to this level of detail for a random phylogenetic network. One specificity of this work is that we use two different and independent approaches, each with its advantages: analytic combinatorics, and local limits. The analytic combinatorics approach is more direct, as it relies on standard tools; but it involves slightly more complex calculations. Because it has not previously been used to study such questions, the local limit approach requires developing an extensive framework beforehand; however, this framework is interesting in itself and can be used to tackle other similar problems.
Type 1 Diabetes (T1D) management is a complex task due to many variability factors. Artificial Pancreas (AP) systems have alleviated patient burden by automating insulin delivery through advanced control algorithms. However, the effectiveness of these systems depends on accurate modeling of glucose-insulin dynamics, which traditional mathematical models often fail to capture due to their inability to adapt to patient-specific variations. This study introduces a Biological-Informed Recurrent Neural Network (BIRNN) framework to address these limitations. The BIRNN leverages a Gated Recurrent Units (GRU) architecture augmented with physics-informed loss functions that embed physiological constraints, ensuring a balance between predictive accuracy and consistency with biological principles. The framework is validated using the commercial UVA/Padova simulator, outperforming traditional linear models in glucose prediction accuracy and reconstruction of unmeasured states, even under circadian variations in insulin sensitivity. The results demonstrate the potential of BIRNN for personalized glucose regulation and future adaptive control strategies in AP systems.
Missing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT (Cluster-Regularized Optimal Transport), an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available on GitHub, this https URL.
The application of generative models for experimental drug discovery campaigns is severely limited by the difficulty of designing molecules de novo that can be synthesized in practice. Previous works have leveraged Generative Flow Networks (GFlowNets) to impose hard synthesizability constraints through the design of state and action spaces based on predefined reaction templates and building blocks. Despite the promising prospects of this approach, it currently lacks flexibility and scalability. As an alternative, we propose S3-GFN, which generates synthesizable SMILES molecules via simple soft regularization of a sequence-based GFlowNet. Our approach leverages rich molecular priors learned from large-scale SMILES corpora to steer molecular generation towards high-reward, synthesizable chemical spaces. The model induces constraints through off-policy replay training with a contrastive learning signal based on separate buffers of synthesizable and unsynthesizable samples. Our experiments show that S3-GFN learns to generate synthesizable molecules ($\geq 95\%$) with higher rewards in diverse tasks.
Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.
The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.