New articles on Quantitative Biology


[1] 2604.18603

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

Bidirectional transformers are the foundation of many sequence modeling tasks across natural, biological, and chemical language domains, but they are permutation-invariant without explicit positional embeddings. In contrast, unidirectional attention inherently encodes positional information through its triangular mask, enabling models to operate without positional embeddings altogether. Here, we introduce Dual Triangle Attention, a novel bidirectional attention mechanism that separates the query-key subspace of each attention head into two complementary triangular masks: one that attends to past-and-self positions and one that attends to future-and-self positions. This design provides bidirectional context while maintaining the causal mask's implicit positional inductive bias in both directions. Using PyTorch's flex_attention, Dual Triangle Attention is implemented as a single compiled kernel call with no additional parameters beyond standard multi-head attention. We evaluated Dual Triangle Attention across three settings: (1) a synthetic argmax position probe, (2) masked language modeling (MLM) on natural language, and (3) MLM on protein sequences. In the argmax task, both Dual Triangle Attention and causal attention learn positional information without explicit positional embeddings, whereas standard bidirectional attention cannot. In the MLM experiments, Dual Triangle Attention with Rotary Positional Embeddings (RoPE) achieved the best context extension performance and strong performance across the board. These findings suggest that Dual Triangle Attention is a viable attention mechanism for bidirectional transformers, with or without positional embeddings.


[2] 2604.18621

Quantum AI for Cancer Diagnostic Biomarker Discovery

Quantum machine learning offers a promising new paradigm for computational biology by leveraging quantum mechanical principles to enhance cancer classification, biomarker discovery, and bioinformatics diagnostics. In this study, we apply QML to identify subtype specific biomarkers for lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), the two predominant forms of non-small cell lung cancer. Our methodology involves a two-phase process: in Phase 1, differential expression analysis and methylation analysis between tumor and normal samples allows us to identify LUAD-specific and LUSC-specific genes, revealing potential prognostic biomarkers for cancer subtypes. Phase 2 focuses on developing a quantum classifier capable of distinguishing between LUAD and LUSC tumors, as well as between tumor and normal samples. This classifier not only enhances diagnostic precision but also demonstrates the quantum advantage in processing large-scale multiomic datasets. Our results consistently demonstrated that Sample3, representing the combined gene set, achieved the highest overall predictive performance in all metrics. These results demonstrate that QML provides an effective and scalable approach for biomarker discovery and subtype specific cancer classification. GO enrichment analysis highlighted the significant involvement of genes in synaptic signaling, ion channel regulation, and neuronal development. In the quantum phase, KEGG analysis further identified enrichment in cancer-associated pathways, including neurotrophin, MAPK, Ras, and PI3KAkt signaling, with key genes such as NGFR, NTRK2, and NTF3 suggesting a central role in neurotrophinmediated oncogenic processes. Our findings highlight the growing potential of quantum computing to advance precision oncology and next-generation biomedical analytics.


[3] 2604.18622

MDAgent: A Multi-Agent Framework for End-to-End Molecular Dynamics Research

Molecular dynamics (MD) simulation is a powerful tool for studying biomolecular structural changes, molecular recognition, transmembrane transport, and functional mechanisms. However, its practical bottleneck lies not only in software operation or parameter setup, but in translating experimental questions into executable, interpretable, and reviewable computational workflows. Here, we present MDAgent, a multi-agent system for end-to-end molecular dynamics research. The system integrates problem understanding, literature-guided strategy design, simulation execution, trajectory analysis, mechanistic interpretation, and quality supervision into a unified workflow, enabling agents not only to run simulations but also to generate research-oriented computational plans and analytical reports. We further introduce a case-based learning mechanism based on Skill and Memory, which stores reusable knowledge from prior tasks, including parameter choices, operational rules, analytical logic, and problem-solving pathways, thereby supporting cross-task transfer without retraining the underlying model. Across multiple representative molecular simulation tasks, MDAgent achieved stable end-to-end performance with improved strategic adaptability, interpretability, and generalization. In an independent complex task involving conformational transitions of TMEM16F and XKR8, the system successfully completed system design, simulation, and mechanistic analysis for large membrane proteins. These results show that combining multi-agent collaboration with case-based learning can transform MD agents from workflow automation tools into scientific question-oriented computational research systems, providing a scalable framework for AI-driven automated research.


[4] 2604.18634

Topological analysis of hemodynamic response to cardiac resynchronization therapy

Objective: The Mapper algorithm is a qualitative method in topological data analysis that constructs graphs from point clouds by combining dimensionality reduction and clustering techniques. The aim of this study is to apply Mapper, together with novel quantitative indices, to compare the effects of biventricular pacing from the left ventricular epicardium versus the endocardium in a swine model of pacing-induced non-ischemic cardiomyopathy. Methods: The distributions of four hemodynamic variables from a previous study on endocardial and epicardial cardiac resynchronization in an experimental swine model of nonischemic cardiomyopathy were analyzed using the Mapper algorithm, enhanced with numerical indices quantifying self-connectivity, scattering, and homogeneity of the resulting colored graphs. Results: Statistically significant differences were observed between pacing from basal regions versus mid or apical regions, with the following self-connectivity index values: basal $0.57$; mid $0.14$ ($p < 0.01$); apical $0.24$ ($p < 0.01$). Endocardial stimulation at lateral sites increased the contrast between the distributions of basal versus mid or apical data, when compared with epicardial stimulation. Conclusions: Topological analysis using the Mapper algorithm, enhanced with quantitative statistical measures, revealed new and biologically plausible significant differences in pacing effects across heart regions.


[5] 2604.18637

NeuroAI and Beyond: Bridging Between Advances in Neuroscience and ArtificialIntelligence

Neuroscience and Artificial Intelligence (AI) have made impressive progress in recent years but remain only loosely interconnected. Based on a workshop convened by the National Science Foundation in August 2025, we identify three fundamental capability gaps in current AI: the inability to interact with the physical world, inadequate learning that produces brittle systems, and unsustainable energy and data inefficiency. We describe the neuroscience principles that address each: co-design of body and controller, prediction through interaction, multi-scale learning with neuromodulatory control, hierarchical distributed architectures, and sparse event-driven computation. We present a research roadmap organized around these principles at near, mid, and long-term horizons. We argue that realizing this program requires a new generation of researchers trained across the boundary between neuroscience and engineering, and describe the institutional conditions: interdisciplinary training, hardware access, community standards, and ethics, needed to support them. We conclude that NeuroAI, neuroscience-informed artificial intelligence, has the potential to overcome limitations of current AI while deepening our understanding of biological neural computation.


[6] 2604.18643

Quantum-Like Models of Cognition and Decision Making: Open-Systems and Gorini--Kossakowski--Sudarshan--Lindblad Dynamics

This paper starts with surveying the evolution of quantum-like models of cognition and decision making, transitioning from static kinematic representations to a robust dynamical framework based on open quantum systems. We provide a comprehensive analysis of the Gorini-Kossakowski-Sudarshan-Lindblad (GKSL) master equation's application in cognitive psychology and decision making, illustrating how it models mental state evolution as a dissipative process influenced by an informational environment. We categorize dynamical regimes into Passive and Active Hamiltonians, demonstrating how non-commutation with projections on decision basis serves as a mathematical signature of cognitive agency and Quantum Escape from classical equilibria. The utility of this framework is further explored through its ability to stabilize non-Nash outcomes in strategic games, such as the Prisoner's Dilemma. Building upon this dynamical foundation, we identify ``cognitive beats'' as a signature of the internal struggle between competing ``flows of mind'' deliberated at approximately equal frequencies. Distinct from the damped oscillations of simple interference, these beats emerge from a structural tension between Liouvillian channels that generates a secondary, slow-scale modulation of conviction. This beat envelope dictates the timing of peak readiness and hesitation, providing a mathematical map of the transition between conflicting cognitive states. By resolving these nested time scales, we provide a new spectral diagnostic for the depth of cognitive agency and the complexity of the underlying deliberation process. This paper develops a theoretical framework linking GKSL dynamics with quantum-like cognition and decision-making (QCDM), highlighting how dissipative quantum models can capture features of human thought and decision processes.


[7] 2604.18827

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling -- even in the mouse visual cortex, a relatively simple system -- models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at this https URL.


[8] 2604.18851

Intrinsic stochasticity in cell polarity and contact inhibition of locomotion

When cells collide, they often exhibit "contact inhibition of locomotion" (CIL), a behavior in which cells repolarize and migrate away from the site of contact. Experimental CIL outcomes are highly variable - why? Here, we develop a minimal stochastic model to quantify how intrinsic noise in cell polarity, arising from the finite number of signaling molecules, influences CIL decision-making. We simulate polarization dynamics by tracking individual Rho GTPase proteins that diffuse and switch stochastically between the cell membrane and cytosol. In the absence of cell-cell contact, the polarity axis diffuses rotationally - the cell's orientation wanders - with a diffusion coefficient that decreases as Rho GTPase copy number increases. Assuming that cell-cell contact inhibits Rho GTPase activation, we investigate how contact geometry, duration, and strength affect CIL sensitivity. At low protein copy number, weak, brief, or spatially narrow contacts are masked by molecular noise. In contrast, at high protein copy number, intrinsic polarity noise is negligible, and randomness in CIL response is more likely to reflect the variability from collision to collision in the cell-cell contact properties.


[9] 2604.18872

Meeting times on graphs in near-cubic time

The expected meeting time of two random walkers on an undirected graph of size $N$, where at each time step one walker moves and the process stops when they collide, satisfies a system of $\binom{N}{2}$ linear equations. Naïvely, solving this system takes $O\left(N^{6}\right)$ operations. However, this system of linear equations has nice structure in that it is almost a Sylvester equation, with the obstruction being a diagonal absorption constraint. We give a simple algorithm for solving this system that exploits this structure, leading to $O\left(N^{4}\right)$ operations and $\Theta\left(N^{2}\right)$ space for exact computation of all $\binom{N}{2}$ meeting times. While this practical method uses only standard dense linear algebra, it can be improved (in theory) to $O\left(N^{3}\log^{2}N\right)$ operations by exploiting the Cauchy structure of the diagonal correction. We generalize this result slightly to cover the Poisson equation for the absorbing "lazy" pair walk with an arbitrary source, which can be solved at the same cost, with $O\left(N^{3}\right)$ per additional source on the same graph. We conclude with applications to evolutionary dynamics, giving improved algorithms for calculating fixation probabilities and mean trait frequencies.


[10] 2604.19563

Information-to-energy trade-offs and the optimal alphabet of polymer replication

We analyze information transmission in a recently proposed coarse-grained model of polymer replication by framing it as a communication channel between templates and copies. By calculating the mutual information in the steady-state limit of long chains, we recover the accurate-random phase diagram and establish that the information per-monomer depends solely on template specificity within the accurate regime. Crucially, even in the accurate region, small error fractions lead to substantial information loss due to the nonlinear relationship between errors and mutual information. Examining the information-to-energy cost ratio reveals non-monotonic behavior as a function of monomer alphabet size, with an optimum determined primarily by the per-monomer assembly free energy. For DNA's four-base alphabet, we find that the observed effective assembly energy (at least $14\,k_B T$) places the system far from the information-transmission optimum, suggesting that biological replication may prioritize the suppression of spontaneous random assembly over information-to-energy efficiency. We also characterize achievable rate-fidelity trade-offs using Shannon bounds, providing a theoretical framework for evaluating future proofreading mechanisms in ensemble models.


[11] 2604.19662

Modelling time-order effects in haptic perception with a Bayesian dynamical framework

Perceptual judgments of sequential stimuli are systematically biased by prior expectations and by the temporal structure of sensory input. In haptic discrimination tasks, these effects often manifest as time-order asymmetries, whereby the perceived difference between two stimuli depends on their presentation order. Here, we introduce a dynamical Bayesian model that accounts for these biases by combining noisy sensory measurements with an evolving internal representation of stimulus intensity. The model formalizes perception as an inference process in which prior expectations are updated by incoming stimuli and propagate in time between observations. We test the model on psychophysical data from vibrotactile discrimination experiments, in which participants compare pairs of sequential stimuli with varying intensities. With a small number of parameters, the model quantitatively reproduces both the direction and magnitude of time-order effects across subjects, as well as the observed inter-individual variability. The inferred parameters provide a compact description of perceptual biases in terms of prior expectations and noise characteristics. Beyond fitting the data, the model induces a transformation of stimulus space, leading to a subject-dependent geometry of perceived stimuli. In this transformed space, perceptual judgments exhibit approximate symmetries that are absent in the physical stimulus coordinates. These results suggest that temporal biases in perception can be understood as a consequence of dynamical inference, and that they impose non-trivial geometric constraints on perceptual representations.


[12] 2604.19718

Direct RNA sequence design under codon constraints using expressive tensor-based secondary structure models

Nucleic acid sequence design via codon optimization is a fundamental task with applications across synthetic biology, mRNA therapeutics, and vaccine design. Given a target protein, it is a major open challenge to navigate the combinatorially large design space of codon sequences mapping to its amino acid sequence. Computational approaches generally seek to optimize simple objectives based on the codon sequence, possibly together with more complicated contributions based on secondary structure analysis. In this work, we demonstrate a direct and efficient algorithm to sample sequences from a suitable Boltzmann distribution defined in terms of the codon sequence and a fully detailed secondary structure free energy model, as well as related algorithms for exact computation of statistical quantities such as free energies, base pairing probabilities, and base and codon marginals. These algorithms draw upon a recently developed tensor-based formulation of secondary structure thermodynamics and demonstrate, for the first time, that global sequence design can be accomplished with respect to a highly accurate free energy model. Moreover, the algorithms can leverage any available CPU and GPU resources in parallel for massive computational speedups.


[13] 2604.18599

Simulation Based Inference of a Simple Neural Network Structure

Neurophysiologists are nowadays able to record from a large number of extracellular electrodes and to extract, from the raw data, the sequences of action potentials or spikes generated by many neurons. Unfortunately these ''many neurons'' still represent only a tiny fraction of the neuronal population that constitutes the network. Using association statistics such as the estimation of the cross-correlation functions, they are trying to infer the structure of the network formed by the recorded neurons. But this inference is compromised by the tremendous under-sampling of the neuronal population. We propose to focus instead on simple spike train statistics, like the empirical spikes frequency, or the interspike interval distribution. Their sampling distributions can be estimated by simulations, and, given a few observed spike train statistics, they provide enough information to infer the structure of the underlying network. We show that, on a ''toy model'', our method gives significantly better results than the sub-network reconstruction method with regards to the inference of the connection probability of the original network.


[14] 2604.18809

Analysis of persistence thresholds for a nonlocal PDE--ODE model of bacterial persister cells

Within many bacterial colonies, persister cells exist as a subpopulation that is tolerant to antibiotics and other stressors, yet not genetically distinct from the rest of the colony. A recent study has proposed epigenetic inheritance as a mechanism that leads to the presence of persister cells. We analyze a nonlocal PDE--ODE model introduced in that study to describe the epigenetic inheritance process and establish its mathematical well-posedness, including existence, uniqueness, and nonnegativity of solutions. We identify a sharp parameter threshold delineating extinction from persistence of the colony: below this threshold the washout equilibrium is globally asymptotically stable, while above it a unique positive equilibrium exists and the population is weakly persistent. Notably, this threshold is independent of the internal community structure.


[15] 2504.20565

DLCM: a versatile multi-level solver for heterogeneous multicellular systems

Computational modeling of multicellular systems may aid in untangling cellular dynamics and emergent properties of biological cell populations. A key challenge is to balance the level of model detail and the computational efficiency, while using physically interpretable parameters to facilitate meaningful comparisons with biological data. For this purpose, we present the DLCM-solver (discrete Laplacian cell mechanics), a flexible and efficient computational solver for spatial and stochastic simulations of populations of cells, developed from first principle to support mechanistic investigations. The solver has been designed as a module in URDME, the unstructured reaction-diffusion master equation open software framework, to allow for the integration of intra-cellular models with extra-cellular features handled by the DLCM. The solver manages discrete cells on a fixed lattice and reaction-transport events in a continuous-time Markov chain. Space-continuous micro-environment quantities such as pressure and chemical substances are supported by the framework, permitting a variety of modeling choices concerning chemotaxis, mechanotaxis, nutrient-driven cell growth and death, among others. An essential and novel feature of the DLCM-solver is the coupling of cellular pressure to the curvature of the cell populations by elliptic projection onto the computational grid, with which we can include effects from surface tension between populations. We demonstrate the flexibility of the framework by implementing benchmark problems of cell sorting, cellular signaling, tumor growth, and chemotaxis models. We additionally formally analyze the computational complexity and show that it is theoretically optimal for systems based on pressure-driven cell migration. In summary, the solver balances efficiency and a relatively fine resolution, while supporting a high level of interpretability.


[16] 2508.21490

Testing quantum-like markers in neural dynamics

We propose two experiments for identifying quantum markers in neural data based on quantum variants of well-known equations for neural activity that describe electrical signal propagation on axonal arbors and dendrites. These include (i) testing if power spectra from subthreshold oscillations in neuronal cultures follow the classical Fitzgugh-Nagumo equations or a recently introduced quantum variant of them and (ii) testing if propagation statistics of electrical activity in axons follow the classical diffusive cable equation or a quantum variant of it.


[17] 2510.12751

Non-linear associations of amyloid-$β$ with resting-state functional networks and their cognitive relevance in a large community-based cohort of cognitively normal older adults

Background: Non-linear alterations in brain network connectivity may represent early neural signatures of Alzheimer's disease (AD) pathology in cognitively normal older adults. Understanding these changes and their cognitive relevance may help clarify early network vulnerability associated with AD pathology. Most prior studies recruited participants from memory clinics, often with subjective memory concerns, limiting generalizability. Methods: We examined 14 large-scale functional brain networks in 968 cognitively normal older adults recruited from the community using resting-state functional MRI, cerebrospinal fluid (CSF) biomarkers (amyloid-$\beta$ 1-42 [A$\beta$], total tau, phosphorylated tau 181), and neuropsychological assessments. Functional networks were identified using group independent component analysis. Results: Inverted U-shaped associations between CSF A$\beta$ and functional connectivity were observed in the precuneus network and ventral default mode network (DMN), but not in the dorsal DMN, indicating network-specific vulnerability to early amyloid pathology. Higher connectivity in A$\beta$-related networks, including dorsal and ventral DMN, precuneus, and posterior salience networks, was associated with better visual memory, visuospatial, and executive performance. No significant relationships were observed between CSF tau and functional connectivity. Conclusions: Using a large, community-based cohort, we demonstrate that non-linear alterations in functional connectivity occur in specific networks even during the asymptomatic phase of AD. Moreover, A$\beta$-related network connectivity is cognitively relevant, highlighting early network vulnerability and its functional consequences in amyloid pathology.


[18] 2511.03769

Current validation practice undermines surgical AI development

Surgical data science (SDS) is rapidly advancing, yet clinical adoption of artificial intelligence (AI) in surgery remains limited, with inadequate validation emerging as an important contributing factor. In fact, existing validation practices often neglect the temporal and hierarchical structure of intraoperative videos, producing misleading, unstable, or clinically irrelevant results. In a pioneering, consensus-driven effort, we introduce a comprehensive catalog of validation pitfalls in AI-based surgical video analysis that was derived from a multi-stage Delphi process with 92 international experts. The collected pitfalls span three categories: (1) data (e.g., incomplete annotation, spurious correlations), (2) metric selection and configuration (e.g., neglect of temporal stability, mismatch with clinical needs), and (3) aggregation and reporting (e.g., clinically uninformative aggregation, failure to account for frame dependencies in hierarchical data structures). A systematic review of surgical AI papers reveals that these pitfalls are widespread in current practice, with the majority of studies failing to account for temporal dynamics or hierarchical data structure, or relying on clinically uninformative metrics. Experiments on real surgical video datasets provide empirical evidence that ignoring temporal and hierarchical data structures can substantially understate uncertainty, obscure critical failure modes, and even alter algorithm rankings. To address these shortcomings, we provide a catalogue of best practices compiled in a multi-stage Delphi process. Together, this work provides an evidence-based framework to inform more rigorous validation of surgical video analysis algorithms and to guide future efforts in benchmarking, reporting, regulatory review, and clinical translation.


[19] 2511.10708

MOSAIC: Codon Harmonization of Monte Carlo-Based Simulated Annealing for Linked Codons in Heterologous Protein Expression

Codon usage bias has a crucial impact on the translation efficiency and co-translational folding of proteins, necessitating the algorithmic development of codon optimization/harmonization methods, particularly for heterologous recombinant protein expression. Codon harmonization is especially valuable for proteins sensitive to translation rates, because it can potentially replicate native translation speeds, preserving proper folding and maintaining protein activity. This work proposes a Monte Carlo-based codon harmonization algorithm, MOSAIC (Monte Carlo-based Simulated Annealing for Linked Codons), for the harmonization of a set of linked codons, which differs from conventional codon harmonization, by focusing on the codon sets rather than individual ones. Our MOSAIC demonstrates robust computational performance on ribosomal proteins (S18, S15, S10, and L11) as model systems. Among them, the harmonized gene of RP S18 was expressed and compared with the expression of the wild-type gene. The harmonized gene clearly yielded a larger quantity of the protein, from which the amount of the soluble protein was also significant. These results underscored the potential of the linked codon harmonization approach to enhance the expression and functionality of sensitive proteins, setting the stage for more efficient production of recombinant proteins in various biotechnological and pharmaceutical applications.


[20] 2602.16129

Oscillation Criteria in Large-Scale Gene Regulatory Networks with Intrinsic Fluctuations

Gene Regulatory Networks(GRNs) with feedback are essential components of many cellular processes and may exhibit oscillatory behavior. Analyzing such systems becomes increasingly complex as the number of components increases. Since gene regulation often involves a small number of molecules, fluctuations are inevitable. Therefore, it is important to understand how fluctuations affect the oscillatory dynamics of cellular processes, as this will allow comprehension of the mechanisms that enable cellular functions to remain even in the presence of fluctuations or, failing that, to determine the limit of fluctuations that permits various cellular functions. In this study, we investigated the conditions under which GRNs with feedback and intrinsic fluctuations exhibit oscillatory behavior. Our focus was on developing a procedure that would be both manageable and practical, even for extensive regulatory networks, that is, those comprising numerous nodes. Using the second-moment approach, we described the stochastic dynamics through a set of ordinary differential equations for the mean concentration and its second central moment. The system can attain either a stable equilibrium or oscillatory behavior, depending on its scale and, consequently, the intensity of fluctuations. To illustrate the procedure, we analyzed two relevant systems: a repressilator with three nodes and a system with five nodes, both incorporating intrinsic fluctuations. In both cases, it was observed that for very small systems, which therefore exhibit significant fluctuations, oscillatory behavior is inhibited. The procedure presented here for analyzing the stability of oscillations under fluctuations enables the determination of the critical minimum size of GRNs at which intrinsic fluctuations do not eliminate their cyclical behavior.


[21] 2604.04981

An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing

Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37,491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1,183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, $3.2\%$ are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.


[22] 2110.00601

Album: executable building blocks for scientific imaging routines, from sharing to LLM-assisted orchestration

Open-source scientific software is a major driver of scientific progress, yet its development and reuse remain difficult in collaborative settings. Researchers repeatedly face four recurring challenges: discovering and reproducing existing routines, adapting them for new use cases, sharing and scaling them across collaborators, and stabilizing them with reproducible execution environments. We present Album, an open-source framework for packaging and sharing scientific routines as executable artifacts through two minimal primitives: (i) the solution, a Python-native executable entry point that combines machine-readable metadata, arguments, environment specifications, and lifecycle hooks; and (ii) the catalog, a decentralized, git-native distribution mechanism with indexed search and optional web rendering for discovery, provenance, and governance. Album uses a two-context execution model in which a host controller evaluates manifests and prepares per-solution environments, while lifecycle hooks execute inside isolated solution environments. This design supports reproducible execution, post-environment setup, and the composition of routines with incompatible dependencies. Album can be used in conjunction with LLM agents: solutions can be drafted and revised with LLM assistance, and a MCP interface exposes cataloged solutions as callable tools for tool-grounded discovery and orchestration. We evaluate Album through four realworld imaging deployments spanning interactive visualization of electron microscopy data, integration of multiple segmentation methods, the orchestration of cryo-electron tomography competition workflows, and mineral quantification pipelines. Overall, Album complements package managers, workflow systems, and container runtimes by making scientific routines executable, shareable artifacts. Documentation and examples are available at this https URL.


[23] 2208.11805

On the Diffusion Time Evolution of Folding Chains in the Heteropolymer Model

In this paper, we mathematically describe the time evolution of protein folding features via Iori et al.'s heteropolymer model. More specifically, we identify that the folding amino acid chain evolve according to a power law $D \sim t^{\nu}$. The power $\nu$ decreases from $0.\overline{66}$ to $0.5$ when the randomness of the coupling constants in the Lennard-Jones potential increases.


[24] 2601.20981

Diversifying Toxicity Search in Large Language Models Through Speciation

Evolutionary prompt search is a practical black-box approach for red teaming large language models, however existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity extension of \textit{ToxSearch} that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. \textit{ToxSearch-S} introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. Preliminary results show \textit{ToxSearch-S} reaching higher peak toxicity ($\approx 0.73$ vs.\ $\approx 0.47$) with a heavier tail (top-10 median $0.66$ vs.\ $0.45$) than the baseline. Speciation also yields broader semantic coverage under a topics-as-species analysis (higher effective topic diversity and larger unique topic coverage). Finally, species formed are well-separated in embedding space (mean separation ratio $\approx 1.93$) and exhibit distinct toxicity distributions, indicating that speciation partitions the adversarial space into behaviorally differentiated niches rather than superficial lexical variants.


[25] 2604.03476

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.