Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellular morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem ofContrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1x improvement in zero shot molecular retrieval of active molecules over the previous state-of-the-art, reaching 77.33% in top-1% accuracy. These results open the door for machine learning to be applied in virtual phenomics screening, which can significantly benefit drug discovery applications.
Motor changes are early signs of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but are often difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of explainable metrics extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet. The aim is to measure their effectiveness in characterizing and assessing multiple NDs, including AD and PD. To this end, task-agnostic and task-specific metrics are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which metrics provide greater discriminative power between NDs and healthy controls and among different NDs. Preliminary results indicate that the various tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted explainable metrics, which shows p-values lower than 0.0001 for multiple tasks. Using various classification algorithms on the computed metrics, we obtain up to 87% accuracy to discriminate AD and healthy controls (CTL), and up to 69% for PD vs CTL.
Recent studies revealed structural and functional brain changes in heavy smokers. However, the specific changes in topological brain connections are not well understood. We used Gaussian Undirected Graphs with the graphical lasso algorithm on rs-fMRI data from smokers and non-smokers to identify significant changes in brain connections. Our results indicate high stability in the estimated graphs and identify several brain regions significantly affected by smoking, providing valuable insights for future clinical research.
Summary: With the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole genome alignment (WGA) formats, offering practical tools for conversion, processing, statistical evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics. Availability and Implementation: wgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. wgatools is published as free software under the MIT open-source license, and its source code is freely available at https://github.com/wjwei-handsome/wgatools. Contact: weiwenjie@westlake.edu.cn (W.W.) or liuhaijun@yzwlab.cn (H.-J.L.).
Global estuaries and coastal regions, acting as critical interfaces for mitigating nitrogen flux to marine, concurrently contend with contamination from tire wear particles (TWPs). However, the effects of pristine and photoaged TWP (P-TWP and A-TWP) and their leachates (P-TWPL and A-TWPL) on key nitrogen removal processes in estuarine sediments remain unclear. This study explored the responses of denitrification rate, anammox rate, and nitrous oxide (N2O) accumulation to P-TWP, A-TWP, P-TWPL, and A-TWPL exposures in estuarine sediments, and assessed the potential biotoxic substances in TWPL. Results indicate that P-TWP inhibited the denitrification rate and increased N2O accumulation without significantly impacting the anammox rate. A-TWP intensified the denitrification rate inhibition by further reducing narG gene abundance and NAR activity, and also decreased the hzo gene abundance, HZO activity, and Candidatus Kuenenia abundance, thereby slowing the anammox rate. N2O accumulation was lower after A-TWP exposure than P-TWP, with the NIR/NOS and NOR/NOS activity ratios closely associated with N2O accumulation. Batch experiments indicated that photoaging promoted Zn release from TWPL, significantly contributing to the inhibited denitrification rate and increased N2O accumulation by TWP. In addition, TWP drives changes in microbial community structure through released additives, with the abundance of DNB and AnAOB closely linked to the Zn, Mn, and As concentrations in TWPL. This study offers insights into assessing the environmental risks of TWPs in estuarine ecosystems.
We have proved in both human-based and computer-based tests that natural concepts generally `entangle' when they combine to form complex sentences, violating the rules of classical compositional semantics. In this article, we present the results of an innovative video-based cognitive test on a specific conceptual combination, which significantly violates the Clauser--Horne--Shimony--Holt version of Bell's inequalities (`CHSH inequality'). We also show that collected data can be faithfully modelled within a quantum-theoretic framework elaborated by ourselves and a `strong form of entanglement' occurs between the component concepts. While the video-based test confirms previous empirical results on entanglement in human cognition, our ground-breaking empirical approach surpasses language barriers and eliminates the need for prior knowledge, enabling universal accessibility. Finally, this transformative methodology allows one to unravel the underlying connections that drive our perception of reality. As a matter of fact, we provide a novel explanation for the appearance of entanglement in both physics and cognitive realms.
Prostate cancer (PCa) was the most frequently diagnosed cancer among American men in 2023. The histological grading of biopsies is essential for diagnosis, and various deep learning-based solutions have been developed to assist with this task. Existing deep learning frameworks are typically applied to individual 2D cross-sections sliced from 3D biopsy tissue specimens. This process impedes the analysis of complex tissue structures such as glands, which can vary depending on the tissue slice examined. We propose a novel digital pathology data source called a "volumetric core," obtained via the extraction and co-alignment of serially sectioned tissue sections using a novel morphology-preserving alignment framework. We trained an attention-based multiple-instance learning (ABMIL) framework on deep features extracted from volumetric patches to automatically classify the Gleason Grade Group (GGG). To handle volumetric patches, we used a modified video transformer with a deep feature extractor pretrained using self-supervised learning. We ran our morphology-preserving alignment framework to construct 10,210 volumetric cores, leaving out 30% for pretraining. The rest of the dataset was used to train ABMIL, which resulted in a 0.958 macro-average AUC, 0.671 F1 score, 0.661 precision, and 0.695 recall averaged across all five GGG significantly outperforming the 2D baselines.
In recent years, deep neural networks (DNNs) have demonstrated remarkable performance in pathology applications, potentially even outperforming expert pathologists due to their ability to learn subtle features from large datasets. One complication in preparing digital pathology datasets for DNN tasks is variation in tinctorial qualities. A common way to address this is to perform stain normalization on the images. In this study, we show that a well-trained DNN model trained on one batch of histological slides failed to generalize to another batch prepared at a different time from the same tissue blocks, even when stain normalization methods were applied. This study used sample data from a previously reported DNN that was able to identify patients with early stage non-small cell lung cancer (NSCLC) whose tumors did and did not metastasize, with high accuracy, based on training and then testing of digital images from H&E stained primary tumor tissue sections processed at the same time. In this study we obtained a new series of histologic slides from the adjacent recuts of same tissue blocks processed in the same lab but at a different time. We found that the DNN trained on the either batch of slides/images was unable to generalize and failed to predict progression in the other batch of slides/images (AUC_cross-batch = 0.52 - 0.53 compared to AUC_same-batch = 0.74 - 0.81). The failure to generalize did not improve even when the tinctorial difference correction were made through either traditional color-tuning or stain normalization with the help of a Cycle Generative Adversarial Network (CycleGAN) process. This highlights the need to develop an entirely new way to process and collect consistent microscopy images from histologic slides that can be used to both train and allow for the general application of predictive DNN algorithms.
The diffusional dynamics and vibrational spectroscopy of molecular hydrogen (H$_2$) in myoglobin (Mb) is characterized. Hydrogen has been implicated in a number of physiologically relevant processes, including cellular aging or inflammation. Here, the internal diffusion through the protein matrix was characterized and the vibrational spectroscopy was investigated using conventional empirical energy functions and improved models able to describe higher-order electrostatic moments of the ligand. H$_2$ can occupy the same internal defects as already found for Xe or CO (Xe1 to Xe4 and B-state). Furthermore, 4 additional sites were found, some of which had been discovered in earlier simulation studies. The vibrational spectra using the most refined energy function indicate that depending on the docking site the spectroscopy of H$_2$ differs. The maxima of the absorption spectra cover $\sim 20$ cm$^{-1}$ which are indicative of a pronounced effect of the surrounding protein matrix on the vibrational spectroscopy of the ligand. Electronic structure calculations show that H$_2$ forms a stable complex with the heme-iron (stabilized by $\sim -12$ kcal/mol) but splitting of H$_2$ is unlikely due to a high activation energy ($\sim 50$ kcal/mol).
Establishing reasonable standards for edible chrysanthemum seedlings helps promote seedling development, thereby improving plant quality. However, current grading methods have the several issues. The limitation that only support a few indicators causes information loss, and indicators selected to evaluate seedling level have a narrow applicability. Meanwhile, some methods misuse mathematical formulas. Therefore, we propose a simple, efficient, and generic framework, SQCSEF, for establishing seedling quality classification standards with flexible clustering modules, applicable to most plant species. In this study, we introduce the state-of-the-art deep clustering algorithm CVCL, using factor analysis to divide indicators into several perspectives as inputs for the CVCL method, resulting in more reasonable clusters and ultimately a grading standard $S_{cvcl}$ for edible chrysanthemum seedlings. Through conducting extensive experiments, we validate the correctness and efficiency of the proposed SQCSEF framework.
Organisms have to keep track of the information in the environment that is relevant for adaptive behaviour. Transmitting information in an economical and efficient way becomes crucial for limited-resourced agents living in high-dimensional environments. The efficient coding hypothesis claims that organisms seek to maximize the information about the sensory input in an efficient manner. Under Bayesian inference, this means that the role of the brain is to efficiently allocate resources in order to make predictions about the hidden states that cause sensory data. However, neither of those frameworks accounts for how that information is exploited downstream, leaving aside the action-oriented role of the perceptual system. Rate-distortion theory, which defines optimal lossy compression under constraints, has gained attention as a formal framework to explore goal-oriented efficient coding. In this work, we explore action-centric representations in the context of rate-distortion theory. We also provide a mathematical definition of abstractions and we argue that, as a summary of the relevant details, they can be used to fix the content of action-centric representations. We model action-centric representations using VAEs and we find that such representations i) are efficient lossy compressions of the data; ii) capture the task-dependent invariances necessary to achieve successful behaviour; and iii) are not in service of reconstructing the data. Thus, we conclude that full reconstruction of the data is rarely needed to achieve optimal behaviour, consistent with a teleological approach to perception.
Response to a poll can be manipulated by means of a series of leading questions. We show that such phenomena cannot be explained by use of classical probability theory, whereas quantum probability theory admits a possibility of offering an explanation. Admissible transformation rules in quantum probability, however, do impose some constraints on the modelling of cognitive behaviour, which are highlighted here. Focusing on a recent poll conducted by Ipsos on a set of questions posed by Sir Humphrey Appleby in an episode of the British political satire \textit{Yes, Prime Minister}, we show that the resulting data cannot be explained quite so simply using quantum rules, although it seems not impossible.