April 02, 2024
Identifying the chemical structure from a graphical representation, or image, of a molecule is a challenging pattern recognition task that would greatly benefit drug development. Yet, existing methods for chemical structure recognition do not typically generalize well, and show diminished effectiveness when confronted with domains where data is sparse, or costly to generate, such as hand-drawn molecule images. To address this limitation, we propose a new chemical structure recognition tool that delivers state-of-the-art performance and can adapt to new domains with a limited number of data samples and supervision. Unlike previous approaches, our method provides atom-level localization, and can therefore segment the image into the different atoms and bonds. Our model is the first model to perform OCSR with atom-level entity detection with only SMILES supervision. Through rigorous and extensive benchmarking, we demonstrate the preeminence of our chemical structure recognition approach in terms of data efficiency, accuracy, and atom-level entity prediction.
Molecules and chemical reactions represent the tokens of the language of chemistry, which underlies applications such as drug or new materials discovery. Molecules can be represented by a molecular formula (e.g., C\(_8\)H\(_{10}\)N\(_4\)O\(_2\)), or preferably by a more detailed structural formula—a graphical representation showcasing the spatial arrangement of atoms in the molecule. Isomers, molecules sharing the same molecular formulas but differing in spatial atom arrangement, typically exhibit distinct chemical and physical properties (as illustrated in Supplementary Material (SM) Section 8). Structural molecular formulas are thus ubiquitous in chemistry publications, lab notes, patents, or text books. This prevalence motivates the development of automatic pipelines to perform chemical structure recognition, parsing structural formulas from images. Such ability promises more efficient scientific literature browsing, automatic lab notes transcription, or chemical data mining, among others.
Recent advances in computer vision have allowed the development of several chemical structure recognition tools [1]–[3]. These tools can be classified into molecular graph predictions methods and atom-level entity prediction methods. Molecular graph prediction methods only use limited image annotation, such as SMILES, a serial notation of a molecule [1], [2], [4], and only predict the molecular graph. In contrast, atom-level entity prediction methods leverage richer image annotations for training the model, such as atom-level entity localization (i.e., individual atoms and bonds are annotated in the original image) [3], [5]. These methods predict the molecular graph as well as the localization of the different components of the molecule in the original image. Figure 1 illustrates the different types of predictions for these two categories of models.
Previous research has shown that atom-level entity prediction methods typically enjoy better training sample efficiency, requiring less images for achieving the same level of performance [6]. This class of methods is also more interpretable. Atom-level entity annotation can indeed help identify the atoms that will be part of new chemical bond in a reaction, and can also facilitate human evaluation and correction when necessary, opening the way for synergistic human-in-the-loop training strategies [5]. Nevertheless, these advantages are compensated by the necessity to provide rich image annotation in the training data. Unfortunately, such supervision is often unavailable in many data domains, such as hand-drawn images. Yet, hand-drawn images represent a prevalent format in chemical notations and sketches. The strict dependency of existing atom-level entity prediction methods on rich image annotation thus prevents their deployment to crucial data domains.
Our research addresses these limitations by introducing a state-of-the-art chemical structure recognition tool, which (1) predicts a molecular graph from images, (2) provides atom-level localization in the original image, and (3) adapts to new data domain with a limited number of data samples and supervision. Our architecture relies on a object detection backbone coupled with a graph construction strategy that is pretrained on synthetic data, where the localization of each atom-level entity is known. We then leverage the atom-level entity localization, coupled with a efficient self-relabeling strategy, to aptly transfer to new domains where no localization is available (typically only image-SMILES pairs are available). This results in a state-of-the-art and highly data efficient architecture, as demonstrated by our rigorous benchmarking.
Key contributions: (1) We propose a novel framework for chemical structure recognition that predicts atom-level localizations trained on a target domain with only SMILES supervision. (2) We show that our method results in state-of-the-art performance on challenging hand-drawn molecule images, with a remarkable data efficiency. (3) We release a new curated dataset containing hand-drawn molecules with atom-level annotations.
Our implementation is available on Github: https://github.com/molden/atomlenz
Our work builds upon the chemical structure recognition literature and takes an object detection approach for solving this task. To enable fine-tuning of the model where no atom-level annotations are present, we leverage advances in weakly supervised object detection.
Optimal chemical structure recognition consists in inferring the structural formulae of a chemical compound based on an image representation of it. The large majority of existing methods performing this task take the image as input, and predict the SMILES (simplified molecular-input line-entry system) representation of the molecule [4]. SMILES consists of strings of ASCII characters that are obtained by printing the chemical symbols encountered in a depth-first tree traversal of the molecular graph. This serial notation provides, at first sight, a convenient representation for training machine learning models, while encoding geometric information about the molecular graph. This justified the popularity of SMILES-based chemical structure recognition models [1], [2].
Nevertheless, SMILES do not provide a natural chemical representation and do not readily encode the geometric properties of the molecules. This hampers the trainability of the underlying machine learning model [6]. This limitation motivated the development of methods predicting the molecular graph and capable of leveraging richer image annotations, such as atom-level localizations [3], [5]. Our work belongs to this category and therefore inherits these strengths. However, we extend previous approaches by providing a mechanism to fine-tune the model to new data domains where only SMILES annotations are available.
Our architecture draws heavily on the literature on objection detection in images [7]–[11], which underlies a wide array of high-level machine learning applications [12]–[15]. We refer to [16] for a recent review of the field. Training object detection models typically requires comprehensive image annotations, such as the precise coordinates of bounding boxes and the associated labels for every object contained within each image. However, and crucially for our application, these annotations are not consistently accessible within certain domains of interest. This scarcity of detailed annotations has spurred the development of weakly supervised object detection methods.
This category of methods enables the training of object detection models without the necessity for precise bounding-box annotations. As a result, these approaches can be directly applied to target domains where such annotations are unavailable. Diverse variations of Weakly Supervised Object Detection (WSOD) architectures have emerged, relying on a range of implementations, including Multiple Instance Learning (MIL) [17], [18] and Class Activation Map (CAM) [19], [20] approaches. Other advanced WSOD techniques incorporate knowledge transfer from a source domain [21]–[25]. Among these approaches, ProbKT [25] distinguishes itself as a versatile method, relying on probabilistic reasoning, which offers the capacity to train atom-level localization models using chemical background information obtained from SMILES and logical reasoning. Our architecture leverages this approach.
Our architecture is composed of four high-level modules: (1) an object detection backbone, which is trained on richly annotated images with atom-level entities, (2) a molecular graph constructor that assembles a molecular graph from the set of atom-level predictions, (3) a weakly supervised training scheme that enables fine-tuning the model on new domains without rich annotations. Additionally, we design a chemically informed combination of experts, ChemExpert, that can further boost the prediction performance. The weakly supervised training scheme of the object detection backbone is visualized in Figure 2.
At the core of our architecture lies an object detection model that is responsible for detecting and labeling atom-level entities in the image. The objects in the image are therefore the atom-level entities such as atoms or bonds. While many object detection methods exist and can be used interchangeably in our architecture, we used the Faster RCNN model [9] in our experiments. It is fast to train, robust, and simultaneously localizes and classifies all objects in a single step. The object detection backbone is trained by minimizing a multi-task loss \(\mathcal{L}\) mixing a multi-class log loss \(\mathcal{L}_{cls}\) applied on the predicted class \(\hat{c}\) of the objects and a regression loss \(\mathcal{L}_{reg}\) applied to the predicted bounding box coordinates \(\hat{b}=(\hat{b}_x,\hat{b}_y,\hat{b}_w,\hat{b}_h)\) of the objects.
\[\mathcal{L} = \mathcal{L}_{cls}(c,\hat{c}) + \mathcal{L}_{reg}(b,\hat{b})\]
Bonds, atoms, or charges are graphically very different. To account for this heterogeneity, we train four distinct object detection models, each tailored to a specific atom-level entity: atoms (\(\mathbf{O}^a\)), bonds (\(\mathbf{O}^b\)), charge objects (\(\mathbf{O}^c\)), and stereocenters (\(\mathbf{O}^s\)). A stereocenter, also known as a stereogenic center, refers to an atom within a molecule that carries groups in a way that exchanging any two of these groups results in a stereoisomer [26]. More details about stereochemistry can be found in SM Section 8.
Each type of atom-level entity comprises multiple classes \(c\) that the object detection backbone aims at labeling (via \(\mathcal{L}_{cls}\)). For instance, the bond object encompasses categories such as single bond, double bond, triple bond, aromatic bond, dashed bond, and wedged bond. Illustrations of different atom-level entity types and classes can be found in SM Section 10.
The output of the object detection backbone is a list of detected atom-level entities in the image, along with their predicted label and position. The objective of the molecular graph constructor is to assemble a chemically sound molecular graph from this list of predictions. This graph can then be easily converted to a SMILES format.
The output graph \(G(V,E)\) is composed of a set of vertices \(V\), corresponding to atoms, and edges \(E\) corresponding to bonds. Each vertex and edge has a label (e.g., a node can be a carbon atom with a positive charge, an edge can be a double bond). Algorithm 3 outlines the series of steps involved in constructing the molecular graph from atom-level entity predictions \(\mathbf{O}^a\) (atoms), \(\mathbf{O}^b\) (bonds), \(\mathbf{O}^c\) (charges), and \(\mathbf{O}^s\) (stereocenters). It proceeds in four steps: (1) a filtering step, (2) a node creation step, (3) an edge creation step, and (4) a validation step.
The filtering step filters atoms from \(\mathbf{O}^a\) that are severely overlapping on the image. When multiple atom objects show an Intersection over Union (IoU) score exceeding a specified threshold, only the object with the highest score is retained.
In the node creation step, we first attach charges to atom objects. Overlapping atom and charge objects exceeding a specific IoU threshold are then merged. The function \(\mathrm{checkCharges}\) is responsible for determining which atom objects should carry a charge. A similar procedure is subsequently applied to identify atoms functioning as stereocenters, utilizing \(\mathrm{checkStereoChem}\) for this purpose. The list of all atom objects, along with their potentially assigned charges or stereocenters are then added to the list of graph vertices.
In the edge creation step, we iterate over all bond objects and evaluate which vertices (atoms) overlap with these bonds, with the function \(\mathrm{checkEdge}\). If only two candidate atoms are identified, the algorithm proceeds to add the edge to the graph. However, when more than two overlapping candidates emerge, the algorithm selects the two most probable ones, factoring in the orientation of the edge and the atoms involved.
Lastly, the validation step identifies potential chemistry-related issues through \(\mathrm{ChemistryProblems}\) and endeavors to resolve them via \(\mathrm{SolveChemistryProblems}\) to ensure the prediction of a chemically valid molecular graph. For instance, each chemical element is assigned a valence number, indicating the atom’s capability to establish bonds with other atoms. If our algorithm detects within the output graph containing atoms with more bonds than their valence numbers permit, the \(\mathrm{SolveChemistryProblems}\) function would attempt to remove bonds iteratively until a graph is formed without valence errors.
To maintain conciseness in our experiments and results, we refer to the combination of the object detection backbone and the molecular graph constructor as AtomLenz (ATOM-Level ENtity localiZer). Further information regarding the subroutines used in the molecular graph constructor is available in SM Section 9.
Our architecture uses an object detection backbone to predict atom-level entities, which requires rich image annotations, such bounding boxes for every object type, including atoms, bonds, charges, and stereocenters within the images. While such a level of supervision can be obtained synthetically with tools like RDKit [27], it is usually not available in real-world target domains, such as hand-drawn images. In such domains, only SMILES are typically available. To enable the fine-tuning of the object detection backbone with only SMILES information, we use a weakly supervised training mechanism that combines (1) a probabilistic logical reasoning module that allows to differentiate through the object detection backbone with only weak supervision, and (2) a graph edit-correction mechanism that allows fine-tuning on less frequent atoms and bonds. A graphical outline of the weakly supervised training procedure is given in Figure 2.
To update the weights of the object detection backbone with only SMILES supervision, we use the ProbKT [25] framework. This weakly supervised domain adaption technique uses probabilistic programming for fine-tuning object detection models with a wide range of supervision signals, and is thus particularly suited for our application. In our experiments, we used ProbKT\(^*\), a computationally efficient variant of ProbKT that relies on Hungarian matching.
ProbKT\(^*\) allows differentiating through the object detection backbone with only SMILES supervision. For better performance, it also includes a relabeling mechanism, where confident predictions are used as new atom-level annotation of the target domain images. This strategy effectively creates a richly annotated dataset that can be used to fine-tune the object detection backbone directly.
While ProbKT\(^*\) is generally effective at performing weakly supervised domain adaptation, it fails when dealing with rare atoms or bonds types. We therefore combine ProbKT\(^*\) with a new edit-correction mechanism [28] designed to detect and rectify minor errors in model predictions. Based on the SMILES, one can generate a reference true graph, although not aligned on the original image. The edit-correction mechanism solves an optimization problem that aims at finding the smallest edit on the predicted graph such that the true and corrected graphs are isomorphic. While this optimization would be intractable in general, focusing on small edits makes it computationally feasible. If such a correction is found, it is used to annotate the image which can then be used to fine-tune the object detection backbone.
When fine-tuning on a new target domain, we proceed by iteratively applying ProbKT\(^*\) and the edit-correction scheme. In practice, we start with a few iterations of ProbKT\(^*\). We then use multiple iterations of the edit-correction scheme until the validation performance stops improving. For sake of conciseness in our experiments results, we abbreviate the combination of both approaches as EditKT*.
For the final prediction of our architecture, we propose to use a combination of experts, which is constrained by chemical soundness of the model predictions. We call this module ChemExpert. It relies on a list of chemical structure recognition tools, ordered by the user’s preference in terms of the predictions. The first tool serves as the most trusted model. At inference time, ChemExpert iteratively checks the validity of the prediction of each model in the list. If a chemical issue is identified, the agent evaluates the next model in the list. The module returns the prediction of the first model with no chemical issues detected. This strategy enables us to incorporate predictions from additional tools alongside those generated by our core model, thereby improving predictive performance. In practice, we use a combination of DECIMER [29] and our approach.
For the pretraining of the object detection models, we generate images synthetically using RdKit [27] and Indigo [30] paired with bounding boxes delineating all objects within, including atoms, bonds, charges, and stereocenters, similarly to what is used in other chemical structure recognition tools [3], [5]. Specifically, we collect approximately 214,000 chemical compounds in SMILES format from the ChEMBL [31] database. To enhance the method’s resilience to stylistic variations, we introduce variability in elements such as fonts, font sizes, line widths, and the spacing between multiple bonds during image generation. More details on this dataset can be found in SM Section 7.
To facilitate the training, fine-tuning, and testing of our models on hand-drawn images, we meticulously curate multiple datasets. We begin with the dataset introduced by [32], which consists of hand-drawn chemical depictions matched with their corresponding SMILES representations. This dataset is partitioned into 4,070 samples for training and validation purposes, along with an additional 1,018 samples for testing. These sets are referred to as the hand-drawn training set and the hand-drawn test set.
In addition to this primary test set, we incorporate an extra test dataset of 614 hand-drawn chemical depictions sourced from [33], which we call the ChemPix test set, to further evaluate the performance of the models on hand-drawn images.
To assess our models’ capability for object localization, we employ a synthetically generated dataset using RdKit [27] provided by [25]. This dataset encompasses 1,000 images depicting chemical structures, each meticulously annotated with bounding boxes outlining the positions and corresponding classes of all the atoms within the molecules. Example images for each test dataset are shown in Figure 4 and SM Section 7.
Our experiments investigate the performance of our approach on hand-drawn images of chemical structures, as this domain suffers from limited data availability and has been shown to be a weak point of existing tools. We evaluate our architecture and state-of-the-art baselines on four distinct fronts: (1) molecular recognition on the new target domain (i.e., only predicting the SMILES), (2) atom-level entity localization, (3) training efficiency (when retrained from scratch), and (4) model evaluation per atom and bond type.
We compare our architecture with the following baselines. DECIMER [1] is an image-transformer approach trained on more than 400 million synthetically generated data samples. The authors of DECIMER have also introduced a version specifically tailored for hand-drawn images (DECIMER fine-tuned [29]). Although it is trained on synthetically generated images, the training dataset of this version mimics the style of hand-drawn images more closely. Img2Mol [2] integrates a deep convolutional neural network trained on molecule depictions (11 million synthetically generated images) with a pretrained decoder. MolScribe [5] and ChemGrapher [3] employ atom-level entity localization annotations in their training process on synthetically generated images. These are the only baselines that can predict atom-level annotations, alongside the SMILES predictions. ChemGrapher is trained on 114,000 generated images and MolScribe on 1 million generated images. Lastly, OSRA [34] is a non-trainable, rule-based approach.
hand-drawn test set | ChemPix test set | |||
---|---|---|---|---|
Method | Acc.(\(T=1\)) | \(\overline{T}\) | Acc.(\(T=1\)) | \(\overline{T}\) |
DECIMER (v2.2.0)[1] | 0.295 | 0.451 | 0.05 | 0.1 |
DECIMER fine-tuned(v2.2.0)[29] | 0.622 | 0.727 | 0.508 | 0.643 |
Img2Mol[2] | 0.084 | 0.275 | 0.015 | 0.084 |
MolScribe[5] | 0.102 | 0.288 | 0.269 | 0.417 |
ChemGrapher[3] | 0.002 | 0.065 | 0.187 | 0.286 |
OSRA[34] | 0.006 | 0.065 | 0.047 | 0.071 |
AtomLenz | 0.009 | 0.087 | 0.054 | 0.064 |
AtomLenz+EditKT* | 0.338 | 0.484 | 0.484 | 0.605 |
ChemExpert([29],AtomLenz+EditKT*) | 0.635 | 0.749 | 0.518 | 0.655 |
We compare the performance of our approach with the baselines on the hand-drawn and ChemPix dataset. Results are given in Table 1. To assess the impact of our fine-tuning strategy, we evaluate three versions of our architecture. The first is a version trained on the synthetic dataset but not fine-tuned to the new hand-drawn dataset (AtomLenz). The second is fine-tuned to the hand-drawn dataset using EditKT\(^*\). The third is ChemExpert, combining DECIMER fine-tuned and AtomLenz. Performance of other combinations in ChemExpert are reported in SM Section [sec:allresults].
We assess the molecular structure prediction performance using accuracy and Tanimoto similarity. Tanimoto similarity (\(T\)) [35], a widely used metric for quantifying molecular similarity, to assess the resemblance between the model’s predictions and the actual molecular graphs. Tanimoto similarity values range from 0 to 1, with higher values indicating greater similarity. A Tanimoto similarity of 1 indicates that the structural descriptors are identical or that they are matching ‘on-bits’ in a binary fingerprint. The binary fingerprint employed to measure the Tanimoto similarity is the Extended-connectivity fingerprint [36] with radius 3 (ECFP6) and fingerprint length of 2048. More details on the calculation of the ECFP6 fingerprint and other fingerprints can be found in SM Section [sec:allresults].
Our tables report both the accuracy, computed by counting the instances where the predicted structures have identical structural ECFP6 descriptors (denoted by a Tanimoto similarity of 1) and the average Tanimoto similarity. Additional measured metrics can be found in SM Section [sec:allresults].
For both datasets hand-drawn test set and ChemPix test set, our ChemExpert performs best. This demonstrates that the combination of our approach with other baselines results in state-of-the-art performance. We further appreciate a significant increase in performance from EditKT\(^*\), compared to the non-fine-tuned version, highlighting the effectiveness of our fine-tuning approach.
In Table 2, we assess the performance of the different methods in terms of their atom-level localization abilities. We employ a test set from [25], which comprises images of chemical representations along with the corresponding atom objects. We use two evaluation metrics: the count accuracy (which can be evaluated without bounding box predictions), and the mean average precision (mAP) localization of the bounding boxes. The atoms count accuracy measures the ability to predict the correct number of atom types in each image. The average precision is computed as the weighted mean of precisions at various Intersect over Union (IoU) thresholds, with the weight reflecting the increase in recall from the previous threshold. Mean Average Precision represents the average of AP values across each class. We use the rather low IoU thresholds of \([0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35]\) in our experiments, considering the relatively small size of the bounding boxes of interest (see Figure 4 (c)), where significant overlap with the true bounding boxes is not anticipated. Methods that do not provide any form of localization are marked with n/a in the table.
In Table 2, we observe the commendable localization performance of our pretrained backbone model (AtomLenz) and also note the high counting accuracy of DECIMER. We posit that both outcomes may be attributed to the nature of the test dataset, aligning with the characteristics of the images used for training both DECIMER and our core pretrained model (AtomLenz). Additionally, we note that DECIMER was trained on 2000 times more images than our architecture, and does not provide any localization.
Method | Count Acc. | mAP |
---|---|---|
DECIMER (v2.2.0)[1] | 0.973 | n/a |
DECIMER fine-tuned (v2.2.0)[1] | 0.97 | n/a |
Img2Mol[2] | 0.929 | n/a |
MolScribe[5] | 0.829 | 0.008 |
ChemGrapher[3] | 0.248 | 0.002 |
OSRA[34] | 0.255 | n/a |
AtomLenz | 0.602 | 0.801 |
Baseline architectures were trained on significantly larger number of images than our model. Molscribe uses 4 times more samples, while DECIMER uses a staggering 2000 times more. In Table 3, we evaluate the sample complexity of the different methods by retraining them from scratch on the same small training dataset, which mimics limited data availability scenarios. We use the hand-drawn training set (4,070 data samples), enriched with atom-level entity localization annotations generated using EditKT* as the training dataset. We observe that the methods that leverage these atom-level entity annotations tend to fare better (ChemGrapher [3] and MolScribe [5]) than the ones using SMILES as only supervision signal (DECIMER [1] and Img2Mol [2]). Our approach significantly outperforms all baselines at this task, highlighting the remarkable data efficiency of our architecture. These findings align with those in the work of [37], where the author concluded that the use of atom-level entity annotations can enhance data efficiency during training. The hand-drawn images utilized in this experiment, along with the corresponding bounding box labels for 1417 images, we release as a novel annotated dataset. More info in the SM Section 7.
Method | Acc.(\(T=1\)) | \(\overline{T}\) |
---|---|---|
DECIMER (v2.2.0)[1] | 0.001 | 0.039 |
Img2Mol[2] | 0.0 | 0.0867 |
MolScribe[5] | 0.013 | 0.0865 |
ChemGrapher[3] | 0.004 | 0.067 |
AtomLenz | 0.338 | 0.484 |
Additionally, we conduct a detailed performance analysis of the most effective models from Table 1, presented in Figure 5. This figure showcases the count accuracies per atom or bond type. For each specific type, we identify images featuring that particular atom or bond type, then examine the predictions made by the methods on these images. Subsequently, we calculate the count accuracies for the predicted objects of the specific type within these images. For instance, when analyzing the ‘triple bond’ type, we select test images where at least one triple bond is depicted in the molecule and evaluate whether the method accurately predicts the correct number of triple bonds in the resulting molecular graph.
The plot in Figure 5 exhibits distinct patterns between ‘AtomLenz+EditKT* ’ and Decimer fine-tuned [29]. For example ‘AtomLenz+EditKT* ’ performs better on images with Chlorine (Cl), Fluorine (F), and Phosphorus(P) compared to Decimer fine-tuned but worse on bonds. This variability may clarify why combining both predictions into ChemExpert leads to improved performance, as errors tend to occur on different samples, and the two approaches complement each other. The same analysis is performed for the ChemPix dataset in the SM section [sec:allresults] in Figure 12.
This study has undertaken a comprehensive evaluation of various methods for chemical structure recognition, with a primary focus on the challenging domain of hand-drawn images. Our findings reveal insights into the strengths and limitations of existing tools and provide a compelling case for the efficacy of our approach. We showed that our method fares competitively despite a lower number of training samples, and resulted in state-of-the-art performance when combined with previous approaches. Our experiments highlighted our method’s proficiency in precisely localizing atom-level entities, a feature notably lacking in many existing tools. Importantly, we showed that our architecture is remarkably more data-efficient than previous models
Despite these improvements in chemical structure recognition, reliably predicting the molecular structure from hand-drawn remains a challenge, and higher prediction performance would be required for a wide adoption of these tools. We hope that the release of our curated hand-drawn molecules images dataset, with detailed atom-level annotations, to the community will contribute to the development of more efficient and reliable tools.
AA, MO and YM are funded by (1) Research Council KU Leuven: Symbiosis 4 (C14/22/125), Symbiosis3 (C14/18/092); (2) Federated cloud-based Artificial Intelligence-driven platform for liquid biopsy analyses (C3/20/100); (3) CELSA - Active Learning (CELSA/21/019); (4) European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956832; (5) Flemish Government (FWO: SBO (S003422N), Elixir Belgium (I002819N), SB and Postdoctoral grants: S003422N, 1SB2721N, 1S98819N, 12Y5623N) and (6) VLAIO PM: Augmenting Therapeutic Effectiveness through Novel Analytics (HBC.2019.2528); (7) YM, AA, and MO are affiliated to Leuven.AI and received funding from the Flemish Government (AI Research Program). Computational resources and services used in this work were partly provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government – department EWI.
The source code and basic instructions are available on https://github.com/molden/atomlenz
Several datasets were used in this work and all are available.
The dataset introduced by [32], which consists of hand-drawn chemical depictions matched with their corresponding SMILES representations, is partitioned into 4,070 samples for training and validation purposes, along with an additional 1,018 samples for testing. These sets are referred to as the hand-drawn training set and the hand-drawn test set and available here: https://dx.doi.org/10.6084/m9.figshare.24599412 The hand-drawn training set was then relabeled using EditKT* to annotate corresponding bounding box labels for 1417 images (see Experiments Section 5.3). The format of the bounding box labels are further explained in Section 7.2.2. The dataset is available here: https://dx.doi.org/10.6084/m9.figshare.24599172
To streamline the process, the hand-drawn training set is offered in different formats together with instructions to assist in training other baseline models (see Experiments Section 5.3). When applicable, localization annotations are also included. The different datasets are available here:
DECIMER format: https://dx.doi.org/10.6084/m9.figshare.24591252
Img2Mol format https://dx.doi.org/10.6084/m9.figshare.24591381
MolScribe format https://dx.doi.org/10.6084/m9.figshare.24591300
ChemGrapher format https://dx.doi.org/10.6084/m9.figshare.24591495
For the pretraining of the object detection models of AtomLenz, we generate images synthetically using RdKit [27] and Indigo [30] paired with bounding boxes delineating all objects within, including atoms, bonds, charges, and stereocenters, similarly to what is used in other chemical structure recognition tools [3], [5]. Specifically, we collect approximately 214,000 chemical compounds in SMILES format from the ChEMBL [31] database. To enhance the method’s resilience to stylistic variations, we introduce variability in elements such as fonts, font sizes, line widths, and the spacing between multiple bonds during image generation. Dataset is available in 2 parts:
part 1 atom and bond entity annotated images: https://zenodo.org/records/10185264
part 2 charge and stereocenter entity annotated images: https://zenodo.org/records/10200185
Example label file:
label,xmin,ymin,xmax,ymax
0,267,522,286,541
2,317,489,336,508
0,313,429,332,448
0,363,396,382,415
2,360,337,379,356
0,306,310,325,329
2,256,343,275,362
0,370,516,389,535
0,374,576,393,595
0,428,603,447,622
2,478,570,497,589
0,474,510,493,529
2,524,477,543,496
0,578,504,597,523
0,628,471,647,490
0,682,498,701,517
0,732,465,751,484
0,728,405,747,424
0,675,378,694,397
6,431,663,450,682
3,581,564,600,583
6,671,318,690,337
0,260,403,279,422
0,421,483,440,502
0,625,411,644,430
Above an example csv label file is illustrated of the bounding box labels for one image. There are several fields in the csv file:
label field will annotate for every bounding box in the image what class the atom,bond,charge or stereo center the entity belongs to. For atom-type entities these are the different possible labels:
{0: 'C', 1: 'H', 2: 'N', 3: 'O', 4: 'S', 5: 'F', 6: 'Cl',
7: 'Br', 8: 'I', 9: 'Se', 10: 'P', 11: 'B', 12: 'Si',
13: '*', 14:'Te', 15:'Sn', 16: 'As', 17:'Al', 18:'Ge',
19:'D', 20:'T'}
For bond-type entities the different possible labels are:
{1: 'single', 2: 'double', 3: 'triple',
4: 'aromatic', 5: 'wedged', 6: 'dashed'}
For charge-type entities the different possible labels are:
{0: 0, 1: +1, 2: -1, 3: +2, 4: -2, 5: +3, 6: +4, 7: +5, 8: +6}
Finally for stereocenters entities:
{0:'stereocenter'}
xmin,ymin: coordinates of top left corner of the bounding box.
xmax,ymax: coordinates of the bottom right corner of the bounding box.
Examples of samples from the synthetically generated training set are illustrated in Figure 11 together with the drawn bounding box labels for the different object types. Also some extra examples of samples of all test sets described in Section 4 are illustrated in Figure 6.
In all imperfect representation levels there are compounds that cannot be distinguished. These undistinguishable groups correspond to the concept of isomerism in chemistry. In the level of molecular formula where only the count of different atoms are given, these equivalent compounds called constitutional isomers. Compound graphs with identical adjacency matrix but different spatial organization are called stereoisomers. In the following we give examples to help clarify these concepts.
Constitutional isomerism is a quite simple concept. It is clear that if we specify the number of atoms for all type, multiple possible compound graphs can be built. There are valence constraints of course, for example a \(C_nH_{2n+2}\) compound cannot contain double or triple bonds. However, there is still a large variety of graphs that can be realized.
The case of sucrose and lactose (see Figure 8) is easy to follow as the galactose and fructose unit only different in the position of the ring closure. A more accidental case is progesterone and THC depicted on Figure 7. This starkly illustrates the pitfalls of using the chemical formula as a representation. The effects of these two compounds are clearly unrelated.
If we only take into account the atom and bond adjacency relations we have some relevant degree of freedom undescribed. An often used example is our hands. While all bones have the same adjacency in both of our hands, we cannot rotate the two such that they are identical: they are mirror images.
Note two important details. Firstly, we do not care about exact positions of atoms in 3D space when the molecule is flexible, similarly as we do not distinguish a hand with closed or opened fingers, but differentiating between the left and right hand. Secondly, the spatial organization has nothing to do with the placement of the atoms on the 2D depiction plane, these positions are arbitrary.
To enhance our representation, new labels need to be introduced: wedge bonds and/or stereocenters. For example see Figure 9. The filled wedge bond indicates that the atom or group at the thick end pointing out of the plane of the drawing, while the dashed wedge bond indicates that the group is under that plane. The depicted compounds are mirror images of each other, however, the difference in the biological effect can be dramatic (in the case of thalidomide the picture is more complicated, as the two form can interconvert in the body, but for didactic pourposes let us assume this is not the case). Note that if the left ring would be symmetric, for example by connecting the nitrogen to the neighboring carbon, the two compound would be identical. A simple 180 degree rotation around the long axis of the compound would show this. Stereoisomerism necessitates the presence of an atom lacking symmetric surroundings. This unique atom, such as the carbon at the wedge bond in this scenario, is referred to as a stereo center.
Stereoisomers are not always mirror images of each other. If there are \(n\) stereocenters in a molecule (see Figure 10) there are \(2^n\) stereoisomers, forming pairs of mirror images (called enantiomers). The non-mirror image pairs are called diastereomers.
This section aims to provide in-depth insights into the subroutines utilized within the molecular graph constructor as introduced in Algorithm 3.
The first subroutine used in the molecular graph constructor is \(\mathrm{filterAtoms}(\mathbf{O}^a)\). This subroutine is implemented inside the function iou_filter_bboxes
in the file
utils_graph.py
.
The function goes over all overlapping bounding boxes of atoms with IoU higher than 0.5. For every group of overlapping bounding boxes the function will keep the bounding box with the highest score.
\(\mathrm{checkCharges}(\mathbf{O}^c, \mathbf{o}^a)\) is responsible for determining which atom objects should carry a charge and is implemented in predict_smiles.py
from line 95
until
99
:
95 charge_atoms = np.ones(len(filtered_bboxes))
96 for index,box_atom in enumerate(filtered_bboxes):
97 for box_charge,label_charge in zip(filtered_ch_boxes,filtered_ch_labels):
98 if bb_box_intersects(box_atom,box_charge) == 1:
99 charge_atoms[index]=label_charge
\(\mathrm{checkStereoChem}(\mathbf{O}^s, \mathbf{o}_{c}^a)\) is applied to identify atoms functioning as stereocenters and is implemented in predict_smiles.py
from line 141
until
151
:
141 stereo_bonds = np.where(mol_graph>4, True, False)
142 if np.any(stereo_bonds):
143 stereo_boxes = stereo_preds[image_idx]['boxes'][0]
144 stereo_labels= stereo_preds[image_idx]['preds'][0]
145 for stereo_box in stereo_boxes:
146 result=[]
147 for atom_box in filtered_bboxes:
148 result.append(bb_box_intersects(atom_box,stereo_box))
149 indices = [i for i, x in enumerate(result) if x == 1]
150 if len(indices) == 1:
151 stereo_atoms[indices[0]]=1
\(\mathrm{checkEdge}(V, \mathbf{o}^b)\) evaluates which vertices (atoms) overlap with the bonds and is implmented in predict_smiles.py
from line 109
until 118
:
109 result = []
110 limit = 0
111
112 while result.count(1) < 2 and limit < 80:
113 result=[]
114 bigger_bond_box = [bond_box[0]-limit,
bond_box[1]-limit,bond_box[2]+limit,bond_box[3]+limit]
115 for atom_box in filtered_bboxes:
116 result.append(bb_box_intersects(atom_box,bigger_bond_box))
117 limit+=5
118 indices = [i for i, x in enumerate(result) if x == 1]
\(\mathrm{filterCands}(\mathrm{candAtoms})\) will select the two most probable atoms to form a bond when more than 2 candidate atoms appear. This step is implemented in dist_filter_bboxes(cand_bboxes)
in
file utils.graph.py
.
Finally the validation step is performed by performing several iterations this code:
mol = Chem.MolFromMolFile('molfile',sanitize=False)
problematic = 0
try:
problems = Chem.DetectChemistryProblems(mol)
if len(problems) > 0:
mol = solve_mol_problems(mol,problems)
Where solve_mol_problems
is implemented in file utils_graph.py
.
Examples of samples from the synthetically generated training set are illustrated in Figure 11 together with the drawn bounding box labels for the different atom-level entity types: atoms, bonds, charges and stereocenters.
In our experiments we assess the molecular structure prediction performance using accuracy and Tanimoto similarity, a widely used metric for quantifying molecular similarity, to assess the resemblance between the model’s predictions and the actual molecular graphs. Tanimoto similarity values range from 0 to 1, with higher values indicating greater similarity. A Tanimoto similarity of 1 indicates that the structural descriptors are identical or that they are matching ‘on-bits’ in a binary fingerprint. The binary fingerprint employed to measure the Tanimoto similarity is the Extended-connectivity fingerprint [36] with radius 3 (ECFP6) and fingerprint length of 2048. Crafted with precision to capture essential molecular features relevant to molecular activity, ECFPs (Extended-Connectivity Fingerprints) [36] are generated through a customized adaptation of the Morgan [38] algorithm. This involves systematically traversing each atom in the molecule to extract all possible paths within a specified radius. Following this, every unique path undergoes hashing into a numerical value within a predetermined bit range. It is worth noting that the encoded fragment size expands proportionally with an increased radius.
Our Tables 4 and 5 report both the accuracy, computed by counting the instances where the predicted structures have identical structural ECFP6 descriptors (denoted by a Tanimoto similarity of 1) and the average Tanimoto similarity. As an additional metric, we include the accuracy when assessing whether the predicted resulting SMILES exactly match the true SMILES.
Lastly, we conduct supplementary experiments utilizing ChemExpert on both the Chempix and hand-drawn test sets, while altering the sequence of chemical structure tools. In both datasets, we note that the combined utilization of AtomLenz+EditKT* and DECIMER fine-tuned within ChemExpert yields the best performance. Nevertheless, the arrangement of tools within ChemExpert slightly alters the performance, depending on the test set and the specific performance metric, as demonstrated in Table 5.
Method | Acc. (exact match) | Acc.(\(T=1\)) | \(\overline{T}\) |
---|---|---|---|
DECIMER (v2.2.0) [1] | 0.281 | 0.295 | 0.451 |
DECIMER fine-tuned(v2.2.0) [29] | 0.567 | 0.622 | 0.727 |
Img2Mol [2] | 0.047 | 0.084 | 0.275 |
MolScribe [5] | 0.094 | 0.102 | 0.288 |
ChemGrapher [3] | 0.002 | 0.002 | 0.065 |
OSRA [34] | 0.006 | 0.006 | 0.065 |
AtomLenz | 0.008 | 0.009 | 0.087 |
AtomLenz+EditKT* | 0.279 | 0.338 | 0.484 |
ChemExpert(AtomLenz+EditKT*,[29]) | 0.416 | 0.417 | 0.572 |
ChemExpert([29],[1]) | 0.571 | 0.626 | 0738 |
ChemExpert([29],AtomLenz+EditKT*) | 0.579 | 0.635 | 0.749 |
Method | Acc. (exact match) | Acc.(\(T=1\)) | \(\overline{T}\) |
---|---|---|---|
DECIMER (v2.2.0) [1] | 0.036 | 0.05 | 0.1 |
DECIMER fine-tuned (v2.2.0) [29] | 0.482 | 0.508 | 0.643 |
Img2Mol [2] | 0.015 | 0.015 | 0.084 |
MolScribe [5] | 0.228 | 0.269 | 0.417 |
ChemGrapher [3] | 0.151 | 0.187 | 0.286 |
OSRA[34] | 0.044 | 0.047 | 0.071 |
AtomLenz | 0.026 | 0.054 | 0.064 |
AtomLenz+EditKT* | 0.4 | 0.484 | 0.605 |
ChemExpert(AtomLenz+EditKT*,[5]) | 0.412 | 0.5 | 0.619 |
ChemExpert(AtomLenz+EditKT*,[29]) | 0.441 | 0.529 | 0.65 |
ChemExpert([29],AtomLenz+EditKT*) | 0.487 | 0.518 | 0.655 |
Method | Acc.(\(T=1\)) (test set) | \(\overline{T}\) (test set) | Acc.(\(T=1\)) (train set) | \(\overline{T}\) (train set) |
---|---|---|---|---|
DECIMER (v2.2.0) [1] | 0.001 | 0.039 | 0.099 | 0.142 |
Img2Mol [2] | 0.0 | 0.0867 | 0.237 | 0.388 |
MolScribe [5] | 0.013 | 0.0865 | 0.234 | 0.275 |
ChemGrapher [3] | 0.004 | 0.067 | 0.007 | 0.073 |
AtomLenz | 0.338 | 0.484 | 0.383 | 0.522 |