October 03, 2025
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR.
<ccs2012> <concept> <concept_id>10010147.10010371.10010382</concept_id> <concept_desc>Computing methodologies Image manipulation</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010224.10010225</concept_id> <concept_desc>Computing methodologies Computer vision tasks</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>
Text removal aims to erase texts from images and fill in the background with seamless pixels, preserving the visual quality of the images. Text removal has a wide range of applications, such as removing captions from videos [1], [2], obscuring private or sensitive information [3]–[6], and editing text in images [7], where removal is usually an important first pre-processing step for the following workflow.
The majority of text removal literature focused on the removal of text from natural scene images [3]–[6], [8], [9], commonly referred to as scene text. One of the early works on scene text removal introduced benchmark datasets [8] that have remained the primary standard for evaluation up to the present, where scene examples often include traffic signs or billboards. While scene texts are an important target domain, we argue that the current benchmark is not always applicable to evaluating text removal in other domains, such as creative domains involving printed posters or advertising banners, which present a different requirement in background inpainting. For example, scene texts do not cross object boundaries, but poster texts can appear on top of a background containing multiple objects.
Another qualitative problem in the existing benchmark is that the dataset presents pixel-level artifacts originating from manipulating ground truth images using an image editing tool, and sometimes leaves visible noise in the inpainted background. These artifacts introduce noise in the evaluation metrics such as PSNR, but they are hard to fix because we cannot remove text in real world scenes to collect perfectly clean ground truth.
In this paper, we address the limitations of the current scene text benchmarks with a synthetic approach. Considering the creative domain in mind, we synthetically create overlay text on images that initially contain no text, and use the composited images to train or evaluate text removal models. This approach ensures that the ground truth remains completely artifact-free because the background images are always clean without pixel-level manipulation. Also, our synthetic approach can control text placement using the location of objects and concepts in the image. This allows us to create more challenging text removal scenarios by positioning text over regions with complex structures and textures. In experiments, we show that existing benchmarks are limited in accurately comparing qualitatively and how our synthetic benchmark can provide better measurement capability.
The main contributions of the newly proposed dataset, which we call OTR (Overlay Text Removal), are as follows:
We present a synthetic approach to build a dataset to evaluate text removal methods, which artificially overlays texts on a complex background. Our synthetic approach guarantees an artifact-free background in the ground truth.
We empirically study the evaluation capability of the proposed dataset and show that our approach can better capture the qualitative characteristics compared to the existing scene text benchmarks.
The first attempts at text removal targeted captions and subtitles in videos, employing spatial and temporal restoration techniques across consecutive frames [1], [2]. Owing to the advancements in image generation and editing achieved by deep learning methods, the recent focus of research has shifted toward scene text removal, that is, the task of eliminating text from natural scene images. Nakamura et al. [3] were the first to apply a convolutional neural network to remove scene text from images. Further research emerged shortly, with several methods [4], [8]–[10] leveraging the progress made by generative adversarial networks (GANs) in the field of image generation and editing [11], [12]. Recent methods have adopted new techniques and architectures such as attention mechanism [13] and vision transformer [6].
Text removal methods can be classified into two categories: (1) one-stage methods, which directly transform input images containing text into text-free outputs in a single step [3], [4], [6], [8], [10], [14], [15], and (2) two-stage methods, which first explicitly detect or estimate text regions and then inpaint only those areas [5], [9], [13], [16]–[20].
SCUT [8] is a popular benchmark dataset for studying text removal and includes two benchmark sets, each built in a different approach.
SCUT-EnsText [8] benchmark consists of pairs of scene text images and their manually edited counterparts, where the text has been removed using Adobe Photoshop. The scene text images have been collected from other scene text datasets, namely ICDAR 2015 [21] and MLT [22]. Lyu et al. recently introduced another dataset for scene text removal [23], which is built in a similar approach. While manually created ground truth examples look natural, there is an inherent drawback that the background often exhibits pixel-level artifacts.
SCUT-SynText [8] benchmark contains synthetic scene text images generated with the SynthText [24] algorithm, along with corresponding background images serving as text-free ground truth. The synthetic approach completely avoids the background artifacts present in the image manipulation, but has a limitation in that the resulting images do not always look natural as a scene image. As SynthText is designed for studying scene text detection, texts are placed on a relatively uniform background, such as the sky, the water surface, or the ground. This is fine for training a scene text detection model, but it leaves an issue for text removal because texts never appear on a complex background and tend to yield only easy-to-inpaint examples. This limitation becomes problematic for studying text removal in a non-scene image.
In this section, we study problems in the existing benchmarks.
Manually created ground truth tends to contain a substantial number of pixels surrounding the text that differ from those in the originals. Ideally, only the pixels corresponding to text strokes would be altered in the ground truth. However, due to the limitations of image manipulation, achieving such precision in an image editor makes it virtually impossible, and broader areas around the text strokes are often modified to produce visually plausible results. As a result, there are artifacts and signs of use of image editing tools that make the ground truth differ from the original.
We evaluate the discrepancy by PSNR between the original images and their corresponding ground truth in the SCUT-EnsText benchmark. Figure 1 shows examples of original and ground truth image pairs and their pixel-level differences. Since the ground truth does not contain any text, we exclude text stroke regions from PSNR computation by detecting them with a text stroke segmentation model [25]. We adopt a common implementation of PSNR that prevents division by zero by adding a constant \(\epsilon = \num{1e-10}\) to the mean squared error (MSE). The PSNR between the original image \(I\) consisting of \(n\) pixels and its approximation \(\hat{I}\) is then computed as follows: \[\text{PSNR} = 20 \cdot \log_{10}(m) - 10 \cdot \log_{10}(\text{MSE} + \epsilon),\] \[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (I_i - \hat{I}_i)^2.\] For 8-bit images with a maximum pixel value \(m\) of 255, two perfectly identical images yield a PSNR of about 148 dB. In contrast, the average PSNR across the whole test set of SCUT-EnsText is 42.72 dB, which roughly corresponds to \(\frac{1}{10}\) of all pixels in the image differing by about 2.5% in grayscale intensity.
To further quantify the discrepancies, we also compute the percentage of pixels — excluding text stroke regions — whose absolute difference in pixel value between the original image and ground truth exceeds a given threshold that we set to 3, a level sufficient to noticeably affect PSNR. The results show that about 8% of pixels exceed this threshold, indicating that a considerable number of pixels in the ground truth deviate from the original images.
Figure 1: Examples from SCUT-EnsText dataset. From left to right, original image (first), ground truth image (second), map of pixels whose difference in value between the original and ground truth image exceeds the set threshold (third), the absolute difference between the original image and ground truth (fourth). The difference between the original and ground truth image should be as little as possible, but the surrounding regions contain altered pixels. Ideally, there should be no difference between the two when text stroke regions are excluded..
In addition, since the SCUT-EnsText dataset is stored and distributed in JPEG format, there are additional compression artifacts that differ between the original images with text and their ground truth counterparts. The visual artifacts — originating both from manual editing and JPEG compression — can compromise the fairness and reliability of evaluation results, particularly those that rely on pixel-level similarity between the original and ground truth.
In real-world scenarios, scene text is typically placed on simple, uncluttered backgrounds to enhance its readability. As a result, most text instances in the SCUT-EnsText dataset appear on plain, uniform backgrounds with minimal visual complexity, such as a billboard. Similarly, in the case of SCUT-SynText, the SynthText algorithm — designed to mimic geometrical properties of scene texts — tends to place text in homogeneous regions with simple textures, such as sky, walls, or signboards. As a consequence, the majority of text removal scenarios in both datasets is relatively easy, as it requires minimal understanding of structural context or semantic coherence.
In contrast, overlay text found in advertisements, magazines, and similar media can overlap with any part of the image, including complex objects. This makes the text removal process more challenging, as it requires the model to preserve both structural integrity and semantic coherence in the inpainted regions.
We assess the complexity of text backgrounds in images using information entropy. We first identify text stroke regions using a text stroke segmentation model [25], then expand the detected text stroke regions \(Z\) by applying dilation with a large kernel to obtain text vicinity regions \(\tilde{Z}\). The entropy of the background surrounding the text is then computed by: \[\label{eq:entropy-formula} H(X) = \sum_{x \in X} p(x) \log p(x),\tag{1}\] \[X = I(\tilde{Z} - Z),\] where \(I(\tilde{Z} - Z)\) denotes the region of the image that surrounds text stroke regions, selected by subtraction of masks \(Z\) and \(\tilde{Z}\). As Table 8 in Section 5.4 shows, our new dataset exhibits higher entropy values, indicating that the backgrounds around text regions are more visually complex.
Commonly used metrics for evaluation of text removal — such as structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and the average of gray-level absolute difference (AGE) — focus on how closely the output matches the ground truth. Other metrics such as text recall measure the amount of text remaining in the image, and Frechet Inception Distance (FID) measures the similarity between the original image and ground truth image distribution. However, text removal results can still be visually convincing even if they differ from the ground truth in the inpainted regions, especially since there is no single correct way to reconstruct the occluded content. Penalizing outputs that look natural but diverge from the ground truth overlooks this ambiguity. As illustrated in Figure 2, metrics commonly used for text removal evaluation often fail to account for this. Therefore, we advocate for the inclusion of evaluation methods that assess the visual quality of the results independently of their similarity to the ground truth.
artifact-free GT | complex text background | |
---|---|---|
SCUT-SynText | ✔ | |
SCUT-EnsText | ||
OTR | ✔ | ✔ |
Table 1 highlights the differences between individual datasets. SCUT-EnsText suffers from artifacts in the ground truth caused by manual editing during its creation. Additionally, since it features real-world scenes, most of the text appears on simple backgrounds that ensure good readability in real life. SCUT-SynText does not contain artifacts in the ground truth, but the design of the SynthText algorithm results in text being mostly placed on simple backgrounds. In contrast, OTR avoids the ground truth artifact issue by using synthetically generated images, and introduces more challenging text removal scenarios with text over complex backgrounds thanks to object-aware text placement.
This section describes the process used to construct our new dataset.
We use images and annotations from Open Images V7 [26] and MS-COCO [27] datasets to build our text removal dataset. Open Images V7 provides hierarchical class annotations with general and more fine-grained categories. From this dataset, we select images labeled with the following general classes: animal, food, furniture, home appliance, kitchen appliance, musical instrument, person, plant, sports equipment, tableware, toy, vehicle. For MS-COCO, we rely on panoptic segmentation annotations and select images containing any of the following classes: dirt, floor, grass, pavement, river, road, sand, sea, sky, snow. Both datasets are distributed under a CC BY 4.0 license, while individual images are licensed under CC BY 2.0.
We create paired data consisting of images with overlay text and their corresponding ground truth images without any text. To ensure that the ground truth is clean and free from pre-existing text that we do not want to consider in our evaluation, we apply a scene text detection model [28] to filter out any images that contain text prior to our processing.
In order to make text removal more challenging, we position the text in such a way that it overlaps with specific objects in the images. To achieve that, we use bounding box annotations of objects for images from Open Images V7 and place the text randomly within these regions. For MS-COCO images, we utilize segmentation masks of our selected classes and place text randomly in the masked regions. All of our selected MS-COCO classes represent background and terrain elements, such as sky, sea, and road, which usually consist of simple textures that are easy to inpaint.
To render text on images, we use the skia-python 1 graphics library and approximately 200 font files from Google Fonts 2. Font sizes are randomly selected from a predefined range, and long text is split into multiple text lines. The text that we place into images is generated by a vision-language model (VLM) [29] instructed to imagine an article or advertisement relevant to the image and make a short headline or a catchphrase for it. The exact prompt is:
Think of a fictional article that is related to the image and think of a short phrase that could be a headline of the article, or think of a fictional advertisement that the image could be used for and think of a short phrase that could be used as a catchphrase for the advertisement. The phrase can be 1 to 20 words long.
We tried several VLMs, including PaliGemma 3B [30] and CogVLM [31], and empirically found out that SmolVLM [29] produces optimal results for our objective while also being efficient. In contrast, PaliGemma 3B generated very repetitive phrases, while CogVLM — despite having eight times as many parameters as SmolVLM — did not demonstrate any noticeable performance advantage.
Figure 3 provides an overview of our data generation process.
The test set of our dataset consists of two subsets: OTR-easy, containing data created from images sourced from the MS-COCO dataset, and OTR-hard, containing data created from images sourced from Open Images V7. Tables 2 and 3 show the number of images per each class in both subsets, respectively. In total, OTR-easy consists of 5,538 samples and OTR-hard consists of 9,055 samples.
Each sample from the dataset consists of an image with rendered text, its corresponding original image without any text, and word-level annotations specifying the bounding boxes of their locations along with their transcriptions. The images are stored in PNG format to avoid degradation by JPEG compression artifacts, and annotations are stored in JSON files.
We also present a training set consisting of about 74,716 samples that can be used for training from scratch or finetuning pretrained models for overlay text removal.
dirt | 1000 | road | 608 |
floor | 1000 | sand | 77 |
grass | 282 | sea | 98 |
pavement | 1000 | sky | 417 |
river | 1000 | snow | 56 |
animal | 714 | person | 1000 |
food | 1000 | plant | 1000 |
furniture | 1000 | sports equipment | 1000 |
home appliance | 188 | tableware | 1000 |
kitchen appliance | 168 | toy | 596 |
musical instrument | 389 | vehicle | 1000 |
PSNR \(\uparrow\) | SSIM \(\uparrow\) | AGE \(\downarrow\) | pEPs \(\downarrow\) | pCEPs \(\downarrow\) | QualiCLIP \(\uparrow\) | TOPIQ \(\uparrow\) | LIQE \(\uparrow\) | HyperIQA \(\uparrow\) | |
---|---|---|---|---|---|---|---|---|---|
EraseNet [4] (TIP ’20) | 26.24 | 93.66 | 4.09 | 0.038 | 0.026 | 0.688 | 0.526 | 3.47 | 0.545 |
MTRNet++ [16] (CVIU ’20) | 28.32 | 94.85 | 2.30 | 0.032 | 0.019 | 0.676 | 0.563 | 3.59 | 0.575 |
PERT [20] (CVIU ’23) | 26.97 | 94.18 | 2.95 | 0.038 | 0.026 | 0.688 | 0.547 | 3.47 | 0.559 |
SAEN [32] (WACV ’23) | 26.11 | 93.92 | 3.52 | 0.039 | 0.026 | 0.671 | 0.549 | 3.51 | 0.559 |
ViT-Eraser [6] (AAAI ’24) | 29.56 | 95.51 | 2.29 | 0.027 | 0.017 | 0.696 | 0.542 | 3.45 | 0.554 |
DBNet++ + SAM + LaMa | 31.76 | 95.74 | 1.82 | 0.025 | 0.013 | 0.725 | 0.551 | 3.71 | 0.561 |
DBNet++ + SAM + FLUX.1 Fill | 31.18 | 95.42 | 1.99 | 0.027 | 0.014 | 0.726 | 0.557 | 3.78 | 0.569 |
DBNet++ + SAM + SD 1.5 | 30.02 | 94.56 | 2.31 | 0.032 | 0.018 | 0.728 | 0.566 | 3.79 | 0.576 |
PSNR \(\uparrow\) | SSIM \(\uparrow\) | AGE \(\downarrow\) | pEPs \(\downarrow\) | pCEPs \(\downarrow\) | QualiCLIP \(\uparrow\) | TOPIQ \(\uparrow\) | LIQE \(\uparrow\) | HyperIQA \(\uparrow\) | |
---|---|---|---|---|---|---|---|---|---|
EraseNet [4] (TIP ’20) | 28.85 | 94.83 | 3.88 | 0.039 | 0.026 | 0.733 | 0.520 | 3.71 | 0.549 |
MTRNet++ [16] (CVIU ’20) | 31.92 | 95.39 | 2.33 | 0.032 | 0.019 | 0.728 | 0.535 | 3.78 | 0.558 |
PERT [20] (CVIU ’23) | 33.00 | 94.95 | 2.79 | 0.035 | 0.024 | 0.732 | 0.527 | 3.69 | 0.551 |
SAEN [32] (WACV ’23) | 29.43 | 94.76 | 3.45 | 0.036 | 0.024 | 0.738 | 0.528 | 3.71 | 0.555 |
ViT-Eraser [6] (AAAI ’24) | 32.61 | 95.95 | 2.25 | 0.026 | 0.016 | 0.759 | 0.530 | 3.72 | 0.553 |
DBNet++ + SAM + LaMa | 52.31 | 96.06 | 1.79 | 0.024 | 0.012 | 0.761 | 0.538 | 3.87 | 0.558 |
DBNet++ + SAM + FLUX.1 Fill | 51.51 | 95.85 | 1.93 | 0.026 | 0.013 | 0.765 | 0.540 | 3.92 | 0.560 |
DBNet++ + SAM + SD 1.5 | 51.18 | 95.21 | 2.18 | 0.029 | 0.015 | 0.763 | 0.547 | 3.92 | 0.566 |
We use two types of methods to obtain baseline results for our benchmark: (1) existing text removal methods [4], [6], [10], [20], [32], and (2) general image inpainting models [33]–[35] combined with a separate text detector [28] and Segment Anything model [36].
Existing methods for text removal are pretrained specifically on data for scene text removal. In contrast, general inpainting models have been trained on large-scale image datasets, enabling them to perform effectively across a wide range of image domains. General inpainting models are used along with text detection models which detect all text regions in the image that have to be inpainted. To minimize the area that needs to be inpainted, bounding boxes produced by the text detector are further refined using the Segment Anything [36] model to segment the text strokes within them.
As discussed in Section [sec:inadequate95evaluation], commonly used evaluation metrics for text removal are not enough to thoroughly evaluate if text removal results look natural or not. To supplement commonly used metrics that depend on the similarity between generated results and ground truth images, we employ additional metrics designed for no-reference image quality assessment (NR-IQA). Namely, QualiCLIP [37] (a CLIP-based self-supervised method trained on increasingly degraded images), LIQE [38] (a multitask learning method leveraging knowledge from other tasks), TOPIQ [39] (a top-down method focusing on semantically important local distortions) and HyperIQA [40] (a content-aware self-adaptive classification network). These methods are designed to assess the perceptual quality of images in accordance with human subjective perception.
Besides the newly adopted metrics, we also use metrics widely used in existing works on text removal methods, i.e., PSNR (peak signal-to-noise ratio), SSIM (structural similarity), AGE (average of gray-level absolute difference), pEPs (percentage of error pixels), and pCEPs (percentage of four-connected neighbors error pixels).
Tables 4 and 5 show the evaluation results for our OTR-hard and OTR-easy datasets, respectively. As can be seen, the scores on the OTR-hard datasets are lower for all methods across most metrics, indicating that OTR-hard presents more challenging scenarios.
Figure 5: Correlation between the information entropy and several metric scores..
Figure 4 shows text removal results using three different methods: PERT [20], ViT-Eraser [6] and FLUX.1 Fill [34]. From a human perspective, FLUX.1 Fill produces better results, but as its outputs slightly differ from the ground truth, it underperforms in metrics relying on direct similarity to the ground truth. In contrast, NR-IQA metrics rank FLUX.1 Fill as the top-performing method. This showcases a discrepancy between metrics focusing on the similarity between the result and ground truth and metrics focusing on perceptual image quality. In practical use, visual quality often matters more than an exact match with the ground truth. This highlights the relevance of NR-IQA metrics for text removal evaluation.
We use information entropy to measure the complexity of backgrounds around text as introduced in Equation 1 in Section [sec:inadequate95evaluation].
Table 6 shows the information entropy for each class in OTR-hard, the data set designed to feature text backgrounds that are more difficult to inpaint. In contrast, samples in OTR-easy exhibit lower information entropy on average, as shown in Table 7. This indicates that text backgrounds in OTR-easy are indeed less complex than those in OTR-hard. Particularly, classes such as sky and sea yield the lowest entropy values, which aligns with our expectations since sea and sky are typically very simple and uniform textures.
Table 8 presents the information entropy of backgrounds around text instances across each dataset. Higher entropy values suggest that the backgrounds in our dataset, in particular the OTR-hard set, are more complex, making it more challenging for text removal models to produce naturally looking results.
Figure 5 illustrates the correlation between the information entropy and evaluation scores using the SSIM, QualiCLIP and TOPIQ metrics, respectively. The information entropy values correspond to several select classes from our dataset, namely sky, snow, grass, person and plant. For metrics that have been widely used for text removal evaluation, such as SSIM, performance on classes with lower information entropy tends to be better compared to those with higher entropy. This suggests a clear relationship between the quality of results and text background complexity, indicating that text on simple backgrounds is easier to remove. Results of the QualiCLIP and TOPIQ metrics indicate that more advanced models that are better at understanding overall structure and semantics — diffusion models such as FLUX.1 — tend to outperform other methods particularly on classes with higher entropy values, i.e., classes with more complex backgrounds. This suggests that datasets featuring text on complex backgrounds are essential for highlighting the strengths of models with more advanced inpainting capabilities. Additionally, the variation in metric scores across different methods is larger for classes with higher entropy, indicating that more challenging scenarios are more effective to distinguish the performance of text removal methods.
OTR-hard (\(H(X)\)) | |||
---|---|---|---|
animal | 6.91 | person | 6.93 |
food | 7.07 | plant | 7.07 |
furniture | 6.98 | sports equipment | 6.79 |
home appliance | 6.87 | tableware | 6.95 |
kitchen appliance | 7.02 | toy | 7.05 |
musical instrument | 6.89 | vehicle | 6.97 |
OTR-easy (\(H(X)\)) | |||
---|---|---|---|
dirt | 6.74 | road | 6.66 |
floor | 6.64 | sand | 6.60 |
grass | 6.66 | sea | 6.33 |
pavement | 6.77 | sky | 6.02 |
river | 6.70 | snow | 6.22 |
\(H(X)\) | |
---|---|
SCUT-EnsText | 6.32 |
SCUT-SynText | 6.44 |
OTR-easy | 6.64 |
OTR-hard | 6.96 |
We introduced a new dataset for text removal that addresses key limitations of existing benchmarks, such as artifacts in ground truth and low background complexity. By simulating overlay text in advertisements and printed media, our dataset provides a more challenging and diverse testing benchmark. We also highlighted the need for better evaluation metrics that go beyond pixel-level similarity.