Abstract

Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR.

<ccs2012> <concept> <concept_id>10010147.10010371.10010382</concept_id> <concept_desc>Computing methodologies Image manipulation</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010224.10010225</concept_id> <concept_desc>Computing methodologies Computer vision tasks</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

1 Introduction↩︎

Text removal aims to erase texts from images and fill in the background with seamless pixels, preserving the visual quality of the images. Text removal has a wide range of applications, such as removing captions from videos [1], [2], obscuring private or sensitive information [3]–[6], and editing text in images [7], where removal is usually an important first pre-processing step for the following workflow.

The majority of text removal literature focused on the removal of text from natural scene images [3]–[6], [8], [9], commonly referred to as scene text. One of the early works on scene text removal introduced benchmark datasets [8] that have remained the primary standard for evaluation up to the present, where scene examples often include traffic signs or billboards. While scene texts are an important target domain, we argue that the current benchmark is not always applicable to evaluating text removal in other domains, such as creative domains involving printed posters or advertising banners, which present a different requirement in background inpainting. For example, scene texts do not cross object boundaries, but poster texts can appear on top of a background containing multiple objects.

Another qualitative problem in the existing benchmark is that the dataset presents pixel-level artifacts originating from manipulating ground truth images using an image editing tool, and sometimes leaves visible noise in the inpainted background. These artifacts introduce noise in the evaluation metrics such as PSNR, but they are hard to fix because we cannot remove text in real world scenes to collect perfectly clean ground truth.

In this paper, we address the limitations of the current scene text benchmarks with a synthetic approach. Considering the creative domain in mind, we synthetically create overlay text on images that initially contain no text, and use the composited images to train or evaluate text removal models. This approach ensures that the ground truth remains completely artifact-free because the background images are always clean without pixel-level manipulation. Also, our synthetic approach can control text placement using the location of objects and concepts in the image. This allows us to create more challenging text removal scenarios by positioning text over regions with complex structures and textures. In experiments, we show that existing benchmarks are limited in accurately comparing qualitatively and how our synthetic benchmark can provide better measurement capability.

The main contributions of the newly proposed dataset, which we call OTR (Overlay Text Removal), are as follows:

We present a synthetic approach to build a dataset to evaluate text removal methods, which artificially overlays texts on a complex background. Our synthetic approach guarantees an artifact-free background in the ground truth.
We empirically study the evaluation capability of the proposed dataset and show that our approach can better capture the qualitative characteristics compared to the existing scene text benchmarks.

2.1 Text Removal Methods↩︎

The first attempts at text removal targeted captions and subtitles in videos, employing spatial and temporal restoration techniques across consecutive frames [1], [2]. Owing to the advancements in image generation and editing achieved by deep learning methods, the recent focus of research has shifted toward scene text removal, that is, the task of eliminating text from natural scene images. Nakamura et al. [3] were the first to apply a convolutional neural network to remove scene text from images. Further research emerged shortly, with several methods [4], [8]–[10] leveraging the progress made by generative adversarial networks (GANs) in the field of image generation and editing [11], [12]. Recent methods have adopted new techniques and architectures such as attention mechanism [13] and vision transformer [6].

Text removal methods can be classified into two categories: (1) one-stage methods, which directly transform input images containing text into text-free outputs in a single step [3], [4], [6], [8], [10], [14], [15], and (2) two-stage methods, which first explicitly detect or estimate text regions and then inpaint only those areas [5], [9], [13], [16]–[20].

2.2 Text Removal Benchmarks↩︎

SCUT [8] is a popular benchmark dataset for studying text removal and includes two benchmark sets, each built in a different approach.

Manually created benchmark

SCUT-EnsText [8] benchmark consists of pairs of scene text images and their manually edited counterparts, where the text has been removed using Adobe Photoshop. The scene text images have been collected from other scene text datasets, namely ICDAR 2015 [21] and MLT [22]. Lyu et al. recently introduced another dataset for scene text removal [23], which is built in a similar approach. While manually created ground truth examples look natural, there is an inherent drawback that the background often exhibits pixel-level artifacts.

Synthetic benchmark

SCUT-SynText [8] benchmark contains synthetic scene text images generated with the SynthText [24] algorithm, along with corresponding background images serving as text-free ground truth. The synthetic approach completely avoids the background artifacts present in the image manipulation, but has a limitation in that the resulting images do not always look natural as a scene image. As SynthText is designed for studying scene text detection, texts are placed on a relatively uniform background, such as the sky, the water surface, or the ground. This is fine for training a scene text detection model, but it leaves an issue for text removal because texts never appear on a complex background and tend to yield only easy-to-inpaint examples. This limitation becomes problematic for studying text removal in a non-scene image.

3 Pilot Study: Benchmark Issues↩︎

In this section, we study problems in the existing benchmarks.

3.1 Editing Artifacts↩︎

Manually created ground truth tends to contain a substantial number of pixels surrounding the text that differ from those in the originals. Ideally, only the pixels corresponding to text strokes would be altered in the ground truth. However, due to the limitations of image manipulation, achieving such precision in an image editor makes it virtually impossible, and broader areas around the text strokes are often modified to produce visually plausible results. As a result, there are artifacts and signs of use of image editing tools that make the ground truth differ from the original.

We evaluate the discrepancy by PSNR between the original images and their corresponding ground truth in the SCUT-EnsText benchmark. Figure 1 shows examples of original and ground truth image pairs and their pixel-level differences. Since the ground truth does not contain any text, we exclude text stroke regions from PSNR computation by detecting them with a text stroke segmentation model [25]. We adopt a common implementation of PSNR that prevents division by zero by adding a constant \(\epsilon = \num{1e-10}\) to the mean squared error (MSE). The PSNR between the original image \(I\) consisting of \(n\) pixels and its approximation \(\hat{I}\) is then computed as follows: \[\text{PSNR} = 20 \cdot \log_{10}(m) - 10 \cdot \log_{10}(\text{MSE} + \epsilon),\] \[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (I_i - \hat{I}_i)^2.\] For 8-bit images with a maximum pixel value \(m\) of 255, two perfectly identical images yield a PSNR of about 148 dB. In contrast, the average PSNR across the whole test set of SCUT-EnsText is 42.72 dB, which roughly corresponds to \(\frac{1}{10}\) of all pixels in the image differing by about 2.5% in grayscale intensity.

To further quantify the discrepancies, we also compute the percentage of pixels — excluding text stroke regions — whose absolute difference in pixel value between the original image and ground truth exceeds a given threshold that we set to 3, a level sufficient to noticeably affect PSNR. The results show that about 8% of pixels exceed this threshold, indicating that a considerable number of pixels in the ground truth deviate from the original images.

Figure 1: Examples from SCUT-EnsText dataset. From left to right, original image (first), ground truth image (second), map of pixels whose difference in value between the original and ground truth image exceeds the set threshold (third), the absolute difference between the original image and ground truth (fourth). The difference between the original and ground truth image should be as little as possible, but the surrounding regions contain altered pixels. Ideally, there should be no difference between the two when text stroke regions are excluded..

In addition, since the SCUT-EnsText dataset is stored and distributed in JPEG format, there are additional compression artifacts that differ between the original images with text and their ground truth counterparts. The visual artifacts — originating both from manual editing and JPEG compression — can compromise the fairness and reliability of evaluation results, particularly those that rely on pixel-level similarity between the original and ground truth.

3.2 Uniform Background↩︎

In real-world scenarios, scene text is typically placed on simple, uncluttered backgrounds to enhance its readability. As a result, most text instances in the SCUT-EnsText dataset appear on plain, uniform backgrounds with minimal visual complexity, such as a billboard. Similarly, in the case of SCUT-SynText, the SynthText algorithm — designed to mimic geometrical properties of scene texts — tends to place text in homogeneous regions with simple textures, such as sky, walls, or signboards. As a consequence, the majority of text removal scenarios in both datasets is relatively easy, as it requires minimal understanding of structural context or semantic coherence.

In contrast, overlay text found in advertisements, magazines, and similar media can overlap with any part of the image, including complex objects. This makes the text removal process more challenging, as it requires the model to preserve both structural integrity and semantic coherence in the inpainted regions.

We assess the complexity of text backgrounds in images using information entropy. We first identify text stroke regions using a text stroke segmentation model [25], then expand the detected text stroke regions \(Z\) by applying dilation with a large kernel to obtain text vicinity regions \(\tilde{Z}\). The entropy of the background surrounding the text is then computed by: \[\label{eq:entropy-formula} H(X) = \sum_{x \in X} p(x) \log p(x),\tag{1}\] \[X = I(\tilde{Z} - Z),\] where \(I(\tilde{Z} - Z)\) denotes the region of the image that surrounds text stroke regions, selected by subtraction of masks \(Z\) and \(\tilde{Z}\). As Table 8 in Section 5.4 shows, our new dataset exhibits higher entropy values, indicating that the backgrounds around text regions are more visually complex.

3.3 Evaluation Approach↩︎

Commonly used metrics for evaluation of text removal — such as structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and the average of gray-level absolute difference (AGE) — focus on how closely the output matches the ground truth. Other metrics such as text recall measure the amount of text remaining in the image, and Frechet Inception Distance (FID) measures the similarity between the original image and ground truth image distribution. However, text removal results can still be visually convincing even if they differ from the ground truth in the inpainted regions, especially since there is no single correct way to reconstruct the occluded content. Penalizing outputs that look natural but diverge from the ground truth overlooks this ambiguity. As illustrated in Figure 2, metrics commonly used for text removal evaluation often fail to account for this. Therefore, we advocate for the inclusion of evaluation methods that assess the visual quality of the results independently of their similarity to the ground truth.

Figure 2: Results of text removal by EraseNet (middle) and FLUX.1 Fill (right). While PSNR, SSIM and AGE metrics suggest that results of EraseNet are better, results of FLUX.1 Fill look visually more convincing from a human perspective.

Table 1: Comparison of characteristics of different datasets.
	artifact-free GT	complex text background
SCUT-SynText	✔
SCUT-EnsText
OTR	✔	✔

Table 1 highlights the differences between individual datasets. SCUT-EnsText suffers from artifacts in the ground truth caused by manual editing during its creation. Additionally, since it features real-world scenes, most of the text appears on simple backgrounds that ensure good readability in real life. SCUT-SynText does not contain artifacts in the ground truth, but the design of the SynthText algorithm results in text being mostly placed on simple backgrounds. In contrast, OTR avoids the ground truth artifact issue by using synthetically generated images, and introduces more challenging text removal scenarios with text over complex backgrounds thanks to object-aware text placement.

4 OTR Dataset↩︎

This section describes the process used to construct our new dataset.

4.1 Data Sources↩︎

We use images and annotations from Open Images V7 [26] and MS-COCO [27] datasets to build our text removal dataset. Open Images V7 provides hierarchical class annotations with general and more fine-grained categories. From this dataset, we select images labeled with the following general classes: animal, food, furniture, home appliance, kitchen appliance, musical instrument, person, plant, sports equipment, tableware, toy, vehicle. For MS-COCO, we rely on panoptic segmentation annotations and select images containing any of the following classes: dirt, floor, grass, pavement, river, road, sand, sea, sky, snow. Both datasets are distributed under a CC BY 4.0 license, while individual images are licensed under CC BY 2.0.

4.2 Dataset Creation↩︎

We create paired data consisting of images with overlay text and their corresponding ground truth images without any text. To ensure that the ground truth is clean and free from pre-existing text that we do not want to consider in our evaluation, we apply a scene text detection model [28] to filter out any images that contain text prior to our processing.

In order to make text removal more challenging, we position the text in such a way that it overlaps with specific objects in the images. To achieve that, we use bounding box annotations of objects for images from Open Images V7 and place the text randomly within these regions. For MS-COCO images, we utilize segmentation masks of our selected classes and place text randomly in the masked regions. All of our selected MS-COCO classes represent background and terrain elements, such as sky, sea, and road, which usually consist of simple textures that are easy to inpaint.

To render text on images, we use the skia-python ¹ graphics library and approximately 200 font files from Google Fonts ². Font sizes are randomly selected from a predefined range, and long text is split into multiple text lines. The text that we place into images is generated by a vision-language model (VLM) [29] instructed to imagine an article or advertisement relevant to the image and make a short headline or a catchphrase for it. The exact prompt is:

Think of a fictional article that is related to the image and think of a short phrase that could be a headline of the article, or think of a fictional advertisement that the image could be used for and think of a short phrase that could be used as a catchphrase for the advertisement. The phrase can be 1 to 20 words long.

We tried several VLMs, including PaliGemma 3B [30] and CogVLM [31], and empirically found out that SmolVLM [29] produces optimal results for our objective while also being efficient. In contrast, PaliGemma 3B generated very repetitive phrases, while CogVLM — despite having eight times as many parameters as SmolVLM — did not demonstrate any noticeable performance advantage.

Figure 3 provides an overview of our data generation process.

Figure 3: A diagram of our data generation process. We use a scene text detection model to filter out images that already contain text. Images with no detected text are passed to a VLM that generates short descriptive phrases, which are then rendered as overlay text on the images.

4.3 Dataset Format↩︎

The test set of our dataset consists of two subsets: OTR-easy, containing data created from images sourced from the MS-COCO dataset, and OTR-hard, containing data created from images sourced from Open Images V7. Tables 2 and 3 show the number of images per each class in both subsets, respectively. In total, OTR-easy consists of 5,538 samples and OTR-hard consists of 9,055 samples.

Each sample from the dataset consists of an image with rendered text, its corresponding original image without any text, and word-level annotations specifying the bounding boxes of their locations along with their transcriptions. The images are stored in PNG format to avoid degradation by JPEG compression artifacts, and annotations are stored in JSON files.

We also present a training set consisting of about 74,716 samples that can be used for training from scratch or finetuning pretrained models for overlay text removal.

Table 2: Number of images in the *OTR-easy* set per each class.
dirt	1000	road	608
floor	1000	sand	77
grass	282	sea	98
pavement	1000	sky	417
river	1000	snow	56

Table 3: Number of images in the *OTR-hard* set per each class.
animal	714	person	1000
food	1000	plant	1000
furniture	1000	sports equipment	1000
home appliance	188	tableware	1000
kitchen appliance	168	toy	596
musical instrument	389	vehicle	1000

5 Experiments↩︎

Table 4: Evaluation results on the OTR-hard dataset.
	PSNR \(\uparrow\)	SSIM \(\uparrow\)	AGE \(\downarrow\)	pEPs \(\downarrow\)	pCEPs \(\downarrow\)	QualiCLIP \(\uparrow\)	TOPIQ \(\uparrow\)	LIQE \(\uparrow\)	HyperIQA \(\uparrow\)
EraseNet [4] (TIP ’20)	26.24	93.66	4.09	0.038	0.026	0.688	0.526	3.47	0.545
MTRNet++ [16] (CVIU ’20)	28.32	94.85	2.30	0.032	0.019	0.676	0.563	3.59	0.575
PERT [20] (CVIU ’23)	26.97	94.18	2.95	0.038	0.026	0.688	0.547	3.47	0.559
SAEN [32] (WACV ’23)	26.11	93.92	3.52	0.039	0.026	0.671	0.549	3.51	0.559
ViT-Eraser [6] (AAAI ’24)	29.56	95.51	2.29	0.027	0.017	0.696	0.542	3.45	0.554
DBNet++ + SAM + LaMa	31.76	95.74	1.82	0.025	0.013	0.725	0.551	3.71	0.561
DBNet++ + SAM + FLUX.1 Fill	31.18	95.42	1.99	0.027	0.014	0.726	0.557	3.78	0.569
DBNet++ + SAM + SD 1.5	30.02	94.56	2.31	0.032	0.018	0.728	0.566	3.79	0.576

Table 5: Evaluation results on the OTR-easy dataset.
	PSNR \(\uparrow\)	SSIM \(\uparrow\)	AGE \(\downarrow\)	pEPs \(\downarrow\)	pCEPs \(\downarrow\)	QualiCLIP \(\uparrow\)	TOPIQ \(\uparrow\)	LIQE \(\uparrow\)	HyperIQA \(\uparrow\)
EraseNet [4] (TIP ’20)	28.85	94.83	3.88	0.039	0.026	0.733	0.520	3.71	0.549
MTRNet++ [16] (CVIU ’20)	31.92	95.39	2.33	0.032	0.019	0.728	0.535	3.78	0.558
PERT [20] (CVIU ’23)	33.00	94.95	2.79	0.035	0.024	0.732	0.527	3.69	0.551
SAEN [32] (WACV ’23)	29.43	94.76	3.45	0.036	0.024	0.738	0.528	3.71	0.555
ViT-Eraser [6] (AAAI ’24)	32.61	95.95	2.25	0.026	0.016	0.759	0.530	3.72	0.553
DBNet++ + SAM + LaMa	52.31	96.06	1.79	0.024	0.012	0.761	0.538	3.87	0.558
DBNet++ + SAM + FLUX.1 Fill	51.51	95.85	1.93	0.026	0.013	0.765	0.540	3.92	0.560
DBNet++ + SAM + SD 1.5	51.18	95.21	2.18	0.029	0.015	0.763	0.547	3.92	0.566

We use two types of methods to obtain baseline results for our benchmark: (1) existing text removal methods [4], [6], [10], [20], [32], and (2) general image inpainting models [33]–[35] combined with a separate text detector [28] and Segment Anything model [36].

Existing methods for text removal are pretrained specifically on data for scene text removal. In contrast, general inpainting models have been trained on large-scale image datasets, enabling them to perform effectively across a wide range of image domains. General inpainting models are used along with text detection models which detect all text regions in the image that have to be inpainted. To minimize the area that needs to be inpainted, bounding boxes produced by the text detector are further refined using the Segment Anything [36] model to segment the text strokes within them.

5.1 Evaluation Metrics↩︎

As discussed in Section [sec:inadequate95evaluation], commonly used evaluation metrics for text removal are not enough to thoroughly evaluate if text removal results look natural or not. To supplement commonly used metrics that depend on the similarity between generated results and ground truth images, we employ additional metrics designed for no-reference image quality assessment (NR-IQA). Namely, QualiCLIP [37] (a CLIP-based self-supervised method trained on increasingly degraded images), LIQE [38] (a multitask learning method leveraging knowledge from other tasks), TOPIQ [39] (a top-down method focusing on semantically important local distortions) and HyperIQA [40] (a content-aware self-adaptive classification network). These methods are designed to assess the perceptual quality of images in accordance with human subjective perception.

Besides the newly adopted metrics, we also use metrics widely used in existing works on text removal methods, i.e., PSNR (peak signal-to-noise ratio), SSIM (structural similarity), AGE (average of gray-level absolute difference), pEPs (percentage of error pixels), and pCEPs (percentage of four-connected neighbors error pixels).

5.2 Quantitative Evaluation↩︎

Tables 4 and 5 show the evaluation results for our OTR-hard and OTR-easy datasets, respectively. As can be seen, the scores on the OTR-hard datasets are lower for all methods across most metrics, indicating that OTR-hard presents more challenging scenarios.

Figure 4: Text removal results produced by PERT, ViT-Eraser and FLUX.1 Fill and a comparison of discrepancy in results of different metrics.

Figure 5: Correlation between the information entropy and several metric scores..

5.3 Significance of NR-IQA Metrics↩︎

Figure 4 shows text removal results using three different methods: PERT [20], ViT-Eraser [6] and FLUX.1 Fill [34]. From a human perspective, FLUX.1 Fill produces better results, but as its outputs slightly differ from the ground truth, it underperforms in metrics relying on direct similarity to the ground truth. In contrast, NR-IQA metrics rank FLUX.1 Fill as the top-performing method. This showcases a discrepancy between metrics focusing on the similarity between the result and ground truth and metrics focusing on perceptual image quality. In practical use, visual quality often matters more than an exact match with the ground truth. This highlights the relevance of NR-IQA metrics for text removal evaluation.

5.4 Dataset Complexity↩︎

We use information entropy to measure the complexity of backgrounds around text as introduced in Equation 1 in Section [sec:inadequate95evaluation].

Table 6 shows the information entropy for each class in OTR-hard, the data set designed to feature text backgrounds that are more difficult to inpaint. In contrast, samples in OTR-easy exhibit lower information entropy on average, as shown in Table 7. This indicates that text backgrounds in OTR-easy are indeed less complex than those in OTR-hard. Particularly, classes such as sky and sea yield the lowest entropy values, which aligns with our expectations since sea and sky are typically very simple and uniform textures.

Table 8 presents the information entropy of backgrounds around text instances across each dataset. Higher entropy values suggest that the backgrounds in our dataset, in particular the OTR-hard set, are more complex, making it more challenging for text removal models to produce naturally looking results.

Figure 5 illustrates the correlation between the information entropy and evaluation scores using the SSIM, QualiCLIP and TOPIQ metrics, respectively. The information entropy values correspond to several select classes from our dataset, namely sky, snow, grass, person and plant. For metrics that have been widely used for text removal evaluation, such as SSIM, performance on classes with lower information entropy tends to be better compared to those with higher entropy. This suggests a clear relationship between the quality of results and text background complexity, indicating that text on simple backgrounds is easier to remove. Results of the QualiCLIP and TOPIQ metrics indicate that more advanced models that are better at understanding overall structure and semantics — diffusion models such as FLUX.1 — tend to outperform other methods particularly on classes with higher entropy values, i.e., classes with more complex backgrounds. This suggests that datasets featuring text on complex backgrounds are essential for highlighting the strengths of models with more advanced inpainting capabilities. Additionally, the variation in metric scores across different methods is larger for classes with higher entropy, indicating that more challenging scenarios are more effective to distinguish the performance of text removal methods.

Table 6: Information entropy per each class in the *OTR-hard* dataset.
OTR-hard (\(H(X)\))
animal	6.91	person	6.93
food	7.07	plant	7.07
furniture	6.98	sports equipment	6.79
home appliance	6.87	tableware	6.95
kitchen appliance	7.02	toy	7.05
musical instrument	6.89	vehicle	6.97

Table 7: Information entropy per each class in the *OTR-easy* dataset.
OTR-easy (\(H(X)\))
dirt	6.74	road	6.66
floor	6.64	sand	6.60
grass	6.66	sea	6.33
pavement	6.77	sky	6.02
river	6.70	snow	6.22

Table 8: Comparison of text background complexity between individual datasets measured by information entropy.
	\(H(X)\)
SCUT-EnsText	6.32
SCUT-SynText	6.44
OTR-easy	6.64
OTR-hard	6.96

6 Conclusion↩︎

We introduced a new dataset for text removal that addresses key limitations of existing benchmarks, such as artifacts in ground truth and low background complexity. By simulating overlay text in advertisements and printed media, our dataset provides a more challenging and diverse testing benchmark. We also highlighted the need for better evaluation metrics that go beyond pixel-level similarity.

References↩︎

[1]

Chang Woo Lee, Keechul Jung, and Hang Joon Kim.2003. . Pattern Recognition Letters24, 15(2003), 2607–2623.

[2]

Ali Mosleh, Nizar Bouguila, and Abdessamad Ben Hamza.2013. . IEEE Transactions on Image processing22, 11(2013), 4460–4472.

[3]

Toshiki Nakamura, Anna Zhu, Keiji Yanai, and Seiichi Uchida.2017. . In ICDAR.

[4]

Chongyu Liu, Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Yongpan Wang.2020. . IEEE Transactions on Image Processing29(2020), 8760–8775.

[5]

Yuxin Wang, Hongtao Xie, Zixiao Wang, Yadong Qu, and Yongdong Zhang.2023. . IEEE Transactions on Image Processing32(2023), 4567–4580.

[6]

Dezhi Peng, Chongyu Liu, Yuliang Liu, and Lianwen Jin.2024. . In AAAI.

[7]

Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding, and Xiang Bai.2019. . In ACMMM.

[8]

Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai.2019. . In AAAI.

[9]

Jan Zdenek Hideki Nakayama.2020. . In WACV.

[10]

Osman Tursun, Simon Denman, Rui Zeng, Sabesan Sivapalan, Sridha Sridharan, and Clinton Fookes.2020. . Computer Vision and Image Understanding201(2020).

[11]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.2014. . In NIPS.

[12]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.2017. . In CVPR.

[13]

Hyeonsu Lee Chankyu Choi.2022. . In ECCV.

[14]

Yuxin Wang, Hongtao Xie, Shancheng Fang, Yadong Qu, and Yongdong Zhang.2021. . arXiv preprint arXiv:2106.13029(2021).

[15]

Chongyu Liu, Lianwen Jin, Yuliang Liu, Canjie Luo, Bangdong Chen, Fengjun Guo, and Kai Ding.2022. . In ECCV.

[16]

Osman Tursun, Rui Zeng, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, and Clinton Fookes.2019. . In ICDAR.

[17]

Zhengmi Tang, Tomo Miyazaki, Yoshihiro Sugaya, and Shinichiro Omachi.2021. . IEEE Transactions on Image Processing30(2021), 9306–9320.

[18]

Prateek Keserwani Partha Pratim Roy.2021. . IEEE Transactions on Circuits and Systems for Video Technology32, 5(2021), 3152–3163.

[19]

Benjamin Conrad Pei-I Chen.2021. . In ICIP.

[20]

Xiangcheng Du, Zhao Zhou, Yingbin Zheng, Xingjiao Wu, Tianlong Ma, and Cheng Jin.2023. . Computer Vision and Image Understanding233(2023).

[21]

Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al2015. . In ICDAR.

[22]

Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al2017. . In ICDAR.

[23]

Guangtao Lyu, Kun Liu, Anna Zhu, Seiichi Uchida, and Brian Kenji Iwana.2023. . Pattern Recognition140(2023).

[24]

Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman.2016. . In CVPR.

[25]

Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, and Dacheng Tao.2024. . TPAMI(2024).

[26]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari.2020. . IJCV128, 7(2020), 1956–1981.

[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.2014. . In ECCV.

[28]

Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, and Xiang Bai.2022. . TPAMI45, 1(2022), 919–931.

[29]

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf.2025. . arXiv preprint arXiv:2504.05299(2025).

[30]

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al2024. . arXiv preprint arXiv:2407.07726(2024).

[31]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Song XiXuan, et al2024. . In NeurIPS.

[32]

Xiangcheng Du, Zhao Zhou, Yingbin Zheng, Tianlong Ma, Xingjiao Wu, and Cheng Jin.2023. . In WACV.

[33]

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky.2022. . In WACV.

[34]

Black Forest Labs.2024. FLUX. https://github.com/black-forest-labs/flux.

[35]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.2022. . In CVPR.

[36]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al2023. . In ICCV.

[37]

Lorenzo Agnolucci, Leonardo Galteri, and Marco Bertini.2024. . arXiv preprint arXiv:2403.11176(2024).

[38]

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma.2023. . In CVPR.

[39]

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin.2024. . IEEE Transactions on Image Processing33(2024), 2404–2418.

[40]

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang.2020. . In CVPR.

https://github.com/kyamagu/skia-python↩︎
https://fonts.google.com/↩︎

OTR: Synthesizing Overlay Text Dataset for Text Removal

Abstract

1 Introduction↩︎

2.1 Text Removal Methods↩︎

2.2 Text Removal Benchmarks↩︎

Manually created benchmark

Synthetic benchmark

3 Pilot Study: Benchmark Issues↩︎

3.1 Editing Artifacts↩︎

3.2 Uniform Background↩︎

3.3 Evaluation Approach↩︎

4 OTR Dataset↩︎

4.1 Data Sources↩︎

4.2 Dataset Creation↩︎

4.3 Dataset Format↩︎

5 Experiments↩︎

5.1 Evaluation Metrics↩︎

5.2 Quantitative Evaluation↩︎

5.3 Significance of NR-IQA Metrics↩︎

5.4 Dataset Complexity↩︎

6 Conclusion↩︎

References↩︎

Subjects

Updated on Academus

OTR: Synthesizing Overlay Text Dataset for Text Removal

Abstract

1 Introduction↩︎

2 Related Work↩︎

2.1 Text Removal Methods↩︎

2.2 Text Removal Benchmarks↩︎

Manually created benchmark

Synthetic benchmark

3 Pilot Study: Benchmark Issues↩︎

3.1 Editing Artifacts↩︎

3.2 Uniform Background↩︎

3.3 Evaluation Approach↩︎

4 OTR Dataset↩︎

4.1 Data Sources↩︎

4.2 Dataset Creation↩︎

4.3 Dataset Format↩︎

5 Experiments↩︎

5.1 Evaluation Metrics↩︎

5.2 Quantitative Evaluation↩︎

5.3 Significance of NR-IQA Metrics↩︎

5.4 Dataset Complexity↩︎

6 Conclusion↩︎

References↩︎

Subjects

Updated on Academus