A Scalable Attention-Based Approach for Image-to-3D Texture Mapping

Arianna Rampini
Autodesk Research
Milan, IT
arianna.rampini@autodesk.com

,

Kanika Madan
Autodesk Research
Toronto, CA
arianna.rampini@autodesk.com

,

Bruno Roy
Autodesk Research
Montreal, CA
arianna.rampini@autodesk.com

,

AmirHossein Zamani
Mila, Concordia University, and Autodesk Research
Montreal, CA
amirhossein.zamani@mila.quebec

,

Derek Cheung
Autodesk Research
Toronto, CA
derek.cheung@autodesk.com


Abstract

High-quality textures are critical for realistic 3D content creation, yet existing generative methods are slow, rely on UV maps, and often fail to remain faithful to a reference image. To address these challenges, we propose a transformer-based framework that predicts a 3D texture field directly from a single image and a mesh, eliminating the need for UV mapping and differentiable rendering, and enabling faster texture generation. Our method integrates a triplane representation with depth-based backprojection losses, enabling efficient training and faster inference. Once trained, it generates high-fidelity textures in a single forward pass, requiring only \(\sim\)​0.2s per shape. Extensive qualitative, quantitative, and user preference evaluations demonstrate that our method outperforms state-of-the-art baselines on single-image texture reconstruction in terms of both fidelity to the input image and perceptual quality, highlighting its practicality for scalable, high-quality, and controllable 3D content creation.

1 Introduction↩︎

3D content creation is central to applications in gaming, virtual and augmented reality, digital twins, and immersive media. With the rapid progress of generative models for 3D shapes, it is now possible to automatically generate diverse and detailed geometries, greatly accelerating the creative process for designers and developers. To be practically useful in many applications, 3D shapes also require high-quality textures that are faithful to a given input reference, enabling realistic appearance and stylistic control. Generating such textures remains a fundamental challenge in computer vision and computer graphics.

Figure 1: Given a single input image (left), our method predicts a texture field for the corresponding 3D mesh and generates high-fidelity textures in a single forward pass. The figure shows novel views of textured meshes produced by our approach for various objects.

Existing approaches for texture generation have achieved impressive visual results, particularly when relying on image diffusion models [1], [2] or multi-view rendering pipelines [3], [4]. These methods are capable of producing high-resolution textures, but they share several important limitations that restrict their use in practice. First, they are computationally expensive, often requiring several minutes per object due to iterative optimization or multi-view rendering. Second, they typically assume access to a clean mesh with a predefined UV mapping, which is rarely available for outputs from modern generative models or real-world 3D scans. Third, despite their high resolution, the generated textures are not always faithful to a reference input image, limiting their applicability in reconstruction or editing tasks where accuracy is crucial.

In this work, we introduce a transformer-based framework that directly predicts a 3D texture field from a single input image and a mesh, without the need for UV mapping, making it robust to noisy or incomplete geometry. Our method represents textures through a triplane field and computes supervision via depth-based backprojection, enabling efficient training. Although individual components—transformer backbones, triplane representations, and differentiable projections—are well established, their integration yields a model that achieves a strong balance between speed, flexibility, and quality. Once trained, our approach produces high-fidelity textures in a single forward pass, requiring only a fraction of a second (\(\sim 0.2\)s) per shape, compared to several minutes for existing methods.

We validate our method extensively on single-image texture reconstruction benchmarks, where it consistently outperforms state-of-the-art baselines both in fidelity to the input image and overall texture quality. Furthermore, we conduct a user study, which confirms that human evaluators strongly prefer the textures generated by our approach over those from competing methods. These results make our model particularly well-suited for modern 3D content creation pipelines, where scalability, speed, and accuracy are critical. Our contributions can be summarized as follows:

  1. We propose a simple yet effective transformer-based framework for predicting texture fields from a single image and a mesh, without requiring UV mapping or multi-view rendering.

  2. We integrate a triplane texture representation with depth-based backprojection losses, enabling efficient and scalable training.

  3. Our method generates high-fidelity textures in a single forward pass (\(\sim 0.2\)s per shape), substantially faster than existing baselines that require minutes per object.

  4. We show strong improvements over state-of-the-art methods on single-image texture reconstruction tasks, supported both by quantitative metrics and by a user study.

2 Related Work↩︎

Existing approaches to texture generation are broadly classified into the following categories.

2.1 Directly using 3D Data↩︎

One of the earliest methods to learn texture as a continuous 3D function was TextureFields [5], which learns an implicit texture representation that can predict the color for any 3D point on a shape. While flexible, continuous texture fields can struggle to reproduce very high-frequency surface detail compared to explicit UV maps. Other approaches learn to generate texture directly on mesh surfaces using convolutional or neural-field–based operators. Texturify [6] trains a GAN-style model to generate geometry-aware surface textures from collections of untextured shapes and images, and Mesh2Tex [7] learns a hybrid mesh–neural-field texture manifold that maps image queries to compact, high-resolution textures for a given mesh. These mesh-based generators often leverage adversarial training (GANs [8] or StyleGAN [9] variants) to improve realism but inherit GAN failure issues such as mode collapse and training instability.

More recently, diffusion- and point-cloud–based approaches have emerged to better capture local detail and to operate directly in 3D or UV space. TUVF [10] learns generalizable UV radiance fields that disentangle texture from geometry by generating textures in a canonical UV sphere space. Point-UV [11] and related point-based diffusion pipelines produce coarse-to-fine textures by denoising colored point samples and then projecting them to UV maps. TexOct [12] proposes an octree-based 3D diffusion to generate textures directly in 3D space, alleviating occlusion and sparse-sampling issues present in some point-based pipelines. While these methods improve high-frequency detail and 3D consistency, they are often demonstrated on limited datasets or category-specific collections, which makes broad generalization and scaling to many categories challenging.

2.2 Multiview-Based Generation↩︎

A different line of work targets using multi-view images to generate 3D textures that are consistent across views. Early iterative view-by-view inpainting approaches such as TEXTure [13], Text2Tex [3], and InTeX [14] generate colors for a mesh by repeatedly rendering the object from different viewpoints and inpainting or updating the visible texels in each view. However, such iterative, view-sequential procedures may produce inconsistencies across views because they lack global 3D awareness of the surface and lacks a deep understanding of 3D structure, often misaligning textures with task needs.

To alleviate these issues, follow-up methods introduced several strategies. [15] builds on this paradigm by introducing a geometry-aware fine-tuning stage that aligns textures with human preferences and task-specific objectives, improving coherence and control without the overhead of joint optimization of geometry and appearance. TexFusion [16] interleaves texture synthesis with multi-view denoising steps and performs view-consistent diffusion sampling to reduce per-view artifacts and stitching errors. TexPainter [17] enforces multi-view consistency by fusing latent views into a common color-space texture and uses a color-fusion optimization scheme (together with synchronized multi-view denoising) to reduce inconsistencies. Paint3D [4] addresses lighting and baked-shading artifacts by separating coarse multi-view fusion from a learned UV inpainting / UVHD refinement stage, producing lighting-less high-resolution UV maps suitable for relighting, though it still relies on expensive test-time refinement. TexGen [18] further improves view consistency and detail preservation by introducing an attention-guided multi-view sampling and noise-resampling framework that maintains a time-dependent texture map updated across denoising steps, reducing seams and preserving fine appearance details.

Figure 2: An overview of the training stage of our method. Given a single input image and a 3D mesh, we extract visual features from the image using a pre-trained DINO [19] encoder. Learned positional embeddings are processed by a transformer and fused with the visual features through cross-attention. The output is reshaped into a triplane texture representation. Query points sampled via depth-map backprojection are decoded into RGB values, yielding a 3D texture field. The model is trained end-to-end with supervision from ground-truth colors.

2.3 Optimization-Based Methods↩︎

These methods treat 3D shape parameters as learnable and leverages CLIP [20] and text-to-image or image-to-image diffusion models [1], [2], [21] in the form of Score Distillation Sampling (SDS) [22] as supervision. Early methods optimize mesh color or neural style fields directly using CLIP [20] losses to align renderings with text prompts, examples include [23][27] which stylize geometry and/or texture by differentiably rendering the asset and minimizing CLIP-based objectives. The introduction of SDS in [22] enabled a powerful new paradigm: pre-trained 2D text-to-image diffusion models can be used as priors/supervisors to optimize 3D representations (e.g., NeRFs [28] or sparse hash grids [29]) by distilling their score into a differentiable 3D objective. DreamFusion [22] is the canonical example of this approach and inspired many follow-ups including Magic3D [30] and ProlificDreamer [31]). Subsequent work extended SDS-based optimization to produce better geometry–appearance disentanglement and higher-fidelity materials. Fantasia3D [32] and other hybrid pipelines disentangle geometry and appearance and introduce spatially-varying BRDF/material representations during optimization, while TextureDreamer [33] and DreamMat [34] incorporate geometry- and light-aware diffusion objectives to improve relightable texture and PBR material estimation.

Despite these advances, optimization-based methods still suffer from practical shortcomings: (i) they can be computationally expensive (per-scene optimization that takes minutes to hours), (ii) they may produce view-inconsistent artifacts or “Janus” faces without careful debiasing, and (iii) naive distillation from 2D models often leads to baked-in shading or incorrect material decomposition unless geometry- or light-aware priors are used. These limitations motivate hybrid, feed-forward, and geometry-aware texture objectives that explicitly inject geometric knowledge into texture synthesis.

2.4 Feed-Forward Methods↩︎

Recently, there has been a strong movement toward feedforward 3D generation models that are trained on large-scale data and produce high-quality 3D assets in a single forward pass, avoiding expensive per-object optimization. These methods typically adopt high-capacity transformer backbones and more compact 3D intermediate representations (e.g., triplanes, 3D Gaussian, or hybrid triplane-gaussian forms) to enable fast inference while maintaining competitive rendering quality. The large reconstruction model [35] demonstrated that scaling model capacity and training on massive multi-view datasets enables generalizable single image to 3D reconstruction. LRM directly predicts a neural radiance field from an input image using a large transformer, producing robust reconstructions across many object categories. Instant3D [36] showed that a carefully designed feedforward network can produce high-quality text-to-3D results in under one second by directly constructing a triplane representation from a text prompt, using mechanisms such as cross-attention and style injection to generate conditional language. Other works push the trade-offs between speed, generalization, and quality by changing the intermediate 3D primitive. GRM (Gaussian Reconstruction Model) [37] and related LRM variants represent scenes or objects as collections of 3D Gaussians decoded from image-aligned tokens. This enables extremely fast reconstructions while remaining amenable to transformer scaling and multi-view conditioning.

Hybrid representations such as [38] combine the best of both worlds. [38] uses a point/triplane decoder to predict a hybrid triplane-gaussian intermediate, which is then rendered by fast splatting, resulting in better novel view rendering quality than naive explicit primitives while retaining the speed advantages of splatting-based renderers. These hybrid pipelines have proven effective for single-view reconstruction and fast feed-forward text/image to 3D tasks.

Despite these advances, feed-forward approaches still face limitations relevant to texture generation: they often require large, diverse training datasets to learn high-frequency, material-aware texture priors; handling complex illumination and spatially varying BRDFs remains challenging; and many models prioritize geometric fidelity and rendering speed over fine, geometry-aware texture detail, which motivates methods that explicitly incorporate geometric cues (e.g., curvature-aware losses or texture alignment objectives) into the learning process.

3 Method↩︎

We present a novel approach for reconstructing high-quality textures on 3D shapes using a transformer-based architecture that synthesizes triplane representations. Our method learns a mapping from visual conditioning inputs to continuous texture fields represented as triplane features. An overview of our training and inference pipelines are shown in Fig. 2 and  3, respectively.

3.1 Preliminaries↩︎

A triplane representation is a 3D neural field encoding that decomposes a volumetric feature field into three orthogonal 2D feature planes corresponding to the \(XY\), \(XZ\), and \(YZ\) coordinate planes. This representation, originally introduced in EG3D [39], provides an efficient way to represent continuous 3D features with 2D convolutional networks.

The Large Reconstruction Model (LRM) framework [35] first introduced the combination of transformers with triplane representations for joint geometry and texture reconstruction using a NeRF field [28]. Building upon this architecture to tackle the more constrained problem of texture field reconstruction over known 3D meshes. Unlike LRM, which learns texture and geometry jointly through differentiable rendering and camera-view modulation, our approach focuses specifically on predicting texture fields over existing geometries. This introduces a unique correspondence problem: establishing the relationship between the (unknown) viewpoint of the conditioning image and the given 3D mesh.

Texture field prediction was pioneered in the TextureFields work [5]. However, that approach had limited scalability, being effective primarily on single-category ShapeNet data. Our contribution lies in combining the texture field rationale with the scalable triplane-transformer architecture, enabling texture synthesis across diverse object categories by training on large-scale datasets. Triplane representations provide both computational efficiency and representational power, allowing us to scale beyond single categories while maintaining the continuous texture field formulation.

Through our experiments, we found that explicit geometric encoding offered minimal benefit in our setting, leading us to adopt a streamlined architecture that relies only on visual conditioning.

3.2 Problem Formulation↩︎

Given a 3D shape with known geometry and a conditioning image \(I\), our goal is to learn a texture field \[T_\theta : \mathbb{R}^3 \to \mathbb{R}^3,\] that maps 3D coordinates to RGB colors. The texture field should be consistent with the visual appearance suggested by the conditioning image while respecting the underlying 3D geometry. We formalize this as: \[\text{RGB} = T_\theta(p, I),\] where \(p \in \mathbb{R}^3\) represents 3D coordinates and \(I\) is the input image.

3.3 Architecture↩︎

3.3.0.1 Visual Conditioning.

We employ a DINOv2 encoder [19] that processes RGB conditioning images at \(384 \times 384\) resolution, producing visual features \(z \in \mathbb{R}^{768}\) that capture semantic and appearance information.

3.3.0.2 Transformer-based Triplane Decoder.

In our implementation, the triplane consists of three feature maps of dimensions \([f_{dim}, t_{res}, t_{res}]\), where \(f_{dim}=48\) is the feature dimension, \(t_{res} \times t_{res}\) is the spatial resolution, the three planes correspond to orthogonal projections \(XY\), \(XZ\), and \(YZ\).

The transformer decoder processes learned positional embeddings \(f_{\text{init}}\) corresponding to triplane token positions. The learned positional embeddings are initialized using sinusoidal encoding and correspond to the flattened sequence of triplane tokens (\(3072 = 3 \times t_{res}^2\) positions). The embeddings have the same dimensionality of the transformer hidden size, and are optimized end-to-end with the rest of the model. Visual conditioning is integrated through cross-attention mechanisms in each transformer layer: \[f_{\text{out}} = \texttt{TransformerDecoder}(f_{\text{init}}, z).\]

The transformer outputs are reshaped into spatial triplane format. Starting from \(32 \times 32\) resolution, a convolutional upsampling network generates triplane features \(P \in \mathbb{R}^{3 \times 48 \times 64 \times 64}\).

To sample features at an arbitrary 3D point \(p = (x,y,z)\), we:

  1. Project \(p\) onto each of the three planes;

  2. Use bilinear interpolation to sample features from each plane at the projected coordinates, yielding \(f_{xy}\), \(f_{xz}\), and \(f_{yz}\);

  3. Concatenate the sampled features, producing a 144-dimensional feature vector (\(3 \times 48\)).

Finally, the concatenated feature vector is passed through a 4-layer MLP (RGBDecoder) with ReLU activations to predict the final RGB color: \[\text{RGB}(p) = \texttt{RGBDecoder}([f_{xy}, f_{xz}, f_{yz}]).\]

At inference time, our approach requires only a single forward pass of the pipeline, as summarized in Algorithm 3.

3.4 Training Methodology↩︎

We train the model using precomputed multi-view depth maps with corresponding ground-truth images. Depth maps are converted into 3D point coordinates via backprojection with camera intrinsics as detailed below, and the predicted colors are supervised against ground truth. For each training sample, we process 4 random views from a set of 55 precomputed depth maps to ensure broad appearance coverage.

3.4.0.1 Depth-Map Backprojection.

To supervise the predicted texture field, we backproject depth maps into 3D point clouds, associating each 3D point with its ground-truth RGB value from the corresponding view.

Formally, given a pixel \((u,v)\) with depth \(d(u,v)\), its camera-space coordinate is obtained via inverse projection: \[\mathbf{p}_{c} = d(u,v) \, K^{-1} \begin{bmatrix} u \\ v \\ 1 \end{bmatrix},\] where \(K\) is the camera intrinsics matrix. The point is then transformed into world coordinates using the camera-to-world transformation \(T_{\text{cam}}\): \[\mathbf{p}_{w} = T_{\text{cam}} \, \begin{bmatrix} \mathbf{p}_{c} \\ 1 \end{bmatrix}.\]

Each depth map of resolution \(384 \times 384\) yields \(147{,}456\) points, one per pixel. We sample \(4\) views per training step, leading to approximately \(590\)K points per batch, as background pixels are masked out and excluded from supervision. The reconstructed 3D points are queried into the predicted texture field to obtain a predicted image, which is then compared against the ground-truth one in the loss.

3.4.0.2 Loss Function.

We optimize a combination of a pixel-wise reconstruction loss and a perceptual similarity loss: \[\mathcal{L}_{\text{total}} = \lambda_{\text{pixel}} \mathcal{L}_{\text{pixel}} + \lambda_{\text{perc}} \mathcal{L}_{\text{perc}}.\]

The pixel-level term enforces low-level fidelity by directly comparing RGB values between predicted and ground-truth renderings: \[\mathcal{L}_{\text{pixel}} = \frac{1}{V} \sum_{v=1}^V \| I_{\text{pred}}(v) - I_{\text{gt}}(v) \|_2^2,\] where \(V\) denotes the number of views per sample.

The perceptual loss \(\mathcal{L}_{\text{LPIPS}}\) encourages high-level similarity by comparing features extracted from a pre-trained VGG network [40], following the LPIPS formulation [41]. Instead of focusing on raw pixel differences, this loss aligns image representations in a deep feature space, improving texture realism and visual coherence: \[\mathcal{L}_{\text{perc}} = \text{LPIPS}(I_{\text{pred}}, I_{\text{gt}}).\]

In practice, we set both weights equally (\(\lambda_{\text{pixel}} = \lambda_{\text{perc}} = 1\)), which provided a good balance between preserving fine details and maintaining perceptual quality in our experiments.

Figure 3: Inference for texture reconstruction

4 Experimental Evaluation↩︎

4.1 Setup↩︎

4.1.0.1 Datasets.

We train on Objaverse [42], a large-scale dataset of \(\sim\)​800k textured 3D assets spanning diverse object categories. We follow a 98%/2% train/validation split. To evaluate cross-dataset generalization, we additionally test on the Google Scanned Objects (GSO) benchmark [43], consisting in real-world scanned meshes. For both datasets, we precompute RGB images and corresponding depth maps at \(384{\times}384\) resolution from 55 viewpoints.

4.1.0.2 Implementation details.

We train all models using the AdamW optimizer with batch size \(128\), weight decay \(0.05\), and a cosine learning rate schedule (base LR \(2{\times}10^{-4}\) with \(10\)k warmup steps). Training is performed in mixed precision on a cluster of 32 NVIDIA A100 GPUs. At inference, our method runs in \(\sim\)​0.2 s per mesh on a single NVIDIA A10 GPU.

4.1.0.3 Baselines.

We compare against three recent image-guided texturing methods: Paint3D [4], EASI-Tex [44], and TEXTure [13]. All methods rely on iterative optimization guided by multi-view diffusion models and assume a UV-mapped mesh as input. We use the authors’ official implementations and recommended hyperparameters, running each baseline to convergence. All methods are provided with the same conditioning image and mesh and evaluated on the same set of novel views. Note that for TEXTure [13], the image-to-texture generation stage requires fine-tuning a diffusion model on the conditioning image, which is considerably more time-consuming than other baselines. To ensure feasibility when processing \(100\) objects from the GSO dataset, we keep all hyperparameters as proposed by the authors, except that we reduce the parameter max train steps from \(10000\) to \(1000\) to achieve practical computation time (i.e. \(~20\) minutes).

4.1.0.4 Metrics.

We evaluate texture reconstruction quality using three metrics: CLIP-Score [45], which measures semantic consistency between the conditioning image and rendered views in CLIP embedding space; LPIPS [46], which assesses perceptual similarity between predicted and ground-truth novel views; and PSNR, a standard pixel-level reconstruction metric. Higher CLIP and PSNR and lower LPIPS indicate better performance.

Figure 4: Results on Objaverse (validation set). Our feed-forward model generalizes across diverse categories and geometries, reconstructing high-fidelity textures from a single image (left).

4.2 Results↩︎

We showcase examples of our texture reconstruction capabilities in Figures 1 and 4.

4.2.0.1 Qualitative comparison.

Figure 5 presents close-up comparisons, while Figure 6 shows novel view generation. We can observe that our method produces cleaner textures with fewer artifacts and significantly improved fidelity to the conditioning image compared to baselines.

4.2.0.2 Quantitative comparison.

Table 1 reports results on GSO (100 random objects, 10 novel views each). Our approach outperforms both baselines across all metrics by a large margin. Moreover, inference runs in only \(\sim\)​0.2 s per shape, which is orders of magnitude faster than optimization-based baselines (10–20 minutes). Unlike prior work, our method requires no UV maps or mesh preprocessing, making it particularly suitable for pipelines that must handle hundreds of assets, such as procedural generation or large-scale 3D content creation.

Table 1: Comparison on GSO. We evaluate 100 random objects and 10 novel views per object. Metrics are computed inside the silhouette. Our method achieves the best semantic alignment (CLIP), perceptual similarity (LPIPS), and pixel accuracy (PSNR) while running in \(\sim\)​0.2 s/shape without UV maps.
Method CLIP-Score \(\uparrow\) LPIPS \(\downarrow\) PSNR \(\uparrow\)
TEXTure [13] 80.24 0.236 13.31
Paint3D [4] 82.67 0.205 13.61
EASI-Tex [44] 83.25 0.203 13.72
Ours 90.09 0.075 27.65
Figure 5: Comparison on GSO. Given the same conditioning image and mesh, our method (bottom row) produces textures with higher fidelity and fewer artifacts than diffusion-based baselines.
Figure 6: Comparison with TEXTure [13] on single-image texture reconstruction on GSO samples. TEXTure requires both a textual description and a conditioning image as input. Across varying levels of complexity—from simple uniform textures to multi-object scenes—our method generates coherent and faithful textures, whereas TEXTure often produces broken or inconsistent color patterns.

4.3 Qualitative User Study↩︎

We conducted a user study to evaluate and compare our method against two established baselines: Paint3D and EASI-Tex. The study involved 62 participants, all professionals working in the Media and Entertainment industry with a background in computer science. Participants were asked to assess the quality of the generated textures based on two key criteria: (1) realism and fidelity, and (2) consistency with respect to the condition image used to guide the generation. As summarized in Table 2, our approach consistently outperformed both baselines across the evaluation metrics.

Table 2: A user study reports the percentage of participants who preferred the results from our approach across two evaluation criteria. Our method outperforms both Paint3D [4] and EASI-Tex [44] in terms of texture fidelity and consistency relative to a conditional image.
Evaluation Criteria Paint3D EASI-Tex Ours
Texture Realism & Fidelity 4.86 12.85 82.29
Conditional Consistency 0.34 3.46 96.21
Table 3: Ablation on model size. Validation is computed on Objaverse (val set); GSO is out-of-domain test. Best and second-best are in bold and italic. The base model attains near-optimal performance at a significantly lower cost than large.
Models Validation dataset GSO (Test dataset)
CLIP-Score \(\uparrow\) LPIPS \(\downarrow\) PSNR \(\uparrow\) MSE \(\downarrow\) CLIP-Score \(\uparrow\) LPIPS \(\downarrow\) PSNR \(\uparrow\) MSE \(\downarrow\)
small 89.2 0.066 23.52 0.110 84.8 0.091 23.29 0.088
base 90.8 0.051 25.38 0.073 88.3 0.071 25.93 0.047
large 90.8 0.050 25.47 0.073 88.6 0.071 25.78 0.050

4.4 Ablations↩︎

4.4.0.1 Model capacity.

We study the impact of model size by training three variants of our architecture: small (\(\sim\)​9M parameters, 3 transformer layers, 6 attention heads, 384-dim features), base (\(\sim\)​52M parameters, 9 layers, 9 heads, 576-dim features), and large (\(\sim\)​115M parameters, 12 layers, 12 heads, 768-dim features). Results are reported in Table 3.

Performance improves substantially from small to base, especially on perceptual metrics (LPIPS and CLIP). Increasing to large provides only marginal improvements (\(<0.2\) CLIP, \(<0.1\) PSNR) at the cost of more than doubling the number of parameters and training/inference memory usage. This suggests that texture reconstruction benefits from a moderately deep transformer with sufficient feature dimensionality, but quickly saturates as capacity grows. In practice, the base model offers the best trade-off between quality and efficiency, and is therefore used in all other experiments.

Figure 7: Effect of model capacity. Comparison of results from our small, base, and large variants. Larger models improve texture sharpness and color fidelity, especially for fine-grained structures, though the base model already provides a strong balance between quality and efficiency.

4.4.0.2 Geometric conditioning.

We experimented with adding geometric signals via cross-attention: (i) latent features from a pre-trained SDF VQ-VAE (Latent[47], and (ii) point-cloud features from PointNet [48]. As reported in Table 4, neither variant improved performance; both slightly degraded LPIPS/PSNR, likely due to misalignment noise and reduced capacity for appearance modeling. We therefore omit geometric conditioning in our final model.

4.4.0.3 Perceptual loss.

Removing LPIPS lowers MSE/PSNR trade-offs (slightly better MSE) but harms perceptual quality and semantic alignment (worse LPIPS and CLIP). Consistent with qualitative observations, LPIPS guidance helps preserve fine appearance and avoids over-smoothing, so we retain it.

Table 4: Ablation on conditioning and losses (Objaverse val). Geometric conditioning does not help in the single-image setting; while removing LPIPS harms perceptual quality.
Model variant Validation dataset
CLIP-Score \(\uparrow\) LPIPS \(\downarrow\) PSNR \(\uparrow\) MSE \(\downarrow\)
base 90.8 0.050 25.47 0.073
w/o LPIPS loss 88.6 0.071 25.89 0.062
+ Latent cond. 90.5 0.053 25.01 0.076
+ Point cloud cond. 90.2 0.057 24.84 0.081

4.4.0.4 Failure cases.

While our method achieves strong results, it also has limitations. First, the output texture resolution is currently limited by the capacity of the model. As a result, our method can struggle to reproduce very fine details, such as text or high-frequency patterns (see Fig. 8).

Figure 8: Failure cases While our method produces coherent textures in most cases, it struggles with high-frequency details. Typical failure modes include handling complex patterns (top row), reconstructing legible text (middle row), and recovering unseen regions such as the back of objects (bottom row).

5 Future work↩︎

Since our approach produces textures that are globally consistent and faithful to the input image, a promising direction is to incorporate a lightweight refinement stage that enhances high-frequency details. Another avenue is to integrate our feed-forward framework with generative pipelines (e.g., diffusion-based texturing), where our method could provide strong initialization and improve sample efficiency and fidelity. Finally, exploring generative extensions of our model would enable conditional sampling of multiple plausible texture fields for the same geometry, broadening its use in creative content generation.

6 Conclusion↩︎

We presented a transformer-based architecture for image-guided texture reconstruction that directly predicts continuous texture fields encoded using a triplane representation. Our method takes as input a single image and a mesh as input, does not rely on UV mapping or differentiable rendering, and generates high-quality textures in a single forward pass. Extensive experiments, ablations, and a user study demonstrate that our approach outperforms existing baselines in both fidelity and efficiency, making it a practical solution for large-scale 3D content creation.

References↩︎

[1]
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
[2]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Ommer Björn, “High-resolution image synthesis with latent diffusion models,” 2022, pp. 10684–10695.
[3]
D. Z. Chen, Y. Siddiqui, H.-Y. Lee, S. Tulyakov, and M. Nie?ner, “Text2Tex: Text-driven texture synthesis via diffusion models.” 2023 , eprint={2303.11396}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2303.11396.
[4]
X. Zeng et al., “Paint3D: Paint anything 3D with lighting-less texture diffusion models.” 2023 , eprint={2312.13913}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2312.13913.
[5]
M. Oechsle, L. Mescheder, M. Niemeyer, T. Strauss, and A. Geiger, “Texture fields: Learning texture representations in function space.” 2019 , eprint={1905.07259}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/1905.07259.
[6]
Y. Siddiqui, J. Thies, F. Ma, Q. Shan, M. Nie?ner, and A. Dai, “Texturify: Generating textures on 3D shape surfaces.” 2022 , eprint={2204.02411}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2204.02411.
[7]
A. Bokhovkin, S. Tulsiani, and A. Dai, “Mesh2Tex: Generating mesh textures from image queries.” 2023 , eprint={2304.05868}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2304.05868.
[8]
I. J. Goodfellow et al., “Generative adversarial networks.” 2014 , eprint={1406.2661}, archivePrefix={arXiv}, primaryClass={stat.ML}, [Online]. Available: https://arxiv.org/abs/1406.2661.
[9]
T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks.” 2019 , eprint={1812.04948}, archivePrefix={arXiv}, primaryClass={cs.NE}, [Online]. Available: https://arxiv.org/abs/1812.04948.
[10]
A.-C. Cheng, X. Li, S. Liu, and X. Wang, “TUVF: Learning generalizable texture UV radiance fields.” 2023 , eprint={2305.03040}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2305.03040.
[11]
X. Yu, P. Dai, W. Li, L. Ma, Z. Liu, and X. Qi, “Texture generation on 3D meshes with point-UV diffusion.” 2023 , eprint={2308.10490}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2308.10490.
[12]
J. Liu et al., “TexOct: Generating textures of 3D models with octree-based diffusion , booktitle = Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),” 2024, pp. 4284–4293.
[13]
E. Richardson, G. Metzer, Y. Alaluf, R. Giryes, and D. Cohen-Or, “TEXTure: Text-guided texturing of 3D shapes.” 2023 , eprint={2302.01721}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2302.01721.
[14]
J. Tang, R. Lu, X. Chen, X. Wen, G. Zeng, and Z. Liu, “Intex: Interactive text-to-texture synthesis via unified depth-aware inpainting,” arXiv preprint arXiv:2403.11878, 2024.
[15]
A. Zamani, T. Xie, A. G. Aghdam, T. Popa, and E. Belilovsky, “Geometry-Aware Preference Learning for 3D Texture Generation,” arXiv preprint arXiv:2506.18331, 2025.
[16]
T. Cao, K. Kreis, S. Fidler, N. Sharp, and K. Yin, “TexFusion: Synthesizing 3D textures with text-guided image diffusion models.” 2023 , eprint={2310.13772}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2310.13772.
[17]
H. Zhang, Z. Pan, C. Zhang, L. Zhu, and X. Gao, “TexPainter: Generative mesh texturing with multi-view consistency.” 2024 , eprint={2406.18539}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2406.18539.
[18]
D. Huo et al., “Texgen: Text-guided 3d texture generation with multi-view sampling and resampling,” 2024 , organization={Springer}, pp. 352–368.
[19]
M. Oquab et al., 2024 , eprint={2304.07193}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2304.07193.
[20]
A. Radford et al., “Learning transferable visual models from natural language supervision.” 2021 , eprint={2103.00020}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2103.00020.
[21]
C. Saharia et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems, vol. 35, pp. 36479–36494, 2022.
[22]
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “DreamFusion: Text-to-3D using 2D diffusion.” 2022 , eprint={2209.14988}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2209.14988.
[23]
Y. Chen, R. Chen, J. Lei, Y. Zhang, and K. Jia, “TANGO: Text-driven photorealistic and robust 3D stylization via lighting decomposition.” 2022 , eprint={2210.11277}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2210.11277.
[24]
F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars.” 2022 , eprint={2205.08535}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2205.08535.
[25]
Y. Ma et al., “X-mesh: Towards fast and accurate text-driven 3D stylization via dynamic textual guidance.” 2023 , eprint={2303.15764}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2303.15764.
[26]
O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2Mesh: Text-driven neural stylization for meshes.” 2021 , eprint={2112.03221}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2112.03221.
[27]
N. Mohammad Khalid, T. Xie, E. Belilovsky, and T. Popa, “CLIP-mesh: Generating textured meshes from text using pretrained image-text models,” Nov. 2022, pp. 1?8, collection=SA ?22, doi: 10.1145/3550469.3555392 , booktitle={SIGGRAPH Asia 2022 Conference Papers}.
[28]
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis.” 2020 , eprint={2003.08934}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2003.08934.
[29]
T. M?ller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics, vol. 41, no. 4, p. 1?15, Jul. 2022, doi: 10.1145/3528223.3530127.
[30]
C.-H. Lin et al., “Magic3D: High-resolution text-to-3D content creation.” 2023 , eprint={2211.10440}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2211.10440.
[31]
Z. Wang et al., “ProlificDreamer: High-fidelity and diverse text-to-3D generation with variational score distillation.” 2023 , eprint={2305.16213}, archivePrefix={arXiv}, primaryClass={cs.LG}, [Online]. Available: https://arxiv.org/abs/2305.16213.
[32]
R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation.” 2023 , eprint={2303.13873}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2303.13873.
[33]
Y.-Y. Yeh et al., “TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion.” 2024 , eprint={2401.09416}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2401.09416.
[34]
Y. Zhang et al., “DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models.” 2024 , eprint={2405.17176}, archivePrefix={arXiv}, primaryClass={cs.GR}, [Online]. Available: https://arxiv.org/abs/2405.17176.
[35]
Y. Hong et al., “LRM: Large reconstruction model for single image to 3D.” 2024 , eprint={2311.04400}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2311.04400.
[36]
J. Li et al., “Instant3D: Fast text-to-3D with sparse-view generation and large reconstruction model.” 2023 , eprint={2311.06214}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2311.06214.
[37]
Y. Xu et al., “GRM: Large gaussian reconstruction model for efficient 3D reconstruction and generation.” 2024 , eprint={2403.14621}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2403.14621.
[38]
Z.-X. Zou et al., “Triplane meets gaussian splatting: Fast and generalizable single-view 3D reconstruction with transformers.” 2023 , eprint={2312.09147}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/2312.09147.
[39]
E. R. Chan et al., “Efficient geometry-aware 3D generative adversarial networks.? arXiv.” 2021.
[40]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[41]
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Wang Oliver, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018, pp. 586–595.
[42]
M. Deitke et al., “Objaverse: A universe of annotated 3d objects,” 2023, pp. 13142–13153.
[43]
L. Downs et al., “Google scanned objects: A high-quality dataset of 3d scanned household items,” 2022 , organization={IEEE}, pp. 2553–2560.
[44]
S. R. K. Perla, Y. Wang, A. Mahdavi-Amiri, and H. Zhang, “EASI-tex: Edge-aware mesh texturing from single image,” ACM Transactions on Graphics (TOG), vol. 43, no. 4, pp. 1–11, 2024.
[45]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021 , organization={PmLR}, pp. 8748–8763.
[46]
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, 2018 , eprint={1801.03924}, archivePrefix={arXiv}, primaryClass={cs.CV}, [Online]. Available: https://arxiv.org/abs/1801.03924.
[47]
A. Sanghi et al., “Wavelet latent diffusion (wala): Billion-parameter 3D generative model with compact wavelet encodings,” arXiv preprint arXiv:2411.08017, 2024.
[48]
C. R. Qi, H. Su, K. Mo, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Guibas Leonidas J, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” 2017, pp. 652–660.