Parallelized Autoregressive Visual Generation

Yuqing Wang1 Shuhuai Ren3 Zhijie Lin2\(\dagger\) Yujin Han1
Haoyuan Guo2 Zhenheng Yang2 Difan Zou1 Jiashi Feng2 Xihui Liu1
1University of Hong Kong 2ByteDance Seed 3Peking University


Abstract

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. Our key insight is that parallel generation depends on visual token dependenciestokens with weak dependencies can be generated in parallel, while strongly dependent adjacent tokens are difficult to generate together, as their independent sampling may lead to inconsistencies. Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6\(\times\) speedup with comparable quality and up to 9.5\(\times\) speedup with minimal quality degradation across both image and video generation tasks. We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling. Project page: https://epiphqny.github.io/PAR-project.

1 Introduction↩︎

Autoregressive modeling has achieved remarkable success in language modeling [1][5], inspiring its application to visual generation [6][16]. These models show great potential for visual tasks due to their strong scalability and unified modeling capabilities [17][19]. Current autoregressive visual generation approaches typically rely on a sequential token-by-token generation paradigm: visual data is first encoded into token sequences using an autoencoder [12], [20], [21], then an autoregressive transformer [22] is trained to predict these tokens following a raster scan order [10]. However, this strictly sequential generation process leads to a slow generation speed, severely limiting its practical applications [19], [23]. In this work, we aim to develop an efficient autoregressive visual generation approach that improves generation speed while maintaining the generation quality.

Figure 1: Comparison of different parallel generation strategies. Both strategies generate initial tokens [1,2,3,4] sequentially then generate multiple tokens in parallel per step, following the order [5a-5d] to [6a-6d] to [7a-7d], etc. (a) Our approach generates weakly dependent tokens across non-local regions in parallel, preserving coherent patterns and local details. (b) The naive method generates strongly dependent tokens within local regions simultaneously, while independent sampling for strongly correlated tokens can cause inconsistent generation and disrupted patterns, such as distorted tiger faces and fragmented zebra stripes.

Figure 2: Visualization comparison of our parallel generation and traditional autoregressive generation (LlamaGen [15]). Our approach (PAR) achieves 3.6-9.5\(\times\) speedup over LlamaGen with comparable quality, reducing the generation time from 12.41s to 3.46s (PAR-4\(\times\)) and 1.31s (PAR-16\(\times\)) per image. Time measurements are conducted with a batch size of 1 on a single A100 GPU.

An intuitive way to improve generation efficiency is to predict multiple tokens in parallel at each step. In language modeling, methods like speculative decoding [24][26] and Jacobi decoding [27], [28] achieve parallel generation through auxiliary draft models or iterative refinement. In the visual domain, approaches like MaskGIT [29] employ non-autoregressive paradigms with masked modeling strategies, while VAR [14] achieves faster speed through next-scale prediction that requires specially designed multi-scale tokenizers and longer token sequences. However, the introduction of additional models and specialized architectures increases model complexity, and may limit the flexibility of autoregressive models as a unified solution across different modalities.

In this work, we ask: can we achieve parallel visual generation while maintaining the simplicity and flexibility of standard autoregressive models? We find that parallel generation is closely tied to token dependenciestokens with strong dependencies need sequential generation, while weakly dependent tokens can be generated in parallel. In autoregressive models, each token is generated through sampling (e.g., top-k) to maintain diversity. Parallel generation requires independent sampling of multiple tokens simultaneously, but the joint distribution of highly dependent tokens cannot be factorized for independent sampling, leading to inconsistent predictions, as demonstrated by the distorted local patterns in Fig. 1 (b). For visual data, such dependencies are naturally correlated with spatial distanceswhile locally adjacent tokens exhibit strong dependencies, spatially distant tokens often have weak correlations. This motivates us to reconsider how to organize tokens for generation: by identifying spatially distant tokens with weak correlations, we can group them for simultaneous prediction. Such non-local grouping allows us to maintain sequential generation for strongly dependent local tokens while enabling parallel generation across different spatial regions. Moreover, we observe that initial tokens in each local region play a crucial role in establishing the global structure - generating them in parallel could lead to conflicting structures across regions, such as repeated parts in different regions without global coordination(see middle row of Fig. 5). Therefore, the initial tokens in each local region should be generated sequentially to establish the global visual structure.

Based on these insights, we propose a simple yet effective approach for parallel generation in autoregressive visual models. Our key idea is to identify and group weakly dependent visual tokens for simultaneous prediction while maintaining sequential generation for strongly dependent ones. To achieve this, we first divide the image into local regions and generate their initial tokens sequentially to establish global context, then perform parallel generation by identifying and grouping tokens at corresponding positions across spatially distant regions. The process is illustrated in Fig. 1 (a). Our approach can be seamlessly implemented within standard autoregressive transformers through a reordering mechanism, with a few learnable token embeddings to facilitate the transition between sequential and parallel generation modes. By ensuring each prediction step has access to all previously generated tokens across regions, we maintain the autoregressive property and preserve global context modeling capabilities. With non-local parallel generation, our approach significantly reduces the number of inference steps and thereby accelerates generation, while maintaining comparable visual quality through careful token dependency handling.

We verify the effectiveness of our approach on both image and video generation tasks using ImageNet [30] and UCF-101 [31] datasets. For image generation, our method achieves around 3.9\(\times\) fewer generation steps and 3.6\(\times\) actual inference-time speedup with comparable generation quality. With more aggressive parallelization, we achieve around 11.3\(\times\) reduction in steps and 9.5\(\times\) actual speedup with minimal quality drop (within 0.7 FID for image and 10 FVD for video). The qualitative comparison of generation results between our method and the baseline is shown in Fig. 2. The experiments demonstrate the effectiveness of our approach across different visual domains and its compatibility with various tokenizers like VQGAN [12] and MAGVIT-v2 [21].

In summary, we propose a simple yet effective parallelized autoregressive visual generation approach that carefully handles token dependencies. Our key idea is to identify and group weakly dependent tokens for simultaneous prediction while maintaining sequential generation for strongly dependent ones. Our approach can be seamlessly integrated into standard autoregressive models without architectural modifications. Through extensive experiments with different visual domains and tokenization methods, we demonstrate considerable speedup while preserving generation quality, making autoregressive visual generation more practically usable for real-world applications.

2 Related Work↩︎

Autoregressive Visual Generation. Autoregressive modeling has been explored in visual generation for years, from early pixel-based approaches [6][8] to current token-based methods. Modern approaches typically follow a two-stage paradigm: first compressing visual data into compact token sequences through discrete tokenizers [12], [20], [21], then training a transformer [22] to predict these tokens autoregressively in raster scan order [11], [13], [16]. This paradigm has been successfully extended to video generation [32][34], where tokens from different frames are predicted sequentially. However, the strictly sequential generation process leads to slow inference speed that scales with sequence length.

Parallel Prediction in Sequential Generation. Various approaches have been proposed to accelerate sequential generation. In language modeling, speculative decoding [24][26] employs a draft model to generate candidate tokens for main model verification, while Jacobi decoding [27], [28] enables parallel generation through iterative refinement. In visual generation, MaskGIT [29] adopts a non-autoregressive approach with BERT-like masked modeling strategies, taking a different modeling paradigm from traditional autoregressive generation. VAR [14] proposes next-scale prediction that progressively generates tokens at increasing resolutions, though requiring specialized multi-level tokenizers and longer token sequences. In contrast, our approach enables efficient parallel generation while preserving the autoregressive property and model simplicity, readily applicable to various visual tasks without specialized architectures or additional models.

3 Method↩︎

Figure 3: Illustration of our non-local parallel generation process. Stage 1: sequential generation of initial tokens (1-4) for each region (separated by dotted lines) to establish global structure. Stage 2: parallel generation at aligned positions across different regions (e.g., 5a-5d), then moving to next aligned positions (6a-6d, 7a-7d, etc.) for parallel generation. Same numbers indicate tokens generated in the same step, and letter suffix (a,b,c,d) denotes different regions .

In this section, we present our approach for parallelized visual autoregressive generation. We first discuss the relationship between token dependencies and parallel generation in Sec. 3.1. Based on these insights, we propose our parallel generation approach in Sec. 3.2. Finally, we present the model architecture and implementation details that realize this process within autoregressive transformers in Sec. 3.3.

3.1 Token Dependencies and Parallel Generation↩︎

Standard autoregressive models adopt token-by-token sequential generation, which significantly limits generation efficiency. To improve efficiency, we explore the possibility of generating multiple tokens in parallel. However, a critical question arises: which tokens can be generated in parallel without compromising generation quality? In this section, we analyze the relationship between token dependencies and parallel generation through pilot studies, providing guidance for designing parallelized autoregressive visual generation models.

Pilot Study. In language modeling, researchers have attempted to group adjacent tokens for multi-token prediction [24][28], [35], [36]. However, our pilot study reveals that directly predicting adjacent tokens leads to significant quality degradation in visual generation (see Fig. 1 (b) and Tab. 7 (d)). In autoregressive generation, each token is generated through sampling strategies (e.g., top-k) to maintain diversity. When generating multiple tokens in parallel, these tokens need to be sampled independently. However, for adjacent visual tokens with strong dependencies, their joint distribution cannot be factorized into independent distributions, as each token is heavily influenced by its neighbors. The impact of such independent sampling is clearly demonstrated in the figure, where generating adjacent tokens in parallel leads to inconsistent local structures like distorted tiger faces and fragmented zebra stripes, as tokens are sampled without considering their neighbors’ decisions.

Design Principles. These observations suggest that parallel generation should focus on weakly correlated tokens to minimize the impact of independent sampling. For visual tokens, dependencies naturally decrease with spatial distance - tokens from distant regions typically have weaker correlations than adjacent ones. This motivates us to perform parallel generation across distant regions rather than within local neighborhoods. However, we find that not all distant tokens can be generated in parallel. The initial tokens of each regions are particularly crucial as they jointly determine the global image structure. Parallel generation of these initial tokens, despite their spatial distances, could lead to conflicting global decisions, resulting in issues like repeated patterns or incoherent patches across regions (see the middle row in Fig. 5).

Based on these insights, we propose three key design principles for parallelized autoregressive generation: 1) generate initial tokens for each region sequentially to establish proper global structure; 2) maintain sequential generation within local regions where dependencies are strong; and 3) enable parallel generation across regions where dependencies are weak through proper token organization.

Figure 4: Overview of our parallel autoregressive generation framework. (a) Model implementation. The model first generates initial tokens sequentially [1,2,3,4], then uses learnable tokens [M1,M2,M3] to help transition into parallel prediction mode. (b) Comparison of visible context between our parallel prediction approach (left) and traditional single-token prediction (right). The colored cells indicate available context during generation. In traditional AR, when predicting token \(6d\), the model can access all previous tokens including \(6a-6c\). Without full attention, our parallel approach would limit each token (e.g., \(6b\)) to only see tokens up to the same position in the previous group (e.g., up to \(5b\)). We enable group-wise full attention to allow access to the entire previous group.

3.2 Non-Local Parallel Generation↩︎

Based on the above principles, we propose our approach that enables parallel token prediction while maintaining autoregressive properties. The process is illustrated in Fig. 3.

Cross-region Token Grouping. Let \(\{v_i\}_{i=1}^{H \times W}\) denote a sequence of visual tokens arranged in a \(H \times W\) grid. We first partition the token grid into \(M \times M\) regions. Each region contains \(k := \Big(\frac{H}{M} \times \frac{W}{M}\Big)\) tokens. We then group tokens at corresponding positions across different regions. Let \(v^{(r)}_{j}\) denote the token at position \(j\) in region \(r\), where \(r \in \{1,...,M^2\}\) and \(j \in \{1,...,k\}\). We then organize these tokens into groups based on their corresponding positions across regions: \[\small \Big\{[v^{(1)}_{1},\cdots,v^{(M^2)}_{1}], [v^{(1)}_{2},\cdots,v^{(M^2)}_{2}], \cdots, [v^{(1)}_{k},\cdots.,v^{(M^2)}_{k}]\Big\}.\] This organization groups together tokens at the same relative position across different regions, facilitating our parallel generation process.

Stage 1: Sequential Generation of Initial Tokens of Each Region. We first generate one initial token for each region sequentially (marked as “1-4” in Fig. 3) to establish the global context. As shown in Fig. 3 (1), we start with the top-left region and generate the initial token for each region by sampling from the conditional probability distribution: \[v^{(i)}_1 \sim \mathbb{P}(v^{(i)}_1 | v^{(<i)}_1), \quad i \in \{1,...,M^2\},\] where \(v^{(i)}_1\) denotes the initial token of the \(i\)-th region. Since the number of regions (\(M^2\)) is small and fixed, this sequential generation introduces minimal overhead while providing crucial global context for subsequent parallel generation.

Stage 2: Parallel Generation of Cross-region Tokens. After initializing tokens for all regions, we proceed with parallel generation of the remaining tokens. As illustrated in Fig.3 (2), at each step, we identify the next position \(j\) within each region following a raster scan order and simultaneously predict tokens at this position across all regions (e.g., tokens 5a-5d are generated in parallel). The parallel generation at each step can be formulated as: \[\{v^{(r)}_{j}\}_{r=1}^{M^2} \sim \mathbb{P}(\{v^{(r)}_{j}\}_{r=1}^{M^2} | v_{<j}),\] where \(\{v^{(r)}_{j}\}_{r=1}^{M^2}\) represents the set of tokens at position \(j\) across all regions to be generated in parallel, and \(v_{<j}\) includes both initial tokens and tokens from previous parallel steps. For example, with \(M=2\) on a \(24 \times 24\) token grid, after generating 4 initial tokens sequentially, we predict \(M^2=4\) tokens in parallel at each subsequent step, reducing the total number of generation steps from 576 to 147 (i.e., \(4 + \frac{576-4}{4}\)). While enabling parallel prediction, our approach maintains the autoregressive property as each prediction is still conditioned on all previous tokens. The key difference is that tokens at corresponding positions across regions, which exhibit weak dependencies, are now generated simultaneously instead of sequentially.

3.3 Model Architecture Details↩︎

We illustrate our parallel generation framework using a standard autoregressive transformer for class-conditioned image generation.

Framework Implementation. As shown in Fig. 4 (a), our model architecture consists of an autoregressive transformer that processes the input sequence and generates visual tokens. The input sequence begins with a class token \((C)\) followed by visual tokens to be generated. To achieve \(n\)-token parallel prediction, we design a special sequence structure with three distinct parts: 1) initial sequential tokens [1,2,...,\(n\)] that are generated one at a time, 2) a transition part with \(n-1\) learnable tokens [M1,M2,M3] that helps the model enter parallel prediction mode, and 3) subsequent token groups that are predicted \(n\) tokens at a time (e.g., [\(5a,5b,5c,5d\)], [\(6a,6b,6c,6d\)]). For predicting each group, the model takes all previous tokens as input while maintaining a fixed offset of \(n\) tokens between input and target sequences. The learnable tokens share the same dimension as regular tokens for seamless integration. To maintain spatial relationships under our reordered sequence, we employ 2D Rotary Position Embedding (RoPE) [37], which preserves each token’s original spatial position information regardless of its sequence position. The above designs enable parallel prediction while preserving the standard autoregressive transformer architecture.

Group-wise Bi-directional Attention with Global Autoregression. Our framework combines sequential generation of initial tokens with parallel generation of subsequent token groups. As illustrated in Fig.4 (b), in traditional autoregressive models, when predicting token \(6d\), the model can access all previous tokens including \(6a-6c\). However, naive parallel generation with causal masking would restrict each token (e.g., \(6b\)) to only see tokens up to the same position in the previous group (e.g., up to \(5b\)), limiting the available context. To address this limitation while maintaining parallelism, we enable bi-directional attention within each prediction group while preserving causal attention between groups. This allows each token in the current group to access the entire previous group as context (e.g., all tokens [\(5a-5d\)] are visible when predicting any token in [\(6a-6d\)]). This design enriches the local context for parallel prediction while maintaining the global autoregressive property, ensuring compatibility with standard optimizations like KV-cache.

Extension to Video Generation. Our parallel generation framework can be naturally extended to video generation. The tokenization process reduces both spatial and temporal dimensions, resulting in tokens arranged in a \(T \times H \times W\) grid, where each latent frame aggregates information from multiple input frames. We treat these temporally compressed tokens similarly to image tokens and apply our parallel generation strategy along the spatial dimensions, with the only modification being the use of 3D position embeddings. While we also explored parallel generation along the temporal dimension, we found it less effective than spatial parallelization. This is because temporal dependencies exhibit stronger sequential characteristics that are fundamental to video coherence, making them less suitable for parallel prediction compared to spatial relationships. The exploration of effective temporal parallel strategies remains as future work.

4 Experiments↩︎

4.1 Experimental Setup↩︎

4pt

Table 1: Model sizes and architecture configurations of PAR. The configurations are following previous works [2], [4], [15], [38].
Model Params Layers Hidden Heads
PAR-L 343M 24 1024 16
PAR-XL 775M 36 1280 20
PAR-XXL 1.4B 48 1536 24
PAR-3B 3.1B 24 3200 32

Figure 5: Qualitative comparison of parallel generation strategies. Top: Our method with sequential initial tokens followed by parallel distant token prediction produces high-quality and coherent images. Middle: Direct parallel prediction without sequential initial tokens leads to inconsistent global structures. Bottom: Parallel prediction of adjacent tokens results in distorted local patterns and broken details.

Table 2: Class-conditional image generation on ImageNet 256\(\times\)​256 benchmark.”\(\downarrow\)” or “\(\uparrow\)” indicate lower or higher values are better.”-re” means using rejection sampling. PAR-4\(\times\) and PAR-16\(\times\) means generating 4 and 16 tokens per step in parallel, respectively.
Type Model #Para. FID\(\downarrow\) IS\(\uparrow\) Precision\(\uparrow\) Recall\(\uparrow\) Steps Time(s)\(\downarrow\)
GAN BigGAN [39] 112M 6.95 224.5 0.89 0.38 1 \(-\)
GigaGAN [40] 569M 3.45 225.5 0.84 0.61 1 \(-\)
StyleGan-XL [41] 166M 2.30 265.1 0.78 0.53 1 0.08
Diffusion ADM [42] 554M 10.94 101.0 0.69 0.63 250 44.68
CDM [43] \(-\) 4.88 158.7 \(-\) \(-\) 8100 \(-\)
LDM-4 [44] 400M 3.60 247.7 \(-\) \(-\) 250 \(-\)
DiT-XL/2 [45] 675M 2.27 278.2 0.83 0.57 250 11.97
Mask MaskGIT [29] 227M 6.18 182.1 0.80 0.51 8 0.13
VAR VAR-d30 [14] 2B 1.97 334.7 0.81 0.61 10 0.27
MAR MAR [46] 943M 1.55 303.7 0.81 0.62 64 28.24
AR VQGAN [12] 227M 18.65 80.4 0.78 0.26 256 5.05
VQGAN [12] 1.4B 15.78 74.3 \(-\) \(-\) 256 5.05
VQGAN-re [12] 1.4B 5.20 280.3 \(-\) \(-\) 256 6.38
ViT-VQGAN [47] 1.7B 4.17 175.1 \(-\) \(-\) 1024 \(>\)​6.38
ViT-VQGAN-re [47] 1.7B 3.04 227.4 \(-\) \(-\) 1024 \(>\)​6.38
RQTran. [16] 3.8B 7.55 134.0 \(-\) \(-\) 256 5.58
RQTran.-re [16] 3.8B 3.80 323.7 \(-\) \(-\) 256 5.58
AR LlamaGen-L [15] 343M 3.07 256.1 0.83 0.52 576 12.58
LlamaGen-XL [15] 775M 2.62 244.1 0.80 0.57 576 18.66
LlamaGen-XXL [15] 1.4B 2.34 253.9 0.80 0.59 576 24.91
LlamaGen-3B [15] 3.1B 2.18 263.3 0.81 0.58 576 12.41
AR PAR-L-4\(\times\) 343M 3.76 218.9 0.84 0.50 147 3.38
PAR-XL-4\(\times\) 775M 2.61 259.2 0.82 0.56 147 4.94
PAR-XXL-4\(\times\) 1.4B 2.35 263.2 0.82 0.57 147 6.84
PAR-3B-4\(\times\) 3.1B 2.29 255.5 0.82 0.58 147 3.46
PAR-XXL-16\(\times\) 1.4B 3.02 270.6 0.81 0.56 51 2.28
PAR-3B-16\(\times\) 3.1B 2.88 262.5 0.82 0.56 51 1.31

Image Generation. For fair comparison with existing token-by-token autoregressive visual generation methods, we adopt similar settings as [15], using a VQGAN tokenizer [12] with a 16,384 codebook size and 16\(\times\) downsampling ratio. Models are trained on ImageNet-1K [30] for 300 epochs, with 384\(\times\)​384 images tokenized into 24\(\times\)​24 sequences. We evaluate on the ImageNet validation set at 256\(\times\)​256 resolution using FID [48] as the primary metric, complemented by IS and Precision/Recall [49]. We experiment with model sizes from 343M to 3.1B parameters (Tab.1), reporting both generation steps and latency time.

Video Generation. We evaluate on the UCF-101 [31] dataset using our reproduced MAGVIT-v2 tokenizer [21]. Each 17-frame video (128\(\times\)​128 resolution) is compressed by 8\(\times\) spatially and 4\(\times\) temporally into a \(5\times16\times16\) token sequence (1280 tokens per video). For fair comparison, we implement both next-token prediction and our parallel generation approach using the same architecture. The position of video codes is encoded via 3D positional embeddings. Our reproduced MAGVIT-v2 tokenizer uses a 64K visual vocabulary instead of the original 262K to facilitate model training. We use Fréchet Video Distance (FVD) [50] to evaluate generation quality.

Detailed training configurations for both video and image generation are provided in the supplementary material.

4.2 Main Results↩︎

4.2.1 Image Generation↩︎

Tab. 2 presents comprehensive comparisons of class-conditional image generation with various state-of-the-art methods, including GAN [39]; [40]; [41], Diffusion [42][45] (iterative denoising), Mask [29] (mask token prediction), VAR [14] (next-scale prediction), MAR [46] (continuous mask token prediction), and AR [12], [15], [47] (autoregressive generation). Our PAR achieves competitive performance while maintaining faster inference speed than most state-of-the-art models. Specifically, when comparing with representative models from different categories, our method shows advantages. Compared with the mask-based method MaskGIT [29], our method achieves substantially better generation quality (FID 2.29 vs. 6.18) despite requiring more steps. For VAR [14], while it achieves slightly better FID (1.97 vs. 2.29), our method maintains a simpler framework with fewer tokens per image and preserves the pure autoregressive nature, making it more flexible for multi-modal integration.

Compared to our baseline model LlamaGen [15], PAR achieves 3.9\(\times\) reduction in generation steps (147 vs.) and 3.58\(\times\) speedup in wall-clock time (3.46s vs.12.41s) while maintaining comparable quality (FID 2.29 vs.). With more aggressive parallelization, PAR-3B-16\(\times\) further accelerates generation to 1.31s (9.5\(\times\) speedup) with only 0.7 FID degradation compared to the baseline, demonstrating the effectiveness of our parallel generation strategy in balancing efficiency and quality.

max width=

4.2.2 Video Generation↩︎

We evaluate our approach on the UCF-101 [31] dataset for class-conditional video generation. Table [tab:main95results] shows comparisons with various state-of-the-art methods across different categories. Among recent works, MAGVIT-v2 [21] achieves strong performance with an FVD of 58 using masked token prediction, while its autoregressive variant MAGVIT-v2-AR obtains an FVD of 109 with 1280 generation steps. Our next-token-prediction baseline (PAR-1\(\times\)) achieves a competitive FVD of 94.1, demonstrating the effectiveness of our implementation. More importantly, our parallel generation variants significantly reduce both generation steps and wall-clock time while maintaining comparable quality. Specifically, PAR-4\(\times\) reduces the generation steps from 1280 to 323 with minimal FVD increase (99.5 vs. 94.1), achieving 3.8\(\times\) speedup (11.27s vs. 43.30s). Further parallelization with PAR-16\(\times\) achieves 12.6\(\times\) speedup (3.44s vs. 43.30s) with 103.4 FVD, while reducing generation steps to 95. Due to space limit, we provide visualization results of video generation in supplementary materials.

4.3 Ablation Study↩︎

In this section, we conduct comprehensive ablation studies to investigate the effectiveness of our key design choices on the ImageNet 256\(\times\)​256 validation set (Tab. 7). Unless specified, we use the PAR-XL model with parallel group size n=4 as default setting.

FID\(\downarrow\) IS\(\uparrow\) steps\(\downarrow\)
w/o 3.67 221.36 144
w 2.61 259.17 147
n FID\(\downarrow\) IS\(\uparrow\) steps\(\downarrow\)
1 2.34 253.90 576
4 2.35 263.24 147
16 3.02 270.57 51
attn FID\(\downarrow\) IS\(\uparrow\) steps\(\downarrow\)
causal 3.64 228.08 147
full 2.61 259.17 147
order pattern FID\(\downarrow\) IS\(\uparrow\) steps\(\downarrow\)
raster one 2.62 244.08 576
distant one 2.64 262.72 576
raster multi 5.64 265.46 147
distant multi 2.61 259.17 147
Table 3: Ablation studies on image generation model designs.
Params FID\(\downarrow\) IS\(\uparrow\) steps
343M 3.76 218.92 147
775M 2.61 259.17 147
1.4B 2.35 263.24 147
3.1B 2.29 255.46 147

Initial sequential token generation. We first evaluate the importance of initial sequential token generation by comparing models with and without this phase in Tab. 7 (a). Results show that initial sequential generation reduces FID from 3.67 to 2.61, with only 3 additional steps (147 vs. 144). We also visualize the comparison in Fig. 5. Without initial sequential generation (middle row), the generated images exhibit inconsistent global structures, such as misaligned dogs with duplicated body parts, as initial tokens are generated without awareness of each other. In contrast, our approach with initial sequential generation (top row) produces more coherent and natural-looking images. The results illustrate the importance of initial sequential token generation for establishing proper global structure.

Number of parallel predicted tokens. The number of tokens predicted in parallel (\(n\)) controls the trade-off between efficiency and quality. As shown in Tab. 7(b), with \(n=4\) (\(M=2\)), our approach reduces generation steps from 576 to 147 while maintaining comparable quality (FID 2.35 vs. 2.34). Further increasing to \(n=16\) (\(M=4\)) achieves more aggressive parallelization with only 51 steps, at the cost of slight quality degradation (FID increase of 0.67). This is consistent with our analysis that tokens from distant regions have weaker dependencies and can be generated in parallel. As shown in Fig. 2, both PAR-4× and PAR-16× preserve visual fidelity while achieving significant speedup (3.46s and 1.31s vs. 12.41s).

Impact of attention pattern. To enable effective parallel prediction while preserving rich context modeling, we study different attention patterns between parallel predicted tokens. With \(n=4\) parallel tokens, enabling full attention within groups reduces FID from 3.64 to 2.61 compared to causal attention, as it allows each token to access complete context from previous groups. This supports our design of combining bi-directional attention within groups with autoregressive attention between groups for effective parallel generation.

Impact of token ordering and prediction pattern. We compare raster scan and our distant ordering under different prediction settings. As shown in Tab. 7(d), while both achieve comparable quality in single-token prediction (FID 2.62 vs. 2.64), their performance differs significantly with multi-token prediction - raster scan degrades severely (FID 5.64) while our distant ordering maintains quality (FID 2.61). This indicates that the choice of parallel predicted tokens is critical. When using raster scan, adjacent tokens with strong dependencies are forced to generate simultaneously, leading to distorted local patterns as shown in Fig. 5 (bottom row). In contrast, our region-based distant ordering groups weakly correlated tokens for parallel prediction, preserving both local details and global coherence (top row).

Model scaling analysis. We study how our parallel prediction approach scales with model size. As shown in Tab. 7 (e), increasing model size from 343M to 3.1B parameters steadily improves generation quality (FID decreases from 3.76 to 2.29). Comparing with sequential generation baseline (LlamaGen) in Tab. 2, while smaller models show a noticeable quality gap (343M: FID 3.76 vs. 3.07), larger models achieve comparable performance (775M: 2.61 vs. 2.62; 1.4B: 2.35 vs. 2.34) while reducing generation steps from 576 to 147. This demonstrates that increased model capacity helps effectively mitigate the quality trade-off from parallel prediction, suggesting stronger capability in modeling joint distribution of parallel tokens.

5 Conclusion↩︎

We propose Parallelized Autoregressive Visual Generation (PAR), a simple yet effective approach that enables efficient parallel generation while preserving the advantages of autoregressive modeling. Our key finding is that the feasibility of parallel generation depends on token dependencies - tokens with weak dependencies can be generated in parallel while strongly dependent tokens lead to inconsistent results. Based on this insight, our PAR organizes tokens based on their dependency strengths rather than spatial proximity. The effectiveness of our approach across different visual domains validates this token dependency-based strategy for efficient autoregressive visual generation. We hope our work can inspire future research on visual generation and other sequence prediction tasks.

Appendix↩︎

The supplementary material includes the following additional information:

  • Sec. 6 provides more implementation details for PAR.

  • Sec. 7 provides more visualization results.

  • Sec. 8 provides the analysis of visual token dependencies.

6 Implementation details for PAR↩︎

Image Generation. For image generation, we train our models on the ImageNet-1K [30] training set, consisting of 1,281,167 images across 1,000 object classes. Following the setting in LlamaGen [15], we pre-tokenize the entire training set using their VQGAN [12] tokenizer and enhance data diversity through ten-crop transformation. For inference, we adopt classifier-free guidance [51] to improve generation quality. The detailed training and sampling hyper-parameters are listed in Tab. 8.

Table 4: Detailed Hyper-parameters for Image Generation.
config value
training hyper-params
optimizer AdamW [52]
learning rate 1e-4(L,XL)/2e-4(XXL,3B)
weight decay 5e-2
optimizer momentum (0.9, 0.95)
batch size 256(L,XL)/ 512(XXL,3B)
learning rate schedule cosine decay
ending learning rate 0
total epochs 300
warmup epochs 15
precision bfloat16
max grad norm 1.0
dropout rate 0.1
attn dropout rate 0.1
class label dropout rate 0.1
sampling hyper-params
temperature 1.0
guidance scale 1.60 (L) / 1.50 (XL) / 1.435 (XXL) / 1.345 (3B)

Video Generation. For video generation, we train our models on the UCF-101 [31] training set, which contains 9.5K training videos spanning 101 action categories. Videos are processed as 8fps random clips and tokenized by our reimplementation of MAGVIT-v2 [21] (as their code is not publicly available), achieving a reconstruction FVD score of 32 on UCF-101. For inference, we use classifier-free guidance [51] with top-k sampling to improve generation quality. The detailed training and sampling hyper-parameters are listed in Tab. 9.

Table 5: Detailed Hyper-parameters for Video Generation.
config value
training hyper-params
optimizer AdamW [52]
learning rate 1e-4
weight decay 5e-2
optimizer momentum (0.9, 0.95)
batch size 256
learning rate schedule cosine decay
ending learning rate 0
total epochs 3000
warmup epochs 150
precision bfloat16
max grad norm 1.0
dropout rate 0.1
attn dropout rate 0.1
class label dropout rate 0.1
sampling hyper-params
temperature 1.0
guidance scale 1.15
top-k 8000

7 More Visualization Results↩︎

In Fig.6 and Fig.7, we provide additional visualization results of PAR-4\(\times\) and PAR-16\(\times\) image generation on ImageNet [30] dataset, respectively.

In Fig.8, we provide the visualization results of video generation using our model on the UCF-101[31] dataset. The results are sampled from 128×128 resolution videos with 17 frames. As shown in the figure, even with 16× parallelization (PAR-16\(\times\)), our method shows no obvious quality degradation compared to single-token prediction (PAR-1\(\times\)), producing smooth motion and stable backgrounds across frames.

Figure 6: Additional image generation results of PAR-4\(\times\) across different ImageNet [30] categories.

Figure 7: Additional image generation results of PAR-16\(\times\) across different ImageNet [30] categories.

Figure 8: Video generation results on UCF-101 [31]. Each row shows sampled frames from a 17-frame sequence at 128×128 resolution, generated by PAR-1\(\times\), PAR-4\(\times\), and PAR-16\(\times\) respectively across different action categories.

8 Analysis of Visual Token Dependencies↩︎

In Sec.3.1, we demonstrated through pilot studies that parallel generation of adjacent tokens leads to quality degradation due to strong dependencies, while tokens from distant regions can be generated simultaneously. In this section, we provide a theoretical perspective of conditional entropy to explain this observation and our design. We use conditional entropy to measure the token dependencies quantitatively - lower conditional entropy between tokens indicates stronger dependency, while higher conditional entropy suggests weaker dependency and thus potential for parallel generation. We further validate our PAR design from the perspective of conditional entropy - In AR-based generation, each step predicts a conditional distribution of the next tokens given all previous tokens. Higher conditional entropy indicates higher difficulty for the model to predict the next tokens. In this section, we first introduce the estimation of conditional entropy in Sec.8.1, and then validate our proposed approach by analyzing the relationship between token dependencies and spatial distances in Sec. 8.2.

8.1 Conditional Entropy Estimation↩︎

Given a visual token sequence \(\{v_1, v_2, ..., v_n\}\), our goal is to estimate the conditional entropy \(H(v_{k}|\{v_j\}_{j < k})\) where the token feature \(v_i \in \mathbb{R}^d\) and \(\{v_j\}_{j < k}\) is the set of (all) visual tokens that precede \(v_k\) in the sequence. This conditional entropy measures the uncertainty of the current token \(v_k\) given the previously occurring visual tokens, thereby characterizing the dependency between \(v_k\) and the set \(\{v_j\}_{j < k}\). It is important to emphasize that we do not require the exact value of \(H(v_{k}|\{v_j\}_{j < k})\). Instead, we aim to reflect the trends in \(H(v_{k}|\{v_j\}_{j < k})\) under different scenarios, such as given different sets of \(\{v_j\}_{j < k}\) and considering different positions of \(v_k\) given the same set of \(\{v_j\}_{j < k}\).

In particular, we characterize the relationship between the token \(v_k\) and the previous ones as the following model \[\label{eq:model} v_{k} = f(\{v_j\}_{j < k}) + \boldsymbol{\epsilon}_k\tag{1}\] where \(v_{k}\) is the next token we focus on and \(\{v_j\}_{j < k}\) is the conditioning token(s), \(f(\cdot)\) is a deterministic function, and \(\boldsymbol{\epsilon}_k\) is the random additive error term. Then the conditional entropy \(H(v_{k}|\{v_j\}_{j < k})\) satisfies \[\begin{align} H(v_{k}|\{v_j\}_{j < k}) &= H(f(\{v_j\}_{j < k}) + \boldsymbol{\epsilon}_k|\{v_j\}_{j < k}) \notag \\ &=H(\boldsymbol{\epsilon}_k|\{v_j\}_{j < k}), \end{align}\] where the second equation holds since \(f(\cdot)\) is a deterministic function. However, exactly calculating \(H(\boldsymbol{\epsilon}_k|\{v_j\}_{j < k})\) is intractable as we cannot access the entire data distribution. To this end, inspired by prior research on bounding techniques for entropy and mutual information estimation [53][58], we seek their upper bound as a proxy for showing the trends of the conditional entropy for different tokens. In particular, we have \[\begin{align} \label{eq:closed-form} H(\boldsymbol{\epsilon}_k|\{v_j\}_{j < k})\le H(\boldsymbol{\epsilon}_k)\le\frac{1}{2}\log((2\pi e)^d|\boldsymbol{\Sigma}|), \end{align}\tag{2}\] where \(\boldsymbol{\Sigma}\) denotes the covariance matrix of \(\boldsymbol{\epsilon}_k\). Notably, the first inequality naturally holds and the second inequality follows from the maximum entropy theory [59], [60], which is achievable when \(\boldsymbol{\epsilon}_k\) follows a Gaussian distribution.

Based on Eq. 2 , we can estimate the trend of conditional entropy changes by calculating the determinant of the residual covariance matrix, i.e., \(|\boldsymbol{\Sigma}|\). In order to obtain the additive errors \(\boldsymbol{\epsilon}\), we consider training a parameterized model \(f_\theta(\cdot)\) to get the function \(f\) and characterize \(\boldsymbol{\epsilon}\) as the residual errors. The detailed algorithm is provided in Algorithm 10.

a

b

c

d

Figure 9: Visualization of token conditional entropy maps. Each map shows the conditional entropy of all tokens when conditioned on a reference token (blue square). Darker red indicates lower conditional entropy and thus stronger dependency with the reference token. The visualization shows that tokens exhibit strong dependencies with their spatial neighbors and weak dependencies with distant regions..

Figure 10: Conditional Entropy Estimation

Figure 11: Conditional entropy differences between parallel and sequential generation in different orders. (a)(d) show parallel (4 tokens) generation strategies and (b)(e) show sequential generation strategies for our proposed order and raster scan order respectively. Numbers indicate generation step in each order. (c)(f) visualize the conditional entropy increase when switching from sequential to parallel generation for each order, where darker red indicates larger entropy increase and thus higher prediction difficulty. Both orders generate the first four tokens sequentially (shown as white regions in entropy maps). Our proposed order that generates tokens from different spatial blocks in parallel shows smaller entropy increases compared to raster scan order that generates consecutive tokens simultaneously, indicating parallel generation across spatial blocks introduces less prediction difficulty than generating adjacent tokens simultaneously.

8.2 Entropy Analysis on ImageNet Data and PAR↩︎

Based on the conditional entropy estimation method introduced above, we conduct experiments on ImageNet to analyze token dependencies and validate our parallel generation strategy. We randomly sample 10,000 images from ImageNet [30] and extract their features using VQGAN [12] encoder, followed by vector quantization to obtain continuous features from the codebook.

We first analyze the dependencies between tokens at different positions. For each position \(j\) in the feature map, we calculate the conditional entropy \(H(v_i|v_j)\) where \(i \neq j\), given the token \(v_j\) at the \(j\)-th position and considering all tokens \(v_i\) at other positions. It should be noted that Algorithm 10 is not limited to \(H(v_{k}|\{v_j\}_{j < k})\) where the given visual tokens \(\{v_j\}\) must satisfy \(j<k\). This is because any given tokens \(v_j\) and \(v_i\) can be considered to satisfy Eq. 1 , making the proposed method applicable in calculating \(H(v_i|v_j)\). Fig. 9 presents the experimental results. We observe that given different token positions \(v_i\), the adjacent tokens typically exhibit lower conditional entropy (shown in redder colors). This indicates that the dependencies between adjacent tokens are stronger compared to the dependencies between tokens that are farther apart in position. This observation aligns with the spatial locality in visual data, where nearby regions have stronger correlations than distant ones.

Next, we analyze how different token ordering strategies affect the difficulty of parallel generation in Fig. 11. To simulate the prediction difficulty during generation, we compute each tokens conditional entropy given all its previous tokens - higher conditional entropy indicates more uncertainty and thus higher prediction difficulty at that position. By comparing the conditional entropy difference between sequential (one token at a time) and parallel generation (predicting multiple tokens simultaneously), we can quantify the increased difficulty introduced by parallel generation at each position. We conduct experiments with 4-token parallel prediction under two ordering strategies: our proposed generation order that first generates the initial four tokens sequentially to establish global structure, then generates tokens from different spatial blocks in parallel, and the raster scan ordering that directly predicts consecutive tokens simultaneously after the initial four tokens.

For our proposed order, we aim to characterize the entropy increase caused by the parallel generation, when compared to the entirely sequential generation methods. In particular, let \(v_k^{(r)}\) be the token at position \(k\) in region \(r\), we define \(\mathcal{V}_{k,r}^{\mathrm{seq}}\) and \(\mathcal{V}_{k,r}^{\mathrm{par}}\) by the sets of the previous tokens of \(v_{k}^{(r)}\) for sequential and parallel generations (see Fig. 11 (a)(b)). Then the conditional entropy of the sequential generation (single-token) and parallel generation (multi-token) are defined as \(H(v_k^{(r)}|\mathcal{V}_{k,r}^{\mathrm{seq}})\) and \(H(v_k^{(r)}|\mathcal{V}_{k,r}^{\mathrm{par}})\). We characterize the entropy increase caused by the parallel generation, i.e., \[\begin{align} \label{eq:diff95entropy95ourorder} H(v_k^{(r)}|\mathcal{V}_{k,r}^{\mathrm{par}})-H(v_k^{(r)}|\mathcal{V}_{k,r}^{\mathrm{seq}}). \end{align}\tag{3}\]

As a comparison, we also consider the raster scan order, where the tokens are exactly arranged based on their positions, denoted as \(v_1,v_2,\ldots\). In this setting, given the current token \(v_k\), we define \(\mathcal{V}_{k}^{\mathrm{seq}}\) and \(\mathcal{V}_{k}^{\mathrm{par}}\) by the sets of the previous tokens of \(v_{k}\) for sequential and parallel generations (see Fig. 11 (d)(e)). Then, we will also characterize the entropy increase caused by the parallel generation in the raster scan order, i.e., \[\begin{align} \label{eq:diff95entropy95raster} H(v_k|\mathcal{V}_{k}^{\mathrm{par}})-H(v_k|\mathcal{V}_{k}^{\mathrm{seq}}). \end{align}\tag{4}\]

The numerical results of 3 and 4 are presented in Fig. 11 (c) and (f). It can be seen that both orderings maintain identical conditional entropy for the first four tokens due to the sequential generation. For subsequent tokens, our proposed order leads to significantly smaller conditional entropy increases compared to the raster scan order. This indicates that when switching from sequential to parallel generation, generating tokens from different spatial blocks introduces less prediction difficulty than generating consecutive tokens in raster scan order. The result quantitatively validates our design.

References↩︎

[1]
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018.
[2]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
[3]
T. Brown et al., “Language models are few-shot learners , booktitle=NIPS,” 2020.
[4]
H. Touvron et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[5]
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[6]
A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, and booktitle=NIPS. others, “Conditional image generation with pixelcnn decoders,” 2016, vol. 29.
[7]
A. Van Den Oord, N. Kalchbrenner, and booktitle=ICML. Kavukcuoglu Koray, “Pixel recurrent neural networks,” 2016.
[8]
T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications,” arXiv preprint arXiv:1701.05517, 2017.
[9]
N. Parmar et al., “Image transformer , booktitle = ICML,” 2018, vol. 80, pp. 4055–4064.
[10]
M. Chen et al., “Generative pretraining from pixels,” 2020, pp. 1691–1703.
[11]
A. Ramesh et al., “Zero-shot text-to-image generation,” 2021, pp. 8821–8831.
[12]
P. Esser, R. Rombach, and booktitle=CVPR. Ommer Bjorn, “Taming transformers for high-resolution image synthesis,” 2021, pp. 12873–12883.
[13]
J. Yu et al., “Scaling autoregressive models for content-rich text-to-image generation,” arXiv preprint arXiv:2206.10789, 2022.
[14]
K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregressive modeling: Scalable image generation via next-scale prediction,” arXiv preprint arXiv:2404.02905, 2024.
[15]
P. Sun et al., “Autoregressive model beats diffusion: Llama for scalable image generation,” arXiv preprint arXiv:2406.06525, 2024.
[16]
D. Lee, C. Kim, S. Kim, M. Cho, and booktitle=CVPR. Han Wook-Shin, “Autoregressive image generation using residual quantization,” 2022.
[17]
T. Henighan et al., “Scaling laws for autoregressive generative modeling,” arXiv preprint arXiv:2010.14701, 2020.
[18]
C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024.
[19]
X. Wang et al., “Emu3: Next-token prediction is all you need,” arXiv preprint arXiv:2409.18869, 2024.
[20]
A. van den Oord, O. Vinyals, and book kavukcuoglu koray, “NIPS, title = Neural Discrete Representation Learning,” 2017.
[21]
L. Yu et al., “Language model beats diffusion–tokenizer is,” 2024.
[22]
A. Vaswani et al., “NIPS, title = Attention is All you Need,” 2017.
[23]
D. Liu et al., “Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining,” arXiv preprint arXiv:2408.02657, 2024.
[24]
Y. Leviathan, M. Kalman, and booktitle=ICML. Matias Yossi, “Fast inference from transformers via speculative decoding,” 2023.
[25]
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” arXiv preprint arXiv:2302.01318, 2023.
[26]
Y. Li, F. Wei, C. Zhang, and H. Zhang, “Eagle: Speculative sampling requires rethinking feature uncertainty,” arXiv preprint arXiv:2401.15077, 2024.
[27]
Y. Song, C. Meng, R. Liao, and booktitle=ICML. Ermon Stefano, “Accelerating feedforward computation via parallel nonlinear equation solving,” 2021.
[28]
S. Kou, L. Hu, Z. He, Z. Deng, and H. Zhang, “Cllms: Consistency large language models,” arXiv preprint arXiv:2403.00835, 2024.
[29]
H. Chang, H. Zhang, L. Jiang, C. Liu, and booktitle =. C. William T. Freeman, “MaskGIT: Masked generative image transformer,” 2022.
[30]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and booktitle=CVPR. Fei-Fei Li, “Imagenet: A large-scale hierarchical image database,” 2009.
[31]
K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
[32]
D. Kondratyuk et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125, 2023.
[33]
W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “VideoGPT: Video generation using VQ-VAE and transformers,” arXiv preprint arXiv:2104.10157, 2021.
[34]
Y. Wang et al., “Loong: Generating minute-level long videos with autoregressive language models,” arXiv preprint arXiv:2410.02757, 2024.
[35]
C. Wang, J. Zhang, and H. Chen, “Semi-autoregressive neural machine translation,” arXiv preprint arXiv:1808.08583, 2018.
[36]
M. Stern, N. Shazeer, and J. Uszkoreit, “Blockwise parallel decoding for deep autoregressive models,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[37]
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024.
[38]
OpenLM-Research, “OpenLLaMA 3B.” 2023 , howpublished = {\url{https://huggingface.co/openlm-research/open_llama_3b}}.
[39]
A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
[40]
M. Kang et al., “Scaling up gans for text-to-image synthesis,” 2023, pp. 10124–10134.
[41]
A. Sauer, K. Schwarz, and booktitle=ACM. S. 2022. conference proceedings Geiger Andreas, “Stylegan-xl: Scaling stylegan to large diverse datasets,” 2022, pp. 1–10.
[42]
P. Dhariwal and booktitle=NIPS. Nichol Alexander, “Diffusion models beat gans on image synthesis,” 2021, vol. 34, pp. 8780–8794.
[43]
J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, The Journal of Machine Learning Research, vol. 23, no. 1, pp. 2249–2281, 2022.
[44]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Ommer Björn, “High-resolution image synthesis with latent diffusion models,” 2022, pp. 10684–10695.
[45]
W. Peebles and booktitle=ICCV. Xie Saining, “Scalable diffusion models with transformers,” 2023, pp. 4195–4205.
[46]
T. Li, Y. Tian, H. Li, M. Deng, and K. He, “Autoregressive image generation without vector quantization,” arXiv preprint arXiv:2406.11838, 2024.
[47]
J. Yu et al., “Vector-quantized image modeling with improved vqgan,” arXiv preprint arXiv:2110.04627, 2021.
[48]
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, NIPS, vol. 30, 2017.
[49]
T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila, “Improved precision and recall metric for assessing generative models,” NIPS, vol. 32, 2019.
[50]
T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards accurate generative models of video: A new metric & challenges,” arXiv preprint arXiv:1812.01717, 2018.
[51]
J. Ho and booktitle=NIPS. 2021. W. on D. G. M. and D. A. Salimans Tim, “Classifier-free diffusion guidance,” 2021.
[52]
I. Loshchilov and booktitle=ICLR. Hutter Frank, “Decoupled weight decay regularization,” 2018.
[53]
M. I. Belghazi et al., “Mine: Mutual information neural estimation,” arXiv preprint arXiv:1801.04062, 2018.
[54]
X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, 2010.
[55]
A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[56]
A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” arXiv preprint arXiv:1612.00410, 2016.
[57]
booktitle=2015. I. T. and A. W. (ITA). Verdú Sergio, \(\alpha\)-mutual information,” 2015 , organization={IEEE}, pp. 1–6.
[58]
N. Tishby and booktitle=2015. ieee information theory workshop (itw). Zaslavsky Noga, “Deep learning and the information bottleneck principle,” 2015 , organization={IEEE}, pp. 1–5.
[59]
E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003.
[60]
T. M. Cover, Elements of information theory. John Wiley & Sons, 1999.