October 23, 2025
Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88×. Our code is released at this link.
With the explosion of highly capable yet computationally demanding deep learning models, quantization has emerged as an effective strategy for balancing performance and efficiency [1]–[4]. Despite its advantages, most existing quantization methods are optimized for a single fixed quantization precision configuration, which limits their ability to adapt dynamically to changing resource availability and deployment across various platforms of diverse memory, compute, and power specifications. This has led to a line of recent work focused on training a single model capable of supporting multiple precisions [5]–[8], thereby enabling instant adaptation to varying resource budgets at runtime without the need for further training. In such multi-bit quantization networks, henceforth multi-bit networks, a single full-precision parent model generates multiple reduced-precision child models, thereby neutralizing the overhead of maintaining separate models for inference. By supporting multiple quantization precisions- referred to as the model’s switchable bit range, these networks enable adaptive deployment across a wide range of compute-constrained devices [9]–[12].
Although the multi-bit networks provide flexibility during inference/deployment, this advantage comes at the cost of substantial training overhead, limiting their practical adoption. The prevalent multi-bit training approach, known as Any-Precision [13], jointly optimizes the model across a small subset of selected bit-widths, termed the trained bit range. While this approach is more efficient than training individual models across the switchable range (Dedicated), it still introduces considerable overhead due to the additional calibration required to enable inference at bit-widths outside the trained range (calibrated bit range), as shown in Fig. 1. Specifically, calibration demands extensive computation using large amounts of training data to preserve the accuracy of untrained bit models in the calibrated range, by aligning their mismatched activation distributions. We identify that these activation mismatches across different bit-width models stem from biases in the weight distribution induced by quantization. Based on this observation, we propose a novel bias correction technique that directly controls the shift and scaling biases in the quantized weights to align distributions across the entire switchable bit-widths. This alignment enables multiple bit-width sub-networks to share a common set of batch normalization (BN) parameters, effectively eliminating the need for costly post-training calibration.
Another major source of the significant training overhead is the use of the entire dataset for updating models in the trained bit range. Although coreset selection methods have been introduced to reduce the training overhead by identifying a subset of important data samples [14]–[17], these approaches have primarily targeted single-precision model training with fixed coresets. Extending this idea to a multi-bit quantization setting, we observe that each bit-width child model can benefit from training on distinct and smaller data subsets due to implicit gradient alignment across bit-widths. Leveraging this insight, we propose a bit-wise coreset sampling method that dynamically selects informative samples individually for each child model, based on the gradients computed per bit-width. Furthermore, since sample importance changes throughout training, we periodically re-sample these coresets to reflect evolving model dynamics. The proposed sampling approach effectively reduces per-epoch computational costs while preserving strong performance through implicit cross-bit-width knowledge transfer, a phenomenon we discover for the first time.
To summarize, our contributions are as follows:
Weight bias correction for activation alignment: We correct quantization-induced biases in the weight space instead of the activation space, enabling multiple child models to share normalization parameters, and in turn eliminating the need for an extra training stage.
Bit-wise coreset sampling: We propose a novel per-bit-width coreset sampling strategy that computes bit-wise importance scores using gradient-based methods, thereby reducing training redundancy in multi-bit quantization networks.
Extensive empirical validation: We demonstrate that our method consistently improves or maintains accuracy while significantly reducing training cost across diverse datasets (e.g., CIFAR-10, CIFAR-100, TinyImageNet, and ImageNet-1K) and architectures (e.g., ResNet, DeiT, Swin). Our method achieves up to 7.88× GPU hour reduction without sacrificing model utility, validating the scalability and generality of our approach.
Multi-bit quantization networks.
Unlike traditional quantized networks that are optimized for a single reduced numerical precision, multi-bit quantization networks [5], [7], [8], [13], [18], [19] are capable of supporting multiple quantization precisions, enabling adaptive and versatile inference/deployment across a wide spectrum of compute-constrained devices [9]–[12]. The mainstream approach to training these networks involves optimizing the model for multiple precisions simultaneously, typically by the sum of loss functions corresponding to each bit-width. Formally, this is stated as shown below: \[\label{eq:objective} \min_{\theta} \sum_{(\mathbf{x},y) \in \mathcal{S}} \sum_{b \in \mathcal{B}} \;\; \mathcal{L}(\mathbf{x}, \, y, \, Q(\theta, b)),\tag{1}\] where \(\theta \in \mathbb{R}^d\) denotes the learnable model parameter which is shared across multiple precisions; \(\mathcal{L}(\mathbf{x}, y, Q(\theta, b))\) is the loss on training sample \((\mathbf{x}, y)\) in training set \(\mathcal{S}\); \(Q(\theta, b) \in \mathbb{Z}^d\) is the quantized version of \(\theta\) at \(b\)-bit precision; and \(\mathcal{B}\), referred to as the trained range, represents the set of all trained bit-widths. To perform this optimization in practice, the batch-wise training scheme, which interleaves parameter updates across child models corresponding to different bit-widths in a batch-wise manner (see Algorithm 2), is commonly adopted to promote generalization across the entire training range.
Training overhead of multi-bit networks. While multi-bit training is generally more efficient than training multiple single-precision networks independently, it still incurs significant computational overhead—particularly as the training range \(\mathcal{B}\) expands. To mitigate this cost, recent approaches minimize the number of bit-widths included in the training range and instead introduce calibrated bit-widths/range to expand precision support. Specifically, the model is first trained on a small set of bit-widths (the trained range), after which a large portion of its parameters are frozen. The remaining parameters are then calibrated or lightly fine-tuned to support additional bit-widths (the calibrated range). The union of the trained and calibrated ranges defines the model’s switchable range, \(\mathcal{R}\)—i.e., the full set of bit-widths supported by the multi-bit network.
Challenges in training multi-bit networks. One major challenge in multi-bit network training research is the accuracy degradation due to activation distribution mismatch between different bit-widths. To address this mismatch, Any-Precision [13], along with CoQuant [18] and MBQuant [8], leverages the ‘switchable batch normalization’ approach first proposed by [20]. While effective, assigning separate batch normalization layers to each bit-width incurs additional overhead: obtaining these parameters for unseen bit-widths typically requires an extra training phase as shown in Fig. 3. Some recent works [5], [7], [21] avoid this mechanism altogether, but often suffer from degraded performance at lower bit-widths or resort to computationally expensive strategies to reduce interference between the training objectives of different child models. In our proposed method, rather than relying on multiple batch normalization layers or other costly techniques, we correct the weight distribution directly before and after quantization. Our key insight is that quantization introduces bit-width-specific shifts and scaling in the weight distribution, and aligning these distributions helps reduce conflicts across child models during training.
Coreset selection. Coreset selection—also known as dataset pruning—aims to reduce model training cost by selecting a small yet representative subset of the training data, while preserving model utility. The core challenge lies in accurately identifying the most informative samples. Feature-space based methods select subsets that preserve the geometry of the data distribution-for example, Herding [22] and Moderate [15] select data points with distance in feature space. Uncertainty-based methods prioritize ambiguous or hard-to-classify samples; for example, Entropy [23] and Cal [24] select samples near decision boundaries. Gradient-based methods leverage training loss gradients. GraND/EL2N [14] rank samples by their gradient magnitude (or prediction error), while Craig [25] and GradMatch [26] select subsets that best match/mimic the full dataset’s gradient signals. Training-dynamics based methods consider samples’ behavior over many epochs. Forgetting [16] counts how often a sample is forgotten during training, and AUM [27] averages the confidence gap across all epochs. Finally, hybrid approaches fuse multiple criteria: TDDS [17] integrates gradient information with training dynamics by measuring each sample’s variability in its epoch-wise contribution to the overall training gradient.
Shortcomings of existing coreset selection research. Despite this breadth of approaches, most coreset selection methods are investigated under the assumption of fixed, full-precision floating-point models, and their applicability to quantized neural networks remains largely unexplored. In particular, the integration of coreset selection into quantization-aware training (QAT) has received little attention, let alone its extension to the more complex setting of multi-bit quantization, where cross-bit interactions can significantly affect saliency estimation and the underlying training dynamics.
Activation distribution mismatch in multi-bit networks. As discussed in Section [related-works-ofa], multi-bit networks often suffer from mismatched activation distributions across bit-widths. To isolate the source of this mismatch, we decompose the post-convolutional activation into two components: the quantized input activation and the quantized weight. To simplify the analysis, we fix the input activation to a specific precision (e.g., 4-bit). Under this setup, any observed variation in the output can be attributed solely to the quantized weights, thereby reducing the problem to a single source of quantization noise.
In Fig. 4 (a), we visualize the post-convolutional activations of ResNet-50 for bit-widths \(b \in \mathcal{B} = \{2, 3, 5\}\) using a batch of ImageNet examples. Although the input activation is fixed, the output distributions vary noticeably across bit-widths. This indicates that the differences can be attributed to quantization-induced bias in the weights. Fig.4 (b) supports this explanation by showing that the quantized weights exhibit clear scale distortions compared to the original weights. This observation is consistent with prior observations of [2], [28], which highlight the presence of systematic bias introduced during quantization.
Many multi-bit networks [8], [13], [18] address this activation mismatch problem by training separate BN parameters for each bit-width to independently correct activation distributions. As illustrated in Fig. 4 (a), this approach successfully aligns BN outputs across different precisions. While effective, aligning output activations typically requires access to the training data and additional forward/backward passes, which incurs additional training overhead.
Bias correction for quantized weights. Instead of rectifying the output activations, we address the bias at its source by aligning the quantized weights prior to convolution. By doing this, we can match activation outputs across bit-widths just by correcting the weights during the initial training stage, without having to explicitly match the activations themselves. As a result, BN layers can be shared across all bit-widths, as shown in Fig. 3, avoiding the additional overhead of calibrating separate BN layers. It is important to note that this correction is performed under a fixed activation bit-width (e.g., 4 bits), meaning that aligning the weights directly translates to more consistent activation outputs across different bit-widths. Specifically, we adjust the quantized weight vector \(\mathbf{w}_q\) with respect to their full-precision counterpart \(\mathbf{w}\), and compute the corrected weights \({\mathbf{w}_q}'\) as follows: \[{\mathbf{w}_q}' = \sqrt{\frac{\mathbb{V}[\mathbf{w}]}{\mathbb{V}[\mathbf{w}_q]}} (\mathbf{w}_q + (\mathbb{E}[\mathbf{w}] - \mathbb{E}[\mathbf{w}_q])),\] where \(\mathbb{E}[\cdot]\) and \(\mathbb{V}[\cdot]\) denote the expectation and variance, respectively. This weight alignment enables multiple child models to share a single set of BN parameters with minimal interference. To compensate for residual discrepancies not fully addressed by bias correction, we additionally apply BN adaptation [29]. While adjusting running statistics has been proven to be effective in fixed-quantized networks [30], its use in multi-bit networks [5] remains limited and often lacks clarity on when it is applied (e.g., applied at every epoch in [5], which is unnecessary). Applying BN adaptation once at the final training stage as shown in Fig. 3, is sufficient to correct the running mean and variance for each bit-width, achieving optimal performance without additional overhead.
To translate the benefits of coreset selection into multi-bit quantization networks, we propose two techniques tailored to this setting: (i) a coreset sampling strategy that accounts for variations in sample importance across bit-widths and training epochs; and (ii) a bit-wise training scheme for accurate per-bit-width importance score evaluation. Together, these techniques enable more efficient and adaptive training across a range of quantization levels while maintaining strong model utility.
The central idea behind our coreset sampling method is to dynamically redraw training subsets along two axes: bit-width and training time. Rather than using a static, global coreset, we select bit-wise coresets that evolve throughout training via sampling as shown in Fig. 3. This design is motivated by two key observations: (i) gradient alignment across bit-widths, and (ii) temporal drift in sample importance.
Observation 1 – Gradient alignment across bit-widths. We find that gradients computed from different bit-widths using the same data sample are highly aligned. In Fig. 5, we visualize the angles between the gradients of the 8-bit and 2-bit child models across several layers of ResNet-20 at various training epochs. It can be seen that the angle between the two gradients stays consistently below 28\(^{\circ}\), with alignment improving in deeper layers. This implies that, without loss of generality, parameter updates based on 2-bit gradients positively influence 8-bit child model (and also that of other precisions in the trained range), and vice versa. We refer to this phenomenon as cross-bit-width implicit knowledge transfer, where shared parameters act as conduits for the transfer of learning signals between child models.
This observation leads to a key insight: it is unnecessary to feed the same sample to all bit-widths during training. Accordingly, we construct bit-wise coresets—separate subsets tailored for each bit-width. These bit-wise coresets not only reflect the variation in sample importance across bit-widths but also exploit the implicit gradient transfer phenomena to benefit from a collective learning signal without accessing the full dataset. Therefore, this design significantly reduces per-epoch computational cost while preserving strong performance across the trained range.
Observation 2 – Temporal drift in sample importance. We also observe that a sample’s importance evolves as training progresses. More specifically, samples influential in early epochs often become less relevant in later stages, where the model nears convergence and the loss landscape flattens. In Figure [fig:rank95diff], we visualize this effect using Spearman correlation of TDDS-based importance scores measured at different training stages on ResNet-18 trained with CIFAR-100 dataset. Correlations between early and late-stage scores may drop as low as 0.54, indicating substantial shifts in sample influence over time.
To account for the temporal drift in sample importance, we periodically re-sample each bit-wise coreset throughout training. Although high-score samples may be informative in the early stages, they often lose relevance as the model’s learning dynamics evolve. Continually refreshing the coresets helps prevent overfitting to outdated importance estimates, especially critical when the true sample importance landscape is dynamic and only partially observable.
Sampling method. To construct the bit-wise coresets via sampling, we first convert the sample importance scores into sampling probabilities by applying min-max normalization. We then further shape the sampling probabilities using temperature-based sampling [31]–[33], which simultaneously reduces overfitting to high-scoring samples and promotes diversity, effectively balancing exploitation and exploration. The sampling probability \(p_i^{(b)}\) for sample \(i\) at bit-width \(b\) is defined as: \[p_i^{(b)}(\tau) = \frac{\bigl(s_i^{(b)}\bigr)^{1/\tau}}{\sum_{\forall j} \bigl(s_j^{(b)}\bigr)^{1/\tau}},\] where \(s_i^{(b)}\) denotes the min-max normalized importance score for sample \(i\) at bit-width \(b\); and \(\tau > 0\) denotes the temperature parameter. Note that importance scores are computed once prior to coreset sampling and remain fixed throughout training.
Extracting accurate, bit-wise importance scores is particularly challenging in the context of training dynamics-based coreset selection methods. These mainstream approaches estimate sample importance over multiple training epochs to capture the intricate training dynamics and improve score reliability (see Section [related-works-ofa] for an overview). A representative example is TDDS [17], which accumulates intermediate gradient signals—referred to as context vectors—throughout training to capture the evolving contribution of each sample. While effective in single-precision settings, we find that applying such methods directly under the standard batch-wise training scheme (Algorithm 2) fails to produce meaningful bit-wise importance estimates. The core issue lies in the interleaved update pattern: gradients from all bit-widths are aggregated before a shared parameter update, resulting in a single unified context vector that masks the distinct training dynamics of each child model.
To address this issue, we introduce a bit-wise training scheme for score evaluation, as shown in Algorithm 6. In this setup, each child model corresponding to a trained bit-width is trained on the entire dataset before proceeding to the next bit-width. This scheduling isolates the gradient updates for each precision, enabling the computation of distinct context vectors and more accurate, bit-wise importance scores.
It is important to note that the proposed bit-wise training scheme is used exclusively for importance score extraction. Once the scores are computed, we revert to the standard batch-wise training scheme for actual multi-bit network training. This hybrid approach allows coreset construction to benefit from accurate, bit-wise decoupled importance evaluation while preserving the generalization advantages of batch-wise training.
Evaluation metrics and baselines. We evaluate our method in terms of per-bit-width accuracy and total GPU hours. Comparisons are made against: (1) the dedicated framework, (2) the standard multi-bit framework (e.g., Any-Precision [13]), and (3) our method which augments the standard framework with Bias Correction (Ours). Within Ours, we evaluate our proposed coreset sampling strategy, which uses bit-wise scores, against six baseline corset selection methods: Random, Entropy [23], Forgetting [16], EL2N [14], Moderate [15], and TDDS [17]. Since most existing coreset selection techniques are designed for dedicated training, we adapt each baseline to our multi-bit framework to ensure a fair comparison.
Datasets and networks. We evaluate our method on four canonical datasets—CIFAR-10, CIFAR-100 [34], TinyImageNet [35], and ImageNet-1K [36]—with a diverse set of networks. These include three ResNet models: PreActResNet-20 [37], ResNet-18, and ResNet-50 [38], as well as three Vision Transformers (ViTs): DeiT-T, DeiT-S [39], and Swin-T [40].
Implementation details. All experiments are conducted on a single NVIDIA A100 GPU. Each experimental setting is as follows: (i) Dedicated trains the model with a single weight and activation bit-width. (ii) Any-Precision uses a training range of \(\mathcal{B} = {1, 2, 4, 8, 32}\) for ResNet models (and \(\mathcal{B} = {2, 8}\) for ViTs). After training, independent parameters for the remaining bit-widths are calibrated for approximately one-third of the training epochs to ensure convergence. (iii) Bias Correction adopts the same training range as (ii) but skips the calibration phase and instead performs BN adaptation. Specifically, we assign a separate BN layer for 1-bit and share BN layers for all other bit-widths. Although inference for the calibrated range is supported and achieves accuracy comparable to the trained range in both (ii) and (iii), those results are omitted here due to the page limit and are provided in the Appendix.
ResNet on CIFAR-10 and 100. Table [tab:rn95cf1095cf100] presents the results on PreActResNet-20 for CIFAR-10 and ResNet-18 for CIFAR-100, highlighting both performance and training time reduction achieved by our method. Compared to existing baselines, Ours achieves competitive performance with reduced training time by eliminating the calibration phase, while coreset sampling further improves efficiency by reducing data usage without compromising accuracy. Additional results for CIFAR-10 with PreActResNet-20, including comparisons between our coreset sampling method and six baselines (see Section [lab:base]), are presented in Table [tab:rn95cf1095cf10095method]. Random selection has been shown to excel at high pruning rates in prior studies [17], and we observe the same trend in our experimental setup. Our method shows consistent improvements in both accuracy and efficiency across bit-widths. Figure 7 shows the trade-off between training cost and accuracy by plotting average accuracy against pruning rate. Our method consistently outperforms all baselines across the entire pruning spectrum and maintains high accuracy even at a 90% pruning rate.
ResNet on ImageNet-1K. We further demonstrate the effectiveness of our method for a bigger dataset like ImageNet-1K. Table [tab:rn95imnet] and Table [tab:rn95imnet95method] summarize our results for ResNet-50 on ImageNet-1K with respect to the baseline and previous methods. For these experiments, we finetune from a pretrained Any-Precision model, where specific implementation details are provided in the Appendix. Compared to the dedicated training setting, our method substantially reduces training time by \(5.75\times\) with minimal impact on performance.
ViTs on ImageNet-1K. We also evaluate our method on larger transformer-based models to demonstrate the generality and scalability of our method to other architectures. The results of three different ViTs (i.e., DeiT-T, DeiT-S, and Swin-T) are summarized in Table [tab:vit95imnet]. To the best of our knowledge, there is not yet a standard multi-bit framework such as Any-Precision for vision transformers. To this end, we implement our own framework with similar configurations as Any-Precision. We compare our method with two dataset pruning ratios- 50% and 60% respectively. With a bigger dataset, our method shows even more significant reduction in training time (as large as 18.22 GPU hours reduction in DeiT-T), while showing consistent accuracy compared to the baselines.
Effect of Bias Correction and BN Adaptation. As shown in Table [tab:biascorr95bnadapt], we conduct an ablation study to quantify the individual and combined effects of Bias Correction and BN adaptation in the final training stage. The results show that both components play distinct yet complementary roles in achieving stable alignment and high accuracy across bit widths. Bias Correction primarily compensates for systematic deviations introduced during quantization, restoring the representational balance in the weight space. However, since it does not modify the batch normalization statistics used at inference, it alone cannot fully align the activation distributions. BN adaptation, applied at the final stage, addresses this limitation by recalibrating the running mean and variance through a small number of forward passes, thereby aligning post-quantization activations with their floating-point counterparts. Together, these two procedures act on different levels of the model—weight and activation—resulting in consistent improvements across all bit widths. Quantitatively, the combination yields the highest accuracy, confirming that the final BN adaptation provides additional activation alignment beyond what bias correction alone can achieve. These findings clarify the respective contributions of both components and highlight the importance of performing BN adaptation at the last training stage for precise activation calibration in quantized models.
Effect of bit-wise schedule for score extraction. To quantify the benefit of isolating per–bit-width dynamics, we perform a fixed-coreset ablation under the multi-bit framework (i.e., Any-Precision [13]) with Bias Correction setting. We select the dataset once—either by (i) the conventional batch-wise 2 TDDS scores [17] or by (ii) our bit-wise 6 scores. We then train the full multi-bit schedule on these reduced sets. As reported in Table [tab:batch95major], bit-wise scoring yields higher accuracy at every pruning ratio on both datasets. This improvement stems from our bit-wise extraction design 6, which enables the collection of separate intermediate gradients per bit-width at each epoch—allowing us to compute distinct context vectors for every sub-model.
In this work, we introduce two techniques to reduce the training overhead of multi-bit quantization networks. First, we correct quantization-induced bias in the weight space, removing the need for an additional training stage. Second, we design a bit-wise coreset sampling strategy that leverages implicit knowledge transfer, allowing each child model to train on a compact, informative subset selected via gradient-based importance scores. Our approach preserves model utility while reducing training costs across various architectures such as ResNets and ViTs, offering a scalable solution for efficient multi-bit quantization training. By enabling more efficient multi-precision learning, our method contributes to the broader goal of sustainable and energy-efficient AI, helping make high-performance models more accessible, affordable, and ubiquitous to everyone.
Despite strong empirical results, our evaluations are limited to computer vision tasks due to the high computational cost of training multi-bit networks—a challenge shared by every prior work in this space. By significantly reducing this overhead, our method paves the way for applying multi-bit quantization to more demanding applications such as generative AI and large-scale language tasks. Extending our approach to these domains will be the focus of our future work, advancing the broader applicability and impact of multi-bit quantization networks across diverse tasks.
We thank the anonymous reviewers for their constructive comments. This work was partly supported by the National Research Foundation of Korea (NRF) grant (RS-2024-00345732, RS-2025-02216217); the Institute for Information & communications Technology Planning & Evaluation (IITP) grants (RS-2020-II201821, RS2019-II190421, RS-2021-II212052, RS-2021-II212068, RS2025-02217613, RS-2025-10692981, RS-2025-25442569); the Technology Innovation Program (RS-2023-00235718, 23040-15FC) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) grant (1415187505).
Quantization function. For CNN experiments, we quantize the weights and activations following the baseline settings of Any-Precision [13], where the activation bit-width is fixed at 4-bits. Any-Precision employs the DoReFa [41] quantizer into its framework. The quantization process is as follows: \[\begin{align} & s = \mathbb{E}(|\mathbf{W}|) \\ & \mathbf{W'}= \frac{\tanh{(\mathbf{W})}}{{2\max(|\tanh{(\mathbf{W})}}|)}+0.5 \\ & \mathbf{\bar{Q}}_n= \lceil M_n \mathbf{W'}\rfloor \\ & \mathbf{Q}_n = s (2\mathbf{\bar{Q}}_n/M_n-1) \end{align}\] where \(\mathbf{W'}\) is the normalized weight tensor; \(\mathbf{\bar{Q}}_n\) is the quantized bin; and \(\mathbf{Q}_n\) is the quantized floating-point weight. \(M_n = 2^n-1\) is the maximum bin value with respect to the target bit-width \(n\).
For ViT experiments, we adopt the StatsQ quantizer proposed in [42] for weight quantization. Activation quantization is performed using the LSQ quantizer [43], fixed at 8-bits. StatsQ applies uniform quantization, using the channel-wise (or head-wise) mean of the absolute weight values as the scaling factor. The quantization process is defined as follows: \[\begin{align} & s = 2 \mathbb{E}_{\mathcal{A}}(|\mathbf{W}|) \\ & \mathbf{W'} = \mathrm{clip}(\mathbf{W} / s, -1, 1) \\ & \mathbf{W''} = \frac{\mathbf{W’}}{2 \max(|\mathbf{W'}|)} + 0.5 \\ & \mathbf{\bar{Q}}_n = \lceil M_n \mathbf{W''} \rfloor \end{align}\] where the remaining steps for mapping the quantized values back to the floating-point range follow the same procedure as in DoReFa. Here, the expectation in the scaling factor \(s\) is taken over dimension(s) \(\mathcal{A}\), which depend on the shape of \(\mathbf{W}\): for 2D weights, \(\mathcal{A}\) corresponds to the column dimension; for 3D weights, it includes both the first and last dimensions.
Implementation of Any-Precision ViTs. To the best of our knowledge, there is currently no standardized multi-bit framework like Any-Precision [13] available for vision transformers (ViTs). To address this, we develop our own framework, adopting configurations similar to those used in Any-Precision. Our implementation is based on the QAT framework from the OFQ [42] codebase, with minor modifications to the quantizer to match the quantization function described above and without the query-key reparameterization proposed in [42].
Configurations of baseline coreset selection methods. To evaluate the effectiveness of our method, we compare it against several existing coreset selection methods. While prior approaches typically focus on selecting a single subset optimized for a single model or configuration, our method targets bit-wise unique subsets tailored for multi-bit training. Accordingly, we apply each baseline’s scoring strategy independently to each sub-model, constructing distinct coresets, and train each sub-model on its respective subset. For CIFAR-10 and CIFAR-100, we train for 200 epochs; for ImageNet-1K, the number of epochs varies depending on the model. For ImageNet-1K experiments, we initialize from pretrained models: ResNet models use the Any-Precision weights [13], while ViTs are initialized from the 8-bit quantized checkpoints provided by PTQ4ViT [44]. The specific configurations for each coreset selection method are as follows:
Entropy [23], EL2N [14]: Scores are computed over 20 epochs for CIFAR-10/100 and 5 epochs for ImageNet-1K. Entropy identifies samples near decision boundaries, while EL2N ranks samples based on the magnitude of their gradients.
Forgetting [16]: Scores are computed over 200 epochs for CIFAR-10/100 and 10 epochs for ImageNet-1K. A forgetting score is defined by the number of times a sample is misclassified after being learned correctly, capturing how often a sample is “forgotten” during training.
Moderate [15]: Scores are computed using features extracted by a pre-trained model for 200 epochs on CIFAR-10/100, and for 70 epochs on ImageNet-1K, as an importance proxy. Moderate quantifies importance by measuring distances between samples in the resulting feature space.
TDDS [17]: Scores are computed over 10 epochs for CIFAR-10, 20 epochs for CIFAR-100, and 5 epochs for ImageNet-1K. TDDS combines gradient information with training dynamics to estimate sample importance.
Bit-wise Unique Scores (Ours): Our method computes per-bit-width importance scores over 10 epochs for CIFAR-10, 20 epochs for CIFAR-100, and 5 epochs for ImageNet-1K, enabling dynamic coreset selection tailored to each sub-network.
Comparison of score uniqueness in bit-wise vs. batch-wise training scheme.
To analyze the effect of the proposed bit-wise training scheme, we examine the uniqueness of the importance scores it produces for each bit-width sub-model, compared to those from the standard batch-wise training scheme. Figure 8 shows a heatmap of Spearman correlation values, which quantify the similarity between importance scores produced by different bit-width sub-models. On both CIFAR-10 and CIFAR-100, the bit-wise scheme results in consistently lower correlation values, indicating that it yields more distinct and representative importance scores for each bit-width. This suggests that separating the training dynamics by bit-width is essential for accurately capturing per-sample importance in multi-bit settings, as it avoids the gradient aggregation seen in batch-wise training and allows each sub-model to maintain distinct context vectors. This isolation is critical for deriving meaningful, bit-wise importance estimates, which are otherwise masked under the standard batch-wise update pattern.
Breakdown of GPU hours. We report GPU hours for all ResNet experiments, broken down by training stage, including: score evaluation, coreset training, calibration (if applicable), and adaptation. Table [tab:gpu95breakdown] also help clarify the computational cost structure of our framework. Specifically, coreset sampling significantly reduces training GPU hours for multi-bit quantization models, and this efficiency gain increases with the pruning rate. However, coreset sampling alone does not eliminate the cost of the calibration phase, which is typically needed to support additional bit-widths. To address this, we apply Bias Correction and BN adaptation, which allow us to remove the calibration step entirely without sacrificing accuracy.
Experiments on alignment achieved by Bias Correction. To demonstrate the effectiveness of our Bias Correction method, we conducted an analysis of the output activations, comparing how well activations from different bit-widths align with one another. In all experiments, the activation precision is fixed to 4 bits, and we measure the mean absolute error (MAE) between the activations of quantized and full-precision models, with and without applying Bias Correction. Table [tab:bc95alignment] show that applying the correction consistently improves alignment, reducing the avg. MAE from 0.755 to 0.667. This improvement holds across all bit-widths, demonstrating our method’s effectiveness. We will include these results in the revised version to highlight the impact of Bias Correction on activation alignment.
Experiments on effectiveness of BN sharing for 1-bit quantization. In multi-bit settings, weight distributions tend to approximate a normal distribution, where our Bias Correction mechanism works effectively. However, in the 1-bit case, the weight distribution collapses into a near-uniform or binary form, causing a significant distribution shift that Bias Correction alone struggles to adequately address. For that reason, prior works on multi-bit networks often omit the 1-bit setting from their analysis [5], [8], [18]. In contrast, our method explicitly incorporates this case and demonstrates that assigning a separate BN layer for 1-bit quantization is both a simple and effective solution.
To further support this, we conduct ablation studies comparing two configurations: one where BN is shared across all bit-widths (including 1-bit), and another where the 1-bit case has its own BN, and the remaining bit-widths share a single BN. Table [tab:1bit95bnshare] clearly show that including 1-bit in BN sharing significantly degrades 1-bit quantization performance. On the other hand, using a separate BN for 1-bit achieves strong performance with negligible additional overhead, that is the additional parameters for the separate BN layer, which accounts for less than 0.01% of the total number of parameters. This performance degradation stems from unstable BN statistics. The 1-bit weights produce activation distributions that are markedly different from those of higher bit-widths, leading to significant fluctuations in shared BN statistics. These fluctuations negatively impact the normalization of other bit-widths, ultimately harming overall model performance.
Experiments on impact of coreset sampling alone. To examine the standalone contribution of coreset sampling to both accuracy and training speedup, we provide results on a CIFAR-10 baseline where only coreset sampling is applied, with all bit-widths except for 1-bit sharing BN layers. As shown in Table [tab:coreset95impact], while coreset sampling contributes most to the speedup, it is not sufficient on its own to maintain strong accuracy at across every bit-width. In contrast, when Bias Correction and BN adaptation are applied alongside coreset sampling, we observe accuracy improvements everywhere, with pratically no additional GPU hours. The results indicate that although coreset sampling is the main driver of compute efficiency, Bias Correction and BN adaptation is essential for best accuracy.
Experiments with varying fixed temperature settings and scheduling methods. Table [tab:cf1095temp] presents the test accuracies of our method, evaluated in the Any-Precision setting, trained across bit-widths \(b\in\{1,2,4,8,32\}\) using fixed sampling temperatures \(\tau\in\{0.1,0.5,1.0\}\) and three temperature scheduling strategies (e.g., linear, exponential, logarithmic). We observe that among the fixed settings, a moderate temperature of \(\tau=0.5\) consistently achieves the highest accuracy, outperforming both the low (\(\tau=0.1\)) and high (\(\tau=1.0\)) extremes. Dynamically increasing the temperature from 0.1 to 1.0 over training—regardless of the scheduling scheme—yields performance that is comparable to or worse than using a fixed \(\tau=0.5\). These results suggest that a single, well-chosen temperature is sufficient to balance the sampling distribution—favoring informative samples while maintaining diversity. In contrast, dynamically adjusting the temperature throughout training introduces additional complexity without delivering clear performance benefits. Based on these observations, all coreset sampling experiments were conducted with the temperature fixed at \(\tau = 0.5\).
Evaluation of calibrated bit-widths. In this section, we present the performance of calibrated bit-widths, which were omitted from the main experimental results in the main paper. Although we refer to these as calibrated bit-widths, it is important to clarify that, in our method, these bit-widths are not explicitly trained or fine-tuned. Instead, we obtain their accuracy using bias correction and batch normalization (BN) adaptation, without additional training or calibration stages. In contrast, Any-Precision [13] recovers calibrated bit-width performance by performing a separate post-training BN calibration procedure. As shown in Table [tab:cf1095cf10095calib] and Table [tab:cf1095cf10095method95calib], the calibrated bit-widths in our method achieve accuracy comparable to trained bit-widths, confirming that our proposed weight bias correction effectively aligns activation distributions without the need for costly calibration.
Evaluation against baseline coreset selection methods on CIFAR-100. Table [tab:cf10095method] presents additional experimental results on CIFAR-100 using the PreActResNet-18 architecture, comparing our bit-wise coreset sampling method against several baseline coreset selection strategies. Consistent with prior findings in TDDS [17] and our CIFAR-10 experiments reported in the main paper, we observe that random coreset selection performs surprisingly well at high pruning rates. This trend persists in the CIFAR-100 setting, where random sampling remains a strong baseline under severe data reduction. Nonetheless, our proposed method consistently outperforms the baselines across different bit-widths, demonstrating its effectiveness in selecting informative samples even under high pruning constraints.
Evaluation of DeiT-S on CIFAR-100 and TinyImageNet We further evaluate our method on a transformer-based architecture, specifically DeiT-S, using smaller datasets such as CIFAR-100 and TinyImageNet. The results are presented in Table [tab:vit95cf10095tinyimnet]. To the best of our knowledge, there is currently no standardized multi-bit framework like Any-Precision for vision transformers. To address this, we implement our own framework following configurations similar to Any-Precision, with a slight modification to the StatsQ quantizer—details of which are provided in Section [sec:quant]. Our method demonstrates consistently strong performance compared to the basic Any-Precision setup, even when pruning the dataset by 60%, achieving up to an \(8.41\times\) reduction in GPU hours on TinyImageNet.
Evaluation of storage-constrained coreset sampling. Coreset-based approaches in multi-bit networks consistently reduce training time; however, due to varying data importance across sub-models, it remains challenging to impose a uniform constraint on the total number of training samples used by the entire model. To address this, we first discard samples that are consistently considered unimportant across all sub-models, and then apply our coreset sampling method with bias correction. To identify and remove consistently uninformative samples before applying coreset sampling, we first compute the importance of each training sample for every sub-model over a single epoch, following our bit-wise training scheme. These importance values are then summed across sub-models, and their variability is assessed—similar to training dynamics approaches [16], [17]—to obtain a final importance score. As shown in Table [tab:cf1095storage], our coreset sampling method performs comparably to existing approaches, even under a 50% dataset storage constraint. Moreover, Table [tab:cf1095cf10095time] shows that training performance can be further enhanced by tuning the pruning rate (i.e., training time), highlighting the adaptability of our method to varying resource budgets in multi-bit network training.
Experiments on Influence of coreset sampling frequency. In practice, the overhead of bit-wise coreset resampling is extremely small compared to the overall training cost. For example, even on an ImageNet-scale dataset, performing 100 resamplings takes only about 3.36 minutes. Given this negligible cost, resampling at every epoch is a practical and effective choice.
To quantitatively demonstrate this, we conducted experiments with different resampling intervals and measured both validation accuracy and total sampling time. Table [tab:sampling95frequency] show that resampling every epoch improves average accuracy by 1.33%p compared to resampling every 30 epochs, while adding just 53 seconds of overhead to a multi-hour training process. This demonstrates that frequent resampling can offer meaningful accuracy gains at virtually no additional cost.
Experiments on dynamic score re-evaluation. We conducted additional experiments where importance scores are dynamically re-evaluated every 10, 30, 50, or 100 epochs, and coresets are resampled accordingly. We evaluated how it impacts accuracy and GPU hours using PreActResNet-20 on CIFAR-10 and PreActResNet-18 on CIFAR-100 under both 80% and 90% data pruning. Table [tab:cf1095dynamic95score] and Table [tab:cf10095dynamic95score] reveal a consistent pattern: while dynamic score re-evaluation leads to only marginal accuracy changes, it incurs a substantial increase in computational cost. In many settings, our one-time scoring strategy already matches or even outperforms more frequent re-evaluation in terms of final accuracy, while consuming significantly fewer GPU hours. This empirical finding validates our design choice, and shows that a single, well-computed importance estimate—when paired with stochastic sampling—offers an effective and efficient balance, capturing most of the benefits of dynamic importance tracking without incurring its heavy cost. Looking ahead, with the observation that dynamic re-evaluation yields modest gains on the more challenging CIFAR-100, we believe dynamic sampling techniques could be the key to boosting performance on complex, high-variability datasets where importance scores drift more drastically throughout training. The main hurdle is the high cost of score re-evaluation during training, which currently limits the practicality of dynamic methods. In future work, we will explore lightweight techniques to reduce score-evaluation overhead while maintaining the quality of importance estimates.
In this section, we use a simple linear classifier to examine how shared weights in multi-bit networks can implicitly transfer knowledge between sub-networks. We consider a setting where an 8-bit model and a 2-bit model share the same real-valued parameter vector \(w\), with weights quantized using the DoReFa quantizer [41]. Training alternates iteratively: the 8-bit model is trained on batch \(X_8\) with hard labels, followed by the 2-bit model trained on a separate batch \(X_2\), also with hard labels. These batches are drawn independently from the data matrix \(X \in \mathbb{R}^{d \times n}\) and do not overlap. The shared parameter \(w\) is updated in-place using the cross-entropy loss and is continuously modified by both models. The key question is: can we formally argue that, despite no explicit soft-label distillation and no shared data examples, the 2-bit model benefits from the 8-bit model’s training- and vice versa- through the shared parameter?
Gradient update within combined data subspaces. When the 8-bit model observes batch \(B_{8}=(X_{8},y_{8})\), it performs a gradient step using the cross-entropy loss. The model is blind to any component of the optimum that is orthogonal to the plane that spans the \(n_{8}\)-dimensional subspace of \(X_{8}\) [45]. That is, under an asymptotic assumption, the direction of the update is fully constrained to the subspace spanned by the input vectors in the batch, as the gradient is a linear combination of the each data point \(x_i\). Therefore, the gradient of the 8-bit model lies within the subspace spanned by its input batch, and thus the corresponding update step is bound as follows: \(\Delta_{8} \in \text{span}(X_{8})\). Similarly, the 2-bit model performs its update with batch \(X_2\). By induction, the net update to \(w\) lies within a sum of subspaces as follows: \(\Delta_{\text{net}} \in \sum_{j=1}^\mathcal{B} \text{span}(X_j)\), where \(\mathcal{B} = \{2,8\}\) in our case. Thus, the shared weight vector evolves within the union of data subspaces: \(\text{span}(X_{8}) \cup \text{span}(X_{2})\). This shows that each sub-network updates its parameters based on a broader subspace that includes data from other sub-networks, thereby increasing its effective data exposure.
Gradient alignment between quantized sub-networks. Given that updates occur within a shared subspace, we analyze whether the gradients from different quantized sub-networks are sufficiently aligned to enable mutual benefit. We assume that the optimum value for both quantized model is similar [41]. In our setting, when the 8-bit model receives batch \(X_{8}\), this updates \(w\) towards minimizing \(\mathcal{L}_{8}\). Since \(\Delta w_{8} \in \text{span}(X_{8})\), this update steers \(w\) toward the optimum \(w^\star\) within \(\text{span}(X_8)\). The 8-bit model moves \(w\) to a point where the 2-bit loss cannot be worse—and is often better (i.e., \(\theta<90 \textdegree\)). This is functionally equivalent to soft distillation: the 8-bit model’s higher-capacity updates are immediately used by the 2-bit model, enabling generalization benefits without soft targets. This provides a theoretical basis for implicit knowledge transfer as the shared parameter acts as a channel of indirect supervision.
Motivated by these observations, we propose a bit-wise coreset sampling method that trains each sub-network with its own compact, informative subset of the full dataset. As the multi-bit network implicitly accesses the collective data seen by all sub-networks, each can prioritize important samples by directly training them, while also benefiting from diverse data exposure through indirect supervision. This not only reduces the overall training cost but also preserves model performance by ensuring sufficient coverage of the dataset across bit-widths.