July 12, 2025
Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines
fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The
dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original
temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into
multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and
SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen.
Generating human motions from text has garnered increasing attention in recent years and has experienced notable progress. These advances have been made possible by existing large-scale text-motion datasets [1]–[4], and a variety of deep generative models such
as VAEs [1], [5], diffusion models [6]–[10], GPTs [11]–[13], and generative masking [14], [15]. Nevertheless, current models encounter critical limitations when
processing complex prompts, falling short in achieving fine-grained control and capturing nuanced variations in human movements. A key contributing factor is the restricted expressivity of text descriptions in existing motion-text datasets. Textual
annotations in these datasets are typically brief and general (e.g., "a person jumps up and lands"), lacking specific execution details. For instance, in HumanML3D [1], motion sequences of approximately 7 seconds are described by texts averaging only 12 words, which is insufficient to capture motion complexity. The importance of expressive text annotations
has been well-established in other text-conditioned visual content synthesis fields [16]–[20]. Descriptive prompts notably enhance the accuracy and aesthetic quality of generated images [16], [17] and improve temporal coherence in video generation [21]. These models understand complex visual content compositions by learning from rich text semantics, enabling fine-grained visual content editing [22], [23], adaptation [24]–[26], and understanding [27]–[29]. More importantly, when dealing with casual user prompts, rich LLM
knowledge can be leveraged to enhance prompts with specific details and nuances that models have learned from fine-grained textual training data, effectively improving generalization capability. To foster this research direction in motion synthesis, we
introduce SnapMoGen, which features high-quality motions captioned with accurate and highly expressive descriptions. SnapMoGen is created by segmenting long motion sequences into meaningful 4-12 second clips, each accompanied by
six text descriptions—two manually annotated by human experts and four augmented by an LLM that introduces diversity while preserving semantics and temporal consistency. In total, SnapMoGen comprises 20K motion clips, amounting to 40 hours of
mocap data, accompanied by 122K detailed text descriptions. As shown in 1, our text annotations contain extremely rich semantic cues of human movements, with an average length of 48 words—three times longer than
HumanML3D. Furthermore, our continuous motion segments facilitate research in long-term motion synthesis and motion localization. 1 presents a statistical
comparison between SnapMoGen and related motion-text datasets.
To generate motions from expressive texts, we build an improved model upon the previous state-of-the-art approach—MoMask [14]. MoMask applies residual quantization to motion latent features, transforming them into multiple ordered sets of same-length discrete token sequences. Despite achieving pleasing VQ reconstruction through extensive tokens, many of them are not utilized to their full capacity. For instance, tokens following the first quantization layer carry only marginal information. This inefficiency, combined with its layer-specific token vocabulary design, creates inflexibility in subsequent text-to-token generation—necessitating separate models for different token sequences: a primary model for the first sequence and a secondary model for remaining tokens. To overcome these limitations, we adopt a multi-scale approach for motion tokenization and generate all motion tokens using a single generative masked transformer. In our residual VQ, tokens at each quantization layer focus on a particular temporal scale, following a coarse-to-fine gradual progression. Additionally, we share one codebook across all layers to ensure a universal token vocabulary. As shown in 3, our multi-scale RVQ continually learns meaningful semantics with more layers, outperforming conventional RVQ [14] with \(45\%\) less tokens. Then, we simply concatenate all tokens along the temporal dimension and train a generative transformer to produce tokens from text by predicting randomly masked tokens. Our new framework, dubbed MoMask++, outperforms MoMask on text-to-motion generation with only a quarter of their token count, as in 3.
In summary, our key contributions are threefold. First, we introduce SnapMoGen, a large-scale dataset comprising 20K temporally continuous motion capture clips described by 122K highly expressive text prompts. We also establish
comprehensive benchmarks and evaluation protocols for this new dataset. Second, we advance beyond the existing state-of-the-art approach by proposing MoMask++, which optimizes motion token capacity through multi-scale quantization and models
text-conditioned token generation using a single generative masked transformer. Third, we demonstrate effective handling of casual user prompts through LLM-based prompt rewriting, enabled by the descriptive captions in our SnapMoGen.
| Datasets | Year | # Clips | Duration | # Texts | ||||
| per text | ||||||||
| per clip | Mocap? | Continuous? | ||||||
| KIT-ML [4] | 2016 | 3,911 | 10.3 h | 6,278 | 8 | 9.5s | ✔ | |
| BABEL\(^\dagger\) [3] | 2021 | 52,937 | 33.2h | 52,937 | 2 | 2.3s | ✔ | ✔ |
| HumanML3D [1] | 2022 | 14,616 | 28.6h | 44,970 | 12 | 7.1s | ✔ | |
| Motion-X [2] | 2023 | 81,084 | 144.2h | 81,084 | 9 | 6.4s | ||
SnapMoGen |
2025 | 20,450 | 43.7h | 122,565\(^*\) | 48 | 7.8s | ✔ | ✔ |
KIT Motion-Language Dataset [4] pioneered this domain with 3.9K motions and 6.3K human-annotated descriptions but was limited in scale and
text diversity. BABEL [3] introduced temporally precise frame-level labels across 33 hours of motion capture data; however, its
annotations primarily consist of short phrases (e.g., ‘lift something’) for approximately 2-second atomic actions rather than descriptions of extended sequences. HumanML3D [1] expanded the field with 14.6K motions and 44.9K texts by aggregating data from AMASS [30] and HumanAct12 [31]. Despite its size, the text descriptions remain brief and general (e.g.,
"a person was pushed but did not fall"), failing to capture nuanced movement details. Motion-X [2] increased diversity by extracting
motions from monocular videos and generating descriptions using video captioning models [28]. However, these motions often contain estimation
artifacts such as jitter and foot-sliding, while their descriptions still lack expressivity. Recently, HuMMan-MoGen introduced fine-grained descriptions for specific body parts in motions. In contrast, our SnapMoGen introduces highly
expressive text descriptions for holistic 4-12 second motion segments.
Recent advances in human motion generation, particularly in text-conditioned synthesis, have significantly improved the realism and text controllability of generated motions. Early methods explored continuous motion representations using generative models such as VAEs [1], [5]. The introduction of diffusion models [6]–[10] has significantly advanced the field. By iteratively refining motion through denoising steps, these models generate realistic sequences that align closely with textual prompts. A parallel line of research models motion as sequences of discrete tokens using quantization techniques such as VQ-VAEs [32]. These approaches represent motion as compact, structured token sequences, typically generated autoregressively [11]–[13], [33], [34] or through generative masking schemes [14], [15]. To reduce quantization error, MoMask [14] applies multiple quantization layers to iteratively approximate the residuals. Nevertheless, as all quantization is applied at the same (and full) temporal scale, the information captured at each successive layer decreases drastically, leading to an overproduction of tokens with notably uneven information content. This inefficiency also makes text-to-token generation rather inflexible. These limitations directly inspire the multi-scale residual quantization process in our framework.
SnapMoGen encompasses 43.7 hours of high-quality motion data captured at 30 frames per second. The dataset comprises a total of 4.7M motion frames, featuring a diverse range of actions including daily activities, fitness routines, social
interactions, dances, and more. We deliberately incorporate various stylized performances (e.g., princess, elderly person, zombie) to enhance diversity. SnapMoGen captures performances from 10 participants, resulting in 20,450 motion clips
ranging from 4 to 12 seconds in length. Each motion clip is accompanied by 6 detailed textual descriptions (2 manually annotated, 4 LLM-augmented), totaling 122,565 textual descriptions with an average length of 48 words. A comparison between
SnapMoGen and existing motion-text datasets is presented in 1. We further augment the dataset by mirroring motion data [1] throughout our experiments.
We aim to cover a wide range of actions while ensuring high-quality 3D motion capture. All motions are recorded using Xsens1 and Rokoko2 motion capture suits. To determine motion content, we combine two resources: (i) LLM-generated action scenarios covering diverse topics, and (ii) a curated collection of videos and images from the internet featuring content of interest, such as stylized movements. These text instructions and video demonstrations are presented to performers prior to recording sessions as reference material. Performers are then encouraged to execute these or related actions in their own interpretive style. Following data collection, motions with notable artifacts (e.g., jittering, foot sliding) are filtered out to maintain data quality.
We deliberately capture long motion sequences containing multiple actions for broader applications. Subsequently, we develop an automated pipeline to segment these sequences into shorter clips of appropriate lengths. The key principle is to prioritize segmentation at motionless moments. Specifically, we first calculate the average positional velocities of hip and end-effector joint at each frame, smoothed by a Gaussian filter. We then detect velocity troughs and normalize their values within each sequence, yielding \(\rho_{1:n} \in [0, 1]\), where \(n\) indicating the number of troughs. Each trough \(i\) is selected as a segmentation point with probability \(0.5\rho_i\), which typically results in clips averaging 8 seconds. Hard constraints of minimum (4s) and maximum duration (12s) are enforced during segmentation. Segmentation examples are provided in supplementary files.
Each motion clip is rendered as video using a 3D character for annotation. We collect descriptions from two distinct annotators for each motion clip. The entire annotation process involves 55 professional native English-speaking annotators who are
instructed to address the following aspects in their textual descriptions: action, context, style, moving direction, speed, trajectory shape, body parts,
spatial relation/location, posture (if applicable), and timing (if applicable). All annotations undergo a second-round review to ensure descriptive accuracy. Typographical errors in the collected textual descriptions
are corrected using an LLM. To enhance textual diversity, we further employ the LLM to re-describe each manual description twice, maintaining precise action semantics while varying expression. This results in a total of six distinct descriptions per motion
clip.
Our goal is to generate a 3D human pose sequence \(\mathbf{m}_{1:N}\) of length \(N\) guided by a textual description \(c\), where \(\mathbf{m}_i\in\mathbb{R}^D\) and \(D\) denotes the dimension of pose features.
In traditional motion VQ-VAEs [11], [12], [15], a motion encoder \(\mathcal{E}(\cdot)\) encodes the motion sequence \(\mathbf{m}\in \mathbb{R}^{N\times D}\) to a latent feature sequence \(f\in\mathbb{R}^{n\times d}\), which is further mapped to a discrete token sequence \(q\in [K]^n\) through vector quantization: \[f =\mathcal{E}(\mathbf{m}),\quad\quad\quad q=\mathcal{Q}(f),\] where \(\mathcal{Q}(\cdot)\) denotes a quantizer. The quantizer typically consists of a learnable codebook \(\mathcal{C}\in \mathbb{R}^{K\times d}\) of \(K\) codes. During quantization, each feature vector \(f_i\) is mapped to the code index \(q_i\) of its nearest code entry in the codebook: \[q_i = \left( \texttt{argmin}_{k \in [K]}\|\texttt{lookup}(\mathcal{C}, k)-f_i\|_2\right) \in [K]\] where \(\texttt{lookup}(\mathcal{C}, k)\) means taking the \(k\)-th vector in codebook \(\mathcal{C}\). The quantized feature vector sequence is finally fed into a decoder \(\mathcal{D}\) to reconstruct the input motion: \[\hat{f}=\texttt{lookup}(\mathcal{C}, q), \quad\quad\quad \hat{\mathbf{m}}=\mathcal{D}(\hat{f}).\]
To effectively reduce quantization errors, MoMask [14] introduces additional \(V\) quantization layers \(\mathcal{Q}^{1, .., V}(\cdot)\). Specifically, starting from the initial residual \(r^0=f\), each \(\mathcal{Q}^v(\cdot)\) calculates token indices \(q^v\) and their corresponding codes \(\hat{f}^v\) as an approximation of residual \(r^v\), and then computes the next residual \(r^{v+1}\) as: \[\label{eq:residual95quant} q^v = \mathcal{Q}^v(r^v), \quad\quad\quad \hat{f}^v = \texttt{lookup}(\mathcal{C}^v, q^v), \quad\quad\,\,\,r^{v+1} = r^v- \hat{f}^v\tag{1}\]
Each quantization layer \(\mathcal{Q}^v(\cdot)\) contains a separate codebook \(\mathcal{C}^v\in \mathbb{R}^{K \times d}\). This approach yields \(V+1\) discrete token sequences \([q^v]_{0}^V \in[K]^{(V+1) \times n}\) of length \(n\) for a motion sequence. The final approximation of the latent sequence \(f\) is the sum of all quantized features \(\hat{f}=\sum_{v=0}^V\hat{f}^v\).
Overall, this quantized auto-encoder model is trained using a compound loss function that combines motion reconstruction and per-layer latent embedding losses: \[\label{eq:rvq} \mathcal{L}_{rvq} = \texttt{SmoothL1}(\mathbf{m} - \hat{\mathbf{m}}) + \beta \sum_{v=0}^V\|r^v-\texttt{sg}[\hat{f}^v]\|_2,\tag{2}\] where \(\texttt{sg}[\cdot]\) denotes the stop-gradient operation, and \(\beta\) a weighting factor for embedding alignment. The codebook entries are updated using exponential moving average [12].
Although high-fidelity VQ reconstruction can be achieved through an extensive set (\((V+1)\times n\)) of motion tokens, this approach introduces inflexibility in learning the text-to-token mapping, primarily due to two factors: i) information is disproportionately distributed across quantization layers—first-layer tokens typically contain the predominant features, while subsequent layers capture only incremental refinements; and ii) tokens at different layers are indexed by independent codebooks. To address this heterogeneity, MoMask [14] applies an expressive generative masked transformer for the principal first-layer tokens, while modeling all other-layer tokens with a secondary transformer conditioned on first-layer results. This hierarchical approach further diminishes the representational capacity of tokens in non-first layers.
In our approach, tokens at different quantization layers are designed to capture information at specific temporal resolutions, with a common codebook shared across all layers. This design allows us to model the generation of all tokens using a single generative masked transformer.
As depicted in 2 (a), for a motion latent feature sequence \(f\in \mathbb{R}^{n\times d}\), our quantizer employs a series of residual quantization operations at progressively increasing temporal resolutions \(\{h^v\}_{v=0}^V\), where \(h^0<\cdots< h^V = n\). We denote all quantization operations uniformly as \(\mathcal{Q}(\cdot)\) as they share a common codebook \(\mathcal{C}\). Quantized features at coarse level \(\hat{f}^v\in \mathbb{R}^{h_v\times d}\) are bilinearly interpolated to full resolution \(\hat{f}_{\uparrow}^v = \mathcal{I}(\hat{f}^v, h^V)\), where residuals are calculated and then downsampled to next scale (\(h_{v+1}\)) to be quantized by the succeeding layer. Mathematically, 1 is reformulated as: \[q^v = \mathcal{Q}(\mathcal{I}(r^v, h^v)),\quad\quad\,\,\, r^{v+1} = r^v - \mathcal{I}(\hat{f}^v, h^V), \quad\quad\,\,\, r^0 = f,\] where \(\hat{f}^v = \texttt{lookup}(\mathcal{C}, q^v)\). Then, the final approximation of latent sequence \(f\) is the sum of all up-interpolated quantized sequences, which is then fed into decoder \(\mathcal{D}\) for motion reconstruction: \[\hat{f} = \sum_{v=0}^V \mathcal{I}(\hat{f}^v, h^V),\quad\quad\,\,\, \hat{\mathbf{m}} = \mathcal{D}(\hat{f}).\]
We further add self-attention layers after each res-block in existing motion VQVAE architectures [12], [14] for higher-fidelity motion reconstruction. During training, we introduce additional emphasis on reconstructing essential rotational features \(\mathcal{L}_{ess}\) on the top of \(\mathcal{L}_{rvq}\) in 2 , weighed by \(\lambda_{ess}\). The final learning objective becomes: \[\mathcal{L}_{ms\_rvq} = \mathcal{L}_{rvq} + \lambda_{ess} \mathcal{L}_{ess}\]

Figure 3: Illustration of token capacity in a pretrained traditional 6-layer, 480-token full-scale RVQ [14] compared to a 10-layer, 266-token multi-scale RVQ. Starting from a zero-sequence, we incrementally add one quantized feature sequence for motion decoding and measure the reconstruction performance. The multi-scale VQ learns tokens more efficiently with meaningful semantics at each quantization layer..
After all, a motion sequence is represented as \(V+1\) ordered discrete token sequences with a hierarchy of temporal scales \(q = (q^0, q^1,...,q^V)\), where each \(q^v\) has a length of \(h^v\). Since a shared codebook is utilized across all scales, tokens from each \(q^v\) belong to the same vocabulary \([K]\).
As shown in 3, compared to previous all full-scale residual VQ [14], our multi-scale VQ effectively exploits token capacity and continually learns meaningful semantic features at each quantization layer. It achieves superior reconstruction quality significantly with fewer tokens (266 vs. 480).
We employ a single bidirectional transformer for token generation from text descriptions. Our framework is illustrated in 2 (b-c). We utilize T5-base [35] to extract word-level features from complex textual descriptions \(c\). Motion tokens from all scales are concatenated along the temporal axis, yielding
an extended token sequence \(q\), which is then embedded through an MLP. We investigate two primary architectures for text conditioning: (i) In-context learning, where embeddings of motion tokens and text
tokens are concatenated and processed uniformly as transformer input, and (ii) Cross-attention, which incorporates additional multi-head cross-attention layers that enable motion features to query relevant text features.
| Methods | R Precision\(\uparrow\) | FID\(\downarrow\) | MM Dist\(\downarrow\) | MModality\(\uparrow\) | ||
| Top 1 | Top 2 | Top 3 | ||||
| Real motions | \({0.511}^{\pm{.003}}\) | \({0.703}^{\pm{.003}}\) | \({0.797}^{\pm{.002}}\) | \({0.002}^{\pm{.000}}\) | \({2.974}^{\pm{.008}}\) | - |
| TM2T [11] | \({0.424}^{\pm{.003}}\) | \({0.618}^{\pm{.003}}\) | \({0.729}^{\pm{.002}}\) | \({1.501}^{\pm{.017}}\) | \({3.467}^{\pm{.011}}\) | \(\underline{{2.424}}^{\pm{.093}}\) |
| T2M [1] | \({0.455}^{\pm{.003}}\) | \({0.636}^{\pm{.003}}\) | \({0.736}^{\pm{.002}}\) | \({1.087}^{\pm{.021}}\) | \({3.347}^{\pm{.008}}\) | \({2.219}^{\pm{.074}}\) |
| MDM [6] | - | - | \({0.611}^{\pm{.007}}\) | \({0.544}^{\pm{.044}}\) | \({5.566}^{\pm{.027}}\) | \(\mathbf{{2.799}}^{\pm{.072}}\) |
| MLD [8] | \({0.481}^{\pm{.003}}\) | \({0.673}^{\pm{.003}}\) | \({0.772}^{\pm{.002}}\) | \({0.473}^{\pm{.013}}\) | \({3.196}^{\pm{.010}}\) | \({2.413}^{\pm{.079}}\) |
| MotionDiffuse [7] | \({0.491}^{\pm{.001}}\) | \({0.681}^{\pm{.001}}\) | \({0.782}^{\pm{.001}}\) | \({0.630}^{\pm{.001}}\) | \({3.113}^{\pm{.001}}\) | \({1.553}^{\pm{.042}}\) |
| T2M-GPT [12] | \({0.492}^{\pm{.003}}\) | \({0.679}^{\pm{.002}}\) | \({0.775}^{\pm{.002}}\) | \({0.141}^{\pm{.005}}\) | \({3.121}^{\pm{.009}}\) | \({1.831}^{\pm{.048}}\) |
| MMM [15] | \({0.515}^{\pm{.002}}\) | \({0.708}^{\pm{.002}}\) | \({0.804}^{\pm{.002}}\) | \({0.089}^{\pm{.005}}\) | \(\underline{{2.926}}^{\pm{.007}}\) | \({1.226}^{\pm{.040}}\) |
| MoMask [14] | \(\underline{{0.521}}^{\pm{.002}}\) | \(\underline{{0.713}}^{\pm{.002}}\) | \(\underline{{0.807}}^{\pm{.002}}\) | \(\mathbf{{0.045}}^{\pm{.002}}\) | \({2.958}^{\pm{.008}}\) | \({1.241}^{\pm{.040}}\) |
| MoMask++\(^\text{in}\) | \(\mathbf{{0.528}}^{\pm{.003}}\) | \(\mathbf{{0.718}}^{\pm{.003}}\) | \(\mathbf{{0.811}}^{\pm{.002}}\) | \({0.072}^{\pm{.003}}\) | \(\mathbf{{2.912}}^{\pm{.008}}\) | \({1.227}^{\pm{.046}}\) |
| MoMask++\(^\text{cra}\) | \({0.517}^{\pm{.002}}\) | \({0.709}^{\pm{.002}}\) | \({0.803}^{\pm{.002}}\) | \(\underline{{0.069}}^{\pm{.003}}\) | \({2.948}^{\pm{.007}}\) | \({1.192}^{\pm{.053}}\) |
In training, a varying fraction \(\gamma(\tau) = \cos(\frac{\pi \tau}{2}) \in [0, 1]\), where \(\tau \sim \mathcal{U}(0, 1)\), of sequence elements is uniformly selected, masked out, and replaced with a special \(\texttt{[MASK]}\) token. The transformer is trained to predict these masked tokens given text input \(c\) and the partially masked token sequence \(\dot{q}\), by maximizing the likelihood: \[\mathcal{L}_{mask} = \sum_{\dot{q}_k=\texttt{[MASK]}} - \mathrm{log}\, p_\theta \left(q_k|\dot{q}, c\right).\] We adopt the replacing and remasking strategy [14], [36] to enhance contextual reasoning ability. Additionally, the model is trained without text condition \(c=\emptyset\) with a probability of \(10\%\) to enable classifier-free guidance (CFG).
During inference, a complete sequence of \(q\) can be generated in a constant number (\(L\)) of iterations. This process begins with an empty sequence \(\left[\texttt{[MASK]}\right]^N\) where all tokens are masked, with \(N=\sum_{v=0}^Vh^v\) denoting the total number of tokens in \(q\). At each iteration (\(l\)), the model predicts categorical token distributions at masked locations, samples tokens, and re-masks the \(\lceil\gamma(\frac{l}{L}\cdot N)\rceil\) lowest-confidence tokens. This process repeats until \(l\) reaches \(L\). We also adopt classifier-free guidance as in [14] with guidance scale \(s\). Finally, all generated tokens are decoded and projected back to motion sequence through the VQ-VAE decoder.
Besides SnapMoGen, we also conduct experiments on HumanML3D [1], a popular motion-text dataset comprising 14,616 motions
with 44,970 textual descriptions.
We process motions in SnapMoGen following procedures established in HumanML3D, including motion mirroring and standardization. To prevent data leakage, we deliberately hold out a test (%10) set and a validation (%5) set where the motion
scenarios (e.g., fashion) differ from the training motions. We primarily adopt the feature representation from HumanML3D, consisting of root angular velocity along Y-axis \(\dot{r}^a\in\mathbb{R}\), root linear velocity on
XZ-plane \(\dot{r}^{xz}\in\mathbb{R}^2\), root height \(\dot{r}^y\in\mathbb{R}\), 6D local joint rotations \(\mathbf{j}^r\in\mathbb{R}^{6j}\), local joint
positions \(\mathbf{j}^p\in\mathbb{R}^{3j}\), and local joint velocities \(\mathbf{j}^v\in\mathbb{R}^{3j}\), where \(j\) denotes the number of joints. We
empirically find that this comprehensive set of pose features leads to slightly better performance (4). Our SnapMoGen follows a skeletal topology
comprising 24 joints, resulting in 296-dimensional pose features. Unlike HumanML3D, our pose features are directly convertible to standard motion capture file formats (e.g., BVH).
We adopt established metrics including FID, R-Precision, MultiModal Distance, and Multimodality following previous works [6], [14], [37]. The evaluator from prior research [1] was exclusively trained to align motion and text embeddings. However, the resulting motion embeddings may be biased toward text alignment while overlooking
motion fidelity. Additionally, its redundant motion feature design lacks flexibility for broader evaluation scenarios [10]. Therefore,
we adopt the TMR [38] approach for our evaluation model, utilizing only essential 148-dimensional motion features. This method extracts
separate latent vectors from motion and text, requiring the motion vector to both align well with corresponding text features and accurately reconstruct the source motion (ensuring fidelity). We use the T5-base model to extract word-level text
features. For R-Precision calculations, we employ a candidate pool size of 100. We also use the CLIP score [10] to evaluate text-motion
alignment, which measures the cosine similarity between text and motion features.
On SnapMoGen, We reproduce baseline methods across three mainstream generative paradigms: diffusion models (MDM [6],
StableMoFusion [37], and MARDM [10]), autoregressive models (T2M-GPT [12]), and generative masking approaches (MoMask [14]). We utilize their official codebases. Each experiment is repeated 20 times, with final results reported as means with 95% confidence intervals. For all baselines,
we replace the original text encoder with T5-Base. For MoMask, we implement a 6-layer RVQ. Please refer to supplementary materials for baseline implementations.
Our VQVAE encoder and decoder consists of three dilated res-blocks, with a down(up)-scale factor of 4 [12], [14]. The temporal quantization scales follows the progression \([n/2^V, ..., n/2^0]\) with \(n\) denoting the full-scale length. We employ 4 (i.e., \(V=3\)) quantization layers for HumanML3D and 2 for SnapMoGen, with codebook sizes of \(512\times512\) and \(2048\times512\), respectively. The
hyper-parameters \(\beta\) and \(\lambda_{ess}\) are set to 0.02 and 2.0. Our transformer architecture comprise 8 layers with feedforward size of 1024, latent dimension of 384, 6 attention
heads, and a dropout ratio of 0.2, totaling 13.5M parameters for in-context model, and 18.3M parameters for cross-attention model. During inference, we use classifier-free guidance scales of 5 and 4, and iteration counts (\(L\)) of 10 and 18 for SnapMoGen and HumanML3D, respectively. All models are trained on a single Tesla V100 GPU, with batch size of 256 for VQVAEs and 64 for transformers.
| Methods | R Precision\(\uparrow\) | FID\(\downarrow\) | CLIP Score\(\uparrow\) | MModality\(\uparrow\) | ||
| Top 1 | Top 2 | Top 3 | ||||
| Real motions | \({0.940}^{\pm{.001}}\) | \({0.976}^{\pm{.001}}\) | \({0.985}^{\pm{.001}}\) | \({0.001}^{\pm{.000}}\) | \({0.837}^{\pm{.000}}\) | - |
| MDM [6] | \({0.503}^{\pm{.002}}\) | \({0.653}^{\pm{.002}}\) | \({0.727}^{\pm{.002}}\) | \({57.783}^{\pm{.092}}\) | \({0.481}^{\pm{.001}}\) | \(\mathbf{{13.412}}^{\pm{.231}}\) |
| T2M-GPT [12] | \({0.618}^{\pm{.002}}\) | \({0.773}^{\pm{.002}}\) | \({0.812}^{\pm{.002}}\) | \({32.629}^{\pm{.087}}\) | \({0.573}^{\pm{.001}}\) | \({9.172}^{\pm{.181}}\) |
| StableMoFusion [37] | \({0.679}^{\pm{.002}}\) | \({0.823}^{\pm{.002}}\) | \({0.888}^{\pm{.002}}\) | \({27.801}^{\pm{.063}}\) | \({0.605}^{\pm{.001}}\) | \({9.064}^{\pm{.138}}\) |
| MARDM [10] | \({0.659}^{\pm{.002}}\) | \({0.812}^{\pm{.002}}\) | \({0.860}^{\pm{.002}}\) | \({26.878}^{\pm{.131}}\) | \({0.602}^{\pm{.001}}\) | \(\underline{{9.812}}^{\pm{.287}}\) |
| MoMask [14] | \({0.777}^{\pm{.002}}\) | \({0.888}^{\pm{.002}}\) | \({0.927}^{\pm{.002}}\) | \({17.404}^{\pm{.051}}\) | \({0.664}^{\pm{.001}}\) | \({8.183}^{\pm{.184}}\) |
| MoMask++\(^\text{in}\) | \(\mathbf{{0.805}}^{\pm{.002}}\) | \(\underline{{0.904}}^{\pm{.002}}\) | \(\mathbf{{0.938}}^{\pm{.001}}\) | \(\underline{{15.56}}^{\pm{.071}}\) | \(\underline{{0.684}}^{\pm{.001}}\) | \({6.556}^{\pm{.178}}\) |
| MoMask++\(^\text{cra}\) | \(\underline{{0.802}}^{\pm{.001}}\) | \(\mathbf{{0.905}}^{\pm{.002}}\) | \(\underline{{0.938}}^{\pm{.001}}\) | \(\mathbf{{15.06}}^{\pm{.065}}\) | \(\mathbf{{0.685}}^{\pm{.001}}\) | \({7.259}^{\pm{.180}}\) |
The quantitative results on HumanML3D and SnapMoGen are reported in [tab:quantitative_eval_humanml3d,tab:quantitative_eval_snapmotion], respectively. Overall, MoMask++ attains state-of-the-art performance on both datasets, demonstrating
consistent improvements in motion-text alignment and motion quality. These advantages are particularly pronounced in our SnapMoGen dataset, partially due to the more expressive evaluation model. We observe that previous works struggle with the
complex, lengthy text inputs in SnapMoGen, and fall short in maintaining multimodal semantic coherence, as evidenced by the relatively low CLIP scores and R-precision values. Notably, our method outperforms MoMask with only two VQ layers (a
quarter of MoMask’s token count) with similar model size. Between the two variants of MoMask++, we find that the in-context model generally performs better on HumanML3D. It however tends to overfit on long text prompts in
SnapMoGen ([fig:loss95curve]) and underperforms compared to the cross-attention model. Nevertheless, a significant gap to real motions still
exists, suggesting substantial room for future improvements.
4 displays pose sequences generated by MoMask++, demonstrating its ability to produce precise motions following fine-grained text prompts. We further showcase the capability to handle out-of-domain user prompts by employing an LLM to rephrase the inputs. For additional generation results and comprehensive visual comparisons, please refer to the supplementary materials.
| Pose Dim. | VQ Config. | VQ Reconstruction | T2M Generation | ||||||||
| #Codes | #Quant. | F/M | w/ Att. | FID \(\downarrow\) | Joint Pos. Err. \(\downarrow\) | FID\(\downarrow\) | CLIP Score \(\uparrow\) | ||||
| base | 296 | 2048 | 4 | M | ✔ | 2.80 | 8.13 | 15.94 | 0.673 | ||
| (A) Only essential pose feat. | 148 | 1024 | 4.71 | 7.12 | 15.95 | 0.667 | |||||
| (B) Smaller codebook | 1024 | 3.30 | 8.43 | 15.61 | 0.668 | ||||||
| 512 | 3.77 | 8.95 | 16.96 | 0.665 | |||||||
| (C) Varying #quant | 5 | 2.31 | 7.65 | 16.38 | 0.663 | ||||||
| 3 | 3.13 | 8.40 | 16.21 | 0.652 | |||||||
| 2 | 4.57 | 8.89 | 15.56 | 0.684 | |||||||
| 1 | 8.81 | 10.48 | 16.25 | 0.677 | |||||||
| (D) Full-scale v.s. multi-scale | F | 2.64 | 6.53 | 18.02 | 0.667 | ||||||
| (E) W/o attention | 3.39 | 8.57 | 16.18 | 0.662 | |||||||
| Text Aug. | T2M Config. | T2M Generation | |||||
| Text Enc. | Conditioning | Architecture | FID\(\downarrow\) | CLIP Score \(\uparrow\) | |||
| base | ✔ | T5-base | In-context | (B) 384, 1024, 8 | 15.56 | 0.684 | |
| (A) W/o text aug. | 17.98 | 0.656 | |||||
| (B) CLIP text enc. | CLIP | 19.96 | 0.478 | ||||
| (C) Cross-att cond. | Cross-att. | 15.06 | 0.685 | ||||
| (D) Larger model | Cross-att. | (M) 512, 2048, 8 | 15.58 | 0.679 | |||
| Cross-att. | (L) 512, 2048, 12 | 16.02 | 0.670 | ||||
We perform comprehensive ablation experiments to evaluate the effects of various hyper-parameters and technical designs, as shown in 4 and 5. In 4 (A), we observe that compact pose representation leads to a smaller VQ reconstruction error, while it slightly underperforms for text-to-motion synthesis.
In terms of VQ configuration, we observe from 4 (B) that while increasing codebook size consistently enhances VQ reconstruction and text-motion alignment (CLIP score), motion quality does not necessarily follow this trend (best FID at \(|\mathcal{C}|=1024\)). 4 (C) show that additional VQ layers effectively improve reconstruction, but more token hierarchies also introduce complexity for text-to-motion synthesis, with optimal results at 2 layers. In 4 (D), we apply full-scale for all layers, yet despite achieving better VQ performance, inefficient token utilization leads to suboptimal generation quality. Finally, incorporating self-attention layers in the encoder and decoder (4 (E)) improves both VQ learning and motion synthesis performance.
We then examine the effects of text augmentation and text-to-motion transformer design in 5. In 5 (A), caption augmentation clearly improves model performance across all evaluation metrics. In 5 (B), we observe that the CLIP text encoder is inadequate for handling the long and complex textual descriptions in SnapMoGen. From 5 and [fig:loss95curve], we further find that cross-attention conditioning is less prone to overfitting and leads to higher
motion quality and better text–motion alignment. Meanwhile, 5 (D) show that transformers with higher latent dimensions or more attention layers counterintuitively
degrade motion generation quality. [fig:loss95curve] provides additional insight, indicating that larger transformer models (Base: 18.3M, Medium: 36.6M, Large:
53.4M) tend to overfit the dataset more severely.
To investigate this question, we analyze which tokens are “favored” by MoMask++. During iterative inference, MoMask++ generates a complete motion sequence by selectively retaining and re-masking tokens at each step, allowing us to track which tokens the model prioritizes over time. We conduct an experiment using four token scales (from coarse to fine, with 10, 20, 40, and 80 tokens, totaling 150 tokens per sequence) and a 10-iteration inference process over 32 text prompts, recording the token completion ratio at each scale. The results, shown in 5, reveal that the model naturally prioritizes coarse-scale tokens (Scale 1) in the early stages of generation and progressively shifts its focus toward finer scales. This behavior demonstrates a “global-to-local” generation strategy, indicating that the attention mechanism effectively captures and prioritizes information based on semantic importance (coarse-to-fine).
In this paper, we introduced SnapMoGen, a high-quality text-motion dataset featuring temporally continuous motion segments with expressive textual annotations. Comprising 20K motion clips and 122K detailed descriptions averaging 48 words each, SnapMoGen provides significantly richer semantic information than existing datasets. We also proposed MoMask++, a novel text-to-motion generation framework that employs multi-scale residual vector quantization and a single generative masked transformer for token prediction. Extensive experiments on both HumanML3D and SnapMoGen demonstrate the state-of-the-art performance of MoMask++.
Our text annotation interface is presented in 6. We first visualize all motions using a 3D character to help annotators better understand the motion content. However, since the character may walk out of the camera view and inter-penetration artifacts sometimes occur, we also display the motions using stick-figure representations that remain centered in the camera view. These two visualizations are synchronized and presented simultaneously to annotators. Annotators can also flag low-quality motions during the annotation process.
All baseline models on SnapMoGen dataset leverage the T5-base model for extracting word-level features from text descriptions and are trained using a single NVIDIA RTX A6000 GPU.
For MDM [6], we use an 8-layer transformer decoder where the text encoding is injected via cross-attention layers. The model is trained for 600K steps with a batch size of 1024 using a diffusion process with \(T = 1000\) steps. For T2M-GPT [12], we first learn a codebook size of \(1024 \times 512\) with a downsampling rate of 4. Then, we model a sequence of codebook indices via an 18-layer transformer. During training, text embeddings and motions are concatenated and processed as input, and a random portion of the ground-truth code indices is replaced with random ones to improve robustness. The model is trained for 600K steps with a batch size of 128. For StableMoFusion [37], we use a Conv1D-based U-Net incorporating residual cross-attention to align motion features with word-level semantics, along with group normalization. The model is trained for 500K iterations with \(T = 1000\) denoising steps and a batch size of 1024. For MARDM [10], we first encode motion into a latent representation using a 3-layer ResNet-based auto-encoder. These motion latents are then modeled using a masked autoregressive transformer with a dimension of 1024 and 16 attention heads, where text encodings are injected via cross-attention layers. The model is trained for 600K steps with a batch size of 128.
Our evaluation model accounts for both motion fidelity and text-motion alignment. We adopt the TMR framework, as shown in 7. This framework comprises three network components: a motion encoder that encodes motion
sequences into global vectors, a text encoder that encodes text sequences into global vectors, and a motion decoder that reconstructs motions from either motion or text vectors. All three networks are 6-layer transformers with a latent dimension of 256, 4
attention heads, and a feedforward hidden size of 1024. The T5-base model first extracts word-level features from texts. For motions, we use only the essential 148-dimensional root motion and local rotational features. All encoders output
Gaussian distribution parameters (mean and log-variance), from which vectors are sampled. We append two temporal timesteps at the end of the sequence input for outputing these vectors.
This evaluation model is trained with a compound loss: \[\mathcal{L}_\text{tmr} = \mathcal{L}_{\text{rec}} +\lambda_{\text{KL}}\mathcal{L}_{{\text{KL}}}+\lambda_{\text{E}}\mathcal{L}_{\text{E}} + \lambda_{\text{NCE}}\mathcal{L}_\text{NCE},\] where \(\mathcal{L}_\boldsymbol{rec}\) measures the motion reconstruction given text or motion input (via a smooth L1 loss). A KL-divergence loss \(\mathcal{L}_\text{KL}\) regularizes each embedding distribution to be close to a unitary Gaussian distribution \(\mathcal{N}(\mathbf{0}, \mathbf{I})\), and also encourage these two distributions to be close to each other. \(\mathcal{L}_\text{E}\) enforces both mean vectors to be similar to each other. Finally, a InfoNCE [39] loss is used for constrastive learning of motion-text batches with batch size of 64. We set \(\lambda_\text{E}\), \(\lambda_\text{KL}\), and \(\lambda_\text{NCE}\) to \(1e-5\), \(1e-5\), and \(0.1\). For more model details, we recommend to read the original TMR work [38].
In inference, we employ the evaluation metrics designed in [1]. We increase the pool size for R-precision to 100 and directly use the mean vectors of the latent distributions as embedding vectors
During training, we enhance data diversity by employing an LLM to rewrite human-provided annotations, generating paraphrased versions with varied linguistic structures while preserving core semantics. This approach ensures each motion sequence is associated with multiple textual descriptions, improving model robustness. The prompt instructions for ChatGPT are provided in [tab:augmenter95prompt].
During inference, the LLM rewrites each input prompt into a richly detailed description, incorporating explicit motion cues such as body posture, timing, and stylistic elements. This expanded form more effectively guides motion generation models. The instructions for ChatGPT are provided in [tab:motion95prompts].
We present several representative failure motions in the static webpage. Here we discuss limitations from both data and model perspectives.
Despite extensive calibration and post-processing of the collected motions, quality issues rooted in the inertial-based mocap suit persist. For example, global positions may lack precision, and jitters can occur during fast or complex motions. Additionally, we are unable to capture highly skilled motions such as cartwheels, backflips, or outdoor activities (e.g., climbing).
Opportunities for improving text-to-motion models also remain. As MoMask++ relies on VQ, quantization errors inevitably degrade motion quality. We observe that MoMask++ struggles with rare motion patterns or uncommon text prompts. Furthermore, it does not yet maintain physical plausibility, such as proper foot contacts.