MotionChain: Conversational Motion Controllers via Multimodal Prompts

Biao Jiang1 Xin Chen2 Chi Zhang Fukun Yin Zhuoyuan Li
Gang Yu Jiayuan Fan3

, Anonymous ECCV Submission


Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.

1 Introduction↩︎

The success of large language models (LLMs) [1][7] has sparked significant interest in the development of multi-modal language models. These models aim to transfer instruction-following and zero-shot abilities to other modalities tasks, such as image-language models [8][11], video-language models [11][14], and 3D-language models [15], [16]. However, a comprehensive model that can perceive visual input and generate continuous motion through multi-turn conversations has not yet been developed. Such a multi-modal model would have wide-ranging applications in fields like humanoid robotics, virtual assistants, game agents and so on.

Previous research on human motion has explored various tasks, including motion generation [17][22], motion captioning [22][24], motion prediction [22], [25][27], and motion composition [28]. Recent works in text-to-motion [19], [20], [29], [30] have involved pre-trained language models [31], [32] for motion generation. For instance, TEMOS [30] employs BERT [31] text embeddings in an end-to-end transformer architecture, while MDM [19] and MLD [20] both utilize text embeddings from CLIP [32] during the conditional diffusion process. On the other hand, MotionCLIP [33] and TMR [34] focus on modeling the coupled relationship between motion and text description, and MotionGPT [22] introduces a motion-language model that represents human motion and language in one unified vocabulary. However, these above methods treat all tasks as a one-turn conditioned generation, lacking contextual understanding and multi-turn continuous generation abilities. Therefore, we construct a Vision-Motion language model, integrating multi-turn conversations and continuous human motions.

Figure 1: MotionChain can interpret instructions from multi-turn conversations and generate human motions or textual answers based on text, motion, or image inputs. We provide the conversation results in image-conditioned motion generation (1st column), motion reasoning (second column), motion editing (third column), and motion translation (third column), with each subsequent turn informed by all previous conversations. Left-to-right represents the temporal order.

Two crucial challenges need to be addressed in this conversational motion generation. The first challenge is to contextually generate human motion in a continuous manner, resembling the way real humans move. The second challenge is the scarcity of text-motion paired datasets compared to datasets with pairs of image-language  [35], [36], image-pose [37][40] and video-motion [41][45]. Fortunately, both human motion and language are sequential and can be continuously "written". Building upon this observation, we employ the general vision-language instruction-tuning approach to enable conversational motion generation and question-answering through multi-modal instructions. By integrating image, motion, and language data and encoding them into tokens, the relationship between these three modalities becomes more evident. Therefore, with the advent of large vision-motion and vision-language data, Vision-Motion-language pre-training can enhance the performance of motion-related tasks.

In this study, we introduce MotionChain, a comprehensive framework that integrates vision, motion, and language. MotionChain leverages large-scale vision-language data, video-motion data, and the strong language generation abilities of pre-trained language models to assist in motion-related generation tasks. To enable MotionChain to comprehend and generate human-like motions, we first train a motion-specific vector quantized variational autoencoder (VQ-VAE) model. This model constructs a "motion vocabulary" similar to the English word vocabulary and converts raw motion data into a sequence of motion tokens. To incorporate vision inputs into MotionChain, we then introduce a specialized vision tokenizer that connects a pre-trained vision encoder to the language model. This tokenizer converts image data into visual tokens within the language-motion "words" embedding space. These tokens are then processed by a pre-trained language model [4], [5], [46], [47], which learns the relationship between image, motion and language. To enable conversational generation, we construct a multi-modal motion conversation dataset based on the existing text-motion dataset [18] and video-motion dataset [45]. We then train the language model using our multi-modal conversation dataset to learn the correlation between the three modalities. Extensive experiments demonstrate that MotionChain achieves state-of-the-art performance in multiple motion-related tasks.

We summarize our contributions as follows: (1) We propose MotionChain, a unified vision-motion-language generative pre-trained model, which performs conversational generation tasks via multi-modal inputs with language models. (2) We introduce a motion composition technique, to generate 3D human motions following the temporal order of instructions. (3) We propose a multi-modal motion conversation benchmark, wherein MotionChain achieves competitive performance across diverse motion tasks

2 Related Work↩︎

Human Motion Modeling. There have been numerous attempts to model the relationship between 3D human motion and multiple modalities including incomplete motion  [19], [22], [25][27], action [17], [19], [20], [22], [28], [48], text [18], [19], [22], [24], [29], [30], [34], [49][53], image [54][58] and video [45], [58][61]. Text-to-motion is one of the most important motion generation tasks, due to the user-friendly and convenient language input. MDM [19], MotionDiffuse [29] and MLD [20] proposes a diffusion-based generative model [62][64] to generate motions conditioned on different inputs. TM2T [24] and T2M-GPT [21] investigate a generative framework based on VQ-VAE [65], [66] and generative transformer for motion generation. Motion completion task generates motion conditioning on partial motions, such as classical motion prediction [25][27] or motion in-between [19], which generates the intermediate motion while the first and last parts are fixed. TEACH [28] proposes a past-conditioned transformer model that generate motion from sequence of actions autoregressively. Apart from motion generation, there is also work investigating other modalities of generation from motion. Two statistical models [67] and recurrent networks [68], [69] are learned in mapping motions to language. TM2T [24] proposed a new motion representation that compresses motions into a short sequence of discrete variables, then uses a neural translation network to build mappings between two modalities. In contrast to the above methods limited to only several tasks, MotionGPT [22] treats human motion as a foreign language and leverages language understanding and zero-shot transfer abilities of pre-trained language models.

Character Control and Animation. Character control involves generating interactive motion sequences based on user instruction signals. One kind of approach [70][72] is to construct a graph representing transitions between motion clips and plan motion using graph search. Considering the limitations of these graph-based approaches in coarse discreteness, alternative methods like frame blending and concatenation [73], low-dimensional latent space learning [74], motion matching [75] proposed for embedding the task in the feature and  [76] do the similar thing through hierarchical setup. Although the control signals for motion control and character animation are different from the instructions in text-to-motion tasks, we still recognize textual commands of conventional human motion generation as a boost for intuitive character control.

Figure 2: Method overview: MotionChain consists of a motion tokenizer \(\mathcal{V_M}\)3.2), a vision tokenize \(\mathcal{V_I}\) (r 3.2) and a vision-motion-aware language model (3.3). By leveraging motion tokens generated by \(\mathcal{V_M}\), alongside visual language token embeddings projected by vision tokenizer \(\mathcal{V_I}\), and text tokens by text tokenizer, MotionChian achieves a unified learning paradigm for both motion and linguistic data.

Multi-Modal Language Models.

In the field of computer vision, there has been a recent surge of interest in multi-modal models that can process text along with other modalities, including images, audio, and videos [12], [77][79]. CLIP [32] is an example of such a model, which learns a semantic latent representation that connects images with corresponding language descriptions. While language models have achieved success in various tasks, the development of multi-modal language models capable of handling human motion is still limited. Existing works in computer vision can be broadly categorized into two classes. The first consists of end-to-end trained models explored separately for specific research topics. For example, tasks like vision-language navigation [80], [81] and Habitat [82] require embodied AI agents to follow natural language instructions and take actions to accomplish goals in visual environments. InstructPix2Pix [81] in image translation enables agents to edit images based on human instructions. The second involves systems that coordinate various models using approaches like LangChain or LLMs [1]. Examples of such systems include Visual ChatGPT [83], X-GPT [84], and MMREACT [85]. While these methods focus on building instruction-following agents, we aim to develop an end-to-end trained multimodal model that can perform conversational motion generation tasks via multi-modal inputs with language models.

3 Methods↩︎

Figure 3: Data collection overview: Our initial step in collecting the motion reasoning data involves the utilization of human motion captions derived from an existing text-motion dataset. Subsequent to this, the text-motion retrieval model TMR [34] aids in the segmentation of motion pairs into categories based on the similarity between them. With the assistance of ChatGPT, we proceed to craft motion editing task data that correspond to these categorized similarity levels. Incorporating both motion reasoning and editing single-turn tasks, as well as the extensive 14 tasks delineated in [22], we construct a rich multi-modal multi-turn conversation dataset.

To leverage large language data, vision-language data, and vision-motion data for assisting motion-related tasks, we propose a motion-language-vision framework called MotionChain. The framework, as depicted in 2, consists of a multi-modal tokenizer that converts various types of data (text, image, and motion) into discrete tokens (3.2), and a vision-motion-aware Language model that comprehends information from different modalities and generates corresponding answers based on input instructions (3.3). Additionally, to simultaneously understand data from multiple modalities, we employ a multi-stage training strategy (3.4) for the training of the multi-modal tokenizer and the motion-language-vision framework.

We first introduced the multi-modal tokenizer, which comprises three branches for processing textual, image, and motion inputs. For textual inputs \(w^{1:N} = \{w^i\}\) of length \(N\) that describes a motion-related question or demand, we employ the SentencePiece model [86] used in previous works [4], [5], [46], [47], which has a vocabulary size of \(K_t\) and is trained on a large number of language datasets. The motion branch consists of a motion encoder \(\mathcal{E_M}\) that encodes a motion sequence \(m^{1:M} = \{x^i\}\) of \(M\) frames into \(L\) motion tokens \(z^{1:L} = \{z^i\}\), where \(L = M/l\) and \(l\) represents the temporal downsampling rate on motion frames. It also includes a motion decoder \(\mathcal{D_M}\) that can decode motion tokens back to human motion \(\hat{m}^{1:M}\). The vision branch processes the input image \(X\) with a pre-trained CLIP visual encoder and a learnable linear projection that follows it, converted into language token embeddings \(H_q\). Given a textual sentence \(w^{1:N}\), a sequence of motion \(m^{1:M}\), and an image condition \(X\), all encoded as language tokens, our vision-motion-aware language model is designed to produce an answer comprising \(L\) tokens, denoted as \(\hat{x}^{1:L} = \{\hat{x}^i\}\). These output tokens can represent either motion sequences \(\hat{x}_m^{1:L}\) or textual descriptions \(\hat{x}_t^{1:L}\), which integrate both human motion \(\hat{m}^{1:M}\), and text \(\hat{w}^{1:L}\) within the given context.

3.1 Data Collection↩︎

With the emergence of text-conditioned motion generation tasks, datasets like KIT [87], BABEL [88], HumanML3D [18] and the more recent Motion-X [89] have been developed. However, these datasets predominantly offer text labels as simple action phrases or captions. Building upon these foundations, MotionGPT [22] introduces an instruction-based motion-language dataset that encapsulates 14 core tasks, including motion prediction, translation, and editing, through thousands of instruction templates in a unified format. Despite this advancement, MotionGPT’s data lack a deep engagement with the nuances of human motion analysis and are limited to single-turn generation tasks without incorporating contextual memory. Inspired by the recent success of GPT models across text-annotation tasks [90], image-annotation tasks [8], 3D-annotation tasks [15], we propose a data collection methodology integrates the capabilities of existing LLMs like ChatGPT [1], with the text-motion retrieval model TMR [34] to facilitate motion conversation data collection. In addition to the 14 motion-related tasks in MotionGPT [22], we introduce tasks centered around motion reasoning and motion editing, leveraging contextual insights for a deeper motion analysis.

Utilizing ChatGPT [1], we initiate the collection of motion reasoning data using human motion captions from the text-motion dataset [18], starting with manually designed example queries that explore the contextual scenarios surrounding motions, possible preceding or succeeding actions, the subjects’ roles, and the tools or equipment involved, etc. Following this, we employ TMR [34] for categorizing motions from the dataset into varying similarity levels. For medium-similarity motion pairs, we utilize ChatGPT [1] to generate motion editing directives that enable the transformation of one motion to another. For motions of high similarity, we manually devise tasks aimed at editing their lengths, further enriching the dataset’s versatility and analytical scope.

After the collection of single-turn generation tasks, we progress to develop multi-turn conversation data. This involves the deliberate association of initial motion generation tasks with a variety of follow-up tasks randomly chosen among motion translation, reasoning, editing, etc. Following [7], we construct our conversation data in a structured format, as depicted below:


USER: \(X_v\) \(X_s^1\) ASSISTANT: \(X_a^1\) \(<\)/s\(>\)

USER: \(X_v\) \(X_s^2\) ASSISTANT: \(X_a^2\) \(<\)/s\(>\)

USER: \(X_v\) \(X_s^3\) ASSISTANT: \(X_a^3\) \(<\)/s\(>\) ...

Where \(X_v\) is defined as the vision language token embeddings, processed via the visual tokenizer. \(X_s^i\) and \(X_a^i\) are used to denote the source inputs and target answers for each round \(i\), respectively. Both sets of tokens originate from the integrated motion-language vocabulary \(V\), which includes motions, texts, or a blend thereof. The dataset exhibits variability in the number of generation turns up to 10; for the sake of clarity, we present only three examples herein. MotionChain is trained to predict answers, incorporating a learning mechanism that determines whether to stop generation by outputting end of sentence flag \(<\)/s\(>\) based on the current instruction and all preceding questions and answers. In the computation of the loss, as defined in 3 , only the green tokens are utilized.

3.2 Multi-modal Tokenizer↩︎

Motion tokenizer, denoted as \(\mathcal{V_M}\), is based on the architecture of Vector Quantized Variational Autoencoders (VQ-VAE) utilized in previous studies [21], [22], [24], [53], [65], [91][93]. Once pre-trained, it can represent motion using discrete tokens, facilitating the integration of motion and language. The Motion tokenizer consists of a motion encoder \(\mathcal{E_M}\) and a motion decoder \(\mathcal{D_M}\). Initially, the motion encoder \(\mathcal{E}\) applies 1D convolutions to the motion features \(m^{1:M}\) along the temporal dimension to obtain latent vectors \(\hat{z}^{1:L} = \mathcal{E_M}(m^{1:M})\). Subsequently, the latent vectors \(\hat{z}\) are quantized and transformed into a collection of codebook entries \(z\). The learnable codebook \(Z = \{{z}^i\}_{i=1}^{K} \subset \mathbb{R}^{d}\) comprises \(K\) latent embedding vectors, each with a dimension of \(d\). The quantization process \(Q(\cdot)\) replaces each row vector \(b\) with its ne,arest codebook entry \(b_k\) in \(Z\), which can be expressed as:

\[z_i = Q(\hat{z}^i) := {\arg \min }_{z_k \in Z}\left\|\hat{z_i} - z_k\right\|_2.\]

We assign \({s}^i\) as the index number of motion tokens \({z}^{1:L}\), so motion tokens \(z^{1:L}\) can be represented as a sequence of indices \({s}^{1:L}=\{{s}^i\}_{i=1}^{L}\). The motion decoder \(\mathcal{D_M}\) can project \({z}^{1:L}=\{{z}^i\}_{i=1}^{L}\) back to the raw motion space, resulting in the motion \(\hat{m}^{1:M}\) with \(M\) frames. Following [21], [22], [24], [53], [93], we adopt three distinct loss functions when training the motion tokenizer: \[\mathcal{L}_\mathcal{V} = \mathcal{L}_{r} + \mathcal{L}_{e} + \mathcal{L}_{c}\label{eq:loss:vq}\tag{1}\] where \(\mathcal{L}_{r}\) denotes reconstruction loss, \(\mathcal{L}_{e}\) denotes the embedding loss, and \(\mathcal{L}_{c}\) denotes commitment loss.

During multi-turn motion generation, the motion continuity between turns is achieved through our motion decoder, which links the motion of the current turn with that of the preceding ones. Taking the composition of two motions as an example: we concatenate the past motion tokens, denoted as \({z}_p^{1:L_p}\), with the tokens representing the current motion, \({z}_c^{1:L_c}\). This concatenated sequence of tokens is subsequently decoded into a comprehensive set of continuous motion features, represented as \(m_{\text{whole}}^{1:M_{\text{whole}}}\), as depicted below: \[{z}_{\text{whole}}^{1:(L_p+L_c)} = [{z}_p^{1:L_p},{z}_c^{1:L_c}].\] Similarly, this framework is adept at executing composition tasks involving an array of motions. The comparison results in 3 demonstrate that our motion tokenizer could effectively perform motion composition tasks.

Visual Tokenizer accepts both image (\(X_I\)) and video(\(X_V\)) as inputs. For the image input, we employ the CLIP visual encoder that is pre-trained on image-text pairs to derive visual feature \(Z_I\). These features are then projected into language token embeddings \(X_v\) via a linear layer like previous wrok [8]. For videos, each sampled frame is encoded through the CLIP visual encoder resulting in a 3D spatiotemporal feature matrix with additional temporal embeddings. Inspired by  [11], we introduce a perceiver module. This module integrates a transformer equipped with fixed-length, learnable queries, that dynamically engage with these visual features to synthesize consistent output. Subsequently, a linear layer, applied similarly to image inputs, uses a trainable matrix \(W\) to map \(Z_I\) to visual token embeddings \(X_v\), maintaining consistency in the dimensionality with the language model’s word embedding space.







Figure 4: Motion Composition Variants: We illustrate the baselines for motion composition during multi-turn motion generation (a). independent decoding each turn (b). separate decoding conditioned on the last few tokens from the prior turn (c). decoding with joint motion tokens. Green tokens stand for image condition, blue tokens stand for textual instruction, and orange tokens stand for human motions..

3.3 Motion-aware Language Model↩︎

Language models such as Llama [4], [5] and T5 [46], [47] employ the SentencePiece [86] model to encode textual inputs into WordPiece tokens, utilizing a \(K_t\) word piece vocabulary. Unlike prior text-to-motion [20], [21], [24], [52] and motion-to-text [24] methods that process text and motion separately, we merge the text vocabulary \(V_t=\{{v}_t^i\}_{i=1}^{K_t}\) with the motion vocabulary \(V_m=\{{v}_m^i\}_{i=1}^{K_m}\), maintaining the motion tokenizer’s codebook \(Z\) order and including special tokens for boundary demarcation. This creates a unified vocabulary \(V = \{V_t, V_m\}\), enabling the formulation of motion-centric tasks in a universal template, where inputs and outputs share the same vocabulary. For visual input, our visual tokenizer converts images or videos into visual token embeddings \(X_v\), aligning with the language model  [1], [4], [46], [86] token space for integrated representation.

For single conditioned generation tasks, our input comprises a sequence of \(N\) length tokens \({X_s}=\{{x_s}^i\}_{i=1}^{N}\), where \(x_s\in \{V_t, V_m\}\) representing either text, motion, or a combination thereof, drawn from the unified vocabularies. In cases involving image inputs, visual tokens \(X_v\) are interspersed at the beginning of the source tokens sequence, forming \([X_v, X_s]\). Subsequent interaction rounds generate target answer tokens \(X_a\). To facilitate iterative result generation and content retention, our framework generates multi-turn conversation data \((X_v, X_s^1, X_a^1, X_s^2, X_a^2,\cdots, X_s^T, X_a^T)\), with \(T\) indicating the total turn count. Notably, visual tokens are consistently placed at the forefront of the initial turn’s source tokens. The processing sequence is organized such that to predict target answer tokens autoregressively, as shown in 2. Source tokens are processed by the transformer to predict the next token’s probability distribution, formulated as: \[p_\theta(X_a \mid X_v, X_s )=\prod_i p_\theta\left(x_a^i \mid X_v, X_{s,<i}, X_{a,<i}\right) \label{eq:lm:autoregressive}\tag{2}\] with \(\theta\) indicating trainable parameters, and \(X_{s,<i}, X_{s,<i}\) the sequences of source and preceding target tokens. The training objective is maximizing the log-likelihood of distribution: \[\mathcal{L}_{LM}=-\sum_{i=0}^{L_t-1} \log p_\theta\left(x_a^i \mid X_v, X_{s,<i}, X_{a,<i}\right) . \label{eq:loss:lm}\tag{3}\] By optimizing this objective, MotionChain captures the complex interrelations among images, motion, and text, facilitating accurate target "word" generation.

During the inference phase, target tokens are recursively sampled from the model’s predicted distribution \(p_\theta\left(\hat{x_a}^i \mid X_v, X_s, \hat{X}_{a,<i} \right)\), ceasing with the appearance of a special end token. This strategy facilitates a step-by-step target sequence generation, where each token’s probability is conditioned on all previous turns’ sources and targets and current source input.

3.4 Training Strategy↩︎

To facilitate the integration of image and motion comprehension within the language modeling context, we adopt a 3-stage training strategy. (1) The initial stage involves pre-training the motion tokenizer on a corpus of human motion data, in line with  [22]. This process establishes the motion vocabulary \(V_m\), which serves as a foundation for encoding human motions as a series of discrete tokens. (2) Subsequently, the motion tokenizer remains frozen while we connect the visual tokenizer to the language model framework. This integration is supported by a suite of supervised objectives, including text-to-motion, motion-to-text, and image-based motion generation, aiming to learn the intricate relationships between images, motion, and language. (3) The final stage involves instruction tuning, and refines the model’s capabilities through the application of prompt-based instructions. These instructions are framed within multi-turn conversation sequences, as detailed in 3.3, to expanded range of motion-related tasks.

Training of Motion Tokenizer. The initial step involves training the motion tokenizer, guided by the loss objective in Equation 1 . This stage enables the tokenizer to represent human motion sequences \(\hat{x}^{1:L}\) as discrete motion tokens, a key step for merging motion data with textual information seamlessly. Once optimized, the motion tokenizer remains frozen.

Motion-language Pre-training Stage. Leveraging recent developments in language modeling [4], [5], [7], [46], [47] pre-trained on natural language datasets and then fine-tuned with instruction-based phrasing [1], [47]. To augment the model’s ability to discern relationships between images and human motions, we first pre-train our MotionChain using a mix of language, image, and motion datasets. Following the stage 1 training of the motion tokenizer, we have established a unified motion-language vocabulary \(V={V_t, V_m}\), capable of representing motions in discrete token form. Moreover, we maintain the visual encoder’s weights in the visual tokenizer as fixed, while the linear projection weight \(W\) is jointly optimized with the language model. During this stage, the model undertakes three fundamental single-turn modality translation tasks: text-to-motion, motion-to-text, and image-conditioned motion generation, as outlined in 3.1. The primary objective is to maximize the likelihood of the model according to the loss function specified in 3 , thereby letting the model understand the relationship between language, vision conditions, and motions.

Instruction Tuning Stage. As described in 3.1, we construct a multi-modal, multi-task, and multi-turn motion conversation dataset by augmenting existing text-to-motion [18] and human mesh reconstruction datasets [45] with targeted instructional prompts and leverage the capabilities of LLMs [1] and the text-motion retrieval model [34] for motion reasoning and editing tasks. The efficacy of instruction tuning, as evidenced across language models [1], [7], [8], [47], is well-established, yielding enhancements in model performance across a wide range of tasks. After instruction tuning, MotionChain can handle more motion-related tasks including the proficient handling of previously unseen tasks

4 Experiments↩︎

We evaluate the proposed MotionChain encompasses comprehensive comparisons across both one-turn motion-related tasks and multi-turn motion generation tasks. Firstly, we provide details of the dataset settings, evaluation criteria, and implementation details as specified in [sec:comp:detail]. Subsequently, comparative analyses are presented, focusing on the motion reasoning task (4.2) and the temporal motion composition task (4.3). In 4.4, we evaluate the choice of motion composition technique and different architectures of vision tokenizer.

4.1 Experimental Setup↩︎

Datasets. For one-turn motion reasoning tasks, the study employs our proposed multi-modal multi-turn conversation dataset upon HumanML3D [18] with 44,970 sequence-level textual descriptions for 14,616 motion sequences obtained from AMASS [94] and HumanAct12 [48]. The datasets are divided into training, testing, and validation sets with a ratio of \(0.8: 0.15:0.05\). To evaluate the multi-turn motion generation task, we focus on BABEL [88] that provides textual descriptions for the motions in the AMASS [94] with annotated segments that overlap in each sequence, which allows evaluating generation of a sequence of motion or actions. We adopt the processed text labels by [28] and motion representation of HumanML3D [18] which combines joint velocities, positions, and rotations. Following [28] we consider pairs of actions for simplicity but MotionChain applies to a sequence of actions or motion of arbitrary length. For the image-conditioned motion generation task, we mainly focus on BEDLAM [45], a large synthetic dataset of realistic moving 3D humans containing more than 200 subjects and 380K frames video and motion pair.

Evaluation Metrics are summarized as four parts. (1) Motion quality: We adopt Frechet Inception Distance (FID) as the primary metric, FID quantifies the divergence in feature distributions between generated and actual motion sequences. Utilizing feature extractors from prior studies [18], [34], [89], FID measures the distance of feature distributions between the generated and real motions. Following  [20], [22], [45], [54], [95], we also adopt MPJPE, PA-MPJPE to measure global and local errors in millimeters and ACCL for acceleration errors, to evaluate the quality of the reconstructed motions. (2) Motion Diversity: Utilizing the Diversity (DIV) metric, we calculate variance across motion features to evaluate generation diversity. (3) Text matching: The precision of text-to-motion matches is quantified by the R Precision metric, based on the feature evaluator [18], [34], [89], and includes an analysis of Top 1/2/3 retrieval accuracy. The Multi-modal Distance (MM Dist) quantifies the semantic gap between motions and texts. (4) Linguistic quality: We follow [24] utilizing linguistic metrics from natural language studies, including BLUE [96], Rouge [97], Cider [98], and BertScore [99] to evaluate the quality of generated motion captions. More detailed benchmark information is provided in the supplementary materials.

Implementation Details. We set the codebook of the motion tokenizer as \(K\in\mathbb{R}^{512\times1024}\) for most experiments. The motion encoder, denoted as \(\mathcal{E_M}\), integrates a temporal downsampling rate, \(l=4\). Our vision tokenizer incorporates a frozen Vision Transformer (ViT-L/14) [32] as visual encoder for most experiments. Additionally, for comprehensive ablation studies, we explored the use of both a frozen vision encoder and a Q-former from BLIP-2 [13] as a vision tokenizer. We mainly utilize Flan-T5-base [47] as the underlying architecture for our language model. Moreover, all our models employ the AdamW [100] optimizer with \([\beta_1,\beta_2]=[0.9,0.99]\) for training. The motion tokenizers are trained to utilize a \(10^{-4}\) learning rate employing cosine annealing scheduler and a 256 mini-batch size. Our language models based on Flan-T5-base [47] have a \(10^{-4}\) learning rate with cosine annealing scheduler and 16 mini-batch sizes in both the pre-train stage and the instruction tuning stage. The motion tokenizer undergoes 10000 epochs of training, while the language model undergoes 500 epochs during the pre-train stage and another 50 epochs during the instruction tuning stage. Most models are trained on 8 Tesla V100 GPUs.

4.2 Comparisons on Motion Reasoning.↩︎

In 3.1, we introduce a multi-modal motion conversation dataset, enriched with motion reasoning data facilitated by ChatGPT [1]. This task evaluates the model’s reasoning capabilities with motion reasoning tasks, where a motion sequence or its corresponding textual descriptions serve as inputs. Our evaluation compares our MotionChain, which integrates motion perception, against contemporary Large Language Models (LLMs) that possess solely textual processing capabilities. The compared LLMs are assessed using their original pre-trained weight. Results in 1, illustrate that MotionChain exhibits superior motion reasoning proficiency, benefiting from its integrated motion perception.

Table 1: Comparison of motion reasoning on the test set of our conversation dataset. Our proposed MotionChain is fine-tuned on motion reasoning tasks while other methods’ results are generated by their pre-trained weight.\(Length_\textit{avg}\) represents the average words in generated answers to all questions. We adopt metrics commonly used in natural language processing tasks for evaluation.
Methods Params \(\text{Length}_{\text{avg}}\) Bleu@1\(\uparrow\) Bleu@4\(\uparrow\) Rouge\(\uparrow\) Cider\(\uparrow\) BertScore\(\uparrow\)
Flan-t5-base [47] 250M \(8.34\) \(4.64\) \(1.78\) \(15.32\) \(15.93\) \(3.45\)
Flan-t5-large [47] 780M \(11.95\) \(12.18\) \(4.83\) \(22.81\) \(15.02\) \(14.19\)
Flan-t5-xl [47] 3B \(9.09\) \(8.54\) \(4.01\) \(24.89\) \(15.03\) \(18.34\)
Llama-2-7b [5] 7B \(130.84\) \(11.12\) \(3.67\) \(19.14\) \(1.04\) \(6.81\)
Vicuna-1.5-7b [7] 7B \(71.49\) \(19.27\) \(7.39\) \(25.75\) \(5.44\) \(19.05\)
Vicuna-1.5-13b [7] 13B \(84.74\) \(17.20\) \(6.53\) \(24.18\) \(7.77\) \(18.00\)
MotionChain 280M 22.17 \(\boldsymbol{37.92}\) \(\boldsymbol{19.19}\) \(\boldsymbol{38.05}\) \(\boldsymbol{24.53}\) \(\boldsymbol{32.24}\)

4.3 Comparisons on Temporal Composition.↩︎

The temporal motion composition task involves generating a continuous motion sequence from two actions in a time series. We conducted our experiments following the settings in TEACH [28] and used the Amass [94] subset BABEL [88] validation set. Additionally, we processed the motion in Amass into the format proposed by HumanML3D [18] and trained our MotionChain on the action-to-motion task. To compare with TEACH, we initially used an officially provided pre-trained model to sample motion on the validation set 20 times. Subsequently, we post-processed their motion into the HumanML3D format, represented in SMPL [101]. The performance of our MotionChain is summarized in Table 2. As evaluating generative models quantitatively is challenging, we also provide qualitative comparisons in the supplementary materials.

Table 2: Comparison of temporal motion composition on Babel [88]. We evaluate the state-of-the-art motion temporal composition method Teach [28] under the 95 % confidence interval from 20 times running. (\(cf.\) 4.1 for notations.)
Methods Diversity MPJPE\(\downarrow\) PA-MPJPE\(\downarrow\) ACCL\(\downarrow\)
Real \(15.74^{\pm.149}\) - - -
Teach [28] \({27.11}^{\pm.159}\) \(979.21^{\pm.215}\) \(933.32^{\pm.254}\) \(23.02^{\pm.018}\)
MotionChain \(43.25^{\pm.159}\) \(\boldsymbol{276.05}^{\pm6.72}\) \(\boldsymbol{53.72}^{\pm.580}\) \(\boldsymbol{7.11}^{\pm0.100}\)

4.4 Ablation Studies↩︎

MotionChain enables multi-modal motion conversation using two main techniques. The first technique involves generating a smooth sequence of motions by concatenating motion tokens which are then decoded back to motion by motion decoder \(\mathcal{D_M}\). The second technique involves processing multi-modal visual input through a vision tokenizer, which consists of a frozen vision encoder and a trainable linear projection. To evaluate the effectiveness of these two designs, we compare them with other variants. For a more comprehensive analysis, detailed ablation studies can be found in the supplementary materials.

Table 3: Evaluation of motion composition methods on HumanML3D [18]. Here Independent, Past-condition, and Tokens-joint stand for different motion composition varients during multi-turn motion conversation, as illustrated in 4.
Method MPJPE\(\downarrow\) PA-MPJPE\(\downarrow\) ACCL\(\downarrow\) Diversity
Independent \(350.79\) \(102.97\) \(11.40\) \(6.47\)
Past-condition \(232.46\) \(46.15\) \(6.18\) \(6.01\)
Tokens-joint \(\boldsymbol{108.77}\) \(\boldsymbol{18.85}\) \(\boldsymbol{2.26}\) \({5.56}\)

Motion Composition Mechanism Apart from the jointly token concatenating mechanism, we also evaluate the performance of temporal motion composition through the other motion temporal composition variants Motion-cat: concatenating the motion in final motion level rather than token level. Experimental results in 3 show that jointly concatenating motion tokens achieved remarkable performance compared to the other variants. For further information regarding the implementation of the aforementioned vision tokenizer, please refer to the supplementary materials.

Image Tokenizeer Architecture. MotionChain connects the frozen vision encoder to the language model through a linear layer. However, previous vision-language [11], [13] works also demonstrate the effectiveness of other kinds of visual-aligning modules. Here we consider the other two vision tokenizer variants: (a) inspired by [11], [102], [103], we introduce a perceiver module that incorporates a transformer receiving a predefined number of latent input queries. These queries cross-attend to the visual features, enabling effective information exchange. (b) We directly adopt the pre-trained Q-former from BLIP-2 [13] to align visual inputs with the language model. We evaluate the different architectures under the single human image as the first frame condition and the last frame condition separately. Experimental results in 4 show that a lightweight linear projection is sufficient for comprehending the human pose from visual input. Additional details about the implementation of the above vision tokenizer can be found in the supplements.

Table 4: Evaluation of vision tokenizer architecture on Bedlam [89]. We implement three different architectures, including Q-former, Perceiver, and Linear. We evaluate these results with the metrics in motion reconstruction. Additional information regarding the implementation is in the supplementary materials. (\(cf.\) 2 for notations.)
Architecture First-frame Last-frame
2-3(lr)4-5 MPJPE \(\downarrow\) PA-MPJPE \(\downarrow\) MPJPE \(\downarrow\) PA-MPJPE \(\downarrow\)
Q-former \(195.49\) \(86.56\) \(134.73\) \(57.17\)
Perceiver \(185.61\) \(99.21\) \(134.89\) \(57.58\)
Linear \(\boldsymbol{144.37}\) \(\boldsymbol{76.48}\) \(\boldsymbol{133.73}\) \(\boldsymbol{56.73}\)

5 Conclusion and Limitation↩︎

Limitation. As the trial to explore conversational human motion generation with visual language models, the proposed MotionChain still has limitations as follows. MotionChain utilizes indeterministic generative models, similar to other language models, but other traditional or neural motion controllers [104], [105] are mostly deterministic and sensitive to control signals. Besides, our method can only generate motion on articulated human bodies, excluding many other human parts such as faces [106][108] and hands [109], [110], [110], [111], [111]. Although we utilize vision, language, and motion as multimodal conditional inputs akin to human perception, MotionChain is still restricted to the collision signals for human-object and human-scene interactions [112][114].

Conclusion. We summarize the proposed MotionChain as a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Compared to these one-turn motion generation methods [19], [22], [29], our MotionChain produces more contextually rich generation and can achieve the step-by-step process of human task execution for humanoid robotics and game agents. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.

6 Acknowledgment↩︎

This work is supported by National Natural Science Foundation of China (No. 62071127, and 62101137), National Key Research and Development Program of China (No. 2022ZD0160100), Shanghai Natural Science Foundation (No. 23ZR1402900), Shanghai Municipal Science and Technology Major Project (No.2021SHZDZX0103). The computations in this research were performed using the CFFF platform of Fudan University.


This appendix provides several additional experiments (7), more qualitative results (8), model implementation details (9), evaluations of inference time (10), protocol for the motion conversation evaluation (11), details of motion representations (12), metric definitions (13).

Video. We provide supplemental videos in Github Page. In this video, we show 1) examples of motion conversation, 2) comparisons of text-based motion generation, and 3) comparisons of motion reasoning. We suggest the reader watch this video for dynamic motion results.

Code will be available on GitHub Page. We provide example code files, which include the process of the training and evaluation of our MotionChain models.

7 Additional Experiments↩︎

We conducted a comprehensive series of experiments to evaluate the efficacy of the proposed MotionChain models further. Specifically, we evaluate each specific comparison on text-to-motion (7.1), motion-to-text (7.2), and motion prediction (7.3) on the HumnaML3D [18] dataset. Additionally, we present an ablation study focusing on the effectiveness of our motion tokenizer (7.4) and the integration of motion tokens within the language model (7.5).

7.1 Comparisons on Text-to-Motion↩︎

The text-to-motion task showcases our MotionGPT model’s capability in generating human-like movements based on textual inputs. Evaluations were performed on MotionChain against current state-of-the-art methods [18][22], [24], on the HumanML3D [18] dataset according to established metrics [18]. The evaluation results, featuring a 95% confidence interval from 20 runs, largely draw from data reported in the cited works. The comparative outcomes, summarized in 5, demonstrating MotionChain’s competitive performance across numerous metrics.

Table 5: Comparison of text-to-motion on HumanML3D [18]. The empty MModality indicates Real motion is deterministic. Pre-trained and Fine-tuned indicate uniform motion-language pre-training and specific fine-tuning on this task. The arrows (\(\rightarrow\)) indicate that closer to Real is desirable. Bold and underline indicate the best and the second best result on text-to-motion task.
Methods RPrecision\(\uparrow\) FID\(\downarrow\) MMDist\(\downarrow\) Diversity\(\rightarrow\) MModality\(\uparrow\)
2-4 Top1 Top2 Top3
Real \(0.511^{\pm.003}\) \(0.703^{\pm.003}\) \(0.797^{\pm.002}\) \(0.002^{\pm.000}\) \(2.974^{\pm.008}\) \(9.503^{\pm.065}\) -
TM2T [24] \(0.424^{\pm.003}\) \(0.618^{\pm.003}\) \(0.729^{\pm.002}\) \(1.501^{\pm.017}\) \(3.467^{\pm.011}\) \(8.589^{\pm.076}\) \(\underline{2.424}^{\pm.093}\)
T2M [18] \(0.457^{\pm.002}\) \(0.639^{\pm.003}\) \(0.740^{\pm.003}\) \(1.067^{\pm.002}\) \(3.340^{\pm.008}\) \(9.188^{\pm.002}\) \(2.090^{\pm.083}\)
MotionDiffuse [29] \(\underline{0.491}^{\pm.001}\) \(\underline{0.681}^{\pm.001}\) \(\underline{0.782}^{\pm.001}\) \(0.630^{\pm.001}\) \({3.113}^{\pm.001}\) \({9.410}^{\pm.049}\) \(1.553^{\pm.042}\)
MDM [19] \(0.320^{\pm.005}\) \(0.498^{\pm.004}\) \(0.611^{\pm.007}\) \({0.544}^{\pm.044}\) \(5.566^{\pm.027}\) \({9.559}^{\pm.086}\) \(\underline{2.799}^{\pm.072}\)
MLD [20] \({0.481}^{\pm.003}\) \({0.673}^{\pm.003}\) \({0.772}^{\pm.002}\) \({0.473}^{\pm.013}\) \({3.196}^{\pm.010}\) \(9.724^{\pm.082}\) \({2.413}^{\pm.079}\)
T2M-GPT [21] \(\underline{0.491}^{\pm.003}\) \({0.680}^{\pm.003}\) \({0.775}^{\pm.002}\) \(\boldsymbol{0.116}^{\pm.004}\) \({3.118}^{\pm.011}\) \(9.761^{\pm.081}\) \(1.856^{\pm.011}\)
MotionGPT [22] \({0.492}^{\pm.003}\) \(\underline{0.681}^{\pm.003}\) \({0.778}^{\pm.002}\) \(\underline{0.232}^{\pm.008}\) \({3.096}^{\pm.008}\) \(\boldsymbol{9.528}^{\pm.071}\) \({2.008}^{\pm.084}\)
MotionChain \(\boldsymbol{0.504}^{\pm.003}\) \(\boldsymbol{0.695}^{\pm.003}\) \(\boldsymbol{0.790}^{\pm.003}\) \({0.248}^{\pm.009}\) \(\boldsymbol{3.033}^{\pm.010}\) \(\underline{9.470}^{\pm.075}\) \({1.715}^{\pm.066}\)

7.2 Comparisons on Motion-to-Text↩︎

In the motion-to-text task, the goal is to generate descriptive text based on sequences of human motion. We evaluate the proposed MotionChain, contrasting it with TM2T [24] and MotionGPT [22] on the HumanML3D dataset and adhering to the evaluation metrics used in  [22], [24]. Following  [22], we leverages the original ground truth texts for evaluation, ensuring a more comprehensive assessment . Assessments in [tab:tm:comp:m2t] demonstrate that MotionChain outperforms the recent methods in generating text descriptions of human motions on most benchmarks.

&\multirow{2}{*}{$\text{Length}_{\text{avg}}$ $\uparrow$} &\multirow{2}{*}
&  $12.75$ &\multicolumn{1}{c}{-}& \multicolumn{1}{c}{-}& \multicolumn{1}{c}{-}& \multicolumn{1}{c}{-}& \multicolumn{1}{c}{-}
\\ \midrule
& $10.67$& $\boldsymbol{48.9}$ & $7.00$ & ${38.1}$ & $16.8$ & ${32.2}$ \\
& $\boldsymbol{13.04}$ 
& $48.2$ & ${12.47}$ & $37.4$ & ${29.2}$ & ${32.4}$
\\ \midrule
& ${12.37}$ 
& $48.1$ & $\boldsymbol{12.56}$ & $\boldsymbol{39.9}$ & $\boldsymbol{33.7}$ & $\boldsymbol{36.9}$
\caption{Comparison of motion captioning on HumanML3D~\cite{Guo_2022_CVPR_humanml3d}. The evaluation metrics follow \cite{chuan2022tm2t}, while we use the ground truth texts without pre-processing for linguistic metrics calculation. \textbf{Bold} indicate the best.}

7.3 Comparisons on Motion Completion.↩︎

In accordance with MotionGPT [22], we consider motion prediction as a collective task referred to as general motion completion. To assess the motion completion capability of MotionChain, we utilize a subset of the AMASS dataset [94], which consists solely of motion data. For the motion prediction task, we use only the initial 20% of the motion sequence as conditions. We evaluate MotionChain using the identical settings as outlined in  [22]. The motion completion results of MotionChain, presented in Table 6, indicate that MotionChain achieves lower values in terms of ADE and FDE metrics. This implies that the mean and last-frame L2 distance between the ground truth and predicted motion are closer.

Table 6: Comparison of motion prediction and motion in-between on part of AMASSS [94] dataset using motion data only.FID indicates motion quality and Diversity (DIV) for motion diversity within each condition. ADE and FDE are joints distance between generation and ground truth.
Methods Motion Prediction
2-5 \(\text{FID}\downarrow\) Diversity\(\uparrow\) ADE\(\downarrow\) FDE\(\downarrow\)
Real \(0.002\) \(9.503\) - -
MDM[19] \(6.031\) \(7.813\) \(5.446\) \(8.561\)
T2M-GPT[21] \(2.056\) \(8.635\) \(6.161\) \(8.302\)
MotionGPT [22] \(\boldsymbol{0.905}\) \(\boldsymbol{8.972}\) \({4.745}\) \({6.040}\)
MotionChain \({1.053}\) \({8.802}\) \(\boldsymbol{4.388}\) \(\boldsymbol{5.401}\)

7.4 Ablation on Motion Tokenizer.↩︎

we conducted an ablation study on the motion tokenizer \(\mathcal{V}\) of the MotionChain model, focusing specifically on the impact of varying the size \(K\) and dimension \(d\) of motion codebooks, and residual quantizer layers \(Q\). Additionally, we benchmarked our VQ-VAE implementation against previous work [17], [20], [95], as shown in 8. This comparative analysis underscored the better performance of our VQ-VAE approach in terms of motion reconstruction accuracy. Through this comprehensive ablation study, in addition to the length limit of T5 series models, we thus identified parameters for the majority of our experiments as \(Q=4, K=512, d=1024\).

Table 7: Comparison of text-to-motion on HumanML3D [18]. The empty MModality indicates Real motion is deterministic. Pre-trained and Fine-tuned indicate uniform motion-language pre-training and specific fine-tuning on this task. The arrows (\(\rightarrow\)) indicate that closer to Real is desirable. Bold and underline indicate the best and the second best result on text-to-motion task.
Methods Motion Token Numbers RPrecision\(\uparrow\) FID\(\downarrow\) MMDist\(\downarrow\) Diversity\(\rightarrow\) MModality\(\uparrow\)
3-5 Top1 Top2 Top3
Real - \(0.511^{\pm.003}\) \(0.703^{\pm.003}\) \(0.797^{\pm.002}\) \(0.002^{\pm.000}\) \(2.974^{\pm.008}\) \(9.503^{\pm.065}\) -
Shared \(V_m\) \(0.496^{\pm.003}\) \(0.686^{\pm.003}\) \(0.784^{\pm.002}\) \(0.291^{\pm.012}\) \(3.067^{\pm.011}\) \(9.394^{\pm.075}\) \(\boldsymbol{2.072}^{\pm.080}\)
Independent \(V_m\times Q\) \(\boldsymbol{0.504}^{\pm.003}\) \(\boldsymbol{0.695}^{\pm.003}\) \(\boldsymbol{0.790}^{\pm.003}\) \(\boldsymbol{0.248}^{\pm.009}\) \(\boldsymbol{3.033}^{\pm.010}\) \(\boldsymbol{9.470}^{\pm.075}\) \({1.715}^{\pm.066}\)
Table 8: Evaluation of our motion tokenizer on the motion part of HumanML3D [18] dataset. We follow MLD [20] to evaluate our VQ-VAE model \(\mathcal{V}\): MPJPE and PAMPJPE are measured in millimeter. ACCL indicates acceleration error. We evaluate FID and Diversity the same as Tab. 3. The baselines of VPoser-t [95] and ACTOR [17] are borrowed from MLD. \(K\) indicates the codebook size, \(d\) indicates the codebook dimension , \(Q\) indicates the Residual-VQ layers.
Method Reconstruction
2-5 MPJPE\(\downarrow\) PAMPJPE\(\downarrow\) FID\(\downarrow\) DIV\(\rightarrow\)
Real - - \(0.002\) \(9.503\)
VPoser-t [95] \(75.6\) \(48.6\) \(1.430\) \(8.336\)
ACTOR [17] \(65.3\) \(41.0\) \(0.341\) \(\boldsymbol{9.569}\)
MLD-1 [20] \(\boldsymbol{54.4}\) \(41.6\) \(0.247\) \(9.630\)
MotionGPT [22] \(55.8\) \(\boldsymbol{40.1}\) \({0.067}\) \(9.675\)
MotionChain \({63.1}\) \({43.4}\) \(\boldsymbol{0.014}\) \({9.157}\)
\(Q=4, K=128, d=512\) \(71.8\) \(51.2\) \(0.037\) \(9.098\)
\(Q=4, K=256, d=512\) \(70.4\) \(48.5\) \(0.051\) \(9.004\)
\(Q=4, K=512, d=512\) \(69.5\) \(46.5\) \(\boldsymbol{0.025}\) \(9.015\)
\(Q=4, K=1024, d=512\) \(\boldsymbol{65.9}\) \(\boldsymbol{43.9}\) \(0.041\) \(\boldsymbol{9.310}\)
\(Q=2, K=512, d=512\) \(79.7\) \(56.9\) \(0.081\) \(9.162\)
\(Q=4, K=512, d=512\) \(69.5\) \(46.5\) \(0.025\) \(9.015\)
\(Q=8, K=512, d=512\) \(49.7\) \(38.6\) \(\boldsymbol{0.025}\) \(\boldsymbol{9.213}\)
\(Q=16, K=512, d=512\) \(\boldsymbol{48.4}\) \(\boldsymbol{38.4}\) \({0.026}\) \({9.075}\)
\(Q=4, K=512, d=128\) \({114.5}\) \({79.7}\) \({1.698}\) \(8.344\)
\(Q=4, K=512, d=256\) \(83.9\) \(59.7\) \(0.560\) \(8.782\)
\(Q=4, K=512, d=512\) \(69.5\) \(46.5\) \(0.052\) \(9.015\)
\(Q=4, K=512, d=1024\) \(\boldsymbol{63.1}\) \(\boldsymbol{43.4}\) \(\boldsymbol{0.014}\) \(\boldsymbol{9.157}\)

7.5 Ablation on Motion Tokens.↩︎

Subsequent to our analysis of motion codebooks, we shift focus to the strategy of sharing motion vocabularies \(V_m\) within the language model backbone. Specifically, we examine the efficacy of sharing motion codes across various residual quantization layers versus a more isolated approach. This means a comparison between integrating \(V\times Q\) newly added tokens and \(V\) tokens alone within the language models. Our experiment shown in 7, grounded in the text-to-motion experiments conducted on the HumanML3D [18] dataset, reveals that the best performance is achieved when motion codes are not shared across the language model.

8 Qualitative Results↩︎

We visualize our result gallery on motion conversations (\(cf.\) 6) and some qualitative results on the comparison of text-to-motion (\(cf.\) 7) and motion reasoning (\(cf.\) 11).

Figure 5: No caption

Figure 6: The gallery showcases the results of our MotionChain model. The supervision of MotionChain is based on our conversational motion-language dataset (see Appendix 11), which builds upon previous motion datasets [18], [88]. For a more dynamic visualization, we recommend referring to our supplemental video.

Figure 7: Comparison of text-driven motion generation methods on the HumanML3D dataset [18]. In the visualizations, misaligned motions are highlighted with red words and boxes, while the characters are color-coded from light to dark to indicate the progression of time.

Figure 8: No caption

Figure 9: No caption

Figure 10: No caption

Figure 11: Comparison on motion reasoning question-answer. The MotionChain is trained on our conversation dataset based on HumanML3D [18]. The results demonstrate that our MotionChain shows promising text and motion understanding.

9 Implementation Details↩︎

We provide detailed explanations regarding the implementation details of motion composition ( 9.1), and the image tokenizer ( 9.2).

9.1 Details of Temporal Motion Compoistion↩︎

To investigate the temporal motion composition abilities of the MotionChain model, we conduct a pair actions composition experiment on the BABEL dataset [88], following the methodology of TEACH [28]. For simplicity, we consider pairs of actions, but it is important to note that MotionChain can handle sequences of actions/motions of any length. During training, in cases where there is segment overlap, we evenly distribute the overlapping frames between the two segments that form the pair. It is worth mentioning that the majority of the pair data (approximately 70 % ) is generated through overlapping segments rather than transitions. In the event of a transition, we concatenate the transition with the second segment. Instead of training a MotionChain model from scratch on the BABEL dataset [88], we utilize a pre-trained MotionChain model obtained from HumanML3D [18]. Subsequently, we convert the motion data in the BABEL dataset [88] into the format used in HumanML3D [18], and then fine-tune the MotionChain model on the BABEL dataset [88] using prompts that incorporate memory, as demonstrated below:


USER: Please assume the role of an Human Motion Language translator. I will use English, you should translate it, and respond in Human Motion Language. My first request is "\(<\)label1\(>\)"

ASSISTANT: \(<\)motion1\(>\)

USER: Please assume the role of a Human Motion Language translator. I will use English, you should translate it, and respond in Human Motion Language. In the last round I asked you to translate "\(<\)label1\(>\)", and your answer is \(<\)motion1\(>\). Now my second request is "\(<\)label2\(>\)"

ASSISTANT: \(<\)motion2\(>\)

For comparison with TEACH [28], we employed the TEACH model that was pre-trained on the BABEL dataset [88] to generate motion samples 20 times on the validation set. Subsequently, we converted the generated motion, originally in SMPL-H format [101], into the HumanML3D format.

We also examine the influence of various motion composition mechanisms on the generated complete motion sequences, as presented in Table 3. The "Independent" mechanism refers to the direct concatenation of independently generated motion sequences without any additional processing. On the other hand, the "Tokens-joint" mechanism involves concatenating motion tokens and decoding them using the VQ decoder, which results in a more coherent and natural sequence of movements.

9.2 Details of Image Tokenzier↩︎

We explore three different architectural designs for image tokenizers:

(a) MLP: In this design, we connect the frozen vision encoder CLIP ViT-L/14 [32] to the language model using a linear layer. The output of the vision encoder is projected to the same dimension as the word embeddings of the language model and is inserted before the text or motion token embeddings.

(b) Perceiver: This design incorporates a perceiver module with a similar architecture to Flamingo [11]. The perceiver module includes a transformer that receives a predefined number of latent input queries. These queries are then projected to the same dimension as the word embeddings of the language model and are inserted before the text or motion token embeddings. Details of architecture is presented in 9.

(c) Q-former: In this design, we directly utilize the pre-trained Q-former from BLIP-2 [13] to align visual inputs with the language model. The Q-former is frozen throughout the entire training process.

Table 9: Architecture of our vision perceiver
(0): PerceiverResampler(
(layers): ModuleList(
(0-5): 6 x ModuleList(
(0): PerceiverAttention(
(norm_media): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(norm_latents): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(to_q): Linear(in_features=1024, out_features=512, bias=False)
(to_kv): Linear(in_features=1024, out_features=1024, bias=False)
(to_out): Linear(in_features=512, out_features=1024, bias=False) )
(1): Sequential(
(0): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(1): Linear(in_features=1024, out_features=4096, bias=False)
(2): GELU(approximate=‘none’)
(3): Linear(in_features=4096, out_features=1024, bias=False) ) ) )
(norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) )
(1): Linear(in_features=1024, out_features=768, bias=True)

10 Inference time↩︎

We conducted a study to evaluate the inference time of our MotionChain model, which utilizes an auto-regressive approach for motion generation. To assess the time costs, we measured the Frames Per Second (FPS) on a single Tesla V100 GPU with a batch size of one. It is important to note that the frame generation rate of MotionChain, even without specific engineering optimizations, surpasses the ground-truth frame rate in text-motion pair datasets [18], [88], [89], highlighting its capability to support real-time motion animation applications.

Table 10: The inference time costs of text-driven motion generation by evaluating the Frames Per Second (FPS), which is obtained by averaging the number of frames generated per second. We present the time costs for various model sizes and observe that, under the same 1 Tesla V100, smaller model sizes achieve faster FPS.
Models Backbone Parameters FPS
MotionChain-small Flan-T5-small 110 M 136.7
MotionChain-base Flan-T5-Base 280 M 74.99
MotionChain-large Flan-T5-Large 810 M 39.18

11 Evaluation Protocols on the Motion Conversation.↩︎

We propose a protocol to evaluate our Multi-turn Multi-modal model, MotionChain, on various motion-language generation tasks. While MotionGPT [22] utilized previous text-motion pair datasets [18], [87], [94] to create an instruction motion-language dataset comprising 14 core tasks with numerous instruction templates, these tasks lack analysis of human motion and are limited to single-turn generation without contextual memory. To overcome this limitation, we introduce motion reasoning and motion editing tasks that leverage contextual information. Initially, we manually provide ChatGPT [1], [2] with a few examples along with corresponding textual descriptions of the motions in the datasets, and then we let it generate the motion analysis (refer to  12). Additionally, using a pre-trained text-motion retrieval model, TMR [34], we retrieve motions from the dataset with high and middle similarities. We collect captions for motion pairs with middle similarity and employ ChatGPT [1], [2] to generate motion editing instructions that can transform one motion into another. Furthermore, we manually construct highly similar motion pairs for motion length editing tasks based on their respective lengths. By randomly combining these single-turn generation tasks, we can create a dialog format. The resulting tasks, along with diverse prompt instructions, are presented in  11. We will release the pre-processed dataset.

Figure 12: The dedicated ChatGPT prompt for facilitating the collection of motion question-answer pairs. Our primary goal was to encompass a wide range of topics, including motion physics and motion analysis. By utilizing this prompt, our aim was to enable ChatGPT to generate high-quality questions, thereby making a valuable contribution to the development of a comprehensive motion question-answer dataset.

12 Motion Representations↩︎

We summarize two kinds of motion representations as follows.

HumanML3D Format [18] introduces a motion representation \(x^{1:L}\) that draws inspiration from motion features in character control [104], [105], [115]. This representation, which contains redundant information, is well-suited for neural models, particularly variational autoencoders. Specifically, the \(i\)-th pose \(x^i\) is defined by a tuple consisting of the root angular velocity \(\dot{r}^a \in \mathbb{R}\) along the Y-axis, root linear velocities \((\dot{r}^x, \dot{r}^z \in \mathbb{R})\) on the XZ-plane, root height \(r^y \in \mathbb{R}\), local joint positions \(\mathbf{j}^p\in\mathbb{R}^{3N_j}\), velocities \(\mathbf{j}^v\in\mathbb{R}^{3N_j}\), and rotations \(\mathbf{j}^r\in\mathbb{R}^{6N_j}\) in root space. Additionally, it includes binary foot-ground contact features \(\mathbf{c}^f \in \mathbb{R}^4\) obtained by thresholding the heel and toe joint velocities. Here, \(N_j\) represents the number of joints, yielding the following representation: \[\begin{align} x^i = \{\dot{r}^a, \dot{r}^x, \dot{r}^z, r^y, \mathbf{j}^p, \mathbf{j}^v, \mathbf{j}^r, \mathbf{c}^f\}. \end{align}\]

SMPL-based Format [101] is a widely used parametric human model, SMPL [101], and its variants [116], [117], which propose motion parameters \(\theta\) and shape parameters \(\beta\). The rotation vectors \(\theta \in \mathbb{R}^{3\times23+3}\) represent the rotations of joints and the root, while \(\beta\) represents the weights for linear blended shapes. This representation is commonly employed in markerless motion capture [59], [118], [119]. By including the global translation \(r\), the representation is formulated as:

\[\begin{align} x^i = \{r, \theta, \beta\}. \end{align}\]

Table 11: A few examples of prompt templates used in our standardized motion conversation evaluation protocol.
Task Input Output
Text-to-Motion Show me a sequence of movements that illustrates [caption]. [motion]
Demonstrate a motion that symbolizes the input: [caption].
I need a human motion that represents [caption].
Text-to-Motion w/ length Please generate a motion that is around [frames] frames long for the caption: [caption]. [motion]
Generate a motion that lasts for [seconds] seconds, and captures the essence of [caption].
Motion-Length-Editting Extend the duration of the motion provided. [motion]
Reduce the duration of the motion without losing its main characteristics and precision.
Length-to-Motion I want to see a motion that lasts for [frames] frames. [motion]
Show me a motion that has a duration of [seconds] seconds.
Radnom Motion Just show me a moving human. [motion]
Produce motions that are not planned or choreographed..
Motion-to-Text Provide a description of the motion shown in [motion] using natural language. [caption]
Provide a text-based explanation of what is happening in [motion].
Motion-to-Text w/ length Generate a text summary for the [motion] that takes [frames] seconds to complete. [caption]
Describe the movement exhibited in [motion] that is shown for a length of [seconds] seconds?
Motion-to-Length How long does [motion]’s poses last in seconds?? There are [frames] frames in the motion.
Calculate the second duration for [motion]’s body movements in seconds? The motion lasts for [seconds] seconds.
Caption-to-Length HPredict the anticipated frame duration for the motion that corresponds to [caption]? The duration is estimated to be around [frames] frames.
Guess the second count required for the motion represented by [caption]. The motion has a length of [seconds] seconds.
Length-to-Caption Given the [frames] frames of the motion, what are some possible actions that could be taken? [caption]
[seconds] is the number of motion seconds, generate the motion description:
Random Caption Depict a motion as like you have seen it. [caption]
Describe the motion of someone randomly.
Motion-Reasoning Can you tell me what muscles are being used during this motion? This motion primarily targets the quadriceps, hamstrings, glutes, and core muscles. It also engages the shoulders and upper back muscles while raising the arms.
2-3 What could be the reason for the person not swinging their arms while walking? There could be various reasons for this, such as the person carrying something heavy or trying to maintain a certain posture while walking.

13 Metric Definitions↩︎

In the following section, we present additional details regarding the evaluation metrics.

Linguistic Quality. To evaluate motion question-answer tasks, we employ linguistic metrics that assess the degree of alignment between the generated results and the ground-truth labels. These metrics include BLUE [96], Rouge citelin2004rouge, Cider [98], and BertScore [99]. For detailed information, please refer to the respective papers associated with each metric.

Motion Quality. The Frechet Inception Distance (FID) serves for evaluating the distribution similarity between generated and real motions. It is calculated using a suitable feature extractor [17], [18], [48] specific to each dataset. Additionally, we employ popular metrics in motion capture [59], [119], [120], such as MPJPE and PAMPJPE [121], to measure global and local errors in millimeters. To assess temporal quality, we utilize the Acceleration Error (ACCL). Furthermore, in line with previous motion prediction studies [25][27], we define the Average Displacement Error (ADE) as the average L2 distance between the ground truth and predicted motion for the entire sequence. The Final Displacement Error (FDE) is calculated as the L2 distance between the ground truth and predicted motion in the last frame.

Motion Diversity. Following previous studies [19], [24], [48], we employ two metrics, Diversity (DIV) and MultiModality (MM), to evaluate the variability of motion across the entire dataset and the diversity of generated motion within each text input, respectively. To assess Diversity, the generated motions are randomly divided into two equal-sized subsets, and the Diversity metric is computed as the average distance between the motions in these subsets. For MultiModality evaluation, a set of text descriptions is randomly sampled from the available descriptions. Each text description is then replicated \(m\) times for motion generation, and the MultiModality metric is defined as the average distance between the motions generated from the same text description.

Condition Matching. HumanML3D [18] and TMR [34] provide motion/text feature extractors that generate geometrically coherent features for aligned text-motion pairs and vice versa. Within this feature space, we assess the motion-retrieval precision (R Precision) by combining the generated motion with 31 mismatched motions and calculating the top-1/2/3 matching accuracy between the text and motion. Additionally, we measure the Multi-modal Distance (MM Dist), which quantifies the distance between the generated motions and the corresponding text..

Time Costs. To assess the computational efficiency of our models, particularly the inference efficiency, we measure the average Frames Per Second (FPS) during motion generation. Specifically, we calculate the FPS on the test set of HumanML3D [18], with a batch size of one, while excluding the time required for model and dataset loading.


of Machine Learning Research 21(1), 5485–5551 (2020).
of Robotics Research 34(10), 1314–1328 (2015).
vol. 2, pp. 723–732 (2023).
of Computer Vision 129(10), 2846–2864 (2021).

  1. Work done while Biao Jiang was a Research Intern with Tencent.↩︎

  2. Project lead.↩︎

  3. Corresponding author.↩︎