April 02, 2024
Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless, these methods show limited generalizability across object categories, shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors, and 2) scarcity of available training data. To tackle this challenge, we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity, we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual joint. This is followed by a spatio-temporal transformer network aimed at capturing correlation among the joints in different dimensions. Moreover, we devise simple heuristic rules to augment the limited training sequences with vast static hand grasping samples. This leads to a broader spectrum of grasping types observed during training, in turn enhancing our model’s generalization capability. We evaluate on two public datasets, GRAB and InterCap, where our method shows superiority over baselines both quantitatively and perceptually.
We humans mostly rely on hands to interact with different objects in the surrounding environment. Learning the high-dimensional space of plausible hand-object interactions is an important and challenging task that needs to be solved in many applications. These include modeling digital humans in Augmented and Virtual Reality, or reasoning about potential grasps in robotics.
Real world objects can largely differ in size, topology and geometry. Learning a model which can adapt to object surface is a particularly demanding task, especially when existing dynamic hand-object interaction data is very scarce. One crucial factor determining generalization capability lies in how the object is encoded relative to the hand. Previous work [2], [3] proposed to use occupancy-based or distance-based virtual sensors to represent local surface geometry. However, these features have two limitations. First, they are inherently constrained by their expressiveness. Occupancy-based sensor attaches an occupancy grid to the hand. Occupancy grids with a low resolution can only detect coarse object geometry. On the other hand, increasing the grid resolution would result in an exponential increase in feature size. Distance-based sensor measures the distance from a fixed set of basis points rigidly attached to the hand to the closest points on the object surface. It gives more fine-grained features and it is also less computationally expensive. However, a discrete collection of hand-to-object distance cannot faithfully describe local object geometry properties such as normal directions and curvature. Moreover, features computed by both of the aforementioned sensors are global with respect to the hand, which means it is difficult to model the intricate correlation between the movement of each finger. As the result, these methods exhibit limited generalization capabilities to unseen objects of different sizes.
The ability of humans to perform dexterous object manipulations is attributed to the dense tactile sensory receptors in the skin. Thus, we hypothesize that the ability to reason about local geometry is key to generalization to arbitrary surfaces. Inspired by this, we propose a novel hand-object interaction sensor which is local to every hand joint. Specifically, we establish a canonical frame at each joint, and use a shared module to process local object points within a small radius of the joint. This way, the module learns joint-agnostic local features, which are highly generalizable from limited training data. We further fuse together features at each joint by self-attention operations, enabling the model to learn the compositional relationship between different joints in forming the hand pose.
Due to the limited availability of dynamic human-object interaction data, we present a simple yet effective method for generating dynamic hand sequences from static grasps. Static hand grasping data is easily accessible and exhibits a diverse range of object geometry and grasping type. With our data augmentation procedure, we can turn them into artifical grasping sequences. We show that adding them to our training dataset can further improve the results.
Our contributions are as follows:
We propose a learning-based method to synthesize diverse hand motion sequences interacting with objects. Though trained only on small hand-held objects, we show that our model naturally generalizes to objects of larger sizes (see Figure. [fig:teaser]).
We introduce a novel hand-object interaction sensor, which detects local object surface geometry relative to hand joints. This is proven essential to our model’s generalization capabilities.
With a simple yet effective data augmentation trick, we are able to utilizing the vast amount of existing static hand grasp data to train our model.
Our code and pre-trained model will be released to enhance further research in this direction.
Static Grasp Synthesis. Synthesizing stable hand grasps given target objects has been extensively studied in computer graphics [4] and robotics [5], [6]. Conventional analytical approaches assume a simplified contact model and solve a constrained optimization problem to satisfy the force-closure condition [7]–[10]. In contrast, data-driven approaches generate grasp hypotheses from human demonstrations or annotated 3D data, and rank them according to certain quality metrics [11], [12]. Modern robotic grasping simulators usually combine the merits of both [13], [14]. Recently, there has been an increasing interest in training neural network-based models for hand grasp generation [15]–[22]. For example, [23], [24] modeled the hand-object proximity as an implicit function.
Dynamic Grasp Synthesis. In comparison with static grasp synthesis, generating dynamic manipulation of objects is more challenging since it additionally requires dynamic hand and object interaction to be modeled. This task is usually approached by optimizing hand poses to satisfy a range of contact force constraints [25]–[28]. With the advent of deep reinforcement learning, a number of work explored training hand grasping control policies in physics simulation [29]–[31]. Hand motions generated by these works are physically plausible but lack natural variations. Zheng et al [32] modeled hand poses in a canonicalized object-centric space, achieving category-level generalization for both rigid and articulated objects. More similar to our work are ManipNet [2] and GRIP [3], which utilized occupancy-based and distance-based sensors to extract local object features near the hand and then directly regressed hand poses from the features. We argue that these features are limited by resolution and they are global with respect to the hand, hence hindering generalization capability. In contrast, we adopt a novel joint-centered point-based sensor which captures local object geometry in finer details while enabling modeling the correlation among hand joints.
Full-body Human-object Interaction Synthesis. Generating realistic human motion sequences in 3D scenes has received considerable attention in recent years [33]–[39]. However, these work usually models coarse body motion only and ignore fine-grained finger articulations. Another line of work focused on generating full-body motion for grasping [40]. A typical solution to this problem is first generating the final static grasping pose and then using a motion infilling network to generate the intermediate poses [41], [42]. [43] and [44] leveraged the existing body pose prior and hand-only grasping prior to circumvent the limited diversity in available full-body grasping data. Braun et al [45] adopted a physics-based approach, training separate low-level policies for the body and fingers, and then integrating them with a high-level policy which operates in latent space.
Grasp Refinement. As consumer-level hand tracking devices including RGB/depth cameras, data gloves and IMUs become more accessible, it is relatively easy to acquire hands that are approximately correct but may contain noise and artifacts. Refining hand poses in accordance with hand-object interaction emerges as a practical research problem [16]. [46] and [47] proposed to identify the potential contact area on the object surface and subsequently adjust the hand to align with the predicted contact points. Limited by their contact representations, they can only handle static hand grasp. Zhou et al [48] improved upon them and extended the binary contact map representation to a spatio-temporal correspondence map, enabling the refinement of a hand motion sequence. We deviate from these work in our assumptions, as we only require hand and object trajectories as input.
Given the trajectories of a hand and an object in interaction, we aim to generate hand poses that align with the object motion. The object shape is assumed to be known. We tackle this problem in three steps. First, we estimate a coarse initial hand pose for each frame individually. We place virtual sensors on the initialized hand joints, detecting nearby object surface points and extracting hand-object interaction features based on these points. Local to each joint, the features are fed to a spatio-temporal attention network, which learns the correlation among hand joints and generates displacements to the initialized hand joints. Lastly, we solve an optimization problem and fit a parametric hand model to the predicted joints. See Figure. 1 for an overview of our method.
Specifically, our input consists of the hand trajectory \(\left\{ \boldsymbol{w}^{t},\boldsymbol{R}_{H}^{t} \right\} _{t=1}^{T}\), the object trajectory \(\left\{\boldsymbol{o}^{t},\boldsymbol{R}_{O}^{t} \right\} _{t=1}^{T}\) and the object template mesh \(\boldsymbol{M}_O = \left\{ \boldsymbol{V}_O, \boldsymbol{F}_O\right\}\), with \(\boldsymbol{w}^{t}, \boldsymbol{o}^t \in \mathbb{R}^3\) denoting hand and object translations and \(\boldsymbol{R}_{H}^{t}, \boldsymbol{R}_{O}^{t} \in \text{SO}^3\) denoting hand and object global orientations respectively. We use the MANO model [1] as our hand representation, which is parameterized by shape \(\boldsymbol{\beta}\) and pose \(\boldsymbol{\theta}\). Hence the hand trajectory is composed of the wrist joint coordinates and the global orientation of the target MANO hand at each frame.
Given the object position and orientation at frame \(t\), we first obtain the object mesh at that frame by \(\boldsymbol{V}_O^t = \boldsymbol{R}_{O}^{t}\boldsymbol{V}_O + \boldsymbol{o}^{t}\). In order to predict an initial hand pose, only the part of the object which is close to the wrist matters. Hence we crop the object by a cube-shaped virtual sensor \(\boldsymbol{S}^t\) rigidly attached to the wrist. The resulting partial object mesh is denoted by \({\boldsymbol{M}_{O}^t}^{\prime} = \left\{ {\boldsymbol{V}_{O}^t}^{\prime}, {\boldsymbol{F}_O^t}^{\prime}\right\}\), where \({\boldsymbol{V}_{O}^t}^{\prime} = \{\boldsymbol{v}_i \in \boldsymbol{V}_O^t : \boldsymbol{v}_i \notin \boldsymbol{S}^t \}\) and \({\boldsymbol{F}_O^t}^{\prime} \subseteq \boldsymbol{F}_O^t\).
Let \(\boldsymbol{P}^t \in \mathbb{R}^{N \times 3}\) denote the point cloud sampled on \({\boldsymbol{M}_{O}^t}^{\prime}\). We subsequently express the point cloud relative to the wrist:
\[\begin{align} \tilde{\boldsymbol{P}}^t = {\boldsymbol{R}_{H}^{t}}^T \left( \boldsymbol{P}^t - \boldsymbol{w}^{t}\right) . \label{eq:32cano} \end{align}\tag{1}\]
We additionally sample the hand trajectory centered on the current frame. We sample \(k\) frames both in the past and in the future, and express them relative to the wrist in a similar fashion as 1 . The inputs to the hand pose initialization module are \(\left[ \tilde{\boldsymbol{w}}^{t-k:t+k},\tilde{\boldsymbol{R}}_{H}^{t-k:t+k}, \tilde{\boldsymbol{P}}^t\right]\), where \(\tilde{\boldsymbol{w}}^{t-k:t+k}\) and \(\tilde{\boldsymbol{R}}_{H}^{t-k:t+k}\) are canonicalized sampled wrist positions and orientations respectively. In particular, we first use PointNet to extract a global feature vector from the partial point cloud \(\boldsymbol{P}\). This feature vector is then concatenated with the trajectory and fed to a three-layer fully-connected network. The output of the network is denoted by \(\boldsymbol{j}_\text{init}^t\), which represents the initialized coordinates relative to the wrist. The training loss for this module is defined by \[\begin{align} L_\text{init} = \left\| \boldsymbol{j}_\text{init}-\boldsymbol{j}_\text{gt} \right\| _{2}^{2}, \end{align}\] where \(\boldsymbol{j}_\text{gt}\) denotes groundtruth joint coordinates.
Although coarse and inaccurate, the initialized joints offer an indication of where the hand could potentially interact with the object. To refine the initial joint positions, we need to sense local geometry properties of the object near the interaction regions.
We introduce a novel joint-centered point-based local geometry sensor to overcome these limitations. Specifically, given the predicted joints \(\{\boldsymbol{j}_i\}_{i=1}^J\), we can utilize inverse kinematics to analytically derive the joint rotations which satisfy:
\[\begin{align} \boldsymbol{j}_k - \boldsymbol{j}_{\text{pa}(k)} = \boldsymbol{R}_{k, \text{pa}(k)}\left(\Bar{\boldsymbol{j}}_k - \Bar{\boldsymbol{j}}_{\text{pa}(k)}\right), \end{align}\] where \(\boldsymbol{R}_{k, \text{pa}(k)}\) is the relative angle between the \(k\)-th joint and its parent, and \(\{\Bar{\boldsymbol{j}}_k\}\) are joints in the rest pose. In this manner, we can define the template frame of the \(k\)-th joint by
\[\begin{align} \mathcal{T}_k = \prod_{i\in \text{A}\left( k \right)}{ \begin{array}{c|c} \boldsymbol{R}_{i, \text{pa}(i)} & \boldsymbol{j}_i \\ \hline \boldsymbol{1} & \boldsymbol{0} \end{array} }, \end{align}\] where \(\text{A}(k)\) denotes the list of ancestors of joint \(k\) and \(\mathcal{T}_k\) is the transformation which brings joint \(k\) from the template frame to the global frame.
By sampling object surface points within a given radius \(r\) of the \(k\)-th joint along with their normal vectors, we get \(\boldsymbol{F}_k = \{ \boldsymbol{P}_k, \boldsymbol{N}_k\}\), where \(\boldsymbol{P}_k = \{ \boldsymbol{v}_i \in \boldsymbol{V}: \left\| \boldsymbol{v}_i - \boldsymbol{j}_k \right\| _{2}^{2} < r\}\). We then transform the sampled points to the template frame, by \[\begin{align} \Bar{\boldsymbol{F}}_k &= \{ \Bar{\boldsymbol{P}}_k, \Bar{\boldsymbol{N}}_k\} \\ &= \{\mathcal{T}_k^{-1}(\boldsymbol{P}_k - \boldsymbol{j}_k), \mathcal{T}_k^{-1}\boldsymbol{N}_k\}. \end{align}\]
Since we now have the sampled object points in a joint-centered canonical frame, we apply a learnable module \(f_\text{feat}\) to process the transformed points. Note that this module is shared between joints, which greatly reduces the learning complexity. We hence arrive at a hand-object interaction feature \(\boldsymbol{f}_k = f_\text{feat}(\Bar{\boldsymbol{F}}_k)\) for each joint \(k\). We implement \(f_\text{feat}\) with a three-layer PointNet architecture.
With the local object features aggregated at each joint, we propose to use a transformer architecture to predict displacement vectors to the initialized joints. Achieving a visually plausible and smooth hand sequence requires modeling spatio-temporal inter-joint dependencies. Hence we apply the self-attention operation in both spatial and temporal dimensions. Concretely, we first project the initialized joint coordinates to per-joint embedding vectors with a fully-connected network \(g_\text{embed}\): \[\begin{align} \boldsymbol{e}_k = g_\text{embed}(\mathcal{T}_k^{-1}\boldsymbol{j}_k), \end{align}\] where \(k\) is the joint index. We concatenate the joint-local sensor features with the joint embeddings and obtain \(\boldsymbol{X} = \text{concat}(\boldsymbol{f}, \boldsymbol{e})\), which is the input feature tensor for our transformer. Note that \(X\) contains features for all the joints in all time frames. We divide \(\boldsymbol{X}\) along its spatial and temporal dimension, and apply a self-attention function to them separately.
Spatial self-attention. The spatial self-attention module divides \(\boldsymbol{X}\) into batches of frames, and each batch contains joint features in a single frame, denoted by \(\boldsymbol{X}_S\). This module takes the hands in different frames as static identities, and focuses on learning the correlations between different fingers. Following conventional self-attention operations, we linearly project \(\boldsymbol{X}_S\) to queries \(\boldsymbol{Q}_S\), keys \(\boldsymbol{K}_S\) and values \(\boldsymbol{V}_S\). The output feature is hence obtained by \[\begin{align} \tilde{\boldsymbol{X}}_S &= \text{sa}(\boldsymbol{Q}_S,\boldsymbol{K}_S,\boldsymbol{V}_S) \\ &= \text{softmax}\left(\frac{\boldsymbol{Q}_S \boldsymbol{K}_{S}^{T}}{\sqrt{l}}\right)\boldsymbol{V}_S, \end{align}\] where \(l\) is the length of key, query and value vectors. See Fig. 3 (Left) for an illustration.
Temporal self-attention. On the other hand, the temporal self-attention module divides \(\boldsymbol{X}\) into batches of joints, and each batch contains features of a specific joint across the whole sequence, denoted by \(\boldsymbol{X}_T\). This module models the trajectory of each individual joint, ensuring that all joints move in a temporally smooth and consistent manner. We similarly project \(\boldsymbol{X}_T\) to queries \(\boldsymbol{K}_T\), keys \(\boldsymbol{Q}_T\) and values \(\boldsymbol{V}_T\) respectively. The module output is \[\begin{align} \tilde{\boldsymbol{X}}_T &= \text{sa}(\boldsymbol{Q}_T,\boldsymbol{K}_T,\boldsymbol{V}_T). \end{align}\] See Fig. 3 (Right) for an illustration.
The Joint Displacement Network consists of interleaving spatial and temporal self-attention modules. The output of the last module is fed to a linear layer to produce the joint displacement vectors \(\bar{\boldsymbol{d}}\) in template frame. As the last step, we utilize the pose transformation derived from IK previously to transform \(\bar{\boldsymbol{d}}\) back to global frame: \[\begin{align} \boldsymbol{d}_k = \left(\prod_{i\in A(k)}{\boldsymbol{R}_{i, \text{pa}(i)}}\right) \bar{\boldsymbol{d}_k}. \end{align}\] The training loss for this module is defined by \[\begin{align} L_\text{disp} = \left\| \boldsymbol{j}_\text{disp} + \boldsymbol{d} -\boldsymbol{j}_\text{gt} \right\| _{2}^{2}. \end{align}\]
With the predicted sequence of hand joints \(\boldsymbol{j}\), we need to recover the hand meshes. This is done by minimizing
\[\begin{align} \mathcal{L}(\boldsymbol{\beta}, \boldsymbol{\theta}) = \left\| \mathcal{J}\left( H\left( \boldsymbol{\beta} ,\boldsymbol{\theta} \right) \right) - \boldsymbol{j} \right\| _{2}^{2}+ \mathcal{L}_\text{reg}(\boldsymbol{\beta}, \boldsymbol{\theta}) \textrm{,} \label{eq:fitting} \end{align}\tag{2}\] where \(\mathcal{J}\) is the function which takes hand vertices as input and outputs joint coordinates. The second term of (2 ) regularizes the shape and pose parameters of MANO, \[\begin{gather} \mathcal{L}_\text{reg}(\boldsymbol{\beta}, \boldsymbol{\theta}) = w_1\left\| \boldsymbol{\beta} \right\| ^2+w_2\sum_{t=1}^T{\left\| \boldsymbol{\theta}^t \right\| ^2} \\ + w_3\sum_{t=1}^{T-1}{\left\| \boldsymbol{\theta}^{t+1} - \boldsymbol{\theta}^{t} \right\| ^2} + w_4\sum_{t=2}^{T-1}{\sum_{i=1}^J{\left\| \ddot{\boldsymbol{j}}_i^t\right\|} }, \end{gather}\] where we enforce temporal smoothness by regularizing both the first and the second time derivatives of the joints.
Accurately capturing hand motion sequences, especially in presence of interacting objects, is a particularly challenging task. Sophisticated solutions usually involve expensive marker-based MoCap systems. As a result, there are only few dynamic hand-object interaction datasets available for use. Nevertheless, capturing the hand in a static pose while grasping an object is relatively straightforward. We can have a much larger training set if we are able to utilizing the widely available static hand grasping datasets. In the following, we introduce a simple yet efficient way to synthesize hand sequences from static poses.
Given mesh of a static hand grasping an object, we first fit MANO model to the hand to get the target joint rotations \(\boldsymbol{P}^T\), global orientation \(\mathbf{R}^T\) and translation \(\boldsymbol{d}^T\), where \(T\) is the desired sequence length. We then generate a source hand as the first frame of the sequence, where the pose is generated by adding a small random Gaussian noise to the mean MANO pose. Similarly, we perturb \(\mathbf{R}^T\) with Gaussian noise to get the global orientation of the initial hand. Next, we compute the average distance moved by the hand per frame from GRAB. The initial translation is determined by moving along the negative normal direction of the target hand palm by this distance.
To obtain hand meshes in intermediate time steps, we apply linear interpolation to hand translation and spherical linear interpolation to joint rotations: \[\begin{align} \boldsymbol{d}^t &= (1-t) \boldsymbol{d}^0 + t \boldsymbol{d}^T \\ \boldsymbol{P}^t &= \text{SLERP}(\boldsymbol{P}^0, \boldsymbol{P}^T, t) \\ \boldsymbol{R}^t &= \text{SLERP}(\boldsymbol{R}^0, \boldsymbol{R}^T, t). \end{align}\] Generating sequences in this way could result in hand-object intersections. Rather than relying on path planning algorithms to prevent collisions, we simply compute the highest intersection volume of a sequence and eliminate sequences with intersection volume surpassing a predefined threshold. See Figure. 4 for a sample sequence.
For Joint Initialization Network, the side length of the cube sensor is 18cm. We sample 2000 points on the partial object mesh as input to PointNet. We uniformly sample 10 frames in the past and in the future within a 1 second time window to compute the trajectory feature. When querying for joint-local object points, we use the sphere sensor with a radius of 2.5cm. A maximum of 300 points are sampled in the neighbourhood of each joint.
MPJPE (mm) \(\downarrow\) | PD (mm) \(\downarrow\) | IV (\(\text{cm}^3\)) \(\downarrow\) | C-IoU (%) \(\uparrow\) | |
---|---|---|---|---|
TOCH | 8.18 | 5.37 | 2.72 | 20.1 |
ManipNet | 9.32 | 5.66 | 3.21 | 18.3 |
GRIP | 7.71 | 4.80 | 2.51 | 19.9 |
GEARS (ours) | 7.24 | 4.36 | 2.24 | 22.7 |
PD (mm) \(\downarrow\) | IV (\(\text{cm}^3\)) \(\downarrow\) | |
---|---|---|
ManipNet | 8.22 | 6.15 |
GRIP | 7.92 | 5.68 |
GEARS (ours) | 7.44 | 5.21 |
MPJPE (mm) \(\downarrow\) | C-IoU (%) \(\uparrow\) | ||
---|---|---|---|
\(r=0\) | 9.34 | 14.8 | |
\(r=0.02\) | 7.28 | 22.1 | |
\(r=0.03\) | 7.24 | 21.9 | |
w/o displacement | 9.63 | 13.2 | |
w/o attention | 7.85 | 18.4 | |
w/o synthetic | 7.31 | 20.6 | |
Ours (iterative) | 7.37 | 19.2 | |
Ours (full) | 7.24 | 22.7 |
GRAB. We train GEARS on GRAB [16], a large-scale MoCap dataset for whole-body grasping. GRAB contains interaction sequences with \(51\) objects. Following the official protocol, we select \(10\) objects for validation and testing, and train with the rest. Due to symmetry of the two hands, we flip left hands to increase the amount of training data. We further augment the training set by transferring grasps to objects of varying sizes, following [48].
InterCap. InterCap is a dataset of whole-body human-scene interaction captured by multiview RGB-D cameras. It features frame-wise pseudo-groundtruth annotations for body, hand and 6D object poses, which are reconstructed by jointly reasoning about human and object contact areas. As we solely focus on hand-object interaction, we consider a subset of objects where the hand is in interaction.
ObMan. ObMan [15] is a static hand grasping dataset. It consists of object models taken from Shapenet [49] and synthetic hand grasps generated by the robotic grasping software GraspIt [13]. Since it only has static hand poses, we cannot directly train on it. Instead, we apply the data synthesis technique and generate 200 sequences for training and testing. Each sequence has a fixed length of 60 frames.
Mean Per-Joint Position Error (MPJPE). We report the average Euclidean distance between predicted and groundtruth 3D hand joints.
Penetration Depth (PD). Penetration depth is the minimum distance required for moving a mesh to make it no longer in intersection with another mesh. We approximate it by finding the maximum vertex-to-object distance for all the penetrating hand vertices.
Intersection Volume (IV). We measure hand-object inter-penetration by voxelizing hand and object meshes and reporting the volume of voxels occupied by both. However, interpreting this metric in isolation could be misleading, since it does not account for non-effective grasping artifacts.
Contact IoU (C-IoU). This metric evaluates the Intersection-over-Union between the groundtruth binary hand-object contact map and the contact map of predicted hands. The contact map is defined on the object surface. It takes a value of 1 if a hand vertex is within \(\pm 2 \mathrm{mm}\) of an object vertex and 0 otherwise.
TOCH [48] is an object-centric model designed for refining noisy hand-object interaction sequences. We tailor it to our task by feeding it with the groundtruth hand trajectory and replacing the noisy hands in the training set with flat hands.
ManipNet [2] relies on both occupancy-based and distance-based sensors to generate dexterous hand motions. Since the original work assumed a different hand model, we adapt it to MANO to compare on a fair ground.
GRIP [3] takes body arm trajectory as input to generate hand poses. It employs a standalone module to denoise the arm trajectory and obtain hand trajectory. For fair comparison, we directly provide GRIP with input hand trajectories.
To verify that our GEARS generates realistic interaction sequences, we first evaluate our method on GRAB and compare with the aforementioned baselines. The results are reported in Table. 1. GEARS outperforms other baselines on all four metrics, which clearly demonstrates the advantage of our method. We can observe that although both ManipNet and GRIP rely on distance-based sensors, GRIP achieves better performance both in terms of joint accuracy and inter-penetration score. We hypothesize that it could attribute to the two-stage approach followed by GRIP. Similar to us, GRIP generates a coarse hand first and subsequently refined it. Moreover, TOCH incurs a higher MPJPE but also achieves higher contact IoU than GRIP. This observation shows that a higher joint error doesn’t necessarily indicate worse grasping quality. TOCH leverages an object-centric interaction representation, which naturally encourages hand-object contact. See Figure. 5 (top row) for qualitative results on GRAB.
GRAB contains mostly small-to-medium sized household objects. To assess our model’s generalization capability to larger objects, we evaluate on the InterCap dataset. We exclude TOCH from this comparison because the object-centric contact map used by TOCH is highly sensitive to object size. Since the groundtruth hand pose annotations of InterCap are not accurate enough, we only report penetration depth and intersection volume, see Table. 2. Compared to GRIP and ManipNet, GEARS incurs less penetration with the objects. Note that all three methods report higher numbers than on GRAB. It can be partially explained by the fact that the input hand trajectory provided by InterCap may exhibit a certain degree of noise. See Figure. 5 (bottom rows) for a qualitative comparison on InterCap.
We ablate different components of GEARS and report the change in performance on GRAB to further justify our proposed method, see Table 3. We first evaluate how sensitive is the model to different sensor radius. Zero radius means that sensor features are neglected by the network. We can observe that as long as the radius is set within a reasonable range, it doesn’t have a significant impact on performance.
Moreover, we train three baseline models, for which i) the Joint Displacement Network is removed; ii) spatio-temporal attention is replaced by fully-connected layers; iii) additional synthetic training sequences are not used. It’s clear that the Joint Displacement Network plays the most important role in our architecture. This agrees with our intuition that local object geometry features are essential to fine-grained placement of joints.
Lastly, we design an iterative baseline, where at inference time the output of the Joint Displacement Network is fed back to itself as input. We expect that one more round of pose refining would further improve the generation quality. Surprisingly, the iterative refining approach doesn’t bring any benefit. We hypothesize that the underlying reason could be distributional shift of test data, since the Joint Displacement Network has only seen the output of Joint Initialization Network during training.
We present GEARS, a learning-based method for generating hand interaction sequences given hand and object trajectories. The main insight which makes GEARS effective is the novel joint-centered point-based sensor which captures local geometry properties of the target object. Furthermore, we design a spatio-temporal self-attention architecture to process joint-local features and learn the correlation among hand joints during interaction. GEARS is capable of generalizing across objects of varying sizes and categories. We show that GEARS outperforms previous methods in terms of generation quality and generalizability.
Acknowledgements This work is supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans). Gerard Pons-Moll is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645. The project was made possible by funding from the Carl Zeiss Foundation.