June 20, 2025
We address key challenges in Dataset Aggregation (DAgger) for real-world contact-rich manipulation: how to collect informative human correction data and how to effectively update policies with this new data. We introduce Compliant Residual DAgger (CR-DAgger), which contains two novel components: 1) a Compliant Intervention Interface that leverages compliance control, allowing humans to provide gentle, accurate delta action corrections without interrupting the ongoing robot policy execution; and 2) a Compliant Residual Policy formulation that learns from human corrections while incorporating force feedback and force control. Our system significantly enhances performance on precise contact-rich manipulation tasks using minimal correction data, improving base policy success rates by over 50% on two challenging tasks (book flipping and belt assembly) while outperforming both retraining-from-scratch and finetuning approaches. Through extensive real-world experiments, we provide practical guidance for implementing effective DAgger in real-world robot learning tasks. Result videos are available at: https://compliant-residual-dagger.github.io/
Learning from human demonstrations has seen many recent successes in real-world robotic tasks [1]–[4]. However, to obtain a successful policy, human demonstrators often have to repeatedly deploy a policy and observe its failure cases, then collect more data to update the policy until it succeeds. This process is broadly referred to as Dataset Aggregation (DAgger) [5], [6]. However, doing DAgger effectively for real-world robotic problems still faces the following challenges:
How to collect informative human correction data? DAgger is most effective when the correction data is within the original policy’s induced state distribution [5]. In practice, the common approach is either (1) collecting offline demonstrations that cover the policy’s typical failure scenarios [7], or (2) human taking over robot control during policy deployment [8], [9]. However, in both cases, it is difficult for human demonstrators to gain the intuition for how to execute the corrective actions without deviating excessively from the original distribution. Human taking over additionally introduces force discontinuity when they do not instantly reproduce the exact same robot force. This is partially due to the lack of effective correction interfaces that support precise and instantaneous intervention.
How to effectively update the policy with new data? Prior methods for improving a pretrained policy with additional data include (1) retraining the policy from scratch with the aggregated dataset [6], which can be computationally expensive; (2) finetuning the policy with only the additional data [3], [10], [11], which is sensitive to the quality of the new data [12], and (3) training a residual policy separately on top of the pretrained policy, which is typically done with Reinforcement Learning [12], [13] or Imitation Learning [14], both require a large number of samples.
In this work, we address these questions by proposing an improved system Compliant Residual DAgger (CR-DAgger) consisting of two critical components:
Compliant Intervention Interface. We propose an on-policy correction system based on kinesthetic teaching to collect delta action without interrupting the current robot policy. Leveraging compliance control, the interface lets humans directly feel the magnitude of their instantaneous correction, so they can provide gentle adjustments. Unlike take-over corrections that may cause force discontinuity, our design allows smooth transition between correction/no correction mode, while maintaining distributional consistency with the original policy.
Compliant Residual Policy. Leveraging the force feedback from our Compliant Intervention Interface, we propose a residual policy formulation that takes in an extra force modality and predicts both residual motions and target forces, which can fully describe the human correction behavior. The Compliant Residual Policy is force-aware, even when the base policy is position-only. We show that our residual policy formulation learns effective correction strategies using the data collected from our Compliant Intervention Interface.
Together, our system significantly improves the success rate of precise contact-rich robot manipulation tasks using a small amount of additional data. We demonstrate the efficacy of our method on two challenging tasks with long horizons and sequences of contacts: book flipping and belt assembly. We improve over the base policy success rate by over 50% using less than 50 intervention episodes, while also outperforming retrain-from-scratch and finetuning under the same data budgets. In summary, our contributions are:
A Compliant Intervention Interface, a system that allows human to provide accurate, gentle, and smooth corrections in both position and force to a running robot policy without interrupting it.
A Compliant Residual Policy, a policy formulation that seamlessly integrates additional force modality inputs and predicts residual motions and forces.
A practical guidance for efficient DAgger based on extensive real-world studies for critical but often overlooked design choices such as batch size and sampling strategy. Our hardware design, training data, and policy code will be made publicly available.
Human-in-the-Loop Corrections for Robot Policy Learning. The original DAgger work [5] requires the demonstrator to directly label actions generated by the policy. In robotics, a practical variation is to let the human take over the robot control and provide correct action directly [6]. Such human correction motion can be recorded with spacemouse [11], [15], joystick [8], smartphone [9], or arm-based teleoperation system [3], [16], [17]. We instead proposes a novel kinesthetic teaching system with compliance controller that allows the demonstrator to apply delta corrections while the robot policy is still running, and additionally records force feedback. Our results show that both the delta correction data and the force data are crucial to the success of the learned policy.
Improving Pretrained Robot Policies with New Data. The most direct approach to improving a pretrained policy with new correction data is to retrain the policy on the aggregated dataset, combining prior demonstrations with new feedback [8], [9]. Alternatively, Reinforcement Learning (RL) offers a framework to incorporate both offline and online data, either by warm-starting replay buffers [18], [19] or by using offline data to guide online fine-tuning [20], [21]. When policies are trained on large-scale human demonstration datasets [22]–[28], retraining becomes impractical, especially when the original data is inaccessible. In such cases, fine-tuning with only the new data is a common solution, using either imitation learning [3], [10], [15] or RL [11], [29]. Another line of work introduces an additional residual model on top of the original policy. These residual policies can be trained with RL in simulation [12], [13], [30], but suffers from sim-to-real challenges. Training residual policy in the real world usually requires a large number of samples [14], [31], intermediate scene representation [32], or consistent visual observations between training and testing [33], [34], making the approach hard to adopt in practice. In this work, we introduce a practical data collection system and an efficient residual policy learning algorithm for long-horizon, contact-rich manipulation tasks. Our approach requires only a small amount of real-world correction data and supports integration of additional sensory modalities not present in the original model, leading to improved policy performance.
Our goal is to improve a pretrained robot policy with a small amount of human correction data. To achieve this, we propose a Compliant Intervention Interface (§ 3.1) that enables precise and intuitive on-policy human correction data collection, and a Compliant Residual Policy (§ [subsec:method:policy]) that efficiently learns the correction behaviors to be used on top of the pretrained policy. Throughout the paper, we use the term base policy to refer to the pretrained policy without online improvements.
Correction data is collected by human demonstrators to rectify policy failures. Unlike initial demonstrations that establish baseline behaviors, correction data specifically targets failure modes observed during policy deployment. Correction data is most effective when it corrects failures in policy-induced state distributions [5]. The interface through which these corrections are collected significantly impacts the quality of correction data, which should be intuitive for demonstrators, capture critical corrective information at precise moments of failures, and help correction data maintain distributional consistency with base policy outputs.
There are two types of correction collection methods: Off-policy correction is when humans observe failures of the base policy during deployment, and then recollect extra offline demonstrations to address failure cases. This approach is most commonly used for improving Behavior Cloning policy performance due to its simplicity - it requires no additional infrastructure beyond the original data collection setup. However, the resulting demonstrations may fail to cover all the failure cases or deviate from the policy’s original distribution. We focus on on-policy correction instead, where humans can monitor policy execution and intervene on the spot when failures occur or are anticipated. This approach allows humans to provide corrections more targeted to the base policy’s failure cases. However, challenges still exist for an intervention system:
Non-smooth transitions. Intervention in robotics is typically implemented by take-over correction: letting human take complete control and overwrite robot policy. As the underlying control abruptly switches between robot policy and human intention, disturbances are introduced due to policy inference and human response latency, especially when the robot is withholding external forces. The recorded data thus may include undesired actions that do not reflect the human’s intention.
Distribution shift. The human intervention may still introduce significant distribution shifts as the motion deviates too much from the original policy distribution. Additionally, the non-smooth transition above could bring in disturbances and add to the distribution shift.
Indirect correction brings errors. Correction is commonly implemented via teleoperation interfaces such as spacemouse or joysticks [10], [11]. With spatial mismatch and teleoperation latency, it is hard for the demonstrator to instantly provide accurate corrections upon intervention starts without going through a short adjustment period.
Missing information. The recorded correction data need to fully describe the human’s intended action. Simply recording the robot’s position is not sufficient, since it may be under the influence of human correction force and will cause different result when testing without human.
We propose a Compliant Intervention Interface with the following designs to solve those challenges:
Delta correction instead of take-over correction. Unlike take-over correction, where the demonstrator has no idea of the policy’s original intention once taking over, we propose a novel on-policy delta correction method: we let the robot policy executes continuously while the human applies forces to the robot with a handle mounted on the end effector, resulting in delta actions on top of the policy action. The human demonstrator can always sense the policy’s intention through haptic feedback, and easily control the magnitude of intervention by the amount of force applied to the handle. As a result, delta correction ensures smooth intervention data and limits the human from providing very large corrections that can easily lead to out-of-distribution states. The approach is also intuitive as the human can directly move the robot towards desired correction directions.
Correction interface with compliance control. In order to apply delta correction over a running policy, we provide a compliant interface that allows humans to safely intervene and apply force to the robot to affect its behaviors at any time, as shown in Fig. 2. We design a kinesthetic correction hardware setup with a detachable handle for human to hold when correcting, and allows easy tool-swapping for different tasks. We run a compliance controller (specifically admittance control) in the background to respond to both contact forces and human correction forces, allowing the human to influence but not completely override the policy execution. The admittance controller uses a constant stiffness \(\sim\)1000 N/m to allow easy human intervention and ensure accurate tracking.
Correction recording with buttons and force sensor. Our interface additionally includes an ATI 6-D force sensor to directly measure contact forces, and a single-key keyboard placed on the handle to record the exact timings of correction starts/ends. Both the policy’s original commands and the human’s delta corrections are recorded, along with force sensor readings during the interaction.
Given the correction data, there are multiple ways to update the policy. Common practices include retraining the base policy from scratch with both initial data and correction data, and finetuning the base policy with only the correction data. However, retraining is costly as it requires updating the entire base policy network from scratch with all the available data. It also requires access to the base policy’s initial training data, which might not be accessible for many open source pretrained models. The amount of correction data is significantly smaller than the initial training data, thus simply mixing them together makes the policy hard to gain effective corrective behaviors. While finetuning allows updating partial policy network parameters with new data only, its training stability can be easily affected by the distribution mismatch between the correction data and initial training data. Moreover, both retraining and finetuning can only update the policy with its fixed network architecture while being unable to incorporate new inputs and outputs. We propose a compliant residual policy trained only on the correction data to refine base policy’s position actions and predict additional force actions.
Compliant residual policy formulation. Our policy directly learns corrective behavior from the human delta correction data, as shown in Fig. 3. It takes as input the same visual and proprioceptive feedback as the base policy but with a shorter horizon. It also takes in an extra force modality, which is available using our Compliant Intervention Interface. The policy outputs five frames of actions at a time, corresponding to 0.1 s of execution time when running at 50 Hz. The action space is 15-dimensional: the first nine dimensions represent the SE3 delta pose from the base policy action to the robot pose command [7], while the later six dimensions represent the expected wrench (force and torque) the robot should feel from external contacts. Both the robot pose command and the expected wrench are sent to a standard admittance controller for execution with compliance.
The residual policy directly uses the base policy’s frozen image encoder to extract an image embedding, a temporal convolution network [35] to encode the force vectors, followed by fully-connected layers to decode actions.
Advantages. This formulation provides the following advantages:
Sample-efficient learning. The residual policy’s network is light-weight (\(\sim\)2MB trainable weights) and only requires a small amount of correction data to train (\(\sim\)50 demonstrations).
Incorporating new sensor modality. Compared to retraining and finetuning methods that are limited to the base policy’s network architecture, residual policy can incorporate new sensor modality. This allows taking any position-based pretrained policy and turning it force-aware simply by collecting a small amount of correction data with force modality.
High-frequency inference. The light-weight residual policy runs at a higher frequency than the base policy, incorporating high-frequency force feedback and enabling reactive corrective behaviors. This reactivity is particular important for error correction during contact events.
Training strategy. In prior work, a residual policy is trained either in simulation with RL [12], [13] to give it sufficient coverage of the input distribution, or in the real world with pre-collected behavior cloning data [18]. In this work, we train the Compliant Residual Policy completely on the small amount of new real-world correction data with the following strategies:
Ensure sufficient coverage of in-distribution data. Human correction tends to be frequent around a few key moments of the task. A residual trained on correction data alone can extrapolate badly around states where no correction is provided. To help the residual policy understand when not to provide corrections, we: (1) include the no correction data for training but label it as zero residual actions; (2) collect a few trajectories where the demonstrator always holds the handle and marks the whole trajectory as correction even when the correction is small or zero. Details are in § 7.3.
Prioritize correction data over no-correction (zero residual action) data. Similar to [15], we alter the sample frequency of intervention data during training based on whether they have human correction or not. Specifically, since the moment of correction start indicates where the current policy performs badly followed by immediate action to fix it, we sample data more frequently for a short period immediately after correction starts. Our real-world ablations (§ 4.5) demonstrate that our training strategies improve the quality of the residual policy and the overall success rate.
For each task, we train a diffusion policy [7] with 150 demonstrations as the base policy. We first deploy the base policy and observe its performance and failure modes. Next, from the same base policy, we collect 50 correction episodes with each data collection method. Then, we update the policy using each network updating method and training procedure. Finally, we deploy the updated policies and evaluate their performance under the same test cases. Each data & policy combination is evaluated on each task for 20 trials. Details of tasks and comparisons are described below.
Book Flipping: As shown in Fig. 4 (a), this task is to flip up books on a shelf. Starting with one or more books lying flat on the shelf, the robot first insert fingers below the book, then rotate the book up and push them firmly against the shelf to let them stand on their own.
This task is challenging as it involves rich use of physical contacts and forceful strategies [36]. A position-only strategy always fails immediately by triggering large forces, so we execute all policies through the same admittance controller. The task success requires high precision in both motion and force to accurately align the fingers with the gap upon insertion, and to provide enough force to rotate heavy books and make the books stand firmly.
Each evaluation includes 20 rollouts on 4 test cases (5 rollouts each), as shown in Fig. 4 (b): 1) flipping a single book (three seen and two unseen books), initially far from the shelf edge; 2) flipping a single book close to the shelf edge; 3) flipping two books together (combinations of three seen and three unseen books), initially far from the shelf edge; 4) flipping two books close to the shelf edge. We use the same initial configurations for all evaluations.
Belt Assembly: As shown in Fig. 5 (a), this task is to assemble a thin belt onto two pulleys, which is part of the NIST board assembly challenge [37]. Starting with the belt grasped by the gripper, the robot needs to first thread the belt over the small pulley, next move down while stretching the belt to thread its other side on the big pulley, then rotate 180 ° around the big pulley to tighten the belt, and finally pull up to release the belt from the gripper.
The task is challenging as it requires both position and force accuracy throughout the process. Specifically, the belt is thin and soft so the initial alignments onto the pulleys are visually ambiguous. Also, since the belt is not stretchable, there is more resistance force and less position tolerance as the belt approaches the second pulley, requiring a policy with good force-position coordination and adaptation. We ran 20 rollouts across 4 different initial board positions for all methods (Fig. 5 (b)). We use the same test cases and initial robot configuration for all evaluations.
We trained a diffusion policy [7] that takes in past images from a wrist-mounted camera and robot proprioception observations, and predicts a future position-based action trajectory. We collected 150 successful demonstrations on both tasks, with five different books for the book flipping task, and randomized initial robot and object configurations for both tasks.
The book flipping base policy achieves a 40% success rate with the following common failure cases (Fig. 4 (c)): (1) Missed insertion. The fingers initially go too high above the book or aims for the gap between the two books, failing to properly insert beneath the books. (2) Incomplete flipping. At the last stage, the policy retracts the blade before the book can stand stably, causing it to fall back.
The belt assembly base policy achieves a 20% success rate with the following common failure cases (Fig. 5 (c)): (1) Missed small pulley assembly: the robot moves with an improper height, causing the belt to get stuck at the base of the board next to the small pulley. (2) Missed big pulley assembly: the fingertip goes too high or too low, causing the belt to miss the slot on the big pulley.
We compare CR-DAggerwith baselines across two dimensions: correction method and policy update method. We present the quantitative results in Fig. 6, and explain key findings in § [subsec:comparisons].
Correction data collection methods. We compare our Compliant Intervention Interface with the two most commonly used correction data collection strategies:
Observe-then-Collect includes two steps: first, the policy is deployed and human demonstrators observe the initial settings that could cause failures; then, demonstrators provide completely new demonstrations starting from similar initial settings. As explained in § 3.1, this type of offline correction potentially misses critical timing information, and the resulting demonstrations may deviate from the policy’s original behavior distribution.
Take-over-Correction (HG-DAgger) [6] is another common correction strategy where human demonstrators monitor policy execution and take complete control when failures are anticipated. We implement Take-over-Correction on our Compliant Intervention Interface by cleaning up command buffer to the compliance controller and switching stiffness to zero upon correction starts, so the robot policy does not affect the robot during correction. However, as explained in § 3.1, take-over correction introduces an abrupt transition around control authority switching, which may cause distributional discontinuities in the training data.
On-Policy Delta (Ours): the details are described in § 3.1.
Policy update methods. We compare with two common policy update methods:
Retrain Policy: Retrain the base policy using both the original training data and the correction data from scratch. As explained in § [subsec:method:policy], this approach is reliable but may require access to the orignal data and large amount of new data to work well.
Finetune Policy: Finetune the base policy using only the correction data (freezing visual encoders). As explained in § [subsec:method:policy], this approach can be sensitive to data quality and distribution shifts.
Residual Policy: an ablation of our method where force is removed from both input and outputs.
Compliant Residual Policy (Ours): Residual policy update with additional force input and outputs, see details in § [subsec:method:policy].
Finding 1: Compliant Residual Policy is able to improve base policy by a large margin. As shown in Fig. 6, [Compliant Residual] policy trained with [On-Policy Delta] data improves the base policy success rate by 60% and 50% on the two tasks respectively. It effectively learns from the limited demonstrations. For example, in the book flipping task, the policy learns to pitch the fingers down more before finger insertion to increase the insertion success; in the belt assembly task, the policy learns to correct the height of the belt when misaligning to the pulley slot. Results are best viewed in our supplementary video.
Finding 2: Residual policy allows additional useful modality to be added during correction. [Compliant Residual] policy performs significantly better than other methods without force (45% higher success rate than the best position-only baseline on the book task and 20% higher on the belt task) as it can both that indicates critical task information and to apply. For example, the last stage of the book flipping task requires the robot to firmly push the book against the shelf wall to let it stand on its own. [Compliant Residual] policy predicts large pushing forces at this stage to make the books stand stably with a 100% success rate, while [Residual]’s success rate drops from 70% to 35% (§ 7.1). The second stage of the belt assembly task (threading the belt on the large pulley) requires delicate belt height adjustments under ambiguous visual information due to occlusions and the lack of depth. [Compliant Residual] policy learns to move along the pulley to find the slot when the finger touches the top of the pulley.
Finding 3: Smooth On-Policy Delta data enables stable residual policy. [Compliant Residual] policy trained with [On-Policy Delta] data has 45% and 30% higher success rates than [Compliant Residual] & [Take-over-Correction] on the two tasks respectively. Both residual policies trained with [Take-over-Correction] data sometimes exhibit large noisy motions that trigger task failures, such as retracting the fingers too early in the book flipping task, and pulling the belt forcefully above the large pulley in the belt assembly task. On the contrary, residual policies trained with [On-policy Delta] data have much smoother action trajectories and better reflect human’s correction intentions, providing suitable magnitudes of corrections.
Finding 4: Retraining base policy is stable but learns correction behavior slowly. Retraining from scratch with the initial and correction data together leads to policies with stable motions. However, its behavior is less affected by the small amount of correction data compared to the dominant portion of initial data, leading to insignificant improvements over the base policy (1.67% success rate drop on the book task and 5% increase on the belt task, averaged across all data collection methods).
Finding 5: Finetuning base policy is unstable. Policy finetuning with either correction data has the worst performance across all policy update methods and even underperforms the base policy (30% success rate drop on the book task and 15% drop on the belt task, averaged across all data collection methods). The finetuned policy predicts unstable and noisy motions, quickly leading to out-of-distribution states, such as inserting too high in the book flipping task and drifting away from the board in the belt assembly task. This is likely due to the distribution mismatch between the base policy training distribution and correction data distribution, causing training instabilities.
We study two important design decisions with ablation studies on the book flipping task.
Training frequency and batch size. One important parameter in DAgger is the batch size between policy updates. With a smaller batch size, the policy is updated more frequently, then new correction data can better reflect the most recent policy distribution. However, DAgger with small batch sizes is known to suffer from catastrophic forgetting [38], [39] since it finetunes neural networks on data with non-stationary distribution. Common solutions include retraining the residual policy at the end of DAgger using all available correction data collected from all the intermediate residual policies [3]. Another way is to rely on the base policy training data as a normalizer [10]. In this work, we empirically found that a much simpler method works the best: single-batch training, i.e.., we train the residual only once. All correction data is obtained when only running the base policy. We compare our single-batch (batch size = 50) DAgger with a small batch size version, where we warm up the residual with 20 episodes of initial correction data, then update every ten more episodes for three times.
Finding: Single-batch DAgger is more suitable for training Compliant Residual Policy. The small-batch training becomes unstable and the demonstrator needs to provide large magnitudes of corrections as the number of iterations increases. During evaluation, the final policy always fails by inserting too high, while our single-batch policy achieves a 100% success rate with the same amount of data and training epochs.
Sampling strategy during training. The start of a human intervention contains critical information of the timing and direction of correction. Accurate delta action predictions right after correction starts are important for reactive corrective behaviors and staying in distribution. We investigate three strategies for sampling from online correction data during training: 1. Uniform sample, where the whole episode is sampled uniformly. 2. Denser sample around the start of a human intervention, and 3. denser sample only after the human intervention starts. For 2 and 3, we uniformly increase the sample frequency four times for a fixed period before and/or after intervention starts.
Finding: Sampling denser right after intervention starts leads to more reactive and accurate corrections. As shown in Fig. 7 (right), the best performance comes from densely sampling after the beginning of interventions. Sampling denser around the start of a human intervention also adds more samples right before the intervention starts, which is where humans observe signs of failures. These are mostly negative data, and using them for training decreases the policy success rate.
In this work, we evaluate practical design choices for DAgger in real-world robot learning, and provide a system, CR-DAgger, to effectively collect human correction data with a Compliant Intervention Interface and improve the base policy with a Compliant Residual Policy. We demonstrate the effectiveness of our designs by comparing them with a variety of alternatives on two contact-rich manipulation tasks.
Limitations and Future Work. The base policy should have a reasonable success rate for the residual policy to learn effectively. From our experiments, we recommend starting to collect correction data for the residual policy when the base policy has at least \(10\%\sim20\%\) success rate. A future direction is to derive theoretical guidelines for the trade-off between the base and residual improvements.
Throughout this work, we use a MLP as the action head of our Compliant Residual Policy and directly regress the actions. Although it works well in our tasks, it may experience difficulty for tasks that involve more distinctive action multi-modalities. More expressive policy formulations, such as Flow Matching [22], [40] might be useful for these tasks.
We would like to thank Eric Cousineau, Huy Ha, and Benjamin Burchfiel for thoughtful discussions on the proposed method, thank Mandi Zhao, Maximillian Du, Mengda Xu, and all REALab members for their suggestions on the experiment setup and the manuscript. This work was supported in part by the NSF Award #2143601, #2037101, and #2132519, Sloan Fellowship, and Toyota Research Institute. We would like to thank Google and TRI for the UR5 robot hardware. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
Both the book flipping task and the belt assembly task can be further divided into three key stages. Fig. 8 below is a more detailed version of Fig. 6, which reports all the task success rates by stages.
One design choice for the residual policy is whether to add the base policy action as an input. Both methods were commonly used in prior residual policy work, some report improvements when the base policy action is not included [12]. We tested both variations on both the book flipping and belt assembly task and obtained mixed results. We report the most successful versions (with base action for the book flipping task, without base action for the belt assembly task) in our result section.
w/ base action | w/o base action | |
---|---|---|
Book flipping | 100 | 75 |
Belt assembly | 25 | 70 |
As mentioned in the “Training strategy” part of § [subsec:method:policy], we used two strategies to ensure the residual policy behaves stably around low correction data regions. The first strategy is to include the no correction portion of online data for training and label them with zero residual actions. The second strategy is to collect a few trajectories (15 out of the 50 total correction episodes) in which the demonstrator marks the whole trajectory as correction, even when the correction is small or zero. In practice, we find that the first strategy works better when the base policy is more stable and has a higher success rate, while the second strategy works better otherwise. In our experiments, we use the first strategy for the book flipping task and use the second strategy for the belt assembly task.
We use a desktop with a NVIDIA GeForce RTX 4090 GPU for training and deployment.
Our kinesthetic correction hardware setup features a tool interface that allows task-specific tool swapping. For the book flipping task, we designed a customized fork-shaped tool that can easily insert under the books and flip them. For the belt assembly task, we used a standard WSG-50 gripper and fin-ray fingers [1]. An interesting future direction is to leverage generative models for automatic manipulator design [41]. Future work can also incorporate other types of force or tactile sensors, such as capacitive F/T sensors [42] and vision-based tactile sensors [43]–[45].