PathFormer: A Transformer with 3D Grid Constraints for Digital Twin Robot-Arm Trajectory Generation


Abstract

Robotic arms require precise, task-aware trajectory planning, yet sequence models that ignore motion structure often yield invalid or inefficient executions. We present a Path-based Transformer that encodes robot motion with a 3-grid (where/what/when) representation and constraint-masked decoding, enforcing lattice-adjacent moves and workspace bounds while reasoning over task graphs and action order. Trained on 53,755 trajectories (80% train / 20% validation), the model aligns closely with ground truth—89.44% stepwise accuracy, 93.32% precision, 89.44% recall, and 90.40% F1—with 99.99% of paths legal by construction. Compiled to motor primitives on an xArm Lite 6 with a depth-camera digital twin, it attains up to 97.5% reach and 92.5% pick success in controlled tests, and 86.7% end-to-end success across 60 language-specified tasks in cluttered scenes, absorbing slips and occlusions via local re-grounding without global re-planning. These results show that path-structured representations enable Transformers to generate accurate, reliable, and interpretable robot trajectories, bridging graph-based planning and sequence-based learning and providing a practical foundation for general-purpose manipulation and sim-to-real transfer.

1 Introduction↩︎

Learning reliable policies for robot arms remains challenging as tasks grow more diverse and long-horizon. Recent progress in language-conditioned manipulation [1][3] shows that Transformers can map high-level instructions to low-level actions, but most treat control as unconstrained sequence generation, ignoring geometric and kinematic structure. The result is brittle behavior: trajectories that work in-distribution can exhibit discontinuities, inefficiencies, or physically invalid steps when deployed.

These issues amplify in multi-step, multi-task settings, where a policy must not only produce the correct order of actions but also ensure that each intermediate transition is feasible—respecting lattice adjacency, workspace bounds, and subtask dependencies. Symbolic or finite-state scaffolds can enforce feasibility, but they rely on handcrafted abstractions that limit scalability. Pure sequence models excel at long-horizon reasoning, yet provide no guarantees that decoded trajectories remain valid—undermining their integration into digital twin frameworks for safe planning, simulation, and sim-to-real transfer.

Recent digital twin efforts underscore both the opportunity and the gap. RoboTwin [4] highlights DTs for scalable data generation and evaluation; control-oriented studies pair reinforcement learning or hybrid controllers with DTs for real-time adjustment (e.g., SAC-based DT control for additive manufacturing [4] and PPO–fuzzy PID for arm stabilization [5]); perception-driven DTs maintain whole-body models for collision avoidance [6]. Across these, trajectory validity is typically enforced post hoc via control corrections or environment updates, rather than guaranteed at generation time.

We introduce PathFormer, a Transformer that generates robot-arm trajectories through a structured 3-grid representation. This design explicitly encodes (i) spatial transitions on a lattice (where), (ii) task-graph dependencies among primitives (what), and (iii) the sequential order of execution (when). Decoding is graph-constrained: at each step the model selects only from lattice-adjacent neighbors, guaranteeing legality by construction while retaining the expressivity of Transformer attention. The 3-grid aligns naturally with digital twins, enabling consistent simulation, safety checks, and transfer to physical arms without domain gaps.

1.0.0.1 Contributions.

(1) 3-grid trajectory representation: a unified encoding of where/what/when that supports joint reasoning over geometry, semantics, and order. (2) Graph-constrained Transformer decoding: neighborhood masking restricts next-step predictions to valid lattice neighbors, providing trajectory-validity guarantees crucial for DT deployment. (3) Empirical validation across diverse tasks: on 53,755 trajectories, PathFormer attains 89.44% stepwise accuracy, 93.32% precision, and 99.99% valid paths, and transfers to real robot arms with a depth-camera digital twin, executing reach, pick, and place primitives.

By embedding structural constraints directly into Transformer-based generation, PathFormer bridges sequence modeling and graph-based planning inside digital twins. The resulting policies are accurate, interpretable, and physically feasible, supporting a wide spectrum of robot-arm tasks in unstructured environments.

2 Related Work↩︎

2.0.0.1 Benchmarks for Robotic Manipulation

Large-scale simulators and benchmarks such as RLBench, Ravens, ALFRED, CALVIN, VLMbench, VIMA-Bench, and COLOSSEUM have accelerated progress in visuomotor learning by testing generalization across novel scenes, objects, and perceptual variations [7][13]. Recent additions like GemBench [14] emphasize articulated objects and long-horizon task compositions. These platforms primarily measure robustness to perceptual or environmental variation, but they do not guarantee structural validity of low-level trajectories. PathFormer complements these efforts by shifting the focus from perceptual robustness to trajectory correctness, ensuring that every decoded path adheres to lattice and kinematic constraints.

2.0.0.2 Sequence Models for Manipulation

Transformer-based visuomotor policies and diffusion models achieve strong performance by directly generating long-horizon action sequences from demonstrations [15][19]. However, since they optimize only for imitation fidelity, their outputs may contain discontinuities or physically invalid steps. PathFormer closes this gap by embedding motion in a 3D lattice and applying constraint-masked decoding, thereby producing predictions that are both order-accurate and valid by construction.

2.0.0.3 LLM/VLM-Guided Planning

Language and vision–language models are increasingly used as high-level planners: SayCan [20] combines LLM reasoning with value functions, VoxPoser [21] encodes spatial affordances as value maps, and CaP [22] generates robot-executable code. Recent work such as 3D-LOTUS++ [14] integrates LLM reasoning with VLM-based grounding for open-vocabulary objects. These methods excel at symbolic task decomposition but still rely on low-level controllers that lack structural guarantees. PathFormer provides a complementary execution layer: its lattice-constrained predictions ensure that trajectories invoked by high-level planners are always feasible and safe.

2.0.0.4 Structured Trajectory Representations and Digital Twins

Structured policies such as finite-state machines, task graphs, and option libraries offer interpretability but depend heavily on manual design [12], [13]. Equivariant architectures [1] and multi-skill Transformers [2] improve generalization but still decode trajectories without explicit feasibility checks. Meanwhile, digital twin approaches are increasingly used to provide safe, real-time mirroring of robot behavior [4][6], supporting domains from dual-arm manipulation to manufacturing control. Similarly, knowledge-graph and GNN-based planners [23], [24] capture structural dependencies and improve reasoning over multi-robot tasks. PathFormer bridges these directions: by embedding trajectories in a 3D lattice and enforcing graph-constrained decoding, it combines the rigor of graph-based planning with the scalability of modern sequence models, and can be integrated naturally with digital twins or KG/GNN-based frameworks.

2.0.0.5 Summary

Benchmarks advance evaluation, Transformers advance multi-skill visuomotor learning, and LLM/VLM systems advance task-level reasoning. Yet ensuring that generated trajectories remain valid under geometric and kinematic constraints is still unresolved. PathFormer addresses this gap: it produces robot motions that are valid by construction, interpretable via a 3D grid, and transferable to both digital twins and physical robot arms, while remaining compatible with higher-level KG- or LLM-driven planners.

3 Methodology↩︎

We propose PathFormer, a causal Transformer that generates robot-arm trajectories on a discrete lattice using a structured 3D grid representation. Unlike prior sequence models that decode unconstrained actions [15], [16], [18], PathFormer couples a depth-camera digital twin with graph-constrained decoding, ensuring trajectories are both accurate and valid by construction. The sensing-to-grid setup, the end-to-end pipeline, and perception-to-execution rollout are shown in Figs. 13. A detailed real-robot rollout is presented later (Fig. 5).

3.0.0.1 Relation to Transformer planners.

Transformer-based task planners often embed graph structure at the task level. For instance, the GNN-Transformer Task Planner (GTTP) integrates scene-graph encoders with Transformer layers to predict subgoals from language instructions [25]. PathFormer differs fundamentally: it operates at the trajectory level, autoregressively decoding lattice coordinates with a neighborhood mask that enforces Manhattan adjacency and workspace bounds. Thus, PathFormer complements GTTP-style planners—task-level planners can propose subgoals or skills, while PathFormer guarantees feasible low-level execution of those subgoals on real hardware.

Figure 1: Digital-twin sensing to grid. A RealSense depth camera provides (x,y,z) targets on the local host; detections are projected onto a 3D grid used by PathFormer. The Lite 6 arm executes plans generated on this grid, enabling consistent sim-to-real transfer.
Figure 2: End-to-end pipeline. Depth camera \rightarrow 3D grid (origin, target, end-effector) \rightarrow PathFormer graph-constrained decoding \rightarrow joint/gripper commands. The grid synchronizes the digital twin with the physical arm, while neighborhood masking enforces lattice adjacency.
Figure 3: Perception-to-execution rollout. Left: RGB and depth streams from the RealSense D435i with YOLO-based object detection. Middle: Lite 6 robot executing reach–pick–place. Right: decoded 3D-grid trajectories for corresponding task phases. Together, these demonstrate how PathFormer integrates perception, digital twin alignment, and real-robot execution.

3.1 Problem Formulation↩︎

A trajectory is a path \(\pi=\{p_1,\dots,p_T\}\) on a 3D lattice: \[p_t=(x_t,y_t,z_t),\quad p_t\in\mathbb{Z}^3,\quad \|p_{t+1}-p_t\|_1 = 1,\] where each move is lattice-adjacent. Given an initial state \(p_0\) from the robot’s end effector and a task context \(C\) (task-graph features, Sec. 3.2), PathFormer predicts \(\pi\) that (i) matches ground-truth order (Stepwise Accuracy), (ii) recovers the correct set of coordinates (Precision/Recall/F1), and (iii) respects adjacency and workspace bounds (Validity).

3.2 3-Grid Representation and Spatial Neighborhood↩︎

We encode trajectories with three complementary grids:

  • Spatial Grid \(\mathcal{G}_{\text{spatial}}=(\mathcal{V},\mathcal{E})\): nodes \(\mathcal{V}\) index lattice coordinates; edges connect neighbors \[(v_i,v_j)\in\mathcal{E}\iff \|p_i-p_j\|_1=1 \;\text{and}\;p_j\in\Omega,\] where \(\Omega\) denotes workspace bounds (derived from the Lite6 robot’s 440mm reach and obstacle masks from the depth camera). This defines the legal action set at each step.

  • Task Grid: a DAG over sub-tasks (e.g., grasp, lift, place) with dependency edges (pre\(\rightarrow\)post), mapped to tokens/features \(C\).

  • Sequence Grid: encodes temporal indices and causal masks to align positions with action order.

Knowledge-Graph View. The combined structure can be seen as a knowledge graph \(\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A})\) whose attributes \(\mathcal{A}\) annotate nodes/edges with action labels, step indices, and task tags. A trajectory is thus a path traversal on \(\mathcal{G}\).

3.3 Digital Twin Pipeline↩︎

To synchronize real-world sensing with simulation (Fig. 2):

  1. A depth camera (Intel RealSense) localizes objects in \((x,y,z)\) using YoloV11.

  2. Coordinates are projected into a voxelized 3D grid, where the origin \((0,0,0)\) aligns with the robot base, the target is inserted, and the end effector is tracked.

  3. The Lite6 robot arm (6 DoF, 440mm reach, \(\pm0.5\)mm repeatability) executes candidate paths, while the digital twin validates feasibility in parallel.

  4. Feedback from both environments keeps the spatial grid consistent and provides training/evaluation data for PathFormer.

3.4 Inputs and Embeddings↩︎

For each candidate cell \(u\in\mathcal{V}\), we build an embedding: \[e(u)=E_{\text{coord}}(x,y,z)\;+\;E_{\text{task}}(C[u])\;+\;E_{\text{seq}}(t),\] where \(E_{\text{coord}}\) encodes spatial coordinates, \(E_{\text{task}}\) encodes sub-task features (e.g., grasp vs. place), and \(E_{\text{seq}}\) encodes step index \(t\). This fuses geometry, semantics, and temporal order into a shared latent space.

3.5 Transformer with Constraint Masking↩︎

We use a causal Transformer decoder to predict \(p_{t+1}\) given \(\{p_1,\dots,p_t\}\) and context \(C\). A neighborhood mask \(M_t\) restricts logits to valid neighbors: \[M_t(u)=\mathbb{1}[u\in\mathcal{N}(p_t)], \quad \mathcal{N}(p_t)=\{u \mid (p_t,u)\in\mathcal{E}\}.\] Validated logits are: \[\tilde{\ell}(u)= \begin{cases} \ell(u), & M_t(u)=1\\ -\infty, & M_t(u)=0 \end{cases}, \quad p_{t+1}=\arg\max_u \text{softmax}(\tilde{\ell}(u)).\] This guarantees lattice-adjacent motion and bounded feasibility by construction.

3.6 Training Objective↩︎

We train on a 53k-scale corpus of trajectories with a composite loss: \[\mathcal{L}=\mathcal{L}_{\text{seq}} +\lambda_1 \mathcal{L}_{\text{coord}} +\lambda_2 \mathcal{L}_{\text{valid}} +\lambda_3 \mathcal{L}_{\text{cov}} +\lambda_4 \mathcal{L}_{\text{len}},\] where \(\mathcal{L}_{\text{seq}}\) supervises next-step indices, \(\mathcal{L}_{\text{coord}}\) optimizes coordinate set recall/precision, \(\mathcal{L}_{\text{valid}}\) penalizes invalid moves, \(\mathcal{L}_{\text{cov}}\) encourages full path coverage, and \(\mathcal{L}_{\text{len}}\) regularizes path length.

3.7 Constrained Decoding (Pseudocode)↩︎

Figure 4: Constrained decoding on \mathcal{G}_{\text{spatial}}
Figure 5: Real-robot rollout and grid trajectories. Top: snapshots of reach–pick–place, showing end-effector motion across phases. Middle: normalized wrist IMU/gyro timeline, with annotations marking transitions between subtasks. Bottom: decoded 3D-grid paths (origin, target, end-effector) aligned with each phase. These results confirm that the digital twin mirrors physical execution while lattice adjacency constraints prevent discontinuities.

3.8 Implementation Platform↩︎

3.8.0.1 Hardware

Experiments are conducted on UFactory collaborative manipulators (xArm5/6/7, Lite6, and UFACTORY850), supporting interchangeable end-effectors (parallel-jaw, vacuum, and BIO grippers). Dual-arm configurations run in a single rviz instance with independent DoFs and synchronized task graphs.

a
b
c
d

Figure 6: Robot Arm hardware specifications and dimensions.. a — Robot Arm dimensions (side view)., b — Robot Arm workspace radius (top view)., c — End-effector mount dimension diagram., d — Physical robot arm (6-DoF).

3.8.0.2 Operating Systems and Middleware

All experiments run on Ubuntu LTS with ROS 2 releases (Foxy, Galactic, Humble, Jazzy, and Rolling), ensuring vendor-branch API compatibility. The control stack builds on the BSD-licensed xarm_ros2 meta-repo, summarized below.

Table 1: ROS 2 middleware in the xarm_ros2 meta-repo.
Package Function
xarm_description URDF/XACRO robot models
xarm_api, xarm_sdk HW services and ROS topics
xarm_controller Hardware interface layer
xarm_moveit_config, xarm_planner MoveIt 2 motion planning
xarm_gazebo Gazebo simulation
xarm_moveit_servo Teleoperation / servo control
xarm_vision RealSense drivers and calibration

3.9 Robot Geometry and Workspace↩︎

3.9.0.1 Reach Envelope

The Lite6 workspace, derived from CAD views (Figs. 6 (a), 6 (b)), is summarized in Table 2.

Table 2: Lite6 reach envelope dimensions.
Parameter Value
Planar radius 440 mm (span \(\approx\)​880 mm)
Vertical envelope 683.5 mm (base to highest reach)
Offsets Base: 165 mm, TCP: 61.5 mm

These dimensions define the lattice bounds for PathFormer: \[x,y \in [-440, 440]~\text{mm}, \qquad z \in [0, 683.5]~\text{mm}.\]

3.9.0.2 Mounting Interface

The flange geometry (Fig. 6 (c)) is summarized in Table 3. These dimensions ensure precise registration of end-effectors and alignment of the lattice frame with the TCP.

Table 3: Lite6 end-effector flange specifications.
Parameter Value
Mounting face  mm
Bolt circles \(2\times\) M6 @ PCD 63 mm; \(4\times\) M6 @ PCD 50 mm
Bore / Counterbore  mm /  mm, with 45° chamfers

3.9.0.3 Lattice Parameterization

The workspace is discretized at a resolution of 20 mm per axis, yielding: \[\mathcal{V}=\Bigl\{(x,y,z)\in \mathbb{Z}^3 \;\big|\; x,y \in [-22,\dots,22],\; z\in[0,\dots,34] \Bigr\}.\]

Constraint-masked decoding enforces lattice adjacency: \[\ell_1(p_t, p_{t+1}) = 1, \qquad p_{t+1} \in \mathcal{V},\] where \(\ell_1\) is the Manhattan distance. The TCP offset (61.5 mm) is applied as a fixed transform.

3.10 Perception, Calibration, and Safety↩︎

A RealSense D435i provides depth sensing, with automated hand–eye calibration (aruco/easy_handeye2). Extrinsics are persisted and reloaded at launch. MoveIt 2 enforces joint limits, self-collision, and planning-scene checks. YOLOv11-based detection supplies \((x,y,z)\) targets for digital twin updates.

3.11 PathFormer Integration with Digital Twins↩︎

PathFormer executes as a ROS 2 node, consuming task context and state (from Gazebo/MoveIt or joint_states) and producing lattice-valid waypoints:

  1. Validity-checked against workspace bounds,

  2. Time-parameterized by MoveIt 2 for smooth execution, or sent directly via xarm_api.

In digital twin mode (Fig. 1), the depth-camera stream synchronizes with the 3D grid in real time, supporting both single-arm and dual-arm experiments.

4 Experimental Results↩︎

We evaluate PathFormer with three experimental designs spanning simulation and hardware:

D1 — Lattice-trajectory decoding (offline). Evaluation on 10,751 labeled lattice trajectories to test order correctness and coordinate-set recovery (Table 4).

D2 — Robotic trajectory execution (controlled bench). xArm Lite 6 paired with a depth-camera digital twin; lattice paths mapped to grasp–move–release primitives to assess feasibility, stability, and mechanical failure modes under controlled layouts.

D3 — Real-world end-to-end manipulation (cluttered scenes). Natural-language pickup–place tasks on the same platform with perception, constraint checks, and a transformer policy; measures sim\(\to\)real transfer and robustness in dynamic, occluded settings (Tables 6, 7).

These studies address: (Q1) Are trajectories order-correct? (Q2) Do they recover the right set of coordinates? (Q3) Are predictions valid by construction and transferable to physical hardware?

4.1 Evaluation Metrics↩︎

We evaluate PathFormer using four complementary metrics:

4.1.0.1 Stepwise Accuracy.

For a predicted trajectory \(\hat{\pi} = \{\hat{p}_1, \ldots, \hat{p}_T\}\) and ground-truth \(\pi^\star = \{p^\star_1, \ldots, p^\star_T\}\): \[\mathrm{Acc}_{\text{step}} = \frac{1}{T}\sum_{t=1}^{T} \mathbb{1}\left[\hat{p}_t = p^\star_t\right].\] This order-sensitive metric penalizes truncations, swaps, or over-extensions. PathFormer achieves 89.44%, i.e., \(\sim\)​9/10 steps are aligned with the gold trajectory.

4.1.0.2 Coordinate Precision, Recall, and F1.

Order-agnostic set overlap between predictions and ground truth is measured by: \[\mathrm{Precision} = \tfrac{|\hat{\Pi}\cap \Pi^\star|}{|\hat{\Pi}|},\quad \mathrm{Recall} = \tfrac{|\hat{\Pi}\cap \Pi^\star|}{|\Pi^\star|},\quad \mathrm{F1} = \tfrac{2PR}{P+R}.\] PathFormer achieves 93.32% precision, 89.44% recall, and 90.40% F1—demonstrating strong balance between correctness and coverage.

4.1.0.3 Valid Path Percent.

We compute the fraction of predictions that obey lattice adjacency and workspace bounds: \[\mathrm{Valid} = \frac{1}{N}\sum_{i=1}^{N}\mathbb{1}\big[\forall t,\;\|\hat{p}_{t+1}-\hat{p}_{t}\|_1=1 \wedge \hat{p}_t \in \mathcal{V}\big].\] PathFormer achieves 99.99%, indicating that virtually all decoded paths are legal by construction.

4.1.0.4 Real-Robot Success Rate.

Physical execution success is defined as completing a primitive without collision or boundary violation: \[\mathrm{Success} = \frac{\# \; \text{successful trials}}{\# \; \text{total trials}} \times 100\%.\] Results peak at 97.5% for reaching and 92.5% for picking in central workspace ranges, with mild degradation at the periphery.

4.2 Results by Experimental Design↩︎

4.2.0.1 Setup (summary).

All experiments use the same decoding stack and safety envelopes across simulation and hardware. The lattice decoder emits unit-adjacent moves with constraint masks; a digital twin maintains a time-synchronized scene graph (pose grids, occlusion map, reachability) and streams targets to an xArm Lite 6 controller with joint-velocity commands, geofences, depth/angle gates, and max-force abort. Depth images are registered to the robot base with a one-time extrinsic calibration; predicted trajectories are compared to measured end-effector traces after tool-center-point (TCP) offset compensation.

4.2.0.2 D1 — Lattice-trajectory decoding (offline).

We quantify order correctness, coordinate-set recovery, and legality on \(10{,}751\) labeled trajectories (no scene images; pure symbolic decoding). Decoding uses beam width \(B{=}5\) with legality masks; ties are broken by coverage penalties to discourage early truncation.

Table 4: Lattice corpus results (trajectory fidelity and validity).
Metric Stepwise Acc. Precision Recall F1
Value (%) 89.44 93.32 89.44 90.40
Validity
Valid Path Percent 99.99

Error profile. Most residual errors are (E1) tail truncation on long horizons and (E2) adjacent-step swaps; (E3) boundary nudges occur when alternatives with equal score choose different but still valid neighbors. Legality masks eliminate illegal jumps (L1) by construction.

4.2.0.3 D2 — Robotic trajectory execution (controlled bench).

Lattice paths are compiled to grasp–move–release primitives with phase-conditioned gains. The controller executes “Approach \(\to\) Engage \(\to\) Transport \(\to\) Release” with impedance/admittance tracking; contact spikes trigger slow-down or abort, and geofences prevent out-of-workspace motion. In this setting, graph-constrained decoding reliably grounds symbolic plans into feasible motor programs. Failures are rare and predominantly mechanical (e.g., transient gripper slip on glossy plastic). When minor object shifts are detected (e.g., a \(15^\circ\) cube rotation), the twin re-grounding updates targets and execution proceeds without global re-planning.

Figure 7: D2: Real-robot Reach task—success (top) and failure (bottom).
Figure 8: D2: Real-robot Place task—success (top) and failure (bottom).
Table 5: D2: Real-robot success rates (SR, %) for Reach and Pick across workspace ranges.
Task Area \((x,y,z)\) [mm] SR
Reach Range 1: \((190\)\(210,\,-225\)\(235,\,0\)\(10)\) 97.5
Range 2: \((220\)\(235,\,-170\)\(185,\,0\)\(10)\) 95.0
Range 3: \((240\)\(255,\,-200\)\(215,\,0\)\(10)\) 95.0
Pick Range 1: \((190\)\(210,\,-225\)\(235,\,0\)\(10)\) 92.5
Range 2: \((220\)\(235,\,-170\)\(185,\,0\)\(10)\) 87.5
Range 3: \((240\)\(255,\,-200\)\(215,\,0\)\(10)\) 90.0

Interpretation. Central workspace (Range 1) yields the highest success; slight degradation near Ranges 2/3 aligns with kinematic reach and camera perspective foreshortening. In successful runs, predicted paths overlay measured end-effector trajectories; residuals reflect boundary nudges (camera quantization, TCP offsets).

4.2.0.4 D3 — Real-world end-to-end manipulation (cluttered scenes).

We evaluate 60 natural-language pickup–place instructions (e.g., pickup smartphone\(>\)place in container) in dynamic, partially occluded scenes using (i) LLM-based perception, (ii) feasibility/occlusion checks, and (iii) a transformer policy conditioned on scene state and constraints.

Unexpected events & local re-grounding.

  • Slipped object: a phone slid \(\sim\)​4 cm after a near-miss; pose was re-grounded and the next grasp succeeded without discarding the plan.

  • Occlusion: a partially hidden cup caused an invalid first grasp; visibility/approach updated and the second attempt completed.

  • Dynamic obstacle: a spoon drifted into the reach path; a one-cell detour preserved lattice feasibility and rejoined the nominal route.

Table 6: D3: Real-world results on 60 instructions (cluttered tabletop).
Metric Value (%) Notes
Successful Executions 86.7 52/60 completed
Grasp Success Rate 90.0 Most failures from occlusion/tight packing
Placement Accuracy 93.3 Minor drop-location deviations

Failure analysis (8/60; 13.3%). Four dominant modes were observed; counts are summarized in Table 7.

  • State-sync “no state” (heavy occlusion or ambiguous “an object” prompts). Mitigation: target disambiguation + retry perception sweep.

  • Occlusion/tight clustering (blocked approach inside the container). Mitigation: local clearing or alternate grasp.

  • Small nested item blocks grasp. Mitigation: nested-occupancy pre-check; reorder picks.

  • Perception mis-ID (category/instance error). Mitigation: frame-majority voting; confidence-gated execution.

Table 7: D3: Failure modes and counts (60 trials).
Failure mode Count Rate (%)
No state available (state-sync) 5 8.3
Occlusion / tight clustering 1 1.7
Small object blocks grasp 1 1.7
Perception mis-prediction 1 1.7
Total 8 13.3

4.2.0.5 Takeaways.

(1) Symbolic\(\to\)motor reliability: lattice plans ground into phase-conditioned motor primitives across layouts. (2) Robustness in clutter: end-to-end success \(\sim 87\%\) with local re-grounding without global re-plans. (3) Adaptivity: slips, occlusions, and incidental obstacles are handled via pose re-estimation and one-cell detours. (4) Roadmap: failures motivate multi-view sensing, target disambiguation, grasp reordering, and confidence-gated execution.

4.3 Discussion↩︎

Summary across D1–D3. The Path-based Transformer demonstrates that trajectory generation can be simultaneously data-driven and structurally guaranteed across three settings: D1 (offline lattice decoding), D2 (controlled robotic execution with a digital twin), and D3 (real-world, cluttered manipulation). On the \(10{,}751\)-trajectory corpus (D1; Table 4), our model aligns closely with ground truth (stepwise accuracy \(89.44\%\), F1 \(90.40\%\)) while maintaining near-perfect legality (\(99.99\%\)), underscoring the value of explicit graph constraints layered on Transformer decoding. In D2, lattice plans compile to phase-conditioned grasp–move–release primitives on an xArm Lite 6, achieving high success in central workspace ranges (Reach up to \(97.5\%\), Pick up to \(92.5\%\); Table 5); rare failures are predominantly mechanical and are mitigated by twin re-grounding. In D3, end-to-end execution over \(60\) language-specified tasks attains \(86.7\%\) overall success with robust handling of slips, occlusions, and incidental obstacles (Table 6); the failure distribution (Table 7) is dominated by state-sync and heavy occlusion.

Why structure helps. The 3-grid representation encodes where/what/when, and legality masks enforce unit-adjacent moves within workspace bounds, ensuring that decoded steps are physically feasible by construction. This structural prior carries through to hardware: in D2 and D3, legality prevents multi-axis jumps, while the digital twin converts symbolic nodes into calibrated targets and supports local pose re-grounding, obviating costly global re-plans.

Observed error modes and their loci. Across D1, residual errors concentrate on (E1) tail truncation for long horizons and (E2) adjacent-step swaps; (E3) boundary nudges arise from score ties between neighboring valid nodes. In D2, failures are mostly mechanical (e.g., transient gripper slip on glossy plastic) and are quickly recovered via target updates (e.g., after a \(15^\circ\) rotation). In D3, the principal failure modes are state-sync “no state” (8.3%), occlusion/tight clustering (1.7%), small nested items blocking grasps (1.7%), and perception mis-ID (1.7%)—see Table 7.

Implications. The combination of constraint-masked decoding and twin-backed execution yields reliable symbolic\(\to\)motor grounding (D2) and robustness in clutter via local re-grounding (D3), delivering \(\sim 87\%\) task-level success without global re-planning. These results suggest that graph-constrained Transformers can serve as dependable high-level planners whose outputs remain executable after sim\(\to\)real transfer.

Limitations and remedies (by design). For D1, horizon-aware decoding (coverage penalties, scheduled sampling, curriculum for long paths) can reduce truncation and swaps, and explicit boundary handling can further trim nudges. For D2, mechanical slips motivate improved grasp-force control and material-aware grasp selection. For D3, the observed failures point to (i) multi-view/active perception to reduce “no state” events, (ii) target disambiguation and confidence-gated execution to curb mis-ID, and (iii) grasp reordering with nested-occupancy checks to address occlusion and container clutter.

Outlook. A tighter loop between perception and decoding (frame-majority voting, uncertainty-aware masks), plus adaptive grasping and local clearing primitives, should raise success rates and cut retries in D3 while preserving the structural guarantees validated in D1 and the motor reliability demonstrated in D2. Overall, PathFormer illustrates that integrating graph structure with sequence models, and coupling them to a digital twin, provides a scalable route to resilient sim-to-real robot behavior.

5 Conclusion↩︎

We introduced a Path-based Transformer that unifies spatial motion, task structure, and action order through a 3-grid representation (where/what/when) and constraint-masked decoding. Embedding graph structure directly into sequence prediction yields trajectories that are accurate, interpretable, and physically feasible. On the \(10{,}751\)-trajectory corpus (D1), the model attains strong sequence fidelity (stepwise accuracy \(89.44\%\), F1 \(90.40\%\)) with near-perfect legality (\(99.99\%\)). When compiled to motor primitives on hardware (D2), central-workspace success peaks at \(97.5\%\) for Reach and \(92.5\%\) for Pick; in real, cluttered scenes with language goals (D3), end-to-end success reaches \(86.7\%\) while absorbing slips, occlusions, and incidental obstacles through local re-grounding rather than global re-plans. By bridging sequence models and graph-based planning, the approach provides symbolic\(\to\)motor reliability: legality masks prevent multi-axis jumps, and the digital twin converts lattice nodes into calibrated targets and supports rapid pose updates, enabling robust sim\(\to\)real behavior while preserving safety envelopes.

We will extend the method to longer horizons and multi-task settings using horizon-aware decoding—with coverage penalties, scheduled sampling, and explicit boundary handling—and tighten the perception–decoding loop via uncertainty-aware masks, multi-view or active perception, and confidence-gated execution to reduce state-sync failures. We also plan to incorporate grasp reordering with nested-occupancy checks to handle container clutter and to couple PathFormer with high-level symbolic or language-conditioned planners alongside MPC-style refinement, thereby marrying task-level reasoning with low-level safety. Structurally constrained Transformers, coupled to a digital twin, offer a scalable path to reliable, constraint-aware trajectory generation and execution for general-purpose robotic manipulation.

References↩︎

[1]
X. Zhu, Y. Qi, Y. Zhu, R. Walters, and R. Platt, “Equact: An se (3)-equivariant multi-task transformer for open-loop robotic manipulation,” arXiv preprint arXiv:2505.21351, 2025.
[2]
K. Gao, F. Wang, E. Aduh, D. Randle, and J. Shi, “Must: Multi-head skill transformer for long-horizon dexterous manipulation with skill progress,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025, to appear. [Online]. Available: https://arxiv.org/abs/2502.02753.
[3]
T. Mu, Y. Liu, and M. Armand, “Look before you leap: Using serialized state machine for language conditioned robotic manipulation,” arXiv preprint arXiv:2503.05114, 2025.
[4]
Y. Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y. Zou, M. Xu et al., “Robotwin: Dual-arm robot benchmark with generative digital twins,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 649–27 660.
[5]
Y. Cen, J. Deng, Y. Chen, H. Liu, Z. Zhong, B. Fan, L. Chang, and L. Jiang, “Digital twin-empowered robotic arm control: An integrated ppo and fuzzy pid approach,” Mathematics, vol. 13, no. 2, p. 216, 2025.
[6]
M. Singh, J. Kapukotuwa, E. L. S. Gouveia, E. Fuenmayor, Y. Qiao, N. Murry, and D. Devine, “Unity and ros as a digital and communication layer for digital twin application: Case study of robotic arm in a smart manufacturing cell,” Sensors, vol. 24, no. 17, p. 5680, 2024.
[7]
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020.
[8]
M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740–10 749.
[9]
A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V. Sindhwani, and J. Lee, “Transporter networks: Rearranging the visual world for robotic manipulation,” in Proceedings of the 2020 Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 155.PMLR, 2021, pp. 726–747.
[10]
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 205–11 212, 2022.
[11]
K. Zheng, X. Chen, O. C. Jenkins, and X. Wang, “Vlmbench: A compositional benchmark for vision-and-language manipulation,” in NeurIPS 2022 Track on Datasets and Benchmarks, 2022.
[12]
Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan, “Vima: General robot manipulation with multimodal prompts,” in NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
[13]
W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” in RSS 2024 Workshop: Data Generation for Robotics, 2024.
[14]
R. Garcia, S. Chen, and C. Schmid, “Towards generalizable vision-language robotic manipulation: A benchmark and llm-guided 3d policy,” in 2025 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2025, pp. 8996–9002.
[15]
P.-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and C. Schmid, “Instruction-driven history-aware policies for robotic manipulations,” in Conference on Robot Learning.PMLR, 2023, pp. 175–187.
[16]
M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning.PMLR, 2023, pp. 785–799.
[17]
A. Goyal, J. Xu, Y. Guo, V. Blukis, Y.-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” in Conference on Robot Learning.PMLR, 2023, pp. 694–710.
[18]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, p. 02783649241273668, 2023.
[19]
T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” in Conference on Robot Learning.PMLR, 2025, pp. 1949–1974.
[20]
A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, J. Hsu, R. Julian, R. Hoque, I. Mordatch, K. Pertsch, J. Peralta, E. Perez, K. Schmeckpeper, J. Vakil, Q. Vuong, F. Xia, T. Xiao, P. Xu, M. Yan, B. Ichter, and S. Levine, “Do as i can, not as i say: Grounding language in robotic affordances,” in Proceedings of the 7th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research.PMLR, 2023.
[21]
W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” in Proceedings of the 7th Conference on Robot Learning (CoRL), ser. PMLR, vol. 229.PMLR, 2023, pp. 540–562.
[22]
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2023, pp. 9493–9500.
[23]
W. Liu, K. Eltouny, S. Tian, X. Liang, and M. Zheng, “Kg-planner: Knowledge-informed graph neural planning for collaborative manipulators,” IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 5061–5071, 2024.
[24]
Z. Ye, Y. J. Kumar, G. O. Sing, F. Song, and J. Wang, “A comprehensive survey of graph neural networks for knowledge graphs,” IEEE Access, vol. 10, pp. 75 729–75 741, 2022.
[25]
Y. Jeong, J. Hong, B. Kim, and H. J. Kim, “Gnn-transformer task planner with semantic-driven data augmentation and verification for robot manipulation,” in 2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 5231–5238.

  1. \(^{1}\)Department of Computer Science, University of Missouri–Kansas City (UMKC), Kansas City, MO, USA. {aha85b, leeyu}@umkc.edu↩︎

  2. \(^{2}\)Department of Computer Science, California State University, Fullerton, CA, USA. duyho@fullerton.edu↩︎

  3. This work has been submitted to the IEEE International Conference on Robotics and Automation (ICRA) 2026 for review. Copyright may be transferred without notice, after which this version may no longer be accessible.↩︎