RPL: Learning Robust Humanoid Perceptive Locomotion on Challenging Terrains

Yuanhang Zhang*1,2,   Younggyo Seo1,   Juyue Chen1,   Yifu Yuan2,  
Koushil Sreenath1,4,   Pieter Abbeel†1,4,   Carmelo Sferrazza†1,   Karen Liu†1,3,   Rocky Duan†1,   Guanya Shi†1,2  
1Amazon FAR (Frontier AI & Robotics),  2Carnegie Mellon University,  3Stanford University,  4UC Berkeley 
*Work done during internship at Amazon FAR. Amazon FAR team co-lead

Long Horizon Locomotion over Challenging Terrains

🎬 ONE Policy for ALL Demos.

[Note: all videos are sped up 2x. Hover over to play at normal speed]

Bidirectional Locomotion

Back and Forth on Different Stairs

Loco-Manipulation with 2kg Payload


Abstract

Humanoid perceptive locomotion has made significant progress and shows great promise, yet achieving robust multi-directional locomotion on complex terrains remains unexplored. To tackle this challenge, we propose RPL, a two-stage training framework that enables multi-directional locomotion on challenging terrains, and remains robust under payload transportation. RPL first trains terrain-specific expert policies with privileged height map observations to master decoupled locomotion and manipulation skills across different terrains, and then distills them into a transformer policy that leverages multiple depth cameras to cover a wide range of views. During distillation, we introduce two techniques to robustify multi-directional locomotion, depth feature scaling based on velocity commands and random side masking, which are critical for asymmetric depth observations and unseen widths of terrains. For scalable depth distillation, we develop an efficient multi-depth system that ray-casts against both dynamic robot meshes and static terrain meshes in massively parallel environments, achieving a 5× speedup over the depth rendering pipelines in existing simulators while modeling realistic sensor latency, noise, and dropout. Extensive real-world experiments demonstrate robust multi-directional locomotion with payloads (2kg) across challenging terrains, including 20° slopes, staircases with different step lengths (22 cm, 25 cm, 30 cm), and 25 cm×25 cm stepping stones separated by 60 cm gaps.

Ablation on Multi-Depth Distillation

In simulation, we change the weight of each hand from 0kg to 2.5kg.

We ablate two distillation components: RSM (Random Side Masking) and DFSV (Depth Feature Scaling Based on Velocity Commands) that are applied to RPL. Despite similar training loss, full RPL generalizes best to OOD settings—RSM is critical for unseen narrow terrain widths; DFSV handles asymmetric multi-view inputs (e.g., front camera on stairs, rear on stepping stones). Together they enable robust bidirectional stair locomotion under unseen geometry and asymmetric perception.

RPL

RPL w/o DFSV

RPL w/o RSM



Scalable Multi-Depth Rendering Performance

We benchmark the capability and speed of different depth rendering pipelines in RPL and existing simulators. We evaluate depth rendering performance in a locomotion-only task setting with Ncam=1, 2, 4 depth cameras over all terrain types (slopes, stairs, and stepping stones), focusing on VRAM usage and iteration time. The depth resolution is 240×135, evaluated on a single NVIDIA L40S GPU. The baselines include: (1) IsaacGym PhysX; (2) IsaacSim RTX (TiledCamera); (3) IsaacSim Warp (RayCasterCamera). The multi-depth rendering pipeline in RPL allows ray-casting against both dynamic and static meshes, achieving a 5× speedup compared to the most efficient baseline (IsaacSim Warp), which does not support dynamic mesh ray-casting.

Method Dyn. Mesh Num. Envs Ncam=1 Ncam=2 Ncam=4
VRAM↓ Iter. (s)↓ VRAM↓ Iter. (s)↓ VRAM↓ Iter. (s)↓
IsaacGym PhysX 1024 16.8 35.6±4.1 22.5 70.1±8.3 33.9 146.5±16.1
IsaacSim RTX 1024 17.5 5.3±0.1 23.2 7.6±0.1 34.4 12.6±0.1
IsaacSim Warp 1024 12.8 3.5±0.1 15.1 5.9±0.1 20.7 9.1±0.1
RPL (Ours) 1024 13.3 1.3±0.0 14.6 1.5±0.1 17.3 1.9±0.1

Depth rendering scalability with a fixed number of parallel environments across different numbers of depth cameras.


Multi-Camera Configuration for Multi-Directional Locomotion

To demonstrate the necessity of multiple cameras for multi-directional locomotion, we compare the achieved terrain levels for Ncam=1, 2, 4 under bidirectional and omnidirectional locomotion. Bidirectional locomotion supports forward and backward walking, whereas omni-directional locomotion enables movement in all planar directions (forward, backward, left, and right). When Ncam=1, a downward-facing camera is used to maximize terrain coverage. As the number of depth cameras decreases, performance degrades especially on stepping stones, where sparse footholds with gaps of up to 70 cm demand a wide field of view aligned with the walking direction. These results indicate that reliable bidirectional and omnidirectional locomotion benefits from having at least two depth cameras covering each potential walking direction to ensure sufficient terrain visibility.

Task Ncam (Config.) Slopes Stairs Up Stairs Down Stepping Stones
Bidirectional Expert Level 6.0 6.0 6.0 6.0
1 (Down) 6.0±0.0 6.0±0.0 5.9±0.1 5.1±0.1
2 (F + B) 6.0±0.0 6.0±0.0 6.0±0.0 6.0±0.0
4 (F + B + L + R) 6.0±0.0 6.0±0.0 6.0±0.0 6.0±0.0
Omnidirectional Expert Level 6.0 6.0 6.0 5.6
1 (Down) 6.0±0.0 6.0±0.0 5.9±0.1 3.0±0.2
2 (F + B) 6.0±0.0 6.0±0.0 5.9±0.1 4.5±0.1
4 (F + B + L + R) 6.0±0.0 6.0±0.1 6.0±0.0 4.6±0.0

Terrain levels↑ achieved under different numbers of depth cameras for bidirectional and omnidirectional locomotion.

BibTeX

@article{zhang2026rpl,
          title={RPL: Learning Robust Humanoid Perceptive Locomotion on Challenging Terrains},
          author={Zhang, Yuanhang and Seo, Younggyo and Chen, Juyue and Yuan, Yifu and Sreenath, Koushil and Abbeel, Pieter and Sferrazza, Carmelo and Liu, Karen and Duan, Rocky and Shi, Guanya},
          journal={arXiv preprint arXiv:2602.03002},
          year={2026}
        }