PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

1Shanghai AI Laboratory, 2Hong Kong University of Science and Technology
*Equal Contributions Corresponding Authors

Object Localization Visualization

We only interactively assign the coarse position of the target object using LiDAR visualization during initialization. Thereafter, all localization and interaction behaviors are performed automatically.

Carry Box

Sit Down

Lie Down

Stand Up

Stylized Locomotion

High-Knee Stepping

Dinosaur-Like Walking

Generlization

Interpolate start reference image.

Thanks to our AMP-based policy learning and the integrated object localization module, the system demonstrates strong spatial and object-level generalization. It can adapt to a wide range of real-world environments without being constrained by the limitations of prior data.

Baseline Comparisons

PhysHSI (In Distribution)

Tracking-Based (In Distribution)

RL-Rewards (In Distribution)

PhysHSI (Full Distribution)

Tracking-Based (Full Distribution)

RL-Rewards (Full Distribution)

Representative Failure Cases

Localization Drift

Inaccurate Placement

Unstable Carrying Posture


Abstract

Deploying humanoid robots to interact with real-world environments—such as carrying objects or sitting on chairs—requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks—box carrying, sitting, lying, and standing up—in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.

System Overview

Interpolate start reference image.

(a) Data Preparation: Human motions from a MoCap dataset are retargeted to humanoid motions, and objects are post-annotated by identifying key contact frames.

(b) AMP Policy Training: The policy is trained using Adversarial Motion Priors (AMP), where a discriminator network learns to differentiate between policy-generated motions and reference motions from the dataset. This adversarial objective guides the policy to produce natural, physically plausible movements while simultaneously optimizing for successful task execution.

(c) Real-World Deployment: During deployment, the coarse position of the target object is manually initialized based on LiDAR visualization. The coarse estimate is then fused with odometry data to provide approximate localization when the object is outside the camera's field of view. Once the object enters view, AprilTag-based visual detection is combined with odometry to achieve fine-grained, fully automated localization for precise interaction.

Acknowledgements

We thank Liang Pan for advice on the implementation of RSI. We thank Shunlin Lu for help with the process of motion data. We thank Jianhui Liu, Tai Wang, Qingwei Ben and Junfeng Long for valuable discussions and advice on the object localization module. We thank Chenhui Li and Intelligent Photonics and Electronics Center at Shanghai AI Lab for help with the MoCap system and SLAM devices. We thank Weixiang Zhong and Yinhuai Wang for assistance with the real-world experiments. We thank Unitree and the Hardware Team of the Embodied AI Center at Shanghai AI Lab for help with hardware issues and the Unitree G1 humanoid robot.

BibTeX

@article{wang2025physhsi,
  title   = {PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System},
  author  = {Wang, Huayi and Zhang, Wentao and Yu, Runyi and Huang, Tao and Ren, Junli and Jia, Feiyu and Wang, Zirui and Niu, Xiaojie and Chen, Xiao and Chen, Jiahe and Chen, Qifeng and Wang, Jingbo and Pang, Jiangmiao},
  journal = {arXiv preprint arXiv:2510.11072},
  year    = {2025},
}