Thanks to our AMP-based policy learning and the integrated object localization module, the system demonstrates strong spatial and object-level generalization. It can adapt to a wide range of real-world environments without being constrained by the limitations of prior data.
Deploying humanoid robots to interact with real-world environments—such as carrying objects or sitting on chairs—requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks—box carrying, sitting, lying, and standing up—in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.
(a) Data Preparation: Human motions from a MoCap dataset are retargeted to humanoid motions, and objects are post-annotated by identifying key contact frames.
(b) AMP Policy Training: The policy is trained using Adversarial Motion Priors (AMP), where a discriminator network learns to differentiate between policy-generated motions and reference motions from the dataset. This adversarial objective guides the policy to produce natural, physically plausible movements while simultaneously optimizing for successful task execution.
(c) Real-World Deployment: During deployment, the coarse position of the target object is manually initialized based on LiDAR visualization. The coarse estimate is then fused with odometry data to provide approximate localization when the object is outside the camera's field of view. Once the object enters view, AprilTag-based visual detection is combined with odometry to achieve fine-grained, fully automated localization for precise interaction.
@article{wang2025physhsi,
title = {PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System},
author = {Wang, Huayi and Zhang, Wentao and Yu, Runyi and Huang, Tao and Ren, Junli and Jia, Feiyu and Wang, Zirui and Niu, Xiaojie and Chen, Xiao and Chen, Jiahe and Chen, Qifeng and Wang, Jingbo and Pang, Jiangmiao},
journal = {arXiv preprint arXiv:2510.11072},
year = {2025},
}