ReVLA@ICRA: Visual Robustness for Robotic Foundation Models

Recent advances in large language models and robotic datasets have paved the way for generalist robotic systems. Open Vision-Language-Action (VLA) models demonstrate strong task versatility, but their visual generalization remains limited. In our latest study, we evaluate three robotic foundation models and reveal a lack of robustness in out-of-domain (OOD) visual scenarios—often due to insufficient training diversity and catastrophic forgetting.

We analyze OpenVLA, which leverages two vision backbones, and identify a critical failure in depth regression caused by forgetting in DINO-v2. To address this, we introduce ReVLA, a model built on a novel gradual backbone reversal strategy via model merging. This restores visual generalization and yields significant improvements: +77% in grasping and +66% in lifting under visual OOD conditions.

Explore the full results, rollouts, and model weights on the ReVLA Website.