VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

1ByteDance Seed       2Peking University       3Zhongguancun Academy      
*Equal Contribution      ‡Project Lead      †Corresponding Authors
VLingNav teaser image
AdaCoT VLingMem Online Post-training Nav-AdaCoT-2.9M ObjectNav / ImageNav / Tracking

Summary Video

Abstract

Vision-Language-Action (VLA) models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large Vision-Language Models (VLMs). However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought (AdaCoT) mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module (VLingMem) that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For training, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, successfully executing practical navigation tasks, including previously unseen and untrained tasks, and demonstrating strong cross-domain and cross-task generalization.

Methods

Pipeline

VLingNav takes video streams and multimodal instruction as input to produce robot action for navigation with tailored linguistic designs. AdaCoT can adaptively generate linguistic thinking according to its observation, while VLingMem summarizes CoT cues with key visual features for globally informed decision-making.

Pipeline Image

Data Collection

We propose Nav-AdaCoT-2.9M, a large-scale embodied navigation dataset encompassing 2.9 million step-by-step adaptive CoT trajectories. To construct this dataset, we leveraged the Habitat simulator to collect extensive simulated navigation data and further developed an automated CoT annotation pipeline.

Pipeline Image Pipeline Image

Online Expert-guided Post-training

To address the limitations of offline imitation learning, and to better align the VLM's high-level representations with the closed-loop robot continuous action, we introduce an online post-training stage. The agent actively interacts with the simulation environment to collect fresh, on-policy trajectories. The policy is then updated using a hybrid objective function that combines outcome-driven optimization with expert-guided supervision. This dual approach allows the model to explore more effective strategies while preventing catastrophic forgetting of the expert policy.

Pipeline Image

Real-world Experiments

Case study

AdaCoT& VLingMem

Object Goal Navigation

ObjNav

Image Goal Navigation

ImgNav

Visual Tracking

Tracking

Simulation Visualization Results

HM3D-OVON Object Goal Navigation

ObjNav

HM3D Instance Image Goal Navigation

ImgNav

EVT-Bench

Tracking

Acknowledgements


We sincerely thank Yunke Cai, Haiquan Chen, Shuai Chu, Taifeng Gao, Bo Jiang, Yunfei Li, Yunfei Liu, Tao Wang, Xibin Wu, and Tingshuai Yan for their strong support and fruitful discussions.

Citation

@article{wang2026vlingnav,
    author  = {Wang, Shaoan and Luo, Yuanfei and Chen, Xingyu and Luo, Aocheng and Li, Dongyue and Liu, Chang and Chen, Sheng and Zhang, Yangang and Yu, Junzhi},
    title   = {VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory},
    journal = {arXiv pre-print},
    year    = {2026},
    url     = {https://arxiv.org/abs/2601.08665}
}

The website (source code) design was adapted from Nerfies.