GitHub - Qi-Zhangyang/GPT4Scene-and-VLN-R1: GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

Zhangyang Qi^1,2*, Zhixiong Zhang², Yizhou Yu¹, Jiaqi Wang^2✉, Hengshuang Zhao^1✉

^* Equation Contribution, ^✉ Corresponding Authors,

¹ The University of Hong Kong, ² Shanghai AI Laboratory,

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

Zhangyang Qi^1,2*, Zhixiong Zhang^2*, Ye Fang², Jiaqi Wang^2✉, Hengshuang Zhao^1✉

^* Equation Contribution, ^✉ Corresponding Authors,

¹ The University of Hong Kong, ² Shanghai AI Laboratory,

🔥 News

[2025/07/06] We have released the training data, tokenizers and data generation code for VLN-R1.

[2025/06/20] We release the VLN-R1 paper in arxiv.

[2025/03/10] We release the training and validation dataset and training code.

[2025/01/21] We release the code validation datasetand model weights.

[2025/01/01] We release the GPT4Scene paper in arxiv. (The first paper in 2025! 🎇🎇🎇).

🌟 Note

This code currently focuses on GPT4Scene. We will merge the SFT section of VLN-R1 with GPT4Scene in the future. The following shows the open-source status of VLN-R1:

Task	Status
VLN-R1 training data	✅ Completed
VLN-R1 dataset production process	✅ Completed
VLN-R1 SFT training section	🔄 In progress
VLN-R1 testing section (including engine)	🔄 In progress
VLN-R1 RFT section	🔄 In progress

🔧 Installation

Important

Installation is mandatory.

conda create --name gpt4scene python=3.10
conda activate gpt4scene

git clone https://github.com/Qi-Zhangyang/GPT4Scene.git
cd GPT4Scene

pip install -e ".[torch,metrics]"

Sometimes, the PyTorch downloaded this way may encounter errors. In such cases, you need to manually install Pytorch.

conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia

pip install qwen_vl_utils flash-attn

🎡 Models and Weights

Function	Model Name	Template
Pretrain Models	Qwen2-VL-7B-Instruct	Huggingface Link
Trained Weights	GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512	Huggingface Link

pip install --upgrade huggingface_hub
huggingface-cli login

🗂️ Dataset (ScanAlign)

Function	Huggingface Dataset Link	Local Dir
Train and Val Dataset and Train Annotations	alexzyqi/GPT4Scene-All	./data/

You can download all trained model weights, dataset and annotations by

python download.py

The folder structure is as follows.

GPT4Scene
├── ckpts
│   ├── Qwen2-VL-7B-Instruct
├── data
│   ├── annotation
│   │   ├── images_2D
│   │   ├── images_3D
│   │   ├── sharegpt_data_chat_scene_32_images_3D_mark.json
├── evaluate
│   ├── annotation
│   │   ├── multi3dref_mask3d_val.json
│   │   ├── ...
│   │   └── sqa3d_val.json
│   ├── ...
│   └── utils
├── model_outputs
│   └── GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512
├── ...
└── README.md

🚀 Inference

To inference, you can run the script

bash evaluate/infer.sh

It will automatically detect the number of GPUs in your current environment and perform chunked testing. Also you can use the slurm system to submit your evaluation task.

srun -p XXX --gres=gpu:4 --time=4-00:00:00 sh evaluate/infer.sh

🏗️ Training

You can start by getting the tokenizer - it only requires CPU resources.

bash gpt4scene_tokenizer_scripts/qwen2vl_7b_full_sft_mark_32_3D_img512.sh

Then you can run the training through 8 gpus.

bash gpt4scene_bash_scripts/qwen2vl_7b_full_sft_mark_32_3D_img512.sh

Also, you can run the torchrun code.

Please note that the initial run requires the tokenizer. It is recommended to disable the GPU initially and proceed with training after the tokenizer has completed.

⚖️ License

This repository is licensed under the Apache-2.0 License.

This repo benefits from LLaMA-Factory, Chat-Scene. Thanks for their wonderful works.

🔗 Citation

If this work is helpful, please kindly cite as:

@article{GPT4Scene,
  title={GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models},
  author={Zhangyang Qi and Zhixiong Zhang and Ye Fang and Jiaqi Wang and Hengshuang Zhao},
  journal={arXiv:2501.01428},
  year={2025}
}

@article{VLNR1,
  title={VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning},
  author={Zhangyang Qi and Zhixiong Zhang and Yizhou Yu and Jiaqi Wang and Hengshuang Zhao},
  journal={arXiv:2506.17221},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
VLN-Ego-making		VLN-Ego-making
assets		assets
evaluate		evaluate
examples		examples
gpt4scene_bash_scripts		gpt4scene_bash_scripts
gpt4scene_tokenizer_scripts		gpt4scene_tokenizer_scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.py		download.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

🔥 News

🌟 Note

🔧 Installation

🎡 Models and Weights

🗂️ Dataset (ScanAlign)

🚀 Inference

🏗️ Training

⚖️ License

🔗 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Qi-Zhangyang/GPT4Scene-and-VLN-R1

Folders and files

Latest commit

History

Repository files navigation

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

🔥 News

🌟 Note

🔧 Installation

🎡 Models and Weights

🗂️ Dataset (ScanAlign)

🚀 Inference

🏗️ Training

⚖️ License

🔗 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages