Zhangyang Qi1,2*, Zhixiong Zhang2, Yizhou Yu1, Jiaqi Wang2✉, Hengshuang Zhao1✉
* Equation Contribution, ✉ Corresponding Authors,
1 The University of Hong Kong, 2 Shanghai AI Laboratory,
Zhangyang Qi1,2*, Zhixiong Zhang2*, Ye Fang2, Jiaqi Wang2✉, Hengshuang Zhao1✉
* Equation Contribution, ✉ Corresponding Authors,
1 The University of Hong Kong, 2 Shanghai AI Laboratory,
[2025/07/06] We have released the training data, tokenizers and data generation code for VLN-R1.
[2025/06/20] We release the VLN-R1 paper in arxiv.
[2025/03/10] We release the training and validation dataset and training code.
[2025/01/21] We release the code validation datasetand model weights.
[2025/01/01] We release the GPT4Scene paper in arxiv. (The first paper in 2025! 🎇🎇🎇).
This code currently focuses on GPT4Scene. We will merge the SFT section of VLN-R1 with GPT4Scene in the future. The following shows the open-source status of VLN-R1:
| Task | Status |
|---|---|
| VLN-R1 training data | ✅ Completed |
| VLN-R1 dataset production process | ✅ Completed |
| VLN-R1 SFT training section | 🔄 In progress |
| VLN-R1 testing section (including engine) | 🔄 In progress |
| VLN-R1 RFT section | 🔄 In progress |
Important
Installation is mandatory.
conda create --name gpt4scene python=3.10
conda activate gpt4scene
git clone https://github.com/Qi-Zhangyang/GPT4Scene.git
cd GPT4Scene
pip install -e ".[torch,metrics]"Sometimes, the PyTorch downloaded this way may encounter errors. In such cases, you need to manually install Pytorch.
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install qwen_vl_utils flash-attn| Function | Model Name | Template |
|---|---|---|
| Pretrain Models | Qwen2-VL-7B-Instruct | Huggingface Link |
| Trained Weights | GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512 | Huggingface Link |
pip install --upgrade huggingface_hub
huggingface-cli login| Function | Huggingface Dataset Link | Local Dir |
|---|---|---|
| Train and Val Dataset and Train Annotations | alexzyqi/GPT4Scene-All | ./data/ |
You can download all trained model weights, dataset and annotations by
python download.pyThe folder structure is as follows.
GPT4Scene
├── ckpts
│ ├── Qwen2-VL-7B-Instruct
├── data
│ ├── annotation
│ │ ├── images_2D
│ │ ├── images_3D
│ │ ├── sharegpt_data_chat_scene_32_images_3D_mark.json
├── evaluate
│ ├── annotation
│ │ ├── multi3dref_mask3d_val.json
│ │ ├── ...
│ │ └── sqa3d_val.json
│ ├── ...
│ └── utils
├── model_outputs
│ └── GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512
├── ...
└── README.md
To inference, you can run the script
bash evaluate/infer.shIt will automatically detect the number of GPUs in your current environment and perform chunked testing. Also you can use the slurm system to submit your evaluation task.
srun -p XXX --gres=gpu:4 --time=4-00:00:00 sh evaluate/infer.shYou can start by getting the tokenizer - it only requires CPU resources.
bash gpt4scene_tokenizer_scripts/qwen2vl_7b_full_sft_mark_32_3D_img512.shThen you can run the training through 8 gpus.
bash gpt4scene_bash_scripts/qwen2vl_7b_full_sft_mark_32_3D_img512.shAlso, you can run the torchrun code.
Please note that the initial run requires the tokenizer. It is recommended to disable the GPU initially and proceed with training after the tokenizer has completed.
This repository is licensed under the Apache-2.0 License.
This repo benefits from LLaMA-Factory, Chat-Scene. Thanks for their wonderful works.
If this work is helpful, please kindly cite as:
@article{GPT4Scene,
title={GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models},
author={Zhangyang Qi and Zhixiong Zhang and Ye Fang and Jiaqi Wang and Hengshuang Zhao},
journal={arXiv:2501.01428},
year={2025}
}@article{VLNR1,
title={VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning},
author={Zhangyang Qi and Zhixiong Zhang and Yizhou Yu and Jiaqi Wang and Hengshuang Zhao},
journal={arXiv:2506.17221},
year={2025}
}