Skip to content

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

License

Notifications You must be signed in to change notification settings

Qi-Zhangyang/GPT4Scene-and-VLN-R1

Repository files navigation

Icon VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

* Equation Contribution, Corresponding Authors,

1 The University of Hong Kong, 2 Shanghai AI Laboratory,

Icon GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

* Equation Contribution, Corresponding Authors,

1 The University of Hong Kong, 2 Shanghai AI Laboratory,

🔥 News

[2025/07/06] We have released the training data, tokenizers and data generation code for VLN-R1.

[2025/06/20] We release the VLN-R1 paper in arxiv.

[2025/03/10] We release the training and validation dataset and training code.

[2025/01/21] We release the code validation datasetand model weights.

[2025/01/01] We release the GPT4Scene paper in arxiv. (The first paper in 2025! 🎇🎇🎇).

🌟 Note

This code currently focuses on GPT4Scene. We will merge the SFT section of VLN-R1 with GPT4Scene in the future. The following shows the open-source status of VLN-R1:

Task Status
VLN-R1 training data ✅ Completed
VLN-R1 dataset production process ✅ Completed
VLN-R1 SFT training section 🔄 In progress
VLN-R1 testing section (including engine) 🔄 In progress
VLN-R1 RFT section 🔄 In progress

🔧 Installation

Important

Installation is mandatory.

conda create --name gpt4scene python=3.10
conda activate gpt4scene

git clone https://github.com/Qi-Zhangyang/GPT4Scene.git
cd GPT4Scene

pip install -e ".[torch,metrics]"

Sometimes, the PyTorch downloaded this way may encounter errors. In such cases, you need to manually install Pytorch.

conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.1 -c pytorch -c nvidia

pip install qwen_vl_utils flash-attn

🎡 Models and Weights

Function Model Name Template
Pretrain Models Qwen2-VL-7B-Instruct Huggingface Link
Trained Weights GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512 Huggingface Link
pip install --upgrade huggingface_hub
huggingface-cli login

🗂️ Dataset (ScanAlign)

Function Huggingface Dataset Link Local Dir
Train and Val Dataset and Train Annotations alexzyqi/GPT4Scene-All ./data/

You can download all trained model weights, dataset and annotations by

python download.py

The folder structure is as follows.

GPT4Scene
├── ckpts
│   ├── Qwen2-VL-7B-Instruct
├── data
│   ├── annotation
│   │   ├── images_2D
│   │   ├── images_3D
│   │   ├── sharegpt_data_chat_scene_32_images_3D_mark.json
├── evaluate
│   ├── annotation
│   │   ├── multi3dref_mask3d_val.json
│   │   ├── ...
│   │   └── sqa3d_val.json
│   ├── ...
│   └── utils
├── model_outputs
│   └── GPT4Scene-qwen2vl_full_sft_mark_32_3D_img512
├── ...
└── README.md

🚀 Inference

To inference, you can run the script

bash evaluate/infer.sh

It will automatically detect the number of GPUs in your current environment and perform chunked testing. Also you can use the slurm system to submit your evaluation task.

srun -p XXX --gres=gpu:4 --time=4-00:00:00 sh evaluate/infer.sh

🏗️ Training

You can start by getting the tokenizer - it only requires CPU resources.

bash gpt4scene_tokenizer_scripts/qwen2vl_7b_full_sft_mark_32_3D_img512.sh

Then you can run the training through 8 gpus.

bash gpt4scene_bash_scripts/qwen2vl_7b_full_sft_mark_32_3D_img512.sh

Also, you can run the torchrun code.

Please note that the initial run requires the tokenizer. It is recommended to disable the GPU initially and proceed with training after the tokenizer has completed.

⚖️ License

This repository is licensed under the Apache-2.0 License.

This repo benefits from LLaMA-Factory, Chat-Scene. Thanks for their wonderful works.

🔗 Citation

If this work is helpful, please kindly cite as:

@article{GPT4Scene,
  title={GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models},
  author={Zhangyang Qi and Zhixiong Zhang and Ye Fang and Jiaqi Wang and Hengshuang Zhao},
  journal={arXiv:2501.01428},
  year={2025}
}
@article{VLNR1,
  title={VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning},
  author={Zhangyang Qi and Zhixiong Zhang and Yizhou Yu and Jiaqi Wang and Hengshuang Zhao},
  journal={arXiv:2506.17221},
  year={2025}
}

About

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •