Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

[CVPR 2025]
[Project Page] [Paper] [Supp] [Dataset]

🧠 Overview

We introduces a new task: Motion-Grounded Video Reasoning, where models must answer motion-related questions using spatiotemporal segmentation masks as visual responses.

This task addresses key limitations in prior video understanding research by introducing:

❓ Implicit question-based reasoning
🕒 Motion-aware temporal localization
🧍 Object-level visual grounding
🎯 Pixel-level mask generation across time
🧩 Four question types: Causal, Sequential, Counterfactual, and Descriptive

📌 Comparison to Prior Tasks

Figure 1: GROUNDMORE fills the gap between referring segmentation, temporal grounding, and reasoning by combining implicit QA with visual spatiotemporal output.

📋 Task Definition

The Motion-Grounded Video Reasoning task requires models to:

Input:
- A video clip V ∈ ℝᵗˣʰˣʷˣ³
- A motion-related question Q
Output:
- Spatiotemporal segmentation masks M ∈ ℝᵗ′ˣʰˣʷ highlighting the target object

This output represents the reasoning result visually by grounding the answer over space and time.

🧪 Dataset Details

We collect a new benchmark dataset: GROUNDMORE, designed to evaluate fine-grained motion reasoning.

1.7K high-resolution video clips
7.6K question-answer pairs
249K object-level spatiotemporal masks
Diverse video categories: family scene, animal, ball game, and outdoor activity

✔️ Task Coverage Comparison

Table 1: Motion-Grounded Video Reasoning supports all dimensions: spatial & temporal context, motion abstraction, pixel-level output, and implicit reasoning.

📊 Dataset Statistics

Table 2: GROUNDMORE contains more dense QA + segmentation annotations than prior benchmarks, especially in motion-related reasoning.

🧠 MoRA: Motion-Grounded Reasoning Assistant

We propose a baseline model called MoRA, built for this task. It integrates:

LLaVA for multimodal reasoning
SAM decoder for spatial mask decoding
[SEG] token for object semantic embedding
[LOC] token for temporal localization of motion events

🧱 Model Architecture

Figure 3: MoRA outputs pixel-level segmentation masks as response for the input motion-related question.

📈 Results on GROUNDMORE

🥇 Zero-shot Evaluation

Table 3: MoRA achieves SOTA on all question types, outperforming previous baseline models.

🔍 Ablation Study

Table 5: Temporal localization via [LOC] token significantly improves performance.

⚙️ Installation

git clone https://github.com/groundmore/GROUNDMORE.git
cd GROUNDMORE
conda create -n groundmore python=3.10
conda activate groundmore
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🚀 Usage

GroundMoRe Download

GroundMoRe is available at: https://huggingface.co/datasets/groundmore/GroundMoRe

Training

Before training, you need to obtain LISA and SAM for model initialization.

Put SAM pretrained weights at ./pretrain_weights/

Zero-Shot Training

We use Refer-YouTube-VOS, MeViS dataset for zero-shot training.

bash run.sh

GroundMoRe Evaluation

python evaluate_groundmore.py

✅ TODO

Release MoRA-FT-LISA7B
Release MoRA-ZS-LISA13B
Release MoRA-FT-LISA13B

📣 Citation

If this work is useful for your research, please cite:

@inproceedings{deng2025groundmore,
  title={Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level},
  author={Deng, Andong and Chen, Tongjia and Yu, Shoubin and Yang, Taojiannan and Spencer, Lincoln and Tian, Yapeng and Mian, Ajmal Saeed and Bansal, Mohit and Chen, Chen},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}

🙏 Acknowledgements

This work is built upon LISA and SAM.

We also appreciate the valuable help from Wenshuo Chen and Erhang Zhang during the GroundMoRe data collection.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
davis2017		davis2017
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
evaluate_groundmore.py		evaluate_groundmore.py
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
merge_weights.sh		merge_weights.sh
requirements.txt		requirements.txt
run.sh		run.sh
train_ds.py		train_ds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

🧠 Overview

📌 Comparison to Prior Tasks

📋 Task Definition

🧪 Dataset Details

✔️ Task Coverage Comparison

📊 Dataset Statistics

🧠 MoRA: Motion-Grounded Reasoning Assistant

🧱 Model Architecture

📈 Results on GROUNDMORE

🥇 Zero-shot Evaluation

🔍 Ablation Study

⚙️ Installation

🚀 Usage

GroundMoRe Download

Training

Zero-Shot Training

GroundMoRe Evaluation

✅ TODO

📣 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dengandong/GroundMoRe

Folders and files

Latest commit

History

Repository files navigation

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

🧠 Overview

📌 Comparison to Prior Tasks

📋 Task Definition

🧪 Dataset Details

✔️ Task Coverage Comparison

📊 Dataset Statistics

🧠 MoRA: Motion-Grounded Reasoning Assistant

🧱 Model Architecture

📈 Results on GROUNDMORE

🥇 Zero-shot Evaluation

🔍 Ablation Study

⚙️ Installation

🚀 Usage

GroundMoRe Download

Training

Zero-Shot Training

GroundMoRe Evaluation

✅ TODO

📣 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages