Skip to content

dengandong/GroundMoRe

Repository files navigation

Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

[CVPR 2025]
[Project Page] [Paper] [Supp] [Dataset]


🧠 Overview

We introduces a new task: Motion-Grounded Video Reasoning, where models must answer motion-related questions using spatiotemporal segmentation masks as visual responses.

This task addresses key limitations in prior video understanding research by introducing:

  • ❓ Implicit question-based reasoning
  • 🕒 Motion-aware temporal localization
  • 🧍 Object-level visual grounding
  • 🎯 Pixel-level mask generation across time
  • 🧩 Four question types: Causal, Sequential, Counterfactual, and Descriptive

📌 Comparison to Prior Tasks

Figure 1: Comparison to other motion understanding tasks

Figure 1: GROUNDMORE fills the gap between referring segmentation, temporal grounding, and reasoning by combining implicit QA with visual spatiotemporal output.


📋 Task Definition

The Motion-Grounded Video Reasoning task requires models to:

  • Input:

    • A video clip V ∈ ℝᵗˣʰˣʷˣ³
    • A motion-related question Q
  • Output:

    • Spatiotemporal segmentation masks M ∈ ℝᵗ′ˣʰˣʷ highlighting the target object

This output represents the reasoning result visually by grounding the answer over space and time.


🧪 Dataset Details

We collect a new benchmark dataset: GROUNDMORE, designed to evaluate fine-grained motion reasoning.

  • 1.7K high-resolution video clips
  • 7.6K question-answer pairs
  • 249K object-level spatiotemporal masks
  • Diverse video categories: family scene, animal, ball game, and outdoor activity

✔️ Task Coverage Comparison

Table 1: Comparison of motion understanding tasks

Table 1: Motion-Grounded Video Reasoning supports all dimensions: spatial & temporal context, motion abstraction, pixel-level output, and implicit reasoning.


📊 Dataset Statistics

Table 2: Dataset statistics

Table 2: GROUNDMORE contains more dense QA + segmentation annotations than prior benchmarks, especially in motion-related reasoning.


🧠 MoRA: Motion-Grounded Reasoning Assistant

We propose a baseline model called MoRA, built for this task. It integrates:

  • LLaVA for multimodal reasoning
  • SAM decoder for spatial mask decoding
  • [SEG] token for object semantic embedding
  • [LOC] token for temporal localization of motion events

🧱 Model Architecture

Figure 3: MoRA Model Architecture

Figure 3: MoRA outputs pixel-level segmentation masks as response for the input motion-related question.


📈 Results on GROUNDMORE

🥇 Zero-shot Evaluation

Table 3: Benchmark Results

Table 3: MoRA achieves SOTA on all question types, outperforming previous baseline models.


🔍 Ablation Study

Table 5: Temporal localization ablation

Table 5: Temporal localization via [LOC] token significantly improves performance.


⚙️ Installation

git clone https://github.com/groundmore/GROUNDMORE.git
cd GROUNDMORE
conda create -n groundmore python=3.10
conda activate groundmore
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🚀 Usage

GroundMoRe Download

GroundMoRe is available at: https://huggingface.co/datasets/groundmore/GroundMoRe

Training

Before training, you need to obtain LISA and SAM for model initialization.

Put SAM pretrained weights at ./pretrain_weights/

Zero-Shot Training

We use Refer-YouTube-VOS, MeViS dataset for zero-shot training.

bash run.sh

GroundMoRe Evaluation

python evaluate_groundmore.py

✅ TODO

  • Release MoRA-FT-LISA7B
  • Release MoRA-ZS-LISA13B
  • Release MoRA-FT-LISA13B

📣 Citation

If this work is useful for your research, please cite:

@inproceedings{deng2025groundmore,
  title={Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level},
  author={Deng, Andong and Chen, Tongjia and Yu, Shoubin and Yang, Taojiannan and Spencer, Lincoln and Tian, Yapeng and Mian, Ajmal Saeed and Bansal, Mohit and Chen, Chen},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}

🙏 Acknowledgements

This work is built upon LISA and SAM.

We also appreciate the valuable help from Wenshuo Chen and Erhang Zhang during the GroundMoRe data collection.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published