Skip to content

Conversation

@JustinTong0323
Copy link
Collaborator

@JustinTong0323 JustinTong0323 commented Oct 20, 2025

Motivation

Enable fa3 & trtllm_mha for spec_v2 (for gpt-oss on Hopper & Blackweel).

Modifications

Accuracy Tests

Test on gpt-oss and llama4, accuracy could match.

Detailed test result could be find here.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @JustinTong0323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request, marked as a work-in-progress, introduces initial support for a new DRAFT_EXTEND_V2 forward mode, which appears to be an evolution or alternative to existing speculative decoding mechanisms. The changes primarily involve integrating this new mode into various attention backend components and updating internal ForwardMode classifications to properly recognize and handle it. This foundational work sets the stage for future enhancements related to speculative decoding within the system.

Highlights

  • Introduction of DRAFT_EXTEND_V2 Forward Mode: A new forward mode, DRAFT_EXTEND_V2, has been introduced, likely to support an updated speculative decoding mechanism. This mode is now recognized across various attention backend components.
  • Integration into Attention Backends: The DRAFT_EXTEND_V2 mode has been integrated into the logic of aiter_backend, flashattention_backend, flashmla_backend, trtllm_mha_backend, and trtllm_mla_backend, affecting how metadata is initialized and how forward_extend operations are handled under this new mode.
  • ForwardMode Utility Updates: The is_extend() and is_extend_or_draft_extend_or_mixed() utility methods in forward_batch_info.py have been updated to include and correctly classify the new DRAFT_EXTEND_V2 mode.
  • Temporary Bypass in Base Attention Backend: The update_verify_buffers_to_fill_after_draft method in base_attn_backend.py has had its NotImplementedError temporarily commented out and replaced with pass, indicating ongoing development for this specific functionality.
  • NSA Backend Compatibility: The NSA backend explicitly asserts that it does not support DRAFT_EXTEND_V2 (similar to DRAFT_EXTEND and TARGET_VERIFY), indicating that this mode is not compatible with the NSA attention mechanism.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@JustinTong0323 JustinTong0323 marked this pull request as draft October 20, 2025 18:57
gemini-code-assist[bot]

This comment was marked as outdated.

@JustinTong0323 JustinTong0323 changed the title wip: fa3 spec overlap wip: fa3 & trtllm_mha spec overlap Oct 27, 2025
@JustinTong0323 JustinTong0323 changed the title wip: fa3 & trtllm_mha spec overlap fa3 & trtllm_mha spec overlap Oct 27, 2025
@JustinTong0323 JustinTong0323 marked this pull request as ready for review October 27, 2025 07:47
@JustinTong0323
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new parameter include_draft_extend_v2 to several methods related to forward mode in the sglang library. This parameter enables the inclusion of DRAFT_EXTEND_V2 in certain conditional checks, allowing for more flexible handling of different forward modes. The changes primarily affect flashattention_backend.py, trtllm_mha_backend.py, and forward_batch_info.py. The code has been reviewed and comments have been provided to address potential issues.

def draft_worker(self):
return self._draft_worker

def clear_cache_pool(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API has already been defined here, and we actually do not need to clear the cache (cache is cleared in the scheduler).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Got it, as this PR is old so at that time this method is not introduced haha

@hnyls2002
Copy link
Collaborator

@JustinTong0323 Already approved, thanks for the support!

@Fridge003 Fridge003 merged commit 0bfe1d1 into sgl-project:main Nov 1, 2025
81 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants