Skip to content

Conversation

@lifuhuang
Copy link
Collaborator

Motivation

fix #12726

Currently LoRA feature is incompatible with spec decoding mostly because as we assume seqlen = 1 during decoding in many places, this result in incorrect result (e.g., for triton backend + cuda graph) or startup failure (e.g., #12726).

Modifications

  1. Generalize decoding seq_len computation logic leveraging spec_info.num_tokens_per_bs.
  2. Add UT in both test_lora and test_lora_qwen3. For simplicity, I tested the speculative_algorithm=NGRAM case to avoid downloading draft model, but the logic should generalize for other algorithms (e.g., EAGLE).

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial changes to enable seamless integration of LoRA with speculative decoding. The primary goal is to eliminate issues arising from the system's previous assumption of a fixed sequence length during decoding, which caused failures when speculative decoding was active. By making the sequence length computation more dynamic and adaptable to speculative decoding's requirements, this PR ensures that both performance optimization techniques can function correctly together, enhancing the overall stability and capability of the system.

Highlights

  • LoRA and Speculative Decoding Compatibility: This PR resolves a critical incompatibility between LoRA (Low-Rank Adaptation) and speculative decoding, which previously led to incorrect results or startup failures due to an implicit assumption of seqlen = 1 during decoding.
  • Generalized Sequence Length Computation: The core fix involves generalizing the seq_len computation logic, leveraging spec_info.num_tokens_per_bs to correctly handle varying sequence lengths introduced by speculative decoding.
  • Expanded Unit Test Coverage: New unit tests have been added to both test_lora and test_lora_qwen3 to validate the integration of LoRA with speculative decoding, specifically for the NGRAM algorithm, ensuring robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for speculative decoding when LoRA is enabled by generalizing the sequence length computation. The changes correctly propagate the num_tokens_per_bs parameter to handle variable sequence lengths in CUDA graph mode for LoRA, and the test cases are updated to cover these new scenarios. I've identified a potential logic issue in chunked_backend.py where a forward mode for speculative decoding might be handled incorrectly, and I've provided a suggestion to address it. Overall, the changes are well-aligned with the goal of the pull request.

@lifuhuang lifuhuang changed the title Support spec decoding when LoRA is enabled Support spec decoding when LoRA is applied to target model Nov 9, 2025
@Fridge003 Fridge003 self-assigned this Nov 10, 2025
@Fridge003
Copy link
Collaborator

cc @ConnorLi96

@hnyls2002 hnyls2002 self-assigned this Nov 12, 2025
@hnyls2002 hnyls2002 mentioned this pull request Nov 12, 2025
26 tasks
Copy link
Contributor

@vedantjh2 vedantjh2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix!
Did you have an opportunity to benchmark this change for any speed up etc? Just wanted to ensure we still get the gains from ngram SD

@Fridge003
Copy link
Collaborator

Can we set an assertion somewhere that currently lora only supports n-gram spec algo?

@Fridge003 Fridge003 merged commit 254f62d into main Nov 16, 2025
156 of 178 checks passed
@Fridge003 Fridge003 deleted the lifu/lora-spec branch November 16, 2025 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Server fails to launch with multi-LoRA + NGRAM Speculative Decoding on 0.5.4.post3

6 participants