Support spec decoding when LoRA is applied to target model #12903

lifuhuang · 2025-11-09T05:19:57Z

Motivation

Currently LoRA feature is incompatible with spec decoding mostly because as we assume seqlen = 1 during decoding in many places, this result in incorrect result (e.g., for triton backend + cuda graph) or startup failure (e.g., #12726).

Modifications

Generalize decoding seq_len computation logic leveraging spec_info.num_tokens_per_bs.
Add UT in both test_lora and test_lora_qwen3. For simplicity, I tested the speculative_algorithm=NGRAM case to avoid downloading draft model, but the logic should generalize for other algorithms (e.g., EAGLE).

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-09T05:20:18Z

Summary of Changes

Hello @lifuhuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial changes to enable seamless integration of LoRA with speculative decoding. The primary goal is to eliminate issues arising from the system's previous assumption of a fixed sequence length during decoding, which caused failures when speculative decoding was active. By making the sequence length computation more dynamic and adaptable to speculative decoding's requirements, this PR ensures that both performance optimization techniques can function correctly together, enhancing the overall stability and capability of the system.

Highlights

LoRA and Speculative Decoding Compatibility: This PR resolves a critical incompatibility between LoRA (Low-Rank Adaptation) and speculative decoding, which previously led to incorrect results or startup failures due to an implicit assumption of seqlen = 1 during decoding.
Generalized Sequence Length Computation: The core fix involves generalizing the seq_len computation logic, leveraging spec_info.num_tokens_per_bs to correctly handle varying sequence lengths introduced by speculative decoding.
Expanded Unit Test Coverage: New unit tests have been added to both test_lora and test_lora_qwen3 to validate the integration of LoRA with speculative decoding, specifically for the NGRAM algorithm, ensuring robustness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for speculative decoding when LoRA is enabled by generalizing the sequence length computation. The changes correctly propagate the num_tokens_per_bs parameter to handle variable sequence lengths in CUDA graph mode for LoRA, and the test cases are updated to cover these new scenarios. I've identified a potential logic issue in chunked_backend.py where a forward mode for speculative decoding might be handled incorrectly, and I've provided a suggestion to address it. Overall, the changes are well-aligned with the goal of the pull request.

python/sglang/srt/lora/backend/chunked_backend.py

test/srt/lora/test_lora.py

Fridge003 · 2025-11-10T20:11:13Z

cc @ConnorLi96

vedantjh2

LGTM! Thanks for the fix!
Did you have an opportunity to benchmark this change for any speed up etc? Just wanted to ensure we still get the gains from ngram SD

test/srt/lora/test_lora.py

python/sglang/test/runners.py

Fridge003 · 2025-11-15T22:05:03Z

Can we set an assertion somewhere that currently lora only supports n-gram spec algo?

lifuhuang requested review from Fridge003, Ying1123, hnyls2002, ispobock and merrymercy as code owners November 9, 2025 05:19

github-actions bot added the lora label Nov 9, 2025

sglang-bot added the run-ci label Nov 9, 2025

gemini-code-assist bot reviewed Nov 9, 2025

View reviewed changes

python/sglang/srt/lora/backend/chunked_backend.py Show resolved Hide resolved

lifuhuang mentioned this pull request Nov 9, 2025

[Bug] Server fails to launch with multi-LoRA + NGRAM Speculative Decoding on 0.5.4.post3 #12726

Closed

5 tasks

Fridge003 mentioned this pull request Nov 9, 2025

Development Roadmap (2026 Q1) #12780

Open

lifuhuang changed the title ~~Support spec decoding when LoRA is enabled~~ Support spec decoding when LoRA is applied to target model Nov 9, 2025

lifuhuang commented Nov 9, 2025

View reviewed changes

test/srt/lora/test_lora.py Outdated Show resolved Hide resolved

Support spec decoding for lora.

e519e75

lifuhuang force-pushed the lifu/lora-spec branch from a961d9d to e519e75 Compare November 10, 2025 07:09

Fridge003 self-assigned this Nov 10, 2025

hnyls2002 self-assigned this Nov 12, 2025

hnyls2002 mentioned this pull request Nov 12, 2025

[Feature] Overlap Spec Support #11762

Open

26 tasks

vedantjh2 reviewed Nov 13, 2025

View reviewed changes

lifuhuang added 2 commits November 14, 2025 04:22

Fix

9a4c23f

Merge branch 'main' into lifu/lora-spec

39708c4

Fridge003 reviewed Nov 15, 2025

View reviewed changes

test/srt/lora/test_lora.py Show resolved Hide resolved

python/sglang/test/runners.py Show resolved Hide resolved

lifuhuang added 3 commits November 16, 2025 00:46

Address comments & refactor code.

cc3989a

Merge branch 'main' into lifu/lora-spec

db048db

Merge branch 'main' into lifu/lora-spec

6fda5a2

Fridge003 approved these changes Nov 16, 2025

View reviewed changes

Fridge003 merged commit 254f62d into main Nov 16, 2025
156 of 178 checks passed

Fridge003 deleted the lifu/lora-spec branch November 16, 2025 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support spec decoding when LoRA is applied to target model #12903

Support spec decoding when LoRA is applied to target model #12903

Uh oh!

lifuhuang commented Nov 9, 2025

Uh oh!

gemini-code-assist bot commented Nov 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 10, 2025

Uh oh!

vedantjh2 left a comment

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Support spec decoding when LoRA is applied to target model #12903

Support spec decoding when LoRA is applied to target model #12903

Uh oh!

Conversation

lifuhuang commented Nov 9, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 9, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 10, 2025

Uh oh!

vedantjh2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fridge003 commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants