-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Support spec decoding when LoRA is applied to target model #12903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @lifuhuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces crucial changes to enable seamless integration of LoRA with speculative decoding. The primary goal is to eliminate issues arising from the system's previous assumption of a fixed sequence length during decoding, which caused failures when speculative decoding was active. By making the sequence length computation more dynamic and adaptable to speculative decoding's requirements, this PR ensures that both performance optimization techniques can function correctly together, enhancing the overall stability and capability of the system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for speculative decoding when LoRA is enabled by generalizing the sequence length computation. The changes correctly propagate the num_tokens_per_bs parameter to handle variable sequence lengths in CUDA graph mode for LoRA, and the test cases are updated to cover these new scenarios. I've identified a potential logic issue in chunked_backend.py where a forward mode for speculative decoding might be handled incorrectly, and I've provided a suggestion to address it. Overall, the changes are well-aligned with the goal of the pull request.
a961d9d to
e519e75
Compare
|
cc @ConnorLi96 |
vedantjh2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the fix!
Did you have an opportunity to benchmark this change for any speed up etc? Just wanted to ensure we still get the gains from ngram SD
|
Can we set an assertion somewhere that currently lora only supports n-gram spec algo? |
Motivation
fix #12726
Currently LoRA feature is incompatible with spec decoding mostly because as we assume seqlen = 1 during decoding in many places, this result in incorrect result (e.g., for triton backend + cuda graph) or startup failure (e.g., #12726).
Modifications
spec_info.num_tokens_per_bs.Accuracy Tests
Benchmarking and Profiling
Checklist