-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Add Mistral Large 3 support. #14213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Mistral Large 3 support. #14213
Conversation
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Summary of ChangesHello @dcampora, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the system's model compatibility by integrating the Mistral Large 3 and PixtralForConditionalGeneration models. It includes crucial updates to the underlying infrastructure for handling Mixture-of-Experts (MoE) architectures, particularly focusing on advanced FP8 quantization techniques. The changes ensure that these new models are correctly configured, loaded, and can leverage optimized quantization paths, while also improving the system's ability to retrieve model-specific information. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Mistral Large 3 and related models like Pixtral and Eagle. This is a significant feature addition, touching many parts of the codebase, including model configuration, quantization kernels, and the core model implementations. The changes are well-structured, especially the use of inheritance and remapping to adapt the new models to the existing DeepseekV2 architecture. The addition of a mistral_utils.py to handle Mistral's non-standard params.json is a good solution to a tricky problem.
I have a few suggestions to improve maintainability and robustness. Specifically, I've pointed out a redundant check in the FP8 quantization logic, a potential issue with silent failures in weight remapping, and a resource leak in temporary file handling.
Overall, this is a great contribution that expands the model support of sglang.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
router code lgtm |
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
gsm8k benchmark results: w/ FP8 attention: w/o FP8 attention: |
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
| return output | ||
|
|
||
|
|
||
| class MistralLarge3ForCausalLMEagle(MistralLarge3ForCausalLM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EAGLE seems not working.
| if rope_scaling: | ||
| rope_scaling["rope_type"] = "deepseek_yarn" | ||
|
|
||
| self.llama_4_scaling = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add comment here for mistral model.
| if self.enable_flashinfer_cutlass_moe or self.enable_flashinfer_trtllm_moe: | ||
| w13_input_scale = layer.w13_input_scale.max().to(torch.float32) | ||
| w2_input_scale = layer.w2_input_scale.max().to(torch.float32) | ||
| w13_input_scale = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to update modelopt quant here?
| self.process_weights_hip_scale_padding(layer) | ||
|
|
||
| # Align FP8 weights to FlashInfer per-tensor kernel layout if enabled | ||
| if get_moe_runner_backend().is_flashinfer_trtllm(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For flashinfer_trtllm moe, do we need to seperate a PR to support it?
| bias: Optional[torch.Tensor] = None, | ||
| ) -> torch.Tensor: | ||
| if self.weight_block_size is not None: | ||
| return self.w8a8_block_fp8_linear( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add DeepGEMM support
|
/tag-and-rerun-ci |
Clean up eagle
Fix Llama4 scaling
|
@ispobock let's tackle Eagle3, DeepGEMM, flashinfer_trtllm moe and FP4 support in follow-up MRs. We have removed the Eagle3 code from the PR. |
|
/tag-and-rerun-ci |
| hidden_states: torch.Tensor, | ||
| forward_batch: ForwardBatch, | ||
| zero_allocator: BumpAllocator, | ||
| llama_4_scaling: Optional[torch.Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many changes for introducing llama_4_scaling parameter here. Is there a better way to handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcampora The ci failed: https://github.com/sgl-project/sglang/actions/runs/19900521834/job/57100360299?pr=14213#step:5:5669
File "/public_sglang_ci/runner-l3c-gpu-67/_work/sglang/sglang/python/sglang/srt/models/kimi_linear.py", line 437, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
TypeError: DeepseekV2AttentionMLA.forward() missing 1 required positional argument: 'llama_4_scaling'
| k_rope: Optional[torch.Tensor] = None, | ||
| cos_sin_cache: Optional[torch.Tensor] = None, | ||
| is_neox: Optional[bool] = False, | ||
| llama_4_scaling: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we add this logic here, we will have to do the same thing for a ton of attn backends and sync them. thus one way maybe just put logic in the models folder (deepseek_v2.py or mistral py etc), since looks like it is just scaling the q tensor before entering the core attention logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Mistral's implementation in vLLM, the scaling is applied between RoPE and attention, so it implements inside the mla layer: https://github.com/vllm-project/vllm/pull/29757/files#diff-6ffcb4f51daf85df32c7d35433c3393f1602663960f677ae61f55af1ed3ab524
In SGLang, for trtllm mla backend, the RoPE is fused into the attention backend. This is quite different from vLLM. The scaling is needed to be passed into the backend to apply correctly.
|
/tag-and-rerun-ci |
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
| if not (per_tensor or per_channel): | ||
| assert self.weight_quant.strategy == QuantizationStrategy.BLOCK | ||
| self.weight_block_size = self.weight_quant.block_structure | ||
| assert self.weight_quant.dynamic is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.weight_quant.dynamic is false, so it can be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is mostly from vLLM: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py#L667-L681
Given compressed tensors are also published by them, would you confirm with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is mostly from vLLM: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py#L667-L681 Given compressed tensors are also published by them, would you confirm with them?
like this recipe https://github.com/vllm-project/llm-compressor/blob/aa504491afd28a0d5f66d3e38088352dcb4e63ff/src/llmcompressor/modifiers/quantization/gptq/base.py#L57, there is no attribute of dynamic
By the way, can you help review this PR #15386
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
Motivation
This PR introduces support for model Mistral Large 3.
Modifications
To enable the model, several key modifications were made.
fp8.py.hf_transformers_utils.pywas added.Accuracy Tests
Benchmarking and Profiling
Checklist