Add Mistral Large 3 support. #14213

dcampora · 2025-12-01T10:06:59Z

Motivation

This PR introduces support for model Mistral Large 3.

Modifications

To enable the model, several key modifications were made.

Two new models are supported: MistralLarge3ForCausalLM and PixtralForConditionalGeneration.
The latter incorporates VLM support.
Per-tensor scale MOE support was added to fp8.py.
As ML3 is not yet supported in AutoConfig, a separate code-path for it in hf_transformers_utils.py was added.
A fix for the router manager was implemented to get PD disaggregation working correctly through Dynamo. This PR closes [Bug] bench_sglang fails due to get_model_info endpoint of SGLang PDRouter not being implemented #12751.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

gemini-code-assist · 2025-12-01T10:07:34Z

Summary of Changes

Hello @dcampora, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's model compatibility by integrating the Mistral Large 3 and PixtralForConditionalGeneration models. It includes crucial updates to the underlying infrastructure for handling Mixture-of-Experts (MoE) architectures, particularly focusing on advanced FP8 quantization techniques. The changes ensure that these new models are correctly configured, loaded, and can leverage optimized quantization paths, while also improving the system's ability to retrieve model-specific information.

Highlights

Mistral Large 3 Support: Added comprehensive support for the Mistral Large 3 model, including its specific architecture and weight remapping, enabling its use within the system.
PixtralForConditionalGeneration Integration: Introduced support for the PixtralForConditionalGeneration model, which includes a vision encoder, patch merger, and vision-language adapter, expanding multimodal capabilities.
FP8 Per-Tensor Quantization for MoE: Implemented FP8 per-tensor quantization for Mixture-of-Experts (MoE) layers, including weight alignment for FlashInfer kernels and precomputation of per-expert output scaling factors, optimizing performance for quantized MoE models.
Model Configuration and Routing Enhancements: Updated model configuration handling across various components to properly recognize and configure new models like Mistral Large 3 and Pixtral, and refactored the router manager to intelligently route model information requests.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for Mistral Large 3 and related models like Pixtral and Eagle. This is a significant feature addition, touching many parts of the codebase, including model configuration, quantization kernels, and the core model implementations. The changes are well-structured, especially the use of inheritance and remapping to adapt the new models to the existing DeepseekV2 architecture. The addition of a mistral_utils.py to handle Mistral's non-standard params.json is a good solution to a tricky problem.

I have a few suggestions to improve maintainability and robustness. Specifically, I've pointed out a redundant check in the FP8 quantization logic, a potential issue with silent failures in weight remapping, and a resource leak in temporary file handling.

Overall, this is a great contribution that expands the model support of sglang.

python/sglang/srt/utils/hf_transformers_utils.py

python/sglang/srt/layers/quantization/fp8.py

python/sglang/srt/models/mistral_large_3.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

slin1237 · 2025-12-01T15:19:47Z

router code lgtm
make sure other pieces are reviewed

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

ispobock · 2025-12-03T12:24:01Z

gsm8k benchmark results:

SGLANG_ENABLE_JIT_DEEPGEMM=0 \
python3 -m sglang.launch_server \
--model mistralai/Mistral-Large-3-675B-Instruct-2512 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 8 \
--attention-backend trtllm_mla \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--chat-template mistral

lm_eval \
--model local-chat-completions \
--model_args model=mistralai/Mistral-Large-3-675B-Instruct-2512,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=8192 \
--tasks gsm8k \
--batch_size 128 \
--apply_chat_template \
--num_fewshot 8

w/ FP8 attention:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9181|±  |0.0076|
|     |       |strict-match    |     8|exact_match|↑  |0.7225|±  |0.0123|

w/o FP8 attention:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9242|±  |0.0073|
|     |       |strict-match    |     8|exact_match|↑  |0.7528|±  |0.0119|

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

ispobock · 2025-12-03T13:01:01Z

python/sglang/srt/models/mistral_large_3_eagle.py

+        return output
+
+
+class MistralLarge3ForCausalLMEagle(MistralLarge3ForCausalLM):


EAGLE seems not working.

ispobock · 2025-12-03T13:03:15Z

python/sglang/srt/models/deepseek_v2.py

        if rope_scaling:
            rope_scaling["rope_type"] = "deepseek_yarn"

+        self.llama_4_scaling = (


Maybe we can add comment here for mistral model.

ispobock · 2025-12-03T13:37:38Z

python/sglang/srt/layers/quantization/modelopt_quant.py

        if self.enable_flashinfer_cutlass_moe or self.enable_flashinfer_trtllm_moe:
-            w13_input_scale = layer.w13_input_scale.max().to(torch.float32)
-            w2_input_scale = layer.w2_input_scale.max().to(torch.float32)
+            w13_input_scale = (


Why do we need to update modelopt quant here?

ispobock · 2025-12-03T13:38:30Z

python/sglang/srt/layers/quantization/fp8.py

                self.process_weights_hip_scale_padding(layer)
+
+            # Align FP8 weights to FlashInfer per-tensor kernel layout if enabled
+            if get_moe_runner_backend().is_flashinfer_trtllm():


For flashinfer_trtllm moe, do we need to seperate a PR to support it?

ispobock · 2025-12-03T13:39:24Z

python/sglang/srt/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py

        bias: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
+        if self.weight_block_size is not None:
+            return self.w8a8_block_fp8_linear(


TODO: add DeepGEMM support

ispobock · 2025-12-03T14:04:27Z

/tag-and-rerun-ci

Clean up eagle

Fix Llama4 scaling

dcampora · 2025-12-03T16:56:41Z

@ispobock let's tackle Eagle3, DeepGEMM, flashinfer_trtllm moe and FP4 support in follow-up MRs. We have removed the Eagle3 code from the PR.

ispobock · 2025-12-04T02:43:02Z

/tag-and-rerun-ci

ispobock · 2025-12-04T05:16:04Z

python/sglang/srt/models/deepseek_v2.py

        hidden_states: torch.Tensor,
        forward_batch: ForwardBatch,
        zero_allocator: BumpAllocator,
+        llama_4_scaling: Optional[torch.Tensor],


There are too many changes for introducing llama_4_scaling parameter here. Is there a better way to handle it?

@dcampora The ci failed: https://github.com/sgl-project/sglang/actions/runs/19900521834/job/57100360299?pr=14213#step:5:5669

File "/public_sglang_ci/runner-l3c-gpu-67/_work/sglang/sglang/python/sglang/srt/models/kimi_linear.py", line 437, in forward hidden_states = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) TypeError: DeepseekV2AttentionMLA.forward() missing 1 required positional argument: 'llama_4_scaling'

fzyzcjy · 2025-12-04T05:54:02Z

python/sglang/srt/layers/attention/trtllm_mla_backend.py

        k_rope: Optional[torch.Tensor] = None,
        cos_sin_cache: Optional[torch.Tensor] = None,
        is_neox: Optional[bool] = False,
+        llama_4_scaling: Optional[torch.Tensor] = None,


if we add this logic here, we will have to do the same thing for a ton of attn backends and sync them. thus one way maybe just put logic in the models folder (deepseek_v2.py or mistral py etc), since looks like it is just scaling the q tensor before entering the core attention logic

From Mistral's implementation in vLLM, the scaling is applied between RoPE and attention, so it implements inside the mla layer: https://github.com/vllm-project/vllm/pull/29757/files#diff-6ffcb4f51daf85df32c7d35433c3393f1602663960f677ae61f55af1ed3ab524

In SGLang, for trtllm mla backend, the RoPE is fused into the attention backend. This is quite different from vLLM. The scaling is needed to be passed into the backend to apply correctly.

ispobock · 2025-12-04T08:37:02Z

/tag-and-rerun-ci

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

Wangzheee · 2025-12-18T04:33:28Z

python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py

+        if not (per_tensor or per_channel):
+            assert self.weight_quant.strategy == QuantizationStrategy.BLOCK
+            self.weight_block_size = self.weight_quant.block_structure
+            assert self.weight_quant.dynamic is not None


self.weight_quant.dynamic is false, so it can be None?

This part is mostly from vLLM: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py#L667-L681
Given compressed tensors are also published by them, would you confirm with them?

This part is mostly from vLLM: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py#L667-L681 Given compressed tensors are also published by them, would you confirm with them?

like this recipe https://github.com/vllm-project/llm-compressor/blob/aa504491afd28a0d5f66d3e38088352dcb4e63ff/src/llmcompressor/modifiers/quantization/gptq/base.py#L57, there is no attribute of dynamic
By the way, can you help review this PR #15386

Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

Add Mistral Large 3 support.

90098a5

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

dcampora requested review from AniZpZ, BBuf, CatherineSue, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, JustinTong0323, Ying1123, ch-wan, ispobock, key4ng, merrymercy, mickqian, slin1237 and yhyang201 as code owners December 1, 2025 10:07

github-actions bot added quant LLM Quantization deepseek model-gateway labels Dec 1, 2025

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

python/sglang/srt/utils/hf_transformers_utils.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/quantization/fp8.py Outdated Show resolved Hide resolved

python/sglang/srt/models/mistral_large_3.py Show resolved Hide resolved

zhyncs added the run-ci label Dec 1, 2025

zhyncs assigned gongwei-130 Dec 1, 2025

zhyncs added the high priority label Dec 1, 2025

zhyncs and others added 2 commits December 1, 2025 02:24

Merge branch 'main' into ml3_support

0cef9be

Apply suggestions from code review

b92e641

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

slin1237 approved these changes Dec 1, 2025

View reviewed changes

Commit suggestions from gemini.

4bfe175

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

yuan-luo added Multi-modal multi-modal language model vlm labels Dec 3, 2025

Linda-Stadter and others added 2 commits December 3, 2025 04:59

fix no module named deep_gemm error

77319c4

Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>

fix triton moe

623b33a

ispobock reviewed Dec 3, 2025

View reviewed changes

clean up (#2)

73abf6b

elvischenv and others added 4 commits December 3, 2025 06:08

add init llama4 scale

d7a7b6c

clean up eagle

1b1c84b

Merge pull request #3 from dcampora/elvis/clean-up-eagle

fcc2c6d

Clean up eagle

Merge pull request #4 from dcampora/elvis/fix-llama4-scale

5d077e1

Fix Llama4 scaling

github-actions bot added the blackwell SM100/SM120 label Dec 3, 2025

ispobock reviewed Dec 4, 2025

View reviewed changes

fzyzcjy reviewed Dec 4, 2025

View reviewed changes

elvischenv added 2 commits December 3, 2025 22:44

fix Llama4 scaling None

5d52732

Merge remote-tracking branch 'upstream/main' into ml3_support

13ccef9

ispobock approved these changes Dec 4, 2025

View reviewed changes

ispobock merged commit 8428078 into sgl-project:main Dec 4, 2025
282 of 297 checks passed

zhyncs mentioned this pull request Dec 8, 2025

Revert "[Bug] fix not desired disable fused share experts caused by r… #14676

Merged

6 tasks

Wangzheee reviewed Dec 18, 2025

View reviewed changes

		return output


		class MistralLarge3ForCausalLMEagle(MistralLarge3ForCausalLM):

Add Mistral Large 3 support. #14213

Add Mistral Large 3 support. #14213

Uh oh!

Conversation

dcampora commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slin1237 commented Dec 1, 2025

Uh oh!

ispobock commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ispobock commented Dec 3, 2025

Uh oh!

dcampora commented Dec 3, 2025

Uh oh!

ispobock commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ispobock Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ispobock commented Dec 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wangzheee Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

dcampora commented Dec 1, 2025 •

edited

Loading

ispobock commented Dec 3, 2025 •

edited

Loading

ispobock commented Dec 4, 2025 •

edited

Loading

ispobock Dec 4, 2025 •

edited

Loading

Wangzheee Dec 18, 2025 •

edited

Loading