Skip to content

Conversation

@vincentzed
Copy link
Contributor

@vincentzed vincentzed commented Oct 19, 2025

Summary

Try to impl top_k > 1 in eagle overlap spec (v2 ->> replace v1 (non overlap))

Accuracy Tests

SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server --dtype float16 --model-path unsloth/Meta-llama-3.1-8b-instruct --attention-backend triton --decode-log-interval 1 --disable-cuda-graph --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLAMA3-instruct-8B --mem-fraction-static 0.8 --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-num-draft-tokens 4 ---page-size 1 --disable-radix-cache --disable-cuda-graph --enable-beta-spec

Right now it still produce gibberish.

  • If topk = 1 it's fine. Set page size 1 only.
  • Not explicitly supporting flashinfer

Baseline result

Implementation Condition Status
Triton Page size > 1
Triton Page size = 1
Triton Topk = 1
Triton Topk > 1
Triton Both > 1
Flashinfer Top K = 1
Flashinfer Top K > 1
Flashinfer Different page sizes untested

Current Status

Still gibberish, focus on triton

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants