[Feat]: Add dflash in specdec-bench#1432
Conversation
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
📝 WalkthroughWalkthroughThis PR extends the speculative decoding benchmark to support the DFLASH speculative algorithm. CLI validation is updated to accept "DFLASH", SGLang model initialization is refactored to apply algorithm-specific parameters, and vLLM integration adds DFLASH configuration using draft model directories and speculative token counts. ChangesDFLASH Speculative Algorithm Support
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1432 +/- ##
=======================================
Coverage 76.95% 76.95%
=======================================
Files 478 478
Lines 51648 51648
=======================================
+ Hits 39744 39747 +3
+ Misses 11904 11901 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/specdec_bench/specdec_bench/models/sglang.py`:
- Around line 47-62: The engine_kwargs construction uses dict(...) which
triggers Ruff C408; replace the dict(...) call with a dict literal for the
variable engine_kwargs in the module (the block that sets model_path,
skip_tokenizer_init, trust_remote_code, mem_fraction_static,
disable_overlap_schedule, tp_size, ep_size, torch_compile_max_bs,
max_running_requests, attention_backend, enable_torch_compile,
cuda_graph_max_bs, disable_cuda_graph, disable_cuda_graph_padding) so the same
key/value pairs are expressed using a { ... } literal instead of dict(...).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c5a40daa-3fcc-4d95-b922-448bdfebbd28
📒 Files selected for processing (3)
examples/specdec_bench/run.pyexamples/specdec_bench/specdec_bench/models/sglang.pyexamples/specdec_bench/specdec_bench/models/vllm.py
| engine_kwargs = dict( | ||
| model_path=model_dir, | ||
| skip_tokenizer_init=True, | ||
| trust_remote_code=kwargs.get("trust_remote_code", False), | ||
| mem_fraction_static=kwargs.get("mem_fraction_static", 0.8), | ||
| disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False), | ||
| tp_size=kwargs.get("tensor_parallel_size", 1), | ||
| ep_size=kwargs.get("moe_expert_parallel_size", 1), | ||
| torch_compile_max_bs=max_concurrent_requests, | ||
| max_running_requests=max_concurrent_requests, | ||
| attention_backend=kwargs.get("attention_backend"), | ||
| enable_torch_compile=kwargs.get("enable_torch_compile", False), | ||
| cuda_graph_max_bs=max_concurrent_requests, | ||
| disable_cuda_graph=False, | ||
| disable_cuda_graph_padding=True, | ||
| ) |
There was a problem hiding this comment.
Fix CI blocker: replace dict(...) with a literal.
Line 47 triggers Ruff C408 in the pipeline, so this currently fails code-quality checks.
Suggested patch
- engine_kwargs = dict(
- model_path=model_dir,
- skip_tokenizer_init=True,
- trust_remote_code=kwargs.get("trust_remote_code", False),
- mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
- disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
- tp_size=kwargs.get("tensor_parallel_size", 1),
- ep_size=kwargs.get("moe_expert_parallel_size", 1),
- torch_compile_max_bs=max_concurrent_requests,
- max_running_requests=max_concurrent_requests,
- attention_backend=kwargs.get("attention_backend"),
- enable_torch_compile=kwargs.get("enable_torch_compile", False),
- cuda_graph_max_bs=max_concurrent_requests,
- disable_cuda_graph=False,
- disable_cuda_graph_padding=True,
- )
+ engine_kwargs = {
+ "model_path": model_dir,
+ "skip_tokenizer_init": True,
+ "trust_remote_code": kwargs.get("trust_remote_code", False),
+ "mem_fraction_static": kwargs.get("mem_fraction_static", 0.8),
+ "disable_overlap_schedule": kwargs.get("disable_overlap_schedule", False),
+ "tp_size": kwargs.get("tensor_parallel_size", 1),
+ "ep_size": kwargs.get("moe_expert_parallel_size", 1),
+ "torch_compile_max_bs": max_concurrent_requests,
+ "max_running_requests": max_concurrent_requests,
+ "attention_backend": kwargs.get("attention_backend"),
+ "enable_torch_compile": kwargs.get("enable_torch_compile", False),
+ "cuda_graph_max_bs": max_concurrent_requests,
+ "disable_cuda_graph": False,
+ "disable_cuda_graph_padding": True,
+ }🧰 Tools
🪛 GitHub Actions: Code Quality / code-quality
[error] 47-62: ruff-check failed (hook id: ruff-check). C408 Unnecessary dict() call (rewrite as a literal) at examples/specdec_bench/specdec_bench/models/sglang.py:47:25.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 47 - 62,
The engine_kwargs construction uses dict(...) which triggers Ruff C408; replace
the dict(...) call with a dict literal for the variable engine_kwargs in the
module (the block that sets model_path, skip_tokenizer_init, trust_remote_code,
mem_fraction_static, disable_overlap_schedule, tp_size, ep_size,
torch_compile_max_bs, max_running_requests, attention_backend,
enable_torch_compile, cuda_graph_max_bs, disable_cuda_graph,
disable_cuda_graph_padding) so the same key/value pairs are expressed using a {
... } literal instead of dict(...).
|
/claude review |
Claude Review SummaryFindings: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 1 Highlights
Overall riskLow. This is a purely additive example-only change; the only meaningful concern is the universal |
| cuda_graph_max_bs=max_concurrent_requests, | ||
| disable_cuda_graph=False, | ||
| disable_cuda_graph_padding=True, |
There was a problem hiding this comment.
[IMPORTANT Compatibility] disable_cuda_graph_padding=True is now applied to all SGLang paths — non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM — not just DFLASH.
Before this PR, this kwarg wasn't passed, so SGLang's default (False, padding enabled) applied. Disabling padding can force more CUDA-graph recompilations / runs at non-bucketed batch sizes, which may shift latency/throughput numbers for the other algorithms that this benchmark is meant to compare.
Since the PR description states this flag is needed specifically to avoid bucket-padding mismatches during DFLASH replay, suggest gating it on DFLASH (or making it kwargs-overridable) so existing EAGLE/MTP/etc. benchmark results remain comparable to runs from before this change. For example, drop disable_cuda_graph_padding from the shared engine_kwargs dict and only set it inside the if speculative_algorithm == "DFLASH": branch.
| print( | ||
| f"[specdec_bench] DFLASH ignores --draft_length / speculative_num_steps / " | ||
| f"speculative_eagle_topk; effective draft block = " | ||
| f"speculative_num_draft_tokens={engine_kwargs['speculative_num_draft_tokens']}" | ||
| ) |
There was a problem hiding this comment.
[SUGGESTION] Minor UX nit: this print message tells users that DFLASH ignores --draft_length and instructs them to set speculative_num_draft_tokens, but speculative_num_draft_tokens is not a CLI flag in run.py — it's only reachable via engine_args in the --runtime_params YAML. Consider mentioning that explicitly in the print, or adding a CLI flag, otherwise users will be left guessing how to override the default of 8.
What does this PR do?
Type of change: new feature
Adds DFLASH speculative decoding support to the
specdec_benchexample for both the SGLang and vLLM backends.run.py: registerDFLASHin--speculative_algorithmchoices.models/sglang.py: unify the SGLang engine setup soengine_kwargsis built once and shared between the speculative and non-speculative paths; add a DFLASH branch (defaultspeculative_num_draft_tokens=8, optionalspeculative_dflash_draft_window_size); setdisable_cuda_graph_padding=Trueandcuda_graph_max_bs=max_concurrent_requeststo avoid CUDA-graph bucket-padding mismatches during DFLASH replay; exposemamba_scheduler_strategypassthrough (needed for Qwen3.5).models/vllm.py: add a DFLASH branch wiringmethod="dflash"withspeculative_num_draft_tokens(default 8).Usage
python run.py \ --model_dir <target_model> \ --tokenizer <target_model> \ --draft_model_dir <dflash_draft_model> \ --mtbench mtbench.jsonl \ --engine SGLANG \ --speculative_algorithm DFLASH \ --concurrency 32 \ --tp_size 4 --ep_size 4Use
--engine VLLMfor the vLLM backend.Testing
Manually validated end-to-end on Kimi-K2.5-NVFP4 + MTBench with both SGLang and vLLM backends.
specdec_benchhas no existing test infrastructure, so no automated tests are added (consistent with how other algorithms in this example are handled).Before your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
Requires SGLang and vLLM versions that include DFLASH support.
Summary by CodeRabbit