Skip to content

[Feat]: Add dflash in specdec-bench#1432

Open
h-guo18 wants to merge 6 commits into
mainfrom
haoguo/dflash-spec-dec
Open

[Feat]: Add dflash in specdec-bench#1432
h-guo18 wants to merge 6 commits into
mainfrom
haoguo/dflash-spec-dec

Conversation

@h-guo18
Copy link
Copy Markdown
Contributor

@h-guo18 h-guo18 commented May 12, 2026

What does this PR do?

Type of change: new feature

Adds DFLASH speculative decoding support to the specdec_bench example for both the SGLang and vLLM backends.

  • run.py: register DFLASH in --speculative_algorithm choices.
  • models/sglang.py: unify the SGLang engine setup so engine_kwargs is built once and shared between the speculative and non-speculative paths; add a DFLASH branch (default speculative_num_draft_tokens=8, optional speculative_dflash_draft_window_size); set disable_cuda_graph_padding=True and cuda_graph_max_bs=max_concurrent_requests to avoid CUDA-graph bucket-padding mismatches during DFLASH replay; expose mamba_scheduler_strategy passthrough (needed for Qwen3.5).
  • models/vllm.py: add a DFLASH branch wiring method="dflash" with speculative_num_draft_tokens (default 8).

Usage

python run.py \
    --model_dir <target_model> \
    --tokenizer <target_model> \
    --draft_model_dir <dflash_draft_model> \
    --mtbench mtbench.jsonl \
    --engine SGLANG \
    --speculative_algorithm DFLASH \
    --concurrency 32 \
    --tp_size 4 --ep_size 4

Use --engine VLLM for the vLLM backend.

Testing

Manually validated end-to-end on Kimi-K2.5-NVFP4 + MTBench with both SGLang and vLLM backends. specdec_bench has no existing test infrastructure, so no automated tests are added (consistent with how other algorithms in this example are handled).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (purely additive — new algorithm choice; existing algorithms unchanged)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ❌ (specdec_bench is example-only with no test harness; matches other algorithms)
  • Did you update Changelog?: N/A (example-only change)
  • Did you get Claude approval on this PR?: ❌

Additional Information

Requires SGLang and vLLM versions that include DFLASH support.

Summary by CodeRabbit

  • New Features
    • Added DFLASH as a supported speculative decoding algorithm option for benchmark testing.
    • DFLASH algorithm now available across inference engines with configurable draft token and window size parameters for optimized performance evaluation.

Review Change Stack

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

This PR extends the speculative decoding benchmark to support the DFLASH speculative algorithm. CLI validation is updated to accept "DFLASH", SGLang model initialization is refactored to apply algorithm-specific parameters, and vLLM integration adds DFLASH configuration using draft model directories and speculative token counts.

Changes

DFLASH Speculative Algorithm Support

Layer / File(s) Summary
CLI argument validation for DFLASH
examples/specdec_bench/run.py
The --speculative_algorithm argument choices list is extended to include "DFLASH".
SGLang engine initialization with DFLASH support
examples/specdec_bench/specdec_bench/models/sglang.py
Engine kwargs are consolidated into a unified dictionary to avoid duplication. DFLASH-specific handling uses speculative_num_draft_tokens and optional speculative_dflash_draft_window_size, with algorithm name normalization and conditional parameter injection for other algorithms. mamba_scheduler_strategy is forwarded when provided.
vLLM speculative config for DFLASH
examples/specdec_bench/specdec_bench/models/vllm.py
A new DFLASH branch in speculative algorithm handling configures the specdec object with method="dflash", draft model directory, and num_speculative_tokens sourced from speculative_num_draft_tokens.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main feature addition: adding DFLASH support to the specdec-bench tool.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns found. All SECURITY.md patterns absent: torch.load/numpy.load misuse, hardcoded trust_remote_code, eval/exec, nosec comments, unlicensed deps all not present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch haoguo/dflash-spec-dec

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.95%. Comparing base (d30ebbd) to head (61c7e79).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1432   +/-   ##
=======================================
  Coverage   76.95%   76.95%           
=======================================
  Files         478      478           
  Lines       51648    51648           
=======================================
+ Hits        39744    39747    +3     
+ Misses      11904    11901    -3     
Flag Coverage Δ
unit 52.72% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Hao Guo added 4 commits May 11, 2026 18:06
dev
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
dev
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
@h-guo18 h-guo18 self-assigned this May 12, 2026
Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>
@h-guo18 h-guo18 marked this pull request as ready for review May 12, 2026 07:31
@h-guo18 h-guo18 requested a review from a team as a code owner May 12, 2026 07:31
@h-guo18 h-guo18 requested review from ChenhanYu and benchislett May 12, 2026 07:31
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/models/sglang.py`:
- Around line 47-62: The engine_kwargs construction uses dict(...) which
triggers Ruff C408; replace the dict(...) call with a dict literal for the
variable engine_kwargs in the module (the block that sets model_path,
skip_tokenizer_init, trust_remote_code, mem_fraction_static,
disable_overlap_schedule, tp_size, ep_size, torch_compile_max_bs,
max_running_requests, attention_backend, enable_torch_compile,
cuda_graph_max_bs, disable_cuda_graph, disable_cuda_graph_padding) so the same
key/value pairs are expressed using a { ... } literal instead of dict(...).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c5a40daa-3fcc-4d95-b922-448bdfebbd28

📥 Commits

Reviewing files that changed from the base of the PR and between d30ebbd and 61c7e79.

📒 Files selected for processing (3)
  • examples/specdec_bench/run.py
  • examples/specdec_bench/specdec_bench/models/sglang.py
  • examples/specdec_bench/specdec_bench/models/vllm.py

Comment on lines +47 to +62
engine_kwargs = dict(
model_path=model_dir,
skip_tokenizer_init=True,
trust_remote_code=kwargs.get("trust_remote_code", False),
mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
tp_size=kwargs.get("tensor_parallel_size", 1),
ep_size=kwargs.get("moe_expert_parallel_size", 1),
torch_compile_max_bs=max_concurrent_requests,
max_running_requests=max_concurrent_requests,
attention_backend=kwargs.get("attention_backend"),
enable_torch_compile=kwargs.get("enable_torch_compile", False),
cuda_graph_max_bs=max_concurrent_requests,
disable_cuda_graph=False,
disable_cuda_graph_padding=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix CI blocker: replace dict(...) with a literal.

Line 47 triggers Ruff C408 in the pipeline, so this currently fails code-quality checks.

Suggested patch
-        engine_kwargs = dict(
-            model_path=model_dir,
-            skip_tokenizer_init=True,
-            trust_remote_code=kwargs.get("trust_remote_code", False),
-            mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
-            disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
-            tp_size=kwargs.get("tensor_parallel_size", 1),
-            ep_size=kwargs.get("moe_expert_parallel_size", 1),
-            torch_compile_max_bs=max_concurrent_requests,
-            max_running_requests=max_concurrent_requests,
-            attention_backend=kwargs.get("attention_backend"),
-            enable_torch_compile=kwargs.get("enable_torch_compile", False),
-            cuda_graph_max_bs=max_concurrent_requests,
-            disable_cuda_graph=False,
-            disable_cuda_graph_padding=True,
-        )
+        engine_kwargs = {
+            "model_path": model_dir,
+            "skip_tokenizer_init": True,
+            "trust_remote_code": kwargs.get("trust_remote_code", False),
+            "mem_fraction_static": kwargs.get("mem_fraction_static", 0.8),
+            "disable_overlap_schedule": kwargs.get("disable_overlap_schedule", False),
+            "tp_size": kwargs.get("tensor_parallel_size", 1),
+            "ep_size": kwargs.get("moe_expert_parallel_size", 1),
+            "torch_compile_max_bs": max_concurrent_requests,
+            "max_running_requests": max_concurrent_requests,
+            "attention_backend": kwargs.get("attention_backend"),
+            "enable_torch_compile": kwargs.get("enable_torch_compile", False),
+            "cuda_graph_max_bs": max_concurrent_requests,
+            "disable_cuda_graph": False,
+            "disable_cuda_graph_padding": True,
+        }
🧰 Tools
🪛 GitHub Actions: Code Quality / code-quality

[error] 47-62: ruff-check failed (hook id: ruff-check). C408 Unnecessary dict() call (rewrite as a literal) at examples/specdec_bench/specdec_bench/models/sglang.py:47:25.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 47 - 62,
The engine_kwargs construction uses dict(...) which triggers Ruff C408; replace
the dict(...) call with a dict literal for the variable engine_kwargs in the
module (the block that sets model_path, skip_tokenizer_init, trust_remote_code,
mem_fraction_static, disable_overlap_schedule, tp_size, ep_size,
torch_compile_max_bs, max_running_requests, attention_backend,
enable_torch_compile, cuda_graph_max_bs, disable_cuda_graph,
disable_cuda_graph_padding) so the same key/value pairs are expressed using a {
... } literal instead of dict(...).

@ChenhanYu
Copy link
Copy Markdown
Collaborator

/claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 12, 2026

Claude Review Summary

Findings: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 1

Highlights

  • [IMPORTANT Compatibility] disable_cuda_graph_padding=True was lifted into the shared engine_kwargs dict, so it now applies to every SGLang path (non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM), not just DFLASH. Previously the kwarg wasn't passed and SGLang's default (padding enabled) was used. Recommend gating it on DFLASH so prior benchmark numbers for the other algorithms remain comparable.
  • [SUGGESTION] The DFLASH info-print refers to speculative_num_draft_tokens, which isn't a CLI argument — only engine_args in the runtime YAML can set it. Worth clarifying.

Overall risk

Low. This is a purely additive example-only change; the only meaningful concern is the universal disable_cuda_graph_padding flip, which is a benchmarking-fidelity issue rather than a correctness one. DFLASH wiring on both backends looks consistent with the existing patterns for EAGLE3/MTP.

Comment on lines +59 to +61
cuda_graph_max_bs=max_concurrent_requests,
disable_cuda_graph=False,
disable_cuda_graph_padding=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] disable_cuda_graph_padding=True is now applied to all SGLang paths — non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM — not just DFLASH.

Before this PR, this kwarg wasn't passed, so SGLang's default (False, padding enabled) applied. Disabling padding can force more CUDA-graph recompilations / runs at non-bucketed batch sizes, which may shift latency/throughput numbers for the other algorithms that this benchmark is meant to compare.

Since the PR description states this flag is needed specifically to avoid bucket-padding mismatches during DFLASH replay, suggest gating it on DFLASH (or making it kwargs-overridable) so existing EAGLE/MTP/etc. benchmark results remain comparable to runs from before this change. For example, drop disable_cuda_graph_padding from the shared engine_kwargs dict and only set it inside the if speculative_algorithm == "DFLASH": branch.

Comment on lines +73 to +77
print(
f"[specdec_bench] DFLASH ignores --draft_length / speculative_num_steps / "
f"speculative_eagle_topk; effective draft block = "
f"speculative_num_draft_tokens={engine_kwargs['speculative_num_draft_tokens']}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Minor UX nit: this print message tells users that DFLASH ignores --draft_length and instructs them to set speculative_num_draft_tokens, but speculative_num_draft_tokens is not a CLI flag in run.py — it's only reachable via engine_args in the --runtime_params YAML. Consider mentioning that explicitly in the print, or adding a CLI flag, otherwise users will be left guessing how to override the default of 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants