[Feat]: Add dflash in specdec-bench by h-guo18 · Pull Request #1432 · NVIDIA/Model-Optimizer

h-guo18 · 2026-05-12T00:23:53Z

What does this PR do?

Type of change: new feature

Adds DFLASH speculative decoding support to the specdec_bench example for both the SGLang and vLLM backends.

run.py: register DFLASH in --speculative_algorithm choices.
models/sglang.py: unify the SGLang engine setup so engine_kwargs is built once and shared between the speculative and non-speculative paths; add a DFLASH branch (default speculative_num_draft_tokens=8, optional speculative_dflash_draft_window_size); set disable_cuda_graph_padding=True and cuda_graph_max_bs=max_concurrent_requests to avoid CUDA-graph bucket-padding mismatches during DFLASH replay; expose mamba_scheduler_strategy passthrough (needed for Qwen3.5).
models/vllm.py: add a DFLASH branch wiring method="dflash" with speculative_num_draft_tokens (default 8).

Usage

python run.py \
    --model_dir <target_model> \
    --tokenizer <target_model> \
    --draft_model_dir <dflash_draft_model> \
    --mtbench mtbench.jsonl \
    --engine SGLANG \
    --speculative_algorithm DFLASH \
    --concurrency 32 \
    --tp_size 4 --ep_size 4

Use --engine VLLM for the vLLM backend.

Testing

Manually validated end-to-end on Kimi-K2.5-NVFP4 + MTBench with both SGLang and vLLM backends. specdec_bench has no existing test infrastructure, so no automated tests are added (consistent with how other algorithms in this example are handled).

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (purely additive — new algorithm choice; existing algorithms unchanged)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ❌ (specdec_bench is example-only with no test harness; matches other algorithms)
Did you update Changelog?: N/A (example-only change)
Did you get Claude approval on this PR?: ❌

Additional Information

Requires SGLang and vLLM versions that include DFLASH support.

Summary by CodeRabbit

New Features
- Added DFLASH as a supported speculative decoding algorithm option for benchmark testing.
- DFLASH algorithm now available across inference engines with configurable draft token and window size parameters for optimized performance evaluation.

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

copy-pr-bot · 2026-05-12T00:23:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-12T00:24:04Z

📝 Walkthrough

Walkthrough

This PR extends the speculative decoding benchmark to support the DFLASH speculative algorithm. CLI validation is updated to accept "DFLASH", SGLang model initialization is refactored to apply algorithm-specific parameters, and vLLM integration adds DFLASH configuration using draft model directories and speculative token counts.

Changes

DFLASH Speculative Algorithm Support

Layer / File(s)	Summary
CLI argument validation for DFLASH `examples/specdec_bench/run.py`	The `--speculative_algorithm` argument `choices` list is extended to include `"DFLASH"`.
SGLang engine initialization with DFLASH support `examples/specdec_bench/specdec_bench/models/sglang.py`	Engine kwargs are consolidated into a unified dictionary to avoid duplication. DFLASH-specific handling uses `speculative_num_draft_tokens` and optional `speculative_dflash_draft_window_size`, with algorithm name normalization and conditional parameter injection for other algorithms. `mamba_scheduler_strategy` is forwarded when provided.
vLLM speculative config for DFLASH `examples/specdec_bench/specdec_bench/models/vllm.py`	A new DFLASH branch in speculative algorithm handling configures the `specdec` object with `method="dflash"`, draft model directory, and `num_speculative_tokens` sourced from `speculative_num_draft_tokens`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main feature addition: adding DFLASH support to the specdec-bench tool.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns found. All SECURITY.md patterns absent: torch.load/numpy.load misuse, hardcoded trust_remote_code, eval/exec, nosec comments, unlicensed deps all not present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch haoguo/dflash-spec-dec

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-12T00:36:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.95%. Comparing base (d30ebbd) to head (61c7e79).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1432   +/-   ##
=======================================
  Coverage   76.95%   76.95%           
=======================================
  Files         478      478           
  Lines       51648    51648           
=======================================
+ Hits        39744    39747    +3     
+ Misses      11904    11901    -3

Flag	Coverage Δ
unit	`52.72% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/models/sglang.py`:
- Around line 47-62: The engine_kwargs construction uses dict(...) which
triggers Ruff C408; replace the dict(...) call with a dict literal for the
variable engine_kwargs in the module (the block that sets model_path,
skip_tokenizer_init, trust_remote_code, mem_fraction_static,
disable_overlap_schedule, tp_size, ep_size, torch_compile_max_bs,
max_running_requests, attention_backend, enable_torch_compile,
cuda_graph_max_bs, disable_cuda_graph, disable_cuda_graph_padding) so the same
key/value pairs are expressed using a { ... } literal instead of dict(...).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c5a40daa-3fcc-4d95-b922-448bdfebbd28

📥 Commits

Reviewing files that changed from the base of the PR and between d30ebbd and 61c7e79.

📒 Files selected for processing (3)

examples/specdec_bench/run.py
examples/specdec_bench/specdec_bench/models/sglang.py
examples/specdec_bench/specdec_bench/models/vllm.py

coderabbitai · 2026-05-12T07:38:43Z

+        engine_kwargs = dict(
+            model_path=model_dir,
+            skip_tokenizer_init=True,
+            trust_remote_code=kwargs.get("trust_remote_code", False),
+            mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
+            disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
+            tp_size=kwargs.get("tensor_parallel_size", 1),
+            ep_size=kwargs.get("moe_expert_parallel_size", 1),
+            torch_compile_max_bs=max_concurrent_requests,
+            max_running_requests=max_concurrent_requests,
+            attention_backend=kwargs.get("attention_backend"),
+            enable_torch_compile=kwargs.get("enable_torch_compile", False),
+            cuda_graph_max_bs=max_concurrent_requests,
+            disable_cuda_graph=False,
+            disable_cuda_graph_padding=True,
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix CI blocker: replace dict(...) with a literal.

Line 47 triggers Ruff C408 in the pipeline, so this currently fails code-quality checks.

Suggested patch

- engine_kwargs = dict( - model_path=model_dir, - skip_tokenizer_init=True, - trust_remote_code=kwargs.get("trust_remote_code", False), - mem_fraction_static=kwargs.get("mem_fraction_static", 0.8), - disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False), - tp_size=kwargs.get("tensor_parallel_size", 1), - ep_size=kwargs.get("moe_expert_parallel_size", 1), - torch_compile_max_bs=max_concurrent_requests, - max_running_requests=max_concurrent_requests, - attention_backend=kwargs.get("attention_backend"), - enable_torch_compile=kwargs.get("enable_torch_compile", False), - cuda_graph_max_bs=max_concurrent_requests, - disable_cuda_graph=False, - disable_cuda_graph_padding=True, - ) + engine_kwargs = { + "model_path": model_dir, + "skip_tokenizer_init": True, + "trust_remote_code": kwargs.get("trust_remote_code", False), + "mem_fraction_static": kwargs.get("mem_fraction_static", 0.8), + "disable_overlap_schedule": kwargs.get("disable_overlap_schedule", False), + "tp_size": kwargs.get("tensor_parallel_size", 1), + "ep_size": kwargs.get("moe_expert_parallel_size", 1), + "torch_compile_max_bs": max_concurrent_requests, + "max_running_requests": max_concurrent_requests, + "attention_backend": kwargs.get("attention_backend"), + "enable_torch_compile": kwargs.get("enable_torch_compile", False), + "cuda_graph_max_bs": max_concurrent_requests, + "disable_cuda_graph": False, + "disable_cuda_graph_padding": True, + }

🧰 Tools

🪛 GitHub Actions: Code Quality / code-quality

[error] 47-62: ruff-check failed (hook id: ruff-check). C408 Unnecessary dict() call (rewrite as a literal) at examples/specdec_bench/specdec_bench/models/sglang.py:47:25.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 47 - 62, The engine_kwargs construction uses dict(...) which triggers Ruff C408; replace the dict(...) call with a dict literal for the variable engine_kwargs in the module (the block that sets model_path, skip_tokenizer_init, trust_remote_code, mem_fraction_static, disable_overlap_schedule, tp_size, ep_size, torch_compile_max_bs, max_running_requests, attention_backend, enable_torch_compile, cuda_graph_max_bs, disable_cuda_graph, disable_cuda_graph_padding) so the same key/value pairs are expressed using a { ... } literal instead of dict(...).

ChenhanYu · 2026-05-12T22:31:14Z

/claude review

claude · 2026-05-12T22:33:18Z

Claude Review Summary

Findings: CRITICAL: 0, IMPORTANT: 1, SUGGESTION: 1

Highlights

[IMPORTANT Compatibility] disable_cuda_graph_padding=True was lifted into the shared engine_kwargs dict, so it now applies to every SGLang path (non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM), not just DFLASH. Previously the kwarg wasn't passed and SGLang's default (padding enabled) was used. Recommend gating it on DFLASH so prior benchmark numbers for the other algorithms remain comparable.
[SUGGESTION] The DFLASH info-print refers to speculative_num_draft_tokens, which isn't a CLI argument — only engine_args in the runtime YAML can set it. Worth clarifying.

Overall risk

Low. This is a purely additive example-only change; the only meaningful concern is the universal disable_cuda_graph_padding flip, which is a benchmarking-fidelity issue rather than a correctness one. DFLASH wiring on both backends looks consistent with the existing patterns for EAGLE3/MTP.

claude · 2026-05-12T22:33:25Z

+            cuda_graph_max_bs=max_concurrent_requests,
+            disable_cuda_graph=False,
+            disable_cuda_graph_padding=True,


[IMPORTANT Compatibility] disable_cuda_graph_padding=True is now applied to all SGLang paths — non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM — not just DFLASH.

Before this PR, this kwarg wasn't passed, so SGLang's default (False, padding enabled) applied. Disabling padding can force more CUDA-graph recompilations / runs at non-bucketed batch sizes, which may shift latency/throughput numbers for the other algorithms that this benchmark is meant to compare.

Since the PR description states this flag is needed specifically to avoid bucket-padding mismatches during DFLASH replay, suggest gating it on DFLASH (or making it kwargs-overridable) so existing EAGLE/MTP/etc. benchmark results remain comparable to runs from before this change. For example, drop disable_cuda_graph_padding from the shared engine_kwargs dict and only set it inside the if speculative_algorithm == "DFLASH": branch.

claude · 2026-05-12T22:33:26Z

+                print(
+                    f"[specdec_bench] DFLASH ignores --draft_length / speculative_num_steps / "
+                    f"speculative_eagle_topk; effective draft block = "
+                    f"speculative_num_draft_tokens={engine_kwargs['speculative_num_draft_tokens']}"
+                )


[SUGGESTION] Minor UX nit: this print message tells users that DFLASH ignores --draft_length and instructs them to set speculative_num_draft_tokens, but speculative_num_draft_tokens is not a CLI flag in run.py — it's only reachable via engine_args in the --runtime_params YAML. Consider mentioning that explicitly in the print, or adding a CLI flag, otherwise users will be left guessing how to override the default of 8.

add dflash in specdec

9cecbd6

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

Hao Guo added 4 commits May 11, 2026 18:06

dev

334776e

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

dev

cad8728

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

disable cuda graph padding

cc09623

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

polish

7bbaae2

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

h-guo18 self-assigned this May 12, 2026

add dflash to vllm backend

61c7e79

Signed-off-by: Hao Guo <haoguo@poly0172.polyphe.clusters.nvidia.com>

h-guo18 marked this pull request as ready for review May 12, 2026 07:31

h-guo18 requested a review from a team as a code owner May 12, 2026 07:31

h-guo18 requested review from ChenhanYu and benchislett May 12, 2026 07:31

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

claude Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: Add dflash in specdec-bench#1432

[Feat]: Add dflash in specdec-bench#1432
h-guo18 wants to merge 6 commits into
mainfrom
haoguo/dflash-spec-dec

h-guo18 commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

ChenhanYu commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

claude Bot May 12, 2026

Uh oh!

claude Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

h-guo18 commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented May 12, 2026

Uh oh!

claude Bot commented May 12, 2026

Claude Review Summary

Highlights

Overall risk

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

h-guo18 commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

codecov Bot commented May 12, 2026 •

edited

Loading