-
Notifications
You must be signed in to change notification settings - Fork 400
[Feat]: Add dflash in specdec-bench #1432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -43,44 +43,48 @@ def __init__( | |
| speculative_algorithm = "LOOKAHEAD" | ||
| elif speculative_algorithm == "NONE": | ||
| speculative_algorithm = None | ||
|
|
||
| engine_kwargs = dict( | ||
| model_path=model_dir, | ||
| skip_tokenizer_init=True, | ||
| trust_remote_code=kwargs.get("trust_remote_code", False), | ||
| mem_fraction_static=kwargs.get("mem_fraction_static", 0.8), | ||
| disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False), | ||
| tp_size=kwargs.get("tensor_parallel_size", 1), | ||
| ep_size=kwargs.get("moe_expert_parallel_size", 1), | ||
| torch_compile_max_bs=max_concurrent_requests, | ||
| max_running_requests=max_concurrent_requests, | ||
| attention_backend=kwargs.get("attention_backend"), | ||
| enable_torch_compile=kwargs.get("enable_torch_compile", False), | ||
| cuda_graph_max_bs=max_concurrent_requests, | ||
| disable_cuda_graph=False, | ||
| disable_cuda_graph_padding=True, | ||
| ) | ||
|
Comment on lines
+47
to
+62
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix CI blocker: replace Line 47 triggers Ruff C408 in the pipeline, so this currently fails code-quality checks. Suggested patch- engine_kwargs = dict(
- model_path=model_dir,
- skip_tokenizer_init=True,
- trust_remote_code=kwargs.get("trust_remote_code", False),
- mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
- disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
- tp_size=kwargs.get("tensor_parallel_size", 1),
- ep_size=kwargs.get("moe_expert_parallel_size", 1),
- torch_compile_max_bs=max_concurrent_requests,
- max_running_requests=max_concurrent_requests,
- attention_backend=kwargs.get("attention_backend"),
- enable_torch_compile=kwargs.get("enable_torch_compile", False),
- cuda_graph_max_bs=max_concurrent_requests,
- disable_cuda_graph=False,
- disable_cuda_graph_padding=True,
- )
+ engine_kwargs = {
+ "model_path": model_dir,
+ "skip_tokenizer_init": True,
+ "trust_remote_code": kwargs.get("trust_remote_code", False),
+ "mem_fraction_static": kwargs.get("mem_fraction_static", 0.8),
+ "disable_overlap_schedule": kwargs.get("disable_overlap_schedule", False),
+ "tp_size": kwargs.get("tensor_parallel_size", 1),
+ "ep_size": kwargs.get("moe_expert_parallel_size", 1),
+ "torch_compile_max_bs": max_concurrent_requests,
+ "max_running_requests": max_concurrent_requests,
+ "attention_backend": kwargs.get("attention_backend"),
+ "enable_torch_compile": kwargs.get("enable_torch_compile", False),
+ "cuda_graph_max_bs": max_concurrent_requests,
+ "disable_cuda_graph": False,
+ "disable_cuda_graph_padding": True,
+ }🧰 Tools🪛 GitHub Actions: Code Quality / code-quality[error] 47-62: ruff-check failed (hook id: ruff-check). C408 Unnecessary 🤖 Prompt for AI Agents |
||
| if speculative_algorithm is not None: | ||
| # https://github.com/sgl-project/sglang/pull/3582 | ||
| self.model = sgl.Engine( | ||
| model_path=model_dir, | ||
| skip_tokenizer_init=True, | ||
| trust_remote_code=kwargs.get("trust_remote_code", False), | ||
| mem_fraction_static=0.8, | ||
| disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False), | ||
| tp_size=kwargs.get("tensor_parallel_size", 1), | ||
| ep_size=kwargs.get("moe_expert_parallel_size", 1), | ||
| speculative_algorithm=speculative_algorithm, | ||
| speculative_num_steps=kwargs.get("speculative_num_steps", 3), | ||
| speculative_eagle_topk=kwargs.get("speculative_eagle_topk", 1), | ||
| speculative_num_draft_tokens=kwargs.get("speculative_num_draft_tokens", 4), | ||
| speculative_draft_model_path=kwargs.get("draft_model_dir"), | ||
| torch_compile_max_bs=max_concurrent_requests, | ||
| max_running_requests=max_concurrent_requests, | ||
| attention_backend=kwargs.get("attention_backend"), | ||
| enable_torch_compile=kwargs.get("enable_torch_compile", False), | ||
| cuda_graph_max_bs=max_concurrent_requests, | ||
| disable_cuda_graph=False, | ||
| ) | ||
| else: | ||
| self.model = sgl.Engine( | ||
| model_path=model_dir, | ||
| skip_tokenizer_init=True, | ||
| trust_remote_code=kwargs.get("trust_remote_code", False), | ||
| mem_fraction_static=0.8, | ||
| disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False), | ||
| tp_size=kwargs.get("tensor_parallel_size", 1), | ||
| ep_size=kwargs.get("moe_expert_parallel_size", 1), | ||
| torch_compile_max_bs=max_concurrent_requests, | ||
| max_running_requests=max_concurrent_requests, | ||
| attention_backend=kwargs.get("attention_backend"), | ||
| enable_torch_compile=kwargs.get("enable_torch_compile", False), | ||
| cuda_graph_max_bs=max_concurrent_requests, | ||
| disable_cuda_graph=False, | ||
| ) | ||
| engine_kwargs["speculative_algorithm"] = speculative_algorithm | ||
| engine_kwargs["speculative_draft_model_path"] = kwargs.get("draft_model_dir") | ||
| if speculative_algorithm == "DFLASH": | ||
| engine_kwargs["speculative_num_draft_tokens"] = kwargs.get("speculative_num_draft_tokens", 8) | ||
| if "speculative_dflash_draft_window_size" in kwargs: | ||
| engine_kwargs["speculative_dflash_draft_window_size"] = kwargs[ | ||
| "speculative_dflash_draft_window_size" | ||
| ] | ||
| print( | ||
| f"[specdec_bench] DFLASH ignores --draft_length / speculative_num_steps / " | ||
| f"speculative_eagle_topk; effective draft block = " | ||
| f"speculative_num_draft_tokens={engine_kwargs['speculative_num_draft_tokens']}" | ||
| ) | ||
|
Comment on lines
+73
to
+77
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [SUGGESTION] Minor UX nit: this print message tells users that DFLASH ignores |
||
| else: | ||
| engine_kwargs["speculative_num_draft_tokens"] = kwargs.get("speculative_num_draft_tokens", 4) | ||
| engine_kwargs["speculative_num_steps"] = kwargs.get("speculative_num_steps", 3) | ||
| engine_kwargs["speculative_eagle_topk"] = kwargs.get("speculative_eagle_topk", 1) | ||
|
|
||
| # extra engine arg needed for qwen3.5 | ||
| if "mamba_scheduler_strategy" in kwargs: | ||
| engine_kwargs["mamba_scheduler_strategy"] = kwargs["mamba_scheduler_strategy"] | ||
|
|
||
| self.model = sgl.Engine(**engine_kwargs) | ||
|
|
||
| self.sampling_config = sampling_kwargs | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[IMPORTANT Compatibility]
disable_cuda_graph_padding=Trueis now applied to all SGLang paths — non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM — not just DFLASH.Before this PR, this kwarg wasn't passed, so SGLang's default (
False, padding enabled) applied. Disabling padding can force more CUDA-graph recompilations / runs at non-bucketed batch sizes, which may shift latency/throughput numbers for the other algorithms that this benchmark is meant to compare.Since the PR description states this flag is needed specifically to avoid bucket-padding mismatches during DFLASH replay, suggest gating it on DFLASH (or making it kwargs-overridable) so existing EAGLE/MTP/etc. benchmark results remain comparable to runs from before this change. For example, drop
disable_cuda_graph_paddingfrom the sharedengine_kwargsdict and only set it inside theif speculative_algorithm == "DFLASH":branch.