Skip to content

perf: async prefetch of next segment's params during compute#1626

Open
fszontagh wants to merge 1 commit into
leejet:masterfrom
fszontagh:perf/async-prefetch
Open

perf: async prefetch of next segment's params during compute#1626
fszontagh wants to merge 1 commit into
leejet:masterfrom
fszontagh:perf/async-prefetch

Conversation

@fszontagh

Copy link
Copy Markdown
Contributor

Summary

When --stream-layers runs a multi-segment plan, today each merged segment H2Ds its params before compute, then waits. GPU sits idle during the H2D.

This PR overlaps them: while segment N's kernel runs on the runtime backend's stream, segment N+1's params are copied to a new pending buffer (ggml_backend_tensor_copy on cudaStreamPerThread). At the next iteration offload_partial_params recognizes the prefetched signature and adopts the pending buffer in place of a second H2D.

Per-segment wallclock drops toward max(H2D, compute) instead of H2D + compute. Falls back to sync if the pending allocation fails.

Related

Continuation of #1576, #1598, #1601, #1611, #1612.

Numbers

RTX 3060 12 GB, --offload-to-cpu --stream-layers --max-vram -1:

Workload Before After
SDXL bf16 1152x896 batch=2 8 steps 20.9 s 18.7 s
Z-Image bf16 1024x688 batch=2 9 steps 82 s 54.7 s

SDXL is a 1-segment plan so prefetch is a no-op; small win is from compute_async + synchronize having less host overhead than synchronous compute. Z-Image hits the 9-segment streaming path and gets the full overlap.

Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant