feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit#1621
feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit#1621RapidMark wants to merge 5 commits into
Conversation
VAE decode can fail on integrated / low-VRAM GPUs because the untiled
compute buffer exceeds the backend's maximum single-buffer allocation
(e.g. Vulkan maxBufferSize), even when total memory is plentiful. sd.cpp
already supports tiling that keeps each compute buffer small, but it had
to be requested up front with --vae-tiling; users hit a hard failure
instead of the working path that was one flag away.
Make --vae-tiling a tristate:
off - never tile (fail if the untiled buffer doesn't fit)
on - always tile (previous --vae-tiling behavior)
auto - (default) try untiled; if the compute buffer can't be allocated,
free it and retry once with tiling
Implemented by appending a `bool auto_tile` to sd_tiling_params_t (kept
at the end of the struct so the C ABI stays backward-compatible) and a
single fallback branch in VAE::decode. Bare `--vae-tiling` with no value
remains backward-compatible (= on). auto_tile round-trips through the
JSON gen-params load/save.
Validated on an AMD Radeon 8060S iGPU (Flux Krea Q4, 1024x1024, Vulkan):
--vae-tiling off fails at decode (8.5 GB buffer exceeds the device limit),
--vae-tiling auto logs the retry and completes by tiling, --vae-tiling on
tiles from the start.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wouldn't be possible to check with the real value, calculated from the graph before the allocation? |
…es review) Reviewer (wbruna) asked why retry-on-failure rather than checking the real buffer size from the graph up front. Good point: ggml can plan the exact compute-buffer size with no allocation. Add an opt-in probe to GGMLRunner: when set_probe_compute_buffer_fits(true), alloc_compute_buffer measures the planned size via ggml_gallocr_reserve_n_size (no_alloc planning, zero allocation) and, if it exceeds ggml_backend_buft_get_max_size(), returns false BEFORE the real reserve -- so the backend never emits its raw "allocation failed" error on the AUTO success path. VAE::decode enables the probe only around the untiled _compute in AUTO mode; the reactive output.empty()->tile path stays as the backstop for a genuine runtime OOM (planned size fits the max, but the device is full). get_max_size() is SIZE_MAX on CPU, so this no-ops there. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024x1024): --vae-tiling auto now logs only the INFO "retrying with tiling" + completes (no allocation-failed spew); off still fails; on still tiles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Good call — done (pushed just now). Instead of retrying on failure, the AUTO path now measures the planned compute-buffer size up front with I kept the original retry-on-empty as a backstop for a genuine runtime OOM (planned size fits the max, but the device is actually full). Net effect on the auto path: the backend no longer prints its raw "allocation failed" error — just an INFO line and the tiled decode. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): |
|
I think having a fallback to vae tiling is a much welcome addition, but I'm having some small issues with the user experience there. Modifying the syntax of For example we could add a Alternatively, set "auto" tiling as default and add something like a |
…out with --no-vae-tiling-fallback Addresses review (stduhpf): turning --vae-tiling into a tristate option that takes a value breaks previously-working command lines. Revert that: --vae-tiling stays the original boolean flag (force tiling on). The auto fallback is now the default (auto_tile defaults true), and since it only tiles when an untiled decode would exceed the backend buffer limit, it is non-breaking and strictly safer for everyone. Add --no-vae-tiling-fallback to disable it (fail instead of tiling) for anyone who wants the old hard-fail behavior. Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024^2): default auto-recovers (logs "retrying with tiling", exit 0); --no-vae-tiling-fallback fails (exit 1); --vae-tiling tiles from the start (exit 0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks — agreed, changing
Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): default auto-recovers (logs |
…hanges) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The proactive probe in alloc_compute_buffer() returns false on purpose when the untiled compute buffer exceeds the backend's max single-buffer size, so the VAE auto-tiling fallback can take over. Callers logged that deliberate deferral as an ERROR, so a successful tiled decode printed a misleading "alloc compute buffer failed" on every run. Gate the ERROR on a new compute_buffer_deferred_to_tiling flag so it only fires on a genuine allocation failure; the deferral path stays a DEBUG line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On memory-constrained backends — integrated GPUs especially — a full-image VAE decode allocates a single compute buffer larger than the backend's maximum single-buffer/allocation size, and sd.cpp hard-fails instead of falling back to the tiling it already supports. The user has to know to pass
--vae-tilingup front; otherwise the run crashes at the very end, after sampling has already completed.Repro
AMD Radeon 8060S (Strix Halo, RDNA3.5 iGPU, 128 GB unified memory), Vulkan backend, Flux Krea-dev Q4 at 1024×1024, with no tiling flag:
The ~8.5 GB single-shot VAE decode buffer exceeds the iGPU's Vulkan per-buffer limit. The card has ample total memory (it shares 128 GB system RAM) — the failure is the per-buffer ceiling, not capacity. The whole gen is lost after a successful sampling pass.
Change
Add an automatic fallback to tiling, on by default, and keep it non-breaking:
--vae-tilingstays exactly as it was — a boolean flag that forces tiling on.ggml_gallocr_reserve_n_size(no-alloc planning, zero allocation) and compared againstggml_backend_buft_get_max_size(); if it won't fit, the decode goes straight to tiling. This is non-breaking — a decode that previously fit behaves identically, and one that previously OOM'd now recovers — and strictly safer. On CPUget_max_size()isSIZE_MAX, so it no-ops there.--no-vae-tiling-fallbackdisables the fallback for anyone who wants the old hard-fail behavior._computestill returns empty at runtime (e.g. the planned size fit the max but the device is genuinely full), it frees the buffer and retries once tiled — so a true OOM is also covered.Implemented with a
bool auto_tileappended to the end ofsd_tiling_params_t(kept at the end so the C ABI stays backward-compatible; defaulttrue), the proactive probe inGGMLRunner::alloc_compute_buffer, and the fallback branch inVAE::decode.Choosing the real graph-planned size (not a hardcoded bytes-per-pixel estimate) keeps it correct across every VAE architecture (SD/SDXL/Flux/Wan/LTX) and backend with no tuning.
Validation (AMD Radeon 8060S iGPU, Krea Q4, 1024²)
vae: untiled decode buffer exceeded the backend limit; retrying with tiling, completes, exit 0--no-vae-tiling-fallback→ fails at decode, exit 1 (the old behavior, opt-in)--vae-tiling→ tiles from the start, exit 0The tiled GPU decode (~6.9 s) is also far faster than the usual workaround of routing the VAE to CPU (~29.5 s) to dodge the OOM, and is visually equivalent at 0.5 tile overlap.
Helps any constrained device, not just iGPUs — an 8 GB discrete card at high resolution hits the same per-buffer wall. Scoped to
decode(where the failure occurs);encodehas the same shape and could get the identical treatment later.Thanks to @wbruna for pushing toward the proactive graph-planned size, and @stduhpf for catching that the original tristate would have broken the
--vae-tilingsyntax (this revision keeps it a plain flag + auto-by-default + opt-out).