Skip to content

feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit#1621

Open
RapidMark wants to merge 5 commits into
leejet:masterfrom
CloudhandsAI:cloudhands/vae-auto-tiling
Open

feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit#1621
RapidMark wants to merge 5 commits into
leejet:masterfrom
CloudhandsAI:cloudhands/vae-auto-tiling

Conversation

@RapidMark

@RapidMark RapidMark commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

On memory-constrained backends — integrated GPUs especially — a full-image VAE decode allocates a single compute buffer larger than the backend's maximum single-buffer/allocation size, and sd.cpp hard-fails instead of falling back to the tiling it already supports. The user has to know to pass --vae-tiling up front; otherwise the run crashes at the very end, after sampling has already completed.

Repro

AMD Radeon 8060S (Strix Halo, RDNA3.5 iGPU, 128 GB unified memory), Vulkan backend, Flux Krea-dev Q4 at 1024×1024, with no tiling flag:

[INFO ] stable-diffusion.cpp - sampling completed, taking 10.24s
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp - vae: failed to allocate the compute buffer
[ERROR] vae.hpp - vae decode compute failed
[ERROR] main.cpp - generate failed

The ~8.5 GB single-shot VAE decode buffer exceeds the iGPU's Vulkan per-buffer limit. The card has ample total memory (it shares 128 GB system RAM) — the failure is the per-buffer ceiling, not capacity. The whole gen is lost after a successful sampling pass.

Change

Add an automatic fallback to tiling, on by default, and keep it non-breaking:

  • --vae-tiling stays exactly as it was — a boolean flag that forces tiling on.
  • The auto-fallback is the new default. Before allocating the untiled decode buffer, the planned size is measured from the graph with ggml_gallocr_reserve_n_size (no-alloc planning, zero allocation) and compared against ggml_backend_buft_get_max_size(); if it won't fit, the decode goes straight to tiling. This is non-breaking — a decode that previously fit behaves identically, and one that previously OOM'd now recovers — and strictly safer. On CPU get_max_size() is SIZE_MAX, so it no-ops there.
  • --no-vae-tiling-fallback disables the fallback for anyone who wants the old hard-fail behavior.
  • A reactive backstop remains: if the untiled _compute still returns empty at runtime (e.g. the planned size fit the max but the device is genuinely full), it frees the buffer and retries once tiled — so a true OOM is also covered.

Implemented with a bool auto_tile appended to the end of sd_tiling_params_t (kept at the end so the C ABI stays backward-compatible; default true), the proactive probe in GGMLRunner::alloc_compute_buffer, and the fallback branch in VAE::decode.

Choosing the real graph-planned size (not a hardcoded bytes-per-pixel estimate) keeps it correct across every VAE architecture (SD/SDXL/Flux/Wan/LTX) and backend with no tuning.

Validation (AMD Radeon 8060S iGPU, Krea Q4, 1024²)

  • default (no flag) → vae: untiled decode buffer exceeded the backend limit; retrying with tiling, completes, exit 0
  • --no-vae-tiling-fallback → fails at decode, exit 1 (the old behavior, opt-in)
  • --vae-tiling → tiles from the start, exit 0

The tiled GPU decode (~6.9 s) is also far faster than the usual workaround of routing the VAE to CPU (~29.5 s) to dodge the OOM, and is visually equivalent at 0.5 tile overlap.

Helps any constrained device, not just iGPUs — an 8 GB discrete card at high resolution hits the same per-buffer wall. Scoped to decode (where the failure occurs); encode has the same shape and could get the identical treatment later.


Thanks to @wbruna for pushing toward the proactive graph-planned size, and @stduhpf for catching that the original tristate would have broken the --vae-tiling syntax (this revision keeps it a plain flag + auto-by-default + opt-out).

VAE decode can fail on integrated / low-VRAM GPUs because the untiled
compute buffer exceeds the backend's maximum single-buffer allocation
(e.g. Vulkan maxBufferSize), even when total memory is plentiful. sd.cpp
already supports tiling that keeps each compute buffer small, but it had
to be requested up front with --vae-tiling; users hit a hard failure
instead of the working path that was one flag away.

Make --vae-tiling a tristate:
  off  - never tile (fail if the untiled buffer doesn't fit)
  on   - always tile (previous --vae-tiling behavior)
  auto - (default) try untiled; if the compute buffer can't be allocated,
         free it and retry once with tiling

Implemented by appending a `bool auto_tile` to sd_tiling_params_t (kept
at the end of the struct so the C ABI stays backward-compatible) and a
single fallback branch in VAE::decode. Bare `--vae-tiling` with no value
remains backward-compatible (= on). auto_tile round-trips through the
JSON gen-params load/save.

Validated on an AMD Radeon 8060S iGPU (Flux Krea Q4, 1024x1024, Vulkan):
--vae-tiling off fails at decode (8.5 GB buffer exceeds the device limit),
--vae-tiling auto logs the retry and completes by tiling, --vae-tiling on
tiles from the start.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wbruna

wbruna commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

I chose reactive (retry on the real allocation failure) over proactive (estimate the buffer size and compare against ggml_backend_buft_get_max_size()) deliberately: a size estimate is VAE-architecture-specific (peak activation differs across SD/SDXL/Flux/Wan/LTX VAEs), so a hardcoded bytes-per-pixel constant would be brittle

Wouldn't be possible to check with the real value, calculated from the graph before the allocation?

…es review)

Reviewer (wbruna) asked why retry-on-failure rather than checking the real
buffer size from the graph up front. Good point: ggml can plan the exact
compute-buffer size with no allocation.

Add an opt-in probe to GGMLRunner: when set_probe_compute_buffer_fits(true),
alloc_compute_buffer measures the planned size via ggml_gallocr_reserve_n_size
(no_alloc planning, zero allocation) and, if it exceeds
ggml_backend_buft_get_max_size(), returns false BEFORE the real reserve --
so the backend never emits its raw "allocation failed" error on the AUTO
success path. VAE::decode enables the probe only around the untiled _compute
in AUTO mode; the reactive output.empty()->tile path stays as the backstop
for a genuine runtime OOM (planned size fits the max, but the device is full).
get_max_size() is SIZE_MAX on CPU, so this no-ops there.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024x1024): --vae-tiling auto
now logs only the INFO "retrying with tiling" + completes (no allocation-failed
spew); off still fails; on still tiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@RapidMark

Copy link
Copy Markdown
Contributor Author

Good call — done (pushed just now).

Instead of retrying on failure, the AUTO path now measures the planned compute-buffer size up front with ggml_gallocr_reserve_n_size (no-alloc planning, zero allocation) and compares it to ggml_backend_buft_get_max_size() before allocating; if it won't fit, it goes straight to tiling — no bytes-per-pixel estimate. On CPU get_max_size() is SIZE_MAX, so it no-ops there.

I kept the original retry-on-empty as a backstop for a genuine runtime OOM (planned size fits the max, but the device is actually full). Net effect on the auto path: the backend no longer prints its raw "allocation failed" error — just an INFO line and the tiled decode.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): --vae-tiling auto now logs only the INFO "retrying with tiling" and completes; off still fails; on still tiles.

@stduhpf

stduhpf commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I think having a fallback to vae tiling is a much welcome addition, but I'm having some small issues with the user experience there. Modifying the syntax of --vae-tiling arg from a flag to a tistate option will break previously working commands, and I think we could implement the same feature without breakling anything.

For example we could add a --vae-tiling-auto flag.

Alternatively, set "auto" tiling as default and add something like a --no-auto-vae-tiling to disable it.

…out with --no-vae-tiling-fallback

Addresses review (stduhpf): turning --vae-tiling into a tristate option that
takes a value breaks previously-working command lines. Revert that: --vae-tiling
stays the original boolean flag (force tiling on). The auto fallback is now the
default (auto_tile defaults true), and since it only tiles when an untiled decode
would exceed the backend buffer limit, it is non-breaking and strictly safer for
everyone. Add --no-vae-tiling-fallback to disable it (fail instead of tiling) for
anyone who wants the old hard-fail behavior.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024^2): default auto-recovers
(logs "retrying with tiling", exit 0); --no-vae-tiling-fallback fails (exit 1);
--vae-tiling tiles from the start (exit 0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@RapidMark

Copy link
Copy Markdown
Contributor Author

Thanks — agreed, changing --vae-tiling's syntax wasn't worth the breakage. Pushed a revision (your alternative B):

  • --vae-tiling is back to the original boolean flag (force tiling on).
  • The auto-fallback is now the default — it only tiles when an untiled decode would exceed the backend's max buffer size, so it's non-breaking and strictly safer for existing commands (an untiled decode that previously fit behaves identically; one that previously OOM'd now recovers).
  • --no-vae-tiling-fallback disables it for the old hard-fail behavior.

Validated on an AMD Radeon 8060S iGPU (Krea Q4, 1024²): default auto-recovers (logs vae: untiled decode buffer exceeded the backend limit; retrying with tiling, exit 0); --no-vae-tiling-fallback fails (exit 1); --vae-tiling tiles from the start (exit 0).

@RapidMark RapidMark changed the title feat: tristate --vae-tiling (off|on|auto) with automatic OOM fallback feat: automatically fall back to VAE tiling when an untiled decode exceeds the backend buffer limit Jun 9, 2026
Mark Caldwell and others added 2 commits June 9, 2026 11:44
…hanges)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The proactive probe in alloc_compute_buffer() returns false on purpose when the
untiled compute buffer exceeds the backend's max single-buffer size, so the VAE
auto-tiling fallback can take over. Callers logged that deliberate deferral as
an ERROR, so a successful tiled decode printed a misleading
"alloc compute buffer failed" on every run. Gate the ERROR on a new
compute_buffer_deferred_to_tiling flag so it only fires on a genuine allocation
failure; the deferral path stays a DEBUG line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants