Skip to content

[Bug] VAE not decoding correctly with Vulkan #1471

@kfl02

Description

@kfl02

Git commit

$ git rev-parse HEAD
3d6064b

Operating System & Version

Ubuntu 26.04

GGML backends

Vulkan

Command-line arguments used

sd-cli -m models/checkpoints/v1-5-pruned-emaonly.safetensors -i test.png --strength 0 -v

Steps to reproduce

I compiled and tried to use sd-cli and wondered about the broken ouput.
This issue is probably the same as #1455 or at least related.

I created this input image that might help visualizing the error.

Image

What you expected to happen

The output image should be the same as the input image (minus encoding/decoding errors).

What actually happened

The output image looks like the VAE decoding failed somehow. The channels seem to be mixed all over the picture, there are repeating patterns. Maybe some modulo or striding error.

Note that the silhouette of the circle (diameter 128px*128px) appears correctly once in the 512px*512px image, then 8 times as a 64px*32px ellipse and then again 64 times as an 32px*16px ellipse.

Image

Logs / error messages / stack trace

sd-cli -m models/checkpoints/v1-5-pruned-emaonly.safetensors -i test.png --strength 0 -v
[DEBUG] main.cpp:550  - version: stable-diffusion.cpp version master-593-3d6064b, commit 3d6064b
[DEBUG] main.cpp:551  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 1 |     AVX512_VBMI = 1 |     AVX512_VNNI = 1 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:552  - SDCliParams {
  mode: img_gen,
  output_path: "output.png",
  image_path: "",
  metadata_format: "text",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false,
  metadata_raw: false,
  metadata_brief: false,
  metadata_all: false
}
[DEBUG] main.cpp:553  - SDContextParams {
  n_threads: 8,
  model_path: "models/checkpoints/v1-5-pruned-emaonly.safetensors",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "",
  high_noise_diffusion_model_path: "",
  vae_path: "",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  hires_upscalers_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  offload_params_to_cpu: false,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  flash_attn: false,
  diffusion_flash_attn: false,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:554  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "",
  negative_prompt: "",
  clip_skip: -1,
  width: -1,
  height: -1,
  batch_count: 1,
  init_image_path: "test.png",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 0, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: inf, shifted_timestep: 0, flow_shift: inf),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=inf, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
  hires: { enabled: false, upscaler: "Latent", model_path: "", scale: 2, target_width: 0, target_height: 0, steps: 0, denoising_strength: 0.7, upscale_tile_size: 128 },
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
}
[INFO ] common.cpp:1801 - set width x height to 512 x 512
[DEBUG] ggml_extend.hpp:58   - ggml_vulkan: Found 1 Vulkan devices:
[DEBUG] ggml_extend.hpp:58   - ggml_vulkan: 0 = AMD Radeon RX 580 2048SP (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
[DEBUG] util.cpp:713  - Found 2 backend devices:
[DEBUG] util.cpp:716  - #0: Vulkan0
[DEBUG] util.cpp:716  - #1: CPU
[DEBUG] ggml_extend.hpp:108  - Initializing backend: Vulkan0
[INFO ] stable-diffusion.cpp:210  - loading model from 'models/checkpoints/v1-5-pruned-emaonly.safetensors'
[INFO ] model.cpp:219  - load models/checkpoints/v1-5-pruned-emaonly.safetensors using safetensors format
[DEBUG] model.cpp:294  - init from 'models/checkpoints/v1-5-pruned-emaonly.safetensors', prefix = ''
[INFO ] stable-diffusion.cpp:303  - Version: SD 1.x
[INFO ] stable-diffusion.cpp:331  - Weight type stat:                      f32: 1131
[INFO ] stable-diffusion.cpp:332  - Conditioner weight type stat:          f32: 196
[INFO ] stable-diffusion.cpp:333  - Diffusion model weight type stat:      f32: 686
[INFO ] stable-diffusion.cpp:334  - VAE weight type stat:                  f32: 248
[DEBUG] stable-diffusion.cpp:336  - ggml tensor size = 400 bytes
[DEBUG] clip_tokenizer.cpp:65   - vocab size: 49408
[DEBUG] ggml_extend.hpp:2067 - clip params backend buffer size =  469.44 MB(VRAM) (196 tensors)
[DEBUG] ggml_extend.hpp:2067 - unet params backend buffer size =  2155.33 MB(VRAM) (686 tensors)
[INFO ] stable-diffusion.cpp:629  - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2067 - vae params backend buffer size =  159.68 MB(VRAM) (248 tensors)
[DEBUG] stable-diffusion.cpp:753  - loading weights
[DEBUG] model.cpp:742  - using 8 threads for model loading
[DEBUG] model.cpp:764  - loading tensors from models/checkpoints/v1-5-pruned-emaonly.safetensors
  |==================================================| 1131/1131 - 3.97GB/s
[INFO ] model.cpp:993  - loading tensors completed, taking 1.00s (process: 0.00s, read: 0.12s, memcpy: 0.00s, convert: 0.16s, copy_to_backend: 0.41s)
[DEBUG] stable-diffusion.cpp:793  - finished loaded file
[INFO ] stable-diffusion.cpp:845  - total params memory size = 2784.45MB (VRAM 2784.45MB, RAM 0.00MB): text_encoders 469.44MB(VRAM), diffusion_model 2155.33MB(VRAM), vae 159.68MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:918  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:3342 - generate_image 512x512
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2789 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:2907 - IMG2IMG
[INFO ] stable-diffusion.cpp:2914 - target t_enc is 0 steps
[DEBUG] ggml_extend.hpp:1880 - vae compute buffer size: 848.50 MB(VRAM)
[DEBUG] vae.hpp:154  - computing vae encode graph completed, taking 1.64s
[INFO ] stable-diffusion.cpp:3081 - encode_first_stage completed, taking 1.65s
[DEBUG] conditioner.hpp:407  - parse '' to [['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1880 - clip compute buffer size: 1.42 MB(VRAM)
[DEBUG] conditioner.hpp:533  - computing condition graph completed, taking 26 ms
[DEBUG] conditioner.hpp:407  - parse '' to [['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1880 - clip compute buffer size: 1.42 MB(VRAM)
[DEBUG] conditioner.hpp:533  - computing condition graph completed, taking 27 ms
[INFO ] stable-diffusion.cpp:3143 - get_learned_condition completed, taking 0.06s
[INFO ] stable-diffusion.cpp:3376 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1880 - unet compute buffer size: 559.90 MB(VRAM)
  |==================================================| 1/1 - 1.48s/it
[INFO ] stable-diffusion.cpp:3407 - sampling completed, taking 1.49s
[INFO ] stable-diffusion.cpp:3425 - generating 1 latent images completed, taking 1.49s
[INFO ] stable-diffusion.cpp:3167 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1880 - vae compute buffer size: 1984.06 MB(VRAM)
[DEBUG] vae.hpp:207  - computing vae decode graph completed, taking 10.65s
[INFO ] stable-diffusion.cpp:3183 - latent 1 decoded, taking 10.65s
[INFO ] stable-diffusion.cpp:3187 - decode_first_stage completed, taking 10.65s
[INFO ] stable-diffusion.cpp:3562 - generate_image completed in 13.85s
[INFO ] main.cpp:441  - save result image 0 to 'output.png' (success)
[INFO ] main.cpp:490  - 1/1 images saved

Additional context / environment details

CPU: 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
GPU: Polaris 20 XL [Radeon RX 580 2048SP]
RAM: 64GB

Using --vae-on-cpu on the command line gives the desired result.
Using another VAE via --vae models/vae/vae-ft-mse-840000-ema-pruned.safetensors gives the same results: With --vae-on-cpu it works, without it the result is a garbled image.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions