make dit faster #1228

leejet · 2026-01-24T14:30:02Z

	before	after
FLUX.2 Klein 4B (CFG 1, 4 steps, 512x512, bf16)	6.7it/s	8.19it/s
Z-Image Turbo (CFG 1, 9 steps, 512x512, bf16)	5.78it/s	5.97it/s
Qwen Image (CFG 6, 20 steps, 512x512, q8)	2.22it/s	2.27it/s

Device: RTX 4090
Backend: cuda

daniandtheweb · 2026-01-24T15:37:17Z

The optimizations look amazing. I've reported the performance improvements here: #1215 (comment), #1215 (comment).

Green-Sky · 2026-01-24T18:05:40Z

	before	after
flux2 klein 4b (cfg1, 4steps, 1024x1024, q8_0)	2.73s/it	2.29s/it
z-image turbo (cfg1, 8steps, 1024x1024, Q3_K_M)	4.81s/it	4.72s/it

rtx 2070 (8gig) cuda

(the z-image numbers are basically within error)

bssrdf · 2026-01-24T20:24:20Z

@leejet, thanks for the PR. It failed for Flux1.dev with error.

>bin\Release\sd-cli.exe --diffusion-model  ..\models\flux1-dev-q8_0.gguf --vae ..\models\ae.safetensors --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" --cfg-scale 1.0 -H 1024 -W 1024 --diffusion-fa  -v -s -1 --steps 20 --sampling-method euler -o astrounaut02ff.png
[DEBUG] main.cpp:500  - version: stable-diffusion.cpp version master-484-fa61ea7-4-ge2600bd, commit e2600bd
[DEBUG] main.cpp:501  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 1 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:502  - SDCliParams {
  mode: img_gen,
  output_path: "astrounaut02ff.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:503  - SDContextParams {
  n_threads: 16,
  model_path: "",
  clip_l_path: "..\models\clip_l.safetensors",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "..\models\t5xxl_fp16.safetensors",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "..\models\flux1-dev-q8_0.gguf",
  high_noise_diffusion_model_path: "",
  vae_path: "..\models\ae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  flow_shift: INF
  offload_params_to_cpu: false,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: true,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:504  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
  negative_prompt: "",
  clip_skip: -1,
  width: 1024,
  height: 1024,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 23136,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:164  - Using CUDA backend
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78   -   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:258  - loading diffusion model from '..\models\flux1-dev-q8_0.gguf'
[INFO ] model.cpp:370  - load ..\models\flux1-dev-q8_0.gguf using gguf format
[DEBUG] model.cpp:416  - init from '..\models\flux1-dev-q8_0.gguf'
[INFO ] stable-diffusion.cpp:274  - loading clip_l from '..\models\clip_l.safetensors'
[INFO ] model.cpp:373  - load ..\models\clip_l.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '..\models\clip_l.safetensors', prefix = 'text_encoders.clip_l.transformer.'
[INFO ] stable-diffusion.cpp:298  - loading t5xxl from '..\models\t5xxl_fp16.safetensors'
[INFO ] model.cpp:373  - load ..\models\t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '..\models\t5xxl_fp16.safetensors', prefix = 'text_encoders.t5xxl.transformer.'
[INFO ] stable-diffusion.cpp:319  - loading vae from '..\models\ae.safetensors'
[INFO ] model.cpp:373  - load ..\models\ae.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '..\models\ae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:335  - Version: Flux
[INFO ] stable-diffusion.cpp:363  - Weight type stat:                      f32: 720  |     f16: 415  |    q8_0: 304
[INFO ] stable-diffusion.cpp:364  - Conditioner weight type stat:          f16: 415
[INFO ] stable-diffusion.cpp:365  - Diffusion model weight type stat:      f32: 476  |    q8_0: 304
[INFO ] stable-diffusion.cpp:366  - VAE weight type stat:                  f32: 244
[DEBUG] stable-diffusion.cpp:368  - ggml tensor size = 400 bytes
[DEBUG] clip.hpp:160  - vocab size: 49408
[DEBUG] clip.hpp:171  - trigger word img already in vocab
[INFO ] flux.hpp:1353 - flux: depth = 19, depth_single_blocks = 38, guidance_embed = true, context_in_dim = 4096, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:573  - Using flash attention in the diffusion model
[DEBUG] ggml_extend.hpp:1914 - clip params backend buffer size =  235.06 MB(VRAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1914 - t5 params backend buffer size =  9083.77 MB(VRAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1914 - flux params backend buffer size =  12247.64 MB(VRAM) (780 tensors)
[INFO ] stable-diffusion.cpp:624  - Using Conv2d direct in the vae model
[DEBUG] ggml_extend.hpp:1914 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:752  - loading weights
[DEBUG] model.cpp:1381 - using 16 threads for model loading
[DEBUG] model.cpp:1403 - loading tensors from ..\models\flux1-dev-q8_0.gguf
  |===========================>                      | 780/1439 - 139.76it/s
[DEBUG] model.cpp:1403 - loading tensors from ..\models\clip_l.safetensors
  |=================================>                | 976/1439 - 168.51it/s
[DEBUG] model.cpp:1403 - loading tensors from ..\models\t5xxl_fp16.safetensors
  |=========================================>        | 1195/1439 - 119.70it/s
[DEBUG] model.cpp:1403 - loading tensors from ..\models\ae.safetensors
  |==================================================| 1439/1439 - 141.22it/s
[INFO ] model.cpp:1629 - loading tensors completed, taking 10.19s (process: 0.00s, read: 8.76s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.55s)
[DEBUG] stable-diffusion.cpp:787  - finished loaded file
[INFO ] stable-diffusion.cpp:860  - total params memory size = 21661.05MB (VRAM 21661.05MB, RAM 0.00MB): text_encoders 9318.83MB(VRAM), diffusion_model 12247.64MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:934  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:3472 - generate_image 1024x1024
[INFO ] stable-diffusion.cpp:3506 - sampling using Euler method
[INFO ] denoiser.hpp:403  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3633 - TXT2IMG
[DEBUG] conditioner.hpp:1215 - parse 'Astronaut in a jungle, cold color palette, muted colors, detailed, 8k' to [['Astronaut in a jungle, cold color palette, muted colors, detailed, 8k', 1], ]
[DEBUG] clip.hpp:401  - token astronaut in a jungle, cold color palette, muted colors, detailed, 8k
[DEBUG] clip.hpp:304  - token length: 77
[DEBUG] t5.hpp:402  - token length: 256
[DEBUG] clip.hpp:764  - identity projection
[DEBUG] ggml_extend.hpp:1726 - clip compute buffer size: 1.40 MB(VRAM)
[DEBUG] clip.hpp:764  - identity projection
[DEBUG] ggml_extend.hpp:1726 - t5 compute buffer size: 68.25 MB(VRAM)
[DEBUG] conditioner.hpp:1345 - computing condition graph completed, taking 194 ms
[INFO ] stable-diffusion.cpp:3250 - get_learned_condition completed, taking 196 ms
[INFO ] stable-diffusion.cpp:3361 - generating image: 1/1 - seed 23136
[DEBUG] ggml_extend.hpp:1726 - flux compute buffer size: 856.75 MB(VRAM)
stable-diffusion.cpp\ggml\src\ggml-cuda\unary.cu:136: GGML_ASSERT(ggml_is_contiguous(src0)) failed

make flux faster

72113b1

leejet mentioned this pull request Jan 24, 2026

FLUX.2-klein's slow generation speed. #1215

Open

loci-dev mentioned this pull request Jan 24, 2026

UPSTREAM PR #1228: make flux faster auroralabs-loci/stable-diffusion.cpp#32

Open

make flux a litter faster

c7d4a60

make z-image a litter faster

6f4b492

leejet changed the title ~~make flux faster~~ make dit faster Jan 24, 2026

make qwen image a litter faster

e2600bd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make dit faster #1228

make dit faster #1228

leejet commented Jan 24, 2026 •

edited

Loading

Uh oh!

daniandtheweb commented Jan 24, 2026

Uh oh!

Green-Sky commented Jan 24, 2026 •

edited

Loading

Uh oh!

bssrdf commented Jan 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

make dit faster #1228

Are you sure you want to change the base?

make dit faster #1228

Conversation

leejet commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Jan 24, 2026

Uh oh!

Green-Sky commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bssrdf commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leejet commented Jan 24, 2026 •

edited

Loading

Green-Sky commented Jan 24, 2026 •

edited

Loading

bssrdf commented Jan 24, 2026 •

edited

Loading