Skip to content

Conversation

@leejet
Copy link
Owner

@leejet leejet commented Jan 24, 2026

before after
FLUX.2 Klein 4B (CFG 1, 4 steps, 512x512, bf16) 6.7it/s 8.19it/s
Z-Image Turbo (CFG 1, 9 steps, 512x512, bf16) 5.78it/s 5.97it/s
Qwen Image (CFG 6, 20 steps, 512x512, q8) 2.22it/s 2.27it/s
  • Device: RTX 4090
  • Backend: cuda

@daniandtheweb
Copy link
Contributor

The optimizations look amazing. I've reported the performance improvements here: #1215 (comment), #1215 (comment).

@leejet leejet changed the title make flux faster make dit faster Jan 24, 2026
@Green-Sky
Copy link
Contributor

Green-Sky commented Jan 24, 2026

before after
flux2 klein 4b (cfg1, 4steps, 1024x1024, q8_0) 2.73s/it 2.29s/it
z-image turbo (cfg1, 8steps, 1024x1024, Q3_K_M) 4.81s/it 4.72s/it

rtx 2070 (8gig) cuda

(the z-image numbers are basically within error)

@bssrdf
Copy link
Contributor

bssrdf commented Jan 24, 2026

@leejet, thanks for the PR. It failed for Flux1.dev with error.

>bin\Release\sd-cli.exe --diffusion-model  ..\models\flux1-dev-q8_0.gguf --vae ..\models\ae.safetensors --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" --cfg-scale 1.0 -H 1024 -W 1024 --diffusion-fa  -v -s -1 --steps 20 --sampling-method euler -o astrounaut02ff.png
[DEBUG] main.cpp:500  - version: stable-diffusion.cpp version master-484-fa61ea7-4-ge2600bd, commit e2600bd
[DEBUG] main.cpp:501  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 1 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:502  - SDCliParams {
  mode: img_gen,
  output_path: "astrounaut02ff.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:503  - SDContextParams {
  n_threads: 16,
  model_path: "",
  clip_l_path: "..\models\clip_l.safetensors",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "..\models\t5xxl_fp16.safetensors",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "..\models\flux1-dev-q8_0.gguf",
  high_noise_diffusion_model_path: "",
  vae_path: "..\models\ae.safetensors",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: ".",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  flow_shift: INF
  offload_params_to_cpu: false,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: true,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:504  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
  negative_prompt: "",
  clip_skip: -1,
  width: 1024,
  height: 1024,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  auto_resize_ref_image: true,
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: euler, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 23136,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:164  - Using CUDA backend
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78   -   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:258  - loading diffusion model from '..\models\flux1-dev-q8_0.gguf'
[INFO ] model.cpp:370  - load ..\models\flux1-dev-q8_0.gguf using gguf format
[DEBUG] model.cpp:416  - init from '..\models\flux1-dev-q8_0.gguf'
[INFO ] stable-diffusion.cpp:274  - loading clip_l from '..\models\clip_l.safetensors'
[INFO ] model.cpp:373  - load ..\models\clip_l.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '..\models\clip_l.safetensors', prefix = 'text_encoders.clip_l.transformer.'
[INFO ] stable-diffusion.cpp:298  - loading t5xxl from '..\models\t5xxl_fp16.safetensors'
[INFO ] model.cpp:373  - load ..\models\t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '..\models\t5xxl_fp16.safetensors', prefix = 'text_encoders.t5xxl.transformer.'
[INFO ] stable-diffusion.cpp:319  - loading vae from '..\models\ae.safetensors'
[INFO ] model.cpp:373  - load ..\models\ae.safetensors using safetensors format
[DEBUG] model.cpp:507  - init from '..\models\ae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:335  - Version: Flux
[INFO ] stable-diffusion.cpp:363  - Weight type stat:                      f32: 720  |     f16: 415  |    q8_0: 304
[INFO ] stable-diffusion.cpp:364  - Conditioner weight type stat:          f16: 415
[INFO ] stable-diffusion.cpp:365  - Diffusion model weight type stat:      f32: 476  |    q8_0: 304
[INFO ] stable-diffusion.cpp:366  - VAE weight type stat:                  f32: 244
[DEBUG] stable-diffusion.cpp:368  - ggml tensor size = 400 bytes
[DEBUG] clip.hpp:160  - vocab size: 49408
[DEBUG] clip.hpp:171  - trigger word img already in vocab
[INFO ] flux.hpp:1353 - flux: depth = 19, depth_single_blocks = 38, guidance_embed = true, context_in_dim = 4096, hidden_size = 3072, num_heads = 24
[INFO ] stable-diffusion.cpp:573  - Using flash attention in the diffusion model
[DEBUG] ggml_extend.hpp:1914 - clip params backend buffer size =  235.06 MB(VRAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1914 - t5 params backend buffer size =  9083.77 MB(VRAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1914 - flux params backend buffer size =  12247.64 MB(VRAM) (780 tensors)
[INFO ] stable-diffusion.cpp:624  - Using Conv2d direct in the vae model
[DEBUG] ggml_extend.hpp:1914 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:752  - loading weights
[DEBUG] model.cpp:1381 - using 16 threads for model loading
[DEBUG] model.cpp:1403 - loading tensors from ..\models\flux1-dev-q8_0.gguf
  |===========================>                      | 780/1439 - 139.76it/s
[DEBUG] model.cpp:1403 - loading tensors from ..\models\clip_l.safetensors
  |=================================>                | 976/1439 - 168.51it/s
[DEBUG] model.cpp:1403 - loading tensors from ..\models\t5xxl_fp16.safetensors
  |=========================================>        | 1195/1439 - 119.70it/s
[DEBUG] model.cpp:1403 - loading tensors from ..\models\ae.safetensors
  |==================================================| 1439/1439 - 141.22it/s
[INFO ] model.cpp:1629 - loading tensors completed, taking 10.19s (process: 0.00s, read: 8.76s, memcpy: 0.00s, convert: 0.01s, copy_to_backend: 0.55s)
[DEBUG] stable-diffusion.cpp:787  - finished loaded file
[INFO ] stable-diffusion.cpp:860  - total params memory size = 21661.05MB (VRAM 21661.05MB, RAM 0.00MB): text_encoders 9318.83MB(VRAM), diffusion_model 12247.64MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:934  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:3472 - generate_image 1024x1024
[INFO ] stable-diffusion.cpp:3506 - sampling using Euler method
[INFO ] denoiser.hpp:403  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:3633 - TXT2IMG
[DEBUG] conditioner.hpp:1215 - parse 'Astronaut in a jungle, cold color palette, muted colors, detailed, 8k' to [['Astronaut in a jungle, cold color palette, muted colors, detailed, 8k', 1], ]
[DEBUG] clip.hpp:401  - token astronaut in a jungle, cold color palette, muted colors, detailed, 8k
[DEBUG] clip.hpp:304  - token length: 77
[DEBUG] t5.hpp:402  - token length: 256
[DEBUG] clip.hpp:764  - identity projection
[DEBUG] ggml_extend.hpp:1726 - clip compute buffer size: 1.40 MB(VRAM)
[DEBUG] clip.hpp:764  - identity projection
[DEBUG] ggml_extend.hpp:1726 - t5 compute buffer size: 68.25 MB(VRAM)
[DEBUG] conditioner.hpp:1345 - computing condition graph completed, taking 194 ms
[INFO ] stable-diffusion.cpp:3250 - get_learned_condition completed, taking 196 ms
[INFO ] stable-diffusion.cpp:3361 - generating image: 1/1 - seed 23136
[DEBUG] ggml_extend.hpp:1726 - flux compute buffer size: 856.75 MB(VRAM)
stable-diffusion.cpp\ggml\src\ggml-cuda\unary.cu:136: GGML_ASSERT(ggml_is_contiguous(src0)) failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants