We have all seen AI image generation models do amazing things but let's be honest, they stumble when it comes to the finer details. Think about generating text inside an image letters often look warped, words become unreadable, and for non Latin scripts like Chinese, its even messier.
Qwen Image built on a massive 20B parameter MMDiT architecture licensed under Apache2.0. Experiments show that it outperforms others in text rendering, especially in Chinese, while also excelling in general image generation, editing, and even image understanding tasks like object detection, segmentation, and depth estimation. On top of that, most editing tools are either too basic (crop, filter, resize) or too complex for everyday users.You can find more detailed findings into their research paper.
![]() |
Reference- Official Qwen Image page |
Now imagine an AI that not only generates stunning visuals in different artistic styles but also integrates text with near perfect accuracy, whether it's English, Korean, Japanese or Chinese. Imagine editing at a professional level swapping objects, tweaking styles, enhancing details all with natural, intuitive inputs. These all are improved by this model.
Lets see how to setup using ComfyUI.
Installation
1. Install ComfyUI into your system. If you already done, then just update if from the Manager by clicking on Update All option.
Native support
2. Download any of the Qwen Image models from hugging face repository:
(a) Qwen Image FP8 (qwen_image_fp8_e4m3fn.safetensors)
(b) Qwen Image BF16 (qwen_image_bf16.safetensors)
(c) Qwen Image Distilled FP8 (qwen_image_distill_full_fp8_e4m3fn.safetensors)
(d) Qwen Image Distilled BF16 (qwen_image_distill_full_bf16.safetensors)
Save this into your ComfyUI/models/diffusion_models folder. Choose these if having more than 16GB VRAM.
3. Download text encoder FP8 (qwen_2.5_vl_7b_fp8_scaled.safetensors) or text encoder FP16 (qwen_2.5_vl_7b.safetensors). Choose the one that suits your hardware GPU. Then, put it into your ComfyUI/models/text_encoders folder.
4. Now, get the Qwen LoRA Image lighting (Qwen-Image-Lightning-8steps-V1.0.safetensors) if you want and save this ComfyUI/models/loras into folder.
5. Finally, download VAE (qwen_image_vae.safetensors) and save it to your ComfyUI/models/vae folder.
6. Restart ComfyUI to take effect.
Qwen Image GGUF
1. You can also download Qwen Image GGUF optimized variants by different developers that will consume low VRAMs. To use this model, you need to get install custom node ComfyUI-GGUF by City96 from the Manager. If you do not know whats GGUF models, learn GGUF in our quantized model tutorial.
If already installed, just update GGUF custom node by City 96 from the Manager by clicking Custom nodes Manager>Search for ComfyUI-GGUF custom nodes then click Install button. Select any of them:
(a) Qwen Image GGUF by Developer City 96
(b) Qwen Image GGUF by Developer Quantstack
Store this into ComfyUI/models/unet folder.
It ranges from Q2(faster with lower precision and lower quality) to Q8(slower generation with higher precision and high quality generation) . All the details with minimal VRAM usage provided below:
2-bit (Q2_K – ~7 GB), 3-bit (Q3_K – ~9 GB), 4-bit (Q4 – 12–13 GB), 5-bit (Q5 – 14–15 GB), 6-bit (Q6 – 16.8 GB), 8-bit (Q8 – 21.8 GB), 16-bit (BF16 – 40.9 GB)
Next, you will need same text encoders, vae and lora models that described above. If you already downloaded then downloading again is not required.
2. Download text encoder FP8 (qwen_2.5_vl_7b_fp8_scaled.safetensors) or text encoder FP16. Choose the one that suits your hardware GPU. Then, put it into your ComfyUI/models/text_encoders folder.
3. Finally, download VAE (qwen_image_vae.safetensors) and save it to your ComfyUI/models/vae folder.
4. Now, get the Qwen LoRA Image lighting (Qwen-Image-Lightning-8steps-V1.0.safetensors) if you want and save this ComfyUI/models/loras into folder.
Workflow
1. You can then load up or drag the following image in ComfyUI to get the workflow, access from our Hugging Face repository. (Qwen_Image_basic_workflow.png ).
You can also get the workflow from ComfyUI dashboard. Navigate to top left corner Workflow>Browse Template>Image. Then search for Qwen Image Workflow. If you are not getting this means you are using the older ComfyUI. Update it from the Manager tab.
2. Drag and Drop into ComfyUI.
3. Load Qwen Image model into Load Diffusion node.
4. Input your prompt into prompt box.
KSampler settings:
Step:20 (Official-50) (For fast generation use - 10) (Use-15 for distilled variant)
CFG: 2.5 (Use -1.0 for distilled variant)
Sampler: Euler or Res_multistep
5. Click run to start execution.
We have tested with different prompts to generate different shots. Here, the results are not cherry picked and you are seeing what we got at our first attempt.
(a) Human Photography
Prompt: Professional studio portrait of a female fashion model wearing a sleek black evening gown, dramatic softbox lighting, high-fashion editorial look, ultra-sharp details
Steps-20
CFG-2.5
(b) Cinematic Movie Scene
Prompt:
A wide-angle cinematic shot of a futuristic city at dusk, neon lights reflecting on wet streets, flying cars leaving glowing trails in the sky, pedestrians in cyberpunk clothing walking through a crowded marketplace. The atmosphere is both gritty and vibrant, with heavy rain creating ripples in puddles. The camera is positioned low to the ground, giving a dramatic perspective looking upward at towering skyscrapers covered in holographic billboards.
Steps-50
CFG-4
(c) Fine Art Painting
Prompt:
A renaissance-style oil painting of an elderly woman sitting by a wooden table, illuminated by a single candle. The brushstrokes emphasize texture in her wrinkled skin, the folds of her wool clothing, and the weathered surface of the table. Behind her, a dark shadowy background contrasts with the warm glow of the candle, creating a chiaroscuro effect. The composition should feel timeless, as though painted by Caravaggio himself.
Steps-30
CFG-4
(d) Street Photography with Textual render
Prompt:
A realistic street photo of a busy New York avenue at night, with glowing neon signs above shops. One sign should clearly read: "Broadway Café" in large red neon letters, with each letter evenly spaced and glowing against the rainy reflections on the street. The camera is positioned at eye level, capturing the sign as the central focus while blurred pedestrians walk by.
Steps-35
CFG-4