Z Image Base (BF16/FP8/GGUF/NVFP4) - Superb Realism

 

Z Image Base - gguf/fp8/bf16 model in comfyui

Z-Image (Base) the original base model of the Z Image Turbo has been released by Alibaba Group and Tongyi-MAI. It's built to give you the full creative signal. 

As the foundation model of the Z Image family, it aims to deliver high quality visuals, deep stylistic flexibility, and strong prompt adherence, all while staying open and developer friendly under the Apache 2.0 license.



Z Image Showcase
Z Image Showcase (Ref-official page)

 On the top of that, the model has been engineered for-

-  Broad aesthetic coverage, from hyper-real photography to anime and stylized illustration.

- Higher output diversity across seeds, helping multi-person scenes feel distinct and alive.

- Strong negative prompt responsiveness, so artifacts and unwanted traits can be reliably suppressed.

 

z image family
Z Image Family architecture

 

While Z Image Turbo focuses on speed, Z Image base is the full capacity foundation model thats the one you build on. It supports negative prompts, LoRA training, structural conditioning like ControlNet, and semantic conditioning right out of the box. To get the detailed insight, you can access their research paper.


 

Installation

1. First get the ComfyUI installation. Older user need to update ComfyUI from the manager Itself.

2. Download Z Image models from Hugging Face repository. There are multiple variants listed below, according to VRAM. Choose the one that suits your system resources. Get the detailed model overview from our quantized model tutorial.

(a) Z Image BF16  [z_image_bf16.safetensors] by Comfy

(b) Z Image FP8 [z-img_fp8-e4m3fn-scaled.safetensors / z-img_fp8-e4m3fn.safetensors / z-img_fp8-e5m2-scaled.safetensors  /z-img_fp8-e5m2.safetensors ] (by drbaph)

(c) Z Image NVFP4 (for RTX5090/5080 users) 

(c) Z Image GGUF(by babakarto)

(d) Z Image GGUF(by GGUF Org)

Save the diffusion models into your ComfyUI/models/diffusion_models folder. 

For GGUF, save it into ComfyUI/models/unet folder. Make sure you have already installed ComfyUI-GGUF custom node by City96. If not done yet, install it from the Manager. Update it if already installed using update option.

3. Download text encoder from Hugging face repository. It uses the same text encoder and Vae models as in Z Image Turbo. So, download again is not required. But if you want then here they are. There are three text encoders. 

 (a) qwen_3_4b.safetensors

(b) qwen_3_4b_fp4_mixed.safetensors

(c) qwen_3_4b_fp8_mixed.safetensors

Choose the one thats easily handled by your VRAM. Save it into ComfyUI/models/text_encoder directory.

4. Download VAE (ae.safetensors) and save it into ComfyUI/models/vae folder.

5. Restart and refresh ComfyUI to take effect.


Workflow

1. Download the workflow (Z_Image_Base_T2Img.json) from our Hugging Face repository.

2. Drag and drop into ComfyUI.

Z Image workflow
Z image workflow


(a) Load Z Image (Bf16/GGUF/FP8) model into Load diffusion model node. If using GGUF variant, replace Load diffusion model node with unet loader node.

(b) Select Qwen3.4b model and load into Load Clip node.

(c) Load Vae into vae node.

(d) Put positive and negative prompt into prompt box. Make sure its detailed and long enough. This will give you more consistent, adherent and realistic results.

(e) Set the KSampler settings:

    Resolution- 512x512 to 2048x2048
    CFG-3.0 to 5.0
    Steps- 28 to 50
    Sampler-Res_multistep
    Shift-3

3. Hit run to start generation. The results are not cherry picked. 

Cinematic 

 

Spongebob caught in forest

 

Prompt-  A realistic trail-camera photograph capturing realistic SpongeBob holding beer bottle in a dense forest at night, taken by an automated wildlife monitoring camera. The scene is illuminated by a sudden infrared flash, creating harsh, high-contrast lighting with blown-out highlights and deep shadows. SpongeBob appears slightly blurred due to motion, as if caught mid-step, giving the image an unposed, candid, accidental feel. The photo has low resolution with visible digital noise, grain, and heavy compression artifacts typical of trail-cam footage. The framing is awkward and off-center, reinforcing the sense that the subject was not intentionally photographed. A faint timestamp overlay appears in the corner of the image, adding authenticity. The overall aesthetic resembles genuine wildlife surveillance imagery—raw, imperfect, eerie, and documentary-like—blending realism with the uncanny feeling of an unexpected creature caught on a remote forest camera at night.

This is really fantastic. We tried with Z Image Turbo and this wasn't like this but Z Image base did a great thing.

 

Human Realism

very pretty caucasian girl

Prompt- very pretty caucasian girl at age 19(with subtle alternative-style makeup and short, curly brown hair with soft layers and see-through side bangs), her hair is styled as a wolf cut, she is sitting in the selfie, which was taken at night in paris, with an average looking tenement visible in the background, its in the rural part of the city with a park in the back. The angle is messy, with slight motion blur and overexposure. The overall vibe is that of a casually taken, mediocre or even failed selfie as if snapped without much thought or effort

 

Text Rendering

 pikachu holding board

Prompt- Ultra-realistic 3D mugshot of Pikachu, cute yet highly detailed, rendered in photorealistic CGI style. Pikachu is standing upright in a police mugshot setting, facing camera, with height measurement markings on a gray concrete wall behind him. He is holding a standard police name plate with clean, legible text reading: “PIKACHU — CASE #8484TYU — DATE: JAN 28, 2026” Pikachu has a shocked, wide-eyed expression, mouth slightly open, cheeks faintly glowing as if charged with electricity. Subtle sparks of yellow electricity crackle around his ears and tail, hinting at illegal electricity generation.  Fur is highly detailed and realistic, soft yet volumetric, with individual strands visible. Skin shading is physically accurate with subsurface scattering. Lighting is neutral police station fluorescent lighting, slightly harsh, casting realistic shadows. Camera is straight-on, eye-level, 50mm lens look, shallow depth of field.  Color grading is cinematic but grounded, cool gray background contrasting with Pikachu’s vibrant yellow. Overall tone blends cute humor with realistic crime-photo seriousness. Hyper-detailed, 8K quality, sharp focus, professional CGI render, realistic materials, clean composition.