Modern image generation models have become incredibly powerful, but they still come with several limitations behind the scenes. Most systems rely on multiple separate components working together such as external VAEs, independent text encoders, and task-specific pipelines. While this setup works, it often creates inefficiencies and inconsistencies during image generation.
HiDream-O1-Image aims to solve these challenges with a fully unified approach to image generation. Instead of combining disconnected systems, the model is built around a Pixel level Unified Transformer (UiT) that directly processes raw pixels, text, and task-specific conditions within a single shared token space. This creates a more streamlined and native generation pipeline.
![]() |
| Hidream o1 showcase |
Whether it’s text-to-image creation, image editing, subject-driven personalization, storyboard generation, or long text rendering, HiDream-O1-Image is designed to manage everything within one unified architecture.
![]() |
| textual representation |
The research behind HiDream-O1-Image focuses on simplifying and strengthening the foundation of image generation models. Traditional diffusion systems typically depend on external Variational Autoencoders (VAEs) to compress images into latent spaces before generation. They also rely on separate text encoders to process prompts independently. You can find indepth information by accessing their research paper.
![]() |
| Id preservation |
While effective, this fragmented design can limit coherence between text understanding and image synthesis. HiDream-O1-Image removes these separations entirely. Its Pixel-level Unified Transformer directly encodes raw pixels alongside textual and conditional information into a shared representation space. This unified structure allows the model to better align visual understanding with language comprehension.
Installation
1. Make sure you do the ComfyUI installation. Older user need to update it from the Manager itself.
2. Download HiDream o1 models from Officially repacked by ComfyUI:
There are multiple variants to choose from. Use that's suitable for you system resources:
(a) hidream_o1_image_bf16.safetensors
(b) hidream_o1_image_dev_bf16.safetensors
(c) hidream_o1_image_dev_fp8_scaled.safetensors
(d) hidream_o1_image_dev_mxfp8.safetensors
(e) hidream_o1_image_fp8_scaled.safetensors
(f) hidream_o1_image_mxfp8.safetensors
BF16 is for high quality generation but consumes more memory. FP8 (float bit 8) uses low VRAM but quality degradation will be there. MXfp8 is the hardware level support (on RTX 4090 and 5090) latest release by NVIDIA (blackwell) for better quality and faster generation.
Here, the hidream_o1_image is the base variant and hidream_o1_image dev is the distilled variant supports text to image and image to image. Save this into ComfyUI/models/checkpoints folder.
3. Download text encoder (gemma4_e4b_it_fp8_scaled.safetensors) and save this into ComfyUI/models/text_enocders folder.
This model is the unified transfer model. There are no VAE.
4. Restart and refresh ComfyUI.
Workflow
1. Download the workflows from our Hugging face repository.
(a)HiDream O1 base (Hidream_O1_base.json)
(b)HiDream O1 Dev (Hidream_O1_dev.json)
2. Drag and drop into ComfyUI
3. Load the HiDream O1 model into load checkpoint node and text encoder into its relevant node.
4. Add text prompts into prompt box.
5. Set KSampler Settings-
HiDream o1 Image
Steps- 50
CFG-5.0
HiDream o1 Image Dev
Steps-28
CFG-5.0
6. Hit run to start generation.
Some results using HiDream O1 Image base and Dev model:
Test 1(Short text)
Prompt- A realistic airport departure board inside a crowded international terminal with travelers walking around and luggage carts moving nearby. The digital board contains multiple rows of perfectly aligned text including destinations, times, and status messages. One highlighted row clearly reads: “Flight AI-302 — DELAYED”. The typography should remain sharp, aligned, and fully readable even with multiple lines of information.
Test 2(Long text)
Prompt- A highly detailed close-up of a modern smartphone displaying a messaging app conversation in dark mode. One visible long message bubble contains the exact text: “Hey, I might be late for the meeting because traffic near downtown is completely blocked right now. Please start without me if necessary, and I’ll join as soon as I can. Also don’t forget to bring the presentation files.” The text should appear naturally inside the UI with realistic spacing, emojis, timestamps, and authentic smartphone typography.
Test 3 (Art with realism)
Prompt- Ultra-photorealistic close-up portrait of a natural 20-year-old woman standing beside a window during golden hour, soft sunlight illuminating realistic skin pores, peach fuzz, subtle freckles, detailed irises, slightly messy hair strands, natural lips without excessive makeup, shallow depth of field, cinematic photography, DSLR realism, authentic imperfections, realistic shadows, high dynamic range.
Test 4 (Closeup shot)
Prompt- Ultra-macro close-up photograph of a single glowing firefly resting on a dew-covered leaf at night, extremely detailed translucent wings, realistic glowing abdomen emitting soft bioluminescent light, visible micro textures on the insect body, shallow depth of field, cinematic nature photography, dark forest background with subtle bokeh.













