Install WAN 2.2 in ComfyUI (Native, GGUF & FP8)

Install Wan2.2 in ComfyUI

You already been with Wan 2.1 , you probably already know its one of the most solid open-source models out there for image and video generation. But now, Wan 2.2 is about to raise the bar even higher.

Wan 2.2 released under Apache2.0 license is the next-gen upgrade to the Wan 2.1 model suite on similar architecture, developed by Alibaba. Its more creative model designed to handle text-to-video, image generation, LoRA training, and even cross-modal workflows like turning a still image into a stylized motion clip. But this is not just a minor version bump. Wan 2.2 is shaping up to be a full-on creative toolkit for visual storytelling.

Table of Contents:


Features

1. Sharper, Smarter Image Generation- Expect better resolution, crisper lines, and more style consistency. The rendering engine behind Wan 2.2 has been optimized to handle high-detail prompts with fewer artifacts.

2. Smooth Video Generation- If you have tried video generation in Wan 2.1, you already know its decent. It gives you more stable motion better object tracking across frames with less flicker, more flow output.

3. Faster & Smarter LoRA Training - You can now fine-tune your own styles with just 10 to 20 images. Wan 2.2 speeds up the LoRA pipeline and even supports multi-model merging for wild style experiments.

4. Special Effects- Auto-stylized lighting Realistic smoke, fire, water Even presets for weather, time-of-day, and ambiance

5. Image/Video Integration-You can now go back and forth between images and video in a single pipeline. You can do exporting it as a stylized 5-second clip.

Basically three models have been released officially:

Model Variants Name Parameters Resolution Description Hugging Face Repository
Hybrid Model (TextToVideo and ImageToVideo) Wan2.2-TI2V-5B 5B 720p A hybrid model that seamlessly handles both text-to-video and image-to-video tasks using a single architecture. Wan2.2-TI2V-5B
Image-to-Video (MoE)* Wan2.2-I2V-A14B 14B 480p & 720p Transforms static images into fluid, coherent videos while preserving visual consistency and natural motion. Wan2.2-I2V-A14B
Text-to-Video (MoE)* Wan2.2-T2V-A14B 14B 480p & 720p Creates cinematic-quality videos from text inputs with high aesthetic fidelity and accurate semantic alignment. Wan2.2-T2V-A14B

*MoE means Mixture of Experts that nowadays used by the community to train models where not every dataset parameters gets loaded result in low memory consumption.

Installation

Install ComfyUI if you have not yet. If you already, then just update it from the Manager section by clicking on "Update All".

There are different compressed model variants that have been released by the community. Choose the one that suits your requirements and system configurations. 

If you use the Native setup that will going to eat a lot of VRAM around 60GB memory space. You can go for Kijai's Quantized and GGUF variants if using low VRAMs.


A. Native Support

(a) Wan2.2 TI2V 5B (Hybrid Version)

1. Download Wan2.2 ImgToVideo/TextToVideo Model (wan2.2_ti2v_5B_fp16.safetensors)  and save it into ComfyUI/models/diffusion_models folder. This supports both Image to Video and Text to Video workflow.

2. Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.

3. Then, download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.


(b) Wan2.2 14B T2V (Text to Video)

1. Download Wan2.2 14B Text To Video High Model (wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors)  and  Wan2.2 14B Text To Video Low Model (wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors )  and save it into ComfyUI/models/diffusion_models folder. You need to download both the models (High variant for first step and low noise is for details)

2. Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

3. Next, download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.


(c) Wan2.2 14B I2V (Image-to-Video)

1. Download Wan2.2 ImgTo Video High Model (wan2.2_i2v_high_noise_14B_fp16.safetensors  and Wan2.2 ImgTo Video Low Model (wan2.2_i2v_low_noise_14B_fp16.safetensors  ) and save it into ComfyUI/models/diffusion_models folder. You need to download both the models (High variant for first step and low noise is for details)

2. Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

3. Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and place it into ComfyUI/models/text_encoders folder.


B. FP8 Quantized by Kijai

1. Setup the Wan2.2 Native Support (mentioned above) or Kijai's WAN2.1 wrapper. If you already done then its not required. 

Kijai wan2.2 Quantized fp8

2. Download WAN2.2 FP8 quantized variant(Image To Video/Text To Video/Text-Image To video) from Kijai's Hugging Face repository. IF interested, you can read more about FP8 quantized models from our quantized model tutorial.

Rest of the models (text encoders, vae) will be same as used in native support or Kijai's wan 2.1 wrapper.


C. GGUF Variants

install Comfyui gguf custom nodes

To use the Wan2.2 GGUF variant, you have to install GGUF custom nodes from the Manager section  by selecting Custom nodes manager button. Now, search for ComfyUI-GGUF (by Author City 96) and hit install

If you already using, then you need to update it from the Manager tab by selecting Custom nodes manager.

You can read more details about GGUF from our quantized model tutorial if interested.

Now you can download GGUF variants from different developers :

wan2.2 gguf

(a) WAN 2.2 GGUF Image To Video 14B by bullerwins. Here, High Noise and Low Noise models both are required for generation. 

Wan2.2 GGUF

(b) WAN 2.2 GGUF Image/Text To Video 5B , WAN 2.2 GGUF Text To Video 14B and WAN 2.2 GGUF Image To Video 14B by Quantstack. 

It ranges from Q2(for fastest generation with lowest quality) to Q8(for highest quality with slowest speed). Choose any of them that's suitable for your use case and system requirements.

Save the model inside ComfyUI/models/unet directory. 

Rest of the models (text encoders, vae) will be same as used in native support.


Workflow

Download Wan2.2 workflow

1. Download the workflows from our Hugging Face repository.

- Wan2.2_14B_I2V.json (Image to Video workflow)

- Wan2.2_14B_T2V.json (Text to Video workflow)

- Wan2.2_5B_Ti2V.json (Text/Image to Video workflow)

wan2.2 default workflow

You can also get the workflow from ComfyUI by navigating to All templates>>Video. Select any one of them. If you are not seeing, this means you are running the outdated ComfyUI version. Just update it from the Manager section (click Update All).

2. Drag and Drop into ComfyUI.

If you are using the GGUF models, just replace the Load Diffusion Model node with UNet Loader (GGUF) node.


(a) Load your image if using Image to Video.


load Wan2.2 model


(b) Choose suitable Wan2.2 model into Load Diffusion model node. 

load text encoders

load wan2.2 vae

(c) Upload text encoders and VAE. If using Text/Image to video 5B hybrid model, then choose Wan2.2 vae and for other two model (14B) use the Wan2.1 Vae.

set KSampler settings

(d) Set KSampler Configuration

Steps- 20 
CFG- 5
Sampler-uni_pc
 


Put the positive and negative prompts

(e) Put your prompts into prompt box. Finally click on Run button to start generation.


Wan2.2 Image To Video 5B FP16 output
Wan2.2 Image To Video 5B FP16 output


Wan2.2 Text To Video 5B FP16 output
Wan2.2 Text To Video 5B FP16 output


After running a bunch of tests, we found something really cool. Wan2.2 5B (hybrid) model is amazing at turning input(image/text) to video. The other two models also works perfectly well and generate more refined, detailed results but they also consume a lot of VRAMs.

Some Tips for faster generation

Here is the interesting part is that each frame in the video gets its own clean up step (called a denoising timestep), and the very first frame starts off already clean. 

1. Because of this, you can move the clean up process forward frame by frame. That means you can keep generating new frames endlessly, making it possible to create really longer videos without running into problems.

2. This workflow includes both Text to Video and Image to Video for hybrid 5B model. If you want to generate either of the workflows just enable it using Ctrl+ B button.

3. You can add the Sage attention node for further generation improvement. Connect Patch Sage Attention KJ node between Load diffusion model node and Model Sampling SD3 node.