Wan2.2 VideoGen locally in ComfyUI (FP16/FP8/GGUF)

install wan2.2 models into comfyui

Creating high-quality, cinematic videos with AI has always been a challenge. Models often hit limits in performance, visuals, or require heavy computing power that makes them impractical for most creators. Wan2.2 steps in a major upgrade of Wan2.1 designed by Alibaba to deliver sharper visuals, smoother motion, and greater creative control, without demanding extra resources.

Built on +65.6% more images and +83.2% more videos than its predecessor Wan2.1, Wan2.2 leverages a Mixture of Experts (MoE) architecture. You can find more details into their research paper. This approach improves generalization across motion, semantics, and aesthetics, setting new benchmarks against both open-source and closed-source alternatives.


These Wan2.2 Model variants released officially:

Models Hugging face repo Description
T2V-A14B 🤗 Huggingface Text-to-Video MoE model, supports 480P & 720P
I2V-A14B 🤗 Huggingface Image-to-Video MoE model, supports 480P & 720P
TI2V-5B 🤗 Huggingface High-compression VAE, T2V+I2V, supports 720P
S2V-14B 🤗 Huggingface Speech-to-Video model, supports 480P & 720P


From finely curated cinematic aesthetics to a high-definition hybrid TI2V model that runs 720P at 24fps on a single 4090 GPU, Wan2.2 makes advanced video generation accessible. It offers both text-to-video and image-to-video support, all while maintaining efficiency and speed. 

Lets see how we can install in ComfyUI.

Table of Contents: 

 

Installation

 First install ComfyUI if you have not yet. If you already , then just update it from the Manager section by clicking on "Update All".

Update ComfyUI from Manager

We are listing all the Wan2.2 model optimized variants released by the Community. So, select them as per your system requirements and use cases. All the details have been provided below.

A. Native Support

 The native support provided by officially by ComfyUI takes almost 60GB VRAM to load the model. If you are struggling with your VRAM choose the quantized/GGUF that are optimized for low end GPUs.

(a) Wan2.2 TI2V 5B (Hybrid Version)

This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant. 


-Download Hybrid Model (wan2.2_ti2v_5B_fp16.safetensors)  and save it into ComfyUI/models/diffusion_models folder. 

-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(b) Wan2.2 14B T2V (Text to Video)

You need at least VRAM of 12GB and more to run this model and the quality you get will be more refined than the Hybrid one. 


-Download High noise Model (wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors)  and  Low noise model (wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors)  , save it into ComfyUI/models/diffusion_models folder.


-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(c) Wan2.2 14B I2V (Image-to-Video)

You need at least 12GB VRAM and more to run this model and the quality you get will be more refined than the Hybrid one.

-Download High noise Model (wan2.2_i2v_high_noise_14B_fp16.safetensors)  and Low noise model (wan2.2_i2v_low_noise_14B_fp16.safetensors)   and save it into ComfyUI/models/diffusion_models folder.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.

 

B. Wan2.2 Quantized FP8 by Kijai

 These model variants are quantized by Developer Kijai for low Vram users. You will face some quality generation issues but its manageable and minimal. 

To use this model, you should have installed the Kijai's ComfyUI-WanVideoWrapper from the Manager. If already done, then just update it.

kijai's wan2.2 quantized models 

(a) Wan2.2 TI2V 5B (Hybrid Version)

This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant. 

download wan2.2 txt-img to video hybrid 

-Download Hybrid TI2V Model  and save it into ComfyUI/models/diffusion_models folder. You will get two versions- Wan2_2-TI2V-5B_fp8_e4m3fn_scaled_KJ.safetensors (older optimized) and  Wan2_2-TI2V-5B_fp8_e5m2_scaled_KJ.safetensors  (newer optimized). Select any one of them.

-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(b) Wan2.2 14B T2V (Text to Video)

You need at least VRAM of 12GB and more to run this model and the quality you get will be more refined than the Hybrid one. 


-Download High and Low noise T2V model , save it into ComfyUI/models/diffusion_models folder. If you are confused what to download, just select the High with Low noise pair, as they are renamed same.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(c) Wan2.2 14B I2V (Image-to-Video)

You need at least 12GB VRAM and more to run this model and the quality you get will be more refined than the Hybrid one.

-Download High noise and Low noise I2V model and save it into ComfyUI/models/diffusion_models folder. Just download the High with Low noise pair, as they are renamed same.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.


(d) S2V-14B (Speech to Video)

  -Download Speech to Video Model and save it into ComfyUI/models/diffusion_models folder.

 

C. Wan 2.2 GGUF variant 

 These model variants are GGUF developed by the community for optimized output having low Vram users. You will face some quality generation issues but its manageable and minimal.

install comfyui gguf custom nodes

1. Install GGUF custom nodes by City 96. Search ComfyUI-GGUF by author City96 from the Manager.

2. If already installed, just update GGUF custom node by City 96 from the Manager by clicking Custom nodes Manager>Search for ComfyUI-GGUF custom nodes then click Install button.


(a) Wan2.2 TI2V 5B (Hybrid Version)

This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant. 

By Developer Quantstack

 -Download Hybrid Model  and save it into ComfyUI/models/unet folder. It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). 

-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(b) Wan2.2 14B T2V (Text to Video)

You need at least VRAM of 12GB and more to run this model and the quality you get will be more refined than the Hybrid one. 

By Developer Quantstack

-Download High noise Model  and  Low noise model , save it into ComfyUI/models/unet folder. Select and download the pair with High and low noise having same name. 

It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). Ex- If selecting High noise Q2 then select Low noise Q2.

-Now, download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Also download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(c) Wan2.2 14B I2V (Image-to-Video)

You need at least 12GB VRAM and more to run this model and the quality you get will be more refined than the Hybrid one.

-Download High noise Model and Low noise model and save it into ComfyUI/models/unet folder. It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). Ex- If selecting High noise Q2 then select Low noise Q2.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/VAE folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.

 

(d) S2V-14B (Speech to Video)

  -Download GGUF Speech to Video Model   and save it into ComfyUI/models/unet folder.

 

Workflow


wan2.2 workflows

1. Download Wan2.2 workflows from our Hugging Face repository. These workflows are supported for all model variants (Native/Kijai's Setup/GGUF). If you are using GGUF workflows, just replace the Load Diffusion model node with Unet Loader node.


 (a) Wan2.2_14B_I2V.json (Wan2.2_14B Image to Video workflow)

 (b) Wan2.2_14B_T2V.json (Wan2.2_14B Text to Video workflow)

 (c) Wan2.2_5B_Ti2V.json (Wan2.2_5B Text to video and Image to Video workflow)

You can also get the workflow by from ComfyUI by navigating to All templates>>Video. Select any one of them. If you are not seeing this means you are using the older ComfyUI version. Just update it from the Manager section by selecting Update All.


2. Drag and Drop into ComfyUI.

(a) Load Image if using Image to workflow. 

(b) Load Wan2.2 model into Load diffusion model node or Unet loader (for GGUF).

(c) Load text encoders 

(d) Load VAE.

(e) Set KSampler settings and add prompts into prompt box.

(f) Select Run button to execute the workflow.



Important Tips: 

1. In Wan2.2 with High and Low Noise worflow, the first KSampler advanced takes the noise from Wan2.2 High noise model with start step-0 to end step-10 (Sampler total 20 steps) . This means 50 % process will be done with high noise model then the rest will be transferred to low noise KSampler Advanced node by enabling return with left over noise parameter. And this also takes the Wan2.2 low noise model data and process it simultaneously.


2. After trying with multiple generation, what we observed is that the way Wan2.2 5B handles I2V and time steps is awesome. Each latent frame has its own denoising time step. The first frame is just set as completely denoised. This means you should be able to do a sliding denoise timestep window and have infinite long form video generation.

3. Text/Image To Video 5B Hybrid Workflow includes both Text to Video and Image to Video. If you want to generate either of the workflows just enable it using Ctrl+ B button.

4. You can add the Sage attention node for further generation improvement. Connect Patch Sage Attention KJ node between Load diffusion model node and Model Sampling SD3 node.


Prompting Tips:

 To get the perfect output from Wan2.2 model, you need perfect and detailed prompting.

1. Shot Order

-Describe the scene like a movie shot.
-Start with what the camera sees first.
-Then describe how the camera moves.
-Finish with what is revealed or shown at the end.

Example: A mountain at dawn -- camera tilts up slowly -- reveals a flock of birds flying overhead.



 2. Camera Language

 Use clear terms to tell the model how the camera should move:

-pan left/right – camera turns horizontally
-tilt up/down – camera moves up or down
-dolly in/out – camera moves forward or backward
-orbital arc – camera circles around a subject
-crane up – camera rises vertically

Wan 2.2 understands these better than the older version.



 3. Motion Modifiers

 Add words to describe how things move:

-Speed: slow-motion, fast pan, time-lapse
-Depth/motion cues: describe how things in the foreground/background move differently to show 3D depth

     e.g., "foreground leaves flutter, background hills stay still"



 4. Aesthetic Tags

 Add cinematic style:

-Lighting: harsh sunlight, soft dusk, neon glow, etc.
-Color Style: teal-orange, black-and-white, film-like tones (e.g., Kodak Portra)
-Lens or Film Style: 16mm film grain, blurry backgrounds (bokeh), CGI, etc.

These help define the look and feel of the scene.



 5. Timing & Resolution Settings

 Keep clips short: 5 seconds or less

-Use around 120 frames max

-Use 16 or 24 FPS (frames per second) – 16 is faster to test

-Use lower resolution (like 960×540) to test quickly, or higher (1280×720) for final output



 6. Negative Prompt

 This part tells the AI what you don’t want in the video. Defaults cover things like:
-bad quality, weird-looking hands/faces
-overexposure, bright colors, still images
-text, compression artifacts, clutter, too many background people

This helps avoid common AI issues.