Built on +65.6% more images and +83.2% more videos than its predecessor Wan2.1, Wan2.2 leverages a Mixture of Experts (MoE) architecture. You can find more details into their research paper. This approach improves generalization across motion, semantics, and aesthetics, setting new benchmarks against both open-source and closed-source alternatives.
These Wan2.2 Model variants released officially:
Models | Hugging face repo | Description |
---|---|---|
T2V A14B | 🤗 Huggingface | Text to Video MoE model that supports 480P & 720P |
I2V A14B | 🤗 Huggingface | Image to Video MoE model that supports 480P & 720P |
TI2V 5B | 🤗 Huggingface | Highly Compressed VAE, T2V+I2V that supports 720P only |
S2V 14B | 🤗 Huggingface | Speech to Video model that supports 480P & 720P |
Animate 14B | 🤗 Huggingface | Animation with Replacement |
From finely curated cinematic aesthetics to a high-definition hybrid TI2V model that runs 720P at 24fps on a single 4090 GPU, Wan2.2 makes advanced video generation accessible. It offers both text-to-video and image-to-video support, all while maintaining efficiency and speed.
Lets see how we can install in ComfyUI.
Table of Contents:
Installation
First install ComfyUI if you have not yet. If you already , then just update it from the Manager section by clicking on "Update All".
We are listing all the Wan2.2 model optimized variants released by the Community. So, select them as per your system requirements and use cases. All the details have been provided below.
A. Native Support
The native support provided by officially by ComfyUI takes almost 60GB VRAM to load the model. If you are struggling with your VRAM choose the quantized/GGUF/ with Lightning that are optimized for low end GPUs.
(a) Wan2.2 TI2V 5B (Hybrid Version)
This is the hybrid version that supports both Text to Video and Image to Video.
-Download Hybrid Model (wan2.2_ti2v_5B_fp16.safetensors) and save it into ComfyUI/models/diffusion_models folder.
-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.
(b) Wan2.2 14B T2V (Text to Video)
You need more VRAM to run this model and the quality you get will be more refined than the Hybrid one.
-Download High noise Model (wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors) and Low noise model (wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors) , save them into ComfyUI/models/diffusion_models folder.
-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.
(c) Wan2.2 14B I2V (Image-to-Video)
You need VRAM more to run this model and the quality you get will be more refined than the Hybrid one.
-Download High noise Model (wan2.2_i2v_high_noise_14B_fp16.safetensors) and Low noise model (wan2.2_i2v_low_noise_14B_fp16.safetensors) and save them into ComfyUI/models/diffusion_models folder.
-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.
(d) S2V-14B (Speech to Video)
-Download Speech to Video BF16 Model (wan2.2_s2v_14B_bf16) for High VRAMs or Speech to Video Fp8 Model (wan2.2_s2v_14B_fp8_scaled.safetensors) for low VRAMs , save it into ComfyUI/models/diffusion_models folder.
-Download Wan2Vecfp16 Model (wav2vec2_large_english_fp16) , save it into ComfyUI/models/audio_encoders folder. Create this as new folder if you do not have.
-Download VAE and save it into ComfyUI/models/vae folder.
-Download Text Encoder and put it into ComfyUI/models/text_encoders folder.
(e) Wan2.2 Animate 14B (Animating Character)
-Download Wan 2.2 Animate 14B model (wan2.2_animate_14B_bf16.safetensors) , then save it into ComfyUI/models/diffusion_models folder.
-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.
-Download Wan Relight Lora (WanAnimate_relight_lora_fp16.safetensors or WanAnimate_relight_lora_fp16_resized_from_128_to_dynamic_22.safetensors) and then save it into ComfyUI/models/loras folder.
-Download LightX2V I2V model (lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors) and save this into your folder.
B. Wan2.2 Quantized FP8 by Kijai
These model variants are quantized by Developer Kijai for low Vram users. You will face some quality generation issues but its manageable and minimal.
To use these models, you should have installed the Kijai's ComfyUI-WanVideoWrapper from the Manager. If already done, then just update it.
You can also follow our Wan2.1 tutorial on Kijai's Wan Video wrapper setup if you do not know how do the setup.
(a) Wan2.2 TI2V 5B (Hybrid Version)
This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant.
-Download Hybrid TI2V Model and save it into ComfyUI/models/diffusion_models folder. You will get two versions- Wan2_2-TI2V-5B_fp8_e4m3fn_scaled_KJ.safetensors (older optimized) and Wan2_2-TI2V-5B_fp8_e5m2_scaled_KJ.safetensors (newer optimized). Select any one of them.
-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.
(b) Wan2.2 14B T2V (Text to Video)
You need at least VRAM of 12GB and more to run this model and the quality you get will be more refined than the Hybrid one.
-Download High and Low noise T2V model , save it into ComfyUI/models/diffusion_models folder. If you are confused what to download, just select the High with Low noise pair, as they are renamed same.
-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.
(c) Wan2.2 14B I2V (Image-to-Video)
You need at least 12GB VRAM and more to run this model and the quality you get will be more refined than the Hybrid one.
-Download High noise and Low noise I2V model and save it into ComfyUI/models/diffusion_models folder. Just download the High with Low noise pair, as they are renamed same.
-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.
(d) S2V-14B (Speech to Video)
-Download Speech to Video FP8 Model (Wan2_2-S2V-14B_fp8_e4m3fn_scaled_KJ.safetensors) and save it into ComfyUI/models/diffusion_models folder.
-Download Wan2Vecfp16 Model (wav2vec2_large_english_fp16) , save it into ComfyUI/models/audio_encoders folder. Create this as new folder if you do not have.
-Download VAE and save it into ComfyUI/models/vae folder.
-Download Text Encoder and put it into ComfyUI/models/text_encoders folder.
(e) Animate-14B (Animation Control)
-Download Wan 2.2 Animate FP8 Model (any of them-Wan2_2-Animate-14B_fp8_e4m3fn_scaled_KJ.safetensors for 4000 GPU series or Wan2_2-Animate-14B_fp8_e5m2_scaled_KJ.safetensors for 3000 GPU series), and save it into ComfyUI/models/diffusion_models folder.
-Download VAE and save it into ComfyUI/models/vae folder.
-Download Text Encoder and put it into ComfyUI/models/text_encoders folder.
C. Wan 2.2 GGUF variant
These model variants are GGUF developed by the community for users having low Vrams. You will face low quality generation but its bearable and minimal.
1. Install GGUF custom nodes by City 96. Search ComfyUI-GGUF by author City96 from the Manager, by selecting Custom nodes Manager. If you do not know what are GGUF models, learn GGUF in our quantized model tutorial.
2. If already installed, just update GGUF custom node by City 96 from the Manager by clicking Custom nodes Manager>Search for ComfyUI-GGUF custom nodes then click Update button.
(a) Wan2.2 TI2V 5B (Hybrid Version)
This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant.
-Download Hybrid Model (By Developer Quantstack) and save it into ComfyUI/models/unet folder. It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). All the model details with VRAM usage-
2-bit Q2_K 1.85 GB
3-bit Q3_K_S 2.29 GB, Q3_K_M 2.55 GB
4-bit Q4_K_S 3.12 GB, Q4_0 3.03 GB, Q4_1 3.25 GB, Q4_K_M 3.43 GB
5-bit Q5_K_S 3.56 GB, Q5_0 3.64 GB, Q5_1 3.87 GB, Q5_K_M 3.81 GB
6-bit Q6_K 4.21 GB
8-bit Q8_0 5.4 GB
-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.
(b) Wan2.2 14B T2V (Text to Video)
-Download High noise and Low noise model (By Developer Quantstack) or High noise and Low noise model (By Developer Bullerwins), save it into ComfyUI/models/unet folder. Select and download the pair with High and low noise having same name.
It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). Ex- If selecting High noise Q2 then select Low noise Q2 by same developer release. All the model details with VRAM usage-
2-bit Q2_K 5.3 GB, Q2_K 5.3 GB
3-bit Q3_K_S 6.51 GB, Q3_K_S 6.51 GB, Q3_K_M 7.17 GB, Q3_K_M 7.17 GB
4-bit Q4_K_S 8.75 GB, Q4_K_S 8.75 GB , Q4_0 8.56 GB, Q4_1 9.26 GB, Q4_0 8.56 GB, Q4_1 9.26 GB, Q4_K_M 9.65 GB, Q4_K_M 9.65 GB
5-bit Q5_K_S 10.1 GB Q5_K_S 10.1 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_K_M 10.8 GB, Q5_K_M 10.8 GB
6-bit Q6_K 12 GB, Q6_K 12 GB
8-bit Q8_0 15.4 GB, Q8_0 15.4 GB
-Now, download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Also download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.
(c) Wan2.2 14B I2V (Image-to-Video)
-Download High noise and Low noise model (By Developer Quantstack) or High noise and Low noise model (By Developer Bullerwins), save it into ComfyUI/models/unet folder. Select and download the pair with High and low noise having same name.
It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). Ex- If selecting High noise Q2 then select Low noise Q2 by same developer release. All the model details with VRAM usage-
2-bit Q2_K 5.3 GB, Q2_K 5.3 GB
3-bit Q3_K_S 6.52 GB, Q3_K_S 6.52 GB, Q3_K_M 7.18 GB, Q3_K_M 7.18 GB
4-bit Q4_K_S 8.75 GB, Q4_K_S 8.75 GB, Q4_0 8.56 GB, Q4_1 9.26 GB, Q4_K_M 9.65 GB, Q4_K_M 9.65 GB
5-bit Q5_K_S 10.1 GB, Q5_K_S 10.1 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_K_M 10.8 GB, Q5_K_M 10.8 GB
6-bit Q6_K 12 GB, Q6_K 12 GB
8-bit, Q8_0 15.4 GB, Q8_0 15.4 GB
-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.
-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.
(d) S2V-14B (Speech to Video)
-Download GGUF Speech to Video Model and save it into ComfyUI/models/unet folder.
All the model details with VRAM usage-
2-bit Q2_K 9.51 GB
3-bit Q3_K_S 10.7 GB, Q3_K_M 11.4 GB
4-bit Q4_K_S 13 GB, Q4_0 12.8 GB, Q4_1 13.5 GB, Q4_K_M 13.9 GB
5-bit Q5_K_S 14.3 GB, Q5_0 14.5 GB, Q5_1 15.2 GB, Q5_K_M 15 GB
6-bit Q6_K 16.2 GB
8-bit Q8_0 19.6 GB
-Download Wan2Vecfp16 Model (wav2vec2_large_english_fp16) , save it into ComfyUI/models/audio_encoders folder. Create this as new folder if you do not have.
-Download VAE and save it into ComfyUI/models/vae folder.
-Download Text Encoder and put it into ComfyUI/models/text_encoders folder.
(e) Wan 2.2 Animate 14B
-Download Wan 2.2Animate 14B GGUF and save it into ComfyUI/models/unet folder.
All the model details with VRAM usage-
2-bit Q2_K 6.46 GB
3-bit Q3_K_S 7.97 GB, Q3_K_M 8.63 GB
4-bit Q4_K_S 10.6 GB, Q4_0 10.4 GB, Q4_K_M 11.5 GB
5-bit Q5_K_S 12.3 GB, Q5_0 12.5 GB, Q5_K_M 13 GB
6-bit Q6_K 14.6 GB
8-bit Q8_0 18.7 GB
-Download VAE and save it into ComfyUI/models/vae folder.
-Download Text Encoder and put it into ComfyUI/models/text_encoders folder.
D. Wan 2.2 Lightning Lora
This is the light LoRA version of Wan2.2 released by Developer LightX2V. Its the distilled variant of Wan2.2 that requires only 4 steps with minimal CFG resulting upto 20x faster output with complex motion generation. People struggling with low VRAMs can use these setup to gain fast video generation.
Note- These LoRA models have to be used with any of the Wan2.2 diffusion models (explained above).
(a) Wan2.2 14B I2V (Image-to-Video)
Make sure you are using the Actual High noise and Low noise I2V diffusion models.
-Download Wan2.2 14B I2V Lightning High noise (high_noise_model.safetensors) and Low noise model (low_noise_model.safetensors) from Hugging face repository.
Save them usually into ComfyUI/models/loras folder. Now use the same diffusion models, Vae, text encoders as described above.
(b) Wan2.2 14B T2V (Text to Video)
The Text to Video has two variants V1 and V1.1. The quality of video generation in V1.1 is some what better than as compared with V1. Choose any of them.
Make sure you are using the Actual High noise and Low noise T2V diffusion models.
-Download Wan2.2 14B T2V Lightining V1 High noise (high_noise_model.safetensors) and Low noise model (low_noise_model.safetensors) from Hugging face repository.
-Download Wan2.2 14B T2V Lightining V1.1 High noise (high_noise_model.safetensors) and Low noise model (low_noise_model.safetensors) from Hugging face repository.
Save them usually into ComfyUI/models/loras folder. Now use the same diffusion models, Vae, text encoders as described above.
Workflow
1. Download Wan2.2 workflows from our Hugging Face repository. These workflows are supported for all model variants (Native/Kijai's Setup/GGUF).
By the way, if you are working with Kijai's setup and want the original Kijai's workflows. Just navigate to your ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows folder.
If using diffusion models use the Load Diffusion model node and for GGUF modes use the Unet Loader node.
(a) Wan2.2_14B_I2V.json (Wan2.2_14B Image to Video workflow)
(b) Wan2.2_14B_T2V.json (Wan2.2_14B Text to Video workflow)
(c) Wan2.2_5B_Ti2V.json (Wan2.2_5B Text to video and Image to Video workflow)
(d) Wan2.2_14B_S2V.json (Wan2.2_14B Speech to Video workflow)
(e) Wan2.2_14B_Animate.json (Wan2.2 Animate Character workflow). You can follow the detailed workflow if interested with Wan 2.2 Animate Video to Video Pose transfer tutorial.
You can also get the workflow from ComfyUI by navigating to All templates>>Video. Select any one of them. If you are not seeing this means you are using the older ComfyUI version. Just update it from the Manager section by selecting Update All.
Wan2.2_5B_Ti2V Workflow
![]() |
Wan2.2 5B Text/Image To Video Workflow (Click to Zoom) |
1. Overview
Purpose: take a start image, convert it into latents for an imagevideo run, sample new latents using the Wan2.2 model + CLIP text prompts, decode latents to images, stitch images into a video and save it.
Execution order (logical): Load models >> Encode prompts >> Load start image >> Image>>video latent node >> Sampler (KSampler) >> VAE Decode >> CreateVideo >> SaveVideo. (See node links in the JSON for exact wiring.)
2. Step 1 — Load models (group: "Step1 - Load models")
UNETLoader (node id 37)
Model loaded: wan2.2_ti2v_5B_fp16.safetensors
This provides the diffusion UNET used for sampling.
CLIPLoader (node id 38)
Text encoder: umt5_xxl_fp8_e4m3fn_scaled.safetensors
Supplies the CLIP/text encoder used by the CLIPTextEncode nodes.
VAELoader (node id 39)
VAE file: wan2.2_vae.safetensors
Used to decode latents back to images.
Note: the MarkdownNote in the graph shows exact model filenames and where to place them under ComfyUI/models (diffusion_models, text_encoders, vae).
3. Step 2 — Prompt encoding
CLIPTextEncode (Positive prompt) — node id 6
Positive prompt text (use as-is or edit):
Output: CONDITIONING (link id 46) goes to KSampler positive input.
CLIPTextEncode (Negative prompt) — node id 7 (titled "CLIP Text Encode (Negative Prompt)")
Negative prompt text (use as-is or edit):
Output: CONDITIONING (link id 52) goes to KSampler negative input.
4. Step 3 — Start image Image-to-video latent
LoadImage (node id 56)
Widget values: "example.png", "image"
Output IMAGE (link id 106) connected to Wan22ImageToVideoLatent start_image input.
Replace "example.png" with your actual start image filename (must be accessible to ComfyUI).
Wan22ImageToVideoLatent (node id 55)
Inputs: vae (from VAELoader id 39) and start_image (from LoadImage id 56).
Widget values shown: [1280, 704, 121, 1]
These are the imagevideo settings saved in this node. They correspond to the video/image dimensions and length:
1280 = width (px)
704 = height (px)
121 = number of frames (video length in frames)
1 = an extra parameter (often batch/steps/loop setting depending on node implementation)
If you want a different resolution or length, change the first three numbers (width, height, frames).
Output: LATENT (link id 104) goes to KSampler as the "latent_image" input.
Tip in graph: "For i2v, use Ctrl + B to enable" — enable the imagevideo mode as instructed in the UI if needed.;
5. Step 4 — Model selection & sampling
ModelSamplingSD3 (node id 48)
Receives MODEL from UNETLoader and passes it to KSampler (link id 95 node 3).
Widget values: [8] (kept from this workflow — no need to change unless you understand the node internals).
KSampler (node id 3)
Inputs wired:
model (from ModelSamplingSD3)
positive conditioning (from CLIPTextEncode node id 6)
negative conditioning (from CLIPTextEncode node id 7)
latent_image (from Wan22ImageToVideoLatent node id 55)
Widget values present in the workflow: [898471028164125, "randomize", 20, 5, "uni_pc", "simple", 1]
Interpreting these values (typical mapping for KSampler):
898471028164125 = seed (a big integer). Because "randomize" is set the actual seed will be randomized each run unless you change seed mode to fixed.
"randomize" = seed mode (randomize vs fixed).
20 = sampler steps (how many diffusion steps; higher = slower but often higher quality).
5 = guidance scale (CFG) (strength of conditioning; higher emphasizes prompt more).
"uni_pc" = sampler algorithm (this workflow uses the uni_pc sampler).
"simple" = scheduling mode (sampling schedule).
1 = batch count (how many images/latents per run).
To reproduce the same result every run, set seed mode to a fixed seed (and record the seed number).
Output: LATENT (link id 35) goes to VAEDecode.
6. Step 5 — Decode latents to images
VAEDecode (node id 8)
Inputs: samples (LATENT from KSampler) and vae (VAE loader).
Decodes latents into IMAGE.
Output IMAGE (link id 107) goes to CreateVideo.
7. Step 6 — Create video
CreateVideo (node id 57)
Input: images (IMAGE from VAEDecode)
Widget values show: [24] — likely the frames per second (fps) used to make the video (24 fps is typical).
Optional audio input exists but is not connected in this workflow.
Output: VIDEO (link id 108) goes to SaveVideo.
8. Step 7 — Save video
SaveVideo (node id 58)
Input: video (from CreateVideo)
Widget values: ["video/ComfyUI", "auto", "auto"]
"video/ComfyUI" is the output folder/path (relative to ComfyUI working dir).
"auto" file naming / format options are selected here (ComfyUI will pick filename/format automatically unless you change these).
Final output: saved video file in the specified folder.
Wan2.2_14B_I2V Workflow
![]() |
Wan2.2 14B Image To Video Workflow (Click to Zoom) |
1. Overview
This workflow converts a start image into a video using the Wan2.2 14B Image-to-Video models.
Flow: Load models >> Encode prompts >> Load image >> Convert to video latents >> Two-stage sampling (high noise + low noise) >> Decode latents >> Assemble video >> Save video.
2. Step 1 - Load models
UNETLoader (High noise, id 37)
Loads wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors.
This is the diffusion model for high-noise stage.
UNETLoader (Low noise, id 56)
Loads wan2.2_i2v_low_noise_14B_fp8_scaled.safetensors.
This is for the refinement (low-noise) stage.
CLIPLoader (id 38)
Loads umt5_xxl_fp8_e4m3fn_scaled.safetensors.
Provides text encoder for prompts.
VAELoader (id 39)
Loads wan_2.1_vae.safetensors.
Needed to decode latents into images.
3. Step 2 -Upload start image
LoadImage (id 62)
Default: "example.png". Replace this with your own start image.
Output connects into the WanImageToVideo node.
4. Step 3 - Video size & length
WanImageToVideo (id 63)
Inputs: positive prompt, negative prompt, VAE, and start image.
Settings: [1280, 720, 121, 1]
Width = 1280 px
Height = 720 px
Frames = 121 (≈ 5 seconds at 24 fps)
Last "1" is extra param (batch/loop).
Outputs: latent + conditioning for samplers.
5. Step 4 - Prompt encoding
Positive Prompt (id 6)
Negative Prompt (id 7)
6. Two-stage sampling
Stage 1 (High noise)
ModelSamplingSD3 (id 54) connects UNET high-noise KSamplerAdvanced (id 57).
KSamplerAdvanced (id 57)
Widgets: enable, 1042664824122032, randomize, 20 steps, 3.5 CFG, euler, simple, 0, 10, enable.
This generates the first latent output.
Stage 2 (Low noise refinement)
ModelSamplingSD3 (id 55) connects UNET low-noise KSamplerAdvanced (id 58).
KSamplerAdvanced (id 58)
Widgets: disable, 0, fixed, 20 steps, 3.5 CFG, euler, simple, 10, 10000, disable.
Takes latent from Stage 1 and refines it.
7. Decode & video assembly
VAEDecode (id 8)
Converts final latents images.
CreateVideo (id 60)
Assembles frames into video. FPS = 24.
SaveVideo (id 61)
Saves to video/ComfyUI folder. Filename/format auto.
Wan2.2_14B_T2V Workflow
![]() |
Wan2.2 14B Text To Video Workflow (Click to Zoom) |
1. Overview
This workflow generates a video directly from text (text-to-video).
Flow: Load models >> Encode prompts >> Create empty latent video >> Two-stage sampling (high noise + low noise) >> Decode latents >> Assemble video >> Save video.
2. Step 1 — Load models
UNETLoader (High noise, id 37)
Loads wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors.
UNETLoader (Low noise, id 56)
Loads wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors.
CLIPLoader (id 38)
Loads umt5_xxl_fp8_e4m3fn_scaled.safetensors.
Used to process text prompts.
VAELoader (id 39)
Loads wan_2.1_vae.safetensors.
Decodes latents into images.
3. Step 2 — Video size (optional)
EmptyHunyuanLatentVideo (id 59)
Creates an empty latent video sequence to be filled in by the model.
Settings: [1280, 704, 121, 1]
Width = 1280 px
Height = 704 px
Frames = 121 (≈ 5 seconds at 24 fps)
Last "1" is an extra param (batch/loop).
4. Step 3 — Prompt encoding- Add Positive Prompt (id 6) and Negative Prompt (id 7)
5. Two-stage sampling
Stage 1 (High noise)
ModelSamplingSD3 (id 54) connects high-noise UNET >> KSamplerAdvanced (id 57).
KSamplerAdvanced (id 57)
Widgets: enable, 774388746670969, randomize, 20 steps, 3.5 CFG, euler, simple, 0, 10, enable.
Uses empty latent video as input and generates first-pass latents.
Stage 2 (Low noise refinement)
ModelSamplingSD3 (id 55) connects low-noise UNET >> KSamplerAdvanced (id 58).
KSamplerAdvanced (id 58)
Widgets: disable, 0, fixed, 20 steps, 3.5 CFG, euler, simple, 10, 10000, disable.
Refines the latents from Stage 1.
6. Decode & assemble video
VAEDecode (id 8)
Converts refined latents >> image frames.
CreateVideo (id 60)
Collects images into a video. FPS = 24.
SaveVideo (id 61)
Saves output to video/ComfyUI folder. Filename/format auto.
Wan2.2 14B Speech To Video Workflow
![]() |
Wan2.2 14B Speech To Video Workflow (Click to Zoom) |
Wan2.2 14B Animate Workflow
![]() |
Wan2.2 14B Animate Workflow (Click to Zoom) |
Wan2.2 I2V5B FP16 Test
![]() |
Wan2.2 I2V5B FP16 output |
Wan2.2 T2V5B FP16 Test
![]() |
Wan2.2 T2V5B FP16 output |
Wan2.2 S2V Test
Some Important Tips for video generation:
1. In Wan2.2 with High and Low Noise workflow, the first KSampler advanced takes the noise from Wan2.2 High noise model with start step-0 to end step-10 (Sampler total 20 steps) . This means 50 % process will be done with high noise model then the rest will be transferred to low noise KSampler Advanced node by enabling return with left over noise parameter. And this also takes the Wan2.2 low noise model data and process it simultaneously.
2. After trying with multiple generation, what we observed is that the way Wan2.2 5B handles I2V and time steps is awesome. Each latent frame has its own denoising time step. The first frame is just set as completely denoised. This means you should be able to do a sliding denoise timestep window and have infinite long form video generation.
3. Text/Image To Video 5B Hybrid Workflow includes both Text to Video and Image to Video. If you want to generate either of the workflows just enable it using Ctrl+ B button.
4. You can add the Sage attention node for further generation improvement. Connect Patch Sage Attention KJ node between Load diffusion model node and Model Sampling SD3 node.
Prompting Tips:
To get the perfect output from Wan2.2 model, you need perfect and detailed prompting.
1. Shot Order
-Describe the scene like a movie shot.
-Start with what the camera sees first.
-Then describe how the camera moves.
-Finish with what is revealed or shown at the end.
Example: A mountain at dawn -- camera tilts up slowly -- reveals a flock of birds flying overhead.
2. Camera Language
Use clear terms to tell the model how the camera should move:
-pan left/right – camera turns horizontally
-tilt up/down – camera moves up or down
-dolly in/out – camera moves forward or backward
-orbital arc – camera circles around a subject
-crane up – camera rises vertically
Wan 2.2 understands these better than the older version.
3. Motion Modifiers
Add words to describe how things move:
-Speed: slow-motion, fast pan, time-lapse
-Depth/motion cues: describe how things in the foreground/background move differently to show 3D depth
e.g., "foreground leaves flutter, background hills stay still"
4. Aesthetic Tags
Add cinematic style:
-Lighting: harsh sunlight, soft dusk, neon glow, etc.
-Color Style: teal-orange, black-and-white, film-like tones (e.g., Kodak Portra)
-Lens or Film Style: 16mm film grain, blurry backgrounds (bokeh), CGI, etc.
These help define the look and feel of the scene.
5. Timing & Resolution Settings
Keep clips short: 5 seconds or less
-Use around 120 frames max
-Use 16 or 24 FPS (frames per second) – 16 is faster to test
-Use lower resolution (like 960×540) to test quickly, or higher (1280×720) for final output
6. Negative Prompt
This part tells the AI what you don’t want in the video. Defaults cover things like:
-bad quality, weird-looking hands/faces
-overexposure, bright colors, still images
-text, compression artifacts, clutter, too many background people
This helps avoid common AI issues.