Wan2.2 VideoGen locally in ComfyUI (FP16/FP8/GGUF)

install wan2.2 model in comfyui

Creating high-quality, cinematic videos with AI has always been a challenge. Models often hit limits in performance, visuals, or require heavy computing power that makes them impractical for most creators. Wan2.2 steps in a major upgrade of Wan2.1 released by Alibaba to deliver sharper visuals, smoother motion, and greater creative control, without demanding extra resources.

Built on +65.6% more images and +83.2% more videos than its predecessor Wan2.1, Wan2.2 leverages a Mixture of Experts (MoE) architecture. You can find more details into their research paper. This approach improves generalization across motion, semantics, and aesthetics, setting new benchmarks against both open-source and closed-source alternatives.


These Wan2.2 Model variants released officially:

Models Hugging face repo Description
T2V A14B 🤗 Huggingface Text to Video MoE model that supports 480P & 720P
I2V A14B 🤗 Huggingface Image to Video MoE model that supports 480P & 720P
TI2V 5B 🤗 Huggingface Highly Compressed VAE, T2V+I2V that supports 720P only
S2V 14B 🤗 Huggingface Speech to Video model that supports 480P & 720P
Animate 14B 🤗 Huggingface Animation with Replacement


From finely curated cinematic aesthetics to a high-definition hybrid TI2V model that runs 720P at 24fps on a single 4090 GPU, Wan2.2 makes advanced video generation accessible. It offers both text-to-video and image-to-video support, all while maintaining efficiency and speed. 

Lets see how we can install in ComfyUI.

Table of Contents: 

 

Installation

 First install ComfyUI if you have not yet. If you already , then just update it from the Manager section by clicking on "Update All".

Update ComfyUI from Manager

We are listing all the Wan2.2 model optimized variants released by the Community. So, select them as per your system requirements and use cases. All the details have been provided below.

A. Native Support

 The native support provided by officially by ComfyUI takes almost 60GB VRAM to load the model. If you are struggling with your VRAM choose the quantized/GGUF/ with Lightning that are optimized for low end GPUs.

(a) Wan2.2 TI2V 5B (Hybrid Version)

This is the hybrid version that supports both Text to Video and Image to Video. 

-Download Hybrid Model (wan2.2_ti2v_5B_fp16.safetensors)  and save it into ComfyUI/models/diffusion_models folder. 

-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(b) Wan2.2 14B T2V (Text to Video)

You need more VRAM to run this model and the quality you get will be more refined than the Hybrid one. 


-Download High noise Model (wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors)  and  Low noise model (wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors)  , save them into ComfyUI/models/diffusion_models folder.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(c) Wan2.2 14B I2V (Image-to-Video)

You need VRAM more to run this model and the quality you get will be more refined than the Hybrid one.

-Download High noise Model (wan2.2_i2v_high_noise_14B_fp16.safetensors)  and Low noise model (wan2.2_i2v_low_noise_14B_fp16.safetensors)   and save them into ComfyUI/models/diffusion_models folder.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.


(d) S2V-14B (Speech to Video) 

-Download Speech to Video BF16 Model (wan2.2_s2v_14B_bf16) for High VRAMs or  Speech to Video Fp8 Model (wan2.2_s2v_14B_fp8_scaled.safetensors) for low VRAMs  , save it into ComfyUI/models/diffusion_models folder.

 -Download Wan2Vecfp16 Model (wav2vec2_large_english_fp16) , save it into ComfyUI/models/audio_encoders folder. Create this as new folder if you do not have.

 -Download VAE  and save it into ComfyUI/models/vae folder.

-Download Text Encoder  and put it into ComfyUI/models/text_encoders folder.

 

(e) Wan2.2 Animate 14B (Animating Character)

-Download Wan 2.2 Animate 14B model (wan2.2_animate_14B_bf16.safetensors) , then save it into ComfyUI/models/diffusion_models folder.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.

-Download Wan Relight Lora (WanAnimate_relight_lora_fp16.safetensors or WanAnimate_relight_lora_fp16_resized_from_128_to_dynamic_22.safetensors) and then save it into ComfyUI/models/loras folder.

-Download LightX2V I2V model (lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors) and save this into your  folder.



B. Wan2.2 Quantized FP8 by Kijai

 These model variants are quantized by Developer Kijai for low Vram users. You will face some quality generation issues but its manageable and minimal. 

To use these models, you should have installed the Kijai's ComfyUI-WanVideoWrapper from the Manager. If already done, then just update it. 

You can also follow our Wan2.1 tutorial on Kijai's Wan Video wrapper setup if you do not know how do the setup. 

kijai's wan2.2 quantized models 

(a) Wan2.2 TI2V 5B (Hybrid Version)

This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant. 

download wan2.2 txt-img to video hybrid 

-Download Hybrid TI2V Model  and save it into ComfyUI/models/diffusion_models folder. You will get two versions- Wan2_2-TI2V-5B_fp8_e4m3fn_scaled_KJ.safetensors (older optimized) and  Wan2_2-TI2V-5B_fp8_e5m2_scaled_KJ.safetensors  (newer optimized). Select any one of them.

-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(b) Wan2.2 14B T2V (Text to Video)

You need at least VRAM of 12GB and more to run this model and the quality you get will be more refined than the Hybrid one. 


-Download High and Low noise T2V model , save it into ComfyUI/models/diffusion_models folder. If you are confused what to download, just select the High with Low noise pair, as they are renamed same.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(c) Wan2.2 14B I2V (Image-to-Video)

You need at least 12GB VRAM and more to run this model and the quality you get will be more refined than the Hybrid one.

-Download High noise and Low noise I2V model and save it into ComfyUI/models/diffusion_models folder. Just download the High with Low noise pair, as they are renamed same.

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.


(d) S2V-14B (Speech to Video)

 -Download Speech to Video FP8 Model (Wan2_2-S2V-14B_fp8_e4m3fn_scaled_KJ.safetensors) and save it into ComfyUI/models/diffusion_models folder.

 -Download Wan2Vecfp16 Model (wav2vec2_large_english_fp16) , save it into ComfyUI/models/audio_encoders folder. Create this as new folder if you do not have.

 -Download VAE  and save it into ComfyUI/models/vae folder.

-Download Text Encoder  and put it into ComfyUI/models/text_encoders folder.

 

(e) Animate-14B (Animation Control)

 -Download Wan 2.2 Animate FP8 Model (any of them-Wan2_2-Animate-14B_fp8_e4m3fn_scaled_KJ.safetensors  for 4000 GPU series or Wan2_2-Animate-14B_fp8_e5m2_scaled_KJ.safetensors for 3000 GPU series), and save it into ComfyUI/models/diffusion_models folder.

 -Download VAE  and save it into ComfyUI/models/vae folder.

-Download Text Encoder  and put it into ComfyUI/models/text_encoders folder.



C. Wan 2.2 GGUF variant 

 These model variants are GGUF developed by the community for users having low Vrams. You will face low quality generation but its bearable and minimal.

install comfyui gguf custom nodes

1. Install GGUF custom nodes by City 96. Search ComfyUI-GGUF by author City96 from the Manager, by selecting Custom nodes Manager. If you do not know what are GGUF models, learn GGUF in our quantized model tutorial.

2. If already installed, just update GGUF custom node by City 96 from the Manager by clicking Custom nodes Manager>Search for ComfyUI-GGUF custom nodes then click Update button.


(a) Wan2.2 TI2V 5B (Hybrid Version)

This is the hybrid version that supports both Text to Video and Image to Video. If you have low VRAM less than 12GB, you can use this model variant. 

 -Download Hybrid Model (By Developer Quantstack) and save it into ComfyUI/models/unet folder. It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). All the model details with VRAM usage-

2-bit Q2_K 1.85 GB

3-bit Q3_K_S 2.29 GB, Q3_K_M 2.55 GB

4-bit Q4_K_S 3.12 GB, Q4_0 3.03 GB, Q4_1 3.25 GB, Q4_K_M 3.43 GB

5-bit Q5_K_S 3.56 GB, Q5_0 3.64 GB, Q5_1 3.87 GB, Q5_K_M 3.81 GB

6-bit Q6_K 4.21 GB

8-bit Q8_0 5.4 GB

-Download VAE (wan2.2_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(b) Wan2.2 14B T2V (Text to Video)

-Download High noise and  Low noise model (By Developer Quantstack) or High noise and Low noise model (By Developer Bullerwins), save it into ComfyUI/models/unet folder. Select and download the pair with High and low noise having same name.  

It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). Ex- If selecting High noise Q2 then select Low noise Q2 by same developer release. All the model details with VRAM usage-

 2-bit Q2_K 5.3 GB, Q2_K 5.3 GB

3-bit Q3_K_S 6.51 GB, Q3_K_S 6.51 GB, Q3_K_M 7.17 GB, Q3_K_M 7.17 GB

4-bit Q4_K_S 8.75 GB, Q4_K_S 8.75 GB , Q4_0 8.56 GB, Q4_1 9.26 GB, Q4_0 8.56 GB, Q4_1 9.26 GB, Q4_K_M 9.65 GB, Q4_K_M 9.65 GB

5-bit Q5_K_S 10.1 GB Q5_K_S 10.1 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_K_M 10.8 GB, Q5_K_M 10.8 GB

6-bit Q6_K 12 GB, Q6_K 12 GB

8-bit Q8_0 15.4 GB, Q8_0 15.4 GB

-Now, download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Also download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and save it into ComfyUI/models/text_encoders folder.

(c) Wan2.2 14B I2V (Image-to-Video)

-Download High noise and  Low noise model (By Developer Quantstack) or High noise and Low noise model (By Developer Bullerwins), save it into ComfyUI/models/unet folder. Select and download the pair with High and low noise having same name.  

It ranges from Q2(faster with lower precision and lower quality) to Q8(slower inference with higher precision and high quality generation). Ex- If selecting High noise Q2 then select Low noise Q2 by same developer release. All the model details with VRAM usage-

 2-bit Q2_K 5.3 GB, Q2_K 5.3 GB

3-bit Q3_K_S 6.52 GB, Q3_K_S 6.52 GB, Q3_K_M 7.18 GB, Q3_K_M 7.18 GB

4-bit Q4_K_S 8.75 GB, Q4_K_S 8.75 GB, Q4_0 8.56 GB, Q4_1 9.26 GB, Q4_K_M 9.65 GB, Q4_K_M 9.65 GB

5-bit Q5_K_S 10.1 GB, Q5_K_S 10.1 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_0 10.3 GB, Q5_1 11 GB, Q5_K_M 10.8 GB, Q5_K_M 10.8 GB

6-bit Q6_K 12 GB, Q6_K 12 GB

8-bit, Q8_0 15.4 GB, Q8_0 15.4 GB

-Download VAE (wan_2.1_vae.safetensors) and save it into ComfyUI/models/vae folder.

-Download Text Encoder (umt5_xxl_fp8_e4m3fn_scaled.safetensors) and put it into ComfyUI/models/text_encoders folder.

 

(d) S2V-14B (Speech to Video)

 -Download GGUF Speech to Video Model   and save it into ComfyUI/models/unet folder.

 All the model details with VRAM usage-

2-bit Q2_K 9.51 GB

3-bit Q3_K_S 10.7 GB, Q3_K_M 11.4 GB

4-bit Q4_K_S 13 GB, Q4_0 12.8 GB, Q4_1 13.5 GB, Q4_K_M 13.9 GB

5-bit Q5_K_S 14.3 GB, Q5_0 14.5 GB, Q5_1 15.2 GB, Q5_K_M 15 GB

6-bit Q6_K 16.2 GB

8-bit Q8_0 19.6 GB

 -Download Wan2Vecfp16 Model (wav2vec2_large_english_fp16) , save it into ComfyUI/models/audio_encoders folder. Create this as new folder if you do not have. 

 -Download VAE  and save it into ComfyUI/models/vae folder.

-Download Text Encoder  and put it into ComfyUI/models/text_encoders folder.

 

 (e) Wan 2.2 Animate 14B 

 -Download Wan 2.2Animate 14B GGUF and save it into ComfyUI/models/unet folder.

All the model details with VRAM usage-

2-bit Q2_K 6.46 GB

3-bit Q3_K_S 7.97 GB, Q3_K_M 8.63 GB

4-bit Q4_K_S 10.6 GB, Q4_0 10.4 GB, Q4_K_M 11.5 GB

5-bit Q5_K_S 12.3 GB, Q5_0 12.5 GB, Q5_K_M 13 GB

6-bit Q6_K 14.6 GB

8-bit Q8_0 18.7 GB

 -Download VAE  and save it into ComfyUI/models/vae folder.

-Download Text Encoder  and put it into ComfyUI/models/text_encoders folder.

 

D. Wan 2.2 Lightning Lora

This is the light LoRA version of Wan2.2 released by Developer LightX2V. Its the distilled variant of Wan2.2 that requires only 4 steps with minimal CFG resulting upto 20x faster output with complex motion generation. People struggling with low VRAMs can use these setup to gain fast video generation.

Note- These LoRA models have to be used with any of the Wan2.2 diffusion models (explained above).


(a) Wan2.2 14B I2V (Image-to-Video)

 Make sure you are using the Actual High noise and Low noise I2V diffusion models.

-Download Wan2.2 14B I2V Lightning High noise (high_noise_model.safetensors) and Low noise model (low_noise_model.safetensors) from Hugging face repository.

Save them usually into ComfyUI/models/loras folder. Now use the same diffusion models, Vae, text encoders as described above.

(b) Wan2.2 14B T2V (Text to Video)

The Text to Video has two variants V1 and V1.1. The quality of video generation in V1.1 is some what better than as compared with V1. Choose any of them.

Make sure you are using the Actual High noise and Low noise T2V diffusion models.

-Download Wan2.2 14B T2V Lightining V1 High noise (high_noise_model.safetensors) and Low noise model (low_noise_model.safetensors) from Hugging face repository.

-Download Wan2.2 14B T2V Lightining V1.1 High noise (high_noise_model.safetensors) and Low noise model (low_noise_model.safetensors) from Hugging face repository.


Save them usually into ComfyUI/models/loras folder. Now use the same diffusion models, Vae, text encoders as described above.

 

Workflow


wan2.2 workflows

1. Download Wan2.2 workflows from our Hugging Face repository. These workflows are supported for all model variants (Native/Kijai's Setup/GGUF). 

By the way, if you are working with Kijai's setup and want the original Kijai's workflows. Just navigate  to your ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows folder.

If using diffusion models use the Load Diffusion model node and for GGUF modes use the Unet Loader node.


 (a) Wan2.2_14B_I2V.json (Wan2.2_14B Image to Video workflow)

 (b) Wan2.2_14B_T2V.json (Wan2.2_14B Text to Video workflow)

 (c) Wan2.2_5B_Ti2V.json (Wan2.2_5B Text to video and Image to Video workflow)

 (d) Wan2.2_14B_S2V.json (Wan2.2_14B Speech to Video workflow) 

 (e) Wan2.2_14B_Animate.json (Wan2.2 Animate Character workflow). You can follow the detailed workflow if interested with Wan 2.2 Animate Video to Video Pose transfer tutorial.

 

wan2.2 workflow from comfyui dashboard

You can also get the workflow from ComfyUI by navigating to All templates>>Video. Select any one of them. If you are not seeing this means you are using the older ComfyUI version. Just update it from the Manager section by selecting Update All.

 

 Wan2.2_5B_Ti2V Workflow

Wan2.2 5B Text-Image To Video Workflow
Wan2.2 5B Text/Image To Video Workflow (Click to Zoom)


1. Overview 

    Purpose: take a start image, convert it into latents for an imagevideo run, sample new latents using the Wan2.2 model + CLIP text prompts, decode latents to images, stitch images into a video and save it.
    Execution order (logical): Load models >> Encode prompts >> Load start image >> Image>>video latent node >> Sampler (KSampler) >> VAE Decode >> CreateVideo >> SaveVideo. (See node links in the JSON for exact wiring.)

2. Step 1 — Load models (group: "Step1 - Load models")

    UNETLoader (node id 37)

      Model loaded: wan2.2_ti2v_5B_fp16.safetensors
      This provides the diffusion UNET used for sampling.
    CLIPLoader (node id 38)

      Text encoder: umt5_xxl_fp8_e4m3fn_scaled.safetensors
      Supplies the CLIP/text encoder used by the CLIPTextEncode nodes.
    VAELoader (node id 39)

      VAE file: wan2.2_vae.safetensors
      Used to decode latents back to images.
    Note: the MarkdownNote in the graph shows exact model filenames and where to place them under ComfyUI/models (diffusion_models, text_encoders, vae).

3. Step 2 — Prompt encoding

    CLIPTextEncode (Positive prompt) — node id 6

      Positive prompt text (use as-is or edit):
       
      Output: CONDITIONING (link id 46)  goes to KSampler positive input.
    CLIPTextEncode (Negative prompt) — node id 7 (titled "CLIP Text Encode (Negative Prompt)")

      Negative prompt text (use as-is or edit):
      
      Output: CONDITIONING (link id 52)  goes to KSampler negative input.

4. Step 3 — Start image  Image-to-video latent

    LoadImage (node id 56)

      Widget values: "example.png", "image"
      Output IMAGE (link id 106)  connected to Wan22ImageToVideoLatent start_image input.
      Replace "example.png" with your actual start image filename (must be accessible to ComfyUI).
    Wan22ImageToVideoLatent (node id 55)

      Inputs: vae (from VAELoader id 39) and start_image (from LoadImage id 56).
      Widget values shown: [1280, 704, 121, 1]

        These are the imagevideo settings saved in this node. They correspond to the video/image dimensions and length:

          1280 = width (px)
          704  = height (px)
          121  = number of frames (video length in frames)
          1    = an extra parameter (often batch/steps/loop setting depending on node implementation)
        If you want a different resolution or length, change the first three numbers (width, height, frames).
      Output: LATENT (link id 104)  goes to KSampler as the "latent_image" input.
    Tip in graph: "For i2v, use Ctrl + B to enable" — enable the imagevideo mode as instructed in the UI if needed.;

5. Step 4 — Model selection & sampling

    ModelSamplingSD3 (node id 48)

      Receives MODEL from UNETLoader and passes it to KSampler (link id 95  node 3).
      Widget values: [8] (kept from this workflow — no need to change unless you understand the node internals).
    KSampler (node id 3)

      Inputs wired:

        model (from ModelSamplingSD3)
        positive conditioning (from CLIPTextEncode node id 6)
        negative conditioning (from CLIPTextEncode node id 7)
        latent_image (from Wan22ImageToVideoLatent node id 55)
      Widget values present in the workflow: [898471028164125, "randomize", 20, 5, "uni_pc", "simple", 1]

        Interpreting these values (typical mapping for KSampler):

          898471028164125 = seed (a big integer). Because "randomize" is set the actual seed will be randomized each run unless you change seed mode to fixed.
          "randomize" = seed mode (randomize vs fixed).
          20 = sampler steps (how many diffusion steps; higher = slower but often higher quality).
          5 = guidance scale (CFG) (strength of conditioning; higher emphasizes prompt more).
          "uni_pc" = sampler algorithm (this workflow uses the uni_pc sampler).
          "simple" = scheduling mode (sampling schedule).
          1 = batch count (how many images/latents per run).
        To reproduce the same result every run, set seed mode to a fixed seed (and record the seed number).
      Output: LATENT (link id 35)  goes to VAEDecode.

6. Step 5 — Decode latents to images

    VAEDecode (node id 8)

      Inputs: samples (LATENT from KSampler) and vae (VAE loader).
      Decodes latents into IMAGE.
      Output IMAGE (link id 107)  goes to CreateVideo.

7. Step 6 — Create video

    CreateVideo (node id 57)

      Input: images (IMAGE from VAEDecode)
      Widget values show: [24] — likely the frames per second (fps) used to make the video (24 fps is typical).
      Optional audio input exists but is not connected in this workflow.
      Output: VIDEO (link id 108)  goes to SaveVideo.

8. Step 7 — Save video

    SaveVideo (node id 58)

      Input: video (from CreateVideo)
      Widget values: ["video/ComfyUI", "auto", "auto"]

        "video/ComfyUI" is the output folder/path (relative to ComfyUI working dir).
        "auto" file naming / format options are selected here (ComfyUI will pick filename/format automatically unless you change these).
      Final output: saved video file in the specified folder.

 

 

 Wan2.2_14B_I2V Workflow

 

Wan2.2 14B Image To Video Workflow (Click to Zoom)
Wan2.2 14B Image To Video Workflow (Click to Zoom)

 

1. Overview

    This workflow converts a start image into a video using the Wan2.2 14B Image-to-Video models.
    Flow: Load models >> Encode prompts >> Load image >> Convert to video latents >> Two-stage sampling (high noise + low noise) >> Decode latents >> Assemble video >> Save video.


2. Step 1 - Load models

    UNETLoader (High noise, id 37)
     Loads wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors.
     This is the diffusion model for high-noise stage.
    UNETLoader (Low noise, id 56)
     Loads wan2.2_i2v_low_noise_14B_fp8_scaled.safetensors.
     This is for the refinement (low-noise) stage.
    CLIPLoader (id 38)
     Loads umt5_xxl_fp8_e4m3fn_scaled.safetensors.
     Provides text encoder for prompts.
    VAELoader (id 39)
     Loads wan_2.1_vae.safetensors.
     Needed to decode latents into images.



3. Step 2 -Upload start image

    LoadImage (id 62)
     Default: "example.png". Replace this with your own start image.
     Output connects into the WanImageToVideo node.



4. Step 3 - Video size & length

    WanImageToVideo (id 63)
     Inputs: positive prompt, negative prompt, VAE, and start image.
     Settings: [1280, 720, 121, 1]

      Width = 1280 px
      Height = 720 px
      Frames = 121 (≈ 5 seconds at 24 fps)
      Last "1" is extra param (batch/loop).
       Outputs: latent + conditioning for samplers.



5. Step 4 - Prompt encoding

    Positive Prompt (id 6)
    Negative Prompt (id 7)

   
    
6. Two-stage sampling

    Stage 1 (High noise)

      ModelSamplingSD3 (id 54) connects UNET high-noise  KSamplerAdvanced (id 57).
      KSamplerAdvanced (id 57)
       Widgets: enable, 1042664824122032, randomize, 20 steps, 3.5 CFG, euler, simple, 0, 10, enable.
       This generates the first latent output.
    Stage 2 (Low noise refinement)

      ModelSamplingSD3 (id 55) connects UNET low-noise  KSamplerAdvanced (id 58).
      KSamplerAdvanced (id 58)
       Widgets: disable, 0, fixed, 20 steps, 3.5 CFG, euler, simple, 10, 10000, disable.
       Takes latent from Stage 1 and refines it.



7. Decode & video assembly

    VAEDecode (id 8)
     Converts final latents  images.
    CreateVideo (id 60)
     Assembles frames into video. FPS = 24.
    SaveVideo (id 61)
     Saves to video/ComfyUI folder. Filename/format auto.

 

 

 Wan2.2_14B_T2V Workflow

 

Wan2.2 14B Text To Video Workflow (Click to Zoom)
Wan2.2 14B Text To Video Workflow (Click to Zoom)


1. Overview

    This workflow generates a video directly from text (text-to-video).
    Flow: Load models >> Encode prompts >> Create empty latent video >> Two-stage sampling (high noise + low noise) >> Decode latents >> Assemble video >> Save video.


2. Step 1 — Load models

    UNETLoader (High noise, id 37)
     Loads wan2.2_t2v_high_noise_14B_fp8_scaled.safetensors.
    UNETLoader (Low noise, id 56)
     Loads wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors.
    CLIPLoader (id 38)
     Loads umt5_xxl_fp8_e4m3fn_scaled.safetensors.
     Used to process text prompts.
    VAELoader (id 39)
     Loads wan_2.1_vae.safetensors.
     Decodes latents into images.



3. Step 2 — Video size (optional)

    EmptyHunyuanLatentVideo (id 59)
     Creates an empty latent video sequence to be filled in by the model.
     Settings: [1280, 704, 121, 1]

      Width = 1280 px
      Height = 704 px
      Frames = 121 (≈ 5 seconds at 24 fps)
      Last "1" is an extra param (batch/loop).



4. Step 3 — Prompt encoding- Add Positive Prompt (id 6) and Negative Prompt (id 7)

 

5. Two-stage sampling

    Stage 1 (High noise)

      ModelSamplingSD3 (id 54) connects high-noise UNET >> KSamplerAdvanced (id 57).
      KSamplerAdvanced (id 57)
       Widgets: enable, 774388746670969, randomize, 20 steps, 3.5 CFG, euler, simple, 0, 10, enable.
       Uses empty latent video as input and generates first-pass latents.
    Stage 2 (Low noise refinement)

      ModelSamplingSD3 (id 55) connects low-noise UNET >> KSamplerAdvanced (id 58).
      KSamplerAdvanced (id 58)
       Widgets: disable, 0, fixed, 20 steps, 3.5 CFG, euler, simple, 10, 10000, disable.
       Refines the latents from Stage 1.



6. Decode & assemble video

    VAEDecode (id 8)
     Converts refined latents >> image frames.
    CreateVideo (id 60)
     Collects images into a video. FPS = 24.
    SaveVideo (id 61)
     Saves output to video/ComfyUI folder. Filename/format auto.



Wan2.2 14B Speech To Video Workflow

Wan2.2 14B speech To Video Workflow
Wan2.2 14B Speech To Video Workflow (Click to Zoom)



Wan2.2 14B Animate Workflow

Wan2.2 14B Animate Workflow
Wan2.2 14B Animate Workflow (Click to Zoom)


 


 

Wan2.2 I2V5B FP16 Test 

Wan2.2 I2V5B FP16 output
Wan2.2 I2V5B FP16 output

 

 Wan2.2 T2V5B FP16 Test

Wan2.2 T2V5B FP16_ utput
Wan2.2 T2V5B FP16 output

 

 Wan2.2 S2V Test 




Some Important Tips for video generation: 

1. In Wan2.2 with High and Low Noise workflow, the first KSampler advanced takes the noise from Wan2.2 High noise model with start step-0 to end step-10 (Sampler total 20 steps) . This means 50 % process will be done with high noise model then the rest will be transferred to low noise KSampler Advanced node by enabling return with left over noise parameter. And this also takes the Wan2.2 low noise model data and process it simultaneously.


2. After trying with multiple generation, what we observed is that the way Wan2.2 5B handles I2V and time steps is awesome. Each latent frame has its own denoising time step. The first frame is just set as completely denoised. This means you should be able to do a sliding denoise timestep window and have infinite long form video generation.

3. Text/Image To Video 5B Hybrid Workflow includes both Text to Video and Image to Video. If you want to generate either of the workflows just enable it using Ctrl+ B button.

4. You can add the Sage attention node for further generation improvement. Connect Patch Sage Attention KJ node between Load diffusion model node and Model Sampling SD3 node.


Prompting Tips:

 To get the perfect output from Wan2.2 model, you need perfect and detailed prompting.

1. Shot Order

-Describe the scene like a movie shot.
-Start with what the camera sees first.
-Then describe how the camera moves.
-Finish with what is revealed or shown at the end.

Example: A mountain at dawn -- camera tilts up slowly -- reveals a flock of birds flying overhead.



 2. Camera Language

 Use clear terms to tell the model how the camera should move:

-pan left/right – camera turns horizontally
-tilt up/down – camera moves up or down
-dolly in/out – camera moves forward or backward
-orbital arc – camera circles around a subject
-crane up – camera rises vertically

Wan 2.2 understands these better than the older version.



 3. Motion Modifiers

 Add words to describe how things move:

-Speed: slow-motion, fast pan, time-lapse
-Depth/motion cues: describe how things in the foreground/background move differently to show 3D depth

     e.g., "foreground leaves flutter, background hills stay still"



 4. Aesthetic Tags

 Add cinematic style:

-Lighting: harsh sunlight, soft dusk, neon glow, etc.
-Color Style: teal-orange, black-and-white, film-like tones (e.g., Kodak Portra)
-Lens or Film Style: 16mm film grain, blurry backgrounds (bokeh), CGI, etc.

These help define the look and feel of the scene.



 5. Timing & Resolution Settings

 Keep clips short: 5 seconds or less

-Use around 120 frames max

-Use 16 or 24 FPS (frames per second) – 16 is faster to test

-Use lower resolution (like 960×540) to test quickly, or higher (1280×720) for final output



 6. Negative Prompt

 This part tells the AI what you don’t want in the video. Defaults cover things like:
-bad quality, weird-looking hands/faces
-overexposure, bright colors, still images
-text, compression artifacts, clutter, too many background people

This helps avoid common AI issues.