Wan2.2 Fun Vace - Video Style Transfer (Video To Video)

 

Wan 2.2 Vace style transfer

Video generation models often face limitations when it comes to controlling specific aspects of the output. Creators struggle to get precise conditions like edges, depth, pose, or trajectories aligned with their vision. Language diversity is another challenge that prevents smoother global adoption.

The Wan2.2 control weights have been trained using the VACE architecture, on Wan2.2 T2V A14B model. It support control conditions including Pose, Depth, MLSD,Canny, and trajectory control. 

 

Wan 2.2 Fun Vace working illustration
Wan 2.2 Vace working illustration (Ref-Official research paper)
 

A solution exists that empowers creators with fine control over video outputs without compromising quality. This approach not only improves precision but ensures adaptability across multiple formats and languages. The outcome is a generation process that feels natural, efficient, and aligned with professional needs.

These model weights are capable of handling raw videos by subject and enable video prediction at resolutions such as 512, 768, and 1024. The model has been trained at 81 frames per second, which makes it suitable for creating fluid motion across different formats. Multi language prediction is supported that adds the versatile feature for a global user base.

 

Installation

You need to update ComfyUI from Manager by selecting Update All to avoid any future errors.

Now, there are different model variants released by the community. Choose the variant that will be supported for your system requirement and use cases-

By official released by Alibaba Pai (requires ~60GB VRAMs or more)

1. First, you need Wan2.2 T2V High and low model.  Do basic Wan 2.2 setup.



Wan2.2-VACE-Fun-A14B High noise

2. Download the official Wan2.2-VACE-Fun-A14B High noise (diffusion_pytorch_model.safetensors). 

 Wan2.2-VACE-Fun-A14B Low noise

Now, you also need Wan2.2-VACE-Fun-A14B Low noise (diffusion_pytorch_model.safetensors) model  and rename them to something relevant. Then, save them into ComfyUI/models/diffusion_models folder. 

3. Restart ComfyUI and refresh.



By Kijai's Optimized BF16 variant (for Medium VRAMs)

1. Setup Kijai's Wan Video Wrapper if not yet.

 

 wan2.2-vace-bf16-fun-high-low-noise

2. Download the Wan2.2 Vace BF 16 optimised (Wan2_2_Fun_VACE_module_A14B_HIGH_bf16.safetensors and Wan2_2_Fun_VACE_module_A14B_LOW_bf16.safetensors)  and save this into ComfyUI/models/diffusion_models folder.

3. You also need Wan2.2 T2V High and low model. Do setup them by following our Wan2.2 installation tutorial. 

4. Restart ComfyUI and refresh.

 


By Kijai's Optimised GGUF module 

1. Setup Kijai's Wan Video Wrapper if not yet.

 kijai's wan2.2 vace module gguf

2. Download the Wan2.2 Vace optimized GGUF (High and Low models) and save this into ComfyUI/models/diffusion_models folder. Use the same pair of models. Ex- If using Q4 High then pair with Q4 low.

3. You also need Wan2.2 T2V High and low model. Do setup them by following our Wan2.2 installation tutorial. 

4. Restart ComfyUI and refresh.
 

 

Workflow

1. After installing Kijai's Wan Video Wrapper custom node, you will get the workflows inside your ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows folder.

2. Drag and drop into ComfyUI.

 

Wan 2.2 Vace output

 You can see the workflow needs quite good amount of VRAM and multiple models to load for better output. Now its, this can also be done using Wan 2.2 Animate model. Follow our detailed Wan 2.2 Animate workflow tutorial for more detailed insights.

 

Some tips when working with the workflow:

1. Set width, height, and frame count as constants at the very start. Reference these in every node that needs them. This eliminates repetitive manual edits across the workflow and makes global changes fast and error free. Experiment widely Vace modules often surprise with their flexibility, especially with mixed preprocessing (depth, line art, normals).

2. Use image and video inputs that resize and crop them early to the workflow’s constant values. This guarantees aspect ratio consistency and output quality. Depth maps are extracted from video inputs for spatial awareness, which powers the vace control system for nuanced animation.

3. Florence 2 is a multimodal node (from Microsoft) that interprets your input image and generates descriptive prompts automatically. For creative control, you can prepend trigger phrases like 'anime illustration of' using a concatenate node especially helpful for guiding style-specific models.

4. The workflow uses a split between 14B and 5B parameter models. The 14B variants require more VRAM but tend to generate superior results. If VRAM is tight, consider GGUF quantizations or distilled models to fit smaller cards.

5. For massive models that exceed video card VRAM, block swap nodes push some computation to system RAM. Bigger swap values reduce VRAM usage but increase render time. The trade off- if you are not hitting out-of-memory errors, keep swap low for speed. If you do, raise swap until the run succeeds, accepting longer generations.

6. Vace is essentially a control system akin to ControlNet but more flexible. Use its start/end percent and strength settings to decide how tightly it adheres to your control input over time. Vace can combine multiple reference images and masks across frame intervals, unlocking complex temporal behaviors. This is ripe for experimentation that animated masks, open pose skeletons, or trajectory controls are all in play.

7. By condensing nodes into sub-graphs, users expose only the most relevant parameters like input, output, prompt prefix, and key node settings. The rest runs under the hood, tidy and efficient. For workflow sharing, this lets others focus on what matters without getting lost in technical clutter.

8. If you get errors loading models, double-check compatibility between node types (native vs wrappers) and FP16/FP8 variants.

9. For quality control, balance the usage of distilled models and advanced samplers. More steps mean longer render but potentially higher fidelity. Use samplers to cut unnecessary computation if quality loss is acceptable.