Creating controlled character animation sounds simple like take the motion from one video and transfer it to another character. But in practice, it has always been a difficult challenge.
Most existing character animation systems convert motion into pose skeletons or separate the environment using masked backgrounds. While these methods make the problem easier for the model, but also remove important information. A skeleton can describe body movement, but it cannot capture everything like subtle clothing motion, interaction with the environment, hand details, or the natural flow of a scene.
![]() |
| Wan 2.1 Scail 2 model showcase |
Sail 2 built upon Wan 2.1, developed by the Zai-org team and Tsinghua University, aims to create end-to-end character animation by allowing the model to learn directly from the complete visual information inside videos. Instead of converting motion into skeletons or separating backgrounds, SCAIL-2 works directly with the source video itself. The goal is simple that preserve more information, create more realistic animations.
If you are following Wan models, you must have head about Wan 2.1 Scail that was not that successful. Now, instead of extracting only motion signals, SCAIL 2 concatenates the driving video directly into the generation sequence. You can find their detailed insights by accessing their research paper. This allows the model to access everything available in the input including movement, appearance, environment, and fine visual details.
![]() |
| Scail 2 model architecture |
The framework also introduces-
-In-context mask conditioning to provide additional control beyond text prompts and raw visual inputs.
-Mode-specific RoPE to guide different animation tasks more effectively.
-Bias-Aware DPO (Direct Preference Optimization) to reduce errors in detailed regions by creating better preference-based training examples. These improvements allow the model to better understand complex motion and generate more accurate results.
Installation
1. First you need to do the ComfyUI installation. Older users need to do the ComfyUI update from the Manager.
2. Download any of the Wan 2.1 Scail 2 variants (fp16/fp8 scaled/fp8 mixed/GGUF) from the following. Choose as per your system resources-
(a) Wan2.1_14B_SCAIL_2_fp8_scaled (wan2.1_14B_SCAIL_2_fp8_scaled.safetensors) - minimum 32GB VRAM
(b) Wan2.1_14B_SCAIL_2_fp16 (wan2.1_14B_SCAIL_2_fp16.safetensors) - for minimum 16 GB VRAM
(c) Wan2.1_14B_SCAIL_2_mxfp8 (wan2.1_14B_SCAIL_2_mxfp8.safetensors) -for minimum 16 GB VRAMs
Save this into ComfyUI/models/diffusion_models folder.
Or you can use GGUF if having low 6-17 GB VRAMS.
(d) Wan 2.1 Scail2 GGUF . It ranges from Q2(low vram with lowest quality) to Q8 (takes highest vram with best quality) variants
Save this into ComfyUI/models/unet folder. Make sure you have the ComfyUI-GGUF custom node by city96. If not have, install from the Manager.
3. Download clip_vision_h (clip_vision_h.safetensors), save this inside ComfyUI/models/clip_vision folder.
4. Download Sam3.1_multiplex_fp16 (sam3.1_multiplex_fp16.safetensors), put this into ComfyUI/models/checkpoints folder.
5. Download Wan21_I2V_14B_lightx2v_cfg_step_distill_lora_rank64 (Wan21_I2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors) and wan2.1_SCAIL_2_DPO_lora_bf16 (wan2.1_SCAIL_2_DPO_lora_bf16.safetensors)
Save this into ComfyUI/models/loras folder.
6. Download umt5_xxl_fp8_e4m3fn_scaled (umt5_xxl_fp8_e4m3fn_scaled.safetensors) , save this into ComfyUI/models/text_encoders folder.
7. Download Wan_2.1_vae (wan_2.1_vae.safetensors) , put this into ComfyUI/models/vae folder.
8. Restart and refresh ComfyUI.
Workflow
1. Download the Scail 2 workflows from our hugging face repository.
(a) Wan2.1_Scail2_basic.json - basic workflow without dpo lora
(b) Wan2.1_Scail2_DPO.json- with DPO lora it improves faces and hands to your output.
(c) Wan2.1_Scail2_gguf.json - for low VRAM users.
2. Drag and drop into ComfyUI. Install the missing nodes from the manager if found error.
3. Load all the models into its relevant nodes.
4. Put your reference input image with driving video.
5. Hit run to start generation.
SCAIL-2 represents an important shift in character animation research. Many AI systems solved complex visual problems by simplifying them first turning motion into skeletons, removing backgrounds, or extracting specific features. That approach works, but every simplification creates the possibility of losing something important.
It takes the opposite direction that give the model more complete visual information and allow it to learn the relationships directly.




