Scail 2 with Wan: Video to Video Motion Transfer

Creating controlled character animation sounds simple like take the motion from one video and transfer it to another character. But in practice, it has always been a difficult challenge.

Most existing character animation systems convert motion into pose skeletons or separate the environment using masked backgrounds. While these methods make the problem easier for the model, but also remove important information. A skeleton can describe body movement, but it cannot capture everything like subtle clothing motion, interaction with the environment, hand details, or the natural flow of a scene.

Wan 2.1 Scail 2 model showcase

Sail 2 built upon Wan 2.1, developed by the Zai-org team and Tsinghua University, aims to create end-to-end character animation by allowing the model to learn directly from the complete visual information inside videos. Instead of converting motion into skeletons or separating backgrounds, SCAIL-2 works directly with the source video itself. The goal is simple that preserve more information, create more realistic animations.

If you are following Wan models, you must have head about Wan 2.1 Scail that was not that successful. Now, instead of extracting only motion signals, SCAIL 2 concatenates the driving video directly into the generation sequence. You can find their detailed insights by accessing their research paper. This allows the model to access everything available in the input including movement, appearance, environment, and fine visual details.

Scail 2 model architecture

The framework also introduces-

-In-context mask conditioning to provide additional control beyond text prompts and raw visual inputs.

-Mode-specific RoPE to guide different animation tasks more effectively.

-Bias-Aware DPO (Direct Preference Optimization) to reduce errors in detailed regions by creating better preference-based training examples. These improvements allow the model to better understand complex motion and generate more accurate results.

Installation

1. First you need to do the ComfyUI installation. Older users need to do the ComfyUI update from the Manager.

2. Download any of the Wan 2.1 Scail 2 variants (fp16/fp8 scaled/fp8 mixed/GGUF) from the following. Choose as per your system resources-

(a) Wan2.1_14B_SCAIL_2_fp8_scaled (wan2.1_14B_SCAIL_2_fp8_scaled.safetensors) - minimum 32GB VRAM

(b) Wan2.1_14B_SCAIL_2_fp16 (wan2.1_14B_SCAIL_2_fp16.safetensors) - for minimum 16 GB VRAM

Save this into ComfyUI/models/diffusion_models folder.

Or you can use GGUF if having low 6-17 GB VRAMS.

(d) Wan 2.1 Scail2 GGUF . It ranges from Q2(low vram with lowest quality) to Q8 (takes highest vram with best quality) variants

Save this into ComfyUI/models/unet folder. Make sure you have the ComfyUI-GGUF custom node by city96. If not have, install from the Manager.

3. Download clip_vision_h (clip_vision_h.safetensors), save this inside ComfyUI/models/clip_vision folder.

4. Download Sam3.1_multiplex_fp16 (sam3.1_multiplex_fp16.safetensors), put this into ComfyUI/models/checkpoints folder.

5. Download Wan21_I2V_14B_lightx2v_cfg_step_distill_lora_rank64 (Wan21_I2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors) and wan2.1_SCAIL_2_DPO_lora_bf16 (wan2.1_SCAIL_2_DPO_lora_bf16.safetensors)

Save this into ComfyUI/models/loras folder. Use DPO lora to do the fixing of hands and faces.

6. Download umt5_xxl_fp8_e4m3fn_scaled (umt5_xxl_fp8_e4m3fn_scaled.safetensors) , save this into ComfyUI/models/text_encoders folder.

7. Download Wan_2.1_vae (wan_2.1_vae.safetensors) , put this into ComfyUI/models/vae folder.

8. Restart and refresh ComfyUI.

Workflow

1. Download the Scail 2 workflows from our hugging face repository.

(a) Wan2.1_Scail2_basic.json - basic workflow

(b) Wan2.1_Scail2_Extend.json -This helps to extend your generated longer videos.

2. Drag and drop into ComfyUI. Install the missing nodes from the manager if found error.

3. Load all the models into its relevant nodes. Add relevant prompts into prompt box.

4. Put your reference input image with driving video.

5. Hit run to start generation.

A great way to use Scail2 is to start by extracting the very first frame from your source video. Once you have that frame, use a high-quality image editing model to change the subject while keeping the overall composition, lighting, and camera angle as close to the original as possible. The cleaner and more consistent this edited frame is, the better your final results are likely to be.

After creating the edited first frame, use the original, unmodified video as the driving video in Scail 2. Instead of generating motion from scratch, Scail 2 follows the movement, camera motion, and timing from the original footage while using your edited frame as the new visual reference. This allows the new subject to inherit the natural motion and dynamics of the source video, resulting in a much more stable and coherent animation.

This workflow is particularly effective because it combines the strengths of both tools: the image editing model produces a high quality replacement for the subject, while Scail2 preserves the original video's motion, expressions, camera movement, and temporal consistency. The result is typically far smoother and more realistic than trying to generate an entirely new video from a text prompt alone.

Scail 2 with Wan: Video to Video Motion Transfer

Installation

Workflow

Posted by Administrator

Search This Blog

Popular Posts

Krea2 Raw/Base & Turbo (BF16/FP8/NVFP4/INT8) High Quality Image Gen

Sulphur 2 -The Uncensored LTX2.3 Video Generation

Top 26 Krea2 LoRA models for Stylized Image Generation

Wan 2.2 Dancer: Consistent Dance Video from Music & Ref Image

Wan2.2 (FP16/FP8/GGUF) VideoGen locally

Install Forge Neo WebUI- Better than Forge & Automatic1111

Important Pages

Our Social Page

Recent Post

Contact form

Scail 2 with Wan: Video to Video Motion Transfer

Installation

Workflow

Posted by Administrator

Related Posts

Search This Blog

Our Social Community

Popular Posts

Krea2 Raw/Base & Turbo (BF16/FP8/NVFP4/INT8) High Quality Image Gen

Sulphur 2 -The Uncensored LTX2.3 Video Generation

Top 26 Krea2 LoRA models for Stylized Image Generation

Wan 2.2 Dancer: Consistent Dance Video from Music & Ref Image

Wan2.2 (FP16/FP8/GGUF) VideoGen locally

Install Forge Neo WebUI- Better than Forge & Automatic1111

Important Pages

Our Social Page

Recent Post

Contact form