Wan 2.1 Scail Pose- 3D Consistent Pose Transfer

 

Most existing approaches rely on 2D pose signals or weak 3D cues, which simply are not strong enough to preserve structure across time. They also tend to process frames in isolation or with limited temporal awareness, which is why animations look unstable in long sequences. SCAIL (Studio-Grade Character Animation via In-Context Learning) framework released by The Tsinghua University and ZAI team set out to build a system that does not just look good in curated demos, but actually holds up under production level demands that he kind studios care about.



scail model architecture
SCAIL model architecture (REf- research paper)


 Instead of relying on fragile 2D poses, the team introduces a novel 3D-consistent pose representation. This captures motion in a way that stays stable across viewpoints, identities, and time. It acts as a much stronger motion signal for animation tasks. 

Scail model Pipeline
Scail model Pipeline (Ref-Research paper)

They also realized that injecting pose information sparsely or locally is not enough. So they designed a full context pose injection mechanism inside a diffusion-transformer architecture, allowing the model to reason over entire motion sequences, not just individual frames. You can find more detailed research by accessing their research paper.

 

Installation

1. First, install ComfyUI if not yet. If already installed, update it from the Manager by selecting Update All option.

2. Make sure you have Kijai's custom node Wan Video wrapper installed. If already have, just update custom nodes from the Manager.  

3. Install Kijai's ComfyUI-WanAnimatePreprocess (for DWpose) and  ComfyUI-SCAIL-Pose (for additional requirements of taichi and pyrender) custom nodes installed from Manager by selecting Custom Nodes Manager option. If already have, just update these custom nodes from the Manager.  

 3. Download  Scail Pose models from Kijai's hugging face repository. Choose the one that suits your system resources:

Wan 2.1 14B Scail FP8


(a) Wan 2.1 14B Scail FP8 (Wan21-14B-SCAIL-preview_fp8_e4m3fn_scaled_KJ.safetensors), for 12 to 16 GB VRAM for faster inference.

Wan 2.1 14B Scail BF16

(b) Wan 2.1 14B Scail BF16 (Wan21-14B-SCAIL-preview_comfy_bf16.safetensors), for 24 GB VRAM or more for better output. 

Save it inside your ComfyUI/models/diffusion_models folder.

Download YOLOv10m model

4. Download YOLOv10m model (yolov10m.onnx).

download DWwhole Body pose

Then, download DWwhole Body pose (vitpose-l-wholebody.onnx ) model. Save both of them into your ComfyUI/models/detection folder. If you do not have, then just create it.
 

5. Restart and Refresh ComfyUI.





Workflow


1. You will get the workflow (wanvideo_SCAIL_pose_control_example_01.json) inside your ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows folder.

2. Drag and drop into ComfyUI. 

3.If you get missing red error nodes, just install them from Manager by selecting Install missing nodes option. 

4. Run the workflow by setting up the nodes:

(a) Load your image  into Load image Reference node. Then Load your reference video into Load video node. Use 480p for lower VRAMs. Higher VRAMs user can go up to 720p.

(b) Load Scail pose model (BF16 or FP8 version) into WanVideo Model Loader node. 

Then into the ONNX detection Model Loader node,  load both (YOLOv10m and DWwhole Body pose) models. These are responsible for handling body pose from the video.

(c) Load wan 2.1 Model into model loader node.

(d) Load wan 2.1vae, text encoders into their respective nodes.

(e) Add your detailed long positive and negative prompts into Wan Video Text Encode node. Make sure you are describing what you want. We experienced low quality and shorter prompts often generate weird generation. Using any LLMs (QwenVL/GPT/Gemini based), you can add prompt enhancer technique to make your short prompts into more detailed prompts.

(f) Hit run to execute the workflow. 

Set values from WanVideo Scheduler node:

Scheduler-DPM++Sde
Steps-6
Shift-7.0

 

scail pose video generation

The Scail pose model operates on downsampled pose input rather than full resolution images. For correct operation, the pose resolution must be exactly half of the final generation resolution. This downsampling improves performance, stability, and consistency of pose geometry.

The pose resolution is automatically calculated based on the generation resolution. The pose is inferred at the required lower resolution and then rendered directly onto the target (final) image resolution. You do not need to manually resize or adjust the pose resolution. 

 dw_poses are optional inputs used for drawing detailed pose information (body, face, and hands) and aligning the generated pose to a reference image. When dw_poses are connected the system performs pose alignment with the reference image.  This alignment ensures correct positioning, scale, and orientation. You can simply enable/disable.