Wan 2.1 Phantom: Single, Multi-Subject AI VideoGen

If you are an AI video creator, animator, or just passionate about the latest in generative media, ByteDance's new Phantom Subject2Video framework will make your work easier.

The model is built on top of the WAN 2.1 diffusion model trained on TextToVideo and ImageToVideo architecture, leveraging its strengths in video synthesis and layering ByteDance’s subject-consistency innovations on top.

Phantom model is a generation model focused on subject consistency. It allows you to use one or more reference images-think portraits, character art, or even photos-and generate videos where those characters retain their identity, style, and outfits across different scenes and actions. You can ger more in-depth information from their research paper.

This is a leap beyond previous models, which often struggled to keep characters looking the same from frame to frame or scene to scene.

Table of Contents

Installation

1. Install ComfyUI if you are new user. Older user need to update it from the Manager section by clicking on "Update ComfyUI" button.

2. The architecture uses WanVideo as the base model. Hence, it uses WanVideo wrapper in the background. So to work with this, its recommended to get WanVideo Wrapper by Kijai that we have explained in our WanVideo tutorial.

If you already have, just update this custom node from the Manager or just move inside this custom node then use "git pull" command using command prompt.

3. Now, download Phantom 1.3B (Phantom-Wan-1.3B.pth) parameter model and load it in ComfyUI using the WAN Video Wrapper.

You can also get Kijai's converted model Phantom Wan Fp32 bit(for high end VRAMs) or Wan Phantom Fp16 bit ( for low end VRAMs) from Kijai's Hugging Face repository and save into "ComfyUI/models/diffusion_models" folder.

The phantom 14billion parameter model as been announced but not yet released. This will give more refined results with consistency. We will update it whenever available.

4. Also download the same VAE and text encoders mentioned for Kijai's WanVideo. If you already done then its not required.

Resolution & VRAM: Lower resolutions (480p) generate quickly and use less VRAM (about 4GB for one subject, 8GB for two). Higher resolutions (720p+) require more VRAM.

Generate up to 81 frames per video (about 5 seconds at 16 FPS), with consistent character appearance, facial features, and clothing across all frames.

5. Restart ComfyUI and refresh it.

Workflow

1. Get the workflow from the "ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows" folder, then just drag and drop into ComfyUI.

2. Reference Inputs: Import up to four subject images. Each is encoded and passed to the Phantom embed node.

3. Describe your scene and character details in the text prompt. The more specific you are (e.g., 'woman in a long black dress, high heels; man in a sky blue printed shirt, white pants'), the more accurately Phantom matches outfits and styles in the output video.

You can generate videos using either a single or multiple reference images by just keep the number of images to 4 or fewer.

Tips

Sometimes the model generates face deformation, inconsistent output, with non - id preservation although better than any other video generative models.

For the best results, describe the reference image clearly with detail in your prompting. If the video output is not quite what you expected, the simplest fix is to try a different seed values and tweak your prompt description.