If you are an AI video creator, animator, or just passionate about the latest in generative media, ByteDance’s new Phantom Subject2Video framework will make your work easier.
The model is built on top of the WAN 2.1 diffusion model trained on TextToVideo and ImageToVideo architecture, leveraging its strengths in video synthesis and layering ByteDance’s subject-consistency innovations on top.
Phantom model is a generation model focused on subject consistency. It allows you to use one or more reference images-think portraits, character art, or even photos-and generate videos where those characters retain their identity, style, and outfits across different scenes and actions. You can ger more in-depth information from their research paper.
This is a leap beyond previous models, which often struggled to keep characters looking the same from frame to frame or scene to scene.
Installation
1. Install ComfyUI if you are new user. Older user need to update it from the Manager section by clicking on "Update ComfyUI" button.
2. The architecture uses WanVideo as the base model. Hence, it uses WanVideo wrapper in the background. So to work with this, its recommended to get
WanVideo Wrapper by Kijai that we have explained in our WanVideo tutorial.
If you already have, just update this custom node from the Manager or just move inside this custom node then use "git pull" command using command prompt.
The phantom 14billion parameter model as been announced but not yet released. We will update it whenever available.
Resolution & VRAM: Lower resolutions (480p) generate quickly and use less VRAM (about 4GB for one subject, 8GB for two). Higher resolutions (720p+) require more VRAM and time-plan accordingly.
Generate up to 81 frames per video (about 5 seconds at 16 FPS), with consistent character appearance, facial features, and clothing across all frames.
Restart ComfyUI and refresh it.
Workflow
1. Get the workflow from the "ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/example_workflows" folder, then just drag and drop into ComfyUI.
2. Reference Inputs: Import up to four subject images. Each is encoded and passed to the Phantom embed node.
3. Describe your scene and character details in the text prompt. The more specific you are (e.g., 'woman in a long black dress, high heels; man in a sky blue printed shirt, white pants'), the more accurately Phantom matches outfits and styles in the output video.
You can generate videos using either a single or multiple reference images by just keep the number of images to 4 or fewer.
For the best results, describe the reference image clearly in your prompting. If the video output is not quite what you expected, the simplest fix is to try a different seed values and tweak your prompt description.