HunyuanCustom: Subject driven Consistent VideoGen

Setup Hunyuan Custom model

Now, controlling your subject in video generation is much easier than before. HunyuanCustom , the latest multi-modal video generation model from the Tencent team that takes controllable video generation to the next level. 

It uses subject and object as the reference and generate focused driven video. You can use single , or multi subject to do this. The model is also capable of generate AI videos by inputting your reference audio with text conditioning or AI object editing videos. 

hunyuan custom architecture showcase
Reference- HunyuanCustom Official Page

Wheather you want create your own AI influencer, E-commerce related showcasing, or doing your Promtional content, this is the mind blowing approach. This methodology is highly effective when you want to do some kind of story generation where you want high level of consistency. You can understand more into their research paper.

The model uses the base HunyuanVideo architecture. So, you will need to do the Hunyuan setup if you haven't done before. Lets, see how to do this.


Installation

1. New user need to install ComfyUI. Update it if you are an older user to avoid any future errors. You can also learn the beginner's guide to ComfyUI if do not know about it.



update the custom nodes from manager

2. You need to install and setup HunyuanVideo Custom nodes by Kijai. The user already have this custom node need to only update it from the Manger section by using the search option from the "Custom nodes" option.

3. The official HunyuanCustom model Fp16 requires at least 80GB VRAM and fp8 variant requires at least 24GB VRAM. 


download hunyuan custom model

But, lower VRAM users can use the HunyuanCustom model quantized by Kijai. After downloading the model save them into the "ComfyUI/models/diffusion_models" folder. The BF16 model variant is for users having more than 12GB VRAM, others having lower VRAM need to use the FP8 variant.


4. Now download the VAE, Text encoders and clip vision models from Hugging Face repository. If you already downloaded these for HunyuanVideo model earlier, then its not required. 

-VAE - save this into "ComfyUI/models/vae" folder.

-Text Encoder - save this into "ComfyUI/models/LLM/llava-llama-3-8b-text-encoder-tokenizer" folder.

-Clip Vision - save this into "ComfyUI/models/clip/clip-vit-large-patch14" folder.

If you do not have these folder, just create it then save them into respective folder.

5. Restart you ComfyUI and refresh it. 



Workflow 

1. Get the workflow from your "ComfyUI/custom_nodes/ComfyUI-HunyuanVideoWrapper/examples" folder.


2. Drag and Drop into ComfyUI.

load model

(a) Load HunyuanCustom Model. 

load vae model

(b) Load VAE model.

load text encoder and vae models

(c) Select and load text encoder and clip models

resize image node

3. Load you image. If your image is not that perfect resolution, use these recommended values (provided on their official page):

(a) 720 by 1080 with 129 frames.

(b) 512 by 896 with 129 frames

Note- Take it in mind- more the size of your resolution, higher your VRAM utilization will be.

add relevant text prompt

4. Add your prompt. We used this to make the model understand better

Prompt: a girl dancing in college farewell on the stage.


video generated using hunyuan custom

4.  Click the Run option. The result you are seeing is not cherry picked. We are just presenting what we got at our first attempt.

The model is highly relative to your subject image and generate whatever you have inputted as the referenced prompts and images. But, you do not get the perfect generation at the first attempt. But, somewhat better than many other video generation models that hallucinate more with morphing video frames.