Install CogVideoX: Text-to-Video and Image-to-Video (ComfyUI)

install cog video x

A Text-to-Video diffusion-based model, CogVideoX has been released by The Knowledge Engineering Group (KEG) & Data Mining (THUDM) at Tsinghua University.

The model has been trained on the base of long detailed prompts like Chat GLM4 or ChatGPT4. To get a detailed overview of CogVideoX, access the respective research paper. You can also access the information for commercial purposes by checking out ChatGlm and the API platform.

Unlike other diffusion video generation models that are unable to generate longer videos, the CogVideoX model can generate 6-second long videos. Now, it's capable of running with lower VRAMs lesser than 12GB.

Currently, these variants have been released:

  •  CogVideoX5B(Text-to-Video) registered under CogVideoX license.
  •  CogVideoX2B(Text-to-Video) registered under Apache2.0 license.
  • CogVideoX5b-I2V(Image-to-Video) registered under CogVideoX License.

Lets move to the installation section and the workflow.


Table of Contents:


Installation:

1. First, do the ComfyUI installation if you are new to ComfyUI.

2. Now, clone the CogVideoX wrapper (custom nodes). Move into the "ComfyUI/custom_nodes" folder. Navigate to the folder address bar and type "cmd" to open the command prompt.

Then, just paste this command in the command prompt to install the wrapper:

git clone https://github.com/kijai/ComfyUI-CogVideoXWrapper.git

3. You also required other dependencies to speed up the video rendering.

For ComfyUI portable users:

Move inside the "ComfyUI_windows_portable" folder. Navigate to the folder address bar and type "cmd" to open the command prompt and again use these commands:

python_embeded\python.exe -m pip install --pre onediff onediffx nexfort


For normal comfy users:

Open the command prompt and use these commands:

pip install --pre onediff onediffx

pip install nexfort


All the required models get downloaded automatically from THUDM's hugging face repository. So, you don't need to download it manually.

At the initial run of the workflow this will take time as the models get downloaded in the background. To get the real-time status, you can switch to command prompt running at the background.


Workflow:

1. The workflow can be found inside the "ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/examples" folder. Directly drag and drop into ComfyUI.

Workflow Description
cogvideo_2b_context_schedule_test_01.json update workflows
cogvideo_5b_vid2vid_example_01.json update workflows
cogvideo_2b_controlnet_example_01.json Update cogvideo_2b_controlnet_example_01.json
cogvideo_5b_example_01.json Update cogvideo_5b_example_01.json
cogvideo_I2V_example_01.json update workflows
cogvideo_fun_pose_example_01.json Add context schedules for control pipeline
cogvideo_fun_5b_GGUF_10GB_VRAM_example_01.json CreateCogvideo_fun_5b_GGUF_10GB_VRAM_example_01.json
cogvideo_fun_i2v_example_01.json update vae tile defaults


There are multiple workflows available. Choose the one as per your requirements. For illustration, we are showcasing the basic one.


load cogVideoX model


2. In the Cog Video model node there are three variants you can choose from-
(a) CogVideoX-5B(Text-to-Video for higher VRAM) 
(b) CogVideoX-2B(Text-To-Video for lower VRAM)
(c) CogVideoX-5B I2V(For Image-To-video)


cogVideoX configuration
Source: CogVideoX

You should use the recommended settings as per your system requirements. More detailed information has been shared by the CogVideoX team. Go through these to get a better understanding. 

CogVideoX model details:

Model Type CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2V 
Model Description Entry-level model, balancing compatibility. Low cost for running and secondary development. Larger model with higher video generation quality and better visual effects. CogVideoX-5B image-to-video version.
Inference Precision FP16* (recommended), BF16, FP32, FP8*, INT8, not supported: INT4 BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 Same as CogVideoX-5B
Single GPU Memory Usage SAT FP16: 18GB
diffusers FP16: from 4GB*
diffusers INT8 (torchao): from 3.6GB*
SAT BF16: 26GB
diffusers BF16: from 5GB*
diffusers INT8 (torchao): from 4.4GB*
Same as CogVideoX-5B
Multi-GPU Inference Memory Usage FP16: 10GB* using diffusers BF16: 15GB* using diffusers Same as CogVideoX-5B
Inference Speed (Step = 50, FP/BF16) Single A100: ~90 seconds
Single H100: ~45 seconds
Single A100: ~180 seconds
Single H100: ~90 seconds
Same as CogVideoX-5B
Fine-tuning Precision FP16 BF16 Same as CogVideoX-5B
Fine-tuning Memory Usage 47 GB (bs=1, LORA)
61 GB (bs=2, LORA)
62GB (bs=1, SFT)
63 GB (bs=1, LORA)
80 GB (bs=2, LORA)
75GB (bs=1, SFT)
78 GB (bs=1, LORA)
75GB (bs=1, SFT, 16 GPU)
Prompt Language English* Same as CogVideoX-2B Same as CogVideoX-2B
Maximum Prompt Length 226 Tokens Same as CogVideoX-2B Same as CogVideoX-2B
Video Length 6 Seconds Same as CogVideoX-2B Same as CogVideoX-2B
Frame Rate 8 Frames / Second Same as CogVideoX-2B Same as CogVideoX-2B
Video Resolution 720 x 480, no support for other resolutions (including fine-tuning) Same as CogVideoX-2B Same as CogVideoX-2B
Position Embedding 3d_sincos_pos_embed 3d_rope_pos_embed 3d_rope_pos_embed + learnable_pos_embed

load clip model

3. Load clip models. Fp16 is for higher end and FP8 for lower end GPUs.

4. As officially instructed, it has been trained on long batch of prompts based on transformers T5 models, we used detailed prompts generated using ChatGPT so that the CogVideoX model can understand better.


First test: 

We generated a professional model photoshoot clip in the ocean.

Prompt used: A professional photoshoot scene set in the ocean, featuring a model standing confidently in shallow water. The model is dressed in a sleek, elegant outfit, with a flowing fabric that moves gracefully with the ocean breeze. The scene is captured during the golden hour, with the sun setting on the horizon, casting a warm glow on the water's surface. Gentle waves lap around the model’s feet, creating a dynamic and serene atmosphere. A professional photographer is seen on the shore, using a high-end camera with a large lens, capturing the moment. Reflective equipment and light modifiers are strategically placed to enhance the lighting, with an assistant holding a reflector to direct sunlight onto the model. The overall mood is glamorous, serene, and professional, emphasizing the beauty of the ocean backdrop and the skill of the photoshoot crew.


video generated using cogvideoX

Here is we got our first result. You can observe the female's right hand has been deformed a little. But the camera moving and panning has been added creating a lot of professional effects with realistic ocean tidal waves.

Of course, the video frames' quality is low. But this is not a big deal. Our focus is to generate consistent video frame generation without any defects. It can be up-scale using other techniques in ComfyUI (Neural network latent upscale), etc. or just split the video into multiple frames and use Supir Upscaler


Second test:

Let's challenge the model and see how much intelligently it maintains to a certain level.

Prompt used:  An action-packed scene set in a futuristic cityscape at night, inspired by an Iron Man movie. The central figure is a superhero in a high-tech, red and gold metallic suit with glowing blue eyes and arc reactor on the chest, hovering in mid-air with jet thrusters blazing from his hands and feet. The suit is sleek, with intricate details and panels that reflect the city lights. In the background, towering skyscrapers with neon signs and holographic billboards illuminate the night sky. The superhero is in a dynamic pose, dodging a barrage of energy blasts from a formidable enemy robot flying nearby, which is large, menacing,
and armed with glowing red weaponry. Sparks fly and smoke trails in the air, adding to the intensity of the battle. The scene captures a sense of speed, power, and heroism, with a dramatic sky filled with dark clouds and flashes of lightning, amplifying the urgency and high stakes of the confrontation.


video generated using cogvideoX

Now the model is kind of confused as to what to generate and there are lots of morphing in the batch of video frames. Although, overall it's a lot better than that of any other diffusion-based models where you need to try multiple attempts to generate a single video clip.


Conclusion:

After certain testing, we can conclude that CogVideoX is much more capable than other diffusion-based video generation models. Now, it can be supported on lower-end GPUs as well where you can use Quantized model.