Frame Pack: Generate Longer Videos with low VRAM

framepack image to video generation model

Video generation has always been a resource-intensive task, often requiring powerful GPUs and significant processing time. But what if you could generate high-quality videos on a average level GPU? FramePack, a creative approach that's changing how we think about next-frame prediction models for video generation.

It is an image to video diffusion framework developed by researchers Lvmin Zhang and Maneesh Agrawala (the same Lvmin Zhang who also created ControlNet and IC-Light). 

Unlike traditional video AI models that we already used, consume excessive memory and often crash on consumer hardware, FramePack is designed to run smoothly on everyday computers, even a laptop with just 6GB of VRAM can generate full 30fps videos but with at least 30 GB of system RAM. You can get more detailed insights from their research paper.


Installation

1. Get ComfyUI installed if are new user. Older user have to update it from the Manager by clicking "Update ComfyUI".

2. Move into "ComfyUI/custom_nodes" folder. Clone the Kijai's y using the following command by typing into command prompt:

git clone https://github.com/kijai/ComfyUI-FramePackWrapper.git

3. Then install the required dependencies using following command into command prompt.

For normal ComfyUI users:

pip install -r requirements.txt

For ComfyUI portable user. Move inside ComfyUI_windows_portable folder and use this command:

python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-FramePackWrapper\requirements.txt

4. Now, download the required base models, Text encoders, VAE, and clip Vision.

Here, we want to mention is that if you are already using the native support for HunyuanVideo by ComfyUI also described in our tutorial, you do not need to download these models (Text encoders, VAE, and clip Vision) as the project uses the same models. You only need to download the FramePack I2V diffusion model.


Download clipvision, text encoders, vae

If you have not used, download and place these models into their respective folders:

-Download Clip Vision files , SigClip and put it in your ComfyUI/models/clip_vision folder.

-Download Text Encoders files and save them into your ComfyUI/models/text_encoders directory.

-Download VAE model then put it into your your ComfyUI/models/vae folder.


5. Now, download the FramePack diffusion model. Create folders inside "ComfyUI/models/diffusers/" folder as  lllyasviel . Move inside it and create new folder FramePackI2V_HY folder and save the model inside it.

FramePack F1 also been officially released that uses single direction prediction rather than older model that uses the bi-directional prediction. You can download these and store them into the same folder described above.

or other alternative is by using the converted model by developer Kijai.

Download the FramePack I2V FP8 or FramePackI2V Fp16 and save this into your "ComfyUI/models/diffusion_models" folder. 

We will update soon whenever the FramePack F1 quantized variant model gets available.

Select the one that suits your hardware and use cases. Fp8 requires low VRAM with lower quality generation whereas FP16 uses high VRAM with high quality output.

6. Restart ComfyUI and refresh it.


Workflow

1. Workflow can be found inside your "ComfyUI/custom_nodes/ComfyUI-FramePackWrapper/example_workflows" folder.

2. Drag and drop into ComfyUI.



Load your FramePack model

(a) Load FramePack I2v Model. You will have two options to load the model. Choose either of the options. First one load the model provided by lllyasviel. Second one is from Kijai's quantized one. - Set the Attention mode (SDPA, Flash Attention, or Sage Attention) and Model precision (BF16, FP16, FP32).

People can use Flash attention or Sage attention to significantly reduce the inference time to almost 30-40%.

Choose your target image

(b) Load your image to convert into video footage.

Select all clip models and text encoders

(c) Load all the clip models and text encoders.

load sigclip model

(d) Load the sigclip model.

load VAE model

(e) Select the VAE model.

Set your image resolution

(f) Set the image resolution. Default set to 512 for lower VRAM users. You can set higher that this if your VRAM supports.


Put your positive prompts

(g) Provide relevant positive prompt into the conditioning. Here, negative conditioning is not required.

Setup configurations for framepack

(h) Set configurations into the FramePack node:

- Set the Video length in seconds

- GPU memory preservation (minimum 6GB)

- Start generation using Run option.

FramePack represents a significant breakthrough in making high-quality video generation accessible to everyday users. Its innovative approach to memory management and bi-directional sampling solves key challenges that have limited video generation on consumer hardware.

While it's particularly well-suited for certain types of videos and has some limitations with complex scene changes, the ability to generate minutes-long videos in a single pass on a laptop GPU is truly revolutionary. For content creators, researchers, and AI enthusiasts, FramePack opens up new possibilities without requiring enterprise-grade hardware.