LTXVideo 0.9.7 : Create longer AI videos with low VRAM

If you have ever imagined generating high-quality videos faster than you can watch them, LTX-Video is here to turn that dream into reality. Developed by Lightricks, this groundbreaking model is the first-ever DiT-based video generation system capable of producing stunning 24 FPS videos at a resolution of 768x512 pixels and all in real-time.

LTXV 13B (Img2Vid) Gen out now! 😘

These are some cool results... pic.twitter.com/D4VZTvSPwR
— Stable Diffusion Tutorials (@SD_Tutorial) May 9, 2025

The model was trained on a large-scale dataset of diverse video content, giving it the ability to generate varied and realistic scenes. From nature-inspired visuals to urban settings, the possibilities are nearly endless. Now, lets move into the installation process.

Table of contents:

Installation

1. Install ComfyUI if you are a new user.

2. Older users need to Select "Update All" to update ComfyUI from the Manager.

3. Install official LTX Video custom nodes from Manager. To do this, select the "Install Custom node" option, then do search "ComfyUI-LTXVideo" and finally click "Install".

All the necessary files get installed automatically when you run the workflow for the first time. The real time status can be tracked from ComfyUI terminal.

If you have already using this custom node and just want to use the new model, just update this custom node from the Manager section by searching "ComfyUI-LTXVideo".

Alternative(Manual):

Move to "ComfyUI/custom_nodes", open your command prompt using "cmd" on folder address bar. Clone the repository using the following command:

~~git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git~~

Install dependencies :

For normal Comfy user

~~pip install -r requirements.txt~~

For portable ComfyUI

Move to the ComfyUI_windows_portable folder, again open the command prompt and use the command below

~~python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI-LTXVideo\requirements.txt~~

4. Download any of the updated models(.safetensors file) from the Hugging face repository and save it inside your "models/checkpoints" folder.

Here, different model variants are available. Choose the model variant as per your use case:

Sl. No.	Model	Description
(a)	LTXV 13B 0.9.7	Generates cinematic style quality videos at significant speed when using NVIDIA 4000/5000 GPU cards
(b)	Latent (Spatial/Temporal) Upscaler	Spatial upscaler enables video resolution increment by upscaling latent tensors without decoding/encoding while the temporal upscaler enhances the consistency for the frames
(c)	LTXV 0.9.6	For higher quality generation, faster, great for final result
(d)	LTXV 0.9.6 Distilled	Fastest model that takes 8 steps for generation, lighter, good for iteration
(e)	LTXV 0.9.5	Improved quality with reduced artifacts in video generation

If you are having low VRAM, just use the Kijai's quantized Fp8 Variant from his Hugging Face repository.

5. Also need to download text encoders(t5xxl_fp16.safetensors, t5xxl_fp8_e4m3fn.safetensors and t5xxl_fp8_e4m3fn_scaled.safetensors) from Hugging face repository and save them into "ComfyUI/models/clip" folder. If you have ever used Flux workflows then it is not required.

The fp8 variant is for 12GB VRAM and lower whereas fp16 for higher end GPUs.

6. Next is to clone the Pixart-Alpha model into your "ComfyUI/models/text_encoders" folder. If the folder not exist then create it.

To do the cloning open the command prompt using "cmd" on folder address bar and type the following command provided below:

~~git clone https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS~~

7. Restart and refresh ComfyUI to take effect.

GGUF Variant

The quantized GGUF LTXVideo model variant also available by City 96 that will give you faster inference speed but with slightly lower quality output.

1. First install and setup GGUF Custom nodes. If you already have, then you just need to update it from the Manager section.

2. Download any of the LTXV model from hugging face repository. These are range from Q3(for faster in low quality generation) to Q8(for best quality with longer render time):

- LTXV 13B 0.9.7 GGUF model

- LTXV 0.9.6 GGUF distilled model

After downloading, save it into "ComfyUI/models/diffusion_models" folder.

3. Now, download the text encoder model and put it into your "ComfyUI/models/text_encoders" folder.

4. Then, download VAE model and save it into your "ComfyUI/models/vae" folder.

Workflow

1. Get the workflow inside your "ComfyUI/custom_nodes/ComfyUI-LTXVideo/assets" folder or alternatively it can be downloaded from github repository. You will get three different workflows:

(a) Text-to-video

(b) Image-to-video

People using the GGUF model can get the workflow from GGUF hugging face repository:

GGUF LTXV distilled 0.9.6 workflow

2. Drag and drop into ComfyUI.

Lets do some testing with model and how it performs with textual detailed prompts and input image. Here, we are trying to generate a horror movie scene using text-to-video workflow.

(a) Load LTXV model into checkpoint node

(b) Load Clip model into CLIP node.

(b) Add positive negative prompt into text Conditioning node.

(d) Set your frame rate. The default value is 25.

(e) Set Video resolution in height and width.

(f) Set steps value.

Text to Video

We have used positive prompt :

A woman with a haunting presence stands atop the weathered roof of a dilapidated, rust-streaked trailer in a desolate environment. She wears a long, flowing dress that sways gently in the cold wind, its fabric aged and tattered at the edges, blending with the gloom of her surroundings. Her posture is both eerie and commanding, shoulders slightly hunched yet exuding an unsettling authority. Her piercing gaze is locked forward, unyielding and distant, as though peering into another realm. The sky above is dominated by dark, churning clouds, foretelling an impending storm, with streaks of lightning faintly illuminating the horizon. The atmosphere is heavy with tension, shadows from the trailer and nearby debris stretching unnaturally long under the dim, uneven lighting. The scene is captured in hyper-realistic detail, with every element from the grime on the trailer to the strands of her unkempt hair rendered in ultra-high definition, creating a cinematic and chilling portrayal of foreboding.

Negative prompt:

low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly

Resolution: 768x512

FPS Frame rate per sec: 25

CFG: 3

Steps: 30

Sampler : Euler

Finally, here is our result.

You can see something like the camera footage, more detailed but face looking little deformed and some kind of artifacts still there. Of course, it is 720p resolution. so, you will get low quality but can be upscaled by other upscaling techniques.

Second try: This time we have to generate video for ecommerce stuff. Lets se how this performs.

Positive Prompt used:

Ultra-high-definition close-up of a stunning editorial female model with flawless skin, striking symmetrical facial features, and piercing eyes. She is wearing a structured outfit featuring a vibrant, colorful Balmain-inspired short skirt paired with a chic, form-fitting top adorned with intricate patterns. The look is completed with bold platform shoes that add a statement edge. The model stands confidently, exuding poise and elegance, illuminated by soft, diffused studio lighting that highlights every texture and detail. The background is minimalist, allowing the outfit’s vibrant colors and the model's elegance to dominate the frame. Rendered in 8k resolution for sharp, lifelike detail, ensuring a highly polished and professional aesthetic suitable for a high-fashion editorial spread.

Negative prompt:

low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly

Resolution:768x512

FPS Frame rate per sec: 25

CFG: 3

Steps: 30

Not impressive. This time its failed to do so. You can see the result is so unsatisfied.

Image to Video

Uploaded image

Positive Prompt used:

An untouched sandy beach with a small, white boat resting on the shore. The scene features footprints scattered on the sand, gentle ocean waves rolling onto the beach, and a horizon filled with sparse vegetation and a partly cloudy blue sky. Driftwood and natural debris are scattered along the coastline, capturing a peaceful, rustic, and natural atmosphere.

Negative prompt:

low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly

Resolution:768x512

FPS Frame rate per sec: 25

CFG: 3

Steps: 30

Here is the output.

Output

Second try: Lets test to generate some kind of haunted movie scene. For this we have inputted image generated using Flux Schnell.

Positive Prompt used:

Capturing a haunted movie scene where a women standing alone in untouched sandy beach with a small, white boat resting on the shore. The scene features footprints scattered on the sand, gentle ocean waves rolling onto the beach, and a horizon filled with sparse vegetation and a partly cloudy blue sky. Driftwood and natural debris are scattered along the coastline, capturing a peaceful, rustic, and natural atmosphere.

Negative prompt:

low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly

Resolution:768x512

FPS Frame rate per sec: 24

CFG: 3

Steps: 40

Here is the output.

We are running the model on RTX 4090 and each video rendering time was 15-23 seconds for 4second long video length. Its better as compared to CogVideoX and Mochi1 in terms of VRAM usage and rendering time but it needs more training data to generate quite good results.

Conclusion

As they have mentioned on their official page, to work with this model, its necessary to have a detailed prompt so that it can be understood better and give a more refined result with improved prompt adherence. You can also use other techniques to enhance your prompting like using LLMs(large language models) DanTag-Tipo or VLM (Vision language models) Florence2 etc. in the background.

LTXVideo 0.9.7 : Create longer AI videos with low VRAM

Installation

GGUF Variant

Workflow

Conclusion

Posted by Admin

Search This Blog

Trending

Wan 2.1: Install & Generate Videos locally with lower VRAM

Easy Install ComfyUI Portable (Windows/Mac/Linux)

Wan2.1 FusionX 14B: Consistent Fast VideoGen with Low VRAM

19 Attractive Prompts for Cool Selfie Clicks

Run Stable Diffusion 10x faster on AMD GPUs

Train your WAN2.1 Lora model on Windows/Linux

Our Social Pages

Recent Posts

Important pages

Contact form

LTXVideo 0.9.7 : Create longer AI videos with low VRAM

Installation

GGUF Variant

Workflow

Conclusion

Posted by Admin

Related Posts

Search This Blog

Trending

Wan 2.1: Install & Generate Videos locally with lower VRAM

Easy Install ComfyUI Portable (Windows/Mac/Linux)

Wan2.1 FusionX 14B: Consistent Fast VideoGen with Low VRAM

19 Attractive Prompts for Cool Selfie Clicks

Run Stable Diffusion 10x faster on AMD GPUs

Train your WAN2.1 Lora model on Windows/Linux

Our Social Community

Our Social Pages

Recent Posts

Important pages

Contact form