LTX-2 VideoGen locally(GGUF/FP8/BF16)

 ltx-2 video geenration in comfyui

For a while now, video generation models have been doing an impressive job visually. Motion looks good, scenes feel cinematic, and prompts are followed pretty well. But there is always been one awkward gap silence. A visually rich video without sound feels unfinished. No atmosphere. No emotion. No realism.

Most existing workflows try to fix this by stitching things together later first generate the video, then generate audio separately, or the other way around. It works kind of. But it often feels disconnected. Audio does not quite match the scene, lip sync feels off, and environmental sounds donot react naturally to what's happening visually. LTX-2 released by Lightricks comes in with a simple but bold promise to generate video and audio together, inside a single model, so they actually understand each other.

Instead of bolting sound on afterward, LTX-2 treats audiovisual generation as one unified problem. The goal is not just speech, but everything background ambience, foley sounds, emotional tone, and timing that feels natural.

Ltx 2 training-inference pipeline
Ltx 2 training-inference pipeline(Ref- Reaserch Paper)

Under the hood, LTX-2 is built very deliberately. The team did not mash audio and video into one messy representation. Instead, they respected how different these modalities actually are.The model uses an asymmetric dual stream transformer-

(a) large video stream (14B parameters) to handle complex spatial and temporal visuals and  

(b) smaller audio stream (5B parameters) tuned for 1D temporal sound data. 

These streams talk to each other using bidirectional cross attention, meaning video influences audio and audio influences video frame by frame. They explained and shared detailed information into their research paper. This is what enables things like lip sync, impact sounds, and environment aware acoustics to line up properly. 

Another important design choice is separate latent spaces for audio and video. This allows better compression for each modality, modality-specific positional embeddings (3D for video, 1D for audio), cleaner editing workflows like V2A or A2V later on. 

 

Installation

1. Install ComfyUI if you are a new user. Older user need to update it from the Manager by selecting Update All.

 

install and update from comfyui manager


2. From the Manager, select Install Custom Nodes option. Search for "LTXVideo" custom node and install it. If already installed then just update it from the Manager by clicking on Custom nodes manager option.

3. Now, there are different model variants-text to video, Image to video and Control to video from Ltx -2 hugging face repository. Download any of them as described below:

Sl No. Model Description
1 LTX-2 19B Dev FP8
(ltx-2-19b-dev-fp8.safetensors)
Development version of the LTX-2 19B model optimized with FP8 precision. It reduces VRAM usage and improves inference speed while maintaining near-original output quality. Best suited for experimentation and faster local inference.
2 LTX-2 19B Dev BF16
(ltx-2-19b-dev.safetensors)
Full development model using BF16 precision. Requires 16GB VRAM with 32GB system RAM. Offers higher numerical stability and better output consistency compared to FP8, making it ideal for high-quality generation, testing, and fine-tuning workflows.
3 LTX-2 19B Distilled FP8
(ltx-2-19b-distilled-fp8.safetensors)
A distilled and compressed version of the LTX-2 19B model using FP8 precision. Designed for faster inference and lower hardware requirements while preserving most of the core model capabilities. Suitable for resource-constrained environments.
4 LTX-2 19B Distilled BF16
(ltx-2-19b-distilled.safetensors)
Distilled variant of the LTX-2 19B model in BF16 format. Balances reduced model size with better output stability and quality compared to FP8 distilled versions. Ideal for production-oriented inference where quality still matters.

Save it into ComfyUI/models/checkpoints folder, not the diffusion_models folder.

GGUF- 

If you want the LTX-2 GGUF models for low VRAMs, download them and setup accordingly. Make sure you have already installed the  ComfyUI-GGUF custom node by city-96 from the Manager. If already done, update this custom node from the Manager.

(a) LTXV-2 GGUF by Kijai

(b) LTXV-2 GGUF by Quantstack

(c) LTXV-2 GGUF by Unsloth

Save it into ComfyUI/models/unet folder.


4. Download upscaler models provided below. 
(a) Spatial upscaler (ltx-2-spatial-upscaler-x2-1.0.safetensors)

(b) Temporal Upscaler (ltx-2-temporal-upscaler-x2-1.0.safetensors)

Save them into ComfyUI/models/latent_upscale_models folder.
5. Download ltx-2-19b-distilled-lora model (ltx-2-19b-distilled-lora-384.safetensors). Save it inside ComfyUI/models/loras folder.

6. Download Quantized Gemma text encoder (gemma-3-12b-it-qat-q4_0-unquantized). Save it inside ComfyUI/models/text_encoders/gemma-3-12b-it-qat-q4_0-unquantized folder.

7. Now, if you want LTX Control nets (Canny, Pose,Depth, Dolly,static etc) lora models. Then, download them as provided below:  

Sl No. Model Download Link
1 ltx-2-19b-ic-lora-detailer.safetensors Download
2 ltx-2-19b-ic-lora-pose-control.safetensors Download
3 ltx-2-19b-ic-lora-canny-control.safetensors Download
4 ltx-2-19b-ic-lora-depth-control.safetensors Download
5 ltx-2-19b-lora-camera-control-dolly-in.safetensors Download
6 ltx-2-19b-lora-camera-control-dolly-left.safetensors Download
7 ltx-2-19b-lora-camera-control-dolly-out.safetensors Download
8 ltx-2-19b-lora-camera-control-dolly-right.safetensors Download
9 ltx-2-19b-lora-camera-control-jib-down.safetensors Download
10 ltx-2-19b-lora-camera-control-jib-up.safetensors Download
11 ltx-2-19b-lora-camera-control-static.safetensors Download

Save them inside ComfyUI/models/loras folder.

8. Restart and refresh ComfyUI.


Workflows

1. After installing LTX custom node, the workflow can be found inside ComfyUI/custom_nodes/ComfyUI-LTXVideo/example_workflows folder. You can also get it by navigating from the ComfyUI templates section.

LTX-2_I2V_Distilled_wLora.json
LTX-2_I2V_Full_wLora.json
LTX-2_ICLoRA_All_Distilled.json
LTX-2_T2V_Distilled_wLora.json
LTX-2_T2V_Full_wLora.json
LTX-2_V2V_Detailer.json

We used RTX 4080 super with 16GB VRAM to generate a 5.4 seconds t2v long 480p video that took around 2 minutes.

2. Drag and Drop into ComfyUI.

Settings-

BF16 full variant: CFG-3, Steps- 25

Distilled variant: CFG-1; Steps-8

Example Prompt- Cinematic handheld medium shot. A foggy city street at night, lit by warm street lamps and soft neon reflections on wet asphalt. A man in his late 30s wearing a worn leather jacket stands under the light, shoulders tense, eyes darting. He exhales slowly and steps forward as rain begins to fall. The camera tracks him from the side, slightly shaky, staying close to his face. He suddenly stops, turns toward the camera, and whispers, "We are not alone."Distant sirens echo as the camera slowly pushes in to an intense close-up on his eyes, rain dripping down his face.

Prompt techniques for better result using LTX 2

(a)Think in shots, not keywords- Write your prompt like a short film scene, not a list of tags. Describe what happens from start to finish in a natural flow.

(b)Establish the shot early-Start by defining the shot type and style (wide, close-up, cinematic, handheld, animated, noir, Pixar-style, etc.) so the model anchors the visual language immediately.

(c) Set the scene clearly- Describe lighting, color palette, textures, atmosphere, and time of day to lock in mood and realism.

(d) Describe action sequentially-Explain actions in the order they happen, using present tense. LTX-2 performs best when motion feels continuous and intentional.

(e) Limit characters and simultaneous actions- Fewer subjects and focused actions lead to cleaner, more accurate results.

(f) Show emotions visually, not abstractly- Avoid words like "sad" or "angry" alone. Instead, show emotion through facial expressions, posture, gestures, and pauses.

(g) Define characters with just enough detail- Include age range, clothing, hairstyle, and standout features. Avoid overloading with unnecessary traits.

(h) Be explicit about camera movement-Clearly state how the camera moves in relation to the subject (pan, dolly, push in, handheld tracking, over-the-shoulder).

(i) Keep it one cohesive paragraph- A single flowing paragraph helps the model maintain continuity and scene logic.

(j) Aim for 4–8 strong sentences-This is the sweet spot for clarity without overwhelming the model.Name styles early if using themIf the look is stylized (film noir, pixel art, surreal, documentary), mention it near the beginning of the prompt.

(k) Avoid readable text and logos- Donot rely on signs, labels, or written text to carry meaning.Iterate instead of overloading. Start simple, generate, then refine. LTX-2 is designed for fast iteration layer complexity gradually.

(l)Match detail to shot scale-Close-ups need fine detail (skin, breath, eye movement). Wide shots should stay broader and simpler.

(m)Include audio and dialogue intentionally-Describe ambient sound, music, and tone. Put dialogue in quotation marks and specify delivery style or accent if needed.

(n)Prefer clean, readable motion-Simple, cinematic movement works better than chaotic or complex physics-heavy actions.

LTX 2 feels like one of those releases that quietly fixes a fundamental flaw everyone had just accepted. Video without sound was never enough that we just tolerated it because that's all we had. By treating audio and video as equal partners rather than separate steps, LTX-2 moves generative media closer to how humans actually experience content. 

You do not watch life in silence, and you donot hear it without context either. The fact that this level of audio visual modeling is being released openly is a big deal. It lowers the barrier for experimentation, pushes the ecosystem forward, and gives developers something solid to build on rather than another black box demo.