LTX 2.3 ID LoRA- Consistent LipSync Audio-Video

 

Most AI video personalization tools still treat video and audio as separate systems. That creates a major limitation like visuals may look accurate, but the voice often feels disconnected from the scene. Lip-sync can drift, sound effects may not match on-screen actions, and voice-cloning models usually cannot adapt speaking style or environment through text prompts alone.  

On the other hand, prompt-based audio models can change tone and style, but they lack visual awareness. The result is content that feels artificial instead of naturally unified.


id lora working
id lora working
 

ID-LoRA aims to solve this by generating both appearance and voice together inside a single AI model. Instead of handling visuals and audio independently, it allows a text prompt, reference image, and short audio clip to guide the entire generation process simultaneously.  

id lora architecture
ID Lora architecture

 

This creates more realistic personalized videos where facial movement, voice, emotion, and environment stay consistent with each other. You can find detailed insight by accessing their research paper.

 

 Installation

 1. You must have ComfyUI installed on your machine. Update it if already have, from the Manager by selecting Update ComfyUI option.

2. Download LTX2.3 model(ltx-2.3-22b-dev.safetensors)  22B-parameter core video model requiring high VRAM for full-generation workflows.

3. Download Ltx 2.3 distilled (ltx-2.3-22b-distilled.safetensors) 
This is the fast-generation variant optimized for 8 inference steps with CFG set to 1.

4. Download text encoder gemma_3_12B_it_fp4_mixed (gemma_3_12B_it_fp4_mixed.safetensors) Required text encoder for processing prompts. Put it in ComfyUI/models/text_encoders.


5. Download ltx-2.3-spatial-upscaler-x2-1.1 (ltx-2.3-spatial-upscaler-x2-1.1.safetensors). Upscales low-resolution input videos. This is essential for two-stage and three-stage pipelines.


6. Next, download Ltx-2.3-22b-distilled-lora-384 (ltx-2.3-22b-distilled-lora-384.safetensors ) A LoRA version of the distilled model. You apply this to refine texture fidelity during the upscale passes. 

7. Now, download Ltx2.3 id-lora-celebvhq (id-lora-celebvhq-ltx2.3.safetensors) ID-LoRA weights trained on the CelebV-HQ dataset, optimized for complex motion and singing video generation.

8. Download Id-lora-talkvid-ltx2.3 (id-lora-talkvid-ltx2.3.safetensors) These loras are ideal for static talking-head videos and digital avatars.

9. (Optional) Download LTX-2.3-GGUF by unsloth or Ltx 2.3 GGUF by Quantstack models (ranges from Q2–Q8 quantized variants). Use any of them with the GGUFLoaderKJ node for GPUs with less than 16GB VRAM.

10. Restart and refresh ComfyUI to take effect.



Workflow

1. Download the workflow (LTX-2.3-ID-Lora-lipsync.json) from our Hugging face repository.
2. Drag and drop this into ComfyUI. install the missing error nodes if found from the Manager.

3. Load diffusion models (video gen, upscalers, loras, text encoders) into their respective nodes.

4. Upload image into I2V image node. Load audio into Load audio node.

5. Use prompts into prompt box. Set Frames for videos in seconds.

3 seconds: 73 frames
4 seconds: 97 frames
5 seconds: 121 frames
10 seconds: 241 frames
15 seconds: 361 frames
20 seconds: 481 frames

Resolutions

Horizontal (Landscape)
320 x 224 (Lowest baseline draft)
512 x 320 (Fast testing)
768 x 512 (Standard low-res)
832 x 480 (Widescreen draft)
1024 x 576 (Medium quality)
1280 x 704 (Standard 720p equivalent)
1920 x 1088 (Standard 1080p equivalent)
3840 x 2176 (Maximum 4K equivalent)

Vertical (Portrait)
224 x 320 (Lowest baseline draft)
320 x 512 (Fast testing)
512 x 768 (Standard low-res)
480 x 832 (Tall draft)
576 x 1024 (Medium quality)
704 x 1280 (Standard 720p equivalent)
1088 x 1920 (Standard 1080p equivalent)


6. Hit run to start generation.


ID-LoRA feels like an important step toward truly coherent AI-generated video. The biggest advantage is not just better voice cloning or cleaner visuals — it’s the fact that both modalities finally understand each other during generation.  That joint understanding creates outputs that feel noticeably more believable.