Creating synchronized audio and video with AI has always been a headache. Most systems today, whether it's text-to-video, video-to-audio, or audio-to-video that focus on just one side of the story. You either generate visuals and then try to slap some sound on top, or you create sound first and hope the visuals somehow match later. OVI a unified video generation model (11B parameter model) released under Apache 2.0 license.
Its built from a massive 5B-parameter audio branch that replicates WAN 2.2 5B parameter model for stunning clarity and MMAudio for audio generation and syncing mechanism. It's a unified generator that understands both modalities, the audio and video as parts of the same creative process.
![]() |
| Ovi Architecture (Ref- Official research paper) |
With other video generation models, you get stunning visuals with awkward lip-syncs, sound effects that miss their cues, and emotional dissonance where the soundtrack does not match the scene. And even though companies like Google have tried to push this frontier with closed systems like Veo3, the research community still lacks a truly open-source, one-pass model that generates both together that seamlessly and naturally.
Here, Ovi handles perfect lip movement matching without relying on face bounding boxes with realistic conversations between multiple speakers. To get the detailed overview you can access their research paper.
Installation
1. Make sure you have ComfyUI installed, if not yet then follow our ComfyUI installation tutorial. Update it if using the older version from the Manager by clicking on Update All.
2. Now, move inside ComfyUI/custom_nodes folder and open command prompt. Clone the repository by using following command:git clone https://github.com/snicolast/ComfyUI-Ovi.git
Move inside ComfyUI-Ovicd ComfyUI-Ovi
Install dependenciespip install -r requirements.txt
3. There are two way to setup the models:
Automatic-Setup the models as its the recommended way. All the models(ovi model, mmAudio, wan2.2vae, clip encoder etc ) will be automatically downloaded from their official repository when you run the workflow for first time. Real-time downloading status can tracked on ComfyUI terminal.
Manual- But if you want manually then :
Download Ovi BF16 (model.safetensors), rename this to Ovi-11B-bf16.safetensors
or
Ovi FP8 (model_fp8_e4m3fn.safetensors), rename this to Ovi-11B-fp8.safetensors
Renaming to correct name is necessary otherwise you will get error. Save this into ComfyUI/models/diffusion_models folder. Rest of the the models listed below are the same that used by Wan 2.2 setup. So, downloading again is not required.
Download wan 2.2 Vae (wan2.2_vae.safetensors) model, save this into ComfyUI/models/vae folder.
Download text encoder model BF 16 (umt5-xxl-enc-bf16.safetensors) or FP8 (umt5-xxl-enc-fp8_e4m3fn.safetensors) and save this into ComfyUI/models/text_encoders folder.
4. Restart Comfyui and refresh to take effect.
Workflow
1. After setting the custom nodes, the workflow (ComfyUI-OVI-workflow-example.json) can be found inside ComfyUI/custom_nodes/ComfyUI-Ovi/workflow_example folder.
2. Just drag and drop into ComfyUI. After loading you will get bunch of red error nodes. Just navigate to Manager, click on Install Missing Custom Nodes option then just install them one by one.
If you have not downloaded the models, at the initial stage of running the workflow you will need to wait as the models gets auto-downloaded. You can get Real-time status on ComfyUI terminal.
3. Next, just follow the workflow steps:
(a) By default it does text to videos. If you want to do Image to Video generation, connect Load image node to first frame image of Ovi Video Generator node. Load the image into Load image node
(b) Inside Ovi Engine Loader select model base precision (BF16/FP8 variant).
VRAM requirement:
BF16 model- greater than 32GB
FP8 model- 16-24GB
Use cpu_offolad option parameter to true if using less VRAM.
(c) Into OVi Component Loader node, select Clip encoder (umt5-xxl-enc-bf16 for high and umt5-xxl-enc-fp8 for low VRam)and Wan 2.2 Vae (wan_2_2_vae) model.
(d) Ovi Attention Selector- Set this to FlashAttention, SDPA, Sage, etc for faster generation. Leave this as auto default if donot know what to do.
(e) Ovi Video generator- Add positive and negative prompts into prompt box. This is based on Wan2.2 video model. So, you can also follow our Wan 2.2 tutorial for better prompting for good video generation.Setup video resolution.
Use tags <S> Your-prompt <E> for adding speech and <AUDCAP> Your-background-audio-prompt <ENDAUDCAP> for background audio inside your prompts for better prompt detection by the model. Here, <s> means Start and <e> means end.
For example- A neon-lit street flickers in the rain. A lone dancer in a silver coat spins under a streetlamp, water splashing around. She raises her arm and shouts, <S>The night belongs to dreamers.<E> The crowd gathered under umbrellas cheers, phones raised high. A saxophonist steps forward from the shadows, replying, <S>Then let's make the city sing.<E> Thunder rolls in the distance as lights shimmer off wet pavement. <AUDCAP>Smooth jazz saxophone, rain pattering, distant thunder, excited murmurs.<ENDAUDCAP>
This currently supports 1504 x 608, 1344 x 704, 1344 x 672, 1280 x 704, 1280 x 640, 960 x 960,832 x 480, 704 x 704 video resolution.
Sample Steps- 20-25 (for faster generation with less quality), Use -50 for better quality.
(f) Click Run to start video generation.



