We all know how awkward it feels when a video looks amazing but the sound just does not match. A fight scene without the punch sound loses its impact. A car chase without the roar of engines feels flat. This is the biggest gap in today's video generation. Models that create sound effects for videos usually stumble because they lack enough diverse data, they struggle to balance text and visuals, and the audio quality often comes out underwhelming. HunyuanVideo Foley tries to achieve this by generating high fidelity audio that syncs with video dynamics and matches the scene's semantic meaning.
Researchers from Tencent Hunyuan, Zhejiang University and Nanjing University of Aeronautics and Astronautics developed a 100k hour multimodal dataset with automated annotation. The detailed information can be found in their research paper. The model incorporates encoded text through CLAP, visual features through SigLIP2 and audio representation through DAC VAE.
Text-Video-to-Audio (TV2A) 🔊📹
— Stable Diffusion Tutorials (@SD_Tutorial) August 28, 2025
the open-source HunyuanVideo-Foleyframework
for generating high-fidelity audio.🚀
Project page: https://t.co/GsQU1dfVNe pic.twitter.com/9whNYMgmFN
It uses multimodal diffusion transformers that integrate temporal fusion and semantic injection. To enhance alignment, a representation alignment strategy is applied using self supervised audio features with REPA loss guided by ATST Frame. This ensures stable training and refined audio generation.
HunyuanVideo Foley addresses the core problems with three key
innovations. It creates a scalable data pipeline that gathers diverse
multimodal data at large scale. It introduces a novel multimodal
diffusion transformer that balances text and visual information without
one overwhelming the other. It applies representation alignment to boost
stability and quality in audio output. The final step uses DAC VAE to
decode audio latents into high quality 48 kHz wave forms. The result is
audio that matches visual timing, aligns with context and reaches
professional fidelity standards.
![]() |
Hunyuan Video Foley architecture (Ref- Official Page) |
This model has the potential to redefine how we experience media. A film no longer feels disconnected between sight and sound. Games reach a new level of realism where every action carries a natural audio response. Content creators gain a tool that reduces the gap between raw visuals and polished productions. The direction of HunyuanVideo-Foley reflects a shift in AI research. It moves from generating isolated outputs to building immersive multimodal experiences. This is not just progress for AI audio generation. It is progress for the future of storytelling and entertainment.
Installation
1. Install ComfyUI if you are new to it. Update it by clicking Update All from the Manager if already installed.
2. Move into the ComfyUI/custom_nodes folder and Clone the repository:git clone https://github.com/if-ai/ComfyUI_HunyuanVideoFoley.git
3. Move into ComfyUI_HunyuanVideoFoley folder:cd ComfyUI_HunyuanVideoFoley
4. Install requirements dependencies (for normal comfy user):
pip install -r requirements.txt
For Comfyui portable, move back into ComfyUI_windows_portable folder use these command:python_embeded\python.exe -m pip install -r ComfyUI\custom_nodes\ComfyUI_HunyuanVideoFoley\requirements.txt
5. Finally, run the python script for ComfyUI HunyuanVideo-Foley Custom Node install:
python install.py
6. Download the Hunyuan video Foley model. There are basically either way to download:
(a) Automatic- This is the recommended option. Your model will be auto-downloaded from official HunyuanFoley Hugging face repository into ComfyUI/models/foley folder. This will be done when you run the workflow for the first time. Real-time status can be tracked in Comfyui background terminal.
(b) Manual- Use this if you know what you are doing. Download the models (hunyuanvideo_foley.pth, synchformer_state_dict.pth, vae_128d_48k.pth ) and save it into ComfyUI/models/foley folder. Then, make sure configuration file is at configs/hunyuanvideo-foley-xxl.yaml
7. Restart ComfyUI and refresh it.
Workflow Explanation
1. After installing the custom node you will get the workflow (hunyuan_foley.json) inside ComfyUI/custom_nodes/ComfyUI_HunyuanVideoFoley/example_workflows folder.
2. Drag and drop into ComfyUI.
3. The workflow loads a video, extracts frames/audio, runs the HunyuanVideoFoley model to generate new material (foley/audio and/or edited frames) from a text prompt and parameters, previews the results, then recombines and saves a finished video.
![]() |
Hunyuan Video Foley Workflow (Click to Zoom) |
(a) Load Video (Upload node) - you pick the source file. This node outputs raw frames, audio and basic video data.
(b) HunyuanFoley Model Loader- Load huyuanFoley mdoel and set the appropriate quantization option. This supports various model (fp8_e5m2 / fp8_e4m3fn/none means BF16 original) quantization type. Choose the one that suits your VRAM.
Video Info / Get Video Components - reads fps, width/height, duration and splits the video into separate streams (images, audio, metadata) for other nodes to use.
(c) Video Helper Suite utility node(s) that pass video frames, audio and metadata cleanly into the generator and into the final combiner.
(d) HunyuanVideoFoley Dependencies- loads required model files (encoders, feature extractors, VAE, etc.).
(e) HunyuanVideoFoley Torch Compile - compiles/initializes the model for the runtime (optimizes it for CPU/GPU).
(f) HunyuanVideoFoley Generator (Advanced) -the core that takes the input frames (and sometimes audio/metadata) plus the text prompt and generation settings.
Important parameters shown there:
guidance_scale (CFG)-how strongly the model follows the prompt, range-1.0-10.0, default: 4.5.
text_prompt- add positive/negative textual prompt in detailed output for your audio description.
num_inference_steps- more steps better quality but slower, range-10-100, default: 50
sample_nums- how many samples/variants to produce 1-6, default: 1.
seed- for randomize generation
silent_audio-Contolling audio to enable or disable using true/False.
(g) Preview nodes (Preview Video / Preview Audio / Preview Any) - quickly play the generated video/audio so you can check results before finalizing.
(h) Video Combine -takes the generated frames + audio (and any original pieces you keep) and encodes the final file. Settings here: frame_rate, format (e.g. h264 mp4), pix_fmt, crf (quality), and save_output.
(i) Final output / save -the finished video file is written to disk and shown in the UI for download or further checks.
(j) Click run to initiate the generation.
Tweak num_inference_steps, guidance_scale, and crf to balance speed vs quality. Use seed to reproduce exact results; change it to randomize. Memory_efficient / cpu_offload flags help when GPU memory is limited.
After using the model, we can say that the results are not the 100 % accurate but can be optimized with the passage of time.