Wan 2.2 Bernini: Ref-Video Editing and Style Transfer

 

 

Video generation has evolved quickly, but most existing systems still struggle with one major limitation is that they are usually built for a single task. One model might generate videos from text. Another might edit existing videos. A different system might work with reference images. The result is a fragmented workflow where each task requires a separate architecture, separate training, and separate optimization. 

Bernini (released by ByteDance & Bernini Team) supports multiple video workflows inside a unified framework including text-to-video (T2V), subject-to-video (R2V), video editing (V2V), and reference-guided video editing (RV2V) that handles wisely and smoothly. The model has been merged into Wan2.2 by the community so that you can do you required task in a systematic manner.

bernini features showcase

Rather than forcing a single model to do everything, it divides responsibilities intelligently. It allows one component to think and another component to create. The framework assigns semantic planning to an MLLM-based planner. Instead of generating pixels directly, this planner predicts the target semantic representation inside the Vision Transformer (ViT) embedding space.

Bernini working architecture
Bernini working architecture


Once the semantic blueprint is prepared, a Diffusion Transformer (DiT)-based renderer takes over and converts those instructions into realistic video outputs. For editing tasks, Bernini introduces additional source Variational Autoencoder (VAE) features to preserve important visual details while making modifications. More detailed insights can be found into their research paper

bernini working on image and video editing

The framework also introduces two notable improvements-
(a) Segment Aware 3D Rotary Positional Embedding (SA 3D RoPE) to better process multiple visual inputs and maintain spatial-temporal understanding. 
(b) Chain of thought reasoning inside the planner, helping the model transfer deeper understanding into the generation process.


 Installation

1. First of all, install and setup ComfyUI to run this model. Older user need to update ComfyUI to its latest version from the Manager tab.

2. Now, you need to have the basic Wan 2.2 I2V installation already setup, as the Wan 2.2 Bernini workflow is dependent on this workflow you will need the text encoders, vae models ready to work.

3. Download the Wan 2.2 Bernini pair of (High and Low) models. There are different model (fp16/fp8 scaled/fp8 mixed/gguf)variants. Choose any of them as per your system resources:

(a) Wan 2.2 Bernini (High & Low) FP16 repacked By Kijai - for high VRAM users minimum 24GB

Wan 2.2 Bernini (High & Low) FP16


(b) Wan 2.2 Bernini (High & Low) FP8 scaled / Fp8 mixed Optimized By Kijai - for low VRAM users minimum 16GB

Wan 2.2 Bernini (High & Low) FP8 scaled / Fp8 mixed


(c) Wan 2.2 Bernini(High & Low) FP16-FP8-mixed repacked by Comfy team - for High/low VRAM users atleast 16-24 GB. These are same models listed above by kijai at one place. Its also have Wan 2.1 support. For this, you need the basic wan2.1 workflow already setup. 

Wan 2.2 Bernini(High & Low) FP16-FP8-mixed


Save these(high & low)models into ComfyUI/models/diffusion_models folder.

(d) Wan 2.2 Bernini GGUF (High & Low), for minimum 12GB Vram. Choose any of the (Q4/Q5/Q8)  high-low pair of models.



Wan 2.2 Bernini GGUF (High & Low)


If using this variant, save this into ComfyUI/models/unet folder.

4. Restart and refresh comfyui


Workflow

1. Download the workflow (Wan2.2_Bernini.json) from our Hugging face repository.
If using GGUF variant, replace the diffusion model loader node with unet loader.


2. Drag and drop the workflow into ComfyUI canvas.

3. Load all the models(wan 2.2 bernini high & low, text encoders, vae etc) into their relevant nodes.

4. Upload your reference images/ videos(supports 2-5 max) to do the style transfer/ video editing/ subject removal etc.

5. Put your relevant prompts into prompt box.

6. Set KSampler configuration:
sampler - res_multistep,
resolution- 720p (for high vrams), 480p(for low vrams)

7. Hit run to start the generation.