Wan2.2 VBVR: Consistent Controlled Motion Video Generation

 

 

If you look at how fast AI has evolved, it's honestly impressive. But here is the catch, almost all of this intelligence lives in text.  Even video models have made huge strides too. They can generate stunning, realistic visuals. But when it comes to actually understanding what's happening in a video, things start to fall apart. VBVR(Very Big Video Reasoing) model technique fix this problem. It doesnot just process frames, but actually understands how things evolve across time.

This is because reasoning in videos is not just about recognizing objects. It's about understanding time, motion, interactions, and cause and effect. And right now, we simply donot have the right tools or data to train models for that.  In short, AI can see and talk, but it still struggles to reason about what it sees over time. 

VBVR video generation framework showcase
VBVR video generation framework showcase


Unlike text, video naturally captures spatial structure, motion, and continuity, making it a perfect medium for building more intuitive, human-like reasoning systems. If done right, this could unlock a whole new level of intelligence in AI.

VBVR model suit is trained on a massive dataset with over 1 million video clips and 2 million images. Around 200 carefully designed reasoning tasks. Roughly 1000x larger than existing datasets. But it is not just about size. More detailed insights can be found by accessing their research paper. The dataset is built on a thoughtful framework inspired by human cognition, focusing on five core reasoning pillars- Abstraction, Knowledge, Spatial understanding, Perception, Transformation.

Researchers also introduced VBVR-Bench, an evaluation system that moves away from vague, model-based scoring and instead uses rule-based, human-aligned methods. This makes results more reliable, interpretable, and reproducible.

 

Installation

1. First, do the ComfyUI installation if not yet. Older user need to update ComfyUI from the Manager by selecting Update All option.

2. As the workflow is based on basic Wan2.2 I2V model. Make sure you have the basic Wan 2.2 Image to Video workflow already setup. 

3. Now, download Wan2.2 VBVR model. There are multiple model variants to choose from. Download the one that support your system requirements:

Wan2.2 VBVR fp8 and bf16

(a) Wan 2.2 VBVR High & Low (FP8) by LiconStudio- atleast 16-24GB VRAM required
(b) Wan 2.2 VBVR High & Low (BF16) by LiconStudio- atleast 24GB VRAM required

Save this into ComfyUI/models/diffusion_models folder.

4. There is also another way to take the model advantage is by using VBVR lora based model with basic Wan2.2 I2V High+Low model. 

 Wan 2.2 VBVR (FP16) by kijai

 Wan 2.2 VBVR (FP16) by linconstudio

You can download the Wan 2.2 VBVR (FP16) Lora by Kijai or Wan 2.2 VBVR lora by LiconStudio. Then, save it inside ComfyUI/models/loras folder.

5. Restart and refresh ComfyUI to take effect.

Workflow 

1. Download the workflow (Wan2.2_VBVR.json) from our Hugging Face repository. The workflow includes both types:
(a) Wan2.2 + VBVR (use as the independent diffusion model) 

(b) Wan2.2 + VBVR Lora (use this as lora with basic Wan2.2 High+Low I2V model) and 

(c) Wan 2.2 without VBVR lora


2. Drag and drop into ComfyUI. Install the missing red nodes if unavailable from the Manager. Restart and refresh ComfyUI.

3. If using Wan2.2 + VBVR model, load the Wan2.2 VBVR High and Low model using Wan Video model loader node. 

Another way explained above, if using Kijai's Wan2.2 VBVR High64 rank Lora into Wan Video Lora select node, then you also need Wan2.2 basic I2V High+Low model loaded using Wan video model loader node.

4. Load the image into Load Image node.  Load other basic models (text encoders, vae etc) into their respective node. 
Settings:
Motion Amplitude-1.20 to 1.5 (More will add extra motion but can be unstable)

5. Put detailed prompt into prompt box and hit Run to start generation.

 

wan 2.2 VBVR testing

 Test 1

 

wan 2.2 VBVR test

Test 2 

 

Conclusion

We have already seen how scaling data and models transformed language AI. VBVR suggests we might be entering a similar phase for video-based intelligence.  But there is also a reality check here, just making bigger models is not worth anymore. 

What matters now is better data, better structure, and better evaluation.  VBVR does not just throw more data at the problem, it creates a systematic way to study reasoning itself.  

If this direction continues, we are not just looking at better video models, we are looking at AI that understands the physical world more like humans do.  And thats where things get really interesting.