Multimodal Model Tools
This page collects sources about image, video, and multimodal model tooling.
Sources in this batch
- Midjourney posted a technical dive on its “Midjourney Scanner”.
- Lucas Beyer’s X post discusses claims about whether video generation models have internal physical understanding, citing a paper about probing diffusion models.
- Omar Sanseviero’s X post says llama.cpp added video input support for Gemma 4 video understanding via chat completions and
mtmd-cli.
Research interest
The most surprising thread is the dispute over whether video generation models encode usable physical structure. Linear probes of diffusion/video models, llama.cpp video input support, and production scanning tools all point toward multimodal systems becoming easier to inspect and run locally. This is worth tracking for evidence that generative video representations support downstream reasoning rather than only photorealistic synthesis.
Related
Batch 21-100 update
New related pages: world-models-and-video-intelligence and robotics-and-embodied-ai. The second batch adds TRELLIS.2, efficient video intelligence, world model material, Vision Banana, LeWorldModel, and dexterous robotics sources.
Updated: 2026-06-27