Multimodal Model Tools

This page collects sources about image, video, and multimodal model tooling.

Sources in this batch

  • Midjourney posted a technical dive on its “Midjourney Scanner”.
  • Lucas Beyer’s X post discusses claims about whether video generation models have internal physical understanding, citing a paper about probing diffusion models.
  • Omar Sanseviero’s X post says llama.cpp added video input support for Gemma 4 video understanding via chat completions and mtmd-cli.

Research interest

The most surprising thread is the dispute over whether video generation models encode usable physical structure. Linear probes of diffusion/video models, llama.cpp video input support, and production scanning tools all point toward multimodal systems becoming easier to inspect and run locally. This is worth tracking for evidence that generative video representations support downstream reasoning rather than only photorealistic synthesis.

Batch 21-100 update

New related pages: world-models-and-video-intelligence and robotics-and-embodied-ai. The second batch adds TRELLIS.2, efficient video intelligence, world model material, Vision Banana, LeWorldModel, and dexterous robotics sources.

Updated: 2026-06-27