Multimodal Open Models

This page tracks multimodal pretraining, video/image/voice models, OCR, VLMs, and world-model-like multimodal systems.

Sources in this batch

  • Beyond Language Modeling explores multimodal pretraining.
  • Molmo 2, Dolphin-v2, VibeVoice, Mistral 3, OLMo-3-32B-Think, Qwen3-Omni, DeepSeek-OCR, and ModernVBERT/ColModernVBERT represent the open/multimodal model ecosystem.
  • Hunyuan World Mirror, VideoRAG, and video-generation explainers connect long-context video comprehension to generative models.
  • Visualizing VLMs and COSMOS-Web-style scientific visualization sources broaden the evaluation and application context.

Research interest

The surprising angle is that multimodal models are fragmenting into specialized capabilities: pointing/tracking, voice, OCR, video RAG, world simulation, and long-context visual understanding. The research question is whether these become one general model interface or a compositional stack of specialized open models.

Open questions:

  • Which multimodal tasks require native joint pretraining versus tool composition?
  • Can video RAG handle long temporal structure without losing causality?
  • How should open multimodal models be benchmarked for agent use, not just captioning or VQA?