Multimodal Open Models
This page tracks multimodal pretraining, video/image/voice models, OCR, VLMs, and world-model-like multimodal systems.
Sources in this batch
- Beyond Language Modeling explores multimodal pretraining.
- Molmo 2, Dolphin-v2, VibeVoice, Mistral 3, OLMo-3-32B-Think, Qwen3-Omni, DeepSeek-OCR, and ModernVBERT/ColModernVBERT represent the open/multimodal model ecosystem.
- Hunyuan World Mirror, VideoRAG, and video-generation explainers connect long-context video comprehension to generative models.
- Visualizing VLMs and COSMOS-Web-style scientific visualization sources broaden the evaluation and application context.
Research interest
The surprising angle is that multimodal models are fragmenting into specialized capabilities: pointing/tracking, voice, OCR, video RAG, world simulation, and long-context visual understanding. The research question is whether these become one general model interface or a compositional stack of specialized open models.
Open questions:
- Which multimodal tasks require native joint pretraining versus tool composition?
- Can video RAG handle long temporal structure without losing causality?
- How should open multimodal models be benchmarked for agent use, not just captioning or VQA?