Open Models and Gemma 4

This page tracks open and locally runnable model releases, especially Gemma 4, Qwen coder variants, Step models, Bonsai, and browser/runtime tooling like Transformers.js.

Sources in this batch

  • LessWrong asks how far behind open models are.
  • Google and Maarten Grootendorst provide Gemma 4 overviews, including multi-token prediction drafters.
  • AMD documents day-0 Gemma 4 support on AMD processors and GPUs.
  • Hugging Face hosts Step-3.7-Flash-GGUF and Qwen3-Coder-Next-GGUF variants.
  • Transformers.js 4.0.0 and Bonsai indicate continued expansion of model deployment surfaces.
  • A hardware-design video asks how hardware is reshaping LLM design.

Research interest

The surprising angle is the convergence of open weights, local runtimes, and architecture/runtime co-design. Gemma 4’s multi-token prediction drafters are notable because they attack inference speed architecturally, not only through kernels or quantization. The open-model-gap question should be tracked empirically: gaps may differ across reasoning, coding, multimodal tasks, latency, and local deployability.

Open questions:

  • Does multi-token prediction materially change the serving economics for small and mid-size models?
  • Are open models behind mainly in post-training/evals, data, architecture, or deployment polish?
  • Which model families are actually useful under local hardware constraints rather than only on leaderboards?