Long Context and Recursive Reasoning

This page tracks recursive language models, looped reasoning, byte/continuous/diffusion language models, and very-long-context architectures.

Sources in this batch

  • Karpathy’s microgpt/nanochat material and recursive-language-model sources point toward small, comprehensible training/inference systems.
  • “Learning to Reason in 13 Parameters”, recursive reasoning with tiny networks, and Recursive Language Models make unusually small or looped reasoning systems central.
  • Sudoku variants, looped language models, continuous autoregressive LMs, text diffusion, and BERT-as-one-diffusion-step all challenge simple left-to-right transformer assumptions.
  • ATLAS claims transformer-like processing with context windows as large as ten million tokens.
  • Compute-as-teacher suggests turning inference compute into supervision.

Research interest

The most surprising cluster is that several sources attack reasoning with recurrence, loops, diffusion, continuous outputs, or extreme context rather than simply scaling dense transformers. If even a subset works, the next wave of LLM progress may be algorithmic and stateful, not just larger pretraining runs.

Open questions:

  • Are looped/recursive models genuinely more sample- or compute-efficient, or are they harder-to-train variants of ordinary transformers?
  • Does ten-million-token context change tasks qualitatively, or just stress memory and retrieval mechanisms?
  • Can inference-time computation be converted into durable training signal without reward hacking?