Post-Training and Efficient Learning

This page tracks prompt evolution, reinforcement learning, distillation, LoRA, quantization-aware training, and compute-as-teacher methods.

Sources in this batch

  • GEPA claims reflective prompt evolution can outperform reinforcement learning.
  • Evolution strategies at hyperscale and FP8 reinforcement learning provide optimization-scale context.
  • LoRA Without Regret and on-policy distillation target efficient adaptation.
  • Compute-as-teacher and compute-optimal quantization-aware training connect inference compute and training objectives.

Research interest

The surprising cluster is that many “training” improvements now operate around the model: prompt evolution, distillation, LoRA, RL precision formats, and inference-compute-derived supervision. This suggests a broader post-training design space where the boundary between optimizer, prompt, adapter, and runtime is fluid.

Open questions:

  • When does prompt/program evolution outperform gradient-based RL, and why?
  • Can on-policy distillation preserve exploration benefits without serving-time cost?
  • How should quantization-aware training be optimized for reasoning and tool-use rather than perplexity?