Agent Time Horizons and Real-World Use

This page tracks agent task horizons, real-world tool use, code execution, function calling, sandboxes, and agent frameworks.

Sources in this batch

  • METR measures time horizon using Claude Code and Codex.
  • Google tests a Learning Hub powered by goal-based actions.
  • FunctionGemma, ReasoningLayer, Mistral Vibe CLI, Anthropic advanced tool use, code execution with MCP, and sandbox-runtime all target more capable tool-using agents.
  • A structured-output video covers grammars, regexes, and state machines as reliability tools.
  • Anthropic’s agent-skills article and Paper2Agent/Sibyl sources focus on packaging real-world expertise into agent-usable artifacts.

Research interest

The most important research variable here is time horizon: can agents remain useful over hours or days of partially specified work? Tool APIs, sandboxes, function calling, and skills are not just product features; they are hypotheses about how to extend agent reliability beyond short chat turns.

Open questions:

  • Which failures dominate as task horizon increases: planning, state tracking, tool misuse, verification, or human handoff?
  • Can structured outputs and sandboxed execution be composed into strong reliability guarantees?
  • What is the minimal skill/package format that turns a research paper into a reliable interactive agent?