Agent Time Horizons and Real-World Use
This page tracks agent task horizons, real-world tool use, code execution, function calling, sandboxes, and agent frameworks.
Sources in this batch
- METR measures time horizon using Claude Code and Codex.
- Google tests a Learning Hub powered by goal-based actions.
- FunctionGemma, ReasoningLayer, Mistral Vibe CLI, Anthropic advanced tool use, code execution with MCP, and sandbox-runtime all target more capable tool-using agents.
- A structured-output video covers grammars, regexes, and state machines as reliability tools.
- Anthropic’s agent-skills article and Paper2Agent/Sibyl sources focus on packaging real-world expertise into agent-usable artifacts.
Research interest
The most important research variable here is time horizon: can agents remain useful over hours or days of partially specified work? Tool APIs, sandboxes, function calling, and skills are not just product features; they are hypotheses about how to extend agent reliability beyond short chat turns.
Open questions:
- Which failures dominate as task horizon increases: planning, state tracking, tool misuse, verification, or human handoff?
- Can structured outputs and sandboxed execution be composed into strong reliability guarantees?
- What is the minimal skill/package format that turns a research paper into a reliable interactive agent?