Agent Time Horizons and Real-World Use

This page tracks agent task horizons, real-world tool use, code execution, function calling, sandboxes, and agent frameworks.

Sources in this batch

METR measures time horizon using Claude Code and Codex.
Google tests a Learning Hub powered by goal-based actions.
FunctionGemma, ReasoningLayer, Mistral Vibe CLI, Anthropic advanced tool use, code execution with MCP, and sandbox-runtime all target more capable tool-using agents.
A structured-output video covers grammars, regexes, and state machines as reliability tools.
Anthropic’s agent-skills article and Paper2Agent/Sibyl sources focus on packaging real-world expertise into agent-usable artifacts.

Research interest

The most important research variable here is time horizon: can agents remain useful over hours or days of partially specified work? Tool APIs, sandboxes, function calling, and skills are not just product features; they are hypotheses about how to extend agent reliability beyond short chat turns.

Open questions:

Which failures dominate as task horizon increases: planning, state tracking, tool misuse, verification, or human handoff?
Can structured outputs and sandboxed execution be composed into strong reliability guarantees?
What is the minimal skill/package format that turns a research paper into a reliable interactive agent?

Quartz 5

Explorer

Agent Time Horizons and Real-World Use

Agent Time Horizons and Real-World Use

Sources in this batch

Research interest

Graph View

Table of Contents

Backlinks

Quartz 5

Explorer

Agent Time Horizons and Real-World Use

Agent Time Horizons and Real-World Use

Sources in this batch

Research interest

Related

Graph View

Table of Contents

Backlinks