The useful production agent is rarely the one with the longest prompt. It is the one that knows the objective, has permission boundaries, gathers evidence, and leaves a recoverable trail when only half the task succeeds.
Demo agents look impressive because the environment is clean. Production workflows are not clean. Repositories are dirty, credentials expire, APIs return partial results, and files change while the agent is working.
Keep a run log that someone can resume
An agent should track objective, working set, actions taken, assumptions, and unresolved risks. This does not require a grand memory system. A durable run log is often enough:
{
"objective": "Prepare content site for ad review",
"workingSet": ["src/content/posts/*.md", "src/pages/*.astro"],
"actions": [
"scanned posts for duplicate headings",
"rewrote six low-detail articles",
"ran pnpm build"
],
"verification": {
"build": "passed",
"warnings": 0
},
"openRisks": ["production env var configured outside repository"]
}
Without state, the agent repeats work. With vague state, it becomes overconfident.
Separate read, write, and publish permissions
Reading a repository is different from editing files. Editing files is different from pushing to a remote. Drafting a message is different from sending it.
Useful products make these boundaries visible:
| Permission | Example |
|---|---|
| Read | inspect logs, repository, docs |
| Write | edit branch, draft issue, prepare migration |
| Execute | run tests, build site, generate preview |
| Publish | merge PR, deploy production, send email |
Sensitive actions should show target, reason, and expected effect. Hidden autonomy is not productivity; it is operational risk.
Evaluate on real work, not toy prompts
The best evaluation set comes from work that already happened:
failed support triage
confusing pull request review
manual release note drafting
content migration
local setup failure
dependency update review
lightweight QA pass
Score completion, evidence quality, recoverability, and artifact usefulness. A correct answer with no evidence is hard to trust. A partial answer with precise state may still save time.
Design tools for reviewable evidence
The same model behaves differently depending on its tools. Broad shell access is powerful, but expensive to review. Narrow tools are safer for sensitive operations.
Tool output should support the next decision:
Log search: query, time range, matching lines, surrounding context
Browser action: URL, selected element, action taken, screenshot link
Deployment: version, environment, status, rollback target
Git action: branch, commit SHA, files changed, remote pushed
If the tool returns only “success,” the agent becomes a black box.
Recovery is part of the product
A production agent must handle partial failure. A migration can apply locally and fail in staging. A browser automation can click the wrong element after a redesign. A push can succeed to one remote and fail to another.
The recovery note should say:
what succeeded
what did not run
what state may have changed
what evidence was collected
what a human should inspect next
This matters more than a polished final message. A failed task with a precise recovery note is still useful. A vague confident failure creates more work.
Start with reversible workflows
The strongest first use cases are repetitive but evidence-heavy: local setup checks, release note drafts, issue deduplication, content migration, dependency review, and QA passes. These tasks have enough structure to evaluate and enough variation to benefit from a model.
Do not start with irreversible production actions. Let the agent gather context, prepare changes, run verification, and produce artifacts. Widen permissions only after the team can measure reliable completion and recovery.