How it runs

Evaluation & quality

How finished work is scored by an independent reviewer, sent back for rework below the bar, and force-accepted honestly when it still fails.

After an agent finishes a task, an independent reviewer scores it 0–100. The principle is simple: no agent grades its own work, and a confused verdict fails closed rather than rubber-stamping.

Who reviews

Review is manager-routed. The evaluator is chosen in order:

  1. A delegated task is reviewed by the manager who handed it down.
  2. Otherwise a Reviewer/QA-role agent, if one exists.
  3. Otherwise the assignee’s manager.
  4. Otherwise any other agent — never the assignee itself.

With no other agent available (for example, a solo CEO), evaluation returns null and the runtime accepts, relying on you as the external check. It never asks an agent to grade its own output.

Unparseable verdicts fail closed

A missing SCORE is retried once with a strict reprompt. If it’s still unparseable, the task is sent back for rework — not auto-passed (task.eval.unparsed is audited). This is the opposite of fail-open behavior, which would rubber-stamp exactly the confused, low-quality outputs most likely to omit the token. The score parser tolerates wrapping (SCORE: **72**) and decimals.

Echo/no-judge setups (like the fake brain) can opt back into fail-open with failOpenOnUnparsedEval: true.

Rework carries feedback

Below evalThreshold (default 60) the task returns to todo with the reviewer’s feedback. On the next attempt the judge is shown the prior feedback and asked to verify each point was addressed — so rework is a real loop, not a re-roll.

Rework holds the task’s dependents until it lands, so downstream work never builds on output that’s still being fixed.

Force-accept is honest

After maxAttempts (default 2 — one initial run plus one rework) a still failing task is not left to deadlock. It’s marked done, but:

  • flagged qualityFlag: 'below-threshold' with its score,
  • a distinct task.forceAccept audit event is emitted,
  • an operator request is raised, and
  • delegations do not fan out off rejected work.

Downstream tasks that depend on it get their upstream context prefixed with [ACCEPTED BELOW QUALITY BAR, score N], so a dependent can hedge rather than trust it blindly. Accepted-clean and accepted-below-the-bar are always distinguishable — to the runtime and to you.

Turning it off or tuning it

KnobDefaultEffect
evaluate (EVAL=off, --no-eval)onrun the quality gate at all
evalThreshold (EVAL_THRESHOLD, --eval-threshold)60minimum passing score
maxAttempts2total executions (initial + reworks)
failOpenOnUnparsedEvalfalseauto-pass an unparseable verdict (echo/no-judge only)
EVAL_THRESHOLD=70 quorum      # hold a higher bar
quorum --no-eval              # ship whatever agents produce, unreviewed

Reliability around execution

Evaluation sits on top of a side-effect-safe execution model. A task turn runs real tools, so it gets a single attempt — a blind retry would re-run Bash/writes/sends. Only genuinely pre-effect blips (spawn/ENOENT) retry once; a clean non-zero exit or unparseable output fails fast. The review-gated rework loop is the correct layer to re-execute real work, which is exactly what it does.