feat(0.50.2): actionability fixes from real-data dogfood by tangletools · Pull Request #123 · tangle-network/agent-eval

tangletools · 2026-05-27T20:18:41Z

Summary

Surfaced by running analyzeRuns() against three real consumer datasets (legal-agent canonical, agent-builder canonical campaigns, gtm-agent golden run). Where the report was silent when it should have been loud, fix the silence.

Changes (all additive):

ScalarDistribution.tailRuns?: Array<{runId, score}> populated for the composite distribution — the report now names the 5 worst runs to inspect first.
InsightReport.costQuality.degraded?: {cost?, pareto?} — explicit reasons when costUsd is all zero or only a single candidate exists. Replaces the silently-misleading single-point Pareto.
buildRecommendations fires on poor composite distribution: critical/investigate below 0.3, high/investigate below 0.5, with worst-N runIds named in the detail.
buildRecommendations flags missing-judges with a medium/expand-corpus pointing at outcome.judgeScores enrichment.

Dogfood result. Legal-agent canonical (n=36, composite mean=0.002) previously returned recommendations: []. Now returns:

critical/investigate — Composite mean 0.002 is below the 0.3 floor — the agent is broken on this corpus
  Worst 5 runs: restaurant-formation=0, crypto-exchange-licensing=0, nuclear-startup-nrc=0,
  cannabis-dispensary=0, existing-business-audit=0
medium/expand-corpus — No judge scores recorded — per-dimension + calibration insights unavailable

Test plan

pnpm test tests/contract-analyze-runs.test.ts — 15/15 pass (4 new tests for the new branches)
pnpm test — 1431/1431 overall
pnpm typecheck — clean
Re-dogfooded on real legal-agent + agent-builder data, verified the recommendations + tailRuns + degraded fields populate correctly

Surfaced by running analyzeRuns() against three real consumer datasets (legal-agent, agent-builder, gtm-agent golden run). - composite.tailRuns: worst-5 runIds with scores - costQuality.degraded: explicit reasons when costUsd=0 / single candidate; replaces silently-misleading single-point Pareto - buildRecommendations: composite-distribution branch (critical below 0.3, high below 0.5) names worst-N runIds in detail - buildRecommendations: missing-judges flag for corpora without outcome.judgeScores Dogfooded on legal-agent canonical (n=36, mean=0.002): previously recommendations: []; now critical/investigate with 5 specific failing scenarios + medium/expand-corpus for missing judges. Tests: 4 new, 15/15 in analyze-runs, 1431/1431 overall.

drewstone

✅ Auto-approved tangletools PR — `54a98a85`

This PR was opened by the trusted tangletools automation account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: tangletools_author · 2026-05-27T20:18:49Z}

drewstone approved these changes May 27, 2026

View reviewed changes

tangletools merged commit 2309667 into main May 27, 2026
1 check passed

tangletools deleted the feat/0.50.2-actionability-fixes branch May 27, 2026 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.50.2): actionability fixes from real-data dogfood#123

feat(0.50.2): actionability fixes from real-data dogfood#123
tangletools merged 1 commit into
mainfrom
feat/0.50.2-actionability-fixes

tangletools commented May 27, 2026

Uh oh!

drewstone left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 27, 2026

Summary

Test plan

Uh oh!

drewstone left a comment

Choose a reason for hiding this comment

✅ Auto-approved tangletools PR — 54a98a85

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved tangletools PR — `54a98a85`