Skip to content

feat(0.50.2): actionability fixes from real-data dogfood#123

Merged
tangletools merged 1 commit into
mainfrom
feat/0.50.2-actionability-fixes
May 27, 2026
Merged

feat(0.50.2): actionability fixes from real-data dogfood#123
tangletools merged 1 commit into
mainfrom
feat/0.50.2-actionability-fixes

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Summary

Surfaced by running analyzeRuns() against three real consumer datasets (legal-agent canonical, agent-builder canonical campaigns, gtm-agent golden run). Where the report was silent when it should have been loud, fix the silence.

Changes (all additive):

  • ScalarDistribution.tailRuns?: Array<{runId, score}> populated for the composite distribution — the report now names the 5 worst runs to inspect first.
  • InsightReport.costQuality.degraded?: {cost?, pareto?} — explicit reasons when costUsd is all zero or only a single candidate exists. Replaces the silently-misleading single-point Pareto.
  • buildRecommendations fires on poor composite distribution: critical/investigate below 0.3, high/investigate below 0.5, with worst-N runIds named in the detail.
  • buildRecommendations flags missing-judges with a medium/expand-corpus pointing at outcome.judgeScores enrichment.

Dogfood result. Legal-agent canonical (n=36, composite mean=0.002) previously returned recommendations: []. Now returns:

critical/investigate — Composite mean 0.002 is below the 0.3 floor — the agent is broken on this corpus
  Worst 5 runs: restaurant-formation=0, crypto-exchange-licensing=0, nuclear-startup-nrc=0,
  cannabis-dispensary=0, existing-business-audit=0
medium/expand-corpus — No judge scores recorded — per-dimension + calibration insights unavailable

Test plan

  • pnpm test tests/contract-analyze-runs.test.ts — 15/15 pass (4 new tests for the new branches)
  • pnpm test — 1431/1431 overall
  • pnpm typecheck — clean
  • Re-dogfooded on real legal-agent + agent-builder data, verified the recommendations + tailRuns + degraded fields populate correctly

Surfaced by running analyzeRuns() against three real consumer
datasets (legal-agent, agent-builder, gtm-agent golden run).

- composite.tailRuns: worst-5 runIds with scores
- costQuality.degraded: explicit reasons when costUsd=0 / single
  candidate; replaces silently-misleading single-point Pareto
- buildRecommendations: composite-distribution branch (critical
  below 0.3, high below 0.5) names worst-N runIds in detail
- buildRecommendations: missing-judges flag for corpora without
  outcome.judgeScores

Dogfooded on legal-agent canonical (n=36, mean=0.002): previously
recommendations: []; now critical/investigate with 5 specific
failing scenarios + medium/expand-corpus for missing judges.

Tests: 4 new, 15/15 in analyze-runs, 1431/1431 overall.
Copy link
Copy Markdown
Contributor

@drewstone drewstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved tangletools PR — 54a98a85

This PR was opened by the trusted tangletools automation account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: tangletools_author · 2026-05-27T20:18:49Z

@tangletools tangletools merged commit 2309667 into main May 27, 2026
1 check passed
@tangletools tangletools deleted the feat/0.50.2-actionability-fixes branch May 27, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants