feat(0.50.2): actionability fixes from real-data dogfood#123
Merged
Conversation
Surfaced by running analyzeRuns() against three real consumer datasets (legal-agent, agent-builder, gtm-agent golden run). - composite.tailRuns: worst-5 runIds with scores - costQuality.degraded: explicit reasons when costUsd=0 / single candidate; replaces silently-misleading single-point Pareto - buildRecommendations: composite-distribution branch (critical below 0.3, high below 0.5) names worst-N runIds in detail - buildRecommendations: missing-judges flag for corpora without outcome.judgeScores Dogfooded on legal-agent canonical (n=36, mean=0.002): previously recommendations: []; now critical/investigate with 5 specific failing scenarios + medium/expand-corpus for missing judges. Tests: 4 new, 15/15 in analyze-runs, 1431/1431 overall.
drewstone
approved these changes
May 27, 2026
Contributor
drewstone
left a comment
There was a problem hiding this comment.
✅ Auto-approved tangletools PR — 54a98a85
This PR was opened by the trusted tangletools automation account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: tangletools_author · 2026-05-27T20:18:49Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Surfaced by running
analyzeRuns()against three real consumer datasets (legal-agent canonical, agent-builder canonical campaigns, gtm-agent golden run). Where the report was silent when it should have been loud, fix the silence.Changes (all additive):
ScalarDistribution.tailRuns?: Array<{runId, score}>populated for the composite distribution — the report now names the 5 worst runs to inspect first.InsightReport.costQuality.degraded?: {cost?, pareto?}— explicit reasons whencostUsdis all zero or only a single candidate exists. Replaces the silently-misleading single-point Pareto.buildRecommendationsfires on poor composite distribution:critical/investigatebelow 0.3,high/investigatebelow 0.5, with worst-N runIds named in the detail.buildRecommendationsflags missing-judges with amedium/expand-corpuspointing atoutcome.judgeScoresenrichment.Dogfood result. Legal-agent canonical (n=36, composite mean=0.002) previously returned
recommendations: []. Now returns:Test plan
pnpm test tests/contract-analyze-runs.test.ts— 15/15 pass (4 new tests for the new branches)pnpm test— 1431/1431 overallpnpm typecheck— clean