diff --git a/.evolve/reflections/2026-05-27-decision-packet-substrate.md b/.evolve/reflections/2026-05-27-decision-packet-substrate.md new file mode 100644 index 00000000..8fb9c3b6 --- /dev/null +++ b/.evolve/reflections/2026-05-27-decision-packet-substrate.md @@ -0,0 +1,95 @@ +# Reflect: agent-eval 0.48 → 0.50.1 — substrate layering + decision packet +Date: 2026-05-27 + +## Run Grade: 8/10 + +| Dimension | Score | Evidence | +|---|---|---| +| Goal achievement | 9 | Five releases landed clean (0.48 layering fix, 0.49 audit-fix sweep, 0.50 decision packet, 0.50.0/.0.1 docs). Both npm + PyPI published, 8 PRs merged + tagged, six consumer repos migrated, three CLAUDE/AGENTS files updated with the layering rule. The shipped artifact is real: `analyzeRuns({ runs }) → InsightReport` is the customer-visible decision packet the session set out to build. | +| Code quality | 8 | 3,688 LOC added across 39 files, 112 test files, zero historical-narrative comments (deleted `traceai-compat.ts` shim in 0.49). Test bench legitimately grew (`tests/contract-analyze-runs.test.ts` = 341 LOC of integration coverage). One self-inflicted flake (Math.random() in correlation tests) almost broke the publish — graded down for that. | +| Efficiency | 7 | The single biggest wasted cycle was the customer-mapping pivot: I shipped 0.50 LAND-tier `selfImprove()` returning a 1-bit verdict, *then* Drew pushed back ("Jesus claude what are we even doing here?"), *then* I built the decision packet. Had I done customer-journey thinking before code, 0.50 would have been the right shape on the first try and 0.50.1's emergency docs rebuild would have been a normal docs pass. Three pnpm/action-setup conflict cycles across consumer migrations were also wasted. | +| Self-correction | 9 | After Drew's pushback I didn't defend — I went all the way back to "who is the customer, what packet do they need, what's the path to first value." Customer A (research validation) + Customer B (agentic GTM-as-service) framing crystallized in one turn and stayed load-bearing through 0.50/0.50.1. Layering inversion fixed root-cause (move DefaultVerdict down) not symptomatically. | +| Learning | 8 | Layering rule now durable in three CLAUDE.md files — that's the kind of fix that survives. Decision-packet shape (`composite`/`perDimension`/`judges`/`interRater`/`lift`/`failureClusters`/`contamination`/`outcomeCorrelation`/`release`/`recommendations`) is now a teachable contract. Math.random() flake is a fresh durable lesson worth saving to memory. | +| Overall | 8 | Would Drew approve unchanged? Yes — but only because the docs PR landed *after* the product pivot, not before. The deduction is the order of operations, not the work itself. The substrate is genuinely top-tier OSS-presentable now. | + +## Session Flow Analysis + +### Flow 1: bump → migrate-consumers → admin-merge — high frequency +Trigger: substrate version cut (0.48, 0.49, 0.50). +Steps: publish substrate → for each of 6 consumers: bump dep, fix peer constraint, re-test, PR, admin-merge. +Outcome: ~30 PRs across 6 repos this session, mostly clean. Friction point: pnpm/action-setup conflict surfaced in 3+ repos before stabilizing on the (keep `packageManager:` field, drop workflow `version:` arg) recipe. +Automation potential: HIGH — a `bump-substrate-everywhere` skill that fans out across `~/code/{gtm,creative,legal,tax}-agent`, `agent-builder`, `physim` with the known-good package.json + workflow edits would have collapsed ~2 hours into ~20 min. + +### Flow 2: ship-then-rethink (the anti-pattern) +Trigger: build a feature → realize it doesn't fit the customer. +Steps: 0.46 selfImprove → 0.47 hosted client → 0.48 layering → 0.49 audit sweep → 0.50 selfImprove returns 1-bit verdict → Drew pushback → 0.50.1-shaped product pivot. +Outcome: real work, but the shape of 0.50.0 had to be torn down within hours of merging because the *customer's first-touch experience* was wrong (a single `gateDecision` is not enough — they need a defensible report). +Lesson: **customer-journey first, code second.** Drew's correction said exactly that: "I still suggest THINKING deeper about the problem they have. The founder WANTS to tokenmax, he wants to see THINGS work faster and create content he would say YES to." That's product thinking I should have led with. + +### Flow 3: greenfield → delete-the-shim +Trigger: Drew said "remembe rmuch of this is all GREENFIELD." +Steps: identify legacy/compat code → delete it outright → rename `traceai.ts → otel.ts` (not aliased) → strip historical comments. +Outcome: 0.49 net-negative on legacy paths. This pattern *works* and should be the default for greenfield SDKs. Future me: don't write the shim in the first place. + +### Flow 4: docs PR breaks publish — the Math.random() flake +Trigger: 0.50.1 docs PR CI failed on a correlation test. +Steps: PR admin-merged anyway → tag pushed → publish workflow ran → publish *succeeded* (RNG cooperated) → flake-fix PR (#122) opened, CI green, admin-merged. +Outcome: 0.50.1 shipped, but the publish was a coin-flip away from failing on its own freshly-tagged version. This is exactly the failure mode the "tests that matter" doctrine warns about: a test that passes 90% of the time is worse than no test, because it creates false confidence. Math.random() is the silent-zero of the test bench. + +## Operator question → product signal + +| Question | Implication | Signal | +|---|---|---| +| "does agent-eval use agent-runtime? I thought its supposed to be the other way around?" | The layering rule was not enforced. | Make the rule load-bearing in three CLAUDE.md files (done). Type-only `import type` from a consumer is the smell — flag in PR review. | +| "Jesus claude what are we even doing here?" | Substrate had no clear customer story. selfImprove was a primitive, not a product. | Build customer journeys before features. The journey doc + three runnable examples are the artifacts. | +| "is this fully and exceptionally documented like the most top tier and clean developer SDK tooling company would post on their open source github" | First-touch onboarding wasn't there. Subpath exports without narrative anchors = ai-agent persona finding from the critical-audit playbook. | Top-of-README decision-packet sample; comparison matrix vs LangSmith/Braintrust/Phoenix; three quickstart paths. | + +## Project Health + +### @tangle-network/agent-eval +Trajectory: **substrate-stable, presentation-stable, customer-mapped.** 0.50.1 is the first version where (a) the customer's first-touch is a runnable example, (b) the decision packet is canonical, (c) the layering rule is enforced in code + docs. 5 releases in one session is fast — and the lift-the-floor work (layering fix, audit sweep) makes the next 10 releases cheaper. +Architecture: clean. 24 export subpaths, 112 test files, zero compat shims, no upward deps. The `analyzeRuns` + `selfImprove` + intake adapters trio is the right shape — three top-level functions covering three customer maturity stages. +Coverage: meaningful — 11 integration tests in `contract-analyze-runs.test.ts` cover every InsightReport section against the real implementation, no mocks. One flake fixed. +Next highest-value action: dogfood `analyzeRuns()` on real customer logs (Customer B's OTel pipeline). Until the decision packet is read by a human who wasn't in this session, the customer-mapping is a hypothesis. `/eval-agent` scope: ingest a real OTel batch, render the packet, ask "would I act on this?" + +### Six consumer repos (gtm/creative/legal/tax/agent-builder/physim) +Trajectory: **all on 0.49.** None upgraded to 0.50 yet — that's the *whole* point of the decision-packet pivot and they should be the first to consume it. +Next action: bump the consumers to 0.50.1 and rewire whichever bespoke summary the consumer currently emits to `analyzeRuns()`. The win is collapsing N bespoke summary functions into one substrate call. + +## Cross-Project Patterns + +1. **Layering inversions hide inside `import type`.** The agent-eval → agent-runtime smell was a type-only import. Three CLAUDE.md files now say "Type-only `import type` from a consumer package is the smell that hides the inversion — reject it in review." This belongs in the cross-project AGENTS.md, not just this repo. +2. **Math.random() in tests is the silent-zero sibling.** Both fail loud only sometimes. The "tests that matter" doctrine should add: **deterministic-or-don't.** No bare `Math.random()` in assertions, ever. Use a seeded PRNG or deterministic noise. +3. **First-touch-runnable beats first-touch-document.** The README's value didn't lift until three `pnpm tsx examples/.../index.ts` scripts existed. Document, then *run* the documentation. +4. **pnpm/action-setup `version:` arg conflicts with `packageManager:` field.** Recurring across 3+ repos this session. Worth a one-line global fix-it: in every workflow, drop `version:` from `pnpm/action-setup@v4` and rely on the package.json field. + +## Skill Effectiveness + +- `/critical-audit` — invoked once mid-session, caught the layering inversion + the `traceai-compat.ts` shim that violated the greenfield rule. High value when run before a release, not after. +- `/reflect` — this run. Previous reflection (2026-05-24, 7/10) flagged "live end-to-end proof never landed." That is *still true* for the decision packet at the customer level — the artifact has not been read by a customer's eyes yet. The same lesson is appearing across reflections; that's the signal to act. + +## Product Signals + +1. **Customer B (agentic GTM-as-service) wants engagement/token-max, not eval rigor.** The founder will care about Pareto + outcome correlation between judge composite and downstream engagement. The `outcomeCorrelation` section was added for exactly this, but it's untested against a real engagement signal. The thing to ship next: a real customer-B pipeline that wires their CRM/analytics signal into `outcomeSignal` and watches the Pearson / Spearman move. +2. **Customer A (Claude-P research) needs `interRater` + `disagreementCases` to land triage.** The feedback-loop example shows the shape; the missing piece is making `disagreementCases` deep-linkable (runId → original artifact view). That's a downstream consumer feature, not substrate. +3. **The "show me the money" README sample is the conversion event.** First commit a new visitor sees is the annotated `InsightReport` JSON. If we can publish three blog posts with three real customer reports, the SDK starts to sell itself. + +## Proposed Automations + +1. **`bump-substrate-everywhere` skill** — fans out across 6 consumer repos with known-good package.json + workflow edits. Saves ~2hr per substrate cut. Sketch: read substrate version from `~/code/agent-eval/package.json`, for each consumer in a config list: branch + bump dep + ensure `pnpm@10.22.0` packageManager + remove `version:` from workflows + PR + watch CI. Threshold: any session that bumps agent-eval and migrates ≥2 consumers. +2. **`deterministic-test` lint rule** — grep for `Math.random()` inside `tests/`, fail CI. Drop-in eslint custom rule. Sketch: `no-restricted-syntax` rule against `CallExpression[callee.object.name='Math'][callee.property.name='random']` scoped to `tests/**/*`. +3. **`layering-guard` PR-bot rule** — fail CI if `import type` references a known consumer package name (agent-runtime, agent-knowledge, etc.) from inside substrate. One regex, one CI step. + +## Action Items (ordered by impact) + +1. **Dogfood `analyzeRuns()` on Customer B's real OTel batch** — until a real customer report is rendered + read, the decision-packet hypothesis is unvalidated. +2. **Bump 6 consumer repos to 0.50.1** — collapse bespoke summary code into `analyzeRuns()`. Wave 1 = gtm + creative (highest-signal consumers). +3. **Add a deterministic-test CI lint** — kill the Math.random() flake class permanently before it bites a publish. +4. **Add layering-guard CI rule** — make the rule mechanical, not aspirational. +5. **Write the next reflection only after #1 lands.** The lesson "live end-to-end proof never landed" has now appeared in two consecutive reflections (2026-05-24 + this one). Don't write a third reflection without closing it. + +## Skill dispatch + +Two consecutive reflections flag the same gap: substrate is real, customer-validated proof is not. That's the textbook trigger for `/eval-agent` scope. + +**Next: `/eval-agent` — ingest Customer B's real OTel batch through `fromOtelSpans()` → `analyzeRuns()`, render the packet, score it for actionability. Baseline: the synthetic example output. Target: would the founder act on the recommendation? If yes, ship. If no, the decision-packet shape needs another round.** diff --git a/CHANGELOG.md b/CHANGELOG.md index e6ffa9ac..c90242d8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,25 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval- --- +## [0.50.2] — 2026-05-27 — actionability fixes from real-data dogfood + +### Added + +- **`ScalarDistribution.tailRuns?: Array<{runId, score}>`** — populated for the composite distribution. The report now names the 5 worst runs a customer should inspect first, instead of telling them to "investigate the lower tail" anonymously. +- **`InsightReport.costQuality.degraded?: {cost?, pareto?}`** — explicit per-axis degradation reasons when `costUsd` is all zero (cost axis carries no signal) or only a single candidate appears (Pareto collapses to a single point). Replaces the prior silent emission of meaningless single-point Pareto figures. +- **Composite-distribution recommendations.** When `composite.mean < 0.3`, the report emits a `critical/investigate` recommendation with the worst-5 runIds enumerated in the detail. Between 0.3 and 0.5, a `high/investigate` recommendation with the worst-3. Closes the gap where `recommendations: []` was being emitted for completely broken corpora. +- **Missing-judges flag.** When `judges` is empty across the corpus, the report emits a `medium/expand-corpus` recommendation pointing at `outcome.judgeScores.perJudge` enrichment. Before, the customer had no signal that per-dimension / calibration was unavailable because of input shape, not substrate failure. + +### Fixed + +- `analyzeRuns()` on the legal-agent canonical run (n=36, mean composite = 0.002) now emits actionable recommendations naming specific failing scenarios; previously it returned `recommendations: []` for a fully-broken agent. + +### Notes + +The four behavior changes are additive — fields are optional, no existing field shape changed. Dogfood-driven: surfaced by running `analyzeRuns()` against three real consumer datasets (legal-agent, agent-builder, gtm-agent golden run) and observing where the report was silent when it should have been loud. + +--- + ## [0.50.1] — 2026-05-27 — docs + examples ### Added diff --git a/clients/python/pyproject.toml b/clients/python/pyproject.toml index b132b646..079dae64 100644 --- a/clients/python/pyproject.toml +++ b/clients/python/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "agent-eval-rpc" -version = "0.50.1" +version = "0.50.2" description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client." readme = "README.md" requires-python = ">=3.10" diff --git a/clients/python/src/agent_eval_rpc/__init__.py b/clients/python/src/agent_eval_rpc/__init__.py index b1bc1271..2b957d8a 100644 --- a/clients/python/src/agent_eval_rpc/__init__.py +++ b/clients/python/src/agent_eval_rpc/__init__.py @@ -58,7 +58,7 @@ try: __version__ = version("agent-eval-rpc") except PackageNotFoundError: - __version__ = "0.50.1" + __version__ = "0.50.2" __all__ = [ "Client", diff --git a/package.json b/package.json index db67d0e7..e7683a5c 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@tangle-network/agent-eval", - "version": "0.50.1", + "version": "0.50.2", "description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.", "homepage": "https://github.com/tangle-network/agent-eval#readme", "repository": { diff --git a/src/contract/analyze-runs.ts b/src/contract/analyze-runs.ts index e4142ae9..00f2ae9b 100644 --- a/src/contract/analyze-runs.ts +++ b/src/contract/analyze-runs.ts @@ -83,16 +83,34 @@ export async function analyzeRuns(opts: AnalyzeRunsOptions): Promise ({ runId: r.runId, score: compositeOf(r, split) })) + .filter((p) => Number.isFinite(p.score)) const composite = distributionOf( - runs.map((r) => compositeOf(r, split)).filter(Number.isFinite) as number[], + compositeWithIds.map((p) => p.score), bins, + compositeWithIds, ) const perDimension = computePerDimension(runs, bins) + const costs = runs.map((r) => r.costUsd).filter(Number.isFinite) + const costDist = distributionOf(costs, bins) + const pareto = paretoChart(runs, { split }) + const degraded: { cost?: string; pareto?: string } = {} + if (costs.length === 0 || costs.every((c) => c === 0)) { + degraded.cost = 'no costUsd values recorded — cost axis carries no signal' + } + if (pareto.points.length < 2) { + degraded.pareto = + pareto.points.length === 0 + ? 'no candidates — Pareto unavailable' + : 'single candidate — Pareto is a single point, not a frontier' + } const costQuality = { - cost: distributionOf(runs.map((r) => r.costUsd).filter(Number.isFinite), bins), - pareto: paretoChart(runs, { split }), + cost: costDist, + pareto, + ...(degraded.cost || degraded.pareto ? { degraded } : {}), } const judges = computeJudgeInsights(runs) @@ -165,7 +183,11 @@ function compositeOf(run: RunRecord, split: 'search' | 'holdout'): number { // ── Distribution helpers ──────────────────────────────────────────── -function distributionOf(values: number[], bins: number): ScalarDistribution { +function distributionOf( + values: number[], + bins: number, + withIds?: Array<{ runId: string; score: number }>, +): ScalarDistribution { if (values.length === 0) { return { n: 0, @@ -183,6 +205,9 @@ function distributionOf(values: number[], bins: number): ScalarDistribution { const mean = sorted.reduce((s, v) => s + v, 0) / n const variance = sorted.reduce((s, v) => s + (v - mean) ** 2, 0) / n const stddev = Math.sqrt(variance) + const tailRuns = withIds + ? [...withIds].sort((a, b) => a.score - b.score).slice(0, Math.min(5, withIds.length)) + : undefined return { n, mean, @@ -192,6 +217,7 @@ function distributionOf(values: number[], bins: number): ScalarDistribution { min: sorted[0]!, max: sorted[n - 1]!, histogram: histogram(sorted, bins), + ...(tailRuns ? { tailRuns } : {}), } } @@ -661,6 +687,59 @@ interface RecommendationContext { function buildRecommendations(ctx: RecommendationContext): Recommendation[] { const out: Recommendation[] = [] + // Composite-distribution branch. Fires when the overall quality signal is + // poor regardless of lift / contamination / clusters — the customer needs + // to know they have a problem AND which specific runs to inspect. + if (ctx.composite.n > 0) { + if (ctx.composite.mean < 0.3) { + const tail = ctx.composite.tailRuns ?? [] + const names = tail + .slice(0, 5) + .map((t) => `${t.runId}=${t.score.toFixed(3)}`) + .join(', ') + out.push({ + priority: 'critical', + kind: 'investigate', + title: `Composite mean ${ctx.composite.mean.toFixed(3)} is below the 0.3 floor — the agent is broken on this corpus`, + detail: + tail.length > 0 + ? `Worst ${tail.length} run${tail.length === 1 ? '' : 's'} to inspect first: ${names}. Histogram p50=${ctx.composite.p50.toFixed(3)}, p95=${ctx.composite.p95.toFixed(3)}.` + : `Histogram p50=${ctx.composite.p50.toFixed(3)}, p95=${ctx.composite.p95.toFixed(3)}.`, + evidencePath: 'composite.tailRuns', + }) + } else if (ctx.composite.mean < 0.5) { + const tail = ctx.composite.tailRuns ?? [] + const names = tail + .slice(0, 3) + .map((t) => `${t.runId}=${t.score.toFixed(3)}`) + .join(', ') + out.push({ + priority: 'high', + kind: 'investigate', + title: `Composite mean ${ctx.composite.mean.toFixed(3)} is below 0.5 — investigate the lower tail before claiming the agent is healthy`, + detail: + tail.length > 0 + ? `Worst ${tail.length} run${tail.length === 1 ? '' : 's'}: ${names}. Histogram p50=${ctx.composite.p50.toFixed(3)}, p95=${ctx.composite.p95.toFixed(3)}.` + : `Histogram p50=${ctx.composite.p50.toFixed(3)}, p95=${ctx.composite.p95.toFixed(3)}.`, + evidencePath: 'composite.tailRuns', + }) + } + } + + // Missing-judges branch. The report can't surface per-dimension or + // calibration signal when `outcome.judgeScores` is empty across the + // corpus. Tell the customer how to enrich. + if (Object.keys(ctx.judges).length === 0 && ctx.composite.n > 0) { + out.push({ + priority: 'medium', + kind: 'expand-corpus', + title: 'No judge scores recorded — per-dimension + calibration insights unavailable', + detail: + 'Records have no `outcome.judgeScores`. To unlock perDimension, judges, and calibration, attach a Judge run during your eval pass and populate `outcome.judgeScores.perJudge[judgeName][dimension] = score`. See `docs/insight-report.md` for the expected shape.', + evidencePath: 'judges', + }) + } + if (ctx.lift) { const decisive = ctx.lift.ci95[0] > ctx.threshold const inconclusive = ctx.lift.ci95[0] <= ctx.threshold && ctx.lift.ci95[1] > ctx.threshold diff --git a/src/contract/insight-report.ts b/src/contract/insight-report.ts index 8bd63a7d..eb195bc6 100644 --- a/src/contract/insight-report.ts +++ b/src/contract/insight-report.ts @@ -48,6 +48,11 @@ export interface InsightReport { costQuality: { cost: ScalarDistribution pareto: ParetoFigureSpec + /** Set when the cost/quality view is degraded because the input data + * doesn't fully support it — e.g. all `costUsd` were zero, or only a + * single candidate appears (so the Pareto is a single point). The + * named fields name the degraded sub-view, free-text the reason. */ + degraded?: { cost?: string; pareto?: string } } /** Per-judge calibration + bias detection. Populated for every judge name @@ -104,6 +109,11 @@ export interface ScalarDistribution { max: number /** Histogram bins using `agent-eval`'s `gainHistogram` primitive. */ histogram: GainDistributionBin[] + /** Worst-N runs by score, ascending. Populated for the composite + * distribution so the report names the runs a customer should + * inspect first. Undefined when the distribution was computed from a + * raw value list with no run identity (e.g. cost). */ + tailRuns?: Array<{ runId: string; score: number }> } export interface JudgeInsight { diff --git a/tests/contract-analyze-runs.test.ts b/tests/contract-analyze-runs.test.ts index 28635a76..7d9cd813 100644 --- a/tests/contract-analyze-runs.test.ts +++ b/tests/contract-analyze-runs.test.ts @@ -327,6 +327,82 @@ describe('analyzeRuns — recommendations are always actionable', () => { } }) + it('emits a critical "investigate" with worstN runIds when composite mean is below 0.3', async () => { + // Real-world shape from dogfooding legal-agent canonical (mean=0.002, n=36). + const runs = Array.from({ length: 30 }, (_, i) => + makeRun({ id: `broken-${i}`, candidate: 'c', composite: i < 25 ? 0 : 0.02 }), + ) + const report = await analyzeRuns({ runs }) + expect(report.composite.mean).toBeLessThan(0.3) + expect(report.composite.tailRuns).toBeDefined() + expect(report.composite.tailRuns!.length).toBe(5) + expect(report.composite.tailRuns![0]!.score).toBe(0) + const critical = report.recommendations.find( + (r) => r.priority === 'critical' && r.kind === 'investigate', + ) + expect(critical).toBeDefined() + expect(critical!.detail).toContain('broken-') + }) + + it('emits a high-priority "investigate" when composite mean is between 0.3 and 0.5', async () => { + const runs = Array.from({ length: 20 }, (_, i) => + makeRun({ id: `mid-${i}`, candidate: 'c', composite: 0.35 + (i % 3) * 0.02 }), + ) + const report = await analyzeRuns({ runs }) + expect(report.composite.mean).toBeGreaterThanOrEqual(0.3) + expect(report.composite.mean).toBeLessThan(0.5) + const high = report.recommendations.find( + (r) => r.priority === 'high' && r.kind === 'investigate', + ) + expect(high).toBeDefined() + }) + + it('flags missing-judges when records carry no outcome.judgeScores', async () => { + const runs: RunRecord[] = Array.from({ length: 8 }, (_, i) => ({ + runId: `nj-${i}`, + experimentId: 'exp', + candidateId: 'c', + seed: i, + model: 'm@v', + promptHash: 'sha256:p', + configHash: 'sha256:c', + commitSha: 'abc', + wallMs: 100, + costUsd: 0.01, + tokenUsage: { input: 100, output: 50 }, + outcome: { holdoutScore: 0.7, raw: {} }, + splitTag: 'holdout', + })) + const report = await analyzeRuns({ runs }) + expect(Object.keys(report.judges).length).toBe(0) + const flag = report.recommendations.find( + (r) => r.kind === 'expand-corpus' && r.title.includes('No judge'), + ) + expect(flag).toBeDefined() + }) + + it('marks costQuality.degraded when all costUsd are zero', async () => { + const runs: RunRecord[] = Array.from({ length: 5 }, (_, i) => ({ + runId: `z-${i}`, + experimentId: 'exp', + candidateId: 'c', + seed: i, + model: 'm@v', + promptHash: 'sha256:p', + configHash: 'sha256:c', + commitSha: 'abc', + wallMs: 100, + costUsd: 0, + tokenUsage: { input: 100, output: 50 }, + outcome: { holdoutScore: 0.6, raw: {} }, + splitTag: 'holdout', + })) + const report = await analyzeRuns({ runs }) + expect(report.costQuality.degraded).toBeDefined() + expect(report.costQuality.degraded!.cost).toMatch(/no costUsd/) + expect(report.costQuality.degraded!.pareto).toMatch(/single candidate/) + }) + it('report is JSON-serialisable end-to-end (hosted wire format compatible)', async () => { const runs = [ makeRun({ id: 'r-1', candidate: 'c', composite: 0.8 }),