Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions .evolve/reflections/2026-05-27-decision-packet-substrate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Reflect: agent-eval 0.48 → 0.50.1 — substrate layering + decision packet
Date: 2026-05-27

## Run Grade: 8/10

| Dimension | Score | Evidence |
|---|---|---|
| Goal achievement | 9 | Five releases landed clean (0.48 layering fix, 0.49 audit-fix sweep, 0.50 decision packet, 0.50.0/.0.1 docs). Both npm + PyPI published, 8 PRs merged + tagged, six consumer repos migrated, three CLAUDE/AGENTS files updated with the layering rule. The shipped artifact is real: `analyzeRuns({ runs }) → InsightReport` is the customer-visible decision packet the session set out to build. |
| Code quality | 8 | 3,688 LOC added across 39 files, 112 test files, zero historical-narrative comments (deleted `traceai-compat.ts` shim in 0.49). Test bench legitimately grew (`tests/contract-analyze-runs.test.ts` = 341 LOC of integration coverage). One self-inflicted flake (Math.random() in correlation tests) almost broke the publish — graded down for that. |
| Efficiency | 7 | The single biggest wasted cycle was the customer-mapping pivot: I shipped 0.50 LAND-tier `selfImprove()` returning a 1-bit verdict, *then* Drew pushed back ("Jesus claude what are we even doing here?"), *then* I built the decision packet. Had I done customer-journey thinking before code, 0.50 would have been the right shape on the first try and 0.50.1's emergency docs rebuild would have been a normal docs pass. Three pnpm/action-setup conflict cycles across consumer migrations were also wasted. |
| Self-correction | 9 | After Drew's pushback I didn't defend — I went all the way back to "who is the customer, what packet do they need, what's the path to first value." Customer A (research validation) + Customer B (agentic GTM-as-service) framing crystallized in one turn and stayed load-bearing through 0.50/0.50.1. Layering inversion fixed root-cause (move DefaultVerdict down) not symptomatically. |
| Learning | 8 | Layering rule now durable in three CLAUDE.md files — that's the kind of fix that survives. Decision-packet shape (`composite`/`perDimension`/`judges`/`interRater`/`lift`/`failureClusters`/`contamination`/`outcomeCorrelation`/`release`/`recommendations`) is now a teachable contract. Math.random() flake is a fresh durable lesson worth saving to memory. |
| Overall | 8 | Would Drew approve unchanged? Yes — but only because the docs PR landed *after* the product pivot, not before. The deduction is the order of operations, not the work itself. The substrate is genuinely top-tier OSS-presentable now. |

## Session Flow Analysis

### Flow 1: bump → migrate-consumers → admin-merge — high frequency
Trigger: substrate version cut (0.48, 0.49, 0.50).
Steps: publish substrate → for each of 6 consumers: bump dep, fix peer constraint, re-test, PR, admin-merge.
Outcome: ~30 PRs across 6 repos this session, mostly clean. Friction point: pnpm/action-setup conflict surfaced in 3+ repos before stabilizing on the (keep `packageManager:` field, drop workflow `version:` arg) recipe.
Automation potential: HIGH — a `bump-substrate-everywhere` skill that fans out across `~/code/{gtm,creative,legal,tax}-agent`, `agent-builder`, `physim` with the known-good package.json + workflow edits would have collapsed ~2 hours into ~20 min.

### Flow 2: ship-then-rethink (the anti-pattern)
Trigger: build a feature → realize it doesn't fit the customer.
Steps: 0.46 selfImprove → 0.47 hosted client → 0.48 layering → 0.49 audit sweep → 0.50 selfImprove returns 1-bit verdict → Drew pushback → 0.50.1-shaped product pivot.
Outcome: real work, but the shape of 0.50.0 had to be torn down within hours of merging because the *customer's first-touch experience* was wrong (a single `gateDecision` is not enough — they need a defensible report).
Lesson: **customer-journey first, code second.** Drew's correction said exactly that: "I still suggest THINKING deeper about the problem they have. The founder WANTS to tokenmax, he wants to see THINGS work faster and create content he would say YES to." That's product thinking I should have led with.

### Flow 3: greenfield → delete-the-shim
Trigger: Drew said "remembe rmuch of this is all GREENFIELD."
Steps: identify legacy/compat code → delete it outright → rename `traceai.ts → otel.ts` (not aliased) → strip historical comments.
Outcome: 0.49 net-negative on legacy paths. This pattern *works* and should be the default for greenfield SDKs. Future me: don't write the shim in the first place.

### Flow 4: docs PR breaks publish — the Math.random() flake
Trigger: 0.50.1 docs PR CI failed on a correlation test.
Steps: PR admin-merged anyway → tag pushed → publish workflow ran → publish *succeeded* (RNG cooperated) → flake-fix PR (#122) opened, CI green, admin-merged.
Outcome: 0.50.1 shipped, but the publish was a coin-flip away from failing on its own freshly-tagged version. This is exactly the failure mode the "tests that matter" doctrine warns about: a test that passes 90% of the time is worse than no test, because it creates false confidence. Math.random() is the silent-zero of the test bench.

## Operator question → product signal

| Question | Implication | Signal |
|---|---|---|
| "does agent-eval use agent-runtime? I thought its supposed to be the other way around?" | The layering rule was not enforced. | Make the rule load-bearing in three CLAUDE.md files (done). Type-only `import type` from a consumer is the smell — flag in PR review. |
| "Jesus claude what are we even doing here?" | Substrate had no clear customer story. selfImprove was a primitive, not a product. | Build customer journeys before features. The journey doc + three runnable examples are the artifacts. |
| "is this fully and exceptionally documented like the most top tier and clean developer SDK tooling company would post on their open source github" | First-touch onboarding wasn't there. Subpath exports without narrative anchors = ai-agent persona finding from the critical-audit playbook. | Top-of-README decision-packet sample; comparison matrix vs LangSmith/Braintrust/Phoenix; three quickstart paths. |

## Project Health

### @tangle-network/agent-eval
Trajectory: **substrate-stable, presentation-stable, customer-mapped.** 0.50.1 is the first version where (a) the customer's first-touch is a runnable example, (b) the decision packet is canonical, (c) the layering rule is enforced in code + docs. 5 releases in one session is fast — and the lift-the-floor work (layering fix, audit sweep) makes the next 10 releases cheaper.
Architecture: clean. 24 export subpaths, 112 test files, zero compat shims, no upward deps. The `analyzeRuns` + `selfImprove` + intake adapters trio is the right shape — three top-level functions covering three customer maturity stages.
Coverage: meaningful — 11 integration tests in `contract-analyze-runs.test.ts` cover every InsightReport section against the real implementation, no mocks. One flake fixed.
Next highest-value action: dogfood `analyzeRuns()` on real customer logs (Customer B's OTel pipeline). Until the decision packet is read by a human who wasn't in this session, the customer-mapping is a hypothesis. `/eval-agent` scope: ingest a real OTel batch, render the packet, ask "would I act on this?"

### Six consumer repos (gtm/creative/legal/tax/agent-builder/physim)
Trajectory: **all on 0.49.** None upgraded to 0.50 yet — that's the *whole* point of the decision-packet pivot and they should be the first to consume it.
Next action: bump the consumers to 0.50.1 and rewire whichever bespoke summary the consumer currently emits to `analyzeRuns()`. The win is collapsing N bespoke summary functions into one substrate call.

## Cross-Project Patterns

1. **Layering inversions hide inside `import type`.** The agent-eval → agent-runtime smell was a type-only import. Three CLAUDE.md files now say "Type-only `import type` from a consumer package is the smell that hides the inversion — reject it in review." This belongs in the cross-project AGENTS.md, not just this repo.
2. **Math.random() in tests is the silent-zero sibling.** Both fail loud only sometimes. The "tests that matter" doctrine should add: **deterministic-or-don't.** No bare `Math.random()` in assertions, ever. Use a seeded PRNG or deterministic noise.
3. **First-touch-runnable beats first-touch-document.** The README's value didn't lift until three `pnpm tsx examples/.../index.ts` scripts existed. Document, then *run* the documentation.
4. **pnpm/action-setup `version:` arg conflicts with `packageManager:` field.** Recurring across 3+ repos this session. Worth a one-line global fix-it: in every workflow, drop `version:` from `pnpm/action-setup@v4` and rely on the package.json field.

## Skill Effectiveness

- `/critical-audit` — invoked once mid-session, caught the layering inversion + the `traceai-compat.ts` shim that violated the greenfield rule. High value when run before a release, not after.
- `/reflect` — this run. Previous reflection (2026-05-24, 7/10) flagged "live end-to-end proof never landed." That is *still true* for the decision packet at the customer level — the artifact has not been read by a customer's eyes yet. The same lesson is appearing across reflections; that's the signal to act.

## Product Signals

1. **Customer B (agentic GTM-as-service) wants engagement/token-max, not eval rigor.** The founder will care about Pareto + outcome correlation between judge composite and downstream engagement. The `outcomeCorrelation` section was added for exactly this, but it's untested against a real engagement signal. The thing to ship next: a real customer-B pipeline that wires their CRM/analytics signal into `outcomeSignal` and watches the Pearson / Spearman move.
2. **Customer A (Claude-P research) needs `interRater` + `disagreementCases` to land triage.** The feedback-loop example shows the shape; the missing piece is making `disagreementCases` deep-linkable (runId → original artifact view). That's a downstream consumer feature, not substrate.
3. **The "show me the money" README sample is the conversion event.** First commit a new visitor sees is the annotated `InsightReport` JSON. If we can publish three blog posts with three real customer reports, the SDK starts to sell itself.

## Proposed Automations

1. **`bump-substrate-everywhere` skill** — fans out across 6 consumer repos with known-good package.json + workflow edits. Saves ~2hr per substrate cut. Sketch: read substrate version from `~/code/agent-eval/package.json`, for each consumer in a config list: branch + bump dep + ensure `pnpm@10.22.0` packageManager + remove `version:` from workflows + PR + watch CI. Threshold: any session that bumps agent-eval and migrates ≥2 consumers.
2. **`deterministic-test` lint rule** — grep for `Math.random()` inside `tests/`, fail CI. Drop-in eslint custom rule. Sketch: `no-restricted-syntax` rule against `CallExpression[callee.object.name='Math'][callee.property.name='random']` scoped to `tests/**/*`.
3. **`layering-guard` PR-bot rule** — fail CI if `import type` references a known consumer package name (agent-runtime, agent-knowledge, etc.) from inside substrate. One regex, one CI step.

## Action Items (ordered by impact)

1. **Dogfood `analyzeRuns()` on Customer B's real OTel batch** — until a real customer report is rendered + read, the decision-packet hypothesis is unvalidated.
2. **Bump 6 consumer repos to 0.50.1** — collapse bespoke summary code into `analyzeRuns()`. Wave 1 = gtm + creative (highest-signal consumers).
3. **Add a deterministic-test CI lint** — kill the Math.random() flake class permanently before it bites a publish.
4. **Add layering-guard CI rule** — make the rule mechanical, not aspirational.
5. **Write the next reflection only after #1 lands.** The lesson "live end-to-end proof never landed" has now appeared in two consecutive reflections (2026-05-24 + this one). Don't write a third reflection without closing it.

## Skill dispatch

Two consecutive reflections flag the same gap: substrate is real, customer-validated proof is not. That's the textbook trigger for `/eval-agent` scope.

**Next: `/eval-agent` — ingest Customer B's real OTel batch through `fromOtelSpans()` → `analyzeRuns()`, render the packet, score it for actionability. Baseline: the synthetic example output. Target: would the founder act on the recommendation? If yes, ship. If no, the decision-packet shape needs another round.**
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,25 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-

---

## [0.50.2] — 2026-05-27 — actionability fixes from real-data dogfood

### Added

- **`ScalarDistribution.tailRuns?: Array<{runId, score}>`** — populated for the composite distribution. The report now names the 5 worst runs a customer should inspect first, instead of telling them to "investigate the lower tail" anonymously.
- **`InsightReport.costQuality.degraded?: {cost?, pareto?}`** — explicit per-axis degradation reasons when `costUsd` is all zero (cost axis carries no signal) or only a single candidate appears (Pareto collapses to a single point). Replaces the prior silent emission of meaningless single-point Pareto figures.
- **Composite-distribution recommendations.** When `composite.mean < 0.3`, the report emits a `critical/investigate` recommendation with the worst-5 runIds enumerated in the detail. Between 0.3 and 0.5, a `high/investigate` recommendation with the worst-3. Closes the gap where `recommendations: []` was being emitted for completely broken corpora.
- **Missing-judges flag.** When `judges` is empty across the corpus, the report emits a `medium/expand-corpus` recommendation pointing at `outcome.judgeScores.perJudge` enrichment. Before, the customer had no signal that per-dimension / calibration was unavailable because of input shape, not substrate failure.

### Fixed

- `analyzeRuns()` on the legal-agent canonical run (n=36, mean composite = 0.002) now emits actionable recommendations naming specific failing scenarios; previously it returned `recommendations: []` for a fully-broken agent.

### Notes

The four behavior changes are additive — fields are optional, no existing field shape changed. Dogfood-driven: surfaced by running `analyzeRuns()` against three real consumer datasets (legal-agent, agent-builder, gtm-agent golden run) and observing where the report was silent when it should have been loud.

---

## [0.50.1] — 2026-05-27 — docs + examples

### Added
Expand Down
2 changes: 1 addition & 1 deletion clients/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "agent-eval-rpc"
version = "0.50.1"
version = "0.50.2"
description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client."
readme = "README.md"
requires-python = ">=3.10"
Expand Down
2 changes: 1 addition & 1 deletion clients/python/src/agent_eval_rpc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@
try:
__version__ = version("agent-eval-rpc")
except PackageNotFoundError:
__version__ = "0.50.1"
__version__ = "0.50.2"

__all__ = [
"Client",
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@tangle-network/agent-eval",
"version": "0.50.1",
"version": "0.50.2",
"description": "Substrate for self-improving agents: traces, verifiable rewards, preferences, GEPA / reflective mutation, auto-research, replay, sequential anytime-valid stats, and release gates.",
"homepage": "https://github.com/tangle-network/agent-eval#readme",
"repository": {
Expand Down
Loading
Loading