ci(test): land the #185 residuals — flake-repro workflow + stuck-state timeout diagnostic#197
Conversation
…e timeout diagnostic Residual 2: .github/workflows/flake-repro.yml — a manually-dispatched (workflow_dispatch-only; the branch's TEMP push trigger stripped) job that loops one suspect test on the real 2-vCPU ubuntu runner until it fails and uploads the failing trx + traces + blame hang-dump, with logging cranked via ENV (never by editing committed log levels). It reliably manufactured the CodeEditRecompile flake (iterations 4/19/80) and targets any project/filter. Residual 3: CodeEditRecompileTest.WaitForLatestRelease now dumps a full discriminating diagnostic on its 50s timeout — MIRROR (cross-hub cache handle) vs INDEX (persisted+indexed state + Release children) views of the node to separate owner-side "never produced v2" from delivery/mirror staleness, plus a fresh re-trigger to split one-time missed emission (recovers) from persistent clobber/dead subscription (stays stuck). Timing-neutral: only runs in the already-failing timeout path. Extracted from the never-PR'd ci/flake-repro-workflow branch WITHOUT the abandoned #124 watcher refinement it also carried (superseded by #194's commit-path high-water fix). Completes #185 (residual 1 landed in #194). Fixes #185. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a reusable manual GitHub Actions workflow to reproduce CI-only flakes on the real 2‑vCPU runner, and enhances CodeEditRecompileTest.WaitForLatestRelease to emit richer diagnostics when the existing 50s wait times out (mirror vs index views + a bounded re-trigger attempt).
Changes:
- Add
.github/workflows/flake-repro.yml(workflow_dispatchonly) to loop a specified test until failure and upload TRX + dumps + MeshWeaver traces. - Extend
WaitForLatestReleasetimeout path to dump discriminating state (MIRROR vs INDEX + Release children) and attempt a re-trigger to classify “one-time missed emission” vs “persistently stuck”.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
test/MeshWeaver.Hosting.Monolith.Test/CodeEditRecompileTest.cs |
Adds timeout-path diagnostics (mirror vs index, release listing, re-trigger classification). |
.github/workflows/flake-repro.yml |
New manual workflow to reproduce flakes on ubuntu-latest and upload diagnostics artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Test Results (shard 2) 15 files ± 0 15 suites ±0 6m 15s ⏱️ -44s Results for commit cc980cb. ± Comparison against base commit 77df906. This pull request removes 43 tests.♻️ This comment has been updated with latest results. |
Test Results 55 files + 1 55 suites +1 23m 50s ⏱️ + 4m 6s Results for commit cc980cb. ± Comparison against base commit 77df906. This pull request removes 43 and adds 673 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
- The three facts using WaitForLatestRelease get [Fact(Timeout = 120000)] with rationale: the happy path completes in seconds; the budget is for the FAILURE path — the 50s primary wait plus the discriminating diagnostic (probes + decisive re-trigger, worst ~50s) must fit inside the xUnit method timeout or the diagnostic is cancelled before it can be emitted. - Index probes are best-effort and tightly bounded (first emission, 5s): waiting for a non-empty snapshot could stall the whole bound when the node is genuinely absent from the index — itself a diagnostic result. - flake-repro.yml exits non-zero when the flake reproduces so the run is clearly red in the Actions list; diagnostics-collection/upload steps still run via if: always(). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes #185 (residual 1 already landed via #194).
Residual 2 — reusable 2-vCPU flake-repro workflow
.github/workflows/flake-repro.yml: manually dispatched (workflow_dispatchonly — the branch's TEMP self-servepush:trigger is stripped, so merging this cannot start any run). Loops one suspect test on the real ubuntu 2-vCPU runner until it fails (inputs: project / filter / iterations / log level via ENV — never by editing committed log levels), uploading the failing iteration's trx +MonolithMeshTestBasetraces + blame hang-dump. It reliably manufactured the CodeEditRecompile flake on iterations 4/19/80 and targets any project/test. Zerosrc/risk.Residual 3 — stuck-state timeout diagnostic
CodeEditRecompileTest.WaitForLatestReleasedumps a discriminating diagnostic on its 50 s timeout: MIRROR (the cross-hub cache handle the test reads) vs INDEX (persisted+indexed state + Release children) views of the same node — separating owner-side "never produced v2" from delivery/mirror staleness — plus a fresh re-trigger that splits a one-time missed emission (recovers → sub-case a) from a persistent clobber / dead subscription (stays stuck → sub-case b). Runs only in the already-failing timeout path; the happy path is untouched.Provenance & verification
ci/flake-repro-workflowbranch without the abandoned ci+fix: CodeEditRecompile flake — 2-vCPU repro harness, watcher refinement, stuck-state diagnostic #124 watcher refinement it also carried (NodeTypeCompilationHelpers.cs/MeshDataSource.cs— superseded by fix(graph): advance the release watcher's dispatch high-water only on the Update COMMIT path #194's commit-path high-water fix).on:is dispatch-only.dotnet build test/MeshWeaver.Hosting.Monolith.Test -c Release -warnaserror: clean.CodeEditRecompileTest(Release): 5/5 pass with the diagnostic in place.🤖 Generated with Claude Code