-
Notifications
You must be signed in to change notification settings - Fork 4
ci(test): land the #185 residuals — flake-repro workflow + stuck-state timeout diagnostic #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| # Manually-triggered flake REPRODUCER. The CI-only concurrency flakes | ||
| # (ThreadSubmissionIntegrationTest, CodeEditRecompileTest, …) are positive-wait | ||
| # timeouts of the "race-the-watcher → wedged hub / dropped cross-mirror write" | ||
| # class — they pass in isolation, as a full class, and under DOTNET_PROCESSOR_COUNT=2 | ||
| # locally, but trip under the real 2-vCPU CI runner. This workflow loops one | ||
| # suspect test ON THAT RUNNER until it fails, with logging cranked up via ENV | ||
| # (never by editing a committed appsettings/Log* level), and uploads the FAILING | ||
| # iteration's trx + MonolithMeshTestBase traces + blame hang-dump so the actual | ||
| # wedge is diagnosable — i.e. it manufactures the repro the fix must be proven against. | ||
| # | ||
| # Run it from the Actions tab (workflow_dispatch). Defaults target CodeEditRecompile; | ||
| # re-dispatch with project=test/MeshWeaver.AI.Test filter=FullyQualifiedName~ThreadSubmissionIntegrationTest | ||
| # (etc.) for the others. | ||
| name: Flake repro (manual) | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| inputs: | ||
| project: | ||
| description: "Test project to loop" | ||
| type: string | ||
| default: "test/MeshWeaver.Hosting.Monolith.Test" | ||
| filter: | ||
| description: "dotnet test --filter expression (the suspect test)" | ||
| type: string | ||
| default: "FullyQualifiedName~CodeEditRecompileTest.PressingCompileButton_SetsRequestedReleaseAt_AndProducesNewRelease" | ||
| iterations: | ||
| description: "Max loop iterations (stops on first failure)" | ||
| type: string | ||
| default: "25" | ||
| loglevel: | ||
| description: "Default log level for the looped runs (Debug/Trace to surface the watcher race)" | ||
| type: string | ||
| default: "Debug" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| env: | ||
| # Same workspace-pinned NuGet cache as dotnet-test.yml so a --no-build/--no-restore | ||
| # loop resolves packages offline. | ||
| NUGET_PACKAGES: ${{ github.workspace }}/.nuget/packages | ||
|
|
||
| jobs: | ||
| repro: | ||
| name: "Loop suspect test until fail" | ||
| # ubuntu-latest == the 2-vCPU GitHub-hosted runner the real flake needs. | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 60 | ||
| # Effective params: workflow_dispatch inputs when present, else push-trigger defaults. | ||
| env: | ||
| REPRO_PROJECT: ${{ inputs.project || 'test/MeshWeaver.Hosting.Monolith.Test' }} | ||
| REPRO_FILTER: ${{ inputs.filter || 'FullyQualifiedName~CodeEditRecompileTest.PressingCompileButton_SetsRequestedReleaseAt_AndProducesNewRelease' }} | ||
| REPRO_ITERS: ${{ inputs.iterations || '80' }} | ||
| DOTNET_ENVIRONMENT: Development | ||
| # 🚨 Logging cranked via ENV only — never a committed appsettings/Log* change | ||
| # (those are production cost contract). Surfaces the watcher/messaging trace. | ||
| Logging__LogLevel__Default: ${{ inputs.loglevel || 'Debug' }} | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| # Mirror dotnet-test.yml: reclaim preinstalled tooling a .NET+Postgres run never | ||
| # touches so the build output + Testcontainers image fit the ~14 GB runner disk. | ||
| - name: Free disk space | ||
| uses: jlumbroso/free-disk-space@main | ||
| with: | ||
| tool-cache: false | ||
| dotnet: false | ||
| android: true | ||
| haskell: true | ||
| large-packages: true | ||
| docker-images: true | ||
| swap-storage: true | ||
| - name: Setup .NET | ||
| uses: actions/setup-dotnet@v5 | ||
| with: | ||
| dotnet-version: 10.0.x | ||
| - name: Restore workloads | ||
| run: dotnet workload restore | ||
| - name: Restore dependencies | ||
| run: dotnet restore | ||
| # Build the whole solution in Release EXACTLY as the real CI build job does, so | ||
| # the looped --no-build runs execute the identical binaries that flake in CI | ||
| # (Release, same warnings-as-errors gate). Faithful repro > speed. | ||
| - name: Build | ||
| run: dotnet build --no-restore -c Release -p:CIRun=true -warnaserror | ||
| # The mesh-local #r feed some dynamic-compilation tests resolve at runtime | ||
| # (version-less `#r "nuget:MeshWeaver.X"`), packed exactly as dotnet-test.yml does. | ||
| - name: Pack mesh-local #r packages | ||
| run: | | ||
| set -euo pipefail | ||
| dotnet pack src/MeshWeaver.BusinessRules/MeshWeaver.BusinessRules.csproj \ | ||
| -c Release --no-build --no-restore -o dist/packages --nologo | ||
| dotnet pack src/MeshWeaver.BusinessRules.Generator/MeshWeaver.BusinessRules.Generator.csproj \ | ||
| -c Release --no-build --no-restore -o dist/packages --nologo | ||
| # 🚨 DO NOT crank the CompileWatcher log levels here. Repro #3/#4 proved it's a HEISENBUG: | ||
| # Debug logging in the watcher hot path changes the scheduling enough to MASK the race | ||
| # (runs #1/#2 caught it at iter 19/4 at the committed Warning level; runs #3/#4 with the | ||
| # categories at Debug missed it across 40+80 iters). So we keep production-like timing and | ||
| # rely on the timing-NEUTRAL stuck-state diagnostic in WaitForLatestRelease (it runs only | ||
| # on the 50s timeout, never during the race) to pin the stalled stage. | ||
| - name: Loop the suspect test until it fails | ||
| run: | | ||
| set -uo pipefail | ||
| iters="$REPRO_ITERS" | ||
| proj="$REPRO_PROJECT" | ||
| filt="$REPRO_FILTER" | ||
| echo "Looping '$filt' in $proj up to $iters× on a 2-vCPU runner (nproc=$(nproc))." | ||
| failed=0 | ||
| for i in $(seq 1 "$iters"); do | ||
| echo "::group::iteration $i / $iters" | ||
| # --no-build --no-restore: run the Release binaries built above, exactly | ||
| # like the CI shards. blame-hang-* captures a mini-dump if the host wedges | ||
| # (the "unresponsive after the second compile" symptom) instead of just | ||
| # timing out opaquely. | ||
| dotnet test "$proj" -c Release --no-build --no-restore \ | ||
| --filter "$filt" -l:trx \ | ||
| --blame-hang-timeout 120s --blame-hang-dump-type mini | ||
| rc=$? | ||
| echo "::endgroup::" | ||
| if [ "$rc" -ne 0 ]; then | ||
| echo "::error::REPRO on iteration $i (exit $rc) — diagnostics uploaded below." | ||
| failed=1 | ||
| break | ||
| fi | ||
| echo "iteration $i passed" | ||
| done | ||
| if [ "$failed" = "0" ]; then | ||
| echo "No repro in $iters iterations (the flake did not surface this run)." | ||
| fi | ||
| # Exit non-zero on repro so the run is CLEARLY marked failed in the Actions | ||
| # list (repro achieved = red). The diagnostics-collection + upload steps | ||
| # below run regardless via `if: always()`. | ||
| exit "$failed" | ||
| - name: Collect diagnostics | ||
| if: always() | ||
| run: | | ||
| mkdir -p repro-diagnostics | ||
| # trx (per-iteration; the last one is the failing run since we break on fail). | ||
| find "$REPRO_PROJECT" -name '*.trx' -exec cp {} repro-diagnostics/ \; 2>/dev/null || true | ||
| # blame hang mini-dump (written when the test host wedged, not just timed out). | ||
| find "$REPRO_PROJECT" -name 'blame-*.dmp' -exec cp {} repro-diagnostics/ \; 2>/dev/null || true | ||
| # MonolithMeshTestBase phase/dispose/memory traces (the watcher + INIT/DISPOSE | ||
| # trace that pinpoints the wedge), written to the process tempdir. | ||
| for f in meshweaver-test-trace meshweaver-dispose-trace meshweaver-memory-delta; do | ||
| [ -f "/tmp/$f.log" ] && cp "/tmp/$f.log" "repro-diagnostics/_$f.log" || true | ||
| done | ||
| # XUnitFileOutputHelper writes per-test/background-hub logs under bin/.../test-logs | ||
| # (the cranked watcher trace lands here when no test is the active ITestOutputHelper). | ||
| # Search repo-wide — background-hub logs can land outside the suspect project dir. | ||
| find . -path '*/test-logs/*.log' -exec cp {} repro-diagnostics/ \; 2>/dev/null || true | ||
| echo "Collected:"; ls -lh repro-diagnostics/ || true | ||
| - name: Upload repro diagnostics | ||
| if: always() | ||
| uses: actions/upload-artifact@v6 | ||
| with: | ||
| name: flake-repro-diagnostics | ||
| path: repro-diagnostics/ | ||
| retention-days: 15 | ||
| compression-level: 9 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.