Skip to content

fix: prevent false service error alerts during operations#6279

Merged
Pyatakov merged 1 commit into
developfrom
fix/spurious-service-unavailable-modal
Jun 26, 2026
Merged

fix: prevent false service error alerts during operations#6279
Pyatakov merged 1 commit into
developfrom
fix/spurious-service-unavailable-modal

Conversation

@Pyatakov

@Pyatakov Pyatakov commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Description

  • Fixed a false "Something went wrong" modal that appeared during long operations such as policy import or dry-run, even though the operation completed successfully.
  • Root cause was in the api-gateway: it checked each service's health by asking for a status reply and ignoring anything that arrived later than 300ms. A service that is merely busy answers late, so it looked exactly like a crashed one and was reported as unavailable, triggering the modal mid-operation.
  • The gateway now tracks the most recent status reply from each service, including late ones, and reports a service as unavailable only after it has been silent for longer than two heartbeats (~70s). A busy service keeps answering (late), so it is no longer mistaken for a down service, while a genuinely down service is still detected.
  • The frontend re-checks status across a short grace period before showing the modal, so a brief blip (such as a service restarting) is ignored.
  • The alert resets once services recover, so a later genuine outage is reported again (previously it only fired once).
  • Pending checks are cancelled when the websocket connection closes, so an old timer cannot trigger the modal during a reconnect.

Resolves #6281

@Pyatakov Pyatakov requested review from a team as code owners June 25, 2026 20:30
@Pyatakov Pyatakov self-assigned this Jun 25, 2026
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Test Results

 32 files  ±0   64 suites  ±0   6m 17s ⏱️ +47s
 35 tests ±0   32 ✅  - 3  0 💤 ±0  3 ❌ +3 
165 runs  ±0  162 ✅  - 3  0 💤 ±0  3 ❌ +3 

For more details on these failures, see this check.

Results for commit fb7467c. ± Comparison against base commit 89b8867.

♻️ This comment has been updated with latest results.

@Pyatakov Pyatakov changed the title fix: suppress spurious service-unavailable modal fix: prevent false service error alerts during operations Jun 25, 2026
@Pyatakov Pyatakov force-pushed the fix/spurious-service-unavailable-modal branch from 0c258dd to c2d8ae9 Compare June 25, 2026 22:25
@Pyatakov Pyatakov marked this pull request as draft June 25, 2026 22:39
@Pyatakov Pyatakov force-pushed the fix/spurious-service-unavailable-modal branch 2 times, most recently from dd091d2 to 46064ef Compare June 25, 2026 22:53
The 'Something went wrong' dialog appeared during long operations such as
policy import or dry-run even though they succeeded. The api-gateway
polled service health by broadcasting GET_STATUS and discarding any reply
later than 300ms. A service that is merely busy answers the poll late, so
it looked identical to a crashed one and was reported as not ready,
tripping the dialog mid-operation.

Track service liveness by the recency of any status reply instead of a
single 300ms snapshot. The gateway now keeps a persistent SEND_STATUS
listener that records every reply (including late ones) and reports a
service as unavailable only once it has been silent past a TTL of ~2
heartbeats. A busy-but-alive service keeps its liveness fresh, so it is no
longer confused with a crashed one, while a genuinely down service is
still detected.

The frontend keys the dialog off that signal with a short grace re-check
(so a brief restart blip is ignored), resets the alert on recovery so a
future outage is reported again, and clears the pending check when the
socket closes.

Signed-off-by: Alex Piatakov <alex.piatakov@swirldslabs.com>
@Pyatakov Pyatakov force-pushed the fix/spurious-service-unavailable-modal branch from 46064ef to fb7467c Compare June 26, 2026 13:13
@Pyatakov Pyatakov marked this pull request as ready for review June 26, 2026 13:38
@Pyatakov Pyatakov merged commit c2e6740 into develop Jun 26, 2026
21 of 23 checks passed
@Pyatakov Pyatakov deleted the fix/spurious-service-unavailable-modal branch June 26, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant