Skip to content

fix(signer): defer sign-round clear + persist before idempotent serve#4123

Merged
mswilkison merged 3 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-signround-state-rollback
Jun 28, 2026
Merged

fix(signer): defer sign-round clear + persist before idempotent serve#4123
mswilkison merged 3 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-signround-state-rollback

Conversation

@mswilkison

Copy link
Copy Markdown
Contributor

Summary

Fixes two Codex-flagged P2 state-consistency holes in start_sign_round (pkg/tbtc/signer/src/engine/signing.rs), where a failed persist or a later validation error could leave in-memory state diverged from durable state and bypass the ROAST replay/transition protections.

1. Persist before serving the idempotent cached round (bug #1)

The fresh-round path mutates the session (consumed-replay markers + round state) and then persists. If that persist failed (state-key-provider or disk error) it returned an error and served no shares — but the canonical idempotent serve then returned signature shares without persisting, so a restart could replay the round with no durable consumed marker.

Fix: the idempotent cached serve now persists before serving when the round is not yet durable, tracked by a process-local SIGN_ROUND_PERSIST_PENDING marker (set on fresh-round mutation, cleared after a successful persist). When the original persist already succeeded — the common case — it still serves the cached round without persisting, preserving the "idempotent replay survives a state-key-provider outage" property that build_taproot_tx relies on. (Rollback is impossible here: the transition clear zeroizes the prior round material.)

2. Defer clearing the active attempt until validation finishes (bug #2)

On an authorized ROAST attempt advance, clear_active_sign_round_for_attempt_transition ran before the fresh-path checks (participant resolution, included-set equality, quarantine, consumed-replay, share construction). A malformed advance that passed authorization + the RFC-21 coordinator check but failed a later check destroyed the in-memory active round with no validated/persisted replacement — so active_attempt_context was dropped and the next StartSignRound could start a fresh attempt without transition evidence until a restart reloaded durable state.

Fix: the advance is authorized but the clear is deferred until every fallible fresh-path check has passed, immediately before the replacement round is installed and persisted (the idempotent/conflict branch is skipped for an authorized advance via a flag).

Tests

Two regression tests, both verified to fail against the pre-fix code:

  • authorized_advance_failing_later_check_preserves_active_sign_round — an authorized advance that fails the included-set check leaves attempt 1 idempotently signable (pre-fix: ConsumedAttemptReplay).
  • idempotent_sign_round_replay_persists_before_serving_after_persist_outage — a sign round whose persist fails does not serve shares on idempotent replay until the state-key provider recovers (pre-fix: served shares with the key down).

Full lib suite: 308 passing (cargo test --lib), cargo fmt --check clean. Design validated with Codex.

Found during the Codex review of #4005.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

Fixes two Codex-flagged P2 state-consistency holes in start_sign_round.

1. Persist failure left in-memory state diverged from durable state. After the
   fresh-round path mutated the session (consumed-replay markers, round state), a
   failed persist returned an error but the canonical idempotent serve then
   returned signature shares WITHOUT persisting -- so a restart could replay the
   round with no durable consumed marker. The idempotent cached serve now
   persists before serving when the round is not yet durable, tracked by a
   process-local SIGN_ROUND_PERSIST_PENDING marker; when the original persist
   already succeeded it still serves cached without persisting, preserving the
   'idempotent replay survives a state-key-provider outage' property
   build_taproot_tx relies on (rollback is impossible -- the transition clear
   zeroizes the prior round material).

2. Active attempt cleared before later validation could fail. On an authorized
   ROAST attempt advance, clear_active_sign_round_for_attempt_transition ran
   before the fresh-path checks (participant resolution, included-set equality,
   quarantine, consumed-replay, share construction). A malformed advance that
   passed authorization but failed a later check destroyed the in-memory active
   round with no validated/persisted replacement, so the next StartSignRound
   could start a fresh attempt without transition evidence until a restart. The
   clear is now deferred until every fallible check has passed, just before the
   replacement round is installed and persisted.

Adds two regression tests, both verified to fail against the pre-fix code. Design
validated with Codex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 33c21f51-4c04-463f-99fa-d20b46b3a539

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/signer-signround-state-rollback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

mswilkison and others added 2 commits June 28, 2026 10:05
Codex re-review: the SIGN_ROUND_PERSIST_PENDING marker was process-global but
cleared only at start_sign_round's own persist sites. If a start_sign_round
persist failed (marker set) and a later UNRELATED successful persist (e.g. a DKG
for another session) then wrote the whole engine state -- making that round
durable -- the marker stayed stale-true. A subsequent idempotent replay of the
now-durable round during a state-key-provider outage would re-enter the persist
branch, try to persist again, and fail instead of serving the cached round.

Move the marker into the persistence module and clear it inside
persist_engine_state_to_storage_with_key on any successful write, so any
operation's successful persist clears it. start_sign_round sets it on a
fresh-round mutation (mark_sign_round_persist_pending) and reads it in the
idempotent serve (sign_round_persist_pending); the explicit clears at the
start_sign_round persist sites are removed (the persist clears it now).

Adds a regression test (verified to fail against the pre-fix code) covering the
unrelated-persist-then-outage replay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: the process-global SIGN_ROUND_PERSIST_PENDING bit conflated all
sessions. After one session's StartSignRound established a round and failed to
persist, the bit stayed set process-wide, so the idempotent cached-serve branch
-- which consulted it for EVERY session -- would force an UNRELATED, already-
durable session's replay to re-persist and fail during the same state-key outage
instead of serving its durable cached shares. An availability regression.

Replace the global AtomicBool with a per-session set
(SIGN_ROUND_PERSIST_PENDING_SESSIONS: OnceLock<Mutex<BTreeSet<String>>>) keyed by
session_id: mark the specific session on a fresh-round mutation, consult that
session in the cached-serve branch, and clear the WHOLE set on any successful
persist (a persist writes the entire engine state, so every in-memory round
becomes durable at once -- preserving the cross-operation durability the prior
commit established). reset_for_tests clears the set explicitly.

Both invariants stay intact: a non-durable round's replay still re-persists
before serving (durability); a durable round's replay still serves without
persisting (availability); a different session's failed persist no longer drags
down an unrelated durable session.

Adds a regression test (verified to fail against the global bool) and corrects
the marker doc comment (mutual exclusion is the inner mutex, not the ENGINE_STATE
guard, since clear also runs off-guard at startup and in test reset).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mswilkison mswilkison merged commit 4587081 into extraction/frost-signer-mirror-2026-05-26 Jun 28, 2026
20 checks passed
@mswilkison mswilkison deleted the fix/signer-signround-state-rollback branch June 28, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant