fix(signer): defer sign-round clear + persist before idempotent serve#4123
Merged
mswilkison merged 3 commits intoJun 28, 2026
Conversation
Fixes two Codex-flagged P2 state-consistency holes in start_sign_round. 1. Persist failure left in-memory state diverged from durable state. After the fresh-round path mutated the session (consumed-replay markers, round state), a failed persist returned an error but the canonical idempotent serve then returned signature shares WITHOUT persisting -- so a restart could replay the round with no durable consumed marker. The idempotent cached serve now persists before serving when the round is not yet durable, tracked by a process-local SIGN_ROUND_PERSIST_PENDING marker; when the original persist already succeeded it still serves cached without persisting, preserving the 'idempotent replay survives a state-key-provider outage' property build_taproot_tx relies on (rollback is impossible -- the transition clear zeroizes the prior round material). 2. Active attempt cleared before later validation could fail. On an authorized ROAST attempt advance, clear_active_sign_round_for_attempt_transition ran before the fresh-path checks (participant resolution, included-set equality, quarantine, consumed-replay, share construction). A malformed advance that passed authorization but failed a later check destroyed the in-memory active round with no validated/persisted replacement, so the next StartSignRound could start a fresh attempt without transition evidence until a restart. The clear is now deferred until every fallible check has passed, just before the replacement round is installed and persisted. Adds two regression tests, both verified to fail against the pre-fix code. Design validated with Codex. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codex re-review: the SIGN_ROUND_PERSIST_PENDING marker was process-global but cleared only at start_sign_round's own persist sites. If a start_sign_round persist failed (marker set) and a later UNRELATED successful persist (e.g. a DKG for another session) then wrote the whole engine state -- making that round durable -- the marker stayed stale-true. A subsequent idempotent replay of the now-durable round during a state-key-provider outage would re-enter the persist branch, try to persist again, and fail instead of serving the cached round. Move the marker into the persistence module and clear it inside persist_engine_state_to_storage_with_key on any successful write, so any operation's successful persist clears it. start_sign_round sets it on a fresh-round mutation (mark_sign_round_persist_pending) and reads it in the idempotent serve (sign_round_persist_pending); the explicit clears at the start_sign_round persist sites are removed (the persist clears it now). Adds a regression test (verified to fail against the pre-fix code) covering the unrelated-persist-then-outage replay. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex re-review: the process-global SIGN_ROUND_PERSIST_PENDING bit conflated all sessions. After one session's StartSignRound established a round and failed to persist, the bit stayed set process-wide, so the idempotent cached-serve branch -- which consulted it for EVERY session -- would force an UNRELATED, already- durable session's replay to re-persist and fail during the same state-key outage instead of serving its durable cached shares. An availability regression. Replace the global AtomicBool with a per-session set (SIGN_ROUND_PERSIST_PENDING_SESSIONS: OnceLock<Mutex<BTreeSet<String>>>) keyed by session_id: mark the specific session on a fresh-round mutation, consult that session in the cached-serve branch, and clear the WHOLE set on any successful persist (a persist writes the entire engine state, so every in-memory round becomes durable at once -- preserving the cross-operation durability the prior commit established). reset_for_tests clears the set explicitly. Both invariants stay intact: a non-durable round's replay still re-persists before serving (durability); a durable round's replay still serves without persisting (availability); a different session's failed persist no longer drags down an unrelated durable session. Adds a regression test (verified to fail against the global bool) and corrects the marker doc comment (mutual exclusion is the inner mutex, not the ENGINE_STATE guard, since clear also runs off-guard at startup and in test reset). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4587081
into
extraction/frost-signer-mirror-2026-05-26
20 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two Codex-flagged P2 state-consistency holes in
start_sign_round(pkg/tbtc/signer/src/engine/signing.rs), where a failed persist or a later validation error could leave in-memory state diverged from durable state and bypass the ROAST replay/transition protections.1. Persist before serving the idempotent cached round (bug #1)
The fresh-round path mutates the session (consumed-replay markers + round state) and then persists. If that persist failed (state-key-provider or disk error) it returned an error and served no shares — but the canonical idempotent serve then returned signature shares without persisting, so a restart could replay the round with no durable consumed marker.
Fix: the idempotent cached serve now persists before serving when the round is not yet durable, tracked by a process-local
SIGN_ROUND_PERSIST_PENDINGmarker (set on fresh-round mutation, cleared after a successful persist). When the original persist already succeeded — the common case — it still serves the cached round without persisting, preserving the "idempotent replay survives a state-key-provider outage" property thatbuild_taproot_txrelies on. (Rollback is impossible here: the transition clear zeroizes the prior round material.)2. Defer clearing the active attempt until validation finishes (bug #2)
On an authorized ROAST attempt advance,
clear_active_sign_round_for_attempt_transitionran before the fresh-path checks (participant resolution, included-set equality, quarantine, consumed-replay, share construction). A malformed advance that passed authorization + the RFC-21 coordinator check but failed a later check destroyed the in-memory active round with no validated/persisted replacement — soactive_attempt_contextwas dropped and the nextStartSignRoundcould start a fresh attempt without transition evidence until a restart reloaded durable state.Fix: the advance is authorized but the clear is deferred until every fallible fresh-path check has passed, immediately before the replacement round is installed and persisted (the idempotent/conflict branch is skipped for an authorized advance via a flag).
Tests
Two regression tests, both verified to fail against the pre-fix code:
authorized_advance_failing_later_check_preserves_active_sign_round— an authorized advance that fails the included-set check leaves attempt 1 idempotently signable (pre-fix:ConsumedAttemptReplay).idempotent_sign_round_replay_persists_before_serving_after_persist_outage— a sign round whose persist fails does not serve shares on idempotent replay until the state-key provider recovers (pre-fix: served shares with the key down).Full lib suite: 308 passing (
cargo test --lib),cargo fmt --checkclean. Design validated with Codex.Found during the Codex review of #4005.
Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com
🤖 Generated with Claude Code