Skip to content

fix(derivation): verify fill-gap region against L1 blob to heal local forks#1008

Open
curryxbo wants to merge 1 commit into
mainfrom
fix/derivation-fillgap-verify-fork
Open

fix(derivation): verify fill-gap region against L1 blob to heal local forks#1008
curryxbo wants to merge 1 commit into
mainfrom
fix/derivation-fillgap-verify-fork

Conversation

@curryxbo

@curryxbo curryxbo commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

A node that diverges from canonical at its unsafe tip and then gets stuck below a committed batch tip never self-heals via L1 derivation — it advances a permanent "shadow chain" and P2P stays wedged on parent block not found. Only a data wipe + resync recovers it.

Root cause is in the local verify scenario dispatch:

  1. The node is stuck at fork height H; a batch is ~480 blocks, so H is almost always inside a batch, i.e. batch.lastBlockNumber > H.
  2. HeaderByNumber(batch.lastBlockNumber) returns nil (tail missing) → dispatch routes to scenario C (fill-gap) before any blob comparison runs. The blob-mismatch self-heal (scenario B, deriveForce(.., 0) full rewrite) is therefore never reached.
  3. Scenario C calls deriveForce(batchInfoFull, localLatest), which skips every block already present locally (Number <= skipNumber) and appends only the missing tail, anchoring it on the local head.
  4. The batch carries block content (txs + block contexts) but not parent hashes, so the skip is blind: the skipped blocks are the local fork, but fill-gap trusts them and builds the tail on top of the fork.

Result: geth keeps producing blocks with correct tx content but wrong parent/state hashes (a shadow chain), batch after batch. latestBatchIndex advances, block height climbs, but every canonical block from peers fails with parent block not found and the node never reconciles.

Fix

After fill-gap completes, the whole [firstBlock, lastBlock] range is present locally. So do not break out of the switch — fall through to the existing rebuildBlob verification:

  • It reconstructs the blob from the local blocks (including the ones fill-gap skipped) and compares against the L1 blob hashes.
  • A fork in the skipped region diverges the blob → triggers the existing deriveForce(.., 0) full rewrite, reorging the fork onto canonical.
  • When the local blocks were genuinely canonical (the normal "node merely behind" case), the blob matches and no rewrite happens — the fill-gap fast path is preserved.

This reuses the exact verification + self-heal machinery already used by scenario A/B; the diff is just "skip the early break and verify".

Why this is complete

The only blind spot of blob-content comparison is a fork that differs only in parent linkage with identical content — but any genuine fork's root block must differ in content (txs or a blob-encoded context field), otherwise it would be canonical. Since batches are processed in order, the root is caught and rewritten in its own batch, making every later batch's firstBlock-1 anchor canonical by induction.

Test plan

  • go build ./... in node/
  • go vet ./derivation/
  • On a node deliberately forked below the batch tip: confirm blob hash mismatch; triggering self-heal reorg fires on the fill-gap batch and the node reorgs back onto canonical (P2P recovers, no more parent block not found).
  • Regression: a node merely behind (honest fill-gap) still catches up with no spurious rewrite.

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
    • Improved local verification during sequencer-stop conditions so the system proceeds with blob and data reconciliation instead of exiting early.
    • Refined forced re-derivation to reconcile forward from the chain’s anchor, keeping only blocks that match both linkage and content, and rewriting from the first detected divergence.
    • Enhanced block-content matching for stricter, byte-accurate transaction comparisons (including normalized base-fee handling), improving detection of inconsistent or forked segments.

@curryxbo curryxbo requested a review from a team as a code owner June 25, 2026 07:22
@curryxbo curryxbo requested review from panos-xyz and removed request for a team June 25, 2026 07:22
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@curryxbo, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 9 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3747c06-81b5-42cf-9607-c1b668bd6da5

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc178d and 9d3eb84.

📒 Files selected for processing (2)
  • node/derivation/batch_info.go
  • node/derivation/derivation.go
📝 Walkthrough

Walkthrough

Local-verify scenario C now reconciles through deriveForce instead of stopping after fillGap, and deriveForce now rewrites from the canonical anchor when block content or parent linkage diverges. New block-content comparison helpers support the reconcile path.

Changes

Local verify reconciliation

Layer / File(s) Summary
Scenario C reconcile flow
node/derivation/derivation.go
Scenario C comments and logs now describe reconciliation, and the reconcile/self-heal paths call deriveForce without a skip boundary.
Block content matching helpers
node/derivation/batch_info.go
Block comparison helpers now check headers, transaction bytes, and normalized base fee equality against SafeL2Data.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

validator

Suggested reviewers

  • twcctop
  • FletcherMan

Poem

A rabbit peered at blocks aligned,
Then found a mismatch of content and kind.
It hopped to the anchor, rewrote the lane,
And the little chain sang straight again.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: verifying the fill-gap derivation path against the L1 blob to heal local forks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/derivation-fillgap-verify-fork

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@node/derivation/derivation.go`:
- Around line 416-421: The fallback around `rebuildBlob` in `derivation.go` is
returning too early on divergence-like sentinel errors, so it never reaches the
existing `deriveForce(..., 0)` self-heal path. Update the caller that handles
`rebuildBlob` results to detect the wrapped sentinel divergence cases (including
`ErrBatchVerifyDivergence` / `blob_count_mismatch`) and route them through the
same full-rewrite branch used after `fillGap`-detected mismatches. Keep the
existing self-heal behavior centralized in the `deriveForce` path so both
verification failures and `rebuildBlob` divergence errors are handled
identically.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 894f04f7-2b4f-46a8-806e-3a62e60d6ca7

📥 Commits

Reviewing files that changed from the base of the PR and between 4631b00 and d3a382a.

📒 Files selected for processing (1)
  • node/derivation/derivation.go

Comment thread node/derivation/derivation.go Outdated
curryxbo pushed a commit that referenced this pull request Jun 25, 2026
rebuildBlob signals a fork two ways: a per-blob hash mismatch (returned in
the rebuilt slice) and a wrapped ErrBatchVerifyDivergence error
(blob_count_mismatch). The caller only handled the former and returned
early on the latter, so a fork that changes the reconstructed blob count
aborted the pull instead of self-healing.

Extract the full-rewrite (deriveForce(.., 0)) into a shared selfHeal
closure and invoke it from both the hash-mismatch loop and the
ErrBatchVerifyDivergence error branch. Non-sentinel rebuildBlob errors
remain transient failures that must not trigger a reorg.

Addresses review feedback on #1008.

Co-authored-by: Cursor <cursoragent@cursor.com>
@curryxbo curryxbo force-pushed the fix/derivation-fillgap-verify-fork branch 2 times, most recently from 504fb79 to 7fc178d Compare June 25, 2026 12:47

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@node/derivation/derivation.go`:
- Around line 398-400: `deriveForce` is still using `d.ctx` and
`context.Background()` instead of the caller’s `derivationBlock` context, so
cancellation and deadlines are not propagated during quiesced work. Update the
`deriveForce` call sites in `derivation.go` to pass the active context from
`withReactorsQuiesced`/`derivationBlock`, and thread that context through
`deriveForce` itself and the related helpers it calls so all L2 reads/writes
respect cancellation promptly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6afbe52b-03cd-462b-860c-6ad6c646a345

📥 Commits

Reviewing files that changed from the base of the PR and between d3a382a and 7fc178d.

📒 Files selected for processing (2)
  • node/derivation/batch_info.go
  • node/derivation/derivation.go

Comment on lines 398 to +400
err = d.withReactorsQuiesced(ctx, batchInfo.batchIndex, func() error {
localLatest, err := d.l2Client.BlockNumber(ctx)
if err != nil {
return fmt.Errorf("read local latest: %w", err)
}
var derErr error
lastHeader, derErr = d.deriveForce(batchInfoFull, localLatest)
lastHeader, derErr = d.deriveForce(batchInfoFull)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Thread the caller context through deriveForce.

deriveForce can perform many L2 reads/writes while reactors are quiesced, but it uses d.ctx and context.Background() instead of the derivationBlock context. Cancellation won’t stop an in-flight reconcile promptly, and each rewritten block can wait up to 60s.

Proposed fix
- lastHeader, derErr = d.deriveForce(batchInfoFull)
+ lastHeader, derErr = d.deriveForce(ctx, batchInfoFull)
- lastHeader, derErr = d.deriveForce(batchInfoFull)
+ lastHeader, derErr = d.deriveForce(ctx, batchInfoFull)
-func (d *Derivation) deriveForce(rollupData *BatchInfo) (*eth.Header, error) {
+func (d *Derivation) deriveForce(ctx context.Context, rollupData *BatchInfo) (*eth.Header, error) {
@@
- anchor, err := d.l2Client.HeaderByNumber(d.ctx, big.NewInt(int64(anchorNum)))
+ anchor, err := d.l2Client.HeaderByNumber(ctx, big.NewInt(int64(anchorNum)))
@@
- local, lErr := d.l2Client.BlockByNumber(d.ctx, big.NewInt(int64(num)))
+ local, lErr := d.l2Client.BlockByNumber(ctx, big.NewInt(int64(num)))
@@
- ctx, cancel := context.WithTimeout(context.Background(), time.Duration(60)*time.Second)
+ callCtx, cancel := context.WithTimeout(ctx, time.Duration(60)*time.Second)
  defer cancel()
- return d.l2Client.NewSafeL2Block(ctx, &safeData)
+ return d.l2Client.NewSafeL2Block(callCtx, &safeData)

Also applies to: 442-444, 918-980

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@node/derivation/derivation.go` around lines 398 - 400, `deriveForce` is still
using `d.ctx` and `context.Background()` instead of the caller’s
`derivationBlock` context, so cancellation and deadlines are not propagated
during quiesced work. Update the `deriveForce` call sites in `derivation.go` to
pass the active context from `withReactorsQuiesced`/`derivationBlock`, and
thread that context through `deriveForce` itself and the related helpers it
calls so all L2 reads/writes respect cancellation promptly.

curryxbo pushed a commit that referenced this pull request Jun 25, 2026
…anch

Cherry-pick the PR #1008 fix (deriveForce verifies each local block against
the batch's content + canonical parent before keeping it, rewriting from the
first divergent/missing height) so the QA fork-injection node actually
exercises L1 self-heal. Scenario C now reconciles in one pass; the
REORG-TEST [2/3]/[3/3] markers are retained for log visibility.

Co-authored-by: Cursor <cursoragent@cursor.com>
@curryxbo curryxbo force-pushed the fix/derivation-fillgap-verify-fork branch 3 times, most recently from 1a011d4 to 149b664 Compare June 26, 2026 02:23
curryxbo pushed a commit that referenced this pull request Jun 26, 2026
…them (#1008)

Verify-driven deriveForce: walking from the canonical anchor
(firstBlockNumber-1), keep each local block while its content matches the
batch, then rewrite via NewSafeL2Block from the first divergent or missing
height onward. Replaces the old skipNumber fill-gap, which blindly trusted
local blocks and could grow a permanent shadow chain (correct tx content,
wrong parent/state hashes) that never self-healed.

This is the fix under test by the chaos hook in the preceding commit; it is
the same change as PR #1008.

Co-authored-by: Cursor <cursoragent@cursor.com>
curryxbo pushed a commit that referenced this pull request Jun 26, 2026
…them (#1008)

Verify-driven deriveForce: walking from the canonical anchor
(firstBlockNumber-1), keep each local block while its content matches the
batch, then rewrite via NewSafeL2Block from the first divergent or missing
height onward. Replaces the old skipNumber fill-gap, which blindly trusted
local blocks and could grow a permanent shadow chain (correct tx content,
wrong parent/state hashes) that never self-healed.

This is the fix under test by the chaos hook in the preceding commit; it is
the same change as PR #1008.

Co-authored-by: Cursor <cursoragent@cursor.com>
…them

A node that diverges from canonical at its unsafe tip and gets stuck below
a committed batch tip never self-heals via L1 derivation. A batch is ~480
blocks, so the fork height is almost always inside a batch
(batch.lastBlockNumber > fork height); HeaderByNumber(lastBlockNumber)
returns nil, routing every batch to scenario C (fill-gap) before any
content comparison runs. fill-gap then called deriveForce with skipNumber =
localLatest, which skipped every block already present locally and appended
only the missing tail, anchoring it on the local head. The batch carries
block content but not parent hashes, so that skip was blind: if the present
blocks were a fork, fill-gap built the tail on top of the fork and geth
kept advancing a shadow chain (correct tx content, wrong parent/state
hashes) while P2P stayed wedged on "parent block not found". Only a data
wipe + resync recovered it.

Fix: replace the skipNumber-driven deriveForce with a single verify-driven
walk. Starting from a known-canonical anchor (firstBlockNumber-1, the tip
the previous batch already verified), each height is KEPT only when the
local block both links to the running canonical parent AND its content
matches the batch (timestamp, gas limit, base fee, ordered txs); otherwise
the walk switches to rewrite mode via NewSafeL2Block for the rest of the
range. The parent-link check is what makes a keep safe — the batch has no
parent hashes, but canonical anchor + matching parent + matching content
implies deterministic execution reproduces the canonical block. The first
divergent or missing height flips the walk into rewrite mode, and because
every later local block then points at the now-stale parent, the parent
check keeps rewriting to the tip automatically.

This unifies the old scenario-B (self-heal, rewrite all) and scenario-C
(fill-gap, append tail) paths: the divergence point is discovered by
comparison instead of being supplied as a skip boundary, so a clean prefix
is skipped for free, a fork is reorged onto canonical, and a missing tail
is appended — without ever blindly trusting un-verified local blocks.
Scenario C now reconciles in one pass and advances the cursor instead of
deferring to a later poll.

Co-authored-by: Cursor <cursoragent@cursor.com>
@curryxbo curryxbo force-pushed the fix/derivation-fillgap-verify-fork branch from 149b664 to 9d3eb84 Compare June 26, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants