Skip to content

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262

Draft
julienrbrt wants to merge 7 commits intomainfrom
julien/edge-case-p2p-stall
Draft

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
julienrbrt wants to merge 7 commits intomainfrom
julien/edge-case-p2p-stall

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Apr 16, 2026

Overview

Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead

Early WIP, i do not like what the LLM has produced.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b14a33d8-d794-4d8e-8381-e6a290bb5019

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/edge-case-p2p-stall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 22, 2026, 1:08 PM

@julienrbrt julienrbrt changed the title feat: add automatic DA retriever walkback when P2P stalls and DA blocks feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 16, 2026

Claude finished @julienrbrt's task in 3m 26s —— View job


Code Review — feat: add automatic DA retriever walkback when P2P stalls

Summary: The PR introduces a walkback mechanism so that when P2P stalls and the DA follower has already advanced well past the node's current height, the DA subscriber rewinds to re-fetch earlier DA heights. The core idea is sound, but there are several correctness bugs and architectural concerns worth addressing before merge.


Critical Issues

1. Potential nil-pointer panic on data-only DA events

walkbackCheck in syncer.go:552 calls events[0].Header.Height() without guarding for a nil Header. DAHeightEvent.Header is a *types.SignedHeader pointer that can legitimately be nil when the DA height contains only a data blob (no block header). Any data-only event will cause a panic at this line.

// syncer.go:551-557
minHeight := events[0].Header.Height()  // ← panics if Header == nil
for _, e := range events[1:] {
    if e.Header.Height() < minHeight {
        minHeight = e.Header.Height()
    }
}

Fix: skip events with a nil Header when computing minHeight, or only check events where Header != nil. Fix this →


2. Infinite rewind loop when walkback reaches the DA floor

walkbackCheck returns early (return 0) when daHeight <= s.daRetrieverHeight.Load() — the floor check — but does not clear walkbackActive. This creates an infinite loop:

  1. walkbackActive = true, DA processes daRetrieverHeight+1
  2. No gap-filling events found → return daRetrieverHeight
  3. Subscriber rewinds to daRetrieverHeight → processed → advances to daRetrieverHeight+1
  4. walkbackActive still true, empty events → rewound again → repeat forever

The syncer would spin at the floor, never progressing and spamming logs.

Fix: clear walkbackActive (and log a warning) when the floor is reached without resolving the gap. Fix this →


3. Duplicate event processing after walkback

When HandleCatchup(ctx, local) succeeds, events are already piped to the event sink inside fetchAndPipeHeight. Then walkbackChecker returns local-1, causing RewindTo(local-1). On the next iteration, the subscriber processes local-1, then local again — piping the same events a second time. Depending on how the downstream processing handles duplicates this could cause double-application of blocks.

This needs either: idempotent event piping confirmation (cite the deduplication path), or the walkback must only rewind without re-piping the already-processed height.


Significant Issues

4. Unused startDAHeight field in daFollower

Added in da_follower.go and stored in NewDAFollower, but never read anywhere in daFollower. It appears the walkback minimum is instead enforced via s.daRetrieverHeight in the Syncer-level walkbackCheck. The field is dead code and should be removed. Fix this →

5. Walkback log fires on every step — extremely verbose

The Info log inside if needsWalkback { ... } in walkbackCheck (syncer.go:566-572) is emitted on every single DA height during a walkback. A gap of 500 heights produces 500 log lines saying the same thing. The log should only fire when the walkback activates (i.e., when walkbackActive transitions from false → true). Fix this →

6. Leaky abstraction: events bubbled up through HandleCatchup just for walkbackChecker

The SubscriberHandler interface now returns []common.DAHeightEvent from HandleCatchup solely so the walkbackChecker callback can inspect them. The comment in subscriber.go:28-32 admits this explicitly ("The subscriber does not interpret them"). This couples a low-level primitive to the Syncer's gap-detection logic and complicates any future SubscriberHandler implementations. An alternative design would be for the walkbackChecker to query the node height independently and not need the events at all — it could just compare DA height against node height on each tick. The events are only needed to check minHeight, which could be obtained from the store or a separate signal.

7. _ = events in test leaves return value untested

da_follower_test.go:183:

events, err := follower.HandleCatchup(t.Context(), s.daHeight)
// ...
_ = events  // ← return value completely ignored

The signature change is the whole point of this PR; the test should assert that returned events match the events emitted by the mock retriever.


Minor Issues

8. mockHandler.On(...).Return([]common.DAHeightEvent(nil), nil) — awkward mock return value

subscriber_test.go:51-52 returns []common.DAHeightEvent(nil) (typed nil). Since the mock returns via args.Get(0).([]common.DAHeightEvent), passing a plain nil would panic — so the typed nil is necessary, but it reads poorly. Consider a local noEvents []common.DAHeightEvent sentinel variable for clarity.

9. s.ctx use in walkbackCheck

walkbackCheck uses s.ctx (the syncer's root context) for the store Height call rather than receiving a context from the caller chain. This is technically fine since walkbackChecker is a callback without a context parameter, but it ties the store call's lifecycle to the syncer root rather than the subscriber's catchup context. The callback signature could optionally carry a context.

10. One-at-a-time walkback performance

Each walkback step rewinds exactly one DA height. For large gaps this triggers O(gap) catchup iterations. A smarter approach would be to estimate the target DA height directly by bisecting or using the DA height stored alongside the last known block height — avoiding the linear scan entirely.


Test Coverage Assessment

The new TestSyncer_walkbackCheck and TestSubscriber_RewindTo tests are well-structured and cover the happy path. Missing coverage:


Summary

The PR's author note ("I do not like what the LLM has produced") is well-founded. The core concept is valid but the current implementation has correctness issues (#1, #2, #3) that need fixing before this is production-safe. I'd recommend:

  1. Fix the three critical bugs (nil panic, infinite floor loop, duplicate events)
  2. Remove startDAHeight dead code
  3. Reconsider whether events need to be bubbled up through HandleCatchup at all — a simpler design might avoid the architectural awkwardness

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 67.39130% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (3efcf93) to head (d585f08).

Files with missing lines Patch % Lines
block/internal/syncing/syncer.go 60.86% 16 Missing and 2 partials ⚠️
block/internal/syncing/da_follower.go 78.26% 4 Missing and 1 partial ⚠️
block/internal/da/subscriber.go 76.47% 3 Missing and 1 partial ⚠️
block/internal/da/async_block_retriever.go 50.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3262      +/-   ##
==========================================
+ Coverage   61.84%   61.87%   +0.03%     
==========================================
  Files         122      122              
  Lines       16241    16299      +58     
==========================================
+ Hits        10044    10085      +41     
- Misses       5312     5325      +13     
- Partials      885      889       +4     
Flag Coverage Δ
combined 61.87% <67.39%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant