Skip to content

feat(blockchain): cap aggregation job children to 2 (binary merge)#482

Open
MegaRedHand wants to merge 3 commits into
mainfrom
limit-aggregation-children-to-2
Open

feat(blockchain): cap aggregation job children to 2 (binary merge)#482
MegaRedHand wants to merge 3 commits into
mainfrom
limit-aggregation-children-to-2

Conversation

@MegaRedHand

Copy link
Copy Markdown
Collaborator

Summary

  • select_proofs_greedily (committee-signature aggregation worker, interval 2) picked every existing proof that added new validator coverage, with no upper bound on how many became children of one aggregation job.
  • Caps the combined new+known selection at 2 children per job, forcing pairwise/binary merges instead of N-way merges.
  • Proofs left uncovered by the cap aren't dropped — they remain in the new/known payload buffers and get merged in a subsequent aggregation round.

Why

Uncapped child counts have previously driven debug builds to stack-overflow in leanVM's rec_aggregation, and an unbounded N-way merge approaches leanVM's own hard cap (MAX_RECURSIONS = 16), beyond which a group fails to aggregate outright. Bounding to 2 keeps recursive verification cost per round small and predictable.

Scoped to the interval-2 aggregation worker (crates/blockchain/src/aggregation.rs) only; block-building compaction and reaggregate.rs have their own greedy-selection sites left untouched.

Test plan

  • cargo test -p ethlambda-blockchain --lib — 42 passed (3 new tests covering: cap within one set, cap across new+known combined, early-stop when no new coverage)
  • cargo fmt --all -- --check
  • cargo clippy -p ethlambda-blockchain --all-targets -- -D warnings

select_proofs_greedily picked as many existing proofs as fit the
greedy coverage criterion, with no upper bound. Uncapped child counts
have previously driven debug builds to stack-overflow in leanVM's
rec_aggregation, and an unbounded N-way merge isn't far from leanVM's
own hard cap (MAX_RECURSIONS = 16), at which point a group would fail
to aggregate outright.

Cap each aggregation job at 2 children, forcing pairwise (binary)
merges. Proofs left uncovered by the cap stay in the new/known buffers
and get merged in a later aggregation round instead of being dropped.
@github-actions

Copy link
Copy Markdown

🤖 Kimi Code Review

Overall Assessment: This is a well-executed safety fix that prevents stack overflow in recursive verification by capping aggregation arity to binary merges. The implementation is correct and the test coverage is thorough.

Minor Suggestions

crates/blockchain/src/aggregation.rs

  • Lines 324-330: Consider adding a debug_assert!(MAX_AGGREGATION_CHILDREN >= 2, "binary merge requires at least 2 children") near the constant definition to prevent accidental misconfiguration if this value is ever made configurable.

  • Lines 489-504 (test helper): The proof_for_validator helper uses vid as usize + 1 for the bit vector length. While validator indices in consensus are bounded by the active validator set size (well within usize limits), consider saturating or asserting on vid to avoid theoretical overflow on 32-bit platforms during tests:

    let len = (vid as usize).checked_add(1).expect("validator index too large");
  • Line 350: The while condition selected.len() < MAX_AGGREGATION_CHILDREN correctly implements the cap. However, since MAX_AGGREGATION_CHILDREN is a small constant (2), you could make the bound explicit in the type system using smallvec or ArrayVec to avoid heap allocations for selected, though this is purely an optimization and not necessary for correctness.

Consensus Safety Check

  • Line 327: The comment notes that uncapped greedy selection previously caused stack overflows in rec_aggregation. Confirm that the binary merge constraint (max 2 children) produces a verification tree depth of O(log n), which is safe for the leanVM recursion limit. The mathematical associativity of BLS aggregation ensures that binary merging produces the same final signature as N-way merging, so this does not affect consensus validity.

Code Quality

  • Lines 487-537: The unit tests effectively verify the capping behavior across the three critical cases (within one set, across both sets, and early termination). The test naming is descriptive.

Conclusion: LGTM. The change correctly bounds recursion depth while preserving liveness (uncovered proofs remain in buffers for subsequent rounds) and safety (aggregation correctness is maintained).


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. The new “later round will pick them up” assumption is not true for known-only leftovers, so the cap can strand partial merges permanently. In aggregation.rs, pure recursive merge work is only scheduled from new_payload_keys(). After a capped round, the merged proof and any leftover proof are promoted from new to known at store.rs. On the next interval, if there are no fresh gossip sigs and no fresh new proof for that data_root, snapshot_aggregation_inputs returns no job, so the remaining known proofs never get another merge pass. With 3 disjoint proofs, this regresses from “fully converge in one round” to “can stop forever at two proofs”.

  2. The cap is applied before considering known_proofs, which can discard the highest-coverage reuse candidates and force unnecessary XMSS aggregation from scratch. In aggregation.rs, the loop fully consumes the 2-child budget from new_proofs before looking at known_proofs. A case like new = [{0}, {1}], known = [{0..63}] now selects the two tiny new proofs and ignores the large existing proof entirely. That is a performance regression, and on a tight interval-2 budget it also increases the chance of missing publication windows.

Notes

I did not see a direct memory-safety or consensus-validation bug in the narrow diff itself, but Item 1 is a real aggregation-progress regression.

I could not run cargo test here because the sandboxed environment blocked the required Rust toolchain/network setup.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Claude Code Review

Good, confirms test helpers are valid. I now have enough context to write the review.

Review: cap aggregation job children to 2 (binary merge)

Overall: Sound, narrowly-scoped fix with good test coverage. The core logic in select_proofs_greedily (crates/blockchain/src/aggregation.rs:340-381) correctly caps the combined selection across new_proofs/known_proofs at MAX_AGGREGATION_CHILDREN, and the early break at line 375-377 is a legitimate micro-optimization (avoids materializing a remaining Vec for known_proofs once the cap is already hit), not dead code.

A few things worth confirming/calling out:

  1. No buffer leak from capping — I verified that proofs left unselected this round aren't orphaned: PayloadBuffer::push (crates/storage/src/store.rs:172-218) already prunes any existing proof that becomes a strict subset of an incoming one. Since a newly-aggregated proof is always a superset of its (≤2) children's participants, those children get auto-evicted on insert. So repeated binary merges converge correctly over subsequent rounds without buffer growth — good, this wasn't obviously guaranteed from the diff alone.

  2. Residual unbounded merges elsewhere (acknowledged scope limit, but worth tracking as a follow-up) — crates/blockchain/src/reaggregate.rs:146-175 still builds an uncapped children vec from candidate.local_partials before calling aggregate_proofs, and block-building compaction has its own greedy-selection site too. If the stack-overflow/MAX_RECURSIONS=16 risk this PR addresses is real for interval-2 aggregation, the same N-way recursive merge risk exists on the reaggregation path triggered by importing a block with many local partials. The PR description calls this out explicitly as out-of-scope, which is reasonable for a minimal fix, but it'd be good to confirm there's a tracked follow-up rather than this being forgotten.

  3. Liveness/latency tradeoff, not a bug — capping to binary merges means full coverage of heavily-fragmented proof sets (e.g., after a partition or catch-up) now takes O(log n) interval-2 rounds instead of one. Worth confirming this doesn't risk attestations missing the interval-4 promotion window under adversarial fragmentation (many small proofs for the same data root), even though each individual round is bounded and cheap. No action needed unless devnet testing shows multi-round convergence interacting badly with the 750ms per-session deadline.

  4. Tests — the three new unit tests correctly target the meaningful cases (intra-set cap, cross-set combined cap, early-stop on no new coverage). No notes.

  5. Style — doc comment on MAX_AGGREGATION_CHILDREN is clear and explains the why (debug stack overflow, leanVM MAX_RECURSIONS), matching the project's convention of justifying non-obvious constants.

No correctness, security, or memory-safety issues found in the diff itself. The change is a safe, well-justified narrowing of existing behavior.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps

greptile-apps Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR changes interval-2 proof aggregation to use binary merges. The main changes are:

  • Adds a two-child cap for aggregation jobs.
  • Stops greedy proof selection once the combined cap is reached.
  • Adds unit tests for capped selection behavior.

Confidence Score: 5/5

The changed flow looks mergeable after a small cleanup to proof selection ordering.

  • Unselected proofs remain buffered for later aggregation rounds.
  • The new test helper uses symbols and call patterns already used in this crate.
  • Mixed new and known proof sets can choose lower-coverage children than needed when new proofs fill the cap first.

crates/blockchain/src/aggregation.rs

Important Files Changed

Filename Overview
crates/blockchain/src/aggregation.rs Adds the binary child cap and tests it, with one ordering issue where new proofs can fill the cap before better known proofs are considered.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/src/aggregation.rs:347
**Priority Bucket Hides Better Proofs**

When `new_proofs` contains two small disjoint proofs, this loop reaches the child cap before `known_proofs` is examined. A known proof for the same root that covers many more validators is skipped, so `build_job` receives a smaller `covered` set and can produce a heavier, lower-coverage aggregate even though the same two-child limit could have reused the better proof.

Reviews (1): Last reviewed commit: "feat(blockchain): cap aggregation job ch..." | Re-trigger Greptile

@@ -337,7 +347,7 @@ fn select_proofs_greedily(
for proof_set in [new_proofs, known_proofs] {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Priority Bucket Hides Better Proofs

When new_proofs contains two small disjoint proofs, this loop reaches the child cap before known_proofs is examined. A known proof for the same root that covers many more validators is skipped, so build_job receives a smaller covered set and can produce a heavier, lower-coverage aggregate even though the same two-child limit could have reused the better proof.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/src/aggregation.rs
Line: 347

Comment:
**Priority Bucket Hides Better Proofs**

When `new_proofs` contains two small disjoint proofs, this loop reaches the child cap before `known_proofs` is examined. A known proof for the same root that covers many more validators is skipped, so `build_job` receives a smaller `covered` set and can produce a heavier, lower-coverage aggregate even though the same two-child limit could have reused the better proof.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant