core/state/snapshot: fix generator shutdown race (backport ethereum/go-ethereum#33540)#286
Conversation
…3540) The snapshot generator goroutine parked waiting for a genAbort signal even after generation completed, and diskLayer.Release() never sent it. On shutdown the goroutine lingered and could access the trie database after it was closed, causing iterator failures. Replace the genAbort request/reply channel with a cancel/done broadcast model and an idempotent, blocking stopGeneration(); Release() now stops the generator. Also converts Tree.Disable()/Tree.Rebuild() to stopGeneration() (upstream had already done this; libevm predated that refactor). Backport of ethereum#33540.
|
I'll wait for @JonathanOppenheimer's review as original author before doing mine so it's in a stable position. I think we should be tracking this in #128 as it will definitely result in conflicts if we don't revert it before the merge. In which version was this introduced into geth? |
|
It was merged 5 days ago in ethereum@bc1967f so not included in a version yet. Given the geth release cadence I'd expect a new release will be about with this fix by the time we merge, but even if not, we could merge just that PR cleanly. Once we merge this, I'll be sure to add the maintenance notes to #128. |
It'll be in 1.17.4: https://github.com/ethereum/go-ethereum/milestone/201, and yeah agree to add it to the list! |
Fix typo in comment regarding goroutine snapshot. Signed-off-by: Rahul Muttineni <rahul.muttineni@avalabs.org>
|
@ARR4N it's ready for review. I kept the Jonathan's comments unresolved so that it's easy for you to review as well. |
alarso16
left a comment
There was a problem hiding this comment.
This diff looks comparable, and I verified this fixed my issue in ava-labs/avalanchego#5482
ARR4N
left a comment
There was a problem hiding this comment.
Seeing as this has already had so many eyes on it, including upstream approval, consider my comments just as questions for my own understanding.
Why this should be merged
This is a backport of the PR @JonathanOppenheimer submitted (and got merged) to upstream, so mostly should be pretty similar.
How this works
See the upstream PR. The easiest way to review this is to open the upstream PR and look at the differences side by side to see what is different. You'll notice that
core/state/snapshot/snapshot.gois different in this backport in that there are two additional replacements of aborting being replaced withstopGeneration. Everything else is more or less the same.How this was tested
The test was backported as well.