Fix cloud-wal-restore on compressed WAL + sibling backup label#1190
Closed
adam8157 wants to merge 2 commits into
Closed
Fix cloud-wal-restore on compressed WAL + sibling backup label#1190adam8157 wants to merge 2 commits into
adam8157 wants to merge 2 commits into
Conversation
Commit 5625881 tightened CloudWalDownloader._get_wals_to_download so that hitting an invalid ``is_requested_wal`` entry exits the iteration immediately, in order to fail fast when a WAL was requested but only its ``<wal>.<offset>.backup`` label exists in the bucket. That logic used ``path.startswith(requested_wal_path)`` to decide ``is_requested_wal``, which also matches the related ``.backup`` label file. With compression enabled, both files live in the bucket as ``<wal>.gz`` and ``<wal>.<offset>.backup.gz``, and the label sorts before the WAL in lexicographic order (``.`` 0x2e + ``0`` 0x30 vs ``.`` 0x2e + ``g`` 0x67). So when recovery requests the start-checkpoint WAL of a compressed backup, the iteration hits the ``.backup.gz`` file first, ``_validate_wal_path`` rejects it as a backup file, and the new ``break`` exits before ever reaching the real ``.gz`` WAL — leaving Postgres unable to fetch the start-checkpoint WAL and recovery to abort with ``FATAL: could not locate required checkpoint record``. Match the requested WAL by comparing the basename (after stripping any compression suffix) against ``wal_name``, so a sibling ``.backup`` file no longer trips the ``is_requested_wal`` branch and is harmlessly skipped via ``_validate_wal_path``. The invariant from 5625881 — "when only the ``.backup`` label exists and the actual WAL is missing, return an empty list rather than unrelated subsequent WALs" — is still preserved, but enforced after the iteration: a ``found_requested`` flag tracks whether the requested WAL was actually located, and the method returns ``[]`` if not. This keeps the prefetch path simple while still failing fast for the caller. References: BAR-1305. Signed-off-by: Adam Lee <adam8157@gmail.com>
Cover the regression scenario fixed by the previous commit: when the bucket holds both ``<wal>.gz`` and ``<wal>.<offset>.backup.gz`` for the requested WAL, the sorted iteration must walk past the backup-label entry and return the real compressed WAL — not exit early on a ``startswith`` match against the label file. The existing ``test_get_wals_to_download_exits_early_when_requested_wal_is_invalid`` test continues to cover the complementary case: when only the ``.backup`` label exists and the actual WAL is missing, the method still returns an empty list. References: BAR-1305. Signed-off-by: Adam Lee <adam8157@gmail.com>
f80b5b9 to
6da99b2
Compare
Contributor
Author
|
Somehow these two commits are already merged, closing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CloudWalDownloader._get_wals_to_downloadmatched the requested WAL bypath.startswith(requested_wal_path), which also matched the sibling<wal>.<offset>.backuplabel file. Combined with the early-break introduced in 5625881, recovery against a compressed backup would exit the iteration on the.backup.gzentry (which sorts before.gzlexicographically) and never reach the real WAL —FATAL: could not locate required checkpoint record.<wal>and<wal>.partial. Sibling.backupfiles are now harmlessly skipped via_validate_wal_path. The "requested WAL missing → return[]" invariant from 5625881 is preserved via afound_requestedflag enforced after the loop.References: BAR-1305.