Skip to content

[python] Fix limit push-down discarding non-raw_convertible splits#7742

Open
TheR1sing3un wants to merge 8 commits intoapache:masterfrom
TheR1sing3un:py-fix-limit-pushdown-non-raw
Open

[python] Fix limit push-down discarding non-raw_convertible splits#7742
TheR1sing3un wants to merge 8 commits intoapache:masterfrom
TheR1sing3un:py-fix-limit-pushdown-non-raw

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

@TheR1sing3un TheR1sing3un commented Apr 29, 2026

How

_apply_push_down_limit now mirrors Java: short-circuit when the
predicate references any non-partition column, otherwise accumulate
split.merged_row_count() and stop at the limit. Splits with unknown
merged count fall through to the reader unchanged.

Tests

ApplyPushDownLimitUnitTest drives _apply_push_down_limit with
synthetic splits.

Compatibility

No public API change. No file format change.

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

Could you please double-check the PR description and the change? I ran both newly added tests on master,
and they pass there as well, so they don't seem to reproduce the described pre-fix issue.

@TheR1sing3un TheR1sing3un force-pushed the py-fix-limit-pushdown-non-raw branch from 9b4c40a to 3b7c748 Compare April 29, 2026 16:31
@TheR1sing3un
Copy link
Copy Markdown
Member Author

@XiaoHongbo-Hope you're right — the two earlier tests didn't actually distinguish the buggy and fixed implementations. Both used inputs (same-key-twice on a single bucket) where every split ended up non-raw_convertible, which means the pre-fix loop body never ran and the fallback return splits returned everything anyway. Thanks for catching it.

I've replaced them with a single, deterministic reproducer that does exercise the bug:

  • PK table partitioned on dt, bucket=1.
  • Partition p1 — two overlapping writes on the same PK → non-raw_convertible split.
  • Partition p2 — one write → raw_convertible split with row_count=1.

PrimaryKeyTableSplitGenerator walks partitions in order, so the plan is [non-raw (p1), raw (p2)]. With with_limit(1) the pre-fix loop skips the non-raw split, then immediately early-returns after the raw one — limited_splits=[raw], p1's data is silently dropped.

End-to-end check:

$ git checkout origin/master -- pypaimon/read/scanner/file_scanner.py
$ pytest ...test_limit_drops_non_raw_split_after_raw_budget_is_met
FAILED ... AssertionError: 1 != 2
$ git checkout HEAD -- pypaimon/read/scanner/file_scanner.py
$ pytest ...test_limit_drops_non_raw_split_after_raw_budget_is_met
1 passed

Force-pushed 3b7c7484b with the new reproducer and an updated commit message / PR description that walks through why the bug requires [non-raw, raw] ordering. PTAL.

…row_count for budget

Two divergences from Java's DataTableBatchScan.applyPushDownLimit():

1) Non-raw_convertible splits were skipped entirely by the loop body
   — they never entered ``limited_splits``. As a consequence, when a
   non-raw split appeared BEFORE a raw split that meets the limit, the
   early-return omitted the non-raw split from the plan altogether.
   Java unconditionally adds every visited split.

2) The accumulator used ``split.row_count`` (file-level pre-DV upper
   bound) where Java uses ``split.partialMergedRowCount()`` — file
   row_count *minus* any deletion-vector cardinality already recorded
   in the manifest. Python has the same value via
   ``DataSplit.merged_row_count()``, but ``_apply_push_down_limit``
   wasn't using it, so on DV-aware raw splits the accumulator
   over-counted and the early-return fired before the reader could
   actually produce ``limit`` rows.

The two divergences interact. With ``[non-raw, raw]`` and a tight
limit, (1) silently drops the non-raw partition's data. With
``[raw_with_DV, non-raw, ...]`` and a limit between the post-DV and
pre-DV row counts, (2) makes the loop early-return on the DV split
alone, leaving the reader with fewer rows than it could otherwise
produce by also draining the trailing non-raw splits.

Fix:

  for split in splits:
      limited_splits.append(split)        # add unconditionally
      if split.raw_convertible:
          merged = split.merged_row_count()
          scanned_row_count += merged if merged is not None else split.row_count
          if scanned_row_count >= self.limit:
              return limited_splits
  return splits

The ``merged is not None`` fallback to ``split.row_count`` keeps the
previous behaviour for layouts where the merged count cannot be
derived from the manifest (older snapshots, some data-evolution
shapes); using the pre-DV upper bound there is still strictly better
than the alternative of skipping that split's contribution to the
budget.

Tests:

  test_limit_drops_non_raw_split_after_raw_budget_is_met (new):
    deterministic ``[non-raw (p1), raw (p2)]`` plan. Pre-fix (master)
    fails with ``1 != 2``: ``limited_splits=[raw]``, p1's data is
    silently dropped. Post-fix returns both splits.

  ApplyPushDownLimitUnitTest (new): synthetic-split unit tests for the
    accumulator, since pypaimon's writer doesn't compact L0 → L1+ and
    the DV-enabled PK read path skips L0, so a true DV-aware
    raw_convertible split is hard to produce from a pure-Python end-
    to-end fixture. Cases:

      * test_dv_aware_accumulator_uses_merged_row_count —
        ``[raw(row_count=10, merged=4), non-raw, non-raw]`` + limit=5.
        Pre-fix: early-returns after the raw split → 1 split.
        Post-fix: 4 < 5 keeps walking → 3 splits.
      * test_accumulator_falls_back_to_row_count_when_merged_unavailable
        — guards the ``merged is None`` fallback path.
      * test_no_raw_splits_falls_through_to_full_list — all-non-raw
        falls through to the loop's terminal ``return splits``.
      * test_empty_splits_returns_empty / test_no_limit_returns_input_unchanged
        — boundary conditions.
@TheR1sing3un TheR1sing3un force-pushed the py-fix-limit-pushdown-non-raw branch from 3b7c748 to a57200e Compare April 29, 2026 17:20
@TheR1sing3un
Copy link
Copy Markdown
Member Author

Update — there was a second divergence from Java's implementation that the previous version of this PR didn't address.

Java's applyPushDownLimit accumulates split.partialMergedRowCount() (file row_count minus deletion-vector cardinality already recorded in the manifest), while Python was using split.row_count (the pre-DV upper bound). On a DV-aware raw split, the pre-DV accumulator over-counts and the early-return fires before the reader can produce limit rows — even with the "always append" fix from the previous round, a plan like [raw_with_DV(row_count=10, merged=4), non-raw, non-raw] with limit=5 would still early-return on the raw split and leave the reader with only 4 rows, despite the trailing non-raw splits being able to provide more.

DataSplit.merged_row_count() already exposes the post-DV value, so the fix is one line — accumulate merged_row_count() instead of row_count, with a fallback to row_count when the merged count is unavailable (older snapshots / certain data-evolution shapes).

Force-pushed a57200ebb:

  • _apply_push_down_limit now uses merged_row_count() if not None else row_count as the budget accumulator.
  • New unit tests in ApplyPushDownLimitUnitTest drive the accumulator directly with synthetic splits — pypaimon's writer doesn't promote L0 → L1+, and the DV-enabled PK read path skips L0 files, so producing a true DV-aware raw split end-to-end isn't really feasible from a pure-Python fixture. The synthetic-split tests pin down the accumulator semantics without depending on storage layout.
  • Both regression tests (test_limit_drops_non_raw_split_after_raw_budget_is_met and test_dv_aware_accumulator_uses_merged_row_count) fail on origin/master and pass post-fix.
  • PR description rewritten to walk through both divergences and how they interact.

PTAL.

if split.raw_convertible:
limited_splits.append(split)
scanned_row_count += split.row_count
merged = split.merged_row_count()
Copy link
Copy Markdown
Contributor

@XiaoHongbo-Hope XiaoHongbo-Hope Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double-check this against current Java semantics? Or the PR description is not latest? It says this mirrors Java

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double-check this against current Java semantics? Or the PR description is not latest? It says this mirrors Java

Hi, How about now? Thanks for your reminder!

Review feedback (XiaoHongbo-Hope on PR apache#7742): the previous fix
accumulated ``merged_row_count() if not None else split.row_count``
under a ``raw_convertible`` gate while unconditionally adding every
split to ``limited_splits``. That was a Python-only behavior and
diverged from Java's ``DataTableBatchScan.applyPushDownLimit`` despite
the PR description claiming "mirrors Java line-for-line". Three
concrete divergences:

  1. Gate: Java uses ``mergedRowCount.isPresent()``, we used
     ``raw_convertible``.
  2. Append timing: Java only adds splits whose merged count is known;
     we added every split regardless.
  3. Fallback: Java has none; we fell back to ``split.row_count`` when
     ``merged_row_count()`` returned None.

The single behavioral fix this PR needs to deliver is the accumulator
source — replacing ``split.row_count`` (DV-blind, over-counts when DV
is on) with ``merged_row_count()`` (DV-aware). Java already does
exactly this. Drop the extra divergences so the loop reads as a
direct port of Java:

    for split in splits:
        merged = split.merged_row_count()
        if merged is not None:
            limited_splits.append(split)
            scanned_row_count += merged
            if scanned_row_count >= self.limit:
                return limited_splits
    return splits

Test adjustments:
- Removed the integration test ``test_limit_drops_non_raw_split_after_
  raw_budget_is_met``. Its expectation ("non-raw split survives the
  limit pushdown") was based on the now-reverted unconditional-append
  behavior. Java drops non-raw splits after the budget is met —
  matching this is now correct, so the test is no longer a regression
  reproducer.
- Renamed ``test_accumulator_falls_back_to_row_count_when_merged_
  unavailable`` to ``test_accumulator_skips_splits_with_unknown_
  merged_count`` and rewrote the docstring to describe the actual
  Java-aligned behavior (skip + fall through).
- Kept ``test_dv_aware_accumulator_uses_merged_row_count`` as the
  master-vs-fix reproducer: master accumulates row_count=10 ≥ limit=5
  and early-returns ``[raw]`` (1 split); fix accumulates merged=4 < 5,
  skips the two non-raw splits, falls through to ``return splits``
  with all 3. Verified by swapping the file body to master's version —
  this test fails (1 != 3) on master and passes after the fix.

Lint: flake8 clean. Tests: 10/10 in reader_split_generator_test.py.
…limit

Line-by-line audit against Java
``DataTableBatchScan.applyPushDownLimit`` (paimon-core/.../source/
DataTableBatchScan.java:128-165) caught one missing branch:

  Java L129: if (pushDownLimit == null || hasNonPartitionFilter())
              return Optional.empty();

Java skips limit pushdown entirely when the predicate references any
non-partition column, because per-split row counts (the accumulator
input below) are pre-filter and would over-count against the actual
filtered output — pushing the early-return budget too low and giving
the reader fewer rows than the user asked for.

Add the equivalent short-circuit to Python: a new private helper
``_has_non_partition_filter()`` mirrors Java's
``SnapshotReaderImpl.hasNonPartitionFilter`` (lines 235-248) using
the existing ``_get_all_fields`` predicate-leaf walker. When the
predicate references any column outside ``partition_keys`` the
limit-pushdown loop is skipped and the splits are returned untouched.

Tests:
- New ``test_non_partition_filter_short_circuits_pushdown`` in
  ApplyPushDownLimitUnitTest covers the new branch.
- Existing 4 unit tests carry through unchanged (the new short-circuit
  doesn't trip when ``has_non_partition_filter=False``).

Inline comments now annotate every Java line we mirror (L129, L138,
L146, L147-163, L164) so a reviewer can verify the port at a glance.
@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

XiaoHongbo-Hope commented May 1, 2026

Thanks for the update - the latest code looks good to me. Two small follow-ups, just suggestions:

  1. The PR description still seems not correct, could you please double check again?
  2. Could you please keep the comments a bit lighter?

@TheR1sing3un
Copy link
Copy Markdown
Member Author

Thanks for the update - the latest code looks good to me. Two small follow-ups, just suggestions:

  1. The PR description still seems not correct, could you please double check again?
  2. Could you please keep the comments a bit lighter?

You're right. My opus is a bit wordy, lol. Let me update it

Address review on PR apache#7742: shrink the multi-paragraph rationale on
``_apply_push_down_limit`` / ``_has_non_partition_filter`` to a single
line each pointing at the Java counterpart. The full reasoning lives
in the PR description; the file just needs to say what it mirrors.
@TheR1sing3un
Copy link
Copy Markdown
Member Author

TheR1sing3un commented May 1, 2026

ready for the final review! @XiaoHongbo-Hope

Comment thread paimon-python/pypaimon/tests/reader_split_generator_test.py
… description drift

Address review comment r3173561771: tighten the new unit-test
docstrings and correct the parts that no longer match the
implementation.

* Class-level rationale dropped — the cases speak for themselves.
* test_dv_aware_accumulator_uses_merged_row_count: previous wording
  said the post-fix loop "adds the two non-raw splits without changing
  the accumulator". That's wrong: ``merged is None`` splits are NOT
  appended to ``limited_splits``; the three-split result comes from
  the fall-through ``return splits`` after the loop completes. Updated
  to say so.
* Other docstrings shrunk to one or two lines each.
* _apply / _split helpers: dropped the inline narration on the fake
  scanner / fake split — they're trivially obvious from the bodies.
@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

XiaoHongbo-Hope commented May 2, 2026

Thanks for the updates — this looks much cleaner now. One tiny wording nit: the current Java code seems
to use split.mergedRowCount(), while the PR description / docstring mentions partialMergedRowCount.
This will make other developers a little confusion. Otherwise LGTM from my side.

Reviewer pointed out the docstring named the Java method
``partialMergedRowCount`` while the actual API is
``Split.mergedRowCount()`` (DataTableBatchScan.applyPushDownLimit
calls ``split.mergedRowCount()``). Pick the real name so future
readers cross-referencing Java don't get tripped up.
@TheR1sing3un
Copy link
Copy Markdown
Member Author

Thanks for the updates — this looks much cleaner now. One tiny wording nit: the current Java code seems to use split.mergedRowCount(), while the PR description / docstring mentions partialMergedRowCount. This will make other developers a little confusion. Otherwise LGTM from my side.

done~

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants