[python] In-memory merge buffer for primary-key writer (Java SortBufferWriteBuffer parity) by TheR1sing3un · Pull Request #7759 · apache/paimon

TheR1sing3un · 2026-05-01T09:49:58Z

Purpose

Fix a silent data-quality bug on master: a primary-key table that gets multiple write_arrow calls inside a single prepare_commit returns multiple rows per PK on read -- regardless of merge engine, including the default deduplicate. Verified empirically on origin/master (b4e54ad) and on the fix branch:

w.write_arrow([{'id': 1, 'a': 'first',  'b': 'old'}])
w.write_arrow([{'id': 1, 'a': 'second', 'b': 'new'}])
c.commit(w.prepare_commit())

# master:    2 rows (PK uniqueness violated, both rows returned)
# this PR:   1 row {'id': 1, 'a': 'second', 'b': 'new'} (dedupe to latest)

This is the writer-side gap #7745's expectedFailure cases exposed.

Root cause

KeyValueDataWriter._merge_data did concat + sort only -- never applied the table's merge function -- so a flushed file held multiple rows per PK, violating the Java LSM invariant "PK unique within a file".
The read-side raw_convertible fast path (split_generator.py:99-100) treats any single-file PK split as merge-free and skips SortMergeReader. That assumption holds on Java because the writer enforces the invariant; on Python it didn't.

Fix

Mirror Java MergeTreeWriter.flushWriteBuffer + SortBufferWriteBuffer.MergeIterator.advanceIfNeeded (paimon-core/.../mergetree/SortBufferWriteBuffer.java:163-293): before each flush, fold each run of equal-PK rows through the table's merge function (reset + add + get_result). The flushed file therefore satisfies the LSM invariant the read side relies on.

The same dispatch is now shared between the read path (MergeFileSplitRead._build_merge_function) and the writer (FileStoreWrite._build_pk_merge_function) -- single source of truth, no implementation drift possible.

In scope

pypaimon/common/merge_engine_dispatch.py (new): single build_merge_function entry point + partial_update_unsupported_options helper, lifted verbatim from MergeFileSplitRead.
pypaimon/read/reader/deduplicate_merge_function.py (new): extracted from the inline class at the end of sort_merge_reader.py so the writer can reuse it.
pypaimon/write/writer/key_value_data_writer.py: constructor takes merge_function; new _merge_pending_by_pk runs the per-PK fold; prepare_commit and _check_and_roll_if_needed invoke it before flushing.
pypaimon/write/file_store_write.py::_build_pk_merge_function: picks the merge function via the shared dispatch. Wholly unsupported engines (aggregation / first-row) fall back to DeduplicateMergeFunction so the file still maintains the LSM invariant; the read side still raises explicitly. partial-update with out-of-scope options keeps the explicit raise from [python] Implement partial-update merge engine in pypaimon #7745. with_write_type (column-subset writes) on PK tables is rejected with a clear NotImplementedError rather than crashing on a hidden arity mismatch.
pypaimon/tests/test_write_merge_buffer.py (new): 9 unit cases driving _merge_pending_by_pk directly with synthetic pa.Table inputs.
pypaimon/tests/test_partial_update_e2e.py: drops the two expectedFailure decorators (now passing); adds test_deduplicate_two_write_arrows_single_commit regression for the master silent bug; adjusts unsupported-option cases to expect the error inside the first write_arrow call.
pypaimon/tests/reader_primary_key_test.py::test_pk_multi_write_once_commit: drops a # TODO support pk merge comment and tightens the expected table -- user_id=2's two writes now correctly dedupe to one row.

Out of scope

Spillable / disk-backed write buffer (Java's BinaryExternalSortBuffer). Python keeps pa.Table as the buffer; large-buffer OOM mitigation is a separate effort.
aggregation / first-row merge engine on the write side (need AggregateMergeFunction / FirstRowMergeFunction ports first, tracked separately). Those engines fall back to dedupe on flush so files stay valid; reads still raise explicitly.
Partial-update sequence-group, per-field aggregator overrides, ignore-delete and partial-update.remove-record-on-* -- still explicitly rejected by the shared dispatch from [python] Implement partial-update merge engine in pypaimon #7745.
DELETE / UPDATE_BEFORE row kinds -- Python writers always emit _VALUE_KIND = 0 (INSERT) today; merge functions raise on non-INSERT input as before.
with_write_type (column-subset writes) on PK tables -- explicitly rejected. The buffer layout would carry only the subset on the value side, while the merge function is built for the full table arity. Supporting this requires either filling absent columns with nulls before flush or adapting the merge function's arity, both larger than this PR.
AppendOnlyDataWriter / DataBlobWriter -- no PK, no merge needed.

Tests

From paimon-python/:

pytest pypaimon/tests/test_write_merge_buffer.py \
       pypaimon/tests/test_partial_update_e2e.py \
       pypaimon/tests/test_partial_update_merge_function.py -v
# All passing (9 new unit + 18 e2e + 16 merge-function unit)

pytest pypaimon/tests/reader_primary_key_test.py \
       pypaimon/tests/reader_split_generator_test.py -q
# All passing except 2 lance tests that fail on master too (no module 'lance')

flake8 --config=dev/cfg.ini <touched files>     # clean

Master vs fix verification (both engines):

# master:        dedupe → 2 rows; partial-update → 2 rows (raw)
# this branch:   dedupe → 1 row latest; partial-update → 1 row per-field merged

Anti-divergence checklist

KeyValueDataWriter._merge_pending_by_pk runs reset / add / get_result once per equal-PK run, equivalent to Java SortBufferWriteBuffer.MergeIterator.advanceIfNeeded.
Writer-side and reader-side dispatch share a single build_merge_function -- impossible to drift.
Files flushed are PK-unique internally (LSM invariant), so the read-side raw_convertible fast path's assumption holds.
get_result() returning None drops that PK group (mirrors Java do { ... } while (result == null)).
_check_and_roll_if_needed folds before slicing for size, so each sliced file individually maintains PK uniqueness.

Known trade-off

_check_and_roll_if_needed runs the per-PK fold on every write call (Python's buffer is a single pa.Table and we re-fold it before any size-based split). Java's SortBufferWriteBuffer only folds on flush. For workloads with many small batches this is O(n²) on buffer size. The fold is idempotent so correctness is fine; if it shows up in profiles, the buffer can be moved to an append-only batch list with a lazy fold at flush in a follow-up.

Generative AI disclosure

Drafted with assistance from a generative AI tool. All code, tests, and Java alignment were reviewed and validated by the contributor.

Sister PR to #7745. Built on top of #7745's branch; once #7745 merges, this rebases cleanly onto master.

``MergeEngine.PARTIAL_UPDATE`` is exposed in ``core_options.py`` and accepts ``merge-engine: partial-update`` as a table option, but the read path never reads that option — ``sort_merge_reader.py`` hardcodes ``DeduplicateMergeFunction()``. So a user who creates a PK table with ``merge-engine: partial-update`` and writes overlapping rows whose non-null columns differ gets silently deduplicated results instead of the expected per-field merge: their data is wrong, with no error or warning. The same is true for ``aggregation`` and ``first-row`` — both are silently degraded to dedupe today. This change ports the core ``PartialUpdateMergeFunction`` semantics from Java (paimon-core/.../mergetree/compact/PartialUpdateMergeFunction.java) and wires the Python read path to dispatch on ``merge-engine``: * New ``pypaimon/read/reader/partial_update_merge_function.py``: on each ``add(kv)`` copy non-null fields of ``kv.value`` into an accumulator; ``get_result()`` returns a fresh KeyValue with the merged row. Result is built into a brand-new tuple so the merge output is decoupled from upstream's reused KeyValue instances. * ``SortMergeReaderWithMinHeap.__init__`` gains an optional ``merge_function`` kwarg; default still ``DeduplicateMergeFunction()`` so any direct callers (none in-tree) are unchanged. * ``MergeFileSplitRead.section_reader_supplier`` selects the merge function based on ``self.table.options.merge_engine()``: DEDUPLICATE -> DeduplicateMergeFunction (unchanged) PARTIAL_UPDATE -> PartialUpdateMergeFunction AGGREGATE / FIRST_ROW -> NotImplementedError (was silent dedupe) Out of scope, intentionally: * Per-field aggregator overrides (``fields.<name>.aggregate-function``) * Sequence-group support (``fields.<name>.sequence-group``) * ``ignore-delete`` / ``partial-update.remove-record-on-*`` options * AGGREGATE / FIRST_ROW merge engine implementations DELETE / UPDATE_BEFORE rows raise ``NotImplementedError`` at ``add()`` time so we can't silently corrupt data with a half-implemented contract. Tests: * ``test_partial_update_merge_function.py`` — 11 unit cases covering single insert, two-way overlapping merges, three-way merges, later- null-does-not-clobber, reset between keys, get_result-before-any- add, UPDATE_AFTER acceptance, DELETE / UPDATE_BEFORE refusal, and result decoupling from input kv (proves we're not aliasing upstream's reused KeyValue). * ``test_partial_update_e2e.py`` — 8 cases: two-write merge, three- write merge, disjoint keys unaffected, later-non-null wins, later- null preserves earlier value, deduplicate engine unchanged (regression), and aggregation / first-row raise NotImplementedError. Verified by checking out ``origin/master``'s ``sort_merge_reader.py`` / ``split_read.py`` and rerunning ``test_partial_update_e2e.py``: master fails the 4 partial-update merge cases (silent dedupe) and the 2 aggregation / first-row "raises" cases (silent dedupe instead of raising); fix passes all 8.

…f-scope options Address review on r3168491328: previously `_build_merge_function()` dispatched on `merge-engine: partial-update` alone, so a table that ALSO configured sequence-group / per-field aggregator / ignore-delete / partial-update.remove-record-on-* would fall into the simple PartialUpdateMergeFunction and silently drop those semantics -- exactly the same silent-corruption pattern this PR exists to close, just reshaped from "silent dedupe" to "silent half-partial-update". Now the PARTIAL_UPDATE branch first scans the table options for any of the unsupported keys: * fields.<name>.sequence-group * fields.<name>.aggregate-function * fields.default-aggregate-function * ignore-delete (and the partial-update./first-row./deduplicate. prefixed aliases) when truthy * partial-update.remove-record-on-delete when truthy * partial-update.remove-record-on-sequence-group when truthy If any are set, raise NotImplementedError naming every offending key so the user can either drop them or escalate. Same shape as the existing AGGREGATE / FIRST_ROW raise. Tests: 7 new e2e cases in test_partial_update_e2e.py, one per option plus a regression case asserting `ignore-delete: false` (explicitly disabled) still passes through to the merge function.

…eNonNullFields Java PartialUpdateMergeFunction.updateNonNullFields (line 177-188) raises IllegalArgumentException when an input field is null and the schema marks that field NOT NULL. The Python port previously absorbed such inputs silently, letting writes whose first value was null on a NOT NULL field land null in the accumulator. Changes: * PartialUpdateMergeFunction.__init__ takes an optional `nullables` list parallel to value indices. When given, every add() checks each null input against `nullables[i]` and raises ValueError on a NOT NULL field, matching Java semantics on every row (not just the first). When omitted, behaviour is unchanged (back-compat for direct callers). * MergeFileSplitRead snapshots the raw value-side schema as `value_fields` before _create_key_value_fields wraps it, then hands `[f.type.nullable for f in self.value_fields]` to the merge function. * Five new unit cases in test_partial_update_merge_function.py: first row null on NOT NULL raises, subsequent row null on NOT NULL raises, null on nullable field is absorbed, length-mismatch nullables raises, omitting nullables preserves the previous lenient behaviour. Result: with the existing guard in _build_merge_function (which refuses out-of-scope options) and the NOT NULL enforcement here, the simple last-non-null path is now feature-equivalent to Java's updateNonNullFields + getResult on the supported subset.

…tedFailure Reviewer asked to cover rows that land in the same data file -- multiple write_arrow() calls before a single prepare_commit(). Adding the cases revealed the writer-side / read-side gap upstream of this PR: KeyValueDataWriter._merge_data only does concat+sort (no merge function applied), so the flushed file holds duplicate primary keys; on read, _build_split_from_pack treats any single-file group as raw_convertible and routes through the fast path, skipping SortMergeReader and the merge-engine dispatch this PR adds. Fixing it requires either a merge buffer in KeyValueDataWriter (mirroring Java SortBufferWriteBuffer / MergeTreeWriter) or a tighter raw_convertible check that proves intra-file PK uniqueness -- both are write-path / scan-path restructuring outside this read-side merge-engine port. The two new cases are kept as unittest.expectedFailure so the gap stays visible and converts to passing regressions when the writer-side fix lands.

Move DeduplicateMergeFunction (previously embedded at the end of sort_merge_reader.py) into its own module pypaimon/read/reader/ deduplicate_merge_function.py so it can be reused outside the read path. Add pypaimon/common/merge_engine_dispatch.py with a single build_merge_function entry point and partial_update_unsupported_options helper, both lifted verbatim from MergeFileSplitRead so the dispatch has exactly one implementation. MergeFileSplitRead._build_merge_function shrinks to a thin wrapper. This keeps the read path's behaviour byte-identical and prepares the write path (next commit) to pick its merge function through the same dispatch.

KeyValueDataWriter._merge_data previously did concat + sort only -- no merge function was ever applied -- so a primary-key flush could emit a single data file containing two or more rows for the same primary key. The read-side raw_convertible fast path (split_generator.py:99-100) treats single-file PK splits as "merge-free" and skips SortMergeReader, which produced silent multi-row-per-PK results in master regardless of merge engine. Mirror Java MergeTreeWriter.flushWriteBuffer + SortBufferWriteBuffer.MergeIterator.advanceIfNeeded (paimon-core/.../mergetree/SortBufferWriteBuffer.java:163-293): fold each run of equal-PK rows in the sorted pending buffer through the table's MergeFunction (reset + add + get_result) before writing to the file. The flushed file therefore satisfies the LSM "PK unique within a file" invariant the read side relies on. FileStoreWrite._build_pk_merge_function picks the merge function through the shared dispatch added in the previous commit. PARTIAL_UPDATE with out-of-scope options keeps the explicit raise introduced in apache#7745 -- silently degrading there would reintroduce the same data-quality risk this PR exists to close. Wholly unsupported engines (aggregation / first-row) fall back to DeduplicateMergeFunction on the write side so the file still keeps the LSM invariant; the read side's dispatch still raises, so users get an explicit error before observing wrong-engine data. Sister to apache#7745. Closes the writer-side gap that apache#7745's expectedFailure cases exposed.

Drop the unittest.expectedFailure decorators on the two same-commit multi write_arrow cases apache#7745 added: the writer-side merge buffer now folds same-PK runs before flush, so they pass. Add a same-commit deduplicate regression (test_deduplicate_two_write_arrows_single_commit) so the bug master silently returned both rows on the default merge engine cannot come back undetected. Add unit coverage in tests/test_write_merge_buffer.py exercising KeyValueDataWriter._merge_pending_by_pk directly with synthetic pa.Table inputs: dedupe collapse / disjoint keys / partial-update fold across two and three writes / later-null does not clobber / empty buffer / single-row fast path / get_result returning None drops the run. Adjust two e2e cases that previously asserted a NotImplementedError at read time: with the writer-side dispatch in place the unsupported- engine fallback runs at write time but read still raises, and the partial-update + unsupported-option cases now surface their NotImplementedError on the writer's first flush. Update the assertion sites accordingly. reader_primary_key_test.test_pk_multi_write_once_commit drops a TODO that explicitly documented the missing merge: with the writer-side fold now in place, user_id=2's two writes deduplicate to the latest row, so the expected table no longer contains a duplicate PK.

* Reject ``with_write_type`` on PK tables in ``_build_pk_merge_function`` with a clear NotImplementedError. Without this guard the buffer layout (column subset on the value side) and the merge function's arity (full table) drift apart and crash with IndexError on flush. * Replace ``except NotImplementedError`` with an explicit ``MergeEngine.AGGREGATE / FIRST_ROW`` check so a future engine that legitimately raises won't be silently swallowed. * Tighten three docstrings/comments: "fail at flush time" and "first write_arrow flush" misnamed the timing -- the dispatch fires inside ``FileStoreWrite._create_data_writer``, not on flush; and the partial-update unit test comment said "4 value-side columns" while listing three.

TheR1sing3un · 2026-05-01T09:54:36Z

block on #7745 .

TheR1sing3un added 8 commits April 30, 2026 02:20

TheR1sing3un marked this pull request as draft May 1, 2026 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] In-memory merge buffer for primary-key writer (Java SortBufferWriteBuffer parity)#7759

[python] In-memory merge buffer for primary-key writer (Java SortBufferWriteBuffer parity)#7759
TheR1sing3un wants to merge 8 commits intoapache:masterfrom
TheR1sing3un:py-pk-write-merge-buffer

TheR1sing3un commented May 1, 2026

Uh oh!

TheR1sing3un commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheR1sing3un commented May 1, 2026

Purpose

Root cause

Fix

In scope

Out of scope

Tests

Anti-divergence checklist

Known trade-off

Generative AI disclosure

Uh oh!

TheR1sing3un commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant