[python] In-memory merge buffer for primary-key writer (Java SortBufferWriteBuffer parity)#7759
Draft
TheR1sing3un wants to merge 8 commits intoapache:masterfrom
Draft
[python] In-memory merge buffer for primary-key writer (Java SortBufferWriteBuffer parity)#7759TheR1sing3un wants to merge 8 commits intoapache:masterfrom
TheR1sing3un wants to merge 8 commits intoapache:masterfrom
Conversation
``MergeEngine.PARTIAL_UPDATE`` is exposed in ``core_options.py`` and
accepts ``merge-engine: partial-update`` as a table option, but the
read path never reads that option — ``sort_merge_reader.py`` hardcodes
``DeduplicateMergeFunction()``. So a user who creates a PK table with
``merge-engine: partial-update`` and writes overlapping rows whose
non-null columns differ gets silently deduplicated results instead of
the expected per-field merge: their data is wrong, with no error or
warning. The same is true for ``aggregation`` and ``first-row`` —
both are silently degraded to dedupe today.
This change ports the core ``PartialUpdateMergeFunction`` semantics
from Java
(paimon-core/.../mergetree/compact/PartialUpdateMergeFunction.java) and
wires the Python read path to dispatch on ``merge-engine``:
* New ``pypaimon/read/reader/partial_update_merge_function.py``: on
each ``add(kv)`` copy non-null fields of ``kv.value`` into an
accumulator; ``get_result()`` returns a fresh KeyValue with the
merged row. Result is built into a brand-new tuple so the merge
output is decoupled from upstream's reused KeyValue instances.
* ``SortMergeReaderWithMinHeap.__init__`` gains an optional
``merge_function`` kwarg; default still ``DeduplicateMergeFunction()``
so any direct callers (none in-tree) are unchanged.
* ``MergeFileSplitRead.section_reader_supplier`` selects the merge
function based on ``self.table.options.merge_engine()``:
DEDUPLICATE -> DeduplicateMergeFunction (unchanged)
PARTIAL_UPDATE -> PartialUpdateMergeFunction
AGGREGATE / FIRST_ROW -> NotImplementedError (was silent dedupe)
Out of scope, intentionally:
* Per-field aggregator overrides (``fields.<name>.aggregate-function``)
* Sequence-group support (``fields.<name>.sequence-group``)
* ``ignore-delete`` / ``partial-update.remove-record-on-*`` options
* AGGREGATE / FIRST_ROW merge engine implementations
DELETE / UPDATE_BEFORE rows raise ``NotImplementedError`` at ``add()``
time so we can't silently corrupt data with a half-implemented contract.
Tests:
* ``test_partial_update_merge_function.py`` — 11 unit cases covering
single insert, two-way overlapping merges, three-way merges, later-
null-does-not-clobber, reset between keys, get_result-before-any-
add, UPDATE_AFTER acceptance, DELETE / UPDATE_BEFORE refusal, and
result decoupling from input kv (proves we're not aliasing
upstream's reused KeyValue).
* ``test_partial_update_e2e.py`` — 8 cases: two-write merge, three-
write merge, disjoint keys unaffected, later-non-null wins, later-
null preserves earlier value, deduplicate engine unchanged
(regression), and aggregation / first-row raise NotImplementedError.
Verified by checking out ``origin/master``'s ``sort_merge_reader.py`` /
``split_read.py`` and rerunning ``test_partial_update_e2e.py``: master
fails the 4 partial-update merge cases (silent dedupe) and the 2
aggregation / first-row "raises" cases (silent dedupe instead of
raising); fix passes all 8.
…f-scope options Address review on r3168491328: previously `_build_merge_function()` dispatched on `merge-engine: partial-update` alone, so a table that ALSO configured sequence-group / per-field aggregator / ignore-delete / partial-update.remove-record-on-* would fall into the simple PartialUpdateMergeFunction and silently drop those semantics -- exactly the same silent-corruption pattern this PR exists to close, just reshaped from "silent dedupe" to "silent half-partial-update". Now the PARTIAL_UPDATE branch first scans the table options for any of the unsupported keys: * fields.<name>.sequence-group * fields.<name>.aggregate-function * fields.default-aggregate-function * ignore-delete (and the partial-update./first-row./deduplicate. prefixed aliases) when truthy * partial-update.remove-record-on-delete when truthy * partial-update.remove-record-on-sequence-group when truthy If any are set, raise NotImplementedError naming every offending key so the user can either drop them or escalate. Same shape as the existing AGGREGATE / FIRST_ROW raise. Tests: 7 new e2e cases in test_partial_update_e2e.py, one per option plus a regression case asserting `ignore-delete: false` (explicitly disabled) still passes through to the merge function.
…eNonNullFields Java PartialUpdateMergeFunction.updateNonNullFields (line 177-188) raises IllegalArgumentException when an input field is null and the schema marks that field NOT NULL. The Python port previously absorbed such inputs silently, letting writes whose first value was null on a NOT NULL field land null in the accumulator. Changes: * PartialUpdateMergeFunction.__init__ takes an optional `nullables` list parallel to value indices. When given, every add() checks each null input against `nullables[i]` and raises ValueError on a NOT NULL field, matching Java semantics on every row (not just the first). When omitted, behaviour is unchanged (back-compat for direct callers). * MergeFileSplitRead snapshots the raw value-side schema as `value_fields` before _create_key_value_fields wraps it, then hands `[f.type.nullable for f in self.value_fields]` to the merge function. * Five new unit cases in test_partial_update_merge_function.py: first row null on NOT NULL raises, subsequent row null on NOT NULL raises, null on nullable field is absorbed, length-mismatch nullables raises, omitting nullables preserves the previous lenient behaviour. Result: with the existing guard in _build_merge_function (which refuses out-of-scope options) and the NOT NULL enforcement here, the simple last-non-null path is now feature-equivalent to Java's updateNonNullFields + getResult on the supported subset.
…tedFailure Reviewer asked to cover rows that land in the same data file -- multiple write_arrow() calls before a single prepare_commit(). Adding the cases revealed the writer-side / read-side gap upstream of this PR: KeyValueDataWriter._merge_data only does concat+sort (no merge function applied), so the flushed file holds duplicate primary keys; on read, _build_split_from_pack treats any single-file group as raw_convertible and routes through the fast path, skipping SortMergeReader and the merge-engine dispatch this PR adds. Fixing it requires either a merge buffer in KeyValueDataWriter (mirroring Java SortBufferWriteBuffer / MergeTreeWriter) or a tighter raw_convertible check that proves intra-file PK uniqueness -- both are write-path / scan-path restructuring outside this read-side merge-engine port. The two new cases are kept as unittest.expectedFailure so the gap stays visible and converts to passing regressions when the writer-side fix lands.
Move DeduplicateMergeFunction (previously embedded at the end of sort_merge_reader.py) into its own module pypaimon/read/reader/ deduplicate_merge_function.py so it can be reused outside the read path. Add pypaimon/common/merge_engine_dispatch.py with a single build_merge_function entry point and partial_update_unsupported_options helper, both lifted verbatim from MergeFileSplitRead so the dispatch has exactly one implementation. MergeFileSplitRead._build_merge_function shrinks to a thin wrapper. This keeps the read path's behaviour byte-identical and prepares the write path (next commit) to pick its merge function through the same dispatch.
KeyValueDataWriter._merge_data previously did concat + sort only -- no merge function was ever applied -- so a primary-key flush could emit a single data file containing two or more rows for the same primary key. The read-side raw_convertible fast path (split_generator.py:99-100) treats single-file PK splits as "merge-free" and skips SortMergeReader, which produced silent multi-row-per-PK results in master regardless of merge engine. Mirror Java MergeTreeWriter.flushWriteBuffer + SortBufferWriteBuffer.MergeIterator.advanceIfNeeded (paimon-core/.../mergetree/SortBufferWriteBuffer.java:163-293): fold each run of equal-PK rows in the sorted pending buffer through the table's MergeFunction (reset + add + get_result) before writing to the file. The flushed file therefore satisfies the LSM "PK unique within a file" invariant the read side relies on. FileStoreWrite._build_pk_merge_function picks the merge function through the shared dispatch added in the previous commit. PARTIAL_UPDATE with out-of-scope options keeps the explicit raise introduced in apache#7745 -- silently degrading there would reintroduce the same data-quality risk this PR exists to close. Wholly unsupported engines (aggregation / first-row) fall back to DeduplicateMergeFunction on the write side so the file still keeps the LSM invariant; the read side's dispatch still raises, so users get an explicit error before observing wrong-engine data. Sister to apache#7745. Closes the writer-side gap that apache#7745's expectedFailure cases exposed.
Drop the unittest.expectedFailure decorators on the two same-commit multi write_arrow cases apache#7745 added: the writer-side merge buffer now folds same-PK runs before flush, so they pass. Add a same-commit deduplicate regression (test_deduplicate_two_write_arrows_single_commit) so the bug master silently returned both rows on the default merge engine cannot come back undetected. Add unit coverage in tests/test_write_merge_buffer.py exercising KeyValueDataWriter._merge_pending_by_pk directly with synthetic pa.Table inputs: dedupe collapse / disjoint keys / partial-update fold across two and three writes / later-null does not clobber / empty buffer / single-row fast path / get_result returning None drops the run. Adjust two e2e cases that previously asserted a NotImplementedError at read time: with the writer-side dispatch in place the unsupported- engine fallback runs at write time but read still raises, and the partial-update + unsupported-option cases now surface their NotImplementedError on the writer's first flush. Update the assertion sites accordingly. reader_primary_key_test.test_pk_multi_write_once_commit drops a TODO that explicitly documented the missing merge: with the writer-side fold now in place, user_id=2's two writes deduplicate to the latest row, so the expected table no longer contains a duplicate PK.
* Reject ``with_write_type`` on PK tables in ``_build_pk_merge_function`` with a clear NotImplementedError. Without this guard the buffer layout (column subset on the value side) and the merge function's arity (full table) drift apart and crash with IndexError on flush. * Replace ``except NotImplementedError`` with an explicit ``MergeEngine.AGGREGATE / FIRST_ROW`` check so a future engine that legitimately raises won't be silently swallowed. * Tighten three docstrings/comments: "fail at flush time" and "first write_arrow flush" misnamed the timing -- the dispatch fires inside ``FileStoreWrite._create_data_writer``, not on flush; and the partial-update unit test comment said "4 value-side columns" while listing three.
Member
Author
|
block on #7745 . |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fix a silent data-quality bug on master: a primary-key table that gets multiple
write_arrowcalls inside a singleprepare_commitreturns multiple rows per PK on read -- regardless of merge engine, including the defaultdeduplicate. Verified empirically onorigin/master(b4e54ad) and on the fix branch:This is the writer-side gap #7745's
expectedFailurecases exposed.Root cause
KeyValueDataWriter._merge_datadidconcat + sortonly -- never applied the table's merge function -- so a flushed file held multiple rows per PK, violating the Java LSM invariant "PK unique within a file".raw_convertiblefast path (split_generator.py:99-100) treats any single-file PK split as merge-free and skipsSortMergeReader. That assumption holds on Java because the writer enforces the invariant; on Python it didn't.Fix
Mirror Java
MergeTreeWriter.flushWriteBuffer+SortBufferWriteBuffer.MergeIterator.advanceIfNeeded(paimon-core/.../mergetree/SortBufferWriteBuffer.java:163-293): before each flush, fold each run of equal-PK rows through the table's merge function (reset+add+get_result). The flushed file therefore satisfies the LSM invariant the read side relies on.The same dispatch is now shared between the read path (
MergeFileSplitRead._build_merge_function) and the writer (FileStoreWrite._build_pk_merge_function) -- single source of truth, no implementation drift possible.In scope
pypaimon/common/merge_engine_dispatch.py(new): singlebuild_merge_functionentry point +partial_update_unsupported_optionshelper, lifted verbatim fromMergeFileSplitRead.pypaimon/read/reader/deduplicate_merge_function.py(new): extracted from the inline class at the end ofsort_merge_reader.pyso the writer can reuse it.pypaimon/write/writer/key_value_data_writer.py: constructor takesmerge_function; new_merge_pending_by_pkruns the per-PK fold;prepare_commitand_check_and_roll_if_neededinvoke it before flushing.pypaimon/write/file_store_write.py::_build_pk_merge_function: picks the merge function via the shared dispatch. Wholly unsupported engines (aggregation/first-row) fall back toDeduplicateMergeFunctionso the file still maintains the LSM invariant; the read side still raises explicitly.partial-updatewith out-of-scope options keeps the explicit raise from [python] Implement partial-update merge engine in pypaimon #7745.with_write_type(column-subset writes) on PK tables is rejected with a clearNotImplementedErrorrather than crashing on a hidden arity mismatch.pypaimon/tests/test_write_merge_buffer.py(new): 9 unit cases driving_merge_pending_by_pkdirectly with syntheticpa.Tableinputs.pypaimon/tests/test_partial_update_e2e.py: drops the twoexpectedFailuredecorators (now passing); addstest_deduplicate_two_write_arrows_single_commitregression for the master silent bug; adjusts unsupported-option cases to expect the error inside the firstwrite_arrowcall.pypaimon/tests/reader_primary_key_test.py::test_pk_multi_write_once_commit: drops a# TODO support pk mergecomment and tightens the expected table --user_id=2's two writes now correctly dedupe to one row.Out of scope
BinaryExternalSortBuffer). Python keepspa.Tableas the buffer; large-buffer OOM mitigation is a separate effort.aggregation/first-rowmerge engine on the write side (needAggregateMergeFunction/FirstRowMergeFunctionports first, tracked separately). Those engines fall back to dedupe on flush so files stay valid; reads still raise explicitly.ignore-deleteandpartial-update.remove-record-on-*-- still explicitly rejected by the shared dispatch from [python] Implement partial-update merge engine in pypaimon #7745.DELETE/UPDATE_BEFORErow kinds -- Python writers always emit_VALUE_KIND = 0(INSERT) today; merge functions raise on non-INSERT input as before.with_write_type(column-subset writes) on PK tables -- explicitly rejected. The buffer layout would carry only the subset on the value side, while the merge function is built for the full table arity. Supporting this requires either filling absent columns with nulls before flush or adapting the merge function's arity, both larger than this PR.AppendOnlyDataWriter/DataBlobWriter-- no PK, no merge needed.Tests
From
paimon-python/:Master vs fix verification (both engines):
Anti-divergence checklist
KeyValueDataWriter._merge_pending_by_pkrunsreset/add/get_resultonce per equal-PK run, equivalent to JavaSortBufferWriteBuffer.MergeIterator.advanceIfNeeded.build_merge_function-- impossible to drift.raw_convertiblefast path's assumption holds.get_result()returningNonedrops that PK group (mirrors Javado { ... } while (result == null))._check_and_roll_if_neededfolds before slicing for size, so each sliced file individually maintains PK uniqueness.Known trade-off
_check_and_roll_if_neededruns the per-PK fold on everywritecall (Python's buffer is a singlepa.Tableand we re-fold it before any size-based split). Java'sSortBufferWriteBufferonly folds on flush. For workloads with many small batches this is O(n²) on buffer size. The fold is idempotent so correctness is fine; if it shows up in profiles, the buffer can be moved to an append-only batch list with a lazy fold at flush in a follow-up.Generative AI disclosure
Drafted with assistance from a generative AI tool. All code, tests, and Java alignment were reviewed and validated by the contributor.
Sister PR to #7745. Built on top of #7745's branch; once #7745 merges, this rebases cleanly onto master.