[python] Align Identifier with Java by encoding branch into the object field#7738
Conversation
…t field
Java's org.apache.paimon.catalog.Identifier represents a table on a branch
(or a system table on a branch) by encoding the branch / system_table into
the "object" field, e.g. "tbl$branch_dev$snapshots". The wire shape is
exactly two fields, "database" and "object". The Python Identifier was
out of step:
* SYSTEM_BRANCH_PREFIX was "branch-", whereas Java uses "branch_". Any
"$branch_<name>" object name produced by the REST server made
is_system_table() return True (treating a branched table as a system
table) — a real correctness bug for catalog snapshot commits that
routed through the branch-encoded identifier.
* "branch" was a third JSON field. Java has @JsonIgnoreProperties so it
silently dropped that field on the wire — Python clients setting
`branch=...` thought they were branching but the server never saw it.
* get_table_name() returned the raw object name including any encoded
suffix; get_branch_name() / get_system_table_name() were missing.
Port the Java design 1:1:
* Identifier(database, object) is the JSON-create constructor; "object"
may already carry an encoded branch / system_table.
* Identifier.create(database, table, branch=None, system_table=None)
encodes the components into the object field. branch="main"
(case-insensitive) is the default and is not encoded, matching Java.
* _split_object_name() lazily decodes the object into table / branch /
system_table, with the same one- / two- / three-segment rules as
Java's splitObjectName().
* SYSTEM_BRANCH_PREFIX is now "branch_".
* get_full_name() drops the database prefix when the database is
UNKNOWN_DATABASE, matching Java.
* Equality and hash use only (database, object), so a JSON round-trip
produces an equal instance.
Eliminate redundant branch parameter passing now that the identifier is
the single source of truth:
* SnapshotCommit.commit() no longer takes a separate `branch` argument
— both CatalogSnapshotCommit and RenamingSnapshotCommit (and the
abstract base) drop it; FileStoreCommit's call site stops passing
table.current_branch().
* FileStoreTable.current_branch() reads from the identifier instead of
CoreOptions; FileStoreTable.copy() re-encodes the branch into the
new identifier when CoreOptions.BRANCH is supplied, so the catalog
environment sees the branched object name.
* RESTCatalog.to_table_metadata() drops the now-redundant block that
re-derives BRANCH from the identifier into options.
* RESTTokenFileIO.refresh_token() uses Identifier.create(db, table,
branch=...) to strip the system-table suffix while preserving the
branch.
* The mock RESTCatalogServer's URL parser stops looking for a
"table.branch" suffix and instead reads identifier.get_branch_name()
directly, matching the Java URL convention.
Tests:
* identifier_test: 25 cases covering simple parse, backtick parsing,
branch encoding, main-branch normalization, UNKNOWN_DATABASE,
one/two/three-segment object names with and without branch prefix,
create() with branch + system_table, equality through JSON shape.
* catalog_branch_manager_test, file_store_table_test,
rest_catalog_commit_snapshot_test, rest_token_file_io_test: updated
to construct branched identifiers via Identifier.create(...) or the
"db.tbl$branch_<x>" string form.
|
|
||
|
|
||
| @dataclass | ||
| @dataclass(init=False) |
There was a problem hiding this comment.
This may be a break change?
There was a problem hiding this comment.
Thanks for catching that @JingsongLi — yes, you're right. Let me enumerate what changes shape so I can address each:
Identifier(database, object, branch=...)constructor kwarg removed.identifier.branchattribute removed.Identifier.create(db, object)second positional arg renamed fromobjecttotable; passing a pre-encoded object containing$now caches it as the table name rather than splitting it.SYSTEM_BRANCH_PREFIXchanged from'branch-'to'branch_'(this one is a bug-fix — Java has always usedbranch_, so anything Python wrote withbranch-was already not round-trippable through the Java REST server).- The
branchJSON field is no longer emitted (Java@JsonIgnoreProperties(ignoreUnknown = true)was already dropping it on the wire, so this never carried information end-to-end, but the Python wire shape does change).
(4) and (5) I'd defend as correctness fixes — the pre-fix behaviour was inconsistent with Java in ways that silently corrupted cross-language scenarios, and I don't see a backward-compat path that preserves the buggy behaviour while letting branched system tables work against a Java REST server.
(1) (2) (3) are the real API-shape break. I agree those need a soft-deprecation path. I'll push a follow-up commit on this PR that:
- Re-accepts
Identifier(..., branch=...)and routes it through the new encoding internally, emittingDeprecationWarning. - Re-exposes
identifier.branchas a property that delegates toget_branch_name(), also withDeprecationWarning. - Re-accepts the old
Identifier.create(db, object)two-arg form (detected by$in the second arg) and routes through the new constructor, withDeprecationWarning. - Adds tests that lock in both the warning and the equivalent-result behaviour, so we don't accidentally drop the shim before users have a chance to migrate.
- Adds a migration note to the PR description.
The plan would be to remove the shim in the next minor version. Does that sound reasonable?
There was a problem hiding this comment.
I don't know why you need to introduce _BRANCH_NOT_SET, Java also has many constructors.
There was a problem hiding this comment.
I don't know why you need to introduce
_BRANCH_NOT_SET, Java also has many constructors.
You're right — the sentinel was overkill, and the deprecation path it was
gating is unnecessary if we just match Java's constructor shape directly.
Pushed b17c946, which:
- Drops
_BRANCH_NOT_SETand allDeprecationWarning. __init__(database, object=None, branch=None, system_table=None)
mirrors Java's three public constructors in one signature: when
branch/system_tableis non-None, encode intoobject; otherwise
treatobjectas the final string (the@JsonCreatorpath).Identifier.createkeeps its existing kwargs and now just delegates
to the constructor — Java has only the 2-argcreate, but keeping
the multi-arg form here costs nothing and avoids touching every
caller introduced by the first commit.
Backward compatibility against the pre-PR public surface (the worry
from your first review) is preserved without warnings:
Identifier(db, obj, branch=...)/branch=None— accepted, encodes
intoobjectso the Java REST server now actually sees the branch.identifier.branch— read+write property. The setter re-encodes
objectsoid.branch = "x"keeps working for any caller that used
to assign to the dataclass field.Identifier.create(db, "tbl$snapshots")two-arg form — unchanged.
A new IdentifierBackwardCompatibilityTest locks these in by asserting
"the old call sites run without raising" (no warning assertions, no
shim semantics — these are first-class supported entry points now).
PTAL.
…nch=/branch/object-with-dollar surfaces Address review feedback on PR apache#7738: the new wire-aligned shape is correct, but the change shouldn't break callers that today use ``Identifier(database=..., object=..., branch=...)`` (kwarg form), ``identifier.branch`` (read), or ``Identifier.create(db, "tbl$snapshots")`` (legacy raw-object two-arg form). These surfaces now keep working for one more release with a ``DeprecationWarning``: * ``Identifier.__init__`` accepts ``branch=`` again. The branch is encoded into ``object`` internally so the wire output matches what ``Identifier.create(db, table, branch=...)`` would produce. * ``Identifier.branch`` is exposed as a ``@property`` that delegates to ``get_branch_name()``. * ``Identifier.create(db, table)`` detects the legacy raw-object shape (``$`` in the second arg, no branch / system_table kwargs) and stores the string verbatim in ``object``, preserving the pre-PR behaviour where ``create("db", "orders$snapshots")`` yielded a system-table identifier. All three paths emit a single ``DeprecationWarning`` with a clear migration suggestion. Scheduled for removal in the next minor release. The bug-fix changes — ``SYSTEM_BRANCH_PREFIX = 'branch_'`` (was ``'branch-'``, never matched Java) and dropping the un-roundtrippable ``branch`` JSON field — are kept as-is; they were already incorrect versus Java and there is no compat path that preserves the buggy behaviour while letting branched system tables work end-to-end. New tests in ``IdentifierBackwardCompatibilityShimTest`` lock in both the behavioural compatibility AND the warning, so the shim doesn't get silently dropped before users have a chance to migrate: * constructor branch= encodes into object, equal to create() * constructor branch="main" is not encoded * constructor branch=None still warns (deprecated kwarg) * .branch property delegates and warns * create(db, "$"-containing) falls back to raw-object form, warns * create(db, "no-dollar") does NOT warn * create(db, "$"-table, branch=...) skips the fallback (kwargs win)
|
Pushed a913e1461 addressing the review:
Kept the two correctness fixes — |
…entinel Address review on a913e14: instead of a `_BRANCH_NOT_SET` sentinel gating a DeprecationWarning, mirror Java's three public constructors (`Identifier(db, object)`, `Identifier(db, table, branch)`, `Identifier(db, table, branch, systemTable)`) directly in `__init__` -- `branch` / `system_table` not None routes through the encoding path, otherwise `object` is taken as the final string. Backward compatibility (pre-PR public surface keeps working without errors): * `Identifier(db, obj, branch=...)` / `branch=None` -- accepted. * `identifier.branch` -- read+write property; the setter re-encodes `object` so the wire stays consistent. * `Identifier.create(db, obj)` two-arg form -- unchanged. Tests: `IdentifierBackwardCompatibilityTest` now asserts the old call sites run without raising (no warning assertions, no shim semantics).
Purpose
Java's
org.apache.paimon.catalog.Identifierencodes the branch (and any system-table suffix) into theobjectfield —tbl$branch_dev$snapshots— so the wire shape is exactly two strings,databaseandobject. Python'sIdentifierwas out of step on three real points:SYSTEM_BRANCH_PREFIX = 'branch-'in Python vsbranch_(with underscore) in Java. Anything written by the Java REST server in the formtbl$branch_<name>falls into Python's two-segment branch-of-is_system_table(), where the prefix checkstartswith('branch-')returnsFalse→ the method returnsTrue. Today a branched table is misclassified as a system table by Python.branchwas a separate JSON field. Java's@JsonIgnoreProperties(ignoreUnknown = true)silently drops that field on the wire — Python clients callingIdentifier(database=..., object=..., branch=...)thought they were targeting a branch but the server never saw it.get_table_name()returned the rawobjectname including any encoded suffix, andget_branch_name()/get_system_table_name()couldn't decodetbl$branch_dev$snapshots-style names.This PR ports the Java design 1:1.
Identifier changes
Identifier(database, object)is now the JSON-create constructor; theobjectmay already carry an encoded branch / system_table.Identifier.create(database, table, branch=None, system_table=None)encodes the components into the object.branch == "main"(case-insensitive) is the default and is not encoded, matching Java._split_object_name()lazily decodes the object intotable/branch/system_table, with the same 1- / 2- / 3-segment rules as Java'ssplitObjectName().SYSTEM_BRANCH_PREFIXis nowbranch_.get_full_name()drops the database prefix when database isUNKNOWN_DATABASE(matches Java).__hash__/__eq__use only(database, object), so a JSON round-trip produces an equal instance.Backward-compat shim (added in commit 2 after review feedback)
Three previously-public surfaces keep working for one more release with a
DeprecationWarning, scheduled for removal in the next minor release:Identifier(database, object, branch=...)— accepted again; the branch is encoded intoobjectinternally so the wire output matchesIdentifier.create(db, table, branch=...).identifier.branch— exposed as a@propertythat delegates toget_branch_name().Identifier.create(db, "tbl$snapshots")— when the second positional arg already contains$and nobranch=/system_table=kwargs are supplied, the call is treated as the legacy raw-object form: the string is stored verbatim inobject, preserving the pre-PR behaviour wherecreate("db", "orders$snapshots")yielded a system-table identifier.The bug-fix changes (
SYSTEM_BRANCH_PREFIXcorrected tobranch_, dropping the un-round-trippablebranchJSON field) are kept as-is; they were already incorrect versus Java, and there is no compat path that preserves the buggy behaviour while letting branched system tables work end-to-end.Knock-on cleanup (single source of truth)
Now that the identifier carries the branch:
SnapshotCommit.commit()(abstract +CatalogSnapshotCommit+RenamingSnapshotCommit) drops itsbranchparameter;FileStoreCommitstops passingtable.current_branch().FileStoreTable.current_branch()reads from the identifier (wasself.options.branch()).FileStoreTable.copy()re-encodes the branch into a new identifier whenCoreOptions.BRANCHis supplied, soCatalogEnvironmentsees the branched object name.RESTCatalog.to_table_metadata()drops the now-redundant block that copied the branch back into options.RESTTokenFileIO.refresh_token()usesIdentifier.create(db, table, branch=...)to strip the system-table suffix while preserving the branch.table.branchsuffix and readsidentifier.get_branch_name()directly, matching the Java URL convention.Migration
Identifier(db, obj, branch=b)Identifier.create(db, table, branch=b)identifier.branchidentifier.get_branch_name()(orget_branch_name_or_default())Identifier.create(db, "tbl$snapshots")Identifier(db, "tbl$snapshots")(raw) orIdentifier.create(db, "tbl", system_table="snapshots")The deprecated forms keep working for this release; deprecation warnings indicate the migration path. Removal targeted at the next minor release.
Linked issue
N/A — surfaced when running pypaimon against a Java REST server that emits
tbl$branch_<name>object names.Tests
pypaimon/tests/identifier_test.py::IdentifierTest— 25 cases covering simple / backtick / Java-compatible parsing, branch encoding,main-branch normalization,UNKNOWN_DATABASE, one- / two- / three-segment object names with and without branch prefix,create()with branch + system_table, equality through the JSON shape, and the malformedtbl$x$y-without-branch-prefix case.pypaimon/tests/identifier_test.py::IdentifierBackwardCompatibilityShimTest— 8 cases locking in the shim: constructorbranch=warns + encodes correctly;branch="main"not encoded;branch=Nonestill warns;.branchproperty delegates + warns;create(db, "$"-containing)falls back to raw-object form + warns;create(db, "no-dollar")does NOT warn;create(db, "$"-table, branch=...)skips the fallback so kwargs win.catalog_branch_manager_test,file_store_table_test,rest_catalog_commit_snapshot_test,rest_token_file_io_test— updated to construct branched identifiers viaIdentifier.create(...)or thedb.tbl$branch_<x>string form.Local:
pytest pypaimon/tests/{identifier_test,branch,table/file_store_table_test,file_store_commit_test,rest/{client_test,rest_token_file_io_test,rest_catalog_commit_snapshot_test}.py}→ 33 identifier tests + surrounding suites green; the only failure in those suites is the pre-existing lance environment issue (unrelated to this PR).flake8 --config=dev/cfg.iniclean.API and format
branchJSON field was already silently dropped by Java; nothing that successfully round-tripped through the Java REST server is affected. After the fix, branch-aware Python clients now correctly produceobject="tbl$branch_<name>"that Java decodes the same way.Identifier(database, object, branch=...)andidentifier.branchand the legacyIdentifier.create(db, object-with-dollar)form keep working for this release with aDeprecationWarning. Migration table above.Documentation
The new
create()signature, the encoding rules, and the deprecation paths are documented inline onIdentifier,__init__,create(), andbranch.Generative AI disclosure
Drafted with assistance from an AI coding tool; the design is a direct 1:1 port of
org.apache.paimon.catalog.Identifier(paimon-api), and every behavioural difference between Java and Python that this PR closes is exercised by a test inidentifier_test.py. The deprecation shim was added in response to review feedback (#7738 review thread onidentifier.py:29).