Skip to content

fix: sync getBulk correctly decodes mixed hashed and plain keys#195

Open
bihaoxwork wants to merge 4 commits into
masterfrom
sync-getbulk-mixed-keys
Open

fix: sync getBulk correctly decodes mixed hashed and plain keys#195
bihaoxwork wants to merge 4 commits into
masterfrom
sync-getbulk-mixed-keys

Conversation

@bihaoxwork

@bihaoxwork bihaoxwork commented May 28, 2026

Copy link
Copy Markdown

Summary

Test

  • Existing EVCacheTestDI async bulk get tests pass
  • Re-enabled sync getBulk test path passes with mixed hashed/unhashed keys
  • No regressions for plain-key-only and hashed-key-only bulk requests

Refactor getBulkData to split hashed vs plain keys and route each set
through the appropriate transcoder path, fixing incorrect decoding when
a single bulk request contains both hashed and unhashed keys. Re-enable
the previously disabled getBulk test path in EVCacheTestDI.

Co-authored-by: Bihao Xu <bxu@netflix.com>
@bihaoxwork bihaoxwork requested review from AustinWheel, akhaku, Copilot and shy-1234 and removed request for akhaku and Copilot May 31, 2026 19:08
@bihaoxwork bihaoxwork requested a review from srrangarajan June 2, 2026 22:39
@shy-1234

shy-1234 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

The PR description seems misleading. The mixed-key-aware asyncGetBulk entry point was introduced in #181. (There was also another follow up PR #184) Please update the description when you get a chance to make sure the right context is linked.

Comment thread evcache-client/test/com/netflix/evcache/test/EVCacheTestDI.java Outdated
Comment thread evcache-client/test/com/netflix/evcache/test/EVCacheTestDI.java Outdated
Comment thread evcache-core/src/main/java/com/netflix/evcache/pool/EVCacheClient.java Outdated
@bihaoxwork

Copy link
Copy Markdown
Author

Code Review

Summary

Refactors the synchronous getBulk path so mixed hashed/plain key batches decode each key with the correct transcoder, adds chunking support for that path, and re-enables the previously-disabled mixed-key test. The core fix is correct and well-covered; findings below are non-blocking quality/observability items.

Approval Recommendation

Comment — no blocking issues. The mixed-key decode, the transcoder-fallback port from #184, and the chunked+hashed read path were all independently verified correct. Consider the null-guard suggestion and the test/observability considerations at your discretion.

Issues Found

Suggestions

  • On the chunking path, getBulk(plainKeys, hashedKeys, …) does return assembleChunks(...) directly, but assembleChunks swallows internal errors and returns null (its catch falls through to return null), bypassing getBulk's own outer catch that returns Collections.emptyMap(). getBulkData then calls buildKeyValueResult(objMap, …) with no null check, and buildKeyValueResult dereferences objMap.size() → NPE. When throwException=true && hasZF=false, the caller sees a confusing NullPointerException instead of the original exception (the pre-refactor plain path returned emptyMap() and never NPE'd). Suggest null-guarding before buildKeyValueResult, or normalizing assembleChunks's null return to emptyMap().
    (confidence: 82)
    evcache-core/.../EVCacheImpl.java#L1994-L1995 · EVCacheClient.java#L1056-L1057

Considerations

  • decodeForKey silently returns null when a hashed key's envelope decode does not yield an EVCacheValue (the instanceof check fails) — no debug log, no metric. The non-chunked path (EVCacheMemcachedClient.asyncGetBulk) logs this exact case ("did not yield an instance of EVCacheValue, this could be due to collision"). Adding an equivalent debug log (and/or reusing the KEY_HASH_COLLISION metric) would restore diagnostic parity on the chunked path.
    (confidence: 73)
    EVCacheClient.java#L584-L591
  • The debug log "fetching bulk data with set of keys containing hashed key(s)" now fires unconditionally for every bulk get, including all-plain-key batches (which the refactor routes through this same path). The async counterpart branches the message on whether hashedKeysMap is empty; the sync path lost that branch and will mislead anyone reading debug output.
    (confidence: 73)
    EVCacheImpl.java#L1991-L1992
  • doChunkingTests(true) (hashing + chunking) never asserts that chunking actually happened — the getAllChunks(...) structural check is guarded by if (!hashingEnabled). The round-trip get/getBulk assertions do exercise the chunked+hashed decode end-to-end, but if a value were silently stored unchunked (wrong branch taken), the test would still pass via the non-chunked branch in assembleChunks. Consider an explicit getAllChunks(hashKey) assertion in the hashed branch (using the actual hash key string) to close the gap.
    (confidence: 72)
    EVCacheTestDI.java#L376-L383
  • The comment "This exercises the mixed-key, chunk-aware getBulk" is inaccurate: with app-level hash.key=true, every key in doChunkingTests is hashed, so buildKeyMap produces an empty plainKeys set — this is an all-hashed (not mixed) chunking scenario. Minor comment wording.
    (confidence: 70)
    EVCacheTestDI.java#L353-L354

Pre-existing Issues

Exists on unchanged code (introduced in #181 for the async path), now also reachable from the sync path. Not reachable through normal flow since memcached only returns requested keys — listed for awareness.

  • buildKeyValueResult does getOrDefault(wireKey, hashedMap.get(wireKey)); if a returned wire key is in neither map, evcKey is null and retMap.put(null, val) silently inserts a null key.
    EVCacheImpl.java#L1962-L1965

Questions for the Author

None — the chunking-routing design and the chunked+hashed read path were verified correct (chunks are stored under hashKey_00, reassembled, then 2-step decoded), and the open human-review questions were already addressed.

Notes

  • Verified correct: (1) the mixed-key split decode (plain → 1-step, hashed → 2-step envelope decode with collision check); (2) the transcoder-fallback ternary correctly ported from the async path (fix: async bulk get transcoder fallback for plain keys in mixed-key requests #184); (3) the chunked+hashed read path (write stores the EVCacheValue envelope chunked under hashKey_NN; read reassembles and decodes via the envelope then value transcoder). The user-visible result is unchanged because the final decanonicalization loop normalizes absent-key and null-value to the same key → null.
  • The refreshEVCache change that closes the previous lifecycleManager before rebuilding is a good catch — it stops the per-refresh DI-container leak.
  • Already acknowledged by the human reviewer and not re-flagged here: unused variable (L359), assert-inside-loop (L461), and the PR-description attribution to Fix: asyncBulkGet decoding of hashed and unhashed keys together in one request #181/fix: async bulk get transcoder fallback for plain keys in mixed-key requests #184.
  • Two candidate findings were investigated and dropped after verification: a claimed RuntimeException-from-decodeForKey deadlock in the non-chunked path (false — decodeForKey is only called from the chunked assembleChunks), and a chunked-vs-non-chunked null-value divergence (false at the API level — normalized away downstream).
  • Agent 7 (Codex) ran but exhausted its turn budget during code exploration, surfacing only a trailing-whitespace nit (assumed CI/style-handled). Internal Netflix context sources (Jira/Slack/Manuals) were not applicable to this public OSS PR.

Generated with pr-review-plugin (Claude + Codex)
React with a thumbs up if useful, thumbs down otherwise.

@bihaoxwork

Copy link
Copy Markdown
Author

Code Review

Summary

Re-review of the delta since the last review (1 new commit, no new discussion), now cross-checked against the team's EVCache review guidelines. The new commit d3916b1 ("fix: drop collided keys and avoid NPE in chunked getBulk") correctly resolves the previous NPE finding, and its "drop collided keys" behavior is exactly what our guidelines prescribe for hashed/raw key collisions. No new blocking issues. The remaining items are non-blocking considerations carried over from the previous review, plus one pre-existing logging-convention gap.

Approval Recommendation

Comment — the NPE fix is verified correct and the collision-drop change is on-standard. Items below are non-blocking.

Issues Found

Considerations

  • (Carried over, now amplified) On the chunked hashed-key path, decodeForKey returns null for a non-EVCacheValue envelope with no debug log and no metric. The collision case is logged/metered via the collisionChecker lambda, but the !(obj instanceof EVCacheValue) branch falls straight to return null and never invokes it — unlike the non-chunked EVCacheMemcachedClient.asyncGetBulk, which logs this exact case. After this PR the null now causes the key to be silently dropped from the result, so the corner case is fully invisible.
    (confidence: 72)
    EVCacheClient.java#L584-L591
  • (Unchanged from previous review) The debug log "fetching bulk data with set of keys containing hashed key(s)" in the sync getBulkData fires unconditionally for every bulk get, including all-plain-key batches, so the "containing hashed key(s)" wording is misleading.
    (confidence: 73)
    EVCacheImpl.java#L1991-L1992
  • (Unchanged from previous review) doChunkingTests(true) (hashing + chunking) never asserts that chunking actually happened — the getAllChunks(...) structural check is guarded by if (!hashingEnabled). The round-trip assertions still pass via the non-chunked branch in assembleChunks even if a value were silently stored unchunked. Consider an explicit getAllChunks(hashKey) assertion in the hashed branch.
    (confidence: 72)
    EVCacheTestDI.java#L376-L383
  • (Unchanged from previous review) The comment "This exercises the mixed-key, chunk-aware getBulk" is inaccurate: with app-level hash.key=true, every key is hashed, so buildKeyMap produces an empty plainKeys set — this is an all-hashed (not mixed) chunking scenario.
    (confidence: 70)
    EVCacheTestDI.java#L353-L354

Documentation Suggestions

  • EVCacheClient.java:686-691assembleChunks now intentionally drops keys whose decode returns null (collision / non-EVCacheValue) — which is the prescribed handling for collisions — while the adjacent checksum-failure branch still inserts null (returnMap.put(ci.getKey(), null)). At the user-facing API the two are normalized to the same key → null by the decanonicalization loop in EVCacheImpl.getBulkData, but they diverge in the hasZF zone-fallback path (a dropped key becomes retry-eligible; a null-valued key does not). The drop is deliberate (per the commit message and the team's "drop the key" guideline); a one-line comment explaining why collision-drop and checksum-fail use different sentinels would prevent a future reader from "fixing" the asymmetry.

Pre-existing Issues

These exist on unchanged code and predate this commit; listed for awareness only.

  • The catch in assembleChunks logs log.error(e.getMessage(), e) — against the team logging guidelines, this should be warn for expected-degraded/recoverable reads (e.g. when hasZF=true or _throwException=false, the failure is suppressed/retried), and it should include the app name and server group for operator context. The log.error line itself is pre-existing; the new commit's throw e now also lets the same exception double-log downstream in getBulkData.
    EVCacheClient.java#L695-L698
  • buildKeyValueResult does getOrDefault(wireKey, hashedMap.get(wireKey)); if a returned wire key is in neither map, evcKey is null and retMap.put(null, val) silently inserts a null key. (Not reachable through normal flow since memcached only returns requested keys.)
    EVCacheImpl.java#L1962-L1965
  • In the ArrayCopyError diagnostic loop inside assembleChunks, dataMap.get(skey).getData() is dereferenced without a null check. If any skey is absent from dataMap, this throws an NPE inside the catch handler, masking the original array-copy exception. The new commit's throw e makes the (masked) exception propagate to callers.
    EVCacheClient.java#L673-L676

Questions for the Author

None — the open questions about the drop-vs-null intent are covered by the Documentation Suggestion above and have no functional impact (downstream normalization).

Notes

  • Aligned with team EVCache review guidelines. The "drop collided keys" change matches the documented standard ("when a key appears as both a hashed and a raw key, drop the key rather than surfacing a decode error to the caller"). The decode path also follows the on-wire-format contract: it detects the envelope via instanceof EVCacheValue (not a flag) and re-decodes the inner value using the envelope's own flags. This is a read-path-only change with no wire/cache-format impact, so it is backward compatible with older clients.
  • Prior NPE finding is fixed (verified). assembleChunks now declares throws Exception and re-throws instead of returning null; both getBulk overloads catch it and return Collections.emptyMap() (or rethrow when _throwException=true), and the mixed-key call site additionally normalizes a null result to emptyMap(). The buildKeyValueResult NPE path is no longer reachable. Not re-flagged.
  • Collision-drop is a net improvement in the hasZF path: a dropped collided key is now retry-eligible against fallback zones (the value we held belonged to a different key), whereas the previous put(null) counted it as present and skipped fallback.
  • Agent 7 (Codex) ran cleanly and reported no actionable correctness issues in the changed code.
  • Already addressed by human review and not re-flagged: unused variable (L359), assert-inside-loop (L461), and the PR-description attribution to Fix: asyncBulkGet decoding of hashed and unhashed keys together in one request #181/fix: async bulk get transcoder fallback for plain keys in mixed-key requests #184.
  • Public OSS repo (github.com): review links use github.com.
  • Previous review feedback: 0 thumbs up, 0 thumbs down (standard thresholds applied).

Generated with pr-review-plugin (Claude + Codex)
React with a thumbs up if useful, thumbs down otherwise.

@bihaoxwork

Copy link
Copy Markdown
Author

Code Review

Summary

Third (delta) review. The delta since the last review is a single commit (242cce8, "Add test for mixed-key chunked getBulk") that was written specifically to address the carried-over considerations — and it does: all four are now resolved. The two production changes in this commit are behavior-preserving debug-log improvements. No new blocking issues; one minor logging-context consideration remains.

Approval Recommendation

Approve — the mixed-key decode fix has been independently verified correct across three reviews, the chunked + hashed read path is now structurally asserted (plain path) and round-trip-verified (hashed path), and the previously-flagged NPE and observability gaps are closed. The single consideration below is trivial and non-blocking.

Issues Found

Considerations

  • The new decodeForKey debug log includes the app name but not the server group / zone. Every comparable debug log in EVCacheClient includes zone (e.g. the appName … zone … pattern used throughout the class), and the team logging guideline asks for app name and server group in log context. The fields zone (L81) and serverGroup (L82) are in scope at this call site. (Note: this log intentionally mirrors the non-chunked sibling in EVCacheMemcachedClient.asyncGetBulk, which also omits the zone — so this is a minor, optional consistency nit, and the key is already in the message.)
    (confidence: 72)
    EVCacheClient.java#L590-L595

Questions for the Author

None.

Notes

  • Prior-review considerations — all addressed by 242cce8:
    • Misleading unconditional "containing hashed key(s)" debug log → now reports accurate "N plain and M hashed key(s)" counts (EVCacheImpl.getBulkData).
    • Silent key-drop on a non-EVCacheValue envelope → debug log added in decodeForKey, mirroring the non-chunked path. (No metric was added, but verified consistent with the non-chunked sibling asyncGetBulk, which also only logs — intentional parity, not a regression. Scored below threshold.)
    • doChunkingTests(true) never asserting chunking actually happened → replaced with doMixedKeyChunkingTests + assertChunked / assertNotChunked helpers (structural assertion on the plain path, round-trip verification on the hashed path).
    • Inaccurate "mixed-key" comment → the test now uses auto.hash.keys=true, producing a genuine plain + hashed mix (short keys stay plain, long keys auto-hash), verified against EVCacheImpl L224.
  • Dropped after verification (scored < 70): string-concatenation in the two new debug logs (40 / 35 — matches the prevailing concatenation style in both files and is isDebugEnabled-guarded); missing KEY_HASH_COLLISION metric on the non-EVCacheValue branch (42 — consistent with the non-chunked path, deliberate mirror); removal of the standalone hash.key=true + chunking test arm (42 — the new mixed test already exercises every line of the hashed + chunked decode path; the all-hashed case is a degenerate sub-case with no diverging branch in assembleChunks/decodeForKey); test-diagnostic nits (assertNotNull(results) NPE-masking; ignored latch.await return; refreshEVCache double-close only if setupEnv throws mid-rebuild; the "absence proves chunking" comment being imprecise). Dropped questions: ZF-retry-on-corrupt-hashed-chunk (38 — pre-existing plain-path behavior, only newly reachable) and "is hash.key+chunking supported" (42 — answerable from code).
  • Pre-existing items from the prior review remain unchanged (informational, not re-flagged): log.error (vs warn) in the assembleChunks catch; buildKeyValueResult put(null, …) when a wire key is in neither map (not reachable in normal flow); ArrayCopyError diagnostic loop dereferencing dataMap.get(skey).getData() without a null check.
  • Codex (Agent 7) ran successfully on the full diff and reported no actionable correctness issues.
  • Already addressed by human review and not re-flagged: unused variable (L359), assert-inside-loop (L461), PR-description attribution to Fix: asyncBulkGet decoding of hashed and unhashed keys together in one request #181/fix: async bulk get transcoder fallback for plain keys in mixed-key requests #184.
  • Public OSS repo (github.com): review links use github.com. Internal Netflix context sources (Jira/Slack/Manuals/Sourcegraph) are not applicable.
  • Previous review feedback: 0 thumbs up, 0 thumbs down (standard thresholds applied).

Generated with pr-review-plugin (Claude + Codex)
React with a thumbs up if useful, thumbs down otherwise.

@bihaoxwork bihaoxwork force-pushed the sync-getbulk-mixed-keys branch from 242cce8 to 79d1d73 Compare June 8, 2026 22:13
@bihaoxwork bihaoxwork requested review from Copilot and removed request for Copilot and srrangarajan June 15, 2026 17:10
}
}

final Map<String, CachedData> dataMap = evcacheMemcachedClient.asyncGetBulk(allKeys, chunkingTranscoder, null, chunkingNodeValidator(Call.BULK))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to understand what assembleChunks does, so just curious should we still be using this deprecated version of asyncGetBulk here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC these are the keys that we encoded using chunkingTranscoder? is this a third layer of encoding/decoding on top of the valueTranscoder and envelopeTranscoder?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If it's correct to use this single transcoder version method, we can either

  • remove the deprecated annotation in that method and enhance its comment to justify its use case or
  • use the double transcoder version and pass null as the second transcoder.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions, let me address all three together since they're related.

On the "third layer of encoding" (and what ChunkTranscoder is): ChunkTranscoder isn't really an encode/decode layer. It's an identity transcoder, decode and encode both just return the input unchanged:

public CachedData decode(CachedData d) { return d; }
public CachedData encode(CachedData o) { return o; }

Its only job is to avoid deserializing so we can pull the raw chunk bytes off the wire and reassemble them. So there's no third codec. The actual decoding is still just the two layers you mentioned, done in decodeForKey after reassembly: envelopeTranscoder to unwrap the EVCacheValue, then valueTranscoder for the payload.

On using the deprecated asyncGetBulk here: because of the above, at this point we're only reading chunk keys (key_00, key_NN) as raw bytes, every key takes the exact same single-step read, there's no plain-vs-hashed transcoder distinction at the chunk layer. The deprecation note only warns against using this overload when you need the memcached layer to do the 2-step hashed decode, which doesn't apply to chunk reads. So functionally it's the right method here. I think we can remove deprecated annotation.

I will use the double transcoder version also add some comments back assembleChunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants