Skip to content

fix: thread-safe cache writes and feature update handling #114

Open
vazarkevych wants to merge 1 commit into
growthbook:mainfrom
vazarkevych:fix/thread-safe-cache-writes
Open

fix: thread-safe cache writes and feature update handling #114
vazarkevych wants to merge 1 commit into
growthbook:mainfrom
vazarkevych:fix/thread-safe-cache-writes

Conversation

@vazarkevych

@vazarkevych vazarkevych commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

Problem

Several race conditions existed in cache and feature-update handling:

  • InMemoryFeatureCache had no locking - concurrent reads/writes could corrupt cache entries.
  • FeatureRepository.load_features / load_features_async had no fetch coalescing — on a cold cache, many threads/coroutines requesting the same SDK payload could all hit the GrowthBook API/CDN at once (cache-miss stampede). Note: Python's GIL does not help here, as it is released during the blocking HTTP fetch.
  • _feature_update_callbacks was mutated and iterated without a lock - concurrent add/remove/notify could raise RuntimeError: list changed size during iteration.
  • _sticky_bucket_cache_lock was a boolean flag, not a real lock - the spin-loop was not thread-safe and silently returned {} when the "lock" was held.
  • FeatureCache.get_current_state returned a mutable reference to savedGroups instead of a copy.

Changes

This branch was rebuilt on top of current main (post remote-eval) to avoid the stale conflicts from the original PR:

  • InMemoryFeatureCache: threading.Lock around get / set / clear.
  • Per-key fetch coalescing in load_features / load_features_async: on a miss, only the first caller fetches for a given cache key; others wait and read the freshly-cached value (double-checked under the lock). Cache hits return before acquiring any lock, so there is no overhead on the hot path. Async locks are keyed by (event-loop id, cache key) to avoid reusing a lock bound to a finished loop (the cross-loop asyncio.Lock hang fixed earlier in main).
  • Callback delivery: add/remove guarded by a dedicated _callbacks_lock; _notify copies the list under the lock and iterates outside it, preventing both the iteration error and deadlocks from slow callbacks.
  • FeatureCache.get_current_state: returns a dict() copy of savedGroups.
  • Sticky buckets: replaced the boolean flag with a real asyncio.Lock() and simplified _refresh_sticky_buckets (re-check under the lock; removed the silent {} fallback).

Preserved from current main

  • Remote-eval cache keys (_compute_cache_key) are untouched; the remote-eval path keeps its existing _remote_eval_inflight coalescing and is intentionally left out of the new lock.
  • force_refresh semantics (the re-check under the lock also honors force_refresh, so SSE invalidation still triggers a refetch).
  • SSE invalidation and async client behavior unchanged.

@vazarkevych vazarkevych requested a review from madhuchavva April 23, 2026 14:29
@madhuchavva

madhuchavva commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@vazarkevych - thanks for identifying these tricky issues.

I guess the most pressing issue here is: the cache-miss stampede in FeatureRepository.load_features / load_features_async: if many threads or coroutines ask for the same uncached SDK payload at once, they can all hit the GrowthBook API/CDN simultaneously.

and, P2 list includes callback list mutation during notification, sticky bucket boolean lock. but the blast radius is limited. so, I’d salvage this by porting these ideas onto current main: thread-safe InMemoryFeatureCache, snapshot callback delivery, per-key fetch coalescing for sync/async loads, savedGroups copy semantics, and real sticky-bucket lock. There are many changes that went in and we'll need to preserve current remote-eval cache keys, force_refresh, SSE invalidation, and async client behavior.

@madhuchavva madhuchavva left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve the conflicts and address the review comments please

@vazarkevych vazarkevych force-pushed the fix/thread-safe-cache-writes branch from 874a5b2 to 3d78f53 Compare June 25, 2026 12:41
@vazarkevych

Copy link
Copy Markdown
Collaborator Author

resolve the conflicts and address the review comments please

Thanks for the advice — agreed on the priorities. I've rebuilt this branch on top of current main (post remote-eval) rather than merging, so the stale conflicts are gone

@vazarkevych

Copy link
Copy Markdown
Collaborator Author

@vazarkevych - thanks for identifying these tricky issues.

I guess the most pressing issue here is: the cache-miss stampede in FeatureRepository.load_features / load_features_async: if many threads or coroutines ask for the same uncached SDK payload at once, they can all hit the GrowthBook API/CDN simultaneously.

and, P2 list includes callback list mutation during notification, sticky bucket boolean lock. but the blast radius is limited. so, I’d salvage this by porting these ideas onto current main: thread-safe InMemoryFeatureCache, snapshot callback delivery, per-key fetch coalescing for sync/async loads, savedGroups copy semantics, and real sticky-bucket lock. There are many changes that went in and we'll need to preserve current remote-eval cache keys, force_refresh, SSE invalidation, and async client behavior.

All the items from your list are in: thread-safe InMemoryFeatureCache, snapshot callback delivery, per-key fetch coalescing for both sync and async loads, savedGroups copy, and a real sticky-bucket lock. Remote-eval cache keys, force_refresh, SSE invalidation and async behavior are preserved (the remote-eval path is left out of the new lock since it already coalesces via _remote_eval_inflight).

The only non-obvious bit is the async coalescing lock — it's keyed by (event-loop id, cache key) to avoid reusing a lock bound to a finished loop.

@vazarkevych vazarkevych requested a review from madhuchavva June 25, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants