fix: thread-safe cache writes and feature update handling #114
fix: thread-safe cache writes and feature update handling #114vazarkevych wants to merge 1 commit into
Conversation
|
@vazarkevych - thanks for identifying these tricky issues. I guess the most pressing issue here is: the cache-miss stampede in and, P2 list includes callback list mutation during notification, sticky bucket boolean lock. but the blast radius is limited. so, I’d salvage this by porting these ideas onto current |
madhuchavva
left a comment
There was a problem hiding this comment.
resolve the conflicts and address the review comments please
874a5b2 to
3d78f53
Compare
Thanks for the advice — agreed on the priorities. I've rebuilt this branch on top of current main (post remote-eval) rather than merging, so the stale conflicts are gone |
All the items from your list are in: thread-safe The only non-obvious bit is the async coalescing lock — it's keyed by (event-loop id, cache key) to avoid reusing a lock bound to a finished loop. |
Problem
Several race conditions existed in cache and feature-update handling:
nMemoryFeatureCachehad no locking - concurrent reads/writes could corrupt cache entries.FeatureRepository.load_features/load_features_asynchad no fetch coalescing — on a cold cache, many threads/coroutines requesting the same SDK payload could all hit the GrowthBook API/CDN at once (cache-miss stampede). Note: Python's GIL does not help here, as it is released during the blocking HTTP fetch._feature_update_callbackswas mutated and iterated without a lock - concurrentadd/remove/notifycould raiseRuntimeError: list changed size during iteration._sticky_bucket_cache_lockwas a boolean flag, not a real lock - the spin-loop was not thread-safe and silently returned{}when the "lock" was held.FeatureCache.get_current_statereturned a mutable reference tosavedGroupsinstead of a copy.Changes
This branch was rebuilt on top of current main (post remote-eval) to avoid the stale conflicts from the original PR:
InMemoryFeatureCache:threading.Lockaroundget/set/clear.load_features/load_features_async: on a miss, only the first caller fetches for a given cache key; others wait and read the freshly-cached value (double-checked under the lock). Cache hits return before acquiring any lock, so there is no overhead on the hot path. Async locks are keyed by (event-loop id, cache key) to avoid reusing a lock bound to a finished loop (the cross-loopasyncio.Lockhang fixed earlier in main).add/removeguarded by adedicated _callbacks_lock;_notifycopies the list under the lock and iterates outside it, preventing both the iteration error and deadlocks from slow callbacks.FeatureCache.get_current_state: returns a dict() copy of savedGroups.asyncio.Lock()and simplified_refresh_sticky_buckets(re-check under the lock; removed the silent {} fallback)._compute_cache_key) are untouched; the remote-eval path keeps its existing_remote_eval_inflightcoalescing and is intentionally left out of the new lock.force_refreshsemantics (the re-check under the lock also honors force_refresh, so SSE invalidation still triggers a refetch).