Skip to content

PLT-0000 fix(deploy): serialize repocache access to prevent stale git locks#151

Open
uptickmetachu wants to merge 2 commits intodevelopfrom
plt-0000/fix-repocache-race
Open

PLT-0000 fix(deploy): serialize repocache access to prevent stale git locks#151
uptickmetachu wants to merge 2 commits intodevelopfrom
plt-0000/fix-repocache-race

Conversation

@uptickmetachu
Copy link
Copy Markdown
Collaborator

Motivation

The gitops pod in the gitops namespace failed a deployment with:

Run: cd /tmp/tmpom_5nw_l;git fetch; git checkout bc929a9… returned with exit code: 1
fatal: Unable to create '/tmp/tmpom_5nw_l/.git/index.lock': File exists.

This wasn't disk pressure — node and pod both had plenty of headroom. The root cause is a race in gitops_server/utils/git.py::temp_repo:

  1. The cache-exists check ran outside the repo_lock semaphore, so two concurrent workers could both decide the cache needs cloning.
  2. The cp -r <cache> <tmp> step was fully unlocked. While one worker was (re)cloning the cache, another's cp -r could snapshot the cache mid-write and pull a half-written .git/index.lock into its private temp dir, breaking the subsequent git checkout.

Under the webhook storm right after a pod restart (~16 concurrent deploys), this race fires and whole deploys fail.

Changes

  • Replace global asyncio.Semaphore(1) with per-repo asyncio.Lock (_get_cache_lock(url)). Concurrent deploys against different repos still run in parallel.
  • Move the cache-exists check inside the lock — eliminates the double-clone window.
  • Hold the lock across cp -r <cache> <tmp> too, so readers can't observe a partially written cache.
  • git fetch / git checkout stay outside the lock: the temp dir is private per deploy, so this keeps the heavy network work parallel.
  • Defensive rm -f <tmp>/.git/index.lock after the copy as belt-and-braces in case a stale lock ever lands on disk from another path.

Test plan

  • uv run pytest tests/test_git.py — all 3 tests pass
  • uv run ruff check / ruff format --check — clean
  • Deploy to gitops namespace and watch a deployment storm (e.g. a master push touching multiple apps) — no more Unable to create '.git/index.lock' failures
  • Confirm concurrent deploys to different repos still run in parallel (per-repo lock, not global)

Follow-ups (not in this PR)

The same incident also showed the gitops pod getting SIGKILL'd (exit 137) from liveness probe timeouts during deployment storms — /livez times out when the event loop is busy coordinating many subprocesses. Worth raising livenessProbe.timeoutSeconds and/or moving the probe off the busy loop in a follow-up.

🤖 Generated with Claude Code

Replace the global asyncio.Semaphore guarding only the initial clone
with per-repo asyncio.Locks held across both cache init and the
cp -r into the per-deploy temp dir. Previously the cache-exists
check ran outside the lock and the copy was fully unlocked, so a
second worker could read the cache mid-clone and pull a half-written
.git/index.lock into the temp dir, breaking the subsequent checkout.

fetch/checkout remain outside the lock since each temp dir is
private, preserving parallelism across concurrent deploys. A
defensive rm -f of .git/index.lock after the copy handles any stale
lock that might still slip through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Docker Images

Commit: b21025a604521e5cd13cad41d8727356b578c897

Tag
610829907584.dkr.ecr.ap-southeast-2.amazonaws.com/gitops:test-b21025a

Mise started failing to install python@3.12.1 with "No GitHub artifact
attestations found". Disable the attestation check in CI so the install
step succeeds; the python version is already pinned via .mise.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant