fix(mcp): reconnect MCP/CLI session after a transient engine restart#2
Open
mathias-heide wants to merge 1 commit into
Open
fix(mcp): reconnect MCP/CLI session after a transient engine restart#2mathias-heide wants to merge 1 commit into
mathias-heide wants to merge 1 commit into
Conversation
The MCP server is a long-lived stdio process (it lives as long as Claude Code / Cursor / Codex keep it) but it cached ONE EngineApiClient that snapshotted the engine's api-port + api-token at first connect and never refreshed them. The engine invalidates those creds on EVERY launch: it mints a fresh random api-token (_generate_api_token) and can bind a different port (start() increments 6550..6565 when the old socket lingers). So any engine restart / reopen / crash-relaunch left the cached client hitting the old port with the old Bearer token -> 401 -> surfaced to the agent as "disconnected from the project". The only recovery was withEngine's catch -> resetClient, which heals on the NEXT call, so every restart cost at least one hard, user-visible disconnect, and soft not_connected / identity_mismatch states never reset the client at all. Fix (MCP/CLI scope only): - EngineApiClient.credentialsChanged(): cheap drift probe that re-reads ~/.summer/api-port + api-token and reports whether they no longer match the client's snapshot (empty/unreadable creds = no drift, so a mid-write read never thrashes the cache). - getClient(): check credentialsChanged() before reusing the cache, so a silent engine restart reconnects transparently with zero surfaced error. - withEngine(): drop the cached client and retry ONCE on connection-class failures only — thrown 401/403/ECONNREFUSED/closed-port, and the soft not_connected / identity_mismatch terminal states (both "nothing applied" by contract). Deliberately does NOT retry timed_out / content_mismatch / denied / canceled, which are ambiguous or intentional and could double-apply a mutation. Repro: src/mcp/reconnect.integration.test.ts drives the real getClient -> EngineApiClient -> withEngine stack over real HTTP against a fake engine that rotates its token; the session survives the rotation. Plus unit coverage for the drift cache and the retry classifier. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem — MCP/CLI "disconnects from the project"
Distinct from the desktop engine-disconnect. The Summer MCP server (
npx summer-engine mcp) is a long-lived stdio process — it lives as long as Claude Code / Cursor / Codex keep it. It caches oneEngineApiClientthat snapshots the engine's~/.summer/api-port+api-tokenat first connect and never refreshes them.But the engine invalidates those credentials on every launch:
api-tokeneach launchlocal_api_server.cpp::_generate_api_tokenBearertoken → 401 → reads as "disconnected"tool_net_thread.cpp::start()instanceIdper processengine_identity.cppSo any engine restart / reopen / crash-relaunch / update left the cached client hitting the old port with the old token → 401. The only recovery was
withEngine'scatch → resetClient(), which heals on the next call — so every restart cost at least one hard, user-visible disconnect, and softnot_connected/identity_mismatchterminal states (which bypass thecatch) never reset the client at all.Fix (scope kept to MCP/CLI)
EngineApiClient.credentialsChanged()— cheap drift probe: re-readsapi-port+api-tokenand reports whether they no longer match this client's snapshot. Empty/unreadable creds (engine mid-write or just-closed) count as no drift so a transient read never thrashes the cache.getClient()— checkscredentialsChanged()before reusing the cache, so a silent engine restart reconnects transparently with zero surfaced error.withEngine()— drops the cached client and retries once on connection-class failures only: thrown401/403/ECONNREFUSED/closed-port, and the softnot_connected/identity_mismatchterminal states (both "nothing applied" by contract). Deliberately does NOT retrytimed_out/content_mismatch/denied/canceled— ambiguous or intentional, and retrying could double-apply a mutation.Two layers compose as defense-in-depth: proactive drift detection handles the common "restarted between calls" case; reset+retry handles the race where the restart lands during a call.
Repro / tests
src/mcp/reconnect.integration.test.ts— drives the realgetClient → EngineApiClient → withEnginestack over real HTTP against a fake engine that rotates its token (only the credential-file readers are stubbed, so the user's~/.summeris untouched). The session survives the rotation and a mid-call restart.src/mcp/server.reconnect.test.ts— drift cache: reuse while stable, rebuild on token/port rotation, don't thrash on a mid-write empty read.src/mcp/tools/with-engine.reconnect.test.ts— retry classifier: heals 401 /fetch failed/not_connected/identity_mismatch; does not retrytimed_out/ normal op errors / 5xx.All 6 new repro tests fail on
mainand pass with this change. Full suite: 312 passed (was 298).tscstrict build clean.Coordinated with the engine-disconnect work but no engine/web changes here — purely the
summer-engine-agentMCP/CLI client.🤖 Generated with Claude Code