Skip to content

fix(mcp): reconnect MCP/CLI session after a transient engine restart#2

Open
mathias-heide wants to merge 1 commit into
mainfrom
fix/mcp-session-reconnect
Open

fix(mcp): reconnect MCP/CLI session after a transient engine restart#2
mathias-heide wants to merge 1 commit into
mainfrom
fix/mcp-session-reconnect

Conversation

@mathias-heide

Copy link
Copy Markdown
Contributor

Problem — MCP/CLI "disconnects from the project"

Distinct from the desktop engine-disconnect. The Summer MCP server (npx summer-engine mcp) is a long-lived stdio process — it lives as long as Claude Code / Cursor / Codex keep it. It caches one EngineApiClient that snapshots the engine's ~/.summer/api-port + api-token at first connect and never refreshes them.

But the engine invalidates those credentials on every launch:

Engine behavior Source Effect on the cached client
Mints a new random api-token each launch local_api_server.cpp::_generate_api_token Old Bearer token → 401 → reads as "disconnected"
Binds 6550 but increments to 6565 if the old socket lingers tool_net_thread.cpp::start() Pinned to a dead/old port
New random instanceId per process engine_identity.cpp identity drift

So any engine restart / reopen / crash-relaunch / update left the cached client hitting the old port with the old token → 401. The only recovery was withEngine's catch → resetClient(), which heals on the next call — so every restart cost at least one hard, user-visible disconnect, and soft not_connected / identity_mismatch terminal states (which bypass the catch) never reset the client at all.

Fix (scope kept to MCP/CLI)

  • EngineApiClient.credentialsChanged() — cheap drift probe: re-reads api-port + api-token and reports whether they no longer match this client's snapshot. Empty/unreadable creds (engine mid-write or just-closed) count as no drift so a transient read never thrashes the cache.
  • getClient() — checks credentialsChanged() before reusing the cache, so a silent engine restart reconnects transparently with zero surfaced error.
  • withEngine() — drops the cached client and retries once on connection-class failures only: thrown 401/403/ECONNREFUSED/closed-port, and the soft not_connected / identity_mismatch terminal states (both "nothing applied" by contract). Deliberately does NOT retry timed_out / content_mismatch / denied / canceled — ambiguous or intentional, and retrying could double-apply a mutation.

Two layers compose as defense-in-depth: proactive drift detection handles the common "restarted between calls" case; reset+retry handles the race where the restart lands during a call.

Repro / tests

  • src/mcp/reconnect.integration.test.ts — drives the real getClient → EngineApiClient → withEngine stack over real HTTP against a fake engine that rotates its token (only the credential-file readers are stubbed, so the user's ~/.summer is untouched). The session survives the rotation and a mid-call restart.
  • src/mcp/server.reconnect.test.ts — drift cache: reuse while stable, rebuild on token/port rotation, don't thrash on a mid-write empty read.
  • src/mcp/tools/with-engine.reconnect.test.ts — retry classifier: heals 401 / fetch failed / not_connected / identity_mismatch; does not retry timed_out / normal op errors / 5xx.

All 6 new repro tests fail on main and pass with this change. Full suite: 312 passed (was 298). tsc strict build clean.

Coordinated with the engine-disconnect work but no engine/web changes here — purely the summer-engine-agent MCP/CLI client.

🤖 Generated with Claude Code

The MCP server is a long-lived stdio process (it lives as long as Claude
Code / Cursor / Codex keep it) but it cached ONE EngineApiClient that
snapshotted the engine's api-port + api-token at first connect and never
refreshed them. The engine invalidates those creds on EVERY launch: it
mints a fresh random api-token (_generate_api_token) and can bind a
different port (start() increments 6550..6565 when the old socket
lingers). So any engine restart / reopen / crash-relaunch left the cached
client hitting the old port with the old Bearer token -> 401 -> surfaced
to the agent as "disconnected from the project". The only recovery was
withEngine's catch -> resetClient, which heals on the NEXT call, so every
restart cost at least one hard, user-visible disconnect, and soft
not_connected / identity_mismatch states never reset the client at all.

Fix (MCP/CLI scope only):
- EngineApiClient.credentialsChanged(): cheap drift probe that re-reads
  ~/.summer/api-port + api-token and reports whether they no longer match
  the client's snapshot (empty/unreadable creds = no drift, so a mid-write
  read never thrashes the cache).
- getClient(): check credentialsChanged() before reusing the cache, so a
  silent engine restart reconnects transparently with zero surfaced error.
- withEngine(): drop the cached client and retry ONCE on connection-class
  failures only — thrown 401/403/ECONNREFUSED/closed-port, and the soft
  not_connected / identity_mismatch terminal states (both "nothing
  applied" by contract). Deliberately does NOT retry timed_out /
  content_mismatch / denied / canceled, which are ambiguous or intentional
  and could double-apply a mutation.

Repro: src/mcp/reconnect.integration.test.ts drives the real getClient ->
EngineApiClient -> withEngine stack over real HTTP against a fake engine
that rotates its token; the session survives the rotation. Plus unit
coverage for the drift cache and the retry classifier.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant