Skip to content

Wire persistent DCRCredentialStore into EmbeddedAuthServer#5196

Draft
tgrunnagle wants to merge 6 commits intodcr-3b_issue_5184from
dcr-3c_issue_5185
Draft

Wire persistent DCRCredentialStore into EmbeddedAuthServer#5196
tgrunnagle wants to merge 6 commits intodcr-3b_issue_5184from
dcr-3c_issue_5185

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle commented May 5, 2026

DRAFT - not ready for review

Summary

  • Why: Phase 2 of the DCR story shipped an in-memory stub for the DCRCredentialStore that lived only in the runner package. Restarting (or scaling out to) an authserver dropped every RFC 7591 client registration on the floor and re-registered against the upstream on every boot, which is unworkable for the Datadog-style upstream demo and for any multi-replica deployment. This PR wires the persistent DCRCredentialStore introduced in earlier sub-issues (in-memory + Redis backends) into EmbeddedAuthServer so a Redis-backed authserver reuses already-registered clients across replicas and restarts.
  • What:
    • EmbeddedAuthServer.dcrStore is now typed against storage.DCRCredentialStore and is derived from the same storage.Storage value returned by createStorage, so a single storage_type: redis config toggles DCR persistence alongside the rest of authserver state. The storage.Storage interface embeds storage.DCRCredentialStore, promoting the previously-needed runtime type assertion to a compile-time guarantee.
    • The Phase 2 standalone in-memory DCRCredentialStore in pkg/authserver/runner/dcr_store.go is collapsed into a thin storageBackedStore adapter that delegates to storage.DCRCredentialStore and translates DCRResolution <-> DCRCredentials at the boundary. There is now exactly one persistence implementation per backend.
    • The constructor is split into the public NewEmbeddedAuthServer (creates storage) and an unexported newEmbeddedAuthServerWithStorage (owns the cleanup contract). Any error after entry closes the storage backend via a deferred cleanup gated on a named return error, so a crash-looping caller no longer leaks the Redis client connection pool / MemoryStorage cleanup goroutine on every restart.
    • buildPureOAuth2Config remains pure (unchanged signature, no ctx, no I/O); buildUpstreamConfigs is the boundary that consumes the resolver and overlays DCR-resolved credentials onto each upstream.
    • Adds DCRStore() accessor on EmbeddedAuthServer mirroring IDPTokenStorage / UpstreamTokenRefresher, used by integration tests to verify the resolver and the authserver write through the same backend.

This PR also lands the dependency stack that #5185 builds on (the persistent DCRCredentialStore types + memory backend, the Redis backend, the operator CRD surface for DCR, and the runner-side DCR resolver wiring). Each layer was developed and reviewed as a separate commit on this branch; commits are sequenced so each one builds and tests cleanly.

Closes #5185

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Notable test coverage added by this PR:

  • pkg/authserver/integration_dcr_restart_test.go (new) — TestEmbeddedAuthServer_DCRSurvivesRestart boots an EmbeddedAuthServer against a mock AS, captures the DCR store via the new DCRStore() accessor, closes the server, and asserts the persisted DCR row survives the first server's Close. Lives in package authserver_test to avoid the runner -> authserver import cycle. The full "boot, close, boot again, observe zero /register" scenario across a fresh constructor is documented as a gap (the production Redis path requires Sentinel, which miniredis does not speak); test docstring records the conditions under which it can be closed.
  • pkg/authserver/runner/embeddedauthserver_test.goTestBuildUpstreamConfigs_DCR exercises first-call registration + cache-hit on the second call (zero additional HTTP requests) and asserts the caller's RunConfig.Upstreams slice is never mutated. TestNewEmbeddedAuthServer_ClosesStorageOnError uses a closeTrackingStorage wrapper to verify the deferred-cleanup contract.
  • pkg/authserver/storage/memory_test.go, redis_test.go, redis_integration_test.go — coverage for the persistent DCRCredentialStore operations on both backends, including ScopesHash canonicalisation (sort + dedupe + newline join).

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

The CRD changes are additive: OAuth2UpstreamConfig.clientId becomes optional with a CEL constraint requiring exactly one of clientId or dcrConfig, and a new dcrConfig field is added. Existing MCPExternalAuthConfig / VirtualMCPServer resources that set clientId continue to validate unchanged.

Does this introduce a user-facing change?

Yes. Operators of OAuth2 upstreams can now configure RFC 7591 Dynamic Client Registration in the operator CRD via dcrConfig (with discoveryUrl or registrationEndpoint, plus optional initialAccessTokenRef, softwareId, softwareStatement) instead of statically configuring clientId + clientSecret. When the authserver is configured with storage_type: redis, DCR registrations persist across restarts and are shared across replicas; in single-replica memory mode, registrations live for the process lifetime as before.

Special notes for reviewers

  • This PR is the terminal task in the Phase 3 DCR DAG and pulls along the dependency stack from sub-issues 1 and 2 (persistent DCRCredentialStore types + memory + Redis backends), the operator CRD surface, and the Phase 2 resolver wiring. The size is above the usual 400-line / 10-file limit; each commit is self-contained and the stack reads top-to-bottom in commit order. Reviewers may prefer to walk the per-commit diffs.
  • The full "boot, close, boot again, zero /register" cross-constructor restart scenario is not exercised; closing it requires either miniredis-Sentinel emulation or a Docker-based Redis Sentinel cluster in the test harness. The wiring that the second boot would consume — the type of dcrStore being the same storage.DCRCredentialStore that authserver.New writes through — is verified at compile time by storage.Storage embedding storage.DCRCredentialStore and by TestEmbeddedAuthServer_DCRSurvivesRestart asserting the persistence boundary.
  • buildPureOAuth2Config was kept intentionally pure (no ctx, no I/O) to preserve the architectural gate established in Authserver DCR integration (Phase 2, Steps 2a-2g) #4978; the wiring change swaps the implementation passed into the resolver, not the call shape.
  • No secrets (client_secret, registration_access_token, initial_access_token, refresh tokens) appear as arguments to slog.* calls; the grep assertion from Authserver DCR integration (Phase 2, Steps 2a-2g) #4978 still applies.

@tgrunnagle tgrunnagle changed the base branch from main to dcr-3b_issue_5184 May 5, 2026 15:48
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label May 5, 2026
@tgrunnagle tgrunnagle force-pushed the dcr-3b_issue_5184 branch 2 times, most recently from b0bf320 to 1736a6e Compare May 7, 2026 15:41
tgrunnagle added 2 commits May 7, 2026 09:28
Type EmbeddedAuthServer.dcrStore against storage.DCRCredentialStore and
derive it from the same storage.Storage value returned by createStorage
via a single type assertion, so a Redis-backed authserver reuses
already-registered RFC 7591 clients across replicas and restarts
instead of re-registering at every boot.

Phase 2 left two parallel DCR stores: a runner-side in-memory map in
dcr_store.go and the storage-level interface added in sub-issue 1. This
collapses the runner-side implementation into a thin storageBackedStore
adapter that delegates to storage.DCRCredentialStore, leaving exactly
one persistence implementation per backend (storage.MemoryStorage and
storage.RedisStorage).

NewInMemoryDCRCredentialStore is preserved as a test helper that wraps
storage.NewMemoryStorage so existing resolver tests compile unchanged;
the standalone inMemoryDCRCredentialStore type and its map / RWMutex
are deleted. buildPureOAuth2Config is unchanged — the wiring change
swaps the implementation passed to the resolver, not the call shape.

Add TestEmbeddedAuthServer_DCRSurvivesRestart in
embeddedauthserver_test.go (next to TestNewEmbeddedAuthServer_DCRBoot)
covering the durable-restart case: boot, close, rebuild against the
same storage.MemoryStorage instance, assert the second resolve makes
zero AS requests. The integration_test.go file under pkg/authserver
would otherwise be the natural home, but it is in package authserver
and importing runner from there would cycle (runner already imports
authserver); the test docstring records this constraint.
Fixed issues from code review of #5185 wiring change:

- HIGH: Storage backend leaked on NewEmbeddedAuthServer error paths.
  Split the constructor into a public NewEmbeddedAuthServer that calls
  createStorage and an unexported newEmbeddedAuthServerWithStorage that
  owns the cleanup contract via a deferred Close gated on a named
  return error. Verified by TestNewEmbeddedAuthServer_ClosesStorageOnError
  using a closeTrackingStorage wrapper.

- MEDIUM: Comment claimed interface embedding that did not exist.
  Embed storage.DCRCredentialStore in the storage.Storage interface
  instead, promoting the runtime type assertion to a compile-time
  guarantee (the AC's explicitly preferred outcome). The dead error
  branch and its outdated comment are gone; mocks regenerated via
  task gen.

- MEDIUM: Test placement deviated from AC instruction. Moved
  TestEmbeddedAuthServer_DCRSurvivesRestart out of the runner package
  and into a new pkg/authserver/integration_dcr_restart_test.go in
  package authserver_test, so the test lives next to the other
  pkg/authserver integration tests without inducing the runner ->
  authserver import cycle. Added a small public DCRStore() accessor
  on EmbeddedAuthServer mirroring existing IDPTokenStorage /
  UpstreamTokenRefresher accessors.

- MEDIUM: Durable-restart not exercised end-to-end. Strengthened the
  restart test to go through NewEmbeddedAuthServer for the first boot
  (full constructor path with DCR), capture the storage via the new
  DCRStore() accessor, and assert the DCR row survives the first
  server's Close. The full "boot, close, boot again, observe zero
  /register" scenario remains a documented gap (the production Redis
  path requires Sentinel which miniredis does not speak); the gap and
  the conditions under which it can be closed are recorded in the test
  docstring per the review's accept-the-gap branch.
@tgrunnagle tgrunnagle force-pushed the dcr-3c_issue_5185 branch from 781a0b9 to 565fade Compare May 7, 2026 16:33
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 92.18750% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.87%. Comparing base (df0ad8f) to head (609e7d3).

Files with missing lines Patch % Lines
pkg/authserver/runner/embeddedauthserver.go 88.88% 1 Missing and 1 partial ⚠️
pkg/authserver/server_impl.go 66.66% 1 Missing and 1 partial ⚠️
pkg/authserver/runner/dcr_store.go 97.50% 1 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##           dcr-3b_issue_5184    #5196      +/-   ##
=====================================================
+ Coverage              67.81%   67.87%   +0.05%     
=====================================================
  Files                    610      610              
  Lines                  62379    62420      +41     
=====================================================
+ Hits                   42302    42366      +64     
+ Misses                 16902    16877      -25     
- Partials                3175     3177       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor Author

@tgrunnagle tgrunnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: concurrency, architecture, test-coverage, security, general-quality. Codex unavailable (skipped).

Consensus Summary

# Finding Consensus Severity Action
F1 Tests leak MemoryStorage cleanupLoop goroutine via newInMemoryDCRResolutionCache 9/10 MEDIUM Fix
F3 Test names TestInMemoryDCRResolutionCache_* reference deleted type 7/10 MEDIUM Fix
F4 newInMemoryDCRResolutionCache doc still markets it for production deployments 7/10 MEDIUM Fix
F5 Restart test name overstates coverage; depends on undocumented MemoryStorage post-Close behavior 9/10 MEDIUM Fix
F6 Storage interface widening (DCRCredentialStore embed) couples impls to DCR; broadens secret reach 7/10 MEDIUM Discuss
F7 resolutionToCredentials/credentialsToResolution drop fields with no round-trip test 7/10 MEDIUM Fix
F8 Deferred-cleanup slog.Warn logs unredacted retErr; can leak upstream response body 7/10 MEDIUM Fix
F9 newMockUpstreamAS duplicates pkg/authserver/runner.newMockAuthorizationServer 7/10 LOW Fix
F10 DCRStore() accessor: secret reach + delegation pattern + lifecycle docs 8/10 MEDIUM Discuss
F11 PR is DRAFT; size over budget; e2e gap not documented n/a INFO Comment

Overall

The wiring change is structurally correct: deferred-cleanup contract via named return retErr works, no double-close path between the deferred close and EmbeddedAuthServer.Close → server.Close → storage.Close, and closeOnce continues to make the caller path idempotent. The architectural gate from #4978 (buildPureOAuth2Config remains pure) is preserved. The closeTrackingStorage test seam is the right shape for verifying the contract — interface seam, not a hook field.

The dominant cluster of findings is comment drift and test-naming drift around the dcrResolutionCache / inMemoryDCRResolutionCache rename (F3, F4) plus a real test-infra goroutine leak that this PR introduces by routing test fixtures through storage.NewMemoryStorage (F1). These are fast to fix.

F6 and F10 are the design-discussion items. Embedding DCRCredentialStore into Storage was a deliberate choice to land a compile-time guarantee, but it widens the secret-bearing surface across every consumer of storage.Storage and every future backend. The new public DCRStore() accessor returns the raw secret-bearing interface and breaks the e.server.* delegation pattern used by sibling getters; if its only consumer is the integration test, a test-only export would be cleaner. Both are worth a deliberate decision rather than a side effect of compile-time-safety convenience.

F5 is the integration test's coverage story: the TestEmbeddedAuthServer_DCRSurvivesRestart name promises a restart, but the assertion is "reading from the same MemoryStorage instance after Close still works," which is an undocumented contract of MemoryStorage. The test docstring honestly acknowledges the cross-boot scenario is deferred — but the test name and the PR's user-visible "survives restart with Redis" promise are not yet in the suite.

Documentation

  • dcrResolutionCache interface doc and newInMemoryDCRResolutionCache doc need updates per F4.
  • EmbeddedAuthServer.dcrStore field comment + DCRStore() doc need a security note per F10.
  • The Storage interface doc should call out the DCR-credential reach if F6's design lands as embedded.

Verification notes

Three findings raised by individual agents were verified against the post-PR file state and dropped as false positives — they cited the deleted (-) lines in the diff as if they were still in the new file. Specifically: (a) one HIGH security finding claimed RedisStorage does not implement DCRCredentialStore — verified false at pkg/authserver/storage/redis.go:2040 on the base branch dcr-3b_issue_5184; (b) two findings about the dcrResolutionCache interface doc retaining "in-memory" / "never expired" claims — verified false; the post-PR file at lines 33-46 contains only the new "thin adapter" framing.

Process notes

  • Self-review caveat: the authenticated user is the PR author, so the review event is necessarily COMMENT (GitHub rejects REQUEST_CHANGES on own PRs). The findings would otherwise lean toward REQUEST_CHANGES due to the cluster of MEDIUM consensus items.
  • Per .claude/rules/pr-creation.md, this PR exceeds the 400-line / 10-file budget — the author already acknowledged this in the body. Splitting the integration test (217 LOC) into a follow-up would put the wiring change inside budget.

Generated with Claude Code

Comment thread pkg/authserver/runner/dcr_store.go Outdated
Comment thread pkg/authserver/runner/dcr_store_test.go
Comment thread pkg/authserver/runner/dcr_store.go Outdated
Comment thread pkg/authserver/integration_dcr_restart_test.go Outdated
Comment thread pkg/authserver/storage/types.go Outdated
Comment thread pkg/authserver/runner/dcr_store.go
Comment thread pkg/authserver/runner/embeddedauthserver.go
Comment thread pkg/authserver/integration_dcr_restart_test.go Outdated
Comment thread pkg/authserver/runner/embeddedauthserver.go
tgrunnagle added 3 commits May 7, 2026 11:43
Addresses #5196 review comments:
- MEDIUM types.go (3203565433) F6: Storage no longer embeds
  DCRCredentialStore. The embed promoted GetDCRCredentials /
  StoreDCRCredentials onto every Storage consumer (handlers,
  registration, session, etc.), broadening the surface that can read raw
  client_secret and registration_access_token. The compile-time
  guarantee is preserved by the per-backend var _ DCRCredentialStore =
  (*MemoryStorage)(nil) / (*RedisStorage)(nil) checks; the runner and
  authserver constructors now obtain the DCR-capable handle via an
  explicit type assertion at the boundary, fail-loud if a future backend
  omits DCR.

- MEDIUM embeddedauthserver.go (3203565453) F10: DCRStore() delegates
  through e.server.DCRStore() instead of holding a redundant e.dcrStore
  field. The Server interface now exposes DCRStore() with a SECURITY +
  lifecycle doc noting the returned handle surfaces raw secrets and is
  bound to Server.Close. Eliminates the drift window between
  EmbeddedAuthServer and the underlying authserver if the server ever
  swaps backends.
Addresses #5196 review comments:
- MEDIUM dcr_store.go (3203565416) F1: replace
  newInMemoryDCRResolutionCache (which leaked a MemoryStorage
  cleanupLoop goroutine on every call) with a test-only newMemoryDCRStore
  in dcr_testhelpers_test.go that takes *testing.T and registers
  t.Cleanup(stor.Close). Updates ~32 call sites across dcr_test.go,
  dcr_store_test.go, and embeddedauthserver_test.go.

- MEDIUM dcr_store_test.go (3203565423) F3: rename test functions from
  TestInMemoryDCRResolutionCache_* to TestStorageBackedStore_* so the
  suite names match the type they actually exercise (the deleted
  inMemoryDCRResolutionCache type no longer exists).

- MEDIUM dcr_store.go (3203565426) F4: dropped the production-marketing
  paragraphs from the helper's docstring (production code in
  NewEmbeddedAuthServer no longer reaches it). The new docstring on
  newMemoryDCRStore states the test-only purpose and the cleanup
  contract directly.

- MEDIUM integration_dcr_restart_test.go (3203565429) F5: rename
  TestEmbeddedAuthServer_DCRSurvivesRestart ->
  TestEmbeddedAuthServer_DCRStorePersistsAcrossClose, and refactor to
  read credentials BEFORE Close. The test no longer silently depends on
  MemoryStorage's undocumented post-Close readability. Tightened the
  44-line docstring to scope it to what is exercisable today and a
  pointer to the deferred follow-up.

- LOW integration_dcr_restart_test.go (3203565449) F9: added
  "DO NOT COPY THIS A THIRD TIME" tripwire on both newMockUpstreamAS
  (authserver_test) and newMockAuthorizationServer (runner) directing
  the next caller to extract the helper to a shared
  pkg/authserver/internal/testhelpers package before duplicating it.
Addresses #5196 review comments:
- MEDIUM dcr_store.go (3203565436) F7: added
  TestResolutionCredentialsRoundTrip pinning the field-by-field contract
  between resolutionToCredentials and credentialsToResolution
  (preserved, dropped, key-recovered, nil-shortcircuit). Added
  MUST-update-both-converters comments on the DCRResolution struct in
  dcr.go and the DCRCredentials struct in storage/types.go so a future
  contributor adding a field to either type sees the converter
  obligation at the struct definition rather than only at the
  converters. Documented the ProviderName asymmetry: the field is
  storage-only ("debug/audit only" per its own docstring) and is
  intentionally left unpopulated by the runner; the test asserts that
  invariant so any future threading is paired with the assertion
  update.

- MEDIUM embeddedauthserver.go (3203565446) F8: route both closeErr and
  retErr in the deferred-cleanup slog.Warn through sanitizeErrorForLog
  so a wrapped DCR failure whose error chain inlines an upstream
  /register response body cannot leak userinfo, query, or fragment
  components into operator logs. Renamed the "original_error" key to
  "cause" to match the package-wide vocabulary. Added
  TestNewEmbeddedAuthServer_DeferredCleanupSanitizesLog which captures
  the Warn record by swapping slog.Default() and asserts both that
  literal secret markers are scrubbed and that host components survive
  for operator correlation.
@tgrunnagle
Copy link
Copy Markdown
Contributor Author

Re: F11 (PR state / size / e2e gap from the multi-agent review body): noted, no commit. The size-over-budget acknowledgement and the deferred Sentinel-restart e2e are already called out in the PR body, and the PR remains in draft pending the size discussion. The Sentinel-restart gap is now explicitly recorded in the docstring on TestEmbeddedAuthServer_DCRStorePersistsAcrossClose (39dfc62) so a future contributor adding the harness has a single grep target.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/L Large PR: 600-999 lines changed labels May 7, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

Drops the package authserver_test workaround introduced to break the
runner -> authserver import cycle that blocked extending
pkg/authserver/integration_test.go. The test's actual subject is the
runner-side DCR-store wiring (it goes through runner.NewEmbeddedAuthServer
and asserts on runner.EmbeddedAuthServer.DCRStore()), so the runner
package is the natural home and matches the location of its sibling
TestNewEmbeddedAuthServer_DCRBoot / _ClosesStorageOnError tests.

The runner-package newMockAuthorizationServer helper replaces the
duplicate newMockUpstreamAS the cross-package placement forced; the
"DO NOT COPY THIS A THIRD TIME" tripwire on it is dropped now that the
duplication is gone.
@github-actions github-actions Bot removed the size/XL Extra large PR: 1000+ lines changed label May 7, 2026
@github-actions github-actions Bot dismissed their stale review May 7, 2026 19:29

PR size has been reduced below the XL threshold. Thank you for splitting this up!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up!

@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistent DCRCredentialStore: wire into EmbeddedAuthServer

1 participant