Domains#1
Closed
siddhsuresh wants to merge 31 commits into
Closed
Conversation
…ster + compute/ecs_service networking/alb - New use_ravion_managed_domains toggle (default false, non-breaking). - When on, declares one domains_alb_attachment that allocates an FQDN, issues the cluster wildcard ACM cert, writes the A-ALIAS, and exposes default_cert_arn. - HTTPS listener is force-created and its default cert comes from the attachment; var.certificate_arns[0] is ignored, additional ARNs still attach as SNI. New outputs: ravion_default_url / _fqdn / _cert_arn / _alb_attachment_id. compute/ecs_cluster - Single use_ravion_managed_domains toggle that forwards down to both the public and private alb child modules (distinct slots so they coexist). - Forces HTTPS + HTTP->HTTPS redirect when on. Per-ALB cert ARN inputs are ignored at index 0 when on (additional ARNs still attach as SNI). - module.yml gets a new "Ravion-Managed Domains" section; cert-ARN inputs are hidden when the toggle is on. New outputs cover both ALBs. compute/ecs_service - New domains list(string) input. When non-empty, declares one domains_module_certificate that hands the FQDN list + cluster's HTTPS listener ARN to Ravion's api. The api requests the ACM cert, persists validation CNAMEs (visible in Domains tab), polls until ISSUED, and attaches as SNI out-of-band — terraform apply never blocks on customer DNS. On destroy the api detaches + deletes the cert. - New outputs surface cert id / arn / status for the UI. The Ravion provider source (ravion.com/ravion/domains) only resolves when the runner's terraform mirror is configured (Ravion pipeline runners inject this via ~/.terraformrc). Required-provider declarations are added so any caller using these modules without Ravion-managed mode still works against a local mirror or skipping the providers in tooling.
When the cluster module has Ravion-managed domains on, it exposes
`ravion_public_alb_default_app_domain_id` (the app_domain_id from the
cluster's alb_attachment). A service module wires that as
`ravion_parent_app_domain_id`, plus the cluster ALB's HTTPS listener ARN
and DNS name. The new resources in compute/ecs_service/ravion_domains.tf
then activate iff:
- var.domains is empty (no custom-domain takeover yet), AND
- ravion_parent_app_domain_id is non-empty
and install:
1. domains_app_domain.auto with parent_id = cluster's app_domain.
Allocates `<svc>-<hash>.<cluster-fqdn>` server-side (allocator now
supports a ParentApex). The cluster wildcard cert covers it — zero
per-service ACM work, zero per-service validation records.
2. domains_dns_record.auto_alias writes the A-ALIAS in our Route53.
3. aws_lb_listener_rule.ravion_auto_domain installs a host_header rule
on the cluster HTTPS listener, forwarding to the service's TG.
When the user adds a custom domain, count flips to 0 → terraform destroys
all three → server-side cascade removes the ManagedDomain row, the
Route53 record, and AWS deletes the listener rule. The user's
domains_module_certificate takes over.
networking/alb gains a `ravion_default_app_domain_id` output exposing
the alb_attachment's app_domain_id (was already in the underlying
provider resource, just not surfaced).
ecs_cluster forwards the public + private flavors.
- networking/vpc/module.yml + compute/ecs_cluster/module.yml: add readme, ui.metrics (CloudWatch via Ravion AWS account), and ui.links. - compute/ecs_service/module.yml: new file mirroring the input schema (incl. the new ravion-managed domain inputs) with metrics + readme. - networking/vpc/variables.tf: wrap validation conditions in try() so a null default doesn't blow up the iteration/length checks. Terraform 1.10 does not short-circuit `||` over `for`/`length` in variable validation. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Same TF 1.10 quirk as the variables.tf fix in the previous commit: `||` does not short-circuit length() over a null value inside a precondition. Wrap each length-check in try() so the null default for *_subnet_cidrs passes the precondition without erroring. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Same TF 1.10 quirk hitting child modules pulled in by ecs_cluster: - compute/autoscaling/variables.tf: instance_maintenance_policy attribute access fails when the policy itself is null. Wrap min/max condition in try() so the null default doesn't blow up the validation. - networking/nlb/variables.tf: contains() barfs on null. Wrap both dns_record_client_routing_policy and enforce_security_group_inbound_rules conditions in try(). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Bulk-applied transform via /tmp/wrap_null_conditions.py over all the modules ecs_cluster + ecs_service pull in. TF 1.10's variable validation does not short-circuit `||` when the right side touches a null value, so the existing `var.X == null || EXPR(var.X)` guards blow up at plan time. try(..., true) preserves the original semantics for non-null inputs (the expression evaluates and returns its bool result) and silently passes when the expression errors on null — same intent the `== null ||` guards had originally. Touches: autoscaling, ecs_cluster, ecs_service, alb, nlb, vpc. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… is on When use_ravion_managed_domains = true, the listener default cert is supplied via the domains_alb_attachment resource (Ravion provisions and attaches the cluster wildcard cert). var.certificate_arns is intentionally empty in that case — the precondition was rejecting the valid config. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…s is on Without this gate the per-service auto-domain modules can't reference public_alb_https_listener_arn even though the HTTPS listener is created by the Ravion-managed-domains code path on the cluster ALB.
The control plane stores load_balancer_attachment_json as a string in the module input so it can template-substitute listener/ALB ARNs from upstream cluster outputs. The Terraform variable now matches that shape (string) and the module decodes it once in locals.tf, merging the decoded value with the optional() defaults the typed variable used to provide. All downstream resources read from local.load_balancer_attachment unchanged.
TF 1.10 doesn't short-circuit `var.X != null && var.X.attr` when X is null — it still evaluates the right side and errors on attribute access. Same pattern fix as the validation cleanup commits.
Server-side DNS record validation rejects ALIAS values whose zone_id is empty. The ecs_service module was hardcoding zone_id="" — now takes ravion_auto_domain_alb_zone_id (mapped from the cluster's public_alb_zone_id output) and writes it into the ALIAS value.
Locked to ravion.com/ravion/domains 0.1.99, a local-dev throwaway version whose checksum no longer matches the published binary. TF regenerates a fresh lock on the next init under the ~> 0.1 constraint.
Initial TF apply runs against the placeholder hello-world task definition (no real app image yet) which never stabilises and trips ECS's deployment circuit breaker. The deploy workflow that pushes the actual image is the one that should wait for steady state, not the infra apply.
Replaces ravion_app_domain + dns_record + module_certificate + the auto/custom split with a single `ravion_domain` resource that nests under a cluster-level parent (customer's top-level TF). The parent owns the wildcard cert; the service rides it via SNI — no per-service ACM. Drops `use_ravion_managed_domains` and all related Ravion plumbing from ecs_cluster and networking/alb: the cluster module is now pure-AWS, and the customer wires `ravion_domain` themselves at the top level. Provider pinned to ~> 0.3. Pin ravion provider to 0.1.99 for local-mirror testing The local mirror at port 8095 has 0.1.99 republished with the latest provider code (Mode A/B, certificate.domains). Higher versions exist in the mirror but aren't in sync with the runner's lock file. Exact-pin lets the runner pick the binary we just published without touching .terraform.lock.hcl. Bump ravion provider constraint to ~> 0.4 Domain handler rewrite (Mode A/B split, `domains` field replaces `custom_domains`) ships in 0.4.0. Per-service custom domains: Mode A vs Mode B Custom domains move OUT of the cluster module and onto each service: add `domains = ["api.example.com"]` to `ecs_service` to opt into Mode B. - ecs_cluster: drop ravion_custom_domains var (cluster cert covers only the wildcard pair); add ravion_aws_account_id + ravion_aws_region as passthrough outputs so services can issue their own certs in the same account. - ecs_service: new `domains` input (max 10). New cluster_alb_dns_name + cluster_alb_zone_id + ravion_aws_account_id + ravion_aws_region inputs (Mode B only — pipe from cluster outputs). ravion_domain.this switches between Mode A (rides wildcard) and Mode B (own cert covering only customer FQDNs) based on whether `domains` is set. - ecs_service: aws_lb_listener_certificate.ravion added (Mode B only, attaches the per-service cert as SNI extra). aws_lb_listener_rule host_header values switch between auto-FQDN (Mode A) and custom domains (Mode B). Move cluster ravion_domain into ecs_cluster module The cluster-level `ravion_domain` (wildcard cert + listener binding) now lives inside `compute/ecs_cluster` rather than being hand-rolled in the customer's top-level TF. Opt in with `use_ravion_managed_domains = true` and pipe `module.cluster.ravion_cluster_domain_id` + `module.cluster.public_alb_https_listener_arn` into each `ecs_service`. - ecs_cluster: new ravion_domains.tf allocates cluster ravion_domain (wildcard) and binds the cert as SNI extra on the public ALB HTTPS listener (does NOT replace the listener's default cert) - ecs_cluster: new vars `use_ravion_managed_domains`, `ravion_cluster_name`, `ravion_aws_account_id`, `ravion_aws_region`, `ravion_custom_domains` - ecs_cluster: new outputs `ravion_cluster_domain_id`, `ravion_cluster_domain_fqdn`, `ravion_cluster_domain_url`, `ravion_cluster_cert_arn`, `ravion_cluster_cert_status` - ecs_cluster: require ravion provider ~> 0.3 - ecs_cluster: README section documenting the wildcard pattern + the `ravion_custom_domains` DNS-01 validation flow - ecs_service: update comments to reference cluster module outputs Adopt domains provider 0.2.0 (domains_module_certificate → domains_certificate) The provider dropped the platform-internal naming (no more "module" or "cluster" in resource types). domains_module_certificate is now just domains_certificate; domains_cluster_certificate folded into the same resource since the only difference was whether the caller passes explicit domains. Bump version constraint to ~> 0.2. Wire per-service custom-domain certs via aws_lb_listener_certificate domains_module_certificate no longer accepts listener_arn (provider 0.1.x). Add explicit aws_lb_listener_certificate.ravion_custom to bind the cert as SNI on the cluster's HTTPS listener. Refresh doc strings across both modules to match the new ownership model. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This reverts commit 32f6012.
This reverts commit 1cbbab9.
0.4.3 in the local mirror has the new ravion_domain resource (and certificate.domains field). 0.1.99 was an old binary that still exposed domains_module_certificate / domains_app_domain etc. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The alb module knows nothing about Ravion-managed domains anymore — the ecs_cluster module attaches the Ravion wildcard cert as an SNI extra on the listener after the alb is created. The precondition was referencing a variable that was never declared in alb's variables.tf. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Mode B previously replaced the auto-FQDN with the customer domains in the listener rule's host_header, dropping all traffic to the auto-FQDN the moment `domains` was set. Now Mode A and Mode B are two separate `ravion_domain` resources living side-by-side; the listener rule matches BOTH the auto-FQDN and every customer FQDN, giving the customer a no-downtime window to flip their DNS over. The Ravion control plane retires the auto-FQDN's Route53 record once at least one customer routing record is MATCHED. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Reads cutover state from the new data source so the auto resource's count flips to 0 once Ravion retires the slot. Next plan after retirement shows a clean destroy of ravion_domain.auto[0] and the listener rule narrows to only customer FQDNs. Provider bumped to 0.4.4 for the new data source. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Following the Mode A/B split, outputs.tf still referenced the dead `ravion_domain.this` symbol. Repoint to `ravion_domain.auto` and add two new outputs for the Mode B cert (custom domain id + cert_arn). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Drops aws_lb_listener_certificate.ravion. The new ravion_domain.custom takes `resource_arn = var.cluster_https_listener_arn` and Ravion attaches the cert to the listener server-side after ACM validates it. TF apply no longer fails with `UnsupportedCertificate` for the between-states-window when ACM has not yet flipped the cert ISSUED. The whole post-issue + pre-delete cert→listener lifecycle lives in Ravion now. Provider pinned to 0.4.5 for the new resource_arn attribute. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two changes both targeting the same edge case (Mode B → Mode A): 1. `ravion_auto_retired` is only honoured while `ravion_mode_b` is true. When the user removes every entry from `var.domains`, the auto-FQDN must come back — otherwise the listener rule has zero host_header values and AWS rejects the apply. 2. Defensive: `aws_lb_listener_rule.ravion` count guards on `length(local.ravion_host_header_values) > 0`, so any future transient zero-headers state declines to declare the rule instead of failing AWS schema validation at plan time. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…r to 0.5.0
Multiple services share one cluster HTTPS listener, and AWS rejects
duplicate listener-rule priorities. Default ravion_listener_rule_priority
was 50000 (collision-prone if anyone overrode to a fixed number); now
defaults to 0 = auto-derive a stable priority from sha256(var.name),
giving every service a unique [1000, 49999] slot without hand-picking.
Explicit values (1-50000) still win.
Bumps both compute/ecs_service and compute/ecs_cluster provider pin
from 0.4.x to 0.5.0, which is the rotation-aware terraform-provider-domains
build: in-place SAN edits no longer destroy+create the cert (the prior
flow cascade-deleted ManagedDomain rows and left the Domains tab empty
while the new cert sat in PENDING_VALIDATION). 0.5.0 lands the PATCH
/domains/{id} path and tower-go's RotateManagedCertWorkflow.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
When cluster_parent_domain_id is set, aws_lb_listener_rule.ravion owns ALB routing scoped by host_header to this service's FQDNs. The caller- supplied listener_rules from the control plane became redundant — and actively harmful: priorities collide across services on the shared listener, and any path-only rule (path-pattern /*) catches traffic destined for sibling services. Short-circuit them entirely in that mode. Also add `ignore_changes = [action]` to the ravion rule so blue/green deploy controllers can flip target groups without TF undoing the swap. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
0.5.1 ships the schema fix that drops UseStateForUnknown on the rotation pending_* attributes — without it, in-place SAN rotations trigger "Provider produced inconsistent result after apply" because the planner expected the prior (empty) state values but the server returns the new rotation target. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ssary try() calls for improved readability and performance. Update ECS cluster and service modules to support Ravion-managed domains, including new inputs for cluster domain ID and HTTPS listener ARN. Adjust ALB and NLB configurations to handle Ravion-specific requirements, ensuring proper certificate management and listener settings. Enhance documentation in module.yml files to clarify usage of Ravion-managed domains and associated configurations.
Restores the validation conditions in compute/{autoscaling,ecs_cluster,ecs_service}
and networking/{alb,nlb} variables.tf to their pre-03f87c3 state. The other
changes in 03f87c3 (Ravion module.yml + listener wiring fixes) remain.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
`public_alb_certificate_arns` is a pre-Ravion input. ARNs in it are frequently stale once the customer flips to Ravion mode, and silently trying to SNI-attach them on every apply turns dead certs into recurring "CertificateNotFound" failures. The Ravion wildcard already serves as the listener default, so the SNI-extras resource was strictly additive and never load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The alb sub-module's security_group wires HTTPS ingress (443) only when it owns the HTTPS listener (gated on local.create_https_listener). My prior refactor flipped that off so Ravion could own the listener with the wildcard cert as default — but I forgot to re-add the SG rule. Symptom: dig resolves, port 443 TCP connect times out (SG drop), demo HTTPS broken end-to-end. Mirrors the rules the alb module would have emitted: one per IPv4 cidr in var.public_alb_ingress_cidr_blocks, one per IPv6 cidr (hardcoded ::/0 to match the alb module's default since the cluster module doesn't yet expose a public_alb_ingress_ipv6_cidr_blocks pass-through). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Greptile Summary
This PR introduces opt-in Ravion-managed domain/cert provisioning for ECS clusters and services, adds a new
ravion_domainTerraform resource type backed by a custom provider, and refactorsload_balancer_attachmentfrom a typed Terraform variable to a JSON-decoded string to support control-plane ARN templating.ecs_cluster: newravion_domains.tfallocates a wildcardravion_domainfor the cluster, binds the cert to the public ALB HTTPS listener as an SNI extra, and exposes domain outputs;module.ymladds UI inputs, dashboard metrics, and output declarations.ecs_service: newravion_domains.tfimplements Mode A (auto-FQDN under the cluster wildcard, with a two-phase cutover viaravion_auto_domain_status) and Mode B (per-service cert covering customer FQDNs);load_balancer_attachmentvariable replaced byload_balancer_attachment_jsonstring with local decoding;module.ymladded from scratch.try(..., true)change: every variable validation inautoscaling,ecs_cluster,ecs_service,alb,nlb, andvpcis wrapped to suppress throws from complex-object attribute access.Confidence Score: 2/5
Multiple blocking gaps exist between the module.yml control-plane definitions and the actual Terraform code — mismatched output names, mismatched input variable names, and conflicting provider version pins — that will cause apply failures or silent no-ops on every stack that uses the new Ravion domain wiring.
The module.yml files declare Ravion output names (ravion_managed_domains_enabled, ravion_public_alb_default_*) that simply do not exist in outputs.tf, and input names (ravion_parent_app_domain_id, ravion_auto_domain_listener_arn, etc.) that don't match the TF variables the resources actually read. Any downstream module referencing these outputs will receive nothing, and any upstream value the control plane passes using the module.yml keys will be silently ignored. Separately, the Ravion provider is pinned at = 0.4.3 in ecs_cluster and = 0.4.5 in ecs_service, which Terraform cannot satisfy in a shared workspace. Finally, when certificate_arns = [] (the new default set in module.yml), the ALB HTTPS listener is created with certificate_arn = null, which AWS rejects. Together these issues mean the primary new feature cannot be exercised in production without code fixes.
compute/ecs_cluster/module.yml and compute/ecs_service/module.yml (output/input name mismatches), compute/ecs_cluster/versions.tf and compute/ecs_service/versions.tf (conflicting provider version pins), networking/alb/locals.tf (null certificate_arn on HTTPS listener)
Important Files Changed
Sequence Diagram
sequenceDiagram participant CP as Control Plane participant Cluster as ecs_cluster module participant RavionAPI as Ravion Provider participant AWS as AWS (ACM + ALB) participant Service as ecs_service module CP->>Cluster: "use_ravion_managed_domains=true, ravion_aws_account_id" Cluster->>RavionAPI: "ravion_domain.cluster (wildcard cert, target=public ALB)" RavionAPI-->>AWS: "Issue ACM cert (*.cluster-fqdn + cluster-fqdn)" RavionAPI-->>Cluster: cert_arn, fqdn, id Cluster->>AWS: aws_lb_listener_certificate (SNI extra on HTTPS listener) Cluster-->>CP: ravion_cluster_domain_id, ravion_cluster_cert_arn, ... CP->>Service: cluster_parent_domain_id, cluster_https_listener_arn Service->>RavionAPI: data.ravion_auto_domain_status (check retirement) RavionAPI-->>Service: "retired=false" Service->>RavionAPI: ravion_domain.auto (child under cluster domain, Mode A) RavionAPI-->>Service: auto fqdn (svc-hash.cluster-fqdn) Service->>AWS: aws_lb_listener_rule.ravion (host_header to target group) Note over Service,AWS: Mode B (custom domains) CP->>Service: "domains=[api.example.com]" Service->>RavionAPI: ravion_domain.custom (cert covering customer FQDNs) RavionAPI-->>AWS: Attach cert to ALB listener via resource_arn Note over RavionAPI,AWS: Customer adds DNS CNAME records RavionAPI-->>Service: "retired=true (after DNS resolves)" Service->>RavionAPI: ravion_domain.auto count to 0 (destroy)Comments Outside Diff (1)
compute/autoscaling/variables.tf, line 9-10 (link)try(..., true)silences all validation errorsWrapping every validation condition in
try(..., true)means any expression that throws (e.g., a type mismatch or an unexpected attribute access on an incorrectly-shaped value) silently passes as valid instead of surfacing an error message. For the simple scalar checks likevar.desired_capacity == null || var.desired_capacity >= 0, the null-guard short-circuit already prevented the throw withouttry(), so the wrapper is no-op but harmless there. The concern is the object-typed variables (mixed_instances_policy,instance_refresh,warm_pool,scaling_policies) where a malformed input object that causes an attribute-access error will now be accepted with no user feedback. This same pattern is applied throughoutcompute/ecs_cluster/variables.tf,compute/ecs_service/variables.tf,networking/alb/variables.tf, andnetworking/nlb/variables.tf.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "ecs_service: bring auto-FQDN back when u..." | Re-trigger Greptile