feat: add GRID v20 and CUDA LTS driver containers#158
Conversation
0c28258 to
09b9a0b
Compare
Operational follow-ups before downstream rolloutA rubber-duck pass surfaced two items worth tracking. Both are operational, not code blockers for this PR. 1. MCR onboarding for the new repo path (action required)The new image path needs to be onboarded to MCR before AgentBaker can pull it. Confirmed with a current check: Without the registration/ACL/replication setup, the first push from this workflow will succeed against ACR but the public MCR path will still 404. Please trigger the standard new-MCR-repo onboarding for 2. Runtime validation on RTX PRO 6000 BSE v6 hardwareBlackwell GPUs require the open-source NVIDIA kernel module. The 595.x branch should default to the open kmod at install time (default since 560.x), so the existing |
|
Could you augment at the same time the logic to support two CUDA drivers in AKS-GPU |
Adds two new NVIDIA driver container images alongside the existing ones, following the same matrix-build pattern as the existing aks-gpu-cuda / aks-gpu-cuda-arm64 split: - `aks-gpu-grid-v20` (GRID v20, driver `595.58.03`): required for the RTX PRO 6000 Blackwell Server Edition v6 SKU family. Existing `aks-gpu-grid` (v18, 570.211.01) is unchanged. - `aks-gpu-cuda-lts` and `aks-gpu-cuda-lts-arm64` (CUDA LTS, NVIDIA R580 LTSB `580.159.04`): a Long Term Support Branch variant alongside the existing Production Branch `aks-gpu-cuda` (595.71.05). R580 LTSB is supported by NVIDIA through Aug 2028. Files: - `driver_config.yml`: add `grid_v20` and `cuda_lts` blocks. - `main.yaml` / `ci.yaml`: matrix-include both branches in each of the grid, cuda, and cuda-arm64 jobs. `DRIVER_KIND=grid` / `cuda` is literal; cache keys and image tags scope by image repo. - `justfile`: add `buildgridv20` / `pushgridv20` and `buildcudalts` / `pushcudalts`. Out of scope: - `auto_update.py` — new variants pinned manually for now, same flow as the existing v18 pin (PR #154). A focused follow-up can extend the updater to handle each branch within its own major. - AgentBaker consumption of the new images — separate PR. - Mariner/Azure Linux/ACL GRID install paths — use separate package/sysext flows, unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
09b9a0b to
b08edce
Compare
Adds two new NVIDIA driver container variants, following the same matrix-build pattern as the existing
aks-gpu-cuda/aks-gpu-cuda-arm64split:aks-gpu-grid-v20— NVIDIA GRID v20 (driver595.58.03). Required for the RTX PRO 6000 Blackwell Server Edition v6 SKU family. Existingaks-gpu-grid(v18,570.211.01) is unchanged.aks-gpu-cuda-ltsandaks-gpu-cuda-lts-arm64— NVIDIA R580 LTSB (580.159.04). A Long Term Support Branch variant alongside the existing Production Branchaks-gpu-cuda(595.71.05). R580 LTSB is supported by NVIDIA through Aug 2028.Changes
driver_config.yml: addgrid_v20andcuda_ltsblocks.main.yaml/ci.yaml: matrix-include both branches in each of thegrid,cuda, andcuda-arm64jobs;DRIVER_KIND=grid/cudaliteral; cache keys and image tags scope by image repo.justfile: addbuildgridv20/pushgridv20andbuildcudalts/pushcudalts.Design trade-off: separate MCR repo vs. shared repo with version-prefix tags
Two designs were considered:
Chosen (this PR): new MCR repo per driver branch — e.g.
aks-gpu-grid(570.x),aks-gpu-grid-v20(595.x),aks-gpu-cuda(PB),aks-gpu-cuda-lts(LTSB).Alternative considered: keep one MCR repo per driver kind, distinguish branches by tag prefix — e.g.
aks-gpu-grid:570.*andaks-gpu-grid:595.*both pushed from this same workflow.Recurring cost vs. one-time cost
The biggest asymmetry between the two designs is when the complexity is paid:
grid_v21, next CUDA LTSB after R580, etc.) requires a new MCR repo onboarding, a new datamodel constant in AgentBaker, a new line in the SKU → repo selection logic, and a new Renovate rule. Version is encoded in the image name, so every branch bump requires coordinating a new image name across repos.driver_config.ymland a new tag on the existing repo; the image name does not change. Version lives in the tag, which is its natural place. But: the schema/parser/renovate complexity to support multiple branches in one repo has to be paid once up front, and it brings a silent-regression risk that has to be permanently guarded against.Where the alternative pushes complexity in AgentBaker
components.jsonschema disambiguation. With separate repos, each entry has a uniquedownloadURL. With a shared repo, both entries would have the samedownloadURL(aks-gpu-grid:*), forcing a new disambiguator field (e.g.driverBranch: "570"/"595") or a change to the:*placeholder convention.gpu_components.goparser rework. Today it doesswitch strings.TrimSuffix(image.DownloadURL, ":*")— two entries pointing to the same repo collapse to the same case, second silently overwrites first. Disambiguation requires switching on (repo, branch).595.x>570.x), which would silently overwrite the v18 entry with a 595.x tag, regressing every v18 GRID node. Preventing this requiresmatchCurrentValue: "/^570\\./"and/^595\\./regex pins on each rule — permanent complexity plus an ambient "latest tag" ambiguity for anyone outside AgentBaker pulling naively.Verdict
The chosen design is simpler to reason about today but pays its cost every time a new branch is added. The alternative is more natural for versions-as-tags and cheaper per new branch, but has a permanent silent-regression risk in Renovate that has to be actively guarded against.
Given the cadence of new driver branches (a couple per year between GRID and CUDA LTSBs) and the existing precedent of
aks-gpu-cuda/aks-gpu-cuda-arm64already being separate repos, the chosen design is the more compatible step. If we end up onboarding many more branches over time, switching to the tag-based model later is a reasonable migration target.Out of scope
auto_update.py— new variants pinned manually for now (same flow as existing v18 pin in PR chore: update NVIDIA GRID driver to v18.6 (570.211.01) #154).Operational note
New MCR paths (
aks-gpu-grid-v20,aks-gpu-cuda-lts,aks-gpu-cuda-lts-arm64) need standard onboarding before downstream consumers can pull them.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com