[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971
[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971itikhono wants to merge 5 commits into
Conversation
|
Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing. |
There was a problem hiding this comment.
Pull request overview
This PR improves GPU compile/tuning throughput by generating candidate input buffers directly on the GPU (splitmix64 counter-based RNG) instead of generating on the host and copying H2D per candidate.
Changes:
- Added a GPU-side random-fill kernel (
device::generate_random) and a host wrapper (gpu_generate_random) that recurses into tuple sub-objects. - Updated
time_programtuning path to allocate parameter buffers on GPU and fill them viagpu_generate_random(keepingfill_mapon the host-fill path). - Added a GPU unit test covering determinism, supported types, empty shapes, tuples, and non-computable types.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/gpu/generate_random.cpp | New GPU test coverage for deterministic RNG fill, tuples, and non-computable types. |
| src/targets/gpu/time_op.cpp | Switches tuning input creation to GPU allocation + GPU RNG fill (except fill_map). |
| src/targets/gpu/include/migraphx/gpu/hip.hpp | Exposes gpu_generate_random API. |
| src/targets/gpu/include/migraphx/gpu/device/generate_random.hpp | Declares new device-side RNG entrypoint. |
| src/targets/gpu/hip.cpp | Implements gpu_generate_random wrapper with tuple recursion. |
| src/targets/gpu/device/generate_random.cpp | Implements splitmix64-based device kernel to fill buffers. |
Dont worry about that one, I think it missing a label. Just fix the licensing. |
…/AMDMIGraphX into gpu-device-bench-inputs
Regressions detected 🔴 |
|
|
@kahmed10 @shivadbhavsar could you help with reviewing this PR please? |
This PR covers 1st part of the issue #4970.

Eliminates "input-gen + H2D (CPU waste)" part , GPU part (caused by bundle increase 1->10) remains
device::generate_randomuses a counter-based splitmix64 RNG (seed + i * golden_ratio_step→splitmix64), so output is deterministic per seed and reproducible across candidates for fair comparison.time_programnow allocates inputs withallocate_gpuand fills them viagpu_generate_random(recurses tuple sub-objects), whilefill_mapinputs keep the host-fill path.Behavior parity with the old host path
visit_all→normalize<bool>→0/1, identical to the old special-case.visit_allwould throw, so generation falls back to a raw byte fill — matching the olduint8host behavior.generate_argument.Performance
Test plan
test_gpu_generate_random: seed determinism + range, half type, empty shape no-op, tuple fills every sub-buffer, non-computable (fp4x2) raw-byte fill — 5/5 pass.Perf testing for YOLO-family models (MI350):
Used migraphx-driver perf, no actual diff detected, the results are quite noisy
different models, batch 4