[4970] Generate tuning inputs on GPU via splitmix64 device RNG by itikhono · Pull Request #4971 · ROCm/AMDMIGraphX

itikhono · 2026-06-16T17:16:05Z

This PR covers 1st part of the issue #4970.
Eliminates "input-gen + H2D (CPU waste)" part , GPU part (caused by bundle increase 1->10) remains

During op/program tuning, candidate inputs were generated on the host (xorshf96 PRNG) and copied to the device for every candidate. This replaces that with a device kernel that fills tuning inputs directly on the GPU, removing the per-candidate host PRNG + H2D copy.
New device::generate_random uses a counter-based splitmix64 RNG (seed + i * golden_ratio_step → splitmix64), so output is deterministic per seed and reproducible across candidates for fair comparison.
time_program now allocates inputs with allocate_gpu and fills them via gpu_generate_random (recurses tuple sub-objects), while fill_map inputs keep the host-fill path.

Behavior parity with the old host path

bool: handled by visit_all → normalize<bool> → 0/1, identical to the old special-case.
fp4x2 (only non-computable type): visit_all would throw, so generation falls back to a raw byte fill — matching the old uint8 host behavior.
tuples: same seed across sub-objects, same as the previous generate_argument.

Performance

No FPS regression across the YOLO model family (within noise).
Compile/tuning time improved up to ~6.6x at batch 64 on MI350, and ~10x at batch 32 on R9700 (measured together with reverting the bundle increase 1->10)

Test plan

test_gpu_generate_random: seed determinism + range, half type, empty shape no-op, tuple fills every sub-buffer, non-computable (fp4x2) raw-byte fill — 5/5 pass.
YOLO compile + inference sweep (fork vs develop).

Perf testing for YOLO-family models (MI350):

Used migraphx-driver perf, no actual diff detected, the results are quite noisy

different models, batch 4

Model	Fixed, img/s	Develop (before), img/s	Δ
yolov8m	1255.7	1242.5	+1.1%
yolov9m	1177.1	1111.2	+5.9%
yolov10m	1337.1	1303.4	+2.6%
yolo11m	1583.1	1564.2	+1.2%
yolo12m	1341.4	1350.1	−0.6%
yolo26m	1415.0	1407.2	+0.6%

github-actions · 2026-06-16T17:16:33Z

Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing.

Copilot

Pull request overview

This PR improves GPU compile/tuning throughput by generating candidate input buffers directly on the GPU (splitmix64 counter-based RNG) instead of generating on the host and copying H2D per candidate.

Changes:

Added a GPU-side random-fill kernel (device::generate_random) and a host wrapper (gpu_generate_random) that recurses into tuple sub-objects.
Updated time_program tuning path to allocate parameter buffers on GPU and fill them via gpu_generate_random (keeping fill_map on the host-fill path).
Added a GPU unit test covering determinism, supported types, empty shapes, tuples, and non-computable types.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/gpu/generate_random.cpp	New GPU test coverage for deterministic RNG fill, tuples, and non-computable types.
src/targets/gpu/time_op.cpp	Switches tuning input creation to GPU allocation + GPU RNG fill (except `fill_map`).
src/targets/gpu/include/migraphx/gpu/hip.hpp	Exposes `gpu_generate_random` API.
src/targets/gpu/include/migraphx/gpu/device/generate_random.hpp	Declares new device-side RNG entrypoint.
src/targets/gpu/hip.cpp	Implements `gpu_generate_random` wrapper with tuple recursion.
src/targets/gpu/device/generate_random.cpp	Implements splitmix64-based device kernel to fill buffers.

itikhono · 2026-06-18T11:04:04Z

@pfultz2 @causten could you help to run [MIGraphX Performance Tests / security_gate (pull_request_target)] target? I think I don't have access/rights to trigger this job

pfultz2 · 2026-06-18T13:52:43Z

could you help to run [MIGraphX Performance Tests / security_gate (pull_request_target)] target?

Dont worry about that one, I think it missing a label. Just fix the licensing.

…/AMDMIGraphX into gpu-device-bench-inputs

gh-app-migraphx-bot-pr-write · 2026-06-18T23:08:17Z

Test	Batch	New Rate (5a3a3e)	Old Rate (f54ca3)	Diff	Status
torchvision-resnet50	64	669.58	3,151.54	-78.75%	🔴
torchvision-resnet50_fp16	64	1,766.59	6,672.57	-73.52%	🔴
torchvision-densenet121	32	557.68	2,669.98	-79.11%	🔴
torchvision-densenet121_fp16	32	644.55	4,553.52	-85.85%	🔴
torchvision-inceptionv3	32	416.47	1,794.69	-76.79%	🔴
torchvision-inceptionv3_fp16	32	501.54	2,710.69	-81.50%	🔴
cadene-inceptionv4	16	222.10	224.10	-0.89%	✅
cadene-resnext64x4	16	176.18	409.63	-56.99%	🔴
slim-mobilenet	64	4,428.97	8,288.90	-46.57%	🔴
slim-nasnetalarge	64	nan	229.24	nan	❌
slim-resnet50v2	64	776.52	3,329.55	-76.68%	🔴
bert-mrpc-onnx	8	140.05	1,170.13	-88.03%	🔴
bert-mrpc-tf	1	37.33	482.99	-92.27%	🔴
pytorch-examples-wlang-gru	1	210.63	335.86	-37.28%	🔴
pytorch-examples-wlang-lstm	1	220.15	464.03	-52.56%	🔴
torchvision-resnet50_1	1	25.53	771.72	-96.69%	🔴
cadene-dpn92_1	1	48.67	451.76	-89.23%	🔴
cadene-resnext101_1	1	44.39	363.79	-87.80%	🔴
onnx-taau-downsample	1	35.45	400.76	-91.15%	🔴
dlrm-criteoterabyte	1	6.16	32.67	-81.14%	🔴
dlrm-criteoterabyte_fp16	1	nan	52.58	nan	❌
agentmodel	1	688.21	9,679.70	-92.89%	🔴
unet_fp16	2	11.24	57.23	-80.35%	🔴
resnet50v1_fp16	1	40.31	972.36	-95.85%	🔴
resnet50v1_int8	1	300.05	960.27	-68.75%	🔴
bert_base_cased_fp16	64	323.80	1,102.92	-70.64%	🔴
bert_large_uncased_fp16	32	82.58	347.52	-76.24%	🔴
bert_large_fp16	1	4.49	205.28	-97.82%	🔴
distilgpt2_fp16	16	303.52	2,094.45	-85.51%	🔴
yolov5s	1	95.46	568.20	-83.20%	🔴
tinyllama	1	3.76	46.01	-91.83%	🔴
vicuna-fastchat	1	8.54	44.04	-80.60%	🔴
whisper-tiny-encoder	1	51.11	419.49	-87.82%	🔴
whisper-tiny-decoder	1	13.59	420.06	-96.76%	🔴
llama2_7b	1	2.06	20.46	-89.95%	🔴
qwen1.5-7b	1	11.39	23.67	-51.88%	🔴
phi3-3.8b	1	2.67	26.98	-90.09%	🔴
llama3-8b	1	2.11	21.83	-90.31%	🔴
whisper-large-encoder	1	2.50	10.32	-75.76%	🔴
whisper-large-decoder	1	3.09	107.05	-97.11%	🔴
mistral-7b	1	5.08	23.86	-78.72%	🔴
FLUX.1-schnell	1	43.74	761.61	-94.26%	🔴

Regressions detected 🔴

gh-app-migraphx-bot-pr-write · 2026-06-18T23:08:18Z

Test	Status	Result
bert-mrpc-onnx	✅	PASSED: MIGraphX meets tolerance
bert-mrpc-tf	❌	ERROR - check error output traceback Traceback (most recent call last): File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in main() File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 313, in main import tensorflow as tf File "/usr/local/lib/python3.10/dist-packages/tensorflow/init.py", line 38, in from tensorflow.python.tools import module_util as _module_util File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/init.py", line 36, in from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in self_check.preload_check() File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check from tensorflow.python.platform import _pywrap_cpu_feature_guard ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
pytorch-examples-wlang-gru	✅	PASSED: MIGraphX meets tolerance
pytorch-examples-wlang-lstm	✅	PASSED: MIGraphX meets tolerance
dlrm-criteoterabyte	✅	PASSED: MIGraphX meets tolerance
agentmodel	✅	PASSED: MIGraphX meets tolerance
unet	✅	PASSED: MIGraphX meets tolerance
resnet50v1	✅	PASSED: MIGraphX meets tolerance
bert_base_cased_fp16	✅	PASSED: MIGraphX meets tolerance
bert_large_uncased_fp16	🔴	FAILED: MIGraphX is not within tolerance - check verbose output
bert_large	✅	PASSED: MIGraphX meets tolerance
yolov5s	🔴	FAILED: MIGraphX is not within tolerance - check verbose output
tinyllama	✅	PASSED: MIGraphX meets tolerance
vicuna-fastchat	✅	PASSED: MIGraphX meets tolerance
whisper-tiny-encoder	✅	PASSED: MIGraphX meets tolerance
whisper-tiny-decoder	✅	PASSED: MIGraphX meets tolerance
distilgpt2_fp16	🔴	FAILED: MIGraphX is not within tolerance - check verbose output
llama2_7b	✅	PASSED: MIGraphX meets tolerance
qwen1.5-7b	✅	PASSED: MIGraphX meets tolerance
phi3-3.8b	✅	PASSED: MIGraphX meets tolerance
llama3-8b	✅	PASSED: MIGraphX meets tolerance
whisper-large-encoder	❌	ERROR - check error output traceback 2026-06-18 18:02:07.178313 [WARN] [/data/src/onnx/onnx_parser.cpp:282] Model has unbound symbolic dimension(s): batch_size, encoder_sequence_length, feature_size. These default to 1 and may cause unexpected behavior. Try setting `--dim-param @<name> <value>` or `--input-dim @<input> <dims>` if program compilation fails. Traceback (most recent call last): File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in main() File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 224, in main model = migraphx.parse_onnx(model_name, default_dim_value=batch) RuntimeError: /data/src/include/migraphx/op/convolution.hpp:113: normalize_compute_shape: CONVOLUTION: mismatched channel numbers: input channels (1) != weights channels (80) * group (1)
whisper-large-decoder	❌	ERROR - check error output traceback 2026-06-18 18:02:09.983074 [WARN] [/data/src/onnx/onnx_parser.cpp:282] Model has unbound symbolic dimension(s): batch_size, decoder_sequence_length, encoder_sequence_length / 2. These default to 1 and may cause unexpected behavior. Try setting `--dim-param @<name> <value>` or `--input-dim @<input> <dims>` if program compilation fails.
mistral-7b	✅	PASSED: MIGraphX meets tolerance
FLUX.1-schnell	✅	PASSED: MIGraphX meets tolerance

itikhono · 2026-06-19T16:11:39Z

@kahmed10 @shivadbhavsar could you help with reviewing this PR please?

Generate tuning inputs on GPU via splitmix64 device RNG

c6a80fe

Copilot AI review requested due to automatic review settings June 16, 2026 17:16

itikhono requested a review from causten as a code owner June 16, 2026 17:16

Copilot started reviewing on behalf of itikhono June 16, 2026 17:16 View session

itikhono mentioned this pull request Jun 16, 2026

YOLO-family models: slow compile that grows dramatically with input size #4970

Open

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread src/targets/gpu/time_op.cpp Outdated

Comment thread src/targets/gpu/device/generate_random.cpp Outdated

pfultz2 requested changes Jun 16, 2026

View reviewed changes

Comment thread src/targets/gpu/device/generate_random.cpp Outdated

Comment thread src/targets/gpu/include/migraphx/gpu/hip.hpp Outdated

Comment thread test/gpu/generate_random.cpp Outdated

Resolve review comments; new tests

17c321d

itikhono requested a review from pfultz2 June 17, 2026 09:28

pfultz2 approved these changes Jun 17, 2026

View reviewed changes

Merge branch 'develop' into gpu-device-bench-inputs

b81105d

pfultz2 added the ok-to-test label Jun 18, 2026

itikhono added 2 commits June 18, 2026 20:25

fix license

5a89c76

Merge branch 'gpu-device-bench-inputs' of https://github.com/itikhono…

5a3a3e5

…/AMDMIGraphX into gpu-device-bench-inputs

github-actions Bot removed the ok-to-test label Jun 18, 2026

itikhono added the ok-to-test label Jun 18, 2026

pfultz2 requested review from kahmed10 and shivadbhavsar June 18, 2026 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971

[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971
itikhono wants to merge 5 commits into
ROCm:developfrom
itikhono:gpu-device-bench-inputs

itikhono commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itikhono commented Jun 18, 2026

Uh oh!

pfultz2 commented Jun 18, 2026

Uh oh!

gh-app-migraphx-bot-pr-write Bot commented Jun 18, 2026

Uh oh!

gh-app-migraphx-bot-pr-write Bot commented Jun 18, 2026

Uh oh!

itikhono commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

itikhono commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Behavior parity with the old host path

Performance

Test plan

Perf testing for YOLO-family models (MI350):

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itikhono commented Jun 18, 2026

Uh oh!

pfultz2 commented Jun 18, 2026

Uh oh!

gh-app-migraphx-bot-pr-write Bot commented Jun 18, 2026

Uh oh!

gh-app-migraphx-bot-pr-write Bot commented Jun 18, 2026

Uh oh!

itikhono commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itikhono commented Jun 16, 2026 •

edited

Loading

itikhono commented Jun 19, 2026 •

edited

Loading