Skip to content

[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971

Open
itikhono wants to merge 5 commits into
ROCm:developfrom
itikhono:gpu-device-bench-inputs
Open

[4970] Generate tuning inputs on GPU via splitmix64 device RNG#4971
itikhono wants to merge 5 commits into
ROCm:developfrom
itikhono:gpu-device-bench-inputs

Conversation

@itikhono

@itikhono itikhono commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

This PR covers 1st part of the issue #4970.
Eliminates "input-gen + H2D (CPU waste)" part , GPU part (caused by bundle increase 1->10) remains
image

  • During op/program tuning, candidate inputs were generated on the host (xorshf96 PRNG) and copied to the device for every candidate. This replaces that with a device kernel that fills tuning inputs directly on the GPU, removing the per-candidate host PRNG + H2D copy.
  • New device::generate_random uses a counter-based splitmix64 RNG (seed + i * golden_ratio_stepsplitmix64), so output is deterministic per seed and reproducible across candidates for fair comparison.
  • time_program now allocates inputs with allocate_gpu and fills them via gpu_generate_random (recurses tuple sub-objects), while fill_map inputs keep the host-fill path.

Behavior parity with the old host path

  • bool: handled by visit_allnormalize<bool>0/1, identical to the old special-case.
  • fp4x2 (only non-computable type): visit_all would throw, so generation falls back to a raw byte fill — matching the old uint8 host behavior.
  • tuples: same seed across sub-objects, same as the previous generate_argument.

Performance

  • No FPS regression across the YOLO model family (within noise).
  • Compile/tuning time improved up to ~6.6x at batch 64 on MI350, and ~10x at batch 32 on R9700 (measured together with reverting the bundle increase 1->10)

Test plan

  • test_gpu_generate_random: seed determinism + range, half type, empty shape no-op, tuple fills every sub-buffer, non-computable (fp4x2) raw-byte fill — 5/5 pass.
  • YOLO compile + inference sweep (fork vs develop).

Perf testing for YOLO-family models (MI350):

Used migraphx-driver perf, no actual diff detected, the results are quite noisy

Image

different models, batch 4

Model Fixed, img/s Develop (before), img/s Δ
yolov8m 1255.7 1242.5 +1.1%
yolov9m 1177.1 1111.2 +5.9%
yolov10m 1337.1 1303.4 +2.6%
yolo11m 1583.1 1564.2 +1.2%
yolo12m 1341.4 1350.1 −0.6%
yolo26m 1415.0 1407.2 +0.6%

Copilot AI review requested due to automatic review settings June 16, 2026 17:16
@itikhono itikhono requested a review from causten as a code owner June 16, 2026 17:16
@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GPU compile/tuning throughput by generating candidate input buffers directly on the GPU (splitmix64 counter-based RNG) instead of generating on the host and copying H2D per candidate.

Changes:

  • Added a GPU-side random-fill kernel (device::generate_random) and a host wrapper (gpu_generate_random) that recurses into tuple sub-objects.
  • Updated time_program tuning path to allocate parameter buffers on GPU and fill them via gpu_generate_random (keeping fill_map on the host-fill path).
  • Added a GPU unit test covering determinism, supported types, empty shapes, tuples, and non-computable types.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/gpu/generate_random.cpp New GPU test coverage for deterministic RNG fill, tuples, and non-computable types.
src/targets/gpu/time_op.cpp Switches tuning input creation to GPU allocation + GPU RNG fill (except fill_map).
src/targets/gpu/include/migraphx/gpu/hip.hpp Exposes gpu_generate_random API.
src/targets/gpu/include/migraphx/gpu/device/generate_random.hpp Declares new device-side RNG entrypoint.
src/targets/gpu/hip.cpp Implements gpu_generate_random wrapper with tuple recursion.
src/targets/gpu/device/generate_random.cpp Implements splitmix64-based device kernel to fill buffers.

Comment thread src/targets/gpu/time_op.cpp Outdated
Comment thread src/targets/gpu/device/generate_random.cpp Outdated
Comment thread src/targets/gpu/device/generate_random.cpp Outdated
Comment thread src/targets/gpu/include/migraphx/gpu/hip.hpp Outdated
Comment thread test/gpu/generate_random.cpp Outdated
@itikhono itikhono requested a review from pfultz2 June 17, 2026 09:28
@itikhono

Copy link
Copy Markdown
Contributor Author

@pfultz2 @causten could you help to run [MIGraphX Performance Tests / security_gate (pull_request_target)] target? I think I don't have access/rights to trigger this job

@pfultz2

pfultz2 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

could you help to run [MIGraphX Performance Tests / security_gate (pull_request_target)] target?

Dont worry about that one, I think it missing a label. Just fix the licensing.

@gh-app-migraphx-bot-pr-write

Copy link
Copy Markdown
Test Batch New Rate (5a3a3e) Old Rate (f54ca3) Diff Status
torchvision-resnet50 64 669.58 3,151.54 -78.75% 🔴
torchvision-resnet50_fp16 64 1,766.59 6,672.57 -73.52% 🔴
torchvision-densenet121 32 557.68 2,669.98 -79.11% 🔴
torchvision-densenet121_fp16 32 644.55 4,553.52 -85.85% 🔴
torchvision-inceptionv3 32 416.47 1,794.69 -76.79% 🔴
torchvision-inceptionv3_fp16 32 501.54 2,710.69 -81.50% 🔴
cadene-inceptionv4 16 222.10 224.10 -0.89%
cadene-resnext64x4 16 176.18 409.63 -56.99% 🔴
slim-mobilenet 64 4,428.97 8,288.90 -46.57% 🔴
slim-nasnetalarge 64 nan 229.24 nan
slim-resnet50v2 64 776.52 3,329.55 -76.68% 🔴
bert-mrpc-onnx 8 140.05 1,170.13 -88.03% 🔴
bert-mrpc-tf 1 37.33 482.99 -92.27% 🔴
pytorch-examples-wlang-gru 1 210.63 335.86 -37.28% 🔴
pytorch-examples-wlang-lstm 1 220.15 464.03 -52.56% 🔴
torchvision-resnet50_1 1 25.53 771.72 -96.69% 🔴
cadene-dpn92_1 1 48.67 451.76 -89.23% 🔴
cadene-resnext101_1 1 44.39 363.79 -87.80% 🔴
onnx-taau-downsample 1 35.45 400.76 -91.15% 🔴
dlrm-criteoterabyte 1 6.16 32.67 -81.14% 🔴
dlrm-criteoterabyte_fp16 1 nan 52.58 nan
agentmodel 1 688.21 9,679.70 -92.89% 🔴
unet_fp16 2 11.24 57.23 -80.35% 🔴
resnet50v1_fp16 1 40.31 972.36 -95.85% 🔴
resnet50v1_int8 1 300.05 960.27 -68.75% 🔴
bert_base_cased_fp16 64 323.80 1,102.92 -70.64% 🔴
bert_large_uncased_fp16 32 82.58 347.52 -76.24% 🔴
bert_large_fp16 1 4.49 205.28 -97.82% 🔴
distilgpt2_fp16 16 303.52 2,094.45 -85.51% 🔴
yolov5s 1 95.46 568.20 -83.20% 🔴
tinyllama 1 3.76 46.01 -91.83% 🔴
vicuna-fastchat 1 8.54 44.04 -80.60% 🔴
whisper-tiny-encoder 1 51.11 419.49 -87.82% 🔴
whisper-tiny-decoder 1 13.59 420.06 -96.76% 🔴
llama2_7b 1 2.06 20.46 -89.95% 🔴
qwen1.5-7b 1 11.39 23.67 -51.88% 🔴
phi3-3.8b 1 2.67 26.98 -90.09% 🔴
llama3-8b 1 2.11 21.83 -90.31% 🔴
whisper-large-encoder 1 2.50 10.32 -75.76% 🔴
whisper-large-decoder 1 3.09 107.05 -97.11% 🔴
mistral-7b 1 5.08 23.86 -78.72% 🔴
FLUX.1-schnell 1 43.74 761.61 -94.26% 🔴

Regressions detected 🔴

@gh-app-migraphx-bot-pr-write

Copy link
Copy Markdown
Test Status Result
bert-mrpc-onnx PASSED: MIGraphX meets tolerance
bert-mrpc-tf ERROR - check error output
traceback
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 313, in main
import tensorflow as tf
File "/usr/local/lib/python3.10/dist-packages/tensorflow/init.py", line 38, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/init.py", line 36, in
from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 26, in
self_check.preload_check()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: libamdhip64.so.6: cannot open shared object file: No such file or directory
pytorch-examples-wlang-gru PASSED: MIGraphX meets tolerance
pytorch-examples-wlang-lstm PASSED: MIGraphX meets tolerance
dlrm-criteoterabyte PASSED: MIGraphX meets tolerance
agentmodel PASSED: MIGraphX meets tolerance
unet PASSED: MIGraphX meets tolerance
resnet50v1 PASSED: MIGraphX meets tolerance
bert_base_cased_fp16 PASSED: MIGraphX meets tolerance
bert_large_uncased_fp16 🔴 FAILED: MIGraphX is not within tolerance - check verbose output
bert_large PASSED: MIGraphX meets tolerance
yolov5s 🔴 FAILED: MIGraphX is not within tolerance - check verbose output
tinyllama PASSED: MIGraphX meets tolerance
vicuna-fastchat PASSED: MIGraphX meets tolerance
whisper-tiny-encoder PASSED: MIGraphX meets tolerance
whisper-tiny-decoder PASSED: MIGraphX meets tolerance
distilgpt2_fp16 🔴 FAILED: MIGraphX is not within tolerance - check verbose output
llama2_7b PASSED: MIGraphX meets tolerance
qwen1.5-7b PASSED: MIGraphX meets tolerance
phi3-3.8b PASSED: MIGraphX meets tolerance
llama3-8b PASSED: MIGraphX meets tolerance
whisper-large-encoder ERROR - check error output
traceback
2026-06-18 18:02:07.178313 [WARN] [/data/src/onnx/onnx_parser.cpp:282] Model has unbound symbolic dimension(s): batch_size, encoder_sequence_length, feature_size. These default to 1 and may cause unexpected behavior. Try setting --dim-param @<name> <value> or --input-dim @<input> <dims> if program compilation fails.
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 377, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 224, in main
model = migraphx.parse_onnx(model_name, default_dim_value=batch)
RuntimeError: /data/src/include/migraphx/op/convolution.hpp:113: normalize_compute_shape: CONVOLUTION: mismatched channel numbers: input channels (1) != weights channels (80) * group (1)
whisper-large-decoder ERROR - check error output
traceback
2026-06-18 18:02:09.983074 [WARN] [/data/src/onnx/onnx_parser.cpp:282] Model has unbound symbolic dimension(s): batch_size, decoder_sequence_length, encoder_sequence_length / 2. These default to 1 and may cause unexpected behavior. Try setting --dim-param @<name> <value> or --input-dim @<input> <dims> if program compilation fails.
mistral-7b PASSED: MIGraphX meets tolerance
FLUX.1-schnell PASSED: MIGraphX meets tolerance

@itikhono

itikhono commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

@kahmed10 @shivadbhavsar could you help with reviewing this PR please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants