Skip to content

[CUDA] Add SM120 NVF4 block-scale MMA support#2364

Open
qqq-tao wants to merge 9 commits into
tile-ai:mainfrom
qqq-tao:nvf4-block-scale-sm120-pr-clean
Open

[CUDA] Add SM120 NVF4 block-scale MMA support#2364
qqq-tao wants to merge 9 commits into
tile-ai:mainfrom
qqq-tao:nvf4-block-scale-sm120-pr-clean

Conversation

@qqq-tao

@qqq-tao qqq-tao commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Add warp-level mxf4nvf4 block-scale MMA lowering and coverage so TileLang can validate NVF4 kernels against SM120/CUTLASS behavior.

Accuracy Validation

To validate the SM120 NVF4 block-scale MMA lowering, I added a CUTLASS-based reference check in maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py.

The validation compares TileLang's warp-level mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3 path against CUTLASS SM120 block-scaled NVFP4 GEMM using the same packed NVF4 A/B inputs and the same UE4M3 scale factors.

Test configuration:

  • Problem size: M=N=128, K=256
  • Tile shape: 128x128x256
  • Input type: packed NVF4 / e2m1
  • Accumulator/output type: float32
  • Scale type: ue4m3
  • Scale vector size: scale_vec::4X
  • Scale granularity: 16 K-elements per scale factor
  • Reference: CUTLASS SM120 NVF4 block-scaled GEMM

Result:

scale_mode: varying
input_mode: random
max_abs_diff: 0.0
mean_abs_diff: 0.0
TileLang NVF4 block-scale output matches CUTLASS exactly.

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added SM120 block-scale GEMM support for FP4 (NVF4) with corresponding runtime/intrinsic support and a CUDA reference implementation for validation.
  * Added a correctness comparison tool that runs and compares block-scale kernels against the CUDA reference, reporting numeric-difference metrics.

* **Tests**
  * Added comprehensive CUDA-focused tests covering layouts, scale-factor mappings, configuration validation, codegen markers, and end-to-end correctness.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Add warp-level mxf4nvf4 block-scale MMA lowering and coverage so TileLang can validate NVF4 kernels against SM120/CUTLASS behavior.
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds SM120 NVF4 block-scale MMA support end-to-end: TL builtin intrinsic, PTX/mma template, CodeGen emission, 4-bit layout transforms and emitter wiring, CUTLASS CUDA reference kernel, TileLang correctness harness with packing/decoding utilities, and CUDA-gated tests.

Changes

NVF4 Block-Scale MMA Support

Layer / File(s) Summary
Compiler op and TIR intrinsic surface
src/op/builtin.{cc,h}, tilelang/language/ast/ir.py, tilelang/language/tir/ir.py, tilelang/language/tir/op.py
Register tl.ptx_mma_block_scale as a TL builtin with 17 inputs and expose dtype-forwarding TIR/AST wrappers plus an op wrapper that converts metadata into StringImm/IntImm before invoking the intrinsic.
PTX instruction template and config
src/tl_templates/cuda/gemm_sm120.h
Define SM120 enums and SM120MmaBlockScaledConfig with a specialization enabling exactly kMxf4nvf4/scale-vec-4/kUE4M3, add inline-PTX device implementation for mma.sync.aligned.m16n8k64, and provide a guarded sm120_mma_sync_blockscaled template wrapper.
CUDA codegen emission
src/cuda/codegen/codegen_cuda.cc
Extend CodeGenTileLangCUDA to recognize tl.ptx_mma_block_scale (17 args), validate the supported configuration, rewrite fp4-packed A/B buffer operands (halving offsets where packed), and emit tl::sm120_mma_sync_blockscaled with template argument substitution.
Layout transforms and MMA emitter
tilelang/cuda/intrinsics/layout/mma_layout.py, tilelang/cuda/intrinsics/macro/mma_macro_generator.py, tilelang/cuda/intrinsics/{macro/,}__init__.py, tilelang/intrinsics/__init__.py
Add shared↔MMA layout mappings for 4-bit A/B operands and inverse load helpers; introduce BlockScaleMmaConfig registry; implement TensorCoreIntrinEmitterWithBlockScale subclass that overrides ldmatrix_a/ldmatrix_b and emits per-warp ptx_mma_block_scale calls using explicit scale buffers and computed packed K indices; re-export emitter through package initializers.
CUTLASS reference kernel
maint/gemm/cutlass_nvf4_ref.cu
Add a PyTorch-bound CUTLASS SM120 block-scaled GEMM entry cutlass_nvf4_gemm_128x128x256 that validates tensor devices/contiguity and exact packed sizes, constructs CUTLASS problem/stride/layout views including SFA/SFB, checks can_implement, allocates workspace, runs the kernel, and synchronizes.
TileLang evaluation harness
maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
Add a correctness-evaluation script: TileLang NVF4 primfunc generator with optional swizzled shared layouts, FP4/UE4M3 packing/repacing utilities for TileLang vs CUTLASS, FP4 decode to float32 reference, CUTLASS extension builder/loader, execution of both implementations, metric printing, and optional strict equality assertion.
Tests and correctness validation
testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
Add CUDA-gated tests: FP4 decode table, swizzle layout helper, NVF4 primfunc generator, input/scale generators, fragment-layout coverage tests, lane-to-atom mapping tests, unsupported-config/type rejection tests, codegen source assertions, and end-to-end correctness tests comparing kernel outputs to a decoded float32 reference with zero tolerance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • lucifer1004
  • LeiWang1999

Poem

🐰 In silken shared tiles I hop and pack,
Four-bit whispers stitched along each track,
Scales march in words, the warps all hum,
TileLang builds, CUTLASS comes to sum,
A rabbit nods — the matrices align.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[CUDA] Add SM120 NVF4 block-scale MMA support' accurately and concisely describes the primary change—adding warp-level block-scale MMA support for NVF4 on SM120 GPUs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Apply clang-format updates and remove an unused block-scale emitter local.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 253-256: The code sets cutlass_root using a hardcoded developer
path which is unsafe; update the logic in the block that defines repo and
cutlass_root (variables cutlass_root, repo and the use of
os.environ["CUTLASS_ROOT"]) to remove the hardcoded "/data/home/..." fallback
and instead: prefer the CUTLASS_ROOT environment variable if set, otherwise fall
back to repo / "3rdparty" / "cutlass"; ensure cutlass_root.exists() is checked
and raise a clear error or log if neither location exists so the failure is
explicit.

In `@tilelang/language/ast/ir.py`:
- Line 2140: The module's __all__ includes "ptx_mma_block_scale" but no symbol
by that name is defined or imported, which breaks wildcard imports; either
remove "ptx_mma_block_scale" from the __all__ list or add a proper
definition/import for ptx_mma_block_scale (e.g., define the function/class or
import it from its source) so the name is actually bound in this module; update
the __all__ entry near the existing list and ensure the symbol name matches
exactly the defined/imported identifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a5bd78f-7a51-446a-bd8a-0078295fbd05

📥 Commits

Reviewing files that changed from the base of the PR and between a3f7093 and a24f9f1.

📒 Files selected for processing (16)
  • maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
  • maint/gemm/cutlass_nvf4_ref.cu
  • src/cuda/codegen/codegen_cuda.cc
  • src/cuda/codegen/codegen_cuda.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/tl_templates/cuda/instruction/mma_block_scale.h
  • testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
  • tilelang/cuda/intrinsics/__init__.py
  • tilelang/cuda/intrinsics/layout/mma_layout.py
  • tilelang/cuda/intrinsics/macro/__init__.py
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/intrinsics/__init__.py
  • tilelang/language/ast/ir.py
  • tilelang/language/tir/ir.py
  • tilelang/language/tir/op.py

Comment thread maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py Outdated
Comment thread tilelang/language/ast/ir.py
qutao added 2 commits June 9, 2026 18:44
Apply ruff formatting and remove trailing whitespace from the NVF4 block-scale emitter changes.
Remove a developer-specific CUTLASS fallback path, bind the block-scale MMA AST helper, and apply clang-format to the CUDA codegen changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tilelang/cuda/intrinsics/macro/mma_macro_generator.py (2)

1402-1407: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve outer-dimension bases for plain scale buffers.

_scale_region_parts() treats Buffer differently from the A/B paths and drops every prefix dimension by returning ([], 0, 0). The accesses at Lines 1478-1482 and Line 1510 then only index the trailing two axes, which breaks direct N-D scale buffers and sliced views. Reuse _legalize_to_buffer_region() here so scale buffers follow the same region contract as A/B.

Suggested fix
     `@staticmethod`
     def _scale_region_parts(scale_buf: Buffer | BufferRegion):
-        if isinstance(scale_buf, BufferRegion):
-            return scale_buf.buffer, [r.min for r in scale_buf.region[:-2]], scale_buf.region[-2].min, scale_buf.region[-1].min
-        if isinstance(scale_buf, Buffer):
-            return scale_buf, [], 0, 0
-        raise ValueError(f"Unsupported scale buffer type: {type(scale_buf)}")
+        region = TensorCoreIntrinEmitter._legalize_to_buffer_region(scale_buf)
+        return (
+            region.buffer,
+            [r.min for r in region.region[:-2]],
+            region.region[-2].min,
+            region.region[-1].min,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py` around lines 1402 -
1407, _scale_region_parts currently returns (buffer, [], 0, 0) for plain Buffer
which drops outer-dimension bases and breaks N-D scale buffers; change it to
call and reuse _legalize_to_buffer_region(scale_buf) so both Buffer and
BufferRegion follow the same region contract as A/B. Update _scale_region_parts
to accept either type, call _legalize_to_buffer_region when scale_buf is a
Buffer to obtain the buffer and full region, and then extract the same tuples
(buffer, [r.min for r in region[:-2]], region[-2].min, region[-1].min) as for
BufferRegion; keep the existing ValueError for unsupported types. Ensure
references to _scale_region_parts usage (the indexing code that expects
preserved outer bases) continue to work unchanged.

1282-1333: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the fixed NVF4 contract in the constructor.

This emitter always lowers as mxf4nvf4 with k64, e2m1/e2m1, and the block-scale fragment layouts from this class, but it never rejects incompatible a_dtype, b_dtype, or accum_dtype inputs. A caller can currently instantiate this public API with, for example, float16 operands and still get NVF4 PTX emitted against mismatched fragment assumptions. Fail fast here instead of silently generating wrong code.

Suggested guard
     def __init__(
         self,
         a_dtype: str = T.float4_e2m1fn,
         b_dtype: str = T.float4_e2m1fn,
         accum_dtype: str = T.float32,
@@
         kind: str = "mxf4nvf4",
         scale_vec_size: int = 4,
         stype: str = "ue4m3",
     ):
+        if str(DataType(a_dtype)) != str(T.float4_e2m1fn) or str(DataType(b_dtype)) != str(T.float4_e2m1fn):
+            raise ValueError("SM120 block-scale MMA currently only supports float4_e2m1fn operands")
+        if str(DataType(accum_dtype)) != str(T.float32):
+            raise ValueError("SM120 block-scale MMA currently only supports float32 accumulation")
         self.block_scale_config = _get_block_scale_mma_config(kind, scale_vec_size, stype)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py` around lines 1282 -
1333, The constructor currently forces block-scale config to NVF4 but does not
validate the caller-provided dtypes, so callers can pass incompatible
a_dtype/b_dtype/accum_dtype and silently generate wrong code; after calling
_get_block_scale_mma_config(...) in __init__, add a guard that checks the
resolved self.block_scale_config.kind (and/or its expected dtype descriptors
from the config) against the incoming a_dtype, b_dtype, and accum_dtype
parameters (e.g., ensure a_dtype and b_dtype match the NVF4 fragment dtypes such
as T.float4_e2m1fn and accum_dtype matches the expected accumulator like
T.float32 or whatever the config exposes), and raise a ValueError with a clear
message if they mismatch; perform this validation before calling
super().__init__ so invalid combinations fail fast.
maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py (1)

342-354: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The FP32 reference diagnostics are missing the UE4M3 block scales.

ref is built from the decoded FP4 payloads only, so diff_tl_ref and diff_cutlass_ref are not checking the same computation as the block-scaled GEMMs. In the default NVF4_SCALE_MODE="varying" case, those numbers are inherently misleading. Either apply sfa_logical / sfb_logical per 16-K chunk when building ref, or drop these prints until that scale-aware reference exists.

Suggested minimal change
-    ref = _decode_rowmajor_fp4(a, M, K) @ _decode_rowmajor_fp4(b, N, K).T
-    diff_tl_ref = (c_tl - ref).abs()
-    diff_cutlass_ref = (c_cutlass - ref).abs()
     print("scale_mode:", scale_mode)
     print("input_mode:", input_mode)
     print("max_abs_diff:", diff.max().item())
     print("mean_abs_diff:", diff.mean().item())
     print("max_abs_diff_transposed:", diff_t.max().item())
     print("mean_abs_diff_transposed:", diff_t.mean().item())
-    print("max_abs_diff_tilelang_ref:", diff_tl_ref.max().item())
-    print("mean_abs_diff_tilelang_ref:", diff_tl_ref.mean().item())
-    print("max_abs_diff_cutlass_ref:", diff_cutlass_ref.max().item())
-    print("mean_abs_diff_cutlass_ref:", diff_cutlass_ref.mean().item())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py` around lines 342 - 354,
The FP32 reference `ref` is computed from raw decoded FP4 payloads so it doesn't
include the UE4M3 block scales (`sfa_logical`/`sfb_logical`), making
`diff_tl_ref` and `diff_cutlass_ref` invalid under NVF4_SCALE_MODE="varying";
fix by applying the per-block scales to the decoded tensors before matmul: after
calling `_decode_rowmajor_fp4(a, M, K)` and `_decode_rowmajor_fp4(b, N, K)`,
multiply each 16xK chunk of the decoded A by the corresponding entries in
`sfa_logical` and each 16xK chunk of the decoded B by `sfb_logical` (or apply
equivalent broadcasting per 16-K block) so `ref = scaled_decoded_a @
scaled_decoded_b.T` matches the block-scaled GEMM, otherwise remove the
`diff_tl_ref`/`diff_cutlass_ref` prints until the scale-aware reference is
implemented; reference symbols: `ref`, `_decode_rowmajor_fp4`, `sfa_logical`,
`sfb_logical`, `diff_tl_ref`, `diff_cutlass_ref`.
♻️ Duplicate comments (1)
maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py (1)

247-256: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when the CUTLASS root is missing.

This now avoids the developer-local fallback, but it still passes non-existent include roots into load(). If CUTLASS_ROOT is unset or wrong and 3rdparty/cutlass is absent, the harness fails later with a compiler error instead of an explicit setup error.

Suggested fix
     repo = Path(__file__).resolve().parents[2]
     cutlass_root_env = os.environ.get("CUTLASS_ROOT")
     cutlass_root = Path(cutlass_root_env) if cutlass_root_env else repo / "3rdparty" / "cutlass"
+    if not cutlass_root.exists():
+        raise RuntimeError(
+            f"CUTLASS not found at {cutlass_root}. "
+            "Set CUTLASS_ROOT or ensure 3rdparty/cutlass exists."
+        )
 
     return load(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py` around lines 247 - 256,
Check that the computed cutlass_root (from cutlass_root_env or repo / "3rdparty"
/ "cutlass") actually exists before calling load; if it does not exist, raise a
clear RuntimeError instructing the developer to set CUTLASS_ROOT or populate
3rdparty/cutlass, and only pass existing include paths (cutlass_root / "include"
and cutlass_root / "tools" / "util" / "include") into the load(...) call instead
of blindly passing non-existent paths; refer to the variables cutlass_root_env,
cutlass_root, extra_include_paths and the load(...) call to locate where to add
the existence check and error raise.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 342-354: The FP32 reference `ref` is computed from raw decoded FP4
payloads so it doesn't include the UE4M3 block scales
(`sfa_logical`/`sfb_logical`), making `diff_tl_ref` and `diff_cutlass_ref`
invalid under NVF4_SCALE_MODE="varying"; fix by applying the per-block scales to
the decoded tensors before matmul: after calling `_decode_rowmajor_fp4(a, M, K)`
and `_decode_rowmajor_fp4(b, N, K)`, multiply each 16xK chunk of the decoded A
by the corresponding entries in `sfa_logical` and each 16xK chunk of the decoded
B by `sfb_logical` (or apply equivalent broadcasting per 16-K block) so `ref =
scaled_decoded_a @ scaled_decoded_b.T` matches the block-scaled GEMM, otherwise
remove the `diff_tl_ref`/`diff_cutlass_ref` prints until the scale-aware
reference is implemented; reference symbols: `ref`, `_decode_rowmajor_fp4`,
`sfa_logical`, `sfb_logical`, `diff_tl_ref`, `diff_cutlass_ref`.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py`:
- Around line 1402-1407: _scale_region_parts currently returns (buffer, [], 0,
0) for plain Buffer which drops outer-dimension bases and breaks N-D scale
buffers; change it to call and reuse _legalize_to_buffer_region(scale_buf) so
both Buffer and BufferRegion follow the same region contract as A/B. Update
_scale_region_parts to accept either type, call _legalize_to_buffer_region when
scale_buf is a Buffer to obtain the buffer and full region, and then extract the
same tuples (buffer, [r.min for r in region[:-2]], region[-2].min,
region[-1].min) as for BufferRegion; keep the existing ValueError for
unsupported types. Ensure references to _scale_region_parts usage (the indexing
code that expects preserved outer bases) continue to work unchanged.
- Around line 1282-1333: The constructor currently forces block-scale config to
NVF4 but does not validate the caller-provided dtypes, so callers can pass
incompatible a_dtype/b_dtype/accum_dtype and silently generate wrong code; after
calling _get_block_scale_mma_config(...) in __init__, add a guard that checks
the resolved self.block_scale_config.kind (and/or its expected dtype descriptors
from the config) against the incoming a_dtype, b_dtype, and accum_dtype
parameters (e.g., ensure a_dtype and b_dtype match the NVF4 fragment dtypes such
as T.float4_e2m1fn and accum_dtype matches the expected accumulator like
T.float32 or whatever the config exposes), and raise a ValueError with a clear
message if they mismatch; perform this validation before calling
super().__init__ so invalid combinations fail fast.

---

Duplicate comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 247-256: Check that the computed cutlass_root (from
cutlass_root_env or repo / "3rdparty" / "cutlass") actually exists before
calling load; if it does not exist, raise a clear RuntimeError instructing the
developer to set CUTLASS_ROOT or populate 3rdparty/cutlass, and only pass
existing include paths (cutlass_root / "include" and cutlass_root / "tools" /
"util" / "include") into the load(...) call instead of blindly passing
non-existent paths; refer to the variables cutlass_root_env, cutlass_root,
extra_include_paths and the load(...) call to locate where to add the existence
check and error raise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5bd71dcd-3f86-40b2-b9d8-6c494431985b

📥 Commits

Reviewing files that changed from the base of the PR and between a24f9f1 and c3c51c9.

📒 Files selected for processing (6)
  • maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
  • maint/gemm/cutlass_nvf4_ref.cu
  • src/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/instruction/mma_block_scale.h
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/language/ast/ir.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/instruction/mma_block_scale.h
  • maint/gemm/cutlass_nvf4_ref.cu

Make the CUTLASS comparison reference scale-aware, validate block-scale emitter dtypes, preserve scale buffer regions, and check CUTLASS include paths explicitly.
@Rachmanino

Rachmanino commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Really appreciate your contribution! I think it's better to provide high-level API for blockscaled mma on sm120, like existing T.tcgen05_gemm_blockscaled for sm100 tcgen5? Besides, it may be better to move blockscaled mma instructions to src/tl_templates/cuda/gemm_sm120.h? Would love to discuss if you have better suggestions!

@qqq-tao

qqq-tao commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, that makes sense.

This PR is currently a low-level SM120 NVF4 MMA emitter. It checks that the new mma.sync instruction works, and that the fragment layout and scale-factor lane mapping are correct.

I agree that users should have a higher-level API, similar to the existing SM100 blockscaled GEMM API. The SM120 version should not copy the TCGEN05/TMEM behavior directly, but it can provide a similar user experience on top of the SM120 warp-level mma.sync path.

For this PR, I can move the SM120 blockscaled MMA helper into gemm_sm120.h, so it sits with the SM120 GEMM code. Then we can add a higher-level tile op API as a follow-up, unless you prefer to include that in this PR.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/language/tir/op.py (1)

1079-1119: ⚠️ Potential issue | 🟠 Major

Fix tl.ptx_mma_block_scale intrinsic arity: set_num_inputs(21) doesn’t match CUDA lowering expectations.

  • src/op/builtin.cc:244-247 registers tl.ptx_mma_block_scale with .set_num_inputs(21).
  • src/cuda/codegen/codegen_cuda.cc:2852 lowers the intrinsic with ICHECK_EQ(op->args.size(), 17U) and consumes op->args[0..16] (accum_dtype/shape/layouts/k/vec_size/dtypes + A/B/C pointers & offsets + scale_a/scale_b).
  • The Python wrapper language/tir/op.py:1079-1119 passes exactly those 17 call_intrin args, so the registration count should be aligned (likely change .set_num_inputs(21)17).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/tir/op.py` around lines 1079 - 1119, The registered
intrinsic tl.ptx_mma_block_scale has a mismatched arity: the Python wrapper
function ptx_mma_block_scale builds 17 call_intrin arguments and the CUDA
lowering asserts ICHECK_EQ(..., 17), so update the intrinsic registration to use
.set_num_inputs(17) (replace the current .set_num_inputs(21)) so the
registration count matches the call_intrin arguments and the CUDA lowering
expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tilelang/language/tir/op.py`:
- Around line 1079-1119: The registered intrinsic tl.ptx_mma_block_scale has a
mismatched arity: the Python wrapper function ptx_mma_block_scale builds 17
call_intrin arguments and the CUDA lowering asserts ICHECK_EQ(..., 17), so
update the intrinsic registration to use .set_num_inputs(17) (replace the
current .set_num_inputs(21)) so the registration count matches the call_intrin
arguments and the CUDA lowering expectations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82a9e9cc-83ba-44c4-acfa-79f6cc76f5fa

📥 Commits

Reviewing files that changed from the base of the PR and between cbc83fc and 17d2313.

📒 Files selected for processing (5)
  • src/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/gemm_sm120.h
  • testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/language/tir/op.py
💤 Files with no reviewable changes (1)
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py

@Rachmanino

Rachmanino commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Nice work! May I ask whether we support TMA load for operands and SFs? If so, we may also illustrate that in our example. Also, I'm curious whether you've considered warp-specialization (either automatic version or handwritten) and its influence on the performance.

@qqq-tao

qqq-tao commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Yes, I am still tuning to optimize the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants