Required prerequisites
What version of TileLang are you using?
0.1.11+cuda.gita4399e4c
System information
NVIDIA A10G (sm_86), Python 3.11. This is a compile-time crash (during reduce lowering), so the CUDA/PyTorch runtime versions aren't involved.
Problem description
A T.reduce_sum over a fragment of width N aborts at compile time when N is not a multiple of the
thread count — e.g. N=96 with threads=128. N that divides the thread count compiles fine. The same
reduction is a routine pattern (it's what RMSNorm/LayerNorm/softmax do), and many real hidden dims aren't
multiples of 128, so this is easy to hit.
N (threads=128) |
result |
| 64, 128 (divides 128) |
compiles & runs |
| 96, 100, 120, 127 (doesn't divide 128) |
InternalError: Inconsistent layouts ... in ReduceOp |
The crash is a thread-replication mismatch: the reduce destination is laid out with replicate_size=128
(the thread count) but the source fragment ends up replicate_size=96 (the width), and the reduce
lowerer asserts these must match.
Reproducible example code
import torch, tilelang, tilelang.language as T
N, THREADS = 96, 128 # N not a multiple of THREADS -> crash; N=64 or 128 compiles
@tilelang.jit
def reduce(X: T.Tensor((1, N), "float32"), Y: T.Tensor((1,), "float32")):
with T.Kernel(1, threads=THREADS) as bx:
f = T.alloc_fragment((1, N), "float32")
s = T.alloc_fragment((1,), "float32")
T.copy(X, f)
T.reduce_sum(f, s, dim=1)
T.copy(s, Y)
X = torch.arange(N, dtype=torch.float32, device="cuda").reshape(1, N)
Y = torch.empty(1, dtype=torch.float32, device="cuda")
reduce(X, Y) # crashes at compile; set N=64 and it compiles & runs (Y == X.sum())
Traceback
tvm::tl::ReduceOpNode::Lower(...)
src/backend/common/op/reduce.h:391 ReduceLowerer<cuda::Reduce>::Lower(...)
ICHECK(!analyzer->CanProve(dst_layout->ReplicateExtent() > red_layout->ReplicateExtent()))
tvm.error.InternalError: Check failed: ... Inconsistent layouts between src and dst in ReduceOp:
dst_layout=tl.Fragment(... replicate_size=128 ...) red_layout=tl.Fragment(... replicate_size=96 ...)
Expected behavior
Compile and reduce correctly (the divisible widths do), or reject with a clear, user-facing message
naming the constraint — not an internal ICHECK exposing replicate_size internals.
Additional context
Root cause: when a width-N fragment is spread across T threads, the per-thread replication factor is
computed with a bare integer division — loop_partition.cc:217
num_thread / thread_extent, with no remainder handling. When N % T != 0 the resulting layout is
inconsistent (here the destination keeps replicate_size=T=128 while the reduced source carries
replicate_size=N=96), and the reduce lowerer's assertion
(reduce.h:391)
trips. The same N % T != 0 precondition, on an elementwise (non-reduce) fragment with N > T, instead
surfaces as no available layout found in layout inference — filed separately as the elementwise cousin.
A related earlier case (#1374) was fixed by PR #1533 in src/layout/utils.cc, but that fix doesn't cover
this reduce path (verified: still crashes on a4399e4c).
Required prerequisites
What version of TileLang are you using?
0.1.11+cuda.gita4399e4c
System information
NVIDIA A10G (sm_86), Python 3.11. This is a compile-time crash (during reduce lowering), so the CUDA/PyTorch runtime versions aren't involved.
Problem description
A
T.reduce_sumover a fragment of widthNaborts at compile time whenNis not a multiple of thethread count — e.g.
N=96withthreads=128.Nthat divides the thread count compiles fine. The samereduction is a routine pattern (it's what RMSNorm/LayerNorm/softmax do), and many real hidden dims aren't
multiples of 128, so this is easy to hit.
N(threads=128)InternalError: Inconsistent layouts ... in ReduceOpThe crash is a thread-replication mismatch: the reduce destination is laid out with
replicate_size=128(the thread count) but the source fragment ends up
replicate_size=96(the width), and the reducelowerer asserts these must match.
Reproducible example code
Traceback
tvm::tl::ReduceOpNode::Lower(...) src/backend/common/op/reduce.h:391 ReduceLowerer<cuda::Reduce>::Lower(...) ICHECK(!analyzer->CanProve(dst_layout->ReplicateExtent() > red_layout->ReplicateExtent())) tvm.error.InternalError: Check failed: ... Inconsistent layouts between src and dst in ReduceOp: dst_layout=tl.Fragment(... replicate_size=128 ...) red_layout=tl.Fragment(... replicate_size=96 ...)Expected behavior
Compile and reduce correctly (the divisible widths do), or reject with a clear, user-facing message
naming the constraint — not an internal
ICHECKexposingreplicate_sizeinternals.Additional context
Root cause: when a width-
Nfragment is spread acrossTthreads, the per-thread replication factor iscomputed with a bare integer division —
loop_partition.cc:217num_thread / thread_extent, with no remainder handling. WhenN % T != 0the resulting layout isinconsistent (here the destination keeps
replicate_size=T=128while the reduced source carriesreplicate_size=N=96), and the reduce lowerer's assertion(
reduce.h:391)trips. The same
N % T != 0precondition, on an elementwise (non-reduce) fragment withN > T, insteadsurfaces as
no available layout foundin layout inference — filed separately as the elementwise cousin.A related earlier case (#1374) was fixed by PR #1533 in
src/layout/utils.cc, but that fix doesn't coverthis reduce path (verified: still crashes on
a4399e4c).