Skip to content

[BUG][Fuzzer] T.reduce_sum over a fragment crashes LayoutInference with an internal "Inconsistent layouts" error when the reduced width isn't a multiple of the thread count #2399

Description

@Hughshine

Required prerequisites

What version of TileLang are you using?

0.1.11+cuda.gita4399e4c

System information

NVIDIA A10G (sm_86), Python 3.11. This is a compile-time crash (during reduce lowering), so the CUDA/PyTorch runtime versions aren't involved.

Problem description

A T.reduce_sum over a fragment of width N aborts at compile time when N is not a multiple of the
thread count — e.g. N=96 with threads=128. N that divides the thread count compiles fine. The same
reduction is a routine pattern (it's what RMSNorm/LayerNorm/softmax do), and many real hidden dims aren't
multiples of 128, so this is easy to hit.

N (threads=128) result
64, 128 (divides 128) compiles & runs
96, 100, 120, 127 (doesn't divide 128) InternalError: Inconsistent layouts ... in ReduceOp

The crash is a thread-replication mismatch: the reduce destination is laid out with replicate_size=128
(the thread count) but the source fragment ends up replicate_size=96 (the width), and the reduce
lowerer asserts these must match.

Reproducible example code

import torch, tilelang, tilelang.language as T
N, THREADS = 96, 128          # N not a multiple of THREADS -> crash; N=64 or 128 compiles

@tilelang.jit
def reduce(X: T.Tensor((1, N), "float32"), Y: T.Tensor((1,), "float32")):
    with T.Kernel(1, threads=THREADS) as bx:
        f = T.alloc_fragment((1, N), "float32")
        s = T.alloc_fragment((1,), "float32")
        T.copy(X, f)
        T.reduce_sum(f, s, dim=1)
        T.copy(s, Y)

X = torch.arange(N, dtype=torch.float32, device="cuda").reshape(1, N)
Y = torch.empty(1, dtype=torch.float32, device="cuda")
reduce(X, Y)                # crashes at compile; set N=64 and it compiles & runs (Y == X.sum())

Traceback

tvm::tl::ReduceOpNode::Lower(...)
  src/backend/common/op/reduce.h:391  ReduceLowerer<cuda::Reduce>::Lower(...)
    ICHECK(!analyzer->CanProve(dst_layout->ReplicateExtent() > red_layout->ReplicateExtent()))
tvm.error.InternalError: Check failed: ... Inconsistent layouts between src and dst in ReduceOp:
  dst_layout=tl.Fragment(... replicate_size=128 ...) red_layout=tl.Fragment(... replicate_size=96 ...)

Expected behavior

Compile and reduce correctly (the divisible widths do), or reject with a clear, user-facing message
naming the constraint — not an internal ICHECK exposing replicate_size internals.

Additional context

Root cause: when a width-N fragment is spread across T threads, the per-thread replication factor is
computed with a bare integer division — loop_partition.cc:217
num_thread / thread_extent, with no remainder handling. When N % T != 0 the resulting layout is
inconsistent (here the destination keeps replicate_size=T=128 while the reduced source carries
replicate_size=N=96), and the reduce lowerer's assertion
(reduce.h:391)
trips. The same N % T != 0 precondition, on an elementwise (non-reduce) fragment with N > T, instead
surfaces as no available layout found in layout inference — filed separately as the elementwise cousin.
A related earlier case (#1374) was fixed by PR #1533 in src/layout/utils.cc, but that fix doesn't cover
this reduce path (verified: still crashes on a4399e4c).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions