[BUG][Fuzzer] `T.reduce_sum` over a fragment crashes LayoutInference with an internal "Inconsistent layouts" error when the reduced width isn't a multiple of the thread count

### Required prerequisites

- [x] I have read the documentation <https://tilelang.com>.
- [x] I have searched the [Issue Tracker](https://github.com/tile-ai/tilelang/issues) that this hasn't already been reported. (comment there if it has.)

### What version of TileLang are you using?

0.1.11+cuda.gita4399e4c

### System information

NVIDIA A10G (sm_86), Python 3.11. This is a compile-time crash (during reduce lowering), so the CUDA/PyTorch runtime versions aren't involved.


### Problem description

A `T.reduce_sum` over a fragment of width `N` aborts at compile time when `N` is not a multiple of the
thread count — e.g. `N=96` with `threads=128`. `N` that divides the thread count compiles fine. The same
reduction is a routine pattern (it's what RMSNorm/LayerNorm/softmax do), and many real hidden dims aren't
multiples of 128, so this is easy to hit.

| `N` (threads=128) | result |
|---|---|
| 64, 128 (divides 128) | compiles & runs |
| 96, 100, 120, 127 (doesn't divide 128) | `InternalError: Inconsistent layouts ... in ReduceOp` |

The crash is a thread-replication mismatch: the reduce destination is laid out with `replicate_size=128`
(the thread count) but the source fragment ends up `replicate_size=96` (the width), and the reduce
lowerer asserts these must match.


### Reproducible example code

```python
import torch, tilelang, tilelang.language as T
N, THREADS = 96, 128          # N not a multiple of THREADS -> crash; N=64 or 128 compiles

@tilelang.jit
def reduce(X: T.Tensor((1, N), "float32"), Y: T.Tensor((1,), "float32")):
    with T.Kernel(1, threads=THREADS) as bx:
        f = T.alloc_fragment((1, N), "float32")
        s = T.alloc_fragment((1,), "float32")
        T.copy(X, f)
        T.reduce_sum(f, s, dim=1)
        T.copy(s, Y)

X = torch.arange(N, dtype=torch.float32, device="cuda").reshape(1, N)
Y = torch.empty(1, dtype=torch.float32, device="cuda")
reduce(X, Y)                # crashes at compile; set N=64 and it compiles & runs (Y == X.sum())
```

### Traceback

```pytb
tvm::tl::ReduceOpNode::Lower(...)
  src/backend/common/op/reduce.h:391  ReduceLowerer<cuda::Reduce>::Lower(...)
    ICHECK(!analyzer->CanProve(dst_layout->ReplicateExtent() > red_layout->ReplicateExtent()))
tvm.error.InternalError: Check failed: ... Inconsistent layouts between src and dst in ReduceOp:
  dst_layout=tl.Fragment(... replicate_size=128 ...) red_layout=tl.Fragment(... replicate_size=96 ...)
```

### Expected behavior

Compile and reduce correctly (the divisible widths do), or reject with a clear, user-facing message
naming the constraint — not an internal `ICHECK` exposing `replicate_size` internals.

### Additional context

Root cause: when a width-`N` fragment is spread across `T` threads, the per-thread replication factor is
computed with a bare integer division — [`loop_partition.cc:217`](https://github.com/tile-ai/tilelang/blob/a4399e4c9934edd4e34bde89628eba9377548b10/src/transform/loop_partition.cc#L217)
`num_thread / thread_extent`, with no remainder handling. When `N % T != 0` the resulting layout is
inconsistent (here the destination keeps `replicate_size=T=128` while the reduced source carries
`replicate_size=N=96`), and the reduce lowerer's assertion
([`reduce.h:391`](https://github.com/tile-ai/tilelang/blob/a4399e4c9934edd4e34bde89628eba9377548b10/src/backend/common/op/reduce.h#L391))
trips. The same `N % T != 0` precondition, on an elementwise (non-reduce) fragment with `N > T`, instead
surfaces as `no available layout found` in layout inference — filed separately as the elementwise cousin.
A related earlier case (#1374) was fixed by PR #1533 in `src/layout/utils.cc`, but that fix doesn't cover
this reduce path (verified: still crashes on `a4399e4c`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG][Fuzzer] `T.reduce_sum` over a fragment crashes LayoutInference with an internal "Inconsistent layouts" error when the reduced width isn't a multiple of the thread count #2399

Required prerequisites

What version of TileLang are you using?

System information

Problem description

Reproducible example code

Traceback

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

`N` (threads=128)	result
64, 128 (divides 128)	compiles & runs
96, 100, 120, 127 (doesn't divide 128)	`InternalError: Inconsistent layouts ... in ReduceOp`

Uh oh!

[BUG][Fuzzer] T.reduce_sum over a fragment crashes LayoutInference with an internal "Inconsistent layouts" error when the reduced width isn't a multiple of the thread count #2399

Description

Required prerequisites

What version of TileLang are you using?

System information

Problem description

Reproducible example code

Traceback

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG][Fuzzer] `T.reduce_sum` over a fragment crashes LayoutInference with an internal "Inconsistent layouts" error when the reduced width isn't a multiple of the thread count #2399