Skip to content

riscv testing rvv#3

Draft
luhenry wants to merge 87 commits into
riscv-testing-mobilenetv2from
riscv-testing-rvv
Draft

riscv testing rvv#3
luhenry wants to merge 87 commits into
riscv-testing-mobilenetv2from
riscv-testing-rvv

Conversation

@luhenry
Copy link
Copy Markdown
Collaborator

@luhenry luhenry commented May 16, 2026

No description provided.

hboyraz and others added 8 commits May 15, 2026 14:22
…ytorch#19603) (pytorch#19603)

Summary:

Replaces the compile-time `#ifdef ENABLE_XNNPACK_WEIGHTS_CACHE` gate in
XNNCompiler.cpp with a runtime boolean plumbed from
`XnnpackBackendOptions::resolve_weight_cache(context)` through
`XNNPACKBackend::init` to `XNNCompiler::compileModel`.

This fixes a silent-disable bug: previously, runtime opt-in via
`set_option(weight_cache_option_key, true)` was silently a no-op unless
the build also set `-c executorch.xnnpack_weights_cache=1`, because the
cache pointer handed to `xnn_create_runtime_v4` was hardcoded to nullptr
when the macro was undefined. Multimethod LoRA models re-packed the
entire backbone for every method load, costing
hundreds of MB of resident memory.

The runtime path now keys all three cache-relevant code regions
(unpacked-data load, cache pointer handoff to xnn_create_runtime_v4, and
finalize_for_runtime) on `bool use_weight_cache` resolved per-init from
the BackendInitContext.

The `Result<vector<string>>` declaration in compileModel was reshaped to
plain `vector<string>` since `Result<>` is non-assignable, which is
required for the new runtime branch.

Reviewed By: GregoryComer

Differential Revision: D105123995

Co-authored-by: Hakan Boyraz <[email protected]>
…ng" (pytorch#19620)

This reverts commit 7355d7b.

### Summary
Temporarily reverting to restore test health for QNN jobs.
Summary:

Add int16 activation / int8 weight (a16w8) quantization tests for
`aten.mean.dim` on Ethos-U55 and Ethos-U85.

## Changes
- Add `a16w8_mean_test_parameters` dict with 11 test configurations
covering keepdim/no-keepdim, positive/negative dims, dim=None, and ranks
1-4
- Add `test_mean_dim_a16w8_u55_INT` using `EthosU55PipelineINT` with
`a16w8_quantization=True, symmetric_io_quantization=True`
- Add `test_mean_dim_a16w8_u85_INT` using `EthosU85PipelineINT` with
same kwargs
- Register `ops/test_mean_dim.py` in `fbcode/` and `xplat/`
`targets.bzl`

Differential Revision: D104532361
Differential Revision: D105377713

Pull Request resolved: pytorch#19621
Differential Revision: D105378870

Pull Request resolved: pytorch#19622
…tmaxWithSoftmax

Differential Revision: D105367634

Pull Request resolved: pytorch#19619
Comment thread examples/riscv/setup.sh Outdated

# Download newer version of qemu-user-static from Debian repositories
QEMU_VERSION=10.0.8+ds-0+deb13u1+b1_$(dpkg --print-architecture)
[[ -f qemu-user_${QEMU_VERSION}.deb ]] || wget --progress=dot:giga http://ftp.us.debian.org/debian/pool/main/q/qemu/qemu-user_${QEMU_VERSION}.deb
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use https://, I don't check the package after download

pytorchbot and others added 21 commits May 18, 2026 00:40
This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#19553 by
@kimishpatel
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/orig
Differential Revision:
[D104318324](https://our.internmc.facebook.com/intern/diff/D104318324/)
@diff-train-skip-merge

Co-authored-by: Kimish Patel <[email protected]>
…ytorch#19572)

### Summary
Add QAT tests for AvgPool, MaxPool and Mul tensor ops for Neutron
backend using the new Neutron MLIR flow.

### Test plan
Unit tests provided.


cc @robert-kalmar
The output from gcc 13.3 and 15.2 compared:

d_print_comp_inner 9,608 -> 11,884 (+2,276)
_vfprintf_r 7,072 -> 8,484 (+1,412)
_dtoa_r 3,260 -> 3,596 (+336 shown? actually +336 by symbol)
_svfprintf_r 6,988 -> 8,208 (+1,220)
_vfiprintf_r 3,692 -> 4,524 (+832)
d_print_mod 1,584 -> 1,942 (+358)
d_type 1,952 -> 2,124 (+172)
__gxx_personality_v0 1,068 -> 1,124 (+56)

New/now-large visible entries in the GCC 15.2 log include:

__ieee754_fmod 924
_Unwind_VRS_Pop 778
d_name 776

ExecuTorch itself did not grow.

ExecuTorch .text total:
GCC 13.3: 15,334
GCC 15.2: 15,006
delta: -328

Signed-off-by: [email protected]
Change-Id: I5c3e9388f3a6d87fd987811d7dc04e9ef85cb69d
Conv2d operator tests were creating random inputs at module import time.
The Arm test seed is applied later by an autouse pytest fixture, so
those tensors were not actually controlled by ARM_TEST_SEED.

That made tests nondeterministic across fresh pytest processes and could
expose different quantization behavior from run to run. Generate the
affected inputs lazily inside each test case so the existing seed
fixture makes them reproducible and ARM_TEST_SEED=RANDOM can
re-randomize the intended data.

Signed-off-by: Zingo Andersen <[email protected]>
Summary:
Add int16 activation / int8 weight (a16w8) quantization tests for
`aten.var` on Ethos-U55 and Ethos-U85.

## Changes
- Add `test_var_a16w8_u55_INT` and `test_var_a16w8_u85_INT` using
`EthosU55PipelineINT`/`EthosU85PipelineINT` with
`a16w8_quantization=True, symmetric_io_quantization=True`
- Split `Var.test_parameters` into passing (`a16w8_var_test_parameters`,
keepdim=True) and xfail (`a16w8_var_test_parameters_xfails`,
keepdim=False) groups for the a16w8 tests
- Mark `keepdim=False` cases (`var_3d_no_keep_dim_0_correction`,
`var_4d_no_keep_dim_0_5_correction`) as `pytest.mark.xfail` since var
a16w8 produces incorrect output for scalar/reduced-rank output
- Register `ops/test_var.py` in `fbcode/` and `xplat/` `targets.bzl`

Differential Revision: D104532362
- Chat template: Gemma 4 31B-IT is instruction-tuned and produces
degenerate output without chat-template wrapping. Auto-wrap --prompt
with the IT template in both inference.py and the C++ runner;
--raw-prompt / --raw_prompt skips wrapping for
pre-formatted input.
- inv_freq dedup: Extract _compute_inv_freq() on Gemma4Attention so
__init__ and materialize_runtime_buffers share a single implementation
instead of duplicating the RoPE frequency computation.
- CI hardening: Check for "Paris" in the export inference sanity check
instead of just checking the script doesn't crash. Restore gemma4_31b
unit tests in the CUDA build job.
- Docs: Update README.md and model.md to reflect chat template and
inv_freq changes.

---------

Co-authored-by: mnachin <[email protected]>
Co-authored-by: Gasoonjia <[email protected]>
pytorch#19508)

## Summary

Fixes pytorch#19356 — Unify and Improve the Android dev story

## Changes

- Added Java API Reference (Javadoc) link to using-executorch-android
page (1-line fix)
- Added Javadoc entry to Android sidebar navigation
- Created package-info.java for org.pytorch.executorch with overview and
quick-start
- Created package-info.java for org.pytorch.executorch.extension.llm
- Created overview.html with top-level intro, quick-start code, and
package descriptions
- Updated build.gradle to pass -overview flag so overview.html is picked
up by the build

## Testing

All are verified by opening the documents manually.

- Added a "Java API Reference" hyperlink directly in the
using-executorch-android page → trivial 1-line fix
<img width="1896" height="886" alt="Screenshot 2026-05-12 124608"
src="https://github.com/user-attachments/assets/50a27d22-b419-4aaa-9b68-2c128e39253e"
/>

- Added sidebar/nav entry under Platforms → Android → "Java API
Reference (Javadoc)"
<img width="1885" height="999" alt="Screenshot 2026-05-12 124524"
src="https://github.com/user-attachments/assets/25c05c29-91d8-4b7a-a47c-b6fe9ed0cdc4"
/>

- Improved the bare javadoc/index.html with an overview, quick-start
code, and package descriptions
<img width="1894" height="1018" alt="Screenshot 2026-05-12 124548"
src="https://github.com/user-attachments/assets/8ab5aabe-7e83-485b-b74b-bc1b28ae2055"
/>

cc @mergennachin @AlannaBurke @digantdesai @freddan80 @per @zingo
@oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani
## Summary
1. New `docs/source/llm/run-on-android.md`, a Java reference for the
`executorch-android` AAR runner. Same shape as `run-on-ios.md`. Covers
`LlmModule`, the `LlmModuleConfig` builder, `LlmGenerationConfig`, the
`LlmCallback` methods, `load`/`stop`/`resetContext`, and the image/audio
prefill variants. Points at LlamaDemo.

2. Added `run-on-android` to the LLM toctree in `working-with-llms.md`,
sitting between the Qualcomm page and iOS.

3. In `getting-started.md`, swapped the two GitHub example links for the
in-docs Android and iOS pages so users stay in the docs.

4. Added a tip admonition to `using-executorch-export.md` under Model
Preparation, sending HF Hub users to `export-llm-optimum.md` before the
manual flow.

5. Cleaned up `export-llm-optimum.md`. Removed the leftover "Method 1"
framing since only the CLI path is documented, bumped the orphaned
subheadings up a level, and pointed the Running on Device links at the
new Android page and the existing iOS page (sample apps kept inline).


Fixes  pytorch#8790



cc @mergennachin @AlannaBurke @larryliu0820 @cccclai @helunwencser
@jackzhxng @byjlw
…seMadvise path (pytorch#19587)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#19554 by
@kimishpatel
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/241/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/241/head
Merge bot PR base:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/orig
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/kimishpatel/241/orig
Differential Revision:
[D104318326](https://our.internmc.facebook.com/intern/diff/D104318326/)
@diff-train-skip-merge

---------

Co-authored-by: Kimish Patel <[email protected]>
Co-authored-by: Gasoonjia <[email protected]>
Fixes pytorch#18073 (This issue was already fixed before this PR)

## Problem
The [setup
tutorial](https://docs.pytorch.org/executorch/main/using-executorch-building-from-source.html)
prescribes:

```bash
conda create -yn executorch python=3.10.0
```

This version has a CPython bug that leaves PyTorch's LeafSpec's
init=False
fields uninitialized. run_decompositions() in ExecuTorch deepcopies the
LeafSpec nodes and crashes.
This breaks the export pipeline (`to_edge_transform_and_lower`,
`to_edge`, `run_decompositions`),
which leads to any model export attempt failing unconditionally on
Python 3.10.0.

> The error message (`AttributeError: 'LeafSpec' object has no attribute
'type'`)
> gives no indication that Python version is the cause, making it hard
for tutorial followers to diagnose.

## Upstream Fix Status
The fix has already landed in PyTorch `main` via pytorch/pytorch#177154
and is available in recent nightly builds.
However, ExecuTorch currently pins to `torch==2.11`.
The 2.11 branch was cut before the fix was merged, which means
the upstream fix does not apply until the ExecutTorch's torch pin is
bumped to 2.12.
Until then, this PR is necessary to prevent tutorial followers from
hitting the bug and getting lost.

## Resolution
The CI build script (`.ci/docker/build.sh`) uses:

```bash
PYTHON_VERSION=3.10
```

conda interprets `python=3.10` as a **prefix**, installing the latest
available
3.10.x — currently 3.10.16 — which does not have this bug.
The CPython fix landed in 3.10.1.
This discrepancy (`3.10` vs `3.10.0`) makes the bug invisible to CI
while affecting any user who follows the setup tutorial.

**Fix** python=3.10.0 -> python=3.10 to align the docs with CI.

## Test plan
**Not working**
```
conda create -yn executorch3100 python=3.10.0
conda activate executorch3100
```

**Working**
```
conda create -yn executorch310 python=3.10
conda activate executorch310
```

```python
import torch
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner

class MyModel(torch.nn.Module):
    def forward(self, x):
        return x + 1

# 1. Export your PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
exported_program = torch.export.export(model, example_inputs)

# 2. Optimize for target hardware (switch backends with one line)
program = to_edge_transform_and_lower(
    exported_program,
    partitioner=[XnnpackPartitioner()]  # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm
).to_executorch()
```

Co-authored-by: Gasoonjia <[email protected]>
Part of pytorch#17425.

This refreshes the iOS SwiftPM documentation for the ExecuTorch 1.0.0
package flow.

Changes:
- Updates the remaining `swiftpm-0.6.0` guidance in the getting started
page to use `swiftpm-1.0.0`.
- Clarifies the Xcode product selection step for the simple XNNPACK app
path.
- Refreshes the Xcode package-product screenshot and demo video assets
for the `swiftpm-1.0.0` flow.

Validation:
- `git diff --check`
- Verified `docs/source/_static/img/swiftpm_xcode.mp4` duration is
94.37s at 1280x836 and sampled frames through package selection.

cc @mergennachin @AlannaBurke @shoumikhin @cbilgin
…71218) (pytorch#19593)

Reviewed By: psiddh

Differential Revision: D104380965

Co-authored-by: DevmateRemedimateMacaClaude Bot <[email protected]>
## Summary

Adds `Tensor.copyDataInto(... dst)` to the Android Java API for the
float32 and float16 dtypes. It copies the tensor's data into a
caller-provided destination buffer instead of allocating a fresh
`float[]` per call (as `getDataAsFloatArray()` does today). The same
pattern is repeated for other types.

## Motivation

While profiling depth inference on Android with Perfetto, output
extraction was a meaningful contributor to ART GC pressure. Each call to
`output.toTensor().dataAsFloatArray` allocates a new Java `float[]`
sized to the tensor's element count and bulk-copies from the underlying
off-heap buffer into it.

The native side already exposes the underlying `FloatBuffer` directly
(zero-copy view of the C++ tensor's `data_ptr()`), so the only thing
missing was a public way for callers to drain it into a destination
buffer they already own and reuse across calls.

## API

```java
public void copyDataInto(FloatBuffer dst)
```

- Implemented on all datatypes

## Caller-side usage example

```java
// One-time setup
FloatBuffer depthBuf = Tensor.allocateFloatBuffer(numelDepth);

// Per inference
EValue[] outputs = module.forward(...);
depthBuf.rewind();
outputs[0].toTensor().copyDataInto(depthBuf);   // no allocation
// ... read from depthBuf ...
```

## Test plan

- [x] Added unit tests in `TensorTest.kt`:
  - `testCopyDataIntoFloat32` — round-trip with reuse across two calls
- `testCopyDataIntoFloat32_writesAtDstPosition` — verifies the call
writes at `dst.position()` and advances it (does not overwrite from
index 0)
- `testCopyDataIntoFloat32_overflow` — `BufferOverflowException` on
undersized destination
- `testCopyDataIntoFloat16` — verifies fp16→fp32 widening matches
`getDataAsFloatArray`
- `testCopyDataIntoFloat_unsupportedDtype` — `IllegalStateException`
from base default for non-float dtypes

This PR was authored with Claude.

cc @kirklandsign @cbilgin

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Differential Revision: D105503056

Pull Request resolved: pytorch#19623
…#19249)

### Summary

When a model prepared with torchao's `quantize_(...)` (e.g. blockwise
int4)
is lowered without an iOS18+ `minimum_deployment_target`, coremltools
raises
a `ValueError` from inside `_construct_constexpr_dequant_op`:

```
ValueError: The more fine-grained quantization (such as blockwise) is only supported since iOS18.Please set minimum_deployment_target to iOS18 for using it.
```

This message is technically correct but does not tell the ExecuTorch
user
*how* to set the deployment target — the answer is buried in
`CoreMLBackend.generate_compile_specs(...)` plus
`CoreMLPartitioner(compile_specs=...)`, which is not obvious unless
you've
already been through the docs.

The two `dequantize_affine` / `dequantize_codebook` handlers in
`backends/apple/coreml/compiler/torch_ops.py` are the only call sites
where
the failing coremltools utilities are invoked from ExecuTorch code, so I
wrap them and re-raise the error with an additional hint that shows the
exact partitioner call.  After this change the user sees:

```
ValueError: The more fine-grained quantization (such as blockwise) is only supported since iOS18.Please set minimum_deployment_target to iOS18 for using it.
ExecuTorch hint: pass `compile_specs=CoreMLBackend.generate_compile_specs(minimum_deployment_target=ct.target.iOS18)` (or higher) to `CoreMLPartitioner` when lowering models that use `quantize_(...)`.
```

Fixes pytorch#13122.

### Test plan

Added `test_dequantize_affine_below_ios18_raises_with_hint` which lowers
a
PerGroup-int4 quantized linear with
`minimum_deployment_target=ct.target.iOS17`
and asserts the raised `ValueError` mentions both `iOS18` and the
`CoreMLPartitioner` / `minimum_deployment_target` keywords.

The existing iOS18 quantization tests still pass
(`test_dequantize_affine_b4w_linear` exercised locally to confirm the
wrapper does not affect the success path).

```
$ python -m unittest -v executorch.backends.apple.coreml.test.test_torch_ops.TestTorchOps.test_dequantize_affine_below_ios18_raises_with_hint
Ran 1 test in 0.653s

OK
$ python -m unittest -v executorch.backends.apple.coreml.test.test_torch_ops.TestTorchOps.test_dequantize_affine_b4w_linear
Ran 1 test in 0.536s

OK
```

Authored with Claude.

cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy
Reverts pytorch#19565

The mp4 file fails import (D105606113) due to large file size
restrictions:

oldSize: 3635918
newSize: 7982247

cc @mergennachin @AlannaBurke @shoumikhin @cbilgin
qti-horodnic and others added 23 commits May 19, 2026 17:48
…Ten op (pytorch#19301)

### Summary
Added support for the core aten op `tan` using a decomposition pass and
the identity:

```
tan(x) = sin(x) / cos(x)
```

### Test plan
```
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_tan --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_tan --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android
```
…pytorch#19558)

Summary:

The pass checked for a batch norm following the conv to avoid breaking
fusion with a squeeze.

However, it did not support Conv -> Batch Norm -> ReLu OR Conv -> ReLU

This commit adds that support, along with other supported activation

Reviewed By: rascani

Differential Revision: D105017469
Differential Revision: D105100368

Pull Request resolved: pytorch#19680
Differential Revision: D105753995

Pull Request resolved: pytorch#19682
### Summary
Some older python version seems to interpret the
qnn_config.htp_performance_mode as a list while newer version just treat
it as an enum. To make it compatible on older python version, remove the
comma.

### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.
…c core ATen op (pytorch#19283)

### Summary
Added support for the core ATen op `scatter.src` using an op builder
with the [QNN
implementation](https://docs.qualcomm.com/doc/80-63442-10/topic/HtpOpDefSupplement.html#scatterelements)
for `ScatterElements`. Note `scatter.src` uses `ScatterElements`
directly with the argument `reduction=NONE`.

### Test plan
```
python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator.test_qnn_backend_scatter_src --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android
python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator.test_qnn_backend_scatter_src --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android
```
pytorch#19224)

Summary:
Heap profiling at runtime with HTP backend on Android platforms. DSP
heap profiling is available for QnnContext_createFromBinary use-cases.
It captures total DSP heap usage at two checkpoints:
- Before the first context is created (before_context_created)
- After the last context is freed (after_context_freed)

The difference between the two values represents heap consumed during
context execution. The value after freeing is typically equal to or
greater than before creation.

Test plan:
python backends/qualcomm/tests/test_qnn_delegate.py
TestQNNQuantizedUtils.test_qnn_backend_runtime_option_heap_profile -b
build-android -H ${HOST} -s ${SN} -m ${SOC_MODEL}
Port doc changes from 12bb0e7 from pure md-files to md.in stubs.


Signed-off-by: Adrian Lundell <[email protected]>
Summary
- Adds a VGF Swin2SR super-resolution example for Arm.
- Adds FP and INT8 export/eval flows with deterministic demo assets.
- Adds Arm OOTB smoke coverage and model tests.

Validation
- bash -n backends/arm/test/test_arm_ootb.sh
- PYTHONPATH=. /Users/usazah01/src/executorch/env/bin/python -m pytest
-q -p no:rerunfailures backends/arm/test/models/test_swin2sr_arm.py -s
- PATH=/Users/usazah01/src/executorch/env/bin:$PATH
backends/arm/scripts/pre-push


cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

---------

Signed-off-by: Usamah Zaheer <[email protected]>
The flag VK_PIPELINE_CREATE_2_FAIL_ON_PIPELINE_COMPILE_REQUIRED_BIT
means
"Do not perform expensive synchronous compilation during this call"

If Vulkan detects that this is needed, it instead
throws an error. We can skip this pre-caution.
)

Based on the decision that aot_arm_compiler should no longer be used for
production use, this patch updates the documentation to direct users
away from aot_arm_compiler for production use, and instead points them
to the Python API.

Signed-off-by: Martin Lindström <[email protected]>
…#19679) (pytorch#19679)

Summary:
Adds `LlmModuleConversationHistoryTest`, an Android instrumentation test
that exercises the multi-turn / KV-cache plumbing on `LlmModule`. The
OKR theme this enables is "Feature testing → conversation history"
(3.2), which depends on `prefillPrompt` + `resetContext` semantics being
correct.

The test runs on the existing TinyStories-110M fixture pulled by
`android_test_setup.sh` from the public `ossci-android` S3 bucket, so it
works on **both** internal fbsource Android CI and OSS GitHub Actions
Android CI without any new fixture infrastructure.

Because TinyStories is too small and not instruction-tuned,
content-level assertions (e.g. "did the model recall the user's name")
are not reliable. Instead, the test asserts four behavioral invariants
of the conversation-history surface that any production multi-turn flow
depends on:

1. `testResetContextProducesDeterministicOutput` — at temperature=0
(greedy decode), running the same prompt twice with `resetContext()`
between yields identical token streams. This is the foundational
invariant: clearing the KV cache truly returns the model to a clean
state.
2. `testKvCacheStatePersistsAcrossGenerateCalls` — without
`resetContext()` between calls, two `generate()` calls with the same
prompt diverge, proving the KV cache is preserved across turns. If this
ever fails, multi-turn conversation is silently broken.
3. `testPrefillPromptInfluencesNextGeneration` —
`prefillPrompt(history)` followed by `generate(prompt)` differs from a
clean-context `generate(prompt)`, proving the prefilled context actually
reaches the decoder.
4. `testResetContextClearsPrefilledHistory` — `prefillPrompt +
resetContext + generate` matches a clean-slate `generate`, proving reset
fully clears prefilled state.


Reviewed By: GregoryComer, kirklandsign

Differential Revision: D105741356

Pulled By: psiddh

---------

Co-authored-by: Copilot Autofix powered by AI <[email protected]>
Co-authored-by: Claude <[email protected]>
Change-Id: Ibb7ef4167ab96426133fce64e34366c365cd12ad
Signed-off-by: Yufeng Shi <[email protected]>
)

## Summary

This PR modernizes the ExecuTorch Arm bare-metal runner workflow so
users can move from a PyTorch model to a runnable
Arm executor runner with fewer manual build-system steps, stronger
validation, and faster repeated local iteration.

The main change is a new standalone Arm executor runner CMake entry
point. `run.sh` now acts as the orchestration
layer for common Ethos-U bare-metal flows: it can derive build
directories, configure the standalone runner with Arm
bare-metal defaults, stage generated PTE/BPTE files, validate reused
CMake caches, build the needed runner target,
  locate the runner binary, and invoke FVP.

  ## Problem

Before this change, the Arm runner workflow depended on manually
stitching together ExecuTorch build/install
artifacts, runner CMake configuration, PTE input wiring, toolchain and
target settings, optional debug features, and
  repeated install/export steps.

That made the workflow harder to explain, fragile in CI, slower to
iterate on locally, and easy to break when reusing
  a build directory configured for a different target or feature set.


And a shorter version if the PR description is already long:

## CMake Architecture Change

```mermaid
flowchart LR
    subgraph Before
        A1["Build ExecuTorch<br/>arm-baremetal preset"] --> A2["Install/export artifacts"]
        A2 --> A3["Configure runner CMake<br/>examples/arm/executor_runner"]
        A4["PTE / BPTE"] --> A3
        A3 --> A5["arm_executor_runner ELF"]
    end

    subgraph After
        B1["run.sh"] --> B2["Validate / choose build dir"]
        B2 --> B3["Standalone runner CMake<br/>examples/arm/executor_runner/standalone"]
        B4["PTE / BPTE"] --> B1
        B3 --> B5["ExecuTorch top-level CMake<br/>as subdirectory"]
        B3 --> B6["Arm CMake helpers + presets"]
        B5 --> B7["arm_executor_runner ELF"]
        B6 --> B7
    end
```

  ## What Changed

- Added `examples/arm/executor_runner/standalone` as the supported
standalone CMake entry point for
  `arm_executor_runner`.
- Added shared Arm CMake helpers for Ethos-U SDK setup, required target
validation, and predictable runner output
  paths.
- Updated `build_executor_runner.sh` and `run.sh` to use the standalone
runner workflow.
- Added deterministic default build directories under `--et_build_root`.
- Added cache validation for reused build directories, including target,
toolchain, selected ops, PTE placement,
  BundleIO, ETDump, and devtools settings.
- Added PTE/BPTE staging so repeated runs can reuse the same configured
CMake build directory.
  - Integrated selective-op handling into the standalone runner path.
- Cleaned up bare-metal install/export behavior so standalone builds can
consume reusable build-tree artifacts.
  - Updated Arm README and notebooks for the new workflow.

  ## Iteration Speed

Repeated local PTE-to-runner iteration is now **8x faster** because
`run.sh` can reuse the configured standalone CMake build directory,
stage updated PTE/BPTE payloads into the existing cache wiring, and
rebuild only the needed runner target instead of repeating the full
manual configure/install/export flow.

  This is a developer workflow speedup, not a model runtime speedup.

  ## Result

For common Ethos-U bare-metal usage, the user-facing path is now
script-owned and repeatable:

  1. Run Arm setup.
  2. Run `examples/arm/run.sh` with a model and target.
3. Reuse or inspect the generated build directory under
`--et_build_root`.
4. Iterate by regenerating the PTE/BPTE and rebuilding through the same
validated CMake cache.

VGF host flows remain explicit: `run.sh` requires an existing
`--build-dir` for VGF-style host builds rather than
  auto-configuring them as bare-metal runner builds.

  ## Testing

Validated through the Arm backend runner, bare-metal, VGF, and CI
workflows covered by this stack.



cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell @rascani

---------

Signed-off-by: Usamah Zaheer <[email protected]>
…ng matrix (pytorch#19617)

### Summary

This is the continuation of
pytorch#19399 and
pytorch#19521, to deliver on Phase 2
of pytorch#18991

### Test plan

This code is exclusively test code. Everything works out of the box, and
CI will validate.
Moves CI testing from yaml to .ci/scripts/test_zephyr.sh and create a
table of readme and target combinations to run instead of having mv2
tests hard coded. This will make it easier to add more sample and tests
in the future as the test flow is more generic.


Signed-off-by: Zingo Andersen <[email protected]>
…torch#19688)

### Summary
Neutron SW 3.1.1 removes the restriction for maximum supported kernel
size of the MaxPool2D operator. This PR reflects this change in the
Neutron backend.

### Test plan
Unit test provided.


cc @robert-kalmar @JakeStevens @digantdesai @rascani
…torch#19687)

### Summary
This PR adds support for the aten.sigmoid operator with the new Neutron
MLIR flow.

### Test plan
Unit tests provided.

cc @robert-kalmar @JakeStevens @digantdesai @rascani
…ytorch#19667)

### Summary
This PR adds support for the `aten.leaky_relu` operator with the new
Neutron MLIR flow.

### Test plan
Unit tests provided.


cc @robert-kalmar @JakeStevens @digantdesai @rascani
Differential Revision: D104433196

Pull Request resolved: pytorch#19641
luhenry added 3 commits May 20, 2026 22:09
Given RISC-V allows different hardware implementations to have different
vector length (similar to ARM SVE), we want to make sure that we test
on different configurations. Luckily, QEMU allows us to simply set a
vlen=<128,256,512,...> parameter on QEMU_CPU to emulate different
vector length.
@luhenry luhenry force-pushed the riscv-testing-rvv branch from f836fdb to 2c8507d Compare May 20, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.