riscv testing rvv by luhenry · Pull Request #3 · riseproject-dev/executorch

luhenry · 2026-05-16T13:14:15Z

No description provided.

…ytorch#19603) (pytorch#19603) Summary: Replaces the compile-time `#ifdef ENABLE_XNNPACK_WEIGHTS_CACHE` gate in XNNCompiler.cpp with a runtime boolean plumbed from `XnnpackBackendOptions::resolve_weight_cache(context)` through `XNNPACKBackend::init` to `XNNCompiler::compileModel`. This fixes a silent-disable bug: previously, runtime opt-in via `set_option(weight_cache_option_key, true)` was silently a no-op unless the build also set `-c executorch.xnnpack_weights_cache=1`, because the cache pointer handed to `xnn_create_runtime_v4` was hardcoded to nullptr when the macro was undefined. Multimethod LoRA models re-packed the entire backbone for every method load, costing hundreds of MB of resident memory. The runtime path now keys all three cache-relevant code regions (unpacked-data load, cache pointer handoff to xnn_create_runtime_v4, and finalize_for_runtime) on `bool use_weight_cache` resolved per-init from the BackendInitContext. The `Result<vector<string>>` declaration in compileModel was reshaped to plain `vector<string>` since `Result<>` is non-assignable, which is required for the new runtime branch. Reviewed By: GregoryComer Differential Revision: D105123995 Co-authored-by: Hakan Boyraz <[email protected]>

…ng" (pytorch#19620) This reverts commit 7355d7b. ### Summary Temporarily reverting to restore test health for QNN jobs.

Summary: Add int16 activation / int8 weight (a16w8) quantization tests for `aten.mean.dim` on Ethos-U55 and Ethos-U85. ## Changes - Add `a16w8_mean_test_parameters` dict with 11 test configurations covering keepdim/no-keepdim, positive/negative dims, dim=None, and ranks 1-4 - Add `test_mean_dim_a16w8_u55_INT` using `EthosU55PipelineINT` with `a16w8_quantization=True, symmetric_io_quantization=True` - Add `test_mean_dim_a16w8_u85_INT` using `EthosU85PipelineINT` with same kwargs - Register `ops/test_mean_dim.py` in `fbcode/` and `xplat/` `targets.bzl` Differential Revision: D104532361

…rch#19561) Differential Revision: D98080033 Pull Request resolved: pytorch#19561

…rch#19523) Differential Revision: D104862210 Pull Request resolved: pytorch#19523

Differential Revision: D105377713 Pull Request resolved: pytorch#19621

Differential Revision: D105378870 Pull Request resolved: pytorch#19622

…tmaxWithSoftmax Differential Revision: D105367634 Pull Request resolved: pytorch#19619

luhenry · 2026-05-16T13:19:02Z

+
+# Download newer version of qemu-user-static from Debian repositories
+QEMU_VERSION=10.0.8+ds-0+deb13u1+b1_$(dpkg --print-architecture)
+[[ -f qemu-user_${QEMU_VERSION}.deb ]] || wget --progress=dot:giga http://ftp.us.debian.org/debian/pool/main/q/qemu/qemu-user_${QEMU_VERSION}.deb


Need to use https://, I don't check the package after download

@kimishpatel

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#19553 by @kimishpatel ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/orig Differential Revision: [D104318324](https://our.internmc.facebook.com/intern/diff/D104318324/) @diff-train-skip-merge Co-authored-by: Kimish Patel <[email protected]>

@robert-kalmar

…ytorch#19572) ### Summary Add QAT tests for AvgPool, MaxPool and Mul tensor ops for Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar

…#19552) ### Test plan Existing unit test

@digantdesai

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Baris Demir <[email protected]> Co-authored-by: Baris Demir <[email protected]>

@digantdesai

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Baris Demir <[email protected]> Co-authored-by: Baris Demir <[email protected]>

The output from gcc 13.3 and 15.2 compared: d_print_comp_inner 9,608 -> 11,884 (+2,276) _vfprintf_r 7,072 -> 8,484 (+1,412) _dtoa_r 3,260 -> 3,596 (+336 shown? actually +336 by symbol) _svfprintf_r 6,988 -> 8,208 (+1,220) _vfiprintf_r 3,692 -> 4,524 (+832) d_print_mod 1,584 -> 1,942 (+358) d_type 1,952 -> 2,124 (+172) __gxx_personality_v0 1,068 -> 1,124 (+56) New/now-large visible entries in the GCC 15.2 log include: __ieee754_fmod 924 _Unwind_VRS_Pop 778 d_name 776 ExecuTorch itself did not grow. ExecuTorch .text total: GCC 13.3: 15,334 GCC 15.2: 15,006 delta: -328 Signed-off-by: [email protected] Change-Id: I5c3e9388f3a6d87fd987811d7dc04e9ef85cb69d

Conv2d operator tests were creating random inputs at module import time. The Arm test seed is applied later by an autouse pytest fixture, so those tensors were not actually controlled by ARM_TEST_SEED. That made tests nondeterministic across fresh pytest processes and could expose different quantization behavior from run to run. Generate the affected inputs lazily inside each test case so the existing seed fixture makes them reproducible and ARM_TEST_SEED=RANDOM can re-randomize the intended data. Signed-off-by: Zingo Andersen <[email protected]>

@digantdesai

) Add static cache integration tests in llama cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Xingguo Li <[email protected]> Co-authored-by: Elena Zhelezina <[email protected]>

Summary: Add int16 activation / int8 weight (a16w8) quantization tests for `aten.var` on Ethos-U55 and Ethos-U85. ## Changes - Add `test_var_a16w8_u55_INT` and `test_var_a16w8_u85_INT` using `EthosU55PipelineINT`/`EthosU85PipelineINT` with `a16w8_quantization=True, symmetric_io_quantization=True` - Split `Var.test_parameters` into passing (`a16w8_var_test_parameters`, keepdim=True) and xfail (`a16w8_var_test_parameters_xfails`, keepdim=False) groups for the a16w8 tests - Mark `keepdim=False` cases (`var_3d_no_keep_dim_0_correction`, `var_4d_no_keep_dim_0_5_correction`) as `pytest.mark.xfail` since var a16w8 produces incorrect output for scalar/reduced-rank output - Register `ops/test_var.py` in `fbcode/` and `xplat/` `targets.bzl` Differential Revision: D104532362

- Chat template: Gemma 4 31B-IT is instruction-tuned and produces degenerate output without chat-template wrapping. Auto-wrap --prompt with the IT template in both inference.py and the C++ runner; --raw-prompt / --raw_prompt skips wrapping for pre-formatted input. - inv_freq dedup: Extract _compute_inv_freq() on Gemma4Attention so __init__ and materialize_runtime_buffers share a single implementation instead of duplicating the RoPE frequency computation. - CI hardening: Check for "Paris" in the export inference sanity check instead of just checking the script doesn't crash. Restore gemma4_31b unit tests in the CUDA build job. - Docs: Update README.md and model.md to reflect chat template and inv_freq changes. --------- Co-authored-by: mnachin <[email protected]> Co-authored-by: Gasoonjia <[email protected]>

@mergennachin

pytorch#19508) ## Summary Fixes pytorch#19356 — Unify and Improve the Android dev story ## Changes - Added Java API Reference (Javadoc) link to using-executorch-android page (1-line fix) - Added Javadoc entry to Android sidebar navigation - Created package-info.java for org.pytorch.executorch with overview and quick-start - Created package-info.java for org.pytorch.executorch.extension.llm - Created overview.html with top-level intro, quick-start code, and package descriptions - Updated build.gradle to pass -overview flag so overview.html is picked up by the build ## Testing All are verified by opening the documents manually. - Added a "Java API Reference" hyperlink directly in the using-executorch-android page → trivial 1-line fix <img width="1896" height="886" alt="Screenshot 2026-05-12 124608" src="https://github.com/user-attachments/assets/50a27d22-b419-4aaa-9b68-2c128e39253e" /> - Added sidebar/nav entry under Platforms → Android → "Java API Reference (Javadoc)" <img width="1885" height="999" alt="Screenshot 2026-05-12 124524" src="https://github.com/user-attachments/assets/25c05c29-91d8-4b7a-a47c-b6fe9ed0cdc4" /> - Improved the bare javadoc/index.html with an overview, quick-start code, and package descriptions <img width="1894" height="1018" alt="Screenshot 2026-05-12 124548" src="https://github.com/user-attachments/assets/8ab5aabe-7e83-485b-b74b-bc1b28ae2055" /> cc @mergennachin @AlannaBurke @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

@mergennachin

## Summary 1. New `docs/source/llm/run-on-android.md`, a Java reference for the `executorch-android` AAR runner. Same shape as `run-on-ios.md`. Covers `LlmModule`, the `LlmModuleConfig` builder, `LlmGenerationConfig`, the `LlmCallback` methods, `load`/`stop`/`resetContext`, and the image/audio prefill variants. Points at LlamaDemo. 2. Added `run-on-android` to the LLM toctree in `working-with-llms.md`, sitting between the Qualcomm page and iOS. 3. In `getting-started.md`, swapped the two GitHub example links for the in-docs Android and iOS pages so users stay in the docs. 4. Added a tip admonition to `using-executorch-export.md` under Model Preparation, sending HF Hub users to `export-llm-optimum.md` before the manual flow. 5. Cleaned up `export-llm-optimum.md`. Removed the leftover "Method 1" framing since only the CLI path is documented, bumped the orphaned subheadings up a level, and pointed the Running on Device links at the new Android page and the existing iOS page (sample apps kept inline). Fixes pytorch#8790 cc @mergennachin @AlannaBurke @larryliu0820 @cccclai @helunwencser @jackzhxng @byjlw

…ytorch#19644) Reverts pytorch#19604 This breaks internal CI

@kimishpatel

…seMadvise path (pytorch#19587) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#19554 by @kimishpatel ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/241/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/241/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/240/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/241/orig Differential Revision: [D104318326](https://our.internmc.facebook.com/intern/diff/D104318326/) @diff-train-skip-merge --------- Co-authored-by: Kimish Patel <[email protected]> Co-authored-by: Gasoonjia <[email protected]>

Fixes pytorch#18073 (This issue was already fixed before this PR) ## Problem The [setup tutorial](https://docs.pytorch.org/executorch/main/using-executorch-building-from-source.html) prescribes: ```bash conda create -yn executorch python=3.10.0 ``` This version has a CPython bug that leaves PyTorch's LeafSpec's init=False fields uninitialized. run_decompositions() in ExecuTorch deepcopies the LeafSpec nodes and crashes. This breaks the export pipeline (`to_edge_transform_and_lower`, `to_edge`, `run_decompositions`), which leads to any model export attempt failing unconditionally on Python 3.10.0. > The error message (`AttributeError: 'LeafSpec' object has no attribute 'type'`) > gives no indication that Python version is the cause, making it hard for tutorial followers to diagnose. ## Upstream Fix Status The fix has already landed in PyTorch `main` via pytorch/pytorch#177154 and is available in recent nightly builds. However, ExecuTorch currently pins to `torch==2.11`. The 2.11 branch was cut before the fix was merged, which means the upstream fix does not apply until the ExecutTorch's torch pin is bumped to 2.12. Until then, this PR is necessary to prevent tutorial followers from hitting the bug and getting lost. ## Resolution The CI build script (`.ci/docker/build.sh`) uses: ```bash PYTHON_VERSION=3.10 ``` conda interprets `python=3.10` as a **prefix**, installing the latest available 3.10.x — currently 3.10.16 — which does not have this bug. The CPython fix landed in 3.10.1. This discrepancy (`3.10` vs `3.10.0`) makes the bug invisible to CI while affecting any user who follows the setup tutorial. **Fix** python=3.10.0 -> python=3.10 to align the docs with CI. ## Test plan **Not working** ``` conda create -yn executorch3100 python=3.10.0 conda activate executorch3100 ``` **Working** ``` conda create -yn executorch310 python=3.10 conda activate executorch310 ``` ```python import torch from executorch.exir import to_edge_transform_and_lower from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner class MyModel(torch.nn.Module): def forward(self, x): return x + 1 # 1. Export your PyTorch model model = MyModel().eval() example_inputs = (torch.randn(1, 3, 224, 224),) exported_program = torch.export.export(model, example_inputs) # 2. Optimize for target hardware (switch backends with one line) program = to_edge_transform_and_lower( exported_program, partitioner=[XnnpackPartitioner()] # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm ).to_executorch() ``` Co-authored-by: Gasoonjia <[email protected]>

@mergennachin

Part of pytorch#17425. This refreshes the iOS SwiftPM documentation for the ExecuTorch 1.0.0 package flow. Changes: - Updates the remaining `swiftpm-0.6.0` guidance in the getting started page to use `swiftpm-1.0.0`. - Clarifies the Xcode product selection step for the simple XNNPACK app path. - Refreshes the Xcode package-product screenshot and demo video assets for the `swiftpm-1.0.0` flow. Validation: - `git diff --check` - Verified `docs/source/_static/img/swiftpm_xcode.mp4` duration is 94.37s at 1280x836 and sampled frames through package selection. cc @mergennachin @AlannaBurke @shoumikhin @cbilgin

…71218) (pytorch#19593) Reviewed By: psiddh Differential Revision: D104380965 Co-authored-by: DevmateRemedimateMacaClaude Bot <[email protected]>

@kirklandsign

## Summary Adds `Tensor.copyDataInto(... dst)` to the Android Java API for the float32 and float16 dtypes. It copies the tensor's data into a caller-provided destination buffer instead of allocating a fresh `float[]` per call (as `getDataAsFloatArray()` does today). The same pattern is repeated for other types. ## Motivation While profiling depth inference on Android with Perfetto, output extraction was a meaningful contributor to ART GC pressure. Each call to `output.toTensor().dataAsFloatArray` allocates a new Java `float[]` sized to the tensor's element count and bulk-copies from the underlying off-heap buffer into it. The native side already exposes the underlying `FloatBuffer` directly (zero-copy view of the C++ tensor's `data_ptr()`), so the only thing missing was a public way for callers to drain it into a destination buffer they already own and reuse across calls. ## API ```java public void copyDataInto(FloatBuffer dst) ``` - Implemented on all datatypes ## Caller-side usage example ```java // One-time setup FloatBuffer depthBuf = Tensor.allocateFloatBuffer(numelDepth); // Per inference EValue[] outputs = module.forward(...); depthBuf.rewind(); outputs[0].toTensor().copyDataInto(depthBuf); // no allocation // ... read from depthBuf ... ``` ## Test plan - [x] Added unit tests in `TensorTest.kt`: - `testCopyDataIntoFloat32` — round-trip with reuse across two calls - `testCopyDataIntoFloat32_writesAtDstPosition` — verifies the call writes at `dst.position()` and advances it (does not overwrite from index 0) - `testCopyDataIntoFloat32_overflow` — `BufferOverflowException` on undersized destination - `testCopyDataIntoFloat16` — verifies fp16→fp32 widening matches `getDataAsFloatArray` - `testCopyDataIntoFloat_unsupportedDtype` — `IllegalStateException` from base default for non-float dtypes This PR was authored with Claude. cc @kirklandsign @cbilgin --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

Differential Revision: D105503056 Pull Request resolved: pytorch#19623

@kimishpatel

…#19249) ### Summary When a model prepared with torchao's `quantize_(...)` (e.g. blockwise int4) is lowered without an iOS18+ `minimum_deployment_target`, coremltools raises a `ValueError` from inside `_construct_constexpr_dequant_op`: ``` ValueError: The more fine-grained quantization (such as blockwise) is only supported since iOS18.Please set minimum_deployment_target to iOS18 for using it. ``` This message is technically correct but does not tell the ExecuTorch user *how* to set the deployment target — the answer is buried in `CoreMLBackend.generate_compile_specs(...)` plus `CoreMLPartitioner(compile_specs=...)`, which is not obvious unless you've already been through the docs. The two `dequantize_affine` / `dequantize_codebook` handlers in `backends/apple/coreml/compiler/torch_ops.py` are the only call sites where the failing coremltools utilities are invoked from ExecuTorch code, so I wrap them and re-raise the error with an additional hint that shows the exact partitioner call. After this change the user sees: ``` ValueError: The more fine-grained quantization (such as blockwise) is only supported since iOS18.Please set minimum_deployment_target to iOS18 for using it. ExecuTorch hint: pass `compile_specs=CoreMLBackend.generate_compile_specs(minimum_deployment_target=ct.target.iOS18)` (or higher) to `CoreMLPartitioner` when lowering models that use `quantize_(...)`. ``` Fixes pytorch#13122. ### Test plan Added `test_dequantize_affine_below_ios18_raises_with_hint` which lowers a PerGroup-int4 quantized linear with `minimum_deployment_target=ct.target.iOS17` and asserts the raised `ValueError` mentions both `iOS18` and the `CoreMLPartitioner` / `minimum_deployment_target` keywords. The existing iOS18 quantization tests still pass (`test_dequantize_affine_b4w_linear` exercised locally to confirm the wrapper does not affect the success path). ``` $ python -m unittest -v executorch.backends.apple.coreml.test.test_torch_ops.TestTorchOps.test_dequantize_affine_below_ios18_raises_with_hint Ran 1 test in 0.653s OK $ python -m unittest -v executorch.backends.apple.coreml.test.test_torch_ops.TestTorchOps.test_dequantize_affine_b4w_linear Ran 1 test in 0.536s OK ``` Authored with Claude. cc @kimishpatel @YifanShenSZ @cymbalrush @metascroy

@mergennachin

Reverts pytorch#19565 The mp4 file fails import (D105606113) due to large file size restrictions: oldSize: 3635918 newSize: 7982247 cc @mergennachin @AlannaBurke @shoumikhin @cbilgin

…Ten op (pytorch#19301) ### Summary Added support for the core aten op `tan` using a decomposition pass and the identity: ``` tan(x) = sin(x) / cos(x) ``` ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_tan --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_tan --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android ```

…pytorch#19558) Summary: The pass checked for a batch norm following the conv to avoid breaking fusion with a squeeze. However, it did not support Conv -> Batch Norm -> ReLu OR Conv -> ReLU This commit adds that support, along with other supported activation Reviewed By: rascani Differential Revision: D105017469

Differential Revision: D105100368 Pull Request resolved: pytorch#19680

@digantdesai

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

Differential Revision: D105753995 Pull Request resolved: pytorch#19682

### Summary Some older python version seems to interpret the qnn_config.htp_performance_mode as a list while newer version just treat it as an enum. To make it compatible on older python version, remove the comma. ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

…c core ATen op (pytorch#19283) ### Summary Added support for the core ATen op `scatter.src` using an op builder with the [QNN implementation](https://docs.qualcomm.com/doc/80-63442-10/topic/HtpOpDefSupplement.html#scatterelements) for `ScatterElements`. Note `scatter.src` uses `ScatterElements` directly with the argument `reduction=NONE`. ### Test plan ``` python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNQuantizedOperator.test_qnn_backend_scatter_src --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android python backends/qualcomm/tests/test_qnn_delegate.py -k TestQNNFloatingPointOperator.test_qnn_backend_scatter_src --model SM8750 --host aisw-vm15-labsd --device 545ee4aa --build_folder build-android ```

…rch#19539) Differential Revision: D104775245 Pull Request resolved: pytorch#19539

pytorch#19224) Summary: Heap profiling at runtime with HTP backend on Android platforms. DSP heap profiling is available for QnnContext_createFromBinary use-cases. It captures total DSP heap usage at two checkpoints: - Before the first context is created (before_context_created) - After the last context is freed (after_context_freed) The difference between the two values represents heap consumed during context execution. The value after freeing is typically equal to or greater than before creation. Test plan: python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedUtils.test_qnn_backend_runtime_option_heap_profile -b build-android -H ${HOST} -s ${SN} -m ${SOC_MODEL}

Port doc changes from 12bb0e7 from pure md-files to md.in stubs. Signed-off-by: Adrian Lundell <[email protected]>

@digantdesai

Summary - Adds a VGF Swin2SR super-resolution example for Arm. - Adds FP and INT8 export/eval flows with deterministic demo assets. - Adds Arm OOTB smoke coverage and model tests. Validation - bash -n backends/arm/test/test_arm_ootb.sh - PYTHONPATH=. /Users/usazah01/src/executorch/env/bin/python -m pytest -q -p no:rerunfailures backends/arm/test/models/test_swin2sr_arm.py -s - PATH=/Users/usazah01/src/executorch/env/bin:$PATH backends/arm/scripts/pre-push cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani --------- Signed-off-by: Usamah Zaheer <[email protected]>

The flag VK_PIPELINE_CREATE_2_FAIL_ON_PIPELINE_COMPILE_REQUIRED_BIT means "Do not perform expensive synchronous compilation during this call" If Vulkan detects that this is needed, it instead throws an error. We can skip this pre-caution.

) Based on the decision that aot_arm_compiler should no longer be used for production use, this patch updates the documentation to direct users away from aot_arm_compiler for production use, and instead points them to the Python API. Signed-off-by: Martin Lindström <[email protected]>

…#19679) (pytorch#19679) Summary: Adds `LlmModuleConversationHistoryTest`, an Android instrumentation test that exercises the multi-turn / KV-cache plumbing on `LlmModule`. The OKR theme this enables is "Feature testing → conversation history" (3.2), which depends on `prefillPrompt` + `resetContext` semantics being correct. The test runs on the existing TinyStories-110M fixture pulled by `android_test_setup.sh` from the public `ossci-android` S3 bucket, so it works on **both** internal fbsource Android CI and OSS GitHub Actions Android CI without any new fixture infrastructure. Because TinyStories is too small and not instruction-tuned, content-level assertions (e.g. "did the model recall the user's name") are not reliable. Instead, the test asserts four behavioral invariants of the conversation-history surface that any production multi-turn flow depends on: 1. `testResetContextProducesDeterministicOutput` — at temperature=0 (greedy decode), running the same prompt twice with `resetContext()` between yields identical token streams. This is the foundational invariant: clearing the KV cache truly returns the model to a clean state. 2. `testKvCacheStatePersistsAcrossGenerateCalls` — without `resetContext()` between calls, two `generate()` calls with the same prompt diverge, proving the KV cache is preserved across turns. If this ever fails, multi-turn conversation is silently broken. 3. `testPrefillPromptInfluencesNextGeneration` — `prefillPrompt(history)` followed by `generate(prompt)` differs from a clean-context `generate(prompt)`, proving the prefilled context actually reaches the decoder. 4. `testResetContextClearsPrefilledHistory` — `prefillPrompt + resetContext + generate` matches a clean-slate `generate`, proving reset fully clears prefilled state. Reviewed By: GregoryComer, kirklandsign Differential Revision: D105741356 Pulled By: psiddh --------- Co-authored-by: Copilot Autofix powered by AI <[email protected]> Co-authored-by: Claude <[email protected]>

@digantdesai

Summary cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

Change-Id: Ibb7ef4167ab96426133fce64e34366c365cd12ad Signed-off-by: Yufeng Shi <[email protected]>

@digantdesai

) ## Summary This PR modernizes the ExecuTorch Arm bare-metal runner workflow so users can move from a PyTorch model to a runnable Arm executor runner with fewer manual build-system steps, stronger validation, and faster repeated local iteration. The main change is a new standalone Arm executor runner CMake entry point. `run.sh` now acts as the orchestration layer for common Ethos-U bare-metal flows: it can derive build directories, configure the standalone runner with Arm bare-metal defaults, stage generated PTE/BPTE files, validate reused CMake caches, build the needed runner target, locate the runner binary, and invoke FVP. ## Problem Before this change, the Arm runner workflow depended on manually stitching together ExecuTorch build/install artifacts, runner CMake configuration, PTE input wiring, toolchain and target settings, optional debug features, and repeated install/export steps. That made the workflow harder to explain, fragile in CI, slower to iterate on locally, and easy to break when reusing a build directory configured for a different target or feature set. And a shorter version if the PR description is already long: ## CMake Architecture Change ```mermaid flowchart LR subgraph Before A1["Build ExecuTorch<br/>arm-baremetal preset"] --> A2["Install/export artifacts"] A2 --> A3["Configure runner CMake<br/>examples/arm/executor_runner"] A4["PTE / BPTE"] --> A3 A3 --> A5["arm_executor_runner ELF"] end subgraph After B1["run.sh"] --> B2["Validate / choose build dir"] B2 --> B3["Standalone runner CMake<br/>examples/arm/executor_runner/standalone"] B4["PTE / BPTE"] --> B1 B3 --> B5["ExecuTorch top-level CMake<br/>as subdirectory"] B3 --> B6["Arm CMake helpers + presets"] B5 --> B7["arm_executor_runner ELF"] B6 --> B7 end ``` ## What Changed - Added `examples/arm/executor_runner/standalone` as the supported standalone CMake entry point for `arm_executor_runner`. - Added shared Arm CMake helpers for Ethos-U SDK setup, required target validation, and predictable runner output paths. - Updated `build_executor_runner.sh` and `run.sh` to use the standalone runner workflow. - Added deterministic default build directories under `--et_build_root`. - Added cache validation for reused build directories, including target, toolchain, selected ops, PTE placement, BundleIO, ETDump, and devtools settings. - Added PTE/BPTE staging so repeated runs can reuse the same configured CMake build directory. - Integrated selective-op handling into the standalone runner path. - Cleaned up bare-metal install/export behavior so standalone builds can consume reusable build-tree artifacts. - Updated Arm README and notebooks for the new workflow. ## Iteration Speed Repeated local PTE-to-runner iteration is now **8x faster** because `run.sh` can reuse the configured standalone CMake build directory, stage updated PTE/BPTE payloads into the existing cache wiring, and rebuild only the needed runner target instead of repeating the full manual configure/install/export flow. This is a developer workflow speedup, not a model runtime speedup. ## Result For common Ethos-U bare-metal usage, the user-facing path is now script-owned and repeatable: 1. Run Arm setup. 2. Run `examples/arm/run.sh` with a model and target. 3. Reuse or inspect the generated build directory under `--et_build_root`. 4. Iterate by regenerating the PTE/BPTE and rebuilding through the same validated CMake cache. VGF host flows remain explicit: `run.sh` requires an existing `--build-dir` for VGF-style host builds rather than auto-configuring them as bare-metal runner builds. ## Testing Validated through the Arm backend runner, bare-metal, VGF, and CI workflows covered by this stack. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani --------- Signed-off-by: Usamah Zaheer <[email protected]>

…ng matrix (pytorch#19617) ### Summary This is the continuation of pytorch#19399 and pytorch#19521, to deliver on Phase 2 of pytorch#18991 ### Test plan This code is exclusively test code. Everything works out of the box, and CI will validate.

Moves CI testing from yaml to .ci/scripts/test_zephyr.sh and create a table of readme and target combinations to run instead of having mv2 tests hard coded. This will make it easier to add more sample and tests in the future as the test flow is more generic. Signed-off-by: Zingo Andersen <[email protected]>

@robert-kalmar

…torch#19688) ### Summary Neutron SW 3.1.1 removes the restriction for maximum supported kernel size of the MaxPool2D operator. This PR reflects this change in the Neutron backend. ### Test plan Unit test provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

@robert-kalmar

…torch#19687) ### Summary This PR adds support for the aten.sigmoid operator with the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

@robert-kalmar

…ytorch#19667) ### Summary This PR adds support for the `aten.leaky_relu` operator with the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

Differential Revision: D104433196 Pull Request resolved: pytorch#19641

Given RISC-V allows different hardware implementations to have different vector length (similar to ARM SVE), we want to make sure that we test on different configurations. Luckily, QEMU allows us to simply set a vlen=<128,256,512,...> parameter on QEMU_CPU to emulate different vector length.

hboyraz and others added 8 commits May 15, 2026 14:22

Revert "Add grammar fields to GenerationConfig for constrained decodi…

e6bf149

…ng" (pytorch#19620) This reverts commit 7355d7b. ### Summary Temporarily reverting to restore test health for QNN jobs.

Thread method-scoped kernel registry through Program and Method (pyto…

a8cfe2b

…rch#19561) Differential Revision: D98080033 Pull Request resolved: pytorch#19561

Add CoreML-stable RMSNorm for llama eager paths (pytorch#19523) (pyto…

d1db6b7

…rch#19523) Differential Revision: D104862210 Pull Request resolved: pytorch#19523

Fix GenerationConfig initialization in qnn_multimodal_runner

42d87c4

Differential Revision: D105377713 Pull Request resolved: pytorch#19621

Fix libc++.so.1 missing for qnn-context-binary-utility (pytorch#19622)

7dbd972

Differential Revision: D105378870 Pull Request resolved: pytorch#19622

Prevent _safe_softmax decomposition in traceand rewire replaceSafeSof…

824cbff

…tmaxWithSoftmax Differential Revision: D105367634 Pull Request resolved: pytorch#19619

luhenry commented May 16, 2026

View reviewed changes

pytorchbot and others added 21 commits May 18, 2026 00:40

NXP backend: Add QAT tests for AvgPool, MaxPool and Mul tensor ops (p…

df3fa0d

…ytorch#19572) ### Summary Add QAT tests for AvgPool, MaxPool and Mul tensor ops for Neutron backend using the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar

NXP backend: Sync config importer to nxp internal CI changes (pytorch…

c531386

…#19552) ### Test plan Existing unit test

Arm backend: Stabilize MobileNetV3 fp16 TOSA test (pytorch#19590)

3c68b67

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Baris Demir <[email protected]> Co-authored-by: Baris Demir <[email protected]>

Arm backend: Stabilize VGF bilinear fp16 test (pytorch#19613)

9beca1f

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani Signed-off-by: Baris Demir <[email protected]> Co-authored-by: Baris Demir <[email protected]>

Revert "Wire target_config Buck deps on cmsis_nn_py (pytorch#19604)" (p…

760aa39

…ytorch#19644) Reverts pytorch#19604 This breaks internal CI

Fix out_of_bounds_read in getConstantDataPtr (XNNCompiler.cpp) (T2673…

20415bf

…71218) (pytorch#19593) Reviewed By: psiddh Differential Revision: D104380965 Co-authored-by: DevmateRemedimateMacaClaude Bot <[email protected]>

Guard weight_dequant.args[1] access in _quantize_fused_conv_bias pass

d62addb

Differential Revision: D105503056 Pull Request resolved: pytorch#19623

Revert "Update iOS SwiftPM docs for ExecuTorch 1.0.0" (pytorch#19652)

7c495fa

Reverts pytorch#19565 The mp4 file fails import (D105606113) due to large file size restrictions: oldSize: 3635918 newSize: 7982247 cc @mergennachin @AlannaBurke @shoumikhin @cbilgin

qti-horodnic and others added 23 commits May 19, 2026 17:48

Drop _loadedBackendOptions ivar from Apple bindings (pytorch#19680)

10c8958

Differential Revision: D105100368 Pull Request resolved: pytorch#19680

Bump PyTorch pins to 2.12 (pytorch#19643)

d66a37c

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

Fix race condition in XNNPACK weights cache during concurrent init()

a4bd823

Differential Revision: D105753995 Pull Request resolved: pytorch#19682

Handle rank-changing views in FuseCascadedTransposeOrPermuteOps (pyto…

6f052fe

…rch#19539) Differential Revision: D104775245 Pull Request resolved: pytorch#19539

Arm backend: Fix stale docgen generation pt.2 (pytorch#19685)

82cf123

Port doc changes from 12bb0e7 from pure md-files to md.in stubs. Signed-off-by: Adrian Lundell <[email protected]>

Revert PyTorch 2.12 pin bump (pytorch#19698)

3b5d18d

Summary cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

Add FP8 placeholder support to ExecuTorch serialization (pytorch#19043)

8debe93

Change-Id: Ibb7ef4167ab96426133fce64e34366c365cd12ad Signed-off-by: Yufeng Shi <[email protected]>

NXP backend: Add support for sigmoid with the new Neutron flow. (py…

6c74cdc

…torch#19687) ### Summary This PR adds support for the aten.sigmoid operator with the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

NXP backend: Add support for leaky_relu with the new Neutron flow. (p…

a76d9cd

…ytorch#19667) ### Summary This PR adds support for the `aten.leaky_relu` operator with the new Neutron MLIR flow. ### Test plan Unit tests provided. cc @robert-kalmar @JakeStevens @digantdesai @rascani

Thread kernel_registry through Module::load_method (pytorch#19641)

6ba868e

Differential Revision: D104433196 Pull Request resolved: pytorch#19641

luhenry force-pushed the riscv-testing-rvv branch from 950ac09 to f836fdb Compare May 20, 2026 18:50

github-actions Bot added module: arm ciflow/trunk labels May 20, 2026

luhenry added 3 commits May 20, 2026 22:09

Add XNNPACK coverage instrumentation for riscv64

7eba60a

Align RISC-V workflow display name to others

2c8507d

luhenry force-pushed the riscv-testing-rvv branch from f836fdb to 2c8507d Compare May 20, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

riscv testing rvv#3

riscv testing rvv#3
luhenry wants to merge 87 commits into
riscv-testing-mobilenetv2from
riscv-testing-rvv

luhenry commented May 16, 2026

Uh oh!

luhenry May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

luhenry commented May 16, 2026

Uh oh!

luhenry May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants