Releases: meta-pytorch/monarch
Release list
0.5.0
New features & API changes
Python: actor identifiers renamed to ActorAddr. ActorId is now ActorAddr across the Python bindings (#3618, #3622). The old pid: int constructor argument is gone β ActorAddr carries a string uid (with pid retained as a compatibility alias) and new label / proc_label properties. ActorAddr.from_string now expects the actor.proc@location wire format. Mailbox.post, PythonActorHandle.bind, ActorSupervisionEvent.actor_id, UndeliverableMessageEnvelope.sender, Instance.actor_id, and the ClientActor / Error / Failure stubs are all updated. ActorMeshProtocol no longer exposes region or get(rank).
Kubernetes operator integration. KubernetesJob.add_mesh now takes pod_template: V1PodTemplateSpec instead of pod_spec: V1PodSpec, and accepts a new annotations= kwarg (#3872, #3949). With meta-pytorch/monarch-kubernetes#49, we need v0.2.0+ of the monarch operator for KubernetesJob with monarch v0.5.0+.
Per-rank bootstrap. HostMesh.spawn_procs(bootstrap_command=β¦) accepts either a uniform BootstrapCommand or a Callable[[Point], BootstrapCommand] for per-rank customization (e.g. per-GPU CUDA_VISIBLE_DEVICES) (#3463). New helpers default_bootstrap_cmd() and BootstrapCommand.with_env(env).
SPMD entry point. New host_mesh_from_store(...) stands up a HostMesh from a torchrun/torchx-style entry point without going through the Job API (#3559).
Telemetry helpers. monarch.actor.span(name) and @traced decorator replace ad-hoc OTEL TRACER.start_as_current_span(...) blocks; spans auto-bind to the current actor (#3665, #3774). PySpan is now a context manager.
Tensor engine & multiprocessing. Tensor engine builds on CPU and macOS via a split tensor_engine_gpu Cargo feature; the env var MONARCH_RDMA_GPU_PLATFORM was renamed to MONARCH_GPU_PLATFORM (#3530). RDMA Python bindings now degrade gracefully when native libs are absent. Linux default multiprocessing start method flipped from spawn to forkserver (#3529). async def __supervise__ is now supported (#3526).
config.configure keys. Added rdma_disable_ibverbs, rdma_allow_tcp_fallback, rdma_max_chunk_size_mb. Removed remote_allocator_heartbeat_interval. New parametrize_config_pointwise test helper.
Removals & deprecations.
- The legacy allocator stack is gone:
monarch._src.actor.allocator,LocalAllocator,ProcessAllocator,HostMesh.allocate_nonblocking/_allocate_nonblocking, theprocess_allocatorbinary (#3567β#3586). UseHostMesh+attach_to_workersor aJobTraitclass. monarch._src.actor.namespaceand the namespace API removed (#3116).Future.get()called from inside an active asyncio or tokio thread now emits aDeprecationWarningand becomes aRuntimeErrorin v0.6 (#3827).
Examples & docs. New Kubernetes GRPO tutorial (Qwen3.5-0.8B on GSM8K) (#3597), Oracle OKE example (#3671), GRPO via cooperative multitasking (#3525).
Rust internals (not Python-visible). Endpoint sends are now infallible and renamed send β post, with failures flowing through a new Undeliverable<M> enum (#3890β#3894, #3912). A new Gateway layer owns per-proc reachability and serving (#3818β#3823); Proc::local β Proc::isolated. Identity constructors collapsed into anonymous() / instance(label) / singleton(name) (#3935, #3940). hyperactor::reference deleted and hyperactor::host moved to hyperactor_mesh::host (#3641, #3724). New hyperactor_remote crate adds keepalive links, supervisors, and rendezvous tokens (#3762β#3768).
Bug Fixes
- Ctrl-C no longer hangs the runtime (#3801); flaky
PyShared.__await__borrow race (#3862); two RwLock/DashMap deadlocks in actor teardown (#3754); re-entrantTraceEventDispatcherSIGSEGV in real training runs (#3690);Mailbox::post_uncheckedshard deadlock (#3684); host shutdown race (#3663). - Bootstrap falls back when
XDG_RUNTIME_DIRdoesn't exist (#3418); long-pathSUN_LENunix-socket panic (#3697); HostMesh label sanitization (#3691); controllerGetStateno longer triggers an undeliverable bounce (#3450); RDMAfind_cuda_segmentboundary (#3769).
Performance & Reliability
- Native V1 casting and the destination-actor reorder buffer are now on by default (#3812), with a point-to-point optimization for small casts (#3646).
- RDMA completion polling is now adaptive β default flipped from a fixed 1 ms sleep to yield-only, gated by
MONARCH_RDMA_CQ_BUSY_POLL_WINDOW(#3771).resolve_ibvmade synchronous, removing a per-read round-trip (#3773). TLS code-transfer replaced with RDMABuffer leader fan-out (#3390). Arc-refcounted PDs/MRs close a latent PD double-free (#3883);KeepaliveLocalMemoryis the sole local-memory handle, with explicitunsafeaccessors (#3922). - Channel correctness: host flushes acks before exit (#3637); duplex sessions made structurally concurrent (#3675); experimental multi-stream sender
exp_dial_unordered(#3557, #3558). ProcMeshControllerreaps procs orphaned by a dead client viaMESH_ORPHAN_TIMEOUT(#3811); periodic RSS recording for managed processes (#3733).
Build & Release
- macOS wheels ship with the stable PyPI release (#3854) and the nightly matrix (#3451); the initial publish pipeline landed in #3412, with follow-up fixes for missing fields (#3344), no-torch (#3371), the crash-recovery plugin (#3831), and general build breakage (#3786).
- ROCm GPU CI via a matrix-based workflow (#3190); ROCm excluded from PR runs (#3861).
- PyTorch bumped 2.11.0 β 2.12.0 for stable; nightly tracks 2.13.0 (#3863).
publish_releaseDocker base aligned to CUDA 12.6 for torch 2.12.0 (#3921); nightly Docker images repaired after upstream cuda12.8 removal (#3880). - PyPI wheels now carry classifiers and project URLs (#3379); docs deploy targets
stable(#3415). New GHA workflow marks stale PRs and deletes branches of closed/non-merged PRs (#3778); test-result XML uploaded as artifacts (#3670); global 5-minute cargo-nextest timeout (#3855).
0.4.1
Full Changelog: v0.4.0...v0.4.1
v0.4.1 is a small patch release that includes some powerful new features and important bug fixes.
New Features & API changes
v0.4.1 adds a substantial new CLI workflow around long-lived jobs:
monarch apply and monarch exec can now be used to launch subclasses
of JobTrait.
This release also introduces JobTrait.remote_mount: mounting a local filesystem to
sync with workers in the monarch job. This makes a FUSE mount on each worker and syncs
changes to the filesystem to all workers. It can use RDMA or TCP depending on availability to
send the data.
JobTrait.gather_mount works in reverse: a read-only FUSE mount that
pulls per-worker directories back into a unified local view. This can be used to gather
logs or other outputs from all workers to be examined locally.
The Monarch Dashboard is a local web UI for inspecting a running Monarch job
in real time. It is included in torchmonarch and starts alongside telemetry.
For jobs, enable both admin and telemetry:
job.enable_admin()
job.enable_telemetry(TelemetryConfig(include_dashboard=True, dashboard_port=8265))The dashboard has three views:
Summary for overall health, actor counts, failures, and message traffic;
Hierarchy for drilling from host mesh down to individual actor details;
DAG for an interactive topology view of hosts, procs, and actors.
Itβs still early, so the UI and APIs may evolve, but itβs already useful for
understanding topology, debugging failures, and inspecting message flow.
On the mesh-admin side, the HTTP surface expands with POST /v1/query and
POST /v1/pyspy_dump, while the internals were refactored to use typed IDs,
references, and timestamps behind a curl-friendly JSON/DTO boundary. That
should make the admin API easier to evolve without breaking existing
consumers.
Bug fixes
- RDMA function
is_rdma_availablebrought back but with a deprecation warning, was deleted in v0.4.0. It is now just a wrapper aroundis_ibverbs_available.get_rdma_backendis recommended to check which implementation is used. - RDMA bug fix for mlx5dv: #3293
Runtime correctness also improved in a few important places in error paths:
stop_actor_by_name now waits for actual actor termination, mesh scans no
longer crash or spin forever when a ProcMesh spawn fails, and mesh-controller
OncePort replies now return accumulated responses correctly.
Performance
A zero-copy regression in the pickle send path was fixed: #3234
v0.4.0
Monarch v0.4 Release Notes
New Features
Networking & RDMA
- EFA support for RDMA β RDMA with AWS's libefa (elastic fabric adapter).
- TCP fallback for RDMA β when RDMA is unavailable the data-plane automatically falls back to TCP, broadening hardware compatibility (#2999).
- ROCm / HIP support for the RDMA stack, enabling AMD GPU deployments (#2891).
- The channel transport layer was rewritten around a typed session lifecycle and unified
NetLinkdispatch, improving reconnect reliability and adding duplex-mode channels.
Distributed Telemetry & Dashboard
Monarch now ships a built-in observability dashboard. The new distributed telemetry system collects actor, mesh, host, proc, and message-level data in real time and exposes it through both a web UI and a schema-first REST API (OpenAPI 3.1). An OTLP-compatible metrics, logs, and trace exporter makes it straightforward to integrate with Grafana, Jaeger, or any OpenTelemetry collector in Kubernetes deployments.
Admin TUI & Live Diagnostics
A new terminal UI (admin_tui) provides live introspection of running meshes, procs, and actors via an HTTP admin server. It includes a built-in py-spy integration that can capture Python stack traces from any running actor directly in the TUI, making it much easier to diagnose stalls and performance issues in production.
Kubernetes
KubernetesJob gained Python-native provisioning, removing the dependency on an external Go controller for mesh creation. A new optional labels parameter on add_mesh() enables integration with Kueue and other label-based Kubernetes controllers (#2693).
Python API Changes
allocate_nonblocking,from_alloc, andhost_meshare renamed to private methods; useattach_to_workersand theKubernetesJob/ProcessJobAPIs instead (#2971).- NUMA bindings are now exposed for proc mesh spawning (#2996).
Bug Fixes & Performance Improvements
Supervision & Fault Tolerance
- ControllerController supervision β a single child torchstore controller failure no longer poisons the parent and all siblings. Each child is now isolated, fixing a critical bug where one failed session could block all subsequent
get_or_spawn_controller()calls (#2835). - Orphaned mesh cleanup β child actors now detect when their parent is unreachable and self-terminate, preventing leaked GPU resources (#2198).
- Clean Python shutdown β proc exit now calls
Py_FinalizeEx, giving Python objects a chance to run destructors and eliminating thepybind11::dec_refGIL crashes seen during shutdown (#2524). - Reliable
proc_mesh.stop()β stop now flushes pending messages and acks before exiting, fixing races that caused spurious errors in CI and user code (#2658).
Performance
- Lazy ValueMesh unpickling β values returned from
accumulateare now deserialized on access rather than eagerly, reducing latency for large results (#2983). - RLE-compressed OnceBuffer accumulation β repeated identical values are run-length encoded during accumulation, cutting memory and network cost for common broadcast patterns (#2989).
- Telemetry overhead was significantly reduced by demoting internal spans and gating channel-level tracing behind DEBUG.
Build & Packaging
- Official aarch64 (ARM64) release binaries are now published alongside x86_64 on PyPI
0.3.0
Monarch 0.3.0 Release Notes
New Features
Kubernetes Job Support
Monarch now supports running distributed training workloads on Kubernetes clusters. The new KubernetesJob API connects to pre-provisioned GPU pods managed by the https://github.com/meta-pytorch/monarch-kubernetes/ repository, enabling seamless multi-node DDP training
on Kubernetes.
Key Capabilities:
- Connect to Kubernetes pods using KubernetesJob
- Provision GPU workers via the MonarchMesh Custom Resource Definition
- Run multi-node DDP training using SPMDActor
Example:
from monarch.job.kubernetes import KubernetesJob
from monarch.spmd import SPMDActor
k8s_job = KubernetesJob(namespace="monarch-tests")
k8s_job.add_mesh("ddpmesh", num_replicas=2)
job_state = k8s_job.state()
proc_mesh = job_state.ddpmesh.spawn_procs({"gpus": 4})
spmd_actors = proc_mesh.spawn("_SPMDActor", SPMDActor)
See the full tutorial: https://meta-pytorch.org/monarch/generated/examples/ddp/kubernetes_ddp.html
We also publish docker packages, see https://github.com/meta-pytorch/monarch/pkgs/container/monarch
monarch.spmd and monarch.job.spmd SPMDJob
The new monarch.job.spmd module provides serve() and run_spmd() for an interactive SPMD development workflow:
- Reserve once, iterate many times: Allocate hosts once, then call run_spmd() repeatedly without reprovisioning
- Remote debugging: Add breakpoint() in your training script and attach with monarch debug
- Job caching: Reload cached job state and re-run on the same reserved hosts
Example:
from monarch.job.spmd import serve
job = serve(
["torchrun", "--nproc-per-node=4", "--standalone", "train.py"],
scheduler="local_cwd",
)
job.run_spmd()
# Later, reload and re-run without reprovisioning:
job = job_load(".monarch/job_state.pkl")
job.run_spmd()
This supports single-node training with command lists and multi-node training with TorchX AppDef on schedulers like Slurm.
See the example: https://meta-pytorch.org/monarch/generated/examples/ddp/spmd_job.html
Experimental Queue Dispatch Mode (Performance)
A new actor dispatch mode where Rust enqueues messages to a channel for Python to process, rather than Rust acquiring the GIL directly. This can improve throughput for message-heavy workloads.
from monarch.config import configure
configure(actor_queue_dispatch=True)
Real this_proc() for Local Spawning
The this_proc() function returns a handle to the current singleton process, enabling actors to spawn other actors locally. Remote actors can use this_proc() to spawn actors on their own hostβenabling patterns like handing out references to a local proc and having
remote actors spawn resources on it.
from monarch.actor import Actor, endpoint, this_proc
class ManagerActor(Actor):
@endpoint
def spawn_helper(self) -> HelperActor:
# Spawns HelperActor in the same process as ManagerActor
return this_proc().spawn("helper", HelperActor)
Zero-Copy Messaging Path from Python
A new Buffer class enables zero-copy message serialization from Python. Large writes (β₯256 bytes) are stored as references to Python bytes objects rather than being copied, integrating with multipart serialization for efficient vectored I/O.
from monarch._rust_bindings.monarch_hyperactor.buffers import Buffer
from monarch.config import configure
buffer = Buffer()
buffer.write(b"small") # copied into pending buffer
buffer.write(b"x" * 1000) # stored as zero-copy reference
# Configure the threshold via:
configure(small_write_threshold=256) # default
Principles of Ownership in Supervision
This release improves the supervision model for error handling in meshes, built on four core principles:
- Owned meshes: Creating new meshes always results in an owned mesh
- Single ownership: All meshes are owned by at most one actor (no transfer or suspension)
- Lifecycle binding: A mesh cannot outlive its ownerβwhen the owner dies, so does the mesh
- Graceful cleanup: Stopped meshes drain pending messages before cleanup; owned meshes clean up before their owner
Actors can now implement supervise to handle failures from owned meshes.
Example:
class ManagerActor(Actor):
def __supervise__(self, failure: MeshFailure) -> bool:
logging.error(f"failure encountered: {failure}")
# Return truthy to handle, falsey to propagate
return None
See the documentation: https://meta-pytorch.org/monarch/actors.html#error-handling-in-meshes
SkyPilot Integration (Community Contribution)
SkyPilotJob enables running Monarch on Kubernetes and cloud VMs across 20+ cloud providers (AWS, GCP, Azure, CoreWeave, Nebius, etc.) via https://skypilot.readthedocs.io/.
import sky
from monarch_skypilot import SkyPilotJob
job = SkyPilotJob(
meshes={"trainers": 2},
resources=sky.Resources(accelerators="A100:1"),
cluster_name="my-monarch-cluster",
)
state = job.state()
trainers = state.trainers # HostMesh with 2 nodes
Features:
- Automatic cluster provisioning and teardown
- Autostop for idle clusters
- Workdir sync and custom file mounts
- Default PyPI install or custom Docker images
Install with:
pip install torchmonarch-nightly skypilot[kubernetes]
Getting Started
Install Monarch 0.3.0:
pip install monarch==0.3.0
0.2.0
Monarch Release Notes
Overview
This release focuses on correctness, robustness, and operational maturity. Major improvements span supervision and shutdown semantics, logging and observability, Kubernetes readiness, SPMD workflows, test hygiene, and build compatibility. Monarch is now more predictable under failure, easier to debug, and better suited for long-running and large-scale deployments.
Supervision & Shutdown
Actor supervision and shutdown behavior has been significantly hardened and clarified.
Key Improvements
-
Strict supervision hierarchy
- Every actor or process has exactly one parent (except the root).
- Child actors can no longer persist after their parent faults or stops.
-
Reliable recursive shutdown
- Asking an actor to stop deterministically stops its entire subtree.
- Shutdown cases are documented, tested, and log spam has been audited.
-
Improved fault propagation
- Supervision errors now describe the full hierarchy of exits.
- Endpoint failures surface clearer context, including actor and endpoint names.
-
HostMesh lifecycle control
- HostMesh can be cleanly stopped (disconnect clients and kill workers).
- HostMesh can be force-killed, causing worker loops to exit immediately.
- Persistent allocations remain usable for reconnects after stop.
Logging
Logging has been refactored to improve clarity, reduce noise, and clearly separate user-facing signals from system internals.
Key Improvements
-
Clear separation of logs
- Monarch system logs and user logs are cleanly separated.
- User-visible faults are communicated only via exceptions and supervision events.
-
Improved error clarity
- Errors are categorized (e.g., user, system, infrastructure).
- Actor names are reported in user-understandable syntax.
- Actor failure reports include richer context and causal chaining.
-
Structured logging
- Errors emit structured log records suitable for filtering and aggregation.
- Supervision events follow a defined schema.
-
Reduced default noise
- Log forwarding, aggregation, and enrichment are disabled by default.
- Log messages have been audited for signal quality.
Observability
Observability has been expanded across actors, meshes, and endpoints.
Key Improvements
-
Comprehensive metrics
- Endpoint latency, throughput, payload size, and error counts are universally available.
- Metrics are collected on both client and server sides.
-
Lifecycle instrumentation
- Actor, process, and mesh state changes emit structured events.
- Supervision events are fully instrumented.
-
Root-cause visibility
- The first triggering event in a failure cascade is surfaced.
- User-parseable actor IDs are linked to internal actor identifiers.
-
Tracing
- Distributed spans cover message send and receive paths.
- Traces can be visualized via Perfetto and standard tracing backends.
-
Performance awareness
- Instrumentation overhead has been reduced and made configurable.
Build Hygiene & Compatibility
Build and dependency management has been simplified.
Key Improvements
- RDMA and tensor engine support are dynamically loaded. The same wheel can be installed
- Monarch no longer has a binary dependency on PyTorch.
- PyTorch is required only at the Python layer.
- Startup time and binary size are significantly reduced.
Networking
Networking reliability has improved, with a focus on Lightning integration.
Key Improvements
- Lightning integration works on HostMesh v1.
- Networking behavior is documented and standardized for OSS usage.
Deprecation
Legacy v0 codepath has been removed
0.1.0
π¦ Monarch v0.1.0 β Initial Release
Weβre excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.
π Highlights
- Actor-Based Programming for PyTorch
Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.
from monarch.actor import Actor, endpoint, this_host
training_procs = this_host().spawn_procs({"gpus": 8})
class Trainer(Actor):
@endpoint
def train(self, step: int): ...
trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()
- Scalable Messaging and Meshes
Actors are organized into meshes β collections that support broadcast, gather, and other scalable communication primitives. - Supervision and Fault Tolerance
Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows. - High-Performance RDMA Transfers
Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts. - Distributed Tensors
Native support for tensors sharded across processes β enabling distributed compute without custom data movement code.
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions β please discuss significant changes or ideas via issues before submitting PRs.
v0.0.0
First Monarch Release!
https://pypi.org/project/torchmonarch/0.0.0/