Skip to content

Releases: meta-pytorch/monarch

0.5.0

Choose a tag to compare

@dulinriley dulinriley released this 19 May 18:11

New features & API changes

Python: actor identifiers renamed to ActorAddr. ActorId is now ActorAddr across the Python bindings (#3618, #3622). The old pid: int constructor argument is gone β€” ActorAddr carries a string uid (with pid retained as a compatibility alias) and new label / proc_label properties. ActorAddr.from_string now expects the actor.proc@location wire format. Mailbox.post, PythonActorHandle.bind, ActorSupervisionEvent.actor_id, UndeliverableMessageEnvelope.sender, Instance.actor_id, and the ClientActor / Error / Failure stubs are all updated. ActorMeshProtocol no longer exposes region or get(rank).

Kubernetes operator integration. KubernetesJob.add_mesh now takes pod_template: V1PodTemplateSpec instead of pod_spec: V1PodSpec, and accepts a new annotations= kwarg (#3872, #3949). With meta-pytorch/monarch-kubernetes#49, we need v0.2.0+ of the monarch operator for KubernetesJob with monarch v0.5.0+.

Per-rank bootstrap. HostMesh.spawn_procs(bootstrap_command=…) accepts either a uniform BootstrapCommand or a Callable[[Point], BootstrapCommand] for per-rank customization (e.g. per-GPU CUDA_VISIBLE_DEVICES) (#3463). New helpers default_bootstrap_cmd() and BootstrapCommand.with_env(env).

SPMD entry point. New host_mesh_from_store(...) stands up a HostMesh from a torchrun/torchx-style entry point without going through the Job API (#3559).

Telemetry helpers. monarch.actor.span(name) and @traced decorator replace ad-hoc OTEL TRACER.start_as_current_span(...) blocks; spans auto-bind to the current actor (#3665, #3774). PySpan is now a context manager.

Tensor engine & multiprocessing. Tensor engine builds on CPU and macOS via a split tensor_engine_gpu Cargo feature; the env var MONARCH_RDMA_GPU_PLATFORM was renamed to MONARCH_GPU_PLATFORM (#3530). RDMA Python bindings now degrade gracefully when native libs are absent. Linux default multiprocessing start method flipped from spawn to forkserver (#3529). async def __supervise__ is now supported (#3526).

config.configure keys. Added rdma_disable_ibverbs, rdma_allow_tcp_fallback, rdma_max_chunk_size_mb. Removed remote_allocator_heartbeat_interval. New parametrize_config_pointwise test helper.

Removals & deprecations.

  • The legacy allocator stack is gone: monarch._src.actor.allocator, LocalAllocator, ProcessAllocator, HostMesh.allocate_nonblocking / _allocate_nonblocking, the process_allocator binary (#3567–#3586). Use HostMesh + attach_to_workers or a JobTrait class.
  • monarch._src.actor.namespace and the namespace API removed (#3116).
  • Future.get() called from inside an active asyncio or tokio thread now emits a DeprecationWarning and becomes a RuntimeError in v0.6 (#3827).

Examples & docs. New Kubernetes GRPO tutorial (Qwen3.5-0.8B on GSM8K) (#3597), Oracle OKE example (#3671), GRPO via cooperative multitasking (#3525).

Rust internals (not Python-visible). Endpoint sends are now infallible and renamed send β†’ post, with failures flowing through a new Undeliverable<M> enum (#3890–#3894, #3912). A new Gateway layer owns per-proc reachability and serving (#3818–#3823); Proc::local β†’ Proc::isolated. Identity constructors collapsed into anonymous() / instance(label) / singleton(name) (#3935, #3940). hyperactor::reference deleted and hyperactor::host moved to hyperactor_mesh::host (#3641, #3724). New hyperactor_remote crate adds keepalive links, supervisors, and rendezvous tokens (#3762–#3768).

Bug Fixes

  • Ctrl-C no longer hangs the runtime (#3801); flaky PyShared.__await__ borrow race (#3862); two RwLock/DashMap deadlocks in actor teardown (#3754); re-entrant TraceEventDispatcher SIGSEGV in real training runs (#3690); Mailbox::post_unchecked shard deadlock (#3684); host shutdown race (#3663).
  • Bootstrap falls back when XDG_RUNTIME_DIR doesn't exist (#3418); long-path SUN_LEN unix-socket panic (#3697); HostMesh label sanitization (#3691); controller GetState no longer triggers an undeliverable bounce (#3450); RDMA find_cuda_segment boundary (#3769).

Performance & Reliability

  • Native V1 casting and the destination-actor reorder buffer are now on by default (#3812), with a point-to-point optimization for small casts (#3646).
  • RDMA completion polling is now adaptive β€” default flipped from a fixed 1 ms sleep to yield-only, gated by MONARCH_RDMA_CQ_BUSY_POLL_WINDOW (#3771). resolve_ibv made synchronous, removing a per-read round-trip (#3773). TLS code-transfer replaced with RDMABuffer leader fan-out (#3390). Arc-refcounted PDs/MRs close a latent PD double-free (#3883); KeepaliveLocalMemory is the sole local-memory handle, with explicit unsafe accessors (#3922).
  • Channel correctness: host flushes acks before exit (#3637); duplex sessions made structurally concurrent (#3675); experimental multi-stream sender exp_dial_unordered (#3557, #3558).
  • ProcMeshController reaps procs orphaned by a dead client via MESH_ORPHAN_TIMEOUT (#3811); periodic RSS recording for managed processes (#3733).

Build & Release

  • macOS wheels ship with the stable PyPI release (#3854) and the nightly matrix (#3451); the initial publish pipeline landed in #3412, with follow-up fixes for missing fields (#3344), no-torch (#3371), the crash-recovery plugin (#3831), and general build breakage (#3786).
  • ROCm GPU CI via a matrix-based workflow (#3190); ROCm excluded from PR runs (#3861).
  • PyTorch bumped 2.11.0 β†’ 2.12.0 for stable; nightly tracks 2.13.0 (#3863). publish_release Docker base aligned to CUDA 12.6 for torch 2.12.0 (#3921); nightly Docker images repaired after upstream cuda12.8 removal (#3880).
  • PyPI wheels now carry classifiers and project URLs (#3379); docs deploy targets stable (#3415). New GHA workflow marks stale PRs and deletes branches of closed/non-merged PRs (#3778); test-result XML uploaded as artifacts (#3670); global 5-minute cargo-nextest timeout (#3855).

0.4.1

Choose a tag to compare

@dulinriley dulinriley released this 08 Apr 01:13

Full Changelog: v0.4.0...v0.4.1

v0.4.1 is a small patch release that includes some powerful new features and important bug fixes.

New Features & API changes

v0.4.1 adds a substantial new CLI workflow around long-lived jobs:
monarch apply and monarch exec can now be used to launch subclasses
of JobTrait.
This release also introduces JobTrait.remote_mount: mounting a local filesystem to
sync with workers in the monarch job. This makes a FUSE mount on each worker and syncs
changes to the filesystem to all workers. It can use RDMA or TCP depending on availability to
send the data.
JobTrait.gather_mount works in reverse: a read-only FUSE mount that
pulls per-worker directories back into a unified local view. This can be used to gather
logs or other outputs from all workers to be examined locally.

The Monarch Dashboard is a local web UI for inspecting a running Monarch job
in real time. It is included in torchmonarch and starts alongside telemetry.
For jobs, enable both admin and telemetry:

job.enable_admin()
job.enable_telemetry(TelemetryConfig(include_dashboard=True, dashboard_port=8265))

The dashboard has three views:
Summary for overall health, actor counts, failures, and message traffic;
Hierarchy for drilling from host mesh down to individual actor details;
DAG for an interactive topology view of hosts, procs, and actors.

It’s still early, so the UI and APIs may evolve, but it’s already useful for
understanding topology, debugging failures, and inspecting message flow.

On the mesh-admin side, the HTTP surface expands with POST /v1/query and
POST /v1/pyspy_dump, while the internals were refactored to use typed IDs,
references, and timestamps behind a curl-friendly JSON/DTO boundary. That
should make the admin API easier to evolve without breaking existing
consumers.

Bug fixes

  • RDMA function is_rdma_available brought back but with a deprecation warning, was deleted in v0.4.0. It is now just a wrapper around is_ibverbs_available. get_rdma_backend is recommended to check which implementation is used.
  • RDMA bug fix for mlx5dv: #3293

Runtime correctness also improved in a few important places in error paths:
stop_actor_by_name now waits for actual actor termination, mesh scans no
longer crash or spin forever when a ProcMesh spawn fails, and mesh-controller
OncePort replies now return accumulated responses correctly.

Performance

A zero-copy regression in the pickle send path was fixed: #3234

v0.4.0

Choose a tag to compare

@dulinriley dulinriley released this 26 Mar 20:52

Monarch v0.4 Release Notes

New Features

Networking & RDMA

  • EFA support for RDMA β€” RDMA with AWS's libefa (elastic fabric adapter).
  • TCP fallback for RDMA β€” when RDMA is unavailable the data-plane automatically falls back to TCP, broadening hardware compatibility (#2999).
  • ROCm / HIP support for the RDMA stack, enabling AMD GPU deployments (#2891).
  • The channel transport layer was rewritten around a typed session lifecycle and unified NetLink dispatch, improving reconnect reliability and adding duplex-mode channels.

Distributed Telemetry & Dashboard

Monarch now ships a built-in observability dashboard. The new distributed telemetry system collects actor, mesh, host, proc, and message-level data in real time and exposes it through both a web UI and a schema-first REST API (OpenAPI 3.1). An OTLP-compatible metrics, logs, and trace exporter makes it straightforward to integrate with Grafana, Jaeger, or any OpenTelemetry collector in Kubernetes deployments.

Admin TUI & Live Diagnostics

A new terminal UI (admin_tui) provides live introspection of running meshes, procs, and actors via an HTTP admin server. It includes a built-in py-spy integration that can capture Python stack traces from any running actor directly in the TUI, making it much easier to diagnose stalls and performance issues in production.

Kubernetes

KubernetesJob gained Python-native provisioning, removing the dependency on an external Go controller for mesh creation. A new optional labels parameter on add_mesh() enables integration with Kueue and other label-based Kubernetes controllers (#2693).

Python API Changes

  • allocate_nonblocking, from_alloc, and host_mesh are renamed to private methods; use attach_to_workers and the KubernetesJob / ProcessJob APIs instead (#2971).
  • NUMA bindings are now exposed for proc mesh spawning (#2996).

Bug Fixes & Performance Improvements

Supervision & Fault Tolerance

  • ControllerController supervision β€” a single child torchstore controller failure no longer poisons the parent and all siblings. Each child is now isolated, fixing a critical bug where one failed session could block all subsequent get_or_spawn_controller() calls (#2835).
  • Orphaned mesh cleanup β€” child actors now detect when their parent is unreachable and self-terminate, preventing leaked GPU resources (#2198).
  • Clean Python shutdown β€” proc exit now calls Py_FinalizeEx, giving Python objects a chance to run destructors and eliminating the pybind11::dec_ref GIL crashes seen during shutdown (#2524).
  • Reliable proc_mesh.stop() β€” stop now flushes pending messages and acks before exiting, fixing races that caused spurious errors in CI and user code (#2658).

Performance

  • Lazy ValueMesh unpickling β€” values returned from accumulate are now deserialized on access rather than eagerly, reducing latency for large results (#2983).
  • RLE-compressed OnceBuffer accumulation β€” repeated identical values are run-length encoded during accumulation, cutting memory and network cost for common broadcast patterns (#2989).
  • Telemetry overhead was significantly reduced by demoting internal spans and gating channel-level tracing behind DEBUG.

Build & Packaging

  • Official aarch64 (ARM64) release binaries are now published alongside x86_64 on PyPI

0.3.0

Choose a tag to compare

@colin2328 colin2328 released this 30 Jan 22:27

Monarch 0.3.0 Release Notes

New Features

Kubernetes Job Support

Monarch now supports running distributed training workloads on Kubernetes clusters. The new KubernetesJob API connects to pre-provisioned GPU pods managed by the https://github.com/meta-pytorch/monarch-kubernetes/ repository, enabling seamless multi-node DDP training
on Kubernetes.

Key Capabilities:

  • Connect to Kubernetes pods using KubernetesJob
  • Provision GPU workers via the MonarchMesh Custom Resource Definition
  • Run multi-node DDP training using SPMDActor

Example:

  from monarch.job.kubernetes import KubernetesJob
  from monarch.spmd import SPMDActor

  k8s_job = KubernetesJob(namespace="monarch-tests")
  k8s_job.add_mesh("ddpmesh", num_replicas=2)

  job_state = k8s_job.state()
  proc_mesh = job_state.ddpmesh.spawn_procs({"gpus": 4})
  spmd_actors = proc_mesh.spawn("_SPMDActor", SPMDActor)

See the full tutorial: https://meta-pytorch.org/monarch/generated/examples/ddp/kubernetes_ddp.html

We also publish docker packages, see https://github.com/meta-pytorch/monarch/pkgs/container/monarch


monarch.spmd and monarch.job.spmd SPMDJob

The new monarch.job.spmd module provides serve() and run_spmd() for an interactive SPMD development workflow:

  • Reserve once, iterate many times: Allocate hosts once, then call run_spmd() repeatedly without reprovisioning
  • Remote debugging: Add breakpoint() in your training script and attach with monarch debug
  • Job caching: Reload cached job state and re-run on the same reserved hosts
  Example:

  from monarch.job.spmd import serve

  job = serve(
      ["torchrun", "--nproc-per-node=4", "--standalone", "train.py"],
      scheduler="local_cwd",
  )
  job.run_spmd()

 # Later, reload and re-run without reprovisioning:
  job = job_load(".monarch/job_state.pkl")
  job.run_spmd()

This supports single-node training with command lists and multi-node training with TorchX AppDef on schedulers like Slurm.

See the example: https://meta-pytorch.org/monarch/generated/examples/ddp/spmd_job.html


Experimental Queue Dispatch Mode (Performance)

A new actor dispatch mode where Rust enqueues messages to a channel for Python to process, rather than Rust acquiring the GIL directly. This can improve throughput for message-heavy workloads.

  from monarch.config import configure

  configure(actor_queue_dispatch=True)

Real this_proc() for Local Spawning

The this_proc() function returns a handle to the current singleton process, enabling actors to spawn other actors locally. Remote actors can use this_proc() to spawn actors on their own hostβ€”enabling patterns like handing out references to a local proc and having
remote actors spawn resources on it.

from monarch.actor import Actor, endpoint, this_proc

class ManagerActor(Actor):
    @endpoint
    def spawn_helper(self) -> HelperActor:
        # Spawns HelperActor in the same process as ManagerActor
        return this_proc().spawn("helper", HelperActor)

Zero-Copy Messaging Path from Python

A new Buffer class enables zero-copy message serialization from Python. Large writes (β‰₯256 bytes) are stored as references to Python bytes objects rather than being copied, integrating with multipart serialization for efficient vectored I/O.

from monarch._rust_bindings.monarch_hyperactor.buffers import Buffer
from monarch.config import configure

  buffer = Buffer()
  buffer.write(b"small")       # copied into pending buffer
  buffer.write(b"x" * 1000)    # stored as zero-copy reference

  # Configure the threshold via:
  configure(small_write_threshold=256)  # default

Principles of Ownership in Supervision

This release improves the supervision model for error handling in meshes, built on four core principles:

  1. Owned meshes: Creating new meshes always results in an owned mesh
  2. Single ownership: All meshes are owned by at most one actor (no transfer or suspension)
  3. Lifecycle binding: A mesh cannot outlive its ownerβ€”when the owner dies, so does the mesh
  4. Graceful cleanup: Stopped meshes drain pending messages before cleanup; owned meshes clean up before their owner

Actors can now implement supervise to handle failures from owned meshes.

Example:

  class ManagerActor(Actor):
      def __supervise__(self, failure: MeshFailure) -> bool:
          logging.error(f"failure encountered: {failure}")
          # Return truthy to handle, falsey to propagate
          return None

See the documentation: https://meta-pytorch.org/monarch/actors.html#error-handling-in-meshes


SkyPilot Integration (Community Contribution)

SkyPilotJob enables running Monarch on Kubernetes and cloud VMs across 20+ cloud providers (AWS, GCP, Azure, CoreWeave, Nebius, etc.) via https://skypilot.readthedocs.io/.

  import sky
  from monarch_skypilot import SkyPilotJob

  job = SkyPilotJob(
      meshes={"trainers": 2},
      resources=sky.Resources(accelerators="A100:1"),
      cluster_name="my-monarch-cluster",
  )
  state = job.state()
  trainers = state.trainers  # HostMesh with 2 nodes

Features:

  • Automatic cluster provisioning and teardown
  • Autostop for idle clusters
  • Workdir sync and custom file mounts
  • Default PyPI install or custom Docker images

Install with:

pip install torchmonarch-nightly skypilot[kubernetes]


Getting Started

Install Monarch 0.3.0:

pip install monarch==0.3.0

0.2.0

Choose a tag to compare

@colin2328 colin2328 released this 22 Dec 20:54

Monarch Release Notes

Overview

This release focuses on correctness, robustness, and operational maturity. Major improvements span supervision and shutdown semantics, logging and observability, Kubernetes readiness, SPMD workflows, test hygiene, and build compatibility. Monarch is now more predictable under failure, easier to debug, and better suited for long-running and large-scale deployments.


Supervision & Shutdown

Actor supervision and shutdown behavior has been significantly hardened and clarified.

Key Improvements

  • Strict supervision hierarchy

    • Every actor or process has exactly one parent (except the root).
    • Child actors can no longer persist after their parent faults or stops.
  • Reliable recursive shutdown

    • Asking an actor to stop deterministically stops its entire subtree.
    • Shutdown cases are documented, tested, and log spam has been audited.
  • Improved fault propagation

    • Supervision errors now describe the full hierarchy of exits.
    • Endpoint failures surface clearer context, including actor and endpoint names.
  • HostMesh lifecycle control

    • HostMesh can be cleanly stopped (disconnect clients and kill workers).
    • HostMesh can be force-killed, causing worker loops to exit immediately.
    • Persistent allocations remain usable for reconnects after stop.

Logging

Logging has been refactored to improve clarity, reduce noise, and clearly separate user-facing signals from system internals.

Key Improvements

  • Clear separation of logs

    • Monarch system logs and user logs are cleanly separated.
    • User-visible faults are communicated only via exceptions and supervision events.
  • Improved error clarity

    • Errors are categorized (e.g., user, system, infrastructure).
    • Actor names are reported in user-understandable syntax.
    • Actor failure reports include richer context and causal chaining.
  • Structured logging

    • Errors emit structured log records suitable for filtering and aggregation.
    • Supervision events follow a defined schema.
  • Reduced default noise

    • Log forwarding, aggregation, and enrichment are disabled by default.
    • Log messages have been audited for signal quality.

Observability

Observability has been expanded across actors, meshes, and endpoints.

Key Improvements

  • Comprehensive metrics

    • Endpoint latency, throughput, payload size, and error counts are universally available.
    • Metrics are collected on both client and server sides.
  • Lifecycle instrumentation

    • Actor, process, and mesh state changes emit structured events.
    • Supervision events are fully instrumented.
  • Root-cause visibility

    • The first triggering event in a failure cascade is surfaced.
    • User-parseable actor IDs are linked to internal actor identifiers.
  • Tracing

    • Distributed spans cover message send and receive paths.
    • Traces can be visualized via Perfetto and standard tracing backends.
  • Performance awareness

    • Instrumentation overhead has been reduced and made configurable.

Build Hygiene & Compatibility

Build and dependency management has been simplified.

Key Improvements

  • RDMA and tensor engine support are dynamically loaded. The same wheel can be installed
  • Monarch no longer has a binary dependency on PyTorch.
    • PyTorch is required only at the Python layer.
    • Startup time and binary size are significantly reduced.

Networking

Networking reliability has improved, with a focus on Lightning integration.

Key Improvements

  • Lightning integration works on HostMesh v1.
  • Networking behavior is documented and standardized for OSS usage.

Deprecation

Legacy v0 codepath has been removed

0.1.0

Choose a tag to compare

@colin2328 colin2328 released this 22 Oct 05:02

πŸ¦‹ Monarch v0.1.0 β€” Initial Release
We’re excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.

πŸš€ Highlights

  1. Actor-Based Programming for PyTorch
    Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.
from monarch.actor import Actor, endpoint, this_host

training_procs = this_host().spawn_procs({"gpus": 8})

class Trainer(Actor):
    @endpoint
    def train(self, step: int): ...

trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()
  1. Scalable Messaging and Meshes
    Actors are organized into meshes β€” collections that support broadcast, gather, and other scalable communication primitives.
  2. Supervision and Fault Tolerance
    Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows.
  3. High-Performance RDMA Transfers
    Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts.
  4. Distributed Tensors
    Native support for tensors sharded across processes β€” enabling distributed compute without custom data movement code.

⚠️ Early Development Notice
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions β€” please discuss significant changes or ideas via issues before submitting PRs.

v0.0.0

v0.0.0 Pre-release
Pre-release

Choose a tag to compare

@colin2328 colin2328 released this 03 Sep 17:15