Skip to content

Topology aware mesh creation#1465

Open
wang2yn84 wants to merge 1 commit intomainfrom
lance-fix-mesh
Open

Topology aware mesh creation#1465
wang2yn84 wants to merge 1 commit intomainfrom
lance-fix-mesh

Conversation

@wang2yn84
Copy link
Copy Markdown
Collaborator

@wang2yn84 wang2yn84 commented May 1, 2026

Starting with v7x, I see more and more mesh creations failures:

jax.errors.JaxRuntimeError: INTERNAL: Not a valid subslice size because bounds are not along host boundaries. Proposed subslice size: 4,1,1, host bounds: 2,2,1
Set --FLAGS_pathways_enforce_subset_devices_form_subslice to false at the Pathways client to disable this check.

The direct root cause is the cli simply flattens the device list and subslice the list and create mesh with that. The devices may or may not belong to the same host. This PR completely rewrite the mesh creation logic by respecting the slice, host and device coord to respect the host bound.

Reference

Colab Notebook

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and all unit tests pass.
  • I have added all appropriate doc-strings/documentation.
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have signed the Contributor License Agreement.
  • I have followed Contribution Guidelines.

@wang2yn84 wang2yn84 changed the title Fix mesh creation Topology aware mesh creation May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants