Skip to content

Add colocated mode to agentic cli#1458

Open
wang2yn84 wants to merge 1 commit intomainfrom
lance-colocate
Open

Add colocated mode to agentic cli#1458
wang2yn84 wants to merge 1 commit intomainfrom
lance-colocate

Conversation

@wang2yn84
Copy link
Copy Markdown
Collaborator

@wang2yn84 wang2yn84 commented Apr 30, 2026

This PR adds colocate mode to agentic cli. It does the following:

  1. It completely refactors the current train pipeline, provide clear execution graph by separating rollout, reference and trainer to each stage. In the future, it's easier to add critic or other roles to the pipeline.
  2. It replaces same_mesh_as with colocate_with config. With same_mesh_as, we are enforcing colocated roles to have the same mesh. But in reality, they can have different shape, but operate on the same set of devices. colocate_with makes it possible.
  3. Removes can_enable_async_rollout because in agentic mode rollout is always async.

If offload is enabled and user specify rolllout_model_config.colocate_with="actor", the reference stage will block until the global batch barrier hits.
If offload is disabled and user specify rolllout_model_config.colocate_with="actor", that indicates there is enough HBM to host all the models, the pipeline will not block and fires asynchronously
If user doesn't specify rolllout_model_config.colocate_with and explicitly provide a rollout mesh, the pipeline runs in disagg mode.

Reference

Colab Notebook

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and all unit tests pass.
  • I have added all appropriate doc-strings/documentation.
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have signed the Contributor License Agreement.
  • I have followed Contribution Guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants