Skip to content

[tx] Make it easy to run on a multi-node Ray cluster #935

@pcmoritz

Description

@pcmoritz

Currently we have multi-node support (#810), but all the processes have to be started separately on the different nodes.

Instead we should make it easy to run with a single entry point on a Ray Cluster. One way to do this would be to introduce a new backend argument enable_ray and if it is true, it will automatically start the relevant processes as Ray actors and schedule them on different nodes

uv run --extra aws --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-8B --backend-config '{"max_lora_adapters": 3, "max_lora_rank": 1, "tensor_parallel_size": 4, "fully_sharded_data_parallel_size": 2, "train_micro_batch_size": 8, "sample_max_num_sequences": 256, "enable_ray": true}' > out1.log

There might be better designs for it, happy to discuss it over a PR or in this issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions