-
Notifications
You must be signed in to change notification settings - Fork 233
Open
Labels
Description
Currently we have multi-node support (#810), but all the processes have to be started separately on the different nodes.
Instead we should make it easy to run with a single entry point on a Ray Cluster. One way to do this would be to introduce a new backend argument enable_ray and if it is true, it will automatically start the relevant processes as Ray actors and schedule them on different nodes
uv run --extra aws --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-8B --backend-config '{"max_lora_adapters": 3, "max_lora_rank": 1, "tensor_parallel_size": 4, "fully_sharded_data_parallel_size": 2, "train_micro_batch_size": 8, "sample_max_num_sequences": 256, "enable_ray": true}' > out1.log
There might be better designs for it, happy to discuss it over a PR or in this issues.