[tx] Make it easy to run on a multi-node Ray cluster

Currently we have multi-node support (https://github.com/NovaSky-AI/SkyRL/pull/810), but all the processes have to be started separately on the different nodes.

Instead we should make it easy to run with a single entry point on a Ray Cluster. One way to do this would be to introduce a new backend argument `enable_ray` and if it is true, it will automatically start the relevant processes as Ray actors and schedule them on different nodes
```
uv run --extra aws --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-8B --backend-config '{"max_lora_adapters": 3, "max_lora_rank": 1, "tensor_parallel_size": 4, "fully_sharded_data_parallel_size": 2, "train_micro_batch_size": 8, "sample_max_num_sequences": 256, "enable_ray": true}' > out1.log
```

There might be better designs for it, happy to discuss it over a PR or in this issues.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tx] Make it easy to run on a multi-node Ray cluster #935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[tx] Make it easy to run on a multi-node Ray cluster #935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions