Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Nov 12, 2025

This PR implements the first of many SkyPilot templates, a way to simplify the user experience of writing SkyPilot YAMLs. This first PR implements ~/sky_templates/ray/start_cluster.sh, a script to easily spin up your own Ray cluster, for running your own Ray programs, or when running multi-node vLLM inference, or RL with SkyRL, verl, etc.

To use it, simply call it from your run step:

resources:
  cpus: 2+

num_nodes: 2

workdir: .

run: |
  ~/sky_templates/ray/start_cluster.sh

  if [ "$SKYPILOT_NODE_RANK" == "0" ]; then
    python main.py
  fi

# ⚙︎ Job submitted, ID: 1
# ├── Waiting for task resources on 2 nodes.
# └── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
# (head, rank=0, pid=3726) Starting Ray cluster...
# (head, rank=0, pid=3726) Ray is not installed. Installing...
# (head, rank=0, pid=3726) Using Python 3.10.10 environment at: /opt/conda
# ...
# (head, rank=0, pid=3726) Ray 2.51.1 installed successfully.
# (worker1, rank=1, pid=2680, ip=10.0.7.223) Ray 2.51.1 installed successfully.
# (head, rank=0, pid=3726) Starting Ray head node...
# (head, rank=0, pid=3726) 2025-11-12 09:28:59,475        INFO usage_lib.py:447 -- Usage stats collection is disabled.
# (head, rank=0, pid=3726) 2025-11-12 09:28:59,475        INFO scripts.py:914 -- Local node IP: 10.0.7.170
# (worker1, rank=1, pid=2680, ip=10.0.7.223) Starting Ray worker node...
# (worker1, rank=1, pid=2680, ip=10.0.7.223) Waiting for head node at 10.0.7.170:6379...                 
# (worker1, rank=1, pid=2680, ip=10.0.7.223) Head node not healthy yet. Retrying in 1s...
# (worker1, rank=1, pid=2680, ip=10.0.7.223) Head node is healthy. Starting worker node...               
# (head, rank=0, pid=3726) 2025-11-12 09:29:04,648        SUCC scripts.py:950 -- --------------------
# (head, rank=0, pid=3726) 2025-11-12 09:29:04,648        SUCC scripts.py:951 -- Ray runtime started.
# (head, rank=0, pid=3726) 2025-11-12 09:29:04,648        SUCC scripts.py:952 -- --------------------
# ...                             
# (worker1, rank=1, pid=2680, ip=10.0.7.223) 2025-11-12 09:29:05,139      INFO scripts.py:1095 -- Local node IP: 10.0.7.223
# (head, rank=0, pid=3726) Head node started successfully.
# (head, rank=0, pid=3726) Waiting for all 2 nodes to join...
# ...
# (worker1, rank=1, pid=2680, ip=10.0.7.223) Worker node started successfully.                           
# (head, rank=0, pid=3726) 2025-11-12 09:29:05,800        WARNING services.py:397 -- Found multiple active Ray instances: {'10.0.7.170:6379', '10.0.7.170:6380'}. Connecting to latest cluster at 10.0.7.170:6379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.        
# (head, rank=0, pid=3726) All 2 nodes have joined.
# (head, rank=0, pid=3726) 2025-11-12 09:29:11,172        WARNING services.py:397 -- Found multiple active Ray instances: {'10.0.7.170:6380', '10.0.7.170:6379'}. Connecting to latest cluster at 10.0.7.170:6379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.       
# (head, rank=0, pid=3726) 2025-11-12 09:29:11,172        INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.0.7.170:6379...                                                              
# (head, rank=0, pid=3726) 2025-11-12 09:29:11,186        INFO worker.py:2003 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265                                                                 
# (head, rank=0, pid=3726) /opt/conda/lib/python3.10/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0                                                                    
# (head, rank=0, pid=3726)   warnings.warn(
# (head, rank=0, pid=3726) Sum of squares: 328350                                                        
# ✓ Job finished (status: SUCCEEDED).

Docs preview:

https://docs.skypilot.co/en/simple-ray/examples/training/ray.html
https://docs.skypilot.co/en/simple-ray/running-jobs/distributed-jobs.html#executing-a-distributed-ray-program

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • test_ray_basic
    • Manually tested skyrl, verl, kimi-k2 example
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja kevinmingtarja changed the title [Core] One-click user-space Ray cluster [Core][UX] One-click user-space Ray cluster Nov 12, 2025
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k nemo
/smoke-test -k ray
/smoke-test -k nemo --nebius
/smoke-test -k ray --nebius

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k nemo
/smoke-test -k ray

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k nemo

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k nemo
/smoke-test -k ray

Comment on lines 9 to 15
# Environment Variables:
# RAY_HEAD_PORT=6379 - Ray head node port
# RAY_DASHBOARD_PORT=8265 - Ray dashboard port
# RAY_DASHBOARD_HOST=127.0.0.1 - Dashboard host (set to 0.0.0.0 to expose externally)
# RAY_DASHBOARD_AGENT_LISTEN_PORT= - (Optional) Dashboard agent listen port
# RAY_NODE_IP_ADDRESS= - (Optional) Node IP address
# RAY_CMD_PREFIX= - (Optional) Command prefix (e.g., "uv run")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the minimal set of params needed to cover all our examples that uses Ray.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinmingtarja! It looks mostly good to me. Please see the comments below.

Also, we should modify this ray example page to include the new template: https://docs.skypilot.co/en/latest/examples/training/ray.html

@kevinmingtarja
Copy link
Collaborator Author

/build-docs

@github-actions
Copy link
Contributor

✅ ReadTheDocs build triggered for branch simple-ray

The documentation will be available at: https://docs.skypilot.co/en/simple-ray/

@kevinmingtarja
Copy link
Collaborator Author

/build-docs

@github-actions
Copy link
Contributor

✅ ReadTheDocs build triggered for branch simple-ray

The documentation will be available at: https://docs.skypilot.co/en/simple-ray/

@kevinmingtarja
Copy link
Collaborator Author

Docs preview:

Added references on the start and stop templates script. Listed down the environment variables for passing arguments to the script.

@kevinmingtarja
Copy link
Collaborator Author

/build-docs

@github-actions
Copy link
Contributor

✅ ReadTheDocs build triggered for branch simple-ray

The documentation will be available at: https://docs.skypilot.co/en/simple-ray/

@kevinmingtarja
Copy link
Collaborator Author

Note: I renamed ~/skypilot_templates -> ~/sky_templates to be more consistent with ~/sky_workdir and ~/sky_logs.

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k nemo
/smoke-test -k ray

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinmingtarja ! This is awesome! It looks quite good to me

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k nemo
/smoke-test -k ray

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k ray_basic
/smoke-test -k ray_basic --gcp
/smoke-test -k ray_basic --nebius
/smoke-test -k ray_basic --kubernetes

@kevinmingtarja kevinmingtarja enabled auto-merge (squash) November 14, 2025 05:06
@kevinmingtarja kevinmingtarja merged commit 969e9c4 into master Nov 14, 2025
24 checks passed
@kevinmingtarja kevinmingtarja deleted the simple-ray branch November 14, 2025 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants