Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Aug 14, 2025

Addresses #6634

There's a possibility that users start a stopped cluster manually through the cloud provider's UI or API.

When this happens, and we try to refresh the cluster status, the following happens:

  1. We see that all instances are up from the cloud API
  2. So we proceed to check ray status, but this will fail with SSH connection timeout, because the manually started instance could get a new IP address, but we only know the old one

Other than having a different IP address (which could be resolved simply by re-fetching the latest data from the cloud API), the other problem with manually started clusters is that it doesn't have the SkyPilot runtime installed (will error with Ray cluster is not found ...). So the only way to recover from this is to do sky start again.

This might not be intuitive though for users, so this PR adds a log message to help them recover from this state.

IP address change could also happen if a user manually allocates a static public IP address after sky launch. In this case, we also won't know of the IP address change, until the user does sky start again.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Added test_aws_manual_restart_recovery
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Aug 14, 2025

/smoke-test -k test_aws_manual_restart_recovery

We should see SSH connection timeouts in the logs ^

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Aug 14, 2025

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja kevinmingtarja changed the title [WIP][Core] Recovery mechanism for manually restarted clusters [Core] Provide escape hatch for recovering manually restarted clusters Aug 14, 2025
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja kevinmingtarja changed the title [Core] Provide escape hatch for recovering manually restarted clusters [UX] Provide escape hatch for recovering manually restarted clusters Aug 15, 2025
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja kevinmingtarja marked this pull request as ready for review August 15, 2025 08:07
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@Michaelvll Michaelvll linked an issue Aug 15, 2025 that may be closed by this pull request
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_aws_manual_restart_recovery

Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor nits

@kevinmingtarja kevinmingtarja enabled auto-merge (squash) August 22, 2025 22:55
@kevinmingtarja kevinmingtarja merged commit 0fa3a9a into master Aug 22, 2025
16 checks passed
@kevinmingtarja kevinmingtarja deleted the refresh-stale-ips-ssh-timeout branch August 22, 2025 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nebius] IP changed and SkyPilot UI cannot connect the cluster

3 participants