-
Notifications
You must be signed in to change notification settings - Fork 919
[UX] Provide escape hatch for recovering manually restarted clusters #6672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/smoke-test -k test_aws_manual_restart_recovery We should see SSH connection timeouts in the logs ^ |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery Got expected error: https://buildkite.com/skypilot-1/smoke-tests/builds/2437/steps/canvas?jid=0198a99d-99ec-4e75-b604-4d4f11da6fbc#0198a99d-99ec-4e75-b604-4d4f11da6fbc/57-571 |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
|
/smoke-test -k test_aws_manual_restart_recovery |
SeungjinYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with minor nits
Addresses #6634
There's a possibility that users start a stopped cluster manually through the cloud provider's UI or API.
When this happens, and we try to refresh the cluster status, the following happens:
ray status, but this will fail with SSH connection timeout, because the manually started instance could get a new IP address, but we only know the old oneOther than having a different IP address (which could be resolved simply by re-fetching the latest data from the cloud API), the other problem with manually started clusters is that it doesn't have the SkyPilot runtime installed (will error with
Ray cluster is not found ...). So the only way to recover from this is to dosky startagain.This might not be intuitive though for users, so this PR adds a log message to help them recover from this state.
IP address change could also happen if a user manually allocates a static public IP address after
sky launch. In this case, we also won't know of the IP address change, until the user doessky startagain.Tested (run the relevant ones):
bash format.shtest_aws_manual_restart_recovery/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)