-
Notifications
You must be signed in to change notification settings - Fork 923
[Core] Improve sky down speed for AWS clusters with ports exposed
#6629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/smoke-test |
|
/quicktest-core |
SeungjinYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏎️
|
/smoke-test |
|
/quicktest-core |
3b559a6 to
f6874f2
Compare
|
/quicktest-core |
|
/smoke-test |
|
/smoke-test |
|
/quicktest-core |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lloyd-brown for making the changes! Just left some small comments : )
| if expected_sg_name != aws_cloud.DEFAULT_SECURITY_GROUP_NAME: | ||
| # Ensure the default security group is created. This is needed | ||
| # to enable us to use the default security group to quickly | ||
| # delete the cluster. If the default security group is not created, | ||
| # we will need to block on instance termination to delete the | ||
| # security group. | ||
| _configure_security_group(ec2, vpc_id, | ||
| aws_cloud.DEFAULT_SECURITY_GROUP_NAME, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not work in the case when user account does not have permission to create security group. Previously, a user can specify the security group name to skip the creation, but now it will always be created? https://docs.skypilot.co/en/latest/reference/config.html#config-yaml-aws-security-group-name
| instances.terminate() | ||
| else: | ||
| # Case 4: We are managing the non-default sg. The default SG does not | ||
| # exist. We must block on instance termination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must block on instance termination so that ...
Add a reason to the comment : )
sky down speed for AWS
sky down speed for AWSsky down speed for AWS clusters with ports exposed
* Move instances to default sg instead of wait for terminate. * Create default sg if not exist, wait on terminate if sg none. * Configure security group using vpc id. * Remove extended IP rules. * Update comment on sg configure. * Clean up teardown logic. * Bash format.sh.
This pull requests aims to reduce the amount of time it takes to run sky down on a cluster. The core issue is that when we expose ports we create a new security group that's deletion is blocked on the termination of the created instances. Incorporating Tian's fix we move those instances to the default security group instead of blocking on their termination. I also parallelized the call to change the security group as it's nontrivial (approx. 1s per instance).
This code was tested by creating a cluster with exposed ports with multiple nodes on AWS and then comparing the time it takes to run sky down on that cluster. For example, sky down on a 4 node cluster went from 58 seconds to 13 seconds.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)