Skip to content

Conversation

@lloyd-brown
Copy link
Collaborator

@lloyd-brown lloyd-brown commented Aug 12, 2025

This pull requests aims to reduce the amount of time it takes to run sky down on a cluster. The core issue is that when we expose ports we create a new security group that's deletion is blocked on the termination of the created instances. Incorporating Tian's fix we move those instances to the default security group instead of blocking on their termination. I also parallelized the call to change the security group as it's nontrivial (approx. 1s per instance).

This code was tested by creating a cluster with exposed ports with multiple nodes on AWS and then comparing the time it takes to run sky down on that cluster. For example, sky down on a 4 node cluster went from 58 seconds to 13 seconds.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@lloyd-brown lloyd-brown changed the title Move instances to default sg instead of wait for terminate. [Core] Improve sky down speed Aug 12, 2025
@lloyd-brown
Copy link
Collaborator Author

/smoke-test

@lloyd-brown
Copy link
Collaborator Author

/quicktest-core

Copy link
Collaborator

@SeungjinYang SeungjinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏎️

@lloyd-brown
Copy link
Collaborator Author

/smoke-test

@lloyd-brown
Copy link
Collaborator Author

/quicktest-core

@lloyd-brown
Copy link
Collaborator Author

/quicktest-core

@lloyd-brown
Copy link
Collaborator Author

/smoke-test

@lloyd-brown
Copy link
Collaborator Author

/smoke-test

@lloyd-brown
Copy link
Collaborator Author

/quicktest-core

@lloyd-brown lloyd-brown merged commit 5ac3578 into master Aug 13, 2025
18 checks passed
@lloyd-brown lloyd-brown deleted the sky-down-aws-speed branch August 13, 2025 05:37
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lloyd-brown for making the changes! Just left some small comments : )

Comment on lines +107 to +114
if expected_sg_name != aws_cloud.DEFAULT_SECURITY_GROUP_NAME:
# Ensure the default security group is created. This is needed
# to enable us to use the default security group to quickly
# delete the cluster. If the default security group is not created,
# we will need to block on instance termination to delete the
# security group.
_configure_security_group(ec2, vpc_id,
aws_cloud.DEFAULT_SECURITY_GROUP_NAME, [])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not work in the case when user account does not have permission to create security group. Previously, a user can specify the security group name to skip the creation, but now it will always be created? https://docs.skypilot.co/en/latest/reference/config.html#config-yaml-aws-security-group-name

instances.terminate()
else:
# Case 4: We are managing the non-default sg. The default SG does not
# exist. We must block on instance termination.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must block on instance termination so that ...
Add a reason to the comment : )

@Michaelvll Michaelvll changed the title [Core] Improve sky down speed [Core] Improve sky down speed for AWS Aug 13, 2025
@Michaelvll Michaelvll changed the title [Core] Improve sky down speed for AWS [Core] Improve sky down speed for AWS clusters with ports exposed Aug 13, 2025
massaindustries pushed a commit to Seeweb/skypilot that referenced this pull request Aug 26, 2025
* Move instances to default sg instead of wait for terminate.

* Create default sg if not exist, wait on terminate if sg none.

* Configure security group using vpc id.

* Remove extended IP rules.

* Update comment on sg configure.

* Clean up teardown logic.

* Bash format.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants