-
Notifications
You must be signed in to change notification settings - Fork 919
[Core] Add gRPC autostop service to skylet #6574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/quicktest-core -k test_autostop_functionality |
|
|
|
/smoke-test -k test_autostop |
|
/quicktest-core -k test_autostop_functionality --base-branch releases/0.9.3 |
|
/quicktest-core -k test_autostop_functionality |
|
/smoke-test -k test_autostop --gcp |
|
/smoke-test --kubernetes |
|
/smoke-test |
|
/smoke-test -k test_launch_fast_with_autostop |
|
/smoke-test -k test_launch_fast_with_autostop |
|
/smoke-test |
|
Edit: Re-ran and passed |
|
Do you know how we have the CI format tests that catch if format has been run or not based on CI running the formatter and catching any diffs? Could we have a similar test but with proto so we have CI ensuring the generated file is always up to date with proto files? |
… for grpc timeout
|
/smoke-test |
Added! https://github.com/skypilot-org/skypilot/actions/workflows/compile-protos-check.yml |
|
/smoke-test -k test_autodown --kubernetes |
* add gRPC autostop service to skylet * update SKYLET_VERSION to '17' * Clarify comment in dependencies.py regarding separation of 'dev' from 'all' * rename helper fn * refactor gRPC and protobuf versions in dependencies.py * add types-protobuf to development requirements * move dev requirements to requirements-dev.txt * handle case when remote cluster is running an old skylet pre-gRPC * implement fallback to legacy execution for is_definitely_autostopping * fix is_definitely_autostopping * small clean up * small clean up * reduce ssh port forward log level to debug to not break VALIDATE_LAUNCH_OUTPUT * move protos to sky/schemas/proto * add instruction for regenerating files from protobuf * minor cleanup * add is_grpc_enabled flag to CloudVmRayResourceHandle to avoid waiting for grpc timeout * refactor pre-commit config, add protobuf compilation check in CI * small cleanup * regenerate protos, fix pre-commit isort exclusion list * fix format.yaml
This PR introduces a gRPC service interface for Skylet's autostop configuration functionality, replacing the previous mechanism of generating code using AutostopCodeGen and invoking it over SSH.
We started with autostop as it serves as a good opportunity to start migrating to gRPC while minimizing risk, as there's only two functions (SetAutostop and IsAutostopping). We will quickly follow up with the rest of the places that are still using codegen (JobLibCodeGen, ManagedJobCodeGen, ServeCodeGen, RayCodegen).
The way this works is by establishing an SSH tunnel using local port forwarding from the API server to the Skylet, and then creating a gRPC channel over this encrypted tunnel. We store the local port number and the PID in
CloudVmRayResourceHandle(which is persisted to the db, and lasts for the lifetime of the cluster) so that we can reuse this tunnel across multiple requests.Given that this is the first time we're using gRPC in this codebase, there are a few things that first need to be decided:
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)Still investigating one smoke test failure- False alarm, an additional log line from the port forward command caused some output matching in our smoke test to fail, reduced that log line to debug level so it wouldn't break this/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)Performance improvements:
Ran a simple test (gist) on a 2 vCPU 4 GB RAM machine which schedules 50 autostop requests in parallel.
Before: The cluster would hang (and go to INIT state) as it would reach 100% CPU and memory usage
After: No noticeable CPU/memory usage spikes