Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Aug 7, 2025

This PR introduces a gRPC service interface for Skylet's autostop configuration functionality, replacing the previous mechanism of generating code using AutostopCodeGen and invoking it over SSH.

We started with autostop as it serves as a good opportunity to start migrating to gRPC while minimizing risk, as there's only two functions (SetAutostop and IsAutostopping). We will quickly follow up with the rest of the places that are still using codegen (JobLibCodeGen, ManagedJobCodeGen, ServeCodeGen, RayCodegen).

The way this works is by establishing an SSH tunnel using local port forwarding from the API server to the Skylet, and then creating a gRPC channel over this encrypted tunnel. We store the local port number and the PID in CloudVmRayResourceHandle (which is persisted to the db, and lasts for the lifetime of the cluster) so that we can reuse this tunnel across multiple requests.

Given that this is the first time we're using gRPC in this codebase, there are a few things that first need to be decided:

  1. Directory structure for the protobuf files, as pointed out by [Core] Add gRPC autostop service to skylet #6574 (comment)
  2. Process for generating and committing the generated python code, and ensuring the runtime grpc and protobuf version is compatible with the generated code

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
    • Still investigating one smoke test failure - False alarm, an additional log line from the port forward command caused some output matching in our smoke test to fail, reduced that log line to debug level so it wouldn't break this
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Performance improvements:

Ran a simple test (gist) on a 2 vCPU 4 GB RAM machine which schedules 50 autostop requests in parallel.

% sky launch -c bench --infra aws -y prometheus.yaml; sleep 30; for i in {1..50}; do; sh -c 'sky autostop -i 60 --down -y bench &'; done;

Before: The cluster would hang (and go to INIT state) as it would reach 100% CPU and memory usage

Screenshot 2025-08-10 at 11 37 30 PM Screenshot 2025-08-10 at 11 37 58 PM

After: No noticeable CPU/memory usage spikes

Screenshot 2025-08-10 at 11 54 26 PM Screenshot 2025-08-10 at 11 54 44 PM

@kevinmingtarja
Copy link
Collaborator Author

/quicktest-core -k test_autostop_functionality

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Aug 8, 2025

/quicktest-core -k test_autostop_functionality --kubernetes

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop

@kevinmingtarja
Copy link
Collaborator Author

/quicktest-core -k test_autostop_functionality --base-branch releases/0.9.3
/quicktest-core -k test_autostop_functionality --base-branch releases/0.10.0

@kevinmingtarja
Copy link
Collaborator Author

/quicktest-core -k test_autostop_functionality
/quicktest-core -k test_autostop_functionality --base-branch releases/0.9.3
/smoke-test -k test_autostop

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop --gcp
/quicktest-core -k test_autostop_functionality --gcp

@kevinmingtarja kevinmingtarja marked this pull request as ready for review August 8, 2025 17:06
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test --kubernetes

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test
/quicktest-core
/quicktest-core --base-branch releases/0.9.3

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_launch_fast_with_autostop

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_launch_fast_with_autostop

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test
/quicktest-core
/quicktest-core --base-branch releases/0.9.3

@kevinmingtarja
Copy link
Collaborator Author

kevinmingtarja commented Aug 9, 2025

Smoke test failures are unrelated, all related to azure quotas:

Operation could not be completed as it results in exceeding approved Total Regional Cores quota. Additional details - 
Deployment Model: Resource Manager, Location: eastus2, Current Limit: 65, Current Usage: 64, Additional Required: 2, 
(Minimum) New Limit Required: 66

Edit: Re-ran and passed

@SeungjinYang
Copy link
Collaborator

Do you know how we have the CI format tests that catch if format has been run or not based on CI running the formatter and catching any diffs? Could we have a similar test but with proto so we have CI ensuring the generated file is always up to date with proto files?

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test
/quicktest-core
/quicktest-core --base-branch releases/0.9.3

@kevinmingtarja
Copy link
Collaborator Author

Do you know how we have the CI format tests that catch if format has been run or not based on CI running the formatter and catching any diffs? Could we have a similar test but with proto so we have CI ensuring the generated file is always up to date with proto files?

Added! https://github.com/skypilot-org/skypilot/actions/workflows/compile-protos-check.yml

@kevinmingtarja kevinmingtarja merged commit 1d1fcea into master Aug 12, 2025
16 checks passed
@kevinmingtarja kevinmingtarja deleted the skylet-grpc-service branch August 12, 2025 22:09
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autodown --kubernetes

massaindustries pushed a commit to Seeweb/skypilot that referenced this pull request Aug 26, 2025
* add gRPC autostop service to skylet

* update SKYLET_VERSION to '17'

* Clarify comment in dependencies.py regarding separation of 'dev' from 'all'

* rename helper fn

* refactor gRPC and protobuf versions in dependencies.py

* add types-protobuf to development requirements

* move dev requirements to requirements-dev.txt

* handle case when remote cluster is running an old skylet pre-gRPC

* implement fallback to legacy execution for is_definitely_autostopping

* fix is_definitely_autostopping

* small clean up

* small clean up

* reduce ssh port forward log level to debug to not break VALIDATE_LAUNCH_OUTPUT

* move protos to sky/schemas/proto

* add instruction for regenerating files from protobuf

* minor cleanup

* add is_grpc_enabled flag to CloudVmRayResourceHandle to avoid waiting for grpc timeout

* refactor pre-commit config, add protobuf compilation check in CI

* small cleanup

* regenerate protos, fix pre-commit isort exclusion list

* fix format.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants