Skip to content

Conversation

@kyuds
Copy link
Collaborator

@kyuds kyuds commented Aug 16, 2025

Replaces ServeCodeGen with a gRPC service in skylet.

TODO:

  • Streaming apis: reserved for future PR (as this PR is already too big)

Notes:

  • had to monkey patch a bunch of stuff for tests. Created RpcRunner to have more fine-grained control over monkey-patching. Otherwise, a lot of tests were working due to "side-effects" of monkey-patches.
  • had to separate serve_rpc_utils due to circular imports (backend_utils)

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kyuds kyuds marked this pull request as draft August 16, 2025 12:07
@kyuds
Copy link
Collaborator Author

kyuds commented Aug 18, 2025

/quicktest-core
/smoke-test
/smoke-test --serve
/smoke-test --kubernetes

@kyuds
Copy link
Collaborator Author

kyuds commented Aug 18, 2025

smoketest, and smoketest serve pass

@kevinmingtarja I think I'll have to upgrade the api server version (or maybe skylet version?). I am getting backwards compatibility issues because skylet doesn't have the Serve implementation for grpc (nevertheless, is_grpc_enabled is True, naturally because we implemented it for autostop)

@kevinmingtarja
Copy link
Collaborator

kevinmingtarja commented Aug 18, 2025

smoketest, and smoketest serve pass

@kevinmingtarja I think I'll have to upgrade the api server version (or maybe skylet version?). I am getting backwards compatibility issues because skylet doesn't have the Serve implementation for grpc (nevertheless, is_grpc_enabled is True, naturally because we implemented it for autostop)

Yeah good catch, I'm also facing that problem in my PR, and I think upgrading the skylet version is not enough, because we don't auto-upgrade skylet/clusters that are launched by an older API server (we have to re-run sky start IIRC). So I think we have two options:

  1. We catch grpc UNIMPLEMENTED errors, and fallback to legacy SSH codegen execution
except grpc.RpcError as e:
            if e.code() == grpc.StatusCode.UNIMPLEMENTED:
  1. We need a way to detect if a skylet/cluster is deployed by an older API server, and have a way to auto-upgrade Skylet, so that it can always get the latest code. Thinking out loud here, maybe we should store the skylet version on the API server, similar to how in k8s you can see what version of kubelet your nodes are running.

I was planning on just doing (1) for now in #6647, and leave (2) for a follow up PR.

@kyuds
Copy link
Collaborator Author

kyuds commented Aug 18, 2025

  1. We catch grpc UNIMPLEMENTED errors, and fallback to legacy SSH codegen execution
except grpc.RpcError as e:
            if e.code() == grpc.StatusCode.UNIMPLEMENTED:

I was planning on just doing (1) for now in #6647, and leave (2) for a follow up PR.

Got it. Also, I feel it will be helpful for us to have some kind of standards on catching errors that arise with grpc.
I've identified four common errors:

  • grpc.FutureTimeoutError
  • grpc.RpcError
  • exceptions.FetchClusterInfoError (this comes from handle.get_grpc_channel() which routes to like a cluster status updater that will fail when cluster is unavailable)
  • exceptions.SkyletInternalError

The existing codepath exposes CommandError and FetchClusterInfoError, but handling them is extremely inconsistent (some errors are wrapped as RuntimeExceptions while others are just exposed), which makes it really confusing to develop.

@kyuds
Copy link
Collaborator Author

kyuds commented Aug 18, 2025

/smoke-test --serve

@kyuds
Copy link
Collaborator Author

kyuds commented Aug 18, 2025

/smoke-test --kubernetes
/smoke-test --jobs-consolidation

@kevinmingtarja
Copy link
Collaborator

/smoke-test --kubernetes /smoke-test --jobs-consolidation

I saw that the --jobs-consolidation test failure is related to sky.exceptions.CloudError: grpc error (FutureTimeoutError). FYI I also saw those and made it more robust in my PR.

@kyuds
Copy link
Collaborator Author

kyuds commented Aug 19, 2025

/smoke-test -k test_launch_fast_with_autostop --aws --jobs-consolidation

@kyuds kyuds marked this pull request as ready for review August 26, 2025 06:01
@kyuds kyuds requested a review from kevinmingtarja August 26, 2025 06:01
@kyuds
Copy link
Collaborator Author

kyuds commented Sep 16, 2025

/smoke-test --serve

@kyuds
Copy link
Collaborator Author

kyuds commented Sep 16, 2025

/quicktest-core

@kyuds
Copy link
Collaborator Author

kyuds commented Sep 16, 2025

/smoke-test --serve --grpc

@kevinmingtarja
Copy link
Collaborator

/smoke-test --serve --kubernetes --grpc
/smoke-test --serve --kubernetes

Copy link
Collaborator

@kevinmingtarja kevinmingtarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @kyuds ! Looks solid overall, just a few nits and discussion points.

Copy link
Collaborator

@kevinmingtarja kevinmingtarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should we run quicktest once more as a sanity check?

@kyuds
Copy link
Collaborator Author

kyuds commented Sep 17, 2025

/quicktest-core

@kyuds
Copy link
Collaborator Author

kyuds commented Sep 17, 2025

rerunning failed smoke tests (flaky, due to provisioning errors which are unrelated)

@kyuds
Copy link
Collaborator Author

kyuds commented Sep 17, 2025

smoke-tests passing (current failures due to provisioning errors/manual cancel) linking buildkite results below (commit to pre-review commit, but review only was concerned about renaming, etc):
https://buildkite.com/skypilot-1/smoke-tests/builds/3364
https://buildkite.com/skypilot-1/smoke-tests/builds/3364

@kyuds kyuds merged commit 5e5f91d into master Sep 17, 2025
21 of 28 checks passed
@kyuds kyuds deleted the kyuds/serve-grpc branch September 17, 2025 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants