Skip to content

Conversation

@cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Aug 8, 2025

This PR supports running pools on remote job controller. It also adds resources management across pool & jobs.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 11, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 12, 2025

/smoke-test
/quicktest-core

@cg505 cg505 self-requested a review August 12, 2025 18:44
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduling logic is increasingly complicated.
We should have some comments that explain how it works with pools.
Can we do an audit of these existing comments.

self.graph = nx.DiGraph()
self.name: Optional[str] = None
self.policy_applied: bool = False
self.pool: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not serialize this, we need to make sure it is handled correctly by RESTful admin policy. See #6348.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment

if current_state == state.ManagedJobScheduleState.ALIVE_WAITING:
if not _can_lauch_in_alive_job():
if not (controller_utils.can_provision() or
actual_pool is not None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we care about the pool here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pool jobs will not launch. This helps it to get scheduled when there is no slot for provisioning but still slot for new job controller process

# first. The requirements to launch in an alive job are more
# lenient, so there is no way that we wouldn't be able to launch
# an ALIVE_WAITING job, but we would be able to launch a WAITING
# job.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no longer true, right? There could be a WAITING job using a pool that can start (doesn't need a LAUNCHING slot), but the ALIVE_WAITING job we fetch is not on a pool and cannot go to LAUNCHING.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 12, 2025

/smoke-test
/quicktest-core

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will fix scheduler logic in a follow-up PR

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 12, 2025

/smoke-test
/quicktest-core

1 similar comment
@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 12, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 13, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 13, 2025

/smoke-test -k test_skyserve_new_autoscaler_update

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 13, 2025

/smoke-test -k test_skyserve_new_autoscaler_update --kubernetes

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 13, 2025

/smoke-test -k test_skyserve_new_autoscaler_update

@cblmemo cblmemo merged commit a78880f into master Aug 13, 2025
17 checks passed
@cblmemo cblmemo deleted the pool-in-remote branch August 13, 2025 18:48
massaindustries pushed a commit to Seeweb/skypilot that referenced this pull request Aug 26, 2025
* init

* check pools

* resources management

* fix

* fix test

* tracking #procs

* refactor

* fix

* fix

* fix

* fix endpoint

* fix ready replicas

* backoff 1

* fix leakage

* locks

* ux

* ux

* dont check can_provision for pool jobs

* fix

* comments

* nit

* nit

* try fix autoscaler

* Revert "try fix autoscaler"

This reverts commit 8f22215.

* new resource limitation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants