-
Notifications
You must be signed in to change notification settings - Fork 919
[Pools] Support remote job controller. #6591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/smoke-test |
…to pool-in-remote
|
/smoke-test |
cg505
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduling logic is increasingly complicated.
We should have some comments that explain how it works with pools.
Can we do an audit of these existing comments.
| self.graph = nx.DiGraph() | ||
| self.name: Optional[str] = None | ||
| self.policy_applied: bool = False | ||
| self.pool: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do not serialize this, we need to make sure it is handled correctly by RESTful admin policy. See #6348.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment
| if current_state == state.ManagedJobScheduleState.ALIVE_WAITING: | ||
| if not _can_lauch_in_alive_job(): | ||
| if not (controller_utils.can_provision() or | ||
| actual_pool is not None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care about the pool here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because pool jobs will not launch. This helps it to get scheduled when there is no slot for provisioning but still slot for new job controller process
| # first. The requirements to launch in an alive job are more | ||
| # lenient, so there is no way that we wouldn't be able to launch | ||
| # an ALIVE_WAITING job, but we would be able to launch a WAITING | ||
| # job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is no longer true, right? There could be a WAITING job using a pool that can start (doesn't need a LAUNCHING slot), but the ALIVE_WAITING job we fetch is not on a pool and cannot go to LAUNCHING.
|
/smoke-test |
cg505
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will fix scheduler logic in a follow-up PR
|
/smoke-test |
1 similar comment
|
/smoke-test |
|
/smoke-test |
|
/smoke-test -k test_skyserve_new_autoscaler_update |
|
/smoke-test -k test_skyserve_new_autoscaler_update --kubernetes |
This reverts commit 8f22215.
|
/smoke-test -k test_skyserve_new_autoscaler_update |
* init * check pools * resources management * fix * fix test * tracking #procs * refactor * fix * fix * fix * fix endpoint * fix ready replicas * backoff 1 * fix leakage * locks * ux * ux * dont check can_provision for pool jobs * fix * comments * nit * nit * try fix autoscaler * Revert "try fix autoscaler" This reverts commit 8f22215. * new resource limitation
This PR supports running pools on remote job controller. It also adds resources management across pool & jobs.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)