-
Notifications
You must be signed in to change notification settings - Fork 921
[jobs] limit the max number of jobs that can run/launch #7488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Previously, after skypilot-org#7051, we allowed the number of jobs launching/running to scale purely based on the controller resources. After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources.
|
/smoke-test --managed-jobs |
Michaelvll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! LGTM, as long as it is tested. : )
sky/jobs/scheduler.py
Outdated
| # Limit the number of jobs that can be running at once on the entire jobs | ||
| # controller cluster. It's hard to handle cancellation of more than 2000 jobs at | ||
| # once. | ||
| MAX_TOTAL_JOBS = 2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO here for getting rid of this in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, rename to MAX_RUNNING_JOBS? TOTAL sounds like including PENDING and finished jobs.
sky/jobs/scheduler.py
Outdated
| # LAUNCHES_PER_WORKER = 16 | ||
| # JOBS_PER_WORKER = 400 | ||
|
|
||
| # keep 1GB reserved after the controllers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not aligned with the 2048 below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just moved, but good catch
| CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash') | ||
|
|
||
|
|
||
| def get_number_of_controllers() -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a lru cache for this?
Previously, after #7051, we allowed the number of jobs launching/running to scale purely based on the controller resources.
After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)