-
Notifications
You must be signed in to change notification settings - Fork 919
[Pool][Serve] Resource management coordination for jobs/pool; Move to SDK + Threading for SkyServe provision/terminate. #7332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/smoke-test |
|
/quicktest-core |
|
/smoke-test |
|
/quicktest-core |
|
also cc @andylizf who proposed this change (using thread) a while ago |
|
@cblmemo My static type checker gives [{
"resource": "/home/andyl/skypilot/sky/jobs/controller.py",
"owner": "pyright",
"severity": 8,
"message": "\"LAUNCHES_PER_WORKER\" is not a known attribute of module \"sky.jobs.scheduler\"",
"startLineNumber": 1122,
"startColumn": 44,
"endLineNumber": 1122,
"endColumn": 63,
"modelVersionId": 3
}] |
|
/smoke-test |
|
/smoke-test |
|
/smoke-test |
|
/smoke-test |
andylizf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some discussions.
Co-authored-by: Andy Lee <andylizf@outlook.com>
|
/smoke-test -k new_autoscaler_update --gcp |
|
/smoke-test --serve -k autoscaler |
|
/smoke-test --serve |
|
/smoke-test |
|
/smoke-test --serve -k autoscaler |
|
/smoke-test --serve -k autoscaler |
|
/smoke-test --serve |
This reverts commit 8c607d2.
|
/smoke-test --serve -k autoscaler |
sky/skylet/constants.py
Outdated
| OVERRIDE_CONSOLIDATION_MODE = 'IS_SKYPILOT_JOB_CONTROLLER' | ||
| IS_SKYPILOT_SERVE_CONTROLLER = 'IS_SKYPILOT_SERVE_CONTROLLER' | ||
|
|
||
| OVERRIDE_CONCURRENT_LAUNCHES = 'OVERRIDE_CONCURRENT_LAUNCHES' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the env var to SKYPILOT_SERVE_OVERRIDE_CONCURRENT_LAUNCHES?
|
/smoke-test --serve -k autoscaler |
|
/smoke-test --serve |
|
All test passed. Merging now! |
This causes issues with sdk usage, which may invoke the client multiple times with different env vars within the same process. Correctness here is worth the miniscule performance hit. This fixes the smoke test `test_managed_jobs_config_labels_isolation`. That test technically regressed in skypilot-org#7966, but would only fail when both jobs were claimed by the same job controller process, which is rare. After skypilot-org#7332, the test would only have a single job controller process and started consistently failing.
* [client] remove client-side cache for request payload env vars This causes issues with sdk usage, which may invoke the client multiple times with different env vars within the same process. Correctness here is worth the miniscule performance hit. This fixes the smoke test `test_managed_jobs_config_labels_isolation`. That test technically regressed in #7966, but would only fail when both jobs were claimed by the same job controller process, which is rare. After #7332, the test would only have a single job controller process and started consistently failing. * fix test
This PR adds the resource management for both pool/serve and jobs controllers.
Rules:
MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB, and the remaining resources is dedicated to the API server;POOL_JOBS_RESOURCES_RATIO.This PR also move SkyServe's process-based launch/termination into API Server calls and using threads.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)