Skip to content

Conversation

@cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Sep 23, 2025

This PR adds the resource management for both pool/serve and jobs controllers.

Rules:

  1. We statically reserve some space for pool/serve and jobs controllers.
    • In consolidation mode, we reserve MAXIMUM_CONTROLLER_RESERVED_MEMORY_MB, and the remaining resources is dedicated to the API server;
    • In remote job controller, we start a local API server in non-deploy mode, and all resources is given to the controllers.
  2. The reserved resources is statically partitioned into 2 parts, controlled by POOL_JOBS_RESOURCES_RATIO.

This PR also move SkyServe's process-based launch/termination into API Server calls and using threads.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 23, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 23, 2025

/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 23, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 23, 2025

/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 24, 2025

also cc @andylizf who proposed this change (using thread) a while ago

@andylizf
Copy link
Collaborator

andylizf commented Sep 24, 2025

@cblmemo My static type checker gives
Curious why pylint is not catching it.

[{
	"resource": "/home/andyl/skypilot/sky/jobs/controller.py",
	"owner": "pyright",
	"severity": 8,
	"message": "\"LAUNCHES_PER_WORKER\" is not a known attribute of module \"sky.jobs.scheduler\"",
	"startLineNumber": 1122,
	"startColumn": 44,
	"endLineNumber": 1122,
	"endColumn": 63,
	"modelVersionId": 3
}]

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 24, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 24, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 25, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 25, 2025

/smoke-test
/quicktest-core

Copy link
Collaborator

@andylizf andylizf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some discussions.

cblmemo and others added 3 commits September 25, 2025 12:28
@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 25, 2025

/smoke-test -k new_autoscaler_update --gcp

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve -k autoscaler

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve -k autoscaler

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve -k autoscaler

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve -k autoscaler

OVERRIDE_CONSOLIDATION_MODE = 'IS_SKYPILOT_JOB_CONTROLLER'
IS_SKYPILOT_SERVE_CONTROLLER = 'IS_SKYPILOT_SERVE_CONTROLLER'

OVERRIDE_CONCURRENT_LAUNCHES = 'OVERRIDE_CONCURRENT_LAUNCHES'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the env var to SKYPILOT_SERVE_OVERRIDE_CONCURRENT_LAUNCHES?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 22, 2025

/smoke-test --serve -k autoscaler

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 23, 2025

/smoke-test --serve
/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Nov 23, 2025

All test passed. Merging now!

@cblmemo cblmemo merged commit 5a380e9 into master Nov 23, 2025
30 of 35 checks passed
@cblmemo cblmemo deleted the resource-mgmt-pool-with-job branch November 23, 2025 01:13
cg505 added a commit to cg505/skypilot that referenced this pull request Nov 25, 2025
This causes issues with sdk usage, which may invoke the client
multiple times with different env vars within the same process.
Correctness here is worth the miniscule performance hit.

This fixes the smoke test `test_managed_jobs_config_labels_isolation`.
That test technically regressed in skypilot-org#7966, but would only fail when
both jobs were claimed by the same job controller process, which is
rare. After skypilot-org#7332, the test would only have a single job controller
process and started consistently failing.
@lloyd-brown lloyd-brown mentioned this pull request Nov 25, 2025
5 tasks
cg505 added a commit that referenced this pull request Nov 26, 2025
* [client] remove client-side cache for request payload env vars

This causes issues with sdk usage, which may invoke the client
multiple times with different env vars within the same process.
Correctness here is worth the miniscule performance hit.

This fixes the smoke test `test_managed_jobs_config_labels_isolation`.
That test technically regressed in #7966, but would only fail when
both jobs were claimed by the same job controller process, which is
rare. After #7332, the test would only have a single job controller
process and started consistently failing.

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants