Skip to content

Conversation

@cg505
Copy link
Collaborator

@cg505 cg505 commented Oct 6, 2025

Previously, after #7051, we allowed the number of jobs launching/running to scale purely based on the controller resources.

After this change, we set maximum values for the number that can run and launch, since there are some bottlenecks (e.g. DB) that will not necessarily scale with the instance resources.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • manually set the constants to low values to make sure they took effect as expected
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Previously, after skypilot-org#7051, we allowed the number of jobs
launching/running to scale purely based on the controller resources.

After this change, we set maximum values for the number that can run
and launch, since there are some bottlenecks (e.g. DB) that will not
necessarily scale with the instance resources.
@cg505
Copy link
Collaborator Author

cg505 commented Oct 6, 2025

/smoke-test --managed-jobs

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! LGTM, as long as it is tested. : )

# Limit the number of jobs that can be running at once on the entire jobs
# controller cluster. It's hard to handle cancellation of more than 2000 jobs at
# once.
MAX_TOTAL_JOBS = 2000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO here for getting rid of this in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, rename to MAX_RUNNING_JOBS? TOTAL sounds like including PENDING and finished jobs.

# LAUNCHES_PER_WORKER = 16
# JOBS_PER_WORKER = 400

# keep 1GB reserved after the controllers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not aligned with the 2048 below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just moved, but good catch

CURRENT_HASH = os.path.expanduser('~/.sky/wheels/current_sky_wheel_hash')


def get_number_of_controllers() -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a lru cache for this?

@cg505 cg505 enabled auto-merge (squash) October 7, 2025 02:50
@cg505 cg505 merged commit 265eb04 into skypilot-org:master Oct 7, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants