Skip to content

Conversation

@kevinmingtarja
Copy link
Collaborator

@kevinmingtarja kevinmingtarja commented Jul 23, 2025

This PR introduces a new Autostop config, wait_for, which determines the condition to check for when resetting the idleness timer. This option works in conjunction with idle_minutes. This config is an enum:

  1. "jobs_and_ssh" (default) - Wait for all jobs to complete AND all SSH sessions to disconnect, thus fixing [Core] SSH sessions should count as non-idle autostop #6292
  2. "jobs" - Wait for all jobs to complete.
  3. "none" - Stop immediately after idle time expires, regardless of running jobs or SSH connections.

With the introduction of "none" for doing hard stops, we should also rename idle_minutes, as it might be confusing for users. We can do that in a follow-up PR, along with making the field accept arbitrary time units (not only minutes). I've created #6382 to track that.


Context:
Today, idleness for autodown/stop is measured from the completion of all tasks in the cluster's queue. However, a user may be SSH'd in, for example to debug some stuff. So we should count SSH activity as non-idle.

This PR implements this by listing the contents of /dev/pts (the virtual filesystem where these pseudo-terminal devices reside) and counting them (excluding ptmx).

$ ls /dev/pts
0  1  2  3  ptmx

Initially, we considered using who, which is part of GNU coreutils. The command is simple, it just prints a list of users who are currently logged in. But we found out that SSH sessions from Cursor (probably VS Code too) did not show up under who (thanks @concretevitamin for reminding to test this path!), likely because their Remote-SSH extension does not spawn an interactive login process.

$ who
gcpuser  pts/0        2025-07-23 20:32 (12.121.71.90)
gcpuser  pts/1        2025-07-23 20:33 (12.121.71.90)

Another approach considered was:

$ ps ax | grep -v grep | grep "sshd:.*@"
   1655 ?        S      0:00 sshd: gcpuser@notty
   3669 ?        S      0:00 sshd: gcpuser@notty
   3977 ?        S      0:00 sshd: gcpuser@pts/0
   4074 ?        S      0:00 sshd: gcpuser@pts/1
   4160 ?        S      0:00 sshd: gcpuser@notty

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Documentations (for user-facing changes)
  • Any manual or new tests for this PR (please specify below)
    - New smoke tests: test_autostop_wait_for_jobs_and_ssh, test_autostop_wait_for_none
    - Manually tested by ssh'ing from the terminal
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_with_ssh_sessions

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_with_ssh_sessions

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_with_ssh_sessions --gcp
/smoke-test -k test_autostop_with_ssh_sessions --azure

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_with_ssh_sessions --aws
/smoke-test -k test_autostop_with_ssh_sessions --gcp
/smoke-test -k test_autostop_with_ssh_sessions --azure

@kevinmingtarja kevinmingtarja enabled auto-merge (squash) July 23, 2025 23:00
@kevinmingtarja kevinmingtarja disabled auto-merge July 23, 2025 23:18
@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_jobs --aws
/smoke-test -k test_autostop_wait_for_jobs --gcp
/smoke-test -k test_autostop_wait_for_jobs --azure
/smoke-test -k test_autostop_wait_for_jobs_and_ssh --aws
/smoke-test -k test_autostop_wait_for_jobs_and_ssh --gcp
/smoke-test -k test_autostop_wait_for_jobs_and_ssh --azure
/smoke-test -k test_autostop_wait_for_none --aws
/smoke-test -k test_autostop_wait_for_none --gcp
/smoke-test -k test_autostop_wait_for_none --azure

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_jobs_and_ssh --gcp
/smoke-test -k test_autostop_wait_for_jobs_and_ssh --azure

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_none

@kevinmingtarja kevinmingtarja changed the title [Core] Check for active SSH sessions when checking cluster idleness [Core] Add wait_for in Autostop config, check for active SSH sessions by default Jul 25, 2025
@kevinmingtarja
Copy link
Collaborator Author

/quicktest-core

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_jobs
/smoke-test -k test_autostop_wait_for_jobs_and_ssh
/smoke-test -k test_autostop_wait_for_none
/smoke-test -k test_autostop_with_unhealthy_ray_cluster

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_jobs
/smoke-test -k test_autostop_wait_for_jobs_and_ssh
/smoke-test -k test_autostop_wait_for_none
/smoke-test -k test_autostop_with_unhealthy_ray_cluster

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_jobs
/smoke-test -k test_autostop_wait_for_jobs_and_ssh
/smoke-test -k test_autostop_wait_for_none
/smoke-test -k test_autostop_with_unhealthy_ray_cluster

@kevinmingtarja
Copy link
Collaborator Author

/quicktest-core

@kevinmingtarja
Copy link
Collaborator Author

/smoke-test -k test_autostop_wait_for_jobs
/smoke-test -k test_autostop_wait_for_jobs_and_ssh
/smoke-test -k test_autostop_wait_for_none
/smoke-test -k test_autostop_with_unhealthy_ray_cluster

@kevinmingtarja kevinmingtarja merged commit 9b4ac32 into master Aug 1, 2025
20 checks passed
@kevinmingtarja kevinmingtarja deleted the kevin/sky-2583-core-ssh-sessions-should-count-as-non-idle-autostop branch August 1, 2025 23:55
@Michaelvll
Copy link
Collaborator

we should check the backward compatibility for diff client server version, considering we have some API changes?

@zpoint
Copy link
Collaborator

zpoint commented Aug 2, 2025

/quicktest-core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants