Skip to content

Conversation

@gmittal
Copy link
Collaborator

@gmittal gmittal commented Nov 24, 2021

Also adds support for multiple clusters!

@gmittal gmittal force-pushed the user/gmittal/cli-v2 branch from d589e8f to d95cda7 Compare November 29, 2021 01:56
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODOs on making sure the Docker examples all pass?

First look, still need to finish reading cli.py.


# TODO: cd into workdir immediately on the VM
# TODO: Delete the temporary cluster config yml (or figure out a way to re-use it)
cloud_vm_ray_backend._run(f'ray attach {handle} --tmux')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antipattern to call a private func. Which signals to me we should expose run() in backend_utils so you could call here.

NOTE TO SELF: check if we want to break Backend abstractions

@@ -0,0 +1,59 @@
import sqlite3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE TO SELF: good dir to put in? one-line docs on module, methods.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to leave out prototype/~/.sky/session.db!

@gmittal gmittal removed the request for review from romilbhardwaj December 2, 2021 00:53
@gmittal gmittal merged commit e0d15fa into master Dec 2, 2021
@gmittal gmittal deleted the user/gmittal/cli-v2 branch December 2, 2021 01:20
frank-lsf added a commit that referenced this pull request Dec 9, 2021
Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52)

* Make 2nd run of "sky run" / "python app.py" skip setup. More examples.

* Fix workflows yaml.

LocalDockerBackend for debugging (#35)

working LocalDockerBackend

Show logs on docker build failure (#53)

Stream build logs on failure and default to remove container on completion

Usability: basic "sky run" distributed task; rsync .git; more docs. (#54)

* aws: move region order

* Do not exclude .git from workdir by default; fixes #42.

* Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes.

* Example: multi_hostname.{yaml,py}, and some fixes to make it work

* Check 'run' field.

Add logging for execution_v2 (#49)

* Add logging for provisioning

* Log task output into a file for `one_node`

* Fix format issues and support ParTask logging

* Fix info message

* Add docstr to logging.py

* Fix tail -f information for provision

* Fix teardown

* Fix codegen, format and other issues

Fix logs sync for n_node (#57)

* Fix logs rsync down for n_node

Prettier paths (#50)

Update .gitignore for sky_logs (#55)

* Update .gitignore

* Add *.log to gitignore

Docker-in-Docker support and bugfixes (#65)

* Fix ping example and debugging commands printed to the user

* Add default docker image

* Add check for container status in post_execute

* lint

* PR comments

containerized app example (#60)

* containerized app example

* setup command fixes and dind notes

* mkdir fix

Fix the incomplete logs (#69)

Fix T4 (#62)

CLI with cluster handles and interactive sessions (#58)

Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.

Make pylint pass. (#76)

* First pass @ making pylint happy; TODO: a few more to solve; test.

* Fix a few more pylint errors, and some minor fixes

* Final pylint errors

* Fix more pylint

* Debug pytest issue

* Accidentally removed a useful try-catch; reverted & documented

Add Detectron2 example (#74)

Cleanups: remove V1 yaml templates; update pylint.yml (#77)

* Cleanups: remove V1 yaml templates; update pylint.yml

* Add cluster_name: Optional[str] = None to the 2 Backend impls

Fix script filling to make tpu_app run again. (#79)

Add support for cross-cloud retry (#70)

* Support multi-cloud provision retry

* Fix pylint issues

Update .gitignore for tpu shell script (#80)

Horovod Example (#29)

* Horovod  Example

* Yapf fix

* Fix

Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local>

Add jupyter notebook example (#34)

* Add jupyter notebook example

* Change to new execute

Fix cross-cloud retry for CLI and CLI (#84)

* Do not fail the program when dag and optimizer are not registered and fix cli

Support file_mounts destination paths that require root access. (#68)

Fixes #39.

Both these should now work:

```
/data/<25GB dir>: /data/<25GB dir>
/data/<25GB dir>: gs://<dir>
```

Added examples/using_file_mounts.yaml as a user guide.

* Support file_mounts destination paths that require root access.

* Enable more robust file mounts.

* Rename to using_file_mounts.yaml.

* Use ~/.sky/file_mounts/

* Remove V1 code to avoid confusion.

* Add slashes to existing file mounts examples.

* Revert "Add slashes to existing file mounts examples."

This reverts commit 53bade0.

* Add storage.is_directory() to rid of the weird slashes

* Cleanup in execution.py

Add example for resnet training with tensorboard (#45)

* add an example for resnet training with tensorboard

[CLI] "sky down" and move cli.py to the core package. (#85)

Example usage:

  # See available commands.
  >> sky

  # Run a task, described in a yaml file.
  # Provisioning, setup, file syncing are handled.
  >> sky run task.yaml
  >> sky run [-c cluster_name] task.yaml

  # Show the list of tasks and running clusters.
  >> sky status

  # Tear down a specific cluster.
  >> sky down -c cluster_name

  # Tear down all existing clusters.
  >> sky down -a

* cli: cleanups, remove unimplemented commands.

* cli/cli.py -> sky/cli.py

* Make cli.py pass pylint.

* sky down

* Fix pylint.

* Update

* Make CLUSTER an argument instead - better help msg

* sky cancel --all (-a)

* Minor fixes and linting.

Fix `pushd not found` in cloud-stores (#88)

* Merge run function for cloud_stores and backend_utils
* Add `executable=/bin/bash`

[CLI] Introduce `sky exec`; some more CLI fixes (#89)

* Fix tricky "sky run" bugs.

Bug before this change:

  sky run cluster app.yaml

    -> handle = cluster.yaml

  now, manually change cluster.yaml instance_type, and ray up cluster.yaml

  sky run cluster app.yaml again, this will fail!

After this change, "sky run" deals with this case correctly.

Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up"
in  one case.  It should use backend.provision() instead which offers provision
retries.

* Remove _reuse_or_provision_cluster() and fixes sky.execute()

_reuse_or_provision_cluster() in the 'sky run' codepath is buggy:

It performs a content diff on generated.yaml.  Which means:

If cluster is not manually changed, this check DOES NOT capture
  - non-workdir file mounts content changes
Other cluster specs are captured.

If cluster is manually changed, skipping re-setup violates Task contracts.

* Add 'sky exec'

* update help

* Lint

* entry_point -> yaml_path

* Show commands in help in better order

Fix non-utf-8 characters when redirecting output (#92)

* Fix non-utf-8 output

* Enable head comment in using_file_mounts.yaml

Add types to global_user_state (#94)

Add types

[CLI] Speed up 'sky exec' for faster iterative development (#91)

* (Not back-compat) global_user_state: support pickle-able handles.

This change is not binary back-compat.  Do this to update:
  rm ~/.sky/state.db

* Speed up sky exec by not using 'ray exec/rsync_up'

Changes:
- Sane SSH options (control path has huge performance boost)
- ResourceHandle: is now a tuple (head_ip, yaml, tpu).
  - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'.
  - Thus tpu management is nicely handled

* Add minimal.yaml

* Update run_smoke_tests.sh

* TESTED: examples/run_smoke_tests.sh

* Add username to auto-generated cluster name

* Fix cloud->remote file mount mkdir logic.

* Better naming for 'managed_tpu' & address comments

* Fix _create_interactive_node() use of handle; don't down the node.

* lint

[CLI] Add sky ssh and port-forwarding (#90)

* Add sky attach and port forwarding gpunode/cpunode

* Fix Ctrl-D error

* Change the name to sky ssh

* Fix type hints

* linting

* format

* Fix merging errors

(Non back-compat) CLI improvements: interactive, remove task table, show entry points  (#99)

* CLI: by default, (re)use the same interactive nodes

* Remove task table & its associated code

Rationale: useful in the future when we add task status tracking
(RUNNING, DONE, etc.).  Absent that, the task table from `sky status`
and the overhead of `sky cancel -a` is not that useful at the moment.

We should add this code back in the future.

* Show entry point per cluster in `sky status`

Clusters
+----------------------+----------------+-----------------------------------------+
|     CLUSTER NAME     |    LAUNCHED    |                 LAST USE                |
+----------------------+----------------+-----------------------------------------+
| sky-cpunode-zongheng | 10 minutes ago |               sky cpunode               |
|        debug1        | 7 minutes ago  | sky run -c debug1 examples/minimal.yaml |
|  sky-bba5-zongheng   | 3 minutes ago  |              multi_echo.py              |
+----------------------+----------------+-----------------------------------------+

* Make 'sky ssh' call the ssh cmd directly

* Truncate last use

* lint

Rebase & format

Update __init__.py
frank-lsf added a commit that referenced this pull request Jan 8, 2022
* GCP catalog

Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52)

* Make 2nd run of "sky run" / "python app.py" skip setup. More examples.

* Fix workflows yaml.

LocalDockerBackend for debugging (#35)

working LocalDockerBackend

Show logs on docker build failure (#53)

Stream build logs on failure and default to remove container on completion

Usability: basic "sky run" distributed task; rsync .git; more docs. (#54)

* aws: move region order

* Do not exclude .git from workdir by default; fixes #42.

* Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes.

* Example: multi_hostname.{yaml,py}, and some fixes to make it work

* Check 'run' field.

Add logging for execution_v2 (#49)

* Add logging for provisioning

* Log task output into a file for `one_node`

* Fix format issues and support ParTask logging

* Fix info message

* Add docstr to logging.py

* Fix tail -f information for provision

* Fix teardown

* Fix codegen, format and other issues

Fix logs sync for n_node (#57)

* Fix logs rsync down for n_node

Prettier paths (#50)

Update .gitignore for sky_logs (#55)

* Update .gitignore

* Add *.log to gitignore

Docker-in-Docker support and bugfixes (#65)

* Fix ping example and debugging commands printed to the user

* Add default docker image

* Add check for container status in post_execute

* lint

* PR comments

containerized app example (#60)

* containerized app example

* setup command fixes and dind notes

* mkdir fix

Fix the incomplete logs (#69)

Fix T4 (#62)

CLI with cluster handles and interactive sessions (#58)

Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.

Make pylint pass. (#76)

* First pass @ making pylint happy; TODO: a few more to solve; test.

* Fix a few more pylint errors, and some minor fixes

* Final pylint errors

* Fix more pylint

* Debug pytest issue

* Accidentally removed a useful try-catch; reverted & documented

Add Detectron2 example (#74)

Cleanups: remove V1 yaml templates; update pylint.yml (#77)

* Cleanups: remove V1 yaml templates; update pylint.yml

* Add cluster_name: Optional[str] = None to the 2 Backend impls

Fix script filling to make tpu_app run again. (#79)

Add support for cross-cloud retry (#70)

* Support multi-cloud provision retry

* Fix pylint issues

Update .gitignore for tpu shell script (#80)

Horovod Example (#29)

* Horovod  Example

* Yapf fix

* Fix

Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local>

Add jupyter notebook example (#34)

* Add jupyter notebook example

* Change to new execute

Fix cross-cloud retry for CLI and CLI (#84)

* Do not fail the program when dag and optimizer are not registered and fix cli

Support file_mounts destination paths that require root access. (#68)

Fixes #39.

Both these should now work:

```
/data/<25GB dir>: /data/<25GB dir>
/data/<25GB dir>: gs://<dir>
```

Added examples/using_file_mounts.yaml as a user guide.

* Support file_mounts destination paths that require root access.

* Enable more robust file mounts.

* Rename to using_file_mounts.yaml.

* Use ~/.sky/file_mounts/

* Remove V1 code to avoid confusion.

* Add slashes to existing file mounts examples.

* Revert "Add slashes to existing file mounts examples."

This reverts commit 53bade0.

* Add storage.is_directory() to rid of the weird slashes

* Cleanup in execution.py

Add example for resnet training with tensorboard (#45)

* add an example for resnet training with tensorboard

[CLI] "sky down" and move cli.py to the core package. (#85)

Example usage:

  # See available commands.
  >> sky

  # Run a task, described in a yaml file.
  # Provisioning, setup, file syncing are handled.
  >> sky run task.yaml
  >> sky run [-c cluster_name] task.yaml

  # Show the list of tasks and running clusters.
  >> sky status

  # Tear down a specific cluster.
  >> sky down -c cluster_name

  # Tear down all existing clusters.
  >> sky down -a

* cli: cleanups, remove unimplemented commands.

* cli/cli.py -> sky/cli.py

* Make cli.py pass pylint.

* sky down

* Fix pylint.

* Update

* Make CLUSTER an argument instead - better help msg

* sky cancel --all (-a)

* Minor fixes and linting.

Fix `pushd not found` in cloud-stores (#88)

* Merge run function for cloud_stores and backend_utils
* Add `executable=/bin/bash`

[CLI] Introduce `sky exec`; some more CLI fixes (#89)

* Fix tricky "sky run" bugs.

Bug before this change:

  sky run cluster app.yaml

    -> handle = cluster.yaml

  now, manually change cluster.yaml instance_type, and ray up cluster.yaml

  sky run cluster app.yaml again, this will fail!

After this change, "sky run" deals with this case correctly.

Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up"
in  one case.  It should use backend.provision() instead which offers provision
retries.

* Remove _reuse_or_provision_cluster() and fixes sky.execute()

_reuse_or_provision_cluster() in the 'sky run' codepath is buggy:

It performs a content diff on generated.yaml.  Which means:

If cluster is not manually changed, this check DOES NOT capture
  - non-workdir file mounts content changes
Other cluster specs are captured.

If cluster is manually changed, skipping re-setup violates Task contracts.

* Add 'sky exec'

* update help

* Lint

* entry_point -> yaml_path

* Show commands in help in better order

Fix non-utf-8 characters when redirecting output (#92)

* Fix non-utf-8 output

* Enable head comment in using_file_mounts.yaml

Add types to global_user_state (#94)

Add types

[CLI] Speed up 'sky exec' for faster iterative development (#91)

* (Not back-compat) global_user_state: support pickle-able handles.

This change is not binary back-compat.  Do this to update:
  rm ~/.sky/state.db

* Speed up sky exec by not using 'ray exec/rsync_up'

Changes:
- Sane SSH options (control path has huge performance boost)
- ResourceHandle: is now a tuple (head_ip, yaml, tpu).
  - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'.
  - Thus tpu management is nicely handled

* Add minimal.yaml

* Update run_smoke_tests.sh

* TESTED: examples/run_smoke_tests.sh

* Add username to auto-generated cluster name

* Fix cloud->remote file mount mkdir logic.

* Better naming for 'managed_tpu' & address comments

* Fix _create_interactive_node() use of handle; don't down the node.

* lint

[CLI] Add sky ssh and port-forwarding (#90)

* Add sky attach and port forwarding gpunode/cpunode

* Fix Ctrl-D error

* Change the name to sky ssh

* Fix type hints

* linting

* format

* Fix merging errors

(Non back-compat) CLI improvements: interactive, remove task table, show entry points  (#99)

* CLI: by default, (re)use the same interactive nodes

* Remove task table & its associated code

Rationale: useful in the future when we add task status tracking
(RUNNING, DONE, etc.).  Absent that, the task table from `sky status`
and the overhead of `sky cancel -a` is not that useful at the moment.

We should add this code back in the future.

* Show entry point per cluster in `sky status`

Clusters
+----------------------+----------------+-----------------------------------------+
|     CLUSTER NAME     |    LAUNCHED    |                 LAST USE                |
+----------------------+----------------+-----------------------------------------+
| sky-cpunode-zongheng | 10 minutes ago |               sky cpunode               |
|        debug1        | 7 minutes ago  | sky run -c debug1 examples/minimal.yaml |
|  sky-bba5-zongheng   | 3 minutes ago  |              multi_echo.py              |
+----------------------+----------------+-----------------------------------------+

* Make 'sky ssh' call the ssh cmd directly

* Truncate last use

* lint

Rebase & format

Update __init__.py

* sky list-gpus

* lint

* Update cli.py

* lint

* Update GCP prompt

* Address comments

* TPUs -> Google TPUs

* Update GCP catalog with regions info

* Update fetch_gcp.py

* Update gcp

* fix

* fix

* Update TPU code

* Revert "Update TPU code"

This reverts commit 1f08fc0.

* Address comments

* Update cli.py

* Update cli.py

* Update cli.py

* Update cli.py
gmittal added a commit that referenced this pull request Mar 15, 2022
Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.
gmittal pushed a commit that referenced this pull request Mar 15, 2022
* GCP catalog

Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52)

* Make 2nd run of "sky run" / "python app.py" skip setup. More examples.

* Fix workflows yaml.

LocalDockerBackend for debugging (#35)

working LocalDockerBackend

Show logs on docker build failure (#53)

Stream build logs on failure and default to remove container on completion

Usability: basic "sky run" distributed task; rsync .git; more docs. (#54)

* aws: move region order

* Do not exclude .git from workdir by default; fixes #42.

* Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes.

* Example: multi_hostname.{yaml,py}, and some fixes to make it work

* Check 'run' field.

Add logging for execution_v2 (#49)

* Add logging for provisioning

* Log task output into a file for `one_node`

* Fix format issues and support ParTask logging

* Fix info message

* Add docstr to logging.py

* Fix tail -f information for provision

* Fix teardown

* Fix codegen, format and other issues

Fix logs sync for n_node (#57)

* Fix logs rsync down for n_node

Prettier paths (#50)

Update .gitignore for sky_logs (#55)

* Update .gitignore

* Add *.log to gitignore

Docker-in-Docker support and bugfixes (#65)

* Fix ping example and debugging commands printed to the user

* Add default docker image

* Add check for container status in post_execute

* lint

* PR comments

containerized app example (#60)

* containerized app example

* setup command fixes and dind notes

* mkdir fix

Fix the incomplete logs (#69)

Fix T4 (#62)

CLI with cluster handles and interactive sessions (#58)

Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.

Make pylint pass. (#76)

* First pass @ making pylint happy; TODO: a few more to solve; test.

* Fix a few more pylint errors, and some minor fixes

* Final pylint errors

* Fix more pylint

* Debug pytest issue

* Accidentally removed a useful try-catch; reverted & documented

Add Detectron2 example (#74)

Cleanups: remove V1 yaml templates; update pylint.yml (#77)

* Cleanups: remove V1 yaml templates; update pylint.yml

* Add cluster_name: Optional[str] = None to the 2 Backend impls

Fix script filling to make tpu_app run again. (#79)

Add support for cross-cloud retry (#70)

* Support multi-cloud provision retry

* Fix pylint issues

Update .gitignore for tpu shell script (#80)

Horovod Example (#29)

* Horovod  Example

* Yapf fix

* Fix

Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local>

Add jupyter notebook example (#34)

* Add jupyter notebook example

* Change to new execute

Fix cross-cloud retry for CLI and CLI (#84)

* Do not fail the program when dag and optimizer are not registered and fix cli

Support file_mounts destination paths that require root access. (#68)

Fixes #39.

Both these should now work:

```
/data/<25GB dir>: /data/<25GB dir>
/data/<25GB dir>: gs://<dir>
```

Added examples/using_file_mounts.yaml as a user guide.

* Support file_mounts destination paths that require root access.

* Enable more robust file mounts.

* Rename to using_file_mounts.yaml.

* Use ~/.sky/file_mounts/

* Remove V1 code to avoid confusion.

* Add slashes to existing file mounts examples.

* Revert "Add slashes to existing file mounts examples."

This reverts commit 53bade0.

* Add storage.is_directory() to rid of the weird slashes

* Cleanup in execution.py

Add example for resnet training with tensorboard (#45)

* add an example for resnet training with tensorboard

[CLI] "sky down" and move cli.py to the core package. (#85)

Example usage:

  # See available commands.
  >> sky

  # Run a task, described in a yaml file.
  # Provisioning, setup, file syncing are handled.
  >> sky run task.yaml
  >> sky run [-c cluster_name] task.yaml

  # Show the list of tasks and running clusters.
  >> sky status

  # Tear down a specific cluster.
  >> sky down -c cluster_name

  # Tear down all existing clusters.
  >> sky down -a

* cli: cleanups, remove unimplemented commands.

* cli/cli.py -> sky/cli.py

* Make cli.py pass pylint.

* sky down

* Fix pylint.

* Update

* Make CLUSTER an argument instead - better help msg

* sky cancel --all (-a)

* Minor fixes and linting.

Fix `pushd not found` in cloud-stores (#88)

* Merge run function for cloud_stores and backend_utils
* Add `executable=/bin/bash`

[CLI] Introduce `sky exec`; some more CLI fixes (#89)

* Fix tricky "sky run" bugs.

Bug before this change:

  sky run cluster app.yaml

    -> handle = cluster.yaml

  now, manually change cluster.yaml instance_type, and ray up cluster.yaml

  sky run cluster app.yaml again, this will fail!

After this change, "sky run" deals with this case correctly.

Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up"
in  one case.  It should use backend.provision() instead which offers provision
retries.

* Remove _reuse_or_provision_cluster() and fixes sky.execute()

_reuse_or_provision_cluster() in the 'sky run' codepath is buggy:

It performs a content diff on generated.yaml.  Which means:

If cluster is not manually changed, this check DOES NOT capture
  - non-workdir file mounts content changes
Other cluster specs are captured.

If cluster is manually changed, skipping re-setup violates Task contracts.

* Add 'sky exec'

* update help

* Lint

* entry_point -> yaml_path

* Show commands in help in better order

Fix non-utf-8 characters when redirecting output (#92)

* Fix non-utf-8 output

* Enable head comment in using_file_mounts.yaml

Add types to global_user_state (#94)

Add types

[CLI] Speed up 'sky exec' for faster iterative development (#91)

* (Not back-compat) global_user_state: support pickle-able handles.

This change is not binary back-compat.  Do this to update:
  rm ~/.sky/state.db

* Speed up sky exec by not using 'ray exec/rsync_up'

Changes:
- Sane SSH options (control path has huge performance boost)
- ResourceHandle: is now a tuple (head_ip, yaml, tpu).
  - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'.
  - Thus tpu management is nicely handled

* Add minimal.yaml

* Update run_smoke_tests.sh

* TESTED: examples/run_smoke_tests.sh

* Add username to auto-generated cluster name

* Fix cloud->remote file mount mkdir logic.

* Better naming for 'managed_tpu' & address comments

* Fix _create_interactive_node() use of handle; don't down the node.

* lint

[CLI] Add sky ssh and port-forwarding (#90)

* Add sky attach and port forwarding gpunode/cpunode

* Fix Ctrl-D error

* Change the name to sky ssh

* Fix type hints

* linting

* format

* Fix merging errors

(Non back-compat) CLI improvements: interactive, remove task table, show entry points  (#99)

* CLI: by default, (re)use the same interactive nodes

* Remove task table & its associated code

Rationale: useful in the future when we add task status tracking
(RUNNING, DONE, etc.).  Absent that, the task table from `sky status`
and the overhead of `sky cancel -a` is not that useful at the moment.

We should add this code back in the future.

* Show entry point per cluster in `sky status`

Clusters
+----------------------+----------------+-----------------------------------------+
|     CLUSTER NAME     |    LAUNCHED    |                 LAST USE                |
+----------------------+----------------+-----------------------------------------+
| sky-cpunode-zongheng | 10 minutes ago |               sky cpunode               |
|        debug1        | 7 minutes ago  | sky run -c debug1 examples/minimal.yaml |
|  sky-bba5-zongheng   | 3 minutes ago  |              multi_echo.py              |
+----------------------+----------------+-----------------------------------------+

* Make 'sky ssh' call the ssh cmd directly

* Truncate last use

* lint

Rebase & format

Update __init__.py

* sky list-gpus

* lint

* Update cli.py

* lint

* Update GCP prompt

* Address comments

* TPUs -> Google TPUs

* Update GCP catalog with regions info

* Update fetch_gcp.py

* Update gcp

* fix

* fix

* Update TPU code

* Revert "Update TPU code"

This reverts commit 1f08fc0.

* Address comments

* Update cli.py

* Update cli.py

* Update cli.py

* Update cli.py
Michaelvll pushed a commit that referenced this pull request Dec 12, 2024
* Fix several issues

* fallback min available mem

* backward compatibility change

* format

* guard against OOM version

* remove/simplify OOM prevention

* logs and comments

* print warning log

* nit

* more conservative on non-blocking mem
juanmichelini pushed a commit to ICML-25-BenchName-builds-repair/skypilot that referenced this pull request Jul 10, 2025
Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.
juanmichelini pushed a commit to ICML-25-BenchName-builds-repair/skypilot that referenced this pull request Jul 10, 2025
* GCP catalog

Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (skypilot-org#52)

* Make 2nd run of "sky run" / "python app.py" skip setup. More examples.

* Fix workflows yaml.

LocalDockerBackend for debugging (skypilot-org#35)

working LocalDockerBackend

Show logs on docker build failure (skypilot-org#53)

Stream build logs on failure and default to remove container on completion

Usability: basic "sky run" distributed task; rsync .git; more docs. (skypilot-org#54)

* aws: move region order

* Do not exclude .git from workdir by default; fixes skypilot-org#42.

* Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes.

* Example: multi_hostname.{yaml,py}, and some fixes to make it work

* Check 'run' field.

Add logging for execution_v2 (skypilot-org#49)

* Add logging for provisioning

* Log task output into a file for `one_node`

* Fix format issues and support ParTask logging

* Fix info message

* Add docstr to logging.py

* Fix tail -f information for provision

* Fix teardown

* Fix codegen, format and other issues

Fix logs sync for n_node (skypilot-org#57)

* Fix logs rsync down for n_node

Prettier paths (skypilot-org#50)

Update .gitignore for sky_logs (skypilot-org#55)

* Update .gitignore

* Add *.log to gitignore

Docker-in-Docker support and bugfixes (skypilot-org#65)

* Fix ping example and debugging commands printed to the user

* Add default docker image

* Add check for container status in post_execute

* lint

* PR comments

containerized app example (skypilot-org#60)

* containerized app example

* setup command fixes and dind notes

* mkdir fix

Fix the incomplete logs (skypilot-org#69)

Fix T4 (skypilot-org#62)

CLI with cluster handles and interactive sessions (skypilot-org#58)

Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.

Make pylint pass. (skypilot-org#76)

* First pass @ making pylint happy; TODO: a few more to solve; test.

* Fix a few more pylint errors, and some minor fixes

* Final pylint errors

* Fix more pylint

* Debug pytest issue

* Accidentally removed a useful try-catch; reverted & documented

Add Detectron2 example (skypilot-org#74)

Cleanups: remove V1 yaml templates; update pylint.yml (skypilot-org#77)

* Cleanups: remove V1 yaml templates; update pylint.yml

* Add cluster_name: Optional[str] = None to the 2 Backend impls

Fix script filling to make tpu_app run again. (skypilot-org#79)

Add support for cross-cloud retry (skypilot-org#70)

* Support multi-cloud provision retry

* Fix pylint issues

Update .gitignore for tpu shell script (skypilot-org#80)

Horovod Example (skypilot-org#29)

* Horovod  Example

* Yapf fix

* Fix

Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local>

Add jupyter notebook example (skypilot-org#34)

* Add jupyter notebook example

* Change to new execute

Fix cross-cloud retry for CLI and CLI (skypilot-org#84)

* Do not fail the program when dag and optimizer are not registered and fix cli

Support file_mounts destination paths that require root access. (skypilot-org#68)

Fixes skypilot-org#39.

Both these should now work:

```
/data/<25GB dir>: /data/<25GB dir>
/data/<25GB dir>: gs://<dir>
```

Added examples/using_file_mounts.yaml as a user guide.

* Support file_mounts destination paths that require root access.

* Enable more robust file mounts.

* Rename to using_file_mounts.yaml.

* Use ~/.sky/file_mounts/

* Remove V1 code to avoid confusion.

* Add slashes to existing file mounts examples.

* Revert "Add slashes to existing file mounts examples."

This reverts commit 53bade0.

* Add storage.is_directory() to rid of the weird slashes

* Cleanup in execution.py

Add example for resnet training with tensorboard (skypilot-org#45)

* add an example for resnet training with tensorboard

[CLI] "sky down" and move cli.py to the core package. (skypilot-org#85)

Example usage:

  # See available commands.
  >> sky

  # Run a task, described in a yaml file.
  # Provisioning, setup, file syncing are handled.
  >> sky run task.yaml
  >> sky run [-c cluster_name] task.yaml

  # Show the list of tasks and running clusters.
  >> sky status

  # Tear down a specific cluster.
  >> sky down -c cluster_name

  # Tear down all existing clusters.
  >> sky down -a

* cli: cleanups, remove unimplemented commands.

* cli/cli.py -> sky/cli.py

* Make cli.py pass pylint.

* sky down

* Fix pylint.

* Update

* Make CLUSTER an argument instead - better help msg

* sky cancel --all (-a)

* Minor fixes and linting.

Fix `pushd not found` in cloud-stores (skypilot-org#88)

* Merge run function for cloud_stores and backend_utils
* Add `executable=/bin/bash`

[CLI] Introduce `sky exec`; some more CLI fixes (skypilot-org#89)

* Fix tricky "sky run" bugs.

Bug before this change:

  sky run cluster app.yaml

    -> handle = cluster.yaml

  now, manually change cluster.yaml instance_type, and ray up cluster.yaml

  sky run cluster app.yaml again, this will fail!

After this change, "sky run" deals with this case correctly.

Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up"
in  one case.  It should use backend.provision() instead which offers provision
retries.

* Remove _reuse_or_provision_cluster() and fixes sky.execute()

_reuse_or_provision_cluster() in the 'sky run' codepath is buggy:

It performs a content diff on generated.yaml.  Which means:

If cluster is not manually changed, this check DOES NOT capture
  - non-workdir file mounts content changes
Other cluster specs are captured.

If cluster is manually changed, skipping re-setup violates Task contracts.

* Add 'sky exec'

* update help

* Lint

* entry_point -> yaml_path

* Show commands in help in better order

Fix non-utf-8 characters when redirecting output (skypilot-org#92)

* Fix non-utf-8 output

* Enable head comment in using_file_mounts.yaml

Add types to global_user_state (skypilot-org#94)

Add types

[CLI] Speed up 'sky exec' for faster iterative development (skypilot-org#91)

* (Not back-compat) global_user_state: support pickle-able handles.

This change is not binary back-compat.  Do this to update:
  rm ~/.sky/state.db

* Speed up sky exec by not using 'ray exec/rsync_up'

Changes:
- Sane SSH options (control path has huge performance boost)
- ResourceHandle: is now a tuple (head_ip, yaml, tpu).
  - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'.
  - Thus tpu management is nicely handled

* Add minimal.yaml

* Update run_smoke_tests.sh

* TESTED: examples/run_smoke_tests.sh

* Add username to auto-generated cluster name

* Fix cloud->remote file mount mkdir logic.

* Better naming for 'managed_tpu' & address comments

* Fix _create_interactive_node() use of handle; don't down the node.

* lint

[CLI] Add sky ssh and port-forwarding (skypilot-org#90)

* Add sky attach and port forwarding gpunode/cpunode

* Fix Ctrl-D error

* Change the name to sky ssh

* Fix type hints

* linting

* format

* Fix merging errors

(Non back-compat) CLI improvements: interactive, remove task table, show entry points  (skypilot-org#99)

* CLI: by default, (re)use the same interactive nodes

* Remove task table & its associated code

Rationale: useful in the future when we add task status tracking
(RUNNING, DONE, etc.).  Absent that, the task table from `sky status`
and the overhead of `sky cancel -a` is not that useful at the moment.

We should add this code back in the future.

* Show entry point per cluster in `sky status`

Clusters
+----------------------+----------------+-----------------------------------------+
|     CLUSTER NAME     |    LAUNCHED    |                 LAST USE                |
+----------------------+----------------+-----------------------------------------+
| sky-cpunode-zongheng | 10 minutes ago |               sky cpunode               |
|        debug1        | 7 minutes ago  | sky run -c debug1 examples/minimal.yaml |
|  sky-bba5-zongheng   | 3 minutes ago  |              multi_echo.py              |
+----------------------+----------------+-----------------------------------------+

* Make 'sky ssh' call the ssh cmd directly

* Truncate last use

* lint

Rebase & format

Update __init__.py

* sky list-gpus

* lint

* Update cli.py

* lint

* Update GCP prompt

* Address comments

* TPUs -> Google TPUs

* Update GCP catalog with regions info

* Update fetch_gcp.py

* Update gcp

* fix

* fix

* Update TPU code

* Revert "Update TPU code"

This reverts commit 1f08fc0.

* Address comments

* Update cli.py

* Update cli.py

* Update cli.py

* Update cli.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants