-
Notifications
You must be signed in to change notification settings - Fork 922
Add interactive sessions, direct launch, and status #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d589e8f to
d95cda7
Compare
concretevitamin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TODOs on making sure the Docker examples all pass?
First look, still need to finish reading cli.py.
prototype/cli/cli.py
Outdated
|
|
||
| # TODO: cd into workdir immediately on the VM | ||
| # TODO: Delete the temporary cluster config yml (or figure out a way to re-use it) | ||
| cloud_vm_ray_backend._run(f'ray attach {handle} --tmux') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Antipattern to call a private func. Which signals to me we should expose run() in backend_utils so you could call here.
NOTE TO SELF: check if we want to break Backend abstractions
prototype/sky/user_manager.py
Outdated
| @@ -0,0 +1,59 @@ | |||
| import sqlite3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE TO SELF: good dir to put in? one-line docs on module, methods.
concretevitamin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to leave out prototype/~/.sky/session.db!
Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (#35) working LocalDockerBackend Show logs on docker build failure (#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (#54) * aws: move region order * Do not exclude .git from workdir by default; fixes #42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (#57) * Fix logs rsync down for n_node Prettier paths (#50) Update .gitignore for sky_logs (#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (#69) Fix T4 (#62) CLI with cluster handles and interactive sessions (#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (#74) Cleanups: remove V1 yaml templates; update pylint.yml (#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (#79) Add support for cross-cloud retry (#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (#80) Horovod Example (#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (#68) Fixes #39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py
* GCP catalog Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (#35) working LocalDockerBackend Show logs on docker build failure (#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (#54) * aws: move region order * Do not exclude .git from workdir by default; fixes #42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (#57) * Fix logs rsync down for n_node Prettier paths (#50) Update .gitignore for sky_logs (#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (#69) Fix T4 (#62) CLI with cluster handles and interactive sessions (#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (#74) Cleanups: remove V1 yaml templates; update pylint.yml (#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (#79) Add support for cross-cloud retry (#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (#80) Horovod Example (#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (#68) Fixes #39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py * sky list-gpus * lint * Update cli.py * lint * Update GCP prompt * Address comments * TPUs -> Google TPUs * Update GCP catalog with regions info * Update fetch_gcp.py * Update gcp * fix * fix * Update TPU code * Revert "Update TPU code" This reverts commit 1f08fc0. * Address comments * Update cli.py * Update cli.py * Update cli.py * Update cli.py
Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.
* GCP catalog Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (#35) working LocalDockerBackend Show logs on docker build failure (#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (#54) * aws: move region order * Do not exclude .git from workdir by default; fixes #42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (#57) * Fix logs rsync down for n_node Prettier paths (#50) Update .gitignore for sky_logs (#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (#69) Fix T4 (#62) CLI with cluster handles and interactive sessions (#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (#74) Cleanups: remove V1 yaml templates; update pylint.yml (#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (#79) Add support for cross-cloud retry (#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (#80) Horovod Example (#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (#68) Fixes #39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py * sky list-gpus * lint * Update cli.py * lint * Update GCP prompt * Address comments * TPUs -> Google TPUs * Update GCP catalog with regions info * Update fetch_gcp.py * Update gcp * fix * fix * Update TPU code * Revert "Update TPU code" This reverts commit 1f08fc0. * Address comments * Update cli.py * Update cli.py * Update cli.py * Update cli.py
* Fix several issues * fallback min available mem * backward compatibility change * format * guard against OOM version * remove/simplify OOM prevention * logs and comments * print warning log * nit * more conservative on non-blocking mem
Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`.
* GCP catalog Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (skypilot-org#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (skypilot-org#35) working LocalDockerBackend Show logs on docker build failure (skypilot-org#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (skypilot-org#54) * aws: move region order * Do not exclude .git from workdir by default; fixes skypilot-org#42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (skypilot-org#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (skypilot-org#57) * Fix logs rsync down for n_node Prettier paths (skypilot-org#50) Update .gitignore for sky_logs (skypilot-org#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (skypilot-org#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (skypilot-org#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (skypilot-org#69) Fix T4 (skypilot-org#62) CLI with cluster handles and interactive sessions (skypilot-org#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (skypilot-org#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (skypilot-org#74) Cleanups: remove V1 yaml templates; update pylint.yml (skypilot-org#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (skypilot-org#79) Add support for cross-cloud retry (skypilot-org#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (skypilot-org#80) Horovod Example (skypilot-org#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (skypilot-org#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (skypilot-org#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (skypilot-org#68) Fixes skypilot-org#39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (skypilot-org#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (skypilot-org#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (skypilot-org#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (skypilot-org#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (skypilot-org#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (skypilot-org#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (skypilot-org#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (skypilot-org#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (skypilot-org#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py * sky list-gpus * lint * Update cli.py * lint * Update GCP prompt * Address comments * TPUs -> Google TPUs * Update GCP catalog with regions info * Update fetch_gcp.py * Update gcp * fix * fix * Update TPU code * Revert "Update TPU code" This reverts commit 1f08fc0. * Address comments * Update cli.py * Update cli.py * Update cli.py * Update cli.py
Also adds support for multiple clusters!