-
Notifications
You must be signed in to change notification settings - Fork 919
Make pylint pass. #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
All checks passed, ready for review. |
| """ | ||
|
|
||
| ResourceHandle = str # yaml file | ||
| class ResourceHandle(str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we have other definitions of ResourceHandle in the backend.py and also local_docker_backend.py. Can we merge them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed the current way is not a very good design. Although both currently are strings, they mean different things. The original intention was some backend may have its own custom, opaque handle struct, therefore it seemed appropriate for each backend to have its own handle definition. Actually, even right now the handle can be (seen as) a tuple, (cluster.yaml, tpu stuff). We should do a refactoring on this later.
Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (#35) working LocalDockerBackend Show logs on docker build failure (#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (#54) * aws: move region order * Do not exclude .git from workdir by default; fixes #42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (#57) * Fix logs rsync down for n_node Prettier paths (#50) Update .gitignore for sky_logs (#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (#69) Fix T4 (#62) CLI with cluster handles and interactive sessions (#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (#74) Cleanups: remove V1 yaml templates; update pylint.yml (#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (#79) Add support for cross-cloud retry (#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (#80) Horovod Example (#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (#68) Fixes #39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py
* GCP catalog Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (#35) working LocalDockerBackend Show logs on docker build failure (#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (#54) * aws: move region order * Do not exclude .git from workdir by default; fixes #42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (#57) * Fix logs rsync down for n_node Prettier paths (#50) Update .gitignore for sky_logs (#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (#69) Fix T4 (#62) CLI with cluster handles and interactive sessions (#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (#74) Cleanups: remove V1 yaml templates; update pylint.yml (#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (#79) Add support for cross-cloud retry (#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (#80) Horovod Example (#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (#68) Fixes #39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py * sky list-gpus * lint * Update cli.py * lint * Update GCP prompt * Address comments * TPUs -> Google TPUs * Update GCP catalog with regions info * Update fetch_gcp.py * Update gcp * fix * fix * Update TPU code * Revert "Update TPU code" This reverts commit 1f08fc0. * Address comments * Update cli.py * Update cli.py * Update cli.py * Update cli.py
* First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented
* GCP catalog Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (#35) working LocalDockerBackend Show logs on docker build failure (#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (#54) * aws: move region order * Do not exclude .git from workdir by default; fixes #42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (#57) * Fix logs rsync down for n_node Prettier paths (#50) Update .gitignore for sky_logs (#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (#69) Fix T4 (#62) CLI with cluster handles and interactive sessions (#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (#74) Cleanups: remove V1 yaml templates; update pylint.yml (#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (#79) Add support for cross-cloud retry (#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (#80) Horovod Example (#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (#68) Fixes #39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py * sky list-gpus * lint * Update cli.py * lint * Update GCP prompt * Address comments * TPUs -> Google TPUs * Update GCP catalog with regions info * Update fetch_gcp.py * Update gcp * fix * fix * Update TPU code * Revert "Update TPU code" This reverts commit 1f08fc0. * Address comments * Update cli.py * Update cli.py * Update cli.py * Update cli.py
Squashed commit of the following:
commit 7889757299c0260fa4fc6e8c2d3b06484783aae7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 21:36:58 2025 +0000
change to local cache for the k8s client for newly added credential
commit bf660981504544aeb15650938d754094d00faf22
Merge: 19f13d372 3e9bd9d0f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 21:30:59 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 19f13d372b3fa7d590adcb490cc03ff7d2fa6f30
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 19:48:17 2025 +0000
Fix the unit test
commit 72ea226c4b73c6e0eac0f4d55063a6507107c711
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 17:42:51 2025 +0000
fix variables
commit 0dcc06b6bde5f2d24df1c7a8e31b703d54705e77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 07:57:26 2025 +0000
remove unecessary endpoint
commit d54c74c59cac76ee56acfd339328144eb60d5929
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 07:49:56 2025 +0000
refactor serve endpoint for less API calls
commit 912069e6f715d10f7d44c4acdd45d57c25de6689
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 07:27:56 2025 +0000
Fix output
commit f8630cdba402882112fdc6821c586d897ee766ea
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 02:36:39 2025 +0000
fix bump seconds
commit 38e66d87fc56d03e5194d9ff18f58ff2194d680a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 23:11:11 2025 +0000
Fix merging issue
commit 0e7ace80f84ee4845f006e63610f9f80c1b5928c
Merge: 99b81b6ff e7d94e956
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 23:00:03 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 99b81b6ff810c595f4e9c5fbd4c8936692521fef
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:53:15 2025 +0000
mypy
commit 0cadaf976bd3f4296250992d299128282e16f9f6
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:35:31 2025 +0000
Fix serve status
commit 8d7849f3579e4a500efcc55c8a7e8023faf776e7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:28:22 2025 +0000
Avoid clone-disk-from tests
commit f0026726c1f6cfe66658e6e4820816e3d7591af9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:14:09 2025 +0000
Avoid detach setup
commit 9b8c0cbe61ede6b846bbc054f68653f6d1cdff62
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 01:09:04 2025 +0000
Address comments
commit cfaa62e573f0345ad4ba70c1b33ba1163ba809df
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 00:56:22 2025 +0000
pylint
commit 52bbdb7619c0a0334024d5a7e0eff02124e456e0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 00:55:03 2025 +0000
Address comments
commit 25a0c9255c1a166e0855f56ca0d53751b1528723
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 22:41:21 2025 +0000
docstr
commit 1bad3ebef6bafbd0eeba28b03f8bc6e310ff558a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 22:13:42 2025 +0000
Fix raise for resource parsing
commit 81d7d79070b1ebc711a3c0034b40778df4608d77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 22:11:35 2025 +0000
Fix env var for CPU/memory limit in pod
commit 11cff6b7370f148e74da0c56729dbe094d3d9f77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 21:46:12 2025 +0000
rename to LONG and SHORT
commit 75797766c9830d58d8c1c0123c2782dee8417aae
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 21:03:02 2025 +0000
Use 127.0.0.1 by default for API server endpoint
commit 41e53331ee5ae0b1119b8b24d950ff6bd13e5a37
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 23:44:01 2025 +0000
update logging for status refresh
commit 574fb6f0b6bf843384ee6f0070b40e540e06b87a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 15:41:41 2025 -0800
[Usage/SKY-1403] Fix the usage collection for API server (#180)
* Fix usage for server side
* minor update
* Fix run id
* fix client entrypoint cmd
* rename function
* directly use reset for usage messages and refactor a bit
* minor comment update
* disable usage for internal status refresh
commit 09801da1f23c125e539caa20b94689832b7b0c08
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 04:46:40 2025 +0000
Fix merge conflict for jobs launch
commit 7b86c84f4543ac092c5fb56474c32459b33c8d34
Merge: 52d499e2f 269dfb192
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 04:27:46 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 52d499e2f9c3bf3f75674bd6d160224b5f50729d
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Feb 2 08:09:49 2025 +0000
format
commit af6a346342109c31b71ad7d793275de407c29ab8
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Feb 2 08:09:34 2025 +0000
address comments
commit 983bf1f392783e81477ccac1f90d9c5b6b77e16e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:33:52 2025 +0000
fix argument
commit 100a5d43154f4cf2ada5f154f40fc361dd35a9ec
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:33:15 2025 +0000
Fix invocation
commit 66540e28ec8282fbaab263b440b4acd4ae4b36eb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:32:26 2025 +0000
fix invocation
commit d62c79c64656723af85d9052282e0fc7e024729e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:27:25 2025 +0000
make `need_confirmation` internal
commit 3f573afd2253410c320b3c134946c89b3d58e7cc
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:01:51 2025 +0000
base exception
commit efdb5911fa0da3f4f5701f27bb26ac5b5395b252
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 06:40:40 2025 +0000
format
commit f6964d398b6f2298bab23e2c82707cef6295c659
Merge: d7dd77d61 ac9f159db
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 06:39:33 2025 +0000
Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi
commit d7dd77d61b5579a698015e43b67718f7a923d916
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 06:39:24 2025 +0000
Address comment
commit ac9f159db4b9adb3a9822d30b0fb71e9119da3c7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 22:35:09 2025 -0800
Update sky/data/storage.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
commit 935c8b94c14275ffa3646312a50e82589a8fd1f7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 05:43:59 2025 +0000
Fix SKY-1383
commit f4f7653a2232e159ee4cfe55f0e9cc60baec0524
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 02:13:49 2025 +0000
Fix unit tests
commit 4347b3d18acd013972132d878187952298a81ffb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 23:57:14 2025 +0000
use response 500 for failed requests
commit b6fea92010749e7593387ba9813e740363633751
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 23:05:46 2025 +0000
Add TODO
commit 26938445c9436c72c50d35f8540e9ee54e06ca2c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 20:32:18 2025 +0000
address comments
commit b6d5437665cb5dd84fc574b4452a3ea24297b034
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 17:46:27 2025 +0000
fix tests
commit 432b586954c84be86ac36feea26eb5f892ad794e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 08:32:52 2025 +0000
Address comments
commit 8ae605c993067d8b15fcd7eb91dbaeada750a8d6
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 04:10:31 2025 +0000
address comments
commit 80b34b5a35f8cd4db081ddd528ee2a506835987a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 30 19:46:26 2025 +0000
Address comments
commit 4e80ff8390715cfdf55a784be2bc7b6e8266f67a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Jan 25 19:51:56 2025 +0000
Address comments
commit 307140e45ba602b88f0d44da6ff232bbd45f22bb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Jan 25 18:39:16 2025 +0000
Address comments
Fix cancel body
Fix
Fix
remove redis types
commit 252786fe73d47a5df725b29b4a2a1fa7d58a4aaa
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 18:35:19 2025 -0800
[SKY-1374] Fix HTTPS keys (#179)
* upload https keys
* alternstive output
commit cb45ae3ab21a3ec1c3fa2567d080723378ef7703
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 23:23:05 2025 +0000
Fix autodown
commit a471c092c4918b035a51f88b728e844f262ce643
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 23:04:04 2025 +0000
Fix folder creation for log download
commit 4c2a438b24b349359e38ef37e042b6fa21bd5444
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 21:48:52 2025 +0000
format
commit 32d8ffa197cdcd669b3c888c0967808c62799712
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 21:43:01 2025 +0000
Fix cluster job downloading
commit 895f277e7cb80a2a43772499e7233f15b7130fbf
Merge: 70e7ebf61 1146cab36
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 20:35:04 2025 +0000
Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi
commit 70e7ebf61cd36aecdd4d35f169b63df0d40ad9aa
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 20:23:03 2025 +0000
Naming and comments
commit 1146cab3694a34bf9cb66a4e244c069250afb47d
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 11:48:19 2025 -0800
[SKY-1368] Support empty folder in file mounts and symlinks (#178)
* upload symlink in file upload
* format
* remnant
* fix relative
* check relative path
* Fix relative path in symlink target
* Add unittest for zip and unzip
* Add back symlink
* fix circle link
commit a8bba42de13ea88b9e4fc9bfb2f121abb2663cfb
Author: zpoint <zp0int@qq.com>
Date: Sat Jan 25 02:51:29 2025 +0800
get rid of autouse on pytest (#176)
* get rid of autouse
* manual fixture
* prevent reload
* remove the prevent_reload
* Avoid reload cli
* format
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 099b8be94f1e40ed84f734a0bd44f7ba9941587f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 09:52:54 2025 -0800
[SKY-1366] Fix storage reconstruct (#177)
* Fix storage reconstruct
* Error out instead
* Add action
* Fix exception type
* Fix logging
commit 62071d56bff05187cf2127429122ba6691ff05a1
Author: Hong <hong@assemblesys.com>
Date: Fri Jan 24 20:31:28 2025 +0800
[SKY-1200][UX] Valid sky server available while login (#134)
commit 4179a6c2f0c771cec2f5d758304f959937b04e04
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 07:57:18 2025 +0000
move lock under logs folder
commit 5a66dae35ecc1efebbacb2a1d5fef9652cbcb8e1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 07:22:14 2025 +0000
Fix config override error handling
commit ed641a16b5f7bf126ec0e6c8cab3b0ae836aa404
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 02:08:52 2025 +0000
Expose CLOUD_REGISTRY
commit 60e211f27c8b2b82f601eb75b99afac9795d6e62
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 02:02:03 2025 +0000
format
commit 9fcce2c977da60f875c26cc4140e87fc293f98c1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 02:00:37 2025 +0000
[SKY-1367] Fix validate exception serialization and deserialization
commit a00748b33e3d062f4bc9afaee631f05a18d85b45
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 00:57:09 2025 +0000
Fix deepspeed setup
commit c81a20a5844d752287f951b65566d5d1e4780b69
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 00:39:53 2025 +0000
remove api server deployment
commit c2a4f49a2adf10099b87f33c5de2d9956ef69d00
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 00:34:34 2025 +0000
Fix dimming
commit 5ff29f469d68e114fc2a7305b89bde4cb62d4ba2
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 23 16:32:26 2025 -0800
merge issue
commit 3994e17be460f15bbbb2834e78a29b566b5a3539
Merge: befbf5114 97b8e8f12
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 23 22:35:21 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit befbf51143a9e1735298a585b0ac84250a79d23b
Author: zpoint <zp0int@qq.com>
Date: Thu Jan 23 14:56:33 2025 +0800
SKY-1034 [Tests] Fix tests with no need for credentials (#51)
* add restapi for tests
* fix storage
* add TODO
* uncomment CI test
* pass api test and unit test
* fix dryrun bug
* fix test_config
* fix test_jobs
* fix for jobs and serve
* bug fix
* fix test jobs and serve
* support enable_all_clouds on server side
* comment out fail test cases
* bug fix and reformat
* temp comment
* pass tests
* test config uncomment
* restructure fixture for faster test
* restore change by cursor
* test test_list_accelerators
* bug fix
* fix test cli
* fix all
* test CI
* bug fix
* rename file
* Update sky/cli.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* test_config pass final test
* test_config pass final test
* use patch and import strategy
* format
* bug fix
* local test behaviour
* fix reload issue
* yapf
* remove test controller util
* fix test cases
* resolve PR comment
* resolve merge conflict
* fix k8s test after merging restapi
* resolve comment
* resolve PR comment
* debug
* debug
* bug fix
* api change
* fix PR comment
* change stream method
* fix
* cli fix
* Update tests/conftest.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* Update tests/test_cli.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* Update tests/test_cli.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* fix
* resolve PR comment
* restore class
* support stream mock
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 72d85e942b91905f41807765d2ad141eae7438ce
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 22:41:28 2025 -0800
[SKY-1357] Fix jobs dashboard on Kubernetes (#175)
* Add port forward command support
* fix port forward command
* format
* Add comment
commit 0c39a8cca617b736e8c675bef08c1a8082167de8
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 18:53:58 2025 -0800
[SKY-1363] `sky down sky-jobs...` request fail to cancel on-going request for `sky jobs launch` (#174)
* Cancel jobs request when downing
* Add column for api status
* Add comment
* minor
commit d582284dc4ddb2cad5282ca03829144acb275644
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 18:51:05 2025 -0800
[SKY-1323] Add timestamps for server logs (#173)
* Add timestamps for server log
* comments
* format
commit fffc1535530f52c888e52c9176318918220e1758
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 22 15:58:22 2025 -0800
[SKY-1223] Docs for multi-k8s support, exec based kubeconfig converter (#164)
* Add docs and kubeconfig converter
* lint
* update docs
* update docs
commit 3ab843ffbb46bc87b4bcc75c113c618142a1ca77
Author: zpoint <zp0int@qq.com>
Date: Thu Jan 23 06:04:30 2025 +0800
[SKY-1009] [Robust] Make streaming requests (sky logs , sky jobs logs , etc) synchronous (#155)
* tail logs
* circular import
* print end
* bug fix
* fix
* core function
* resolve PR comment
* mypy
* bug fix
* type
* resolve PR comment
* restore change
* resolve merge conflict
* Architecture: Add changes with dicussed offline
* minor movement
* format
* format
* Fix import
* fix imports
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 3e75a0f4d3756fb8c469e94088c360d5fdcbe4a4
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 22 13:07:39 2025 -0800
[SKY-1329] Better logging when API server fails to start (#167)
* Better logging when API server fails to start
* lint
* comments
* lint
* lint
commit 179ed3391ee8c5a7e554b9470965cca82a751432
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 11:17:25 2025 -0800
[SKY-1337] Refactor client common and fix jobs logs sync down (#165)
* refactor and support log sync down for jobs
* Refactor for file uploads
* Fix log path
* Fix jobs logs downloading
* fix smoke test
* fix smoke test
* format
* Add debugging
* Add debug
* Reuse request body
* Remove debug lines
* get the latest
* fix jobs logs tests
* fix grep JobStatus
commit a90928e11a872cbabb983fb766fa52485861ee66
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 11:01:35 2025 -0800
[SKY-1347] Cancel pending requests for `sky down` (#172)
Cancel pending requests for `sky down` as well
commit 9167fed281428f8a64ecd39b834fa028d4864c94
Author: zpoint <zp0int@qq.com>
Date: Wed Jan 22 18:01:08 2025 +0800
add aiofiles support for pre-commit-config (#171)
add aiofiles support
commit 8301271cb1504ad76abcf2ebf4b48d1979af71de
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 21 22:16:21 2025 -0800
[SKY-1326] Minor logging fixes for requests (#169)
Minor logging
commit 1865ecba8b01a03e436f4b5e65beb8031cac4258
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 21 18:57:55 2025 -0800
[SKY-1269] Allow tasks to run in custom namespaces (#170)
* incluster namespace selection
* comment
* lint
* lint
commit c4edb603464a00a715b41df7bfbdfa7c1a30bd4a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 21 16:31:47 2025 -0800
[SKY-1115] Fix smoke tests with remote API server (no local credential) and local API server (#161)
* Run cloud storage checks in tests on a cluster
* More fixes
* add more fix
* format
* format
* Fix tests that requires local credentials
* refactor the cmd runner
* use new cluster
* Add handling for GCP cloud commands
* Wait for the cloud cmd cluster to be up
* fix storage deletion
* fix api server endpoint
* fix env
* fix None for env
* fix intermediate storage
* increase timeout
* skip bench
* fix queue with docker
* format
* more robust termination
* Fix error handling for controller
* fix API call
* format
* format
* fix gcp related tests
* Additional fixes
* fix clean up
* change zone
* avoid test if on k8s
* longer wait for cloud cmd cluster
* longer wait
* longer wait
* longer
* avoid tailing ?
* wait for running
* Fix storage mounts k8s test
* Fix msg
* Fix managed jobs output
* fix dependency installation
* Add additional fix
* longer wait time for running
* longer wait
* slgihtly longer for clean up the resources
* fix output
* Fix cluster name
* fix cluster names
commit 020a20efa43be95e36a3222d7e7abaf2269d2d7f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 19 14:24:31 2025 -0800
[SKY-1348] Fix console mess up (#166)
* Fix jobs logs
* Add comments
commit 2bae2b5a6c3d52f589dca55b7faa0eb2a68cc5c5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 19 11:52:51 2025 -0800
[SKY-1339] Upload files by chunks (#163)
* fixes
* wip fixes
* wip
* minor fix
* Fix bug for bad zip
* fix chunk upload
* update return value
* fix multi thread issue
* avoid io
* Add cleanup for stale chunks
* format
* lint
* Add comments
* ux
* fix cleanup
commit 189c686bf2a12071a21407085ef9fd258a30d800
Merge: 61f44a053 2354b818b
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 19 19:21:07 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 61f44a053c6778ccf5c6b28abc92006cee997f03
Merge: e87afbc6c 6b23582d9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 23:46:28 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit e87afbc6c2687ae864c944d8137a294f32094410
Merge: b0ea95f03 9e1b4ddc5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 07:15:09 2025 +0000
Merge branch 'master' of github.com:assemble-org/skypilot into restapi
commit b0ea95f030f24c3a7f9936233d1c838eb953b7d1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 06:21:19 2025 +0000
Fix job controller merging issue
commit 37f6fce6ad9b1723c32fc9157701adaf8e34c330
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Thu Jan 16 20:29:22 2025 -0800
[SKY-1321] Fix concurrent requests to launch local API server (#162)
* Fix concurrent requests to launch local API server
* lint
commit 2b537b9b349fb848ae9d4a6da84f330be8fcdd8c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 03:42:59 2025 +0000
format and unittest fix
commit bd65f74dce7588ae62efd58bc5d5021576213326
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 03:27:58 2025 +0000
Fix cannonicalize accelerator implementation from master merging
commit a33a894b3e2df6a238a0fab727dcce01c95e0f5c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 01:37:31 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 50671940daf8eb9713879a88480f230531e2cc7e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 16 15:21:41 2025 -0800
[SKY-980] Refactor jobs queue and service status to avoid mp pool (#141)
* Refactor jobs queue and service status to avoid mp pool
* add comment
* fix API cancel
commit 7c614835447302964ae767aad8b8d283e7198411
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Thu Jan 16 11:35:51 2025 -0800
[Docs] Updates based on onboarding feedback (#120)
* Update docs
* Add python version
commit 73a20398855a0a63ccb9306af2ae40446e18f29d
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 15 22:58:28 2025 -0800
[SKY-373] Fix `sky local` for client server (#61)
* Make sky local up work
* lint
* lint
* Update checks
* lint
* Merge with restapi
* revert log changes
* fix
* Rebase done
* Comments and refactor
* fixes
* Change to __enter__ and add todo
* lint
* lint
* fix package path
* remove debug
* Fix GPU detection
* Logging fixes
* Lint
commit 006ce5c2aaa4f8bfbf9c061195cf9dc84d3e54ec
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 15 19:55:59 2025 -0800
[SKY-915/SKY-1292] Clean up ssh entries and fix api logs exception handling (#157)
* Remove ssh config for clusters not exist
* Fix log path non-exist issue
* Update sky/cli.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Elaborate comments
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
commit 0a0cb59e43e84441812622ee50f2dd042660e787
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 16 00:46:35 2025 +0000
Add back changes from #78
commit e2d3eac431d6c443244e8b5eba28b1f8a2c07180
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 15 15:50:36 2025 -0800
[SKY-1045] Support GCP service account auth (#160)
Update GCP service account docs
commit efd893df50e9b9e91dc988a9f47d8c5c7fd7a7f7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 15 22:32:47 2025 +0000
Fix SKY-1303
commit 36e638f677de165dff5bfe6d278f29633f1db49a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 19:28:44 2025 -0800
[SKY-1288] Fix python SDK docs (#156)
* Fix docs for SDK
* Split into items
* Add comments for API endpoint env var
* format
* fix docs
* Add docs for serve
commit 5b686e34d0843d7f399584302e5758a621263dcd
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 17:31:23 2025 -0800
[SKY-1217] Fix log download test (#159)
* Fix log download test
* Add comment
commit 4757b78d389188ec89810e82e6186c5c648f4ab7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 16:34:05 2025 -0800
[SKY-1095] Fix serve down issue with storage deletion (#158)
Fix serve down issue with storage deletion
commit c8bb92c7c5fca19939242fd10f64d9cdc9c7327f
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 14 14:14:09 2025 -0800
[SKY-1233] Handle non-sky exceptions and airflow updates (#151)
* Add exception wrapping and update airflow
* Add git branch
* Update airflow and docs
* Update airflow and docs
* lint
commit 0b4530f38e35cbfea513992ce9c189698abc4365
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 14 12:51:53 2025 -0800
[SKY-1258] Fix sky api start error message when using remote API server (#152)
* Fix sky api start logic
* lint
* lint
* comments
* simplify logging logic
* Add docs and comments
* Add docs and comments
commit 6dd380e7e316bc7c0051f955ba5f326c4e0d454c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 11:14:08 2025 -0800
[SKY-1291] Fix CRLF output (#153)
* Fix CRLF output
* fix last line
* update comment
commit 79f49ab29e9f901cca143a4aec366fd946d2aa55
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Jan 13 20:40:24 2025 -0800
[SKY-1108] Better logging for skypilot config override (#150)
Better logging for skypilot config override
commit c28d7b49f6670893098ff63bf552ccea7c978dd7
Author: zpoint <zp0int@qq.com>
Date: Mon Jan 13 14:01:37 2025 +0800
type hint fix for tail logs (#149)
type hint
commit 24d270b64c6fa480e0eb50583b198401e000902f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 12 22:00:46 2025 -0800
[SKY-1129] Avoid lru_cache across requests (#137)
* Add reload for caches
* Add comments
* Fix docstr
* Simplify
* Add TODO
commit 270edfb8156e32cb395cae55bd37c44e13f2db1f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 12 21:53:26 2025 -0800
[UX] Prompt when launching on other's cluster (#146)
* Prompt when launching on other's cluster
* add import
* format
* Better logging
* Only show hash when user
commit 37f0c823a1b86ddb38ce377e80863baa4d74ad6c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 12 21:52:34 2025 -0800
[UX] Add terminal completion and fix progress bar with `\r` (#148)
* complete console
* Add terminal completion
* format
* format
* Fix \r
* format
commit 1e3e5281e159735533a7eac9f98c21f45f21eab6
Author: zpoint <zp0int@qq.com>
Date: Sun Jan 12 11:23:38 2025 +0800
[SKY-1287]sky exec test echo hi doesn't show task logs (hi) (#147)
* fix merge conflict overwritten
* mypy
* resolve comment
* mypy
* return type
commit a3db3e2e61f25c0518b6c868a3f1afaacf1b885c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 18:53:27 2025 -0800
[SKY-1256] Avoid buffer for streaming hints (#145)
* Avoid buffer for streaming hints
* format
* Better log streamer
* format
* format
* format
* remove cache
* fix error handling
commit 786ae1d778889f89ba170d6e3db25774dec211ca
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Jan 11 02:42:49 2025 +0000
Avoid refresh cluster status when cluster yaml is None
commit 6d31f31c9b867c0515f6c9ea3080ae63101d66f0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 13:50:11 2025 -0800
[SKY-1278/SKY-1134] Add versioning for API server (#138)
* Add versioning for API server
* Add commit and version in the health call
* refactor a bit
* format
* fix docstr
* renaming
commit 212f72fe663e27a3f7f8e83da13e6e795d0831b9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 19:57:48 2025 +0000
format
commit 5fbe71ea17551dfb22f1021db44dd81a98c6dfe2
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 19:56:22 2025 +0000
format
commit e8a1f4736536020255430e9d1d61654881a0a7dc
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 19:55:18 2025 +0000
Rename to API server
commit 55c648700419d9c2181206438fe2e30aaac483d8
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Fri Jan 10 10:32:56 2025 -0800
Update health endpoint in docs (#144)
Update health endpoint
commit 9ec1ed69f64019d23dc5760aec76e6c7d01bc66d
Author: zpoint <zp0int@qq.com>
Date: Fri Jan 10 17:04:42 2025 +0800
[SKY-988] [Robust] Refactor tail_logs to make it a uvicorn async call instead of an async executor call (#130)
* sync log call
* terminate worker
* use sky abort
* merge restapi
* revert logging order
* doc update
commit e1eed3bf0ddd987b26aa094a2d1407f127ab28d2
Author: Hong <hong@assemblesys.com>
Date: Fri Jan 10 14:46:21 2025 +0800
[UX] set `RUNNING` request color to green (#142)
commit 961782a0d9b48939dea9ba2abcf5a53662d55052
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 22:34:18 2025 -0800
[SKY-1273] Rename API server related CLIs (#139)
* Rename API server related CLIs
* Consolidate server logs
* format
* Add api login python API
* format
* Add comments and TODOs
* Add TODO for endpoint env var
commit cfaf37e1e5204b42c3bfcac6fcbb7f301d5cfa18
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 18:16:47 2025 -0800
Add user hash for existing cluster (#140)
* Add user hash for existing cluster
* format
* format
commit c96710df70bf89fd20c373e04213bde54d9157be
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 01:05:37 2025 +0000
Fix check multiple clouds
commit 466c4d6a767e1a82631b6f2794ca296e82904049
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 10:48:17 2025 -0800
Clean up client file mounts during API server restarts (#135)
* Refactor client server
* add docstr
* Fix types
* Refactor for jobs/server
* use type alias
* Add todo
* rename `check_health`
* update docstr
* Add comments
* Update docstr
* Adding docstr
* Move
* Add doc
* style
* format
* fix redis
* renames
* naming
* format
* format
* clean up clients file mounts as well
* fix merging naming issue
* format
* format
commit e6cf03f050c2086382e21f3fa2a68fb0d0ade32b
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Jan 9 10:20:33 2025 -0800
[SKY-1210] Fix or triage TODOs in restapi branch (#128)
* Fix or triage TODOs in restapi branch
* merge with restapi
* comments
commit f73a871431a1a85584761f14b13e2e95e4d1a0d9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 00:41:27 2025 -0800
Address comments for restapi branch (#133)
* Refactor client server
* add docstr
* Fix types
* Refactor for jobs/server
* use type alias
* Add todo
* rename `check_health`
* update docstr
* Add comments
* Update docstr
* Adding docstr
* Move
* Add doc
* style
* format
* fix redis
* renames
* naming
* format
* format
* Add comment
* Avoid dumping to files
* format
* Update sky/server/requests/executor.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Add todo for legacy
* use queue.Queeu
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
commit 8fa28098edb6a85631c9a95107857d5f3701a832
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 7 19:34:12 2025 +0000
remove poetry
commit fe9d7385b459754cf493eee3345281f2f97fc4a7
Merge: 344151261 6cf98a3cf
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 7 18:39:54 2025 +0000
Merge branch 'master' of github.com:assemble-org/skypilot into restapi
commit 34415126138d67254c16f830c875a9f9077f76d4
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Jan 6 23:45:19 2025 -0800
Merge master (#131)
* [perf] use uv for venv creation and pip install (#4414)
* Revert "remove `uv` from runtime setup due to azure installation issue (#4401)"
This reverts commit 0b20d568ee1af454bfec3e50ff62d239f976e52d.
* on azure, use --prerelease=allow to install azure-cli
* use uv venv --seed
* fix backwards compatibility
* really fix backwards compatibility
* use uv to set up controller dependencies
* fix python 3.8
* lint
* add missing file
* update comment
* split out azure-cli dep
* fix lint for dependencies
* use runpy.run_path rather than modifying sys.path
* fix cloud dependency installation commands
* lint
* Update sky/utils/controller_utils.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* [Minor] README updates. (#4436)
* [Minor] README touches.
* update
* update
* make --fast robust against credential or wheel updates (#4289)
* add config_dict['config_hash'] output to write_cluster_config
* fix docstring for write_cluster_config
This used to be true, but since #2943, 'ray' is the only provisioner.
Add other keys that are now present instead.
* when using --fast, check if config_hash matches, and if not, provision
* mock hashing method in unit test
This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.
* check config hash within provision with lock held
* address other PR review comments
* rename to skip_if_no_cluster_updates
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* add assert details
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* address PR comments and update docstrings
* fix test
* update docstrings
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* address PR comments
* fix lint and tests
* Update sky/backends/cloud_vm_ray_backend.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* refactor skip_if_no_cluster_update var
* clarify comment
* format exception
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* [k8s] Add resource limits only if they exist (#4440)
Add limits only if they exist
* [robustness] cover some potential resource leakage cases (#4443)
* if a newly-created cluster is missing from the cloud, wait before deleting
Addresses #4431.
* confirm cluster actually terminates before deleting from the db
* avoid deleting cluster data outside the primary provision loop
* tweaks
* Apply suggestions from code review
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* use usage_intervals for new cluster detection
get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.
* fix terminating/stopping state for Lambda and Paperspace
* Revert "use usage_intervals for new cluster detection"
This reverts commit aa6d2e9f8462c4e68196e9a6420c6781c9ff116b.
* check cloud.STATUS_VERSION before calling query_instances
* avoid try/catch when querying instances
* update comments
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* smoke tests support storage mount only (#4446)
* smoke tests support storage mount only
* fix verify command
* rename to only_mount
* [Feature] support spot pod on RunPod (#4447)
* wip
* wip
* wip
* wip
* wip
* wip
* resolve comments
* wip
* wip
* wip
* wip
* wip
* wip
---------
Co-authored-by: hwei <hwei@covariant.ai>
* use lazy import for runpod (#4451)
Fixes runpod import issues introduced in #4447.
* [k8s] Fix show-gpus when running with incluster auth (#4452)
* Add limits only if they exist
* Fix incluster auth handling
* Not mutate azure dep list at runtime (#4457)
* add 1, 2, 4 size H100's to GCP (#4456)
* add 1, 2, 4 size H100's to GCP
* update
* Support buildkite CICD and restructure smoke tests (#4396)
* event based smoke test
* more event based smoke test
* more test cases
* more test cases with managed jobs
* bug fix
* bump up seconds
* merge master and resolve conflict
* more test case
* support test_managed_jobs_pipeline_failed_setup
* support test_managed_jobs_recovery_aws
* manged job status
* bug fix
* test managed job cancel
* test_managed_jobs_storage
* more test cases
* resolve pr comment
* private member function
* bug fix
* restructure
* fix import
* buildkite config
* fix stdout problem
* update pipeline test
* test again
* smoke test for buildkite
* remove unsupport cloud for now
* merge branch 'reliable_smoke_test_more'
* bug fix
* bug fix
* bug fix
* test pipeline pre merge
* build test
* test again
* trigger test
* bug fix
* generate pipeline
* robust generate pipeline
* refactor pipeline
* remove runpod
* hot fix to pass smoke test
* random order
* allow parameter
* bug fix
* bug fix
* exclude lambda cloud
* dynamic generate pipeline
* fix pre-commit
* format
* support SUPPRESS_SENSITIVE_LOG
* support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log
* support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log
* add backward_compatibility_tests to pipeline
* pip install uv for backward compatibility test
* import style
* generate all cloud
* resolve PR comment
* update comment
* naming fix
* grammar correction
* resolve PR comment
* fix import
* fix import
* support gcp on pre merge test
* no gcp test case for pre merge
* [k8s] Make node termination robust (#4469)
* Add limits only if they exist
* retry deletion
* lint
* lint
* comments
* lint
* [Catalog] Bump catalog schema version (#4470)
* Bump catalog schema version
* trigger CI
* [core] skip provider.availability_zone in the cluster config hash (#4463)
skip provider.availability_zone in the cluster config hash
* remove sky jobs launch --fast (#4467)
* remove sky jobs launch --fast
The --fast behavior is now always enabled. This was unsafe before but since
\#4289 it should be safe.
We will remove the flag before 0.8.0 so that it never touches a stable version.
sky launch still has the --fast flag. This flag is unsafe because it could cause
setup to be skipped even though it should be re-run. In the managed jobs case,
this is not an issue because we fully control the setup and know it will not
change.
* fix lint
* [docs] Change urls to docs.skypilot.co, add 404 page (#4413)
* Add 404 page, change to docs.skypilot.co
* lint
* [UX] Fix unnecessary OCI logging (#4476)
Sync PR: fix-oci-logging-master
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* [Example] PyTorch distributed training with minGPT (#4464)
* Add example for distributed pytorch
* update
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Fix
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Add tests for Azure spot instance (#4475)
* verify azure spot instance
* string style
* echo
* echo vm detail
* bug fix
* remove comment
* rename pre-merge test to quicktest-core (#4486)
* rename to test core
* rename file
* [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337)
* [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
* fix format issue
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
---------
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
* [k8s] Fix IPv6 ssh support (#4497)
* Add limits only if they exist
* Fix ipv6 support
* Fix ipv6 support
* [Serve] Add and adopt least load policy as default poicy. (#4439)
* [Serve] Add and adopt least load policy as default poicy.
* Docs & smoke tests
* error message for different lb policy
* add minimal example
* fix
* [Docs] Update logo in docs (#4500)
* WIP updating Elisa logo; issues with light/dark modes
* Fix SVG in navbar rendering by hardcoding SVG + defining text color in css
* Update readme images
* newline
---------
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298)
* style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks
* chore: more typings
* use `df.empty` for dataframe
* fix: more `df.empty`
* format
* revert partially
* style: add back comments
* style: format
* refactor: `dict[str, str]`
Co-authored-by: Tian Xia <cblmemo@gmail.com>
---------
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* [Docs] Fix logo file path (#4504)
* Add limits only if they exist
* rename
* [Storage] Show logs for storage mount (#4387)
* commit for logging change
* logger for storage
* grammar
* fix format
* better comment
* resolve copilot review
* resolve PR comment
* remove unuse var
* Update sky/data/data_utils.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* resolve PR comment
* update comment for get_run_timestamp
* rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [Examples] Update Ollama setup commands (#4510)
wip
* [OCI] Support OCI Object Storage (#4501)
* OCI Object Storage Support
* example yaml update
* example update
* add more example yaml
* Support RClone-RPM pkg
* Add smoke test
* ver
* smoke test
* Resolve dependancy conflict between oci-cli and runpod
* Use latest RClone version (v1.68.2)
* minor optimize
* Address review comments
* typo
* test
* sync code with repo
* Address review comments & more testing.
* address one more comment
* [Jobs] Allowing to specify intermediate bucket for file upload (#4257)
* debug
* support workdir_bucket_name config on yaml file
* change the match statement to if else due to mypy limit
* pass mypy
* yapf format fix
* reformat
* remove debug line
* all dir to same bucket
* private member function
* fix mypy
* support sub dir config to separate to different directory
* rename and add smoke test
* bucketname
* support sub dir mount
* private member for _bucket_sub_path and smoke test fix
* support copy mount for sub dir
* support gcs, s3 delete folder
* doc
* r2 remove_objects_from_sub_path
* support azure remove directory and cos remove
* doc string for remove_objects_from_sub_path
* fix sky jobs subdir issue
* test case update
* rename to _bucket_sub_path
* change the config schema
* setter
* bug fix and test update
* delete bucket depends on user config or sky generated
* add test case
* smoke test bug fix
* robust smoke test
* fix comment
* bug fix
* set the storage manually
* better structure
* fix mypy
* Update docs/source/reference/config.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/reference/config.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* limit creation for bucket and delete sub dir only
* resolve comment
* Update docs/source/reference/config.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update sky/utils/controller_utils.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* resolve PR comment
* bug fix
* bug fix
* fix test case
* bug fix
* fix
* fix test case
* bug fix
* support is_sky_managed param in config
* pass param intermediate_bucket_is_sky_managed
* resolve PR comment
* Update sky/utils/controller_utils.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* hide bucket creation log
* reset green color
* rename is_sky_managed to _is_sky_managed
* bug fix
* retrieve _is_sky_managed from stores
* propogate the log
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [Core] Deprecate LocalDockerBackend (#4516)
Deprecate local docker backend
* [docs] Add newer examples for AI tutorial and distributed training (#4509)
* Update tutorial and distributed training examples.
* Add examples link
* add rdvz
* [k8s] Fix L40 detection for nvidia GFD labels (#4511)
Fix L40 detection
* [docs] Support OCI Object Storage (#4513)
* Support OCI Object Storage
* Add oci bucket for file_mount
* [Docs] Disable Kapa AI (#4518)
Disable kapa
* [DigitalOcean] droplet integration (#3832)
* init digital ocean droplet integration
* abbreviate cloud name
* switch to pydo
* adjust polling logic and mount block storage to instance
* filter by paginated
* lint
* sky launch, start, stop functional
* fix credential file mounts, autodown works now
* set gpu droplet image
* cleanup
* remove more tests
* atomically destroy instance and block storage simulatenously
* install docker
* disable spot test
* fix ip address bug for multinode
* lint
* patch ssh from job/serve controller
* switch to EA slugs
* do adaptor
* lint
* Update sky/clouds/do.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* Update sky/clouds/do.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* comment template
* comment patch
* add h100 test case
* comment on instance name length
* Update sky/clouds/do.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* Update sky/clouds/service_catalog/do_catalog.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* comment on max node char len
* comment on weird azure import
* comment acc price is included in instance price
* fix return type
* switch with do_utils
* remove broad except
* Update sky/provision/do/instance.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* Update sky/provision/do/instance.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* remove azure
* comment on non_terminated_only
* add open port debug message
* wrap start instance api
* use f-string
* wrap stop
* wrap instance down
* assert credentials and check against all contexts
* assert client is None
* remove pending instances during instance restart
* wrap rename
* rename ssh key var
* fix tags
* add tags for block device
* f strings for errors
* support image ids
* update do tests
* only store head instance id
* rename image slugs
* add digital ocean alias
* wait for docker to be available
* update requirements and tests
* increase docker timeout
* lint
* move tests
* lint
* patch test
* lint
* typo fix
* fix typo
* patch tests
* fix tests
* no_mark spot test
* handle 2cpu serve tests
* lint
* lint
* use logger.debug
* fix none cred path
* lint
* handle get_cred path
* pylint
* patch for DO test_optimizer_dryruns.py
* revert optimizer dryrun
---------
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal>
* [Docs] Refactor pod_config docs (#4427)
* refactor pod_config docs
* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
---------
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* [OCI] Set default image to ubuntu LTS 22.04 (#4517)
* set default gpu image to skypilot:gpu-ubuntu-2204
* add example
* remove comment line
* set cpu default image to 2204
* update change history
* [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524)
* Support specify os type with custom image id.
* trim space
* nit
* comment
* Update intermediate bucket related doc (#4521)
* doc
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* add tip
* minor changes
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [aws] cache user identity by 'aws configure list' (#4507)
* [aws] cache user identity by 'aws configure list'
Signed-off-by: Aylei <rayingecho@gmail.com>
* refine get_user_identities docstring
Signed-off-by: Aylei <rayingecho@gmail.com>
* address review comments
Signed-off-by: Aylei <rayingecho@gmail.com>
---------
Signed-off-by: Aylei <rayingecho@gmail.com>
* [k8s] Add validation for pod_config #4206 (#4466)
* [k8s] Add validation for pod_config #4206
Check pod_config when run 'sky check k8s' by using k8s api
* update: check pod_config when launch
check merged pod_config during launch using k8s api
* fix test
* ignore check failed when test with dryrun
if there is no kube config in env, ignore ValueError when launch
with dryrun. For now, we don't support check schema offline.
* use deserialize api to check pod_config schema
* test
* create another api_client with no kubeconfig
* test
* update error message
* update test
* test
* test
* Update sky/backends/backend_utils.py
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [core] fix wheel timestamp check (#4488)
Previously, we were only taking the max timestamp of all the subdirectories of
the given directory. So the timestamp could be incorrect if only a file changed,
and no directory changed. This fixes the issue by looking at all directories and
files given by os.walk().
* [docs] Add image_id doc in task YAML for OCI (#4526)
* Add image_id doc for OCI
* nit
* Update docs/source/reference/yaml-spec.rst
Co-authored-by: Tian Xia <cblmemo@gmail.com>
---------
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* [UX] warning before launching jobs/serve when using a reauth required credentials (#4479)
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* Update sky/backends/cloud_vm_ray_backend.py
Minor fix
* Update sky/clouds/aws.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* wip
* minor changes
* wip
---------
Co-authored-by: hong <hong@hongdeMacBook-Pro.local>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [GCP] Activate service account for storage and controller (#4529)
* Activate service account for storage
* disable logging if not using service account
* Activate for controller as well.
* revert controller activate
* Add comments
* format
* fix smoke
* [OCI] Support reuse existing VCN for SkyServe (#4530)
* Support reuse existing VCN for SkyServe
* fix
* remove unused import
* format
* [docs] OCI: advanced configuration & add vcn_ocid (#4531)
* Add vcn_ocid configuration
* Update config.rst
* fix merge issues WIP
* fix merging issues
* fix imports
* fix stores
---------
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
Co-authored-by: Christopher Cooper <cooperc@assemblesys.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Co-authored-by: zpoint <zp0int@qq.com>
Co-authored-by: Hong <weih1121@qq.com>
Co-authored-by: hwei <hwei@covariant.ai>
Co-authored-by: Yika <yikaluo@assemblesys.com>
Co-authored-by: Seth Kimmel <seth.kimmel3@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Lei <nkwanglei@126.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Co-authored-by: Hysun He <hysunhe@foxmail.com>
Co-authored-by: Andrew Aikawa <asai@berkeley.edu>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal>
Co-authored-by: Aylei <rayingecho@gmail.com>
Co-authored-by: Chester Li <chaoleili2@gmail.com>
Co-authored-by: hong <hong@hongdeMacBook-Pro.local>
commit 4d9c1d5f83a7ea2799800976f98bef8435552902
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Jan 6 13:06:15 2025 -0800
[Client] Create ssh config folder (#105)
Create ssh config folder
commit 397a6d9c80c1d63127699be654e1c445de8bb4db
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Jan 3 14:51:31 2025 -0800
[UX][SKY-1085] Show more readable process command name in `htop` (#119)
* [UX] Process naming
* comment
commit c289b36e11bf5fe39bcf5dfbaae22feec7cc027d
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Jan 3 14:40:47 2025 -0800
[UX][SKY-1081] More consistent logs for file syncing (#118)
[UX][SKY-1081] Better logs for file syncing
commit f29a6de440408172bf5c7e054b50ee6742d0557b
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Jan 3 14:04:42 2025 -0800
[SKY-1071] Fix sky api server_logs for remote API server (#129)
* [SKY-1071] Fix sky api server_logs for remote API server
* comments
commit ae30752777f7bc7778feda5b51f760337d066615
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 23 11:20:31 2024 -0800
[UX][SKY-1092] Colored request status for sky api ls (#116)
* [UX][SKY-1092] Colored request status for sky api ls
* address comments
* fix type
* nit
commit baf3fa340f625acb18e56d28cda9bcc99fc83ff5
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 23 11:18:27 2024 -0800
[UX][SKY-1086] Polish worker logs in API server (#117)
* [UX][SKY-1086] Polish worker logs in API server
* address comments
* revert accident commit
commit a9f5f38c92e23a61fcb365b4fb62dd3959219c0c
Author: Yika <yikaluo@assemblesys.com>
Date: Wed Dec 18 10:17:16 2024 -0800
[SKY-1128] Fix sky storage cloud type mismatch issue (#109)
Fix sky storage cloud type mismatch issue
commit 6318b38159c6754e5e950c08c4058176d88aea5c
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Dec 17 16:57:17 2024 -0800
[Docs] Quick updates to docs (#106)
Updates
commit 591c52e8757eb182392a51cf583807c6fbe7e41e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Dec 17 16:22:01 2024 -0800
Reload kubernetes utils to avoid caching issue (#102)
* Reload kubernetes utils to avoid caching issue
* format
commit d75fcd849c2dfef833a5afb3dfae35010afc1e06
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Dec 17 16:01:27 2024 -0800
[SKY-1107] Docs for deployment (#92)
* Update admin docs
* Updates
* Update docs
* Update docs
* Add storage docs
* Updates
commit 65554949e0f7743e964abf9c228c0d501087a5f5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Dec 17 12:49:03 2024 -0800
[Tests] Fix smoke test for `--kubernetes` and backward compatibility test on AWS (#80)
* Remove stop test for k8s
* Simplify the test k8s creation commands
* use 5+
* fix
* fix
* expanduser
* add tpu mark
* minore large cpu
* correctly quote
* format
* get old jobs as well
* fix launch on existing
* fix
* fix back
* fix back compat
* minor fix
* fix
* Fix server side validation
* remove debug
* rename back to cloudvmray
* rename back to cloudvmray
* fix aws cli on kubernetes pod
* format
* use uv
* fix status -r in backward compat
* Fix path
* format
* less \t
* try less gap for waiting
* Add a timeout for curl to avoid long waiting
* Add max time as well
* shorter sleep
commit 8af911e4ad36a8f9da77f5186b18e716503a05a4
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Dec 17 00:03:37 2024 -0800
[Sky-1123] Fix proxy command for SSH (#94)
Fix proxy command
commit ed724c52a458fbac9d81fa0134f0ff1612b0c531
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 16 18:11:31 2024 -0800
[SKY-1110] Fix sky server controller name issue (#91)
* Remove controller name reference on client side
* format
* add comments
* rename
commit 670af89b9677b75a4c08f409ccc653c07a608654
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 16 18:10:12 2024 -0800
Fix validating relative path (./) (#90)
* Not validate file_mounts for sky exec
* fix sky launch on file_mount relative path
* address comments
* validate on both side
* nit
* smoke test
* comments
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* format
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 8f07b85738dc174a0656ef5115a9b1e7145783a3
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Dec 16 01:22:43 2024 -0800
Fix config reload (#79)
* reload the config
* rename
* elaborate
* Fix makedir
* minor
* longer retry for deploy
* Refactor
* fix unit test
* fix
* format
* fix docstr
* revert
* Fix loading issue
* Add unit test
* format
* fix condition
* Fix controller hash
* revert to remove
* rename
* format
* rename
* comment
* Avoid skip env var
* comment
* Add more tests
* try test_config
* format
* avoid test config
* fix folder path
* use uuid
* use uuid for download as well
* format
commit 389a32b3f0410dce4bd95b74c7e982f442f4d644
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Sat Dec 14 17:49:41 2024 -0800
[SKY-1073] Fix cross-device linking error with `sky jobs launch` (#78)
* hardlink in the same block device
* lint
* lint
* Update sky/skylet/constants.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 1a45d90e5b7ea75d2c86c2bba377a7d563af5f85
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Dec 14 16:09:56 2024 -0800
[Docs] Client server docs (SKY-1101) (#77)
* Add autobuild
* better logging
* Add docs for API server
* update note for `sky down` and `sky top`
* format
* Updates
* Update
* updates
* fix typos
* Fix
* comment
---------
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
commit 07a4ce71fe41a4b51ef4cade461c88648cb5cecf
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Fri Dec 13 16:06:53 2024 -0800
[SKY-1074] Fix storage related tests for remote API server (#74)
* Make test_storage_mounting.yaml.j2 cloud-specific
* Add local marker to skip tests on remote API server
* Update todo
* lint
commit 1556c3301e1d06614da8a0040c6bd55608679a4b
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Dec 13 15:53:13 2024 -0800
[SKY-1080] Controller backward compatibility (#72)
* Controller backward compatibility
* format
* test
* test2
* polish
* format
* revert hostname
* nit
* better naming
* unit test
* more unit test
* format
* address comments
* constant
commit 3eb370d639713b8892bfb7fccecd8cf3e5fcff75
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 21:16:16 2024 -0800
[SKY-1084] k8 ssh proxy uses sky python (#75)
k8 ssh proxy uses sky python
commit 9f56b0fc31c6492f930805acf89c8e4ab4cc2609
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 21:15:50 2024 -0800
[SKY-1094] Disable sky bench (#73)
Disable sky bench
commit 352320e2ba3e5c086fe9fd8ea1aeac716e67e371
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 20:28:43 2024 -0800
[SKY-1104] fix sky down issue (#76)
fix sky down issue
commit 0383c82457401875a428aed0f9dd0bb6d6764904
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Dec 12 17:09:40 2024 -0800
Fix websocket proxy path and fixes SKY-1076 (#71)
* Fix websocket proxy path
* Add cluster config
* format
commit 9fbb7fba1e656bb403ddfa17056382ecca7819c1
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 16:56:40 2024 -0800
UX fixes from alpha test (#70)
* UX fixes from alpha test
* decoder encoder
* address comments
commit f8273f983dd4d78f6cbedc0e8f066c741afe26de
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 16:45:21 2024 -0800
[UX] sky api abort -a and -u (#69)
* sky api abort -a and -u
* fix -u
* comments
* address comments
* nit
commit 7b7abaa82a89d63f45a92d8e46af203bc42e991b
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 13:40:05 2024 -0800
[UX] sky api ls feature polish (#67)
* sky api ls feature polish
* address comments
* naming
* nit
* user ID
* --all-status
commit 92b137f838a636d4568a6040a73919da440576e5
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Dec 12 09:45:11 2024 -0800
Fix sky storage leaking issue (#65)
* Fix cloud storage leak
* nit
* format
* comments
* comments
commit 839db19606ae46e75e354f127faa9d5ceb96e7f0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Dec 11 21:37:46 2024 -0800
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi (#62)
* [perf…
* Rearchitect SkyPilot to be Client-Server
Squashed commit of the following:
commit 7889757299c0260fa4fc6e8c2d3b06484783aae7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 21:36:58 2025 +0000
change to local cache for the k8s client for newly added credential
commit bf660981504544aeb15650938d754094d00faf22
Merge: 19f13d372 3e9bd9d0f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 21:30:59 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 19f13d372b3fa7d590adcb490cc03ff7d2fa6f30
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 19:48:17 2025 +0000
Fix the unit test
commit 72ea226c4b73c6e0eac0f4d55063a6507107c711
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 17:42:51 2025 +0000
fix variables
commit 0dcc06b6bde5f2d24df1c7a8e31b703d54705e77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 07:57:26 2025 +0000
remove unecessary endpoint
commit d54c74c59cac76ee56acfd339328144eb60d5929
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 07:49:56 2025 +0000
refactor serve endpoint for less API calls
commit 912069e6f715d10f7d44c4acdd45d57c25de6689
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 07:27:56 2025 +0000
Fix output
commit f8630cdba402882112fdc6821c586d897ee766ea
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Feb 6 02:36:39 2025 +0000
fix bump seconds
commit 38e66d87fc56d03e5194d9ff18f58ff2194d680a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 23:11:11 2025 +0000
Fix merging issue
commit 0e7ace80f84ee4845f006e63610f9f80c1b5928c
Merge: 99b81b6ff e7d94e956
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 23:00:03 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 99b81b6ff810c595f4e9c5fbd4c8936692521fef
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:53:15 2025 +0000
mypy
commit 0cadaf976bd3f4296250992d299128282e16f9f6
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:35:31 2025 +0000
Fix serve status
commit 8d7849f3579e4a500efcc55c8a7e8023faf776e7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:28:22 2025 +0000
Avoid clone-disk-from tests
commit f0026726c1f6cfe66658e6e4820816e3d7591af9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 22:14:09 2025 +0000
Avoid detach setup
commit 9b8c0cbe61ede6b846bbc054f68653f6d1cdff62
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 01:09:04 2025 +0000
Address comments
commit cfaa62e573f0345ad4ba70c1b33ba1163ba809df
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 00:56:22 2025 +0000
pylint
commit 52bbdb7619c0a0334024d5a7e0eff02124e456e0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Feb 5 00:55:03 2025 +0000
Address comments
commit 25a0c9255c1a166e0855f56ca0d53751b1528723
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 22:41:21 2025 +0000
docstr
commit 1bad3ebef6bafbd0eeba28b03f8bc6e310ff558a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 22:13:42 2025 +0000
Fix raise for resource parsing
commit 81d7d79070b1ebc711a3c0034b40778df4608d77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 22:11:35 2025 +0000
Fix env var for CPU/memory limit in pod
commit 11cff6b7370f148e74da0c56729dbe094d3d9f77
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 21:46:12 2025 +0000
rename to LONG and SHORT
commit 75797766c9830d58d8c1c0123c2782dee8417aae
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Feb 4 21:03:02 2025 +0000
Use 127.0.0.1 by default for API server endpoint
commit 41e53331ee5ae0b1119b8b24d950ff6bd13e5a37
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 23:44:01 2025 +0000
update logging for status refresh
commit 574fb6f0b6bf843384ee6f0070b40e540e06b87a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 15:41:41 2025 -0800
[Usage/SKY-1403] Fix the usage collection for API server (#180)
* Fix usage for server side
* minor update
* Fix run id
* fix client entrypoint cmd
* rename function
* directly use reset for usage messages and refactor a bit
* minor comment update
* disable usage for internal status refresh
commit 09801da1f23c125e539caa20b94689832b7b0c08
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 04:46:40 2025 +0000
Fix merge conflict for jobs launch
commit 7b86c84f4543ac092c5fb56474c32459b33c8d34
Merge: 52d499e2f 269dfb192
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Feb 3 04:27:46 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 52d499e2f9c3bf3f75674bd6d160224b5f50729d
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Feb 2 08:09:49 2025 +0000
format
commit af6a346342109c31b71ad7d793275de407c29ab8
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Feb 2 08:09:34 2025 +0000
address comments
commit 983bf1f392783e81477ccac1f90d9c5b6b77e16e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:33:52 2025 +0000
fix argument
commit 100a5d43154f4cf2ada5f154f40fc361dd35a9ec
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:33:15 2025 +0000
Fix invocation
commit 66540e28ec8282fbaab263b440b4acd4ae4b36eb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:32:26 2025 +0000
fix invocation
commit d62c79c64656723af85d9052282e0fc7e024729e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:27:25 2025 +0000
make `need_confirmation` internal
commit 3f573afd2253410c320b3c134946c89b3d58e7cc
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 18:01:51 2025 +0000
base exception
commit efdb5911fa0da3f4f5701f27bb26ac5b5395b252
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 06:40:40 2025 +0000
format
commit f6964d398b6f2298bab23e2c82707cef6295c659
Merge: d7dd77d61 ac9f159db
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 06:39:33 2025 +0000
Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi
commit d7dd77d61b5579a698015e43b67718f7a923d916
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 06:39:24 2025 +0000
Address comment
commit ac9f159db4b9adb3a9822d30b0fb71e9119da3c7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 22:35:09 2025 -0800
Update sky/data/storage.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
commit 935c8b94c14275ffa3646312a50e82589a8fd1f7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 05:43:59 2025 +0000
Fix SKY-1383
commit f4f7653a2232e159ee4cfe55f0e9cc60baec0524
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Feb 1 02:13:49 2025 +0000
Fix unit tests
commit 4347b3d18acd013972132d878187952298a81ffb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 23:57:14 2025 +0000
use response 500 for failed requests
commit b6fea92010749e7593387ba9813e740363633751
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 23:05:46 2025 +0000
Add TODO
commit 26938445c9436c72c50d35f8540e9ee54e06ca2c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 20:32:18 2025 +0000
address comments
commit b6d5437665cb5dd84fc574b4452a3ea24297b034
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 17:46:27 2025 +0000
fix tests
commit 432b586954c84be86ac36feea26eb5f892ad794e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 08:32:52 2025 +0000
Address comments
commit 8ae605c993067d8b15fcd7eb91dbaeada750a8d6
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 31 04:10:31 2025 +0000
address comments
commit 80b34b5a35f8cd4db081ddd528ee2a506835987a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 30 19:46:26 2025 +0000
Address comments
commit 4e80ff8390715cfdf55a784be2bc7b6e8266f67a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Jan 25 19:51:56 2025 +0000
Address comments
commit 307140e45ba602b88f0d44da6ff232bbd45f22bb
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Jan 25 18:39:16 2025 +0000
Address comments
Fix cancel body
Fix
Fix
remove redis types
commit 252786fe73d47a5df725b29b4a2a1fa7d58a4aaa
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 18:35:19 2025 -0800
[SKY-1374] Fix HTTPS keys (#179)
* upload https keys
* alternstive output
commit cb45ae3ab21a3ec1c3fa2567d080723378ef7703
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 23:23:05 2025 +0000
Fix autodown
commit a471c092c4918b035a51f88b728e844f262ce643
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 23:04:04 2025 +0000
Fix folder creation for log download
commit 4c2a438b24b349359e38ef37e042b6fa21bd5444
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 21:48:52 2025 +0000
format
commit 32d8ffa197cdcd669b3c888c0967808c62799712
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 21:43:01 2025 +0000
Fix cluster job downloading
commit 895f277e7cb80a2a43772499e7233f15b7130fbf
Merge: 70e7ebf61 1146cab36
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 20:35:04 2025 +0000
Merge branch 'restapi' of github.com:assemble-org/skypilot into restapi
commit 70e7ebf61cd36aecdd4d35f169b63df0d40ad9aa
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 20:23:03 2025 +0000
Naming and comments
commit 1146cab3694a34bf9cb66a4e244c069250afb47d
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 11:48:19 2025 -0800
[SKY-1368] Support empty folder in file mounts and symlinks (#178)
* upload symlink in file upload
* format
* remnant
* fix relative
* check relative path
* Fix relative path in symlink target
* Add unittest for zip and unzip
* Add back symlink
* fix circle link
commit a8bba42de13ea88b9e4fc9bfb2f121abb2663cfb
Author: zpoint <zp0int@qq.com>
Date: Sat Jan 25 02:51:29 2025 +0800
get rid of autouse on pytest (#176)
* get rid of autouse
* manual fixture
* prevent reload
* remove the prevent_reload
* Avoid reload cli
* format
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 099b8be94f1e40ed84f734a0bd44f7ba9941587f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 09:52:54 2025 -0800
[SKY-1366] Fix storage reconstruct (#177)
* Fix storage reconstruct
* Error out instead
* Add action
* Fix exception type
* Fix logging
commit 62071d56bff05187cf2127429122ba6691ff05a1
Author: Hong <hong@assemblesys.com>
Date: Fri Jan 24 20:31:28 2025 +0800
[SKY-1200][UX] Valid sky server available while login (#134)
commit 4179a6c2f0c771cec2f5d758304f959937b04e04
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 07:57:18 2025 +0000
move lock under logs folder
commit 5a66dae35ecc1efebbacb2a1d5fef9652cbcb8e1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 07:22:14 2025 +0000
Fix config override error handling
commit ed641a16b5f7bf126ec0e6c8cab3b0ae836aa404
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 02:08:52 2025 +0000
Expose CLOUD_REGISTRY
commit 60e211f27c8b2b82f601eb75b99afac9795d6e62
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 02:02:03 2025 +0000
format
commit 9fcce2c977da60f875c26cc4140e87fc293f98c1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 02:00:37 2025 +0000
[SKY-1367] Fix validate exception serialization and deserialization
commit a00748b33e3d062f4bc9afaee631f05a18d85b45
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 00:57:09 2025 +0000
Fix deepspeed setup
commit c81a20a5844d752287f951b65566d5d1e4780b69
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 00:39:53 2025 +0000
remove api server deployment
commit c2a4f49a2adf10099b87f33c5de2d9956ef69d00
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 24 00:34:34 2025 +0000
Fix dimming
commit 5ff29f469d68e114fc2a7305b89bde4cb62d4ba2
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 23 16:32:26 2025 -0800
merge issue
commit 3994e17be460f15bbbb2834e78a29b566b5a3539
Merge: befbf5114 97b8e8f12
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 23 22:35:21 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit befbf51143a9e1735298a585b0ac84250a79d23b
Author: zpoint <zp0int@qq.com>
Date: Thu Jan 23 14:56:33 2025 +0800
SKY-1034 [Tests] Fix tests with no need for credentials (#51)
* add restapi for tests
* fix storage
* add TODO
* uncomment CI test
* pass api test and unit test
* fix dryrun bug
* fix test_config
* fix test_jobs
* fix for jobs and serve
* bug fix
* fix test jobs and serve
* support enable_all_clouds on server side
* comment out fail test cases
* bug fix and reformat
* temp comment
* pass tests
* test config uncomment
* restructure fixture for faster test
* restore change by cursor
* test test_list_accelerators
* bug fix
* fix test cli
* fix all
* test CI
* bug fix
* rename file
* Update sky/cli.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* test_config pass final test
* test_config pass final test
* use patch and import strategy
* format
* bug fix
* local test behaviour
* fix reload issue
* yapf
* remove test controller util
* fix test cases
* resolve PR comment
* resolve merge conflict
* fix k8s test after merging restapi
* resolve comment
* resolve PR comment
* debug
* debug
* bug fix
* api change
* fix PR comment
* change stream method
* fix
* cli fix
* Update tests/conftest.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* Update tests/test_cli.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* Update tests/test_cli.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* fix
* resolve PR comment
* restore class
* support stream mock
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 72d85e942b91905f41807765d2ad141eae7438ce
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 22:41:28 2025 -0800
[SKY-1357] Fix jobs dashboard on Kubernetes (#175)
* Add port forward command support
* fix port forward command
* format
* Add comment
commit 0c39a8cca617b736e8c675bef08c1a8082167de8
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 18:53:58 2025 -0800
[SKY-1363] `sky down sky-jobs...` request fail to cancel on-going request for `sky jobs launch` (#174)
* Cancel jobs request when downing
* Add column for api status
* Add comment
* minor
commit d582284dc4ddb2cad5282ca03829144acb275644
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 18:51:05 2025 -0800
[SKY-1323] Add timestamps for server logs (#173)
* Add timestamps for server log
* comments
* format
commit fffc1535530f52c888e52c9176318918220e1758
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 22 15:58:22 2025 -0800
[SKY-1223] Docs for multi-k8s support, exec based kubeconfig converter (#164)
* Add docs and kubeconfig converter
* lint
* update docs
* update docs
commit 3ab843ffbb46bc87b4bcc75c113c618142a1ca77
Author: zpoint <zp0int@qq.com>
Date: Thu Jan 23 06:04:30 2025 +0800
[SKY-1009] [Robust] Make streaming requests (sky logs , sky jobs logs , etc) synchronous (#155)
* tail logs
* circular import
* print end
* bug fix
* fix
* core function
* resolve PR comment
* mypy
* bug fix
* type
* resolve PR comment
* restore change
* resolve merge conflict
* Architecture: Add changes with dicussed offline
* minor movement
* format
* format
* Fix import
* fix imports
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 3e75a0f4d3756fb8c469e94088c360d5fdcbe4a4
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 22 13:07:39 2025 -0800
[SKY-1329] Better logging when API server fails to start (#167)
* Better logging when API server fails to start
* lint
* comments
* lint
* lint
commit 179ed3391ee8c5a7e554b9470965cca82a751432
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 11:17:25 2025 -0800
[SKY-1337] Refactor client common and fix jobs logs sync down (#165)
* refactor and support log sync down for jobs
* Refactor for file uploads
* Fix log path
* Fix jobs logs downloading
* fix smoke test
* fix smoke test
* format
* Add debugging
* Add debug
* Reuse request body
* Remove debug lines
* get the latest
* fix jobs logs tests
* fix grep JobStatus
commit a90928e11a872cbabb983fb766fa52485861ee66
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 22 11:01:35 2025 -0800
[SKY-1347] Cancel pending requests for `sky down` (#172)
Cancel pending requests for `sky down` as well
commit 9167fed281428f8a64ecd39b834fa028d4864c94
Author: zpoint <zp0int@qq.com>
Date: Wed Jan 22 18:01:08 2025 +0800
add aiofiles support for pre-commit-config (#171)
add aiofiles support
commit 8301271cb1504ad76abcf2ebf4b48d1979af71de
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 21 22:16:21 2025 -0800
[SKY-1326] Minor logging fixes for requests (#169)
Minor logging
commit 1865ecba8b01a03e436f4b5e65beb8031cac4258
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 21 18:57:55 2025 -0800
[SKY-1269] Allow tasks to run in custom namespaces (#170)
* incluster namespace selection
* comment
* lint
* lint
commit c4edb603464a00a715b41df7bfbdfa7c1a30bd4a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 21 16:31:47 2025 -0800
[SKY-1115] Fix smoke tests with remote API server (no local credential) and local API server (#161)
* Run cloud storage checks in tests on a cluster
* More fixes
* add more fix
* format
* format
* Fix tests that requires local credentials
* refactor the cmd runner
* use new cluster
* Add handling for GCP cloud commands
* Wait for the cloud cmd cluster to be up
* fix storage deletion
* fix api server endpoint
* fix env
* fix None for env
* fix intermediate storage
* increase timeout
* skip bench
* fix queue with docker
* format
* more robust termination
* Fix error handling for controller
* fix API call
* format
* format
* fix gcp related tests
* Additional fixes
* fix clean up
* change zone
* avoid test if on k8s
* longer wait for cloud cmd cluster
* longer wait
* longer wait
* longer
* avoid tailing ?
* wait for running
* Fix storage mounts k8s test
* Fix msg
* Fix managed jobs output
* fix dependency installation
* Add additional fix
* longer wait time for running
* longer wait
* slgihtly longer for clean up the resources
* fix output
* Fix cluster name
* fix cluster names
commit 020a20efa43be95e36a3222d7e7abaf2269d2d7f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 19 14:24:31 2025 -0800
[SKY-1348] Fix console mess up (#166)
* Fix jobs logs
* Add comments
commit 2bae2b5a6c3d52f589dca55b7faa0eb2a68cc5c5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 19 11:52:51 2025 -0800
[SKY-1339] Upload files by chunks (#163)
* fixes
* wip fixes
* wip
* minor fix
* Fix bug for bad zip
* fix chunk upload
* update return value
* fix multi thread issue
* avoid io
* Add cleanup for stale chunks
* format
* lint
* Add comments
* ux
* fix cleanup
commit 189c686bf2a12071a21407085ef9fd258a30d800
Merge: 61f44a053 2354b818b
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 19 19:21:07 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 61f44a053c6778ccf5c6b28abc92006cee997f03
Merge: e87afbc6c 6b23582d9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 23:46:28 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit e87afbc6c2687ae864c944d8137a294f32094410
Merge: b0ea95f03 9e1b4ddc5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 07:15:09 2025 +0000
Merge branch 'master' of github.com:assemble-org/skypilot into restapi
commit b0ea95f030f24c3a7f9936233d1c838eb953b7d1
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 06:21:19 2025 +0000
Fix job controller merging issue
commit 37f6fce6ad9b1723c32fc9157701adaf8e34c330
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Thu Jan 16 20:29:22 2025 -0800
[SKY-1321] Fix concurrent requests to launch local API server (#162)
* Fix concurrent requests to launch local API server
* lint
commit 2b537b9b349fb848ae9d4a6da84f330be8fcdd8c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 03:42:59 2025 +0000
format and unittest fix
commit bd65f74dce7588ae62efd58bc5d5021576213326
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 03:27:58 2025 +0000
Fix cannonicalize accelerator implementation from master merging
commit a33a894b3e2df6a238a0fab727dcce01c95e0f5c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 17 01:37:31 2025 +0000
Merge branch 'master' of github.com:skypilot-org/skypilot into restapi
commit 50671940daf8eb9713879a88480f230531e2cc7e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 16 15:21:41 2025 -0800
[SKY-980] Refactor jobs queue and service status to avoid mp pool (#141)
* Refactor jobs queue and service status to avoid mp pool
* add comment
* fix API cancel
commit 7c614835447302964ae767aad8b8d283e7198411
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Thu Jan 16 11:35:51 2025 -0800
[Docs] Updates based on onboarding feedback (#120)
* Update docs
* Add python version
commit 73a20398855a0a63ccb9306af2ae40446e18f29d
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 15 22:58:28 2025 -0800
[SKY-373] Fix `sky local` for client server (#61)
* Make sky local up work
* lint
* lint
* Update checks
* lint
* Merge with restapi
* revert log changes
* fix
* Rebase done
* Comments and refactor
* fixes
* Change to __enter__ and add todo
* lint
* lint
* fix package path
* remove debug
* Fix GPU detection
* Logging fixes
* Lint
commit 006ce5c2aaa4f8bfbf9c061195cf9dc84d3e54ec
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 15 19:55:59 2025 -0800
[SKY-915/SKY-1292] Clean up ssh entries and fix api logs exception handling (#157)
* Remove ssh config for clusters not exist
* Fix log path non-exist issue
* Update sky/cli.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Elaborate comments
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
commit 0a0cb59e43e84441812622ee50f2dd042660e787
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 16 00:46:35 2025 +0000
Add back changes from #78
commit e2d3eac431d6c443244e8b5eba28b1f8a2c07180
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Wed Jan 15 15:50:36 2025 -0800
[SKY-1045] Support GCP service account auth (#160)
Update GCP service account docs
commit efd893df50e9b9e91dc988a9f47d8c5c7fd7a7f7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Wed Jan 15 22:32:47 2025 +0000
Fix SKY-1303
commit 36e638f677de165dff5bfe6d278f29633f1db49a
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 19:28:44 2025 -0800
[SKY-1288] Fix python SDK docs (#156)
* Fix docs for SDK
* Split into items
* Add comments for API endpoint env var
* format
* fix docs
* Add docs for serve
commit 5b686e34d0843d7f399584302e5758a621263dcd
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 17:31:23 2025 -0800
[SKY-1217] Fix log download test (#159)
* Fix log download test
* Add comment
commit 4757b78d389188ec89810e82e6186c5c648f4ab7
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 16:34:05 2025 -0800
[SKY-1095] Fix serve down issue with storage deletion (#158)
Fix serve down issue with storage deletion
commit c8bb92c7c5fca19939242fd10f64d9cdc9c7327f
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 14 14:14:09 2025 -0800
[SKY-1233] Handle non-sky exceptions and airflow updates (#151)
* Add exception wrapping and update airflow
* Add git branch
* Update airflow and docs
* Update airflow and docs
* lint
commit 0b4530f38e35cbfea513992ce9c189698abc4365
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Jan 14 12:51:53 2025 -0800
[SKY-1258] Fix sky api start error message when using remote API server (#152)
* Fix sky api start logic
* lint
* lint
* comments
* simplify logging logic
* Add docs and comments
* Add docs and comments
commit 6dd380e7e316bc7c0051f955ba5f326c4e0d454c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 14 11:14:08 2025 -0800
[SKY-1291] Fix CRLF output (#153)
* Fix CRLF output
* fix last line
* update comment
commit 79f49ab29e9f901cca143a4aec366fd946d2aa55
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Jan 13 20:40:24 2025 -0800
[SKY-1108] Better logging for skypilot config override (#150)
Better logging for skypilot config override
commit c28d7b49f6670893098ff63bf552ccea7c978dd7
Author: zpoint <zp0int@qq.com>
Date: Mon Jan 13 14:01:37 2025 +0800
type hint fix for tail logs (#149)
type hint
commit 24d270b64c6fa480e0eb50583b198401e000902f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 12 22:00:46 2025 -0800
[SKY-1129] Avoid lru_cache across requests (#137)
* Add reload for caches
* Add comments
* Fix docstr
* Simplify
* Add TODO
commit 270edfb8156e32cb395cae55bd37c44e13f2db1f
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 12 21:53:26 2025 -0800
[UX] Prompt when launching on other's cluster (#146)
* Prompt when launching on other's cluster
* add import
* format
* Better logging
* Only show hash when user
commit 37f0c823a1b86ddb38ce377e80863baa4d74ad6c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sun Jan 12 21:52:34 2025 -0800
[UX] Add terminal completion and fix progress bar with `\r` (#148)
* complete console
* Add terminal completion
* format
* format
* Fix \r
* format
commit 1e3e5281e159735533a7eac9f98c21f45f21eab6
Author: zpoint <zp0int@qq.com>
Date: Sun Jan 12 11:23:38 2025 +0800
[SKY-1287]sky exec test echo hi doesn't show task logs (hi) (#147)
* fix merge conflict overwritten
* mypy
* resolve comment
* mypy
* return type
commit a3db3e2e61f25c0518b6c868a3f1afaacf1b885c
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 18:53:27 2025 -0800
[SKY-1256] Avoid buffer for streaming hints (#145)
* Avoid buffer for streaming hints
* format
* Better log streamer
* format
* format
* format
* remove cache
* fix error handling
commit 786ae1d778889f89ba170d6e3db25774dec211ca
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Jan 11 02:42:49 2025 +0000
Avoid refresh cluster status when cluster yaml is None
commit 6d31f31c9b867c0515f6c9ea3080ae63101d66f0
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 13:50:11 2025 -0800
[SKY-1278/SKY-1134] Add versioning for API server (#138)
* Add versioning for API server
* Add commit and version in the health call
* refactor a bit
* format
* fix docstr
* renaming
commit 212f72fe663e27a3f7f8e83da13e6e795d0831b9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 19:57:48 2025 +0000
format
commit 5fbe71ea17551dfb22f1021db44dd81a98c6dfe2
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 19:56:22 2025 +0000
format
commit e8a1f4736536020255430e9d1d61654881a0a7dc
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 19:55:18 2025 +0000
Rename to API server
commit 55c648700419d9c2181206438fe2e30aaac483d8
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Fri Jan 10 10:32:56 2025 -0800
Update health endpoint in docs (#144)
Update health endpoint
commit 9ec1ed69f64019d23dc5760aec76e6c7d01bc66d
Author: zpoint <zp0int@qq.com>
Date: Fri Jan 10 17:04:42 2025 +0800
[SKY-988] [Robust] Refactor tail_logs to make it a uvicorn async call instead of an async executor call (#130)
* sync log call
* terminate worker
* use sky abort
* merge restapi
* revert logging order
* doc update
commit e1eed3bf0ddd987b26aa094a2d1407f127ab28d2
Author: Hong <hong@assemblesys.com>
Date: Fri Jan 10 14:46:21 2025 +0800
[UX] set `RUNNING` request color to green (#142)
commit 961782a0d9b48939dea9ba2abcf5a53662d55052
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 22:34:18 2025 -0800
[SKY-1273] Rename API server related CLIs (#139)
* Rename API server related CLIs
* Consolidate server logs
* format
* Add api login python API
* format
* Add comments and TODOs
* Add TODO for endpoint env var
commit cfaf37e1e5204b42c3bfcac6fcbb7f301d5cfa18
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 18:16:47 2025 -0800
Add user hash for existing cluster (#140)
* Add user hash for existing cluster
* format
* format
commit c96710df70bf89fd20c373e04213bde54d9157be
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Fri Jan 10 01:05:37 2025 +0000
Fix check multiple clouds
commit 466c4d6a767e1a82631b6f2794ca296e82904049
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 10:48:17 2025 -0800
Clean up client file mounts during API server restarts (#135)
* Refactor client server
* add docstr
* Fix types
* Refactor for jobs/server
* use type alias
* Add todo
* rename `check_health`
* update docstr
* Add comments
* Update docstr
* Adding docstr
* Move
* Add doc
* style
* format
* fix redis
* renames
* naming
* format
* format
* clean up clients file mounts as well
* fix merging naming issue
* format
* format
commit e6cf03f050c2086382e21f3fa2a68fb0d0ade32b
Author: Yika <yikaluo@assemblesys.com>
Date: Thu Jan 9 10:20:33 2025 -0800
[SKY-1210] Fix or triage TODOs in restapi branch (#128)
* Fix or triage TODOs in restapi branch
* merge with restapi
* comments
commit f73a871431a1a85584761f14b13e2e95e4d1a0d9
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Thu Jan 9 00:41:27 2025 -0800
Address comments for restapi branch (#133)
* Refactor client server
* add docstr
* Fix types
* Refactor for jobs/server
* use type alias
* Add todo
* rename `check_health`
* update docstr
* Add comments
* Update docstr
* Adding docstr
* Move
* Add doc
* style
* format
* fix redis
* renames
* naming
* format
* format
* Add comment
* Avoid dumping to files
* format
* Update sky/server/requests/executor.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Add todo for legacy
* use queue.Queeu
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
commit 8fa28098edb6a85631c9a95107857d5f3701a832
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 7 19:34:12 2025 +0000
remove poetry
commit fe9d7385b459754cf493eee3345281f2f97fc4a7
Merge: 344151261 6cf98a3cf
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Jan 7 18:39:54 2025 +0000
Merge branch 'master' of github.com:assemble-org/skypilot into restapi
commit 34415126138d67254c16f830c875a9f9077f76d4
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Jan 6 23:45:19 2025 -0800
Merge master (#131)
* [perf] use uv for venv creation and pip install (#4414)
* Revert "remove `uv` from runtime setup due to azure installation issue (#4401)"
This reverts commit 0b20d568ee1af454bfec3e50ff62d239f976e52d.
* on azure, use --prerelease=allow to install azure-cli
* use uv venv --seed
* fix backwards compatibility
* really fix backwards compatibility
* use uv to set up controller dependencies
* fix python 3.8
* lint
* add missing file
* update comment
* split out azure-cli dep
* fix lint for dependencies
* use runpy.run_path rather than modifying sys.path
* fix cloud dependency installation commands
* lint
* Update sky/utils/controller_utils.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* [Minor] README updates. (#4436)
* [Minor] README touches.
* update
* update
* make --fast robust against credential or wheel updates (#4289)
* add config_dict['config_hash'] output to write_cluster_config
* fix docstring for write_cluster_config
This used to be true, but since #2943, 'ray' is the only provisioner.
Add other keys that are now present instead.
* when using --fast, check if config_hash matches, and if not, provision
* mock hashing method in unit test
This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.
* check config hash within provision with lock held
* address other PR review comments
* rename to skip_if_no_cluster_updates
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* add assert details
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* address PR comments and update docstrings
* fix test
* update docstrings
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* address PR comments
* fix lint and tests
* Update sky/backends/cloud_vm_ray_backend.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* refactor skip_if_no_cluster_update var
* clarify comment
* format exception
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* [k8s] Add resource limits only if they exist (#4440)
Add limits only if they exist
* [robustness] cover some potential resource leakage cases (#4443)
* if a newly-created cluster is missing from the cloud, wait before deleting
Addresses #4431.
* confirm cluster actually terminates before deleting from the db
* avoid deleting cluster data outside the primary provision loop
* tweaks
* Apply suggestions from code review
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* use usage_intervals for new cluster detection
get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.
* fix terminating/stopping state for Lambda and Paperspace
* Revert "use usage_intervals for new cluster detection"
This reverts commit aa6d2e9f8462c4e68196e9a6420c6781c9ff116b.
* check cloud.STATUS_VERSION before calling query_instances
* avoid try/catch when querying instances
* update comments
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* smoke tests support storage mount only (#4446)
* smoke tests support storage mount only
* fix verify command
* rename to only_mount
* [Feature] support spot pod on RunPod (#4447)
* wip
* wip
* wip
* wip
* wip
* wip
* resolve comments
* wip
* wip
* wip
* wip
* wip
* wip
---------
Co-authored-by: hwei <hwei@covariant.ai>
* use lazy import for runpod (#4451)
Fixes runpod import issues introduced in #4447.
* [k8s] Fix show-gpus when running with incluster auth (#4452)
* Add limits only if they exist
* Fix incluster auth handling
* Not mutate azure dep list at runtime (#4457)
* add 1, 2, 4 size H100's to GCP (#4456)
* add 1, 2, 4 size H100's to GCP
* update
* Support buildkite CICD and restructure smoke tests (#4396)
* event based smoke test
* more event based smoke test
* more test cases
* more test cases with managed jobs
* bug fix
* bump up seconds
* merge master and resolve conflict
* more test case
* support test_managed_jobs_pipeline_failed_setup
* support test_managed_jobs_recovery_aws
* manged job status
* bug fix
* test managed job cancel
* test_managed_jobs_storage
* more test cases
* resolve pr comment
* private member function
* bug fix
* restructure
* fix import
* buildkite config
* fix stdout problem
* update pipeline test
* test again
* smoke test for buildkite
* remove unsupport cloud for now
* merge branch 'reliable_smoke_test_more'
* bug fix
* bug fix
* bug fix
* test pipeline pre merge
* build test
* test again
* trigger test
* bug fix
* generate pipeline
* robust generate pipeline
* refactor pipeline
* remove runpod
* hot fix to pass smoke test
* random order
* allow parameter
* bug fix
* bug fix
* exclude lambda cloud
* dynamic generate pipeline
* fix pre-commit
* format
* support SUPPRESS_SENSITIVE_LOG
* support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log
* support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log
* add backward_compatibility_tests to pipeline
* pip install uv for backward compatibility test
* import style
* generate all cloud
* resolve PR comment
* update comment
* naming fix
* grammar correction
* resolve PR comment
* fix import
* fix import
* support gcp on pre merge test
* no gcp test case for pre merge
* [k8s] Make node termination robust (#4469)
* Add limits only if they exist
* retry deletion
* lint
* lint
* comments
* lint
* [Catalog] Bump catalog schema version (#4470)
* Bump catalog schema version
* trigger CI
* [core] skip provider.availability_zone in the cluster config hash (#4463)
skip provider.availability_zone in the cluster config hash
* remove sky jobs launch --fast (#4467)
* remove sky jobs launch --fast
The --fast behavior is now always enabled. This was unsafe before but since
\#4289 it should be safe.
We will remove the flag before 0.8.0 so that it never touches a stable version.
sky launch still has the --fast flag. This flag is unsafe because it could cause
setup to be skipped even though it should be re-run. In the managed jobs case,
this is not an issue because we fully control the setup and know it will not
change.
* fix lint
* [docs] Change urls to docs.skypilot.co, add 404 page (#4413)
* Add 404 page, change to docs.skypilot.co
* lint
* [UX] Fix unnecessary OCI logging (#4476)
Sync PR: fix-oci-logging-master
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* [Example] PyTorch distributed training with minGPT (#4464)
* Add example for distributed pytorch
* update
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Update examples/distributed-pytorch/README.md
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Fix
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
* Add tests for Azure spot instance (#4475)
* verify azure spot instance
* string style
* echo
* echo vm detail
* bug fix
* remove comment
* rename pre-merge test to quicktest-core (#4486)
* rename to test core
* rename file
* [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337)
* [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
* fix format issue
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
---------
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
* [k8s] Fix IPv6 ssh support (#4497)
* Add limits only if they exist
* Fix ipv6 support
* Fix ipv6 support
* [Serve] Add and adopt least load policy as default poicy. (#4439)
* [Serve] Add and adopt least load policy as default poicy.
* Docs & smoke tests
* error message for different lb policy
* add minimal example
* fix
* [Docs] Update logo in docs (#4500)
* WIP updating Elisa logo; issues with light/dark modes
* Fix SVG in navbar rendering by hardcoding SVG + defining text color in css
* Update readme images
* newline
---------
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298)
* style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks
* chore: more typings
* use `df.empty` for dataframe
* fix: more `df.empty`
* format
* revert partially
* style: add back comments
* style: format
* refactor: `dict[str, str]`
Co-authored-by: Tian Xia <cblmemo@gmail.com>
---------
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* [Docs] Fix logo file path (#4504)
* Add limits only if they exist
* rename
* [Storage] Show logs for storage mount (#4387)
* commit for logging change
* logger for storage
* grammar
* fix format
* better comment
* resolve copilot review
* resolve PR comment
* remove unuse var
* Update sky/data/data_utils.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* resolve PR comment
* update comment for get_run_timestamp
* rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [Examples] Update Ollama setup commands (#4510)
wip
* [OCI] Support OCI Object Storage (#4501)
* OCI Object Storage Support
* example yaml update
* example update
* add more example yaml
* Support RClone-RPM pkg
* Add smoke test
* ver
* smoke test
* Resolve dependancy conflict between oci-cli and runpod
* Use latest RClone version (v1.68.2)
* minor optimize
* Address review comments
* typo
* test
* sync code with repo
* Address review comments & more testing.
* address one more comment
* [Jobs] Allowing to specify intermediate bucket for file upload (#4257)
* debug
* support workdir_bucket_name config on yaml file
* change the match statement to if else due to mypy limit
* pass mypy
* yapf format fix
* reformat
* remove debug line
* all dir to same bucket
* private member function
* fix mypy
* support sub dir config to separate to different directory
* rename and add smoke test
* bucketname
* support sub dir mount
* private member for _bucket_sub_path and smoke test fix
* support copy mount for sub dir
* support gcs, s3 delete folder
* doc
* r2 remove_objects_from_sub_path
* support azure remove directory and cos remove
* doc string for remove_objects_from_sub_path
* fix sky jobs subdir issue
* test case update
* rename to _bucket_sub_path
* change the config schema
* setter
* bug fix and test update
* delete bucket depends on user config or sky generated
* add test case
* smoke test bug fix
* robust smoke test
* fix comment
* bug fix
* set the storage manually
* better structure
* fix mypy
* Update docs/source/reference/config.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/reference/config.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* limit creation for bucket and delete sub dir only
* resolve comment
* Update docs/source/reference/config.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update sky/utils/controller_utils.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* resolve PR comment
* bug fix
* bug fix
* fix test case
* bug fix
* fix
* fix test case
* bug fix
* support is_sky_managed param in config
* pass param intermediate_bucket_is_sky_managed
* resolve PR comment
* Update sky/utils/controller_utils.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* hide bucket creation log
* reset green color
* rename is_sky_managed to _is_sky_managed
* bug fix
* retrieve _is_sky_managed from stores
* propogate the log
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [Core] Deprecate LocalDockerBackend (#4516)
Deprecate local docker backend
* [docs] Add newer examples for AI tutorial and distributed training (#4509)
* Update tutorial and distributed training examples.
* Add examples link
* add rdvz
* [k8s] Fix L40 detection for nvidia GFD labels (#4511)
Fix L40 detection
* [docs] Support OCI Object Storage (#4513)
* Support OCI Object Storage
* Add oci bucket for file_mount
* [Docs] Disable Kapa AI (#4518)
Disable kapa
* [DigitalOcean] droplet integration (#3832)
* init digital ocean droplet integration
* abbreviate cloud name
* switch to pydo
* adjust polling logic and mount block storage to instance
* filter by paginated
* lint
* sky launch, start, stop functional
* fix credential file mounts, autodown works now
* set gpu droplet image
* cleanup
* remove more tests
* atomically destroy instance and block storage simulatenously
* install docker
* disable spot test
* fix ip address bug for multinode
* lint
* patch ssh from job/serve controller
* switch to EA slugs
* do adaptor
* lint
* Update sky/clouds/do.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* Update sky/clouds/do.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* comment template
* comment patch
* add h100 test case
* comment on instance name length
* Update sky/clouds/do.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* Update sky/clouds/service_catalog/do_catalog.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* comment on max node char len
* comment on weird azure import
* comment acc price is included in instance price
* fix return type
* switch with do_utils
* remove broad except
* Update sky/provision/do/instance.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* Update sky/provision/do/instance.py
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* remove azure
* comment on non_terminated_only
* add open port debug message
* wrap start instance api
* use f-string
* wrap stop
* wrap instance down
* assert credentials and check against all contexts
* assert client is None
* remove pending instances during instance restart
* wrap rename
* rename ssh key var
* fix tags
* add tags for block device
* f strings for errors
* support image ids
* update do tests
* only store head instance id
* rename image slugs
* add digital ocean alias
* wait for docker to be available
* update requirements and tests
* increase docker timeout
* lint
* move tests
* lint
* patch test
* lint
* typo fix
* fix typo
* patch tests
* fix tests
* no_mark spot test
* handle 2cpu serve tests
* lint
* lint
* use logger.debug
* fix none cred path
* lint
* handle get_cred path
* pylint
* patch for DO test_optimizer_dryruns.py
* revert optimizer dryrun
---------
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal>
* [Docs] Refactor pod_config docs (#4427)
* refactor pod_config docs
* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
---------
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
* [OCI] Set default image to ubuntu LTS 22.04 (#4517)
* set default gpu image to skypilot:gpu-ubuntu-2204
* add example
* remove comment line
* set cpu default image to 2204
* update change history
* [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524)
* Support specify os type with custom image id.
* trim space
* nit
* comment
* Update intermediate bucket related doc (#4521)
* doc
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* Update docs/source/examples/managed-jobs.rst
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* add tip
* minor changes
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [aws] cache user identity by 'aws configure list' (#4507)
* [aws] cache user identity by 'aws configure list'
Signed-off-by: Aylei <rayingecho@gmail.com>
* refine get_user_identities docstring
Signed-off-by: Aylei <rayingecho@gmail.com>
* address review comments
Signed-off-by: Aylei <rayingecho@gmail.com>
---------
Signed-off-by: Aylei <rayingecho@gmail.com>
* [k8s] Add validation for pod_config #4206 (#4466)
* [k8s] Add validation for pod_config #4206
Check pod_config when run 'sky check k8s' by using k8s api
* update: check pod_config when launch
check merged pod_config during launch using k8s api
* fix test
* ignore check failed when test with dryrun
if there is no kube config in env, ignore ValueError when launch
with dryrun. For now, we don't support check schema offline.
* use deserialize api to check pod_config schema
* test
* create another api_client with no kubeconfig
* test
* update error message
* update test
* test
* test
* Update sky/backends/backend_utils.py
---------
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [core] fix wheel timestamp check (#4488)
Previously, we were only taking the max timestamp of all the subdirectories of
the given directory. So the timestamp could be incorrect if only a file changed,
and no directory changed. This fixes the issue by looking at all directories and
files given by os.walk().
* [docs] Add image_id doc in task YAML for OCI (#4526)
* Add image_id doc for OCI
* nit
* Update docs/source/reference/yaml-spec.rst
Co-authored-by: Tian Xia <cblmemo@gmail.com>
---------
Co-authored-by: Tian Xia <cblmemo@gmail.com>
* [UX] warning before launching jobs/serve when using a reauth required credentials (#4479)
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* Update sky/backends/cloud_vm_ray_backend.py
Minor fix
* Update sky/clouds/aws.py
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* wip
* minor changes
* wip
---------
Co-authored-by: hong <hong@hongdeMacBook-Pro.local>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
* [GCP] Activate service account for storage and controller (#4529)
* Activate service account for storage
* disable logging if not using service account
* Activate for controller as well.
* revert controller activate
* Add comments
* format
* fix smoke
* [OCI] Support reuse existing VCN for SkyServe (#4530)
* Support reuse existing VCN for SkyServe
* fix
* remove unused import
* format
* [docs] OCI: advanced configuration & add vcn_ocid (#4531)
* Add vcn_ocid configuration
* Update config.rst
* fix merge issues WIP
* fix merging issues
* fix imports
* fix stores
---------
Signed-off-by: nkwangleiGIT <nkwanglei@126.com>
Signed-off-by: Aylei <rayingecho@gmail.com>
Co-authored-by: Christopher Cooper <cooperc@assemblesys.com>
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Co-authored-by: zpoint <zp0int@qq.com>
Co-authored-by: Hong <weih1121@qq.com>
Co-authored-by: hwei <hwei@covariant.ai>
Co-authored-by: Yika <yikaluo@assemblesys.com>
Co-authored-by: Seth Kimmel <seth.kimmel3@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Lei <nkwanglei@126.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Co-authored-by: Hysun He <hysunhe@foxmail.com>
Co-authored-by: Andrew Aikawa <asai@berkeley.edu>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal>
Co-authored-by: Aylei <rayingecho@gmail.com>
Co-authored-by: Chester Li <chaoleili2@gmail.com>
Co-authored-by: hong <hong@hongdeMacBook-Pro.local>
commit 4d9c1d5f83a7ea2799800976f98bef8435552902
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Jan 6 13:06:15 2025 -0800
[Client] Create ssh config folder (#105)
Create ssh config folder
commit 397a6d9c80c1d63127699be654e1c445de8bb4db
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Jan 3 14:51:31 2025 -0800
[UX][SKY-1085] Show more readable process command name in `htop` (#119)
* [UX] Process naming
* comment
commit c289b36e11bf5fe39bcf5dfbaae22feec7cc027d
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Jan 3 14:40:47 2025 -0800
[UX][SKY-1081] More consistent logs for file syncing (#118)
[UX][SKY-1081] Better logs for file syncing
commit f29a6de440408172bf5c7e054b50ee6742d0557b
Author: Yika <yikaluo@assemblesys.com>
Date: Fri Jan 3 14:04:42 2025 -0800
[SKY-1071] Fix sky api server_logs for remote API server (#129)
* [SKY-1071] Fix sky api server_logs for remote API server
* comments
commit ae30752777f7bc7778feda5b51f760337d066615
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 23 11:20:31 2024 -0800
[UX][SKY-1092] Colored request status for sky api ls (#116)
* [UX][SKY-1092] Colored request status for sky api ls
* address comments
* fix type
* nit
commit baf3fa340f625acb18e56d28cda9bcc99fc83ff5
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 23 11:18:27 2024 -0800
[UX][SKY-1086] Polish worker logs in API server (#117)
* [UX][SKY-1086] Polish worker logs in API server
* address comments
* revert accident commit
commit a9f5f38c92e23a61fcb365b4fb62dd3959219c0c
Author: Yika <yikaluo@assemblesys.com>
Date: Wed Dec 18 10:17:16 2024 -0800
[SKY-1128] Fix sky storage cloud type mismatch issue (#109)
Fix sky storage cloud type mismatch issue
commit 6318b38159c6754e5e950c08c4058176d88aea5c
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Dec 17 16:57:17 2024 -0800
[Docs] Quick updates to docs (#106)
Updates
commit 591c52e8757eb182392a51cf583807c6fbe7e41e
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Dec 17 16:22:01 2024 -0800
Reload kubernetes utils to avoid caching issue (#102)
* Reload kubernetes utils to avoid caching issue
* format
commit d75fcd849c2dfef833a5afb3dfae35010afc1e06
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Dec 17 16:01:27 2024 -0800
[SKY-1107] Docs for deployment (#92)
* Update admin docs
* Updates
* Update docs
* Update docs
* Add storage docs
* Updates
commit 65554949e0f7743e964abf9c228c0d501087a5f5
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Tue Dec 17 12:49:03 2024 -0800
[Tests] Fix smoke test for `--kubernetes` and backward compatibility test on AWS (#80)
* Remove stop test for k8s
* Simplify the test k8s creation commands
* use 5+
* fix
* fix
* expanduser
* add tpu mark
* minore large cpu
* correctly quote
* format
* get old jobs as well
* fix launch on existing
* fix
* fix back
* fix back compat
* minor fix
* fix
* Fix server side validation
* remove debug
* rename back to cloudvmray
* rename back to cloudvmray
* fix aws cli on kubernetes pod
* format
* use uv
* fix status -r in backward compat
* Fix path
* format
* less \t
* try less gap for waiting
* Add a timeout for curl to avoid long waiting
* Add max time as well
* shorter sleep
commit 8af911e4ad36a8f9da77f5186b18e716503a05a4
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Tue Dec 17 00:03:37 2024 -0800
[Sky-1123] Fix proxy command for SSH (#94)
Fix proxy command
commit ed724c52a458fbac9d81fa0134f0ff1612b0c531
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 16 18:11:31 2024 -0800
[SKY-1110] Fix sky server controller name issue (#91)
* Remove controller name reference on client side
* format
* add comments
* rename
commit 670af89b9677b75a4c08f409ccc653c07a608654
Author: Yika <yikaluo@assemblesys.com>
Date: Mon Dec 16 18:10:12 2024 -0800
Fix validating relative path (./) (#90)
* Not validate file_mounts for sky exec
* fix sky launch on file_mount relative path
* address comments
* validate on both side
* nit
* smoke test
* comments
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
* format
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 8f07b85738dc174a0656ef5115a9b1e7145783a3
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Mon Dec 16 01:22:43 2024 -0800
Fix config reload (#79)
* reload the config
* rename
* elaborate
* Fix makedir
* minor
* longer retry for deploy
* Refactor
* fix unit test
* fix
* format
* fix docstr
* revert
* Fix loading issue
* Add unit test
* format
* fix condition
* Fix controller hash
* revert to remove
* rename
* format
* rename
* comment
* Avoid skip env var
* comment
* Add more tests
* try test_config
* format
* avoid test config
* fix folder path
* use uuid
* use uuid for download as well
* format
commit 389a32b3f0410dce4bd95b74c7e982f442f4d644
Author: Romil Bhardwaj <romil.bhardwaj@berkeley.edu>
Date: Sat Dec 14 17:49:41 2024 -0800
[SKY-1073] Fix cross-device linking error with `sky jobs launch` (#78)
* hardlink in the same block device
* lint
* lint
* Update sky/skylet/constants.py
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
---------
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
commit 1a45d90e5b7ea75d2c86c2bba377a7d563af5f85
Author: Zhanghao Wu <zhanghao.wu@outlook.com>
Date: Sat Dec 14 16:09:56 2024 -0800
[Docs] Client server docs (SKY-1101) (#77)
* Add autobuild
* better logging
* Add docs f…
* First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented
* GCP catalog Make 2nd run of "sky run" / "python app.py" skip setup; more examples. (skypilot-org#52) * Make 2nd run of "sky run" / "python app.py" skip setup. More examples. * Fix workflows yaml. LocalDockerBackend for debugging (skypilot-org#35) working LocalDockerBackend Show logs on docker build failure (skypilot-org#53) Stream build logs on failure and default to remove container on completion Usability: basic "sky run" distributed task; rsync .git; more docs. (skypilot-org#54) * aws: move region order * Do not exclude .git from workdir by default; fixes skypilot-org#42. * Task: document, num_nodes in from_yaml(), 'run' str for >1 nodes. * Example: multi_hostname.{yaml,py}, and some fixes to make it work * Check 'run' field. Add logging for execution_v2 (skypilot-org#49) * Add logging for provisioning * Log task output into a file for `one_node` * Fix format issues and support ParTask logging * Fix info message * Add docstr to logging.py * Fix tail -f information for provision * Fix teardown * Fix codegen, format and other issues Fix logs sync for n_node (skypilot-org#57) * Fix logs rsync down for n_node Prettier paths (skypilot-org#50) Update .gitignore for sky_logs (skypilot-org#55) * Update .gitignore * Add *.log to gitignore Docker-in-Docker support and bugfixes (skypilot-org#65) * Fix ping example and debugging commands printed to the user * Add default docker image * Add check for container status in post_execute * lint * PR comments containerized app example (skypilot-org#60) * containerized app example * setup command fixes and dind notes * mkdir fix Fix the incomplete logs (skypilot-org#69) Fix T4 (skypilot-org#62) CLI with cluster handles and interactive sessions (skypilot-org#58) Add interactive sessions (`gpunode`, etc.), multiple clusters support, cluster handles, and `sky status`. Make pylint pass. (skypilot-org#76) * First pass @ making pylint happy; TODO: a few more to solve; test. * Fix a few more pylint errors, and some minor fixes * Final pylint errors * Fix more pylint * Debug pytest issue * Accidentally removed a useful try-catch; reverted & documented Add Detectron2 example (skypilot-org#74) Cleanups: remove V1 yaml templates; update pylint.yml (skypilot-org#77) * Cleanups: remove V1 yaml templates; update pylint.yml * Add cluster_name: Optional[str] = None to the 2 Backend impls Fix script filling to make tpu_app run again. (skypilot-org#79) Add support for cross-cloud retry (skypilot-org#70) * Support multi-cloud provision retry * Fix pylint issues Update .gitignore for tpu shell script (skypilot-org#80) Horovod Example (skypilot-org#29) * Horovod Example * Yapf fix * Fix Co-authored-by: Michael Luo <michaelluo@MacBook-Pro.local> Add jupyter notebook example (skypilot-org#34) * Add jupyter notebook example * Change to new execute Fix cross-cloud retry for CLI and CLI (skypilot-org#84) * Do not fail the program when dag and optimizer are not registered and fix cli Support file_mounts destination paths that require root access. (skypilot-org#68) Fixes skypilot-org#39. Both these should now work: ``` /data/<25GB dir>: /data/<25GB dir> /data/<25GB dir>: gs://<dir> ``` Added examples/using_file_mounts.yaml as a user guide. * Support file_mounts destination paths that require root access. * Enable more robust file mounts. * Rename to using_file_mounts.yaml. * Use ~/.sky/file_mounts/ * Remove V1 code to avoid confusion. * Add slashes to existing file mounts examples. * Revert "Add slashes to existing file mounts examples." This reverts commit 53bade0. * Add storage.is_directory() to rid of the weird slashes * Cleanup in execution.py Add example for resnet training with tensorboard (skypilot-org#45) * add an example for resnet training with tensorboard [CLI] "sky down" and move cli.py to the core package. (skypilot-org#85) Example usage: # See available commands. >> sky # Run a task, described in a yaml file. # Provisioning, setup, file syncing are handled. >> sky run task.yaml >> sky run [-c cluster_name] task.yaml # Show the list of tasks and running clusters. >> sky status # Tear down a specific cluster. >> sky down -c cluster_name # Tear down all existing clusters. >> sky down -a * cli: cleanups, remove unimplemented commands. * cli/cli.py -> sky/cli.py * Make cli.py pass pylint. * sky down * Fix pylint. * Update * Make CLUSTER an argument instead - better help msg * sky cancel --all (-a) * Minor fixes and linting. Fix `pushd not found` in cloud-stores (skypilot-org#88) * Merge run function for cloud_stores and backend_utils * Add `executable=/bin/bash` [CLI] Introduce `sky exec`; some more CLI fixes (skypilot-org#89) * Fix tricky "sky run" bugs. Bug before this change: sky run cluster app.yaml -> handle = cluster.yaml now, manually change cluster.yaml instance_type, and ray up cluster.yaml sky run cluster app.yaml again, this will fail! After this change, "sky run" deals with this case correctly. Another bug: previously _reuse_or_provision_cluster() in cli.py uses "ray up" in one case. It should use backend.provision() instead which offers provision retries. * Remove _reuse_or_provision_cluster() and fixes sky.execute() _reuse_or_provision_cluster() in the 'sky run' codepath is buggy: It performs a content diff on generated.yaml. Which means: If cluster is not manually changed, this check DOES NOT capture - non-workdir file mounts content changes Other cluster specs are captured. If cluster is manually changed, skipping re-setup violates Task contracts. * Add 'sky exec' * update help * Lint * entry_point -> yaml_path * Show commands in help in better order Fix non-utf-8 characters when redirecting output (skypilot-org#92) * Fix non-utf-8 output * Enable head comment in using_file_mounts.yaml Add types to global_user_state (skypilot-org#94) Add types [CLI] Speed up 'sky exec' for faster iterative development (skypilot-org#91) * (Not back-compat) global_user_state: support pickle-able handles. This change is not binary back-compat. Do this to update: rm ~/.sky/state.db * Speed up sky exec by not using 'ray exec/rsync_up' Changes: - Sane SSH options (control path has huge performance boost) - ResourceHandle: is now a tuple (head_ip, yaml, tpu). - Uses the cached head_ip to directly run our own 'rsync', 'ssh <cmd>'. - Thus tpu management is nicely handled * Add minimal.yaml * Update run_smoke_tests.sh * TESTED: examples/run_smoke_tests.sh * Add username to auto-generated cluster name * Fix cloud->remote file mount mkdir logic. * Better naming for 'managed_tpu' & address comments * Fix _create_interactive_node() use of handle; don't down the node. * lint [CLI] Add sky ssh and port-forwarding (skypilot-org#90) * Add sky attach and port forwarding gpunode/cpunode * Fix Ctrl-D error * Change the name to sky ssh * Fix type hints * linting * format * Fix merging errors (Non back-compat) CLI improvements: interactive, remove task table, show entry points (skypilot-org#99) * CLI: by default, (re)use the same interactive nodes * Remove task table & its associated code Rationale: useful in the future when we add task status tracking (RUNNING, DONE, etc.). Absent that, the task table from `sky status` and the overhead of `sky cancel -a` is not that useful at the moment. We should add this code back in the future. * Show entry point per cluster in `sky status` Clusters +----------------------+----------------+-----------------------------------------+ | CLUSTER NAME | LAUNCHED | LAST USE | +----------------------+----------------+-----------------------------------------+ | sky-cpunode-zongheng | 10 minutes ago | sky cpunode | | debug1 | 7 minutes ago | sky run -c debug1 examples/minimal.yaml | | sky-bba5-zongheng | 3 minutes ago | multi_echo.py | +----------------------+----------------+-----------------------------------------+ * Make 'sky ssh' call the ssh cmd directly * Truncate last use * lint Rebase & format Update __init__.py * sky list-gpus * lint * Update cli.py * lint * Update GCP prompt * Address comments * TPUs -> Google TPUs * Update GCP catalog with regions info * Update fetch_gcp.py * Update gcp * fix * fix * Update TPU code * Revert "Update TPU code" This reverts commit 1f08fc0. * Address comments * Update cli.py * Update cli.py * Update cli.py * Update cli.py
...Went through all pylint errors and resolved them. After merging this, we could turn on pylint.
Other changes
Tested:
examples/run_smoke_tests.sh