Releases: skypilot-org/skypilot
Release v0.11.2rc1
This patch release is a minor bump from v0.11.1 to to get you the latest fixes:
- Always set SSH key permission to avoid issue when jobs controller restarts by high availability setting #8316
Install the release candidate:
# Select needed clouds.
uv pip install 'skypilot[kubernetes,aws,gcp]==0.11.2rc1'
Upgrade your remote API server:
NAMESPACE=skypilot # TODO: change to your installed namespace
RELEASE_NAME=skypilot # TODO: change to your installed release name
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--version 0.11.2-rc.1 \
--reset-then-reuse-values \
--set apiService.image=null # reset to default image
Full Changelog: v0.11.1...v0.11.2rc1
SkyPilot v0.11.1
This patch release is a minor bump from v0.11.0 to to get you the latest fixes:
- #8285: Fix an issue where the API server request database is incorrectly locked causing an irresponsive API server with an error:
sqlite3.OperationalError: database is locked
See the full v0.11 release notes for everything new in SkyPilot v0.11!
SkyPilot v0.11.0
SkyPilot v0.11.0: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability
SkyPilot v0.11.0 delivers major new features: Pools and Managed Jobs Consolidation Mode; significant improvement enterprise-readiness at large scale: supports hundreds of AI engineers with a single API server instance, avoid OOM, >10x performance improvement on many requests, additional observability, and more; and UX improvements: Templates, Python SDK, CI/CD, Git Support, and more.
Get it now with:
uv pip install "skypilot>=0.11.0"Or, upgrade your team SkyPilot API server:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.0
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:$VERSION \
--version $VERSION --devel --reuse-valuesHighlights
[Beta] SkyPilot Pools: Batch inference across clouds & k8s
SkyPilot supports spawning a pool that launches a set of workers across many clouds and Kubernetes clusters (docs, #6260, #6426, #6459, #6552, #6591, #6665, #6675, #7332, #7963, #8008, #7876, #8047, #7930,#8039, #7846, #7855, #7919, #7920). Jobs can be scheduled on this pool and distributed to workers as they become available.
Key benefits include:
- Fully utilize your GPU capacity across clouds & k8s
- Unified queue for jobs on all infra
- Keep workers warm, scale elastically
- Step aside for other higher priority jobs; reschedule when GPUs become available
Learn more in our blog post.
[GA] Managed Jobs Consolidation mode
Consolidation Mode is general available (#7122, #7127, #7396, #7459, #7498, #7560, #7601, #7619, #7717, #7720, #7847, #8082, #8021, #8106). This enables:
- 6x faster job submission
- consistent credentials across the API server and jobs controller.
- Persistent managed jobs state on postgres
# config.yaml
jobs:
controller:
consolidation_mode: true # Currently defaults to False.
Efficient and Robust Managed Jobs
We have significantly optimized the Managed Jobs controller, allowing it to handle 2000+ parallel jobs on a single 8-CPU controller—an 18x improvement in job capacity with the same controller size (#7051, #7371, #7379, #7408, #7432, #7473, #7487, #7488, #7494, #7519, #7585, #7595, #7945, #7966, #7979,#8036, #8095).
Enterprise-Ready SkyPilot at Large-scale
- Support hundreds of AI engineers with a single SkyPilot API server instance
- Significant reduction in memory consumption and avoid OOM for API server
- CLI/SDK/Dashboard speed up with a large amount of clusters, jobs
- Comprehensive API server metrics for operation
- SSO support with Microsoft Entra ID
Kubernetes
UX, Robustness, Performance Improvement
- Robust SSH for SkyPilot cluster on Kubernetes
- Robust multi-node(pod) provisioning with retry and recovery (#7820, #7854, #7852)
- Improve resource cleanup after termination
- Intelligent GPU name detection
- Improved volume support: label support, name validation, SDK support.
Volume Support for Existing PVC and Ephemeral Volumes (#7915, #7971,#8179)
-
Exising PVC: Reference pre-existing Kubernetes PersistentVolumeClaims as a SkyPilot volume (#7915).
# volume.yaml name: existing-pvc-name type: k8s-pvc infra: k8s/context1 use_existing: true config: namespace: namespace
-
Ephemeral Volumes: automatically create volumes when a cluster is launched and deleted when the cluster is torn down, making them ideal for temporary storage across multiple nodes, such as caches and intermediate results.
# task.sky.yaml file_mounts: /mnt/cache: size: 100Gi
Cloud Support
CoreWeave Integration Announcement
CoreWeave now officially support SkyPilot (#6386, #6519, #6895, #7756, #7838). This integration provides: Infiniband Support, Object Storage, Autoscaling.
See the announcement on Coreweave blog.
AMD GPU Support Announcement
SkyPilot now fully supports AMD GPUs on Kubernetes clusters (#6378, #6944). This includes: GPU Detection and Scheduling, Dashboard Metrics, ROCm Support.
See the announcement on AMD Rocm blog.
More Clouds Support
- Together AI Instant cluster
- Seeweb support
User Experience
SkyPilot Templates
SkyPilot now ships predefined YAML templates for launching clusters with popular frameworks and patterns. Templates are automatically available on all new SkyPilot clusters. (#7935, #7965)
You can now launch a multi-node Ray cluster by adding a single line to your YAML’s run block:
run: |
# One-line setup for a distributed Ray cluster
~/.sky/templates/ray/start_cluster
# Submit your job
python train.pyProgrammability: SkyPilot Python SDK
SkyPilot Python SDK is significantly improved with:
- Type hints
- Log streaming
logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
if line is not None:
if 'needle in the haystack' in line:
print("found it!")
break
logs.close()- Admin policy: build admin policy with helper functions
resource_config = user_request.task.get_resource_config()
resource_config['use_spot'] = True
user_request.task.set_resources(re...SkyPilot v0.11.0rc1
This is a preview of the upcoming 0.11.0 release!
Install the release candidate:
# Select needed clouds.
uv pip install 'skypilot[kubernetes,aws,gcp]==0.11.0rc1'Upgrade your remote API server:
NAMESPACE=skypilot # TODO: change to your installed namespace
RELEASE_NAME=skypilot # TODO: change to your installed release name
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--version 0.11.0-rc.1 \
--reset-then-reuse-values \
--set apiService.image=null # reset to default imageIf you try out 0.11.0rc1, please let us know in Slack!
SkyPilot v0.10.5
SkyPilot v0.10.5: Major Managed Jobs Efficiency Improvement, UX and SDK Usability Enhencement, and API Server Robustness Fixes
This release focuses on production stability, performance optimization, and fixing critical bugs that affected reliability in multi-user and Kubernetes environments. Key highlights include major performance improvements for managed jobs and the dashboard, resolution of race conditions and resource leaks, and expanded cloud/accelerator support.
This release includes 400+ merged pull requests with high-priority critical fixes and additional improvements spanning bug fixes, performance enhancements, new features, and comprehensive documentation updates.
Highlights
Managed Job Efficiency and Robustness Improvement
- 18x more jobs with the same jobs controller size – 2000+ parallel jobs on an 8-CPU controller (#7051, #7371, #7379, #7408, #7432, #7473, #7488, #7487, #7494, #7519, #7585, #7595).
- [Beta] Avoid separate job controller with the new
consolidation_mode(#7127, #7396, #7459, #7122, #7498, #7601, #7560, #7619, #7717, #7720).- Consistent credentials across API server and jobs controller
- 6x faster job submission,
sky jobs launch
# config.yaml
jobs:
controller:
consolidation_mode: true # Currently defaults to False.Performance and Robustness Improvement at a Large Scale
- 20x cluster status speedup at large scale (#7215, #7220, #7224, #7246, #7389, #7453, #7282, #7261, #7076, #7676).
- 6x jobs query and dashboard speedup (#7463, #7458, #7780, #7798, #7453).
API Server Robustness Improvement
- Significantly reduce memory consumption and avoid OOM for API server (#7240).
Python SDK Usability Improvements
- New Python SDK examples to easily scale out your jobs on any infra (#7335).
import sky
from sky import jobs as managed_jobs
for i in range(100):
resource = sky.Resources(accelerators='A100:8')
task = sky.Task(resources=resource,
workdir='.',
run='python batch_inference.py')
managed_jobs.launch(task, name=f'hello-{i}')- Admin policy robustness and better usability (#7827).
Integration
- CoreWeave managed Kubernetes and storage support (#7756, #6519, #7759).
- AMD GPU support and GPU metrics (#6944).
What's New
Additional Performance Improvements
- Migrated managed jobs and SkyServe to a gRPC architecture, reducing P95 latency by up to 86% and CPU usage by 20-30% (#7245, #6647, #6702, #7184).
- Optimized Kubernetes pod processing with streaming JSON, providing 2x speedup and 50% memory reduction for large clusters (#7469).
- Optimized database queries for
sky statusand API endpoints, improving mean response time by 10-35% (#7689, #7690, #7665, #7705, #7708). - Replaced standard JSON with high-performance
orjsonfor API serialization, improving tail latency for large payloads (#7734).
Critical Stability & Security Fixes
- Fixed critical race conditions in Docker operations ("container name already in use") (#7030), API server start/stop (#7534), and asynchronous job cancellation, preventing cluster leaks (#7511).
- Resolved critical Kubernetes resource leaks, including semaphore file leaks (
/dev/shmexhaustion) (#7678) and fusermount-server leaks (#7398). - Fixed system hangs caused by SSH threads (#7202), K8s SSH proxy blocking (#7537), and
BrokenProcessPoolcrashes when cancelingsky logs(#7607). - Addressed a critical OOM bug in the server caused by inefficient GPU name canonicalization in Kubernetes (#7080).
- Fixed security/privacy bugs where pending jobs were visible across users (#7581) and environment variables leaked between managed jobs (#7459).
- Prevents accidental cancellation of wrong requests by requiring exact ID matches when prefixes are ambiguous (#7730).
- Replaced insecure
curl | shpattern for fluent-bit installation with official package repositories (#7126).
UX Improvement
- Improved CLI error messages for Kubernetes resource constraints (#6814) and multi-node setup failures (#7001).
- Enhanced
sky downoutput to show a clear summary when multiple clusters fail or succeed (#7225, #7635). - Added a CLI spinner for short-running requests (like
sky status) to show activity when the server is under load ([#76...
SkyPilot v0.10.3.post2
This patch release is a minor bump over v0.10.3.post1 to fix robustness against Coreweave clusters, i.e., adding additional retries and fallback for the Kubernetes API calls:
- Fallback to file-based command execution on 431, 400 #7536 #7563
- Retry on transiant authentication issue with error code 403 #7568 #7574
To upgrade:
pip install "skypilot==0.10.3.post2"
# Restart your local API server
sky api stop; sky api start
Or, upgrade the API server:
NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.10.3
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
--set apiService.image=berkeleyskypilot/skypilot:0.10.3.post2 \
--version $VERSION --devel --reuse-values
Full Changelog: v0.10.3.post1...v0.10.3.post2
SkyPilot v0.10.3.post1
This patch release is a minor bump over v0.10.3 to fix a dependency issue caused by a breaking change in uvicorn==0.36.0:
- Pin
uvicorndependency to mitigateAttributeError: 'Config' object has no attribute 'setup_event_loop'error. (#7287)
If you see an issue like above, upgrade your SkyPilot with:
pip install "skypilot==0.10.3.post1"
# Restart your local API server
sky api stop; sky api startIf you are using a remote API server, it should still work with v0.10.3.
SkyPilot v0.10.3
SkyPilot v0.10.3: 2-10x Performance Improvement, Better Observability, Production-Ready Authentication and More
SkyPilot v0.10.3 delivers 2-10x performance improvement, enhanced cloud integration, improved Kubernetes integration, and strengthened production reliability features for AI/ML workloads across clouds.
Highlights
Massive Performance Improvements: 2-10x more efficient
- 4x faster
sky status(#6883, #6858, #6868, #6871, #6882, #6940, #6948, #6892, #6908) - 2-10x speedup for dashboard pages, including users, infra, volumes (#7031, #7033, #6959, #7013, #7011, #7012)
- 2x speedup for remote PostgreSQL environments through connection pooling (#6998, #6897)
- Fixes request responsiveness and SSH lagging/disconnection issue in production (#6984, #6983, , #6889, #6963, #6947, #7015, #6966, #6938, #6957)
- Improved API server memory management (#7046, #7037, #6909)
Upgrade your SkyPilot API server to 0.10.3 and get the improvement out of the box!
Observability & Monitoring
More SkyPilot API server and cluster metrics can now be set up for better observability, including:
- Detailed cluster events (#6907, #6906, #6901, #6899, #6900, #6936, #6875)
- API server detailed metrics for request latency, event loop, and process resource usage (#7017, #6935, #6942, #6968, #7062)

- Request ID tracking for better debugging (#6933)
CoreWeave Integration: Autoscaler
CoreWeave autoscaler is now supported, with a single config field change (#6895):
kubernetes:
autoscaler: coreweaveSkyPilot API Server: Authentication & Production Environment
SkyPilot now supports Microsoft Entra ID SSO login (#7045, #7028), besides Okta and Google Workspace. See more docs for setting up SSO login for API server: https://docs.skypilot.co/en/latest/reference/auth.html#sso-recommended

What's New
UX Improvement
- Fix CLI auto-completion authentication issue (#6724)
- Fixed cluster ownership display (#6989)
- Fixed API logs for daemon request (#6841)
- SDK improvements for
tail_logs(#6902) - Responsive spinner for request blocking (#6905)
- Fixed SDK authentication for download endpoints (#6955)
- Enable log downloading in dashboard (#6999, #7000, #7003, #7010)
- Prevented clients from setting DB strings server-side (#7042)
- Improved logout error handling (#6874)
Storage & File Operations
- Fixed storage mounting on ARM64 instances (#7008)
- Improved chunk upload reliability (#6854)
- Rsync permission fixes for Kubernetes (#6951)
Enhanced Cloud Platform Support
Kubernetes Improvements
- Pod configuration change detection with warnings (#6912)
- Volume name validation (#6863)
- Worker service cleanup on termination (#7014)
- AMD GPU support improvements
Nebius Cloud
- Static IP configuration support (#7002)
- Docker image ID fixes (#6894)
- Pagination support for
list_instances(#6867) - Customizable API domain (#6888)
- Better network error messages (#6971)
- Various stability improvements (#6856, #6896)
AWS
- Elastic Fabric Adapter (EFA) automatic enablement for improved network performance (#6852)
- AWS Systems Manager (SSM) support with new
use_ssmflag for secure connections (#6830) - Improved VPC error messages and handling (#6746)
- Fixed CloudWatch region setting issues (#6747)
- Better teardown leak prevention (#7022)
RunPod
- Volume support for RunPod network volumes (#6949)
GPU Support
- Added B200 to common accelerators in
sky show-gpus(#7006)
Documentation & Examples
- New training examples: TorchTitan (#6677), distributed RL with game servers (#6988)
- Best practices for network and storage benchmarking (#6632)
- Clarified Ray runtime usage (#6783)
- SSM documentation improvements (#7050)
- Fixed broken documentation links (#7026, #7004)
Testing & CI/CD & Development
- Backward compatibility tests against stable releases (#6979)
- SSH lag unit tests (#6968)
- Improved smoke test reliability (#6913, #6958, #6967, #7019)
- Customizable Buildkite queue support (#7063)
- Fixed type errors in managed jobs (#6994)
Developer Experience
- Python 3.12 and 3.13 support (#5304, #6990)
- Performance measurement annotations (#6943)
- Better error handling for UV-only environments (#6893)
API Server Improvements
SkyPilot v0.10.2
SkyPilot v0.10.2: Enhanced observability, programmatic SDK, performance, and more
SkyPilot v0.10.2 brings values in cluster management, improved Kubernetes support, programmatic SDK, and numerous stability and performance enhancements for production use from teams with a large number of workloads.
Get it now with:
uv pip install "skypilot>=0.10.2"
Highlights
Cluster observability improvements
Get provisioning logs with the new --provision option for sky logs (#6638):
sky logs --provision <cluster-name>Find the detailed reason for your cluster failure, e.g., OOM, with the cluster events (#6590, #6593, #6615, #6620, #6621, #6667, #6658, #6609, #6617):
SDK: Process your logs while streaming
SkyPilot introduces a new preload_content option for tail_logs to enable processing logs while streaming.
logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
if line is not None:
if 'needle in the haystack' in line:
print("found it!")
break
logs.close()Performance Improvements
- Job dashboard page loading speed improvement for 10k+ jobs (#6714, #6652)
- Volume mounting on Kubernetes from 2mins to seconds (#6679)
- Increase the connection limit from a single client: 10 to 100 (#6782)
- Speed up
sky downfor AWS clusters with exposed ports by 4x (#6629, #6663, #6720) - Fix the OOM issue with large file_mounts (#6865)
- Enhanced SSH connection robustness (#6715)
What's New
Infrastructure-specific Improvements
- Kubernetes:
- Label support for volumes (#6696)
- Better authentication handling for in-cluster services (#6600)
- Proxy setup guide for Kubernetes (#6618)
- Improved error handling: Enhanced failover error messages (#6613)
- Enable write access to conda base environment (#6766)
- Better GPU label detection with empty labels (#6859)
- AWS:
- Nebius:
- RunPod:
- Lambda: Support B200 on Lambda (#6816)
- R2 mount issue resolution (#6662)
Dashboard Enhancements
- Managed Jobs:
- User Management:
- Allow editing allowed_users and private settings for workspaces (#6566)
- Miscs:
- Fixed tour auto-start behavior (#6654)
API Server Enhancements
- Robust jobs/serve controller against upgrades (#6779)
- Prometheus metrics in deployment mode (#6712)
- Docs for using Cloud SQL for states (#6587)
- State management improvements:
- Admin Policy & Permissions & RBAC:
UX/API Improvements
- CLI Enhancements:
- Better managed job log tailing behavior (#6719)
- Improved
--configoption documentation (#6794) - Better logs organization for downloading (#6795)
- Warning for disk size specifications on Kubernetes (#6637, #6684)
- Client version and commit info in
sky api info(#6748) - Improved cluster event messaging and wording (#6686, #6688)
- Better user feedback with spinner for delay messages (#6575)
- Enhanced pod configuration validation (#6825)
- SDK Improvements:
- Improved transient failure handling (#6808, #6807): More intelligent retry mechanisms
- Simpler log streaming utility (#6750): More intuitive API for processing logs in real-time with iterator-based approach
- Better response typing (#6659, #6718, #6527)
- Exposed
sky.endpointsfunction (#6599) - Enhanced decoder backward compatibility (#6810)
- Comprehensive endpoints documentation (#6815)
- Robustness and Performance Improvements:
- Fixed authentication with simplejson (#6698)
- Reduced import statement overhead (#6641, #6645, #6648)
- Optimized directory utilities (#6646)
- Resolved API logging issues (#6619)
- Enhanced retry logic for non-transient errors (#6844)
- Better handling of missing pods with termination filtering (#6697)
- Fixed various cluster name querying issues (#6616, #6624)
- Improved job import handling in sky init (#6623)
Testing and Infrastructure
- Test Improvements:
- Automatic retry for core tests (#6710)
- Expanded nightly build coverage (#6695)
- Staggered test execution (#6732)
- Remote server test support (#6819, #6823)
- New smoke tests for provisioning logs (#6790, #6818)
- Fixed endpoint comparison in unittests (#6578)
- Fixed TPU test example failure (#6673)
- Added Kubernetes volume merging unit tests (#6813)
- Added support for custom cloud config files in smoke tests (#6851)
- Enhanced serve status checks after termination (#6713)
- Include Nebius VMs tests in CI/CD (#6835)
- CI/CD Enhancements:
- Helm Charts & Deployment:
Other feature improvements
- SSH Node Pool: region-specific launches (#6767)
- SkyServe: GPU-aware Load Balancing (#6147)
SkyServe now intelligently scale based on GPU types. Set different QPS targets across heterogeneous GPUs:load_balancing_policy: instance_aware_least_load replica_policy: target_qps_per_replica: "H100:1": 2.5 # H100 can handle 2.5 QPS "A100:1": 1.25 # A100 can handle 1.25 QPS "A10:1": 0.5 # A10 can handle 0.5 QPS
Backend
- Alpha support for replacing skylet with grpc (#6771, #6754, #6788, #6574, #6640, #6643, #6649, #6850)
- Alpha support for Managed Jobs Pool (#6665, #6571, #6614, #6676, #6580, #6682, #6675, #6785, #6591, #6666)
- Enhanced event querying by cluster hash (#6617)
- Resolved unpickle issues with class location changes (#6731)
- Volume resolution before DAG application (#)
- Fixed log cancellation issues (#6801)
Examples and Documentation
- RAG example improvements for embeddings (#6639)
- Kimi K2 multi-node serving example (#6706, #6723)
- Enhanced autostop documentation (#6577)
- Two-hop file transfer guide for managed jobs (#6576)
- SDK examples in Quick Start (#6561, #6597)
- Polished landing page design (#6756)
- Miscs (#6741, #6827, #6828, #6586, #6774)
Contributors
We thank all contributors who made this release possible!
New Contributors: @miltava, @tomzx, @webconn, @hongsu-moreh, @ibpark-moreh, @nathan-liner
All Contributors: @DanielZhangQD, @cblmemo, @kyuds, @SeungjinYang, @lloyd-brown, @romilbhardwaj, @rohansonecha, @zpoint, @aylei, @SalikovAlex, @Michaelvll, @kevinmingtarja, @miltava, @Maknee, @cg505, @tomzx, @concretevitamin, @webconn, @sethkimmel3, @andylizf, @hongsu-moreh, @ibpark-moreh, @lucamanolache, @nathan-liner
Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!
Full Changelog
For a complete list of changes, see the commit history.
SkyPilot v0.10.1
SkyPilot v0.10.1: Improved production readiness, distributed training, cloud integrations, and more
SkyPilot v0.10.1 improves enterprise production readiness with large-scale distributed training capabilities (Llama 4 400B, OpenAI GPT-OSS), introduces/enhances integration with AMD GPUs and leading GPU clouds (CoreWeave, Nebius), and delivers enhanced reliability features for mission-critical AI workloads.
Get it now with:
uv pip install "skypilot>=0.10.1"
Highlights
Native Git Support
Use your (private) git repositories as your SkyPilot workdir:
# task.sky.yaml
workdir:
url: https://github.com/my-org/my-repo.git
ref: 1234ab # commit hash or branch nameFind your commit hash of your workdir in Dashboard:
Added in (#6294, #6257, #6268).
Distributed LLM Pretraining/Finetuning and Agentic training Examples
We released high-performance distributed training examples for large models with checkpointing support (#6525, #6551, #6242, #6273).
- Llama 4 Maverick 400B model training on more than 16 H200 GPUs: example
- OpenAI GPT-OSS model pretraining and finetuning: example
Agentic training example with VeRL is now also available (#6443).
Cloud and accelerator support
Native AMD GPU Support on Kubernetes
AMD ROCm is now supported on Kubernetes clusters with SkyPilot!
# task.sky.yaml
resources:
infra: k8s/my-amd-cluster
image_id: docker:rocm/pytorch-training:v25.6
accelerators: MI300:4Coreweave Integration
SkyPilot now supports CoreWeave clusters with native Infiniband and object store support (#6386, #6487, #6483).
Use your CoreWeave cluster with SkyPilot:
resources:
infra: k8s/my-coreweave-cluster
network: best # Enable infinibandNebius cloud improvement
B200 GPUs and spot instances are now supported on Nebius cloud (#6474, #6478, #6267).
# task.sky.yaml
resources:
infra: nebius
accelerators: B200:8
use_spot: trueAdditionally, SkyPilot now supports MOUNT_CACHED mode for Nebius cloud. (#6456)
Autostop/Autodown based on SSH sessions
In addition to running jobs on clusters, you can also ask autostop/autodown to wait for active SSH sessions or none of them in SkyPilot YAML (#6361, #6485).
# task.sky.yaml
resources:
autostop:
wait_for: jobs_and_sshExternal Logging
You can now dump your SkyPilot cluster/job logs to external logging services like AWS CloudWatch and GCP Cloud Logging (#6331, #6405, #6411, #6369). Configure it in your ~/.sky/config.yaml:
logs:
store: aws # Or 'gcp', etc.
aws:
... # Service-specific options; see below.What's New
UX/API
- 2x Faster
sky status: Intelligent caching reduces status command latency by 50% (#6166) - Optimized Imports: Lazy loading of dependencies for faster CLI startup (#6302)
- Autostop/Autodown: Improve autostop logic during cluster refresh (#6388)
- Fast network timeout detection: Faster timeout detection for remote API server checks (#6263)
- Better hints for parallel launch: Better hints for parallel launch (#6549)
- Support
sky api logoutto logout from API server (#6284, #6327, #6412) - Type Checking: Enhanced mypy support with py.typed marker (#6440)
UI
- Interactive Tour: Guided tour for new users to learn dashboard features (#6565, #6301)
- Mobile Support: Responsive design for narrow screens (#6280)
- YAML Syntax Highlighting: Enhanced code readability with syntax highlighting (#6404)
- User Management: Delete users directly from the dashboard interface (#6434)
- Display fixes: Cluster history tracking issues (#6512), VSCode remote connection (#6555), workspace resources aggregation (#6537, #6283)
Storage/Volume
- S3 Mounting on ARM-based Instances: Full support for S3 mounting on ARM64 instances (#6255)
- Volume Management SDK: Added SDK for creating, managing persistent volumes across clouds (#6439, #6170, #6383)
- New examples for volumes: juiceFS, NFS, etc. (#6521)
API Server
API Server Deployment
- Helm Chart on Artifact Hub: SkyPilot Helm chart now available on Artifact Hub (#6560, #6371)
- More robust DB: Use Alembic for database schema migrations (#6579, #6556, #6196), thread safety improvements (#6422, #6279)
- OAuth2 Proxy Integration: Built-in OAuth2 proxy support (#6476)
- Redis Secret Support: Configuration for Redis URL via secrets (#6559)
- Deployment image size reduction (#6312, #6285)
- GPU Detection: Fixed GPU name matching with underscores (#6491)
Client Enhancements
- Enhanced Client Reliability: Automatic retry on transient failures (#6259, #6298, #6234)
- Robustness enhancements: lock timeout notifications (#6530), transient error retry (#6234), catalog refresh (#6272)
- Service Account Token Auth Fix: Fixed token authentication in file uploads (#6432)
- Issues fixes: file path escaping (#6262),
Managed Jobs
- Force Disable Cloud Buckets: Configuration option to disable cloud bucket usage for managed jobs and SkyServe (#6402, #6407)
- Multi-Task Log Viewing: See logs from all tasks in a managed job pipeline (#6415)
Cloud Integrations
- GCP N4 Instance Types: Support for new N4 machine family (#6253)
- GCP TPU Improvements: Enhanced TPU provisioning logic (#6420)
- Hyperbolic: Marked CUSTOM_MULTI_NETWORK as unsupported (#6245)
- RunPod: CPU instance launching support (#6450)
- Vast: Port opening support (#6282)
- Azure: Dependency resolution fixes (#6231)
Miscs
- Developer Experience: Improved format.sh for code consistency (#6580)
- Backward Compatibility: Fixed issues with Task field and version detection (#6548, #6564, #6485)
- Test Stability: Fixed various flaky tests (#6507, #6495, #6511, #6469)
- Release Pipeline: Fixed API version extraction and workflow issues (#6464, #6254)
- Buildkite: Sequential test execution to reduce resource usage (#6217)
Production Deployment Guides
- PostgreSQL with GCP Cloud SQL: Complete guide for deploying API server with Cloud SQL backend (#6325)
- Docker Deployment: Containerized API server deployment guide (#6296)
- AWS SSM as SSH proxy: Setup guide for AWS Systems Manager (#6522)
Contributors
We thank all contributors who made this release possible!
New Contributors**: @LokmaSpeedy, @jacobergzhou, @amd-pratmish, @tedspare, @jimbz, @makhalin
All Contributors: @alex000kim, @amd-pratmish, @andylizf, @aylei, @bikramnehra, @cblmemo, @cg505, @clayrosenthal, @concretevitamin, @DanielZhangQD, @jacobergzhou, @jimbz, @kevinmingtarja, @kyuds, @lloyd-brown, @LokmaSpeedy, @lucamanolache, @makhalin, @Maknee, @Michaelvll, @rohansonecha, @romilbhardwaj, @SalikovAlex, @SeungjinYang, @tedspare, @zhenjiasun, @zpoint
Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!
Full Changelog
For a complete list of changes, see the commit history.



















