Skip to content

Releases: skypilot-org/skypilot

Release v0.11.2rc1

17 Dec 06:28
c55d94c

Choose a tag to compare

Release v0.11.2rc1 Pre-release
Pre-release

This patch release is a minor bump from v0.11.1 to to get you the latest fixes:

  • Always set SSH key permission to avoid issue when jobs controller restarts by high availability setting #8316

Install the release candidate:

# Select needed clouds.
uv pip install 'skypilot[kubernetes,aws,gcp]==0.11.2rc1'

Upgrade your remote API server:

NAMESPACE=skypilot # TODO: change to your installed namespace
RELEASE_NAME=skypilot # TODO: change to your installed release name
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --version 0.11.2-rc.1 \
  --reset-then-reuse-values \
  --set apiService.image=null  # reset to default image

Full Changelog: v0.11.1...v0.11.2rc1

SkyPilot v0.11.1

15 Dec 23:54
a7380a5

Choose a tag to compare

This patch release is a minor bump from v0.11.0 to to get you the latest fixes:

  • #8285: Fix an issue where the API server request database is incorrectly locked causing an irresponsive API server with an error:
    sqlite3.OperationalError: database is locked
    

See the full v0.11 release notes for everything new in SkyPilot v0.11!

SkyPilot v0.11.0

11 Dec 07:21
b2f3519

Choose a tag to compare

SkyPilot v0.11.0: Multi-Cloud Pools, Fast Managed Jobs, Enterprise-Readiness at Large Scale, Programmability

SkyPilot v0.11.0 delivers major new features: Pools and Managed Jobs Consolidation Mode; significant improvement enterprise-readiness at large scale: supports hundreds of AI engineers with a single API server instance, avoid OOM, >10x performance improvement on many requests, additional observability, and more; and UX improvements: Templates, Python SDK, CI/CD, Git Support, and more.

Get it now with:

uv pip install "skypilot>=0.11.0"

Or, upgrade your team SkyPilot API server:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.11.0

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:$VERSION \
  --version $VERSION --devel --reuse-values

Highlights

[Beta] SkyPilot Pools: Batch inference across clouds & k8s

SkyPilot supports spawning a pool that launches a set of workers across many clouds and Kubernetes clusters (docs, #6260, #6426, #6459, #6552, #6591, #6665, #6675, #7332, #7963, #8008, #7876, #8047, #7930,#8039, #7846, #7855, #7919, #7920). Jobs can be scheduled on this pool and distributed to workers as they become available.

Key benefits include:

  • Fully utilize your GPU capacity across clouds & k8s
  • Unified queue for jobs on all infra
  • Keep workers warm, scale elastically
  • Step aside for other higher priority jobs; reschedule when GPUs become available

Learn more in our blog post.

batch_inference_architecture

[GA] Managed Jobs Consolidation mode

Consolidation Mode is general available (#7122, #7127, #7396, #7459, #7498, #7560, #7601, #7619, #7717, #7720, #7847, #8082, #8021, #8106). This enables:

  • 6x faster job submission
  • consistent credentials across the API server and jobs controller.
  • Persistent managed jobs state on postgres
# config.yaml
jobs:
  controller:
    consolidation_mode: true  # Currently defaults to False.

Frame 5

Efficient and Robust Managed Jobs

We have significantly optimized the Managed Jobs controller, allowing it to handle 2000+ parallel jobs on a single 8-CPU controller—an 18x improvement in job capacity with the same controller size (#7051, #7371, #7379, #7408, #7432, #7473, #7487, #7488, #7494, #7519, #7585, #7595, #7945, #7966, #7979,#8036, #8095).

image 1

Enterprise-Ready SkyPilot at Large-scale

  • Support hundreds of AI engineers with a single SkyPilot API server instance
  • Significant reduction in memory consumption and avoid OOM for API server

image_1

  • CLI/SDK/Dashboard speed up with a large amount of clusters, jobs

image_2

  • Comprehensive API server metrics for operation

image_3

  • SSO support with Microsoft Entra ID

image_4

Kubernetes

UX, Robustness, Performance Improvement

  • Robust SSH for SkyPilot cluster on Kubernetes
  • Robust multi-node(pod) provisioning with retry and recovery (#7820, #7854, #7852)
  • Improve resource cleanup after termination
  • Intelligent GPU name detection
  • Improved volume support: label support, name validation, SDK support.

Volume Support for Existing PVC and Ephemeral Volumes (#7915, #7971,#8179)

  • Exising PVC: Reference pre-existing Kubernetes PersistentVolumeClaims as a SkyPilot volume (#7915).

    # volume.yaml
    name: existing-pvc-name
    type: k8s-pvc
    infra: k8s/context1
    use_existing: true
    config:
      namespace: namespace
  • Ephemeral Volumes: automatically create volumes when a cluster is launched and deleted when the cluster is torn down, making them ideal for temporary storage across multiple nodes, such as caches and intermediate results.

    # task.sky.yaml
    file_mounts:
      /mnt/cache:
        size: 100Gi

Cloud Support

CoreWeave Integration Announcement

CoreWeave now officially support SkyPilot (#6386, #6519, #6895, #7756, #7838). This integration provides: Infiniband Support, Object Storage, Autoscaling.

See the announcement on Coreweave blog.

image_5

AMD GPU Support Announcement

SkyPilot now fully supports AMD GPUs on Kubernetes clusters (#6378, #6944). This includes: GPU Detection and Scheduling, Dashboard Metrics, ROCm Support.

See the announcement on AMD Rocm blog.

image_6

More Clouds Support

  • Together AI Instant cluster
  • Seeweb support

User Experience

SkyPilot Templates

SkyPilot now ships predefined YAML templates for launching clusters with popular frameworks and patterns. Templates are automatically available on all new SkyPilot clusters. (#7935, #7965)

You can now launch a multi-node Ray cluster by adding a single line to your YAML’s run block:

run: |
  # One-line setup for a distributed Ray cluster
  ~/.sky/templates/ray/start_cluster

  # Submit your job
  python train.py

Programmability: SkyPilot Python SDK

SkyPilot Python SDK is significantly improved with:

  • Type hints

image_7

  • Log streaming
logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
    if line is not None:
        if 'needle in the haystack' in line:
            print("found it!")
            break
logs.close()
  • Admin policy: build admin policy with helper functions
resource_config = user_request.task.get_resource_config()
resource_config['use_spot'] = True
user_request.task.set_resources(re...
Read more

SkyPilot v0.11.0rc1

04 Dec 23:28
83ef36b

Choose a tag to compare

SkyPilot v0.11.0rc1 Pre-release
Pre-release

This is a preview of the upcoming 0.11.0 release!

Install the release candidate:

# Select needed clouds.
uv pip install 'skypilot[kubernetes,aws,gcp]==0.11.0rc1'

Upgrade your remote API server:

NAMESPACE=skypilot # TODO: change to your installed namespace
RELEASE_NAME=skypilot # TODO: change to your installed release name
helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --version 0.11.0-rc.1 \
  --reset-then-reuse-values \
  --set apiService.image=null  # reset to default image

If you try out 0.11.0rc1, please let us know in Slack!

SkyPilot v0.10.5

06 Nov 21:31
4e2cfdc

Choose a tag to compare

SkyPilot v0.10.5: Major Managed Jobs Efficiency Improvement, UX and SDK Usability Enhencement, and API Server Robustness Fixes

This release focuses on production stability, performance optimization, and fixing critical bugs that affected reliability in multi-user and Kubernetes environments. Key highlights include major performance improvements for managed jobs and the dashboard, resolution of race conditions and resource leaks, and expanded cloud/accelerator support.

This release includes 400+ merged pull requests with high-priority critical fixes and additional improvements spanning bug fixes, performance enhancements, new features, and comprehensive documentation updates.


Highlights

Managed Job Efficiency and Robustness Improvement

jobs-controller
# config.yaml
jobs:
  controller:
    consolidation_mode: true  # Currently defaults to False.

job-launch-fast

  • Managed jobs robustness improvement (#7769, #7716).

Performance and Robustness Improvement at a Large Scale

image 1

API Server Robustness Improvement

  • Significantly reduce memory consumption and avoid OOM for API server (#7240).
Screenshot 2025-11-11 at 11 51 07 AM

Python SDK Usability Improvements

  • New Python SDK examples to easily scale out your jobs on any infra (#7335).
import sky
from sky import jobs as managed_jobs

for i in range(100):
    resource = sky.Resources(accelerators='A100:8')
    task = sky.Task(resources=resource,
                    workdir='.',
                    run='python batch_inference.py')
    managed_jobs.launch(task, name=f'hello-{i}')
image
  • Admin policy robustness and better usability (#7827).

Integration

  • CoreWeave managed Kubernetes and storage support (#7756, #6519, #7759).
  • AMD GPU support and GPU metrics (#6944).
image 4
image 5

What's New

Additional Performance Improvements

  • Migrated managed jobs and SkyServe to a gRPC architecture, reducing P95 latency by up to 86% and CPU usage by 20-30% (#7245, #6647, #6702, #7184).
  • Optimized Kubernetes pod processing with streaming JSON, providing 2x speedup and 50% memory reduction for large clusters (#7469).
  • Optimized database queries for sky status and API endpoints, improving mean response time by 10-35% (#7689, #7690, #7665, #7705, #7708).
  • Replaced standard JSON with high-performance orjson for API serialization, improving tail latency for large payloads (#7734).

Critical Stability & Security Fixes

  • Fixed critical race conditions in Docker operations ("container name already in use") (#7030), API server start/stop (#7534), and asynchronous job cancellation, preventing cluster leaks (#7511).
  • Resolved critical Kubernetes resource leaks, including semaphore file leaks (/dev/shm exhaustion) (#7678) and fusermount-server leaks (#7398).
  • Fixed system hangs caused by SSH threads (#7202), K8s SSH proxy blocking (#7537), and BrokenProcessPool crashes when canceling sky logs (#7607).
  • Addressed a critical OOM bug in the server caused by inefficient GPU name canonicalization in Kubernetes (#7080).
  • Fixed security/privacy bugs where pending jobs were visible across users (#7581) and environment variables leaked between managed jobs (#7459).
  • Prevents accidental cancellation of wrong requests by requiring exact ID matches when prefixes are ambiguous (#7730).
  • Replaced insecure curl | sh pattern for fluent-bit installation with official package repositories (#7126).

UX Improvement

  • Improved CLI error messages for Kubernetes resource constraints (#6814) and multi-node setup failures (#7001).
  • Enhanced sky down output to show a clear summary when multiple clusters fail or succeed (#7225, #7635).
  • Added a CLI spinner for short-running requests (like sky status) to show activity when the server is under load ([#76...
Read more

SkyPilot v0.10.3.post2

10 Oct 21:36
a51ed28

Choose a tag to compare

This patch release is a minor bump over v0.10.3.post1 to fix robustness against Coreweave clusters, i.e., adding additional retries and fallback for the Kubernetes API calls:

  • Fallback to file-based command execution on 431, 400 #7536 #7563
  • Retry on transiant authentication issue with error code 403 #7568 #7574

To upgrade:

pip install "skypilot==0.10.3.post2"
# Restart your local API server
sky api stop; sky api start

Or, upgrade the API server:

NAMESPACE=skypilot
RELEASE_NAME=skypilot
VERSION=0.10.3

helm repo update skypilot
helm upgrade -n $NAMESPACE $RELEASE_NAME skypilot/skypilot \
  --set apiService.image=berkeleyskypilot/skypilot:0.10.3.post2 \
  --version $VERSION --devel --reuse-values

Full Changelog: v0.10.3.post1...v0.10.3.post2

SkyPilot v0.10.3.post1

22 Sep 20:43
48dfbea

Choose a tag to compare

This patch release is a minor bump over v0.10.3 to fix a dependency issue caused by a breaking change in uvicorn==0.36.0:

  • Pin uvicorn dependency to mitigate AttributeError: 'Config' object has no attribute 'setup_event_loop' error. (#7287)

If you see an issue like above, upgrade your SkyPilot with:

pip install "skypilot==0.10.3.post1"
# Restart your local API server
sky api stop; sky api start

If you are using a remote API server, it should still work with v0.10.3.

SkyPilot v0.10.3

10 Sep 02:16
1e23b5b

Choose a tag to compare

SkyPilot v0.10.3: 2-10x Performance Improvement, Better Observability, Production-Ready Authentication and More

SkyPilot v0.10.3 delivers 2-10x performance improvement, enhanced cloud integration, improved Kubernetes integration, and strengthened production reliability features for AI/ML workloads across clouds.

Highlights

Massive Performance Improvements: 2-10x more efficient

Upgrade your SkyPilot API server to 0.10.3 and get the improvement out of the box!

Observability & Monitoring

More SkyPilot API server and cluster metrics can now be set up for better observability, including:

CoreWeave Integration: Autoscaler

CoreWeave autoscaler is now supported, with a single config field change (#6895):

kubernetes:
  autoscaler: coreweave

SkyPilot API Server: Authentication & Production Environment

SkyPilot now supports Microsoft Entra ID SSO login (#7045, #7028), besides Okta and Google Workspace. See more docs for setting up SSO login for API server: https://docs.skypilot.co/en/latest/reference/auth.html#sso-recommended
image 1

What's New

UX Improvement

  • Fix CLI auto-completion authentication issue (#6724)
  • Fixed cluster ownership display (#6989)
  • Fixed API logs for daemon request (#6841)
  • SDK improvements for tail_logs (#6902)
  • Responsive spinner for request blocking (#6905)
  • Fixed SDK authentication for download endpoints (#6955)
  • Enable log downloading in dashboard (#6999, #7000, #7003, #7010)
  • Prevented clients from setting DB strings server-side (#7042)
  • Improved logout error handling (#6874)

Storage & File Operations

  • Fixed storage mounting on ARM64 instances (#7008)
  • Improved chunk upload reliability (#6854)
  • Rsync permission fixes for Kubernetes (#6951)

Enhanced Cloud Platform Support

Kubernetes Improvements

  • Pod configuration change detection with warnings (#6912)
  • Volume name validation (#6863)
  • Worker service cleanup on termination (#7014)
  • AMD GPU support improvements

Nebius Cloud

  • Static IP configuration support (#7002)
  • Docker image ID fixes (#6894)
  • Pagination support for list_instances (#6867)
  • Customizable API domain (#6888)
  • Better network error messages (#6971)
  • Various stability improvements (#6856, #6896)

AWS

  • Elastic Fabric Adapter (EFA) automatic enablement for improved network performance (#6852)
  • AWS Systems Manager (SSM) support with new use_ssm flag for secure connections (#6830)
  • Improved VPC error messages and handling (#6746)
  • Fixed CloudWatch region setting issues (#6747)
  • Better teardown leak prevention (#7022)

RunPod

  • Volume support for RunPod network volumes (#6949)

GPU Support

  • Added B200 to common accelerators in sky show-gpus (#7006)

Documentation & Examples

  • New training examples: TorchTitan (#6677), distributed RL with game servers (#6988)
  • Best practices for network and storage benchmarking (#6632)
  • Clarified Ray runtime usage (#6783)
  • SSM documentation improvements (#7050)
  • Fixed broken documentation links (#7026, #7004)

Testing & CI/CD & Development

  • Backward compatibility tests against stable releases (#6979)
  • SSH lag unit tests (#6968)
  • Improved smoke test reliability (#6913, #6958, #6967, #7019)
  • Customizable Buildkite queue support (#7063)
  • Fixed type errors in managed jobs (#6994)

Developer Experience

  • Python 3.12 and 3.13 support (#5304, #6990)
  • Performance measurement annotations (#6943)
  • Better error handling for UV-only environments (#6893)

API Server Improvements

  • Request retention daemon fixes (#7035)
  • Better error messages for unavailable server logs (#6855)
  • Request ID header corrections (#7025)
  • Backward compatibility improvements ([#6981](#6...
Read more

SkyPilot v0.10.2

27 Aug 08:04
fc35b36

Choose a tag to compare

SkyPilot v0.10.2: Enhanced observability, programmatic SDK, performance, and more

SkyPilot v0.10.2 brings values in cluster management, improved Kubernetes support, programmatic SDK, and numerous stability and performance enhancements for production use from teams with a large number of workloads.

Get it now with:

uv pip install "skypilot>=0.10.2"

Highlights

Cluster observability improvements

Get provisioning logs with the new --provision option for sky logs (#6638):

sky logs --provision <cluster-name>

Find the detailed reason for your cluster failure, e.g., OOM, with the cluster events (#6590, #6593, #6615, #6620, #6621, #6667, #6658, #6609, #6617):

SDK: Process your logs while streaming

SkyPilot introduces a new preload_content option for tail_logs to enable processing logs while streaming.

logs = sky.tail_logs(cluster_name, job_id, follow=True, preload_content=False)
for line in logs:
    if line is not None:
        if 'needle in the haystack' in line:
            print("found it!")
            break
logs.close()

Performance Improvements

  • Job dashboard page loading speed improvement for 10k+ jobs (#6714, #6652)
  • Volume mounting on Kubernetes from 2mins to seconds (#6679)
  • Increase the connection limit from a single client: 10 to 100 (#6782)
  • Speed up sky down for AWS clusters with exposed ports by 4x (#6629, #6663, #6720)
  • Fix the OOM issue with large file_mounts (#6865)
  • Enhanced SSH connection robustness (#6715)

What's New

Infrastructure-specific Improvements

  • Kubernetes:
    • Label support for volumes (#6696)
    • Better authentication handling for in-cluster services (#6600)
    • Proxy setup guide for Kubernetes (#6618)
    • Improved error handling: Enhanced failover error messages (#6613)
    • Enable write access to conda base environment (#6766)
    • Better GPU label detection with empty labels (#6859)
  • AWS:
    • Customizable SSH users (#6625)
    • Root device name detection (#6644)
    • Support Rocky Linux (#6711)
    • Fix fractional GPU quantity for P5.48xlarge (#6722)
  • Nebius:
    • Memory configuration support (#6832): Users can now properly specify memory requirements (e.g., 64GB instances)
    • More robust cluster status fetching (#6693, #6672)
    • Enhanced VRAM calculation (#6628)
    • Improved infinite waiting fixes (#6674)
  • RunPod:
    • Docker image handling with any_of config (#6728)
    • Docker name conflict resolution (#6769)
  • Lambda: Support B200 on Lambda (#6816)
  • R2 mount issue resolution (#6662)

Dashboard Enhancements

  • Managed Jobs:
  • User Management:
    • Allow editing allowed_users and private settings for workspaces (#6566)
  • Miscs:
    • Fixed tour auto-start behavior (#6654)

API Server Enhancements

  • Robust jobs/serve controller against upgrades (#6779)
  • Prometheus metrics in deployment mode (#6712)
  • Docs for using Cloud SQL for states (#6587)
  • State management improvements:
    • Unique constraint violation detection for PostgreSQL (#6773)
    • SQLAlchemy warning suppression (#6796)
    • Reduce DB creation operations (#6594)
  • Admin Policy & Permissions & RBAC:
    • Apply admin policy for volumes (#6668, #6781, #6749)
    • Enhanced restful admin policy in testing (#6353, #6626)
    • Backward compatibility handling for admin policies (#6692)
    • Optimized permission module logging (#6805)

UX/API Improvements

  • CLI Enhancements:
    • Better managed job log tailing behavior (#6719)
    • Improved --config option documentation (#6794)
    • Better logs organization for downloading (#6795)
    • Warning for disk size specifications on Kubernetes (#6637, #6684)
    • Client version and commit info in sky api info (#6748)
    • Improved cluster event messaging and wording (#6686, #6688)
    • Better user feedback with spinner for delay messages (#6575)
    • Enhanced pod configuration validation (#6825)
  • SDK Improvements:
    • Improved transient failure handling (#6808, #6807): More intelligent retry mechanisms
    • Simpler log streaming utility (#6750): More intuitive API for processing logs in real-time with iterator-based approach
    • Better response typing (#6659, #6718, #6527)
    • Exposed sky.endpoints function (#6599)
    • Enhanced decoder backward compatibility (#6810)
    • Comprehensive endpoints documentation (#6815)
  • Robustness and Performance Improvements:
    • Fixed authentication with simplejson (#6698)
    • Reduced import statement overhead (#6641, #6645, #6648)
    • Optimized directory utilities (#6646)
    • Resolved API logging issues (#6619)
    • Enhanced retry logic for non-transient errors (#6844)
    • Better handling of missing pods with termination filtering (#6697)
    • Fixed various cluster name querying issues (#6616, #6624)
    • Improved job import handling in sky init (#6623)

Testing and Infrastructure

  • Test Improvements:
    • Automatic retry for core tests (#6710)
    • Expanded nightly build coverage (#6695)
    • Staggered test execution (#6732)
    • Remote server test support (#6819, #6823)
    • New smoke tests for provisioning logs (#6790, #6818)
    • Fixed endpoint comparison in unittests (#6578)
    • Fixed TPU test example failure (#6673)
    • Added Kubernetes volume merging unit tests (#6813)
    • Added support for custom cloud config files in smoke tests (#6851)
    • Enhanced serve status checks after termination (#6713)
    • Include Nebius VMs tests in CI/CD (#6835)
  • CI/CD Enhancements:
    • PAT token for release publishing (#6612)
    • Helm lint in pre-commit hooks (#6798, #6803)
    • Fixed nightly build triggers (#6610)
  • Helm Charts & Deployment:
    • Fixed RBAC template rendering issues (#6737, #6742)
    • Resolved ingress configuration for oauth in grafana (#6799)
    • Enhanced values.yaml documentation (#6800)
    • Better patch strategy for merging Kubernetes configs (#6653)

Other feature improvements

  • SSH Node Pool: region-specific launches (#6767)
  • SkyServe: GPU-aware Load Balancing (#6147)
    SkyServe now intelligently scale based on GPU types. Set different QPS targets across heterogeneous GPUs:
    load_balancing_policy: instance_aware_least_load
    replica_policy:
      target_qps_per_replica:
        "H100:1": 2.5    # H100 can handle 2.5 QPS
        "A100:1": 1.25   # A100 can handle 1.25 QPS
        "A10:1": 0.5     # A10 can handle 0.5 QPS

Backend

Examples and Documentation

Contributors

We thank all contributors who made this release possible!

New Contributors: @miltava, @tomzx, @webconn, @hongsu-moreh, @ibpark-moreh, @nathan-liner

All Contributors: @DanielZhangQD, @cblmemo, @kyuds, @SeungjinYang, @lloyd-brown, @romilbhardwaj, @rohansonecha, @zpoint, @aylei, @SalikovAlex, @Michaelvll, @kevinmingtarja, @miltava, @Maknee, @cg505, @tomzx, @concretevitamin, @webconn, @sethkimmel3, @andylizf, @hongsu-moreh, @ibpark-moreh, @lucamanolache, @nathan-liner

Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!

Full Changelog

For a complete list of changes, see the commit history.

SkyPilot v0.10.1

11 Aug 05:50
0c04d75

Choose a tag to compare

SkyPilot v0.10.1: Improved production readiness, distributed training, cloud integrations, and more

SkyPilot v0.10.1 improves enterprise production readiness with large-scale distributed training capabilities (Llama 4 400B, OpenAI GPT-OSS), introduces/enhances integration with AMD GPUs and leading GPU clouds (CoreWeave, Nebius), and delivers enhanced reliability features for mission-critical AI workloads.

Get it now with:

uv pip install "skypilot>=0.10.1"

Highlights

Native Git Support

Use your (private) git repositories as your SkyPilot workdir:

# task.sky.yaml
workdir:
  url: https://github.com/my-org/my-repo.git
  ref: 1234ab # commit hash or branch name

Find your commit hash of your workdir in Dashboard:

Added in (#6294, #6257, #6268).

Distributed LLM Pretraining/Finetuning and Agentic training Examples

We released high-performance distributed training examples for large models with checkpointing support (#6525, #6551, #6242, #6273).

  • Llama 4 Maverick 400B model training on more than 16 H200 GPUs: example
  • OpenAI GPT-OSS model pretraining and finetuning: example

Agentic training example with VeRL is now also available (#6443).

Cloud and accelerator support

Native AMD GPU Support on Kubernetes

AMD ROCm is now supported on Kubernetes clusters with SkyPilot!

# task.sky.yaml
resources:
  infra: k8s/my-amd-cluster
  image_id: docker:rocm/pytorch-training:v25.6
  accelerators: MI300:4

Coreweave Integration

SkyPilot now supports CoreWeave clusters with native Infiniband and object store support (#6386, #6487, #6483).

Use your CoreWeave cluster with SkyPilot:

resources:
  infra: k8s/my-coreweave-cluster
  network: best # Enable infiniband

Nebius cloud improvement

B200 GPUs and spot instances are now supported on Nebius cloud (#6474, #6478, #6267).

# task.sky.yaml
resources:
  infra: nebius
  accelerators: B200:8
  use_spot: true

Additionally, SkyPilot now supports MOUNT_CACHED mode for Nebius cloud. (#6456)

Autostop/Autodown based on SSH sessions

In addition to running jobs on clusters, you can also ask autostop/autodown to wait for active SSH sessions or none of them in SkyPilot YAML (#6361, #6485).

# task.sky.yaml
resources:
  autostop:
    wait_for: jobs_and_ssh

External Logging

You can now dump your SkyPilot cluster/job logs to external logging services like AWS CloudWatch and GCP Cloud Logging (#6331, #6405, #6411, #6369). Configure it in your ~/.sky/config.yaml:

logs:
  store: aws  # Or 'gcp', etc.
  aws:
    ...  # Service-specific options; see below.

What's New

UX/API

  • 2x Faster sky status: Intelligent caching reduces status command latency by 50% (#6166)
  • Optimized Imports: Lazy loading of dependencies for faster CLI startup (#6302)
  • Autostop/Autodown: Improve autostop logic during cluster refresh (#6388)
  • Fast network timeout detection: Faster timeout detection for remote API server checks (#6263)
  • Better hints for parallel launch: Better hints for parallel launch (#6549)
  • Support sky api logout to logout from API server (#6284, #6327, #6412)
  • Type Checking: Enhanced mypy support with py.typed marker (#6440)

UI

  • Interactive Tour: Guided tour for new users to learn dashboard features (#6565, #6301)
  • Mobile Support: Responsive design for narrow screens (#6280)
  • YAML Syntax Highlighting: Enhanced code readability with syntax highlighting (#6404)
  • User Management: Delete users directly from the dashboard interface (#6434)
  • Display fixes: Cluster history tracking issues (#6512), VSCode remote connection (#6555), workspace resources aggregation (#6537, #6283)

Storage/Volume

  • S3 Mounting on ARM-based Instances: Full support for S3 mounting on ARM64 instances (#6255)
  • Volume Management SDK: Added SDK for creating, managing persistent volumes across clouds (#6439, #6170, #6383)
  • New examples for volumes: juiceFS, NFS, etc. (#6521)

API Server

API Server Deployment

  • Helm Chart on Artifact Hub: SkyPilot Helm chart now available on Artifact Hub (#6560, #6371)
  • More robust DB: Use Alembic for database schema migrations (#6579, #6556, #6196), thread safety improvements (#6422, #6279)
  • OAuth2 Proxy Integration: Built-in OAuth2 proxy support (#6476)
  • Redis Secret Support: Configuration for Redis URL via secrets (#6559)
  • Deployment image size reduction (#6312, #6285)
  • GPU Detection: Fixed GPU name matching with underscores (#6491)

Client Enhancements

  • Enhanced Client Reliability: Automatic retry on transient failures (#6259, #6298, #6234)
  • Robustness enhancements: lock timeout notifications (#6530), transient error retry (#6234), catalog refresh (#6272)
  • Service Account Token Auth Fix: Fixed token authentication in file uploads (#6432)
  • Issues fixes: file path escaping (#6262),

Managed Jobs

  • Force Disable Cloud Buckets: Configuration option to disable cloud bucket usage for managed jobs and SkyServe (#6402, #6407)
  • Multi-Task Log Viewing: See logs from all tasks in a managed job pipeline (#6415)

Cloud Integrations

  • GCP N4 Instance Types: Support for new N4 machine family (#6253)
  • GCP TPU Improvements: Enhanced TPU provisioning logic (#6420)
  • Hyperbolic: Marked CUSTOM_MULTI_NETWORK as unsupported (#6245)
  • RunPod: CPU instance launching support (#6450)
  • Vast: Port opening support (#6282)
  • Azure: Dependency resolution fixes (#6231)

Miscs

  • Developer Experience: Improved format.sh for code consistency (#6580)
  • Backward Compatibility: Fixed issues with Task field and version detection (#6548, #6564, #6485)
  • Test Stability: Fixed various flaky tests (#6507, #6495, #6511, #6469)
  • Release Pipeline: Fixed API version extraction and workflow issues (#6464, #6254)
  • Buildkite: Sequential test execution to reduce resource usage (#6217)

Production Deployment Guides

  • PostgreSQL with GCP Cloud SQL: Complete guide for deploying API server with Cloud SQL backend (#6325)
  • Docker Deployment: Containerized API server deployment guide (#6296)
  • AWS SSM as SSH proxy: Setup guide for AWS Systems Manager (#6522)

Contributors

We thank all contributors who made this release possible!

New Contributors**: @LokmaSpeedy, @jacobergzhou, @amd-pratmish, @tedspare, @jimbz, @makhalin

All Contributors: @alex000kim, @amd-pratmish, @andylizf, @aylei, @bikramnehra, @cblmemo, @cg505, @clayrosenthal, @concretevitamin, @DanielZhangQD, @jacobergzhou, @jimbz, @kevinmingtarja, @kyuds, @lloyd-brown, @LokmaSpeedy, @lucamanolache, @makhalin, @Maknee, @Michaelvll, @rohansonecha, @romilbhardwaj, @SalikovAlex, @SeungjinYang, @tedspare, @zhenjiasun, @zpoint

Special thanks to the community for bug reports, feature requests, and pull requests that helped improve SkyPilot!

Full Changelog

For a complete list of changes, see the commit history.