Skip to content

Conversation

@DanielZhangQD
Copy link
Collaborator

@DanielZhangQD DanielZhangQD commented Nov 14, 2025

  • Support mounting ephemeral volumes for clusters on k8s
    • When the cluster is launched, create ephemeral volumes and mount them to the cluster
    • When the cluster is torn down, delete the ephemeral volumes
  • Improve the efficiency of the volume refresh daemon
  • Add logs for resource topology override with volume constraint

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@DanielZhangQD
Copy link
Collaborator Author

/smoke-test --kubernetes -k test_volumes_on_kubernetes

@DanielZhangQD
Copy link
Collaborator Author

/smoke-test --kubernetes -k test_volumes_on_kubernetes --remote-server

@DanielZhangQD
Copy link
Collaborator Author

Test cases:

  • Functionality:
    • Normal volumes:
      • Apply new volumes
      • List volumes
      • Mount to cluster
      • Delete volume
    • Ephemeral volumes:
      • Sky launch with ephemeral volume
        • Incorrect type
        • Incorrect size
        • Incorrect storage class
        • Incorrect label
        • ReadWriteOnce + multi-nodes
        • Normal case with label + ReadWriteMany + storage class + multi-nodes
        • Normal case with size only + multiple ephemeral volumes + normal volume
      • Sky jobs launch with ephemeral volume + normal volume
      • List volumes - last use is set and status is in_use after creation, is_ephemeral is shown
      • Sky down with ephemeral volume cleaned up
      • Failover works with ephemeral volumes cleaned up
      • Remote API server with incluster config
    • Volume status refresh daemon works correctly
  • Compatibility - volume operations:
    • New client + New server
    • New client + Old server
    • Old client + New server

@DanielZhangQD
Copy link
Collaborator Author

/quicktest-core --base-branch master
/smoke-test --kubernetes --no-resource-heavy
/smoke-test --managed-jobs --aws

@DanielZhangQD DanielZhangQD marked this pull request as ready for review November 17, 2025 12:09
Comment on lines +961 to +977
if ephemeral_volumes:
for ephemeral_volume in ephemeral_volumes:
# Update the volumes and volume mounts in the pod spec
if 'volumes' not in pod_spec['spec']:
pod_spec['spec']['volumes'] = []
pod_spec['spec']['volumes'].append({
'name': ephemeral_volume.name,
'persistentVolumeClaim': {
'claimName': ephemeral_volume.volume_name_on_cloud,
},
})
if 'volumeMounts' not in pod_spec['spec']['containers'][0]:
pod_spec['spec']['containers'][0]['volumeMounts'] = []
pod_spec['spec']['containers'][0]['volumeMounts'].append({
'name': ephemeral_volume.name,
'mountPath': ephemeral_volume.path,
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the same logic to mutate the pod spec as normal volumes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the case for other clouds, e.g. runpod, the volume ID is required to mount, which is only available after creation, if we move the k8s handling to the path before provisioning, we cannot handle the ephemeral volume config resolving in one place for different clouds (now it's all in the provisioner/volume.py), so I prefer to keep the current implementation, WDYT?

Comment on lines 248 to 249
provision_volume.delete_ephemeral_volumes(cluster_name.name_on_cloud,
provider_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an ephemeral volume get created, can the underlying PVC be used by a normal volume via use_existing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Created issue #8029 to trace this.

Comment on lines +45 to +48
def _resolve_pvc_volume_config(cloud: clouds.Cloud,
config: provision_common.ProvisionConfig,
volume_config: Dict[str, Any]) -> Dict[str, Any]:
provider_config = config.provider_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we do we need these new handlings for ephemeral volume. I mean, it looks like a normal volume which has the same lifecycle with the referencing cluster, so why not use the same volume handling logic of normal volume?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that for normal volumes, the volume config and volume info are already available during cluster launching, while it's not the case for ephemeral volumes, e.g. the infra info is available after optimizing, while some volume info, like volume id in runpod, are only available after volume creation.

'volume_name': {
'type': 'string',
},
'is_ephemeral': {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm thinking of naming it as lifecycle or something similar, which can be a enum that can be easily extended, e.g. will we need a volume that has same lifecycle as a Pool in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concept should also apply to Pool if we support it later, it means lifecycle bond to the Pool or not.

cluster_name: str,
cluster_name_on_cloud: str,
config: common.ProvisionConfig,
ephemeral_volumes: Optional[List['volume_utils.VolumeInfo']] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put ephemeral_volumes to ProvisionConfig?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@DanielZhangQD
Copy link
Collaborator Author

/quicktest-core --base-branch master
/smoke-test --kubernetes --no-resource-heavy
/smoke-test --managed-jobs --aws

@DanielZhangQD
Copy link
Collaborator Author

/smoke-test --kubernetes -k test_volumes_on_kubernetes --remote-server

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @DanielZhangQD !

@DanielZhangQD DanielZhangQD merged commit 293c676 into master Nov 21, 2025
31 of 36 checks passed
@DanielZhangQD DanielZhangQD deleted the 3727 branch November 21, 2025 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants