Skip to content

Conversation

@kevinmingtarja
Copy link

@kevinmingtarja kevinmingtarja commented Aug 28, 2025

  • some tests were failing because we didn't have gcp enabled on the Coreweave API server, meanwhile include_gcs_mount is True by default -> We either have to check on the remote API server, which clouds are enabled, or assume that for example only AWS is going to be enabled
  • I think is_remote_server_test() needs to be modified to also check api_server_endpoint_configured_in_env_file()?

Remaining issues:

  1. test_using_file_mounts_with_env_vars, test_kubernetes_storage_mounts is also failing because it requires GCP access
  2. sky logs --sync-down taking so long (see slack thread) -> still looking into this
  3. HA jobs controller deployment times out, because we're rsync'ing a lot of small files during init and Coreweave's PVCs are backed by NFS: https://assemble-platforms.slack.com/archives/C07PVLJSXD0/p1756419707568979?thread_ts=1756346988.980369&cid=C07PVLJSXD0
  4. Weird issue saying the kube context file does not exist
__________________________________________________________________ test_jobs_launch_and_logs ___________________________________________________________________
--
  | 2025-08-29 01:02:03 UTC | [gw0] linux -- Python 3.10.18 /home/buildkite/miniconda3/bin/python3
  | 2025-08-29 01:02:03 UTC | tests/smoke_tests/test_basic.py:568: in test_jobs_launch_and_logs
  | 2025-08-29 01:02:03 UTC | job_id, handle = sky.stream_and_get(sky.jobs.launch(task,
  | 2025-08-29 01:02:03 UTC | sky/utils/common_utils.py:616: in _record
  | 2025-08-29 01:02:03 UTC | return f(*args, **kwargs)
  | 2025-08-29 01:02:03 UTC | sky/server/common.py:774: in wrapper
  | 2025-08-29 01:02:03 UTC | return func(*args, **kwargs)
  | 2025-08-29 01:02:03 UTC | sky/utils/annotations.py:27: in wrapper
  | 2025-08-29 01:02:03 UTC | return func(*args, **kwargs)
  | 2025-08-29 01:02:03 UTC | sky/server/rest.py:129: in wrapper
  | 2025-08-29 01:02:03 UTC | return func(*args, **kwargs)
  | 2025-08-29 01:02:03 UTC | sky/client/sdk.py:2071: in stream_and_get
  | 2025-08-29 01:02:03 UTC | return stream_response(request_id,
  | 2025-08-29 01:02:03 UTC | sky/client/sdk.py:154: in stream_response
  | 2025-08-29 01:02:03 UTC | return get(request_id)
  | 2025-08-29 01:02:03 UTC | sky/utils/common_utils.py:616: in _record
  | 2025-08-29 01:02:03 UTC | return f(*args, **kwargs)
  | 2025-08-29 01:02:03 UTC | sky/utils/annotations.py:27: in wrapper
  | 2025-08-29 01:02:03 UTC | return func(*args, **kwargs)
  | 2025-08-29 01:02:03 UTC | sky/client/sdk.py:1979: in get
  | 2025-08-29 01:02:03 UTC | raise error_obj
  | 2025-08-29 01:02:03 UTC | E   ValueError: Failed to load Kubernetes configuration for '(current-context)'. Please check if your kubeconfig file exists at ~/.kube/config and is valid.
  | 2025-08-29 01:02:03 UTC | E   [kubernetes.config.config_exception.ConfigException] Invalid kube-config file. No configuration found.
  | 2025-08-29 01:02:03 UTC | E
  | 2025-08-29 01:02:03 UTC | E   Hint: Kubernetes attempted to query the current-context set in kubeconfig. Check if the current-context is valid.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@kevinmingtarja kevinmingtarja changed the title only enable aws for remote api server in test_docker_storage_mounts more smoke test fixes Aug 29, 2025
@zpoint zpoint merged commit 5ff53fa into zpoint:dev/zeping/fix_hippocratic_failures Aug 29, 2025
@zpoint zpoint deleted the dev/zeping/fix_hippocratic_failures branch August 29, 2025 02:21
zpoint added a commit that referenced this pull request Aug 30, 2025
* fix multi_tenant

* skip some tests

* skip some storage tests

* more smoke test fixes (#82)

* only enable aws for remote api server in test_docker_storage_mounts

* fix is_remote_server_test()

* fetch accelerator dyunamically from k8s cluster

---------

Co-authored-by: Kevin Mingtarja <69668484+kevinmingtarja@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants