Skip to content

Conversation

@aylei
Copy link
Collaborator

@aylei aylei commented Aug 21, 2025

Test:

Launch with network issue, can be recovered by retry:

$ SKYPILOT_DEBUG=1 sky launch --infra aws -c bastion --cpus 4+ -y
...
Running on cluster: bastion
D 08-21 23:47:01 rest.py:119] Retry stream_and_get due to ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)), attempt 1/5
I 08-21 15:49:35 optimizer.py:918] Considered resources (1 node):
I 08-21 15:49:35 optimizer.py:986] -----------------------------------------------------------------------------
I 08-21 15:49:35 optimizer.py:986]  INFRA             INSTANCE     vCPUs   Mem(GB)   GPUS   COST ($)   CHOSEN
I 08-21 15:49:35 optimizer.py:986] -----------------------------------------------------------------------------
I 08-21 15:49:35 optimizer.py:986]  AWS (us-east-1)   m6i.xlarge   4       16        -      0.19          ✔
I 08-21 15:49:35 optimizer.py:986] -----------------------------------------------------------------------------
....
  • Normal launch
  • sky status
  • sky api logs xxx (will not retry)
sky.exceptions.ClientError: Failed to stream logs: Request 'xxx' not found 

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

Signed-off-by: Aylei <rayingecho@gmail.com>
@aylei
Copy link
Collaborator Author

aylei commented Aug 21, 2025

/quicktest-core

@aylei aylei enabled auto-merge (squash) August 21, 2025 16:05
@kevinmingtarja
Copy link
Collaborator

/smoke-test -k test_long_setup_run_script

Copy link
Collaborator

@kevinmingtarja kevinmingtarja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @aylei!

As a follow up, what do you think of adding a test similar to https://github.com/skypilot-org/skypilot/blob/2824cc15878b2eb3d43b04b0d0a24f1fbb4577b0/tests/load_tests/test_ssh_proxy.py, that stress tests these streaming API calls under different network conditions or with simulated network failures.

@aylei aylei merged commit 62954dd into master Aug 21, 2025
18 checks passed
@aylei aylei deleted the retry-on-stream-error branch August 21, 2025 16:58
@usage_lib.entrypoint
@server_common.check_server_healthy_or_start
@annotations.client_api
@rest.retry_transient_errors()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importing functions from the server module in the client is a bit weird. We can do a refactor in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants