Skip to content

Conversation

@SeungjinYang
Copy link
Collaborator

@SeungjinYang SeungjinYang commented Aug 21, 2025

On transient retries - if we determine the retry to have succeeded, then reset the retry count. This allows the function decorator to deal with any number of transient errors that the function can recover from.

To determine if a retry has succeeded, we use the line_processed field as a proxy for progress. The idea is that if the function, on retry, were able to process more lines, the function must have overcome whatever transient error was thrown before.

This PR has no effect on functions that don't interact with line_processed field, preserving the current behavior where requests are retried max_retries times.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@SeungjinYang SeungjinYang requested a review from aylei August 21, 2025 18:08
@SeungjinYang SeungjinYang changed the title reset traisient failure count if function made progress on latest retry reset transient failure count if function made progress on latest retry Aug 21, 2025
@SeungjinYang SeungjinYang marked this pull request as ready for review August 21, 2025 18:24
@SeungjinYang SeungjinYang changed the title reset transient failure count if function made progress on latest retry [SDK] reset transient failure count if function made progress on latest retry Aug 21, 2025
Comment on lines 133 to 140
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't necessary as part of this PR but I think it's a nice improvement where we try immediately on the first instance of transient error to improve responsiveness of the code.

@SeungjinYang
Copy link
Collaborator Author

/quicktest-core

@SeungjinYang
Copy link
Collaborator Author

/quicktest-core -k job

@SeungjinYang
Copy link
Collaborator Author

test_managed_jobs bw compatibility test is failing, unsure why

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @SeungjinYang !

@SeungjinYang
Copy link
Collaborator Author

test_managed_jobs bw compatibility test is also failing on master and seems unrelated to this issue. merging

@SeungjinYang SeungjinYang merged commit b4bf58c into master Aug 22, 2025
16 of 17 checks passed
@SeungjinYang SeungjinYang deleted the reward-progress branch August 22, 2025 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants