multi-runner: re-try scale-up operation for runner-type if it fails due to insufficient IPv4 addresses in subnet

Good day,

I have a multi-tenanted CICD system that uses multi-runner to handle runner management, standby/warm-up pools, etc. etc. (and it works very well, by the way 😄 ). I'm also using fairly up-to-date code (i.e. `v5.15.2` of this project), along with up-to-date GHA `actions/runner` agent/tools.

I have a set of runners that use a shared/multi-team subnet in AWS. There are occasions where a tenant over-commits runners, and exhausts the IPv4 address space supplied by the subnet. For most operations, GHA queues-up/serializes jobs nicely. For example, if some runner-type has an upper limit of 30 instances, and 60 jobs are queued up, all of the jobs eventually run to successful completion. One semi-related "corner case" where this doesn't happen, however, is if a runner fails to start due to insufficient space in the subnet being used to launch the runner (i.e. "insufficient IP space available").

Is there some mechanism/feature-flag/etc. that exists (or that could reasonably be implemented) so that some sort of "re-try this scale-up operation with some back-off timer to prevent API spam/overload" could be provided? It would be preferable to have the job eventually get queued-up, even if it means waiting for a progressively lengthier duration, versus having a job stuck in the "wait for a runner" state for 18 hours (as a specific example). In this contrived example, case, it'd be desirable if the job eventually were queued up (say, during low-usage periods overnight when there's ample capacity) instead of requiring user interaction (or a CI bot to auto-cancel "stuck" jobs).

Besides this quirk, this is an awesome project/tool that has been extremely helpful/useful. Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-runner: re-try scale-up operation for runner-type if it fails due to insufficient IPv4 addresses in subnet #4105

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multi-runner: re-try scale-up operation for runner-type if it fails due to insufficient IPv4 addresses in subnet #4105

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions