Good day,
I have a multi-tenanted CICD system that uses multi-runner to handle runner management, standby/warm-up pools, etc. etc. (and it works very well, by the way 😄 ). I'm also using fairly up-to-date code (i.e. v5.15.2 of this project), along with up-to-date GHA actions/runner agent/tools.
I have a set of runners that use a shared/multi-team subnet in AWS. There are occasions where a tenant over-commits runners, and exhausts the IPv4 address space supplied by the subnet. For most operations, GHA queues-up/serializes jobs nicely. For example, if some runner-type has an upper limit of 30 instances, and 60 jobs are queued up, all of the jobs eventually run to successful completion. One semi-related "corner case" where this doesn't happen, however, is if a runner fails to start due to insufficient space in the subnet being used to launch the runner (i.e. "insufficient IP space available").
Is there some mechanism/feature-flag/etc. that exists (or that could reasonably be implemented) so that some sort of "re-try this scale-up operation with some back-off timer to prevent API spam/overload" could be provided? It would be preferable to have the job eventually get queued-up, even if it means waiting for a progressively lengthier duration, versus having a job stuck in the "wait for a runner" state for 18 hours (as a specific example). In this contrived example, case, it'd be desirable if the job eventually were queued up (say, during low-usage periods overnight when there's ample capacity) instead of requiring user interaction (or a CI bot to auto-cancel "stuck" jobs).
Besides this quirk, this is an awesome project/tool that has been extremely helpful/useful. Cheers!
Good day,
I have a multi-tenanted CICD system that uses multi-runner to handle runner management, standby/warm-up pools, etc. etc. (and it works very well, by the way 😄 ). I'm also using fairly up-to-date code (i.e.
v5.15.2of this project), along with up-to-date GHAactions/runneragent/tools.I have a set of runners that use a shared/multi-team subnet in AWS. There are occasions where a tenant over-commits runners, and exhausts the IPv4 address space supplied by the subnet. For most operations, GHA queues-up/serializes jobs nicely. For example, if some runner-type has an upper limit of 30 instances, and 60 jobs are queued up, all of the jobs eventually run to successful completion. One semi-related "corner case" where this doesn't happen, however, is if a runner fails to start due to insufficient space in the subnet being used to launch the runner (i.e. "insufficient IP space available").
Is there some mechanism/feature-flag/etc. that exists (or that could reasonably be implemented) so that some sort of "re-try this scale-up operation with some back-off timer to prevent API spam/overload" could be provided? It would be preferable to have the job eventually get queued-up, even if it means waiting for a progressively lengthier duration, versus having a job stuck in the "wait for a runner" state for 18 hours (as a specific example). In this contrived example, case, it'd be desirable if the job eventually were queued up (say, during low-usage periods overnight when there's ample capacity) instead of requiring user interaction (or a CI bot to auto-cancel "stuck" jobs).
Besides this quirk, this is an awesome project/tool that has been extremely helpful/useful. Cheers!