Sometimes, a job need to wait 30/60 minutes before getting a runner #3953

julien-michaud · 2025-02-28T14:22:42Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.10.1

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Start workflows
2. The first two jobs will get a runner very quickly
3. The third one will sometimes stay pending for 30/40 minutes before getting a runner

Describe the bug

Let's say that I have a workflow with 3 jobs running in parallel.

Sometimes, the jobs 1 and 2 will get a runner right away but the third one will have to wait 30 minutes to an hour before getting a runner.

Describe the expected behavior

All the jobs should start right away.

Note that I have two runner-scale-sets with the same runnerScaleSetName name, I don't know if its a bad practice or not but its working fine 🤷‍♂

I did that to ease teh upgrade process when a new chart is available, I update the gha-runner-scale-sets one by one to avoid service interruptions.

Thanks

Additional Context

gha-runner-scale-set-controller:
  enabled: true
  flags:
    logLevel: "warn"
  podLabels:
    finops.company.net/cloud_provider: gcp
    finops.company.net/cost_center: compute
    finops.company.net/product: tools
    finops.company.net/service: actions-runner-controller
    finops.company.net/region: europe-west1
  replicaCount: 3
  podAnnotations:
    ad.datadoghq.com/manager.checks: |
      {
        "openmetrics": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8080/metrics",
              "histogram_buckets_as_distributions": true,
              "namespace": "actions-runner-system",
              "metrics": [".*"]
            }
          ]
        }
      }
  metrics:
    controllerManagerAddr: ":8080"
    listenerAddr: ":8080"
    listenerEndpoint: "/metrics"

gha-runner-scale-set:
  enabled: true
  githubConfigUrl: https://github.com/company
  githubConfigSecret:
    github_token: <path:secret/github_token/actions_runner_controller#token>

  maxRunners: 100
  minRunners: 1

  containerMode:
    type: "dind"  ## type can be set to dind or kubernetes

  listenerTemplate:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
      annotations:
        ad.datadoghq.com/listener.checks: |
          {
            "openmetrics": {
              "instances": [
                {
                  "openmetrics_endpoint": "http://%%host%%:8080/metrics",
                  "histogram_buckets_as_distributions": true,
                  "namespace": "actions-runner-system",
                  "max_returned_metrics": 6000,
                  "metrics": [".*"],
                  "exclude_metrics": [
                    "gha_job_startup_duration_seconds",
                    "gha_job_execution_duration_seconds"
                  ],
                  "exclude_labels": [
                    "enterprise",
                    "event_name",
                    "job_name",
                    "job_result",
                    "job_workflow_ref",
                    "organization",
                    "repository",
                    "runner_name"
                  ]
                }
              ]
            }
          }
    spec:
      containers:
      - name: listener
        securityContext:
          runAsUser: 1000
  template:
    metadata:
      labels:
        finops.company.net/cloud_provider: gcp
        finops.company.net/cost_center: compute
        finops.company.net/product: tools
        finops.company.net/service: actions-runner-controller
        finops.company.net/region: europe-west1
    spec:
      restartPolicy: OnFailure
      imagePullSecrets:
        - name: company-prod-registry
      containers:
        - name: runner
          image: eu.gcr.io/company-production/devex/gha-runners:v1.0.0-snapshot5
          command: ["/home/runner/run.sh"]

  controllerServiceAccount:
    namespace: actions-runner-system
    name: actions-runner-controller-gha-rs-controller

Controller Logs

https://gist.github.com/julien-michaud/dce55b9320fb236b622cbb00919277ce

Runner Pod Logs

The text was updated successfully, but these errors were encountered:

avadhanij · 2025-04-03T18:58:48Z

We are seeing the same issue and we have a similar setup. We are unsure if the two runnerscalesets (for upgrade ease) is actually causing problems.

emmahsax · 2025-04-04T19:59:33Z

We are on 0.8.2, and seem to be encountering a similar issue. We recently upgraded Karpenter to 1.3.3, and that's when we began seeing this issue. But it may have been existing before that.

marcusisnard · 2025-04-14T15:33:28Z

I’m observing similar behavior, even when not running in a high availability setup (single cluster on Azure). Unfortunately, the logs offer no insight, and the latency is unpredictable.

marcusisnard · 2025-04-22T11:33:08Z

Our organization is experiencing job queue delays exceeding 12 hours, severely impacting production workloads. No error logs are observed on our side. What steps can we take to troubleshoot this issue? @nikola-jokic

nikola-jokic · 2025-04-22T11:40:32Z

Hey everyone,

Could you please submit these logs without obfuscation through the support? We cannot investigate it without understanding which workflow runs are stuck. If you have failed runners, they count as the number of runners, so we can avoid creating an indefinite number of runners if something goes wrong with the cluster.
But if the delay is caused on the back-end side, please submit the workflow run that is stuck, and the unobfuscated log, so we can troubleshoot it. Thanks!

nikola-jokic · 2025-04-22T12:28:54Z

Hey everyone,

We found the root cause of the issue, and it should be fixed now. Please let us know if you are still experiencing this issue. I will leave this issue open for now for visibility. Thank you all for reporting it!

marcusisnard · 2025-04-22T12:32:11Z

Hey everyone,

We found the root cause of the issue, and it should be fixed now. Please let us know if you are still experiencing this issue. I will leave this issue open for now for visibility. Thank you all for reporting it!

Do we need to uninstall and re-deploy ARC?

nikola-jokic · 2025-04-22T12:35:32Z

No, the issue was on the back-end side, so it should start working properly without touching the ARC installation.

marcusisnard · 2025-04-22T14:08:23Z

We are still seeing this issue, lots of jobs still pending, we do not have a cap on the maximum number of runners. Please let me know how I can send the appropriate logs and helm chart values used for our deployment.

nikola-jokic · 2025-04-22T14:13:25Z

Do you have failed ephemeral runners? If you don't have failed ones, please send the listener log, the controller log and workflow URLs of the pending jobs. You can submit them in the support issue if you don't want to share them publicly.
If you do have failed runners, please remove all failed ephemeral runner instances, which would free up the slots to scale up.

niodice · 2025-04-22T16:21:09Z

@marcusisnard unfortunately I offer no help, but I wanted to ask how you view that particular UI. It looks like a GitHub view showing the scale sets directly in the UI. I have no such view, but it would be nice to see it.

nikola-jokic · 2025-04-25T12:17:02Z

Hey @niodice, the UI doesn't show that. The failed runners are scoped to the cluster, and is a mechanism we use to guard against the bad state. So you would have to inspect these runners inside the cluster, and not in the UI (if I understood you correctly)

niodice · 2025-04-30T17:47:57Z

@nikola-jokic we just saw this again. Where can I open a support case to share logs in a private venue?

nikola-jokic · 2025-05-07T13:05:23Z

Hey @niodice,

Very sorry, I'm responding slowly. Can you please reach out to our support? Please share the workflow run, controller and the listener log.

niodice · 2025-05-07T19:35:42Z

Thanks @nikola-jokic , submitted under https://support.github.com/ticket/personal/0/3381328

hlascelles · 2025-05-12T10:26:06Z

We are still seeing this. Self-hosted k8s ARC. Runner version 2.323.0.

Everything runs fine, until we see the number of "Failed ephemeral runners" in the AutoscalingRunnerSet climb to match the total number of allowed runners. Then everything stops. No runners are in the web console. All Listeners stop at "listener-app.listener","message":"Getting next message","lastMessageID":0

We are automating a cron that checks the number of "Failed ephemeral runners" and if it goes up we delete and recreate the AutoscalingRunnerSet. It appears to reregister, runners appear in the web console, and builds all start again.

nikola-jokic · 2025-05-12T11:18:20Z

Just a small hint @hlascelles, you can delete these ephemeral runner resources, and the ephemeral runner set will re-create them with the fresh config. Might be easier to manage than updating the autoscaling runner set

uladzislauhramovich · 2025-05-13T14:04:24Z

@nikola-jokic how does one do that? Just reinstall the chart? Because there are no failed pods or related errors visible.

forgondolin · 2025-05-14T14:41:46Z

That's also happening in our cluster, periodically I have to delete the ephemeralrunerset, also it's not scaling as it should.
@hlascelles your cron is restarting only the runnerset or also the controller?

hlascelles · 2025-05-14T18:43:14Z

@nikola-jokic Same question as @uladzislauhramovich ... How do you delete the ephemeral runners? Are they k8s resources? There is nothing in the GitHub web console, no runners there.

@forgondolin We are deleting the autoscalingrunnersets (which are then autorecreated by flux).

forgondolin · 2025-05-14T19:49:00Z

Idk if it helps but:
kubectl delete ephemeralrunnersets -n namespace --all or runnerscaleset for checking the runner admin

OneideLuizSchneider · 2025-05-15T09:58:18Z

For us, I had to reinstall it to make it work, I had to do this a couple of times lately!

Tombar · 2025-05-15T20:27:34Z

We are running into the exact same issue as described in this issue.

One thing we notice is that, while github UI reports the correct number of active runners, the ephemeralrunnerset k8s resource doesn't

We are running the lastest version of the chart, crds and docker agent

forgondolin · 2025-05-16T17:24:34Z

@hlascelles suggestion current works for us, is not the ideal, but the cronjob made it better, still not ideal but I'm experimenting with different timesets

OneideLuizSchneider · 2025-05-16T17:45:37Z

Since I upgraded it to 0.11.0, I'm not facing this issue anymore!

forgondolin · 2025-05-16T18:00:54Z

@OneideLuizSchneider I've tried 0.11.0 but it keeps spinning and killing the autoscaller like crazy in a loop, any tips on that?
ty!

OneideLuizSchneider · 2025-05-16T19:26:41Z

@OneideLuizSchneider I've tried 0.11.0 but it keeps spinning and killing the autoscaller like crazy in a loop, any tips on that? ty!

@forgondolin
Well, I did upgrade the k8s autoscaler as well, and another point, we have many runners, for example(dev, staging, prod, etc), not sure if it has something to do with it, maybe create some runners for testing, like testing-build and see if it still happens, it seems that if you use one a lot it starts to happen(again, not sure).

forgondolin · 2025-05-23T18:20:00Z

@OneideLuizSchneider gonna give that a try in the next sprint. Thanks a lot

uladzislauhramovich · 2025-05-26T09:45:12Z

Well, I did upgrade the k8s autoscaler as well, and another point, we have many runners, for example(dev, staging, prod, etc), not sure if it has something to do with it, maybe create some runners for testing, like testing-build and see if it still happens, it seems that if you use one a lot it starts to happen(again, not sure).

Same, it only happens with the most used label of our runners. The others are fine.

mmack · 2025-05-27T06:11:35Z

We face the same problem. Version: 0.11.0

uladzislauhramovich · 2025-05-28T09:27:10Z

Ok, I'm not sure how I missed that initially but you can find your failed runners by running kubectl get ephemeralrunners -A and then delete them. Not sure yet if this helps with scheduling times the same way redeployment does.

@hlascelles

patrickvinograd · 2025-05-29T17:24:12Z

So it's not inherently a bug that there are ephemeralrunner resources in failed state - I see the value of persisting those for diagnosis purposes - since the pod goes away, having a handle to a persistent resource to track down an error is useful.

And, as several people have landed on in this and other issues, one could write a cron job to periodically clean up failed ephemeralrunners. But, why are we having to do that vs. the system self-healing?

It seems like there is potentially a bug in the autoscaling logic in either the listener or in the ephemeralrunnerset controller, i.e. that it is not discounting failed ephemeralrunners when it makes its scaling decision. this morning I had an instance of a job waiting for a runner, and the listener was logging Calculated target runner count ... "decision": 1 ... "currentRunnerCount": 1 ... but that 1 existing runner was Failed, i.e. never going to execute a job.

I haven't dug into the listener code to confirm, but it seems like the listener should not count failed runners. There's a tradeoff that in the long term failed listeners might pile up, and might even do so quickly if there's a non-sporadic failure mode. So maybe a configurable "max failed runners" threshold would be in order?

julien-michaud added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 28, 2025

julien-michaud changed the title ~~Sometimes, no runners are spawned~~ Sometimes, a job need to wait 30/60 minutes before getting a runner Feb 28, 2025

nikola-jokic mentioned this issue Apr 22, 2025

Scale Set Listener Stops Responding #3204

Closed

4 tasks

nikola-jokic removed the needs triage Requires review from the maintainers label Apr 22, 2025

nikola-jokic mentioned this issue Apr 22, 2025

Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online" #3420

Closed

4 tasks

This was referenced May 7, 2025

Occasionally jobs are stuck in "waiting for a runnner to come online" #3649

Closed

Jobs are waiting too long for a runner to come online. #3704

Closed

patrickvinograd mentioned this issue May 29, 2025

EphemeralRunner stuck in failed state if the job it was allocated to is cancelled #4091

Open

4 tasks

Sometimes, a job need to wait 30/60 minutes before getting a runner #3953

Sometimes, a job need to wait 30/60 minutes before getting a runner #3953

Comments

julien-michaud commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

avadhanij commented Apr 3, 2025

Uh oh!

emmahsax commented Apr 4, 2025

Uh oh!

marcusisnard commented Apr 14, 2025

Uh oh!

marcusisnard commented Apr 22, 2025

Uh oh!

nikola-jokic commented Apr 22, 2025

Uh oh!

nikola-jokic commented Apr 22, 2025

Uh oh!

marcusisnard commented Apr 22, 2025

Uh oh!

nikola-jokic commented Apr 22, 2025

Uh oh!

marcusisnard commented Apr 22, 2025

Uh oh!

nikola-jokic commented Apr 22, 2025

Uh oh!

niodice commented Apr 22, 2025

Uh oh!

nikola-jokic commented Apr 25, 2025

Uh oh!

niodice commented Apr 30, 2025

Uh oh!

nikola-jokic commented May 7, 2025

Uh oh!

niodice commented May 7, 2025

Uh oh!

hlascelles commented May 12, 2025

Uh oh!

nikola-jokic commented May 12, 2025

Uh oh!

uladzislauhramovich commented May 13, 2025

Uh oh!

forgondolin commented May 14, 2025

Uh oh!

hlascelles commented May 14, 2025

Uh oh!

forgondolin commented May 14, 2025

Uh oh!

OneideLuizSchneider commented May 15, 2025

Uh oh!

Tombar commented May 15, 2025

Uh oh!

forgondolin commented May 16, 2025

Uh oh!

OneideLuizSchneider commented May 16, 2025

Uh oh!

forgondolin commented May 16, 2025

Uh oh!

OneideLuizSchneider commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forgondolin commented May 23, 2025

Uh oh!

uladzislauhramovich commented May 26, 2025

Uh oh!

mmack commented May 27, 2025

Uh oh!

uladzislauhramovich commented May 28, 2025

Uh oh!

patrickvinograd commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

julien-michaud commented Feb 28, 2025 •

edited

Loading

OneideLuizSchneider commented May 16, 2025 •

edited

Loading

patrickvinograd commented May 29, 2025 •

edited

Loading