Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

mikolajpabiszczak · 2022-07-25T08:00:16Z

I am frankly not sure if this is the issue on CML side, but let me describe it.

CML versions tested: 0.11.0 and 0.17.0
Cloud provider: AWS

Remark: the very same workflow worked when I last used it (3 months ago)

Deploy self-hosted runner:

    [...]
    cml runner \
        --cloud=aws \
        --cloud-region=eu-west-1 \
        --cloud-type=g3s.xlarge \
        --cloud-spot \
        --single \
        --cloud-startup-script=$(echo 'echo "$(curl https://github.com/${{ github.actor }}.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0) \
        --labels=debug
    [...]

this deployment job finishes successfully, but when it finishes the instance (as checked in AWS console) has not yet performed status checks (this was not the case when the workflow worked 3 last time) / is still in the Initialisation stage.

The next job (which runs on self-hosted runner) gets closed basically immediately (in 4s):
The runner has received a shutdown signal
although the instance itself is not getting cancelled: it goes through AWS status checks and remains running (to clarify: instance deployed as single),

One more thing: if I deploy the worker as reusable it will be marked as offline in the list of workers after the job fails and will not be accessible…

I deployed the reusable instance and got logs after failure:

ubuntu@ip-172-31-32-70:~$ journalctl -u cml.service -f
-- Logs begin at Thu 2022-07-21 01:23:30 UTC. --
Jul 22 11:36:49 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Outputs: 0"}
Jul 22 11:36:49 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Connected to acpid service."}
Jul 22 11:37:18 ip-172-31-32-70 cml.sh[2440]: {"date":"2022-07-22T11:37:18.362Z","level":"info","message":"runner status","repo":"https://github.com/xxxx/yyyy","status":"ready"}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"date":"Fri Jul 22 2022 11:37:30 GMT+0000 (Coordinated Universal Time)","error":{"name":"HttpError","request":{"headers":{"accept":"application/vnd.github.v3+json","authorization":"token [REDACTED]","user-agent":"octokit-rest.js/18.0.0 octokit-core.js/3.6.0 Node.js/16.16.0 (linux; x64)"},"method":"GET","request":{"agent":{}},"url":"https://api.github.com/repos/xxxx/yyyy/actions/runs?status=queued"},"response":{"data":{"documentation_url":"https://docs.github.com/rest/reference/actions#list-workflow-runs-for-a-repository","message":"Resource not accessible by integration"},"headers":{"access-control-allow-origin":"*","access-control-expose-headers":"ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset","connection":"close","content-encoding":"gzip","content-security-policy":"default-src 'none'","content-type":"application/json; charset=utf-8","date":"Fri, 22 Jul 2022 11:37:30 GMT","referrer-policy":"origin-when-cross-origin, strict-origin-when-cross-origin","server":"GitHub.com","strict-transport-security":"max-age=31536000; includeSubdomains; preload","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept, X-Requested-With","x-content-type-options":"nosniff","x-frame-options":"deny","x-github-media-type":"github.v3; format=json","x-github-request-id":"ABF8:0EFF:39DD5D:40D374:62DA8BFA","x-ratelimit-limit":"5000","x-ratelimit-remaining":"4975","x-ratelimit-reset":"1658491535","x-ratelimit-resource":"core","x-ratelimit-used":"25","x-xss-protection":"0"},"status":403,"url":"https://api.github.com/repos/xxxx/yyyy/actions/runs?status=queued"},"status":403},"exception":true,"level":"error","message":"unhandledRejection: Resource not accessible by integration\nHttpError: Resource not accessible by integration\n    at /snapshot/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Job.doExecute (/snapshot/cml/node_modules/bottleneck/light.js:405:18)","os":{"loadavg":[1.05,0.6,0.24],"uptime":146.55},"process":{"argv":["/usr/bin/cml-internal","/snapshot/cml/bin/cml.js","runner","--name","cml-4l6sv1qiu1","--labels","debug","--idle-timeout","300","--driver","github","--repo","https://github.com/xxxx/yyyy","--token","ghs_wDbiPdDx3S0wjEm4hvt0v0v0v037P54OliM1","--tf-resource","eyJtb2RlIjoibWFuYWdlZCIsInR5cGUiOiJpdGVyYXRpdmVfY21sX3J1bm5lciIsIm5hbWUiOiJydW5uZXIiLCJwcm92aWRlciI6InByb3ZpZGVyW1wicmVnaXN0cnkudGVycmFmb3JtLmlvL2l0ZXJhdGl2ZS9pdGVyYXRpdmVcIl0iLCJpbnN0YW5jZXMiOlt7InByaXZhdGUiOiIiLCJzY2hlbWFfdmVyc2lvbiI6MCwiYXR0cmlidXRlcyI6eyJuYW1lIjoiY21sLTRsNnN2MXFpdTEiLCJsYWJlbHMiOiIiLCJpZGxlX3RpbWVvdXQiOjMwMCwicmVwbyI6IiIsInRva2VuIjoiIiwiZHJpdmVyIjoiIiwiY2xvdWQiOiJhd3MiLCJjdXN0b21fZGF0YSI6IiIsImlkIjoiaXRlcmF0aXZlLTJvNzh2ZXFjOHJrZ2kiLCJpbWFnZSI6IiIsImluc3RhbmNlX2dwdSI6IiIsImluc3RhbmNlX2hkZF9zaXplIjozNSwiaW5zdGFuY2VfaXAiOiIiLCJpbnN0YW5jZV9sYXVuY2hfdGltZSI6IiIsImluc3RhbmNlX3R5cGUiOiIiLCJyZWdpb24iOiJldS13ZXN0LTEiLCJzc2hfbmFtZSI6IiIsInNzaF9wcml2YXRlIjoiIiwic3NoX3B1YmxpYyI6IiIsImF3c19zZWN1cml0eV9ncm91cCI6IiJ9fV19"],"cwd":"/","execPath":"/usr/bin/cml-internal","gid":0,"memoryUsage":{"arrayBuffers":15632910,"external":33348698,"heapTotal":106082304,"heapUsed":75520952,"rss":311275520},"pid":2440,"uid":0,"version":"v16.16.0"},"stack":"HttpError: Resource not accessible by integration\n    at /snapshot/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Job.doExecute (/snapshot/cml/node_modules/bottleneck/light.js:405:18)","trace":[{"column":21,"file":"/snapshot/cml/node_modules/@octokit/request/dist-node/index.js","function":null,"line":86,"method":null,"native":false},{"column":null,"file":null,"function":"runMicrotasks","line":null,"method":null,"native":false},{"column":5,"file":"node:internal/process/task_queues","function":"processTicksAndRejections","line":96,"method":null,"native":false},{"column":18,"file":"/snapshot/cml/node_modules/bottleneck/light.js","function":"async Job.doExecute","line":405,"method":"doExecute","native":false}]}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"HttpError: Resource not accessible by integration","stack":"Error: HttpError: Resource not accessible by integration\n    at process.<anonymous> (/snapshot/cml/bin/cml/runner.js:333:32)\n    at process.emit (node:events:539:35)\n    at emit (node:internal/process/promises:140:20)\n    at processPromiseRejections (node:internal/process/promises:274:27)\n    at processTicksAndRejections (node:internal/process/task_queues:97:32)","status":"terminated"}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Unregistering runner cml-4l6sv1qiu1..."}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-4l6sv1qiu1\" is still running a job\""}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jul 22 11:37:33 ip-172-31-32-70 systemd[1]: cml.service: Main process exited, code=exited, status=1/FAILURE
Jul 22 11:37:35 ip-172-31-32-70 systemd[1]: cml.service: Failed with result 'exit-code'.

The text was updated successfully, but these errors were encountered:

DavidGOrtega · 2022-07-25T08:05:00Z

👋 @mikolajpabiszczak the reason is because the runner has been marked to do just one job with the parameter --single the option that you might be looking for is --reuse

mikolajpabiszczak · 2022-07-25T08:16:47Z

@DavidGOrtega: I do know that. So let me emphasise this again:

the problem is not about single vs. reusable (I know and understand the difference between those). In both cases the workflow does not work (and it worked 3 months ago). I used reusable only to collect the logs provided and to see whether GitHub sees the runner (it does not: it marks it as offline). In fact all the workflows (using CML) that I tested do not work (but worked 3 months ago)
Moreover, if I use the reusable runner, and I try to run the failed job again it does not pick up the already existing runner (bc. GitHub sees it as offline).
In case I use single the instance does not get cancelled after the failure, I have to terminate it manually.

(I added some clarifications in the opening message)

DavidGOrtega · 2022-07-25T08:40:17Z

@mikolajpabiszczak
You have in your logs

Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"HttpError: Resource not accessible by integration","stack":"Error: HttpError: Resource not accessible by integration\n    at process.<anonymous> (/snapshot/cml/bin/cml/runner.js:333:32)\n    at process.emit (node:events:539:35)\n    at emit (node:internal/process/promises:140:20)\n    at processPromiseRejections (node:internal/process/promises:274:27)\n    at processTicksAndRejections (node:internal/process/task_queues:97:32)","status":"terminated"}

There must be something that you do not have permissions to do with your token?
Then the unregistering can not happen yet because there is still a job in play

DavidGOrtega · 2022-07-25T08:48:43Z

Just to be sure and move one step forward can you please your REPO_TOKEN? Does it have all all the permissions?

mikolajpabiszczak · 2022-07-25T09:47:03Z

These were not changed since the working runs, but I checked it again. We are using a company application, so checking up wrt. this list

Repository level:

administration (read and write)
checks (we are not using cml send-github-check)
pull requests (read and write)

Organisation level:

self-hosted runners (read and write)

Additionally, in the repository settings:

all actions are allowed
and workflows have read and write permissions

dacbd · 2022-07-25T15:12:21Z

It looks like that app needs an additional scope it might not have? https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository

@mikolajpabiszczak to confirm is an issue with app generated token can you try and curl the endpoint with one of the generated tokens?

curl \
  -H "Accept: application/vnd.github+json" \ 
  -H "Authorization: token <TOKEN>" \
  https://api.github.com/repos/OWNER/REPO/actions/runs

dacbd · 2022-07-25T21:00:21Z

we might need to update our guide for using a github app?

mikolajpabiszczak · 2022-07-26T07:57:39Z

Did some tests, indeed the culprit was the lack of sufficient permissions: after adding Read and write permissions for Actions the workflows work again.

Thx for your time and help! And yes, the guide needs an update in this case. ;D

dacbd · 2022-07-26T16:59:48Z

@mikolajpabiszczak thanks for the report and help, we'll keep this open until we update the docs

casperdcl assigned dacbd Jul 26, 2022

casperdcl added documentation Markdown files external-request You asked, we did labels Jul 26, 2022

dacbd mentioned this issue Jul 26, 2022

update doc/self-hosted-runners GitHub App instructions #284

Open

casperdcl transferred this issue from iterative/cml Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

mikolajpabiszczak commented Jul 25, 2022 •

edited

Loading

DavidGOrtega commented Jul 25, 2022 •

edited

Loading

mikolajpabiszczak commented Jul 25, 2022 •

edited

Loading

DavidGOrtega commented Jul 25, 2022

DavidGOrtega commented Jul 25, 2022

mikolajpabiszczak commented Jul 25, 2022 •

edited

Loading

dacbd commented Jul 25, 2022

dacbd commented Jul 25, 2022

mikolajpabiszczak commented Jul 26, 2022 •

edited

Loading

dacbd commented Jul 26, 2022

Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

Comments

mikolajpabiszczak commented Jul 25, 2022 • edited Loading

DavidGOrtega commented Jul 25, 2022 • edited Loading

mikolajpabiszczak commented Jul 25, 2022 • edited Loading

DavidGOrtega commented Jul 25, 2022

DavidGOrtega commented Jul 25, 2022

mikolajpabiszczak commented Jul 25, 2022 • edited Loading

dacbd commented Jul 25, 2022

dacbd commented Jul 25, 2022

mikolajpabiszczak commented Jul 26, 2022 • edited Loading

dacbd commented Jul 26, 2022

mikolajpabiszczak commented Jul 25, 2022 •

edited

Loading

DavidGOrtega commented Jul 25, 2022 •

edited

Loading

mikolajpabiszczak commented Jul 25, 2022 •

edited

Loading

mikolajpabiszczak commented Jul 25, 2022 •

edited

Loading

mikolajpabiszczak commented Jul 26, 2022 •

edited

Loading