Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-hosted workers fail immediately, get marked "offline" in the runners list. #388

Open
mikolajpabiszczak opened this issue Jul 25, 2022 · 9 comments
Assignees
Labels
documentation Markdown files external-request You asked, we did

Comments

@mikolajpabiszczak
Copy link

mikolajpabiszczak commented Jul 25, 2022

I am frankly not sure if this is the issue on CML side, but let me describe it.

CML versions tested: 0.11.0 and 0.17.0
Cloud provider: AWS

Remark: the very same workflow worked when I last used it (3 months ago)

  1. Deploy self-hosted runner:

        [...]
        cml runner \
            --cloud=aws \
            --cloud-region=eu-west-1 \
            --cloud-type=g3s.xlarge \
            --cloud-spot \
            --single \
            --cloud-startup-script=$(echo 'echo "$(curl https://github.com/${{ github.actor }}.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0) \
            --labels=debug
        [...]
    

    this deployment job finishes successfully, but when it finishes the instance (as checked in AWS console) has not yet performed status checks (this was not the case when the workflow worked 3 last time) / is still in the Initialisation stage.

  2. The next job (which runs on self-hosted runner) gets closed basically immediately (in 4s):
    The runner has received a shutdown signal
    although the instance itself is not getting cancelled: it goes through AWS status checks and remains running (to clarify: instance deployed as single),

One more thing: if I deploy the worker as reusable it will be marked as offline in the list of workers after the job fails and will not be accessible…

I deployed the reusable instance and got logs after failure:

ubuntu@ip-172-31-32-70:~$ journalctl -u cml.service -f
-- Logs begin at Thu 2022-07-21 01:23:30 UTC. --
Jul 22 11:36:49 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Outputs: 0"}
Jul 22 11:36:49 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Connected to acpid service."}
Jul 22 11:37:18 ip-172-31-32-70 cml.sh[2440]: {"date":"2022-07-22T11:37:18.362Z","level":"info","message":"runner status","repo":"https://github.com/xxxx/yyyy","status":"ready"}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"date":"Fri Jul 22 2022 11:37:30 GMT+0000 (Coordinated Universal Time)","error":{"name":"HttpError","request":{"headers":{"accept":"application/vnd.github.v3+json","authorization":"token [REDACTED]","user-agent":"octokit-rest.js/18.0.0 octokit-core.js/3.6.0 Node.js/16.16.0 (linux; x64)"},"method":"GET","request":{"agent":{}},"url":"https://api.github.com/repos/xxxx/yyyy/actions/runs?status=queued"},"response":{"data":{"documentation_url":"https://docs.github.com/rest/reference/actions#list-workflow-runs-for-a-repository","message":"Resource not accessible by integration"},"headers":{"access-control-allow-origin":"*","access-control-expose-headers":"ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset","connection":"close","content-encoding":"gzip","content-security-policy":"default-src 'none'","content-type":"application/json; charset=utf-8","date":"Fri, 22 Jul 2022 11:37:30 GMT","referrer-policy":"origin-when-cross-origin, strict-origin-when-cross-origin","server":"GitHub.com","strict-transport-security":"max-age=31536000; includeSubdomains; preload","transfer-encoding":"chunked","vary":"Accept-Encoding, Accept, X-Requested-With","x-content-type-options":"nosniff","x-frame-options":"deny","x-github-media-type":"github.v3; format=json","x-github-request-id":"ABF8:0EFF:39DD5D:40D374:62DA8BFA","x-ratelimit-limit":"5000","x-ratelimit-remaining":"4975","x-ratelimit-reset":"1658491535","x-ratelimit-resource":"core","x-ratelimit-used":"25","x-xss-protection":"0"},"status":403,"url":"https://api.github.com/repos/xxxx/yyyy/actions/runs?status=queued"},"status":403},"exception":true,"level":"error","message":"unhandledRejection: Resource not accessible by integration\nHttpError: Resource not accessible by integration\n    at /snapshot/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Job.doExecute (/snapshot/cml/node_modules/bottleneck/light.js:405:18)","os":{"loadavg":[1.05,0.6,0.24],"uptime":146.55},"process":{"argv":["/usr/bin/cml-internal","/snapshot/cml/bin/cml.js","runner","--name","cml-4l6sv1qiu1","--labels","debug","--idle-timeout","300","--driver","github","--repo","https://github.com/xxxx/yyyy","--token","ghs_wDbiPdDx3S0wjEm4hvt0v0v0v037P54OliM1","--tf-resource","eyJtb2RlIjoibWFuYWdlZCIsInR5cGUiOiJpdGVyYXRpdmVfY21sX3J1bm5lciIsIm5hbWUiOiJydW5uZXIiLCJwcm92aWRlciI6InByb3ZpZGVyW1wicmVnaXN0cnkudGVycmFmb3JtLmlvL2l0ZXJhdGl2ZS9pdGVyYXRpdmVcIl0iLCJpbnN0YW5jZXMiOlt7InByaXZhdGUiOiIiLCJzY2hlbWFfdmVyc2lvbiI6MCwiYXR0cmlidXRlcyI6eyJuYW1lIjoiY21sLTRsNnN2MXFpdTEiLCJsYWJlbHMiOiIiLCJpZGxlX3RpbWVvdXQiOjMwMCwicmVwbyI6IiIsInRva2VuIjoiIiwiZHJpdmVyIjoiIiwiY2xvdWQiOiJhd3MiLCJjdXN0b21fZGF0YSI6IiIsImlkIjoiaXRlcmF0aXZlLTJvNzh2ZXFjOHJrZ2kiLCJpbWFnZSI6IiIsImluc3RhbmNlX2dwdSI6IiIsImluc3RhbmNlX2hkZF9zaXplIjozNSwiaW5zdGFuY2VfaXAiOiIiLCJpbnN0YW5jZV9sYXVuY2hfdGltZSI6IiIsImluc3RhbmNlX3R5cGUiOiIiLCJyZWdpb24iOiJldS13ZXN0LTEiLCJzc2hfbmFtZSI6IiIsInNzaF9wcml2YXRlIjoiIiwic3NoX3B1YmxpYyI6IiIsImF3c19zZWN1cml0eV9ncm91cCI6IiJ9fV19"],"cwd":"/","execPath":"/usr/bin/cml-internal","gid":0,"memoryUsage":{"arrayBuffers":15632910,"external":33348698,"heapTotal":106082304,"heapUsed":75520952,"rss":311275520},"pid":2440,"uid":0,"version":"v16.16.0"},"stack":"HttpError: Resource not accessible by integration\n    at /snapshot/cml/node_modules/@octokit/request/dist-node/index.js:86:21\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Job.doExecute (/snapshot/cml/node_modules/bottleneck/light.js:405:18)","trace":[{"column":21,"file":"/snapshot/cml/node_modules/@octokit/request/dist-node/index.js","function":null,"line":86,"method":null,"native":false},{"column":null,"file":null,"function":"runMicrotasks","line":null,"method":null,"native":false},{"column":5,"file":"node:internal/process/task_queues","function":"processTicksAndRejections","line":96,"method":null,"native":false},{"column":18,"file":"/snapshot/cml/node_modules/bottleneck/light.js","function":"async Job.doExecute","line":405,"method":"doExecute","native":false}]}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"HttpError: Resource not accessible by integration","stack":"Error: HttpError: Resource not accessible by integration\n    at process.<anonymous> (/snapshot/cml/bin/cml/runner.js:333:32)\n    at process.emit (node:events:539:35)\n    at emit (node:internal/process/promises:140:20)\n    at processPromiseRejections (node:internal/process/promises:274:27)\n    at processTicksAndRejections (node:internal/process/task_queues:97:32)","status":"terminated"}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Unregistering runner cml-4l6sv1qiu1..."}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-4l6sv1qiu1\" is still running a job\""}
Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"info","message":"Waiting 10 seconds to destroy"}
Jul 22 11:37:33 ip-172-31-32-70 systemd[1]: cml.service: Main process exited, code=exited, status=1/FAILURE
Jul 22 11:37:35 ip-172-31-32-70 systemd[1]: cml.service: Failed with result 'exit-code'.
@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 25, 2022

👋 @mikolajpabiszczak the reason is because the runner has been marked to do just one job with the parameter --single the option that you might be looking for is --reuse

@mikolajpabiszczak
Copy link
Author

mikolajpabiszczak commented Jul 25, 2022

@DavidGOrtega: I do know that. So let me emphasise this again:

  1. the problem is not about single vs. reusable (I know and understand the difference between those). In both cases the workflow does not work (and it worked 3 months ago). I used reusable only to collect the logs provided and to see whether GitHub sees the runner (it does not: it marks it as offline). In fact all the workflows (using CML) that I tested do not work (but worked 3 months ago)

  2. Moreover, if I use the reusable runner, and I try to run the failed job again it does not pick up the already existing runner (bc. GitHub sees it as offline).

  3. In case I use single the instance does not get cancelled after the failure, I have to terminate it manually.

(I added some clarifications in the opening message)

@DavidGOrtega
Copy link
Contributor

@mikolajpabiszczak
You have in your logs

Jul 22 11:37:30 ip-172-31-32-70 cml.sh[2440]: {"level":"error","message":"HttpError: Resource not accessible by integration","stack":"Error: HttpError: Resource not accessible by integration\n    at process.<anonymous> (/snapshot/cml/bin/cml/runner.js:333:32)\n    at process.emit (node:events:539:35)\n    at emit (node:internal/process/promises:140:20)\n    at processPromiseRejections (node:internal/process/promises:274:27)\n    at processTicksAndRejections (node:internal/process/task_queues:97:32)","status":"terminated"}

There must be something that you do not have permissions to do with your token?
Then the unregistering can not happen yet because there is still a job in play

@DavidGOrtega
Copy link
Contributor

Just to be sure and move one step forward can you please your REPO_TOKEN? Does it have all all the permissions?

@mikolajpabiszczak
Copy link
Author

mikolajpabiszczak commented Jul 25, 2022

These were not changed since the working runs, but I checked it again. We are using a company application, so checking up wrt. this list

Repository level:

  • administration (read and write)
  • checks (we are not using cml send-github-check)
  • pull requests (read and write)

Organisation level:

  • self-hosted runners (read and write)

Additionally, in the repository settings:

  • all actions are allowed
  • and workflows have read and write permissions

@dacbd
Copy link
Contributor

dacbd commented Jul 25, 2022

It looks like that app needs an additional scope it might not have? https://docs.github.com/en/rest/actions/workflow-runs#list-workflow-runs-for-a-repository

@mikolajpabiszczak to confirm is an issue with app generated token can you try and curl the endpoint with one of the generated tokens?

curl \
  -H "Accept: application/vnd.github+json" \ 
  -H "Authorization: token <TOKEN>" \
  https://api.github.com/repos/OWNER/REPO/actions/runs

@dacbd
Copy link
Contributor

dacbd commented Jul 25, 2022

@casperdcl casperdcl added documentation Markdown files external-request You asked, we did labels Jul 26, 2022
@mikolajpabiszczak
Copy link
Author

mikolajpabiszczak commented Jul 26, 2022

Did some tests, indeed the culprit was the lack of sufficient permissions: after adding Read and write permissions for Actions the workflows work again.

Thx for your time and help! And yes, the guide needs an update in this case. ;D

@dacbd
Copy link
Contributor

dacbd commented Jul 26, 2022

@mikolajpabiszczak thanks for the report and help, we'll keep this open until we update the docs

@casperdcl casperdcl transferred this issue from iterative/cml Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Markdown files external-request You asked, we did
Projects
None yet
Development

No branches or pull requests

4 participants