Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token times out under load? #188

Closed
loosebazooka opened this issue Jan 22, 2025 · 24 comments
Closed

token times out under load? #188

loosebazooka opened this issue Jan 22, 2025 · 24 comments
Labels
bug Something isn't working

Comments

@loosebazooka
Copy link
Member

loosebazooka commented Jan 22, 2025

Not sure what the root cause is here. But we frequently see

/home/runner/work/_actions/sigstore/sigstore-conformance/b0635d4101f11dbd18a50936568a1f7f55b17760/test/conftest.py:157: TimeoutError
---------------------------- Captured stdout setup -----------------------------
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.
Current token expires too early, retrying in 30 seconds.

in sigstore-java. I imagine it happens when there's some slow down on GHA. Wondering if we should explore an alternative here?

@loosebazooka loosebazooka added the bug Something isn't working label Jan 22, 2025
@haydentherapper
Copy link
Collaborator

Can we increase the frequency of the publishing beacon to every minute? Is there a concern of running out of bandwidth?

@woodruffw
Copy link
Member

Can we increase the frequency of the publishing beacon to every minute? Is there a concern of running out of bandwidth?

Yep, there should be no harm (other than an obscene amount of CI traffic) in doing that! Although it's strange that only Java is seeing that error -- we've seen other flakes on the Python client, but not consistent (or with that error).

@jku
Copy link
Member

jku commented Jan 23, 2025

it's strange that only Java is seeing that error

It might be worth trying to pinpoint the reason here -- I have not seen this issue myself since the current token infra was setup...

  • Appu, Do you have more complete logs? (like with timestamps)
  • we could try to add some more info in the retry message

@jku
Copy link
Member

jku commented Jan 23, 2025

What is going on here, btw? Why does it take 40 minutes to run the test suite on sigstore-java?

Image

@loosebazooka
Copy link
Member Author

loosebazooka commented Jan 23, 2025

There's a set of conformance tests and it's actually (1-2m in ideal situation), when it catches a bad token it balloons to 15+ minutes.

Image

@loosebazooka
Copy link
Member Author

@loosebazooka
Copy link
Member Author

I think the problem is partly that Java runs this 4 times per PR (against [java17, java 11]x[staging, prod]). And this usually happens on our dependabot update cycle, when 10 prs are created.

@di
Copy link
Member

di commented Jan 23, 2025

Here's a successful but slow run: https://github.com/sigstore/sigstore-java/actions/runs/12915297054/job/36016838227

@woodruffw I don't think this is related but I did notice that this is building the grpclib library from source because it lacks wheels on PyPI: https://pypi.org/project/grpclib/0.4.7/. Looks like this comes from betterproto==2.0.0b6->sigstore-protobuf-specs==0.3.2? Should we look into avoiding that?

(Also, I think we should probably be compiling our dependencies here as well...)

@jku
Copy link
Member

jku commented Jan 23, 2025

There's a set of conformance tests and it's actually (1-2m in ideal situation), when it catches a bad token it balloons to 15+ minutes.

Maybe I'm not looking in the right place but all production runs seem to be 15-17 mins each? This seems fairly consistent, in fact I've not seen any other results so far.

@di
Copy link
Member

di commented Jan 23, 2025

Also maybe we can run the test suite with --durations to identify if there are any particularly bad cases? This only takes ~5m to run on the Python client, I'm surprised it should take 3x longer here.

@loosebazooka
Copy link
Member Author

Expand out: https://github.com/sigstore/sigstore-java/actions/runs/12915296222

Image

And you should see the individual runs where prod takes 15+ and staging takes 1-2m.

But looking at it more now, it looks like maybe prod is getting rate limited or something? what do , mean instead of . on the conformance output?

@woodruffw
Copy link
Member

@loosebazooka Out of curiosity, how is the conformance CLI client invoked? Do you spin up a new process each time?

(I'm wondering if JVM startup is playing a role -- when running the CPython bundle tests we end up spawning 1000+ processes, which I could see taking quite a bit of time.)

@jku
Copy link
Member

jku commented Jan 23, 2025

But looking at it more now, it looks like maybe prod is getting rate limited or something?

I think there are just so many more tests for prod that it takes a lot longer: The selftest here takes 2 mins for prod and 0.2 mins for staging as well.

what do , mean instead of . on the conformance output?

good question.

@woodruffw
Copy link
Member

@woodruffw I don't think this is related but I did notice that this is building the grpclib library from source because it lacks wheels on PyPI: pypi.org/project/grpclib/0.4.7. Looks like this comes from betterproto==2.0.0b6->sigstore-protobuf-specs==0.3.2? Should we look into avoiding that?

Hmm, yeah -- it looks like grpclib's sdist build is a pretty trivial one (from inspector), but ideally we wouldn't need that at all. I can send a ping to the upstream and see if they'd be willing to provide a wheel!

(Also, I think we should probably be compiling our dependencies here as well...)

Yeah, fully agreed! We have them fully pinned at the moment, but not fully resolved or hashed. I can look into that last part today.

@di
Copy link
Member

di commented Jan 23, 2025

I think the comma indicates a hung/slow test, and pytest outputs the comma to keep the output going? This happens with Python too: https://github.com/sigstore/sigstore-python/actions/runs/12894538382/job/35953394258

@di
Copy link
Member

di commented Jan 23, 2025

I can send a ping to the upstream and see if they'd be willing to provide a wheel!

Looks like later, pre-releases have wheels: https://pypi.org/project/grpclib/0.4.8rc2/#files

@woodruffw
Copy link
Member

Ah yep, I just started writing an issue and noticed vmagamedov/grpclib#188 🙂

@jku
Copy link
Member

jku commented Jan 23, 2025

Updating with some details:

So something clearly goes wrong in the test run on GHA for sigstore-java -- none of this explains the original issue of token not being available but it may explain why only sigstore-java sees the problem: Like maybe there is a bug in how the token is refreshed when it expires multiple times during the test run but the situation only comes up on java tests...

@loosebazooka
Copy link
Member Author

loosebazooka commented Jan 23, 2025

Oh I thought cpython tests were skipped by default? Is that not true? Locally they are skipped because of the no github_workspace, but I guess the default on the skip is "false".

@loosebazooka
Copy link
Member Author

Each instance of the test is starting up a new jvm though. So there's that. We could make conformance a long running process that could help, but it would complicate what the cli is.

@jku
Copy link
Member

jku commented Jan 24, 2025

Each instance of the test is starting up a new jvm though. So there's that.

I've had a closer look at the TUF conformance tests (since that test suite is even more blocked by the client startup time: I know because I've profiled the python case) and this does seem like a good explanation: the python client test run already spends > 90% of its time in python module imports and the java client test run takes ~4x as long. So startup cost for sigstore-java client is roughly 4x that of sigstore-python client

So

  • the issues with speed seem to be explained -- jvm startup is just expensive
    • potential workaround is to skip cpython release verification tests... but they are useful too: they contain an archive of historical signatures that are good to test against 🤷
    • it would be possible to write a single instance conformance client where the called CLI is a fast wrapper that only does IPC to the actual (java) conformance client... but that sounds like a lot of work
  • There does not seem to be anything GitHub Actions specific here: the test suite was faster locally since it was not running the cpython release tests
  • token timeouts have not been explained yet but may be related to the fact that java tests just take longer: maybe there is a bug in the test suite token code: this should be testable

@loosebazooka
Copy link
Member Author

I think we can do a release, and close this for now. I believe @jku 's PR treats the symptoms sufficiently. There's still some longer term issues with starting the jvm for each test. But it's much lower priority now.

@woodruffw
Copy link
Member

Sounds good to me 🙂 -- I can do a release in a moment.

@woodruffw
Copy link
Member

Just cut 0.0.17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants