Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutdown event handler not killed on timeout #6615

Open
oliver-sanders opened this issue Feb 18, 2025 · 8 comments
Open

shutdown event handler not killed on timeout #6615

oliver-sanders opened this issue Feb 18, 2025 · 8 comments
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 18, 2025

Seen in the wild, event handler not respecting its timeout?

  • Workflow stalled.
  • "workflow stalled" event handler was run, it timed out after 10 mins.
  • Workflow hit stall timeout.
  • "abort on stall timeout" event handler was run, this did not time out and had to be killed manually.
CRITICAL - Workflow stalled    
WARNING - PT3H stall timer starts NOW    
ERROR - [('workflow-event-handler-00', 'stall') cmd] suite_report.py 'stall' 'r7091_physics_test/run11' 'workflow stalled'
    [('workflow-event-handler-00', 'stall') ret_code] -9    
    [('workflow-event-handler-00', 'stall') err] killed on timeout (PT10M)
ERROR - stall EVENT HANDLER FAILED    
WARNING - stall timer timed out after PT3H    
ERROR - Workflow shutting down - "abort on stall timeout" is set    
INFO - platform: xce - remote tidy (on xcel01)    
ERROR - [('workflow-event-handler-00', 'shutdown') cmd] suite_report.py 'shutdown' 'r7091_physics_test/run11' '"abort on stall timeout" is set'
    [('workflow-event-handler-00', 'shutdown') ret_code] -15
ERROR - shutdown EVENT HANDLER FAILED    
INFO - DONE    

Why did the "workflow stalled" event handler abide by the timeout and the "abort on stall timeout" event handler ignore it?

This causes issues with auto-restart functionality because if an event handler ends up hanging for whatever reason, there is no timeout to stop it and the workflow will remain in the shutting down state indefinitely.

@oliver-sanders oliver-sanders added bug Something is wrong :( investigation needs reproducing A bug report that does not yet have a reproducible example labels Feb 18, 2025
@oliver-sanders oliver-sanders added this to the 8.4.x milestone Feb 18, 2025
@MetRonnie
Copy link
Member

Are you talking about the process pool timeout? Is the issue that the shutdown event handler did not get killed after the process pool timeout elapsed?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Feb 19, 2025

Presumably, in the above log snippet, the first event handler was killed after 10 minutes, the second event handler was not killed after 10 minutes. Whatever timeout applied to the first event handler should also have applied to the second.

@oliver-sanders
Copy link
Member Author

Reproducible example

#!/usr/bin/env bash

# bin/handler

sleep 600
# flow.cylc

[scheduler]
    [[events]]
        stall handlers = handler
        shutdown handlers = handler

[scheduling]
    [[graph]]
        R1 = foo => bar

[runtime]
    [[foo]]
        script = false
    [[bar]]
# global.cylc

[scheduler]
    process pool timeout = PT10S
    [[events]]
      stall timeout = PT20S
      abort on stall timeout = True
$ cylc vip ... --no-detach
  • The first event handler (stall) gets killed by the process pool timeout (as expected).
  • The second event handler (shutdown) is left running.

@oliver-sanders oliver-sanders removed the needs reproducing A bug report that does not yet have a reproducible example label Feb 26, 2025
@hjoliver hjoliver changed the title event handler timeout shutdown event handler not killed on timeout Feb 27, 2025
@hjoliver
Copy link
Member

hjoliver commented Feb 27, 2025

(Note for testing you can inline sleep event handlers by appending ; echo).

It seems to be only shutdown handlers that aren't subject to the process pool timeout.

With the same global config, two non-shutdown handlers get killed here:

[scheduler]
    [[events]]
        stall handlers = "sleep 60; echo"
        startup handlers = "sleep 60; echo"

but even a lone shutdown handler does not get killed:

[scheduler]
    [[events]]
        shutdown handlers = "sleep 60; echo"

@hjoliver
Copy link
Member

hjoliver commented Feb 27, 2025

Shutdown handlers aren't run in the process poll because it's "closed" by then.

# cylc/flow/workflow_events.py
351             if self.proc_pool.closed:
352                 # Run command in foreground if abort on failure is set or if
353                 # process pool is closed
354                 self.proc_pool.run_command(proc_ctx)
355                 self._run_event_handlers_callback(proc_ctx)
356             else:
357                 # Run command using process pool otherwise
358                 self.proc_pool.put_command(
359                     proc_ctx, callback=self._run_event_handlers_callback)

@hjoliver
Copy link
Member

I'll post a fix...

@oliver-sanders
Copy link
Member Author

(Note for testing you can inline sleep event handlers by appending ; echo).

Sneaky. I really don't like the auto-provision of args here.

@oliver-sanders
Copy link
Member Author

Shutdown handlers aren't run in the process poll because it's "closed" by then.

Dammit, I thought it might be something like that.

FYI: There is another related issue that it might be possible to fix at the same time. #5997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

3 participants