Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline status update may fail on landing zone deletion #1894

Open
mikkonie opened this issue Feb 7, 2024 · 3 comments
Open

Timeline status update may fail on landing zone deletion #1894

mikkonie opened this issue Feb 7, 2024 · 3 comments
Assignees
Labels
app: taskflowbackend Issue in the taskflowbackend app bug Something isn't working ongoing Ongoing issue, needs observation or pending on other projects

Comments

@mikkonie
Copy link
Contributor

mikkonie commented Feb 7, 2024

This is a very random bug I've observed in production a couple of times (two times at the time of writing, to be exact):

  • User deletes a landing zone
  • Zone gets correctly deleted from iRODS
  • Zone status gets correctly updated to DELETED
  • ..but the timeline zone_delete event status remains at SUBMIT

I'll try to see how to reproduce this. Setting the zone status is the last task in the landing_zone_delete flow. A possible explanation could be an error/crash in run_flow() after the flow has returned. I have to look into server logs to see if any errors have been raised.

One possible might be an iRODS timeout similar to #1458 and #1893. OTOH, today I witnessed this happen in production at a time when iRODS connections were otherwise working without issues.

@mikkonie mikkonie added bug Something isn't working app: taskflowbackend Issue in the taskflowbackend app labels Feb 7, 2024
@mikkonie mikkonie self-assigned this Feb 7, 2024
@mikkonie
Copy link
Contributor Author

mikkonie commented Feb 7, 2024

Still looking for a way to reproduce this. An interesting thing I observed is that in an affected project, other landing zone operations have been successfully triggered after the zone deletion. This means we reach the project unlock code in run_flow(), which is placed right after the timeline status updating.

Hence my interest is now targeted at this bit in run_flow():

# Flow completion
if flow_result and tl_event and async_mode:
    tl_event.set_status('OK', 'Async submit OK')
# Exception/failure
elif not flow_result and not ex_msg:
    ex_msg = UNKNOWN_RUN_ERROR

There is no error in the log. Based on that it looks like neither of these conditions are met. ex_msg is definitely not set at this point as we never raise an error. async_mode is always set when this flow is triggered. Likewise, tl_event should always be there as we'd get a crash if the tl_uuid query fails.

This is a weird one. I could add some debug logging to the aforementioned point in run_flow() to see if there is some unforeseen way these conditions can fail.. Or maybe I'm looking at the wrong thing entirely. Maybe we don't ever get this far in the execution of run_flow(), and the lock simply times out? But it definitely shouldn't time out that fast..

@mikkonie mikkonie added the ongoing Ongoing issue, needs observation or pending on other projects label Feb 7, 2024
@mikkonie
Copy link
Contributor Author

mikkonie commented Feb 7, 2024

I queried our production database. Out of 901 deletions, in 15 the timeline event has gotten stuck in the SUBMIT state. Timewise, these range from November 2022 to this day. Interestingly, 6 of them happened in the same project on the same day (2023-10-25).

I'll try to come up with and test some variables which could affect this but if no luck, I'll leave it ongoing now and see if I can observe more of this in production.

Mental note: here's a twoliner to grab affected events in the Django shell.

from timeline.models import ProjectEvent
events = [e for e in ProjectEvent.objects.filter(event_name='zone_delete') if e.get_status().status_type == 'SUBMIT']

One thing I noticed that at least some of these zones in question (maybe all, I need to check) had previously failed with Failed to remove collection. That doesn't exactly explain why we wouldn't hit that bit of code though.

@mikkonie
Copy link
Contributor Author

I'll keep my eye out on further occurrences of this issue. It might simply be because of an admin setting the deleted status in shell and neglecting to update/delete the related timeline status. If this is the case, the proper fix would be to enable proper admin tools. For that, see #1892.

This may well be the same issue as with #1798, so I'm setting that one to ongoing as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app: taskflowbackend Issue in the taskflowbackend app bug Something isn't working ongoing Ongoing issue, needs observation or pending on other projects
Projects
None yet
Development

No branches or pull requests

1 participant