Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock the atexit_hooks during execution of the hooks on shutdown. #49868

Merged
merged 8 commits into from
May 30, 2023

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented May 17, 2023

Fixes #49841.

Follow-up to #49774.

This PR makes two changes:

  1. It locks atexit_hooks while iterating the hooks during execution of _atexit() at shutdown.
    • This prevents any data races if another Task is registering a new atexit hook while the hooks are being evaluated.
  2. It defines semantics for what happens if another Task attempts to register another atexit hook after all the hooks have finished, and we've proceeded on to the rest of shutdown.
    • Previously, those atexit hooks would be ignored, which violates the user's expectations and violates the "atexit" contract.
    • Now, the attempt to register the atexit hook will throw an exception, which ensures that we never break our promise, since the user was never able to register the atexit hook at all.
    • This does mean that users will need to handle the thrown exception and likely do now whatever tear down they were hoping to delay until exit.

@NHDaly NHDaly requested a review from vtjnash May 17, 2023 22:56
@NHDaly
Copy link
Member Author

NHDaly commented May 18, 2023

EDIT: these are resolved

Fix the lock to be more precise, so that we are allowed to register more
atexit hooks from inside an atexit hook.

Add tests that cover all the cases:
1. registering a hook from inside a hook
2. registering a hook from another thread while hooks are running
3. attempting to register a hook after all hooks have finished
   (disallowed)
@NHDaly NHDaly requested a review from kpamnany May 18, 2023 21:04
@NHDaly NHDaly added the backport 1.9 Change should be backported to release-1.9 label May 18, 2023
@KristofferC KristofferC mentioned this pull request May 19, 2023
51 tasks
@NHDaly NHDaly requested a review from pchintalapudi May 19, 2023 20:46
NHDaly and others added 2 commits May 25, 2023 16:57
Co-authored-by: Jameson Nash <[email protected]>
@NHDaly
Copy link
Member Author

NHDaly commented May 28, 2023

Ready for a final review.

@NHDaly NHDaly requested a review from vtjnash May 28, 2023 00:20
@vtjnash vtjnash merged commit 20752db into master May 30, 2023
@vtjnash vtjnash deleted the nhd-atexit_hook-lock-during-atexit branch May 30, 2023 16:23
@vtjnash
Copy link
Member

vtjnash commented May 30, 2023

Error in testset atexit:
Test Failed at C:\buildkite-agent\builds\win2k22-amdci6-1\julialang\julia-master\julia-48583a900c\share\julia\test\atexit.jl:260
  Expression: run(cmd_eval)
    Expected: ProcessFailedException
  No exception thrown

https://buildkite.com/julialang/julia-master/builds/24379#01886d90-17c9-48d9-9710-b2b18d515682

@IanButterworth
Copy link
Member

@NHDaly this bleeds output into the test logs and adds to noise. Would it be possible to avoid that?

  | binaryplatforms                                  (3) \|     2.30 \|   0.03 \|  1.3 \|     243.02 \|  4284.28
  | atexit                                           (3) \|        started at 2023-06-02T18:34:33.049
  | From worker 3:	INSIDE
  | From worker 3:	go
  | From worker 3:	done
  | From worker 3:	exit11
  | From worker 3:	INSIDE
  | From worker 3:	go
  | From worker 3:	done
  | From worker 3:	exit11
  | From worker 3:	FINALIZER
  | From worker 3:	ready
  | From worker 3:	done
  | From worker 3:	FINALIZER
  | From worker 3:	ready
  | From worker 3:	done
  | atexit                                           (3) \|     7.24 \|   0.00 \|  0.0 \|       5.29 \|  4284.28
  | enums                                            (3) \|        started at 2023-06-02T18:34:40.289

@NHDaly
Copy link
Member Author

NHDaly commented Jun 5, 2023

Oh, woops! I think those logs were just there for debugging. Thanks! I'll clean that up right now.

@@ -354,6 +354,7 @@ const atexit_hooks = Callable[
() -> Filesystem.temp_cleanup_purge(force=true)
]
const _atexit_hooks_lock = ReentrantLock()
global _atexit_hooks_finished::Bool = false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash: Oh, wait, this needs to be volatile, right? Since it's used inside a while loop across multiple threads?

So this should be an atomic, since we don't have the ability to do volatile without atomic, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile is only meaningful for special memory mapped hardware registers, and x86 doesn't have any of those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that true? Isn't it also needed to prevent global variables from getting pulled into either a register or a stack-local variable and thus breaking coordination across threads?

I think we've seen this in julia code before, that unless you mark the variable as an Atomic it ends up inlined as a local into the function body.

Here's an example:

julia> mutable struct Foo
           x::Int
       end

julia> function f(z::Foo)
                  while z.x > 0
                  end
                  return z.x
       end
f (generic function with 1 method)

julia> @code_native f(Foo(0))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 13, 0
        .globl  _julia_f_331                    ; -- Begin function julia_f_331
        .p2align        2
_julia_f_331:                           ; @julia_f_331
; ┌ @ REPL[11]:1 within `f`
        .cfi_startproc
; %bb.0:                                ; %top
; │ @ REPL[11] within `f`
        ;DEBUG_VALUE: f:z <- [DW_OP_deref] $x0
        ;DEBUG_VALUE: f:z <- [DW_OP_deref] 0
        ldr     x0, [x0]
; │ @ REPL[11]:2 within `f`
        cmp     x0, #1                          ; =1
        b.lt    LBB0_2
LBB0_1:                                 ; %L1
                                        ; =>This Inner Loop Header: Depth=1
        b       LBB0_1
LBB0_2:                                 ; %L5
; │ @ REPL[11]:4 within `f`
        ret
        .cfi_endproc
; └
                                        ; -- End function
.subsections_via_symbols

And you can see that this breaks coordination across tasks:

julia> Threads.nthreads()
4

julia> global t::Foo = Foo(1)
Foo(1)

julia> Threads.@spawn @info f(t)
Task (runnable) @0x000000015b4bd5a0

julia> t.x = -1
-1

julia>

julia> t.x
-1

Whereas by marking this variable as an Atomic, it fixes the inlining of the variable, and its value is fetched on every iteration of the loop:

julia> mutable struct Foo2
           @atomic x::Int
       end

julia> function f(z::Foo2)
                  while @atomic(z.x) > 0
                  end
                  return @atomic z.x
       end
f (generic function with 2 methods)

julia> @code_native f(Foo2(0))
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 13, 0
        .globl  _julia_f_350                    ; -- Begin function julia_f_350
        .p2align        2
_julia_f_350:                           ; @julia_f_350
; ┌ @ REPL[30]:1 within `f`
        .cfi_startproc
; %bb.0:                                ; %top
; │ @ REPL[30] within `f`
        ;DEBUG_VALUE: f:z <- [DW_OP_deref] $x0
        ;DEBUG_VALUE: f:z <- [DW_OP_deref] 0
LBB0_1:                                 ; %L1
                                        ; =>This Inner Loop Header: Depth=1
; │ @ REPL[30]:2 within `f`
; │┌ @ Base.jl:50 within `getproperty`
        ldar    x8, [x0]
; │└
        cmp     x8, #0                          ; =0
        b.gt    LBB0_1
; %bb.2:                                ; %L5
; │ @ REPL[30]:4 within `f`
; │┌ @ Base.jl:50 within `getproperty`
        ldar    x0, [x0]
; │└
        ret
        .cfi_endproc
; └
                                        ; -- End function
.subsections_via_symbols

julia>

And now this issue is fixed as well:

julia> Threads.nthreads()
4

julia> global t2::Foo2 = Foo2(1)
Foo2(1)

julia> Threads.@spawn @info f(t2)
Task (runnable) @0x000000017234a290

julia> @atomic t2.x = -1
[ Info: -1
-1

^^ So I do think there is an issue in the code as written. I am pretty sure that this issue is part of what is encompassed by the term "volatility", but using whatever name we want to use for it, it's an issue, yeah?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although in this situation, it's also fixed by using a lock, which I guess introduces the needed barriers... 🤔

julia> mutable struct Foo
           x::Int
           lock::ReentrantLock
       end

julia> function f(z::Foo)
                  while @lock z.lock (z.x > 0)
                  end
                  return @lock z.lock z.x
       end
f (generic function with 1 method)

julia> global t::Foo = Foo(1, ReentrantLock())
Foo(1, ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0
)), (2, 4333405504, 4499986432)))

julia> Threads.@spawn @info f(t)
Task (runnable) @0x0000000281dbf0f0

julia> @lock t.lock t.x = -1
[ Info: -1
-1

So are you saying that the lock like that should be sufficient in all cases, and we don't also need to mark the variable atomic, to get it to become volatile?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Jameson meant is that the notion of "volatile" is irrelevant here. The notion that is of import is "atomicity" (https://en.cppreference.com/w/c/language/atomic).
Either by marking the Bool Atomic or by protecting it with a lock. Before C11 volatile and atomic were often confounded.

From https://en.cppreference.com/w/c/language/volatile

Note that volatile variables are not suitable for communication between threads; they do not offer atomicity, synchronization, or memory ordering. A read from a volatile variable that is modified by another thread without synchronization or concurrent modification from two unsynchronized threads is undefined behavior due to a data race.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha. Thank you both! That's clear now, then. 👍 Thanks for educating me!

So, a julia ReentrantLock is enough to also provide the LLVM / C concept of atomicity? We can guarantee that the updates to the value will be reflected across threads?

Now that we're discussing it, the answer seems to obviously be yes, otherwise how would anything work... OKAY thanks this is helpful.

KristofferC pushed a commit that referenced this pull request Jun 26, 2023
Ensure the lock is precise, so that we are allowed to register new
atexit hooks from inside an atexit hook. But then disable `atexit()`
when shutting down after it finishes running.

Add tests that cover all the cases:
1. registering a hook from inside a hook
2. registering a hook from another thread while hooks are running
3. attempting to register a hook after all hooks have finished
   (disallowed)

Fixes #49841
Co-authored-by: Jameson Nash <[email protected]>

(cherry picked from commit 20752db)
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should atexit_hooks be guarded during _atexit?
5 participants