-
-
Notifications
You must be signed in to change notification settings - Fork 758
--bisect deadlocks when reporting results #2637
Comments
Thanks a lot for this detailed answer and the clues. I would love to take a deeper look on this one. In the meantime did you see this discussion rspec/rspec-rails#1353 ? |
@DavidS I'd love it if you could provide us with an isolated reproduction of this, my time is quite limited at the moment and I won't be able to get to a reproduction containing complex code like puppet, another reason for this is to exclude the possibility of puppet itself causing the deadlock. If it is just a buffer issue as you suggest it should be possible to trigger with only "RSpec" code right? Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself? |
Definitely 💚 |
require 'rspec'
RSpec.describe "a bunch of nothing" do
(0...3000).each do |t|
it { expect(t).to eq t }
end
end is an example that immediately deadlocks for me when running under
By changing the |
Thanks, yes that triggers the issue for me, as a work around the shell runner of course works, but then I guess thats why you have #2638 open 😂 |
Exactly :-D Thanks a lot for the time and work y'all put into rspec. |
Thanks. I was able to reproduce it on my mac with a very basic example like the one you mentioned. https://github.com/benoittgt/rspec_repro_bisect_deadlock |
A very interesting answer has been posted by @palkan here : benoittgt/rspec_repro_bisect_deadlock#1 I had also success removing |
@palkan comes to the same conclusion that is in my original analysis. |
Thanks a lot @DavidS for those additionnals information. What do you think about the proposal of @palkan
I tried in def send(message)
packet = Marshal.dump(message)
- @write_io.write("#{packet.bytesize}\n#{packet}")
+ @write_io.write_nonblock("#{packet.bytesize}\n#{packet}")
end
# rubocop:disable Security/MarshalLoad
def receive
packet_size = Integer(@read_io.gets)
- Marshal.load(@read_io.read(packet_size))
+ Marshal.load(@read_io.read_nonblock(packet_size)) but
I am ok to remove the |
Reading through the docs of write_nonblock (https://apidock.com/ruby/IO/write_nonblock) I also think that in the original error case you would just get a Errno::EWOULDBLOCK instead of writing the data out. |
I tried this: it fills the buffer and returns 65536. Not sure when Errno::EWOULDBLOCK could happen. My initial suggestion that we can leverage Probably, the following refactoring could be a bit better than just dropping def dispatch_run(run_descriptor)
- @run_dispatcher.dispatch_specs(run_descriptor)
+ pid = @run_dispatcher.dispatch_specs(run_descriptor)
@channel.receive.tap do |result|
# ...
Process.waitpid(pid)
end
end |
@palkan do you feel like working up a patch with something like that? |
@JonRowe Not in the next couple of weeks. I can try to find another evil martian to help with this) |
It is working with @palkan proposal @@ -1,9 +1,10 @@
def dispatch_run(run_descriptor)
- @run_dispatcher.dispatch_specs(run_descriptor)
+ pid = @run_dispatcher.dispatch_specs(run_descriptor)
@channel.receive.tap do |result|
if result.is_a?(String)
raise BisectFailedError.for_failed_spec_run(result)
end
+ Process.waitpid(pid)
end
end
@@ -23,6 +24,5 @@
end
def dispatch_specs(run_descriptor)
- pid = fork { run_specs(run_descriptor) }
- Process.waitpid(pid)
+ fork { puts run_specs(run_descriptor) }
end But I didn't verified yet what David mentioned.
I dig into MRI source code but at the moment I no clue of what can happens when it not "cleanup". I don't see it. Also another proposal will be to use
--- a/lib/rspec/core/bisect/fork_runner.rb
+++ b/lib/rspec/core/bisect/fork_runner.rb
@@ -92,7 +92,7 @@ module RSpec
def dispatch_specs(run_descriptor)
pid = fork { run_specs(run_descriptor) }
- Process.waitpid(pid)
+ Process.waitpid(pid, Process::WNOHANG) It is working too. |
I recently use my last proposition with the Don't you think a PR with that change will be a good idea? Not sure about how we can test this at the moment. I have to think about it. |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](https://github.com/rspec/rspec-core/blob/7b6b9c3f2e2878213f97d6fc9e9eb23c323cfe1c/lib/rspec/core/bisect/fork_runner.rb#L122). `Channel#send` calls `IO#write` here https://github.com/rspec/rspec-core/blob/7b6b9c3f2e2878213f97d6fc9e9eb23c323cfe1c/lib/rspec/core/bisect/utilities.rb#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](https://ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal is to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) Related: - #2637
@benoittgt 's |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal is to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - #2637
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669
I just ran into this—I'm on rspec-core 3.9.2 on OS X and still seem to get the hanging behavior right before it outputs the failing and non-failing examples. Using shell runner works... Does anyone have suggestions on how to debug what is going on? |
Hello @agibralter There is many commands in the PR that may help you to understand what is happening. If you succeed to find an easy way to reproduce it I would be very happy to patch it. |
From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: rspec#2637 - PR discussion: rspec#2669
I've contributed a PR which addresses a specific situation in which Specifically when encoding is set to UTF-8 (eg. by Rails) and the forked child process tries to send back BINARY/ASCII-8BIT encoded data to the parent, but the communication channel ( The parent process, whose output you're seeing, waits indefinitely for the child process which has since errored out. |
--bisect deadlocks when reporting results
at the point of ctrl+c, the process has been already sitting a lot longer than the test suite would run.
stracing the processes shows the following situation:
where pid 6090 is the
--bisect
process and pid 6096 is the child rspec process. From other traces I've been running, I understand that theppoll
is waiting on an IO event after/while writing the results to fd 10. Meanwhile the main process is hanging inwaitpid
atrspec-core/lib/rspec/core/bisect/fork_runner.rb
Line 95 in 2800fe1
A common reason why this might not show up in testing is if the result report in the tests is smaller than the underlying OS's buffer size. In that case the runner process exits after writing to the buffer and the parent continues happily reading from the buffer. In my case the testsuite results are ~93kB and the processes deadlock.
Your environment
Steps to reproduce
bundle install
the gems according to theGemfile.lock
while this creates results:
The text was updated successfully, but these errors were encountered: