--bisect deadlocks when reporting results #2637

DavidS · 2019-06-13T13:30:45Z

--bisect deadlocks when reporting results

david@davids:~/git/puppet-resource_api$ bundle exec rspec --seed 40589 --bisect=verbose --pattern spec/\*\*\{,/\*/\*\*\}/\*_spec.rb  --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}' 
Bisect started using options: "--seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :fork
Running suite to find failures...^C

Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)


Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)

david@davids:~/git/puppet-resource_api$

at the point of ctrl+c, the process has been already sitting a lot longer than the test suite would run.

stracing the processes shows the following situation:

david@davids:~$ strace -s 10000 -p 6090 -p 6096
strace: Process 6090 attached
strace: Process 6096 attached
[pid  6090] wait4(6096,  <unfinished ...>
[pid  6096] ppoll([{fd=10, events=POLLOUT}], 1, NULL, NULL, 8^Cstrace: Process 6090 detached
strace: Process 6096 detached
 <detached ...>

david@davids:~$

where pid 6090 is the --bisect process and pid 6096 is the child rspec process. From other traces I've been running, I understand that the ppoll is waiting on an IO event after/while writing the results to fd 10. Meanwhile the main process is hanging in waitpid at

rspec-core/lib/rspec/core/bisect/fork_runner.rb

Line 95 in 2800fe1

Process.waitpid(pid)

A common reason why this might not show up in testing is if the result report in the tests is smaller than the underlying OS's buffer size. In that case the runner process exits after writing to the buffer and the parent continues happily reading from the buffer. In my case the testsuite results are ~93kB and the processes deadlock.

Your environment

Ruby version: ruby 2.5.5p157 (2019-03-15 revision 67260) [x86_64-linux-gnu]
rspec-core version: 3.8.0

Steps to reproduce

checkout https://github.com/DavidS/puppet-resource_api/tree/rspec-core-repro
bundle install the gems according to the Gemfile.lock
the following command hangs:

david@davids:~/git/puppet-resource_api$ bundle exec rspec --bisect=verbose --seed 40589 --pattern 'spec/**{,/*/**}/*_spec.rb' --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :fork
Running suite to find failures...

while this creates results:

david@davids:~/git/puppet-resource_api$ bundle exec rspec --bisect=verbose --require shell --seed 40589 --pattern 'spec/**{,/*/**}/*_spec.rb' --exclude-pattern 'spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}'
Bisect started using options: "--require shell --seed 40589 --pattern spec/**{,/*/**}/*_spec.rb --exclude-pattern spec/{fixtures/**/*.rb,fixtures/modules/*/**/*.rb}" and bisect runner: :shell
Running suite to find failures... (2 minutes 48.2 seconds)
 - Failing examples (1):
    - ./spec/puppet/resource_api_spec.rb[1:16:2:1]
 - Non-failing examples (1398):
    - ./spec/acceptance/array_spec.rb[1:1:1]
[...]

The text was updated successfully, but these errors were encountered:

benoittgt · 2019-06-23T09:59:33Z

Thanks a lot for this detailed answer and the clues.

I would love to take a deeper look on this one. In the meantime did you see this discussion rspec/rspec-rails#1353 ?

JonRowe · 2019-06-23T10:14:19Z

@DavidS I'd love it if you could provide us with an isolated reproduction of this, my time is quite limited at the moment and I won't be able to get to a reproduction containing complex code like puppet, another reason for this is to exclude the possibility of puppet itself causing the deadlock.

If it is just a buffer issue as you suggest it should be possible to trigger with only "RSpec" code right?

Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself?

benoittgt · 2019-06-23T10:15:34Z

Alternatively I'm open to suggestions for detecting deadlocks and preventing them within RSpec itself?

Definitely 💚

DavidS · 2019-06-24T08:56:49Z

require 'rspec'

RSpec.describe "a bunch of nothing" do
  (0...3000).each do |t|
    it { expect(t).to eq t }
  end
end

is an example that immediately deadlocks for me when running under rspec --bisect:

david@davids:~/tmp/rspec-deadlock-example$ bundle exec rspec spec/nil_spec.rb --bisect 
Bisect started using options: "spec/nil_spec.rb"
Running suite to find failures...^C

Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)


Bisect aborted!

The most minimal reproduction command discovered so far is:
  (Not yet enough information to provide any repro command)

david@davids:~/tmp/rspec-deadlock-example$ cat spec/nil_spec.rb

root@davids:~# strace -p 27326 -p 27329
strace: Process 27326 attached
strace: Process 27329 attached
[pid 27326] wait4(27329,  <unfinished ...>
[pid 27329] ppoll([{fd=8, events=POLLOUT}], 1, NULL, NULL, 8^Cstrace: Process 27326 detached
strace: Process 27329 detached
 <detached ...>

root@davids:~#

By changing the 3000, you can make the result set arbitrarily large if you OS has a larger default buffer size.

JonRowe · 2019-06-24T09:02:56Z

Thanks, yes that triggers the issue for me, as a work around the shell runner of course works, but then I guess thats why you have #2638 open 😂

DavidS · 2019-06-24T10:15:46Z

Exactly :-D

Thanks a lot for the time and work y'all put into rspec.

benoittgt · 2019-07-04T19:47:42Z

Thanks. I was able to reproduce it on my mac with a very basic example like the one you mentioned.

https://github.com/benoittgt/rspec_repro_bisect_deadlock

benoittgt · 2019-07-05T07:27:30Z

A very interesting answer has been posted by @palkan here : benoittgt/rspec_repro_bisect_deadlock#1

I had also success removing Process.waitpid but I was not sure it was a good idea. I didn't understand it's usage here.

DavidS · 2019-07-05T07:54:34Z

@palkan comes to the same conclusion that is in my original analysis. Process.waitpid does some OS-level cleanup that is sometimes necessary for the overall health of the OS. In this case of a single child process, we can be lazy and not do it at all, or defer it until all output is read and processed.

benoittgt · 2019-07-05T10:45:00Z

Thanks a lot @DavidS for those additionnals information.

What do you think about the proposal of @palkan

Use IO#write_nonblock should work. That would require changes in both send and receive methods.

I tried in utilities.rb

        def send(message)
          packet = Marshal.dump(message)
-         @write_io.write("#{packet.bytesize}\n#{packet}")
+         @write_io.write_nonblock("#{packet.bytesize}\n#{packet}")
        end

        # rubocop:disable Security/MarshalLoad
        def receive
          packet_size = Integer(@read_io.gets)
-         Marshal.load(@read_io.read(packet_size))
+         Marshal.load(@read_io.read_nonblock(packet_size))

but

Traceback (most recent call last):
	18: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/bin/rspec:23:in `<main>'
	17: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/bin/rspec:23:in `load'
	16: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/exe/rspec:4:in `<top (required)>'
	15: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/runner.rb:45:in `invoke'
	14: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/runner.rb:69:in `run'
	13: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/invocations.rb:36:in `call'
	12: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:17:in `bisect_with'
	11: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:27:in `bisect'
	10: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:49:in `start_bisect_runner'
	 9: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:38:in `start'
	 8: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/coordinator.rb:31:in `block in bisect'
	 7: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:20:in `find_minimal_repro'
	 6: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:117:in `prep'
	 5: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:155:in `track_duration'
	 4: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/example_minimizer.rb:118:in `block in prep'
	 3: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:59:in `original_results'
	 2: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/fork_runner.rb:71:in `dispatch_run'
	 1: from /Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/utilities.rb:47:in `receive'
/Users/benoit.tigeot/.rvm/gems/ruby-2.5.1/gems/rspec-core-3.8.1/lib/rspec/core/bisect/utilities.rb:47:in `load': marshal data too short (ArgumentError)

I am ok to remove the waitpid.

DavidS · 2019-07-05T13:42:23Z

Reading through the docs of write_nonblock (https://apidock.com/ruby/IO/write_nonblock) I also think that in the original error case you would just get a Errno::EWOULDBLOCK instead of writing the data out.

palkan · 2019-07-05T16:18:11Z

the original error case you would just get a Errno::EWOULDBLOCK

I tried this: it fills the buffer and returns 65536. Not sure when Errno::EWOULDBLOCK could happen.
So, that would be a partial write (and that's why we see marshal data too short).

My initial suggestion that we can leverage write_nonblock was not correct: we need to initiate the read before calling waitpid. That's the main problem.

Probably, the following refactoring could be a bit better than just dropping Process.waitpid:

def dispatch_run(run_descriptor)
- @run_dispatcher.dispatch_specs(run_descriptor)
+ pid = @run_dispatcher.dispatch_specs(run_descriptor)
  @channel.receive.tap do |result|
    # ...
    Process.waitpid(pid)
   end
end

JonRowe · 2019-07-08T16:37:18Z

@palkan do you feel like working up a patch with something like that?

palkan · 2019-07-08T18:32:07Z

@JonRowe Not in the next couple of weeks. I can try to find another evil martian to help with this)

benoittgt · 2019-07-10T11:55:04Z

It is working with @palkan proposal

@@ -1,9 +1,10 @@
         def dispatch_run(run_descriptor)
-          @run_dispatcher.dispatch_specs(run_descriptor)
+          pid = @run_dispatcher.dispatch_specs(run_descriptor)
           @channel.receive.tap do |result|
             if result.is_a?(String)
               raise BisectFailedError.for_failed_spec_run(result)
             end
+            Process.waitpid(pid)
           end
         end

@@ -23,6 +24,5 @@
           end

           def dispatch_specs(run_descriptor)
-            pid = fork { run_specs(run_descriptor) }
-            Process.waitpid(pid)
+            fork { puts run_specs(run_descriptor) }
           end

But I didn't verified yet what David mentioned.

Process.waitpid does some OS-level cleanup that is sometimes necessary for the overall health of the OS

I dig into MRI source code but at the moment I no clue of what can happens when it not "cleanup". I don't see it.

Also another proposal will be to use WNOHANG. From waitpid doc:

If WNOHANG was specified in options and there were no children in a waitable state, then waitid() returns 0 immediately (...)

--- a/lib/rspec/core/bisect/fork_runner.rb
+++ b/lib/rspec/core/bisect/fork_runner.rb
@@ -92,7 +92,7 @@ module RSpec

           def dispatch_specs(run_descriptor)
             pid = fork { run_specs(run_descriptor) }
-            Process.waitpid(pid)
+            Process.waitpid(pid, Process::WNOHANG)

It is working too.

benoittgt · 2019-09-18T05:29:04Z

I recently use my last proposition with the Process::WNOHANG in a project with stuck bisect. It helped us having a result instead of a locked bisect.

Don't you think a PR with that change will be a good idea? Not sure about how we can test this at the moment. I have to think about it.

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](https://github.com/rspec/rspec-core/blob/7b6b9c3f2e2878213f97d6fc9e9eb23c323cfe1c/lib/rspec/core/bisect/fork_runner.rb#L122). `Channel#send` calls `IO#write` here https://github.com/rspec/rspec-core/blob/7b6b9c3f2e2878213f97d6fc9e9eb23c323cfe1c/lib/rspec/core/bisect/utilities.rb#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](https://ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal is to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) Related: - #2637

hayesr · 2019-10-14T17:26:25Z

@benoittgt 's Process::WNOHANG solution also worked for me.

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal is to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - #2637

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: #2637 - PR discussion: #2669

agibralter · 2020-06-04T02:48:07Z

I just ran into this—I'm on rspec-core 3.9.2 on OS X and still seem to get the hanging behavior right before it outputs the failing and non-failing examples. Using shell runner works... Does anyone have suggestions on how to debug what is going on?

benoittgt · 2020-06-04T19:00:09Z

Hello @agibralter

There is many commands in the PR that may help you to understand what is happening.

If you succeed to find an easy way to reproduce it I would be very happy to patch it.

From Palkan in benoittgt/rspec_repro_bisect_deadlock#1 First, I've tried to play with the number of specs which led to the interesting conclusion: **the process hangs only at 1548+ specs**. ```diff RSpec.describe "a bunch of nothing" do (0...(ENV.fetch('N', 3000).to_i)).each do |t| it { expect(t).to eq t } end end ``` Try to run with `N=1547` and `N=1548`. Seems suspicious, right? Let's add `pry-byebug` to the equation (or Gemfile). In order it to work we need to tweak our runner code a bit: ```diff - $stdout = $stderr = @spec_output + # $stdout = $stderr = @spec_output ``` After a bit of `puts` debugging I localized the problem: [`@channel.send`](/lib/rspec/core/bisect/fork_runner.rb@7b6b9c3#L122). `Channel#send` calls `IO#write` here /lib/rspec/core/bisect/utilities.rb@7b6b9c3#L41: ```ruby def send(message) packet = Marshal.dump(message) @write_io.write("#{packet.bytesize}\n#{packet}") end ``` Do you know, what is the `packet.bytesize` for `N=1548`? It's **65548**. This number is very important: the pipe size is only **65536** on MacOS (see docs for [`IO#write_nonblock`](ruby-doc.org/core-2.6.3/IO.html#method-i-write_nonblock) for more). That makes `@write_io.write` hangs forever, because no one reads the buffer: we call `Channel#receive` only after `Process.waitpid(pid)`, thus waiting for the write operation to complete. ----------- A basic proposal will be to use WNOHANG. From waitpid doc: > If WNOHANG was specified in options and there were no children > in a waitable state, then waitid() returns 0 immediately (...) To validate this proposal on OSX we run just before running bisect: `lsof -n -P -r1 -c ruby | grep -e 'PIP' -e '===' -e 'COMMAND'` This will give us in loop the PIPE sizes of Ruby processes. Without our patch we see that quickly we hit 65536 bytes on two pipes, with the patch we keep pipes at the right size. ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ruby 40134 benoit 3 PIPE 0xf3b025a6a6cd6005 16384 ->0xf3b025a6a6cd5045 ruby 40134 benoit 4 PIPE 0xf3b025a6a6cd5045 16384 ->0xf3b025a6a6cd6005 ruby 40134 benoit 5 PIPE 0xf3b025a6a6cd7805 16384 ->0xf3b025a6a6cd7145 ruby 40134 benoit 7 PIPE 0xf3b025a6a6cd7145 16384 ->0xf3b025a6a6cd7805 ruby 40134 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40134 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ruby 40144 benoit 3 PIPE 0xf3b025a6a6cd5d05 16384 ->0xf3b025a6a6cd5c45 ruby 40144 benoit 4 PIPE 0xf3b025a6a6cd5c45 16384 ->0xf3b025a6a6cd5d05 ruby 40144 benoit 5 PIPE 0xf3b025a6a6cd7085 16384 ->0xf3b025a6a6cd6785 ruby 40144 benoit 7 PIPE 0xf3b025a6a6cd6785 16384 ->0xf3b025a6a6cd7085 ruby 40144 benoit 10 PIPE 0xf3b025a6a6cd6fc5 16384 ->0xf3b025a6a6cd5a05 ruby 40144 benoit 11 PIPE 0xf3b025a6a6cd5a05 16384 ->0xf3b025a6a6cd6fc5 ``` But if we look properly from the doc we can even go further. > If status information is immediately available on an appropriate child process, waitpid() returns this information. Otherwise, waitpid() returns immediately with an error code indicating that the information was not available. In other words, WNOHANG checks child processes without causing the caller to be suspended. and as pirj mention: "With this in mind, do we really need to check that information that waitpid returns? We don't seem to use it." Removing "waitpid" produce the same behavior as with `WNOHANG`. Improvements: The bisect command request lot's of ram. The next step should be to reduce that consumption. Related: - fix: rspec#2637 - PR discussion: rspec#2669

mikejarema · 2021-01-21T16:15:00Z

I've contributed a PR which addresses a specific situation in which rspec --bisect hangs.

Specifically when encoding is set to UTF-8 (eg. by Rails) and the forked child process tries to send back BINARY/ASCII-8BIT encoded data to the parent, but the communication channel (Bisect::Channel) rejects it silently due to the encoding mismatch.

The parent process, whose output you're seeing, waits indefinitely for the child process which has since errored out.

benoittgt self-assigned this Sep 2, 2019

benoittgt mentioned this issue Nov 25, 2019

Avoid bisect command to get stuck #2669

Merged

benoittgt closed this as completed in #2669 Dec 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--bisect deadlocks when reporting results #2637

--bisect deadlocks when reporting results #2637

DavidS commented Jun 13, 2019

benoittgt commented Jun 23, 2019

JonRowe commented Jun 23, 2019 •

edited

Loading

benoittgt commented Jun 23, 2019

DavidS commented Jun 24, 2019

JonRowe commented Jun 24, 2019

DavidS commented Jun 24, 2019

benoittgt commented Jul 4, 2019

benoittgt commented Jul 5, 2019

DavidS commented Jul 5, 2019

benoittgt commented Jul 5, 2019 •

edited

Loading

DavidS commented Jul 5, 2019

palkan commented Jul 5, 2019

JonRowe commented Jul 8, 2019

palkan commented Jul 8, 2019

benoittgt commented Jul 10, 2019 •

edited

Loading

benoittgt commented Sep 18, 2019

hayesr commented Oct 14, 2019

agibralter commented Jun 4, 2020

benoittgt commented Jun 4, 2020

mikejarema commented Jan 21, 2021 •

edited

Loading

--bisect deadlocks when reporting results #2637

--bisect deadlocks when reporting results #2637

Comments

DavidS commented Jun 13, 2019