test our policy to respond to a vat worker subprocess crashing #2958

warner · 2021-04-24T21:28:36Z

What is the Problem Being Solved?

When a vat worker is running in a child subprocess, such as with XS, we introduce a new failure mode: the child process might die. This could happen because of an engine fault (a crash is the best case outcome for a buffer overflow or memory corruption bug, because if the process survives it could probably be compromised in some sneaky way). It might also happen because of resource exhaustion, like too much RAM being allocated, if our metering tools aren't able to catch that first. It might also happen because of something outside the scope of swingset entirely: the host computer is being rebooted, so it's killing off all processes (usually with a polite+catchable SIGINT, followed a few seconds later by an uncatchable SIGKILL). If the kernel happens to politely SIGINT the worker process first, the kernel might observe the worker's termination fast enough to react and record any state changes before being killed itself.

We need to decide what the kernel does when the worker process exits unexpectedly. Our options are:

terminate the kernel itself (without committing state)
treat it as equivalent to a vat fault (like a c-list violation, or a metering fault not saved by a Keeper), and terminate the vat
retry the crank with a new worker

We also need a way to react to a worker process which dies outside of the context of a delivery. This could happen if the host computer is low on RAM, and the OOM (Out-Of-Memory) Killer sets its sights on the worker process. It could also happen if an incautious sysadmin kills the wrong process.

Description of the Design

We need to enhance the vat-manager API to include worker-process crash as part of the VatDeliveryResult structure. It should indicate the exit code of the process, including a unix signal number (if running on unix), perhaps pre-processed to indicate what category of crash has happened. We also need some kernel options to select the policy (if we decide to support more than one).

As for the policy selection itself: the choice to terminate the vat is (highly) visible to consensus, so if we're operating in a consensus mode, we need to make sure that all validators make the same choice. We can't afford to let random host-computer reboots make one validator think the vat has died, while all the other validators think it's just fine.

Consensus Mode

So if we're in consensus mode, I think the best option is to immediately terminate the kernel. The rule will be: if the worker reports a metering fault, terminate the vat, but if the worker dies and we don't know why, terminate the kernel.

If the worker dies because the host computer is being rebooted, this will play out the same way as if the kernel process had been killed first: when the reboot finishes, the kernel will come back up, and it will pick up where it left off, launching a new worker process for that vat (as well as all other active vats).

If the worker died because of an uncaught resource exhaustion problem, then the outcome depends upon how serious it is. If it's only affecting one validator, and it's somehow random (maybe swap space ran low on that computer), then rebooting the kernel might help, and the validator is only out a couple of blocks. If it persists, that validator will be unable to produce more blocks, and they'll get jailed, which is exactly the right way to evict a validator that isn't powerful enough to support the chain.

If the problem affects lots of validators, then it's much more likely that a fix is required in our software. If it affects more than a third of the voting power, the remainder will not be able to produce new blocks, and the chain will halt. The speed with which the chain comes to a halt will depend upon what fraction of the validators are affected. The worst case is for slightly fewer than one third of the voting power to go offline, in that it causes the most churn in the validator set. Ideally all faults are common-mode: they affect everybody or (nearly) nobody. Halting the chain is probably better than slashing/jailing a third of the validators.

It might be interesting to apply a ulimit size limit to the worker process, set to the same value across all validators. This might improve the odds that an uncaught memory-exhaustion attack affects all/most validators the same way.

Non-Consensus Mode

If we're not running in a consensus mode, the kernel can do whatever it likes. The only party it must answer to is its future self.

I don't have strong feelings about what the ideal policy would be. I like the idea of using ulimit as a sensor for resource exhaustion attacks, but I also don't want host an unfortunately-sequenced computer reboot to make us think a vat should be terminated permanently. Fortunately I think we can use the process exit code (signal number) to distinguish between the two: when automatic stack expansion fails the process gets a SIGSEGV, if the engine tries and fails to allocate memory then it probably self-terminates with SIGABRT (need to check this), but if system shutdown kills the process we'd see SIGINT or SIGKILL or SIGTERM or something. The setrlimit(3) man page says CPU exhaustion is signalled (on linux) with SIGXCPU followed by SIGKILL.

Assuming we can distinguish between them reliably, it might make sense for the kernel to retry a crashed worker once (per kernel process), maybe after a multi-second timeout, after which we declare the vat terminated. That should be proof against reboot-time shutdown sequencing, and would also tolerate sysadmins fumble-fingering a kill command that hits the wrong process.

But I think it'd also be safe (and easier to implement) to match what we do in consensus mode: any unexplained worker death causes the kernel process to halt immediately (after logging what happened, of course, so the operator knows why).

Consensus Considerations

As described above, in consensus mode, we must not terminate vats without consensus-based evidence of fault. We don't wat to let an attacker cause a small number of validators to drop out of consensus. Metering is supposed to catch this in a deterministic fashion, but if that fails, we'd like to improve the odds of keeping the chain intact (most validators proceeding, or most validators halting).

Security Considerations

Several memory-corruption are somewhat probabilistic: the attacker must submit a number of probes to learn some secret value (e.g. ASLR secrets). These attacks will appear as crashes of the first N-1 probes, followed by a silent compromise on the subsequent N probes. In some cases, the secret they're trying to learn is reset by the process being reset, or the computer being rebooted. The attacker's mean time to compromise is multiplied by the time between probe attempts.

If we re-try a crashed worker immediately, and did not limit the number of times we retried, we're giving an attacker (who is trying to probe the worker's secrets with their vat delivery) the best odds of success. If we slow down the retries (randomized exponential backoff is a popular approach), or put a hard limit on retries, then we reduce their chances considerably.

If we use something like seccomp(2) (#2386) then the same considerations apply even more directly.

Test Plan

The most reliable test would be to launch a real worker process, then kill it in various ways, and assert that the kernel reacts in the selected way.

We should also have direct tests of the kernel policy, where we exercise the code with a vat-manager delivery result that has been extended to describe worker crashes.

cc @michaelfig @dckc @erights

The text was updated successfully, but these errors were encountered:

dckc · 2021-04-26T16:37:09Z

I read "we need..." as: this work should be scheduled. The most conservative milestone would be staking dynamics, but I don't think it's required by then. And I infer that you're taking the ball, @warner . If I misread that, please clarify.

warner · 2021-06-22T05:26:11Z

#3394 is the specific thing we need in the short-term, but I'm not convinced we absolutely need it for the stress-test phase. Maybe we could add it into the phase4.5 upgrade. I'm moving this policy question out of the milestone.

warner · 2022-02-25T22:43:49Z

I think we're probably doing the right thing now (crash the kernel if a worker dies unexpectedly), but we don't have a good test. I'll reduce the estimate on this one to reflect the smaller remaining work.

warner · 2022-07-19T19:01:58Z

This seems related to #5480

Tartuffo · 2022-08-02T20:11:28Z

@FUDCo Also assigning this one to you per discussion with @warner. You'll need to build a harness for the test. Manually kill a vat worker - the Kernel process should panic.

Closes #2958

FUDCo · 2022-08-04T00:16:44Z

@warner and I discussed this and concluded that this would be difficult to realize as a fully automated test that can be used in CI, but should be readily doable as a manual test using swingset-runner. This I have done, and will be putting up a PR shortly for the test scaffolding I put together. However, though I will mark that PR as closing this issue, it doesn't really close the issue as such but merely makes it possible to do so. The actual test results need to be reported somewhere so I will report them here, since the issue is a better archival record than the PR will be.

We identified three different cases of a vat worker process crashing that need to be covered. These vary depending on where in the message handling cycle the vat and the kernel are at the time the vat process itself dies:

The vat process is actively running, handling a message dispatch
The vat process is not actively running; instead the kernel or some other vat process is active
The vat process is handling a message dispatch, but control is inside the kernel handling a syscall

The test setup is a swingset-runner configuration (packages/swingset-runner/demo/workerDeathTest) with three vats: bootstrap, busy, and idle:

busy has one root method, doStuff, which executes a long running, compute-bound loop and then returns. The long running loop takes about a minute to execute all of its iterations. Every so often (about every 5 seconds) it prints a progress message to the log for the benefit of the human doing the testing while at the same time it reads a value from its baggage, resulting in a syscall that enters the kernel. Note that vats running under swingset-runner are not subjected to metering unless they're explicitly configured to be, so this very long running loop is tolerated. This makes exercising the test cases straightforward.
idle has one root method, doNothing, that does nothing but simply returns
bootstrap acts as the test controller; it sends the doStuff message to the busy vat, waits for the return, sends the doNothing message to the idle vat, waits for its return, and then exits.

All vats are run using the xs-worker worker type (i.e., an external process running XS, which is the operational setup we are concerned about). From the packages/swingset-runner directory, execute the command:

bin/runner --init --config demo/workerDeathTest/swingset.json run

Case 1

To test case 1, during the long running busy loop we use the Unix ps command in a separate shell session to learn the process ID of the busy vat worker process and then issue a kill -9 PID command to abruptly terminate it. The result is immediate:

error in kernel.deliver: (ExitSignal#1)
ExitSignal#1: v6:busy exited due to signal SIGKILL
  at new ErrorSignal (file:///Users/chip/Agoric/agoric-sdk/packages/xsnap/api.js:37:5)
  at ChildProcess.<anonymous> (file:///Users/chip/Agoric/agoric-sdk/packages/xsnap/src/xsnap.js:121:22)
  at Object.onceWrapper (node:events:642:26)
  at ChildProcess.emit (node:events:527:28)
  at ChildProcess.emit (node:domain:475:12)
  at Process.ChildProcess._handle.onexit (node:internal/child_process:291:12)

##### KERNEL PANIC: error during tryProcessMessage: ExitSignal: v6:busy exited due to signal SIGKILL #####
kernel failure in crank 26: ExitSignal: v6:busy exited due to signal SIGKILL (ExitSignal#1)

Case 2

To test case 2, during the long running busy loop we do the same ps and kill operation done for case 1, but do it to the idle vat instead of the busy vat. The busy loop runs to completion and then when control returns to the kernel we see:

error in kernel.deliver: (ExitSignal#1)
ExitSignal#1: v7:idle exited due to signal SIGKILL
  at new ErrorSignal (file:///Users/chip/Agoric/agoric-sdk/packages/xsnap/api.js:37:5)
  at ChildProcess.<anonymous> (file:///Users/chip/Agoric/agoric-sdk/packages/xsnap/src/xsnap.js:121:22)
  at Object.onceWrapper (node:events:642:26)
  at ChildProcess.emit (node:events:527:28)
  at ChildProcess.emit (node:domain:475:12)
  at Process.ChildProcess._handle.onexit (node:internal/child_process:291:12)

##### KERNEL PANIC: error during tryProcessMessage: ExitSignal: v7:idle exited due to signal SIGKILL #####
kernel failure in crank 32: ExitSignal: v7:idle exited due to signal SIGKILL (ExitSignal#1)

Note, however, that the kernel is only able to sense the failure of the idle vat when it tries to interact with it, in this case by trying to send it a message. This is the normal and expected behavior.

Case 3

To test case 3, we run swingset-runner inside the debugger and set a breakpoint to intercept the vatstoreGet operation that the busy vat will do accompanying its periodic progress report to the console. When the breakpoint is reached, we leave the kernel suspended in the debugger, while we use ps and kill -9 to terminate the user vat process. We then instruct the debugger to resume execution of the kernel. This immediately results in:

error in kernel.deliver: (Error#1)
Error#1: write EPIPE
  at afterWriteDispatched (node:internal/stream_base_commons:160:15)
  at writeGeneric (node:internal/stream_base_commons:151:3)
  at Socket._writeGeneric (node:net:817:11)
  at Socket._write (node:net:829:8)
  at writeOrBuffer (node:internal/streams/writable:389:12)
  at _write (node:internal/streams/writable:330:10)
  at Socket.Writable.write (node:internal/streams/writable:334:10)
  at file:///Users/chip/Agoric/agoric-sdk/node_modules/@endo/stream-node/writer.js:43:23
  at new Promise (<anonymous>)
  at Object.next (file:///Users/chip/Agoric/agoric-sdk/node_modules/@endo/stream-node/writer.js:42:9)
  at Object.next (file:///Users/chip/Agoric/agoric-sdk/node_modules/@endo/netstring/writer.js:48:21)
  at runToIdle (file:///Users/chip/Agoric/agoric-sdk/packages/xsnap/src/xsnap.js:203:31)
  at processTicksAndRejections (node:internal/process/task_queues:96:5)

##### KERNEL PANIC: error during tryProcessMessage: Error: write EPIPE #####
kernel failure in crank 26: Error: write EPIPE (Error#1)

Conclusion

In all cases the kernel panics and the execution of the swingset is halted, which is the desired behavior we sought to verify.

warner · 2022-08-05T01:27:42Z

Sounds great! Thanks for the thorough test. I agree this can close the ticket.

Closes #2958

warner added enhancement New feature or request SwingSet package: SwingSet labels Apr 24, 2021

dckc added this to the Testnet: Stress Test Phase milestone Apr 26, 2021

dckc assigned warner Apr 26, 2021

dckc added the xsnap the XS execution tool label Apr 28, 2021

warner mentioned this issue Jun 22, 2021

manager-subprocess-xsnap should distinguish metering faults from random process death #3394

Closed

warner removed this from the Testnet: Stress Test Phase milestone Jun 22, 2021

Tartuffo added MN-1 and removed MN-1 labels Feb 2, 2022

warner changed the title ~~policy to respond to a vat worker subprocess crashing~~ test our policy to respond to a vat worker subprocess crashing Feb 25, 2022

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

Tartuffo modified the milestones: Mainnet 1, RUN Protocol RC0 Apr 5, 2022

Tartuffo added the MUST-HAVE label Jul 19, 2022

Tartuffo assigned FUDCo and unassigned warner Aug 2, 2022

FUDCo added a commit that referenced this issue Aug 4, 2022

chore: Swingset-runner test to validate vat process failure behavior

aa72c90

Closes #2958

This was referenced Aug 4, 2022

Implement swingset-runner test to validate vat process failure behavior #5890

Merged

unexpected worker exit doesn't cause proper kernel panic #5480

Closed

turadg pushed a commit that referenced this issue Aug 5, 2022

chore: Swingset-runner test to validate vat process failure behavior

784f9d6

Closes #2958

mergify bot closed this as completed in #5890 Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test our policy to respond to a vat worker subprocess crashing #2958

test our policy to respond to a vat worker subprocess crashing #2958

warner commented Apr 24, 2021

dckc commented Apr 26, 2021

warner commented Jun 22, 2021

warner commented Feb 25, 2022

warner commented Jul 19, 2022

Tartuffo commented Aug 2, 2022

FUDCo commented Aug 4, 2022 •

edited

Loading

warner commented Aug 5, 2022

test our policy to respond to a vat worker subprocess crashing #2958

test our policy to respond to a vat worker subprocess crashing #2958

Comments

warner commented Apr 24, 2021

What is the Problem Being Solved?

Description of the Design

Consensus Mode

Non-Consensus Mode

Consensus Considerations

Security Considerations

Test Plan

dckc commented Apr 26, 2021

warner commented Jun 22, 2021

warner commented Feb 25, 2022

warner commented Jul 19, 2022

Tartuffo commented Aug 2, 2022

FUDCo commented Aug 4, 2022 • edited Loading

Case 1

Case 2

Case 3

Conclusion

warner commented Aug 5, 2022

FUDCo commented Aug 4, 2022 •

edited

Loading