-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test our policy to respond to a vat worker subprocess crashing #2958
Comments
I read "we need..." as: this work should be scheduled. The most conservative milestone would be staking dynamics, but I don't think it's required by then. And I infer that you're taking the ball, @warner . If I misread that, please clarify. |
#3394 is the specific thing we need in the short-term, but I'm not convinced we absolutely need it for the stress-test phase. Maybe we could add it into the phase4.5 upgrade. I'm moving this policy question out of the milestone. |
I think we're probably doing the right thing now (crash the kernel if a worker dies unexpectedly), but we don't have a good test. I'll reduce the estimate on this one to reflect the smaller remaining work. |
This seems related to #5480 |
@warner and I discussed this and concluded that this would be difficult to realize as a fully automated test that can be used in CI, but should be readily doable as a manual test using We identified three different cases of a vat worker process crashing that need to be covered. These vary depending on where in the message handling cycle the vat and the kernel are at the time the vat process itself dies:
The test setup is a
All vats are run using the
Case 1To test case 1, during the long running busy loop we use the Unix
Case 2To test case 2, during the long running busy loop we do the same
Note, however, that the kernel is only able to sense the failure of the Case 3To test case 3, we run
ConclusionIn all cases the kernel panics and the execution of the swingset is halted, which is the desired behavior we sought to verify. |
Sounds great! Thanks for the thorough test. I agree this can close the ticket. |
What is the Problem Being Solved?
When a vat worker is running in a child subprocess, such as with XS, we introduce a new failure mode: the child process might die. This could happen because of an engine fault (a crash is the best case outcome for a buffer overflow or memory corruption bug, because if the process survives it could probably be compromised in some sneaky way). It might also happen because of resource exhaustion, like too much RAM being allocated, if our metering tools aren't able to catch that first. It might also happen because of something outside the scope of swingset entirely: the host computer is being rebooted, so it's killing off all processes (usually with a polite+catchable SIGINT, followed a few seconds later by an uncatchable SIGKILL). If the kernel happens to politely SIGINT the worker process first, the kernel might observe the worker's termination fast enough to react and record any state changes before being killed itself.
We need to decide what the kernel does when the worker process exits unexpectedly. Our options are:
We also need a way to react to a worker process which dies outside of the context of a delivery. This could happen if the host computer is low on RAM, and the OOM (Out-Of-Memory) Killer sets its sights on the worker process. It could also happen if an incautious sysadmin kills the wrong process.
Description of the Design
We need to enhance the vat-manager API to include worker-process crash as part of the
VatDeliveryResult
structure. It should indicate the exit code of the process, including a unix signal number (if running on unix), perhaps pre-processed to indicate what category of crash has happened. We also need some kernel options to select the policy (if we decide to support more than one).As for the policy selection itself: the choice to terminate the vat is (highly) visible to consensus, so if we're operating in a consensus mode, we need to make sure that all validators make the same choice. We can't afford to let random host-computer reboots make one validator think the vat has died, while all the other validators think it's just fine.
Consensus Mode
So if we're in consensus mode, I think the best option is to immediately terminate the kernel. The rule will be: if the worker reports a metering fault, terminate the vat, but if the worker dies and we don't know why, terminate the kernel.
If the worker dies because the host computer is being rebooted, this will play out the same way as if the kernel process had been killed first: when the reboot finishes, the kernel will come back up, and it will pick up where it left off, launching a new worker process for that vat (as well as all other active vats).
If the worker died because of an uncaught resource exhaustion problem, then the outcome depends upon how serious it is. If it's only affecting one validator, and it's somehow random (maybe swap space ran low on that computer), then rebooting the kernel might help, and the validator is only out a couple of blocks. If it persists, that validator will be unable to produce more blocks, and they'll get jailed, which is exactly the right way to evict a validator that isn't powerful enough to support the chain.
If the problem affects lots of validators, then it's much more likely that a fix is required in our software. If it affects more than a third of the voting power, the remainder will not be able to produce new blocks, and the chain will halt. The speed with which the chain comes to a halt will depend upon what fraction of the validators are affected. The worst case is for slightly fewer than one third of the voting power to go offline, in that it causes the most churn in the validator set. Ideally all faults are common-mode: they affect everybody or (nearly) nobody. Halting the chain is probably better than slashing/jailing a third of the validators.
It might be interesting to apply a
ulimit
size limit to the worker process, set to the same value across all validators. This might improve the odds that an uncaught memory-exhaustion attack affects all/most validators the same way.Non-Consensus Mode
If we're not running in a consensus mode, the kernel can do whatever it likes. The only party it must answer to is its future self.
I don't have strong feelings about what the ideal policy would be. I like the idea of using
ulimit
as a sensor for resource exhaustion attacks, but I also don't want host an unfortunately-sequenced computer reboot to make us think a vat should be terminated permanently. Fortunately I think we can use the process exit code (signal number) to distinguish between the two: when automatic stack expansion fails the process gets a SIGSEGV, if the engine tries and fails to allocate memory then it probably self-terminates with SIGABRT (need to check this), but if system shutdown kills the process we'd see SIGINT or SIGKILL or SIGTERM or something. Thesetrlimit(3)
man page says CPU exhaustion is signalled (on linux) with SIGXCPU followed by SIGKILL.Assuming we can distinguish between them reliably, it might make sense for the kernel to retry a crashed worker once (per kernel process), maybe after a multi-second timeout, after which we declare the vat terminated. That should be proof against reboot-time shutdown sequencing, and would also tolerate sysadmins fumble-fingering a
kill
command that hits the wrong process.But I think it'd also be safe (and easier to implement) to match what we do in consensus mode: any unexplained worker death causes the kernel process to halt immediately (after logging what happened, of course, so the operator knows why).
Consensus Considerations
As described above, in consensus mode, we must not terminate vats without consensus-based evidence of fault. We don't wat to let an attacker cause a small number of validators to drop out of consensus. Metering is supposed to catch this in a deterministic fashion, but if that fails, we'd like to improve the odds of keeping the chain intact (most validators proceeding, or most validators halting).
Security Considerations
Several memory-corruption are somewhat probabilistic: the attacker must submit a number of probes to learn some secret value (e.g. ASLR secrets). These attacks will appear as crashes of the first N-1 probes, followed by a silent compromise on the subsequent N probes. In some cases, the secret they're trying to learn is reset by the process being reset, or the computer being rebooted. The attacker's mean time to compromise is multiplied by the time between probe attempts.
If we re-try a crashed worker immediately, and did not limit the number of times we retried, we're giving an attacker (who is trying to probe the worker's secrets with their vat delivery) the best odds of success. If we slow down the retries (randomized exponential backoff is a popular approach), or put a hard limit on retries, then we reduce their chances considerably.
If we use something like
seccomp(2)
(#2386) then the same considerations apply even more directly.Test Plan
The most reliable test would be to launch a real worker process, then kill it in various ways, and assert that the kernel reacts in the selected way.
We should also have direct tests of the kernel policy, where we exercise the code with a vat-manager delivery result that has been extended to describe worker crashes.
cc @michaelfig @dckc @erights
The text was updated successfully, but these errors were encountered: