manager-subprocess-xsnap should distinguish metering faults from random process death #3394

warner · 2021-06-22T05:25:08Z

If a vat worker subprocess exits unexpectedly, our kernel does not know the state of the vat: the worker might have died because of something the vat did, or because of something outside swingset (maybe the host computer is being rebooted and all processes are being killed, in some random order). #2958 is about having a policy to react to an unexpected worker termination.

If we're in "consensus mode", we must crash the kernel: we do not know why the worker terminated, so we don't know that it's also being terminated on all other validators. In particular, metering faults are a distinct "known" form of worker termination. We need to be able to distinguish between a metering fault and some other random error.

manager-subprocess-xsnap.js has a catch inside deliverToWorker that conflates these cases:

    async function deliverToWorker(delivery) {
      parentLog(vatID, `sending delivery`, delivery);
      let result;
      try {
        result = await issueTagged(['deliver', delivery]);
      } catch (err) {
        parentLog('issueTagged error:', err.code, err.message);
        let message;
        switch (err.code) {
          case ExitCode.E_TOO_MUCH_COMPUTATION:
            message = 'Compute meter exceeded';
            break;
          case ExitCode.E_STACK_OVERFLOW:
            message = 'Stack meter exceeded';
            break;
          case ExitCode.E_NOT_ENOUGH_MEMORY:
            message = 'Allocate meter exceeded';
            break;
          default:
            message = err.message;
        }
        return harden(['error', message, null]);
      }

I'm thinking that the non-meter-fault errors should propogate an Error upwards (i.e. don't catch the default case). We can use a non-rejecting deliverToWorker return promise with a value of ['error'..] to mean "consensus metering fault", and a yes-rejecting return promise (which should then carry all the way back to controller.run()) to mean "unknown worker error, halting the kernel".

I think we can get away without fixing this for the stress-test phase this week, but the danger is that something which happens to kill a worker process might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus. (I think. @michaelfig has investigated this more than me).

The text was updated successfully, but these errors were encountered:

michaelfig · 2021-06-22T05:44:14Z

Well-stated!

might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus.

That's correct. With the current problem, the kernel appears to have already committed the state that said the vat was terminated before we actually have a noticeable behaviour divergence, rather than just crashing the kernel immediately and not committing that state.

warner · 2021-06-22T16:00:46Z

This fix also needs to change deliver() in manager-helper.js, which catches all errors in deliverToWorker and replaces them with a ['error', err.message, null] VatDeliveryResult. We need a plan.

One option is to define deliver() to do one of three things:

return ['ok', null, meterUsage]: happy
return ['error', problem, null]: consensus unhappy (metering fault): terminate vat, continue with kernel
reject: non-consunsus unhappy: halt kernel

Another option is to only use return, but examine problem to distinguish between the last two cases. I don't like the idea of parsing a string to make that distinction.

I'm in favor of the first option.. any other opinions?

refs #3394

warner added bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool labels Jun 22, 2021

warner assigned dckc Jun 22, 2021

warner mentioned this issue Jun 22, 2021

test our policy to respond to a vat worker subprocess crashing #2958

Closed

warner added a commit that referenced this issue Jun 22, 2021

chore(swingset): initial draft of test-xsnap-errors

329846f

refs #3394

dckc added this to the Testnet: Stress Test Phase milestone Jun 22, 2021

dckc assigned warner Jun 22, 2021

dckc mentioned this issue Jun 22, 2021

fix(swingset): xs-worker confused meter exhaustion with process fail #3396

Merged

dckc pushed a commit that referenced this issue Jun 22, 2021

chore(swingset): initial draft of test-xsnap-errors

38ea940

refs #3394

dckc pushed a commit that referenced this issue Jun 22, 2021

chore(swingset): initial draft of test-xsnap-errors

38e9c9f

refs #3394

dckc closed this as completed in #3396 Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manager-subprocess-xsnap should distinguish metering faults from random process death #3394

manager-subprocess-xsnap should distinguish metering faults from random process death #3394

warner commented Jun 22, 2021

michaelfig commented Jun 22, 2021

warner commented Jun 22, 2021

manager-subprocess-xsnap should distinguish metering faults from random process death #3394

manager-subprocess-xsnap should distinguish metering faults from random process death #3394

Comments

warner commented Jun 22, 2021

michaelfig commented Jun 22, 2021

warner commented Jun 22, 2021