Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manager-subprocess-xsnap should distinguish metering faults from random process death #3394

Closed
warner opened this issue Jun 22, 2021 · 2 comments · Fixed by #3396
Closed
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented Jun 22, 2021

If a vat worker subprocess exits unexpectedly, our kernel does not know the state of the vat: the worker might have died because of something the vat did, or because of something outside swingset (maybe the host computer is being rebooted and all processes are being killed, in some random order). #2958 is about having a policy to react to an unexpected worker termination.

If we're in "consensus mode", we must crash the kernel: we do not know why the worker terminated, so we don't know that it's also being terminated on all other validators. In particular, metering faults are a distinct "known" form of worker termination. We need to be able to distinguish between a metering fault and some other random error.

manager-subprocess-xsnap.js has a catch inside deliverToWorker that conflates these cases:

    async function deliverToWorker(delivery) {
      parentLog(vatID, `sending delivery`, delivery);
      let result;
      try {
        result = await issueTagged(['deliver', delivery]);
      } catch (err) {
        parentLog('issueTagged error:', err.code, err.message);
        let message;
        switch (err.code) {
          case ExitCode.E_TOO_MUCH_COMPUTATION:
            message = 'Compute meter exceeded';
            break;
          case ExitCode.E_STACK_OVERFLOW:
            message = 'Stack meter exceeded';
            break;
          case ExitCode.E_NOT_ENOUGH_MEMORY:
            message = 'Allocate meter exceeded';
            break;
          default:
            message = err.message;
        }
        return harden(['error', message, null]);
      }

I'm thinking that the non-meter-fault errors should propogate an Error upwards (i.e. don't catch the default case). We can use a non-rejecting deliverToWorker return promise with a value of ['error'..] to mean "consensus metering fault", and a yes-rejecting return promise (which should then carry all the way back to controller.run()) to mean "unknown worker error, halting the kernel".

I think we can get away without fixing this for the stress-test phase this week, but the danger is that something which happens to kill a worker process might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus. (I think. @michaelfig has investigated this more than me).

@warner warner added bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool labels Jun 22, 2021
@michaelfig
Copy link
Member

Well-stated!

might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus.

That's correct. With the current problem, the kernel appears to have already committed the state that said the vat was terminated before we actually have a noticeable behaviour divergence, rather than just crashing the kernel immediately and not committing that state.

@warner
Copy link
Member Author

warner commented Jun 22, 2021

This fix also needs to change deliver() in manager-helper.js, which catches all errors in deliverToWorker and replaces them with a ['error', err.message, null] VatDeliveryResult. We need a plan.

One option is to define deliver() to do one of three things:

  • return ['ok', null, meterUsage]: happy
  • return ['error', problem, null]: consensus unhappy (metering fault): terminate vat, continue with kernel
  • reject: non-consunsus unhappy: halt kernel

Another option is to only use return, but examine problem to distinguish between the last two cases. I don't like the idea of parsing a string to make that distinction.

I'm in favor of the first option.. any other opinions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants