You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a vat worker subprocess exits unexpectedly, our kernel does not know the state of the vat: the worker might have died because of something the vat did, or because of something outside swingset (maybe the host computer is being rebooted and all processes are being killed, in some random order). #2958 is about having a policy to react to an unexpected worker termination.
If we're in "consensus mode", we must crash the kernel: we do not know why the worker terminated, so we don't know that it's also being terminated on all other validators. In particular, metering faults are a distinct "known" form of worker termination. We need to be able to distinguish between a metering fault and some other random error.
manager-subprocess-xsnap.js has a catch inside deliverToWorker that conflates these cases:
asyncfunctiondeliverToWorker(delivery){parentLog(vatID,`sending delivery`,delivery);letresult;try{result=awaitissueTagged(['deliver',delivery]);}catch(err){parentLog('issueTagged error:',err.code,err.message);letmessage;switch(err.code){caseExitCode.E_TOO_MUCH_COMPUTATION:
message='Compute meter exceeded';break;caseExitCode.E_STACK_OVERFLOW:
message='Stack meter exceeded';break;caseExitCode.E_NOT_ENOUGH_MEMORY:
message='Allocate meter exceeded';break;default:
message=err.message;}returnharden(['error',message,null]);}
I'm thinking that the non-meter-fault errors should propogate an Error upwards (i.e. don't catch the default case). We can use a non-rejecting deliverToWorker return promise with a value of ['error'..] to mean "consensus metering fault", and a yes-rejecting return promise (which should then carry all the way back to controller.run()) to mean "unknown worker error, halting the kernel".
I think we can get away without fixing this for the stress-test phase this week, but the danger is that something which happens to kill a worker process might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus. (I think. @michaelfig has investigated this more than me).
The text was updated successfully, but these errors were encountered:
might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus.
That's correct. With the current problem, the kernel appears to have already committed the state that said the vat was terminated before we actually have a noticeable behaviour divergence, rather than just crashing the kernel immediately and not committing that state.
This fix also needs to change deliver() in manager-helper.js, which catches all errors in deliverToWorker and replaces them with a ['error', err.message, null] VatDeliveryResult. We need a plan.
One option is to define deliver() to do one of three things:
Another option is to only use return, but examine problem to distinguish between the last two cases. I don't like the idea of parsing a string to make that distinction.
I'm in favor of the first option.. any other opinions?
If a vat worker subprocess exits unexpectedly, our kernel does not know the state of the vat: the worker might have died because of something the vat did, or because of something outside swingset (maybe the host computer is being rebooted and all processes are being killed, in some random order). #2958 is about having a policy to react to an unexpected worker termination.
If we're in "consensus mode", we must crash the kernel: we do not know why the worker terminated, so we don't know that it's also being terminated on all other validators. In particular, metering faults are a distinct "known" form of worker termination. We need to be able to distinguish between a metering fault and some other random error.
manager-subprocess-xsnap.js
has acatch
insidedeliverToWorker
that conflates these cases:I'm thinking that the non-meter-fault errors should propogate an Error upwards (i.e. don't
catch
the default case). We can use a non-rejectingdeliverToWorker
return promise with a value of['error'..]
to mean "consensus metering fault", and a yes-rejecting return promise (which should then carry all the way back tocontroller.run()
) to mean "unknown worker error, halting the kernel".I think we can get away without fixing this for the stress-test phase this week, but the danger is that something which happens to kill a worker process might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus. (I think. @michaelfig has investigated this more than me).
The text was updated successfully, but these errors were encountered: