|
| 1 | +# Clustered data-plane |
| 2 | + |
| 3 | +## Decision |
| 4 | + |
| 5 | +We will make the data-plane being able to run in a clustered environment. |
| 6 | + |
| 7 | +## Rationale |
| 8 | + |
| 9 | +Currently, data-plane cannot run effectively in a clustered environment because: |
| 10 | +- there's no way to identify a specific replica that is running a data flow and "terminate/suspend" it |
| 11 | +- there's no way to re-start a data flow that was interrupted because the replica crashed |
| 12 | + |
| 13 | +## Approach |
| 14 | + |
| 15 | +We will provide this feature through the `DataPlaneStore` persistence layer. |
| 16 | +A `runtimeId` will be added in the `DataFlow`, and it will be set when it gets started with the replica's `runtimeId`. |
| 17 | +There will be a configured duration `flowLease` (that can be in milliseconds, seconds at most). |
| 18 | + |
| 19 | +### Identify specific replica to suspend/terminate |
| 20 | + |
| 21 | +The now synchronous `suspend`/`terminate` will become asynchronous by putting the `DataFlow` in a `-ING` state like: |
| 22 | +- `SUSPENDING` |
| 23 | +- `TERMINATING` |
| 24 | + |
| 25 | +For termination (the same logic will be duplicated for suspension), in the `DataPlaneManager` state machine there will be |
| 26 | +two new `Processor` registered: |
| 27 | +- one filters by `TERMINATING` state and `runtimeId`: it will stop the data flow and transition it to `TERMINATED` |
| 28 | +- one filters by `TERMINATING` and by `updatedAt` passed by at least 2/3 times `flowLease`: it will transition the data flow to |
| 29 | + `TERMINATED` (as cleanup for dangling data flows). |
| 30 | + |
| 31 | +Note: once the "termination" message is sent from the control-plane to the data-plane and the ACK received, the control-plane |
| 32 | +will consider the `DataFlow` as terminated, and it will continue evaluating the termination logic on the `TransferProcess` |
| 33 | +(send protocol message, transition to `TERMINATED`). |
| 34 | +We consider this acceptable because `DataFlow` termination is generally a cleanup operation that shouldn't take too much time. |
| 35 | + |
| 36 | +### Re-start interrupted data flow |
| 37 | + |
| 38 | +Please consider `flowLease` as a configured time duration (milliseconds, seconds at most). |
| 39 | + |
| 40 | +A running data flow will need to update the `updatedAt` field every `flowLease` |
| 41 | +In the `DataPlaneManager` state machine, fetches items in `STARTED` with `runtimeId` different from the replica one, |
| 42 | +that have `updatedAt` past by at least 2/3 times `flowLease`. |
| 43 | +These data-flows can then be started again |
| 44 | + |
0 commit comments