Skip to content

Commit b46a4c1

Browse files
authored
docs: DR for clustered data-plane (#4522)
* docs: DR for clustered data-plane * PR remark * PR remarks
1 parent d2b2f90 commit b46a4c1

File tree

2 files changed

+45
-0
lines changed

2 files changed

+45
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Clustered data-plane
2+
3+
## Decision
4+
5+
We will make the data-plane being able to run in a clustered environment.
6+
7+
## Rationale
8+
9+
Currently, data-plane cannot run effectively in a clustered environment because:
10+
- there's no way to identify a specific replica that is running a data flow and "terminate/suspend" it
11+
- there's no way to re-start a data flow that was interrupted because the replica crashed
12+
13+
## Approach
14+
15+
We will provide this feature through the `DataPlaneStore` persistence layer.
16+
A `runtimeId` will be added in the `DataFlow`, and it will be set when it gets started with the replica's `runtimeId`.
17+
There will be a configured duration `flowLease` (that can be in milliseconds, seconds at most).
18+
19+
### Identify specific replica to suspend/terminate
20+
21+
The now synchronous `suspend`/`terminate` will become asynchronous by putting the `DataFlow` in a `-ING` state like:
22+
- `SUSPENDING`
23+
- `TERMINATING`
24+
25+
For termination (the same logic will be duplicated for suspension), in the `DataPlaneManager` state machine there will be
26+
two new `Processor` registered:
27+
- one filters by `TERMINATING` state and `runtimeId`: it will stop the data flow and transition it to `TERMINATED`
28+
- one filters by `TERMINATING` and by `updatedAt` passed by at least 2/3 times `flowLease`: it will transition the data flow to
29+
`TERMINATED` (as cleanup for dangling data flows).
30+
31+
Note: once the "termination" message is sent from the control-plane to the data-plane and the ACK received, the control-plane
32+
will consider the `DataFlow` as terminated, and it will continue evaluating the termination logic on the `TransferProcess`
33+
(send protocol message, transition to `TERMINATED`).
34+
We consider this acceptable because `DataFlow` termination is generally a cleanup operation that shouldn't take too much time.
35+
36+
### Re-start interrupted data flow
37+
38+
Please consider `flowLease` as a configured time duration (milliseconds, seconds at most).
39+
40+
A running data flow will need to update the `updatedAt` field every `flowLease`
41+
In the `DataPlaneManager` state machine, fetches items in `STARTED` with `runtimeId` different from the replica one,
42+
that have `updatedAt` past by at least 2/3 times `flowLease`.
43+
These data-flows can then be started again
44+

docs/developer/decision-records/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -62,3 +62,4 @@
6262
- [2024-08-16 Policy_Validation](./2024-08-16-policy-validation)
6363
- [2024-09-24 STS Accounts API](./2024-09-24-sts-accounts-api)
6464
- [2024-09-25 Multiple Protocol Versions](./2024-09-25-multiple-protocol-versions)
65+
- [2024-10-02 Clustered data-plane](./2024-10-02-clustered-data-plane/)

0 commit comments

Comments
 (0)