Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joint Consensus #101

Merged
merged 50 commits into from
Feb 14, 2019
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
1b77e82
Enable Joint Consensus.
Hoverbear Sep 12, 2018
21068ff
Reduce trace messages
Hoverbear Nov 21, 2018
79df099
Test and bugfix intermingled conf changes
Hoverbear Nov 22, 2018
cd93df0
Harmonize function names, documentation example.
Hoverbear Nov 22, 2018
c62aee7
fmt
Hoverbear Nov 22, 2018
d3f4f7d
Harmonize error message name
Hoverbear Nov 22, 2018
2f3113b
Touchup docs
Hoverbear Nov 22, 2018
4385ea9
Doc refinement and correct candidacy_status bug
Hoverbear Nov 22, 2018
9076171
Add pitfall doc note
Hoverbear Nov 22, 2018
8a31c83
Docs, make progress tests nicer
Hoverbear Nov 22, 2018
827fe30
Documentation refinements
Hoverbear Nov 22, 2018
8379622
Lower asserts to result returns on public Raft functions
Hoverbear Nov 22, 2018
c51e5ce
Further documentation clarity
Hoverbear Nov 22, 2018
617858f
fmt
Hoverbear Nov 22, 2018
7f35bd1
Resolve nits
Hoverbear Nov 29, 2018
c38ef4c
Merge branch 'master' into joint-consensus
Hoverbear Nov 30, 2018
0a5011f
Resolve warning
Hoverbear Nov 30, 2018
60c28cb
Merge branch 'joint-consensus' of github.com:pingcap/raft-rs into joi…
Hoverbear Nov 30, 2018
8964a1c
Resolve minor nits.
Hoverbear Dec 6, 2018
77169b5
Persist began_conf_change_at to hard state
Hoverbear Dec 6, 2018
146d749
Fix power cycle failure
Hoverbear Dec 28, 2018
42a14e1
Use ConfChange instead of Entry.
Hoverbear Jan 3, 2019
2635917
Merge branch 'master' into joint-consensus
Hoverbear Jan 3, 2019
bacf6e3
wip
Hoverbear Jan 4, 2019
0429927
Some refinements, still problems.
Hoverbear Jan 9, 2019
815f7d2
Test is green!
Hoverbear Jan 9, 2019
c26eef4
lints
Hoverbear Jan 11, 2019
e9721e5
Use appropriate name for proto
Hoverbear Jan 11, 2019
8c4ba9d
Add/remove/promote error in progress on pending change.
Hoverbear Jan 11, 2019
bc66fd7
Raft layer insert/remove/promote report errors
Hoverbear Jan 11, 2019
74c77aa
apply_conf_change in rawnode gives errors
Hoverbear Jan 11, 2019
3e7ed81
Contracts -> Errors
Hoverbear Jan 11, 2019
9d9a892
Fix test_raw_node_propose_add_duplicate_node test
Hoverbear Jan 11, 2019
115f72b
fmt
Hoverbear Jan 14, 2019
4d6ee81
lints
Hoverbear Jan 14, 2019
b265184
Merge branch 'master' into joint-consensus
Hoverbear Jan 15, 2019
15f19ab
contains doesn't borrow
Hoverbear Jan 18, 2019
99f47e6
Ensure benches aren't noops
Hoverbear Jan 18, 2019
dfea373
Reflect some of @qupeng's feedback. One test still fails.
Hoverbear Jan 28, 2019
7b84b24
fix a test case about power cycle. (#173)
hicqu Jan 29, 2019
f77da21
Merge branch 'master' into joint-consensus
hicqu Jan 30, 2019
39cb541
Fix nits
Hoverbear Jan 30, 2019
579e107
Merge branch 'master' into joint-consensus
hicqu Jan 30, 2019
cb812de
Merge branch 'master' into joint-consensus
Hoverbear Jan 31, 2019
ff73fe9
Clippy/fmt
Hoverbear Jan 31, 2019
fecdf59
Merge branch 'joint-consensus' of github.com:pingcap/raft-rs into joi…
Hoverbear Jan 31, 2019
356abd7
Documentation touchup
Hoverbear Jan 31, 2019
016e4a7
fmt
Hoverbear Feb 1, 2019
55385dd
simplify snapshot restore. (#176)
hicqu Feb 2, 2019
542655e
Merge branch 'master' into joint-consensus
Hoverbear Feb 11, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ quick-error = "1.2.2"
rand = "0.5.4"
fxhash = "0.2.1"
fail = { version = "0.2", optional = true }
getset = "0.0.6"

[dev-dependencies]
env_logger = "0.5"
Expand Down
29 changes: 24 additions & 5 deletions benches/suites/progress_set.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ pub fn bench_progress_set(c: &mut Criterion) {
bench_progress_set_remove(c);
bench_progress_set_iter(c);
bench_progress_set_get(c);
bench_progress_set_nodes(c);
bench_progress_set_voters(c);
bench_progress_set_learners(c);
}

fn quick_progress_set(voters: usize, learners: usize) -> ProgressSet {
Expand All @@ -28,7 +29,7 @@ fn quick_progress_set(voters: usize, learners: usize) -> ProgressSet {
pub fn bench_progress_set_new(c: &mut Criterion) {
let bench = |b: &mut Bencher| {
// No setup.
b.iter(|| ProgressSet::new());
b.iter(|| ProgressSet::new);
};

c.bench_function("ProgressSet::new", bench);
Expand Down Expand Up @@ -146,14 +147,32 @@ pub fn bench_progress_set_iter(c: &mut Criterion) {
});
}

pub fn bench_progress_set_nodes(c: &mut Criterion) {
pub fn bench_progress_set_voters(c: &mut Criterion) {
let bench = |voters, learners| {
move |b: &mut Bencher| {
let set = quick_progress_set(voters, learners);
b.iter(|| {
let set = set.clone();
let agg = set.iter().all(|_| true);
agg
set.voters().for_each(|_| {});
});
}
};

DEFAULT_RAFT_SETS.iter().for_each(|(voters, learners)| {
c.bench_function(
&format!("ProgressSet::nodes ({}, {})", voters, learners),
bench(*voters, *learners),
);
});
}

pub fn bench_progress_set_learners(c: &mut Criterion) {
let bench = |voters, learners| {
move |b: &mut Bencher| {
let set = quick_progress_set(voters, learners);
b.iter(|| {
let set = set.clone();
set.voters().for_each(|_| {});
});
}
};
Expand Down
3 changes: 1 addition & 2 deletions benches/suites/raw_node.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,7 @@ fn quick_raw_node() -> RawNode<MemStorage> {
let peers = vec![];
let storage = MemStorage::default();
let config = Config::new(id);
let node = RawNode::new(&config, storage, peers).unwrap();
node
RawNode::new(&config, storage, peers).unwrap()
}

pub fn bench_raw_node_new(c: &mut Criterion) {
Expand Down
5 changes: 5 additions & 0 deletions proto/eraftpb.proto
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,16 @@ enum ConfChangeType {
AddNode = 0;
RemoveNode = 1;
AddLearnerNode = 2;
BeginConfChange = 3;
FinalizeConfChange = 4;
}

message ConfChange {
uint64 id = 1;
ConfChangeType change_type = 2;
// Used in `AddNode`, `RemoveNode`, and `AddLearnerNode`.
uint64 node_id = 3;
bytes context = 4;
// Used in `BeginConfChange` and `FinalizeConfChange`.
ConfState configuration = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree my comment about ConfState. Here I think we should add a new struct, for example:

struct MembershipChange {
    repeated uint64 nodes;
    repeated uint64 learner_nodes;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 This is exactly what ConfState was originally.

}
466 changes: 277 additions & 189 deletions src/eraftpb.rs

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions src/errors.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

use std::error;
use std::{cmp, io, result};
use StateRole;

use protobuf::ProtobufError;

Expand Down Expand Up @@ -63,6 +64,18 @@ quick_error! {
NotExists(id: u64, set: &'static str) {
display("The node {} is not in the {} set.", id, set)
}
/// The action given requires the node to be in a particular state role.
InvalidState(role: StateRole) {
display("Cannot complete that action while in {:?} role.", role)
}
/// The node attempted to transition to a new membership configuration while there was none pending.
NoPendingMembershipChange {
display("No pending membership change. Create a pending transition with `Raft::propose_membership_change` on the leader.")
}
/// An argument violates a calling contract.
ViolatesContract(contract: String) {
display("An argument violate a calling contract: {}", contract)
}
}
}

Expand Down
122 changes: 108 additions & 14 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@

## Creating a Raft node

You can use [`RawNode::new`](raw_node/struct.RawNode.html#method.new) to create the Raft node. To create the Raft node, you need to provide a [`Storage`](storage/trait.Storage.html) component, and a [`Config`](struct.Config.html) to the [`RawNode::new`](raw_node/struct.RawNode.html#method.new) function.
You can use [`RawNode::new`](raw_node/struct.RawNode.html#method.new) to create the Raft node. To
create the Raft node, you need to provide a [`Storage`](storage/trait.Storage.html) component, and
a [`Config`](struct.Config.html) to the [`RawNode::new`](raw_node/struct.RawNode.html#method.new)
function.

```rust
use raft::{
Expand Down Expand Up @@ -58,7 +61,9 @@ node.raft.become_leader();

## Ticking the Raft node

Use a timer to tick the Raft node at regular intervals. See the following example using Rust channel `recv_timeout` to drive the Raft node at least every 100ms, calling [`tick()`](raw_node/struct.RawNode.html#method.tick) each time.
Use a timer to tick the Raft node at regular intervals. See the following example using Rust
channel `recv_timeout` to drive the Raft node at least every 100ms, calling
[`tick()`](raw_node/struct.RawNode.html#method.tick) each time.

```rust
# use raft::{Config, storage::MemStorage, raw_node::RawNode};
Expand Down Expand Up @@ -101,11 +106,18 @@ loop {

## Proposing to, and stepping the Raft node

Using the `propose` function you can drive the Raft node when the client sends a request to the Raft server. You can call `propose` to add the request to the Raft log explicitly.
Using the `propose` function you can drive the Raft node when the client sends a request to the
Raft server. You can call `propose` to add the request to the Raft log explicitly.

In most cases, the client needs to wait for a response for the request. For example, if the client writes a value to a key and wants to know whether the write succeeds or not, but the write flow is asynchronous in Raft, so the write log entry must be replicated to other followers, then committed and at last applied to the state machine, so here we need a way to notify the client after the write is finished.
In most cases, the client needs to wait for a response for the request. For example, if the
client writes a value to a key and wants to know whether the write succeeds or not, but the
write flow is asynchronous in Raft, so the write log entry must be replicated to other followers,
then committed and at last applied to the state machine, so here we need a way to notify the client
after the write is finished.

One simple way is to use a unique ID for the client request, and save the associated callback function in a hash map. When the log entry is applied, we can get the ID from the decoded entry, call the corresponding callback, and notify the client.
One simple way is to use a unique ID for the client request, and save the associated callback
function in a hash map. When the log entry is applied, we can get the ID from the decoded entry,
call the corresponding callback, and notify the client.

You can call the `step` function when you receive the Raft messages from other nodes.

Expand Down Expand Up @@ -165,11 +177,15 @@ loop {
}
```

In the above example, we use a channel to receive the `propose` and `step` messages. We only propose the request ID to the Raft log. In your own practice, you can embed the ID in your request and propose the encoded binary request data.
In the above example, we use a channel to receive the `propose` and `step` messages. We only
propose the request ID to the Raft log. In your own practice, you can embed the ID in your request
and propose the encoded binary request data.

## Processing the `Ready` State

When your Raft node is ticked and running, Raft should enter a `Ready` state. You need to first use `has_ready` to check whether Raft is ready. If yes, use the `ready` function to get a `Ready` state:
When your Raft node is ticked and running, Raft should enter a `Ready` state. You need to first use
`has_ready` to check whether Raft is ready. If yes, use the `ready` function to get a `Ready`
state:

```rust,ignore
if !node.has_ready() {
Expand All @@ -180,9 +196,11 @@ if !node.has_ready() {
let mut ready = node.ready();
```

The `Ready` state contains quite a bit of information, and you need to check and process them one by one:
The `Ready` state contains quite a bit of information, and you need to check and process them one
by one:

1. Check whether `snapshot` is empty or not. If not empty, it means that the Raft node has received a Raft snapshot from the leader and we must apply the snapshot:
1. Check whether `snapshot` is empty or not. If not empty, it means that the Raft node has received
a Raft snapshot from the leader and we must apply the snapshot:

```rust,ignore
if !raft::is_empty_snap(ready.snapshot()) {
Expand All @@ -195,7 +213,8 @@ The `Ready` state contains quite a bit of information, and you need to check and

```

2. Check whether `entries` is empty or not. If not empty, it means that there are newly added entries but has not been committed yet, we must append the entries to the Raft log:
2. Check whether `entries` is empty or not. If not empty, it means that there are newly added
entries but has not been committed yet, we must append the entries to the Raft log:

```rust,ignore
if !ready.entries.is_empty() {
Expand All @@ -205,7 +224,9 @@ The `Ready` state contains quite a bit of information, and you need to check and

```

3. Check whether `hs` is empty or not. If not empty, it means that the `HardState` of the node has changed. For example, the node may vote for a new leader, or the commit index has been increased. We must persist the changed `HardState`:
3. Check whether `hs` is empty or not. If not empty, it means that the `HardState` of the node has
changed. For example, the node may vote for a new leader, or the commit index has been increased.
We must persist the changed `HardState`:

```rust,ignore
if let Some(hs) = ready.hs() {
Expand All @@ -214,7 +235,10 @@ The `Ready` state contains quite a bit of information, and you need to check and
}
```

4. Check whether `messages` is empty or not. If not, it means that the node will send messages to other nodes. There has been an optimization for sending messages: if the node is a leader, this can be done together with step 1 in parallel; if the node is not a leader, it needs to reply the messages to the leader after appending the Raft entries:
4. Check whether `messages` is empty or not. If not, it means that the node will send messages to
other nodes. There has been an optimization for sending messages: if the node is a leader, this can
be done together with step 1 in parallel; if the node is not a leader, it needs to reply the
messages to the leader after appending the Raft entries:

```rust,ignore
if !is_leader {
Expand All @@ -227,7 +251,9 @@ The `Ready` state contains quite a bit of information, and you need to check and
}
```

5. Check whether `committed_entires` is empty or not. If not, it means that there are some newly committed log entries which you must apply to the state machine. Of course, after applying, you need to update the applied index and resume `apply` later:
5. Check whether `committed_entires` is empty or not. If not, it means that there are some newly
committed log entries which you must apply to the state machine. Of course, after applying, you
need to update the applied index and resume `apply` later:

```rust,ignore
if let Some(committed_entries) = ready.committed_entries.take() {
Expand Down Expand Up @@ -258,6 +284,72 @@ The `Ready` state contains quite a bit of information, and you need to check and

For more information, check out an [example](examples/single_mem_node/main.rs#L113-L179).

## Membership Changes

When building a resilient, scalable distributed system there is a strong need to be able to change
the membership of a peer group *dynamically, without downtime.* This Raft crate supports this via
**Joint Consensus**
([Raft paper, section 6](https://web.stanford.edu/~ouster/cgi-bin/papers/raft-atc14)).

It permits resilient arbitrary dynamic membership changes. A membership change can do any or all of
the following:

* Add peer (learner or voter) *n* to the group.
* Remove peer *n* from the group.
* Remove a leader (unmanaged, via stepdown)
* Promote a learner to a voter.

It (currently) does not:

* Allow control of the replacement leader during a stepdown.
* Optionally roll back a change during a peer group pause where the new peer group configuration
fails.
* Provide automated promotion of newly added voters from learner to voter when they are caught up.
This must be done as a two stage process for now.

> PRs to enable these are welcome! We'd love to mentor/support you through implementing it.

This means it's possible to do:

```rust
use raft::{Config, storage::MemStorage, raw_node::RawNode, eraftpb::Message};
let config = Config { id: 1, peers: vec![1], ..Default::default() };
let mut node = RawNode::new(&config, MemStorage::default(), vec![]).unwrap();
node.raft.become_candidate();
node.raft.become_leader();

// Call this on the leader, or send the command via a normal `MsgPropose`.
node.raft.propose_membership_change((
// Any IntoIterator<Item=u64>.
// Voters
vec![1,2,3],
// Learners
vec![4,5,6],
)).unwrap();

# let entry = &node.raft.raft_log.entries(2, 1).unwrap()[0];
// ...Later when the begin entry is ready to apply:
node.raft.begin_membership_change(entry).unwrap();
assert!(node.raft.is_in_membership_change());
#
# // We hide this since the user isn't really encouraged to blindly call this, but we'd like a short
# // example.
# node.raft.commit_apply(2);
#
# let entry = &node.raft.raft_log.entries(3, 1).unwrap()[0];
// ...Later, when the finalize entry is ready to apply:
node.raft.finalize_membership_change(entry).unwrap();
assert!(node.raft.prs().voter_ids().contains(&2));
assert!(!node.raft.is_in_membership_change());
```

This process is a two-phase process, during the midst of it the peer group's leader is managing
**two independent, possibly overlapping peer sets**.

> **Note:** In order to maintain resiliency gaurantees (progress while a majority of both peer sets is
active), it is very important to wait until the entire peer group has exited the transition phase
before taking old, removed peers offline.

*/

#![cfg_attr(not(feature = "cargo-clippy"), allow(unknown_lints))]
Expand All @@ -275,6 +367,8 @@ extern crate quick_error;
#[cfg(test)]
extern crate env_logger;
extern crate rand;
#[macro_use]
extern crate getset;

mod config;
/// This module supplies the needed message types. However, it is autogenerated and thus cannot be
Expand All @@ -297,7 +391,7 @@ pub mod util;
pub use self::config::Config;
pub use self::errors::{Error, Result, StorageError};
pub use self::log_unstable::Unstable;
pub use self::progress::{Inflights, Progress, ProgressSet, ProgressState};
pub use self::progress::{Configuration, Inflights, Progress, ProgressSet, ProgressState};
pub use self::raft::{vote_resp_msg_type, Raft, SoftState, StateRole, INVALID_ID, INVALID_INDEX};
pub use self::raft_log::{RaftLog, NO_LIMIT};
pub use self::raw_node::{is_empty_snap, Peer, RawNode, Ready, SnapshotStatus};
Expand Down
Loading