Advanced Topics in Distributed System

Distributed System Modelling

Communication Model

Formally, there are two types of delay in distributed system, namely the message delay $\Delta$ and the processing delay $\Phi$. But we treat them as one in this course.

  • Fully synchronous: there is a known upper bound on the delay of message.
  • Weakly synchronous: the delay grows at most in polynomials w.r.t. timeout $t$.
  • Semi-synchronous: there is a known probability distribution on the delay.
  • Partially synchronous: there is an unknown upper bound on the delay.
  • Asynchronous: there is no upper bound on the delay of the message.

Note that we don’t contemplate that the message may be lost.

Fault Model

In distributed system, everything will be fine if there is no failure.

Crash fault only considers that the participant goes down and (possibly) recovers. The participant will always play by the rule. This model is adopted by most tech giants. It can be proved that agreement cannot be reached in an asynchronous communication channel if even one crash fault is allowed (FLP theorem).

Byzantine fault assumes the adversarial case that the participant may be hostile and may sabotage the system. This is the case in Bitcoin (check how Bitcoin tolerates the Byzantine failure).

Consistency Model

Note that consistency in distributed system (determined by the order observed) shares some similarity with the consistency in the memory. Multi-core CPU can be treated as a distributed system, though operating system hides this fact from the user.

Database is a kind of in between the distributed system and the memory. The consistency in distributed world is also referred to as external consistency in the context of database, c.f. the (internal) consistency and isolation of ACID properties.

Integrity of Distributed System

Correctness

Aforementioned models underpin the discussion of the correctness for distributed system. The correctness can be grouped into:

  • Safety property

    Safety means that “bad things do not happen”. This is the system invariant that has to be maintained during the lifespan. Safety property can be proven by induction.

  • Liveness property

    Liveness means “good things do happen”, given the allowance that it can be reached progressively instead of immediately. Particularly, it means the system won’t get stuck. It can only be shown by logic or TLA+. Some examples of liveness include:

    • progress (no starvation)
    • termination
    • consensus

Time Complexity

Messaging incurred during communication is the major source of time complexity in distributed system.

Time and Order

The time of occurrence of the events naturally define an order among them. However, due to that the communication among computers is not instant, participants in the distributed system just cannot agree on the exact time and thus the order defined on time (but to check how time is synchronized over Internet, see this and this), since participants cannot wait indefinitely for a latecomer claiming that it is earlier.

But we need the time mostly for ordering, instead of the exact value. Regardless, we can still define orders among the events. Below lists some common orderings on events in distributed system:

  • Total order: a total order can be defined on the events and thus all replicas process messages in the same order; the order may not respect the real time; required by banking; system observing total order is sequentially consistent (the system behaves as if there is only one machine); total order respecting the real time gives linearizability.
  • Causal order / happen-before order: the order tells the causal relationship or the explicit dependency (e.g. reply-to) between events; typical causal order is reflected in send-recv or reply-to paradigm.
  • FIFO order: different nodes can process events in different orders so long as the order of events from the same source is preserved; viable in some social network applications.

Note that order and consistency are different. Order is a relation defined on the events. A system observing xxx-order can be said as xxx-consistent.

Implementing Total Order

  • System needs a global sequencer (leader).
  • Before sending messages, follower asks sequencer for a timestamp.
  • System needs to handle failure case (e.g. in replicated state machine).

The problem is that the system is bottlenecked by the central server (the global sequencer).

Implementing Causal Order

The time of events on the single node are easy to compare; while for events on different nodes, this rings the bell of first principle problem of distributed system.

Anyway, if an order strictly based on time is insisted, we can only ensure a partial order. A total order above can be achieved simply because we break ties arbitrarily (e.g. by node ID or leader’s whim).

Lamport’s Logical Clock

  • Each node has a local counter.
  • If any local event (neither sending nor receiving) happens, increment the local counter.
  • Piggyback the counter when sending the message.
  • Reset the local counter as the max(local counter, message counter) + 1 when receiving the message

If LC(x) < LC(y), we can only be assured that either x causes (happens before) y or they happen concurrently. How to determine x < y $\iff$ Clock(x) < Clock(y)?

Lamport’s Vector Clock

  • Each node has a local vector clock.
  • If any local event happens, increment self entry of self vector clock, i.e. VC[i][i]++.
  • When sending the message, piggyback the clock .
  • When receiving the message, reset the local clock as max(local clock, message clock), where the max is the element-wise max-out, increment self entry of self vector clock.

Now when VC(x) < VC(y), where $<$ means $\forall k, x_k \le y_k \and VC(x) \ne VC(y)$ , we can say that x happens before (and thus may cause) y.

Implementing FIFO Order

This can be achieved so long as the client labels the message.

Leader Election

Leader is for efficiency sake. In most cases, it reduces the messaging cost from $O(n^2)$ to $O(n)$.

Leader election is alike to mutual exclusion problem because every time exactly one machine can be the leader. However, they differ in two ways:

  • mutual exclusion hates starvation while a leader stays as long as there is no failure, which violates the bounded waiting in mutual exclusion;
  • mutual exclusion doesn’t handle failed leader.

One-line Solution

The node with highest ID becomes the leader. Say for example, everyone broadcasts its IP address and the node with highest IP wins the election.

In distributed system, we always dig out corner case that fails the protocol. Regarding this one-line solution, it is possible that the node with highest IP repeatedly die and recover, leading to recurrent “by-elections”. This reinforces the idea that “everything will be fine if there is no failure”.

Bully Algorithm

Bully algorithm improves on the one-line solution. It works on fully-connected network and it assumes a synchronous, fail-stop setting by introducing a timeout.

In Bully algorithm, once a node detects that the leader has failed (by timeout), it sends an Election message to those whose ID are greater than itself, trying to claim the leadership.

  • If any node with larger ID replies, it gives up and waits for the Leader message;
  • If it receives no reply about the Election message after timeout, it thinks it can be the leader and sends out the Leader message to all others;
  • If it receives Reply message from larger-ID node but no further Leader message after timeout, it assumes the new leader has died and re-trigger the leader election;
  • If it receives Election message from smaller-ID node, it sends Reply back and triggers the leader election.

The trick is, sending to highest-ID node known so far and waiting for potential timeout is not worth sending to larger ID in parallel, because highest-ID node may have died.

Best-case (highest-ID node detects the failure) message complexity is $O(n)$; worst-case (lowest-ID node detects the failure) message complexity is $O(n^2)$.

Fault Model

Current dependable computing assumption makes the following speculations 1) message are not lost; 2) each peer is honest and cooperative. Due to FLP, in an asynchronous system, fault detection can’t be both complete and accurate.

  • Complete: you don’t miss any faulty node.
  • Accurate: you won’t get any false alarm.

We prefer completeness to accuracy (even if it may overkill). Some fault detection algorithms include:

  • ping-ack (source-driven/demand-driven)
  • heartbeat (target-driven)

Crash Fault

Crash means the node is either permanently dead, or being slow in response or dead and then recovered (under asynchronous model, one cannot differentiate slow network and node failure).

On crash fault, we assume fail-stop in which case most protocols deal with:

  • the node crashes like a segmentation fault;
  • others can detect the failure; otherwise it’s a crash-stop;
  • the internal state in failed node is lost (logs are not assumed to be persisted).

Network Partitioning

Byzantine Fault

Byzantine failure generalizes the above failures. In essence, Byzantine failure is the fault in which case different symptoms are presented to different observers (e.g. the malicious node passing different messages to different nodes).

Consensus

The Problem

We have $n$ nodes , each labeled as $0,\dots,n-1$. Each node proposes its own initial value in an agreed domain. In the end, every node should agree upon an irrecoverable final decision value.

Consensus is about reaching agreement instead of voicing out opinions. Agreement means the value of non-faulty node should be identical to other non-faulty processes. Termination means every node must reach a final value. Validity is a sanity check: the final decision must be chosen from initial proposals, instead of being a trivial value.

Some examples of consensus problem include:

  • leader election (Bully algorithm);

  • distributed transaction (2PC);

  • agreement problem (where a node brings up a proposal and others say yes or no);

  • replicated state machine (RAFT).

FLP Theorem (Hardness of <synchronous> Consensus Problem): it is not possible for a deterministic protocol to solve consensus in an asynchronous system with $\ge 1$ processes that may crash-stop.

If we switch to synchronous setting, things will be a lot better.

Byzantine Generals Problem: BGP allegory considers a number of generals attacking a fortress. Generals must agree upon a time to attack or simply just retreat. Any uncoordinated attack would lead to a defeat. Generals are physically separated and have to send out their votes via messengers who may fail to deliver the votes (which can be detected yet, indicating fail-stop and synchrony). The problem is complicated by the existence of traitor generals who may send different votes to different generals.

Byzantine Fault Tolerance Theorem (Hardness of <asynchronous> Consensus Problem): When the number of malicious process $=1$, it is impossible to achieve consensus unless the number of processes $> 3$. This can be generalized as such: when the number of malicious process $=f$, it is impossible to achieve consensus unless the number of processes $> 3f$.

Protocols

Oral Message Protocol

There are two types of roles, namely commander and lieutenant. A general becomes a commander when he issues commands to other generals; he becomes lieutenant when he receives commands from others.

By “oral message”, it means

  • the receiver knows the sender of the message;
  • the absence of a message can be detected.

As a loyal lieutenant, before attack, a smart way is to confirm if others receive the same command as it does. Thus, everyone (as a general) should forward its received command to others. In this way, honest generals can confirm each other’s vote and outweigh the traitors. Such is the intuition of Lamport’s oral message solution.

This solution is based on a recursive “function” call. Let $f$ be the number of traitors. The base case is $OM(0, v)$ where there is no traitor, in which case any commander can coordinate the attack. Otherwise,

  • The commander $i$ initiates $OM(f, S - \set{i})$ and sends out a value $v$ to every lieutenant $j \notin S - \set{i}$.
  • Upon receipt, every lieutenant $j$ initiates $OM(f-1, S - \set{i,j})$ as the commander and forwards $v_j$, which is its received vote, to every lieutenant $k \notin S - \set{i,j}$.
  • After finishing the above recursive call, every lieutenant count votes.

For detail, refer to this post??.

Practical-BFT

PBFT works under weak synchrony. It uses the message signature to ensure that the message from commander can’t be tampered.

Paxos

Dummy Protocol

The very intuition of reaching consensus is to commit so long as the majority accepts the message. This is reflected by the 1-phase majority vote design. However in such design, the committed value is only known by the initiator but not shared among other learners, leading to potential inconsistency. If we make it 1-phase collective vote, we lose the availability.

To solve this, we either add another phase to broadcast the committed value (Paxos) or ensure that the value is agreed by the super-quorum (Fast-Paxos). Increasing the phase number means relying one node collecting information and broadcasting to others; while increasing the number of quorum means less fault tolerance. The key is tradeoff between the robustness to failure and the consistent result.

Happy Path

Paxos assumes on the partial synchrony and zero Byzantine fault. There are three roles in Paxos protocol, namely proposer, voter and acceptor (developed from voter), and learner (who is outside the quorum and won’t vote), which may or may not be exclusive. There are two phases, namely proposal election ($\ne$ leader election) and atomic total-order broadcast.

In Paxos proposal, no actual value is attached during proposal election. The actual value is not attached until a majority accepts the proposer and then the proposer feels accepted by the majority and then sends out the value.

  • 1a) Prepare Phase

    Whichever proposer interested in proposing a value sends out a Prepare<N> message that contains a global unique identifier, say $N$, on which a total order can be defined (e.g. <proposer_ip, local_proposal_no>), informing all voters its attempt to be a leader (only after becoming a leader can a node propose value).

  • 1b) Promise Phase

    For each voter,

    • if $N > M$ where $M$ is the max identifier it has ever seen,

      • if it has accepted another value $u$ (which is associated with $M$), it replies a Promise<M, u> message;
      • else it replies a Promise<N, "Any value is fine."> message;
    • else ignore $N$’s proposal to save communication;

  • 2a) Accept Phase

    For proposer,

    • if its proposal $N$ receives a majority of promises, it becomes a “leader” (not the leader in the Raft sense; see later) and
      • if all received promises say "Any value is fine", then it can decide the value as v and sends Accept(N, v) to all or the majority of voters;
      • else probably the previous leader is dead and it shall propagate the legacy values and thus sends Accept(N, v') where $v’$ is the value from the one with highest identifier among the set of replied promises;

    For each voter (now as the acceptor),

    • if it receives Accept(N, v) from proposer (leader already)
      • if $N$ is larger than the max identifier it has promised, return Accepted(v) to the leader;
      • else it ignores this Accept message because this is an older message;
  • 2b) Accepted Phase

    For each leader, if it receives Accpeted message from the majority of the voters with the same proposal number $N$, the consensus is reached and the value is committed. Then it notifies others (including learners) about the accepted value.

Summary

Paxos is leaderless because every new value needs to go through the above two phases, meaning that voter is not prohibited from becoming a leader when the leader (for the current decree) exists yet. This means multiple decisions can be made in parallel (multi-decree parliament).

The practical Paxos version is the Multi-Paxos, where leader is reused until it is dead (skipping Phase 1). Multi-Paxos is usually wrongly referred to as Paxos, even in the Raft paper.

The downside of the Paxos paper is that it is incomplete, lacking of specifications for corner cases. For example, what if proposers propose aggressively and never let the leader gets the job done (a random cooldown period)? What if the leader dies? For detail, refer to this post.

Raft

Usually reaching consensus leads to lower performance. Raft is designed for high availability in replicated state machine. It works under fail-stop setting (synchrony). The difference between RSM and broadcasting is that RSM only requires agreement on the final state while broadcasting requires agreement on the ordering of the input.



  • Three roles

    • Leader

      Leader handles all the client requests until it fails.

    • Follower

      Follower forwards client request to the leader.

    • Candidate

      When the leader is detected to have failed, whoever starts running for leadership becomes a candidate.

  • Three RPCs

    • AppendEntries

      This RPC may convey commands or nothing (as a heartbeat).

    • RequestVote

      This happens when any follower detects that there is no RPC from the leader for the election timeout long.

  • Considerations

    • leader election and log replication
    • re-election (dead leader)
    • split-brain (slow leader)
    • configuration change
Happy Path

Raft maintains that replicas get the same input sequence, apply the log sequence serially (or serially-equivalently) and get the same output state. Every replica stores the input sequence as the input log. Each replica buffers its log entries until they are committed.

Time is divided into terms. Each term begins with an election continued by normal operation with the elected leader. Term can be leaderless (and thus there is no operation in this term) due to the split vote. Each replica maintains its current term to check if it is out of date.

For leader, it will request AppendEntries either for issuing command or as a heartbeat test. After issuing the command, it will receive from each follower whether the command has been buffered and the follower’s current term (think why).

If a follower receives nothing from the leader for the election timeout long, it will change to candidate. For candidate, when triggering election, it will increment its current term, vote for itself, and issue RequestVote to other replicas. This results in either majority vote, or split vote (in which case a new round of election is initiated), or “the return of the king” (in which case the candidate retreats).

Upon new log entry, the leader firstly appends it to its own log and broadcasts the command to followers by AppendEntries(cmd). Once the leader knows the command is on the majority, it applies the command and asks followers to apply by AppendEntries(cmd, commitIdx).

Raft maintains two safety properties:

  1. Log matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
  2. Leader completeness: once an entry is committed, its previous entries are also committed.
Re-election

As stated above, an entry is committed when the leader sees it replicated on the majority. Upon RequestVote, each replica checks candidate’s term and log index. These two numbers of candidate’s should be as least up-to-date as the replica’s. This ensures that the leader always hold the most up-to-date log and log will only flow from leader to follower.

On the other hand, a new leader never retries to commit its old-term entries to others (by re-broadcasting old-term entries); otherwise the system run the risk of overwriting committed entries ($\S5.4$ of Raft). These old-term entries are auto-committed if the leader is able to commit a new entry in the new term.

Split-brain

This is solved by having each replica (including old leader) become the follower once it finds its term is stale upon rejection. At the same time, the stale replica should update its term as the rejecting replica.

Configuration Change

2-phase solution. Add phase to broadcast the information (kind of using one phase for synchronization).

Once the follower receives and applies the C_old_and_new, it is in a joint-consensus mode. It has to accept rules in the both configuration.

CAP and Eventual Consistency

Now we consider the network partition fault model.

  • Consistency means every read, possibly from difference client, receives the most recent write or an error (i.e. linearizability).
  • Availability means every request received by a non-faulty node in the system must result in a response.
  • Partition tolerance means the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

CAP Theorem: If there is a network partition, one has to choose between consistency (linearizability) or availability.

As an aside, consistency has different meanings in distributed system, database (in which it requires the validity of database rules be upheld), and memory order; availability has different meanings in distributed system and software system.

Linearizability is the de facto golden standard in terms of order in the distributed world. Depending on the order observed, the consistency level of a distributed system contains:

  • strong consistency / atomic consistency (linearizability)
  • sequential consistency (total order)
  • causal consistency (causal order)
  • FIFO consistency (FIFO order)
  • eventual consistency

Due to the CAP theorem, we have to the relax the requirement on consistency by allowing for more divergence across the replicas.

Eventual Consistency

Eventual consistency guarantees that there are no lost updates. Eventually, i.e. when there are no further updates, current updates will be propagated to all the replicas and replicas are identical. But during operation, there is no guarantee on the order of these update events. As such, it allows the client to see inconsistent intermediate result.

To ensure there are no lost updates, a record of all the intermediate updates needs to be kept.

Case Study: CRDT

Amazon’s shopping cart implementation maintains a insert-set and a remove-set on each replica and occasionally merges the sets across the replicas. Now this implementation is generalized as conflict-free replicated data type.

For a data structure to be CRDT, ACI properties have to be satisfied to facilitate the merge step:

  • associative to handle late arrival
  • commutative to handle reordered arrival
  • idempotence to handle duplicated arrival

Two implementations of CRDT are commutative replicated data type (CmRDT) and convergent replicated data type (CvRDT). Google Slides is an example of CmRDT (which exchanges the operations) and GitHub of CvRDT (which exchanges the final states)?

If a data structure’s operation does not follow ACI, order of the operations on the data structure has to be preserved (by using tricks introduced in [Time and Ordering](#Time and Ordering) section).

We can have availability as well as scalability under network partition by relaxing consistency, with which the coordination used for upkeeping consistency becomes the bottleneck of throughput. Eventually consistent system is said to be coordination-free. But no coordination does not indicates no communication. For example, Amazon shopping card still needs communication to merge; Google Slides still needs communication to exchange. It is just that the system does not wait for agreement.

Distributed Caching

P2P is an overlay decentralized network. It can either be structured or unstructured. On unstructured P2P, a peer can decide which files it hosts and finding a file requires message flooding.

On structured P2P, the system maintains a consistent hashing table to map a file to a particular set of nodes. Let $K$ be the number of files and $n$ be the number of nodes. The advantage of consistent hashing is that, when a node dies, i.e. when the number of array slots changes, on average only $K / n$ files need to be remapped.

To achieve this, the nodes and the files are separately hashed, resulting in a unique identifier for each node and file. To avoid the problem of having to reassign every BLOB when a server is added or removed, node and files are firstly placed on a virtual ring by modulo their hashing value with a specific number. Then the file is assigned to the next node that appears on the circle in clockwise order, which is the system invariant that needs to maintained throughout the lifespan.

Now it is simple to contemplate how consistent hashing deals with the following operations:

  • a search is initiated;
  • a peer joins;
  • a peer leaves;
  • a peer fails.

In baseline implementation, each node only knows its predecessor and successor. For better efficiency, a finger table that contains $+1, +2, +4, +8, +16, +32, \dots$ neighbors relative to node’s own hashing value can be maintained to speed up searching.

When a node $N_i$ joins, it has to ask its known peer $N_j$ to help it find the predecessor and successor. When a node leaves, it can gently informs its predecessor and successor and delivers the files properly. When a node fails, the system has to recover on its own. In either case, nodes in the system have to undergo a stabilization phase to update the finger table.

Consistent hashing has been widely applied in cloud systems like Amazon Dynamo. For detail, refer to the Wiki.

Blockchain

Public Blockchain

Public blockchain is a decentralized (not only distributed) probabilistic (instead of deterministic, to circumvent FLP) BFT RSM (handling leader election and log replication). It is confronted with many possible Byzantine attacks:

  • a node creates a number of fake identities to occupy the majority so that it can tamper the account book, namely Sybil attack (resolved by PoW/PoS);
  • a node falsely adds value to itself or double-spends the money (resolved by “peer review”);
  • a node steals money from others’ account (resolved by cryptography).

The log is a chain of blocks, which is append-only and tamper-proof (protected by the block hash and nonce). Leader election and log replication are secured by psychology and economic: it is not worthy of diverging from the main branch for individuals, so long as there are enough participants. In essence, there is no failure handling in Blockchain’s log replication.

Private Blockchain

The upside of private/permissioned/consortium Blockchain is that it does not face the Sybil attack because every participant is licensed upon whom legal action can be forced.

Distributed Transaction

Below lists the difference of concurrency control in operating system and database:

OS DB
Scenario R/W on single value a transaction involves R/W on multiple values
Correctness linearizability ACID
Solution locking; lock-free methods 2PL (pessimistic); validation-based method (optimistic)

DB has a larger granularity of atomicity because of the transaction. As a result, it downgrades the golden requirement from the linearizability to the serializability. Building DB is quite reverse to building consistent transactions with inconsistent replication: DB as a slack system is built atop strict system.

Single-node Transaction

2-phase Locking

  • If a transaction needs to read/write an item (say a row), it must acquire a lock first.
  • A transaction may read/write multiple items. But once it starts unlocking, it cannot lock anymore1.

2PL guarantees a serializable schedule, which is deducible from the locking order. The good aspect is it requires little coordination because everyone only needs to respect the rule locally and acquires the lock from the global lock table. However, deadlock can still happen in this case (which can be resolved by aborting one transaction).

To improve performance, a finer-granularity locking (like shared/exclusive locking) can be applied.

Optimal Concurrency Control

Instead of waiting for the lock, OCC lets transactions interleave whatever. But when conflict is observed, abort all the transactions.

  • OCC records and cross-checks the timestamps of transactions’ read and write.
  • Whenever a transaction reads and finds the item has already been written by a transaction whose timestamp is later than itself, it updates its own timestamp as the last-written timestamp and restarts.

The serial order is equivalent to the final timestamp order. A finer control can be achieved by using multiple versions of an item (multi-version timestamp ordering), so that read can read from the version history, though more space will be consumed.

OCC is a concept as well as a method. A more practical implementation is referred to as validation-based method, which comes in three phases for a single transaction:

  • (Execution Phase) The writes are stored into thread-local in-memory write-set.
  • (Validation Phase) Validates that the timestamps of a transaction’s read-set and write-set do not conflict with those transactions that have already committed.
  • (Commit Phase) If there is no overlap or serializability conflict (note that this is not neither-nor), lock, apply the write-set, and unlock. Otherwise, restart.

Multi-node Transaction

2-phase commit (2PC) algorithm involves two phases: one phase for reaching decision and another for commit. Leader/coordinator is called transaction manager and follower/participant is called remote manager in this context. The paradigm is: TM firstly proposes a transaction and all nodes need to reach a unanimous decision. The intricacy in this setting is that, the system needs to reach a unanimous decision to reach a unanimous state, instead of merely consensus.

Designing a distributed database involves concurrency control (entails locking remote memory), write-ahead logging and 2PC. There are two ways to achieve a distributed database:

  • distributed transaction without replication (global serializability)
  • distributed transaction with replication (1-copy serializability)

2PC is not fault-tolerant. Every replica has to maintain a lockstep. As such, it suffers from blocking due to 9 kinds of fail-stop. As a result, 3PL (Paxos Commit) is proposed to mitigate the liveliness issue of 2PL, if not completely. Paxos-Commit protocol is proposed. It adds even more liveliness and thus suffers from lower efficiency.


  1. For atomicity and durability, locks are released only after the commit request to the log buffer, namely the Strict 2PL. ↩︎