ELI5: Leader Election

In distributed systems, it is often necessary to elect a leader process whenever there is a need to coordinate tasks. We assume that such a system contains a symmetric network of processes, i.e. all processes in the network are indistinguishable, and anyone can be the leader.

Note that these elections are not like those in real life, where we want to elect the best or most capable process — no, we really just want to elect someone and be done with it. In this post, we'll take a look at algorithms for leader election given 3 different process topologies.

Note: I wrote this post to explain this topic to myself while I was taking CS4231 Parallel and Distributed Algorithms in university.

A Ring of Processes

Assumptions

We first assume that we have a network of processes arranged in a ring. Why a ring topology? Turns out, it is a pretty common logical topology for 2 reasons:

  1. It is efficient – a ring only requires O(n) number of connections between nodes
  2. It has a reasonable amount of resilience against faults – a ring requires 2 cuts to segment the network.

We also assume the nodes all have a unique ID. This is a reasonable assumption in the real world; there are a number of sortable globally unique ID generators that do not require synchronization.

Lastly, we assume that the network is reliable and messages sent are guaranteed to be received exactly once.

Algorithm

Given these assumptions, we can use the Chang-Robert's algorithm to elect a leader. The idea is to elect the process with the highest ID. Since process IDs are unique, we'll have exactly 1 leader. Here are the rules:

  1. Processes can only send messages in a clockwise direction.
  2. A process can (and does) send an election message containing its ID.
  3. If a process receives an election message, it must pass it on to the next process in the ring if the message's ID is greater than its own, or ignore it if the message's ID is less than its own. If the message's ID is equal to its own, it knows that it is the leader, since all other processes had lower IDs and passed the message on.
  4. Now, the victor process sends a leader message through the ring to notify the other processes that it has won and shove this fact in their faces. A victory lap, if you will. All other processes will just take note and pass the message on. Once the victor process has received its leader message, it knows that everyone knows that it's the leader.

That's it! Let's look at the performance of the algorithm.

Performance

In the best case, the algorithm takes O(N) messages to elect a leader. How? We assume the N processes are arranged in clockwise increasing order. For illustration, let's assume the N processes have unique IDs 1...N. Process N's election message will travel around the ring, as every other ring has an ID less than N. What about all the other processes? They'll send their election message to a process with an ID 1 greater than theirs and will therefore be dropped after just one send. This means that all the other N-1 processes will send a total of N-1 election messages. Therefore, the total number of election messages = N + N-1 = O(N).

In the worst case, the algorithm takes O(N^2) messages to elect a leader. Take the same ring as above, but reverse the processes, i.e. the processes are arranged in clockwise decreasing order with IDs N...1. Now, consider process N. As before, its election message will travel around the ring until it comes back to it — this results in N election messages. Next, consider process N-1. Its election message travels until it reaches process N, which ignores the message as N > N-1 — this results in N-1 election messages. Therefore, every process i results in i messages being sent through the network, and the total number of election messages is N + N-1 + N-2 + ... + 1 = N(N+1)/2 = O(N).

In the average case, the algorithm takes O(N log N) messages to elect.

A Fully Connected Graph of Processes

If we the graph is fully connected (i.e. every process is connected directly with every other process), we can actually just use Lamport's mutual exclusion algorithm to elect the leader. Basically, this algorithm allows exactly 1 process to run some bit of code (a critical section) while every other process waits. You can see where this is going. One process starts the election by informing everyone else that it's starting. Once acknowledgement is received, all processes just try to enter the critical section. Whoever gets in first (and there will only be 1) sends a message to everyone else declaring its the leader. Cool.

Unfortunately, a fully connected set of processes isn't scalable; that's a lot of edges to maintain. In the real world, this may mean actual wires connecting different machines.

A Connected Graph of Processes

Let's consider something more realistic — a connected graph of processes (i.e. every node is reachable from any other node), where every process knows how many neighbors it has.

The solution here is to build a spanning tree across the graph, and make the root of this tree the leader.

To make this algorithm more understandable, let's first assume that there's one root process that knows it's the root process. This process sends an invite message to all its neighbors. When one of these neighbors receive this invite message, it forwards it on to all of its other neighbors, and replies the inviter with an accept message, establishing a parent/child relationship between the sender and recipient of the invite message. All the other processes in the graph follow the same steps. If one of these process subsequently receives another invite message, it'll just ignore it (so that our tree doesn't become a graph). Since the graph is connected, we know that the tree will eventually grow to reach every process in the graph, thus forming a spanning tree. In addition, every process can count the number of neighbors that it's received a message from. When this count is equal to its (known) number of neighbors, the process knows that it's done building relationships with its neighbors.

Let's come back to the assumption that the root process knows it's root. Let's get rid of the assumption, making all processes equal again (remember, the graph is symmetric). If we assume that every process has a unique ID, the solution is simple: just let every process assume it's the root and start building a spanning tree. The tree rooted at the process with the highest ID wins, and that process is the leader.


Armed with these theoretical algorithms, you're now ready to implement a solution to declare a leader among equals!