Coordination and Agreement
Objectives
To give students understanding of the goals of the problems of coordination and agreement in distributed systems; to give them intuitions and algorithmic techniques for addressing the problems; to give understanding of the theoretical and practical limits to solving them – in particular, the limits due to the possibility of failure.
More specifically, to introduce algorithms for distributed mutual exclusion and election algorithms, for multicast communication, consensus and related problems.
To appreciate the impact of whether we use a synchronous or asynchronous system model on the algorithms we construct.
Points to emphasize
`Failure detectors’ have both a practical and a theoretical importance. They are of practical importance because many systems have to cope with failures and therefore to detect them, however reliably or unreliably. They are of theoretical importance because the presence of a failure detector with well defined properties affects our ability to solve the problem of consensus.
Distributed mutual exclusion is needed when processes access a resource or collection of resources and we require their updates to be consistent. It is preferable for the service that manages the resources to provide mutual exclusion itself, since this does not require extra communication; but file servers, for example, do not provide synchronization and we require a separate synchronization service. Algorithms to achieve distributed mutual exclusion must be evaluated on the basis of their efficiency and their behaviour under failure conditions.
Election algorithms are required when there is a need to distinguish one of a collection of processes as a coordinator. These algorithms are designed to select a unique process, even under failure conditions. Unfortunately, network partitions may mean that a collection of processes is split into several subgroups, each of which cannot know whether the other continues to function.
In multicast communication we make a strong distinction between multicast properties (delivery guarantees) and how to implement those properties. We can treat multicast in a modular way, in particular delivery reliability and ordering guarantees are orthogonal; and either of these can be made uniform. Multicast communication is considered only for static groups in this chapter. Chapter 18 treats the case of dynamic groups.
Consensus can be related to several similar problems including multicast communication. While solutions in synchronous systems exist, consensus cannot be deterministically solved in an asynchronous system even under modest failure assumptions. But that result says that consensus cannot be guaranteed , not that it is impossible. In practice systems do reach consensus as a matter of course, by masking process failures if necessary (e.g. transactions, covered in chapters 16 and 17).
Possible difficulties
One of the problems with this material is being clear about the assumptions underlying each result (synchronous/asynchronous system, failure assumptions, closed/open groups etc.). Failure assumptions, in particular, radically change our ability to solve problems.
Inevitably, some students will find the material in this chapter rather abstract. Encourage them to try to develop algorithms themselves.
Teaching hints
Begin by reviewing `agreement in Pepperland’ (p. 66 and p. 69), and the failure model of Section 2.4.2. Students normally find this material stimulating.
If you have already introduced the concept of distributed shared memory or talked about shared files, point out that these require a distributed mutual exclusion service. Similarly motivate the other problems of coordination and agreement using practical examples of bulletin boards (multicast delivery), backup systems for mission-critical applications (elections and other consensus problems) etc.
Multicast ordering using a sequencer is a good programming exercise (Java).