A few years ago, while researching Zookeeper for a project I was working on, I realized that there was a whole field Computer Science, Distributed Systems, that I was totally unfamiliar with. That started a journey of discovery that’s been very rewarding. In response to a question on the Akka mailing list I put together a list of links to Distributed Systems resources. I’ve been meaning to translate that email to a blog post for a while.
Implementation-oriented books that I would recommend for developers are:
- I Heart Logs by Jay Kreps, the designer of Kafka (there is also a webcast). This is a collection of his recent epic blog posts (30 pages, highly recommended).
- Jeff Hodge’s Notes on Distributed Systems for Young Bloods blog post and his Practicalities of Distributed Systems videos
- Pattern-oriented Software Architecture vol 4 (and 2)
- Cloud Design Patterns
- Reactive Messaging Patterns with the Actor Model
- Enterprise Integration Patterns
- SOA Patterns
- CQRS Journey
These are all filled with practical advice for building real-world distributed systems.
One thing I found is that there is a big gap between academic and industry knowledge right now. This is discussed in a post on Henry Robinson’s excellent Paper Trail blog where he provides a guide to digging deeper both on the academic side and by reading research papers written by industry leaders like Google, Yahoo, etc. Definitely read the links in the “First Steps” section. The gap is also the topic of a post on Marc Brooker’s blog. Besides papers he links to some other good people to follow like Aphyr and Peter Bailis. Two blogs that review Distributed Systems papers are the Morning Paper and MetaData. I also recommend following Brave New Geek, Ben Stopford and Kellabyte, and the High Scalability and Highly Scalable blogs.
Working to fill the gap are a few books (most are in progress):
- Reactive Design Patterns by Akka’s own Roland Kuhn and Typesafe’s Jamie Allen
- Designing Data Intensive Applications by Martin Kleppman from the Samza Project (also check out his recent video)
- Distributed Systems for Fun and Profit
Essential ACM Queue articles
- There’s Just No Getting around It: You’re Building a Distributed System
- The Network is Reliable
- Eventually Consistent by Werner Vogels of Amazon
- Eventual Consistency Today: Limitations, Extensions, and Beyond
- There is No Now – Problems with Simultaneity in Distributed Systems
Notable blog posts
- Distributed Algorithms in NoSQL Databases
- Strong Consistency Models
- Elements of Scale: Composing and Scaling Data Platforms
- Cloud Computing Concepts Part 1 and Part 2 – despite the generic sounding name these classes cover many distributed systems fundamentals
I recommend getting familiar with the CAP Theorem. You’re going to run into it all over the place.
- Towards Robust Distributed Systems by Eric Brewer, the guy who came up with CAP.
- CAP Twelve Years Later by Eric Brewer
- Availability and Partition Tolerance
- Consistency Tradeoffs in Modern Database Design (aka the PACELC paper)
- Perspectives on the CAP Theorem
- This Long Run’s CAP Theorem series
Zookeeper is a Consensus (or Coordination) system. Consensus is a major topic in theoretical and practical distributed systems and is what got me started digging into distributed systems originally. To start getting familiar with Consensus I recommend:
- Distributed Consensus, a.k.a “What do we eat for lunch?”
- Consensus Systems for the Skeptical Architect
- The Zookeeper book from O’Reilly
- The Secret Lives of Data – an interactive tutorial on Raft
- Raft – A Consensus Algorithm for Distributed Logs
- Implementing Replicated Logs with Paxos
On the academic textbook side
- Distributed Systems, an Algorithmic Approach
- Introduction to Secure and Reliable Distributed Programming
- Guide to Reliable Distributed Systems by Ken Birman. Also check out his recent syllabus where there are slides covering the material in the book. Speaking of Birman, he created a well-known distributed group membership algorithm called Virtual Synchrony that is useful in a lot of use cases. An implementation of Virtual Synchrony is the open source JGroups.
This is just the tip of the iceberg. Besides consensus, other distributed systems topics that I’ve found interesting include distributed databases, group membership, gossip protocols (used in Akka, Cassandra and Consul), time and clocks, and peer-to-peer systems.