Distributed Systems links

pg-distrib-logoA few years ago, while researching Zookeeper for a project I was working on, I realized that there was a whole field Computer Science, Distributed Systems, that I was totally unfamiliar with. That started a journey of discovery that’s been very rewarding.   In response to a question on the Akka mailing list I put together a list of links to Distributed Systems resources.  I’ve been meaning to translate that email to a blog post for a while.

To start off I would definitely recommend checking out a talk that Jonas Boner from Typesafe gave at Strange Loop called The Road to Akka Cluster and Beyond (slides).

Implementation-oriented books that I would recommend for developers are:

These are all filled with practical advice for building real-world distributed systems.

One thing I found is that there is a big gap between academic and industry knowledge right now.  This is discussed in a post on Henry Robinson’s excellent Paper Trail blog where he provides a guide to digging deeper both on the academic side and by reading research papers written by industry leaders like Google, Yahoo, etc.   Definitely read the links in the “First Steps” section.  The gap is also the topic of a post on Marc Brooker’s blog and a post on Murat’s blog.  Besides papers he links to some other good people to follow like Aphyr and Peter Bailis.  Two blogs that review Distributed Systems papers are the Morning Paper and MetaData.  I also recommend following Brave New Geek, Ben Stopford and Kellabyte, and the Hacking, Distributed, High Scalability and Highly Scalable blogs.

Papers We Love is a collective of meetups across the world where people present their takes on research papers that they find fascinating.  Their web site has videos taken at these meetups.  They also hold yearly conferences.

YouTubers are also getting into the act – for example, Vivek Haldar has a series of videos called Read a Paper where he summarizes papers in around ten minutes.

Many times the conferences where the papers are presented also publish videos, slide decks and posters that are much easier to consume for a working developer.  If you have a paper that you are really interested in be sure and and check out the web site of the conference where the paper was published.  Usenix in particular is really good at this.  In addition in the last few years a number of research projects have been creating web sites to promote the research where you can find code, videos and more.  For example, check out the site for Hermes, a replication protocol.

Working to fill the gap between academia and industry:

Essential ACM Queue articles

Notable blog posts

Other reading lists

Online Courses

I recommend getting familiar with the CAP Theorem.  You’re going to run into it all over the place.

Zookeeper is a Consensus (or Coordination) system.  Consensus is a major topic in theoretical and practical distributed systems and is what got me started digging into distributed systems originally.  To start getting familiar with Consensus I recommend:

On the academic textbook side, I have these on my stack to read:

This is just the tip of the iceberg.  Besides consensus, other distributed systems topics that I’ve found interesting include distributed databases, group membership, gossip protocols (used in Akka, Cassandra and Consul), time and clocks, and peer-to-peer systems.

Adventures in Clustering – part 1

Last year I added clustering support to a system I had previously developed for a client. The requirements were to implement automated failover to eliminate a single point of failure and to distribute certain kinds of work among members of the cluster.

The application I was changing is a message delivery system, but my client has other applications with the same needs so the solution needed to be general enough to apply elsewhere.

Automating Failover with Leader Election

The message delivery system runs a number of scheduled jobs. For example, one of the scheduled jobs checks the database to see which new messages need to be delivered, delivers them, and updates message statuses after the delivery is complete. The message delivery system runs on JVMs on multiple machines for availability, but only one JVM should run the scheduled job, otherwise duplicate messages will be sent.

Before clustering support was added one of the JVMs was designated as the Delivery Server using a Java system property and only the Delivery Server ran the scheduled jobs. If the Delivery Server went down, one of the other JVMs had to be manually converted into the Delivery Server. This was suboptimal because it depended on humans to be available to both notice the problem and perform the failover, and the failover process was error-prone.

There are a number of ways to solve the the problem with the specific use case I just described. But there were other use cases where manual failover was also the problem, both in the message delivery system and in other applications the client has. I didn’t want to have a use case-specific solution for each problem.

To solve the problem generally I decided to use Leader Election. With Leader Election cluster members decide among themselves which member is the leader and the leader is responsible for certain tasks. The message delivery system already had the concept of a leader – the Delivery Server. I just needed to automate the process of choosing that leader.

The ClusterService

To support the Leader Election and work distribution features, I introduced the concept of a cluster service. When the service is initialized it starts the leader election process for that cluster member. At any time it can be queried to see if the current node is the leader, and who the other members of the cluster are. Here is the ClusterService interface:

In ClusterStatus, current, leader and participants are node IDs.

Scheduled Jobs

Previously only the Delivery Server ran scheduled jobs.  With Leader Election all of the cluster members run the scheduled jobs, but those jobs were changed to return immediately if the current cluster member is not the leader.   For example:

The distributeEvents job runs every 30 seconds on all cluster members. It gets the cluster status from the ClusterService and if the current node is the leader it calls distribute() to do the actual work.

Work Distribution

In the message delivery system the ClusterSystem is also used for work distribution. The leader distributes certain pieces of work to all of the nodes in the cluster. The ClusterSystem is queried for the cluster members. The cluster member node ID is mapped to a remote Akka actor. For example:

I will cover the work distribution system in detail in a separate post.

Zookeeper and Curator

I wanted to leverage a third-party clustering system as developing core clustering functionality is non-trivial. My first choice was Akka Cluster but when I was adding clustering support to the message delivery system Akka Cluster had not been released (it was just a high level design doc at that time). Zookeeper, on the other hand, had been on the scene for for a while. The Zookeeper client has a reputation of being difficult to work with so I decided to use Curator, a library that abstracts over the Zookeeper Java client.

ZookeeperService

ZookeeperService is the Curator-based implementation of the ClusterService:

ZookeeperService also has an embedded Zookeeper server which will be covered in part 2.

Leader Election using Curator

Curator has two ways of doing Leader Election: the LeaderSelector and LeaderLatch.

I first implemented leader election using LeaderSelector but I was unsatisfied with how complicated the resulting code was. After some discussion with Jordan Zimmerman, Curator’s developer, I reimplemented leader election using LeaderLatch. Here’s what it looks like:

The call to watchLeaderChildren() is optional. It’s only needed if you want to be notified when the leader changes or if a cluster member falls out of the cluster. In the message delivery system that wasn’t strictly necessary because it always checks who the leader is before doing something that only the leader should do. But it’s a nice thing to have for monitoring purposes:

In watchLeaderChildren a watch is set on the children of the LeaderLatch znode. Each child represents a cluster member and the first znode in the list of children is the leader.  If the set of children changes the watch is fired and the process() method is called. In process() the cluster status is queried and the watch is set again (Zookeeper watches are disabled after they fire).

Cluster Status using Curator

ZookeeperService.clusterStatus looks like:

It is a straightforward query of the LeaderLatch to populate the ClusterStatus object.

Application Startup

ClusterService.initialize is called at application startup if ClusterService.enabled is true. Here is the ZookeeperService implementation:

CuratorFramework is the high level Curator API.  It wraps a Curator Client which wraps the Zookeeper client. After the CuratorFramework is created the startup code blocks until the client has connected to the server and is ready to start working. Then selectLeader() is called to start the leader election process. Once the leader election process is started, Curator handles everything else under the hood (for example if the leader dies, a new node joins the cluster, or if one of the Zookeeper servers goes down).

Coming Up

In the next posts in this series I will embed the Zookeeper server into the application, handle a corner case when leadership changes while the leader is working, discuss work distribution in detail and I will talk about a port of the clustering functionality to Akka Cluster. Stay tuned!

An Auto-Updating Caching System – part 2

In the previous post we imagined that we needed to build a caching system in front of a slow backend system. The cache needed to meet the following requirements:

  • The data in the backend system is constantly being updated so the caches need to be updated every N minutes.
  • Requests to the backend system need to be throttled.

Akka actors looked like a good fit for the requirements. Each actor would handle a query for the backend system and cache the results. In part 1 I talked about the CachingSystem which created CacheActors and provided helper methods for working with the caches. In this post I will cover the CacheActor class hierarchy.

Here is the base CacheActor class:

abstract class CacheActor[V](cacheSystem: CacheSystem) extends Actor with Logging {
  implicit val execContext: ExecutionContext = context.dispatcher

  def findValueReceive: Receive = {
    case FindValue(params) => findValueForSender(params, sender)
  }

  def findValueForSender(params: Params, sender: ActorRef) {
    val key = params.cacheKey
    val elem = cache.get(key)

    if (elem != null) {
      sender ! elem.getObjectValue.asInstanceOf[V]
    } else {
      Future { findObject(params) }.onComplete {
        case Right(result) => result match {
          case Some(value) => sender ! value
          case None => sender ! Status.Failure(new Exception("findObject returned None for key=" + key + " cache=" + cache.getName))
        }
        case Left(ex) => sender ! Status.Failure(ex)
      }
    }
  }

  def findObject(params: Params): Option[V] = {
    cacheSystem.findObjectForCache(params.cacheKey, cache, 
                                   finder(params))
  }

  // Provided by subclasses
  val cache: Cache
  def finder(params: Params): () => V
}

object CacheActor {
  case class FindValue(params: Params)

  trait Params {
    def cacheKey: String
  }
}

Part 1 showed an example of a CachingBusinessService sending a FindValue message to a Service1CacheActor using the ? (ask) method.  findValueReceive handles FindValue by either returning a value from the cache or making a call to the backend (via CacheSystem.findObjectForCache) to get the value.

Concrete CacheActors are responsible for implementing finder which returns a function to query the backend system. The returned function is ultimately executed by CacheSystem.findObjectForCache.

Part 1 also showed CacheSystem sending UpdateCacheForNow messages to periodically update cache values. UpdateCacheForNow is handled by a subclass of CacheActorDateCacheActor:

abstract class DateCacheActor[V](cacheSystem: CacheSystem) 
    extends CacheActor[V](cacheSystem) {

  override def receive = findValueReceive orElse  {
    case UpdateCacheForNow => updateCacheForNow()

    case UpdateCacheForPreviousBusinessDay => updateCacheForPreviousBusinessDay()
  }

  def updateCacheForNow() {
    val activeBusinessDay: Range[Date] = DateUtil.calcActiveBusinessDay
    val start = activeBusinessDay.getStart
    val now = new Date

    // If today is a business day and now is within the business day, 
    // retrieve data from the backend and put in the cache
    if (now.getTime >= start.getTime && 
        now.getTime <= activeBusinessDay.getEnd.getTime)
      updateCacheForDate(now)
    } 
  }

  def updateCacheForPreviousBusinessDay() {
    updateCacheForDate(DateUtil.calcActiveBusinessDay.getStart)
  }

  def updateCacheForDate(date: Date) {
    import DateCacheActor._    // Use separate thread pool
    Future { findObject(new DateParams(date)) }
  }
}

object DateCacheActor {
  // Update cache for the current time
  case object UpdateCacheForNow  

  // Update cache for previous business day   
  case object UpdateCacheForPreviousBusinessDay

  // updateCacheForDate() uses a separate thread pool to prevent scheduled tasks 
  // from interfering with user requests
  val FUTURE_POOL_SIZE = 5
  val FUTURE_QUEUE_SIZE = 20000

  private lazy val ucfdThreadPoolExecutor = 
    new ThreadPoolExecutor(FUTURE_POOL_SIZE, FUTURE_POOL_SIZE, 1, TimeUnit.MINUTES, 
                           new ArrayBlockingQueue(FUTURE_QUEUE_SIZE, true))
  implicit lazy val ucfdExecutionContext: ExecutionContext = 
    ExecutionContext.fromExecutor(ucfdThreadPoolExecutor)
}

During non-business hours values UpdateCacheForNow messages are ignored and values from the previous business day are returned from the cache.  If the app is started during non-business hours an UpdateCacheForPreviousBusinessDay message is scheduled to populate cache values for the previous business day.

A separate thread pool is used to perform the backend system queries for the scheduled UpdateCacheFor* tasks.   We don’t want them to interfere with user requests which are handled using the regular actor thread pool.

Here is what a concrete DateCacheActor would look like, using the Service1CacheActor from part 1 as an example:

class Service1CacheActor(val cache: Cache, cacheSystem: CacheSystem, 
                         bizService: BusinessService) 
    extends DateCacheActor[JList[Service1Result]](cacheSystem) {

  override def receive = super.receive

  override def updateCacheForDate(date: Date) {
    import DateCacheActor._
    Future { findObject(new Service1Params(date, true)) }
    Future { findObject(new Service1Params(date, false)) }
  }

  def finder(params: Params) = { () =>
    params match {
      case p: Service1Params => bizService.service1(p.date, p.useFoo)
      case _ => throw new IllegalArgumentException("...") 
    }
  }
}

class Service1Params(date: Date, val useFoo: Boolean) extends DateParams(date) {
  override def cacheKey = super.cacheKey + ":" + useFoo
}

Service1CacheActor‘s implementation of updateCacheForDate finds and caches the results of the true and false variations of the BusinessService.service1 backend system call.

If we wanted to cache another one of BusinessService‘s methods using the auto-updating caching system we would:

  1. Subclass DateCacheActor, implement finder and potentially override updateCacheForDate.
  2. Subclass DateParams, providing the parameters to the backend query, and override the cacheKey method.
  3. Call createCacheActor again in CachingBusinessService to create the new DateCacheActor from #1, and write a cached version of the backend query method, sending FindValue to the new actor and waiting for the response.

An Auto-Updating Caching System – part 1

Imagine you needed to build a caching system in front of a slow backend system with the following requirements:

  • The data in the backend system is constantly being updated so the caches need to be updated every N minutes.
  • Requests to the backend system need to be throttled.

Here’s a possible solution taking advantage of Akka actors and Scala’s support for functions as first class objects.   The first piece of the puzzle is a CacheSystem which creates and queries EhCache caches:

class CacheSystem(name: String, updateIntervalMin: Int, cacheManager: CacheManager) {
  var caches = List.empty[Cache]
  val actorSystem = ActorSystem("cache_" + name)

  val DEFAULT_TTL_SEC = 86400   // 1 day

  def addCache(name: String, size: Int, ttlSeconds: Int = DEFAULT_TTL_SEC): Cache = {
    val config = new CacheConfiguration(name, size)
    config.setTimeToIdleSeconds(ttlSeconds)
    val cache = new Cache(config)
    cacheManager.addCache(cache)
    caches = cache :: caches
    cache
  }

  def createCacheActor(cacheName: String, cacheSize: Int, scheduleDelay: Duration, 
                       actorCreator: (Cache, CacheSystem) => Actor, 
                       ttlSeconds: Int = DEFAULT_TTL_SEC): ActorRef = {

    val cache = addCache(cacheName, cacheSize, ttlSeconds)
    val actor = actorSystem.actorOf(Props(actorCreator(cache, this)), 
                                    name = cacheName + "CacheActor")

    actorSystem.scheduler.schedule(scheduleDelay, updateIntervalMin minutes, 
                                   actor, UpdateCacheForNow)   
    if (!DateUtil.isNowInActiveBusinessDay) {
      actorSystem.scheduler.scheduleOnce(scheduleDelay, actor, 
                                         UpdateCacheForPreviousBusinessDay)
    }

    actor
  }

  def findCachedObject[T](key: String, cache: Cache, finder: () => T): Option[T] = {
    val element = cache.get(key)

    if (element == null) {
      findObjectForCache(key, cache, finder)
    } else {
      Some(element.getObjectValue.asInstanceOf[T])
    }
  }

  def findObjectForCache[T](key: String, cache: Cache, finder: () => T): Option[T] = {
    val domainObj = finder()
    if (domainObj != null) {
      val element = new Element(key, domainObj)
      cache.put(element)
      Some(domainObj)
    } else {
      None
    }
  }

  def clearAllCaches() {
    caches.foreach(_.removeAll())
  }
}

The createCacheActor method creates a cache and an actor, and schedules tasks to periodically update the cache. I decided to use actors for this because the actor system’s thread pool is a good way to meet the throttling requirement. In addition using the Akka API it is easy to have scheduled tasks send messages to actors.  createCacheActor takes a function of  (Cache, CacheSystem) => Actor to create the actor.   It then fills in those parameters to create the actor using the Akka actorOf method.

The findCachedObject and findObjectForCache methods take a finder function of () => T that handles the lookup of objects from the backend system.

Here is an example of the CacheSystem being used by the business logic layer:

class CachingBusinessService(bizService: BusinessService) extends BusinessService {
  implicit val timeout = Timeout(60 seconds)

  val service1CacheActor = 
    cacheSystem.createCacheActor("service1", DATE_CACHE_SIZE, 0 seconds, 
                                 new Service1CacheActor(_, _, bizService))
  // ... more actors created here

  def service1(date: Date, useFoo: Boolean): JList[Service1Result] = {
    val future = 
      service1CacheActor ? FindValue(new Service1Params(date, useFoo))
    Await.result(future, timeout.duration).asInstanceOf[JList[Service1Result]]
  }

  // ... more service methods
}

The CachingBusinessService is a caching implementation of the BusinessService interface. It creates CacheActors to service the requests.   To create the Service1CacheActor it passes a curried constructor to createCacheActor.

The caching implementation of service1 sends a FindValue message to the service1CacheActor, using the ? (ask) method which returns an Akka Future.  Then it waits for the result of the future and returns it to the caller.

Using Await.result should raise a red flag. You don’t want to block if you don’t have to (epsecially inside of an actor). However in this case the BusinessService is being called as part of a REST API served by a non-async HTTP server. Before the caching layer was introduced it would block waiting for the back end to to respond.

Here’s the code for the FindValue message and the Params that it contains.  Params are the parameters for the backend query. Each unique Params object corresponds to a cache entry so each Params subclass is responsible for generating the appropriate cache key.

object CacheActor {
  case class FindValue(params: Params)

  trait Params {
    def cacheKey: String
  }
}

In the next post I’ll describe the CacheActor class hierarchy.