Archives for February 2013

Adventures in Clustering – part 2

Embedding a Zookeeper Server

To minimize the number of moving parts in the message delivery system I wanted to embed the Zookeeper server in the application, rather than running a separate ensemble of Zookeeper servers.

The embedded Zookeeper server is encapsulated in an EmbeddedZookeeper trait that is mixed into the ZookeeperService class from part 1.   Here is EmbeddedZookeeper:

There are two kinds of Zookeeper servers, standalone and replicated.  Standalone (aka Single) servers, which are typically used for local development and testing, use ZookeeperServerMain to start up. Replicated (aka Clustered or Multi-server) servers, which are replicated for high availability, use QuorumPeerMain.  I extend both of these classes to add a stop() method in a common interface.

A couple other things to mention about EmbeddedZookeeper: it is self-typed to ClusterService and Logging to get access to the nodeId and log methods.  Also there is an abstract clientPort method that is implemented by ZookeeperService.

ZookeeperService Initialization

The embedded server is configured and started as part of the initialize() method in ZookeeperService:

In initialize() a list of Zookeeper server hostnames is passed to genServerNames() which generates a ServerNames (see below).   Then the Zookeeper server is configured and started.

Server Configuration

ServerNames contains hosts, a collection of cluster node IDs and serversKey and serversVal, which corresponds to the server.x=[hostname]:nnnnn[:nnnnn] configuration settings for clustered servers (each server needs to know about all of the other servers).  If there are multiple elements in ServerNames.hosts then useReplicated will be true which will tell configureServer() to configure a replicated server.

In configureServer() the first thing that happens is the Zookeeper server data directory is removed and recreated. I found that the Zookeeper server data directory could get corrupted if the server wasn’t shut down cleanly. Instead of trying to maintain the data directory across restarts I decided to just recreate the data directory each time the application started up. The downside is that the server’s database has to be populated on each startup (it synchronizes the data from other servers). In this particular use case it’s ok because there is very little data being stored in Zookeeper (just the Leader Election znode and children). The upside is that no manual intervention is needed to address corrupt data directories.

Replicated Zookeeper servers require a myid file that identifies the ordinal position of the server in the ensemble. To avoid having to create this file manually I create it here as part of server startup.

Finally the appropriate Zookeeper server object is instantiated and configured with a set of Properties and a server startup function is returned for use by startServer(). A reference to the server object is saved in zkServer for shutdown later.

Server Startup

To start the server a new thread is created and in that thread the server startup function returned by configureServer() is run. The main application thread waits on a semaphore for the server to start.

Notes

One of the limitations of running Zookeeper in replicated mode is that a quorum of servers has to be running in order for the ensemble to respond to requests. A quorum is N/2 + 1 of the servers. In practice this means that there should be an odd number of servers and there needs to be at least three servers in the cluster. If there are three servers in the cluster then two of the servers have to be up and running for the clustering functionality to work. Depending on how you deploy your application it might not be possible to always keep at least N/2 + 1 servers running. If that is the case then embedded Zookeeper won’t be an option for you.

Also, one of the things that I noticed when running embedded Zookeeper is that during deployment as application JVMs are bounced you will get a lot of error noise in the logs coming from Zookeeper server classes. This is expected behavior but it might cause concern with your operations team.

Coming Up

In the next post I will come back to the ClusterService and fix a problem where multiple nodes might think they are the leader at the same time.