High-availability resides in the Backing Services

Alessandro Gerlinger Romero
9 min readNov 12, 2018

Assume that is in place a tool for the management of containerized applications across multiple hosts (e.g., Kubernetes, OpenShift, Docker Swarm, Apache Mesos) and that applications are composed of containerized processes as defined by the Twelve-Factor App.

Still according to the Twelve-Factor App [1], such containerized processes can be horizontally scaled up and down as needed since they are stateless and share-nothing. Consequently, any data that needs to persist must be stored in a stateful store. Those stores are called backing services and should be treat as attached resources, in the sense that, they are loosely coupled to the deploy they are attached to.

Altogether the goal of any containerized application is to be as available as the network on which it runs, in other words, to have the high-availability property. Therefore, as containerized processes are stateless and share-nothing high-availability resides in backing services.

High-availability in Backing Services — Basic Concepts

The scaling up or down of a backing service that provides a distributed stateful store poses challenges since, at least, the variable number of the instances need to communicate with each other (forming a cluster), to communicate with the processes attached to them and to provide some sort of consistency. Furthermore, in such distributed stateful store, the ability to survive a network partitioning into multiple disconnected backing services is desirable for high-availability.

Partition (P) and consistency © together with availability (A) converges to the CAP theorem[2] also named Brewer’s theorem, which states that it is impossible for a distributed stateful store to simultaneously provide more than two out of these guarantees. Therefore, the first (1) concept related to the high-availability of the backing services is the selection of two of the tree guarantees.

The remaining four basic concepts around distributed stateful backing services are: (2) discovery mechanisms a.k.a. peer discovery, (3) initial cluster formation (or a race condition), (4) health check and (5) detection and mitigation of the presence of network partitions.

(2) Discovery mechanisms

Once selected the two guarantees (from CAP Theorem) for a given distributed stateful backing service, the next step is to analyze/define how the members of the cluster will discover each other, what is called discovery mechanism a.k.a. peer discovery. There are two groups of discovery mechanisms: (1) statically defined and (2) dynamically defined.

The statically defined discovery mechanisms, in its turn, can also be divided in two types: (a) a list of predefined ips/hostnames and (b) a subnet. The option (a) is not applicable to a containerized application since the number of instances is variable, consequently, their ips/hostnames. The option (b) is normally built-in in products and, in this way, commonly used in a containerized application. There are different implementations for (b), however, the most common one, namely well-known-address (WKA), is the definition by the backing service of a list of candidate members using a subnet and then to establish the communication with all candidate members. Although common, WKA can lead to an initial delay in the cluster formation since a range (potentially big) must be scanned for the finding of peers.

The dynamically defined discovery mechanisms are based on some sort of service (e.g., Kubernetes, AWS, Azure, Consul, Zookeeper, etcd, DNS), in which one or more queries are resolved for a given service name. Such service can require backing service registration or can be based on a DNS service (like the one provided by Docker Swarm and Kubernetes). There are some dynamically defined mechanisms based on UDP multicast, however, they are not usually recommended for production environments since UDP is often blocked and other mechanisms are more definite [3].

(3) Race condition in the initial cluster formation

Once the members can find each other, the next challenge is the natural race condition that emerges in the initial cluster formation since a multitude of members are starting at the same time and more than one can be the “first one”.

Some backing services continually try to establish the communication between peers (e.g., Hazelcast), on the other hand, some try only once (e.g., RabbitMQ). The former behavior can produce transient errors while the latter impair the cluster formation definitively.

Therefore, working with products that selected the latter behavior, it is mandatory to control the initialization of the members, otherwise, some containers do not join the cluster and are “standalone” stateful store. In order to avoid such undesired behavior, a common technique is to rely on some sort of order, in such a way that the next instance of backing service is started after the previous one is available. The underlying provisioning and orchestration tool must provide the information (e.g., slots in Docker Swarm) or the feature (e.g., Kubernetes stateful sets [4]) for such guarantee.

(4) HealthCheck

A tool for the management of containerized applications must define a way to determine if a given containerized process is able to handle requests, and, consequently, diagnosis about the health status of a given service. In the microservices realm, the HealthCheckAPI pattern [5] support such health diagnosis.

The health check is continually called by the tool for the management of containerized process to diagnose their health status. If a process is unhealthy, the tool removes the unhealthy process from the service name commonly and, eventually, it can command a “stop/start” of the service (usually configured in real world apps). Note “stop/start” of a containerized process, indeed, roughly means that failed/discontinued members will be replaced with brand new ones.

“Stop/start” of stateless and share-nothing service has no consequences, however, “stop/start” of distributed stateful store backing services can easily lead to inconsistency since part of the state can be lost (in memory or in storage). Therefore, additional concerns about data storage and replication must be considered.

(5) Behavior in the presence of a network partition

As described in the previous basic concept, the health checking can easily lead to “network partitions” since a failed/discontinued member, when restarted/started can perceive that a network partition occurred. The other source of network partition is a network issue between healthy nodes. While a network partition is in place, the two (or more) sides of the cluster can, optimistically, evolve independently, with both sides thinking the other is unavailable. This scenario is known as split-brain. In fact, split-brain approaches can be classified as either optimistic or pessimistic [6].

The optimistic approach selects AP (from CAP) and then let the partitioned nodes work as usual (sacrifying consistency), afterwards, once the network is fixed reconciliation might be required in order to have the cluster in a consistent state (e.g., Hazelcast, RabbitMQ with ignore as partition handling strategy [7]).

The pessimistic approach select CP (from CAP, sacrifying availability), once a network partitioning has been detected, access to the sub-partitions is limited in order to guarantee consistency (e.g., RabbitMQ with pause_minority as partition handling strategy [7]).

RabbitMQ — Example

Once the five basic concepts related to the high-availability of backing services were explored, this section briefly explores how these concepts are materialized in the backing service, RabbitMQ.

(1) CAP Theorem

According to the RabbitMQ documentation [7], RabbitMQ can be used as in different ways in order to support a stateful store backing service, mainly Federated and Cluster. Typically, one would use clustering for high availability and increased throughput, with machines in a single location (reliable LAN link) [7]. Therefore, the selection of a cluster using RabbitMQ chooses Consistency and Partition Tolerance (CP) from the CAP theorem.

(2) Discovery mechanisms

According to the RabbitMQ documentation [9], RabbitMQ provides a variety of discovery mechanisms (e.g., Consul, etcd, Kubernetes, AWS ECS, DNS). In a real-world project, DNS was selected as discovery mechanism. The DNS is configured using the following snippet:

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_dns
cluster_formation.dns.hostname = tasks.rabbit

Once the configuration is in place, (1) the “seed hostname” (in the above snippet “tasks.rabbit”) is queried in the DNS server, (2) for each returned DNS record’s IP address a reverse DNS lookup is performed and (3) the result is appended with current node’s prefix (e.g., rabbit in rabbit@hmlhfx-backend-rabbit.1.18972dh.support) to each hostname.

It is important that each resolved name (afterwards steps 1, 2 and 3 discussed above) must be the same of the RABBITMQ_NODENAME and also that the following properties were defined adequately: RABBITMQ_USE_LONGNAME (true) and RABBITMQ_ERLANG_COOKIE (same value for all peers of the cluster).

(3) Race condition in the initial cluster formation

Without Kubernetes that would provide Statefulset [4], in a real-world project, the slot (order of the instance of a given containerized processes) is used to control the sequence of peers startup, consequently, avoiding the race condition in the initial cluster formation.

The bootstrap shell checks if the slot of current instance is greater than 1, in that case, the shell waits (wait_for_it [11]) until the slot 1 is healthy, therefore, available as a peer for cluster formation.

(4) Health check

In a real-world project, the health check of instances of a RabbitMQ backing service is performed running a shell script that executes the command “rabbitmqctl cluster_status”. Moreover, in order to guarantee that there is no undesired network partition due to the initial cluster formation, the response of this command is parsed searching for the slot 1 (during initial cluster formation), if not found, the process is unhealthy. Therefore, only peers that join slot 1 are considered healthy.

At this stage, the following image shows an example of such treatment of the basic concepts related to high-availability.

Nonetheless, as discussed earlier, the health check itself brings additional concerns about data storage and replication. In order to avoid the loosing of messages, the replication of queues was defined using the policy provided by RabbitMQ [10], which leads to the following image for a given queue (note the feature “ha” enabled in the queue).

(5) Behavior in the presence of a network partition

According to the RabbitMQ documentation [8], the default behavior of RabbitMQ in the face of a network partition is to “ignore” it, therefore, once a split occurs the split will continue even after network connectivity is restored. Such default configuration, clearly selects two different guarantees from CAP theorem, namely AP, compromising the consistency.

In order to configure the backing service to guarantee CP, the right partition handling strategy must be selected. There are four strategies, namely ignore, pause_minority, pause_if_all_down and autoheal. In a real-world project, the following configuration (RabbitMQ.conf) was selected.

cluster_formation.node_cleanup.only_log_warning = false
cluster_partition_handling = pause_minority

The first line of the configuration (plugin rabbitmq_peer_discovery_common must be enabled) defines the nodes that are no more available should be removed from the cluster since they would not rejoin the cluster (it is not using statefulset [4],). The second defines that RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority (i.e. fewer or equal than half the total number of nodes) after seeing other peers go down [8] so preferring consistency over availability.

Wrapping-up

The reason why high-availability resides in the backing services as well as the five basic concepts — that must be appropriately addressed for a containerized high-available stateful store backing service — were presented and explored using RabbitMQ as an example stateful store backing service.

These five basic concepts are general and should be assessed for all containerized processes that are stateful and must be high-available. For example, a containerized process that applies an in-memory data grid, e.g., Hazelcast, must assess these five basic concepts.

References

[1] The Twelve-Factor App — https://12factor.net/

[2] Brewer’s |Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

[3] Hazelcast — Setting Up Clusters Discovery Mechanisms — https://docs.hazelcast.org/docs/latest-development/manual/html/Setting_Up_Clusters/Discovery_Mechanisms.html

[4] Kubernetes — Statefulset — https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/

[5] Health Check API — https://microservices.io/patterns/observability/health-check-api.html

[6] Davidson, Susan; Garcia-Molina, Hector; Skeen, Dale (1985). “Consistency In A Partitioned Network: A Survey”. ACM Computing Surveys. 17( 3): 341–370.

[7] RabbitMQ — Distributed RabbitMQ brokers — https://www.rabbitmq.com/distributed.html

[8] RabbitMQ — Clustering and Network Partitions — https://www.rabbitmq.com/partitions.html

[9] RabbitMQ — Cluster Formation and Peer Discovery — https://www.rabbitmq.com/cluster-formation.html

[10] RabbitMQ — Highly Available (Mirrored) Queues — https://www.rabbitmq.com/ha.html

[11] Wait for it — Alpine — https://github.com/eficode/wait-for

--

--