The Big CAP Theory Part 1

What is the CAP Theorem?

Everyone wants reliable applications, and for decades we’ve striven to achieve reliability by pushing the same old philosophies harder and harder. Often to their breaking point. Finding that it’s often more costly and difficult to make gains as we push closer and closer to our goal. As if cost and complexity are an asymptote of reliability, which is never truly achievable but exponentially more costly. All of this is defined in the CAP theorem, and it is critical to understand the basic theorem before we can understand the solutions.

The CAP Theorem, also called Brewer’s Theorem was posited by Eric Brewer in 1998 and codified into a formal theorem when its proof was published in 2002 by Nancy Lynch and Seth Gilbert. The basis of the theorem is that a distributed application, like all modern architectures, can’t be 100% Consistent, 100% Available, and 100% Partition Tolerant. One of the three tent poles of reliability inevitably suffers.  These tent poles are Consistency, Availability, and Partition Tolerance.

Consistency

The basis of consistency is shared state. This means that every portion of the architecture knows, or has access to the same information as every other portion. If you’ve ever had to replicate sessions between nodes in a cluster of application servers, or configured a RDBMS cluster to keep each node in sync with every other node, so each has a complete copy of the entire data set, you’ve configured in favor of consistency.

The primary way distributed systems have delivered on consistency is by blocking clients who write to the system until those writes persist. This way, a client can know if there is a failure of any kind. If a failure is encountered, the client can decide whether to retry, or report the error. In a strange way, even though the process can take an extended period of time, it’s actually fail fast. However, this method is also slow, and the foundation of the ACID principal. The topic of a future article.

Availability

Practice to theory to practiceAvailability is possibly the simplest of the three precepts of the CAP theorem. Essentially, availability refers to whether a distributed system can respond to a request. No more, no less. Often people complicate the availability precept by adding other expectations, such as responding to a request with the correct data. This actually highlights the complex interaction of the precepts of the CAP theorem. Availability doesn’t care if the response is correct or efficient, but simply if the system is able to respond.

The primary way distributed systems have delivered on availability is by maintaining an authoritative system of record and blocking access to data as it is being written. We can see this in RDBMS with record locks. Preventing other requests from accessing the data until the first system has completed its write. As we can see, not a complicated aspect, but a frustrating one.

Partition Tolerance

If a portion of messaging or RDBMS cluster has ever halted and reported a partition was detected, or if you’ve ever tuned your heartbeat or net tick time configuration for a middleware product, you’ve dealt with partition tolerance. Partition tolerance defines how well a distributed system can recover if any of the nodes miss any of the messages updating state between the nodes. When the application is waiting for RDBMS to finish updating all the other nodes with your write, we’re waiting on the partition tolerance solution; namely, making clients wait until the update has been propagated throughout the system. That, at least, has been the classic solution: to immediately copy all operations to a selection of backup nodes, making the client wait until the update has been completed. Often every node is a backup of every other node. This is commonly seen in RDBMS clusters, where it’s uncommon to segment the data across the cluster except in large and complex data architectures.

This propagation often introduces considerable overhead as each operation a client system may execute turns into multiple internal operations. Thus, it must be replicated to all the other backup nodes. Additionally, the distributed system, or cluster, must continually check that all the other backup nodes are active. This ensures the distributed system knows if communication has been lost, and therefore a possibility of missing an update message. This makes partition tolerance the most error prone and likely to halt a system. Any error in communication can trigger a partition fault. Even if there was no actual error. When an Oracle cluster begins issuing Cluster Halt errors or the messaging cluster, like RabbitMQ, stops processing messages, this is a failure of partition tolerance.

What do we do with all this information? How do we design a reliable system when our only option is balancing a three-legged stool? That’s what we’ll cover in our future articles covering the ACID and BASE Principals.  Feel like something was missed in this introduction to CAP theorem?  Drop me a line at robert@opensource.io

About the Author

>