CAP Theorem – System Design

The CAP theorem (Consistency, Availability, Partition tolerance) is a fundamental concept in distributed systems that describes the trade-offs among three key system properties. It states that a distributed system can only guarantee at most two out of the three properties simultaneously.

Here’s a detailed explanation of the CAP theorem with an example and a diagram:

The Three Properties

Consistency (C):
- Every read receives the most recent write or an error.
- This ensures that all nodes in the distributed system return the same data at any time.
- Example: In a banking system, if you transfer money from one account to another, any subsequent read operation should reflect the updated balances immediately.
Availability (A):
- Every request (read/write) receives a response, regardless of whether it is the latest version.
- This means the system remains operational even if some nodes fail.
- Example: If one node in a system goes down, the system should still handle requests using the remaining nodes.
Partition Tolerance (P):
- The system continues to operate despite network partitions (failures in communication between nodes).
- This ensures the system is resilient to network issues and does not completely fail.
- Example: If a network link between two nodes is broken, both sides should still function independently.

CAP Theorem Trade-Offs

According to CAP theorem, a distributed system can provide at most two of these properties simultaneously in the presence of a network partition:

CP (Consistency + Partition Tolerance):
- Ensures data consistency across all nodes even during a partition, but some nodes may not be available.
- Example: HBase or MongoDB (with strict consistency).
AP (Availability + Partition Tolerance):
- Ensures the system is always available even during a partition, but data consistency may be compromised.
- Example: Cassandra or DynamoDB.
CA (Consistency + Availability):
- Ensures consistency and availability as long as there are no network partitions, but sacrifices partition tolerance.
- Example: Traditional RDBMS systems like MySQL or PostgreSQL in a single-node setup.

Real-World Example

Scenario: Online Shopping System

Imagine a distributed system for an e-commerce platform.
A user adds a product to their cart. This operation is replicated across multiple servers to ensure availability and fault tolerance.

Consistency:
- If the system is consistent, all servers must have the same updated cart data before responding to the user.
- During a network partition, the system may block updates until all servers synchronize, reducing availability.
Availability:
- The system remains available even if some servers are disconnected.
- However, the cart data may differ between servers (inconsistency) during the partition.
Partition Tolerance:
- During a partition, the system must tolerate the failure and continue working.
- It may prioritize either availability (AP) or consistency (CP), but not both.

Diagram of CAP Theorem

Here is a conceptual representation of the CAP theorem:

          Partition Tolerance
                 /   \
                /     \
               /       \
          Consistency   Availability

In practice:

CA systems work well without partitions but fail when one occurs.
CP systems ensure correctness but may reject requests during a partition.
AP systems prioritize availability but may return outdated or inconsistent data during a partition.

Detailed Example with CAP Trade-Offs

Banking System Example:

Consistency (C): All branches of the bank show the same account balance at all times.
Availability (A): ATMs and online banking always provide service.
Partition Tolerance (P): The system can handle communication breakdowns between branches.

Trade-Off:

During a network partition:
- CP: The system halts all transactions (availability is compromised) until partitions resolve to ensure consistent account balances.
- AP: Transactions continue (availability is maintained), but balances may be inconsistent across branches.
- CA: Without partition tolerance, the system cannot guarantee service during communication failures.