The Trouble with Distributed Systems
Single-Node vs Distributed Systems: Developer Mindset
| Aspect | Single-Node Developer | Distributed Systems Developer |
|---|---|---|
| Failure Model | Failures are rare and total (process crash) | Partial failures are normal—some nodes fail, others continue |
| Failure Detection | Immediate and obvious (exceptions, crashes) | Ambiguous—timeout could mean crash, delay, or network loss |
| Communication | Function calls, shared memory (reliable, instant) | Message passing over network (lossy, delayed, duplicated) |
| Latency Assumption | Predictable, low (ns–µs) | Variable, unbounded (ms–seconds) |
| Time Model | Single system clock; consistent ordering | No global clock; clock drift; ordering is uncertain |
| Event Ordering | Deterministic execution order | Partial ordering only; causality must be inferred |
| State Management | Single source of truth | Multiple replicas; state can diverge |
| Consistency | Strong consistency by default | Tradeoffs: consistency vs availability vs latency |
| Concurrency | Threads/processes with shared memory | Independent nodes; no shared memory; coordination is hard |
| Error Handling | Exceptions are exceptional | Failures are expected; retries are common |
| Retry Semantics | Usually unnecessary | Must consider idempotency and duplicate effects |
| Knowledge Model | Node has full, accurate system view | Node’s view is incomplete and possibly wrong |
| Leadership / Coordination | Simple (in-process locks) | Requires consensus, leader election, leases |
| Edge Cases | Limited and predictable | Combinatorial explosion of edge cases |
| Debugging | Reproducible, local debugging | Non-deterministic, timing-dependent, hard to reproduce |
| Correctness Definition | Correct output for given input | Safety + Liveness guarantees |
| Performance Focus | CPU, memory optimization | Latency, throughput, fault tolerance tradeoffs |
| Design Philosophy | “Make it work efficiently” | “Assume everything can fail—design for recovery” |
| Core Abstraction | Deterministic computation | Unreliable communication + uncertainty |
| Typical Bugs | Logic errors, memory bugs | Race conditions, split-brain, stale reads, data loss |
| Mental Model | Program = function | System = protocol among unreliable participants |
Single-node programming is about controlling execution. Distributed systems programming is about surviving uncertainty.
**Q & A**
Which row introduces the *biggest conceptual break* from what you already know?
The core challenges of distributed systems stem from three primary areas of unreliability:
- Networks: Asynchronous packet networks offer no guarantees on delivery or latency. Timeouts are the only mechanism for failure detection, yet they cannot distinguish between a crashed node, a network outage, or a slow response.
- Clocks: Hardware clocks drift and synchronization protocols like NTP are subject to network delays. Relying on timestamps for ordering events often leads to silent data loss.
- Processes: Unpredictable "stop-the-world" garbage collection pauses, virtualization "steal time," and I/O latencies can halt a process at any moment, causing it to lose leadership or violate lease agreements.
To operate reliably, engineers must abandon optimism and assume that anything that can go wrong will go wrong. Reliability is achieved by building fault-tolerant mechanisms—such as quorums and fencing tokens—that allow the system as a whole to function even when its individual components are unreliable.
**Q and A**
- Which of these do you think is the hardest to deal with in practice?
- In a single-machine program, which of these do we usually ignore?
Faults and the Reality of Partial Failure
In a single-node environment, hardware is designed to present an idealized system model of mathematical perfection. If an internal fault occurs, the system typically crashes entirely to avoid returning confusing, incorrect results.
In contrast, distributed systems must confront the "messy reality of the physical world." This includes:
- Partial Failure: A state where some components are broken but others are fine. This is nondeterministic; an operation may succeed one moment and fail the next without a clear cause.
- Component Reliability: While supercomputers (HPC) often treat partial failure as total failure (stopping the entire cluster), cloud-based internet services must remain available. They are built from "commodity" hardware with higher failure rates but lower costs, necessitating software-level fault tolerance.
- The Goal: The objective is to build a reliable higher-level system from unreliable underlying components. While this doesn't eliminate all faults, it makes the remaining issues easier to reason about.
Unreliable Networks
Internet services primarily use shared-nothing architectures, where machines communicate solely via asynchronous packet networks (Ethernet/IP). These networks provide no guarantees regarding when a message will arrive or if it will arrive at all.
The Six Points of Failure
When a sender transmits a request and fails to receive a response, it is impossible to distinguish between these scenarios:

- The request was lost (e.g., an unplugged cable).
- The request is waiting in a queue (overload).
- The remote node has failed (crash or power loss).
- The remote node has temporarily stopped responding (e.g., a GC pause).
- The response was lost on the network (misconfigured switch).
- The response was delayed (congestion).
Timeouts and Unbounded Delays
Because delays are unbounded, there is no "correct" value for a timeout.
- Long timeouts: Increase the time users must wait to see an error.
- Short timeouts: Risk prematurely declaring a node "dead," which can lead to cascading failures as load is transferred to already struggling nodes.
Time out estimation: 2d + r where d is the maximum delay to deliver a packet and r is the maximum delay to process a request.
**Q and A**
- Should we always choose a very large timeout to be safe?
- What happens if every node uses aggressive short timeouts?
Network congestion and queueing

-
Network Switch: Multiple sources competing for one link leads to output queuing and congestion delay. If the queue overflows, packets are dropped and must be resent.
-
Operating System: Incoming network requests are queued in the OS kernel until a busy CPU or application can handle them, introducing arbitrary delay.
-
Virtualized Environments: When a running OS (VM) is paused by the VMM to allow another VM to use the CPU, incoming data is buffered by the VMM, increasing delay variability.
-
TCP Sender: Flow control (backpressure) causes the sender to limit its output rate, creating additional queuing before the data enters the network.
Variable Latency: Packet vs. Circuit Switching
| Feature | Packet Switching (Internet/Datacenter) | Circuit Switching (Telephone Network) |
|---|---|---|
| Bandwidth | Opportunistic; shared dynamically. | Fixed and guaranteed for the duration. |
| Utilization | High; optimized for bursty traffic. | Low; bandwidth is wasted if idle. |
| Delays | Unbounded; subject to queueing. | Bounded; no queueing. |
Unreliable Clocks
Distributed systems rely on clocks for timeouts, expiration dates, and event ordering. However, hardware clocks (quartz oscillators) are imprecise and subject to clock drift (varying by temperature and age).
Network Time Protocol (NTP)
NTP is a networking protocol used to synchronize the clocks of computers over a network to a common timebase. It ensures accurate timekeeping for various applications, including GPS and financial services.
Monotonic vs. Time-of-Day Clocks
| Clock Type | Purpose | Behavior |
|---|---|---|
| Time-of-Day | Wall-clock time (e.g., UTC). | Can jump backward if resynced via NTP; unsuitable for measuring elapsed time. |
| Monotonic | Measuring duration (e.g., timeouts). | Guaranteed to move forward; absolute value is meaningless (often nanoseconds since boot). |
Slewing
Slewing in clocks refers to the gradual adjustment of the clock's time to correct for drift, ensuring smooth synchronization without abrupt changes. This process is crucial for maintaining accurate timekeeping in systems like NTP (Network Time Protocol) where precise timing is essential.
Leap seconds
A leap second is an additional second added to Coordinated Universal Time (UTC) to keep it in sync with astronomical time (UT1), which is based on the Earth's rotation. Leap seconds are typically added at the end of June or December to ensure the difference between UTC and UT1 remains within 0.9 seconds.
Language specific timestamps
| Language | Function/Command | Description |
|---|---|---|
| Python | time.monotonic() |
Returns monotonic time unaffected by system clock changes. |
| JavaScript | performance.now() |
Provides a high-resolution monotonic timestamp. |
| Bash | date +%s.%N |
Gets a high-resolution timestamp (not fully monotonic). |

The Danger of Last Write Wins (LWW)
Many databases use timestamps to resolve conflicts. This is dangerous because:
- A node with a lagging clock can silently overwrite data from a node with a faster clock.
- It cannot distinguish between sequential writes and truly concurrent ones.
- NTP synchronization is limited by network round-trip time, making it nearly impossible to ensure perfect ordering across nodes.
Confidence Intervals and TrueTime
A clock reading should be viewed as a confidence interval rather than a point in time. Google’s Spanner uses the TrueTime API to return a range [earliest, latest]. By deliberately waiting for the length of the uncertainty interval before committing a transaction, Spanner ensures that causality is respected across datacenters.
Process Pauses
A node in a distributed system may be "preempted" at any moment, halting execution for seconds or even minutes.
Causes of "Stop-the-World" Pauses
- Garbage Collection (GC): In languages like Java, GC can pause all threads.
- Virtualization: A hypervisor may suspend a VM to allow another VM to use the CPU (steal time).
- Paging/Thrashing: The OS may pause a process to swap memory pages from disk.
- Disk/Network I/O: Synchronous I/O operations can block threads unexpectedly.
The Lease Problem
A common pattern involves a leader obtaining a "lease" to perform writes. If a process is paused after checking that its lease is valid but before performing the write, the lease may expire during the pause. If another node has been elected leader in the meantime, the original node will unknowingly perform an unauthorized write, leading to data corruption.
Knowledge, Truth, and Lies
In a distributed system, a node cannot trust its own perception. It can only know the state of other nodes by exchanging messages, which are themselves unreliable.
Quorums: Truth by Majority
Decisions (such as declaring a node "dead") must be made by a quorum—usually an absolute majority of nodes. This prevents a single "limping" or semi-disconnected node from causing a total system failure.
Fencing Tokens
To prevent a "zombie" leader (one that was paused and replaced) from corrupting data, fencing tokens must be used.
- The lock service issues a monotonically increasing token with every lease.
- The storage service records the highest token it has processed.
- Any write request with an older (lower) token is rejected.
Byzantine Faults
While most datacenter systems assume nodes are "unreliable but honest," Byzantine faults occur when nodes "lie" or send corrupted data (common in aerospace or blockchain).
- Byzantine Fault Tolerance (BFT): Extremely complex and expensive to implement; generally unnecessary for internal datacenter applications where nodes are controlled by a single organization.
System Models and Correctness
To prove the correctness of distributed algorithms, engineers use formalized System Models based on timing and fault assumptions.
Timing Models
- Synchronous: Bounded delays and clock errors (unrealistic for most).
- Partially Synchronous: Behaves synchronously most of the time but allows for occasional unbounded delays (the most practical model).
- Asynchronous: No timing assumptions; no clocks; very restrictive.
Safety vs. Liveness
- Safety: "Nothing bad happens." If violated, the damage is permanent (e.g., a duplicate fencing token).
- Liveness: "Something good eventually happens" (e.g., a node eventually receives a response). Liveness properties are allowed to fail during network outages but must recover once the network is restored.
Final Insight: "In distributed systems, suspicion, pessimism, and paranoia pay off." Theoretical models are essential for uncovering hidden flaws, but real-world implementations must always be prepared for the "impossible" to happen.