Systemic Failure: Aphorisms

  1. Distributed systems are everywhere. Every system is distributed; ignoring this leads to failure.
  2. Designing and building distributed systems is challenging.
  3. CAP is the law. Don't defy it. Define your system within its constraints (AP or CP).
  4. Centralization is a risk. Avoid reliance on single points of failure.
  5. Clients are participants. Integrate them into your distributed systems design.
  6. Failure is inevitable. Software, networks, and hardware will fail. Prepare for it.
  7. Handle errors: network interruptions, hardware malfunctions, and human mistakes.
  8. Networks are dynamic. They shift topologies, multiply partitions, and vanish nodes. Adapt.
  9. Account for latency. Distinguish it from partitions and outages.
  10. Time is relative. Clocks drift; events clash. Manage concurrency carefully.
  11. Synchronization is delicate. Beware inconsistencies; deleted data can return.
  12. Actions have consequences. Mitigate irreversible side effects.
  13. Algorithms are fragile. Safeguard critical execution paths from failures.
  14. Read-only is insufficient. It doesn't guarantee no write capability.
  15. Quorums are flexible. Adjust cluster size and voting thresholds as needed.
  16. Storage is vulnerable. Mitigate corruption; data can disappear or reappear.
  17. Storage is limited. Plan for capacity limits, since unlimited storage does not exist.
  18. Bandwidth is precious. Minimize data transfer during resynchronization.
  19. Acknowledgement isn't confirmation. Ensure messages are received and processed.
  20. Persistence requires storage. Write messages to disk to prevent loss.
  21. Timeouts are finite; don't wait indefinitely for lost messages.
  22. Brief outages matter. Even short disruptions can have significant impacts.
  23. Theory isn't practice. Don't rely solely on unproven research.