Great Talk on Debugging Production Systems

AWS is currently experiencing a degredation - reddit is down.
Bryan Cantrill (VP of Engineering at Joyent) gave an excellent and entertaining presentation about Debugging Production Systems. It is well worth an hour of your time if you work on large distributed cloud stacks in production. Some of the key things I took from the video:

  • Every failure is sacred. Seek to build a system that properly stores and formats stack traces for future analysis.
  • Cloud architectures are full of abstraction layers where “ungentlemanly” failures occur.  Instead of crashing, things stay up, but performance is degraded.
  • It is these non-fatal pathologies that are particularly difficult to debug. They often lead to cascading levels of system instability and throw the system into untested states. It is this spiral that is often behind most airline disasters – a recoverable, non critical failure occurs then various failover/automatic systems either over-react or under-react then the airplane becomes difficult to control and systems no longer make sense so serious human error occurs.
  • When looking at a core dump, you are acting as a scientist and you are testing hypotheses and proving theories with data. When looking at non-fatal pathologies, you are acting as a physician and treating symptoms. Debugging this way makes it very difficult to determine root cause.

The talk ends with a small demo into the capabilities of DTrace. Amazing core level magic is going on there that can get static snapshots, enable transient failures to manifest as hard failures and generally reduce the effort required to gather failure data in dynamic systems.