This is a crosspost from my employer Managed by Q's blog Do Not Erase
If you develop software, there's a good chance you're familiar with test-driven development (whether you practice it or not): write your tests first, confirm that they fail, write your initial code, and iterate until your tests no longer fail.
While TDD is common, test coverage only guarantees so much. When a system is actually running in production--especially over a distributed network--there are countless ways it can fail unexpectedly.
Much like writing tests, we can try to pre-empt and understand those failures, as well as other types of behavior we can't predict, with monitoring and other runtime tools.