GOTO Aarhus 2021

Thursday Jun 10
11:30 –
Sailing Center 2

DevOps at Scale, the Scary Parts


Operating systems at scale comes with scary outages, most of them caused by deploying bad code or config. However, with scale also comes all the problems that are not supposed to happen.

In this talk we’ll present 3 real life outages: on disk corruption, network packet corruption, and data deletion. We’ll relive the stages of oncallers during outages like these: annoyance from getting paged, panic from realizing something is completely wrong, frustration from inability to find the root cause, relief that it can be fixed, exhaustion from scrambling to remediate, and finally pride from getting through it. As part of this we will look at some of the tools and techniques used to remediate and prevent future incidents.