Blog

tech | leadership | communication | entrepreneurship

Can you prepare for Black Swan events?

  • Date:30 Jul 2024
  • Language:

By definition - you cannot. There is although something you can do to speed up the recovery - and that's an ongoing process.

What is a Black Swan event? It's a highly improbable event that in theory should never happen. When it happens the consequences are usually dire. Black swans usually catch us when least excpected and can be compared to a fire that cannot be simply "put out".

A perfect example of a black swan is the recent "Crowdstrike" issue that basicall stopped the whole world for almost a day causing enormous havoc and generating massive losses everywhere. My recommendation would not help in case of "Crowdstrike" simply because IT is a bit like medicine - a dentist cannot just replace a surgeon.

In 2024 I had a chance to experience the 3rd black swan in my career. All 3 taught me valuable lessons that I want to share with you to help increase your odds of survival.

The first swan - the unbreakable breaks

A cluster providing block storage went down and the whole data center is affected for a period of 3 days.

Our main database is down, sales are a round zero.

What saved us? Backups! Done every 3 hours and kept with a different provider. In less than 4 hours we managed to read the data in and we were running again as other products were not affected.

Not keeping all eggs in one basket is a great strategy - and when it comes to backups - it's an absolute must.

Diversification worked in our favor even though we did not plan for a black swan.

Now a side note in case you are using managed databases: always secure a backup also outside of the managed DB system - and be prepared to provision a database server online from scratch on a VM. It's normally not much to prepare - and there are plenty of 1-click solutions, but having this card in your pocket might save your company one day (and it costs you almost nothing).

The second swan - things get harder

We have specifically selected a managed load balancer solution which was designed for extra flexibility and resiliency. It went down: an intermittent error that could not be immediately detected by the provider - caused even further delays.

Sales disrupted dramatically.

How did we recover?

We did not. One possible option would have been to create a VPS-based load balancer and change the DNS, but we never considered this solution because we relied on the product to prevent such issues.

This incident revealed the dangers of convenient dependencies.

Again - preparing a simple scenario where a nginx-based load balancer can be provisioned by hand with some pre-made configs - that's one hour of work, another one for documentation and testing. Such load balancer does not have to have great features, enormous flexibility. It's the "recovery mode" tool - it should be simple and durable.

The third swan - "anycast" can also die... apparently

A combined registrar and DNS provider experiences a DDoS attack that takes down their anycast infrastructure = the DNS itself. Our domains cease to exist.

Plot twist: you buy "anycast DNS" because you want to remove DDoS attacks from your risk matrix.

Two days, a total of 7 hours in prime time - sales of course affected. We could not recover from this one, either. Our hands were tied.

The registrar manages exclusively our name servers. Changing to another DNS provider is impossible, because these particular records (NS) are unchangeable for us.

Conclusions

We have learned the hard way that we are vendor locked-in.

Just 3 events yet they all had significant impact. The market offers plenty of fantastic products that simplify software production and delivery. However, they all come with a common risk: vendor lock-in.

You buy those products with the intention to get flexibility and shift the risk of failures and black swans to that party. It works most of the time, but when a black swan event hits, you sink alongside the product, because you are vendor locked-in to it.

Lessons learned? If you can - diversify your infrastructure. Check if you can rearrange the bricks, so that in case of the unimaginable you can recover your systems in under a day.

Preparing for every scenario is impossible - but sometimes identifying additional options and building awareness around them can in fact be worthwhile.

Share this article: