How many servers and services do you have that nobody dares to touch?

If you cannot answer the title question, maybe it's time to break the glass ceiling above your business that limits your capabilities?

In the DevOps world a "snowflake" is a machine that nobody dares to touch. Simply because they mimic true snowflakes: they are so fragile that even the slightest change can break the whole thing apart.

The most ironic thing is that the fragility of a snowflake can cause true avalanches - especially if we look at potential consequences and the so-called "fallout".

Snowflakes don't appear - they are always made. For instance, you might have an old server running a critical application that only one retired employee understands. The fear of breaking something and causing downtime keeps everyone away - thus making it a "snowflake".

Another very popular example: there is this one server - or a set of servers - that everybody was just "enhancing" all the time. At some point their status changes from "production machines" to "if that things go down - we go drown, too". The scariest thing is that in such cases usually people are afraid that the machine won't reboot, or nobody can recreate it outside of its environments due to all interconnections and "new layers" added over past X years.

Similar thing happens to code. This is how you build the worst type of both operational and development tech debt. You might not feel it, but its tentacles work according to the "death by 1000 cuts" principle. At some point there comes the last and final cut. That's when real panic begins.

But as the situation progresses you subconsciously adapt to the new reality, which like a glass ceiling limits your business and influences all your future decisions.

Ignoring these issues is like ignoring a ticking time bomb. The longer you wait, the more complicated the fix. Yes, more expensive, too.

What to do?

Just like with procrastination, you need to act on it. If you don't then at some point you will become a hostage of your own tech.

Very often snowflakes run legacy software which even further complicates the process. Many software vendors and services intentionally no longer publish archive/old versions of their software, because they want to force people to upgrade.

This is happening because way too many people don't patch and upgrade their infrastructure - and hence they create a collective problem for everyone.

That can be a limiting factor when you have to recreate the environment in a virtual setting - this is why it's important to act - and not wait endlessly.

How to do it?

I would myself choose the DevOps way, which would involve:

an expert is performing an assessment of the situation
replication of the environment so that it's possible to run isolated tests
monitoring - not only the infrastructure, but also business vitals; you will love it once the new solution comes online!
backups; you have no idea how many backups are not working as advertised
investigating and introducing concepts like SDA (software defined architecture), database rollback strategies, zero-downtime deployments
when the snowflake is transformed - it's time to look at other items that might be beneficial in your case

Assessment

This is a golden DevOps rule: have a plan before you act, and in order to have a plan - you need to look around, gather all information and see what is going on.

In this phase it's an advantage if the person in charge has the knowledge of your current technology - and not only the one running on the snowflake. I am writing this with the assumption that the snowflake is running on legacy software. Believe me, this is most often the case.

Replication

One of the reasons why you have a snofwlake is because it's no longer easy to replicate the environment. Unfortunately - you have to do it, but in a controlled manner.

We have Docker, VirtualBox, OpenVZ, KVM, Xen - or Hyper-V. It's possible to build confined environments and replicate hardware of the snowflake. All you have to do is to grab the image of your current snowflake - or perform "creative copying" (database directories etc. usually aren't trivial to copy, but often there are ways to perform reliable hot-copies of those). Then you can securely work on making sure, that you can recreate the environment - although this time you will also document it.

By replicating the environment you are actually doing 50% of the job: you get the recipe for the server and the ability to run further tests and transform things.

One mistake that people often make is that when they replicate the snowflake - they begin to change so many things, that at some point they end up with two snowflakes. Nobody dares to put the new machine into production, because in terms of configuration and features it drifted away too much from the original machine.

Monitoring

You need it, and I am not just talking about monitoring of CPU or memory, but also business vitals. By actually correlating those two classes of measurements you can often catch errors before they reach your customers - and stop processes that can potentially corrupt your data.

Backups

Plenty of businesses have no control over their backups and are unable to tell whether they can restore their data. Why? Because they have no routines for that - hence - they never do it.

When you work with the snowflake I recommend re-checking everything around your backups. Should a disaster happen - they will be your last resort.

SDA

When you replicate the snowflake it is a good idea to make sure that the process can be repeated. You can script it - or you can go one level up and use Software Define Architecture. You can use systems like SaltStack, Puppet or Ansible to create declarative recipes that will enable you to seamlessly recreate configuration and even whole machines.

This usually requires a bit more effort, but that is something that pays off quicker than you think - especially if you are growing or need some form of seasonal scaling.

The project is ready

The "feature freeze" rule applies to snowflakes - something I mentioned implicitly in the previous paragraphs. Don't allow the new configuration that you are creating to drift away too much from the old one (especially if you are placing the same software on the machine).

Establish again the solid foundation that you did not have, stabilize it, and when there are no bugs and it just works for some period - then start adding features.

Summary

This list is far from complete - it's often a bit of work, but usually enough for your team to regain confidence and work at full speed again.

So yes, it is worth it, because the value that you unlock once you have the best fitting scenario in place will keep on paying off in the future.

Are there any twists? Yes, sometimes you don't eliminate the snowflake, but begin a "strangling operation".

The infamous "it depends" means also, that there are many solutions to a given problem. Sometimes antipatterns can be the best ones.

In the video I go much deeper into details and expand on the concepts introduced in this blog article.

Blog

tech | leadership | communication | entrepreneurship