If something can fail, it will.
What kind of value do you get from the infamous "blue screen of death" or a "Wanted function is not available right now. Try again later. ID: d34db33f"?
ZERO.
The probability that a simple system will fail is relatively low - but conversion oriented systems usually are built from multiple pieces. If one link breaks - the whole chain goes down.
Normally the focus is on not having to deal with the error in the first place. We add resiliency, redundancy, we invest in load balancers and tools that are supposed to make our life easier. We also buy external products that are supposed to take some of those risks off our shoulders.
Focus on "avoiding the error" often leaves the communication area totally unsupervised. When a disaster happens - that is usually the moment when we figure out that "the messaging could have been betteor" or "we should have invested a few hours in a maintenance mode".
This is sadly a common problem which eats even those with big budgets that normally should have it under control.
I always tell my customers - focus on three areas - and I deliberately use action verbs for that purpose.
- communicate: inform your customers about the status of your systems
- write meaningful error messages: "Try again later" isn't!
- prepare a maintenance mode: this might save your business
Error messages in web systems vary from well-formulated to directly insulting - and occasional profanity also can pop up every now and then.
The "error" checklist
Below is a more detailed checklist that takes into account the three areas I mentioned above.
Communicate
If you are an integration partner, offer an API that others consume or simply you product is important for others (which is very relative!) - have a status page hosted at another provider - and have it integrated with your main website/websites.
Have it automated - when something goes down - the color changes. Have also a manual mode, so that you can yourself mark a maintenance mode or an error that wasn't caught by the automated checks.
Majority of such products give also the ability to write extra text so that you can "annotate" your outages/maintenance. Use it - and in case you have a longer outage that severly affects your customers - always make sure to write when the next update will be published.
I have witnessed both sides of the problem - when things go down and a huge business cannot sell a thing - and when I was supposed to find a solution to a problem because plenty of customers were waiting on the other side (and the business on top of that). Neither side is comfortable - but when the only thing you can do is to passively wait and hope... then timeline and regular updates help a lot your customers to make decisions about employing other plans.
There is also one more thing when it comes to that - if you proactively take care of proper communication - you have a chance to own the conversation.
Social media are everywhere and people are quick. Have your own channel - and own it. But to do that - you have provide meaningful information and at a decent rate.
Write meaningful error messages
There are two classes of error messages:
- the ones that are "hardcoded", i.e. reactions to processes that went wrong
- the ones that you control, i.e. during an outage
Both can be messed up, although it's usually the hardcoded ones that receive minimum love from developers and stakeholders.
Writing those is hard - seriously. You need a "catch-all" phrase for everything!
Systems usually throw different exception types - use that to your advantage. Look at the probability and frequency/occurrence of a given exception - and write custom catch sections for those.
Other than that it's all about:
- being clear: name things as they are
- being as specific as possible: look at the frequency/occurrence rate - you can't cover them all - so cover at least the most probable ones
- include context: can you provide an external API answer as a context? Or values of some variables? If so - do it. Throw it in a nicely formatted
<pre>
block. People take screenshots when they contact support. That is 80% less work for you - and quicker satisfaction to your customers - can you suggest any fix? Then do it. Avoid generic flops like "Try again later" or "Contact support" - unless these are exactly things the user should do
- avoid jargon: use language that your customers can understand. Need to add some fruity details? See the point about context. You can always provide a correlation ID at the bottom of the error message - or a nice
<pre>
block with context. - consistent format: try to keep your message format alike - have a template - or a few. This way your error message won't be derailed soon
- be polite: you might think this is an obvious one... but believe me, that's not the case. If there's one thing that ChatGPT is good for is to rewrite possibly drastic messages into polite ones. So use it if you are out of ideas - but remember to seed it with details about your app and your customer group so that the tone can be properly adjusted
For situations when you control the message, i.e. during an ongoing outage or maintenance - then I recommend following this checklist:
- write what happened - share as much as you legally can; if you don't know what happened - write that you are investigating the problem (which is true!)
- provide a timeline, if you cannot, then provide a timeline for the next update
- keep updating people
How often should you update? It all depends on how long you think the whole outage will take - but if you provide a timeline for the next update then people can decide whether they want to employ backup plans on their side.
Prepare a maintenance mode
It's all about being able to gracefully "shutdown" your systems and inform customers about that. This works especially well when multiple frontends use one big backend - or when you actually have to perform an upgrade with downtime.
Why do I focus so much on those? Because you are about to retain a customer or save your data. When people know what is going on, what happens, when it will be fixed - majority of them WILL understand the situation and COME BACK later.
If all they meet is a white screen of loading death... then you don't own the communication channel and allow people to "assume" what happened. You don't want that.
With a maintenance mode you own the communication channel from the very beginning - and can deal with users - all at once. If you have a customer support line - they will thank you, because otherwise it would be red hot from calls and/or emails!
A fine example that was implemented by some of my customers:
- a number of frontends
- a large backend behind a load balancer with multiple nodes
We were able to introduce maintenance mode on every frontend separately. In case the big backend died - the mode was invoked automatically.
When the maintenance mode was enabled on the big backend - then all frontends were automatically enabling it, too.
Now - you can have standard predefined messages available, but there's nothing that prevents you from adding additional texts into your predefined HTTP code answer that triggers the maintenance mode on frontends.
This has saved the business on a few occassions:
- sales were recovered
- data corruption was prevented (after a critical error was discovered in production)
Does it work?
Now - how do I know that it works? It was measured. Customers COME BACK when they hear the truth, when you provide the time frame, when you actually treat them like partners...
Yes, you become vulnerable - but there are ways in which you can convey the message and not necessarily become a "piñata" for your competition.
It's in your best interest to communicate the errors properly. It can save your customer support department, your brand, your product, your sales.
Now - how do I know that it works? Because we have measured. Customers COME BACK when they hear the truth, when you provide the time frame, when you actually treat them like adults...