When you turn on your kitchen tap you expect water to come out.
Most people don’t think about where the water comes from, what
treatments it’s had, or the miles of pipes and values it’s travelled
through to reach you.
No-one really cares - until it stops working. Which doesn’t happen that often if it’s fitted correctly in the first place and properly maintained.
It’s the same for other things. Lights, TVs, washing machines. Most people don’t give them a second thought until they go wrong. It should be that way for IT systems too.
The difference is the basic principles of large-scale water supply have been around for millennia. The Mesopotamians used clay pipes 6000 years ago. The ancient Egyptians used copper, the Han Dynasty used bamboo, and the Romans used lead. In fact, the word plumbing comes from plumbum, the Latin for lead.
Much like today, those ancient folk probably didn’t think much about the mechanics of their water supply either. Until it stopped working.
By comparison, IT systems are still in their infancy and are continually evolving through small daily changes, and sometimes massive transformations. So, making IT systems as reliable as plumbing or electrical appliances is tricky – not to mention time consuming and expensive.
GeoPlace needs reliable IT systems and applications to do its work. What we do is unique so we often can’t find off-the-shelf software, which means we have to invent some of it ourselves.
We go to great lengths to make our systems as reliable as we can but it’s a moving target. We made over 2,000 changes (fixes and improvements) in the last year alone, any one of which has the potential to cause unexpected problems with our dozens of applications, 350+ servers, or millions of lines of code.
When it comes right down to it, all IT problems are caused by humans – don’t let the anyone tell you otherwise!
Problems can stem from improper maintenance, inadequate monitoring, cutting corners (known as technical debt), a design flaw or manufacturing fault from years ago, pushing a system beyond what it was designed to do or beyond its planned lifetime, or simply from a typo made yesterday by an overworked and underappreciated engineer (except on the last Friday in July 🤨).
Somewhere along the way there was a human being who created a system, tool, or process that allowed the problem to occur. Someone didn’t put enough checks or guardrails in place to prevent it. Don’t be too hard on them though, it was probably unintentional and due to a lack of information at the time.
Humans are fallible. Mistakes and accidents happen. So, we need to find a way to protect against that.
Poka-yoke (pronounced PO-ka yo-KAY) is a Japanese term from lean manufacturing that roughly translates as mistake-proofing. There are examples of it everywhere. It’s why your sink has an overflow outlet, your washing machine won’t operate with the door open, and plugs only fit in the electric socket one way.
Sir Ranulph Fiennes is attributed with saying “There is no such thing as bad weather, only inappropriate clothing”. In that spirit, let’s say that there is no such thing as computer error, only inappropriate mistake-proofing.
So how do poka-yoke principles apply to IT systems at GeoPlace?
- We limit access to systems by following the principle of least privilege, which means only people who need access can have it. It’s like locking a door and only giving keys to people who need them.
- We have measures in place to spot potential problems early. Any non-routine changes go through an approvals system where work is checked by at least one other pair of eyes, sometimes more. Changes then pass through multiple quality assurance and testing stages before they are deployed to our production systems. This release management process is crucial in reducing problems and making our systems more reliable.
- Despite all the advances in technology, servers often still rely on mechanical components – like spinning fans and disks. We use servers that have redundant components so if a power supply or hard disk fails there’s another one to automatically take its place.
- Whenever possible we build high-availability systems, where multiple servers are used to run one application. This means we can lose an entire server (sometimes more than one) without causing disruption. In some ways it’s like airliners having multiple engines in case one of them fails.
- Almost all our systems are hosted on virtual servers, which separates the hardware from the software. This provides the capability to automatically move systems away from malfunctioning hardware often without any human intervention at all.
- Our cloud providers have a long list of accreditations to ensure their procedures are following good practice and security is maintained, including ISO 27017, ISO 27018, ISO 27701, PCI-DSS, as well as SOC1-3.
- In this VUCA world we also must be vigilant against outside threats. We employ all the usual security measures you would expect, and we are certified to the UK Government’s Cyber Essentials standard which ensures we’re effectively managing things like passwords and anti-malware, and applying security updates – which is a thankless task but it has to be done. A notable example was seen in December 2021 with the scramble to fix the Log4j flaw, which has been called the “the single biggest, most critical vulnerability of the last decade”
- To ensure our information security processes are following good practice we are certified to the ISO 27001 international standard, which includes being assessed by external auditors every six months. This complements our quality management certification to the ISO 9001 standard which amongst other things helps formalise our approach to risk management.
Despite all this, preventative measures only get us so far. We’re human, so errors slip through from time to time and accidents still happen.
When they do, we need to be ready to deal with them and get things back to normal as soon as possible.
- We monitor thousands of different aspects of our systems so we can be alerted about problems, hopefully before they cause noticeable disruption. We can’t monitor everything though, not least because we have to be careful to avoid the Observer Effect, where excessive monitoring can itself actually create problems.
- When we have a problem it’s our highly skilled and experienced staff that save the day. Ultimately, it’s their expertise that resolves problems. We feed any lessons learned into our Post-Incident Review sessions so that we can avoid the problem happening again.
- We invest a lot of time and effort into automating our processes wherever we can, in particular when we’re building our IT systems. By using techniques like Infrastructure-as-code we write scripts which software tools can use to build and configure our servers. Although writing the scripts takes longer than manually building servers, it pays off in the long-term by reducing errors and ensuring future deployments are done quickly and in exactly the same way every time. In fact, it’s now very common that we won’t even try to fix a malfunctioning server because it’s quicker and easier to simply destroy it and build a new one. Back in the day we used to give servers cute names, like Nemo and Dory, and nurse them back to health when they were sick. Now we give them designations, like svswpapil03p, and treat them as if they are disposable.
- For the really big problems we turn to Business Continuity and Disaster Recovery procedures. This is reserved for catastrophic events like major fires. We have duplicate systems already in place that are ready to be used, and copies of our most important data which is continually updated. If we had a disaster we could deploy these systems rapidly, but most of our servers don’t hold data so we would rebuild those from scratch using automated scripts.
We do a lot already that aligns with poka-yoke principles, but we’re always looking to improve. We can look at why our changes fail and how to stop it happening again; we can find ways to avoid disruption by teaching systems how to deal with problems themselves (known as exception handling); and, we can practice restoring systems so we can do it rapidly when problems do occur.
It's not easy or quick, but every improvement brings us a step closer to being as reliable as your kitchen tap.
Want to find out how we transitioned from hardware to the cloud? Read this case study from OVHcloud.