System Design 1O1: System Reliability

Is your system reliable?

Faisal Sheikh
4 min readMay 17, 2020

When things go wrong or when times are bad, you count on someone or something reliable i.e., someone or something that helps you cope up with bad times. In the context of a software system, the word ‘reliable’ is no different. The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient systems. In the event of a fault, our systems will rely on the fault-tolerant mechanisms that we’ve considered and implemented; hence, designing our systems in a fault-tolerant or resilient way is of paramount importance.

Reliability (noun) — The ability to work correctly, even when things go wrong.

What things can go wrong in a system? Or what can be called a fault?

A fault cannot necessarily be considered as a system failure. Ideally, a fault can be defined as the inability/failure of a system component to behave/perform as expected under certain circumstances. Now the circumstances, under which a fault appears, might occur rarely; when it does occur though, our system should be able to deal with it without crashing or behaving in an unexpected way.

For instance, consider a web server you developed and hosted on your computer. It’s working as expected but, will it still work the same way when the computer hard-disk crashes? or when a critical software bug pops up? or when the security of your system is compromised? or when your computer gets destroyed, maybe in a natural calamity? etc.

Of course, it’s impossible to account for all the faults that might occur in the system, but that’s not even required. For instance, it’ll make no sense for planning to host your server on a different planet (if that’s possible), in case Earth is washed-out. It is necessary though to account for the faults that have a good probability of occurrence such as security attacks, hardware failures, and so on.

Looking at most of the faults, it’s easy to observe that they can broadly be categorized into two categories —

  1. Hardware faults: Failure/inability of any hardware component(s) to perform as expected. Eg., hard-disk crash, destruction of the machine, etc.
  2. Software faults: Failure/inability of any software component(s) to perform as expected. Eg., an error occurs under a certain type of user input, security vulnerability, etc.

How to design a fault-tolerant system?

The process of designing a fault-tolerant system starts with identifying the faults, then identifying ways to cope up with those faults, and lastly, ensuring our system is tolerant of those faults.

For instance, for making our system tolerant of hard-disk crashes, we’ll first need to identify this as a fault. Once identified, we’ll then have to discover a way to cope-up with it, say, by having another hard-disk holding the backup of the data. Lastly, to validate this approach, we can perform a test where we intentionally detach the main hard-disk and then ensure the data backup on the other hard-disk.

There are some faults though, which cannot be tolerated by our system; we must prevent them. Consider, for instance, a security vulnerability (a software fault) that allowed the hackers to access the user data. This event cannot be undone, data has been leaked; hence, the only way to cope up with this fault could’ve been to prevent it, maybe by having robust security mechanisms in place.

One should keep in mind though that it’s rare to come up with a highly resilient, fault-tolerant system in the initial design. Even if most of the faults have been considered in the design, we can still encounter unforeseeable faults in our systems; hence, a good system design is the one that adapts and evolves accordingly in the future, without a lot of effort.

Although most of the scenarios I mentioned above are pretty obvious, in reality, software systems are much more complex and the faults aren’t as apparent. On the brighter side, we currently have a lot of tools and techniques available with us for enhancing the reliability of our systems. In the upcoming articles, I’ll try to address some of them in detail; if interested, make sure to hit the follow button.

If you liked this article, leave a clap, and let me know your thoughts, queries, or suggestions in the comments section below.

A good read for reference —

--

--