It is important to understand that no distributed system is ever safe from any failures. No matter how fault tolerant a system is prepared, there is no such thing as a complete failure-proof system. A constant stream of problems will always arise and taking the necessary precautions and having strong problem solving skills are essential to the success of improving a distributed system from any type of failure.
We will discuss four types of failures that may occur within a distributed system and discuss the proper way of addressing them. Without the proper precaution, knowledge, and understanding of these distributed systems and its failures, business continuity is put at risk and can be disrupted.
One of the most common failures in a distributed system is hardware failure and is also one of the main reasons why performing backups are necessary. No other failure will make you think twice about realizing the importance of backups than an unrecoverable hard disk failure. Depending on which particular hardware was the root of the failure, it can be a simple plug and play replacement, or even extensive as a catastrophic meltdown.
This type of failure is also applicable to a centralized system and can leave the same consequences if the system is not properly designed to be fault tolerant. To isolate this failure, you must understand the purpose of a synchronous system.
This type of systems sends a message to a device and waits a given time for it to respond. If no response is received after a certain amount of time, it will send the message again. After a certain amount of resends, that device will be labeled as failed. To fix and avoid this failure is to have physical redundancy. Meaning, either have an active replication or have a primary backup of the system. Physical redundancy also involves having physical components to replace any failure of hardware that may have
Another common failure that occurs within distributed systems is network congestion. Network congestion occurs when too many messages are being transmitted which eventually overloads the network and causes the network’s performance to decrease. Also, machines that communicate with one another in a distributed system can become overloaded when the amount of work being processed becomes too large for the network to handle.
This type of failure is very common in distributed systems because there are times when the distributed communication has a feedback loop. That feedback loop can cause the congestion to feed upon itself and contribute more congestion on the network.
Like distributed systems, this type of failure is also common in a centralized system when too many requests are flooding into it. Monitoring the network is a good approach but not the best. Sometimes too much monitoring or logging can contribute to network congestion by causing delays and potentially interfere with normal operations.
There may also be rare occurrences when the whole site fails. Although rare, never rule out the possibility of a whole site being brought to its knees. For example, in the case of natural disasters, a whole site could lose its power and even uninterruptable power supplies would not be sufficient enough to provide power to the system. When a site failure occurs, it is considered as a “double fault” because there is more than just one machine that failed. Hence most distributed systems are only designed to handle single component failures.
Like many failures that are associated with distributed systems, this failure also applies to a centralized system. The best approach to avoid a site failure is to have a hot-site. A hot-site is a backup up a system that is located in another location. Although having a hot-site as a backup option can be expensive, it is worth the money you will invest by minimizing downtime and keeping business continuity on the constant.
In fault-tolerant distributed systems, arbitrary faults can occur during the execution of an algorithm. This type of failure is called a Byzantine failure. When this type of failure occurs, the system may respond in unpredictable ways. The system may fail to take another step in the algorithm; it may fail to correctly execute a step in the algorithm, and may make an arbitrary execution of a step other than the one that was indicated by the algorithm. In essence, Byzantine faults occur when a faulty processor continues to run and gives incorrect commands.
The processor can also work with other faulty processors and can give the wrong impression that they are working correctly. Byzantine failures are not common among centralized systems as they are with distributed systems. To isolate a Byzantine failure, it must detected by verifying results of a computation by duplicating the work on two or more processors. If the same results occur within all processors, then it is unlikely that a Byzantine failure occurred.
All of these failures can occur unexpectedly at any given time and to avoid potentially disastrous failures, an organization must implement a solid backup plan in the case that one may occur. Most of these failures arise when there is a lack of understanding of the complexities within a distributed system. No system has complete proof from failure. As long as the proper precautions are taken to design a fault-tolerant distributed system, it will minimize that chances of a failure from occurring and will result in the system in being more flexible and also less prone to errors.
Bray, B.; Liu, J. (n.d.). Detecting Byzantine Failures Transparently. Retrieved from http://www.cs.cornell.edu/courses/cs612/2001sp/projects/byzantine/byzantine/byzantine.html
Hayden, M. (2003). Failure Scenarios and Mistakes Commonly Found in Distributed Systems. Retrieved from http://www.venturanetworksinc.com/failure_scenarios_whitepaper.pdf
Redwine, E.; Holliday, J. (n.d.). Reliability of Distributed Systems. Retrieved from http://www.cse.scu.edu/~jholliday/REL-EAR.htm
Stallings, W. (2012). Operating Systems: Internals and Design Principals