Failures That May Occur in a Distributed Systems Essay
Failures That May Occur in a Distributed Systems
A distribution systems is a collection of processors that have a common goal for their system. Some examples would be SOA-based systems, massively multiplayer online games and peer-to-peer applications. The distributed system is software systems in which components located on network computers. This systems communicate and coordinate through passing messages. This systems interact with each other to accomplish a common goal. This processor will contain their own local memory.
Undeliverable Messages Failures
This is where a message is undeliverable due to either the recipient is down when a message arrives or the sender and recipient are in different components of a network partition
Will stop processes at other sites to stop communicating
When a site experiences a system failure, processing stops abruptly and the contents of volatile storage are destroyed (Microsoft Research, 2012).
Network Partition Failure
This is a network fragments into two or more disjointed sub-networks within which messages can be sent, but between which messages are lost. Centralized systems are completely opposite of a distributed system, where a distributed system is a collection of processors which contains their own memories and communicate together through various lines. A centralizes system allows certain functions to be concentrated in the systems hub, plus it can be easily accessed from all points (Wikipedia, 2012). After a failure has occurred certain actions must be taken, depending on what the failure is will help to determine what actions need to be taken. Site and communications failures manifest themselves as the inability of one site to exchange message with another site. When you have a failure one the first steps is to have a handshake procedure.
Handshake is where two sites communicate between each other to set parameters so normal communications over the channels can begin. After the failure has been isolated than we would start to fix the failure. When the systems has a failure than it must initiate the procedure which will allow the system to reconfigure. This will allow its primary function to fail and reset to a simpler function, mitigating any unacceptable failure consequence. It will control the system without forcing sacrifice desired, but uninsurable, capabilities. After the system reconfigured it will go through the recovery phase and be integrated back in to the system. Network partition is where all paths between two sites contain a failed or broken link.
The network partition will divide the operational sites into two or more component, where ever two sites within the component can communicate but cannot communicate with the components in other sites. When the links are repaired, communications is reestablished between the sites where messages could not exchanges messages thereby merging components. Some was to reduce the probability of a network partition is to design a highly connected network, where the failure of few sites and links will not disrupt all the paths between any pair of sites, This requires the use of more components and cost more money.
Sometimes the networks topology is could be constrained by other factors, like geography and communication medium. We are limited in the way we can avoid partitions networks. There are a lot of advantages of having a distributed system. Like being able to connected remote users, have higher speed and for the most part it is reliable the system need to know how to handle the errors and failures correctly so it can fix them quickly and easily.
Microsoft. (2014). Distrbuted Recovery Chapter 7. Retrieved from http://research.microsoft.com/en-us/people/philbe/chapter7.pdf Wikipedia. (2014). Centralized Systems. Retrieved from http://en.wikipedia.org/wiki/Centralized_system