The dependability of computer systems is one of the key issues in the technological era. Our daily lives are currently governed by complex computer systems (Haugk, Lax, Royer and Williams, 1985). Operating systems capable of managing key application on computer systems should be in a position to cope with the increasing rate of software problems, malicious attacks and hardware flaws (Parhami, 2005 and Lohr, 2001). One of the most significant requirements for operating systems is resilience to errors.
Most of the operating systems stop operating once they encounter a problem with the hardware or software. This results to loss of applications and data running in the system. Some common examples of such issues are Windows blue screen errors and kernel panics in UNIX (David and Campbell, n. d). This is unfortunate since the main concern of the users is with the applications and data. They are afraid of losing data out of a fault that it not of their making. Even after a fault is encountered in the software or hardware, the users would want to have their data intact and recoverable.
This problem has been taken care of by the invention of Self-healing operating systems. Self-healing operating systems refer to systems that automatically detect, diagnose and repair software and hardware problems that are localized. There are various techniques that are used by the operating system in recovery, once an error has been detected (Andrzejak, Geihs, Shehory and Wilkes, 2009). Code reloading Temporary memory errors or memory corruption as a result of an erroneous code can lead to errors like illogical instructions to the software code.
Despite the fact that the ECC memory is capable of detecting and fixing some temporary memory faults, it is not capable of handling corruption faults that result from invalid instructions. The simplest most effective technique to handle such a problem is code reloading. This recovery technique reloads the flawed memory work from permanent memory. In case the fault is permanent, a case that can be identified through testing, there is a possibility of recovering through remapping of the faulty hardware page utilizing virtual memory support.
In case the processing unit points to an undefined command exception, the command is reloaded by the handler from a copy of the system code in memory-mapped permanent memory and the command that is reloaded is executed. This recovery procedure is the simplest in implementation. However; the procedure is not capable of detecting memory corruption resulting from an opcode transforming into another legal opcode (David and Campbell, n. d). Regular checking of the operating system code is important to better detection of flaws in the memory. Hashing and checksums are simple methods of verifying of running system code.
If a fault is detected a reload is triggered very fast. This is a preventive strategy that is capable of detecting flaws before they cause errors. The preventive strategy is also capable of detecting faults that make an opcode to result to another legitimate opcode (Demsky, and Rinard, 2002). CRC-32 checksum of critical kernel code is computed periodically by choices. This is makes sure that the memory where the instruction is stored has not been corrupted. In case the checksum changes as a result of corrupted memory, the block of the memory that is corrupted is reloaded from the permanent memory.
Flushing of the instruction cache is carried out to ensure that all the affected commands are disposed of. The checksum can also be computed as soon as an operating system error is detected. This is done to make sure that the system and recovery code is not affected (Liedtke, 1995). Modern ARM-based processor designs consist of Run Time Integrity Checker (RTIC) hardware. This hardware is capable of being configured by the operating system for computation and verification of SHA-1 hashes of specific code areas. Once an error is identified, a communication is made to the processor via an interrupt.
The same kind of checksum verification can be utilized in checking the integrity of fixed data. Checking the integrity of changing data is hard. One weakness of this recovery procedure is that it cannot be clearly used for codes that are created at run-time or for self-modifying code. This means that care must be taken to make sure that a replica of the created code is stored in a permanent memory (Shapiro, 2004). Component micro-rebooting This technique has been proven to be effective for application programs. Application of this technique to OS is also practicable (Voas and McGraw, 1998).
The technique can help in recovery from temporary hardware flaws and some system bugs. For the Nooks project, this technique as extension restarts was utilized for recovery of the Linux Kernel. The technique involves reinitialising the corrupted part or destroying and recreating it and then re-requesting the component. While in code reloading errors are fixed only in processor commands, in this technique errors are fixed in kernel data structures. The technique works in collaboration with isolated components. The wrapper elements that offer isolation of the components are also utilised in the management of the recovery.
The fault model that is addressed in micro-reboot is component-level flaw repression. This can be partly implemented by component isolation (Tanenbaum, Herder and Bos, 2006). Automatic service restarts In case crucial operating system service, like the paging daemon, stops working, it brings the entire system to a stop. Once the failure of such a crucial process is realised, a restart of the process can solve the problem and continue the operation of the operating system. The flaw model that is handled by automatic service restart is single-process failure.
In this case there is usually no external state corruption. In micro-kernel OS, this basically involves detection and restarting of the affected system services that are run as application processes (David, Carlyle and Campbell, 2007). For instance, in Minix3, this operation is carried out by reincarnation server. A system process could be developed such that it is mechanically restarted once it encounters an exception. There is a particular system process that loops constantly awaiting a prepared process and acquiesces to the new process. This special system process is the process dispatcher.
The system becomes completely useless once the process dispatcher crashes. This is the reason why in some systems the system dispatcher is executed as a restartable process that can be recovered once it crashes (Demsky and Rinard, n. d). Process restarts may fail to work where the process utilizes locks for accessing shared data structures. Such cases are common where the process dies while holding a single or more locks. Even in case the shared data structures are not affected or they can be corrected, recovery will not happen unless there is releasing of all the locks held by processes.
This is why the system should be such that it can track all the locks help by processes and forcefully release any that is help once a process is halted. It is possible to implement lock tracking and force unlocking to ensure that the process runs once a fault has been identified and fixed (Tanenbaum, Herder and Bos, 2006). Watch-dog based recovery This technique utilises external hardware watchdog timers. They are utilised in error detection where the operating system is not doing any useful work. This is such a case where the OS is in an infinite loop. There is need for regular resetting of the timer by the operating systems.
A signal is sent to the processor once the timer expires. The processor has a reset pin where the timers are usually cabled. They lead to a complete reboot of the system in case of failure. This process has a weakness for a complete reboot results to the loss of user data and applications that are currently in the volatile memory. However, since the memory is conserved after a process reset, reconstruction of both the operating systems and user state is possible. This makes it possible to continue operating after the reset. This way the user data is recovered resulting to higher reliability (Andrzejak, Geihs, Shehory and Wilkes, 2009).
This technique has been successfully implemented in Linux and Choices. Once there is resetting of the memory management unit (MMU), interrupt subsystem, watchdog bites, and the processor, the system continues to operate effectively. To be able to avoid loosing the user data, the reset handler passes the usual boot procedure when the reset is instigated by the timer. The reset handler turns the memory management unit back on, there is deactivation of the running processes, reinitialising of the interrupts and skips to the OS’s process dispatch loop.
After this the system runs the next ready process (Shapiro, 2004). All that is lost is the process state of the one that was running during the resetting of the processor. The process whose state is lost cannot be scheduled once more. As a result, it is eliminated from the process queue. A solution to the lock-up state is delivering of exception to the thread that is locked up. In this case, the thread is free to try local recovery rather than being forced to terminate. Watch-dog based recovery uses single process crash as a fault model without external state corruption.
The technique utilises the lock tracking code in the release of pooled resources that are in a process that is terminated. Another kind of lockup that can initiate a watchdog timeout is a deadlock. Recovery in this case can be tried by restarting some parts so as to break cycles (Andrzejak, Geihs, Shehory and Wilkes, 2009). Transactional roll-back Once an error results to an exception during an operation, there could be a roll back of the state of the part. This can be achieved through the abortion of the operation. After abortion, the operation is then retried.
In Choices, management of a transaction is carried out by the same wrapper elements that offer isolation. The transaction is aborted by the wrapper. Where there is unhandled exception, the state of the part is rolled back. It is also possible to use multi-threaded and non-blocking execution offered by RSTM for better performance (Brown and Patterson, 2001). Support of transactional model on parts results to expenses in terms of space and time. Expenses in terms of space are as a result of storage of backup copies of states prior to transactions.
In terms of time, it is due to performance of memory copies and management of the memory during the set up and committing of a transaction (Marathe et al. 2006). Transactional roll-back differ from component micro-booting since the roll back is only on the current process, while the latter re-initialises the entire internal state of the process. Based on the kind of the component, either of the two techniques can be employed. Particularly, in case the component has crucial state information that can be lost if component micro-booting is used, then transactional roll-back can be utilised to retain the state.
Component micro-booting is useful when the component can withstand state reinitialisation and has few overheads (Demsky and Rinard, n. d). Process-level recovery Where clear recovery cannot work, or in case the recovery process becomes erroneous, specific process states can be stored to permanent memory. This is carried out as the last option is all the others cannot work. Once the user states are stored, the system can attempt full reboot. The state of the processes can then be saved selectively into the computer.
Every operating system state is reinitialised after the reboot probably removing fleeting errors. Process-level recovery ensures that user applications are not lost when the fault affects only a few system applications or immaterial operating system state. The technique can be used in collaboration with file system snapshots to make sure that the file integrity is not affected after the recovery process by going on to run erroneous processes. This procedure needs minimal support from the operating system. All it requires is an operational permanent memory drive and user process state management code.
The stored processes can be restored selectively after the healing process (Ghosha, Sharman, Rao and Upadhyaya, 2007). Conclusion The reliability of computer systems is one of the key issues in the modern society. This is because computers have become central to our lives and we depend on them for many of our operations. A reliable computer system is one that can recover from a fault or an error effectively and without loss of either user applications or data. This is the reason why operating systems have been developed such that they are self-healing.
This means that they can automatically detect, diagnose and repair software and hardware problems that are localized. The recovery techniques discussed on the paper include: Code reloading; Component micro-rebooting; Automatic service restarts; Watch-dog based recovery; Transactional roll-back; and Process-level recovery. Annotated Bibliography: Andrzejak, A. , Geihs, K. , Shehory, O. & Wilkes, J. (2009). Self-Healing and Self-Adaptive Systems, Dagstuhl Seminar 09201, May 10-15, 2009. This paper presented in Dagstuhl Seminar tackles various aspects of self-healing and self-adaptive systems.
Among the issues discussed in the paper include fault detection and diagnosis, recovery and repair techniques, frameworks and architectures for self-adaptive systems, self-healing solutions in IT infrastructures, and fault management for application systems. The discussion on recovery and repair techniques makes the paper an important resource for the project. Brown, A. , and Patterson, D. (2001). Embracing failure: A case for recovery-oriented computing (ROC). High Performance Transaction Processing Symposium, Asilomar, CA (October 2001). This paper is generally on recovery-oriented technology.
Brown and Patterson discus various aspects related to recovery from faults and errors in computing. In their work, they have not left out the role of operating systems in recovery, which is the focus of this research. As a result, this paper provides very important information for the project. The authors are experts in data recovery and therefore the information provided is reliable in understanding recovery in computing. David, F. & Campbell, R. (n. d). Building a Self-Healing Operating System, Urbana, IL: University of Illinois. This paper by David, F. & Campbell, R.
discusses the rationale behind development of Self-healing Operating Systems. They go further to discus the recovery techniques that ensure user applications and data in temporary storage are not lost when an operating system crashes. The techniques discussed include: Code reloading; Component micro-rebooting; Automatic service restarts; Watch-dog based recovery; Transactional roll-back; and Process-level recovery. This makes the paper an important resource for this project. David, F. Carlyle, J. & Campbell, R. (2007). Exploring Recovery from Operating System Lockups. In USENIX Annual Technical Conference, Santa Clara, CA.
In the recovery process, process restarts may be impossible where the process has locks. This mostly happens where the process terminates while holding a single or more locks. This resource provides crucial information on how to deal with these lock-ups for recovery to be effective. The paper introduces what lock-ups and how to handle them when using different recovery methods. This is what makes it important as an information source for this paper. Demsky, B. and Rinard, M. (2002). Automatic detection and repair of errors in data structures. Technical Report MIT-LCS-TR-875, MIT, Massachusetts Institute of Technology.
This paper is on mechanical detection and repair of errors in computer systems. The idea of automatic detection and repair reveals the fact that the operation system is involved in the detection and recovery. The paper provides details on how the self-healing operating system detects and repairs errors in data structures. These are the techniques that are used for detection and recovery which are the main focus of the essay. Demsky, B. & Rinard, M. (n. d). Automatic Data Structure Repair for SelfHealing Systems. Retrieved on August 3, 2010 from http://people. csail. mit.
edu/rinard/paper/sms03. pdf The authors of this paper, Demsky, B. & Rinard, M. talk about a system that they came up with that that accepts specifications of key data structure constraints, detects and repairs breaches of these constraints, making it possible for the program to recover from errors and continue working effectively. The paper offers the procedures that the authors use in detection and recovery of their system from the errors. This is what makes the paper significant for the research. Ghosha, D. , Sharman, R. , Rao, R. & Upadhyaya, S. (2007). Self-healing systems: survey and
synthesis, Decision Support Systems Volume 42, Issue 4. Ghosha, Sharman, Rao and Upadhyaya give a detailed analysis of Self-healing systems. Theirs is a contemporary software-based systems and applications analysis in a world where this has gained significance importance. They discus the ability of Self-healing systems in to manage conflicting resources and service different user needs. They go ahead to discus the need and how to discover and rectify system faults and recovery from errors. They have argued that these systems attempt to “heal” themselves by recovering from faults and regaining normal performance rates.
Haugk, G. , Lax, F. , Royer, R. and Williams, J. (1985). The 5ESS(TM) switching system: Maintenance capabilities. AT&T Technical Journal, 64(6 part2). This paper discusses maintenance capabilities of operating systems. It is a useful recourse for the essay that discusses self-healing of operating systems from an historic point of view. Computer systems have been affected by software bugs and hardware faults since the beginning. This article discusses how these bugs and faults that result to errors have been handled since the invention of computer hardware and software. Liedtke, J.
(1995). On micro-kernel construction. In SOSP ’95: Proceedings of the fifteenth ACM symposium on Operating systems principles, New York: ACM Press. This book includes the proceedings of ACM symposium on Operating systems principles in 1995. The book contains a discussion of the component micro-rebooting that has been proven to be effective for application programs. The author also argues that the application of this technique to operating system is also practicable. For the Nooks project, this technique as extension restarts was utilized for recovery of the Linux Kernel.
This book contains important information on component micro-rebooting as recovery technique for self-healing operating systems. Lohr, S. (2001). Go to: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, and Iconoclasts, the Programmers Who Created the Software Revolution. New York: Basic Books. This book provides important information on the evolution and working of software. The book offers reliable information on software management. Software bugs are some of the problems that cause errors on processes. The book offers a clear understanding of these bugs and ways of dealing with them.
Marathe, V. et al. (2006). Lowering the Overhead of Software Transactional Memory. Technical Report TR 893, Computer Science Department, University of Rochester, Mar 2006. According to this paper, support of transactional model on parts results to overheads in terms of space and time. Expenses in terms of space are as a result of storage of backup copies of states prior to transactions. In terms of time, it is due to performance of memory copies and management of the memory during the set up and committing of a transaction. After providing this fact, the authors goes on to discuss ways of eliminating these overheads.
Parhami, B. (2005). Computer Architecture: From Microprocessors to Supercomputers, New York: Oxford University Press. As the technology has been advancing, so are the changes and needs to have systems that are more reliable. This book has a section that discusses computer operations and it is the section that has significant information for the paper. Faults in computer hardware are as crucial in error detection and recovery as software. This makes the book important for the research. The research would not be complete without the understanding of computer hardware. Shapiro, M. ( 2004).
Self-Healing in Modern Operating Systems. Retrieved on August 3, 2010 http://queue. acm. org/detail. cfm? id=1039537 Shapiro gives an introduction to the topic of self-healing operating systems by first discussing the role played by the operating system in a computer system. It is not possible to understand the concept of self-healing operating systems, without understanding operating systems in general. This is the strength of this article for this research. He goes on to discuss the self-healing system model, which leads to the self-healing operating systems, which is the center of this research.
Tanenbaum, A. S. , Herder, J. N. and Bos, H. (2006). Can We Make Operating Systems Reliable and Secure? Computer, 39(5):44–51, The reliability of computer systems is one of the key issues in the modern society. This article provides the reasons why computer systems need to be made reliable and dependable. The authors go on to explain ways by which operating systems can be made more reliable in a computing environment prone to hardware faults and software bugs. This book is an important resource for the essay since it provides the solutions to the problem. Voas J. M. and McGraw G. (1998).
Software Fault Injection. New York: Wiley, 1998. Software Fault Injection is a book that identifies the fact that software bugs can result to unreliability in computer systems. The book discusses ways in which these bugs and errors in computer systems can be identified and what should be done. The solution suggested by Voas J. M. and McGraw G. is related to the operating systems, leading us to what is referred to as self-healing Operating Systems. This section on how the system can solve the problems with the software is the one that offers important information for the research.