In the previous post we have understood the need of Fault Tolerance in VLSI System Design. A VLSI system can broadly be considered as a union of following 3 layers:
- Hardware Layer (Processing cores, Memories, etc.)
- Software Layer (OS, Program Instructions)
- Interconnection Layer (Bus-Based or Network-on-Chip)
Designers introduce several techniques to all these system layers to deal with transient as well as permanent faults, and these techniques incorporate the concept of Redundancy in some means. So, we can say “Redundancy is the heart of Fault Tolerance“.
Following are the four different forms of Redundancies we deal with in Fault Tolerance:
- Hardware Redundancy (HR)
- Software Redundancy (SR)
- Information Redundancy (IR)
- Time Redundancy (TR)
Hardware Redundancy (HR)
In this case we introduce multiple redundant units of complete module or sub-modules to the system. Redundant units along with the actual unit performs the same job to detect the fault and mask it. Triple Modular Redundancy (TMR) is a very common implementation of hardware redundancy. Let’s say the module is a processor. In a TMR system, 2 more identical processors would be there who execute the same set of instructions along with the primary one. A Voter unit needs to be incorporated, which would compare the outputs of all the processors and gives majority votes to the correct result and discards the faulty result if any.
Software Redundancy (SR)
Software Redundancies is dealt by multiple programmer teams who develop different versions of the same program (software). It is very unlike that different versions of the same program would generate the same fault on the same input set. SR uses HR when multiple versions of the software are run on the multiple identical modules simultaneously or it uses TR when multiple versions are run on the same hardware module one after another.
Information Redundancy (IR)
The description should start with the word “parity” which would give a familiar impression to the reader. Information redundancy adds some redundant bits to the original data bits to detect the errors and even sometime correct the errors in the original information. These additional bits are known as check bits. Hamming Code, Cyclic Code, Checksum etc. are different forms of Information Redundancies.
Time Redundancy (TR)
Most of the faults we come across in VLSI systems are transient, which go away after a small time interval. So, re-execution of a part of the code which produced a fault before sometime, is very likely not to produce the fault once again if the fault is a transient one. This technique consumes additional time to get the correct result, therefore, named as time redundancy.
Depending on the requirements criteria, different as well as multiple redundancies can be used in different application areas. Also the Area, Power and Performance penalties are decided based on how the redundancy has been implemented. A generalized view has been displayed in the following table which might vary from application to application and situation to situation.
Redundancy | Area Penalty | Performance Penalty | Power Penalty | Application |
---|---|---|---|---|
Hardware | High | Low | Moderate | Critical Systems- Avionics, Automobiles etc. |
Software | High when HR | High when TR | Moderate | Complex Computing Systems |
Information | Sometimes High like RAID | High | Low | Communication and Memories |
Time | Low | High | Low | Computing and Interconnect |