In recent few years VLSI design has achieved remarkable growth. High performance (peta-scale) computing is a reality now and we are expecting exa-scale computing by 2020. We talk about many core processor now a days. Intel’s Xeon Phi (Knights Landing) with 72 cores and IBM’s Kilocore processors with more than 1000 cores are great examples of many core processors. Design scientists and manufacturing engineers are working hard to come up with efficient designs which meet the area, power, performance demand of the market. At the same time the highly embedded units need to be robust and the design cycle should meet the time to market needs.
No doubt the designers are trying hard to develop fault free systems, but no matter how robust the design is, 100% fault free design is impossible. Computer Scientists and engineers have introduced variety of tools and techniques to reduce the number of faults in the system they build. However we need to build systems that will acknowledge the existence of faults as a fact of life, and incorporate techniques to tolerate these faults while still delivering an acceptable level of service [1].
Different application areas demand fault tolerance a necessity to be incorporated during design phase:
- Critical Application: Aircraft, Nuclear reactor, Medical equipments
- Harsh Environment: Systems open to high vibration, temperature, humidity, electromagnetic disturbances, particle hits
- High Computing Systems: Complex systems consisting of millions of devices
Fault and error in one part of the unit can spread across the whole system. For example a stuck-at-zero at the data output of a memory module (permanent ground connection) might provide a wrong “0” data (while logic “1” is the correct data) to the processor. This wrong value of data would be processed by the processor and it may end up in a wrong result generation. In this scenario, the processor unit was not faulty, but the error in the result propagated due to the fault in the memory module.
Faults in electronic systems can be transient or permanent.
Transient Faults – disappear after a relatively short time. For example, a memory cell whose contents are changed spuriously due to some electromagnetic interference. Overwriting the memory cell with the right content will make the fault go away.
Permanent Faults – never go away, component has to be repaired or replaced
We would discuss about different Redundancy techniques in the future posts to deal with faults in VLSI systems.
[1] Fault-Tolerant Systems, by Israel Koren and C. Mani Krishna