Analysis and Architecture for Application Level Reliability

Project status: 

Single event upsets (SEUs) are a source of concern for correct operation of CMOS circuits. The severity of the problem is increased as the transistor size and supply voltage decrease. In the traditional or numerical notion of correctness, every output has to be correct to the last bit. However, there exist many applications which are resilient to a certain degree of error and whose output is of acceptable quality even in the presence of SEUs. We use the concept of application-level correctness to denote acceptable output (rather than numerical correctness) for such applications. Such applications use a fidelity metric to estimate the quality of the output. For example, for JPEG decoders, an image with a minimumĀ  PSNR (peak signal to noise ratio) of 35 dB is considered acceptable.

However, even for these applications, there exist certain critical instructions whose numerical correctness must be guaranteed - for example, the exit condition of an iterative improvement algorithm. Our goal in this project is to identify all such critical instructions and protect them using some form of duplication. Our technique consists of an offline profile guided compiler analysis pass and an online monitoring technique. Together, these two techniques allow us to greatly reduce the number of instructions which need to be duplicated at runtime.