+ All documents
Home > Documents > Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques

Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques

Date post: 20-Nov-2023
Category:
Upload: virginia
View: 1 times
Download: 0 times
Share this document with a friend
6
Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques Xinfei Guo ECE Dept., University of Virginia Charlottesville, VA 22904, USA [email protected] Wayne Burleson AMD Research Boxborough, MA 01719, USA [email protected] Mircea Stan ECE Dept., University of Virginia Charlottesville, VA 22904, USA [email protected] ABSTRACT In this paper we postulate that future electronics systems will use sleep time as an active recovery period essential for their overall performance. Our hypothesis is that by explicitly controlling the ratio of sleep vs. active and sleep conditions (e.g. higher temperatures, negative voltages), we can deeply rejuvenate electronic systems periodically to improve their metrics. We perform a series of stress and recovery experiments using commercial FPGAs to demonstrate several cases where we bring stressed chips back to within 90% of their original margin by actively rejuvenating for only 1/4 of the stress time. We validate our experiments against extracted models and present potential applications to multi-core systems. Categories and Subject Descriptors B.8.2 [Hardware]: Performance and Reliability Performance Analysis and Design Aids; B.7.1 [Hardware]: Integrated Circuits Types and Design Styles, Advanced Technologies General Terms Reliability, Performance, Measurement, Design Keywords Aging, BTI, accelerated recovery, self-healing, FPGAs 1. INTRODUCTION A great challenge for present and future electronic systems is coping with process, voltage, temperature and aging (PVTA) variations. Variations reduce yields and complicate testing, require increased design margins that lead to lower performance or higher power and cost, lead to transient and/or permanent faults, thus reducing system reliability, and in general get worse with each new process node. It is likely that soon all the potential benefits of going to the next process node could be wiped out by the extra margins needed to compensate for variations. Transistor aging (or wearout) is a long term process that is caused by several interrelated physical mechanisms that conspire to worsen both performance, by making circuits slower, and power, by increasing leakage [1, 2]. Aging consists of both reversible and irreversible phenomena which accumulate at different rates under stress, e.g. voltage stress. When the system is not stressed, there is some level of recovery, typically at much slower rate than the wearout. Bias temperature instability (BTI) is one of the dominant reversible aging mechanisms, which shifts the threshold voltage (V th ) of transistors over time under voltage stress, increases circuit delay and shortens circuit lifetime [3, 4]. Negative Bias Temperature Instability (NBTI) occurs under negative stress conditions and affects PMOS transistors. Similarly, Positive Bias Temperature Instability (PBTI) affects NMOS transistors under positive stress voltage. Although the effect of PBTI has been negligible in previous technologies, it is rapidly becoming an important reliability issue with the introduction of high-k and metal gates [5, 6]. Depending on the bias condition of the gate, there are two phases of BTI. The stress (or wearout) phase is defined when gate is under stress (V gs < 0 for PMOS, V gs > 0 for NMOS), and the recovery phase happens when stress is removed. Several ways of dealing with BTI-induced variations have been proposed previously; one method is to accept the variations, track and monitor them [7, 8], then dynamically adapt to them [9, 10, 11], thus being able to design for the average case instead of the worst case. The problem is that, with scaling, the worst case becomes even worse and the distribution becomes skewed, thus the potential advantages of adaptation are reduced, which means the system will function correctly but with poor power, performance and area (PPA) metrics. This work proposes a more fundamental solution inspired by circadian rhythms to actually repair the variations. In [12, 13], a supply voltage greater than normal is applied to decrease stress time while keeping the same throughput. In this way, aging, especially NBTI-induced aging can be reduced with a small performance penalty, but with power overheads. Most previous BTI mitigation techniques focus on reducing BTI-induced degradation during operation (under stress) by decreasing the stress voltage or stress time; however either performance or power overheads are introduced. Although the idea of circadian rhythms has been proposed in [13], in that previous work “sleep” still only meant a period of inactivity – our approach is much more ambitious by considering sleep a period of active recovery. We investigate the idea of periodic sleep for electronic systems not unlike that of biological systems, and propose several accelerated self-healing techniques. By controlling the ratio of sleep vs. active and sleep conditions, such as applying negative supply voltage and high temperature, we show that the chip can be rejuvenated significantly. The proposed techniques are demonstrated by modeling and experiments based on a set of FPGA chips in 40nm. Explorations of implementation on multi-core systems are proposed as future work. The main contributions of this paper are as follows: We propose accelerated self-healing techniques to improve lifetime and hence relax the design margins of electronic systems by the use of negative voltages and high temperatures; z We develop a first-order model for both wearout and accelerated recovery periods for 40nm FPGAs, based on the latest device level NBTI models; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. DAC '14, June 01 - 05 2014, San Francisco, CA, USA Copyright 2014 ACM 978-1-4503-2730-5/14/06 …$15.00. http://dx.doi.org/10.1145/2593069.2593162
Transcript

Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques

Xinfei Guo ECE Dept., University of Virginia Charlottesville, VA 22904, USA

[email protected]

Wayne Burleson AMD Research

Boxborough, MA 01719, USA [email protected]

Mircea Stan ECE Dept., University of Virginia Charlottesville, VA 22904, USA

[email protected]

ABSTRACT In this paper we postulate that future electronics systems will use sleep time as an active recovery period essential for their overall performance. Our hypothesis is that by explicitly controlling the ratio of sleep vs. active and sleep conditions (e.g. higher temperatures, negative voltages), we can deeply rejuvenate electronic systems periodically to improve their metrics. We perform a series of stress and recovery experiments using commercial FPGAs to demonstrate several cases where we bring stressed chips back to within 90% of their original margin by actively rejuvenating for only 1/4 of the stress time. We validate our experiments against extracted models and present potential applications to multi-core systems. Categories and Subject Descriptors B.8.2 [Hardware]: Performance and Reliability – Performance Analysis and Design Aids; B.7.1 [Hardware]: Integrated Circuits – Types and Design Styles, Advanced Technologies

General Terms Reliability, Performance, Measurement, Design

Keywords Aging, BTI, accelerated recovery, self-healing, FPGAs

1. INTRODUCTION A great challenge for present and future electronic systems is coping with process, voltage, temperature and aging (PVTA) variations. Variations reduce yields and complicate testing, require increased design margins that lead to lower performance or higher power and cost, lead to transient and/or permanent faults, thus reducing system reliability, and in general get worse with each new process node. It is likely that soon all the potential benefits of going to the next process node could be wiped out by the extra margins needed to compensate for variations. Transistor aging (or wearout) is a long term process that is caused by several interrelated physical mechanisms that conspire to worsen both performance, by making circuits slower, and power, by increasing leakage [1, 2]. Aging consists of both reversible and irreversible phenomena which accumulate at different rates under stress, e.g. voltage stress. When the system is not stressed, there is some level

of recovery, typically at much slower rate than the wearout. Bias temperature instability (BTI) is one of the dominant

reversible aging mechanisms, which shifts the threshold voltage (Vth) of transistors over time under voltage stress, increases circuit delay and shortens circuit lifetime [3, 4]. Negative Bias Temperature Instability (NBTI) occurs under negative stress conditions and affects PMOS transistors. Similarly, Positive Bias Temperature Instability (PBTI) affects NMOS transistors under positive stress voltage. Although the effect of PBTI has been negligible in previous technologies, it is rapidly becoming an important reliability issue with the introduction of high-k and metal gates [5, 6]. Depending on the bias condition of the gate, there are two phases of BTI. The stress (or wearout) phase is defined when gate is under stress (Vgs < 0 for PMOS, Vgs > 0 for NMOS), and the recovery phase happens when stress is removed.

Several ways of dealing with BTI-induced variations have been proposed previously; one method is to accept the variations, track and monitor them [7, 8], then dynamically adapt to them [9, 10, 11], thus being able to design for the average case instead of the worst case. The problem is that, with scaling, the worst case becomes even worse and the distribution becomes skewed, thus the potential advantages of adaptation are reduced, which means the system will function correctly but with poor power, performance and area (PPA) metrics. This work proposes a more fundamental solution inspired by circadian rhythms to actually repair the variations. In [12, 13], a supply voltage greater than normal is applied to decrease stress time while keeping the same throughput. In this way, aging, especially NBTI-induced aging can be reduced with a small performance penalty, but with power overheads. Most previous BTI mitigation techniques focus on reducing BTI-induced degradation during operation (under stress) by decreasing the stress voltage or stress time; however either performance or power overheads are introduced. Although the idea of circadian rhythms has been proposed in [13], in that previous work “sleep” still only meant a period of inactivity – our approach is much more ambitious by considering sleep a period of active recovery. We investigate the idea of periodic sleep for electronic systems not unlike that of biological systems, and propose several accelerated self-healing techniques. By controlling the ratio of sleep vs. active and sleep conditions, such as applying negative supply voltage and high temperature, we show that the chip can be rejuvenated significantly. The proposed techniques are demonstrated by modeling and experiments based on a set of FPGA chips in 40nm. Explorations of implementation on multi-core systems are proposed as future work. The main contributions of this paper are as follows: • We propose accelerated self-healing techniques to improve

lifetime and hence relax the design margins of electronic systems by the use of negative voltages and high temperatures;

We develop a first-order model for both wearout and accelerated recovery periods for 40nm FPGAs, based on the latest device level NBTI models;

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

DAC '14, June 01 - 05 2014, San Francisco, CA, USA Copyright 2014 ACM 978-1-4503-2730-5/14/06 …$15.00. http://dx.doi.org/10.1145/2593069.2593162

We comprehensively validate using hardware experiments, demonstrating the significance of the proposed techniques;

We explore the on-chip implementation of the techniques and present their potential application in other electronic systems architecture such as multi-core systems.

The rest of the paper is organized as follows. Section 2 covers sleep vs. inactivity for electronic systems with similarities from biology; proactive accelerated rejuvenation is introduced. Section 3 presents a first order gate-level wearout and accelerated self-healing model for FPGAs based on existing BTI device level models. Section 4 focuses on measurement setup and test cases; test results and model validation are presented in Section 5. We discuss the on-chip implementation of the technique in Section 6; we also propose potential applications in other systems, like multi-core. Section 7 concludes the paper.

2. ACCELERATED SELF-HEALING In this section, we motivate the introduction of accelerated self-

healing and propose the concept of proactive accelerated rejuvenation for electronic systems.

2.1 Inspired by Biology: Sleep vs. Inactivity The concepts of active and sleep power modes for electronic systems are widely used in the community, but the terms are slightly misleading since until now, sleep for electronic systems really just means a period of inactivity or idleness, quite different from biological organisms which, during sleep, go through several active processes that are essential for the recovery of their full capabilities. In this paper, we use the circadian rhythms inspired definitions for sleep and inactivity, and argue that sleep should be used as an active recovery period for future electronics. Electronic systems will benefit from such sleep periods with active rejuvenation during which some of the effects of wearout (such as BTI) can be reversed, thus leading to effective self-healing.

2.2 Proactive Accelerated Rejuvenation Here we distinguish between passive recovery (system

unstressed when not in use) and active recovery (proactively scheduled accelerated recovery periods) similar to [14], and make the claim that passive recovery, by being slow, and unpredictable, cannot effectively be used to improve metrics and reduce margins; thus it is sometimes even ignored when modeling aging phenomena. Recovery can be made active by reversing the direction of the stress (e.g. using positive instead of negative Vgs in the case of NBTI), and can be accelerated (e.g. by increasing the temperature). Philosophically, there are two alternatives for scheduling accelerated recovery: reactive, when an integrated circuit (IC) or part of an IC has aged by a particular threshold amount, and proactive, in anticipation of future wearout. Reactive accelerated recovery is potentially more “economic” since it is only scheduled when needed. But it needs to track changing threshold voltages, has the disadvantage of being unpredictable, thus potentially introducing performance and/or energy overheads at inopportune times and likely leading to a smaller improvement in lifetime, and accumulates upfront more irreversible aging thus leading to a lower expected performance and energy – circuit operates more time in an aged/stress mode.

Without proactive accelerated rejuvenation, electronic systems need to be designed to cope with aging over the lifetime of the product, typically several years. This means increased design margins and/or methods to adapt to the worst conditions. It should be noted that adaptation is no panacea since aging fundamentally worsens the system metrics, e.g. by simultaneously reducing performance and increasing power. The system might function correctly with adaptation, but will still become sluggish and burn

too much power. Proactive recovery, with scheduled explicit accelerated recovery periods ahead of any sign of stress, is simpler to implement, results in the system operating for longer time in a “refreshed” mode, thus leading to better expected performance and other metrics, and has better cumulative metrics as well.

3. CROSS-LAYER MODELING To analyze the wearout and accelerated self-healing

mechanisms, we present a first order model for FPGAs, which will be used as our test platform. The proposed model is based on the latest device level NBTI Trapping/Detrapping (TD) model [15], and considers FPGA gate and circuit level structures. Both active stress (or wearout) and accelerated self-healing (or deep rejuvenation) phase models are presented in this section. 3.1 Device Level Wearout Model The Trapping/Detrapping (TD) model is used as a robust prediction method for NBTI considering both variations in device level and supply voltage [15]. According to the model, the threshold voltage of the transistor increases when a trap captures a charge carrier in the stress phase. If the transistor is in recovery phase, some of the interface traps can be annealed, and the number of occupied traps reaches a new equilibrium and results in partial recovery. Since the degradation effect of PBTI is similar to NBTI [6, 12], the PBTI effect can be modeled similar to the NBTI effect. The overall BTI effect can follow equations from [15]. Assume that stress starts at time 0, and no stress is applied before. The threshold voltage shift until time period t2 is:

))1log(()( 111 CtAtVth ++=Δ φ (1)

)exp()exp(~ 011

ox

dds

kTtBV

kTE

K−φ (2)

If a sleep interval of t2 follows the stress phase, the total threshold voltage shift in the end is equal to:

)))(1log(

)1log(1)(())1log(()(12

212221 ttCk

CtktVCtAttV thth +++++−Δ+++=+Δ φ

(3)

)exp()exp(~ 022

ox

ddr

kTtBV

kTE

K−φ (4)

where A, B, C are (approximately) constant, K1 and K2 are fitting parameters, k is Boltzmann’s constant, T is temperature, E0 is activation energy, tox is the oxide thickness, and Vdds and Vddr are the supply voltages under stress and recovery, respectively.

Equation (3) estimates the dependence of the threshold voltage shift ∆Vth on the ratio of active vs. sleep, voltage and temperature. For ratio of active vs. sleep especially, if t2 << t1, the second component dominates and recovery starts fast, as t2 increases, the first component becomes dominant and increases logarithmically with time; this means that recovery is slower than degradation and ∆Vth can’t be fully recovered. The unrecovered part will be added to next stress phase and will accumulate as shown in Figure 1.

3.2 Wearout Model for FPGAs

FPGA vendors have been aggressive in adopting the very latest technology nodes; this makes FPGAs more susceptible to wearout

Figure 1. Behavioral illustration of stress and recovery

∆Vth(t1+t2)

Time

∆Vth(t1)

t1 t1+t2

∆Vth

Stress Recovery

0

Stress Recovery

that can lead to frequency degradations [16]. Due to their “bleeding edge” technologies, reconfigurability and regular structure, FPGAs are an ideal test platform for aging research [17, 18]. In this paper, we choose 2-input Look Up Table (LUT)-based commercial FPGAs, fabricated in a 40nm technology, to demonstrate experimentally our proposed techniques.

Basic components of FPGAs include the I/O and the core architecture; we focus on core architecture in this paper. Figure 2 shows a generic Pass Transistor (PT)-based 2-input LUT structure. Routing blocks include all the routing elements between LUT blocks. Four configure bits (C0 to C3) are stored in memory, In0 and In1 are input signals.

Let’s consider an inverter mapped to the LUT: In0 is the input

of the inverter, C0 to C3 are 0101 and In1 is always 1. As shown in the figure, the Path Of Interest (POI) is from the input of the LUT-based inverter to the output of the routing blocks. Here, we consider two stress conditions, AC stress and DC stress (constant stress). The rate of aging depends on switching activity; in this model, DC stress refers to the case where the input of the LUT is always static and doesn’t switch, and AC stress is when the input is switching. Assume the inverter is under DC stress, and In0 is always 1. M1, M5 are under stress and the threshold shift will affect the delay of POI. If In0 is always 0, only M7 is under stress. Based on this simple example, two hypotheses can be made:

• Hypothesis 1: Not all the transistors on POI are under stress. In DC stress mode, once the inputs are given, the number of stressed and unstressed transistors is constant;

• Hypothesis 2: Recovery can only have an impact on stressed transistors, but has no effect on “fresh” (never aged) transistors, nor on transistors that have already recovered (close) to the “fresh” state.

Although the exact gate level netlists of commercial FPGAs are unavailable, we believe that the two hypotheses can be applied to any pass-transistor LUT configurations. The propagation delay of a digital gate can be approximated as:

thdd

ddL

d

ddLd VV

VCIVCt

−∝~ (5)

where CL is the output capacitance of the gate. The change in gate delay when Vth is subject to change is:

0~ dthdd

thd t

VVVt ⋅−

ΔΔ (6)

where td0 is the original delay of the gate without any Vth shift. Assume that all stressed transistors on POI are under the same stress condition (Vgs are the same), so we can approximately assume that ∆Vth of all stressed transistors are the same. Total delay change ∆Td of POI becomes:

sd

LD

ndnd NttT ΔΔ=Δ ∑ ~ (7)

where LD is the logic depth, Ns is number of transistors that are under stress and 0 ≤ Ns ≤ LD. Combine Equations (1), (2), (6) and (7), and assume Vdds>>Vth. The total delay shift can be expressed as:

ddsox

ddsd V

CtAkTtBV

kTEYtT ))1log(()exp()exp(~)( 10

1++−Δ (8)

01~ dstNKY (9)

If Vdds and T are constant, Equation (8) can be expressed as: ))1log((~)( 11 CtAtTd ++Δ β (10)

where β, A and C are fitting parameters and can be extracted from measurement results.

3.3 Accelerated Recovery Modeling Based on the recovery phase equation of the device model, we combine Equations (3), (4) and (7), and the delay change of POI after sleep period t2 becomes:

)))(1log(

)1log(1)((

))1log(()(

12

21

20221 ttCk

CtktT

VCtA

tttT ddds

dd +++++

−Δ+++

=+Δ φ

(11) where Vdds is the voltage during operation after recovery. Assume that the ratio of active time to sleep time is α in one cycle, and the cycle period is t. Delay change in one cycle can be expressed as:

))1log(

)1

1log(1)(

1(

))1

1log(()( 0 Ctk

tCktTV

tCAttT d

ddsdad ++

+++

−+

Δ++++

=Δ αα

ααφ

(12)

)exp()exp(~ 0

oxa

dda

aaa tkT

BVkTEK −φ (13)

where Vdda and Ta are supply voltage and temperature under accelerated self-healing condition. Based on Equation (12), we can see a big dependence of delay shift as a function of α, Vdda and Ta. By decreasing Vdda and increasing Ta, the first component can be decreased exponentially. Also, by tuning α properly, both the first and second component can decrease. 4. EXPERIMENTAL SETUP To validate the proposed techniques and evaluate the model, a series of accelerated tests are conducted. In this section, we talk about the measurement setup and test schedule in detail.

4.1 Stress and Recovery “Knobs” So far, we have proposed three ways to accelerate the recovery

during sleep, one is proactive recovery, and the other two are related with the sleep conditions, i.e. negative voltage and high temperature. The main “knobs” tuned in measurements are voltage, time, temperature, switching activity and α, the ratio of active (wearout) and sleep (rejuvenation) time.

4.2 Test Configuration In our model, the delay change is used as the metric to capture

the effect of wearout. We choose a Ring oscillator (RO) structure which is widely used as a test platform to measure the delay of the Circuit Under Test (CUT) to capture the delay. Figure 3 shows our test configuration, which is a modified LUT-based Ring Oscillator based on the design proposed in [19]. It consists of 75 inverters implemented in LUTs and a 16-bit counter to capture the output frequency of the ring oscillator. Enable signal En is used to switch between AC stress and DC stress mode. The oscillation frequency fosc can be calculated as:

refoutosc fCf 2= (14) So the delay of CUT is:

Figure 2. Pass-transistor based LUT structure

refoutosc

d fCfT

41

21 ==

where fref is the frequency of the reference clofrequency, CUT is placed at different locations oa diagnostic program is run. The output of thfrom a certain time range that has stable valuefactors and the voltage supply are kept constant to another; when fref =500Hz, the variation of thewithin ±5 and ±0.0001% in terms of correspondvariation which we consider acceptable.

4.3 Accelerated Testing Methodol From the model presented in Section 3temperature and voltage have a great impact onbe used to accelerate aging. In this worktemperature is applied since our preliminary tecan observe a larger than 1% frequency degradtemperature for all of our test cases. The recommended operating temperature of this within -40°C to 85°C. In our test cases, we c110°C, which are above the upper limit of temtoo high to prevent the chip from functioning. are heated up or cooled down by a thermal chamtemperature fluctuation of ± 0.3°C. Core voltageDC power supply and its nominal value isgenerator provides the external clock source for t

4.4 Test Cases All tests were carried out on a group of

FPGA chips within the same family. Several tactive (wearout) and sleep (accelerated self-heconsidered and are denoted as follows (AS – aAR – accelerated recovery, etc.): • AS110AC24: In this accelerated stress test

under 110°C environment for 24 hours inRO is always enabled to switch.

• AS110DC24: This is similar to the previoustress mode. RO is enabled only every 20recording. Data sampling overhead is less th

• AS100DC24: 100°C is applied and thaccelerated DC stress mode for 24 hours.

• R20Z6: In this case, chips are recovered f20°C at 0V.

• AR20N6: Negative voltage of -0.3V is appaccelerate recovery at 20°C.

• AR110Z6: In this case, only high tempeapplied, and the chip is powered off at 0V f

• AR110N6: Chips are recovered with both 1During recovery, RO wakes up every 30

sampling. Five chips of the same type are usedcases, which are summarized in Table 1. The lasis conducted after Chip 5 is re-stressed for 48 compare the accelerated recovery behavior withthe same active/sleep ratio, but with different str

Figure 3. Test configuration

fref

EnEn

75 LUTs

Circuit Under Test (CUT)

(15)

ock. To pick this on the FPGA, and

he counter is read es. Environmental from one reading

e counter output is ding RO frequency

logy 3, both elevated n wearout and can k, only elevated ests show that we dation under high

he FPGAs we use choose 100°C and mperature, but not

The FPGA chips mber, which allows

e is provided by a s 1.2V. A clock the counter.

fresh commercial test cases in both ealing) phases are accelerated stress,

t case, the chip is n AC stress mode.

us case, but in DC 0 minutes for data han 3s.

he chip is under

for 6 hours under

plied to the chip to

erature (110°C) is for 6 hours. 10°C and -0.3V. minutes for data

d for different test st test case, which hours, is used to

h the case that has ress condition. As

a baseline all chips are stressed at 2initially.

Table 1. Test cases for Accelerated WePhase

Case No.

Chip No.

T (°C)

Vo(

Active (Stress)

AS110AC24 1 110 1AS110DC24 2 110 1AS110DC24 3 110 1AS100DC24 4 100 1AS110DC24 5 110 1AS110DC48 5 110 1

Sleep (Recovery)

R20Z6 2 20 AR20N6 3 20 -AR110Z6 4 110 AR110N6 5 110 -

AR110N12 5 110 -

5. TEST RESULTS AND AThis section presents the testing resu

Section 4. 5.1 Accelerated Wearout T5.1.1 Effect of Switching Activi

AC stress and DC stress are conduccase, Figure 4 shows the measurementRO frequency degradation of both casebecomes slower. AC stress can be vieand recovery process, during whichfollowed by recovery phases due tocircuit, and results in smaller frequeabout half of that in the DC stress casthat recovery is slower compared to webe fully recovered with symmetric ACstress is a partially self-healing procesTo fully, or almost fully, rejuventechniques are thus required.

5.1.2 Effect of Temperature on Figure 5 shows measured delay cha110°C. As the model predicts, initialland then slower. High temperature aTable 2 summarizes the delay ctemperature conditions. Table 3 showwe use in the model.

0 1 2 3 4 50

0.5

1

Time( 1×

16-b Counter

clk

in

Cout

16

rst

Figure 4. AC/DC stre

0

0.5

1

1.5

2

2.5

0 3 hours 6 ho

Freq

uenc

y D

egra

datio

n (%

)

Figure 5. Accelerated Wearout with

Del

ay C

hang

e ΔT

d (n

s)

20°C and 1.2V for 2 hours

earout and Self-Healing ltage (V)

Time (hours)

Switching Activity

Active/Sleep

1.2 24 AC -1.2 24 DC -1.2 24 DC -1.2 24 DC -1.2 24 DC -1.2 48 DC -0 6 - 40.3 6 - 40 6 - 40.3 6 - 40.3 12 - 4

ANALYSIS ults from all the test cases in

Test Results ity on Wearout cted in the first and second t results. In the first 3 hours, es is relatively fast and then ewed as a symmetric stress stress phases are always o dynamic activity of the ency degradation, which is se. The results also indicate earout since the chip cannot

C stress. In other words, AC s with a slow recovery rate.

nate the chip, accelerated

Wearout ange over time at 100°C and ly, frequency degrades fast

accelerates the degradation. change (%) for different ws the extracted parameters

5 6 7 8 940 s)

110 °C Measurement100°C Measurement100°C Model110°C Model

ess test results

ours 12 hours 24 hours

AC Stress DC Stress

24 hours

110°C and 100°C for 1 day

5.2 Results for Accelerated RecovConsidering that we use different indiv

experiments, the initial RO frequencies for difdiffer due to variations. To make a fair comrecovered delay (RD – delay decrease during metric, which it can be calculated as:

)()()()( 1212 tTtTtTtRD ddd Δ−Δ=−=where ∆Td(t1) is the delay change at the end of∆Td(t2) is current delay change. 5.2.1 Negative Voltage

Currently, when electronic systems go to voltage is usually gated to reduce leakage, but tpassive recovery. For accelerated recovery we voltage of -0.3V. Figure 6 compares the recovehours when the temperature is set at 20respectively. These tests correspond to the lascases. Model predictions are also included inresults show that stressed chips rejuvenate fastesupply voltage for both temperatures. By appvoltage the recovery is significantly accelerattemperature. 5.2.2 High Temperature

High temperature not only accelerates waccelerates recovery. Figure 7 presents rectemperature with supply voltages at 0V and -0.3high temperature accelerates recovery. The accurately predicts this behavior, as also shown i

Figure 8 shows the delay change ∆Td over timand indicates that test results match the modeHigh temperature (110°C), combining with (-0.3V) achieves the highest recovery rate. Tabthe results for different recovery conditions; threlaxed parameter is defined as how much thfrom the original margin. For example, in the ctemperature and negative voltage are applied tdesign margin relaxed parameter is as high means we can bring the stressed chip back to design margin in only 1/4 of the stress time. cases, we can bring the stressed chips back to woriginal margin.

very vidual chips for fferent fresh chips mparison, we use

recovery) as our

)( 2tTd (16) f stress phase, and

sleep, the supply this only results in

apply a negative ered delay over 6 0°C and 110°C, st three (AR) test n the figure. The er with a negative plying a negative ted even at room

wearout, but also covery delay vs. 3V; in both cases,

proposed model in the figure. me for four cases, eling results well. negative voltage

ble 4 summarizes he design margin he chip recovered ase that both high to the system, the as 72.4%, which 27.6% of original In all accelerated

within 90% of their

5.2.3 Ratio of active vs. sleep tIn both AR110N6 and AR110N12

sleep time is 4, but stress conditions arin Table 5 show that in both casesrelaxed parameter can be achieved. Ttuning the active vs. sleep ratio and sldesign margin can be relaxed significwhole period of wearout and accelerathigh temperature (110°C), negative voactive vs. sleep ratio of 4.

Figure 8. Delay change ove

0

0.5

1

1.5

2

2.5

0 hour 0.3 hour 1 hoursR

ecov

ered

Del

ay(n

s)

0

0.5

1

1.5

0 hour 0.3 hour 1 hours

Rec

over

ed

Del

ay(n

s)

20 °C 20 °C Mo

0

0.5

1

0 hour 0.3 hour 1 hours

Rec

over

ed

Del

ay(n

s) 0V 0V

Figure 7. Recover und

0

0.5

1

1.5

2

2.5

0 hour 0.3 hour 1 hours

Rec

over

ed D

elay

(ns)

0 0.5 1 1.5 20.5

1

1.5

2

2.5

3

3.5

4

Time( 410× s)

-0.3V

(a)

110°C

0V

(a)

20°C

(b)

(b)

Del

ay C

hang

e ΔT

d (n

s)

Figure 6. Recover at (a

ime case, the ratio of active to

re different. Our test results s, the same design margin This indicates that by only leep conditions, the original cantly. Figure 9 shows the ted recovery behavior under ltage (-0.3V) and scheduled

r time during recovery

2 hours 4 hours 6 hours

2 hours 4 hours 6 hours

odel 110 °C 110 °C Model

2 hours 4 hours 6 hours

Model -0.3V -0.3V Model

er (a) 0V (b) -0.3V

2 hours 4 hours 6 hours

2.5 3

110°C and -0.3 V110°C and 0V20°C and -0.3V20°C and 0VModel(110°C and -0.3V)Model(110°C and 0V)Model(20°C and -0.3V)Model(20°C and 0V)

a) 20°C (b) 110°C

6. FUTURE WORK In previous sections we demonstrated that the proposed

accelerated techniques can rejuvenate chips significantly – in this section, we present some possible on-chip implementations of our techniques and also infer the potential use for other electronic systems, such as multi-core systems.

6.1 On-chip Negative Voltages In a typical design, like FPGAs or other types of chips, negative voltages are possible and have been proposed e.g. in [20], for other purposes. The challenges for picking the negative voltage are: (1) Breakdown voltage limitation: the voltage must be at the level below the lateral pn-junction breakdown voltage, (2) Implementation feasibility: implementation of negative voltage will introduce area overhead, (3) Gate-induced Drain Leakage Current (GIDL) may introduce large leakage currents. The power overhead of generating circuit is another consideration. From our test results presented in Section 3, a modest negative voltage, such as -0.3V, can be enough to rejuvenate the chip deeply.

6.2 Multi-core Implementation There is a lot of interest in energy-efficient scheduling on multi-core architectures, including the future emergence of “dark Silicon.” The basic idea is to keep some cores active while others are asleep for saving energy or for abiding by TDP limitations. In this section we suggest two important applications of the proposed accelerated techniques in multi-core systems. The first method is to use active cores as “on-chip heaters” for the cores that are asleep. Figure 10 illustrates a simplified 8-core system, where cores 3 and core 7 are in sleep mode. As noticed, both cores are surrounded by active neighbor cores, which generate heat during operation. By taking advantage of this, cores 3 and 7 can be rejuvenated deeply during sleep. The second method is to consider circadian rhythms when doing scheduling; the multi-core system can benefit from sleep periods with active recovery during which some of the effects of wearout can be actually reversed by external resources, as demonstrated in this paper. Both temporal and spatial distributions can be explored. Combining the proposed accelerated techniques with existing core scheduling methods can bring a huge benefit for extending life time and relaxing design margin of multi-core systems.

Core 6

Core 1 Core 2 Core 3 Core 4

Core 5 Core 7

Shared L3 Cache

Core 8

Zzzzzz...

Zzzzzz...

Heat Heat

Heat

Heat

Heat Heat

7. CONCLUSIONS This paper explores the idea of deep rejuvenation for electronic

systems according to the active recovery during sleep for

biological systems. By controlling the ratio of active vs. sleep time or sleep conditions (negative voltage and high temperature), the transistors can be deeply rejuvenated. A first order model and comprehensive experimental results demonstrate that the proposed techniques can significantly accelerate the recovery. We demonstrate several cases that bring stressed chips to within 90% of their original margin by actively rejuvenating for only 1/4 of the stress time. On-chip implementation of the proposed techniques are discussed. One limitation of the work is that the first order model is optimistic in that it ignores other aging effects, such as Electromigration (EM), and uses first order estimation for calculating delay. Additionally, the effects of chip to chip variations on aging are also ignored for now. Future work includes exploring the prospect of periodic deep rejuvenation on a periodic schedule and developing a virtual circadian rhythm. By exploiting the extra flexibility offered by the circadian rhythms, the power, performance and area (PPA) metrics can be significantly improved. Since the time before the next scheduled deep rejuvenation is known in advance, there is a good opportunity for device/circuit/architecture/system cross-layer optimization that can take advantage of it.

8. ACKNOWLEDGMENTS This work was supported in part by NSF under Grant No. CCF-1255907, and by SRC through Global Research Collaboration (GRC) program under task ID. 2410.001. We would also like to thank Mr. Alec Roelke for discussions.

9. REFERENCES [1] ITRS Edition Report, http://www.itrs.net/reports.html, 2011. [2] M. Agarwal, et al. “Circuit failure prediction and its application to transistor

aging,” Proc. IEEE VLSI Test Symp., pp.277 -286, 2007. [3] W. Wang, et al. “The impact of NBTI on the performance of combinational

and sequential circuits,” Proc. DAC, pp. 364-369, Jun. 2007. [4] W. Wang, “Circuit aging in scaled CMOS design�: Modeling, simulation, and

prediction,” Arizona State University, Tempe, Doctoral Dissertation, 2008 [5] S. Pae, et al. “BTI reliability of 45 nm high-K+ metal gate process

technology,” Proc. IRPS, pp.352 -357, 2008. [6] S. Jafar, et al. “A comparative study of NBTI and PBTI (charge trapping) in

SiO2/HfO2 stacks with FUSI, TiN, Re gates,” Proc. VLSI Circuits, pp.23 -25, 2006.

[7] T. Kim, et al. “Silicon Odometer: An on-chip reliability monitor for measuring frequency degradation of digital circuits,” J. Solid State Circuits, vol. 43, no. 4, pp.874 -880, 2008.

[8] A. C. Cabe, et al. “Small embeddable NBTI sensors (SENS) for tracking on-chip performance decay,” Proc. IEEE Int. Symp. Quality Electron. Des., pp.1 -6, 2009.

[9] Z. Qi and M.R. Stan, “NBTI Resilient Circuits Using Adaptive Body Biasing,” Proc. ACM Great Lakes VLSI Symposium., pp. 285-290, 2008.

[10] S.V. Kumar, et al. “Adaptive Techniques for Overcoming Performance Degradation due to Aging in Digital Circuits,” Proc. IEEE ASP-DAC, pp. 284-289, 2009.

[11] L. Zhang and R. Dick, “Scheduled voltage scaling for increasing lifetime in the presence of NBTI,” Proc. IEEE ASP-DAC, pp.492 -497, 2009.

[12] S. Gupta and S. S. Sapatnekar, “GNOMO: Greater-than-NOMinal Vdd Operation for BTI mitigation,” Proc. IEEE ASP-DAC, pp. 271-276, 2012.

[13] S. Gupta and S. S. Sapatnekar, “Employing Circadian Rhythms to Enhance Power and Reliability,” TODAES, pp. 1-23, 2013.

[14] J. Shin, et al. “A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime,” Proc. ISCA, pp. 353–362, 2008.

[15] J. B. Velamala, et al., “Physics Matters: Statistical Aging Prediction under Trapping/Detrapping,” Proc. DAC, pp. 139-144, 2012.

[16] K. Ramakrishnan, et al., “Impact of NBTI on FPGAs,” 20th Intl. Conf. on VLSI Design, pp. 717-722, 2007.

[17] A. Maiti, et al. “The Impact of Aging on an FPGA-Based Physical Unclonable Function,” Proc. FPL, pp. 151-156, 2011.

[18] S. Kiamehr, et al. “Investigation of NBTI and PBTI induced aging in different LUT implementations,” Intl. Conf. on FPT, pp. 1-8, 2011.

[19] S. Velusamy, et al. “Monitoring temperature in FPGA based SoCs,” Proc. Comput. Des. Conf., pp.634 -637, 2005.

[20] S. Mukhopadhyay, et al. “Capacitive coupling based transient negative bit-line voltage (Tran-NBL) scheme for improving write-ability of SRAM design in nanometer technologies,” Proc. ISCAS, pp.384 -387, 2008.

Figure 10. Illustration of Multi-core System Self-Healing

Figure 9. Illustration of wearout vs. accelerated recovery


Recommended