# Design and Optimization of On-Chip Interconnects Using Wave-Pipelined Multiplexed Routing

Ajay J. Joshi, Student Member, IEEE, Gerald G. Lopez, Student Member, IEEE, and Jeffrey A. Davis, Senior Member, IEEE

Abstract-Every new VLSI technology generation has resulted in interconnects increasingly limiting the performance, area, and power dissipation of new processors. Subsequently, it is necessary to devise efficient interconnect design techniques to reduce the impact of VLSI interconnects on overall system design. New optimizations of a wave-pipelined multiplexed (WPM) interconnect routing circuit are described in this paper. These WPM circuits can be used with current interconnect repeater circuits to further reduce interconnect delay, interconnect area, transistor area, and/or power dissipation. For example, new area constrained WPM circuit optimizations illustrate that the interconnect circuit power can be reduced by 26% or the interconnect performance can be improved by 74%. Moreover, in both these cases, because a significant number of repeaters are eliminated, the transistor area can reduce by 41% or 29%, respectively. Finally, the tolerance of WPM circuits to crosstalk noise, power supply noise, clock skew, and manufacturing variations is also presented. This study of tolerance levels defines the conditions under which the WPM circuit will function correctly, and it is shown in this paper for the first time that WPM circuits are robust enough to operate with variability that can be encountered in deep submicrometer technologies.

*Index Terms*—Low-power high-performance design, on-chip interconnects, on-chip networks, time-division multiplexing (TDM), wave-pipelined multiplexing (WPM).

# I. INTRODUCTION

I NTERCONNECTS have become the performance bottleneck of current VLSI and future gigascale integrated (GSI) design systems. With the stress on improving system performance with every new technology generation to meet the market needs, it is imperative to devise novel and practical interconnect design techniques to reduce the interconnect delay and improve overall system performance. Various design [1], [2] and material solutions [3] have been proposed to solve this interconnect problem.

Repeater insertion, for example, is one of the most commonly adopted strategies to reduce the interconnect delay. It is shown in [2] that inserting an optimal number and optimal size of equispaced repeaters on an interconnect reduces the relationship between interconnect length and interconnect delay from quadratic to linear. This is a significant reduction in interconnect delay that directly translates into an improvement in interconnect performance. The repeater insertion technique

Manuscript received October 13, 2006; revised March 20, 2007.

A. J. Joshi is with Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail: joshi@mit.edu).

G. G. Lopez and J. A. Davis are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA.

Digital Object Identifier 10.1109/TVLSI.2007.902209

can also be used to reduce power and/or area [4]–[6] while maintaining interconnect performance.

Along with this repeater insertion technique, there is an opportunity to implement wire sharing pervasively across the entire range of interconnects to further improve the interconnect design. The wave-pipelined multiplexed (WPM) routing technique described in [7] can be used to implement wire sharing. This WPM routing technique takes advantage of intra-clock period interconnect idleness and transmits multiple data signals over the interconnect in a wave-pipelined fashion. This WPM routing technique can be used along with repeater insertion to improve interconnect design. In this paper, new WPM design optimizations are presented to minimize power, area, and/or interconnect delay. In addition, the tolerance of WPM circuits to inherent circuit and system variations is investigated.

A brief overview of the WPM routing technique is presented in Section II. Section III discusses various power, area, and/or delay optimization strategies using WPM routing. Given that an electrical circuit should work under both best case and worst case scenarios for it to be used commercially, the tolerance level of the WPM circuit to crosstalk noise, power supply noise, clock skew, and manufacturing variations is described in Section IV, followed by concluding remarks in Section V.

# II. WPM ROUTING

A detailed description of the WPM interconnect routing technique is presented in [7]. This technique takes advantage of the inherent interconnect idleness and sends multiple signals in a wave-pipelined fashion on a single shared interconnect in a single clock cycle. This wire sharing technique has been designed such that there is no reduction in the overall throughput performance of the system after application of WPM routing. Figs. 1 and 2 show the schematic diagram of the circuitry required for implementing conventional and WPM routing [7], respectively. As can be seen from Figs. 1 and 2, two dedicated wires can be replaced by a single shared wire using WPM routing. The WPM circuit uses a 2:1 multiplexer, a 1:2 demultiplexer, buffers, and some delay circuitry for correct sampling and routing of data. The required signal  $\varphi_{\min}$  can be easily generated locally at both the sender and receiver side using the global clock that is distributed across the entire chip. Based on the delay constraints described in [7], it is possible to generate either a single-stage WPM (SSWPM) or double-stage WPM (DSWPM) interconnect.

The primary advantage of this wire sharing technique is the reduction in the number of interconnects that need to be routed.



Fig. 1. Schematic diagram for conventional routing.



Fig. 2. Schematic diagram for WPM routing.

This reduction in interconnect count directly results in a reduction of the interconnect area and transistor area. In addition, there is an opportunity to increase the spacing between the WPM interconnect and its neighbors to reduce power, area, and/or delay by optimizing different design parameters. Thus, depending on the design at hand, various design parameters can be tweaked to generate an optimal WPM interconnect design.

## III. OPTIMIZATION OF WPM CIRCUIT DESIGN

As described in Section II, two dedicated interconnects can be replaced by a single shared interconnect using WPM routing. The elimination of one interconnect frees up additional routing area that can be harnessed in a variety of ways. For example, the spacing between the interconnects can be increased so as to fill up all available routing area. This increase in wire spacing decreases the coupling capacitance between the neighboring interconnects. As a result, smaller-sized drivers and receivers can be used, resulting in a decrease in the total device capacitance. Hence, there is an opportunity to reduce both dynamic and static power dissipation, and/or improve performance with WPM circuits. Similarly, the width of the interconnect can be increased to fill up available routing area. This will reduce the overall in-



Fig. 3. Cross-sectional view of conventional design and WPM design.

terconnect delay and improve performance. This will also help in the reduction of the transistor area.

To illustrate WPM optimization, two dedicated interconnects with repeaters, each of length 1.0 cm and designed to operate at 1.3 GHz for 100-nm technology, are considered. The interconnects are designed using RLC modeling. The inductance and capacitance components are extracted using RAPHAEL. However, it should be noted that as repeaters are inserted onto the interconnects the inductive effects of the wires reduce and these effects can be ignored if the equivalent resistance of the interconnect plus the driver resistance of each segment is greater than its characteristic impedance [8]. It is assumed that the two interconnects have active lines as their neighbors as shown in Fig. 3. The dimensions of these two interconnects are designed such that they require 70% of the clock period for data transmission. It is assumed that a guardband of 20% of the clock period is necessary to account for clock skew and signal variations. Even though each interconnect remains idle for only 10% of the clock period, this time is enough to schedule a second signal in a wave-pipelined fashion, using the WPM technique, on either interconnect without any loss of throughput performance. Hence, we use WPM routing and replace these two interconnects by a single shared WPM interconnect, which uses the overhead circuitry described in Section II. The WPM design will have a single shared interconnect with two active lines as neighbors as shown in Fig. 3.

#### A. Minimum Wire Area Design

This optimization represents the more classic application of WPM to reduce wire area only and will represent a baseline to compare to other optimizations in this section. Fig. 4 illustrates how interconnect routing area (interconnect pitch \* interconnect length) changes as a function of the number of inserted repeaters for a conventional design with two dedicated interconnects and the WPM design with a single shared interconnect channel. It is assumed that the interconnect dimensions can be changed such that the overall wire delay is a constant along the curves in Fig. 4. One can observe from Fig. 4 that a simple application of WPM decreases the total wire area by 50%. This reduction in interconnect count decreases repeater count, and therefore, decreases active transistor area. Fig. 5 shows the variation in the transistor area for different number of repeaters. Even with WPM overhead, one can get more than 15% reduction in the transistor area at the optimal design point.



Fig. 4. WPM design-interconnect area versus number of repeaters.



Fig. 5. WPM design-transistor area versus number of repeaters.

An increase in the static power and the dynamic power of the system is observed after application of this type of WPM routing. The dynamic power increases due to the increase in the activity factor of the shared resources and the additional WPM overhead circuitry. In addition, the static power of the overhead circuitry required for implementing the WPM routing also contributes to the power equation. As a result, there is an increase in the total power dissipated by the wire-area-centric WPM design. At the minimum wire area design point, close to 20% increase in the total power is observed for this classic application of WPM as shown in Fig. 6.

# B. Low-Power and Low-Area Design (Balanced Design)

The elimination of interconnects resulting from WPM increases available routing area and provides an opportunity to increase wire spacing, which can result in lower wire capacitance and smaller driver sizes. Furthermore, because of crosstalk constraints [9], increasing wire spacing can enable an increase in dielectric thickness, which reduces the effective ground capacitance. The wire spacing and dielectric thickness are increased in same proportions, such that the crosstalk constraints and processing constraints [9] are not violated. Figs. 7–9 show the variation in interconnect area, transistor area, and power with the number of repeaters for a low-power design. The increase in



Fig. 6. WPM design-total power versus number of repeaters.



Fig. 7. Low-power and low-area design—interconnect area versus number of repeaters.

wire spacing and dielectric thickness decreases interconnect capacitance, which decreases interconnect delay to less than 70% of the clock period. Hence, the pitch and width of the interconnect are proportionately decreased so that the delay will be equal to 70% of the clock period (as in the conventional design). This enables the use of smaller sized drivers/receivers. The resulting decrease in interconnect capacitance and device capacitance decreases the total power dissipated by the system creating an opportunity for a low-power design. For Spwire = (Wire spacing/ Wire width) = 1.5 and Hoxide = (Dielectric thickness/Wire width) = 1.5, a 6% reduction in power can be observed at the optimal point in Fig. 9. The decrease in interconnect count and driver/receiver sizes decreases the interconnect area and transistor area, respectively, of the system. A 44% decrease in interconnect area and 29% decrease in transistor area can be observed at the optimal point.

# C. Minimum Power Design

It is possible to further increase wire spacing such that the interconnect area of the WPM circuit is the *same* as that of the conventional circuit. This would represent a minimum power design for the two-wire circuit. In our case study, for Spwire = 4.9 and Hoxide = 2.0, the total interconnect area of the WPM



Fig. 8. Low-power and low-area design-transistor area versus number of repeaters.



Fig. 9. Low-power and low-area design-total power versus number of repeaters.

circuit is equal to the interconnect area of the conventional circuit. Here, it is assumed that Hoxide can have a maximum value of 2.0 due to manufacturing constraints. Fig. 10 shows the variation in power with number of repeaters for conventional and a minimum power design. Even with the inclusion of WPM overhead circuits, the WPM design dissipates 26% less power than the conventional circuit at the optimal point. Fig. 11 shows the transistor area for the conventional and WPM interconnect design. As expected a significant reduction in transistor area is observed for the minimum power design. At the optimal point, 41% reduction in the total transistor area is observed.

# D. High Performance Design

The WPM routing technique can also be optimized to improve the latency performance of the interconnect circuit for a given area constraint. The performance of large systems can be limited either by the delay of the longest interconnect on the chip, which is routed on the global tier, or the logic critical path. Let us assume that the system performance is being limited by interconnect delay. In such a case, for a given wire area WPM routing can be used to reduce the interconnect delay and in turn improve system performance.

TABLE I COMPARISON BETWEEN CONVENTIONAL DESIGN AND HIGH PERFORMANCE DESIGN

| Design type             | Interconnect<br>width (µm) | Interconnect<br>spacing (μm) | Transistor area<br>(cm²) | Operating<br>frequency (Ghz) | Power dissipation<br>(normalized) |
|-------------------------|----------------------------|------------------------------|--------------------------|------------------------------|-----------------------------------|
| Conventional design     | 0.383                      | 0.383                        | 5.24e-6                  | 1.3                          | 1.0                               |
| High performance design | 0.691                      | 0.613                        | 3.72e-6                  | 2.26                         | 1.07                              |



Fig. 10. Minimum power design-total power versus number of repeaters.



Fig. 11. Minimum power design-transistor area versus number of repeaters.

The interconnects routed on a global tier have wide widths and high aspect ratio [10]. The elimination of an interconnect using WPM routing frees up a significant amount of routing area that can be used to reduce interconnect delay. Fig. 12 shows a redesign approach that can be adopted for interconnects in the global tier to improve performance. As can be seen from Fig. 12, the width of the WPM interconnect is increased to be equal to the height of the interconnect. This increase in interconnect cross-section reduces the interconnect resistance. In addition, the spacing between the interconnects is also increased. This helps in reducing the interconnect coupling capacitance. Though the ground capacitance increases due the increase in interconnect width, there is still a reduction in the *RC* interconnect delay.



Fig. 12. High performance design using WPM routing (cross-sectional view of the global tier).

To quantify the performance improvement, a 1.0-cm-long interconnect is designed to operate at 1.3 GHz using 100-nm technology. A suboptimal number [11] and suboptimal size [12] of repeaters are inserted on the interconnects. An aspect ratio of 1.8 [10] is assumed and the wire spacing is assumed to be equal to the wire width for the conventional design. Table I shows a comparison between the conventional design and the high performance design. While calculating the delay of a high performance design, the delay of the overhead circuitry is also included. The operating frequency (performance) of the system is calculated as  $0.8/t_{latency}(t_{latency} = latency of second bit)$ . It can be observed from the table that a 74% improvement in latency performance is obtained with no net increase in wire area per bit. In addition, a 29% reduction in transistor area and 7% increase in power dissipation is also obtained.

# E. Comparison of WPM Circuit Optimizations

The advantages obtained using WPM routing in terms of interconnect area, transistor area, performance, and power dissipation for different designs are summarized in Table II. A simple application of WPM routing reduces interconnect area by 50% and transistor area by 15% while maintaining performance. However, there is an increase in power. If the interconnect area is reduced by 44%, then a reduction in both transistor area (29%) and power (6%) is observed. This gives us a balanced WPM design. If the interconnect area in WPM design is maintained to be equal to the conventional design, then a minimum power and low area design is generated. Here, a significant reduction in transistor area (41%) and power (26%) is obtained while maintaining performance. Similarly, for the high performance design, WPM routing can result in a 74% improvement in latency performance while still reducing transistor area by

| Design type                              | Interconnect area<br>(normalized) | Transistor area<br>(normalized) | Performance<br>(normalized) | Power dissipation<br>(normalized) |
|------------------------------------------|-----------------------------------|---------------------------------|-----------------------------|-----------------------------------|
| Conventional design                      | 1.0                               | 1.0                             | 1.0                         | 1.0                               |
| WPM minimum wire area design             | 0.5                               | 0.85                            | 1.0                         | 1.2                               |
| WPM balanced design                      | 0.56                              | 0.71                            | 1.0                         | 0.94                              |
| WPM minimum power and<br>low area design | 1.0                               | 0.59                            | 1.0                         | 0.74                              |
| WPM high performance<br>design           | 1.0                               | 0.71                            | 1.74                        | 1.07                              |

 TABLE II

 Comparison Between Different Design Styles Using WPM Routing

29%. There is no change in the interconnect area for the high performance design.

# IV. IMPACT OF VARIATIONS ON WPM CIRCUITS

The WPM routing technique can provide significant advantages in terms of interconnect area, transistor area, power dissipation, and interconnect performance. However, to harness these advantages it is imperative for the circuit to function correctly under both best case and worst case conditions. The performance of an interconnect circuit can be severely affected due to crosstalk noise, power supply noise, clock skew, manufacturing variations, etc. It is, therefore, necessary to provide the necessary guardbands and at times even overdesign the WPM circuit to ensure correct circuit operation under various nonideal conditions. The tolerance levels of the WPM circuit to external noise are discussed here. As described in [7], the circuitry for WPM routing uses static CMOS logic, dynamic latch, and transmission gate designs. The WPM routing technique requires delay matching at the receiver side to correctly sample the signals transmitted on the shared wire. This delay matching is directly affected by delay variations resulting from crosstalk noise, power supply noise, and manufacturing variations. The tolerance of WPM circuit to these delay variations is discussed in this section. The signal  $\varphi_{\min}$  which is used to correctly sample the data at the sender side and the receiver side is generated from a global clock. The existence of clock skew in the different clock domains in a high-performance system that can affect the correct operation of the WPM circuit. The effect of clock skew on the WPM circuit is also explored. In addition, the effect of manufacturing variations on the working of the WPM circuit is also determined.

# A. Crosstalk and Dynamic Delay

The interconnects in a multilevel interconnect network are routed orthogonally on adjacent routing levels. The capacitive coupling between interconnects routed on adjacent routing channels may result in noise transients that cause variations in the latency of the interconnects. As the WPM routing technique is highly dependent on the delay constraints given in [7], it is necessary to study the impact of crosstalk noise on the delay of the shared wires.

In order to analyze the impact of crosstalk on WPM routing, an interconnect system consisting of five interconnects and two nonideal ground planes is considered. The interconnect system that is used to study the capacitive crosstalk effects is shown in Fig. 13. The total interconnect self capacitances and the mutual



Fig. 13. Interconnect system with five interconnects and two ground planes.



Fig. 14. Different switching patterns in the five interconnect system.

capacitances are denoted by  $C_g$  and  $C_m$ , respectively. Though not shown explicitly in Fig. 13, both near capacitance and far capacitance components are considered while calculating  $C_m$ . The two ground planes can be replaced by orthogonal active lines, however, [8] has shown that the assumption of ground planes above and below is a good first-order approximation for calculating the interconnect capacitance.

Depending on the way an interconnect network is designed, the neighboring lines of an interconnect can be active signal lines or ground lines. To provide the most general analysis, we will have three different types of switching patterns for this five interconnect system. The three different switching patterns are shown in Fig. 14.

In the first switching pattern, only the center interconnect C switches and all the neighboring interconnects are ground lines. In the second switching pattern, all interconnects switch in the same direction, while in the third switching pattern alternate interconnects switch in opposite directions. The figure also shows the effective switching capacitance of the center interconnect. As can be seen from Fig. 14, the switching capacitance is maximum for pattern (3), i.e., when the adjacent interconnect latency will be maximum for pattern (3). On the other hand, the switching capacitance is minimum for pattern is minimum for pattern (2) resulting in a minimum interconnect latency.

As described in [7], at the receiver side of the WPM circuit, the signal  $\varphi_{\min}$  is delayed to give  $\varphi_1$  and  $\varphi_2$  which are used to sample the signals received at the input of the demultiplexer. In order to ensure that the received signal is correctly sampled, the signal  $\varphi_1$  should be wide enough to account for any variations in interconnect latency. To account for latency variations due to capacitive crosstalk the signal  $\varphi_1$  (and, hence, signal  $\varphi_{\min}$ ) should be at least as wide as the difference between the worst case delay and best case delay of the interconnect.

Using Sakurai's model for interconnect delay [13], the delay for an interconnect having a resistance R, capacitance C, and repeaters k can be given by

$$\tau_{\rm int} = (1.02R_{\rm seg}C_{\rm seg} + 2.3R_tC_{\rm seg} + 2.3R_tC_t + 2.3R_{\rm seg}C_t) \cdot k$$
(1)

where  $R_{seg}(=R/k)$  is the resistance of the interconnect segment,  $C_{seg}(=C/k)$  is the switching capacitance of the interconnect segment,  $R_t$  is the output resistance of the repeater driver, and  $C_t$  is the input capacitance of the repeater driver. For switching pattern (2) in Fig. 14, the switching capacitance  $C_{seg} = 2C_g/k$  and for switching pattern (3) in Fig. 14 the switching capacitance  $C_{seg} = (2C_g + 4C_m)/k$ . Assuming the repeaters have been designed for worst case delay for both patterns, the minimum width of signal  $\varphi_1$  to maintain signal integrity can be given by the difference between the worst case and best case delay, or

$$\tau_{\rm WPM} = \left[\frac{(1.02R_{\rm seg} + 2.3R_t)(2C_g + 4C_m - 2C_g)}{k}\right] \cdot k \quad (2)$$

$$\tau_{\text{WPM}} = [(1.02R_{\text{seg}} + 2.3R_t)(4C_m)].$$
 (3)

The minimum pulsewidth [14] of the signal  $\varphi_{\min}$  that is necessary to ensure data transmitted over the interconnect reaches the input of the demultiplexer is given by

$$\tau_{\text{WPM,limit}} = \sigma_{\text{RCseg}} \ln \left( \frac{K_1}{1 - \upsilon_1} \right) + \Delta_{\text{repeater}}$$
(4)

where

$$\sigma_{\text{RCseg}} = R_t C_t + R_t C_{\text{seg}} + C_t R_{\text{seg}} + 0.4 R_{\text{seg}} C_{\text{seg}} \tag{5}$$

$$K_1 = 1.01 \left[ \frac{R_t C_{\text{seg}} + R_{\text{seg}} C_t + R_{\text{seg}} C_{\text{seg}}}{R_t C_{\text{seg}} + R_{\text{seg}} C_t + \frac{\pi}{4} R_{\text{seg}} C_{\text{seg}}} \right]$$
(6)

$$\upsilon_{k-1} = \frac{1}{2 - \upsilon_k} \tag{7}$$

from [13] and  $v_1$  = fraction of the full supply voltage at the output of the first repeater segment.

Fig. 15 shows the WPM circuit and its corresponding timing diagram. In the timing diagram,  $\varphi_{s(x=0)}$  and  $\varphi_{(x=l)}$  give the data signal at the beginning and the end of the interconnect, respectively.  $\varphi_{\min}$  is the signal that is used at both the sender side and the receiver side for correct sampling. Its pulsewidth is given by  $\tau_{\rm WPM}$ . For the timing diagram, it is assumed that during the first clock period logic bit 1 follows logic bit 1, and during the second clock period logic bit 0 follows logic bit 0 over the shared wire. In the case of no delay variations, the interconnect delay will be equal to  $\tau_{int(ideal)}$ . Hence required  $\tau_{\rm WPM}$  will be equal to  $2*\tau_{\rm guardband}$  (see Fig. 15) or minimum pulsewidth [14], whichever is greater. It is not necessary to add extra guardbands if the minimum pulsewidth is used as it provides the necessary time for correct sampling. The guardbands provide the necessary setup hold times for the signal to be correctly sampled. On the other hand, in the case of delay variations due to crosstalk noise, the pulsewidth should be equal to  $\tau_{\rm WPM} = \tau_{\rm int(worstcase)} - \tau_{\rm int(bestcase)} + 2 * \tau_{\rm guardband}$ . Here too, the guardbands provide the necessary setup hold times for correct sampling of data bits. In Fig. 15, the calculation of the necessary pulsewidth for the case when there is delay variation due to crosstalk is shown. There can be a significant amount of difference between the pulsewidth that is required when delay variations exist and when they do not. Based on the delay constraints in [7], this variation in pulse widths directly affects the number of interconnects that can be designed as SSWPM interconnects.

For conventional repeater insertion, the difference in the worst case interconnect delay and the best case interconnect delay can be fairly large. Hence, the pulsewidth  $\tau_{WPM}$  required for sampling can be quite large resulting in a small fraction of interconnects that can be classified as SSWPM interconnects, and as a result, there will be less opportunity to apply WPM routing. In order to avoid such a case, a staggered repeater insertion technique after [1] can be adopted. Fig. 16 shows a five-interconnect system where repeaters are inserted in a staggered manner.

If opposite switching signals are transmitted on adjacent interconnects, then half of the interconnect segment experiences similar switching on the neighboring interconnect, while the other half experiences opposite switching. This staggered repeater insertion significantly reduces the worst case delay because of the reduction of net crosstalk. In case similar switching signals are transmitted on adjacent interconnects, the neighboring interconnects with staggered repeaters still experience some coupling, and the delay is more than the case where repeaters are inserted in conventional fashion. Thus, the worst case delay using staggered repeater insertion is less than that exhibited for conventional repeater insertion for switching pattern (3) in Fig. 14 and, the best case delay using staggered repeater insertion is more than that exhibited by conventional repeater insertion for switching pattern (2) in Fig. 14. Table III shows this comparison between the similar switching and opposite switching delay for conventional and staggered repeater insertion determined using HSPICE for a 1.0-cm-long interconnect. Here, the interconnect switching capacitances are determined using the RC2 models in RAPHAEL. As repeaters are inserted on the interconnects, the interconnect becomes resistive and as described earlier, the inductance can be ignored in this analysis. However, inductance and inductive coupling is included in the HSPICE simulations. The difference between the worst case and best case delay for staggered repeater insertion is less than the difference between the delays exhibited by conventional repeater insertion by more than 60%. So, if a staggered repeater insertion technique is adopted, a smaller control pulse for the  $\varphi_{\min}$  signal will be sufficient for correct sampling of data at the receiver side. This would significantly increase the number of interconnects that can be designed as SSWPM interconnects. Figs. 17 and 18 show a plot of the pulsewidth  $\tau_{\rm WPM}$  that is necessary to avoid any loss of signal integrity due to crosstalk noise when repeaters are inserted in a conventional and a staggered fashion. Here,  $\tau_{\rm WPM}$  is equal to the difference between worst case and best case interconnect delay plus  $2*\tau_{guardband}$  ( $\tau_{guardband}$  = 50% pulsewidth). The absolute



Fig. 15. WPM circuit and timing diagram showing the necessary pulsewidth to tolerate crosstalk noise.

minimum pulsewidth that is required for data transmission is also plotted in Figs. 17 and 18. In Fig. 17, the interconnect length is maintained at 1.0 cm, while the interconnect width is varied from 0.1 to 0.9  $\mu$ m. The aspect ratio is assumed to be 1.0, and the spacing between the wires is assumed to be equal to the width of the wires. A suboptimal number and suboptimal size of repeaters are inserted on these interconnects. The gray shaded area shows the set of pulse widths that can avoid signal integrity loss due to capacitive coupling noise.

As can be seen in Fig. 17, for small wire widths, the necessary pulsewidth for a staggered repeater insertion is more than the minimum pulsewidth that can travel across the interconnect. Beyond a width of 0.2  $\mu$ m the two plots cross and the pulsewidth becomes limited by the minimum pulsewidth that can travel across the interconnect. Thus, if we use staggered repeater insertion then for interconnect widths larger than 0.2  $\mu$ m there would be no loss of signal integrity due to capacitive coupling between neighboring interconnects. As we go to larger wire widths, necessary pulsewidth plot for conventional repeater insertion crosses the plot for minimum pulsewidth. Hence, for wire width of 0.7  $\mu$ m and above even if the conventional repeater insertion technique is used there will be no loss in signal integrity.

Similarly in Fig. 18, the interconnect width is fixed at 0.2  $\mu$ m and the interconnect length is varied from 0.1 to 1.3 cm. Here, too, a suboptimal number and suboptimal size of repeaters are

| Television                 | Conventional r                   | epeater insertion                 | Staggered repeater insertion     |                                   |  |
|----------------------------|----------------------------------|-----------------------------------|----------------------------------|-----------------------------------|--|
| Interconnect<br>width (cm) | Similar switching<br>delay (sec) | Opposite switching<br>delay (sec) | Similar switching<br>delay (sec) | Opposite switching<br>delay (sec) |  |
| 1.00E-05                   | 1.18E-09                         | 1.59E-09                          | 1.21E-09                         | 1.37E-09                          |  |
| 2.00E-05                   | 5.64E-10                         | 7.99E-10                          | 5.76E-10                         | 6.68E-10                          |  |
| 3.00E-05                   | 3.43E-10                         | 5.21E-10                          | 3.78E-10                         | 4.41E-10                          |  |
| 4.00E-05                   | 2.49E-10                         | 3.86E-10                          | 2.79E-10                         | 3.25E-10                          |  |
| 5.00E-05                   | 2.06E-10                         | 3.05E-10                          | 2.20E-10                         | 2.55E-10                          |  |

TABLE III COMPARISON OF INTERCONNECT DELAY FOR CONVENTIONAL REPEATER INSERTION AND STAGGERED REPEATER INSERTION USING HSPICE



Fig. 16. Staggered repeater insertion.



Fig. 17. Minimum pulse widths required to avoid loss of data integrity due to crosstalk noise (fixed interconnect length).

inserted on these interconnects. It can be seen from this plot that for shorter interconnects the required pulsewidth for the control signal  $\varphi_{\min}$  is limited by the minimum pulsewidth. For interconnects that are 0.9 cm and longer, the pulsewidth required will be limited by the pulsewidth for staggered repeater insertion.

Thus, the pulsewidth of the control signal that will be used for correct sampling of the data at the receiver side and at the sender side (the data pulsewidth is equal to the pulsewidth of the sampling signal) can be increased to prevent any loss of data integrity due to the crosstalk noise. The pulsewidth should be at least as large as the difference between the worst case delay and the best case delay of the interconnect.

### B. Power Supply Noise

In a VLSI design, the simultaneous switching in neighboring circuits within a short duration of time can cause considerably large current spikes, which increase the IR drop and result in



Fig. 18. Minimum pulse widths required to avoid loss of data integrity due to crosstalk noise (fixed interconnect dimensions).

L(di/dt) noise over the power supply network [15]. The power supply noise can vary the supply voltage of the devices that may significantly affect the latency of interconnects with repeaters due to the change in the transistor drive current. The WPM circuit design uses delay matching at the receiver side to correctly sample the data received at the input of the demultiplexer and route it to the appropriate sink. Hence, it is imperative to study the effect of power supply noise on the WPM routing technique.

To study the impact of power supply noise on the overall delay of the interconnect, a 1.0-cm-long interconnect with 0.2  $\mu$ m width is modeled using *RLC* modeling. The interconnect aspect ratio is 1.0 and the interconnect spacing is equal to the interconnect width. A suboptimal number and suboptimal size of repeaters are inserted on these interconnects. The circuit is assumed to be driven by a supply voltage of 1.2 V with a  $\pm 10\%$ variation.

A lower supply voltage reduces the transistor drive current, which increases the delay of the interconnect circuit. While on the other hand, if supply voltage increases, the drive current increases and reduces wire delay. For the WPM routing technique, we are concerned with the difference between the worst case delay and the best case delay. Fig. 19 shows this difference between the worst case delay and best case delay resulting from power supply variations when neighbors are switching in the same direction and opposite direction. The figure also has a plot of the minimum sustainable pulsewidth that can travel across the interconnect. As can be seen from the figure, the difference between the worst case and best case delay for both types of neighbors is much smaller than the minimum pulsewidth. Thus,



Fig. 19. Delay variations due to power supply noise.

variations due to power supply noise can be easily tolerated by the WPM circuit.

The power supply noise also has a similar effect on the delay circuitry used at the receiver side of the WPM circuit. As the drive current is directly proportional to the supply voltage, a reduction in the supply voltage will increase the charge–discharge time of the transistors in the inverter chain, resulting in an increase in the delay of  $\varphi_{\min}$ . Similarly, if the supply voltage increases, the drive current increases, charge–discharge time of the transistors reduces, which results in a decrease in the delay of  $\varphi_{\min}$ . In order for the delayed  $\varphi_{\min}$  signal to correctly sample the signal at the receiver side, it is imperative that there are minimum variations in the delaying of  $\varphi_{\min}$ . A guardband is introduced while calculating the pulsewidth of the signal  $\varphi_{\min}$  that is generated locally at the receiver side. This guardband ensures that the signals are correctly sampled by the demultiplexer even if there are variations in the delaying of signal  $\varphi_{\min}$ .

#### C. Clock Skew

With the continuous increase in the operating frequency of future GSI systems [16], it is difficult to maintain the clock signals in phase with each other in different regions of a system. This has resulted in the existence of multiple clock domains in high performance digital systems. A significant amount of clock skew can exist among different clock domains. The signal  $\varphi_{\min}$  that is used for sampling of data at both the sender and the receiver side is generated by simply ANDing the global clock signal and a delayed global clock signal. Clock skew directly affects the time when the signal  $\varphi_{\min}$  is generated, which might result in a delay mismatch at the receiver side. Thus, clock skew can have an adverse effect on the working of the WPM circuit.

To analyze the tolerance of WPM routing to clock skew, it is assumed that if a leading edge of clock input to a region arrives after the expected global reference leading edge, then the input clock in that clock domain has positive clock skew. On the other hand, if the clock input to a region arrives before the expected time, then the region has negative clock skew. Though the existence of clock skew in a clock domain can be determined with respect to a global reference signal, the relative clock skew between two clock domains maybe much more or much less than

TABLE IV CLOCK SKEW TOLERANCE OF A WPM CIRCUIT FOR DIFFERENT INTERCONNECT LENGTHS

| Interconnect<br>length (cm) | Maximum<br>positive clock<br>skew (% of global<br>clock period) | Maximum<br>negative clock<br>skew (% of global<br>clock period) |
|-----------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|
| 0.5                         | 10.209                                                          | 10.186                                                          |
| 0.6                         | 9.286                                                           | 9.286                                                           |
| 0.7                         | 9.275                                                           | 9.560                                                           |
| 0.8                         | 9.548                                                           | 9.792                                                           |
| 0.9                         | 9.806                                                           | 10.196                                                          |
| 1                           | 10.312                                                          | 9.794                                                           |

the clock skew between individual clock domains and the global reference.

In case of WPM routing, if there is positive clock skew at the receiver side with respect to the sender side, then the signal  $\varphi_{\min}$  that is generated using the global clock will be delayed. As a result, the sampling window will get delayed and it will not correctly sample the signal at receiver side. Similarly, if the clock skew is negative, then the  $\varphi_{\min}$  signal will be generated early and it will reach the sampling circuitry early, which will again result in an incorrect sampling of data. This will significantly affect the signal integrity in a WPM circuit.

To study the tolerance level of an interconnect circuit to clock skew, interconnects having lengths varying from 0.5 to 1.0 cm are designed for 100-nm technology. It is assumed that these interconnects will lie on the same tier. Based on the multilevel interconnect network design approach proposed in [11], the pitch values are chosen such that the longest interconnect would require 70% of the clock period for data transmission. Out of the remaining 30% of the clock period, 20% will serve as the guardband and 10% will be enough to schedule a second signal over the WPM interconnect in a wave-pipelined fashion. The interconnect system is assumed to be designed for 1.3 GHz. A suboptimal number and suboptimal size of repeaters are inserted on the interconnect. Clock skew is deliberately introduced in between the clock signals that are used to generate the  $arphi_{\min}$ signal at the sender side and receiver side. As a result, the signal  $\varphi_{\min}$  gets delayed or arrives early at the input of the sampling circuitry.

The upper and lower limits of clock skew tolerance are determined based on signal integrity analysis. Within this range, the data signal integrity is maintained. Table IV shows the maximum positive and negative clock skew that can be tolerated by the WPM circuit across a range of interconnect lengths. The skew tolerance as a percentage of the clock period is presented in the table. On an average the WPM circuit has a skew tolerance of 10% for both positive and negative clock skew.

#### D. Manufacturing Variations

The impact of die-to-die and within-die variations is explored to forecast their impact on WPM routing. For our purposes, variation on gate length is analyzed. According to the ITRS [16], gate CD control can be achieved with a  $\pm 3\sigma$ , 10% variation in gate length after etch. Applying the same techniques found in Bowman's device variation analysis of critical path delay [17], statistical simulations employed the same  $\pm 3\sigma$ , 10% variation in gate length as described by the ITRS. It is assumed that

|       | tdelay1 [ns] | tdelay3 [ns] | tdelay5 [ns] | tdelay7 [ns] | tdelay9 [ns] | tdelay11 [ns] |
|-------|--------------|--------------|--------------|--------------|--------------|---------------|
| Run 1 | 0.13         | 0.30         | 0.46         | 0.62         | 0.78         | 0.94          |
| Run 2 | 0.13         | 0.30         | 0.46         | 0.62         | 0.78         | 0.94          |
| Run 3 | 0.13         | 0.30         | 0.46         | 0.62         | 0.78         | 0.94          |

TABLE V DELAY CIRCUIT MEAN VALUES [ns]

TABLE VI DELAY CIRCUIT  $6\sigma$  Values [ns]

|       | tdelay1 [ns] | tdelay3 [ns] | tdelay5 [ns] | tdelay7 [ns] | tdelay9 [ns] | tdelay11 [ns] |
|-------|--------------|--------------|--------------|--------------|--------------|---------------|
| Run 1 | 0.04         | 0.08         | 0.12         | 0.16         | 0.20         | 0.24          |
| Run 2 | 0.04         | 0.08         | 0.12         | 0.16         | 0.20         | 0.24          |
| Run 3 | 0.04         | 0.08         | 0.12         | 0.15         | 0.19         | 0.23          |



Fig. 20. Schematic circuitry of the delay circuitry in WPM routing.



Fig. 21. Delay circuitry timing variations due to gate length process variations.

die-to-die and within-die components of variation are equal contributors. In determining spatial correlation between devices, it is assumed that gate length is highly correlated for devices found within 100  $\mu$ m of each other [18]. Two circuits are analyzed for the impact device variation: the delay circuitry and the interconnect circuitry.

Fig. 20 shows the schematic circuit of the delay circuit used in WPM routing. The delay circuit with gate length variation is simulated 1000 times in 3 separate runs (3000 total HSPICE simulations) to ensure experimental integrity. Tables V and VI tabularize the mean and  $\pm 6\sigma$  values, respectively, of each of the three runs. Here, tdelay1, tdelay3, tdelay5, tdelay7, tdelay9, and tdelay11 indicate the delay experienced by a pulse at stages 1, 2, 3, 4, 5, and 6, respectively, of the delay circuitry. For a single circuit, no difference in the mean "tdelay" values is observed between a nominal delay circuit and a circuit with device variation. The criteria that is used to determine if manufacturing variations will cause a timing failure is to compare the  $\pm 6\sigma$  variation to the 50% of pulsewidth, which is 0.25 ns. If the  $6\sigma$  value is below 50% of the pulsewidth of 0.25 ns, the variation is not

TABLE VII INTERCONNECT DELAY MEAN VALUES [ns]

|       | delay1 [ns] | delay2 [ns] |
|-------|-------------|-------------|
| Run 1 | 1.57        | 1.83        |
| Run 2 | 1.57        | 1.83        |
| Run 3 | 1.57        | 1.83        |

TABLE VIII INTERCONNECT DELAY  $6\sigma$  Values [ns]

|       | delay1 [ns] | delay2 [ns] |
|-------|-------------|-------------|
| Run 1 | 0.22        | 0.15        |
| Run 2 | 0.23        | 0.16        |
| Run 3 | 0.23        | 0.16        |

enough to cause a WPM failure. For the experiments performed, it is shown in Table VI that the delay circuits are well within the aforementioned criteria. Sample distributions for Run 1 are illustrated in Fig. 21.

The interconnect circuit with device variation is simulated in the same fashion as the delay circuitry previously described. Here, delay 1 and delay2 stand for time of arrival of the first data and second data, respectively, at the input of the receiver side circuit of the WPM circuit. For a single interconnect circuit, no difference in the mean "delay" values is observed between an ideal interconnect circuit and an interconnect circuit with gate length variation. Values for the mean and  $\pm 6\sigma$  are presented in Tables VII and VIII, respectively. For experiments performed, it is shown in Table VIII that the  $\pm 6\sigma$  for the interconnect circuits are well within the criteria for correct WPM circuit operation. Sample distributions for Run 1 are illustrated in Fig. 22.

In summary, statistical simulations are run to analyze the impact of process variations on the delay and interconnect circuitry. Both circuits exhibit no change in mean delay values when compared to ideal circuits. Finally, the  $\pm 6\sigma$  values are found to be within specification of the WPM routing design.

# V. CONCLUSION

The design and optimization of an on-chip WPM interconnect circuit is presented in this paper. This WPM routing technique can be applied pervasively with repeater insertion to further improve overall interconnect design. By optimizing various interconnect design parameters, it is possible to reduce area, power, and/or delay of an interconnect using WPM routing. For a balanced WPM interconnect design, it is possible to reduce interconnect area by 44%, transistor area by 29%, and power dissi-



Fig. 22. Interconnect circuit delay variations due to gate length process variations.

pation by 6% with no change in performance, while for a minimum power WPM design the power dissipation and transistor area can be reduced by 26% and 41%, respectively. There is no change in interconnect performance or area for the minimum power design. Similarly, for the high performance WPM design, performance can be increased by 74% and transistor area can be reduced by 29% with no change in interconnect area.

The tolerance levels of WPM routing to variations due to crosstalk noise, power supply noise, clock skew, and manufacturing variations are also presented in this paper. It is shown that by designing the WPM circuit for worst case wire delay and having the pulsewidth for the sampling signal  $\varphi_{\min}$  to be at least equal to the difference between the worst case delay and the best case delay, capacitive crosstalk noise can be tolerated. On the other hand, in case of power supply noise, the delay variations resulting from variable supply voltage are much smaller compared to those due to crosstalk noise. Thus, the WPM circuit can easily tolerate power supply noise. Furthermore, a case study is presented which shows that the WPM circuit can tolerate an average of 10% positive and negative clock skew. Finally, the effect of manufacturing variations on the functioning of the delay circuit and interconnect circuit of the WPM design is also studied. The  $\pm 6\sigma$  values for the various runs on the delay circuit and interconnect circuit are well within the tolerance levels for the WPM design.

# References

- J. Xu and W. Wolf, "A wave-pipelined on-chip interconnect structure for networks-on-chips," in *Proc. HOTI*, 2003, pp. 10–14.
- [2] H. Bakoglu and J. Meindl, "Optimal interconnection circuits for VLSI," *IEEE Trans. Electron Devices*, vol. ED-32, no. 5, pp. 903–909, May 1985.
- [3] S. Thompson, M. Alavi, M. Hussein, P. Jacob, C. Kenyon, P. Moon, M. Prince, S. Sivakumar, S. Tyagi, and M. Bohr, "130 nm logic technology featuring 60 nm transistors, low-K dielectrics, and Cu interconnects," *Intel Technol. J.*, vol. 06, no. 02, pp. 5–13, May 2002.
- [4] K. Banerjee and A. Meherotra, "A power-optimal repeater insertion methodology for global interconnects in nanometer designs," *IEEE Trans. Electron Devices*, vol. 49, no. 11, pp. 2001–2007, Nov. 2002.

- [5] V. Adler and E. Friedman, "Repeater design to reduce delay and power in resistive interconnect," *IEEE Trans. Circuits Syst. I, Fundam. Theory Appl.*, vol. 45, pp. 607–616, May 1998.
- [6] G. Garcea, N. van der Meijs, and R. Otten, "Simultaneous analytical area and power optimization for repeater insertion," in *Proc. ICCAD*, 2003, pp. 568–573.
- [7] A. Joshi and J. Davis, "Wave-pipelined multiplexed (WPM) routing for gigascale integration (GSI)," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 13, no. 8, pp. 899–910, Aug. 2005.
- [8] A. Naeemi, "Analysis and optimization for global interconnects for gigascale integration (GSI)," Ph.D. dissertation, Electr. Comput. Eng. Dept., Georgia Inst. Technol., Atlanta, 2003.
- [9] J. Davis, "A hierarchy of interconnect limits and opportunities for gigascale integration (GSI)," Ph.D. dissertation, Electr. Comput. Eng. Dept., Georgia Inst. Technol., Atlanta, 1999.
- [10] S. Thompson, N. Anand, M. Armstrong, C. Auth, B. Arcot, M. Alavi, P. Bai, J. Bielefeld, R. Bigwood, J. Brandenburg, M. Buehler, S. Cea, V. Chikarmane, C. Choi, R. Frankovic, T. Ghani, G. Glass, W. Han, T. Hoffmann, M. Hussein, P. Jacob, A. Jain, C. Jan, S. Joshi, C. Kenyon, J. Klaus, S. Klopcic, J. Luce, Z. Ma, B. Mcintyre, K. Mistry, A. Murthy, P. Nguyen, H. Pearson, T. Sandford, R. Schweinfurth, R. Shaheed, S. Sivakumar, M. Taylor, B. Tufts, C. Wallace, P. Wang, C. Weber, and M. Bohr, "A 90 nm logic technology featuring 50 nm strained silicon channel transistors, 7 layers of Cu interconnects, low k ILD and 1 μm<sup>2</sup> SRAM cell," in *Proc. IEDM*, 2002, pp. 61–64.
- [11] R. Venkatesan, J. Davis, K. Bowman, and J. Meindl, "Optimal n-tier multilevel interconnect architectures for gigascale integration (GSI)," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 9, no. 6, pp. 899–912, Dec. 2001.
- [12] Y. Cao, C. Hu, X. Huang, A. Kahng, S. Muddu, D. Stroobandt, and D. Sylvester, "Effects of global interconnect optimizations on performance estimation of deep submicron design," in *Proc ICCAD*, 2003, pp. 56–61.
- pp. 56–61.
  [13] T. Sakurai, "Closed-form expressions for interconnection delay, coupling, and crosstalk in VLSI's," *IEEE Trans. Electron Devices*, vol. 40, no. 1, pp. 118–124, Jan. 1993.
- [14] V. Deodhar and J. Davis, "Optimization for throughput performance for low power VLSI interconnects," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 13, no. 3, pp. 308–318, Mar. 2005.
- [15] H. Bakoglu, Circuits, Interconnections and Packaging for VLSI. Reading, MA: Addison-Wesley, 1990.
- [16] ITRS, "International Technology Roadmap for Semiconductors," 2004 [Online]. Available: http://public.itrs.net
- [17] K. A. Bowman, S. G. Duvall, and J. D. Meindl, "Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration," *IEEE J. Solid-State Circuits*, vol. 37, no. 2, pp. 183–190, Feb. 2002.
- [18] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos, "Modeling within-die spatial correlation effects for process-design co-optimization," in *Proc. ISQED*, 2005, pp. 516–521.



**Ajay Joshi** (S'98) was born in Mumbai, India. He received the B.E. degree in computer engineering from the University of Mumbai, Mumbai, India, in 2001, and the M.S. degree and Ph.D. degree in electrical engineering from the Georgia Institute of Technology, Atlanta, in 2003 and 2006, respectively.

He is currently working as a Postdoctoral Associate in the Integrated Systems Group (ISG), Massachusetts Institute of Technology, where he is exploring the scope of using photonics and CNTs for on-chip communication. In 2001, he joined

the Advanced Interconnect Modeling and Design (AIMD) Research Group, Georgia Institute of Technology. In 2003, he worked as an intern with the Post Silicon Validation Debug Group, Intel Corporation, Santa Clara, CA, where he was responsible for developing two system debug tools. Over the past four years, he has coauthored several publications in international conferences and refereed journals. His research interests include interconnect modeling, network-on-chip design, high-speed low-power digital design, vertical integration, and physical design.



**Gerald G. Lopez** (S'02) received the B.S. degree (*cum laude*) in computer engineering from the University of Maryland, Baltimore County, in 2001, as a Meyerhoff Scholar, and the M.S. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, in 2002, where he is currently pursuing the Ph.D. degree in electrical and computer engineering.

interests focus on developing compact VLSI interconnect models, creating novel global interconnect circuits, and understanding the limitations and opportunities of system-level interconnect prediction

Dr. Davis was a recipient of the National Science Foundation CAREER Award for excellence as a young educator and researcher in January 2001, the 2002–2003 Outstanding ECE Junior Faculty Award, and the 2004–2005 Class of 1940 W. Roane Beard Outstanding Teacher Award from the Georgia Institute of Technology. In 2002, he was the general chair for the International Workshop on System Level Interconnect Prediction (SLIP), and he served as a Guest Editor for a special issue of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS on SLIP.



**Jeffrey A. Davis** (M'99–SM'05) received the B.E.E., M.S.E.E., and Ph.D. degrees from the Georgia Institute of Technology, Atlanta, in 1993, 1997, and 1999, respectively.

He is currently an Associate Professor in the School of Electrical and Computer Engineering, Georgia Institute of Technology. He has coauthored over 60 refereed journal, conference, and workshop publications. He has also co-edited a book entitled *Interconnect Technology and Design for Gigascale Integration* (Kluwer, 2003). His current research