Home/Magazine Archive/September 2018 (Vol. 61, No. 9)/Can Beyond-CMOS Devices Illuminate Dark Silicon?/Full Text

Contributed articles
## Can Beyond-CMOS Devices Illuminate Dark Silicon?

For more than 50 years, Moore's Law has been the fundamental economic driver of the microprocessor industry,^{17} seeing the number of (CMOS) on-chip transistors doubles with each technology generation. As a corollary, microprocessor performance also doubles as a result of transistor scaling from Dennard scaling^{6} and Pollack's Rule.^{4} Unfortunately, performance-scaling trends have abated due to increased sub-threshold leakage current and decreased supply-voltage scaling.^{3} Consequently, computer architects have adopted multi-core architectures in an attempt to maintain processor performance scaling via parallel processing.^{4}

While multi-core processors have succeeded in delivering modest (approximately linear) performance gains, projections indicate they will encounter a power wall as transistors continue to scale.^{7} Specifically, Esmaeilzadeh et al.^{7} suggested that as the number of transistors continues to double, power densities will approach the physical and economical limits of a chip's thermal design power (TDP), necessitating the selective activation of on-chip devices. This phenomenon is colloquially referred to as "dark silicon"^{7} and has inspired a range of solutions,^{27} including "beyond-CMOS" devices,^{1} "dim silicon" cores,^{8} customized accelerators,^{28} and even combinations of all such approaches, to produce heterogeneous architectures.^{26} While each approach offers novel ideas to overcome the "dark silicon" phenomenon, beyond-CMOS devices are the fundamental but most unpredictable choice for overcoming the limitations of CMOS devices.^{27}

Here, we present a methodology to benchmark beyond-CMOS devices at the architectural level. Our benchmarking model, which we refer to as the new Dark Silicon, or nDS, model, is purely analytical and based on two existing analytical benchmarking methodologies. The first is the "Beyond-CMOS Benchmarking" version 3 (BCBv3) methodology^{19} that aims to benchmark beyond-CMOS devices at the circuit level. The second is an architectural-level approach called the "Dark Silicon," or DS, model, that explores the limits of multi-core scaling within a fixed TDP and area budget for CMOS-based technology.^{7} Combining these two methodologies and introducing three modifications, we were able to benchmark beyond-CMOS devices at the architectural level for multi-core processors executing parallel workloads, or PARSEC benchmarks.^{2} To investigate the state of the art, we input four promising beyond-CMOS devices to nDS and quantified their performance and percentage of dark silicon with respect to CMOS. Ultimately, we want to determine if beyond-CMOS devices can overcome the growth of dark silicon and provide a future path for the continuation of Moore's Law.

First, we briefly introduce our selected beyond-CMOS devices and their parameters used in our case study. We then summarize the BCBv3^{19} and Dark Silicon^{7} models that are the foundation of our framework.

**Overview of steep-slope devices.** From the classes of beyond-CMOS devices, we selected four representative "steep slope" devices that initial device/circuit-level benchmarking efforts suggest are among the most competitive with CMOS.^{19} Specifically, we selected the heterojunction III-V TFET (HetJTFET),^{21} Negative Capacitance FET (NCFET),^{20} gallium nitride TFET (GaNTFET),^{16} and two-dimensional hetero-junction interlayer TFET (Thin-TFET).^{14} Steep-slope devices offer a sub-threshold slope below the intrinsic limit for CMOS of 60mV/decade. The result is increased on-current and reduced-leakage current at low-supply voltage when compared to CMOS. Table 1 reports the CMOS and steep-slope device parameters used in our model.

**Table 1. Benchmarking parameters for CMOS and steep-slope devices.**

**BCB overview.** The first beyond-CMOS benchmarking effort by Bernstein et al.,^{1} BCBv1, which evolved from Phase 1.0 of the Semiconductor Research Corporation's Nanoelectronics Research Initiative, aimed to assemble device researchers to collectively benchmark their devices. However, due to a lack of unifying benchmarking guidelines, no conclusive argument could be made from the device comparisons. To overcome this lack of guidelines, Nikonov and Young^{18} developed an analytical methodology built on BCBv1, or BCBv2, by which all device parameters for a given device class would be derived from the same uniform assumptions, relations, and schemes. These computations yielded estimates for performance, area, switching delay, and energy. BCBv3^{19} was released in 2015 to reflect an improved understanding of beyond-CMOS devices and their operations. In addition to updated device parameters, BCBv3 also included new logic-circuit configurations (such as sequential logic) and computation of standby power for each device. Figure 1 is a sample output of BCBv3 comparing dynamic switching energy vs. delay of CMOS and our selected devices for an inverter fanout-of-4 (INVFO4), a circuit used to estimate input/output signal delay.

**Figure 1. BCBv3 sample output showing dynamic energy vs. delay of an inverter INVFO4 for various device technologies; device input parameters are reported in Table 1.**

**Dark silicon model overview.** In 2011, Esmaeilzadeh et al.^{7} explored multi-core scaling limits for five technology-node generations (45nm to 8nm) to project how core scaling might affect the performance (measured in SPECmarks) of multi-core processors. They developed an analytical model to compute potential performance gains on parallel applications, or PARSEC,^{2} for each such node. This DS model spans from the device level to the architectural level and includes a device-scaling model, a core-scaling model, and a multi-core-scaling model.

The device-scaling model considers trends associated with device scaling, including area, frequency, and power, from 45nm to 8nm, based on optimistic^{9} and conservative^{5} projection schemes.^{a} For each technology node, scaling factors for frequency and power are derived by normalizing their projections against empirical 45nm CMOS data. It is important to note here that frequency-scaling factors derived from the International Technology Roadmap for Semiconductors (ITRS) projections are based on INVFO4 simulations from the Model for the Assessment of CMOS Technologies and Roadmaps (MASTAR) for each technology node.^{9} In this sense, the DS model uses INVFO4 delay to determine the clock frequency of a multi-core processor.

The core-scaling model provides projections for the maximum performance a single core can achieve for a given area. Moreover, it also projects core power for selected core performance. It is derived by creating two scatter plots—core area vs. performance and core power vs. performance—using empirical 45nm processor data. For both plots, SPECmark is used as a measure of performance, representing aggregate performance of the SPEC benchmark suite.^{24} The authors then plotted the Pareto-optimal frontier for processors based on 45nm technology. For each technology generation, the scaling factors derived in the device model were then used to scale the area vs. performance and power vs. performance Pareto frontiers. The result is a set of processor core projections for each technology node.

The multi-core-scaling model investigates two multi-core configurations based on CPUs and GPUs, and four topologies for each configuration—symmetric, asymmetric, dynamic, and composed multi-core. For a selected multi-core configuration and topology, the DS model uses the core-scaling model to analytically compute the best possible speedup, optimal number of cores, and fraction of dark silicon for each PARSEC benchmark, given a certain chip area and TDP budget.

Like the DS model, our nDS model consists of a device, a core, and multi-core-scaling models. However, as the models are based on CMOS scaling trends, they cannot be utilized directly for beyond-CMOS devices without accounting for the change in device technology. This is due to the fact that some of the assumptions (such as Pollack's Rule) made in the DS model are no longer valid when transitioning from CMOS to beyond-CMOS devices. Here, we describe three modifications—one for each scaling model—required for benchmarking beyond-CMOS devices.

First, in the DS-device-scaling model, we derived CMOS frequency- and power-scaling factors by normalizing INVFO4 frequency and power data for each technology node against 45nm CMOS data (2010 column in the 2010 ITRS^{9}).^{b} As the 45nm frequency and power data are based on MASTAR INVFO4 simulations, simply normalizing the 15nm BCB-based frequency and power values by the 45nm MASTAR-based data would produce inconsistent scaling factors. Figure 2 illustrates this inconsistency, showing the BCB INVFO4 computation and comparing it to the MASTAR computation for 45nm CMOS; there is a notable discrepancy between the two computed INVFO4 values, with switching energy data points differing by approximately 50X.

**Figure 2. INVFO4 energy vs. delay for various technology nodes and devices; the initials BCB or MASTAR appear after each device to indicate the computation model for the data point.**

To produce consistently normalized values, we leveraged the BCBv3 model to compute the 45nm CMOS parameters. We utilized the same 45nm inputs as was used in the MASTAR simulation from the 2010 column in the 2010 ITRS report^{9} while adjusting the BCB model for the 45nm technology node; the default is a 15nm metal half pitch, representing the 2018 column in the 2010 ITRS report. With the newly computed BCB INVFO4 45nm data point, we normalize the 15nm beyond-CMOS and CMOS devices to produce consistent-frequency and power-scaling factors, as reported in the inset of Figure 3.

**Figure 3. Erroneous compression of the 45nm area vs. performance Pareto frontier occurs when a device has a frequency-scaling factor of less than one, assuming beyond-CMOS devices and CMOS are of equal size.**

Our second modification pertains to the core-scaling model, projecting the Pareto optimal core area vs. performance and core power vs. performance for each technology node beyond 45nm. The respective empirical 45nm Pareto frontiers are scaled using the derived scaling factors (from the core-scaling model) for each technology node. Note the area-scaling factor is computed from Moore's Law using *Area _{Scaling}* = 2

When normalized to the new 45nm data point, so the frequency-scaling factor of some beyond-CMOS devices is less than one; that is, the 15nm projection indicates the device is slower than 45nm CMOS, thus prohibiting the area/performance Pareto frontier scaling mentioned earlier. For example, if we assume CMOS and beyond-CMOS devices are of similar size and scale equally, then the area component of the area/performance Pareto frontier does not differ between CMOS and beyond-CMOS devices at 15nm. When performance is scaled by a frequency-scaling factor of less than one, the area/performance Pareto frontier becomes more compact compared to the 45nm curve, as reflected in the GaNTFET data in Figure 3. This compaction violates the premise of the area/performance Pareto frontier, or Pollack's Rule, as performance does not increase from 45nm to 15nm. Addressing this problem, we first assume the area/performance Pareto frontiers of all 15nm beyond-CMOS devices are identical to the projected 15nm CMOS area/performance Pareto frontier in Figure 3, preserving Pollack's Rule. We then derive a new area-scaling factor based on the ratio of CMOS device area to the target Beyond-CMOS device's area; for example, TFET devices are approximately 25% larger than CMOS devices.^{18} When considering steep-slope devices, nDS reduces the baseline CMOS chip area budget by 25%.

Ultimately, we want to determine if beyond-CMOS devices can overcome the growth of dark silicon and provide a future path for the continuance of Moore's Law.

Our third and final modification is to the multi-core-scaling model. When run out of the box, the DS model reduces TDPs by 20% to account for leakage power across all CMOS-technology-node generations. To account for differences in leakage power for each beyond-CMOS device, we modified the multi-core scaling model to consider per-device leakage power. To compute the percent leakage power of each device, we leverage the BCB model, which also computes the dynamic and static power of a 32-bit ripple carry adder (RCA) for each be-yond-CMOS device. The percentage of leakage power is simply the ratio of static power to total power. Our justification for using the 32-bit RCA is twofold: First, in the DS model, the cores that compose a multi-core architecture are primarily pipelined logic.^{c} And second, we use the 32-bit RCA to compute the leakage power not only for the 15nm beyond-CMOS devices but also 15nm CMOS to ensure consistency.

To demonstrate that nDS produces accurate projections, we validated the model with empirical 22nm and 14nm Intel processor data. Specifically, we collected SPECmark scores from the available SPEC CPU2006 data, along with each processor's TDP from their corresponding data sheets, to create an empirical power vs. performance plot at the 22nm and 14nm technology nodes. Following a similar approach by Esmaeilzadeh et al.^{7} for 45nm processor data, we used power regression to fit a curve to the data in a Pareto-optimal sense. In Figure 4, the orange- and red-dotted-line curves with triangle markers represent empirical power/performance Pareto frontiers for 22nm and 14nm, respectively. Note while we will be comparing 16nm (conservative and ITRS MASTAR) and 15nm BCB projections to the 14nm empirical data, the inputs for the 16nm and 15nm projections represent the same technology year, or the 2018 column of the 2011 ITRS report, and is representative of current 14nm Intel processors.

**Figure 4. Comparison of core power vs. performance projections for 22nm and 16/15/14nm technology for various projection methodologies; results show our approach (green curve) is closer to the empirical data (red curve) than the projection schemes used by Esmaeilzadeh et al. ^{7}**

Using the 22nm and 14nm empirical Pareto frontiers, we first assess the accuracy of the two projection schemes—ITRS and conservative—from Esmaeilzadeh et al.^{7} In Figure 4, the curves are colored according to their respective technology node projection, with blue curves representing 22nm and green curves representing 16/15nm.^{d} For 22nm, the optimistic ITRS MASTAR projection overestimates power/performance, as shown by the blue curve with square markers. Conversely, the conservative projections underestimate power/performance for the 22nm technology node, as shown by the blue curve with diamond markers. For the 16nm conservative projection—green curve with diamond markers—the projection again underestimates the power/performance. The 16nm ITRS MASTAR projection is somewhat more in line with empirical data.

To assess the accuracy of our approach, we first used the BCB model to compute 22nm and 15nm CMOS energy vs. delay for an INVFO4, as in Figure 1. We next computed the 22nm and 15nm INVFO4 frequency and power-scaling factors by normalizing them against the BCB-computed 45nm INVFO4 data. Using the scaling factors obtained through the BCB, we scaled the 45nm power/performance Pareto frontier to 22nm and 15nm, as shown in Figure 4 as the blue curve with circle markers and the green curve with circle markers, respectively. At 22nm, the BCB-based projection produces a nearly identical Pareto frontier when compared to the empirical 22nm data. Likewise, the BCB 15nm projection is also closely aligned with the 14nm empirical data. Our model thus outputs projections in line with current-generation processors. We now leverage our approach to project 15nm beyond-CMOS devices.

Given the validated nDS model, we now consider architectural-level performance projections for the steep-slope devices discussed earlier. Our results project both the speedup and percentage of dark silicon for 15nm steep slope devices relative to 45nm CMOS^{e} under various input constraints. Given that 45nm and 15nm are separated by three technology nodes, we desire a speedup of at least 8× in order to conform to Moore's Law. The results presented here are divided into three parts: a high-TDP (125W) study, a lowTDP (5W) study, and a heterogeneous multi-core study. To be consistent with the original DS-model work, our chip area budget is 111*mm*^{2}, or the area of a 45nm Intel Nehalem processor without the L2 and L3 caches.^{7}

**High-TDP study.** The nDS model aims to find the maximum speedup achievable by a symmetric multi-core processor. The number of cores in the processor must satisfy the given TDP and area constraints. We report the results in Figure 5 for high (125W) and low (5W) TDPs. The solid blue bars in the subplots of Figure 5 respectively represent the geometric means across PARSEC for (a) speedup relative to 45nm, (b) number of active cores, and (c) percentage of total chip area that is dark silicon.

**Figure 5. For each device, the three sub-figures show the geometric means of (a) speedup over 45nm CMOS, (b) number of active cores, and (c) percentage of dark silicon across all PARSEC benchmarks for a high (125W) and low (5W) TDPs.**

As one might expect from Figure 1, the Thin-TFET indeed achieves the greatest average speedup with the lowest average percentage of dark silicon. However, within this optimistic scenario of selecting the best multi-core configuration for each benchmark, the Thin-TFET falls short of performance-scaling trends associated with Moore's Law; that is, the Thin-TFET achieves 6.36× speedup when Moore's Law predicts 8× speedup from 45nm to 15nm. In fact, the Thin-TFET is less than 2× better than both CMOS HP and LV. This result is not apparent from the circuit-level benchmarking results in Figure 1. However, compared to CMOS HP, the Thin-TFET is able to power 99% of its cores, whereas over one-third of the CMOS HP chip must be powered off. Moreover, the Thin-TFET consumes an average of 75% less power compared to CMOS HP and CMOS LV. This result is not shown, though from our model, CMOS HP and CMOS LV average 120W per benchmark, or the power limit is reached, vs. 30W for the Thin-TFET, or the area limit is reached.

For both high-and low-TDPs, most of the steep-slope devices have less dark silicon while also consuming less power compared to CMOS.

Again, the results in Figure 5 represent the per-benchmark speedups for each device technology; speedup assumes optimal core configuration for each benchmark. As the total number of cores can vary for each benchmark, the average speedups discussed previously represent an upper bound on the achievable speedup. To observe how these averages might change, given a single static symmetric multi-core processor, we modified nDS to select the one multi-core configuration that yielded the greatest geometric mean speedup for all benchmarks. Table 2 reports these results and compares them to the previous "best speedup per benchmark" optimization strategy. The average speedup for the best multi-core configuration for each benchmark is listed in the second column, and the average speedup for the best multi-core configuration for all benchmarks is listed in the third column. As expected, the average speedups decrease when selecting a static multi-core configuration vs. a variable configuration across all benchmarks. Of particular interest, the Thin-TFET is the least-affected device when changing the multi-core optimization strategy. Given a sufficient power budget, the Thin-TFET would thus achieve between 5× and 6× speedup over 45nm CMOS for highly parallel applications. However, the speedup of the Thin-TFET remains less than 2× better than both the CMOS HP and CMOS LV.

**Table 2. Geometric mean of speedups relative to 45nm CMOS for two types of optimizations: per each benchmark and across all benchmarks.**

**Low-TDP study.** Steep-slope devices are likely to deliver greater benefit compared to CMOS in power-constrained environments. We now re-examine the results we previously presented here but for multi-core processors with low TDP (5W). For each device, the orange-striped bars in the subplots in Figure 5 represent low-TDP geometric mean results when the optimal core configuration is selected for each benchmark. We first note an unintuitive result regarding the average speedup of 15nm CMOS HP in Figure 5a that reports it is slower than 45nm CMOS HP. This is a direct result of utilizing CMOS HP in a TDP-constrained environment. The high switching energy and standby power of CMOS HP limits the number of devices that can be utilized simultaneously. On average, CMOS HP can utilize only approximately six cores, while the remaining 93% of the chip is dark silicon. Moreover, CMOS LV shows little improvement vs. CMOS HP, with 1.5× average speedup and 60% dark silicon.

In this power-constrained scenario, most of the steep-slope devices outperform CMOS, and the Thin-TFET again achieves the best overall average speedup. Compared to CMOS HP, the Thin-TFET is more than 5× faster on average, with only 30% dark silicon. Compared to CMOS LV, the Thin-TFET is approximately 2.5× faster on average while utilizing 30% more of its cores. Collectively, no steep-slope device is significantly faster than CMOS, but the amount of dark silicon in both the high- and low-TDP cases indicates a higher percentage of the processor's cores can be powered on.

As we did with high-TDP, we now consider low-TDP when a single static multi-core configuration is selected for all benchmarks. Table 3 reports these results and compares them to the previous per-bench-mark optimization strategy. The results show, on average, the speedup for each device changes by 27.8% when switching to the static multi-core configuration. Moreover, in contrast to the high-TDP result, the Thin-TFET experiences the greatest change. However, the Thin-TFET remains the best overall device, with speedups of 4.8× better than CMOS HP and 2.4× better than CMOS LV, the more natural comparison.

**Table 3. Geometric mean of speedups relative to 45nm CMOS for two types of optimizations: per each benchmark and across all benchmarks.**

As a more thorough evaluation, we extended our benchmarking methodology to consider benchmarks for low-power applications. We chose two benchmarks from the Media-bench benchmarking suite: MPEG-2 encoding and JPEG encoding.^{13} To input them into our model, we followed the same approach outlined in the original DS model work by Esmaeilzadeh et al.^{7} For the two MediaBench benchmarks, this process involves determining the fraction of instructions that can be run in parallel,^{11,12} the fraction of instructions that access memory,^{22} and L1^{23} and L2^{15} cache-miss-rate constants. Once these inputs are determined, the original DS model can be employed to estimate cycles per instruction for each possible design point along a device's Pareto frontier. Finally, this data can be used in our nDS model to compute speedups, as described earlier. Figure 6 reports the speedup results for the two Media-Bench benchmarks and the PARSEC benchmarks.

**Figure 6. nDS benchmarking results for per-benchmark speedups relative to 45nm CMOS for CMOS and steep-slope devices with a low (5W) TDP; results include two MediaBench benchmarks: MPEG-2 and JPEG encoding.**

The MPEG-2 and JPEG benchmarks are both highly parallel applications, whereby 86% and 96% of instructions can be run in parallel, respectively. Their relative speedups are thus consistent with the highly parallel PARSEC benchmarks, as reported in Figure 6. Again, the Thin-TFET performs best, with speedups of 5× for MPEG2 and 7.4× for JPEG.

**Heterogeneous multi-core study.** We now explore the potential speedups achievable through steep-slope devices in a heterogeneous or an asymmetric multi-core architecture where different core types can use different technologies. Researchers are actively considering how to integrate two-dimensional materials with CMOS.^{29} Several previous research efforts, including Zhang et al.^{30} and Swaminathan et al.,^{25} have considered hybrid multi-core processors consisting of two types of device technologies. However, most such research is concerned with trading performance for power efficiency. Here, we aim to determine if greater speedups relative to 45nm are possible when considering heterogeneous asymmetric multi-core architectures.

The asymmetric multi-core design we target is identical to the one in the original dark silicon model in which there is one large monolithic core, in addition to many identical smaller cores. The large, high-performing core is leveraged to accelerate serial portions of the code, while the smaller cores, as well as the large one, are used to execute the parallel portions. For our asymmetric-core study, we fixed the technology of the large core while allowing the technology of the smaller cores to vary. For each large-core and small-core technology, our model determines the optimal number of smaller cores able to yield the best overall speedup for each PARSEC benchmark. We perform this operation for both the high-TDP and low-TDP cases, reporting our results in Table 4. For the high-TDP results in Table 4, the greatest speedup was achieved when the large-core technology was CMOS HP and the small-core technology was the Thin-TFET. This result is fairly intuitive, as CMOS HP provides the best individual core performance, while the performance and power efficiency of the Thin-TFET provides the best parallel performance. Moreover, these two technologies achieve an average speedup of 7.27× relative to 45nm technology and is closest to achieving the desired 8× speedup (from 45nm to 15nm) based on Moore's Law. However, when compared to an asymmetric multi-core composed of only 15nm CMOS HP technology, the CMOS HP/Thin-TFET multi-core is, again, only 1.4× faster. For the low-TDP case, the homogeneous Thin-TFET achieved the best overall speedup of 4.56× relative to 45nm CMOS. Compared to the 15nm CMOS HP symmetric multi-core and homogeneous asymmetric multi-core, the asymmetric Thin-TFET was approximately 6× faster. (The higher speedup over 15nm CMOS HP is due to the fact that 15nm CMOS HP incurs high standby power, as explained earlier.) When compared to the CMOS LV homogeneous asymmetric and symmetric multi-cores, the asymmetric Thin-TFET was 2.2× and 3× faster, respectively.

**Table 4. Geometric mean speedups relative to 45nm CMOS for an asymmetric multi-core composed of two device technologies.**

To project the performance of multi-core processors based on beyond-CMOS devices, we introduced nDS, an analytical architectural-level benchmarking model for beyond-CMOS devices. nDS achieves this level of benchmarking by leveraging and modifying two existing benchmarking models: the BCB and the DS model. We validated the accuracy of nDS by showing close correlation between projected and empirical CMOS power/performance for current-generation processors. To demonstrate the capabilities of nDS, we bench-marked four promising steep-slope devices. We sought to determine if these devices could sustain Moore's Law and/or limit the growth of dark silicon. Our results demonstrated that for both high- and low-TDP processors, none of the beyond-CMOS device technologies we examined are projected to sustain Moore's Law performance-scaling trends. However, given that the 2011 ITRS9 expected speedup from 15nm CMOS to 10nm CMOS is approximately 1.3× and that a 15nm technology change from CMOS to Thin-TFET yields a 5× average performance gain, we would expect a 6× to 7× performance gain for the Thin-TFET at 10nm.

For both high- and low-TDPs, most of the steep-slope devices have less dark silicon while also consuming less power compared to CMOS. For high-TDP, the Thin-TFET has less than 1% dark silicon compared to over 36% and 17% for the CMOS HP and LV, respectively. For low-TDP, the GaNTFET achieves the least dark silicon, with approximately 18% compared to the 93% and 60% of the CMOS HP and LV, respectively. While the selected devices are not significantly faster than CMOS, their steep-slope nature allows for less power consumption per core and thus their ability to power a greater percentage of on-chip devices. As a result, steep-slope devices are better able to exploit application parallelism to achieve further performance gains.

This work was supported in part by the Center for Low Energy Systems Technology (LEAST), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

1. Bernstein, K., Cavin, R., Porod, W., Seabaugh, A., and Welser, J. Device and architecture outlook for beyond-CMOS switches. *Proceedings of the IEEE 98*, 12 (Dec. 2010), 2169–2184.

2. Bienia, C., Kumar, S., Singh, J.P., and Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In *Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.* ACM Press, New York, 2008, 72–81.

3. Borkar S. Design challenges of technology scaling. *IEEE Micro 19*, 4 (July 1999), 23–29.

4. Borkar, S. Thousand-core chips: A technology perspective. In *Proceedings of the Design Automation Conference.* ACM Press, New York, 2007, 746–749.

5. Borkar, S. The exascale challenge. In *Proceedings of the 2010 International Symposium on VLSI Design, Automation and TEST.* IEEE Press, 2010, 2–3.

6. Dennard, R., Rideout, V., Bassous, E., and LeBlanc, A. Design of ion-implanted MOSFETs with very small physical dimensions. *IEEE Journal of Solid-State Circuits 9*, 5 (Oct. 1974), 256–268.

7. Esmaeilzadeh, H., Blem, E., Amant, R.S., Sankaralingam, K., and Burger, D. Dark silicon and the end of multi-core scaling. In *Proceedings of the International Symposium on Computer Architecture.* ACM Press, New York, 2011, 365–376.

8. Huang, W., Rajamani, K., Stan, M., and Skadron, K. Scaling with design constraints: Predicting the future of big chips. *IEEE Micro 31*, 4 (July 2011), 16–29.

9. International Technology Roadmap for Semiconductors; http://www.itrs2.net

10. Kim, R., Avci, U., and Young, I. Source/drain doping effects and performance analysis of Ballistic III-V n-MOSFETs. *IEEE Journal of the Electron Devices Society 3*, 1 (Jan. 2015), 37–43.

11. Kodaka, T., Kimura, K., and Kasahara, H. Multigrain parallel processing for .jpeg encoding on a single-chip multiprocessor. In *Proceedings of the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.* IEEE Press, 2002, 57–63.

12. Kodaka, T., Nakano, H., Kimura, K., and Kasahara, H. Parallel processing using data localization for MPEG2 encoding on Oscar chip multiprocessor. In *Proceedings of Innovative Architecture for Future-Generation High-Performance Processors and Systems.* IEEE Press, 2004, 119–127.

13. Lee, C., Potkonjak, M., and Mangione-Smith, W.H. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In *Proceedings of the ACM/IEEE International Symposium on Microarchitecture.* IEEE Computer Society Press, 1997, 330–335.

14. Li, M., Esseni, D., Nahas, J., Jena, D., and Xing, H. Two-dimensional hetero-junction interlayer tunneling field-effect transistors (Thin-TFETs). *IEEE Journal of the Electron Devices Society 3*, 3 (May 2015), 200–207.

15. Li, M.-L., Sasanka, R., Adve, S.V., Chen, Y.-K., and Debes, E. The ALPBench benchmark suite for complex multimedia applications. In *Proceedings of the IEEE International Workload Characterization Symposium.* IEEE Press, 2005, 34–45.

16. Li, W., Sharmin, S., Ilatikhameneh, H., Rahman, R., Lu, Y., Wang, J., Yan, X., Seabaugh, A., Klimeck, G., Jena, D., and Fay, P. Polarization-engineered III-Nitride heterojunction tunnel field-effect transistors. *IEEE Journal on Exploratory Solid-State Computational Devices and Circuits* (Dec. 2015), 28–34.

17. Moore, G. Cramming more components onto integrated circuits. *Electronics 38*, 8 (Apr. 1965), 114–117.

18. Nikonov, D. and Young, I. Overview of beyond-CMOS devices and a uniform methodology for their benchmarking. *Proceedings of the IEEE 101*, 12 (Dec. 2013), 2498–2533.

19. Nikonov, D. and Young, I. Benchmarking of beyond-CMOS exploratory devices for logic integrated circuits. *IEEE Journal on Exploratory Solid-State Computational Devices and Circuits* (Dec. 2015), 3–11.

20. Salahuddin, S. and Datta, S. Use of negative capacitance to provide voltage amplification for low-power nanoscale devices. *Nano Letters 8*, 2 (Feb. 2008), 405–410.

21. Seabaugh, A.C. and Zhang, Q. Low-voltage tunnel transistors for beyond-CMOS logic. *Proceedings of the IEEE 98*, 12 (Dec. 2010), 2095–2110.

22. Sohoni, S. *Improving L2 Cache Performance through Stream-Directed Optimizations.* Ph.D. thesis, University of Cincinnati, Cincinnati, OH, 2004; http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092932892

23. Sohoni, S., Min, R., Xu, Z., and Hu, Y. A study of memory system performance of multimedia applications. In *Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.* ACM Press, New York, 2001, 206–215.

24. Standard Performance Evaluation Corporation; http://www.spec.org

25. Swaminathan, K., Kultursay, E., Saripalli, V., Narayanan, V., Kandemir, M., and Datta, S. Improving energy efficiency of multi-threaded applications using heterogeneous CMOS-TFET multi-cores. In *Proceedings of the International Symposium on Low-Power Electronics and Design.* IEEE Press, 2011, 247–252.

26. Swaminathan, K., Kultursay, E., Saripalli, V., Narayanan, V., Kandemir, M., and Datta, S. Steep-slope devices: From dark to dim silicon. *IEEE Micro 33*, 5 (Sept. 2013), 50–59.

27. Taylor, M.B. Is dark silicon useful?: Harnessing the four horsemen of the coming dark silicon apocalypse. In *Proceedings of the Design Automation Conference.* IEEE Press, 2012, 1131–1136.

28. Venkatesh, G., Sampson, J., Goulding, N., Garcia, S., Bryksin, V., Lugo-Martinez, J., Swanson, S., and Taylor, M.B. Conservation cores: Reducing the energy of mature computations. *ACM SIGARCH Computer Architecture News 38*, 1 (2010).

29. Yu, L., Lee, Y.-H., Ling, X., Santos, E.J.G., Shin, Y.C., Lin, Y., Dubey, M., Kaxiras, E., Kong, J., Wang, H., and Palacios, T. Graphene/MoS2 hybrid technology for large-scale two-dimensional electronics. *Nano Letters 14*, 6 (May 2014), 3055–3063.

30. Zhang, Y., Peng, L., Fu, X., and Hu, Y. Lighting the dark silicon by exploiting heterogeneity on future processors. In *Proceedings of the Design Automation Conference.* ACM Press, New York, 2013, 1–7.

a. 45nm was the current technology node (2009) when H. Esmaeilzadeh et al.^{7} were developing the original Dark Silicon model.

b. The 45nm power input was analytically computed as *P _{dynamic}* = α

c. Shared caches were removed from the input area and power budgets.^{7}

d. Conservative and ITRS projections list the 2018 technology node as 16nm, while BCB uses 15nm; current empirical data labels the 2018 node as 14nm.

e. We use 45nm CMOS to maintain consistency with the original DS model.

**©2018 ACM 0001-0782/18/9**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.

No entries found