Computer science has long had a solid foundation for evaluating the performance of algorithms. The asymptotic complexity of the time required by an algorithm is well defined and usually tractable, allowing for a clear evaluation of whether one algorithm provides a fundamental improvement over another. More nuanced and alternative evaluations, such as amortized and randomized analysis, provide additional insights into the fundamental advantages of different algorithms.
Unfortunately, the situation is even grimmer when evaluating the performance of a computer system, whether that system is a computer architecture, a compiler, a graphics processor, or a runtime system. Given a specific application, it is often fairly straightforward to execute the application on various systems and evaluate which system offers faster execution of that application on the provided input. Of course, once an application has been run on a particular input, one generally does not need to rerun it on that same input.
What programmers really want is some way to evaluate which system is likely to provide better performance on applications and data sets run in the future, thus making it the "better" system. Benchmarks also provide a way to examine how various system components behave and interact under load. Benchmarks should give repeatable results, even when rerun by an independent researcher or testing organization. A benchmark can be either a real or a synthetic application.
A synthetic application doesn't compute anything useful but is designed to have performance characteristics that are representative of a range of real applications.
Benchmarks have an established history in computer science. The first widely used synthetic benchmark was the Whetstone benchmark written in the Algol60 programming language in 1972 and later translated into many other languages. Several benchmarks became well known and widely used in research or commercial settings, or both. Examples include the Livermore loops, the Dhrystone benchmark, the Linpack benchmark, and the Perfect Club benchmarks.
The design and selection of benchmarks, however, has traditionally been a matter of art and taste, as much science as engineering. The paper here by the DaCapo team is the best effort I've seen in providing a sound basis for selecting benchmarks. Historically, there has not been any standard methodology for deciding whether or not a benchmark did indeed provide a representative measure of a system's performance within a particular domain. A more serious problem with benchmarks is that they age poorly. Benchmarks often do a reasonable job of evaluating the performance of applications at the time they are proposed. However, three things tend to make benchmarks grow less useful over time:
Almost every systems researcher and commercial software developer has a personal horror story about a poorly designed benchmark that was difficult to use, produced misleading results, or focused attention on the wrong problem for too long. One such story in my own experience involves the SPEC JVM98 db benchmark intended to represent a database benchmark. Several early papers on removing redundant or useless synchronization from Java programs focused on this benchmark, since removing such synchronization could produce a 20% to 30% speed improvement in the benchmark. However, closer examination revealed that more than 70% of the CPU time for this benchmark was spent in a badly written 20-line Shell sort; replacing the handwritten sort with a call to the built-in sort function doubled the execution speed, even without removing the useless synchronization.
The DaCapo research group offers what seems to be an exceptionally well engineered set of benchmarks for evaluating Java computer systems. This includes not only selecting the benchmark applications, but designing a substantial infrastructure to support the execution and evaluation of benchmark executions.
Far more important than the actual selection of the benchmarks and the engineering infrastructure, the DaCapo team has put together an excellent description of best practices for using benchmarks to evaluate Java system performance, as well as a principled approach for evaluating whether a suite of benchmark applications is, in fact, sufficiently diverse. This approach involves measuring a number of characteristics of each application, and then applying principal component analysis (PCA) to determine whether the applications do have fundamental differences, or if they basically measure the same aspects of a system. I hope the methodology described in the paper will allow the DaCapo benchmark suite, and others, to be evaluated so they can evolve in ways that make them useful as well as meaningful for more than just a moment in time.
©2008 ACM 0001-0782/08/0800 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2008 ACM, Inc.
No entries found