I have attended the Supercomputing Conference (SC) several times over the past decade, including SC14 in New Orleans last month. One of the themes that I heard at this year’s conference was a new variation on an old tradeoff: productivity vs. performance.
The old tradeoff dates back to the days of the first high-level languages. When creating a new piece of software, one could choose to write it in:
Since the assembly version of the software ran faster, it had better performance than the high-level version. Since a human could develop the high-level version faster and it required less work to port/maintain, it allowed the human to be more productive than the assembly version, if the performance difference was not too great. If you were (say) a scientist working at a national laboratory where a new supercomputer was installed every few years, you would be willing to use a high-level language and pay a performance penalty, rather than use an assembly language and have to rewrite your software for each new supercomputer. It was acceptable to trade off higher productivity for lower performance, so long as the performance penalty associated with the high-level language was not too large.
(Lest the reader think this is ancient history and that today's compilers can solve the problem for us, consider the work of Kazushige Goto, whose hand-optimized assembly versions of the Basic Linear Algebra Subprograms (BLAS) libraries for the Intel and AMD x86 architectures regularly ran at least 20% faster than compiled versions. 20% was a big enough performance difference that most supercomputers included Goto's version of BLAS until just a few years ago, when accelerators began to dominate the supercomputing scene.)
Jumping forward to today: The top supercomputers are heterogeneous systems of distributed nodes, each containing multicore processors and accelerators. These accelerators may be either general purpose graphics processing units (GPUs) or co-processors (e.g., Intel’s Xeon Phi). To use the hardware in these supercomputers efficiently, software developers most often use MPI+X, where:
In choosing X, one must make a productivity-performance tradeoff decision. The following overview skips over many details, but it will hopefully provide the reader with a high-level understanding of some of the tradeoffs involved.
CUDA lets a software developer write highly efficient code for Nvidia GPUs, which dominate the accelerator market. It is the most mature of the accelerator technologies, and there is a wealth of available documentation and examples on the Internet.
However to maximize software performance on a given Nvidia GPU, one’s CUDA code must be tuned for that device. In particular, the programmer must explicitly organize the GPU’s threads into blocks, organize the blocks into a grid, move data from the computer’s main memory to the GPU’s memory and back, place data on either the GPU’s (larger but slower) global memory or its (smaller but faster) shared memory, and so on. The optimal way to specify all of these depends on the characteristics of a given GPU, which means one’s code may have to be retuned for each new generation of Nvidia device, at least to get maximum performance.
People I talked to at SC14 loved the performance CUDA can deliver on Nvidia GPUs, but did not love the work required to create and tune their programs the first time around, or the work required to retune their programs when those programs were ported to a new machine. They view every minute of that kind of work as lost productivity.
OpenCL lets a software developer write highly efficient code for CPU cores, co-processor cores, or GPU cores, not just Nvidia GPU cores. However to achieve this flexibility, OpenCL programs are far more complex than their CUDA counterparts. They also require similar tuning, in that a programmer must explicitly organize work-items (threads) into work-groups (blocks), organize the work-groups into an index-space (grid), move data from the computer’s main memory to an accelerator’s memory and back, place data on the (large but slow) global memory or the (small but fast) local memory, and so on.
As with CUDA, one’s code may well need to be retuned each time it is ported to a new architecture. In head-to-head comparisons, CUDA is generally faster than OpenCL, at least on the Nvidia devices where the two can be directly compared, making tuning at least as important in OpenCL as it is in CUDA.
As mentioned above, OpenCL programs are more complicated than their CUDA counterparts. This added complexity tends to increase the time-expense of OpenCL software development and maintenance, reducing one's productivity even more.
OpenACC seeks to improve one’s productivity by (i) making it easier to write code that will run on an accelerator, and (ii) making the task of porting one’s code to a new architecture as simple as recompiling that code. For example, to exploit a GPU, one may assign code to workers (threads), organize workers into gangs (blocks), and designate chunks of worker-code that are amenable to SIMD execution as vector chunks. However one may either specify values for each of these, or let the compiler choose values. OpenACC also provides built-in support for some parallel patterns that are commonly needed on the accelerator, such as the reduction pattern.
Researchers at SC14 reported that they had compared the performance of OpenACC and CUDA or OpenCL in both summer 2013 and 2014, and that while OpenACC had a substantial performance penalty in 2013, that penalty had been greatly reduced (but remains significant) in summer 2014. Anecdotally, I heard one researcher say that in 2013, he had written an OpenACC program in 50% of the time it took him to write a CUDA version, but that his CUDA version ran 50% faster than his OpenACC version. In 2014, the OpenACC compiler had improved to where his CUDA version was less than 25% faster than his OpenACC version. Compiler vendors like Cray and Portland Group International have a commercial interest in improving the performance of OpenACC, so the performance gap between OpenACC and the other technologies seems likely to continue to shrink in the future.
OpenMP has traditionally been an easy to use library for implicit multithreading on multicore CPUs. However the OpenMP 4 specification adds directives for running code on target devices (accelerators), on which one may identify work to be done by threads, organize those threads into teams (an abstraction for blocks) and organize teams into a league (an abstraction for grids or other device-specific mappings). OpenMP also provides a simd directive for marking loop-code that is amenable to vectorization. As in OpenACC, the exact mappings of these abstractions to the hardware may be left to the compiler or may be explicitly specified.
As of this writing, compiler writers are still implementing support for the OpenMP 4 specification, so this technology is far less mature than OpenACC, which as we saw above, trails CUDA and OpenCL in performance. (At one time, there was talk of OpenACC and OpenMP merging, but this seems unlikely, according to those at SC14.) When mature, OpenMP has the potential to provide everything one needs to exploit parallelism on CPU and accelerator cores, making it a strong candidate for balancing performance with productivity.
All of these technologies bear watching in the future, as they continue to evolve and mature. While CUDA and OpenCL are a far cry from assembly, the need to be knowledgeable about the hardware details of one’s accelerator in order to achieve optimal (or even good) performance makes these technologies feel much closer to the hardware than a typical high-level language. In terms of being productive. OpenACC seems promising, but it still has work to do to close the performance gap. OpenMP seems even more promising, but has even further to go before it is ready for use.
In the meantime, CS educators face the interesting problem of deciding what technologies to use in the classroom to expose their students to accelerator-based computing. This brings me to these questions for those who have read this far:
I look forward to hearing your views!
A really interesting thought on performance vs productivity dilemma we developers normally have to choose when writing programs.
Just wondering, in the field of machine learning, many of the libraries are implemented using Python, Matlab or R; and I believe parallel implementations of these libraries do not use the same MPI library used by many scientific applications developed using C/C++/Fortran (e.g. in Python we have libraries like NumbaPro or the native multiprocessing module).
Do you think there will be a convergence or standardization of these parallel processing API so that these libraries coded in different languages can easily used together without substantial recompiling / rewriting codes? I believe this is also some scenario where we choose productivity over performance.
I'm not an expert in Python, Matlab, or R, but I'm told that many of the functions in Python libraries are actually "wrappers" around C functions that do the actual work, for better performance. So if these libraries have parallel capabilities, the people who wrote these libraries likely developed the required functionality in C using MPI/OpenMP/CUDA/... for performance, and then "wrapped" that functionality in an API for Python programmers to use for increased productivity. I'd guess that Matlab and/or R functionality that runs on a Beowulf cluster take a similar approach.
There are also projects like mpi4py that seek to provide an MPI-like system for writing distributed computations in Python, and pycuda that provides an API for writing GPU computations in Python. So another option is to develop applications using packages like those that let you take advantage of the available hardware. If there is an algorithm that's expressed using MPI or CUDA but you prefer Python, you can 'translate' that algorithm to Python using the appropriate package. So that's another way to trade off performance and productivity.
As far as convergence goes: there used to be just two hardware platforms to deal with -- distributed multiprocessors and shared memory multiprocessors -- but the accelerators bring a third option into the mix. On today's supercomputers, all three have converged, but on servers, desktops, laptops, etc., you're still dealing with shared memory multiprocessors, some of which have an accelerator.
Some of the new features in OpenMP 4 seem to be designed to bridge the shared-memory/accelerator gap. For example, if your program does lots of array operations that are amenable to vectorization, you can precede those operations with #pragma omp simd. Then if the underlying hardware contains a vector accelerator unit, the compiler can generate machine code to take advantage of that hardware; otherwise, the compiler will ignore the #pragma and generate normal machine code. That makes me more productive, as it offloads most of the effort of developing for performance from me to the compiler. Of course, I still have to be able to recognize sections of code that are amenable to vectorization; the compiler won't do that for me.
Bridging the gap between distributed and shared-memory multiprocessors is trickier because programs that are written specifically for maximum performance on a shared memory multiprocessor (with or without an accelerator) generally require significant revision to run on a distributed multiprocessor.
Some languages (e.g. Erlang, Scala) try to bridge this divide by encouraging the developer to have their processes/threads/tasks communicate only through message passing. If your software is designed that way, then it doesn't matter whether those components are run locally on the same machine or remotely on distributed machines. Put differently, the message-passing approach to communication is pretty general, in that programs written using that approach can run either on a distributed-memory or a shared-memory multiprocessor; whereas programs written to communicate via shared memory (using locks and the like) are limited to shared-memory multiprocessors. So those languages might be seen as trying to make message-passing the standard API for parallel processing.
All of these approaches would require substantial recompiling / rewriting codes to achieve what you want. I hate to sound pessimistic, but achieving the kind of interoperability you are describing seems unlikely any time soon. The parallel design patterns work of Mattson, Keutzer, et al supports a standardized thought process when designing parallel software, but at the implementation and runtime levels, achieving that kind of standardization seems like a grand challenge.
The difference between GotoBLAS2 and the compiled Netlib is a lot more than 20%. It can be ~1000x in the case of DGEMM! But it's not like Kazushige is the only person to have solved this problem. Vendors such as Intel (MKL), AMD (ACML), Cray (LibSci), IBM (ESSL), NVIDIA (CUBLAS), etc. provide highly optimized implementations for their systems. And there are, of course, other open-source projects, including the well-known ATLAS (math-atlas.sourceforge.net) and the much newer BLIS (https://code.google.com/p/blis/).
Your point is still true, but the way it is made might allow the reader to conclude that optimizing BLAS is a skill possessed by only one living person and that the optimizations the compiler is missing are worth the convenience of writing Fortran instead of assembly. I'll give 20% to use Fortran instead of assembly any day, but 1000x is not acceptable.
These are all good points, Jeff. Thank you for filling in some missing details!
Displaying all 4 comments