General Purpose Computation on Graphics Processing Units

June 21st, 2006

A top-of-the range graphics card, such as the NVidia 7900GTX, is a highly parallel computer, with 24 execution units capable of performing 32-bit floating point calculations on 4-element vectors at 650MHz, including being able to issue a pair of multiply-add instructions per cycle, giving a theoretical maximum of 250GFlops. The GPU has 512MB of memory attached, with a 50GB/sec theoretical maximum bandwidth. There is also a GPU that recently became available, (the 7950GX2) which is basically a pair of 7900GTX on a single card, and its possible to install a 2 or maybe even four cards in a single machine. All of the top-model GPU cards cost around $600 shortly after launch.

So… for the cost of a modest PC and 2 7950GX2 cards (a total of about $3000), you get around 1000 GFlops of computing power and 2GB of memory operating at an aggregate 200GB/sec.

Compare this to a tricked out dual-core PC (a total of about $3000), where you get around 20GFlops of computing power and 2GB of memory operating at an aggregate 10GB/sec.

In general, GPUs have been improving in performance at a rate far higher than Moore’s law, though there are of course questions about whether this can continue. Given that GPUs are now consuming 100+W each, ATI and NVidia have already picked the low hanging performance fruit with their architecture. On the other hand, it looks like Intel and AMD have mastered their heat problems, and are managing major improvements in performance simultaneous with major energy use reductions. I imagine that ATI and NVidia will be able to apply some of these techniques to their designs, enabling them to continue along the performance curve for some time to come. Also – GPUs don’t necessarily need increased clock-speeds for increased performance – that can be achieved by adding more vector processing units. The main limiting factor seems to be memory bandwidth, hence the advent of multi-gpu solutions such as the 7950GX2.

The 1000 GFlops I mentioned above is a theoretical maximum. Its only on certain kinds of problems that rate can be achieved. In general, the problems that can leverage this power are said to have high “arithmetic intensity”, i.e. more computation than memory accesses. An example of a computation that is unsuitable is matrix-matrix multiplication which has O(N^3) memory accesses to produce O(N^2) results. CPUs have the advantage here because they can chunk the operation into tasks that can fit in cache.

Current GPUs operate only on 32-bit floating point values, and not even IEEE-754 standard 32-bit floating point at that. (When rendering computer graphics, you don’t really care if the final results of your computations are out by ± 0.2%, so the 32-bit floating-point operations tend to be a bit lax when it comes to rounding the last bit, and operations such as division and transcendental instrinsics (sqrt, sin, cos, exp, log, etc) are also less accurate. It is possible, however, to perform computations at a higher precision, by emulating double-precision with two single-precision values, but at the cost of more instructions – and many more in the case of transcendental instrinsics. GPUs also offer 16-bit floating-point values, which, if lower precision is acceptable, can improve performance by reducing memory and bandwidth requirements. Whilst current GPUs have no support for integer or bitwise operations, the next generation of cards will (DirectX 10 class cards).

The programming model for GPUs is quite constrained. Computation is rendering. That is, the result of rendering is a 2D array of ‘pixels’, which are currently limited to 4 planes of 4 floats each. Computation is performed by assigning to each “pixel” in turn one of the N vector processing units, whose job it is to compute that pixel. Those vector processing units can read from many random memory locations, but can only write to one location – the pixel they are assigned to. This memory model is called CREW, or Concurrent-Read Exclusive-Write. Computing something like a histogram, which involves examining a sequence of ‘things’ and throwing each of them into one of N ‘bins’, also known as a scatter operation, is not a natural fit for this architecture, though, of course, where there’s a will, there’s a way. It is probably easier to compute a histogram on the CPU, however.

A major differentiating factor between GPUs and CPUs is the design of their memory systems. The individual processors in a GPU have small caches, designed for 2D coherent memory accesses, and are around twice as fast as main memory. GPU main memory systems are optimised for fast sequential access, with random access being around 3-5 times slower. Given this architecture, something like a dot-product between multiple vectors, z[j] = sum(x[j][i]*y[j][i]), is a perfect operation for GPUs.

GPU code can be written using a number of tools, from assembly language, through one of several C-like languages designed for graphics programming (Cg from NVidia, HLSL from Microsoft, GLSL from OpenGL, all of which are pretty much interchangeable with each other), and on to specialized languages designed to make programming GPUs easier for non-graphics programmers (i.e. stream-processing languages such as BrookGPU and Sh). The stream-processing languages seem overly-complex and somewhat inefficient for my tastes (Sh is based on C++, and the C-like BrookGPU is no longer under active development, though there are rumours that ATI is sponsoring further development of Brook for a future API of theirs). It seems to me that Cg/HLSL/GLSL as the most effective ways of programming these devices for now. Cg/HLSL/GLSL are very close to C, with support for loops, conditionals, function calls, along with a number of extensions oriented around vector and matrix processing (floating point vectors of up to 4 elements, and matrices of up to 4×4 elements). Also, by operating at the graphics level rather than the specialised stream-processing languages, one can leverage graphics-specific operations. For example, a graphics operations known as an “Occlusion Query” can be very handy – it very quickly totals up how many pixels pass a test specified in arbitrary Cg/HLSL/GLSL code. An example of the use of the occlusion query is in computing the median of a set of numbers (i.e. in log(N) passes, you can binary search for a value which has an equal number of samples above and below it)..

Some other limitations of GPUs are that, even in a multiple GPU configuration, each GPU has only a certain amount of local RAM to operate on (512MB in the case of the 7900GTX), and that transfers to/from the CPU tend to be around the 500-1000MB/sec mark (well below the theoretical 4GB/sec available on the PCIe X16 bus interface, so there’s room for improvement). Another limitation is that textures (2D arrays) are currently limited in size to 4096×4096.

On a historical note, we have seen these cards increase in features and capabilities almost yearly. Loops are the most recent addition to their capabilities, and the one that I believe pushed us over the threshold into general purpose computing. In the not too distant future, we will see DirectX 10 class cards implementing integer/bitwise operations and a suite of other features, along with 64 or more execution units. In general, the architectural model seems fairly firm, and unlikely to undergo huge changes. What is likely to change is that the various vendors are preparing APIs for general purpose computing on their architectures. Whilst not much is known about these APIs at this point, their goal is to remove the necessity of programming in DirectX/OpenGL with their graphics-oriented abstractions, which requires knowlege of graphics terminology and techniques, and can be somewhat awkward when applied to other programming tasks.

A good source for more information on this field can be found at www.gpgpu.org

One Response to “General Purpose Computation on Graphics Processing Units”

  1. choy Says:

    http://www.nvidia.com/object/tesla_gpu_processor.html

    looks like they’ve moved up to IEEE 754 single precision floats in the Tesla C870.