GPU programming, a high level overview

Over the last year, there has been quite some fuss over using GPU’s for other tasks than gaming. News stories about AI and cryptocurrency mining requiring vast amounts of GPU’s; product shortages in the consumer gaming market are just a few to name. Share prices for semiconductor design companies, particularly Nvidia, have been soaring. Yet, on the other hand, there is little understanding in the general software development community as to how these techniques work, what class of problems they can solve more effectively, and how and when to use them.

In this series of blog posts, I will shed some light on this issues. We’ll see how applications can benefit from the massive throughput these chips can deliver. Our focus will be mainly on Nvidia’s CUDA platform, since it is the most established and well documented. There are alternatives, off course, like the AMD sponsored ROCm/HIP platform, or OpenCL. OpenCL has been around for some time now, but never gained much traction. This is partially due to Nvidia’s lacklustre support, pushing it’s CUDA platform, instead. Fortunately, some of these competing platforms are directly supported by some popular machine learning frameworks, such as TensorFlow and PyTorch, or libraries such as ArrayFire. There is also direct support in platforms geared towards technical computing, such as Julia or MATLAB. This is of particular importance for those who don’t want to end up with a vendor lock-in on Nvidia.

A little history

The graphics processor unit, or GPU for short, is a so-called co-processor. This means that the regular CPU instructs this processor and offloads certain computations that it cannot (efficiently) do itself.

The architecture is designed specifically to apply the same operations to a queue of independent data items. It has a relatively simple execution unit, but carries around many of them. That way, each of the execution units can pick an item from the queue and processes it, while the CPU can do something else. To render more visual spectacle (i.e. more data items in the queue) in the same time frame, manufacturers can either increase clock frequency, or add more execution units to drain the queue in time for the next image. Clock frequencies usually run close to what the manufacturing process allows, so adding more execution units is a straightforward way to increase performance.

For computer graphics purposes, part of the memory of the GPU is allocated as the frame buffer. This is the buffer that drives your video port, such as the HDMI or DisplayPort connector on your laptop. For our purposes, however, we never write to the frame buffer since we don’t need to draw anything on screen.

So what is so special about the graphics workload that a GPU can do so much better then a CPU? As an example, let’s look the early generation consumer market GPU’s, such as the Nvidia’s first commercial success, the Riva 128. It can generate 3D images by being instructed through a graphics API. At that time, and to this day, the two dominant API’s are OpenGL and Direct3D. These drivers take commands to create, and operate on, graphics primitives. These are primarily nodes (points in space), vertices that connect nodes (lines in space) and vertices forming triangles (surfaces in space) in a 3D space. The driver transforms these operations to a stream of low-level instructions the GPU hardware can understand and execute.

At first, the operations the GPU performed on these primitives were rather limited. These GPU’s had special hardware to perform geometric calculations in relatively few cycles. It had a fixed pipeline of multiple stages where it would move around and rotate every node based on what the camera position is of what we see on the screen. Then it would figure out which vertices were hidden by other vertices, and fill the visible ones with some predefined texture. What is special about these workloads is that for each stage, it is the same (very specific) arithmetic procedure that needs to be performed on many elements. This lead to a so-called SIMT (simultaneous instruction, multiple threads) architecture. A set of dedicated processing units all execute the same program, sharing the electronic circuit that does the instruction decoding.

While this model lead to some impressive advances in 3D gaming in the 90’s there was also a drawback. The graphics pipeline was not really programmable. You could instruct it to a certain degree, but it was all strongly tied to the functionality that the graphics API’s and the hardware provided. As game designers were naturally pushing for more impressive visuals, the fixed render pipeline became a creative limitation. The more spectacular special visual effects had to be programmed and processed on the CPU. This made the CPU again the bottleneck that the GPU was supposed to alleviate. As a result, GPU designers started to change their processing units support more general functionality. While the driver would still convert the graphics API calls to programs, it was now also possible to run your own program using a C-like shader language.

It did not take long for some smart engineers to figure out that some of their workload shared similar traits with 3D graphics workloads. They wrote specific shaders to parallelize their workflows on relatively cheap consumer hardware. Such hardware offered far more memory bandwidth for the price than traditional computer clusters offered. To provide enough bandwidth, such clusters often required special network interfaces and equipment which was (and is) expensive. This “abuse” of the shaders however, was quite inconvenient, as you still had to use the graphics API to load your data and program and extract the data from the GPU’s memory. It was primarily Nvidia who realized that there was a market for memory/compute bound CPU-bound workflows that they could start moving into and sell their GPU’s beyond their traditional video game-loving audience. To aid its market entry, Nvidia introduced the CUDA software development kit. This consists of the CUDA language, a C-like language^[1] with some modifications to make it suitable for SIMT, along with its compiler; a CUDA client library exposing the API as well as tools such as debuggers and profilers. With this SDK, a developer could build solutions to a very wide range of problems using a familiar paradigm. This gave rise to the term general purpose GPU programming, or GPGPU for short.

The kind of suitable workloads

So, apart from from the obvious workloads related to computer graphics, let’s have a more detailed look of some of the workloads that we can improve the throughput on. We have already seen that it works best if it can work on a sequence of independent data items. Off course, these do not necessarily have to do anything with computer graphics. In fact, they don’t even have to be independent (meaning data items referencing each other explicitly, or implicitly via the list order) as we will see in our next post! Any application that operates very large data sets of a similar nature can benefit from the GPU parallelization. Typical examples will be image and video analysis and modification and augmentation, high speed network filtering and monitoring, physics simulations, training of AI networks from prepared data and off-line 3D rendering, to mention a few.

Figure 1. An example of how execution on the CPU and GPU are interleaved in time.

Figure 1 gives a simplified overview of how the interplay between the processor and the GPU work. From a single thread on the CPU, a large number of threads get queued on the GPU. The CUDA driver will split up the work in batches and schedule them. The GPU will work on as many batches simultaneously as it has threads available. Based on the actual number of threads that the GPU can run simultaneously, part of the batch items might have to wait until the GPU has free capacity again. After the CUDA queue is filled by the main thread, the CPU can either wait on the results to come is (as sketched in Figure 1), or it can do something else and will block when it needs the results computed by the GPU.

So how does this work from a programmer’s perspective? As an example, have a look at a variant of one of the simplest CUDA programs, in jargon known as kernels. This one is almost obligatory in any tutorial, the GPGPU version of hello world. I performs an per element addition and multiplication:

__global__ void compute(float *a, float *b, float *c)
{
    int i = threadIdx.x + blockIdx.x + blockDim.x;
    c[i] = a[i] + b[i] * b[i];
}

We can call this code from a C or C++ file as it has the global specifier, that is compiled with Nvidia’s nvcc compiler by

float *a, *b, *c;

cudaMalloc(&a, 1024 * 4096);
cudaMalloc(&b, 1024 * 4096);
cudaMalloc(&c, 1024 * 4096);

// ... copying emitted for brevity

dim3 blockSize(1024);
dim3 gridSize(4096);

<<<gridSize, blockSize>>> compute(a, b, c);

where a, b and c are pointers into the memory of the GPU. For brevity, we exclude the memory cleanup and data transfer instructions. Alternatively, we could use the CUDA driver API directly, which gives more dynamic control over what to launch and how. This is also the only way that you can use CUDA when using foreign function interface bindings, such as JCuda for the JVM. Unfortunately, it does not seem to be very actively maintained. We still require manual resource management, and debugging and performance measure tooling will be more difficult to use.

While this shows the basic idea of how we can run the same program of different data (the variable i will be unique for each thread, per the automatically injected variables threadIdx and blockIdx), by itself is typically little practical value. To run a workload on a GPU, we have to pay a penalty for moving the data and the kernel to the GPU in the first place. Then, the scheduler needs to queue the actual kernel instructions on the processor, and finally we have to copy the data back from the GPU device memory.^[2]

A note on complexity

For use cases such as these, the computational complexity is of O(n): the amount of work we need to perform scales linearly with the size of the problem. This kind of computational complexity will typically result in the memory transfer being the bottleneck. Unless is the arithmetic complexity is high (meaning for each element in the input array we do a lot of work) it usually more worthwhile to do such work on the CPU. Things become more interesting for large problems with a worse complexity, such as sorting at O(N log N); Fourier transformations, also with O(N log (N)); matrix multiplication — O(N²), solving linear equation systems – upper bound of O(N³). With such computational complexity unfortunately also comes programming complexity. The same data has to be processed by different threads at different times, requiring the programmer to explicitly enforce synchronisation at the interaction points.

While performance naively seems to be about performing a workload in less time, a more interesting take on the class of problems is the so-called Gustafson’s law. This basically states how much a task is slowed down on a sequential machine, compared to what could have been done in the same time, the more so for problems whose complexity is of higher order. For example, in physics simulations this means we can model more detailed and/or contain larger structures; for finance this means we run more theoretical price models on different financial instruments and compare them against current market prices to reap real-time investment opportunities (so-called high-frequency trading).

For many applications, though, we need to add a certain amount of detail and complexity to be useful. A physics model that is so coarse that we cannot put the structures in in whose behaviour we want to simulate is of little interest. Perhaps a more striking example can be found by the emergence of AI systems that are currently taking the world by storm. A lot of the mathematical building blocks are in fact not that new, with some of them known since the seventies. The problem has always been that the compute power required to successfully train these models has been too limited to cross into the territory for usability for many operators. Once the performance of GPU’s and the toolkits for GPGPU made running these models feasible, research into AI and ML picked up rapidly again. This lead to the amazing tools that are now publicly available.

That’s it for today. Next time, we’ll dive a bit more into the technical details of the GPU model to see how it works in detail, and how to use this knowledge to build high-throughput kernels.

1. Newer versions support various versions of C++ as well.

2. On some architectures, such as the Nvidia Tegra, the CPU and GPU memory is shared, so we don’t pay the memory transfer penalty.