Last week we had a quick glance of what GPU programming is and what kind of issues it can tackle. Today, we will dive a bit under the hood and take a look at the logical architecture of modern chips, and show a basic strategy when to migrate part of our code over to the CUDA platform.

The hardware

One of the prime challenges with GPU programming is that its model exposes much more of the hardware model than traditional CPU programming does. Modern CPU’s are internally quite complex beasts. They use many techniques such as out-of-order execution, speculative execution and super-scalar design. Such tricks try to execute multiple instructions simultaneously, or hide memory latency, all to achieve maximum instruction throughput. These are all handled in hardware, and a program is guaranteed to be executed as if the instructions were executed sequentially, one at the time.[1] For a developer, this makes live a bit easier.

The GPU architecture takes a different approach. The execution unit of a GPU processor is comparatively much simpler. It doesn’t contain much logic to optimize its throughput, and doesn’t do speculative execution. The hardware instruction scheduler, however, is much more complicated, and responsible for maintaining throughput. Fortunately, developers don’t need to be concerned with its details, but knowing it exists is vital to writing correct and efficient programs. Thus, some of the effort of optimizing against the target platform is shifted to the developer to write high-throughput code. The upside of this model is that the amount of chip real estate (i.e., the number of transistors) for an execution unit is much lower. These space savings mean the same execution unit can be duplicated many, many times on the same chip.[2] On some of the more expensive GPU’s, this means that it can easily run over 100.000 threads concurrently.

For programming, we are mainly concerned with the logical model of the GPU. This is a programming model that, while quite close to the underlying hardware, abstract away some issues that the work dispatcher to the GPU in the driver or a compiler can do for us. This frees a developer from some of the more mundane work regarding scheduling and memory details.

The programming model

Right down on the lowest level, there are — obviously, off course — individual threads. They work on a single data item, such as a floating point number or an integer. This is really quite similar to a CPU thread, and it is also the unit that will execute our program. Each thread has a linear sequential id (the threadIdx and blockIdx variables that are globally available inside CUDA code). With this offset id, each thread can load or store a unique data element from/to memory, as we can see in the example in figure 1.

Next up are warps, which are a combination of 32 threads on Nvidia GPU’s (and 64 threads on AMD GPU’s). A warp is the minimum workload that can be scheduled. Hence, we always need to run our program in multiples of 32 threads. It is important to remember this if the data array we are working on is not a multiple of 32 elements. Part of a warp might read (or write!) outside the data boundary if we don’t tell it explicitly to do nothing. The reason for this is that a warp has a single instruction decoder. Each thread in the warp thus need to either execute the same instruction, while working on different data, or do nothing. An immediate consequence of this model is that if some threads within a warp branch diverge while others do not (e.g. by an if/else), the threads that don’t take the branch idle all thread have joined a common point again. Interestingly, modern GPU’s are not required to run a branch to completion before working on another branch; it might interleave the code in two ore more branches, with branches progressing in any order. Because of this undecided order, on some cases when reading and writing to memory, it is required to provide synchronization points within a warp itself. The GPU won’t do that for us, unfortunately.

Figure 1. A very simple SIMT program containing a single warp. All threads in the warp execute the same instruction, but on different data.

Next up are blocks – these are also called cooperative thread arrays in PTX.[3] This is a collection of warps that run on the same core — called a streaming multiprocessor in Nvidia parlance. A single chip typically contains multiple of these multiprocessors. All threads within a block have access to the same shared memory. Shared memory is a low latency memory buffer, a kind of scratchpad memory, used for threads within a block to share data with each other. While it is extremely fast — a couple of clock cycles, it is also a scarce resource, with many architecture generations having no more than 64 kib per multiprocessor. Shared memory needs to be declared specifically a such within the GPU program; while we can access it as regular memory, the compiler will map all reads and writes to this memory to the correct instructions. A kernel invocation typically consists of multiple blocks, which together form a grid. Different blocks of the same kernel will not necessarily be scheduled on the same multiprocessor, or even the same physical chip for that matter.

It is the CUDA driver that is responsible of placing a block on the hardware,a developer cannot rely on a specific execution order. The only way to guarantee that, at an arbitrary point in the code, all blocks have executed and their memory writes to be globally visible, is to split the code up in two kernels and run them sequentially.

Yet one level up the hierarchy we find the device scope. It holds the global memory for all threads running on the chip, and is the prime method for the GPU and the CPU to exchange data. Memory transfers can only be initiated from the CPU, and have a blocking and non-blocking variant. When we need to perform frequent or large transfers, we can use a special type of memory allocation that is guaranteed not to be swapped out or moved physically — called pinned memory. That way, the data transfer can done by a dedicated high-speed direct memory access (DMA) controller, freeing the CPU from doing the transfer itself. Additionally, the GPU itself also doesn’t have to wait for these transfers to complete. It is possible to copy data from/to the device, execute CPU programs and GPU programs simultaneously to achieve maximum throughput.

At the bottom of the hierarchy is the system scope, which mainly deals with multiple GPU’s and GPU/CPU synchronisation. For now, we will not dive into this level. To summarize the hierarchy, figure 2 shows how the different levels build on top of each other, and what kind of resources they own and expose to the level above.

gpu hierarchy
Figure 2. A very simple SIMT program. All threads in the warp execute the same instruction, but on different data.

As we have seen last time, workloads of O(N) will typically not benefit too much from GPU parallelization because of the data transfer overhead. Any increase in complexity implies, however, that the results at array offset i will somehow depend on the results at other offsets as well. Threads outside a warp cannot exchange data directly; the threads will have to communicate by reading and writing to common memory. Ideally this is fast shared memory, but for some applications writing to device memory is inevitable. As we do not control the warp scheduling, to prevent data races, such reads and write have to be synchronized. Unfortunately, these tend to be relatively expensive, so a developer should strive to prevent redundant shared memory access. In order to prevent excessive waiting, synchronization can be applied at different levels of the hierarchical GPU model. This keeps the synchronization communication on the chip as local as possible, and thereby the time spent in the synchronization lock.


One of the more interesting aspects of a GPU is the scheduling behaviour, as this is quite different from CPU’s. On a normal CPU, a thread runs on a single core in apparently sequential order, until it gets exempted by the OS scheduler. The OS scheduler decides which task should run next, and the CPU continuous executing that task at the point where it was pre-empted previously. During the time before the next pre-emption, the CPU will use all tricks in the book to prevent stalling its internal execution units to make as much progress through a program as possible, while still adhering logically to the code sequence as we programmed it.

A GPU, on the other hand, is a latency-hiding architecture. Now what does that mean? In a regular sequential program, there are typically only a few instructions that can be executed together. For example, while we are waiting for some data from memory to arrive, we can already start working on some other instructions for which we do have all the dependent data, since the CPU requires only logically consistent observable behaviour.

The GPU does a somewhat similar trick, but uses the data parallelism explicitly. While it might, for example, be waiting for a memory load for a warp 1 to come in, it is in most cases not sophisticated enough to already start working on data it does have in this warp. But it doesn’t have to! If our kernel call contained enough warps, it could simply try to progress on another warp that is ready to have its next instruction executed. An example of this mechanism is shown in figure 3. A chip will hold one or more multiprocessors, which manage the shared memory, caches and registers for all warps that are scheduled on it. On most CUDA architectures, there are four warp queues per mutiprocessor, and each can hold 64 warps. A mutiprocessor should typically contain enough registers to maintain the state of all warps that can be scheduled. A developer has to be careful, though: even with the number of registers running into the many 10’s of thousands, they are still a scarce resource. Using to many of them in a program will decrease the number of warps that can be active simultaneously.

warp q progress
Figure 3. Illustration of a warp queue. At t0 there are three warps that can proceed. Warps 2 and 6 have their instructions executed, and at t1 we are still awaiting the result. In the meantime, warp 1 and 4 have completed their instruction and are ready to have their next instructed dispatched.

As an example how this architecture increases throughput, let look at a somewhat fictional memory access example. Suppose that a memory read from device memory takes a considerable number of clock cycles from instruction dispatch to the data showing up in the thread registers. If we have enough warps queued up, though, any of these can do some other work (or perhaps the same work), while the first warp awaits the memory transfer. The GPU architecture is designed to exploit this kind of parallelism, so compared to OS threads, there will not be any switching overhead when working on a different warp.

This example also nicely show why we shouldn’t run jobs that are too small on the GPU. If the total set of warps is too small, we will more frequently see that no warp is ready to progress and we waste the cycle, thus defeating the latency-hiding system. A multiprocessor contains multiple execution units for different workloads (e.g. load/store, floating point and integer operations). The warp schedulers will at every clock cycle try to match a warp that can make progress to an available execution unit. To make sure we optimally use the available hardware, the GPU will expose all kind of metrics detailing the occupation of its resources. A tool like Nvidia’s Nsight is a nice way to analyse these metrics.

Assess, (Parallelize), Optimize, Deploy

Now that we have a better understanding of how the GPU does its magic, let’s zoom out a little bit. GPU programming is a nice tool too have in the toolbox, but we need to apply if effectively. When building algorithms, it is always best to have a CPU implementation, first. These are typically easier to program, test, verify and integrate into an existing codebase, and at any rate serve as a base case. In my experience, clients many times don’t specify any performance criteria, but will typically have some presumptions about what would be considered acceptable. Now suppose we find ourselves in a situation were the client not too happy with the performance, what do we do?

The answer, off course, is to profile our application. This can either be done by writing specific benchmark cases if the slow use case is known a priori, or (somewhat more tricky) collect profiles from a acceptance or production environment. This will show if there are any hot-spots in the code that will give the highest pay-off when we optimize. If such spots exists, we need to make sure the current solution is properly coded, and is not limited by I/O operations, in which case the GPU will be of little help. Sometimes, just using a different approach to solving the problem can make a huge difference. If there is still an obvious bottleneck, we can possibly start to think to offload the work to a GPU.

A decision to offload, will have many consequences that should be considered. Again, we limit ourselves to the CUDA case as it’s a more mature and better documented solution then OpenCL or HIP, so its easier to start with. The drawback is that we are now required to run on a machine with an Nvidia CPU, whereas other solutions can run on a multitude of platforms (even the regular CPU). If we want to run our solution in the cloud, the machines tend to be more expensive then their non-GPU counterparts. Second, the instance must have a GPU that matches (at least) the compute capability target architecture. The exposure of the GPU can be done either directly by running a virtual machine that supports mapping the GPU into the virtual machine space directly, or indirectly though e.g. Docker. To use an Nvidia GPU on docker, the Docker host needs to have nvidia-container-toolkit installed, and the container needs to be started with the --gpus flag to expose the device.

How successful the offloading to the GPU will be, depends to what extend parts of the workload can progress independently. If our code contains many synchronisation points, where blocks or whole kernels have to wait until all other have completed, performance will be degraded –- sometimes quite dramatically.

After an initial implementation, the kernel should be profiled by running a representative problem. Using Nvidia’s Nsight compute, we can get quite a lot of detailed information which parts are performing and which not, together with some suggestions on how to remedy the situation. In my experience, an initial GPU implementation tends to be quite suboptimal, so visualizing what is going on is an indispensable help to utilize the hardware efficiently.

Where to begin

Unfortunately, both the CUDA runtime API and its lower level CUDA driver API companion are written in C only. These are needed to manage the GPU context and execute our program on the GPU. From the major programming languages, it seems only Python has an Nvidia maintained wrapper. In my opinion, Python is a great language for exploratory analysis and rapid prototyping. Besides, a developer also gets access to quite a lot of libraries and packages that run GPU kernels. Some prominent examples of these are machine learning libraries such as PyTorch and Tensorflow. Also, Nvidia makes quite a lot of libraries available for common algorithms and tasks.

Now that we have a proper overview of what the pitfalls are in GPU programming, we are ready to try to tackle an actual problem. Next time, we will look at some actual code, and see what the profiler tools have to tell us about it.

1. This assumption famously ceases to hold in multithreaded programs. This is a fascinating topic in itself, however.
2. Interestingly enough, Intel tried something similar with the — now discontinued — Xeon Phi architecture, which packed a lot of relatively simple early Pentium cores, with some modifications, on a single chip.
3. For now, we will not concern ourselves with PTX. This is an intermediate representation, similar in philosophy to Java bytecode or LLVM intermediate representation. For those who want a thorough understanding of GPU architecture, diving into the PTX instruction set architecture programming guide is highly recommended.