While current graphics processing units (GPUs) are able to offer at least an order of magnitude improvement in processing power over current central processing units (CPUs) that have been traditionally used to perform calculations accessing this performance is frequently less than trivial. This series of articles examines some of the issues that arise when trying to optimize the performance of code running on the GPU.

Kernel Occupancy

Intel Processor Die Map
Figure 1: Intel Processor Die Map(source: Intel)

A modern CPU, as shown in Figure 1, consists of a small number of identical, complex general purpose cores. Each of these cores is design to process a wide range of instructions and to execute a single thread of instructions at high speed. Much of the complexity of these cores is due to the fact that to improve speed the processing of each instruction is pipelined.

16 Stage CPU Pipeline
Figure 2: 16 Stage CPU Pipeline(source: Intel)

Each stage of the processing of an instruction is done by a separate section of the hardware, this allows many instructions to be processed at the same time rather than waiting for each one to finish before beginning the next, as shown in Figure 2 for a 16 stage pipeline. This approach does have drawbacks however. Chief amongst these is that it may not be possible to start processing the next instruction, for example if it depends on the result of an instruction still being processed or on data still being loaded. Modern CPUs allow for this by being able to look ahead in the list of upcoming instructions to find one that can be started, this is known as out of order execution. The drawback is that the hardware to do this is quite complex and reduces the number of transistors that can be allocated to the job of actually processing the instructions. This is just one example of how the general purpose nature of the CPU detracts from its ability to do a single job, in this case processing
arithmetical instructions.

Maxwell architecture block diagram
Figure 3: NVIDIA Maxwell Architecture Block Diagram(source: NVIDIA)

In contrast to a CPU a GPU consists of a very large number of identical, simple, in-order processing elements. This structure is shown in Figure 3. These elements are simple, efficient and small precisely because, unlike a core in a CPU, they are designed only to support a limited set of primarily arithmetical operations. The remaining functions required of the GPU are performed by a much smaller number of special purpose elements which are shared between the much larger numbers of processing elements. The processing elements are organised into groups each of which shares a number of special purpose elements and other resources forming what is commonly termed a “compute core”.

While the structure of a GPU is very different from that of a CPU many of the challenges in extracting the maximum performance from it remain the same. One of the primary challenges, as alluded to above when discussing the structure of a CPU, is to ensure that all of the processing elements are kept busy, the measure of which is known as kernel occupancy. With its very different structure the GPU uses a different strategy from the CPU to maximize this.

Workgroups and Work Items

To be able to extract the maximum performance from the GPU the structure of the problem solved by the GPU must mirror its structure. It must consist of a large number of identical tasks, or work items, each of which can be processed by one of the processing elements.

It is the nature of the problem that allows the GPU to adopt a different approach to handling cases where a processing element cannot for some reason execute the next instruction in the thread of instructions it is working on. Instead of including complex logic for looking ahead in the thread of instructions for an instruction it can execute the GPU leverages the fact that there are many work items to be performed and simply switches to another work item for which the next instruction may be executed.

As mentioned above a single processing element is not complete by itself but rather is part of a compute core. The processing elements of a compute core share resources and may switch between the work items assigned to the compute core, but not to work items assigned to other compute cores. The set of work items assigned to a single compute core is known as a work group.

One of the most significant factors in maximising kernel occupancy is therefore to ensure that the problem to be solved is split into the optimal number of work groups each containing the optimal number of work items. This is a non-trivial task and the correct values will vary depending on the hardware. The aim being to ensure that there are sufficient work groups to keep all the compute cores occupied and sufficient work items in each work group to keep all the processing elements in the compute cores occupied. Some of the principal factors to be considered when computing the size of the work groups are listed below:

  • The number of compute cores available on the device being used.
  • The number of processing elements per compute core on the device being used.
  • The number of registers available per compute core vs the number required by the program to be executed.
  • The amount of shared memory available to each compute core vs the amount required by the program to be executed.