Dominik Göddeke -- GPGPU Tutorials

GPGPU::Fast Transfers Tutorial

Introduction and prerequisites
Code framework
Fast transfers via PBOs
Summary and results
Acknowledgements
Disclaimer and copyright

Source code

Source code of the example implementation is provided. All code has been tested using Visual Studio.NET on Windows and the GNU and Intel compilers on Linux. Please note that the OpenGL extension this tutorial builds on is currently not supported on ATI hardware, a decent NVIDIA GPU is required. GLUT and GLEW are used to set up a rendering context and manage extensions, as explained in my other tutorials. Cg is required for the first version below, the GLSL version requires OpenGL 2.0 because I decided to use the core shader features instead of the (old) GLSL extensions. The sample code comes as one C++ file, and compilation is trivial if you link to the required libraries.

Download Cg version
Download GLSL version, initial port courtesy of Mike Hudson
Download special version that uses single precision input data.

Introduction and Prerequisites

This tutorial demonstrates how to use the so called pixel-buffer-object extension of OpenGL (in short: PBO) to achieve faster transfers to and from GPU memory. It is based on one particular application scenario, but we will try to explain the underlying concepts in much detail so that adaptions to other applications should be reasonably easy. Even if the application as outlined below is far from what you want to achieve using GPGPU techniques, you should read through the tutorial, since the bottom line is that PBO transfers for floating point data are almost always more efficient that traditional ones, requiring a surprisingly minimal, self-contained amount of modifications to your source code. If you do not want to learn about the "application" covered in this tutorial, continue directly in chapter 2.

"Application" outline

Let us assume that some high-precision CPU application generates a comparatively large amount of double precision data and performs compute-intensive operations in a sequential fashion on it. "Comparatively large" means the amount of data is much larger than the onboard memory of GPUs, which is typically in the range of 128M to 1G these days. Two real-life examples of such co-processor style scenarios are:

Domain-decomposition techniques in which several subdomains are treated almost independently to form a final solution, for instance (linear) system solvers for Finite Element or Finite Difference / Finite Volume based simulations incorporating outer parallel schemes (MPI etc.) on compute clusters.
Streaming real-time data analysis, for instance Black-Scholes in financial analysis or feature detection in video streams, which all benefit to large extends from the raw processing power and memory bandwidth of the GPU.

In order to offload these calculations to GPUs serving as co-processors, we need to find efficient solutions to the following two subproblems:

Minimising the overhead of streaming data to the GPU, while maximising the overall throughput during the computation.

Conversion of double precision CPU arrays to single precision GPU arrays/textures and vice versa. We emphasise the fact that the input data is double precision, because many existing codes use double precision on the CPU, and as Robert Strzodka and myself demonstrated in our article Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations, it is indeed possible to achieve double precision results on GPUs that only support single precision natively. To this end, this tutorial also presents an efficient approach to combine the explicit precision conversion with the transfers.

Ideally, we would like to stream the next set of data into GPU memory while computing on the current one and reading back the results of the last set of input data to interleave communication and computation. This tutorial will however not demonstrate how to do this explicitly to keep it as general as possible. We do however take implicit advantage of such DMA transfers, and some ideas on how to realise such schemes will be discussed.

In this tutorial, we will emulate the scenario described above in the following way:

For simplicity, the CPU "application" preallocates all necessary double precision input data. Creating the data on the fly would increase the complexity of the code on the CPU side, while we want to focus on efficient transfers.
The data is organised in the form of 1D vectors/arrays on the CPU, as usual. It is partitioned in independent sets of nine input arrays at a time.
The possibly complex calculations on the GPU are emulated by componentwise addition of all the values in the arrays in a block-wise manner, accumulating into a final result array.
To emulate the workload of the GPU kernel, it can be iterated several times on the same input data in a ping-pong fashion. This serves well to emulate (severely) bandwidth-bound applications. Changing the computational kernel to perform a series of expensive (math) operations to emulate compute-bound applications is beyond the scope of this tutorial and left as an exercise to the reader.

Notes for the reader

This tutorial is not meant to reiterate everything that is explained in detail in the Basic Math Tutorial. The reader is expected to have a level of understanding of GPGPU computing as conveyed in the entry-level tutorial. In the following chapters, we will therefore not discuss topics as ping-pong style computations, viewport-to-computational-domain mappings, shaders-as-kernels etc. If you are not familiar with these, please refer to the entry-level tutorial first.

Code framework

In this section, we will briefly describe the accompanying code implementing the "application" outlined above. For details on why a particular decision was made and how a particular OpenGL detail or GPGPU concept is realised, please refer to the Basic Math Tutorial.

Input data layout and user-defined parameters

Input arrays are supposed to have a fixed length N, chosen so that the mapping from 1D arrays to 2D textures is trivial: N=texSize*texSize. Input arrays are partitioned into numChunks disjunct sets, each consisting of numArraysPerChunk=9 individual arrays. The value 9 has mostly historical reasons, when I implemented PBO toycode for one of my projects for the first time in early 2006, I was concerned with 9-banded matrices. If you want to change this number, you will have to change the hard-coded shader as well, it is beyond the scope of this tutorial to include a (user-friendly) dynamic assembly of the computational kernel / shader source.

The default settings in the code result in approximately 1.6 GB of double precision input data, accessed via the variable chunks. You can adapt the memory requirements to your hardware by increasing or decreasing the value of numChunks or by modifying the array length N (texSize). It should be obvious that due to the way the code is designed, choosing a combination of settings that results in more memory requirements than there is physically available will result in extensive harddisk paging and swapping, creating a bottleneck which will render the entire tutorial code useless. The code will print out an estimate of the memory load at startup, allowing you to terminate it.

To emulate the workload of the kernel on each chunk of data, use the variable numKernelSteps to define the number of iterations the kernel performs in a ping-pong fashion before the next set of input arrays is streamed in. The variable numIterations can be used to iterate the whole computation sequence several times, especially to reduce timing noise if the amount of input data is less than one GigaByte.

In the accompanying example implementation, the change from conventional to PBO-based transfers is toggled via a simple preprocessor definition called USE_PBO.

OpenGL and shader data

The shader itself is trivial, it just accumulates all input arrays componentwise into a new array that sums up the running total of the values. Enabling and running this shader is plain Cg or GLSL usage. A standard one-to-one mapping of the viewport to the data and the output domain is employed. Rendering to textures in a ping-pong fashion to iterate over all sets of input data is achieved using one FBO with two textures attached to it. For details, please refer to the comments in the accompanying source code.

Remarks and limitations

It is important to note that only one set of input textures is used, corresponding to one set of input data. These textures are organised in an array inputTextures of length numArraysPerChunk. The reason for this is simply that memory is a precious resource: While we could of course allocate enough textures to hold all input data, this would increase the memory requirements by a factor of 1.5 compared to the amount of double precision data. One GigaByte worth of input data would require 1.5 GigaByte of available physical memory: The call to glTexImage2D() implies a malloc() in driver-controlled memory. In real applications, unnecessary allocation of memory should be avoided, naturally. An optimal texture management for such a streaming application would be to maintain a round-robin buffer of textures so that the available onboard memory is almost entirely occupied by these textures. Please keep in mind that the shaders and especially the framebuffer(s) use onboard memory as well. It is beyond the scope of this tutorial to teach how to maintain such a buffer, since we deliberately do not cover how to explicitly interleave communication and computation, as explained above. Such a scheme will however result in a significant performance increase, and might be covered in a follow-up tutorial.

PBOs are not an out-of-the-box mechanism to achieve improved performance for free! The performance of PBO-based implementations highly depends on a number of parameters:

texture formats: PBOs only provide real speedups for natively supported formats. Single-channel-floats as used throughout this tutorial are just one particular example. It is highly recommended to perform benchmarks. Owen Harrison's transferBench project is a recommended starting point.
Download and readback performance depends on the hardware. NVIDIA Quadro cards advertise improved readback performance as a feature. The chipsets do influence performance as well; naturally, NVIDIA chpisets such as NFORCE4 and better show superior performance with NVIDIA GPUs. Some manufacturers use suboptimal BIOS settings as well by default.

One other aspect that should be kept in mind is that PBOs itself are quite memory-hungry. The way they are currently implemented in the drivers results in approximately three times the amount of memory being allocated: If the PBO is created to store 1 MB of data, 3 MB will be allocated. The good news is that only 1 MB will be actually used (filled with data), yet still, excessive usage of PBOs might easily flood the swap space. There is an admittedly ugly macro in the accompanying source code which can be used to monitor this behaviour on Linux systems.

Fast transfers via PBOs

PBO setup and initialisation

In order to use pixel buffer objects, we have to generate them using standard OpenGL API calls:

 #define BUFFER_OFFSET(i) ((char *)NULL + (i))	

 GLuint ioBuf[10];
 glGenBuffers(10, ioBuf);

The first line is a convenient macro definition to offset data in PBOs, taken directly from the extension specification. Since we want to stream nine arrays into nine textures at a time and need one additional PBO to demonstrate asynchronous readbacks, we generate ten buffer objects in one go. These are the only required steps in the initialisation phase an application that benefits from PBOs.

Conventional transfer to GPU memory

For the scenario covered in this tutorial, conventional transfers of input data to textures require two steps:

 copy (chunks[chunk][array],tmpfloat);

 glTexSubImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, 0, 0, texSize, texSize, 
                 GL_LUMINANCE, GL_FLOAT, tmpfloat);

The first call explicitly converts a double precision array to a single precision array using a loop encapsulated in a simple function (refer to the source code for details). The second call copies the data to a previously allocated texture. Details on the various parameters are discussed in detail here.

PBO-accelerated transfers to GPU memory

PBO-accelerated transfers are only slightly more complicated:

 glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, ioBuf[array]);
 glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB,texSize*texSize*sizeof(float), 
              NULL, GL_STREAM_DRAW);
 void* ioMem = glMapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY);
 assert(ioMem); 
 copy(chunks[chunk][array], ioMem);
 glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB);
 glTexSubImage2D(GL_TEXTURE_RECTANGLE_ARB, 0, 0, 0, texSize, texSize, 
                 GL_LUMINANCE, GL_FLOAT, BUFFER_OFFSET(0));
 glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

Before we discuss and explain the details of this sequence of API calls, it is important to note that (assuming the buffer objects are initialised correctly as outlined above) in order to use PBOs to increase transfer speed, all that needs to be done is to exchange the conventional approach with these new calls. In other words, this is only slightly more complex than plain copy and paste.

In the first two calls, we bind the buffer object (array is just a loop index over our nine input arrays). Next, we invalidate this buffer object by passing NULL as data. This call also defines the size of the buffer. It is generally recommended to invalidate each buffer before actually using it to avoid problems. The last parameter is a hint to the driver how the buffer object may be used. It is an expensive operation to change the usage pattern (or the size of a buffer object), which is the reason why we use separate buffers for transferring data to and from the card. The usage patters are explained in very much detail in the vertex buffer object extension, which is a prerequisite to the PBO extension. Essentially, GL_STREAM_DRAW hints at the driver that the data store contents will be specified once by the application, and used at most a few times as the source of a GL (drawing) command, which is precisely what we want to accomplish.

By calling glMapBuffer(), we acquire a pointer to the first data item in this buffer object by mapping it to the so-called unpack buffer target. The unpack buffer refers to a location in memory that gets "unpacked" from CPU (main) memory into GPU memory, hence the somewhat confusing language. This is a pointer to a chunk of memory in what I like to call "driver-controlled memory". Mapping the buffer also changes "permissions" to access this particular chunk of memory, essentially transferring control to the memory to us. While the buffer is mapped, the driver does not have control over it. Depending on the driver, this might be PCIe/AGP memory, or ideally onboard memory. The usage hint GL_WRITE_ONLY tells the GL that while we have control of the memory, we will access it write-only. This allows for internal optimisations and increases our chances to achieve good performance. If we mapped this buffer read/write, it would almost always be located in system memory, from which reading is much faster. Our access pattern is write-only, all available usage patterns are listed in the vertex buffer object extension.

Once we have gained control over this particular chunk of memory, we can copy data into it. If the data were in the same format as the buffer, we could use memcpy() and eventually DMA, but our input data is double precision and we do have to convert it to single precision explicitly in a loop, encapsulated in the small utility function copy(). Note that we do the conversion and the data movement to the buffer in one call, a significantly more efficient approach than converting the data and copying it afterwards.
After the data conversion and movement is complete, we release this particular chunk of memory by unmapping the corresponding buffer object, effectively giving control back to the GL driver.

We did not populate any texture with the data yet to be able to use it as input to our shader. The next call does exactly this by sourcing the still bound buffer object. Since we gave control over the memory back to the GL driver, this means that ideally no actual memcpy() takes place if the driver decided to map the buffer in onboard memory previously. If not, one true, asynchronous DMA transfer from driver-controlled memory to onboard memory takes place. Note that the parameters to glTexSubImage() are identical to the conventional version except the last one, where we pass an offset inside the bound buffer instead of a pointer to an array in system memory.

Finally, we unbind buffer object by binding it to zero. This call is crucial, as doing the following computations while the buffer object is still bound will result in all sorts of weird side effects, eventually even delivering wrong results. Refer to the specification to learn more about which GL calls act differently while a buffer is bound.

Analysis

The conventional transfer implies a series of non-DMA memory operations: First, an explicit conversion from double to single precision is performed by the CPU. The call to conventional glTexSubImage() implies an explicit memcpy() into driver-controlled memory, since the OpenGL specification states that it is save to overwrite the contents of the input array after passing it to that API function. glTexSubImage() additionally implies rearranging the data as the driver deems necessary. The version using PBOs is hence faster since the CPU is only involved once in the entire process, converting and rearranging in one go. We do definitely save on one memcpy(), and eventually a second one if the driver does not decide to DMA the data anyway after we unmap the buffer. Experiments with using just one buffer object instead of nine indicate that once we unmap the buffer, a true DMA is performed, since using more (nine in this case) buffer objects results in overall better performance.

Conventional transfer from GPU memory

Conventional readbacks of data from the GPU to the CPU requires three steps:

 glReadBuffer(attachmentpoints[readTex]);
 glReadPixels(0, 0, texSize, texSize,GL_LUMINANCE,GL_FLOAT,tmpfloat);
 copy(tmpfloat, data);

First, we declare one texture attached to our FBO as the source of read operations, then we read the data back into a temporary single precision array and finally convert it explicitly to double precision to use it again in our "application".

PBO-accelerated transfers from GPU memory

To exploit PBOs for transferring data from the GPU to the CPU, the following steps are required:

 glReadBuffer(attachmentpoints[readTex]);
 glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, ioBuf[9]);
 glBufferData(GL_PIXEL_PACK_BUFFER_ARB,texSize*texSize*sizeof(float), 
              NULL, GL_STREAM_READ);
 glReadPixels (0, 0, texSize, texSize, GL_LUMINANCE, GL_FLOAT, 
               BUFFER_OFFSET(0));
 void* mem = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY);   
 assert(mem);
 copy(mem,data);
 glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
 glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);

First, we declare one texture attached to our PBO as source for the readback and bind the buffer to the pixel-pack target. Note: PBOs 0--8 were used during download, and we use the previously generated PBO at index 9. Next, we invalidate this buffer by passing NULL, the target size and a usage pattern: GL_STREAM_READ hints at the driver that the data store contents will be specified once and used at most a few times by the application, which is exactly what we want to do. For details and other usage patterns, please refer to the vertex buffer object extension.

In contrast to conventional glReadPixels(), this call returns immediately when a matching buffer is bound, and just triggers a DMA transfer into the buffer object, which is most definitely located in system memory due to the usage pattern we specified. Readbacks are only asynchronous on the CPU side. To take advantage of this, a real application would have to perform a lot of independent work on some other data while the result is DMA'ed in the background.

Unfortunately, not all data layout and format combinations are supported for asycronous transfers. I recommend performing benchmarks with Owen Harrisons transferBench tool to figure out details for your application.

In our case, async transfers are not supported. Which is not a particular problem, since there is no work to be done in the mean time. We acquire a pointer to the data by mapping the buffer to the pixel pack target. We again indicate what we want to do with the data the returned pointer relates to, accessing it read-only in out case. Note that this call will block until the data is available, just the same as in the conventional approach, glReadPixels() would have blocked.

Finally, we convert data back to double precision since the underlying "application" will continue using the data in double precision. This step is not necessary in most applications, and will be explicitly performed by the CPU. To give control over the memory back to the driver, we unmap the buffer and unbind the buffer object.

Analysis

As explained previously, conventional transfers require a pipeline stall on the GPU to ensure that the data being read back is synchronous with the state of computations. PBO-accelerated transfers are NOT able to change this behaviour, they are only asynchronous on the CPU side. This behaviour cannot be changed at all due to the way the GPU pipeline works. This means in particular that PBO transfers from the GPU will not deliver any speedup with the application covered in this tutorial, they might even be slower than conventional ones. They are however asynchronous on the CPU: If an application can schedule enough work between initiating the transfer and actually using the data, true asynchronous transfers are possible and performance might be improved in case the data format allows this.

Summary and results

On a Geforce 7800 GTX, the throughput of streaming 16.4 GB of double precision input data is increased from 0.7 GB/s to 1.8 GB/s (runtime 9.1s vs 24.0s for the accompanying source code) by just replacing conventional downloads with PBO-accelerated ones, for the same computation. Hence, using PBO-accelerated downloads as introduced in this tutorial is a must-do, especially since it is essentially free from a programming point of view.

It should be noted that the actual amount of data transferred is of course a factor of two less. That is why I talk about throughput and not about transfer speed via the PCIe bus. A version of the code that uses single precision input data is also provided and reaches a throughput of 1.2 GB/s, which is reasonable close to the real transfer speed one would expect.

For readbacks, gaining speedups by using PBOs requires some code restructuring and cannot be applied in all applications. The reason is that readbacks are only asynchronous on the CPU, not on the GPU. To benefit from PBO acceleration, a lot of independent work needs to be scheduled between initiating the transfer and requesting the data.

Acknowledgements

I would like to thank Mike Hudson, Oliver Korb and Thomas Rohkämper for beta-testing and proofreading. Mike also created the initial version of the GLSL port, which I ported to GL2.0 syntax. As always, the forums at gpgpu.org always deserve proper credit.

Disclaimer and Copyright

The example code for this tutorial is released under a weakened version of the zlib/libPNG licence:

This software is provided 'as-is', without any express or implied
warranty. In no event will the author be held liable for any 
damages arising from the use of this software. 

Permission is granted to anyone to use this software for any 
purpose, including commercial applications, and to alter it 
and redistribute it freely.

Feedback (preferably by e-mail) is appreciated!

Dominik Göddeke -- GPGPU::Fast Transfer Tutorial

GPGPU::Fast Transfers Tutorial

Contents

Source code

Introduction and Prerequisites

Related work

"Application" outline

Notes for the reader

Code framework

Input data layout and user-defined parameters

OpenGL and shader data

Remarks and limitations

Fast transfers via PBOs

PBO setup and initialisation

Conventional transfer to GPU memory

PBO-accelerated transfers to GPU memory

Analysis

Conventional transfer from GPU memory

PBO-accelerated transfers from GPU memory

Analysis

Summary and results

Acknowledgements

Disclaimer and Copyright