How to build a BVH – part 9a: To the GPU

In this series we explore the wonderful world of the bounding volume hierarchy: the mysterious data structure that enables real-time ray tracing on recent GPUs, in complex, animated scenes. Despite the broad adoption of ray tracing in games, in-depth understanding of the underlying technology seems to be reserved to a select few. In this article we show how a proper BVH is implemented without support from DXR / RTX / OptiX / Vulkan, in easy-to-read ‘sane C++’. Whether this is actually useful for you, or just interesting as background information, depends on your needs, obviously.

This series consist of ten articles. For each article, full source code is available on Github, in easy-to-read ‘Sane C++’. Contents:

  1. Basics: data structure, simple construction and ray traversal.
  2. Faster rays: the Surface Area Heuristic and ordered traversal.
  3. Faster construction: binned BVH building.
  4. BVHs for animated models: refitting and rebuilding.
  5. BVHs for animated scenes: the top-level BVH.
  6. TLAS & BLAS part 2 – all together now.
  7. Consolidating, part 1: ray tracing textured and shaded meshes.
  8. Consolidating, part 2: Whitted-style ray tracing.
  9. OpenCL: GPU ray tracing, part 1 (this article).
  10. OpenCL: GPU ray tracing, part 2 (series final).

We approach the end of this series. After building the BVH on the CPU, and rendering an animated scene with textured and reflective teapots on the CPU, we take the rendering process to the GPU, chasing visions of better performance. The GPU bits in this article will be implemented in OpenCL. The GPU code will not rely on any hardware ray tracing; in fact, we’ll not use any of that. Of course we will do some performance measurements, and compare against RTX – in article 10.

GPU

Compared to a CPU, a GPU is a strange beast. It sits pretty isolated in a computer: separated, by a slow connection, from CPU, RAM, user input, disk I/O and other commodities. The only thing it has quick access to is the display. In terms of processing it’s an oddity as well: a large group of cores chews on an even larger number of threads, which ideally all do the same thing, just on different data. Still, the prospect of vast bandwidth and great compute power is tempting, and justifies some pain and effort to get this beast under control.

In this article we will use OpenCL to access the GPU. This is a somewhat arbitrary choice; e.g. CUDA (NVIDIA only) or compute shaders would also do the job. Regardless of the chosen GPU programming framework, we’ll have to deal with an API, which can be a bit cumbersome, especially compared to the quite pure C++ we have been using so far. To make our lives a bit easier, templates exist, and the template we’ve been using so far comes with some decent OpenCL functionality. If you are not using the template, please copy the OpenCL specifics from the template to your own code; it’s a pretty thin wrapper.

The OpenCL program we’re going to run is pretty simple:

The code would produce a a nice gradient, but we’re not quite there yet. Someone needs to start this code, ideally from some C++. To do that, we store the OpenCL code in a file, kernels.cl, and load it from the C++ code that we have been developing. To ::Init() we add:

The first line loads the OpenCL source file from subdirectory cl, compiles it, and looks for a kernel function called render. The second line creates a Buffer of 640 * 640 * 4 bytes. This ‘Buffer’ is something special: the bytes it stores can reside in regular RAM, or on the GPU, or both. The two locations are not automatically in sync, but we can copy from VRAM to RAM (using Buffer::CopyFromDevice() ) or from RAM to VRAM (using Buffer::CopyToDevice() ). So now we have a way to provide the render kernel with data, and we have a way to obtain results, back to the CPU, where we can operate on them more easily.

We can now execute the kernel, e.g. in the ::Tick() function:

The kernel is expecting two arguments: the first is a __global uint*, the second a simple int. The call to SetArguments(...) passes the buffer we created, and 0 for the integer, to OpenCL; the second line starts the kernel on the GPU, using 640×640 threads.

Once the kernel is done (which will be in an instant), we can get the results by copying the buffer from the GPU to RAM:

The buffer now contains 640×640 unsigned integers, which we can now simply copy to the screen surface. We could use a nested for loop over x and y for that, but a memcpy does the job more quickly:

And now we can start the program, and admire its glorious GPU-calculated gradient.

With that out of the way, we can now focus on getting some ray tracing done on the GPU. We just created GPU code that produces colors that we can display; all that we need to do now is calculate these colors using a Trace function – running on the GPU. The Trace function will need data: a skybox, triangles, and of course: a BVH.

GPGPU Best Practices

From here, we will port the existing CPU ray tracer to the GPU, one bit at a time. A few ‘best practices’ will help to make this process (and probably other GPGPU projects) go smooth:

First of all, we start with working CPU code. Debugging on the GPU is hard; it really helps if we don’t complicate things further by building new things directly in GPU code.

Secondly, start with something simple and build from there. We just did that: getting anything to the screen (‘first image‘) is an important first step, from a practical and psychological point of view. From there, small, verifiable steps will ultimately lead to a finished project.

Third, functionality comes before speed. “The most important optimization is the one that turns a not working program into a working one,” and: “Premature optimization is the root of all evil.” Applied to GPGPU: make it work first, in the most intuitive manner; tweaking is for later.

And finally, let the GPU do what it does best. In our case: Building a BVH is not easy to do on a massive parallel architecture. So: don’t! Tracing rays on the other hand is a perfect fit: We basically execute the same task, for thousands of pixels.

Trace

Before we proceed, let’s make some small changes to the render kernel, so that it can work with a different window size than the hardcoded 640×640 we used so far.

This requires two additional changes. The constants SCRWIDTH and SCRHEIGHT (used later) are defined in a file named common.h, which can be found in the template folder. We thus include this file at the top of kernels.cl:

Note that common.h will now be included by the CPU code as well as the GPU code. This is a great way to share some functionality, and to ensure that both use the same settings (such as screen size, in this case).

The second change is a replacement of 640 * 640 by SCRWIDTH * SCRHEIGHT in the CPU code. This occurs in two places: once for the allocation of the target buffer, and once for the invocation of the trace kernel, in ::Tick.

Time to do some Tracin’. In the render kernel we can use pixel coordinates x and y to generate a primary ray:

Setting up the ray requires a camera position camPos and three screen corners p0, p1 and p2. On the CPU we would calculate those outside the tile/pixel loop, because they are the same for all pixels. For GPU code, the only way to do something ‘outside the loop’ is to do it on the CPU. We thus pass the four positions as arguments of the kernel:

On the CPU-side, the change is straight-forward:

Setting up a ray requires a Ray class. We could move the CPU-side Ray definition to common.h to use the same struct on CPU and GPU, but that has issues: The CPU Ray definition uses __m128 variables for SIMD processing, which is not supported on GPUs. It is simpler to just re-define Ray in kernels.cl, and, with it, Intersection:

Note the strong similarity between C code and OpenCL code. Some differences exist as well though; where C++ lets us get away with just using Intersection as a type for member hit in struct Ray, OpenCL insists on calling it struct Intersection hit. Likewise, Ray ray in the kernel must be labeled struct Ray ray. Another notable difference that we encounter early on: you can’t pass anything ‘by reference’, so when we call the Trace function, we either pass the ray ‘by value’ or using a pointer.

The initial version of function Trace (which just returns a white pixel) reveals one more nuisance. A float3 is not constructed using float3( 1, 1, 1 ), but instead using (float3)( 1, 1, 1 ), as shown in the code snippet.

Function Trace returns a float3, like we did in the CPU code. Since the buffer that the kernel operates on is in fact an unsigned int buffer, we need to convert the final pixel value from float3 to uint:

We now have a kernel that renders a white window. But let’s look on the bright side: The white pixels are produced using correct primary rays, using a Trace function that should be easy to extend to what we had on the CPU.

Sky

The easiest way to prove that the primary rays are in fact correct is by converting the skydome functionality.

In the CPU code, in ::Tick, we create a new buffer:

The buffer takes the existing HDR skydome, stored in skyPixels. The second line copies the data to the GPU. We need to do this only once: Skydomes generally don’t change over time, so there will be no per-frame data transport for this data. To use the data in the kernel we do however need to pass a pointer to it. We add an argument to the kernel:

In the CPU code, in ::Tick, we pass the pointer to the data on the GPU device.

To make the data accessible in the Trace function, we simply pass it along:

Inside Trace, we now have a ray, and skyPixels. We can test both by pretending that we never hit any geometry; this should yield a rotating skydome without teapots. Code:

That requires some explanation. First of all, the size of the skydome texture is now hardcoded. To make that more generic, we really should really pass skyWidth and skyHeight along with the skyPixels pointer. Secondly, INVPI and INV2PI are conveniently used from common.h: Including it pays off again. And finally: some math functions have different names in OpenCL. Here: atan2f and acosf become atan2 and acos. Some functions also have slightly different behavior. Here: atan2 can yield a negative number that we need to handle.

The kernel now produces a considerably more interesting image:

BVH Data

Let’s take a step back, and see what other data we need on the GPU to complete the Trace function.

  • A single mesh, with Tri and TriEx arrays.
  • A BVH for the mesh, consisting of a triangle index array and a set of nodes.
  • A texture: again, just one.
  • An array of BVHInstances, which includes transforms.
  • A TLAS, with instance indices in the leaves.

To get all this data to the GPU, we start with creating seven buffers, using the data we have created on the CPU side, at the end of the ::Init() function:

Buffers triData and triExData still assume the hardcoded triangle limit of 1024 triangles in the Mesh class, which is enough for the teapot object. If you already relaxed this constraint, you can just replace 1024 by the actual number of triangles in the mesh.

Once all the buffers are created, we copy them to the GPU:

One buffer is missing from this list: the tlasData buffer. This buffer changes every frame when we animate objects and update the TLAS; the call to CopyToDevice() will thus go to the ::Tick() function rather than the ::Init() function.

To get pointers to all this data in the kernel code, we pass all the pointers to the kernel, like we did with the skydome texture data:

In kernels.cl we update the definition of the render kernel:

OpenCL allows us to pass pointers to undefined structs like Tri and TriEx, but to use these, we’ll need to properly define them. This however reveals an important complication.

Alignment

The problem is the float3 type. On the CPU side, we expect it to be 12 bytes in size; in fact, we rely on this, e.g. in the (rather important) TLASNode class, which fits snuggly in 32 bytes:

In OpenCL, things are different. A GPU can read 16 bytes from memory more quickly than 12 bytes, and therefore, a float3 is padded with a dummy field, to make it 16 bytes. To enforce the correct layout of the TLASNodes that we receive from the CPU, we can define the struct as follows:

This struct is 32 bytes in size, and has the same layout as the CPU version. Copying data into this should thus yield the correct result.

The other structs that use float3 variables can be modified in a similar fashion:

The last structure that we need is the BVHInstance class. The CPU version of this class has some data and methods that we don’t need for rendering on the GPU: on the GPU end we just need the two matrices it holds. Still, synchronization is easier if we keep the same layout, so we devise a struct that does the job:

The variables dummy1 and dummy2 together take up 8 bytes, just like the BVH* in the original class. The last member variable, aabb bounds, is 6 times 4 bytes in size, just like the uint dummy array at the end of the proposed struct. Note the odd float16 data type: this will represent the mat4 we used on the CPU.

All data is in position, time to start rendering.

Tea Time

To render a teapot using ray tracing on the GPU, we need an OpenCL version of the BVH::Intersect function, which implements BLAS traversal. Once this works, we can convert TLAS traversal and shading, but all that will not produce anything verifiable without proper BLAS traversal. Let’s start with the CPU code, and tweak it until it works:

The first change we make is the function header. OpenCL is not an object-oriented language, so BVH::Intersect will have to be a function at the global scope. We also need to make changes to the data we pass into it: The Ray& must either become a struct Ray*, or simply a struct Ray (i.e., by value). In this case the pointer is the better option, primarily because we will change the ray that is handed to us. And finally, we’ll need some additional data that we accessed via the BVH class so far: The triangle data, and of course the BVH data, consisting of nodes and triangle indices.

After adding the struct keyword on lines 3, 17 and 18 (before each BVHNode* declaration), line 7 shows up in the error log. Because strict methods are not allowed in OpenCL, we replace the condition by:

Next up is the call to IntersectTri(...). This function doesn’t exist yet, so let’s add it and convert it to OpenCL:

Note that after obtaining the three float3‘s for the vertices (lines 3-5, highlighted), the code is quite straight-forward and close to the C++ original: Even the cross-product on line 7 is supported out-of-the-box by OpenCL. We call the IntersectTri from BVHIntersect, using a pointer to a single triangle:

One more remaining issue is now the AABB intersection, called from lines 19 and 20. Luckily we kept the non-SIMD version, which is easily converted to OpenCL. Original:

And the OpenCL version, which uses the ‘three-float notation’ for the extends of the bounding box stored in a BVHNode:

The final problem in BVHIntersect is a bit silly: OpenCL doesn’t have a ‘swap’ function. So, when we find that the second child is closer than the first, we swap their pointers and distances manually:

The full BVHIntersect function, ported to OpenCL:

All done. We can call this function from Trace, to see if the rather massive amount of new code and data actually works. At the start of the Trace function we call the new BVHIntersect function:

This should produce white teapot, without any shading, and skydome pixels for all rays that do not hit the teapot. And lo and behold!:

TODO

Now, we’re obviously not done. It’s just one teapot, it is not animated, there’s no reflection and no texture. But really, that is for another day, the article is long enough as it is. But from here things should go smooth: OpenCL is now under our command, and the data is in position. Next time we really complete the project, with some large-scale instanced scenery.

Closing Remarks

The complete code for this project, in ten convenient incremental stages, can be found on GitHub (for Windows / Visual Studio).

Enjoyed the article? Spotted an error?

Ping me on Twitter (@j_bikker), comment on this article, or send an email to bikker.j@gmail.com.
You can also find me on Github: https://github.com/jbikker.

Or read some more articles: https://jacco.ompf2.com. Some favorites:

Author: jbikker

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.