How to build a BVH – part 2: Faster Rays

In this series we explore the wonderful world of the bounding volume hierarchy: the mysterious data structure that enables real-time ray tracing on recent GPUs, in complex, animated scenes. Despite the broad adoption of ray tracing in games, in-depth understanding of the underlying technology seems to be reserved to a select few. In this article we show how a proper BVH is implemented without support from DXR / RTX / OptiX / Vulkan, in easy-to-read ‘sane C++’. Whether this is actually useful for you, or just interesting as background information, depends on your needs, obviously.

This series consist of ten articles. For each article, full source code is available on Github, in easy-to-read ‘Sane C++’. Contents:

  1. Basics: data structure, simple construction and ray traversal.
  2. Faster rays: the Surface Area Heuristic and ordered traversal (this article).
  3. Faster construction: binned BVH building.
  4. BVHs for animated models: refitting and rebuilding.
  5. BVHs for animated scenes: the top-level BVH.
  6. TLAS & BLAS part 2 – all together now.
  7. Consolidating, part 1: ray tracing textured and shaded meshes.
  8. Consolidating, part 2: Whitted-style ray tracing.
  9. OpenCL: GPU ray tracing, part 1.
  10. OpenCL: GPU ray tracing, part 2 (series finale).

In this second article we explore several approaches to substantially speed up ray traversal. For a static scene, this means faster rendering. However, some of the presented techniques do make BVH construction more costly. How to fix this is a topic for a later article.

Setting up Shop

In the first article the scene consisted of a list of random triangles. For today’s experiments we will need something fancier. We’ll use a model from the Unity RobotLab scene:

The model used here (click to download it) consists of exactly 12582 triangles. For the purpose of this article the file was converted to a simplistic format, which can be loaded with just a few lines of code.

The second change to the code is in the visualization: a bit of depth visualization is obtained by plotting greyscale pixels based on the ray distance.

For this camera view, the scene renders in 300 milliseconds on my machine, which corresponds to about 1.3MRays/s, still on a single core.

Something interesting happens when we observe the bunny from the opposite side. When we place the camera at (-1.5f, -0.2f, -2.5) and flip the relative position of the screen plane accordingly by changing -2 to 2 for p0, p1 and p2, we get a different image, but also different performance: suddenly one frame only takes 183 milliseconds, for a ray throughput of more than 2.1MRays/s.

Once we see what causes this, we have the means to always achieve the higher performance level. In this article we will combine this with several other things to greatly improve overall performance.

Raison d’Être

Before we get to the cause of the puzzling difference between the two camera viewpoints I would like to introduce you to the Surface Area Heuristic.

Last time we chose the split plane position and axis in a rather simplistic manner: halfway along the longest axis and perpendicular to that axis, a technique known as the midpoint split. It is not hard to come up with scenes where this is not a great idea:

A midpoint split in this case yields two large overlapping boxes. A subtle nudge of the plane to the right, so it includes the disc at the bottom of the image, would be much better: the left box would be only slightly bigger, but the box on the right would be tiny. On the other hand, a nudge to the left might also be a good idea: the midpoint split hardly yields a balanced tree, while including the tree discs near the split plane seems far better in that regard.

This raises the question: what makes a good BVH? Should it be balanced, should it quickly isolate empty space, or is there something else?

Let’s take a step back, and think about what the BVH is trying to solve. Without a BVH, we brute-force intersect N triangles. With the BVH, we intersect a small number of triangles, plus a number of bounding boxes. An optimal BVH minimizes the total number of intersections. Applied to the local problem of picking a good split plane: since the number of bounding boxes after the split is constant (i.e., two), the optimal split plane is the one that minimizes the number of intersections between a ray and the primitives in both child nodes. ‘A ray’, without further context, is a random ray, unless we specifically construct the BVH for a particular ray – which we don’t.

The word ‘random’ provides an important clue. How many intersections with a random ray need to be evaluated for a bounding box containing N triangles? That depends: if the ray hits the box, the answer is N; otherwise it is zero. The cost is thus proportional to N, but also proportional to the probability that the ray hits the box.

So, what is the probability of a random ray hitting a random box? Obviously, we can’t calculate this. But we can do something else. Consider these two triangles:

Which of these has a higher probability of being struck by a random ray? Obviously, the larger one. So, although we cannot calculate the probability itself, we do know that it is proportional to surface area, and that turns out to be enough. Applying this to boxes, the same principle holds: the chance that a random ray hits a random box is proportional to the surface area of the box. Note: not its volume! This is easy to see when you imagine a box with a height of zero: although it now has a volume of zero, it can still be hit by a ray.

This leads to the formulation of the Surface Area Heuristic:

C_{SAH}=N_{left} A_{left} + N_{right} A_{right}

In words: the ‘cost’ of a split is proportional to the summed cost of intersecting the two resulting boxes, including the triangles they store. The cost of a box with triangles is proportional to the number or triangles, times the surface area of the box.

Applying the SAH

The SAH allows us to estimate the cost of a split. But, for what splits do we evaluate this cost? The answer is: all of them. There are infinitely many ways to split a bounding box along one axis, and three times more if we consider x, y and z instead of just the longest axis. Fortunately, the cost function is constant between two primitive centroids. This means that we only evaluate it at those locations:

Only after evaluating all of them we know which position (on which axis) is best; this position then replaces the midpoint split position.

We thus replace the midpoint split by a search for the lowest cost. For this, we use candidate split planes through the centroids of all triangles in the node, and for each axis:

The search yields a bestPos and a bestAxis, which we then use instead of the midpoint split position.

The above code uses function EvaluateSAH to calculate the cost for one potential split. This function needs to determine the bounding boxes that would result from the split, as well as the number of triangles in each of the boxes. Based on this information, the SAH cost function can be evaluated.

A new struct aabb is used here, which is straightforward but makes the EvaluateSAH function a lot more concise:

There is one thing that remains, and that may come as a bit of a surprise. In the initial version of the Subdivide function, recursion was terminated when a certain triangle count was reached (we used 2). Using the SAH, we can do better. Splitting a node in two new nodes is supposed to reduce the cost function. This is however not always the case. Take the example used in the first article: two triangles may form an axis-aligned quad, which cannot be split into two boxes, unless these two boxes fully overlap. The SAH can be used to easily detect this. We calculate the SAH cost of the parent:

Before splitting, the parent is one box with triangles, so the cost is based on this single box. Now, splitting is supposed to help. If it doesn’t, we should abort the attempt to split.

This test replaces the old test.

Results

Before applying the surface area heuristic, it took 183 milliseconds to render the scene. With the SAH, this is reduced to 128 milliseconds: a 43% speedup.

There is a problem however. Without SAH, the BVH is constructed rapidly: it only takes 2.5 milliseconds on my machine. With SAH, this increases to 7.4 seconds. The reason is obvious: we simply do a lot more work to determine the split plane position.

Order

Back to that weird thing with speed being influenced by camera angle. Consider the following two scenarios:

In the left scenario, the ray travels to the south. It intersects the top and bottom node, but depending on the order, it needs to visit both, or only one. This is because an intersection shortens the ray: hitting the first disc along the ray makes the ray short enough to not intersect the bottom node. If we on the other hand visit the bottom node first, the ray also gets shorter, but not short enough to skip the top node. In the right scenario the ideal order is reversed: this time we will save time if we visit the bottom node first.

It is now clear what happened to the bunny cam: since we visit the nodes in a fixed order, some views are faster than others. We can fix this by visiting the nodes front to back. There are several ways to do this. One way requires a reformulation of the traversal algorithm:

The traversal looks quite different from what we had before. For starters, the code is no longer recursive. An infinite loop is used instead, and whenever two child nodes require a visit, one of them is pushed on a small stack. The loop completes when we try to pop from an empty stack.

Secondly, the test to check if we intersect a node is removed; this time we check if the child nodes intersect the ray. The answer from the IntersectAABB function is also no longer a boolean, but a float: 1e30f denotes ‘miss’, while any other value is a hit at the reported distance:

Using the distances to both child nodes, we can sort them: now, if the nearest child is at distance 1e30f, we know we must have missed them both, in which case we proceed with a node from the stack. Otherwise, we proceed with the nearest child. If the far child also requires visitation, we push this one on the stack.

The resulting traversal is faster for both camera views. The front-side view drops from 128 to 107 milliseconds. The back-side view improves from 300 to 91 milliseconds: more than 3 times faster than without ordering.

Low Hanging Fruit

Let’s finish this article with some quick and dirty optimizations. For this, we visit the IntersectAABB function once more. It contains no less than six divisions, which are notoriously expensive for processors. In this case, they are easy to get rid of.

A division by x can be replaced by a multiplication by 1/x. In this case, that doesn’t look particularly useful, because 1/x itself contains a division. However, one ray gets intersected with many boxes; that means that the cost of calculating 1/ray.D.xyz can be cached. This requires a modification to the ray struct:

Member variable rD will contain the reciprocals of the direction components, which we can set right after constructing the ray:

From here on, IntersectAABB operates without divisions:

On my machine this optimization has limited effect, which indicates that the performance problem may lie elsewhere. Perhaps the traversal code is memory bound (which causes the divisions to be hidden behind memory latencies) or perhaps we suffer from branch mispredictions in the conditional code in IntersectAABB. A definitive diagnosis requires a proper profiling session with a specialized tool such as Intel’s VTune. Instead of that, I am going to propose two optimizations: one that aims to improve cache utilization, and one that reduces the amount of conditional code.

1. Improving data locality: To make better use of the caches, we should ensure that data that we use is similar to data we have recently seen. This is known as temporal data locality. In a ray tracer this can be achieved by rendering the image in tiles. The pixels in a tile of e.g. 4×4 pixels often find the same triangles, typically after traversing the same BVH nodes. For this, we change the pixel plotting loops as follows:

2. Reducing conditional code: We can change IntersectAABB to a version that is mostly devoid of conditional code. This version does however require an Intel compatible CPU, so the following is very platform dependent.

We start with a modification of the BVHNode:

This rather horrible data structure uses an unnamed union to overlap the two float3 + uint structs with a __m128 (pronounced as ‘quadfloat’) value, which is a 128-bit vector variable that stores four floats. The __m128 data type is supported in hardware on pretty much any processor. The union lets us access the original data without changes. At the same time, we can now read the bounding box information into two SSE registers, which lets us use a rather cryptic piece of intersection logic. Before we get to that, we need to make a similar change to the ray struct:

The constructor makes sure that the three dummy values that pad the 12-bute float3 values to 16 bytes are initialized to valid floats. The SIMD AABB intersection code now becomes:

On my Ryzen processor this yields a modest speed boost compared to the original code. That same Ryzen also made clear to me that doing calculations on numbers that are not floats (in this case: the fourth component in aabbMin4 and aabbMax4, which really are ints) is exceedingly slow. The SSE intersection code counters this by masking these values out.

With the improved data locality and SIMD code we reach the final performance level for this article. For the backside view we end up with 86ms per frame, which means we are blasting the scene with 4.5MRays/s. The frontside view is a little bit slower at 103ms, but that is still 3.8MRays/s, on a single core.

If you are interested in trying out multi-core performance: OpenMP lets you start with this with relative ease. You will have to play a bit with the tile size to optimize performance. Also make sure to use the dynamic scheduler: each tile has a different workload. As a target, know that good parallel ray tracing should scale linearly in the number of cores, although OpenMP alone may not completely get you there.

Closing Remarks

The next article in this series discusses binned BVH building for rapid construction. Check it out now.

The complete code for this project (for Windows / Visual Studio) can be found on GitHub.

Enjoyed the article? Spotted an error?

Ping me on Twitter (@j_bikker), comment on this article, or send an email to bikker.j@gmail.com.
You can also find me on Github: https://github.com/jbikker.

Or read some more articles: https://jacco.ompf2.com. Some favorites:

Author: jbikker

6 thoughts on “How to build a BVH – part 2: Faster Rays

  1. I’m probably missing something, but in the “ordered traversal” in the IntersectBVH function, it looks like you test the content of a popped node even if you have found an intersection lower than dist1(2) while traversing its brother’s content. wouldn’t it make sens if you kept the distance to the node in addition to it and dismiss an unstacked node if the dist is bigger than current ray intercsection dist ?

    Perhaps that distance management and addtional code complexity is costlier than the additional intersections though.

    1. Good observation; indeed the pushed node will be visited even if the ray got too short for it to be strictly needed. However, the nodes inside that node will be culled immediately in that case. Any clever trick to prevent the visit will thus at best prevent fetching (and intersecting) two nodes. One way to obtain that is to store the distance to the far child on the stack (along with a pointer to it), doubling the bandwidth for the stack. Whether that is worth it or not is something I need to try.

      1. Yes, either fetching/intersecting two nodes or one leaf (and its triangles). Probably marginal though.

  2. Isn’t the aabb area function only calculating half of the total surface area? Shouldn’t it be 2.0f * (e.x * e.y + e.y * e.z + e.z * e.x)? Am I missing something?

    1. No you’re not missing something. However: this is all about relative cost and areas; scaling all surface areas by a factor 2 therefore does not affect the outcome of the heuristic.

  3. Very helpful articles. Altough I use a bit different hiearchy and coding style, your code is still pretty clear and easy to follow.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.