How to build a BVH – part 3: Quick Builds

In this series we explore the wonderful world of the bounding volume hierarchy: the mysterious data structure that enables real-time ray tracing on recent GPUs, in complex, animated scenes. Despite the broad adoption of ray tracing in games, in-depth understanding of the underlying technology seems to be reserved to a select few. In this article we show how a proper BVH is implemented without support from DXR / RTX / OptiX / Vulkan, in easy-to-read ‘sane C++’. Whether this is actually useful for you, or just interesting as background information, depends on your needs, obviously.

This series consist of six eight articles. For each article, full source code is available on Github, in easy-to-read ‘Sane C++’. Contents:

  1. Basics: data structure, simple construction and ray traversal.
  2. Faster rays: the Surface Area Heuristic and ordered traversal.
  3. Faster construction: binned BVH building (this article).
  4. BVHs for animated models: refitting and rebuilding.
  5. BVHs for animated scenes: the top-level BVH.
  6. TLAS & BLAS part 2 – all together now.
  7. Consolidating, part 1: ray tracing textured and shaded meshes.
  8. Consolidating, part 2: Whitted-style ray tracing.
  9. OpenCL: GPU ray tracing, part 1.
  10. OpenCL: GPU ray tracing, part 2 (series finale).

In this third article we discuss how to reduce the time it takes to build a good BVH. This will be an important ingredient for the ultimate goal: building a BVH for a large scale dynamic environment, which is the topic of a later article.

Preparations

In the previous article the Surface Area Heuristic was introduced to improve the quality of the BVH. Here, a ‘high quality tree’ is one that can be used to quickly find the intersection between a ray and the scene. The SAH works wonders, and when combined with ordered traversal, the naïve BVH from the first article is outperformed by a factor 3.

There is a catch however. Where the basic BVH builder needed mere milliseconds to build the tree, the SAH builder needs multiple seconds – and we’re not exactly using a very complex mesh. The reason is simple: For N polygons, we evaluate N split plane candidates on each axis – but split plane evaluation involves visiting all primitives. The algorithmic complexity is thus O(N²) for one split; not great if we hoped to build BVHs for much larger scenes, and also not something we can solve with low-level optimizations or parallel code.

For the new concepts introduced in this article we will need to clean up the Subdivide function somewhat. Let’s start with splitting of the code that finds the optimal split plane position and axis to a function:

This simplifies the relevant section in Subdivide to

Next, Subdivide checks if the best split cost is actually an improvement over not splitting. We can make that a bit more clear using a function that calculates the cost of an unsplit node:

Back in Subdivide, we call CalculateNodeCost after finding the best split plane, and terminate if it isn’t worth it:

Now that everything is nice and tidy we can focus on the isolated logic in FindBestSplitPlane.

From O(N²) to O(N)

A straightforward way to reduce the time complexity to O(N) is to use the midpoint algorithm. However, we have very good reasons for not using that: the resulting tree is a rather poor one. But what would happen if we did not just consider one position, but several, or many? Intuitively, there shouldn’t be too much of a difference between these two situations:

Left: split plane candidates through the primitive centroids. Right: uniformly spaced split plane candidates.

In fact, if we use a high enough number of uniformly spaced planes, the solution should converge to the same answer as the full SAH sweep. And, once this works, we can start experimenting: At which number of planes do we still get a reasonable tree?

The initial code for this is simple enough:

Here, 100 is hardcoded as the number of intervals; these are separated by 99 planes, which will be the split plane candidates. The planes are a fixed distance apart: to prevent a repeated division by 100, this distance is calculated once as (boundsMax – boundsMin) / 100.

Running the code gets us our first results. Building the BVH is now considerably faster, because 100 is far smaller than N: on my machine it now takes about 350ms. The other result is that the BVH is not bad at all: where the full sweep results in a frame time of 106ms, the 100 planes yield a tree that renders in 109ms; a rather modest reduction in performance considering the much faster build.

It turns out that a smaller number of planes still gets us a pretty decent tree. At only 16 planes, rendering takes 110ms – but the build completes in a mere 60ms. And at just 8 planes, rendering takes 111ms, while the tree now builds in 30ms. Even 4 planes seem reasonable: 113ms for rendering, 15ms for building the tree. Note that this is still far better than the midpoint split.

We can now balance tree quality and build speed. This is very useful, especially once we start rendering animated scenes: a model that occupies just a few pixels and changes every frame justifies a quick and dirty tree, but full-screen static scenery should use a tree that is high quality – no matter the cost. We can control this now, especially when we realize that we left one trick unused: limiting the SAH sweep to the longest axis, like we did with the midpoint split. This comes at a minor degradation of tree quality, but it cuts build time down to 5ms for our scene.

In short, we have options.

Tweaking

In the initial code for the uniformly spaced planes I assumed that we should subdivide the full extent of the AABB of a node. This is strictly not necessary. In the SAH sweep, the first split plane candidate is on the first centroid; the last one on the last centroid. A better range of values to cut up is thus the AABB over the centroids of the primitives. It turns out that this matters.

Bounds over the primitives (blue) versus bounds over the primitive centroids (yellow).

The centroid bounds are quickly found using a loop over the primitives:

This replaces the two lines that read boundsMin and boundsMax directly from the node AABB. At 4 and 8 planes per dimension, this slightly improves the resulting tree, at negligible cost.

Binning

The Unity vehicle mesh consists of about 12k triangles. Using the fixed planes, this now builds in mere milliseconds. But: it turns out that we can do much better than what we have so far.

Let’s have another look at the diagram with the evenly spaced planes:

Here, seven planes separate 8 intervals. Each of the primitive centroids projects to one of these intervals. We can now calculate the bounds of each interval, and the primitive count for each interval. Once we have this data for each interval, we can efficiently determine the union of the bounds of a range of intervals, as well as the sum of the primitive counts of a range of intervals. And this in turn is exactly what we need to evaluate the SAH; this time without visiting every primitive multiple times.

Let’s start with the definition of a bin, which will store an AABB and a primitive count:

Next, we populate the bins by visiting the primitives once (per axis):

Here, BINS is the number of intervals.

The bin index for a primitive centroid is calculated as:

binIdx=(centroid[a] - boundsMin) * 8 / (boundsMax - boundsMin)

To prevent the division per primitive, variable scale stores the non-varying part of this equation.

Once we have determined bounds and primitive counts for the bins, we can use this information to calculate the info we need at each plane. Our final goal is to evaluate the SAH at each plane. For this, we need the following information:

  • The number of primitives to the left of the plane
  • The number of primitives to the right of the plane
  • The bounding box over the primitives to the left of the plane
  • The bounding box over the primitives to the right of the plane.

We need this data per plane, so we store it in some arrays:

We can fill these arrays with a sweep over the bins. We will do a sweep from the left to the right to populate the leftArea and leftCount arrays, and a sweep from the right to the left to populate the rightArea and rightCount arrays.

Note that the sweep from left to right and from right to left happen simultaneously. Variable leftBox starts as an empty AABB. At each bin it is extended with the bounds of that bin. Likewise, leftSum (used for the summed triangle count) starts at zero and increases with the counts stored for each bin. The variables for the right side are updated in the same manner.

After the arrays have been filled we can evaluate the SAH for each plane:

Once again, scale is used to prevent the repeated division.

The full Subdivide function is now a somewhat large chunk of code:

But, it is worth it. Using eight bins the BVH now builds in just 11ms: almost three times faster than before. I did not time the version that considers only the longest axis of the node bounds, but that should take us close to 3ms for this mesh. And that is without any low-level optimizations and SIMD; there is still a lot of room for improvement if we go that route.

What’s Next

Now that we can build a BVH this rapidly we are ready to consider animated scenes. Before we go there, two more ingredients are needed: refitting and rigid motion for BVHs using ray transforms. Combined with fast rebuilds, these ingredients unlock the full potential of the top-level BVH, which will finally let us ray trace massive scenes with large moving objects in real-time, even on a CPU. But that’s material for another day.

Continue reading in article 4: rebuilding and refitting.

Closing Remarks

The complete code for this project (for Windows / Visual Studio) can be found on GitHub.

Enjoyed the article? Spotted an error?

Ping me on Twitter (@j_bikker), comment on this article, or send an email to bikker.j@gmail.com.
You can also find me on Github: https://github.com/jbikker.

Or read some more articles: https://jacco.ompf2.com. Some favorites:

Author: jbikker

10 thoughts on “How to build a BVH – part 3: Quick Builds

  1. Would it not be better to pre-sort the primitives by centroid along each axis and evaluate the SAH incrementally while sweeping over the axis? That will find the best split plane in linear time without any quality trade-off.

    Might not fit what you have in mind for animation in the next chapter though.

    1. The sort adds an O(n log n) pass (for each axis) to the first split. After that it is possible to maintain the sorting for subsequent splits however, as was done in the ‘Bonsai BVH’ paper. I chose not to do this, as it is conceptually a lot more complex. It does theoretically lead to better BVHs though.

  2. I’ve been following along in my own codebase, now up to this step. It turns out you can have bugs that don’t prevent it from working, but do completely destroy performance. Once I stepped through the code I found where the outputs just didn’t look correct (giant numbers instead of numbers in the 0-5 range) and fixing the typos improved performance by about 1,500x haha. Thanks for these great articles.

      1. It was my own personal typos from rewriting the code. Although I did see one tiny typo in the first article, line 8 of the code under the heading “Traverse”, IntersectTri( ray, tri[node.firstPrim + i] ) should I believe be IntersectTri( ray, tri[triIdx[node.firstTriIdx + i]] ), since you had already replaced the direct node access with the indexed access (but not firstTriIdx -> leftFirst yet).

        I personally had an error in the intersect AABB, which basically caused it to intersect every time, and also a small bug in the partition selection, which I noticed made my performance worse than the down the middle split, which is how I caught it.

        1. Thanks for pointing that one out! You are right and I am sorry it made you waste time on a bug hunt. Code has been updated.

  3. Wonderful series, ty so much.

    But yea Daniel is right about the “tri[triIdx[node.firstTriIdx + i]]”, been following along in rust and I would be sad to admit how many hours I spent trying to figure out what was going wrong.

  4. I was following your code snippets pretty carefully and pieced them together, but my code did not work. I even wrote visualization of bounding boxes with SVG and transparency to help me debug the issue, but I was getting weird results.

    The issue was that I did not reinitialize the ‘scale’ (line 44 of the full code). Using the very same name for a value and later for inverse of that value is pretty confusing. In the shorter snippet, it kinda looks like you’re providing the value for context and one needs to pay close attention to realize that the value is now inverted.

    But more importantly, I want to thank you for writing this series. It’s a nice read and easy to follow. I’m writing my toy ray tracer implementation in my spare time on and off since 2011 and this past Christmas was the first time I was able to finally debug the core and implement proper acceleration structure for it. So thank you for a Christmas gift, I guess 🙂

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.