How to build a BVH – part 6: All Together Now

In this series we explore the wonderful world of the bounding volume hierarchy: the mysterious data structure that enables real-time ray tracing on recent GPUs, in complex, animated scenes. Despite the broad adoption of ray tracing in games, in-depth understanding of the underlying technology seems to be reserved to a select few. In this article we show how a proper BVH is implemented without support from DXR / RTX / OptiX / Vulkan, in easy-to-read ‘sane C++’. Whether this is actually useful for you, or just interesting as background information, depends on your needs, obviously.

This series consist of ten articles. For each article, full source code is available on Github, in easy-to-read ‘Sane C++’. Contents:

  1. Basics: data structure, simple construction and ray traversal.
  2. Faster rays: the Surface Area Heuristic and ordered traversal.
  3. Faster construction: binned BVH building.
  4. BVHs for animated models: refitting and rebuilding.
  5. BVHs for animated scenes: the top-level BVH.
  6. TLAS & BLAS part 2 – all together now (this article).
  7. Consolidating, part 1: ray tracing textured and shaded meshes.
  8. Consolidating, part 2: Whitted-style ray tracing.
  9. OpenCL: GPU ray tracing, part 1.
  10. OpenCL: GPU ray tracing, part 2 (series finale).

In this article we finalize the construction of the two-level acceleration structure for animated scenes.

Half Baked

The conclusion of article 5 probably felt a bit underwhelming. The code handles no less than two armadillos, which get intersected using a single call to the Intersect method of TLAS, even if they both get a new transformation matrix each frame. True, but… that’s not the kind of performance and versatility you may have been hoping for. It’s time to fulfill some promises.

Let’s formulate a target for this article, and then see how we can get there. The goal for today is to have 256 armadillos floating through space, while rotating, and bouncing off the walls of a virtual cube. For this, we will build a top-level structure over 256 bottom-level acceleration structures, in real-time. The total polygon count will be 30,000 times 256, so well over 7 million triangles. Once we reach this goal, it’s play time; we’ll be able to ray trace virtually anything – although a scene that can be intersected efficiently by rays must have certain characteristics. More on that at the end of the article.

Proper TLAS Construction

Last time we built a TLAS over two BVHs. We did this by storing the two BVHs in TLAS leaf nodes; the two leaf nodes then became child nodes of a single TLAS root node.

For a larger number of BVHs we can use a similar technique. In short, we will combine pairs of nodes using a parent node; the parent node will then be paired with other leaf nodes or interior nodes, until we have just a single node left. This is then the root node.

To build an optimal TLAS, the pairs that we create must be carefully chosen. It turns out that we should be pairing nodes that form small boxes – where ‘small’ actually means: having a small surface area. So, given 256 BVHs, we are looking for the two BVHs that, when put into a single AABB, have the smallest AABB. A simple algorithm will get the job done:

This is an O(N^2) algorithm. However, we’re overlooking something: the algorithm gets us one pair, which brings the number of nodes down to 255 (one is now an interior node). We need to do this 255 more times to get to a single root node. By then, the algorithm is O(N^3), which is bad, even on a fast machine and at a pretty low number of BVHs (such as 256).

Luckily, something better exists. For this, we need to look at a 2008 paper by Walter et al., titled Fast Agglomerative Clustering for Rendering. The term agglomerative clustering refers to the bottom-up process we’re using when we combine pairs of nodes; the alternative is the top-down process we used for constructing a BLAS over a group of triangles. The algorithm from the paper, in pseudo-code:

If you try this out on a piece of paper you’ll see that this actually works. Performance is pretty good as well: we get close to O(N^2) in practice, which is quite acceptable for 256 armadillos.

Before we turn this into practical code we need to make a change to the TLASNode struct that was proposed in article 5. The struct stores only the index of the left child, assuming that the right child can be derived from that. Using agglomerative clustering, this is no longer the case. Luckily, the required change is not too complex:

The 32-bit field leftRight now stores two numbers, in 2×16 bits. For leaf nodes, field BLAS stores the index of a BLAS, otherwise it is unused. We identify leaves by looking at leftRight: for interior nodes one of the child indices must be greater than zero, so only if we set leftRight to 0, we have a leaf node.

Back to TLAS construction. Construction starts with the 256 leaf nodes:

As with a regular BVH, tlasNode[0] will contain the root. We create the root last, so we reserve space for it; the first node we will use as a leaf node is thus tlasNode[1] (indexed by nodesUsed). The AABB we use for each TLAS leaf node is obtained from the BVHs and is in world space.

Now that we have the leaf nodes, we can create the work list for the agglomerative clustering algorithm:

Rather than directly storing the TLASNode instances, we store indices of elements of the tlasNode array. The number of nodes that still need to be paired is obtained from blasCount; for 256 armadillos this will simply be 256, and after each generated pair, it will decrement by 1 until only one node is left: this will be our root node.

The array can be initialized in the loop that creates the TLAS leaf nodes:

Everything is now ready for the actual clustering. The algorithm calls ‘FindBestMatch’ three times. This is the function that, given a TLASNode A, finds the node B (where B != A) that forms the smallest AABB with A. It is implemented as follows:

This function simply checks all remaining nodes in the list. For each node, the combined AABB is determined. If the area of this AABB (skipping the factor 2, as usual) is smaller than what we found before, we take note of that. After checking all nodes we return our best find.

Note the tricky addressing of the tlasNode array: since the work list contains tlasNode array indices, we need this complex indirection to query the correct nodes.

With the FindBestMatch function properly implemented, we can now finalize the agglomerative clustering algorithm.

This code closely follows the pseudo-code. When two nodes are paired, a new interior node is created (line 15), which points to existing nodes with indices nodeIdxA and nodeIdxB (line 16). The AABB of the new interior node is the union of the AABBs of the child nodes (lines 17 and 18). Now that the two nodes are paired, work item A (in the nodeIdx array) gets replaced by the index of a newly created TLAS interior node (line 19). Work item B, which is to be removed, gets replaced by the last element in the list (line 20), and the list is shortened by 1 (on line 21, as part of the function call).

And with that, we can suddenly handle far more than a pair of armadillos. Play time!

Move it

Let’s load a few more armadillos. For starters, 16 of them:

Loading 16 armadillos happens in a flash, but building a BVH for each of them causes a bit of a delay. We want 256 armadillos ultimately, so we’ll need to look into that later.

With 16 objects in memory we need some basic animation to make them move. How about this:

This super-ugly code makes half of the armadillos bounce, while the other half rotates. It’s all pretty chaotic. The code does however show how flexible things are now: objects are scaled, translated and rotated without any difficulty, using pretty basic matrix math. This will combine quite nicely with a typical scene graph.

Changing the transforms of the BLAS BVHs updates their world space bounds, and so we need to build a new TLAS. We do this each frame:

That’s all! This will invoke the agglomerative builder, which executes in a flash for the 16 armadillos that we have right now. Let’s tilt and move the camera a bit to take it all in:

The result is glorious.

Après Nous le Déluge

It is time for the final step: 256 armadillos. For that we only really need one new ingredient: instancing.

Instancing is very commonly used in games:

Here, just a few trees are carbon-copied all over the rolling hills, with a slightly different scale and orientation applied to prevent that we notice the repetition. In a regular 3D engine, instancing helps to reduce memory use and bandwidth between CPU and GPU. In a ray tracer we get essentially the same benefits, plus one more: we can reuse a BLAS if the only difference between two objects can be expressed in a 4×4 matrix: rotation, translation and scale. On a race track that means: one car gets multiplied to form a grid; add a track and you’re done. You could even apply different textures to each car: materials do not affect the BVH, after all.

Let’s make the necessary changes. We start with a BVHInstance class, which takes over the matrix transform functionality from the BVH class.

This leads to a cleaner BVH class: the inverse transform and world space bounds are now gone, as well as the SetTransform(...) method. Even the BVH::Intersect(...) is simpler now; the ray transform is handled by a short method of BVHInstance, which forwards the work, with an already transformed ray, to the BVH.

From now on, when we would normally process a BVH, we now operate on a BVHInstance. The TLAS constructor takes a BVHInstance array, for instance:

That also means that member variable blas is now a BVHInstance* rather than a BVH*. Apart from this, surprisingly few changes are needed; we can simply point to the same BVH now with different transforms, which is exactly what we need. This has a great effect on the scene construction:

A mesh is now loaded once, and a BVH is constructed for it, also once. The only thing that we get 16 of is BVHInstances – but creating these is, well, instantaneous.

Let’s bring in an army of armadillos!

We’re in the Army Now

Recall the goal of this article: to have 256 armadillos float around in a cubical section of space, rotating and bouncing off the walls. For that, we need some additional data in the AllTogetherApp class.

Each of these will become arrays when we initialize the scene.

Finally, we use this data to actually update the transforms of 256 BVHInstance objects:

And that is all. On my machine, about 8.5 million rays per second are traced through this scene, which now consists of 7.68 million triangles. Updating the TLAS only takes half a millisecond.

Future Work

From here, the sky is the limit. In terms of BVH maintenance, we can add all kinds of functionality:

  • Add refitting for meshes that deform. We can still instance them, complete with animation.
  • Add rebuilding for meshes that undergo more structural changes.
  • Move large static meshes by changing their transforms alone.
  • Add all this in a single framework that intelligently divides work over frames for an optimal balance between ray tracing speed and BVH maintenance time.

In terms of performance, we’ve only scratched the surface. We can for example trace bundles of rays to greatly speed up the kind of ray distributions we use in these demos, exploiting the fact that rays for nearby pixels tend to travel the same route through a BVH. We can also build a BVH with more than 2 child nodes per level, to reduce the number of traversal steps. This is beneficial for rays that go all over the place, like those for path tracing.

And then there is the actual using of the rays. We’ve done greyscale images so far. Some cleverly aimed rays can yield photo-realistic images. Doing that for animated scenes is the holy grail – and it’s happening in games right now.

But all that is for another day.

Armadillo signing off.

Closing Remarks

The complete code for this project, in ten convenient incremental stages, can be found on GitHub (for Windows / Visual Studio).

Enjoyed the article? Spotted an error?

Ping me on Twitter (@j_bikker), comment on this article, or send an email to bikker.j@gmail.com.
You can also find me on Github: https://github.com/jbikker.

Or read some more articles: https://jacco.ompf2.com. Some favorites:

Author: jbikker

11 thoughts on “How to build a BVH – part 6: All Together Now

  1. You mention rescaling the objects, but I don’t see how the scaling works WRT the hit distance computation. Doesn’t the “local BVH” hit distance need to be scaled for the portion that is inside of the BVH ? (hit dist = dist to BVH entry + bvhScale * dist from entryPoint to hit) ?

    1. The answer, in short: The scaling part of the matrix is no different than the other parts (translation / rotation).

      It is however rather hard to see that inverting the matrix works for the scale though. When we scale a mesh, we do the same as with translation: we scale the ray (the whole ray) in the opposite ‘direction’. Try this out on a piece of paper, and keep three things in mind:

      1. We do not re-normalize the ray direction after applying the transform.
      2. ray.t is to be multiplied by the magnitude of ray direction to get the actual distance in world space units.
      3. Concatenating a scaling matrix with a translation matrix affects the translation.

      Regarding 3: take mat4::Scale( 2 ) * mat4::Translate( float3( 4, 0, 0 ) ) and notice how the translation becomes 8 in the resulting matrix.

      Hope this helps.

  2. Why can the TLAS not be built in the same top-down fashion as we did for triangles? Where we’d compute the centroids of all bounding boxes and do a binned SAH recursion to build the tree? Why are we attempting to build the tree from the bottom up?

    1. A bottom-up tree is of superior quality. It is also much more expensive to build, which is why we reserve it for the TLAS.
      I need to specify ‘superior quality’ and compare it against a top-down build; this will happen in a separate article.

      1. Good to know! Thanks so much for the reply. I’m working on a ray tracing engine using tiled GeoTIFF and SPC Maplet geometries of celestial bodies, so my geometries are completely non overlapping. This explains why I was seeing good performance using a top-down build for my particular use case. This is great to know though! Thanks for investigating.

      2. Any updates on bottom up vs top down for scenes with overlap?
        Also, in the paper the algorithm uses a kd-tree rather than a list to accelerate the search for the best match. It seems that you have a working kd-tree implementation but didn’t use it for some reason?

        1. In the past I built TLASes with just a few nodes, and in that case, the list is a good option: it never took an objectionable amount of time. To construct a TLAS over a more complex scene I did indeed implement a full kd-tree with support for arbitrary additions and deletions, and with very good performance. While that is a good solution for a few k nodes, it is still insufficient when the numbers go up.

          I didn’t further investigate the effects of overlap.

  3. How would you optimize adding and removing many meshes to the TLAS? (Or is it ok to rebuild the TLAS every frame?)

    1. It is indeed customary to rebuild the TLAS per frame and not to make any assumptions about refittability or anything else that could be exploited for faster construction. Even AAA games have just a few k BLAS’es and won’t struggle too much with a per-frame rebuild.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.