How to build a BVH – Part 1: Basics

In this series we explore the wonderful world of the bounding volume hierarchy: the mysterious data structure that enables real-time ray tracing on recent GPUs, in complex, animated scenes. Despite the broad adoption of ray tracing in games, in-depth understanding of the underlying technology seems to be reserved to a select few. In this article we show how a proper BVH is implemented without support from DXR / RTX / OptiX / Vulkan, in easy-to-read ‘sane C++’. Whether this is actually useful for you, or just interesting as background information, depends on your needs, obviously.

This series consist of ten articles. For each article, full source code is available on Github, in easy-to-read ‘Sane C++’. Contents:

  1. Basics: data structure, simple construction and ray traversal (this article).
  2. Faster rays: the Surface Area Heuristic and ordered traversal.
  3. Faster construction: binned BVH building.
  4. BVHs for animated models: refitting and rebuilding.
  5. BVHs for animated scenes: the top-level BVH.
  6. TLAS & BLAS part 2 – all together now.
  7. Consolidating, part 1: ray tracing textured and shaded meshes.
  8. Consolidating, part 2: Whitted-style ray tracing.
  9. OpenCL: GPU ray tracing, part 1.
  10. OpenCL: GPU ray tracing, part 2 (series finale).

In this first article we discuss the bare basics: what the BVH is used for, and how a basic BVH is constructed and traversed with a ray. The end product is a simple but reasonably efficient CPU ray tracer, which we will further improve upon in subsequent articles.

Background

The BVH is an acceleration structure. It serves to reduce a pretty fundamental problem in ray tracing: the sheer cost of finding the nearest intersection of a ray and a scene consisting of (potentially) millions of polygons. The basic principle is easy to understand: if we have 64 polygons, we could split them into two groups of 32 polygons, each group enclosed by a bounding volume such as a box or a sphere. When intersecting the scene, we can skip 32 polygons if the ray does not intersect the enclosing volume, nearly halving the processing time. Next, we do the same to each of the two groups: split in two halves, encapsulate, recurse. The resulting bounding volume hierarchy brings down processing time to manageable levels: with each level, we halve the number of polygons. Starting with 64 polygons, we traverse a tree in six steps to arrive at a single polygon.

BVH: a hierarchy of bounding volumes. We can skip any volume that the ray doesn’t intersect. Image based on A 3-Dimensional Representation for Fast Rendering of Complex Scenes, by Rubin and Whitted.

That’s the global idea, but let’s make things practical. I will be using a minimalistic template from one of my courses for these experiments, but the code is designed to be generic. Feel free to use something else.

We start with the definition of a triangle:

Your triangle may need more properties for rendering, but for building a BVH and intersecting the triangle with a ray, this is all we need. Speaking of rays:

Here, O is the ray origin, D is the (typically normalized) ray direction. Scalar t is the distance at which we found an intersection, which gets initialized to ‘infinity’ (well, almost) here.

Let’s construct some random triangles to play with:

Here, RandomFloat() yields a uniform random number in the range 0..1. I used Marsaglia’s Xor32 RNG, but anything will do. Given three vectors with random components, a triangle is then produced within a 10x10x10 cube centered around the origin. Not pretty, but good enough for now. Feel free to replace this with a proper model, e.g. imported using assimp or TinyOBJ.

Next, we need a way to intersect a triangle with a ray:

This is the famous Möller–Trumbore intersection algorithm, which is still pretty close to ‘as fast as possible’. There exist slightly faster approaches (see Chapter 5 of Sven’s thesis), but these need more (or rather, different) data than just the three vertices.

The IntersectTri function determines if there is an intersection between the ray and the triangle. If this is the case, we check if the found intersection distance is closer than what the ray encountered before. If so, the t value for the ray is updated. For a given ray, we can now simply check all triangles; after doing so, the ray ‘knows’ the shortest distance to a surface in the scene.

So, let’s generate some rays. For e.g. a 640 x 640 image, we do this by firing rays from a camera position to a virtual screen plane:

Here, the camera is placed at position (0,0,-18), so: well outside the cube that contains the triangles. Virtual screen corners are located at (-1, 1, -15), (1,1,-15) and (-1,-1,-15), so 3 units in front of the camera. For each of the pixels we can now calculate a position on this square. The ray is then initialized with the camera position as origin, and a normalized vector from the ray origin to the pixel position as a direction.

Generating primary rays: pixelPos is obtained by interpolating over the rectangle defined by p0, p1 and p2.

With a properly initialized primary ray we can now get our first ray traced view of the test scene, by ‘brute force’ checking the ray against every triangle. The result is a t value in the ray for each pixel. This value is 1e30f if we hit nothing, or the actual distance to a triangle otherwise.

Result of ray tracing 64 triangles.

The result is underwhelming, but hey, it shows that the basic setup works. I’ll leave the explanation of proper Whitted-style ray tracing to Peter Shirley‘s amazing book and focus on the efficiency of those rays instead.

Concepts and Ingredients

On my machine, it takes about 140 milliseconds to construct the 640 x 640 rays and intersect them with the 64 triangles. That’s not bad by the way: we trace almost 3 million rays per second, and that’s not even using multithreading or the GPU. Thing is, however: computation scales linearly in the number of triangles… Tracing 256 triangles takes 540 milliseconds; 1024 triangles bring this to 2250 milliseconds.

Time to introduce the BVH.

When Rubin & Whitted first described the bounding volume hierarchy, they assumed the bounding volumes would be placed manually. This is not something artists will enjoy, so let’s see how a BVH can be constructed automatically.

We start with the scene and its bounding box. This is going to be the root node for the hierarchy. We then recursively subdivide this node. A split plane partitions the primitives in two sets. Note that we do not split space: this might lead to primitives being split and/or assigned to both sides. In a BVH this never happens. The partitioning is what we need the primitive centroids for: each triangle (or disc, in the figure above) is either to the left of the split plane, or not to the left (which includes the situation where the primitive is on the split plane – we need to be precise!).

After the split we have two (now independent) groups of objects, with their updated group bounds:

Note that the bounding volumes of the two groups may overlap. This actually happens when we split the right group vertically:

Splitting the left group also leads to an interesting situation: the group of discs at the top and the single disc at the bottom have bounds that are pretty far apart. A ray that does intersect the left group may thus miss both of its child nodes, avoiding intersection tests with seven discs.

For a BVH node we need the following data:

  1. The bounds. In our case: an axis-aligned bounding box (AABB), which is easy to calculate and cheap to store: just 6 floats define the AABB in 3D space. Note: a sphere would be even cheaper to store, but it is not as convenient to work with in practice.
  2. References to the two child nodes, if the node is not a leaf node.
  3. References to primitives, if the node is a leaf node.

A naïve BVH node layout would look something like:

However, this struct has many problems. It is rather large, but also somewhat unpredictable in size due to the vector and the pointers, which take up 4 bytes on a 32-bit OS, but 8 on a 64-bit OS.

The first improvement is switching from pointers to indices. For triangles we use their indices in the triangle array; for child node pointers we can do something similar if we pre-allocate all nodes that we ever need in a BVHNode pool. We can do this, because the size of the BVH for N triangles has an upper limit: we can never have more than 2N-1 nodes, since N primitives in N leaves have no more than N/2 parents, N/4 grandparents and so on.

With primitive pointers and BVHNode pointers replaced by indices the BVHNode struct reduces to:

Note how we cleverly store a list of primitives using just two numbers: we simply store the index of the first primitive, and the number of primitives. This way, any number of primitives can be specified in just eight bytes, as long as they are stored consecutively in the primitive array. It turns out that we can enforce this elegantly.

We can make several improvements; most of these I will mention later but one is useful now: boolean isLeaf is redundant once we realize that for a leaf the primCount cannot be zero, while for all other nodes leftChild and rightChild cannot be zero. This yields (for now) the final structure with a size of 40 bytes.

Construction

We can now start the actual construction of the hierarchy. The first step is the creation of the root node, which we will then recursively subdivide.

The first lines of this function initialize the centroids for the triangles, which are needed for the subdivision process. After that, we claim one node from the bvhNode array. The root node is initialized: the child references are zeroed, and all primitives are added to it.

Updating the AABB for the root node is straightforward:

Note: The fminf and fmaxf functions used here return the minimum and maximum x, y and z for pairs of float3‘s. Each vertex of each triangle in the BVHNode is visited to find the lowest and highest x, y and z: the resulting set of six values is the AABB we are looking for.

Finally, we recursively subdivide the root node. A few steps are required:

  1. Determine the axis and position of the split plane
  2. Split the group of primitives in two halves using the split plane
  3. Create child nodes for each half
  4. Recurse into each of the child nodes.

1. Split plane axis and position: for now, we will split the AABB along its longest axis. Later we will discuss better ways, but for now this is fine. Code:

2. Split the group in two halves: this sounds like a complicated thing to do, but in BVH construction this is surprisingly simple. Because we are not splitting any triangles, the combined size of the list of triangles in both halves is exactly the size of the unsplit list. That means the split can be done in place. For this, we walk the list of primitives, and swap each primitive that is not on the left of the plane with a primitive at the end of the list. This is functionally equivalent to a QuickSort partition.

Note that the triangles do not need to be sorted; it is sufficient that the two groups of triangles on both sides of the split plane are stored consecutively.

3. Creating child nodes for each half: Using the outcome of the partition loop we construct two child nodes. The first child node contains the primitives at the start of the array; the second child node contains the primitives starting at i. The number of primitives on the left is thus i - node.firstPrim; the right child has the remaining primitives. In code:

Some details to point out here:

  • Theoretically, the split in the middle can yield an empty box on the left or the right side. Not easy to come up with such a situation though! Left as an exercise for the reader. 😉
  • We should not forget to set the primitive count of the node we just split to zero. Since we do not store the isLeaf boolean anymore, we rely on the primCount to identify leaves and interior nodes.
  • Since we increment the ‘next free bvh node counter’ nodesUsed twice on consecutive lines, it is obvious that rightChildIdx is always equal to leftChildIdx + 1. This of course should have consequences for the BVHNode struct, which gets reduced to 36 bytes.

We’re almost done with the subdivide functionality. One we have generated the left and right child nodes, we update their bounds (using a call to UpdateNodeBounds, which we created earlier), and we recurse into the child nodes (using a call to the Subdivide function we’re currently writing). Leaves us with one question: when does the recursion end? One would say: once there is one primitive left. This is sadly not true. The point is, two triangles quite often form a quad in real-world scenes. If this quad is axis-aligned, there is no way we can split it (with an axis-aligned plane) into two non-empty halves. For this reason, we typically terminate when we have 2 or less primitives in a leaf. Even this is not completely safe – but for a solution we need a seemingly unrelated technique, which will be discussed in the next article.

One More Thing

There is one nagging oversight in the partition code that we need to fix. I am talking about this line:

This line has two issues. First, we are changing the original list of triangles. This may be fine if it is owned by the BVH builder, but not so great if it is shared with e.g. the animation system of an engine. Secondly, we are swapping whole triangles by value. Their size is somewhat limited in this case, but what if a triangle has uv-coordinates, material ids, a name and so on?

We can resolve these issues by using an intermediate array with triangle indices. It contains simple unsigned integers and gets initialized with the numbers 0..N-1. From then on, we swap entries in this array, rather than the triangles themselves. A BVHNode leaf now references a slice of indices, rather than a set of consecutive triangles. It is an extra step of indirection, but it proves to be quite practical.

We make a few small changes to facilitate this new data structure:

At the start of BuildBVH we initialize this array:

The BVHNode struct needs some updates to member variable names, and now also incorporates some of the other changes we discussed:

UpdateBounds now uses the new indirection:

The full Subdivide function, of which we have seen bits and pieces so far, in its full glory:

And there you have it: one valid BVH constructed from a polygon soup, in a (perhaps) surprisingly short snippet of code.

Traverse

Time to put the BVH to work. Traversing it is of course a recursive process, starting at the root node:

  1. Terminate if the ray misses the AABB of this node.
  2. If the node is a leaf: intersect the ray with the triangles in the leaf.
  3. Otherwise: recurse into the left and right child.

That doesn’t sound too complex, and indeed: it is implemented in just a few lines of code.

The IntersectAABB function is one function we did not implement yet. To detect if a ray intersects an AABB we use the slab test. Note that we don’t care where we hit the box; the only info we need is a yes/no answer. Without further explanation:

The IntersectBVH function now does the work we previously did with the loop over all the triangles. That means that this loop now simply gets replaced by a single function call:

Question is: is it any faster?

Well, on my machine I got 140 milliseconds for brute-force intersecting 64 triangles. With the BVH, this drops to (drumroll…) 32 milliseconds; a 4.4x speedup. Things get more interesting at 1024 triangles. At this scene size, frame time without BVH is 2250 milliseconds; with BVH this drops to 112 milliseconds. That is 20x faster. For larger scenes, the brute force approach becomes infeasible. And we’re only getting started: better BVHs will yield faster rays.

Dangling Pointers

In this article I postponed a lot of things. For instance: how can we make the BVHNode structure even smaller, and why should we do this? And what about the ‘ideal’ split plane axis and position? How do we safely terminate recursion in Subdivide? And what about multithreading construction?

Most of these topics are for another day, but let’s end this article with a better BVHNode structure. We currently have:

This takes 36 bytes. That means that, on a modern GPU, one node fits in a 64-byte cache line, but quite frequently a node will in fact be partially in one cache line and partially in the next. This is a pretty severe performance issue. If we can reduce the BVHNode to 32 bytes, and if we align the array of nodes that we pre-allocate to a 64-byte boundary in memory (e.g. using _aligned_malloc), then every node always uses exactly half of a cache line. Even better: if we are careful, the two children of a node will be in the same cache line (for the inpatient: skip one node after the root to achieve this).

We can reach the 32-byte node size when we realize that a node that has triangles must be a leaf. It thus has no child nodes. And the other way round: if there are child nodes, there are no triangles. We thus need to store a child node index or the index of the first triangle. That means these can be stored in one and the same variable:

How we interpret this mysterious ‘leftFirst’ variable depends on triCount. If it is 0, leftFirst contains the index of the left child node. Otherwise, it contains the index of the first triangle index.

There you have it: 32 bytes for one BVH node. It doesn’t get any better. Or perhaps it does. We’ll see, in a later article.

Continue reading in article 2, in which we explore how to trace rays more quickly.

Thanks

This article now has a small but important fix in the ray traversal code, as suggested by Daniel Moore. Thanks!

Closing Remarks

The complete code for this project (for Windows / Visual Studio) can be found on GitHub.

Enjoyed the article? Spotted an error?

Ping me on Twitter (@j_bikker), comment on this article, or send an email to bikker.j@gmail.com.
You can also find me on Github: https://github.com/jbikker.

Or read some more articles: https://jacco.ompf2.com. Some favorites:

Author: jbikker

22 thoughts on “How to build a BVH – Part 1: Basics

  1. Extremely useful articles! Thanks a lot for the material!
    Somehow I got an empty box on one side right away on my test mesh so made a workaround by trying other axis if this happens.

  2. Thanks a ton for these articles 🙏
    I wanted to share a nice finding, using axis of greatest variance works really well with meshes that contain long thin triangles,
    while simply using axis of largest bounding box side causes a leaf node to contain thousands of triangles which makes traversal very slow.

      1. We compute element wise variance of triangle centers (bounding box centers work as well), then chose the split axis based on the largest component (similar to choosing split axis based on largest component of node’s bounding box dimensions),

        The naiive variance algorithm from here works as a start:
        https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

        The 1111Engine……stl file from here exhibits the issue because it has some long thin triangles:
        https://drive.google.com/file/d/1TfpeZHQ9_krY9cSQulYGZ5YbfKdFSL1f/view

      2. Here’s the essence of it:
        We compute element wise variance of triangle centers (bounding box centers work as well), then chose the split axis based on the largest component (similar to choosing split axis based on largest component of node’s bounding box dimensions),

        1. Clear! I wasn’t sure if you meant actual variance. This sounds like a good alternative to full SAH, thanks for sharing.

  3. if (tri[triIdx[i]].centroid[axis] < splitPos) i++;

    ^is this split correct? Isn't it possible for a triangle to have a centroid that's on the left side of the bounding box, but 1 or 2 vertices that put it inside both bounding boxes?

    Shouldn't the split be based on whether all three vertices lie on the left side?

  4. What is the rationale of having a triangle struct store the vertex/centroid data it requires? This would obviously lead to a larger memory footprint as adjacent triangles will being storing duplicate vertex information. Would it be sensible instead to use a vertex buffer and index buffer, where we only need to process each triangle’s index and then can fetch the vertex data from a single vertex buffer.

    So instead of needing to do: tri[triIdx].vertex0

    We could do: vertex_buffer[face_buffer[triIdx][0]] (in this instance, face_buffer is a 2d array of the same format as a face definition in an OBJ file. Conceivably you could convert this to a 1d array with linear index, but I wrote it this way for clarity hopefully).

    What is the benefit of having the triangle’s explicitly hold their vertices? If I missed this somewhere I apologize for missing it.

    1. A vertex / index buffer would work fine. It does however add some bandwidth for ‘pointer chasing’, which is why sometimes the vertices are stored directly. It costs some memory, but it is faster.

      1. Is it always faster? If you were rendering in a coherent way such that adjacent triangles are (often) rendered one after the other, shouldn’t previously loaded vertex information already be in cache and thus not need to be read from memory? Or does the fact that each ray traverses the bvh in-between triangle intersections mean that the cache is disrupted with aabb data, and thus vertex information is expelled and no benefit is gained (as I believe there is a benefit in rasterization where each triangle is processed sequentially, and thus cached vertices can help you).

        I’m fairly unfamiliar with the gritty details of this sort of thing so I’m hoping my question is sensible!

        1. It is always faster. Even getting the data from cache is more work than directly accessing (hopefully cached) vertex positions. On top of that, without indirection data there is more room for actual vertex data, so the probability of a cache hit will go up.
          This is just in the optimal case that you described. In practice, ray distributions are often not coherent at all, and/or other data access evicts vertex data. In that case every memory access is a cache miss.
          To give you a ballpark figure: in practice indexed triangles will make the ray query about 5% slower. This is a rough estimate for packets of primary and shadow rays in the Arauna CPU ray tracer. I have no data for path tracing, but I expect it is similar.It is always faster. Even getting the data from cache is more work than directly accessing (hopefully cached) vertex positions. On top of that, without indirection data there is more room for actual vertex data, so the probability of a cache hit will go up.
          This is just in the optimal case that you described. In practice, ray distributions are often not coherent at all, and/or other data access evicts vertex data. In that case every memory access is a cache miss.
          To give you a ballpark figure: in practice indexed triangles will make the ray query about 5% slower. This is a rough estimate for packets of primary and shadow rays in the Arauna CPU ray tracer. I have no data for path tracing, but I expect it is similar.

  5. May I ask about the BVH storage in memory ? I cannot read it out from the code… Are two child nodes stored next to each other in the array, or is that a linear storage (meaning first child node is always stored after the parent and the second child is stored after the whole first leaf) ?

    I’m using a recursive function which works only at one node at time, resulting in linear storage of tree in array. Should I rewrite it like yours, or do you think it’s fine ?

    void obj_list::split_bvh(uint begin, uint end, uint node_size) {
    aabb bbox = box_from(begin, end);
    uint size = end – begin;
    if (size > node_size) {
    uint n = bvh.size();
    uchar axis = bbox.get_longest_axis();
    uint split = sort(begin, end, axis, bbox.pmid());
    if (split == begin || split == end) {
    bvh.emplace_back(bvh_node(bbox, begin, end, 0));
    return;
    }
    bvh.emplace_back(bvh_node(bbox, n + 1, 0, 1));
    split_bvh(begin, split, node_size);
    bvh[n].n2 = bvh.size();
    split_bvh(split, end, node_size);
    }
    else bvh.emplace_back(bvh_node(bbox, begin, end, 0));
    }

  6. Great article! I like all the little optimisation tricks. One note though: By using the centroid you will sometimes assign a primitive to the incorrect side, the bounding box center should be used instead.
    Example:
    Triangle: a = (-5, 6, 0), b = (-5, 0, 0), c = (7, 0, 0.), centroid = (-1, 2, 0), aabbCenter = (1, 3, 0)
    Splitting plane: x = 0
    Using the centroid, the triangle goes to the left side, which leads to a bleeding of 7 units.
    Using the bounding box center, the triangle goes to the right side, whih leads to a bleeding of 5 units.

    However, centroid can sort some axis-aligned quad cases naturally.

    1. Thanks for pointing that out, I never realized that it matters which point you pick. But it indeed seems logical that the smallest ‘bleeding’ as you call it happens with the AABB centroid. TIL. 🙂

  7. Great article I have come again to share findings ^^, the partitioning does not work with unsigned integers because j will get decremented below zero and it will wrap around and cause issues ^^, sharing this note in case anyone has the same issue
    the solution is to use a slightly differently written code for partitioning


    uint32_t i = node->start;
    uint32_t j = node->start + node->count - 1;
    do
    {
    while (i <= j && triangles[indices[i]].aabb.max.as_array[split_axis] < split_pos)
    {
    i++;
    }
    while (i = split_pos)
    {
    j--;
    }

    if (i < j)
    {
    uint32_t tmp = indices[i];
    indices[i] = indices[j];
    indices[j] = tmp;
    i++;
    j--;
    }
    } while (i < j);

    1. The code does not use unsigned ints in any of the versions in the articles or on GitHub, unless I missed something?

      1. ah sorry I meant people who need need to implement using unsigned ints based on the article, I did and ran into this issue, so posted fix ^^

  8. As a word of advice to anyone else following this, allowing the BVH to exit construction with a leaf with more tris than the specified threshold is VITAL.
    I came upon this article looking for speedups of my implementation of the BVH, and spent 8 hours trying to figure out why my raycasting was 10x slower than the listed speed in pt 2. The singular issue causing this was me forcing my BVH to split until only 4 tris or less remained in each leaf. For such a small detail to cause such a tremendous mess is frankly incredible. It was literally the last thing I checked.
    Kudos to you, you saved my bacon on this one. I considered my code trying to select splits that put all tris on only one side of the split a bug. It never would have occurred to me to allow it were not for this article.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.