Wavefront Path Tracing

Introduction

In this article we will explore an important concept used in the recently published Lighthouse 2 platform. Wavefront path tracing, as it is called by NVIDIA’s Laine, Karras and Aila, or streaming path tracing, as it was originally named by Van Antwerpen in his master’s thesis, plays a crucial role in the development of efficient GPU path tracers, and potentially, also in CPU path tracers. It is somewhat counter-intuitive however, and its use requires rethinking the flow of ray tracing algorithms.

‘Occupancy’

The path tracing algorithm is a surprisingly simple algorithm, which can be described in a few lines of pseudo-code:

vec3 Trace( vec3 O, vec3 D )
   IntersectionData i = Scene::Intersect( O, D )
   if (i == NoHit) return vec3( 0 )          // ray left the scene
   if (i == Light) return i.material.color   // lights do not reflect
   vec3 R, pdf = RandomDirectionOnHemisphere( i.normal ), 1 / 2PI
   return Trace( i.position, R ) *  i.BRDF * dot( i.normal, R ) / pdf

Input is a primary ray from the camera through a screen pixel. For this ray we determine the nearest intersection with a scene primitive. If there is none, the ray disappeared in the void. Otherwise, if the ray encountered a light, we found a light transport path between the light and the camera. If we find anything else, we bounce and recurse, hoping that the bounced ray does find a light. Note that this process resembles the (reverse) path of a photon, bouncing around the scene surfaces.

GPUs are designed to do the same thing on many threads. At first sight, ray tracing is a natural fit for this. So, we use OpenCL or CUDA to spawn a thread per pixel, and each thread executes the algorithm, which indeed works as intended, and quite fast too: just have a look at some ShaderToys to see how fast ray tracing can be on the GPU. Fast or not, the question is: are these ray tracers as fast as they can be?

There is a problem with the algorithm. A primary ray may find a light right away, or after a single random bounce, or after fifty bounces. A CPU programmer may see a potential stack overflow; a GPU programmer should see low hardware utilization. The problem is caused by the (conditional) tail recursion: a path may get terminated at a light source, or it may continue if it hit something else. Translated to many threads: a portion of the threads will get terminated, and a portion continues. After a few bounces, we have a few threads that have work left to do, while most threads are waiting for the final threads to finish.

The hardware utilization problem is amplified by the SIMT execution model of GPUs. Threads are organized in groups, e.g. 32 threads go together in a warp on a Pascal GPU (10xx class NVidia hardware). The threads in a warp share a single program counter: they execute in lock-step, so every program instruction is executed by the 32 threads simultaneously. SIMT stands for: single instruction multiple thread, which describes this concept well. For a SIMT processor, conditional code is a challenge. This is nicely illustrated in the Volta white paper:

SIMT execution of conditional code. From: Volta white paper, NVidia.

When some condition is true for some threads in the warp, the branches of the if-statement are serialized. The alternative to ‘all threads do the same thing’ is ‘some threads are disabled’. In an if-then-else block, the portion of threads doing useful work will be 50% on average, unless all threads agree on the condition.

Conditional code is sadly not uncommon in a path tracer. Shadow rays are cast only if a light source is not behind the shading point, different paths may hit different materials, Russian roulette may or may not kill a path, and so on. It turns out that this becomes a major source of inefficiency, and it is not easy to prevent without extreme measures.

An earlier version of this article used the term ‘occupancy’ instead of ‘hardware utilization’. Occupancy is a somewhat tricky concept. For NVIDIA devices, it is defined as: “the number of warps running concurrently on a multiprocessor divided by the maximum number of warps that can run concurrently“. The maximum number of concurrent warps includes warps that are ready to be swapped in when another warp encounters a stall. Occupancy does not take into account control flow divergence, which reduces instruction level parallelism (ILP). It is thus possible to have 100% occupancy for code that has a single active thread in each warp, and it is possible to have 100% hardware utilization with <100% occupancy.

Streaming Path Tracing

The streaming path tracing algorithm is designed to combat the root of the occupancy problem. Streaming path tracing splits the path tracing algorithm in four phases:

  1. Generate
  2. Extend
  3. Shade
  4. Connect

Each phase is implemented as a separate program. So instead of running the full path tracer as a single GPU program (‘kernel’), we now have four kernels. And on top of that, they execute in a loop, as we will see shortly.

Phase 1 (‘Generate’) is responsible for generating the primary rays. It is a simple kernel that produces ray origins and directions for as many rays as there are pixels. The output of this phase is a large buffer of rays, and a counter, which tells the next phase how many rays should be processed. For primary rays this is of course simply screen width times screen height.

Phase 2 (‘Extend’) is the second kernel. It is executed only after phase 1 has completed for all pixels. The kernel reads the buffer generated in phase 1, and intersects each ray with the scene. The output of this phase is an intersection result for each ray, stored in a buffer.

Phase 3 (‘Shade’) executes after phase 2 is completely done. It takes the intersection result from phase 2 and evaluates the shading model for each path. This may or may not generate new rays, depending on whether a path was terminated or not. A paths that spawns a new ray (the path is ‘extended’) writes a new ray (‘path segment’) to a buffer. Paths that directly sample light sources (‘explicit light sampling’ or ‘next event estimation’) write a shadow ray to a second buffer.

Phase 4 (‘Connect’) traces the shadow rays generated in phase 3. It is similar to phase 2, but there is an important distinction: shadow rays merely need to find any intersection, while extension rays need to find the nearest intersection. This justifies a separate kernel.

Once phase 4 has completed we are left with a buffer that contains path extension rays. With these rays we proceed to phase 2. We do this until no path extension rays remain, or until we reach a maximum number of iterations.

Inefficiencies

For a performance-aware programmer, the streaming path tracing algorithms should raise a lot of red flags:

  • Instead of a single kernel invocation, we now have three invocations per iteration, plus a generate kernel. Kernel invocations involve a certain overhead, so this is bad.
  • Each kernel reads a massive buffer and writes a massive buffer.
  • The CPU needs to know how many threads to spawn for each kernel, so the GPU needs to inform the CPU how many rays were generated in phase 3. Information travelling back from GPU to CPU is a bad idea, and we need it at least once per iteration.
  • How does phase 3 write rays to a buffer without having gaps all over the place? It’s not going to use an atomic counter for that, is it?
  • The number of active paths is still going to go down, so how does the scheme help in the first place?

To start with the last concern: if we pass a million tasks to the GPU, it will not run a million threads concurrently. The actual number of threads that runs concurrently depends on the hardware capabilities, but in general, tens of thousands of threads execute in parallel. Only when the workload drops below that number we are going to see hardware under-utilization due to a low task count.

The massive buffer I/O is another concern. This is indeed an issue, but not as much so as we might expect: data access is highly predictable, and especially for writing to the buffers, latency is not a problem. In fact, this type of data processing is precisely what the GPU was made for in the first place.

Another thing that a GPU does well is atomic counters, which you may not expect when you are coming from a CPU world. As a rule of thumb, an atomic write is as expensive as an un-cached write to global memory. In many cases, the latency will be hidden by the massively parallel execution of the GPU.

Consequences

Before we discuss the details, we will have a look at the consequences of the wavefront path tracing algorithm. First of all, the buffers. We need a buffer for the output of phase 1, i.e. the primary rays. For each ray, we need:

  • A ray origin: three floats, so 12 bytes
  • A ray direction: three floats, so 12 bytes

In practice it is better to increase the size of the buffer. Storing 16 bytes for the ray origin and direction allows the GPU to read those with a single 128-bit read. The alternative is a 64-bit read followed by a 32-bit read to obtain a float3, which is about twice as slow. So for a 1920×1080 screen we have: 1920x1080x32=~64MB. We also need a buffer for the intersection results produced by the Extend kernel. This is again 128 bit per entry, so 32MB. Next, the Shade kernel may produce up to 1920×1080 path extensions (upper limit), and we can’t write them to the buffer we are reading from. So, another 64MB. And finally, if our path tracer casts shadow rays, that is another 64MB buffer. Summing everything, we get to 224MB of data, just for the wavefront algorithm. Or, about 1GB at a 4K resultion.

Now, here’s another thing we may need to get used to: memory is abundant. 1GB may sound like a lot, and there are ways to lower that figure, but realistically, by the time we need to actually path trace at 4K, using 1GB on an 8GB GPU is the least of our problems.

A bigger problem than memory requirements are the consequences for the rendering algorithm. So far I assumed that we want to generate a single extension ray and perhaps a shadow ray per thread in the Shade kernel. But what if we want to do some AO, using 16 rays per pixel? The 16 AO rays need to be stored in a buffer, but worse, they will only arrive in the next iteration. A similar problem arises with Whitted-style ray tracing: casting a shadow ray to multiple lights, or splitting a path when it encounters glass is pretty much impossible.

On the plus side, wavefront path tracing does solve the issues we identified in the ‘Occupancy’ section:

  • In phase 1, all threads unconditionally produce primary rays and write these to a buffer.
  • In phase 2, all threads unconditionally intersect rays with the scene and write intersection results to a buffer.
  • In phase 3, we start the intersection result evaluation with 100% occupancy.
  • In phase 4, we process a continuous list of shadow rays without any gaps.

By the time we return to phase 2 with the survivingfor paths with a length of 2 segments we again have a compacted ray buffer, guaranteeing full occupancy when the kernel starts.

There is an additional benefit, which we should not overlook. The code in the four individual phases is isolated. Each kernel can use all available GPU resources (cache, shared memory, registers) without taking into account other kernels. This may allow the GPU to run the scene intersection code with more threads, because this code does not require as many registers as the shading code. More threads means: better latency hiding.

Full occupancy, better latency hiding, streaming writes: these are benefits that directly relate to the origin and nature of the GPU platform. For a GPU, wavefront path tracing is very natural.

Is It Worth It?

The question is of course: does the improved occupancy justify the buffer I/O and the cost of the additional kernel invocations?

The answer is: yes, but it is not easy to prove.

If we go back to the ShaderToy path tracers for a moment, we will see that most of them feature a simple hard-coded scene. Replacing this by a full-blown scene is not trivial: for millions of primitives ray-scene intersection becomes a hard problem, that is often left to NVidia (Optix), AMD( Radeon-Rays) or Intel (Embree). None of these options can simply replace a hard-coded scene in a toy CUDA ray tracer. In CUDA, the closest match (Optix) insists on taking over your program flow. On the CPU, Embree will actually allow you to trace individual rays from your own code, but at a significant performance cost: it much prefers to trace a large batch of rays instead of individual rays.

Screen from “It’s About Time”, rendered using Brigade 1.

Whether wavefront path tracing is faster than the alternative (the ‘megakernel’ as Laine et al. call it) depends on the time spent in the kernels (large scenes and expensive shaders reduce the relative overhead of the wavefront algorithm), the maximum path length, the occupancy in the megakernel and the difference in register pressure in the four phases. In an early version of the original Brigade path tracer, we found that even a simple scene, with a mix of specular and Lambertian surfaces, running on a GTX480 benefited from the wavefront approach.

Streaming Path Tracing in Lighthouse 2

The Lighthouse 2 platform provides two wavefront path tracers. The first one uses Optix Prime to implement phase 2 and 4 (the ray/scene intersection phases); the second one uses Optix directly to implement the same functionality.

Optix Prime is a simplified version of Optix that will only intersect a collection of rays with a scene consisting of triangles. Unlike the full Optix library, it does not support user intersection code; it will only intersect triangles. For a wavefront path tracer this is exactly what is needed however.

The Optix Prime wavefront path tracer is implemented in rendercore.cpp in the rendercore_optixprime_b project. Optix Prime Initialization starts in the Init function, using rtpContextCreate. An scene is created using rtpModelCreate. The various ray buffers are created in the SetTarget function, using rtpBufferDescCreate. Note that we provide regular device pointers for these buffers: this means that they can be used in Optix as well as in regular CUDA kernels.

Rendering starts in the Render method. The CUDA kernel generateEyeRays is used to fill the primary ray buffer. Once the buffer is populated, Optix Prime is invoked, using rtpQueryExecute. This yields the intersection results in extensionHitBuffer. Note that all buffers remain on the GPU: there is no traffic between the CPU and the GPU, apart from the kernel calls. The Shade phase is implemented in a regular CUDA kernel, shade. It’s implementation can be found in pathtracer.cu.

A few details of the optixprime_b implementation are noteworthy. First, shadow rays are traced outside the wavefront loop. This is valid: a shadow ray contributes to a pixel if it is not occluded, but apart from that, its result is not needed anywhere. A shadow ray is thus a fire and forget ray, which we can trace at any time and in any order. In this case, this is exploited by batching all shadow rays, so that the batch that is finally traced is as large as possible. This has one unfortunate consequence: for N iterations of the wavefront algorithm and X primary rays, the upper bound on the shadow ray count is XN.

Another detail is the handling of the various counters. The Extend and Shade phases need to know how many paths are active. The counters for this are updated (atomically) on the GPU, and subsequently used on the GPU, without ever moving back to the CPU. Sadly there is one case where this is not possible: Optix Prime wants to know the number of rays to trace. For this we need to bring back the counters once per iteration.

Conclusion

This article explained what wavefront path tracing is, and why it is needed to run the path tracing algorithm efficiently on a GPU. A practical implementation is provided in the Lighthouse 2 platform, which is open source and available on Github.

Acknowledgements

The article was updated based on suggestions by Jebb (see comments).

Post navigation

3 comments for “Wavefront Path Tracing

  1. Jedd
    August 4, 2019 at 1:36 am

    Hi Jacco,

    Thanks for the nice article. I’m a raytracing noob, so I definitely learned something :-). I do have a couple of follow-up questions and observations, though.

    1. The following statement is not quite right.

    “if we pass a million tasks to the GPU, it will not spawn a million threads. The actual number of threads that runs concurrently depends on the hardware, but in general, tens of thousands of threads are running. Only when the workload drops below that number we are going to see occupancy issues due to a low task count.”

    The GPU *will* spawn a million threads – they just wanted be spawned all at once. As you note, the hardware has limits on how many threads may run concurrently, so new threads will be spawned as old threads retire. Further, this notion of occupancy is completely different than the “active-threads-in-a-warp” occupancy you discuss earlier and later in article (more commonly known as warp divergence).

    2. Kernel invocation overhead is typically quite small. When have you observed this being a problem in practice?

    3. The mental model you described for atomics is not quite right, either. I’d contend that GPUs are actually quite *bad* at atomics. It is hard and expensive to scale atomic performance as the rest of the GPU scales, so you should only expect the relative perform of atomics to decrease in the future as GPUs become ever larger. It’s also worth noting that the hardware mechanism used to order z-buffer/framebuffer updates is completely different than the mechanism used for the atomics you access from a kernel (at least on the GPUs I’m familiar with). Typically, an atomic access will only be as fast as an uncached write if there is no contention on that atomic. The more the HW has to serialize atomic accesses, the slower they will be.

    4. For the two-way traffic problem you describe, have you evaluated GPU-enqueue-GPU (i.e pre-allocate a large command buffer for your worst case number of iterations, initialize with no-ops, then encode real kernel invocations into the command buffer as you progress) vs persistent threads?

    Persistent threads require forward progress guarantees, which is availably only on recent NVIDIA GPUs as far as I know. GPU-enqueue-GPU on the other hand is available across all vendors (NVIDIA, AMD, Intel, Apple, probably others), so it would significantly broaden an implementation’s portability. That said, it seems you’re using NVIDIA technology to begin with (e.g. Optix), so maybe your flexibility in exploring this trade-off is limited or just not pertinent to you :-).

    • jbikker
      August 4, 2019 at 1:25 pm

      Hi Jedd,

      Thanks for your extensive feedback and insightful remarks.

      I have read up on the term ‘occupancy’ and decided to replace it by ‘hardware utilization’ for this article, which better describes the goal of wavefront path tracing.

      Regarding kernel invocation overhead: this may be an assumption I kept from earlier CUDA versions. A kernel invocation typically involves the transfer of arguments and a CPU-to-device signal to launch the kernel, but I can imagine that this is mostly hidden by asynchronous transfers and the command queue these days?

      Atomics: I removed the bit about z-buffers. I did observe that most attempts to reduce atomics at the warp level to a single atomic (using a warp parallel scan) are futile; simply doing an atomic for every thread seems more efficient. Similarly, I noticed that writing to the frame buffer using atomics (where collisions are rare) barely slows down the code. This is consistent with what you wrote: the frame buffer access would basically be an uncached write, and the atomic completes in roughly the same time.

      Persistent threads: I removed this remark; it turns out that the overhead of thread allocation is negligible since Kepler. The term persistent threads is a bit confusing by the way: Aila & Laine used it for a worker thread / work stealing scheme, not for threads that survive over frame boundaries.

      – Jacco.

      • Jedd
        August 5, 2019 at 9:48 am

        Hi Jacco,

        Thanks for the extensive message back :-).

        1. I agree your definition is, ultimately, what matters to a software developer. However, from the perspective of HW, there is an important distinction. A thread that becomes inactive due to conditional code still occupies resources in the GPU (most importantly, a temporary register allocation), thus preventing future threads from launching. This distinction is important because any “occupancy” metric you may find in performance tools won’t measure the affect of conditional predication.

        2. Yes, the kernel launches will be pipelined through much of the GPU’s command processor. From the GPU’s perspective, the main overhead are the cache-flush-invalidates that make memory coherent between invocations. From the CPU’s perspective, there is only the latency of submitting the command buffer that (hopefully) contains many kernel invocations. Typically, in a real-time rendering setup, this latency is hidden by double or triple buffering the frame.

        3. Your first observation is probably because the compiler is already doing the warp reduction on your behalf. I agree that your claim about performance is correct when there is no contention, but I was merely pointing out that it’s not true in the contended case (which can be common depending on the exact nature of the algorithm)..

        4. Presumably your setup is roughly this: one or more consumer threads that spin waiting for work to arrive and one or more producer threads that send work to be processed by the consumer threads. Unfortunately, most GPUs do not guarantee that this situation won’t livelock. The shader core may not schedule waves fairly and consequentially can starve the producer threads of opportunity to execute. Further complicating the situation, because of the SIMT nature of GPUs, even if the shader core scheduled waves fairly (“forward progress guarantee”), one must be careful not to pair a producer and consumer thread in the same wave, else livelock may still occur. NVIDIA has apparently solved this latter problem in Volta (but I suspect their is still a performance cost, so it’s desirable to avoid it if you can).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.