In this article we will explore an important concept used in the recently published Lighthouse 2 platform. Wavefront path tracing, as it is called by NVIDIA’s Laine, Karras and Aila, or streaming path tracing, as it was originally named by Van Antwerpen in his master’s thesis, plays a crucial role in the development of efficient GPU path tracers, and potentially, also in CPU path tracers. It is somewhat counter-intuitive however, and its use requires rethinking the flow of ray tracing algorithms.
The path tracing algorithm is a surprisingly simple algorithm, which can be described in a few lines of pseudo-code:
vec3 Trace( vec3 O, vec3 D ) IntersectionData i = Scene::Intersect( O, D ) if (i == NoHit) return vec3( 0 ) // ray left the scene if (i == Light) return i.material.color // lights do not reflect vec3 R, pdf = RandomDirectionOnHemisphere( i.normal ), 1 / 2PI return Trace( i.position, R ) * i.BRDF * dot( i.normal, R ) / pdf
Input is a primary ray from the camera through a screen pixel. For this ray we determine the nearest intersection with a scene primitive. If there is none, the ray disappeared in the void. Otherwise, if the ray encountered a light, we found a light transport path between the light and the camera. If we find anything else, we bounce and recurse, hoping that the bounced ray does find a light. Note that this process resembles the (reverse) path of a photon, bouncing around the scene surfaces.
GPUs are designed to do the same thing on many threads. At first sight, ray tracing is a natural fit for this. So, we use OpenCL or CUDA to spawn a thread per pixel, and each thread executes the algorithm, which indeed works as intended, and quite fast too: just have a look at some ShaderToys to see how fast ray tracing can be on the GPU. Fast or not, the question is: are these ray tracers as fast as they can be?
There is a problem with the algorithm. A primary ray may find a light right away, or after a single random bounce, or after fifty bounces. A CPU programmer may see a potential stack overflow; a GPU programmer should see low hardware utilization. The problem is caused by the (conditional) tail recursion: a path may get terminated at a light source, or it may continue if it hit something else. Translated to many threads: a portion of the threads will get terminated, and a portion continues. After a few bounces, we have a few threads that have work left to do, while most threads are waiting for the final threads to finish.
The hardware utilization problem is amplified by the SIMT execution model of GPUs. Threads are organized in groups, e.g. 32 threads go together in a warp on a Pascal GPU (10xx class NVidia hardware). The threads in a warp share a single program counter: they execute in lock-step, so every program instruction is executed by the 32 threads simultaneously. SIMT stands for: single instruction multiple thread, which describes this concept well. For a SIMT processor, conditional code is a challenge. This is nicely illustrated in the Volta white paper:
When some condition is true for some threads in the warp, the branches of the if-statement are serialized. The alternative to ‘all threads do the same thing’ is ‘some threads are disabled’. In an if-then-else block, the portion of threads doing useful work will be 50% on average, unless all threads agree on the condition.
Conditional code is sadly not uncommon in a path tracer. Shadow rays are cast only if a light source is not behind the shading point, different paths may hit different materials, Russian roulette may or may not kill a path, and so on. It turns out that this becomes a major source of inefficiency, and it is not easy to prevent without extreme measures.
An earlier version of this article used the term ‘occupancy’ instead of ‘hardware utilization’. Occupancy is a somewhat tricky concept. For NVIDIA devices, it is defined as: “the number of warps running concurrently on a multiprocessor divided by the maximum number of warps that can run concurrently“. The maximum number of concurrent warps includes warps that are ready to be swapped in when another warp encounters a stall. Occupancy does not take into account control flow divergence, which reduces instruction level parallelism (ILP). It is thus possible to have 100% occupancy for code that has a single active thread in each warp, and it is possible to have 100% hardware utilization with <100% occupancy.
Streaming Path Tracing
The streaming path tracing algorithm is designed to combat the root of the occupancy problem. Streaming path tracing splits the path tracing algorithm in four phases:
Each phase is implemented as a separate program. So instead of running the full path tracer as a single GPU program (‘kernel’), we now have four kernels. And on top of that, they execute in a loop, as we will see shortly.
Phase 1 (‘Generate’) is responsible for generating the primary rays. It is a simple kernel that produces ray origins and directions for as many rays as there are pixels. The output of this phase is a large buffer of rays, and a counter, which tells the next phase how many rays should be processed. For primary rays this is of course simply screen width times screen height.
Phase 2 (‘Extend’) is the second kernel. It is executed only after phase 1 has completed for all pixels. The kernel reads the buffer generated in phase 1, and intersects each ray with the scene. The output of this phase is an intersection result for each ray, stored in a buffer.
Phase 3 (‘Shade’) executes after phase 2 is completely done. It takes the intersection result from phase 2 and evaluates the shading model for each path. This may or may not generate new rays, depending on whether a path was terminated or not. A paths that spawns a new ray (the path is ‘extended’) writes a new ray (‘path segment’) to a buffer. Paths that directly sample light sources (‘explicit light sampling’ or ‘next event estimation’) write a shadow ray to a second buffer.
Phase 4 (‘Connect’) traces the shadow rays generated in phase 3. It is similar to phase 2, but there is an important distinction: shadow rays merely need to find any intersection, while extension rays need to find the nearest intersection. This justifies a separate kernel.
Once phase 4 has completed we are left with a buffer that contains path extension rays. With these rays we proceed to phase 2. We do this until no path extension rays remain, or until we reach a maximum number of iterations.
For a performance-aware programmer, the streaming path tracing algorithms should raise a lot of red flags:
- Instead of a single kernel invocation, we now have three invocations per iteration, plus a generate kernel. Kernel invocations involve a certain overhead, so this is bad.
- Each kernel reads a massive buffer and writes a massive buffer.
- The CPU needs to know how many threads to spawn for each kernel, so the GPU needs to inform the CPU how many rays were generated in phase 3. Information travelling back from GPU to CPU is a bad idea, and we need it at least once per iteration.
- How does phase 3 write rays to a buffer without having gaps all over the place? It’s not going to use an atomic counter for that, is it?
- The number of active paths is still going to go down, so how does the scheme help in the first place?
To start with the last concern: if we pass a million tasks to the GPU, it will not run a million threads concurrently. The actual number of threads that runs concurrently depends on the hardware capabilities, but in general, tens of thousands of threads execute in parallel. Only when the workload drops below that number we are going to see hardware under-utilization due to a low task count.
The massive buffer I/O is another concern. This is indeed an issue, but not as much so as we might expect: data access is highly predictable, and especially for writing to the buffers, latency is not a problem. In fact, this type of data processing is precisely what the GPU was made for in the first place.
Another thing that a GPU does well is atomic counters, which you may not expect when you are coming from a CPU world. As a rule of thumb, an atomic write is as expensive as an un-cached write to global memory. In many cases, the latency will be hidden by the massively parallel execution of the GPU.
Before we discuss the details, we will have a look at the consequences of the wavefront path tracing algorithm. First of all, the buffers. We need a buffer for the output of phase 1, i.e. the primary rays. For each ray, we need:
- A ray origin: three floats, so 12 bytes
- A ray direction: three floats, so 12 bytes
In practice it is better to increase the size of the buffer. Storing 16 bytes for the ray origin and direction allows the GPU to read those with a single 128-bit read. The alternative is a 64-bit read followed by a 32-bit read to obtain a float3, which is about twice as slow. So for a 1920×1080 screen we have: 1920x1080x32=~64MB. We also need a buffer for the intersection results produced by the Extend kernel. This is again 128 bit per entry, so 32MB. Next, the Shade kernel may produce up to 1920×1080 path extensions (upper limit), and we can’t write them to the buffer we are reading from. So, another 64MB. And finally, if our path tracer casts shadow rays, that is another 64MB buffer. Summing everything, we get to 224MB of data, just for the wavefront algorithm. Or, about 1GB at a 4K resultion.
Now, here’s another thing we may need to get used to: memory is abundant. 1GB may sound like a lot, and there are ways to lower that figure, but realistically, by the time we need to actually path trace at 4K, using 1GB on an 8GB GPU is the least of our problems.
A bigger problem than memory requirements are the consequences for the rendering algorithm. So far I assumed that we want to generate a single extension ray and perhaps a shadow ray per thread in the Shade kernel. But what if we want to do some AO, using 16 rays per pixel? The 16 AO rays need to be stored in a buffer, but worse, they will only arrive in the next iteration. A similar problem arises with Whitted-style ray tracing: casting a shadow ray to multiple lights, or splitting a path when it encounters glass is pretty much impossible.
On the plus side, wavefront path tracing does solve the issues we identified in the ‘Occupancy’ section:
- In phase 1, all threads unconditionally produce primary rays and write these to a buffer.
- In phase 2, all threads unconditionally intersect rays with the scene and write intersection results to a buffer.
- In phase 3, we start the intersection result evaluation with 100% occupancy.
- In phase 4, we process a continuous list of shadow rays without any gaps.
By the time we return to phase 2 with the survivingfor paths with a length of 2 segments we again have a compacted ray buffer, guaranteeing full occupancy when the kernel starts.
There is an additional benefit, which we should not overlook. The code in the four individual phases is isolated. Each kernel can use all available GPU resources (cache, shared memory, registers) without taking into account other kernels. This may allow the GPU to run the scene intersection code with more threads, because this code does not require as many registers as the shading code. More threads means: better latency hiding.
Full occupancy, better latency hiding, streaming writes: these are benefits that directly relate to the origin and nature of the GPU platform. For a GPU, wavefront path tracing is very natural.
Is It Worth It?
The question is of course: does the improved occupancy justify the buffer I/O and the cost of the additional kernel invocations?
The answer is: yes, but it is not easy to prove.
If we go back to the ShaderToy path tracers for a moment, we will see that most of them feature a simple hard-coded scene. Replacing this by a full-blown scene is not trivial: for millions of primitives ray-scene intersection becomes a hard problem, that is often left to NVidia (Optix), AMD( Radeon-Rays) or Intel (Embree). None of these options can simply replace a hard-coded scene in a toy CUDA ray tracer. In CUDA, the closest match (Optix) insists on taking over your program flow. On the CPU, Embree will actually allow you to trace individual rays from your own code, but at a significant performance cost: it much prefers to trace a large batch of rays instead of individual rays.
Whether wavefront path tracing is faster than the alternative (the ‘megakernel’ as Laine et al. call it) depends on the time spent in the kernels (large scenes and expensive shaders reduce the relative overhead of the wavefront algorithm), the maximum path length, the occupancy in the megakernel and the difference in register pressure in the four phases. In an early version of the original Brigade path tracer, we found that even a simple scene, with a mix of specular and Lambertian surfaces, running on a GTX480 benefited from the wavefront approach.
Streaming Path Tracing in Lighthouse 2
The Lighthouse 2 platform provides two wavefront path tracers. The first one uses Optix Prime to implement phase 2 and 4 (the ray/scene intersection phases); the second one uses Optix directly to implement the same functionality.
Optix Prime is a simplified version of Optix that will only intersect a collection of rays with a scene consisting of triangles. Unlike the full Optix library, it does not support user intersection code; it will only intersect triangles. For a wavefront path tracer this is exactly what is needed however.
The Optix Prime wavefront path tracer is implemented in
rendercore.cpp in the
rendercore_optixprime_b project. Optix Prime Initialization starts in the
Init function, using
rtpContextCreate. An scene is created using
rtpModelCreate. The various ray buffers are created in the
SetTarget function, using
rtpBufferDescCreate. Note that we provide regular device pointers for these buffers: this means that they can be used in Optix as well as in regular CUDA kernels.
Rendering starts in the
Render method. The CUDA kernel
generateEyeRays is used to fill the primary ray buffer. Once the buffer is populated, Optix Prime is invoked, using
rtpQueryExecute. This yields the intersection results in
extensionHitBuffer. Note that all buffers remain on the GPU: there is no traffic between the CPU and the GPU, apart from the kernel calls. The Shade phase is implemented in a regular CUDA kernel,
shade. It’s implementation can be found in
A few details of the
optixprime_b implementation are noteworthy. First, shadow rays are traced outside the wavefront loop. This is valid: a shadow ray contributes to a pixel if it is not occluded, but apart from that, its result is not needed anywhere. A shadow ray is thus a fire and forget ray, which we can trace at any time and in any order. In this case, this is exploited by batching all shadow rays, so that the batch that is finally traced is as large as possible. This has one unfortunate consequence: for N iterations of the wavefront algorithm and X primary rays, the upper bound on the shadow ray count is XN.
Another detail is the handling of the various counters. The Extend and Shade phases need to know how many paths are active. The counters for this are updated (atomically) on the GPU, and subsequently used on the GPU, without ever moving back to the CPU. Sadly there is one case where this is not possible: Optix Prime wants to know the number of rays to trace. For this we need to bring back the counters once per iteration.
This article explained what wavefront path tracing is, and why it is needed to run the path tracing algorithm efficiently on a GPU. A practical implementation is provided in the Lighthouse 2 platform, which is open source and available on Github.
The article was updated based on suggestions by Jebb (see comments).