Speeding Up Lighthouse 2

Last week an updated version of the Lighthouse 2 benchmark was released. This version is about 15% faster, and produces images with less noise. In fact, rendering 10 paths per pixels (samples per pixel, or spp) takes as long as it took the February 27 version of the benchmark to render 8spp. At the same time, the quality of the 10spp version exceeds the image quality of the 24spp version in the February 27 package. Yesterday I committed the improved source code. In this blog post I will discuss the changes that resulted in the improved performance and quality. Many of these changes should apply to other renders, and, in fact, to GPGPU applications in general.

Optimization

To optimize any application, we must answer two questions:

  1. What is the goal of the optimization?
  2. Where are the bottlenecks in the existing code?

If you read my posts on The Art of Software Optimization you will recognize these as part of the structured approach to optimization.

The ‘goal’ of optimization can be quite broad, and includes things like the target platform (in this case: high-end consumer-level CPU and GPU) and the time available for optimization (in this case: spare time project, so no deadlines). And of course: objective performance, which generally should be ‘better’. But in the case of a path tracer, this is a bit harder to define. It is not hard to make a path tracer that runs at 60fps. However, on today’s hardware, it is impossible to write a path tracer that renders a noise-free image of a complex scene at 60fps (although with clever filtering Quake 2 RTX may actually qualify). So, let’s carefully define our aim:

The aim is to improve the code in such a way that a higher quality unfiltered image is achieved at the same frame rate.

Why unfiltered? For several reasons: first of all, a better unfiltered image means better input for a filter, and thus a better filtered image. And besides that, once the noise level is low enough, unfiltered path tracing has several benefits. Depth of field, motion blur and translucent materials are notoriously hard to filter, and in general, filtering tends to blur features in the image. Skipping the tricks yields a renderer that just works.

With this goal in mind, we need to define ‘quality’. With a path tracer, this is straightforward. ‘Ground truth’ is obtained by rendering a view with many samples per pixel. For the scenes in the benchmark the image has fully converged after a few thousand samples. The quality of a real-time image is now determined using the difference between this image and the ground truth image. To express this using one number we can use the root mean square error (RMSE):

\tag{1}rmse=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(f_i-g_i)^2}

Where f_i is pixel i, g_i is ground truth pixel i and N is the number of pixels. We thus compute the average squared difference between a pixel and ground truth, and take the square root of that. For rgb pixels we need an additional detail. I will simply treat r, g and b as individual pixel values, so N becomes three times larger. Alternatively, we could use a different weight for r, g and b to acknowledge the sensitivity of the human eye to the color components. However, this is not needed here: we simply want a similarity metric between two images.

The adjusted optimization goal:

The aim is to modify the code in such a way that the lowest RMSE is obtained at 30 frames per second at a resolution of 1600×900 pixels.

The 30fps requirement is somewhat arbitrary. It is the lowest number that people nowadays accept as ‘real-time’, it is the frame rate of most movies, and several console games deliberately aim for this number. Likewise, 1600×900 is arbitrary: Few modern screens will be smaller than 1920×1080, and on such screens, 1600×900 is a large window that doesn’t look like we’re cutting corners (even though we totally are).

Profiling

The benchmark project provides some basic numbers to guide the optimization process. To see these, you will have to build the project from the source code, which is available on github. On line 24 of main.cpp of the benchmarkapp project you can disable the automatic flythough to enable interactive mode:

At 10spp, the frame time is 42.7ms. Frame time is used by a number of tasks, several of which are timed individually. Sending out rays from the camera and intersecting them with the geometry takes 6.17ms. This is time spent in Optix, and via Optix, in the RTX hardware. It is quite safe to assume that there is not much to be gained here. Secondary rays (diffuse and specular bounces) take up 6.52ms, and shadow rays take another 6ms. The total time spent on actual ray tracing is 19.78ms. The time not spend on ray tracing is dominated by ‘shading time’. Shading means: processing ray tracing results, by evaluating the material model, which results in the generation of shadow rays and bounced rays. This is implemented in CUDA, and thus fully under our control. Any speedup that we achieve in this code will have a significant impact on the overall performance of the path tracer. This is the functionality we will have to focus on.

Speed

The shading code for the Optix7 render core (which is the one used in the benchmark) can be found in pathtracer.h. Lighthouse 2 is a wavefront path tracer. Very briefly, this means that all primary rays arrive in the shading kernel, which then produces secondary rays and shadow rays, which are processed in subsequent waves. Read more about wavefront path tracing in this dedicated blog post.

The differences between the March and February versions of the renderer are relatively small. They do however yield a significant speed boost. The image below shows the same camera view in the same scene, using the new code:

Due to changes to the camera and material model it is sadly not possible to recreate the exact same view, but the data is clear nevertheless: ray tracing takes about the same amount of time (as expected), but shading went down from 21.58ms to 13.52ms.

Let’s briefly discuss the changes:

A float4 skydome – The skydome is stored as a massive HDR texture. Every read from this texture is very likely a cache miss, especially for all the secondary rays that hit it at pretty much random locations.

The normal way to store such a bitmap is using three floats. However, at the hardware level, there is no such thing as a float3 memory transaction: we either get 1, 2 or 4 32-bit values. A float3 thus requires two operations. To make matters worse, the two reads may require access to two cache lines.

A simple fix is to store the skydome pixels as float4 values, wasting four bytes per skydome pixel. This sounds pretty bad, but with a 6GB video card (remember, high-end) this is hardly an issue.

No textures – In the scene used in the benchmark every polygon has a texture. However, many polygons will use a single texel from this bitmap.

This again provides an opportunity to reduce the number of memory transactions. The texture class used in Lighthouse 2 already supports a material color, which is used in the absence of a texture. While loading the scene, if a ‘one texel’ polygon is detected, the material is duplicated and modified to not use a texture.

This is obviously a highly scene-specific optimization, and under normal circumstances the visual artist should optimize the materials. It does however show that looking for a reduction in memory transactions in general pays off.

Flow divergence – The shading model used in Lighthouse 2 is based on the Disney BRDF implementation of the Appleseed renderer. The original implementation distinguishes four components in materials:

  1. Diffuse
  2. Sheen
  3. Specular
  4. Clearcoat

Additional properties, such as subsurface scattering, metallicness and anisotropy are handled as part of these four.

Not all materials use all four components. In fact, most materials use one or two. When different threads in the same warp require the evaluation of different components, this evaluation will be serialized.

It turns out that the evaluation of the four components has some functional overlap. By executing the shared code unconditionally, the time spent in divergent code is reduced, which speeds up shading calculations.

Note that this is only a temporary solution. For the Disney BDRF, some choices have been made that are purely driven by artistic wishes. Looking at the great impact of shading on overall render time it seems logical to look for a more efficient compromise: perhaps a single unconditional code path can yield 90% of the flexibility at 50% of the cost.

Optimizing blue noise – Blue noise is used in Lighthouse 2 to replace random numbers. Very briefly: with blue noise we optimize the distribution of stochastic error over the image:

The picture is shamelessly copied from the one-page paper by Solid Angle’s Iliyan Georgiev and Marcos Fajardo. The implementation in Lighthouse 2 is based on work by Eric Heitz, which draws a single random number from a 2D blue noise tile using the following code:

float noise( uint* blueNoise, int x, int y, int idx, int dim )
{
    x &= 127, y &= 127, idx &= 255, dim &= 255;
    // xor index based on optimized ranking
    int rankedIdx = (idx ^ blueNoise[dim + (x + y * 128) * 8 + 65536 * 3]) & 255;
    // fetch value in sequence
    int value = blueNoise[dim + rankedIdx * 256];
    // if the dimension is optimized, xor sequence value based on optimized scrambling
    value ^= blueNoise[(dim & 7) + (x + y * 128) * 8 + 65536];
    // convert to float and return
    float r = (0.5f + value) * (1.0f / 256.0f);
    if (r >= 1) r -= 1; // never happens?
    return r;
}

In the shading code we typically need four of these samples. For example, when generating primary rays, two random numbers are used to generate a position on the aperture, and two for a random position on the pixel. And in the shading code, two random numbers are needed to select a random point on a random light, and two to steer the random bounce. The number of random parameters is called the dimensionality: a camera ray thus is four dimensional. The dimension we wish to sample with the above code is the last argument of the function, ‘dim’.

Without paying too much attention to the inner workings of the above function, we can see that increasing ‘dim’ values yield subsequent samples from the blueNoise array. If dim is a multiple of 4 (note: it is), this happens twice: once in the calculation of rankedIdx, and once when xor’ing variable value. That means that we have two opportunities to replace four separate memory transactions by a single 128-bit memory transaction. Since this happens twice for every path, this should make a difference.

Behold the 4-way blue noise sampler:

float4 noise4( uint* blueNoise, int x, int y, int idx, int dim )
{
    uint4 bn4 = ((uint4)(blueNoise + dim + (x + y * 128) * 8 + 65536 * 3));
    int rsi1 = (idx ^ bn4.x) & 255, rsi2 = (idx ^ bn4.y) & 255;
    int rsi3 = (idx ^ bn4.z) & 255, rsi4 = (idx ^ bn4.w) & 255;
    int v1 = blueNoise[dim + 0 + rsi1 * 256];
    int v2 = blueNoise[dim + 1 + rsi2 * 256];
    int v3 = blueNoise[dim + 2 + rsi3 * 256];
    int v4 = blueNoise[dim + 3 + rsi4 * 256];
    uint4 bx4 = ((uint4)(blueNoise + (dim & 7) + (x + y * 128) * 8 + 65536));
    return make_float4( (0.5f + (v1 ^ bx4.x)) * (1.0f / 256.0f), (0.5f + (v2 ^ bx4.y)) * (1.0f / 256.0f),
    (0.5f + (v3 ^ bx4.z)) * (1.0f / 256.0f), (0.5f + (v4 ^ bx4.w)) * (1.0f / 256.0f) );
}

Although values v1, v2, v3 and v4 still require a memory transaction each, the total number of reads has been brought down substantially.

Tweaking __launch_bounds__ – I am probably pointing out the obvious here, but carefully tuning (and periodically re-tuning) (and tuning per hardware generation) of the workgroup size for every kernel is very important for optimal performance. I have been carefully tuning this before, but it turned out that large kernels in Lighthouse 2 could use some tuning nevertheless. I am somewhat ashamed to report that this probably had the biggest impact on the final performance level.

PART 2 – Variance Reduction

A performance improvement alone – albeit substantial – has limited impact on the image quality after 33 milliseconds. The speedup lets us render 10spp instead of 8spp, but the 2 extra samples barely make a difference. It turns out that some small changes have a far greater impact.

Using a low-resolution skydome texture – The skydome texture is a HDR bitmap, with some very bright spots. A simple way to reduce the variance of glossy surfaces is to blur the skydome. The March version of the benchmark stores a 1/64^{th} version of the sky, where 8×8 pixels are averaged, to obtain the same effect.

Path space regularization – A path, starting at the camera, that hits a near-specular glossy surface after visiting a much more diffuse surface typically has high variance. Path space regularization helps: by clamping the roughness of materials to be no less than earlier encountered roughness, overall variance is reduced.

Aggressive clamping – Caustics are notoriously hard to render using a path tracer. The earliest signs of an emerging caustic are formed by some bright specks. These fireflies are the result of paths that found a light source via an ‘improbable path’, e.g. via glass or a mirror. If we accept some bias in the final image we can reduce the fireflies by clamping path transport values. To do this correctly:

  1. Detect the bright paths by comparing their magnitude \sqrt{r^2+g^2+b^2} against a threshold m.
  2. Normalize the color.
  3. Scale the color by m.

Note that normalizing and scaling is better than simple clamping. Also note that the rgb color is repeatedly treated as a vector here, which does not really make sense, but serves as a cheap approximation of a proper luminance calculation.

Adaptive Sampling

The final feature that was supposed to have reduced variance is adaptive sampling. Sadly, the variance estimation that drives the adaptive sampling requires at least 8 samples and an additional 8 samples on average, adaptively distributed. Within the 33ms budget this is currently not possible. To be continued.

Questions? Mail me: bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

3 comments for “Speeding Up Lighthouse 2

  1. Jasper Bekkers
    April 7, 2020 at 1:43 pm

    If you need to reduce bandwidth for the skydome you could look into storing it in 32-bits, either a shared exponent representation (rgbm / rgb9e5 etc) instead so you still get the dynamic range or something like 11:11:10 if the dynamic range required is much smaller.

    • jbikker
      April 7, 2020 at 2:11 pm

      Yes that should work. I’ll do some timings later to see if bandwidth is the issue. So far I purely optimized for transaction count, which helped greatly.

      • Jasper Bekkers
        April 8, 2020 at 5:37 am

        Yeah they’re two entirely distinct things. A long time ago I sunk some time into doing a load-store vectorizer for SPIR-V after noticing how much impact this had on NVidia hardware in particular. AMD hardware typically doesn’t suffer from this problem because they wait for memory transactions differently as long as overlapping memory transactions share an s_waitcnt it treats it as a single transaction

        Early prototype is still up on ://github.com/Jasper-Bekkers/SPIRV-Tools

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.