The implementation of the renderer can be found on Github. This text occasionally refers to specific source files in this repository.

I am a teacher. My topics include graphics (in particular: ray tracing) and software optimization, and hidden between the lines, C/C++ and game development. For newcomers, getting up to speed in these topics is a massive undertaking. Aspiring game developers often turn to easier programming languages, or full-blown engines such as Unity or Unreal. This is fine, but it does take away from the joy of having full low level control over the machine.

One way to reduce the complexity of the task is to stick with (2D) raster graphics. On a flat screen, Tetris and Snake can be programmed with relatively little effort.

Perhaps this is also what made Minecraft such a success. Limiting creativity to a coarse 3D grid suddenly turns 3D world building into playing with Legos.

The goal of the voxel renderer is to offer the aspiring game developer a 3D world to toy with: a fixed-size 3D canvas, that can be edited with simple operations like plot, line, block, sphere. There will be sprites, tiles and bitmap fonts. There will also be some supporting operations such as sprite collision detection. The idea is that with these tools, a 3D Tetris or Pacman or Snake is as easy to build as the 2D original.

With these goals in mind, let’s formulate some requirements:

- The renderer should be fast enough for games, regardless of the contents of the 3D canvas.
- The renderer should be scalable: ideally it should at least work on older hardware. Young developers not always wield the latest GPUs.
- It should run on the GPU, so the CPU can be used for game logic.
- It should run
*in parallel*with the CPU-side game logic. - The world should have a fixed size, with a reasonable amount of detail. Let’s go for 1024³, which is admittedly somewhat arbitrary.
- The system should not choke on pretty significant changes to the world.

With these requirements in mind, it makes sense to use *ray tracing* rather than rasterization. This may sound counter-intuitive, but there are over 1 billion voxels in the world, and quite a few of them will change per frame. Processing the voxels into triangles could be rather costly. Luckily, ray tracing voxels is an easy task for a GPU. An added benefit is that ray tracing is scalable and easy to use. Once the basic renderer works, shadows and reflections can be added easily.

The voxels will be stored in a two-level grid, which consists of a coarse 128³ *top-level grid*, which stores 8³ voxel *bricks* for a total 1024³ resolution. This data structure (which I first encountered in Fairlight’s 5-Faces demo) has several benefits. For starters, it lets the ray tracer skip over empty bricks. It also allows for easy partial map synchronization between CPU and GPU. And last but not least: most maps can now be stored with automatic compression.

A top-level grid cell can thus point to a brick, or it can be empty. There is a third option: a grid cell can also store a solid color, which is a cheap replacement for a uniformly colored brick. To make this practical, the lowest bit of the 32-bit grid cell value determines if the cell stores a brick pointer:

- 0: the grid cell is solid; solid color is stored in bits 1..31.
- 1: the grid cell contains a pointer. The brick index is stored in bits 1..31.

With this layout, a value of 0 is a ‘solid color’ grid cell with color 0, i.e. transparent. We can thus ‘memset’ the top-level grid to clear the world.

At the lowest level, the bricks store the voxels. Each voxel will store an 8-bit value, where 0 means transparent. When not transparent, the 8-bit value will be interpreted as a 3-3-2 rgb triplet. This is a significantly lower color resolution than 24-bit RGB, but it does help to keep the memory requirements down. Even with a single byte per voxel, the upper bound is 1GB for 128³ unique bricks, plus 8MB for the 128³ 32-bit values of the top-level grid.

To make the renderer run on the GPU, I will use OpenCL. There are other options, such as compute shaders, Vulkan and CUDA. A quick investigation reveals that CUDA excludes many GPUs, compute shaders seem to offer insufficient low-level control over the pipeline, and Vulkan is great but also huge. OpenCL turns out to be pretty compatible: even somewhat old IGPUs are supported, as well as recent Android devices. There’s just one caveat: because of limited support for OpenCL on NVIDIA hardware, we have to limit ourselves to OpenCL 1.2.

The application flow, which is inspired by the Brigade 1 renderer, is illustrated in the following diagram.

In this flow, CPU code and GPU kernels run in parallel. Once the CPU is done making changes to the world, it starts a parallel copy over this data to the GPU. Obviously, we can’t just overwrite data that is being used by the GPU, so the changes are stored in a staging buffer. The staging buffer is processed once rendering on the GPU completes, in the commit kernel.

A few details about this process:

- The commit kernel runs for a very short amount of time: it copies data from the staging buffer in GPU memory to other locations in GPU memory. The bandwidth for these transfers is much higher than the bandwidth between CPU and GPU.
- The parallel copy from CPU to staging buffer has very little impact on the time it takes to execute the render kernel. We get this transfer ‘for free’ by hiding it behind GPU compute work.
- Doing the parallel copy successfully requires
*two*OpenCL job queues. This is not hard to set up, but the queues must be carefully synchronized: commit cannot start before the copy completes, render cannot start before commit completes, and the next copy cannot start before commit completes. In OpenCL, this is conveniently achieved using events.

Frame time is now limited by either the combined runtime of game logic and the parallel copy, or the combined runtime of the commit kernel and the render kernel.

The implementation of this flow can be found in methods World::Render and World::Commit in world.cpp on Github.

Time to dive into the details. Let’s start with the *host *side of things, in other words: the CPU, and the operations on the data set in system RAM.

To make changes to the world, the CPU can directly access a host-side copy of the data.

The two-level grid complicates this slightly. To set a voxel, we must first ensure that the corresponding cell in the top-level grid points to a brick. If this is the case, we calculate the position in the brick, and overwrite the value. If not, we assign a brick to the cell first.

To assign a brick to a cell, we obtain it from an array of unused bricks. Most scenes will not use a unique brick for every cell: we can thus pre-allocate a fraction of the theoretical upper limit. The pre-allocated bricks are added to a circular buffer. Bricks are claimed at the tail, and recycled at the head of this list. By using a power of 2 size for the buffer, the modulo that is needed to wrap around at the end of the available space becomes a cheap bitwise AND.

When setting a voxel value, there are two more special cases to consider. The first is when we try to set a value that is identical to the existing value for the voxel, either encoded in a brick, or directly in the value for a solid top-level grid cell. Checking for this prevents redundant work, including unnecessary allocation of bricks. The second is when setting a value to 0 results in a completely empty brick. We can detect this situation by keeping track of the number of non-zero voxels in a brick. Empty bricks can now be recycled to the circular buffer. Ultimately, this is required if we don’t want to run out of bricks over time.

The use of voxels encourages multi-threaded code. This is an interesting side effect of the 3D setting: where 2D code running on modern CPUs can assume pretty much unlimited processing power, on modern CPUs, 3D data processing can bring a fast processor to its knees, quickly. For example: adding a solid voxel ball with radius 25 can be implemented by evaluating a cube of 50 ✕ 50 ✕ 50 voxels. That is 125,000 loop iterations, resulting in approx. 73,000 ‘Plot’ operations. A bouncing ball must be deleted at its old location and plotted at a new location, which doubles the effort. Having 50 balls bounding around in 3D suddenly isn’t trivial.

Without changes, setting voxels is far from thread-safe. Multiple threads may claim a brick before writing its index in the top-level grid, for instance.

The circular buffer helps. It becomes thread-safe simply by making the head- and tail pointer increments atomic (if we can guarantee it will not be empty). This is a significant improvement over e.g. stacks, for which thread-safety is surprisingly complex. The atomics do not solve all our problems. There’s a subtle source of inefficiencies lurking here: *false sharing*.

When two threads recycle bricks to the circular buffer, they will write to successive locations in the buffer. In most cases, these two locations will be in the same 64-byte cache line. This cache line in turn is stored in the L1 caches of the cores that executed the threads. So, even though the two cores wrote to different locations in the cache line, they now need to synchronize the changes: a costly operation.

We can prevent false sharing by spacing the writes. This can be achieved by allocating a much larger circular buffer, but it is more efficient to take steps of 17.

The number 17 is a prime. As shown in the example, stepping through a power-of-2 sized buffer with a prime step size ensures that we visit *all *elements before we revisit element 0. With a step size of 17 and an element size of 4 bytes, two subsequently visited elements are now 68 bytes apart, which places them in different cache lines.

With the thread-safe circular buffer everything is not taken care of sadly. It is still possible that two threads try to assign a new brick to a top-level grid cell concurrently. This can be prevented by batching and sorting writes, but for now, I left this unimplemented.

With all the above information, plotting a single voxel may seem very costly. Luckily, this is not the case: in the majority of cases (roughly 511 out of 512 cases for a 8 ✕ 8 ✕ 8 brick size), plotting is a matter of finding the top-level grid cell, reading the brick index from the cell, determining the offset of the voxel in the brick, and setting the voxel. Although this is obviously more work than directly writing to an array, it is not vastly more expensive.

The full implementation of the host-side Get and Set methods can be found in world.h in the Github repository.

After one or more CPU threads have made changes to the world, it must be synchronized to the GPU for rendering. As explained before, this happens via a staging buffer. To get the data to the staging buffer, a single copy is used.

Data transfer to and from the GPU is the Achilles heel of many applications that use the GPU for calculations.

Data travels from CPU RAM to GPU VRAM over the bus. For a PCI 3.0 bus, this happens at a rate of approx. 1GB/s per lane. Commonly 16 lanes are available, for a total of 16GB/s. Although it is clear that at 60fps we cannot send all our voxels to the GPU each frame, we should be able to send a fairly large portion. To reach high throughput however, it is important to limit the number of copies: each individual copy has significant overhead.

Taking this into account, each frame the following data is send to the GPU:

- The top-level grid: 8MB of data. Considering the bandwidth we have available, finding out which parts of this data have changed is not worth it.
- Changed bricks: we can keep track of which bricks got changed since the last frame. Only these bricks will be send to the GPU. Currently, at max 8192 bricks will be synchronized; this is 512KB of data. A larger budget would probably also be safe.

The data ends up in a staging buffer on the GPU. From here, a small kernel copies the data to the final destination, which is of course also on the GPU. The bandwidth of these copies is insanely high: about 320GB/s, i.e. 20x more than the PCI 3.0 bus. The commit kernel in practice executes in a flash.

Now that the data is stored in VRAM, we can trace some rays. For this, we use Amanatides and Woo’s 3D-DDA traversal algorithm to step through the top-level grid.

The layout of the top-level grid matters. On the CPU, we can simulate a 3D array in a chunk of continuous memory: just like a 2D image is a collection of rows stored one after the other in memory, a 3D grid is a collection of 2D images, stored one after the other. We can then access a single element using a simple formula: p=x+y*width+z*width*height. This layout can however be less than optimal. When scanning the grid along the x-axis, subsequent elements are likely to be in a cache line, speeding up access. When scanning along y or z, or simply diagonally, this is no longer the case. We can optimize for *all *directions by storing data together that is close in space. A basic application of this concept is *tiling*: by storing 8 ✕ 8 ✕ 8 voxels in a 512-byte brick, we ensure that this data fits in a handful of cache lines. Tracing a ray through the brick will benefit from the cache, regardless of the direction of the ray.

On the GPU, the ideal layout is hardware-dependent. Luckily, OpenCL knows which layout is best. To benefit from this knowledge, we can simply store our data in a 3D image. The render kernel can then read from this image, which will be slightly faster than accessing a plain array in a naïve format.

We get the 8MB top-level grid data into a 3D image using a special OpenCL function, *clEnqueueCopyBufferToImage*. This function will take data that has been placed in the staging buffer, and reformat it while copying it to the 3D image buffer.

The full ray traversal implementation can be found in kernels.cl on Github.

Now that we can trace rays, producing an image is relatively straightforward. For a ray tracer, we trace primary rays through the pixels of a screen plane. The TraceRay function returns the voxel that was found, or 0 if the ray left the scene. It also returns a normal for the intersection point. For a set of voxels, we have only 6 unique normals. For a scene illuminated by a skydome, we thus have only 6 unique shading values, which we obtain by integrating over the 6 unique hemispheres. File common.h contains these precalculated values.

From here, the sky is the limit. A simple extension that replaces the basic unoccluded skydome illumination by a single diffuse bounce is already in place. These rays are distributed over the hemisphere using blue noise, and have a limited length to reduce their impact on performance. The ray tracer can be further expanded to support shadows for a light source, or reflections for some voxel colors, or depth of field, and so on. And of course, some post processing should be added: TAA, some HDR effects, filtering, … . But all of that is an entirely different playground.

In this article, a number of optimizations were discussed:

- Copying data from CPU to GPU in parallel to running kernels, to effectively hide them entirely;
- Sending all data using a single copy from host to device;
- Using a circular buffer instead of a stack to simplify multithreading;
- Using a large step size for the circular buffer to prevent false sharing;
- Using powers of 2 for efficient modulo calculations in the circular buffer and during two-level grid traversal;
- Storing 3D data in a 3D bitmap for access with better data locality.

The resulting renderer uses up to 1GRay/s to produce over 60fps at a resolution of 1600×900 pixels on a 2070 laptop GPU.

Questions / suggestions? Mail me at bikker.j@gmail.com.

]]>vec3 P1( 1, 1, 1 ); vec3 P2( 2, 2, 2 ); vec3 D = normalize( P2 - P1 );

This is an interesting example, because it tempts us to store actual vectors in quadfloats: e.g., __m128 p1_4 = _mm_set_ps( 1, 1, 1, 0 ). Doing so is obviously a waste: we only have three-component vectors. Things get worse when I tell you that Intel introduced 8-wide SIMD with the AVX instruction set in 2011:

__m256 A = _mm256_set1_ps( 2.0f ); __m256 B = _mm256_set_ps( 1, 2, 3, 4, 5, 6, 7, 8 ); __m256 C = _mm256_mul_ps( A, B );

As you can see, it works just like SSE, but this time we get eight floats per octfloat, the registers are now 256 bits rather than 128 bits, and the intrinsic names are even longer.

Applying AVX to the first code snippet really forces us to think differently about ‘vectorizing’ code. Let’s start with the scalar flow:

float P1x = 1, P1y = 1, P1z = 1; float P2x = 2, P2y = 2, P2z = 2; float Dx = P2x - P1x, Dy = P2y - P1y, Dz = P2z - P1z; float len = sqrtf( Dx * Dx + Dy * Dy + Dz * Dz ); Dx /= len, Dy /= len, Dz /= len;

The functionality shown here is identical to the original functionality. This time however, every operation is operating on a scalar. The scalar flow exposes the work that is done under the hood, which in turn encourages a simple optimization, where we calculate the reciprocal of len, so we can replace the three divisions by one reciprocal and three multiplications.

The scalar flow is also the ideal starting point for writing SSE code. A proper vector flow consists of *multiple scalar flows*.

So, instead of normalizing one vector at a time, we should be normalizing *four*. If we can do that properly, the concept naturally extends to AVX, which we can use to normalize eight vectors at a time.

Normalizing vectors is boring, so let’s bring in a more practical problem: ray/triangle intersection. This is an operation that is frequently used in rendering, but also in collision detection, line-of-sight queries and audio propagation.

The triangle is defined by its vertices, v_1, v_2 and v_3; the ray by its origin O and (normalized) direction D. The intersection is defined by a single scalar t, which is the distance of the intersection along the ray. The intersection point itself is then defined as I=O+t\cdot D.

Intersecting a triangle efficiently is often done using the approach proposed by Möller and Trumbore:

bool Intersect( vec3 O, D, v1, v2, v3; out float u, v, t ) vec3 edge1 = v2 - v1, edge2 = v3 - v1; vec3 h = cross( D, edge2 ); float det = dot( edge1, h ); if (det > -0.001 && det < 0.001) return false; // no hit float inv_det = 1.0 / det; vec3 s = O - v1; u = dot( s, h ) * inv_det; if (u < 0.0 || u > 1.0) return false; // no hit vec3 q = cross( s, edge1 ); v = dot( D, q ) * inv_det; if (v < 0.0 || u + v > 1.0) return false; // no hit t = dot( edge2, q ) * inv_det; return (t > 0);

Given a ray (in O and D) and a triangle the code yields a yes/no answer for the hit, as well as a distance t and a position u,v in barycentric coordinates for the hit.

To benefit from SIMD for this code we need to identify a scalar flow. And, we need to run this scalar flow in parallel: four times for SSE, eight times for AVX, or an arbitrary number of times if we want to anticipate arbitrary SIMD-widths (e.g. 16 for AVX512). We are clearly looking for *data parallelism*, i.e. executing the same code for different inputs. This happens to be the execution model of GPUs as well, but that is another story.

In the case of ray/triangle intersections, we have two options:

- Intersect many rays with one triangle;
- Intersect one ray with many triangles.

Both are valid: it’s unlikely that we have just one triangle to intersect, and it is also unlikely that we have just one ray. But, since we need a decision, let’s assume we have four rays, and one triangle. I will assume SSE hardware, but the idea extends naturally to AVX.

Let’s start with the scalar flow of the intersection code.

float e1x = v2x - v1x; float e1y = v2y - v1y; float e1z = v2z - v1z; float e2x = v3x - v1x; float e2y = v3y - v1y; float e2z = v3z - v1z; float hx = Dy * e2z - Dz * e2y; float hy = Dz * e2x - Dx * e2z; float hz = Dx * e2y - Dy * e2x; float det = e1x * hx + e1y * hy + e1z * hz; if (det > -0.001 && det < 0.001) return false; float inv_det = 1 / det; float sx = Ox - v1x; float sy = Oy - v1y; float sz = Oz - v1z; float u = (sx * hx + sy * hy + sz * hz) * inv_det; if (u < 0 || u > 1) return false; float qx = sy * e1z - sz * e1y; float qy = sz * e1x - sx * e1z; float qz = sx * e1y - sy * e1x; float v = (Dx * qx + Dy * qy + Dz * qz) * inv_det; if (v < 0 || u + v > 1) return false; float t = (e2x * Qx + e2y * Qy + e2z * Qz) * inv_det; return if (t > 0);

Here, operations on vec3 objects have been replaced by scalar operations. Dot products are now written out as the sum of three products. The cross products are written out as well.

We are now ready for the actual vectorization. The first lines calculate two edges of the triangle. Since each stream is intersecting a ray against the same triangle, we feed the same edges in each stream:

__m128 e1x4 = _mm_set1_ps( v2x - v1x ); __m128 e1y4 = _mm_set1_ps( v2y - v1y ); __m128 e1z4 = _mm_set1_ps( v2z - v1z ); __m128 e2x4 = _mm_set1_ps( v3x - v1x ); __m128 e2y4 = _mm_set1_ps( v3y - v1y ); __m128 e2z4 = _mm_set1_ps( v3z - v1z );

The next three lines pose a problem. The scalar code refers to Dx, Dy and Dz for the direction of a single ray. We can replace this by Dx4, Dy4 and Dz4 for four rays, but that requires an expensive ‘gather’ operation, e.g.:

__m128 Dx4 = _mm_set_ps( ray1.Dx, ray2.Dx, ... );

The _mm_set_ps instruction can become painfully slow, especially when the four rays are not in the same cache line: in that case, *one *instruction can cause *four *cache misses.

We can prevent this by reorganizing our ray data. Imagine the array of rays looks like this:

struct Ray { vec3 O, D; }; Ray ray[256]; // 256 * 12 * 2 bytes = 6Kb

We call this an *array of structures*, or *AOS*. An alternative layout, which holds the same data and has the same size, is the SOA layout:

float Ox[256], Oy[256], Oz[256]; // 3x 1024 bytes float Dx[256], Dy[256], Dz[256]; // 3x 1024 bytes, total 6Kb

This layout happens to be identical to this one:

__m128 Ox4[64], Oy4[64], Oz4[64]; // 3x 1024 bytes __m128 Dx4[64], Dy4[64], Dz4[64]; // 3x 1024 bytes, total 6Kb

And *that *is data we can read efficiently in the intersection code. I have added the calculation of det, which is trivial now.

__m128 hx4 = _mm_sub_ps( _mm_mul_ps( Dy4, e2z4 ), _mm_mul_ps( Dz4, e2y4 ) ); __m128 hy4 = _mm_sub_ps( _mm_mul_ps( Dz4, e2x4 ), _mm_mul_ps( Dx4, e2z4 ) ); __m128 hz4 = _mm_sub_ps( _mm_mul_ps( Dx4, e2y4 ), _mm_mul_ps( Dy4, e2x4 ) ); __m128 det4 = _mm_add_ps( _mm_add_ps( _mm_mul_ps( e1x4, hx4 ), _mm_mul_ps( e1y4 * hy4 ) ), _mm_mul_ps( e1z4, hz4 ) );

The next line to translate reads:

if (det > -0.001 && det < 0.001) return false;

This is a significant challenge. Scalar det became vector det4. So, we need to figure out how to compare two vectors. But more fundamentally, what if *one *stream wants to ‘return’, while the others wish to continue? The answer is: *the stream doesn’t return*. Instead, we incapacitate it.

To see how that can be done, we return to unsigned integers once more. Consider the following code snippet:

unsigned int a = CalculateSomething(); if (a < 0) return; a += Calculation2();

Let’s assume that the profiler indicates that the code is performing poorly due to branch mispredictions. In that case, we *could* rewrite the code to get rid of the branch:

unsigned int a = CalculateSomething(); bool valid = (a >= 1); a += Calculation2() * valid;

The interesting thing about a boolean is that it can be used as a number: 0 for ‘false’, and 1 for ‘true’. And with that, the addition on the third line works as always if valid is indeed true. But if valid is *false*, the addition doesn’t change a anymore, even though the code is executed. This *masking *of operations is what we will use in SSE / AVX code as well.

The SSE instruction set does in fact provide comparison instructions:

__m128 mask1 = _mm_cmpeq_ps( a4, b4 ); // equal __m128 mask2 = _mm_cmpgt_ps( a4, b4 ); // greater than __m128 mask3 = _mm_cmple_ps( a4, b4 ); // less or equal

The result of these operations is a 128-bit mask. The cmp instructions perform four comparisons, and store each result as a 32-bit mask: 11111111 11111111 11111111 11111111 for true, and 00000000 00000000 00000000 00000000 for false. A 128-bit mask can be used to ‘zero-out’ results, or to let results through unmodified. For that, we use the AND operator. Using unsigned ints and floats again:

union { float f; unsigned int i; }; f = 3.141593f; i &= 0xFFFF; i &= 0x0000;

As in the earlier examples, f and i occupy the same memory and use the same 32 bits, so changing i changes f and vice versa. After filling f with 3.141593, i contains gibberish. If we apply the bitwise AND operator to i using 32 bits all set to 1, the result is whatever was in i, and therefore f remains unmodified. If, however, we apply the bitwise AND operator using 32 bits all set to 0, the result is 0. And, interestingly, 32 bits set to 0 yield floating point number 0.0. The final line thus sets f to 0.

We can now apply this to the condition in the intersection code.

if (det > -0.001 && det < 0.001) return false;

We have two conditions, and thus two masks to start with:

__m128 mask1 = _mm_cmple_ps( det, _mm_set1_ps( -0.001f ) ); __m128 mask2 = _mm_cmpge_ps( det, _mm_set1_ps( 0.001f ) ); __m128 combined = _mm_or_ps( mask1, mask2 );

Note that the conditions are inverted: we want to *return *if det exceeds -0.001 and det is smaller than 0.001; we thus want to keep the stream alive if det is smaller or equal to -0.001, or if det is greater or equal to 0.001. The combined mask, created on the third line, will have 32 bits set to 1 for each stream that is still active, or to 0 for inactive streams.

Now that we know which streams are still useful, we can just continue. There are two additional if statements, that will update the mask. On the last two lines everything comes together: the distance that we calculate is tested against 0 (we can’t have intersections behind the ray origin), and to this test we add the combined mask. Only valid distances, computed by active streams, may return ‘yes’.

The final vectorized code then becomes:

__m128 EPS4 = _mm_set_ps1( EPSILON ); __m128 MINUSEPS4 = _mm_set_ps1( -EPSILON ); __m128 ONE4 = _mm_set_ps1( 1.0f ); __m128 e1x4 = _mm_set_ps1( v2.x - v1.x ); __m128 e1y4 = _mm_set_ps1( v2.y - v1.y ); __m128 e1z4 = _mm_set_ps1( v2.z - v1.z ); __m128 e2x4 = _mm_set_ps1( v3.x - v1.x ); __m128 e2y4 = _mm_set_ps1( v3.y - v1.y ); __m128 e2z4 = _mm_set_ps1( v3.z - v1.z ); __m128 hx4 = _mm_sub_ps( _mm_mul_ps( dy4, e2z4 ), _mm_mul_ps( dz4, e2y4 ) ); __m128 hy4 = _mm_sub_ps( _mm_mul_ps( dz4, e2x4 ), _mm_mul_ps( dx4, e2z4 ) ); __m128 hz4 = _mm_sub_ps( _mm_mul_ps( dx4, e2y4 ), _mm_mul_ps( dy4, e2x4 ) ); __m128 det4 = _mm_add_ps( _mm_add_ps( _mm_mul_ps( e1x4, hx4 ), _mm_mul_ps( e1y4, Py4 ) ), _mm_mul_ps( e1z4, hz4 ) ); __m128 mask1 = _mm_or_ps( _mm_cmple_ps( det4, MINUSEPS4 ), _mm_cmpge_ps( det4, EPS4 ) ); __m128 inv_det4 = _mm_rcp_ps( det4 ); __m128 sx4 = _mm_sub_ps( ox4, _mm_set_ps1( v1.x ) ); __m128 sy4 = _mm_sub_ps( oy4, _mm_set_ps1( v1.y ) ); __m128 sz4 = _mm_sub_ps( oz4, _mm_set_ps1( v1.z ) ); __m128 u4 = _mm_mul_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( sx4, hx4 ), _mm_mul_ps( sy4, hy4 ) ), _mm_mul_ps( sz4, hz4 ) ), inv_det4 ); __m128 mask2 = _mm_and_ps( _mm_cmpge_ps( u4, _mm_setzero_ps() ), _mm_cmple_ps( u4, ONE4 ) ); __m128 qx4 = _mm_sub_ps( _mm_mul_ps( sy4, e1z4 ), _mm_mul_ps( sz4, e1y4 ) ); __m128 qy4 = _mm_sub_ps( _mm_mul_ps( sz4, e1x4 ), _mm_mul_ps( sx4, e1z4 ) ); __m128 qz4 = _mm_sub_ps( _mm_mul_ps( sx4, e1y4 ), _mm_mul_ps( sy4, e1x4 ) ); __m128 v4 = _mm_mul_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx4, qx4 ), _mm_mul_ps( dy4, qy4 ) ), _mm_mul_ps( dz4, qz4 ) ), inv_det4 ); __m128 mask3 = _mm_and_ps( _mm_cmpge_ps( v4, _mm_setzero_ps() ), _mm_cmple_ps( _mm_add_ps( u4, v4 ), ONE4 ) ); __m128 newt4 = _mm_mul_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( e2x4, qx4 ), _mm_mul_ps( e2y4, qy4 ) ), _mm_mul_ps( e2z4, qz4 ) ), inv_det4 ); __m128 mask4 = _mm_cmpgt_ps( newt4, _mm_setzero_ps() ); __m128 mask5 = _mm_cmplt_ps( newt4, t4 ); __m128 combined = _mm_and_ps( _mm_and_ps( _mm_and_ps( _mm_and_ps( mask1, mask2 ), mask3 ), mask4 ), mask5 ); t4 = _mm_blendv_ps( t4, newt4, combined );

Note that even something simple as returning ‘yes’ or ‘no’ suddenly is non-trivial in vectorized code. Instead, the above code updates the intersection distance if (and only if) a valid distance was found which is smaller than the previously found distance. And with that, everything is unconditional.

Conclusion

In this post I discussed several concepts:

- The importance of parallel scalar flows as a starting point for vectorization;
- Data layout: Array of Structures versus Structure of Arrays;
- The concept of masking, which allows us to continue a stream unconditionally.

All this was applied to a practical bit of code.

With that, we conclude the topic of SIMD. If you have any questions, feel free to drop me an email at bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

]]>Doing things simultaneously means: using multiple cores, or multiple processors, or a CPU and a GPU. However, before we start spawning threads, we can do things in parallel in a single thread, using a concept that is called *instruction level parallelism* (ILP).

ILP happens naturally in a superscalar CPU pipeline, where multiple instructions are fetched, decoded and executed in each cycle, at least under the right conditions. A second form of ILP is provided by certain complex CPU instructions that perform multiple operations. An example is *fused multiply and add* (FMA), which performs a calculation like a=b\sdot c+d using a single assembler instruction. A particularly interesting class of complex instructions are the *vector operations*, which apply a single operation on multiple data. When applied correctly, single instruction multiple data (SIMD) can yield great speedups, of four or eight times, and sometimes more. This doesn’t hinder multi-core processing either: whatever gains we obtain in a single thread using SIMD code scales up even further when we finally do spawn those extra threads.

To understand the concept of SIMD it helps to start with 32-bit unsigned integers. An unsigned integer consists of 4 bytes:

union { unsigned int a4; unsigned char a[4]; };

In C/C++, a union is a mechanism that allows us to overlap variables in memory. In this case, unsigned int a4 and the array a[4] occupy the same bytes. This means that if we modify a4, we modify the array, and vice versa. This allows us to do some interesting things:

a[0] = 1; a[1] = 3; a[2] = 25; a[3] = 100; a4 *= 2;

The first four lines modify the individual bytes. The last line operates on all four bytes at the same time. The effect is interesting: multiplying a4 by 2 yields { 2, 6, 50, 200 } in the array. We effectively operated on four numbers using a single instruction.

This doesn’t seem particularly useful, but in fact it is. Let’s start with a little detour, and the following code snippet:

uint ScaleColor( uint c, float x ) // x = 0..1 { uint red = (c >> 16) & 255; uint green = (c >> 8) & 255; uint blue = c & 255; red = red * x, green = green * x, blue = blue * x; return (red << 16) + (green << 8) + blue; }

This code takes a color, stored as ‘alpha-red-green-blue’ in the four bytes of a single unsigned integer, and scales the color by ‘x’, one color component at a time.

This code works, but it is very inefficient, because of the many implicit type conversions. To multiply red (an *integer* value) by x (a *float* value), red is promoted to a float value, then the multiplication is applied, and finally the result is converted back to an integer value. We can do better, using some basic *fixed point* math, where scale factor x is represented by an integer value in the range 0..255:

uint ScaleColor( uint c, uint x ) // x = 0..255 { uint red = (c >> 16) & 255, green = (c >> 8) & 255, blue = c & 255; red = (red * x) >> 8; green = (green * x) >> 8; blue = (blue * x) >> 8; return (red << 16) + (green << 8) + blue; }

The ‘>> 8’ is a *bitshift*, which effectively divides a number by 256. This compensates for the scale factor, which is 256 times larger than it was before. The code now is devoid of float operations, and thus type conversions; as a consequence it runs (much) faster.

Interestingly, we can do even better. Behold the following code:

uint ScaleColor( const uint c, const uint x ) // x = 0..255 { uint redblue = c & 0x00FF00FF; uint green = c & 0x0000FF00; redblue = ((redblue * x) >> 8) & 0x00FF00FF; green = ((green * x) >> 8) & 0x0000FF00; return redblue + green; }

This time, the red and blue color components sit together in a single 32-bit value. Green is separated, in its own 32-bit value. This leaves an 8-bit gap between red and blue, where green used to be.

The gap allows us to multiply red and blue by the scale, using a single multiplication. If we multiply a number in the range 0..255 by another number in the range 0..255, the maximum value we can get is 65025, which fits in 16 bits, according to the Windows 10 calculator:

The code exploits this. The red and green bytes are multiplied by the scale, then divided by 256 using a bitshift which moves each bit 8 positions to the right. Finally, we need to get rid of some bits, for which the & operation is used.

Compared to the original integer code, which used 7 bitshifts, 3 ands, 3 multiplications and 2 additions, the new code only uses 2 bitshifts, 4 ands, 2 multiplications and an addition. And we’re not even done. The final version is slightly more efficient:

uint ScaleColor( const uint c, const uint x ) // x = 0..255 { uint redblue = c & 0x00FF00FF; uint green = c & 0x0000FF00; redblue = (redblue * x) & 0xFF00FF00; green = (green * x) & 0x00FF0000; return (redblue + green) >> 8; }

This time we have 1 shift, 4 ands, 2 multiplications and 1 addition, for a total of 8 operations.

The point of this exercise is this: once we realize that an unsigned int is a collection of four bytes, integer operations operate on these *vectors* of four bytes. The four bytes are operated on in parallel.

This sadly has limited use. Consider the following situation:

union { unsigned int a4; char a[4]; }; a[0] = 200; a[1] = a[2] = a[3] = 1; a4 *= 2;

After the multiplication, array a contains { 144, 3, 2, 2 } instead of { 400, 2, 2, 2 }. Since 400 doesn’t fit in a byte, the value overflows into a[1] which thus becomes 3, and a[0] gets the remaining 400-256=144.

If we would want to work with larger numbers, or negative numbers, or if we wish to divide four values, this approach doesn’t work anymore. In an ideal world, we would like a bit more freedom:

- An ideal approach supports unsigned and signed integers as well as floats;
- It would be nice if we could also use more complex arithmetic such as division, multiplication, and square roots;
- Preferably, there should be no influencing between streams, such as overflow;
- It would be nice if things were a bit easier to use.

With vector instructions, we get all that. Well, except for one thing, of course.

In 1996, Intel introduced the Pentium II processor, and with it, the MMX instruction set, which was the first use of SIMD in commodity hardware. Shortly after that, in 1998, Motorola introduced AltiVec, and in 1999, Intel released the Pentium III, with the SSE instruction set.

The SSE instruction set consists of 70 assembler instructions that operate on 128-bit *vector registers*. Rather than storing a single value, each of these store 4 floats, or 4 integers, or 8 shorts, or 16 bytes.

In Visual Studio, the vector functionality is directly accessible in C/C++, in the form of the (unpronounceable) __m128 data type, which we will call, from now on, *quadfloat*. If the vector contains integers, the data type is __m128i, or *quadint*. Since a quadfloat is literally just a small set of floats, we can use a union to access the individual values, like we did before:

union { __m128 a4; float a[4]; };

We operate on quadfloats and quadints using vector operations. For this, so-called *intrinsics* are provided: instructions that translate to a single assembler instruction. A few examples:

__m128 a4 = _mm_set_ps( 1, 0, 3.141592f, 9.5f ); __m128 b4 = _mm_setzero_ps(); __m128 c4 = _mm_add_ps( a4, b4 ); // not: __m128 = a4 + b4; __m128 d4 = _mm_sub_ps( b4, a4 );

The first line stores four floats in a vector register. The second line sets the four values in the vector to zero. Adding two vectors is done using _mm_add_ps, and subtracting using _mm_sub_ps. Note the naming convention: you are free to do this differently of course, but it helps to identify variables that actually contain 4 values rather than 1. Also note that you can’t simply add vectors using e.g. a4+b4, unless you wrap the quadfloats and quadints in a class that provides some overloaded operators. It’s all very low-level, very assembler-like, and that is because SSE intrinsics get you *very* close to the metal.

As promised, SSE provides a pretty versatile set of operations:

__m128 c4 = _mm_div_ps( a4, b4 ); // component-wise division __m128 d4 = _mm_sqrt_ps( a4 ); // four square roots __m128 d4 = _mm_rcp_ps( a4 ); // four reciprocals __m128 d4 = _mm_rsqrt_ps( a4 ); // four reciprocal square roots (!)

Divisions, square roots, reciprocals. And the best part: you get the result of four divisions in roughly the same time it takes to do one regular division. Four square roots: actually you get these often *faster *than a single square root. The weirdest instruction is _mm_rsqrt_ps, which gets you *four reciprocal square roots*, in just a few cycles, much faster than just a square root! Why is that? Well, 1/\sqrt{x} is very frequently used, namely: for normalizing vectors. For that reason, x86-compatible CPUs provide a fast approximate reciprocal square root instruction, which uses a hardware lookup table, baked on your CPU, which coughs up the answers, at an incredible speed. If ‘approximate’ is not good enough for you, you can easily refine the answer afterwards.

Beyond square roots and reciprocals, there is more good stuff:

a4 = _mm_max_ps( a4, b4 ); c4 = _mm_min_ps( a4, b4 );

Those dreaded min/max lines that typically yield conditional code are handled by actual instructions in SSE. And that means that if your code was slowed down by branch mispredictions, SSE can actually lead to speedups beyond 4x.

Particles

Let’s do some practical SIMD work. You can download the project by clicking here; it should compile out-of-the-box in Visual Studio 2019. When you run it you see a slowly rotating backdrop, consisting of eight black holes:

Press space to start a stream of particles.

The application provides some basic timings for code blocks. Drawing the backdrop took 100.7 milliseconds in the above screenshot. In that time, for each pixel the gravity of the black holes is accumulated, and, if this value exceeds 1, a black pixel is plotted (an accurate simulation of the event horizon), otherwise a shade of blue. The code that produces the backdrop can be found in game.cpp on line 30:

void Game::BuildBackdrop() { Pixel* dst = screen->GetBuffer(); float fy = 0; for (unsigned int y = 0; y < SCRHEIGHT; y++) { float fx = 0; for (unsigned int x = 0; x < SCRWIDTH; x++) { float g = 0; for (unsigned int i = 0; i < HOLES; i++) { float dx = m_Hole[i]->x - fx; float dy = m_Hole[i]->y - fy; float squareddist = (dx * dx + dy * dy); g += (250.0f * m_Hole[i]->g) / squareddist; } if (g > 1) g = 0; *dst++ = (int)(g * 255.0f); fx++; } fy++; } }

Note that this code has not been intentionally crippled for this example: it is pretty optimal as-is.

To apply SIMD to the code, we will be working with four numbers at a time. Looking at the loops, we have two options here: we can either work on four pixels at a time, or we can process the black holes in groups of four. Either option is fine, but we need to make a choice, so I will work with four black holes at a time. That means that the inner loop changes:

for (unsigned int i = 0; i < HOLES; i += 4) { // ... }

Inside the loop, every *scalar* operation (operating on one input to produce one result) will be converted to a vector operation. Let’s start with the first line:

float dx = m_Hole[i]->x - fx;

Before we translate this line to SSE code, let’s create a backup. This has two purposes: we can revert to the (correct) code later if needed, and, since the SSE code is going to be rather hard to read, it helps others to understand what we are doing.

// float dx = m_Hole[i]->x - fx; __m128 dx4 = ... ;

The original line took the x-coordinate of a single black hole, and subtracted the x-coordinate of a pixel. The new line still operates on a single pixel, but four black holes. So, m_Hole[i]->x becomes:

_mm_set_ps( m_Hole[i], m_Hole[i + 1], m_Hole[i + 2], m_Hole[i + 3] )

Whatever we wish to subtract from that vector should be a vector as well. That means that the single float value fx must be promoted to a vector with four identical values:

_mm_set_ps( fx, fx, fx, fx )

Or, briefer:

_mm_set1_ps( fx )

The whole calculation thus becomes:

// float dx = m_Hole[i]->x - fx; __m128 dx4 = _mm_sub_ps( _mm_set_ps( m_Hole[i]->x, m_Hole[i + 1]->x, m_Hole[i + 2]->x, m_Hole[i + 3]->x ), _mm_set1_ps( fx ) );

The second line is converted in the same manner, but this time for y-coordinates. The third line is now not hard to convert:

// float squareddist = (dx * dx + dy * dy); __m128 sd4 = _mm_add_ps( _mm_mul_ps( dx4, dx4 ), _mm_mul_ps( dy4, dy4 ) );

The final line however presents some challenges. The calculation is simple enough: expand 250.0f to a vector of four identical values using _mm_set1_ps, multiply by four gravity values which we put in a vector using _mm_set_ps, and divide the result by sd4. But what about g? In the original code, the gravity of eight black holes was added to g one by one, but now we have *four* values. We can for now solve the problem by using a union.

// g += (250.0f * m_Hole[i]->g) / squareddist; union { __m128 temp4; float temp[4]; }; temp4 = _mm_div_ps( _mm_mul_ps( _mm_set1_ps( 250.0f ), _mm_set_ps( m_Hole[i]->g, m_Hole[i + 1]->g, m_Hole[i + 2]->g, m_Hole[i + 3]->g ) ), sd4 ); g += temp[0] + temp[1] + temp[2] + temp[3];

Note that the actual addition to g is done using scalar code; since the vectors are really only just four floats, this is a valid approach.

With all the lines of the inner loop converted to SSE, the code runs faster, but sadly, not a lot. The reason is, in a way, once again *type conversions*: in order to work on vector data, we gather floats together over and over again. We should apply some loop hoisting, and get all those conversions out of the inner loop. The black holes do not change position while all the pixels are being plotted; we can thus simply prepare two quadfloats with the x-coordinates of eight black holes, and likewise two quadfloats for the y-coordinates and gravity values.

`__m128 hx4[2] = {`

`_mm_set_ps( m_Hole[0 * 4 + 0]->x, m_Hole[0 * 4 + 1]->x,`

`m_Hole[0 * 4 + 2]->x, m_Hole[0 * 4 + 3]->x ),`

`_mm_set_ps( m_Hole[1 * 4 + 0]->x, m_Hole[1 * 4 + 1]->x,`

`m_Hole[1 * 4 + 2]->x, m_Hole[1 * 4 + 3]->x )`

`};`

`__m128 hy4[2] = {`

`_mm_set_ps( m_Hole[0 * 4 + 0]->y, m_Hole[0 * 4 + 1]->y,`

`m_Hole[0 * 4 + 2]->y, m_Hole[0 * 4 + 3]->y ),`

`_mm_set_ps( m_Hole[1 * 4 + 0]->y, m_Hole[1 * 4 + 1]->y,`

`m_Hole[1 * 4 + 2]->y, m_Hole[1 * 4 + 3]->y )`

`};`

`__m128 hg4[2] = {`

`_mm_set_ps( m_Hole[0 * 4 + 0]->g, m_Hole[0 * 4 + 1]->g,`

`m_Hole[0 * 4 + 2]->g, m_Hole[0 * 4 + 3]->g ),`

`_mm_set_ps( m_Hole[1 * 4 + 0]->g, m_Hole[1 * 4 + 1]->g,`

`m_Hole[1 * 4 + 2]->g, m_Hole[1 * 4 + 3]->g )`

`};`

Calculating dx4 now reduces to:

// float dx = m_Hole[i]->x - fx; __m128 dx4 = _mm_sub_ps( hx4[i / 4], _mm_set1_ps( fx ) );

This time, the code is almost twice as fast as the original. Still not the promised 4x, but that has to do with how we accumulate (and process) the gravity.

To apply SSE to your own code, you may find the following steps useful.

**1. Locate a significant bottleneck in your code.**

Writing SSE code is *hard*. Make sure the effort is worth it. Use a profiler to pinpoint particularly expensive code.

**2. Keep a copy of the original code.**

Whether you do this in comments, or in an #ifdef block, keeping the original code helps. You will thank yourself later, for example when you try to port to a platform that doesn’t support SSE, such as ARM (e.g. for Android).

**3. Prepare the scalar code.**

Add a loop that executes your code four times, so that you know beforehand that you have four data streams that can be processed in parallel.

**4. Reorganize your data.**

Make sure you don’t convert your data over and over again; have it in the correct (vector) format before you enter the code you wish to convert. Lots of gathers will reduce your gains to zero, or worse.

**5. Make unions with floats.**

Quadfloats can be unioned with floats. This lets you convert one line at a time, checking correct behavior at every step.

**6. Convert one line at a time.**

Verify functionality as you go.

**7. Check MSDN for exotic SSE instructions. **

There are some pretty weird instructions, that may just solve your problem in the smallest amount of cycles. Know your options.

This is part 1 of 2 of the introduction to SIMD. Part 2 is now also available and discusses how to layout your data in a fundamentally SIMD-friendly way, and how to handle conditional code.

Questions? Mail me: bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

]]>float4 FetchTexelTrilinear( float lambda, float2 uv, int offset, int width, int height ) { int level0 = min( MIPLEVELCOUNT - 1, (int)lambda ); int level1 = min( MIPLEVELCOUNT - 1, level0 + 1 ); float f = lambda - floor( lambda ); // select first MIP level int o0 = offset, w0 = width, h0 = height; for (int i = 0; i < level0; i++) o0 += w0 * h0, w0 >>= 1, h0 >>= 1; // select second MIP level int o1 = offset, w1 = width, h1 = height; for (int i = 0; i < level1; i++) o1 += w1 * h1, w1 >>= 1, h1 >>= 1; // read actual data float4 p0 = FetchTexel( uv, o0, w0, h0 ); float4 p1 = FetchTexel( uv, o1, w1, h1 ); // final interpolation return (1 - f) * p0 + f * p1; }

For context, a bit of sampling theory:

When using textures in a renderer we may encounter two issues related to sampling a ‘discrete signal’, which the texture is: it is a raster image that approximates an analog image, which in turn can be imagined as an image with infinite resolution. The two issues are:

- Oversampling: when using the texture, we read a particular pixel of the texture many times, because the texture is too small or we are viewing it up close.
- Undersampling: this time, when reading the texture we skip pixels, because the texture is too large. The skipping tends to happen in patterns, which are called Moiré patterns. These can be very distracting, especially when animated.

Oversampling can be countered by increasing texture resolution. This is however not always an option. To get a better image using the data we have, we can use interpolation. For a texture, this interpolation is bilinear.

Bilinear interpolation blends four pixels. Compared to nearest pixel sampling, which uses only a single pixel, this is a significant increase in memory transactions.

For undersampling we have another technique: *MIP mapping*. For this, we store the original texture, but also a scaled down versions. When we detect undersampling, we simply switch to a smaller version of the texture.

The visible transition between the MIP maps can finally be countered with *trilinear *interpolation. Where bilinear interpolation interpolated in two dimensions by blending four pixels, trilinear interpolates in three dimensions. The extra dimension is a blend between two MIP maps. The number of memory operations is now eight: four pixels in each of the two MIP maps that we interpolate between.

Let’s return to the code snippet. The input parameters are:

- lambda: floating point version of the desired MIP level. The fractional part of lambda is the interpolation parameter used to blend between the bilinear samples from the two nearest MIP maps.
- uv is the texture coordinate, where (1,1) is the bottom-right corner of the texture. Larger values result in tiling. The coordinate must still be scaled by the sizes of the sampled MIP maps.
- offset is the position of the texture data in a larger buffer that stores all textures.
- width and height specify the dimensions of the first MIP map (level 0).

The texture data for the MIP maps is stored sequentially, so for a 16 x 16 texture, the first MIP map (which is 8 x 8) starts at offset 256, and the second one (4 x 4) at 256 + 64, and so on. In the code, the offsets of the two used MIP maps are calculated using two small for loops:

int o0 = offset, w0 = width, h0 = height; for (int i = 0; i < level0; i++) o0 += w0 * h0, w0 >>= 1, h0 >>= 1; int o1 = offset, w1 = width, h1 = height; for (int i = 0; i < level1; i++) o1 += w1 * h1, w1 >>= 1, h1 >>= 1;

This is where things get interesting.

Yesterday I received an e-mail from Marvin Reza. He pointed out that the for loops are in fact finite geometric sums:

offset=\frac{1}{4}+\frac{1}{16}+\frac{1}{64}...This sum can be calculated as

\sum_{k=0}^{n}r^k=\frac{1-r^n}{1-r}where r is (in our case) 0.25, and n is the desired MIP level. In CUDA (or OpenCL, or C) we would write:

o = width * height * (1.0f - powf( 0.25f, level0 ) / 0.75f);

I had two issues with this suggestion. The first is that this function is performing so many memory operations that surely a minor optimization to a for loop is not going to speed it up in any measurable way. The second is that a powf function is considered to be so expensive that surely it is not any faster than a simple for loop.

However… it *does *in fact make a measurable difference. And we’re not even done yet.

Marvin points out that the powf( 0.25, level ) can be replaced by exp2f( -2 * level ), which is slightly faster still. But we can take it one step further.

In IEEE floating point representation, 2^n is special. Have a look at this table:

2^-1 0.500000 1056964608 111111000000000000000000000000 2^-2 0.250000 1048576000 111110100000000000000000000000 2^-3 0.125000 1040187392 111110000000000000000000000000 2^-4 0.062500 1031798784 111101100000000000000000000000 2^-5 0.031250 1023410176 111101000000000000000000000000 2^-6 0.015625 1015021568 111100100000000000000000000000

The third column contains the ‘unsigned integer’ representation of the 32-bits used to store the floating point number in the second column. The fourth column shows the binary representation of this number. One thing is immediately clear: *all the action happens in the top bits.* In an IEEE 32-bit floating point number, this is the *exponent* of the number. The remaining bits store the *mantissa.*

The exponents for the six powers of two are obtained when we shift the integer number 23 bits to the right. They are: 126, 125, 124, 123, 122 and 121.

In the above table we went from float to integer representation. We can however also go in the opposite direction. If we want to store 2^{-n}, we can store (127-n)<<23 in an unsigned integer, and reinterpret this as a floating point number. And this is *much *faster than a powf or an exp2f.

In CUDA:

value = __uint_as_float( (127 - n) << 23 );

The whole FetchTexelTrilinear now becomes:

float4 FetchTexelTrilinear( float lambda, float2 uv, int offset, int width, int height ) { int level0 = 0, level1 = 0; float f = 0; if (lambda >= 0) level0 = min( MIPLEVELCOUNT - 1, (int)lambda ), level1 = min( MIPLEVELCOUNT - 1, level0 + 1 ), f = lambda - floor( lambda ); // as proposed by Marvin Reza, slightly faster float scale = (float)(width * height) * 1.3333333333f; int o0 = offset + (int)(scale * (1 - __uint_as_float( (127 - 2 * level0) << 23 ))); int o1 = offset + (int)(scale * (1 - __uint_as_float( (127 - 2 * level1) << 23 ))); // read actual data float4 p0 = FetchTexel( texCoord, o0, width >> level0, height >> level0 ); float4 p1 = FetchTexel( texCoord, o1, width >> level1, height >> level1 ); // final interpolation return (1 - f) * p0 + f * p1; }

Note that this exercise exposed a bug where lambda could be negative. This is countered in the optimized version of the code.

Many thanks to Marvin for the initial suggestion and for triggering this experiment!

Questions? Mail to bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

]]>To optimize any application, we must answer two questions:

- What is the goal of the optimization?
- Where are the bottlenecks in the existing code?

If you read my posts on *The Art of Software Optimization* you will recognize these as part of the structured approach to optimization.

The ‘goal’ of optimization can be quite broad, and includes things like the target platform (in this case: high-end consumer-level CPU and GPU) and the time available for optimization (in this case: spare time project, so no deadlines). And of course: objective performance, which generally should be ‘better’. But in the case of a path tracer, this is a bit harder to define. It is not hard to make a path tracer that runs at 60fps. However, on today’s hardware, it is impossible to write a path tracer that renders a noise-free image of a complex scene at 60fps (although with clever filtering Quake 2 RTX may actually qualify). So, let’s carefully define our aim:

The aim is to improve the code in such a way that a higher quality unfiltered image is achieved at the same frame rate.

Why unfiltered? For several reasons: first of all, a better unfiltered image means better input for a filter, and thus a better filtered image. And besides that, once the noise level is low enough, unfiltered path tracing has several benefits. Depth of field, motion blur and translucent materials are notoriously hard to filter, and in general, filtering tends to blur features in the image. Skipping the tricks yields a renderer that *just works*.

With this goal in mind, we need to define ‘quality’. With a path tracer, this is straightforward. ‘Ground truth’ is obtained by rendering a view with many samples per pixel. For the scenes in the benchmark the image has fully converged after a few thousand samples. The quality of a real-time image is now determined using the difference between this image and the ground truth image. To express this using one number we can use the root mean square error (RMSE):

\tag{1}rmse=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(f_i-g_i)^2}Where f_i is pixel i, g_i is ground truth pixel i and N is the number of pixels. We thus compute the average squared difference between a pixel and ground truth, and take the square root of that. For *rgb* pixels we need an additional detail. I will simply treat *r*, *g* and *b* as individual pixel values, so *N* becomes three times larger. Alternatively, we could use a different weight for *r*, *g* and *b* to acknowledge the sensitivity of the human eye to the color components. However, this is not needed here: we simply want a similarity metric between two images.

The adjusted optimization goal:

The aim is to modify the code in such a way that the lowest RMSE is obtained at 30 frames per second at a resolution of 1600×900 pixels.

The 30fps requirement is somewhat arbitrary. It is the lowest number that people nowadays accept as ‘real-time’, it is the frame rate of most movies, and several console games deliberately aim for this number. Likewise, 1600×900 is arbitrary: Few modern screens will be smaller than 1920×1080, and on such screens, 1600×900 is a large window that doesn’t look like we’re cutting corners (even though we totally are).

The benchmark project provides some basic numbers to guide the optimization process. To see these, you will have to build the project from the source code, which is available on github. On line 24 of main.cpp of the benchmarkapp project you can disable the automatic flythough to enable interactive mode:

At 10spp, the frame time is 42.7ms. Frame time is used by a number of tasks, several of which are timed individually. Sending out rays from the camera and intersecting them with the geometry takes 6.17ms. This is time spent in Optix, and via Optix, in the RTX hardware. It is quite safe to assume that there is not much to be gained here. Secondary rays (diffuse and specular bounces) take up 6.52ms, and shadow rays take another 6ms. The total time spent on actual ray tracing is 19.78ms. The time *not *spend on ray tracing is dominated by ‘shading time’. Shading means: processing ray tracing results, by evaluating the material model, which results in the generation of shadow rays and bounced rays. This is implemented in CUDA, and thus fully under our control. Any speedup that we achieve in this code will have a significant impact on the overall performance of the path tracer. This is the functionality we will have to focus on.

The shading code for the Optix7 render core (which is the one used in the benchmark) can be found in pathtracer.h. Lighthouse 2 is a wavefront path tracer. Very briefly, this means that *all *primary rays arrive in the shading kernel, which then produces secondary rays and shadow rays, which are processed in subsequent waves. Read more about wavefront path tracing in this dedicated blog post.

The differences between the March and February versions of the renderer are relatively small. They do however yield a significant speed boost. The image below shows the same camera view in the same scene, using the new code:

Due to changes to the camera and material model it is sadly not possible to recreate the exact same view, but the data is clear nevertheless: ray tracing takes about the same amount of time (as expected), but shading went down from 21.58ms to 13.52ms.

Let’s briefly discuss the changes:

**A float4 skydome** – The skydome is stored as a massive HDR texture. Every read from this texture is very likely a cache miss, especially for all the secondary rays that hit it at pretty much random locations.

The normal way to store such a bitmap is using three floats. However, at the hardware level, there is no such thing as a float3 memory transaction: we either get 1, 2 or 4 32-bit values. A float3 thus requires *two *operations. To make matters worse, the two reads may require access to two cache lines.

A simple fix is to store the skydome pixels as float4 values, wasting four bytes per skydome pixel. This sounds pretty bad, but with a 6GB video card (remember, high-end) this is hardly an issue.

**No textures** – In the scene used in the benchmark every polygon has a texture. However, many polygons will use a single texel from this bitmap.

This again provides an opportunity to reduce the number of memory transactions. The texture class used in Lighthouse 2 already supports a material color, which is used in the absence of a texture. While loading the scene, if a ‘one texel’ polygon is detected, the material is duplicated and modified to not use a texture.

This is obviously a highly scene-specific optimization, and under normal circumstances the visual artist should optimize the materials. It does however show that looking for a reduction in memory transactions in general pays off.

**Flow divergence** – The shading model used in Lighthouse 2 is based on the Disney BRDF implementation of the Appleseed renderer. The original implementation distinguishes four components in materials:

- Diffuse
- Sheen
- Specular
- Clearcoat

Additional properties, such as subsurface scattering, metallicness and anisotropy are handled as part of these four.

Not all materials use all four components. In fact, most materials use one or two. When different threads in the same warp require the evaluation of different components, this evaluation will be serialized.

It turns out that the evaluation of the four components has some functional overlap. By executing the shared code unconditionally, the time spent in divergent code is reduced, which speeds up shading calculations.

Note that this is only a temporary solution. For the Disney BDRF, some choices have been made that are purely driven by artistic wishes. Looking at the great impact of shading on overall render time it seems logical to look for a more efficient compromise: perhaps a single unconditional code path can yield 90% of the flexibility at 50% of the cost.

**Optimizing blue noise** – Blue noise is used in Lighthouse 2 to replace random numbers. Very briefly: with blue noise we optimize the distribution of stochastic error over the image:

The picture is shamelessly copied from the one-page paper by Solid Angle’s Iliyan Georgiev and Marcos Fajardo. The implementation in Lighthouse 2 is based on work by Eric Heitz, which draws a single random number from a 2D blue noise tile using the following code:

float noise( uint* blueNoise, int x, int y, int idx, int dim ) { x &= 127, y &= 127, idx &= 255, dim &= 255; // xor index based on optimized ranking int rankedIdx = (idx ^ blueNoise[dim + (x + y * 128) * 8 + 65536 * 3]) & 255; // fetch value in sequence int value = blueNoise[dim + rankedIdx * 256]; // if the dimension is optimized, xor sequence value based on optimized scrambling value ^= blueNoise[(dim & 7) + (x + y * 128) * 8 + 65536]; // convert to float and return float r = (0.5f + value) * (1.0f / 256.0f); if (r >= 1) r -= 1; // never happens? return r; }

In the shading code we typically need four of these samples. For example, when generating primary rays, two random numbers are used to generate a position on the aperture, and two for a random position on the pixel. And in the shading code, two random numbers are needed to select a random point on a random light, and two to steer the random bounce. The number of random parameters is called the *dimensionality*: a camera ray thus is four dimensional. The dimension we wish to sample with the above code is the last argument of the function, ‘dim’.

Without paying too much attention to the inner workings of the above function, we can see that increasing ‘dim’ values yield subsequent samples from the blueNoise array. If dim is a multiple of 4 (note: it is), this happens twice: once in the calculation of *rankedIdx*, and once when xor’ing variable *value*. That means that we have two opportunities to replace four separate memory transactions by a single 128-bit memory transaction. Since this happens twice for every path, this should make a difference.

Behold the 4-way blue noise sampler:

float4 noise4( uint* blueNoise, int x, int y, int idx, int dim ) { uint4 bn4 =((uint4)(blueNoise + dim + (x + y * 128) * 8 + 65536 * 3)); int rsi1 = (idx ^ bn4.x) & 255, rsi2 = (idx ^ bn4.y) & 255; int rsi3 = (idx ^ bn4.z) & 255, rsi4 = (idx ^ bn4.w) & 255; int v1 = blueNoise[dim + 0 + rsi1 * 256]; int v2 = blueNoise[dim + 1 + rsi2 * 256]; int v3 = blueNoise[dim + 2 + rsi3 * 256]; int v4 = blueNoise[dim + 3 + rsi4 * 256]; uint4 bx4 =((uint4)(blueNoise + (dim & 7) + (x + y * 128) * 8 + 65536)); return make_float4( (0.5f + (v1 ^ bx4.x)) * (1.0f / 256.0f), (0.5f + (v2 ^ bx4.y)) * (1.0f / 256.0f), (0.5f + (v3 ^ bx4.z)) * (1.0f / 256.0f), (0.5f + (v4 ^ bx4.w)) * (1.0f / 256.0f) ); }

Although values v1, v2, v3 and v4 still require a memory transaction each, the total number of reads has been brought down substantially.

**Tweaking __launch_bounds__** – I am probably pointing out the obvious here, but carefully tuning (and periodically re-tuning) (and tuning per hardware generation) of the workgroup size for every kernel is *very *important for optimal performance. I have been carefully tuning this before, but it turned out that large kernels in Lighthouse 2 could use some tuning nevertheless. I am somewhat ashamed to report that this probably had the biggest impact on the final performance level.

A performance improvement alone – albeit substantial – has limited impact on the image quality after 33 milliseconds. The speedup lets us render 10spp instead of 8spp, but the 2 extra samples barely make a difference. It turns out that some small changes have a far greater impact.

**Using a low-resolution skydome texture** – The skydome texture is a HDR bitmap, with some very bright spots. A simple way to reduce the variance of glossy surfaces is to blur the skydome. The March version of the benchmark stores a 1/64^{th} version of the sky, where 8×8 pixels are averaged, to obtain the same effect.

**Path space regularization** – A path, starting at the camera, that hits a near-specular glossy surface *after *visiting a much more diffuse surface typically has high variance. Path space regularization helps: by clamping the roughness of materials to be no less than earlier encountered roughness, overall variance is reduced.

**Aggressive clamping** – Caustics are notoriously hard to render using a path tracer. The earliest signs of an emerging caustic are formed by some bright specks. These *fireflies* are the result of paths that found a light source via an ‘improbable path’, e.g. via glass or a mirror. If we accept some bias in the final image we can reduce the fireflies by clamping path transport values. To do this correctly:

- Detect the bright paths by comparing their magnitude \sqrt{r^2+g^2+b^2} against a threshold m.
- Normalize the color.
- Scale the color by m.

Note that normalizing and scaling is better than simple clamping. Also note that the *rgb *color is repeatedly treated as a vector here, which does not really make sense, but serves as a cheap approximation of a proper luminance calculation.

The final feature that was *supposed to* have reduced variance is adaptive sampling. Sadly, the variance estimation that drives the adaptive sampling requires at least 8 samples and an additional 8 samples on average, adaptively distributed. Within the 33ms budget this is currently not possible. To be continued.

Questions? Mail me: bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

]]>In the first post on the Art of Software Optimization I presented the ten steps of a structured approach to optimization:

- Determine optimization requirements.
- Profile: determine hotspots.
- Analyze hotspots: determine scalability.
- Apply high level optimizations to hotspots.
- Profile again.
- Paralellize / vectorize / use GPGPU
- Profile again.
- Apply low level optimizations.
- Repeat step 7 and 8 until time runs out.
- Report.

A vital part of this process is profiling. I used profiling to find out that the sorting code of the demo application scaled poorly, which was countered with a high level optimization: BubbleSort was replaced by QuickSort, which has a much better algorithmic complexity. After that, more profiling revealed that two lines in a function that draws scaled sprites are now responsible for the bulk of the runtime of the application.

The starting point for this post is the demo project used in part 1, but this time with 65536 sprites, and QuickSort instead of BubbleSort. Download the Visual Studio 2019 project by clicking here.

After profiling the code with VerySleepy CS for about 27 seconds, the following results are shown:

The two lines that take 11.61 and 10.10 seconds of the total runtime can be found in surface.cpp, line 429 and 430:

```
Pixel color = src[u + v * m_Pitch];
if (color & 0xffffff)
a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color;
```

Before we attempt to optimize this code, let’s take a moment to see what happens in Sprite::DrawScaled. The method draws a sprite to the screen, taking into account transparency. The sprite is drawn to a rectangular area on the screen that may differ from the size of the sprite; this lets us draw the sprite image with an arbitrary scale.

In the above image, a 16×16 Mario sprite is drawn twice. The small version is 8×8 pixels, which means that four pixels of the sprite need to be squeezed into 1 pixel on the screen. We could do that by averaging 2×2 pixels. Sprite::DrawScaled uses a simpler approach: it will just skip every other row and column.

We thus *write* to every pixel in a square, but we don’t necessarily *read* from every pixel of the sprite image, except when we enlarge the sprite: in that case we may in fact read every pixel multiple times. This is how DrawScaled operates: the two loops cover the target rectangle, and for each pixel in this rectangle, the position of a pixel in the sprite is calculated.

With that in mind, let’s try to make the code faster. Question is however, *why *is this code slow? There are several options:

- Line 429 reads from memory. Memory is slow.
- Line 430 writes to memory.
- There is conditional code on line 430.
- There is a function call on line 430.
- Line 430 uses a lengthy calculation.

Many of these points require knowledge about the hardware to understand why they would be problematic. But the worst thing is: it is not even certain that lines 429 and 430 are the problem… Let’s stop for a moment, and gather some information before we continue.

The C++ code that we are trying to optimize will be executed by an *x86 processor*, which expects x86 machine code, which is produced by the C++ compiler. It’s easy to take a peek under the hood.

To obtain the disassembly shown here, compile the program in ‘Debug’ mode and place a breakpoint on line 428. Once the breakpoint is triggered, right-click line 428, and select ‘Go To Disassembly’.

To understand what happens here we need a crash course in x86 assembler.

The first *x86* CPU was the Intel 8086, which was introduced in 1978. This was a 16-bit processor, with eight general purpose 16-bit registers:

**AX,** the *accumulator*;**BX,** the *base* register;**CX,** the *counter*;**DX,** the *data *register;**BP,** the *base pointer*;**SI,** or *source index*;**DI,** or *destination index*;

and **SP,** the *stack pointer*.

The first four registers can also be addressed per byte: **AH** and **AL**, **BH** and **BL**, and so on.

In 1985 the 32-bit Intel 80386 processor was introduced. This processor is fully compatible with the 16-bit version, but sports eight 32-bit registers: **EAX, EBX, ECX, EDX, EBP, ESI, EDI** and **ESP,** which overlap the original 16-bit registers. An optional floating point co-processor, the 80387, adds eight 80-bit registers for floating point calculations, named **st0..st7**.

After the 32-bit processors, AMD introduced the x86-compatible ‘x86-64‘ processors, with 64-bit registers: **RAX**, **RBX** and so on, as well as eight new 64-bit general purpose registers, named **R8..R15**. In terms of registers, a final important addition came with SIMD support, for which 8 128-bit registers are available: **xmm0..7**, and more recently: 16 256-bit registers, **ymm0..15**. AVX512 CPUs finally have 32 512-bit registers, named **zmm0..31**.

An example of some x86 assembly code is shown below.

loop: mov eax, [0x1008FFA0] // read from address into register shr eax, 5 // shift eax 5 bits to the right add eax, edx // add registers, store in eax dec ecx // decrement ecx jnz loop // jump if not zero fld [esi] // load from address [esi] onto FPU fld st0 // duplicate top float faddp // add top two values, push result

Every line has a simple instruction, which reads something from memory into a register, writes to memory, operates on a pair of operands, or continues execution at a new location, perhaps based on a condition.

The above code is not too hard to read. With this knowledge, let’s take another look at the assembly that was produced for the C++ code:

The first instruction converts an integer (*x*, stored 38h (hexadecimal, 56 in base 10) bytes away from the start of the stack) to a floating point number in register xmm0 (*ss,* or *single scalar*). The second line multiplies this number by variable *[whatever]*. The result must be stored in integer variable *u*, so the third line converts the multiplication result back to integer. Finally, the result of the conversion, in register eax, is written to a local variable on the stack.

Subsequent lines of C++ code yield more complex assembly, but it is not too hard to make sense of it. Things change drastically however when we switch from *debug* to *release*.

When looking at the assembly produced in *release* mode, the first thing that becomes clear is that there is now *much more *assembly code. The code for individual lines also appears to be strangely interleaved, and a lot of code is repeated.

It turns out that the compiler has optimized the code in several ways:

- The content of the loop is being repeated. This is known as loop unrolling; it reduces the overhead of the loop itself.
- Instructions are reordered. This increases separation of instructions that depend on results of earlier instructions, and helps the CPU to execute multiple instructions per clock cycle.

Running the *release *code shows that it is indeed much faster than the *debug *code. It is clear that the compiler knows things about CPUs that we don’t. And although it is nice that the compiler automates a lot of the optimization, it would be *nicer* if we knew how to help the compiler produce the fastest code possible.

The execution of instructions by a CPU is typically divided in four phases:

- Fetch
- Decode
- Execute
- Writeback

In the **phase 1**, an instruction is read from memory. The instruction is encoded in a number of bits, and must be decoded in **phase 2**. In **phase 3**, the instruction is executed. The result of the execution is written back to memory or a register in **phase 4** and is now available as input for subsequent instructions.

The above diagram shows this flow, with the ‘Execute’ stage highlighted. Now imagine that each stage takes one cycle: in that case, each instruction takes *four *cycles. During each phase, only a part of the CPU is actually active: the *Fetch* hardware goes unused when an instruction is being decoded, for example.

Now consider the following improved CPU model:

This is a *pipeline*: during each cycle, one instruction is fetched, one decoded, one executed and one result is written back. All parts of the CPU are now making themselves useful during each cycle, and the throughput of the CPU is (in theory) four times higher.

This improvement comes at a price. Imagine that the second instruction needs the result from the first instruction. The first instruction still needs to complete three phases, so this time the second instruction gets delayed: it must be halted until the input it requires is available. This is called a *bubble *in the pipeline. Compilers prevent bubbles by reordering code, as we have seen. If instruction three does not depend on instruction one, two and three can perhaps be swapped. This is also good to know for us programmers: we should write code with fewer dependencies, or perhaps somehow fill up bubbles with useful work (which we then essentially get done for free).

Modern CPUs use far more than four stages in the pipeline. The reason for this is *clockspeed*. The rate at which the CPU can run basically depends on the most complex stage. If we make individual stages simpler, we can increase the frequency, and thus improve throughput. Like this:

Of course, with longer pipelines, the impact of dependencies increase. Another problem also increases: the cost of branches. Imagine that we have a ‘goto’ statement somewhere in our code. To fill the pipeline, we can just follow the ‘goto’. But what if the jump is *conditional*, and the condition is still being evaluated?

This special dependency is handled by *branch prediction* hardware in the CPU. Instead of waiting for the condition, the CPU will ‘guess’ if the conditional jump will be taken or not. If it guesses wrong, the pipeline is full of irrelevant instructions, and must be reset – at a significant cost. When conditional jumps are unpredictable, this happens frequently. This can become a major source of inefficiency in a program.

The programmer can counter this by reducing conditional code. And, if the use of conditional code cannot be prevented, it should at least be predictable.

In the last diagram the *Execute* stage was represented by three sub-stages. In real scenarios this presents another source of inefficiency. Some instructions need completely different execution logic than others: dividing a floating point number has no overlap with loading an integer from memory, for instance.

So, when a floating point instruction is being executed, a significant portion of the CPU is idling. But what if we execute a different instruction *in the same cycle*? This must obviously be a different type of instruction, which increases the pressure on the compiler to produce a suitable stream of instructions. And, if we execute several instructions per cycle, we must also *fetch *and *decode *several instructions per cycle:

This yields the mighty *superscalar pipeline*, which is commonly used in modern processors.

Let’s take a step back. A modern CPU executes multiple instructions per cycle, *if *these instructions are not all of the same type, and *if *these instructions do not depend on recently started instructions. The pipeline is filled, even with instructions after a jump, *if *the jump destination is correctly guessed. In all other cases, the machinery comes to a halt, and severe latencies occur. The compiler attempts to minimize dependencies, and optimizes the instruction mix, but it could use some help.

It is time to introduce some generic tools for low level software optimization. These are meant to kick-start your optimization effort. Many of these involve the concepts I discussed in the previous paragraphs. I refer to these tools as: *“The Rules of Engagement”*.

Rule number one:

- Avoid Costly Operations.

This may seem obvious. But what is a ‘costly operation’? As you dive deeper into optimization, you will quickly gain an intuition for this. Here is a quick overview:

<< , >> : bit shifting

+ – & | ^ : simple arithmetic, logical operators

* : multiplication

/ : division

sqrt

sin, cos, tan

pow, log, exp

On most processors, bit operations are very cheap (as in: they take one cycle or less on average). Sometimes they can be completely free. There are many ways to exploit this. For example: multiplying by a power of 2 can be replaced by a bitsift: v = v << 3 has the same effect as v *= 8 (assuming v is an integer). Likewise, v >>= 4 is the same as v /= 16. Your compiler also knows this, so it is actually safe to write v /= 16. But what about v *= 9? A speedy replacement is: v += v << 3, which gets rid of the multiplication.

Looking at the table, ‘pow’ is pretty bad. So, please replace y=pow(x,5) by y=x*x*x*x*x! It looks ugly, but the result is *much *faster.

Point here is: once you know which operations may be costly, you can start looking for alternatives, carefully balancing required precision, code versatility and readability, and performance.

- Precalculate.

If there is no fast alternative to a certain operation, you may be able to use a *lookup table*. Take sine and cosine, for example: on many processors, these are expensive to evaluate. But a 360 entry table lets you fetch the correct values at the cost of a (cached?) memory operation. And some interpolation gets you values in between, albeit at a slight loss of accuracy.

A special case of precalculation is *loop hoisting*. Here is the inner loop of DrawScaled again:

`for (int x = 0; x < a_Width; x++)`

`{`

i`nt u = (int)((float)x * whatever);`

`Pixel color = src[u + v * m_Pitch];`

`if (color & 0xffffff)`

`a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color;`

`}`

Here, the address of a pixel in array src is calculated as u + v * m_Pitch. But, inside the inner loop, v is not changing, and neither is m_Pitch. So, we can precalculate the result of v * m_Pitch *outside the inner loop*, and use the precalculated value inside the loop. And, as you can see, there are several other opportunities for loop hoisting in this short code segment.

Precalculation can go much further though. We see just how far in a few minutes.

- Pick the Right Data Type.

In C and C++, you can store numbers in many ways: in integers, unsigned integers, floats, doubles, (unsigned) shorts and (unsigned) chars. So which data type should you use in practice? The answer is: it depends.

Imagine you have a positive number, and you know that it will never exceed 255. You now have the option to store it in an unsigned char, and the CPU can store it in an 8-bit register (AL or AH, for example). There is a problem with that approach: even though all x86-compatible CPUs still support access to AH and AL for the 64-bit RAX register, the functionality is outdated and therefore somewhat unoptimized. A compiler will thus emulate the byte behavior, using a full register. This requires some extra operations compared to regular 64-bit numbers: summing 1 and 255 in a byte yields 0, due to overflow, and this overflow must now be mimicked. What it boils down to is that operating on bytes is *more work* than operating on 64-bit numbers.

There is another consideration when picking a data type, and that is type conversion. We have seen before that a simple line like int u = x * whatever involves multiple type conversions if ‘whatever’ happens to be a float value. The type conversion itself is not free, and may add significant overhead to the calculation. Ensuring that all variables are either int or float would thus improve performance.

And finally, it is beneficial to have both integer code and floating point code in the same block. This allows the compiler to interleave instructions for the superscalar pipeline. A long loop that operates exclusively on integers will have part of the CPU idling (the floating point execution units), which is a waste of compute power.

- Avoid Conditional Branches.

The description of the (superscalar) pipeline should have made clear why conditional code is bad. But you may not always be aware of conditionals. Examples of conditional code:

if (x==y) { Function1(); } else { Function2(); }

This is obvious conditional code. Note that if x in practice always equals y, the conditional code is cheap to evaluate: the CPU will quickly learn and predict the correct program flow.

a=a>b?a:b;

This is the so-called *ternary operator*. It is often hidden: a = max(a,b).

More examples of conditional code:

switch (value) { case 1: Funtion1(); break; case 2: Function2(); break; }

virtual float GetX() { return x; }

for( int x = 0; x < 10; x++ ) { Function1( x ); }

do { Function1(); } while (x > 0);

Fast code does not use conditionals, or uses mostly predictable conditions. Reaching this state is not always easy. Sometimes lookup tables help. Sometimes you can split a loop in multiple parts to get rid of a condition inside a loop. And, like the compiler did, loop unrolling helps: a for loop includes the evaluation of a conditional, and therefore less iterations means less conditionals.

- Early Out.

This one is easily overlooked. Consider the following code snippet:

char a[] = “abcdfghijklmnopqrstuvwxyz”, c = ‘p’; int position = -1; for ( int t = 0; t < strlen( a ); t++ ) { if (a[t] == c) { position = t; } }

This code finds a character in a string. In this case we look for ‘p’, which we find in loop iteration 16. But after that, the loop *continues searching*. The fix is simple: add a *break *statement as soon as the correct answer is found. And while we are at it: the function strlen(a) is called whenever the loop condition is evaluated. This is prevented when we precalculate the length of string a: an example of *loop hoisting*.

‘Early Out’ can be exploited in the DrawScaled function. We are scaling *very* specific sprites: images of red, green, blue and yellow balls. That means that once we have found the right side of the ball image, we can stop: there will be no opaque pixels further to the right. Of course, such a change makes function DrawScaled significantly less versatile. But, if the goal is to make the application faster for the current input, this may be considered acceptable.

- Use the Power of Two.

A division by a power of two can be replaced by a bitshift, as we have already seen. But there are more situations where powers of two shine. Consider the following array:

float a[100][100];

Accessing elements in this array requires a multiplication: cell 10,10 for instance is located at position 10 of row 10, so 10 times the width of the table plus 10. Now, if we would slightly resize the table:

float a[100][128];

This time, the width is a power of two. Cell 10,10 is now located at a + 10 * 128 + 10, which is the same as a + (10 << 7) + 10. The result is that every access of a cell in this table is now faster.

Note that we are wasting some memory here. On the average PC, this is *not an issue*. You probably have 8GB or more, and chances are you never wrote a program that used more than 1GB. Feel free to use memory liberally!

- Do Things Simultaneously.

The final Rule of Engagement is: *Do Things Simultaneously*. That means: use all cores. Maximize the use of all cores, in fact. And once all cores are running at peak performance, turn to the GPU, and make that chip work at peak performance as well. But before we get to multithreading and GPGPU, we need to consider doing things simultaneously on a single core. This is what we will do when we start using SIMD hardware, which will be discussed in detail in a later blog post.

Summarizing, the Rules of Engagement:

- Avoid Costly Operations
- Precalculate
- Pick the Right Data Type
- Avoid Conditional Branches
- Early Out
- Use the Power of Two
- Do Things Simultaneously

With these, you should be able to make a decent start.

Let’s bring back the inner loop of DrawScaled once more:

for (int x = 0; x < a_Width; x++) { int u = (int)((float)x * whatever); Pixel color = src[u + v * m_Pitch]; if (color & 0xffffff) a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color; }

With the Rules of Engagement in mind, there’s a lot we can do now:

- We could store ‘x’ as an integer, emulating floating point logic using fixed point math, to evade the type conversions.
- We can loop hoist v * m_Pitch, and ((a_Y + y) * a_Target->GetPitch().
- In fact, we can calculate the destination address at the start of a line of pixels and increment it by 1 at the end of the loop iteration.
- We can ‘early out’ on the right side of a ball.
- We can preprocess the ball images to remove the alpha channel, so that ‘color & 0xffffff’ is reduced to just ‘color’.

But there is that nagging feeling… *What are we doing?*

There is a famous quote by William Wulf that goes like this:

More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason – including blind stupidity.

The problem is this: we are drawing scaled sprites. The scale is uniform, so the width equals the height, in this demo. Width and height are passed as integers, and the number of unique sizes is quite limited. Certainly no ball is larger than 64 pixels, or smaller than 5 or so. So, *why don’t we simply precalculate scaled sprites?* This gets rid of the scaling logic altogether, which surely beats optimizing it…

With the prescaled ball sprites we can now render a massive amount of spheres in real-time.

But what if we want to go *further*? There is always a faster way… How about:

- We can get rid of the sorting. There are in fact two ways to do this: this first is using a kD-tree, the second using a z-buffer. I’ll leave this as an exercise for the reader.
- Considering the density of the dot sphere, we can simply skip the back side. And perhaps the inside as well. So we can just render a thin shell.
- Pixels on the screen are 32-bit, and so is the sprite. Using a palettized display greatly reduces data transport.
- Perhaps we can draw two pixels at once, with 64-bit writes? Or maybe even more?

Of course, at some point you need a 4k screen to benefit from further performance gains. But just for the fun of it, *how far can we go?*

I remember an old trick used on the Amiga, where a regular loop with a variable number of iterations was unrolled.

Original:

for( int x = 0; x < N; x++ ) pixel[x] = 0;

Unrolled:

pixel[0] = 0; pixel[1] = 0; pixel[2] = 0; pixel[3] = 0; . . pixel[1023] = 0;

To set a set of N pixels to 0, one would set pointer ‘pixels’ to the correct value, and then jump to the line 1023 – N to draw precisely N pixels. Without a loop condition, without a jump after each loop iteration. This is obviously blazingly fast.

We can apply this concept to ball plotting as well. Imagine we want to draw a yellow ball using a width and height of 1 pixel, at location (x, y). For that we don’t need a loop:

Pixel* a = GetBuffer() + x + y * m_Width; a[0] = 0x00ffff00; // yellow

Note that the color is not read from a ball picture anymore; it is simply hardcoded. A larger ball can be drawn similarly, e.g. a 2×2 ball:

Pixel* a = GetBuffer() + x + y * m_Width; switch (size) { case 1: a[0] = 0x00ffff00; // yellow break; case 2: a[0] = 0x00ffff00; a[1] = 0x00ffff00; a[1280] = 0x00ffff00; a[1281] = 0x00ffff00; break; }

As a final optimization, each pair of 32-bit writes can be combined in a single 64-bit write. Note that in the above code the screen width (here: 1280) has been hardcoded as well. From here we can proceed, with all desired ball sizes, and with the correct colors obtained from the scaled ball sprites. I wrote a small program that emits the desired C code to do this. For 64 ball sizes in four colors, the result is a *massive *6MB source file. But rendering using that code is extremely fast.

In the next post I will dive into the memory hierarchy, and the impact of caches on application performance. Until then: for questions, please email me at bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

]]>In this first post I will introduce the *Art of Software Optimization*. This introduction includes a high level overview of the optimization process, which allows us to perform optimization in a structured, deliberate way. I will also discuss the most important ingredient of successful software optimization: *profiling*.

Let’s start at the beginning. With today’s multi-teraflop GPUs and 64-core CPUs, who needs fast software? In software development, production efficiency and security is more important than performance, and hardware is cheaper than labor, so Java is more popular than C and Python beats C++. Despite this, some software still requires the super computer of the future, e.g. AlphaGo Parallel, which used a massive system to beat a human. Luckily, compute power continues to improve consistently over the years, which in fact provides us with a first reason to properly optimize our software. Doubling the performance of your program is about the same thing as running it on hardware that will be available in 18 months or so. Optimization thus allows us to* look into the future*.

There is a second reason for software optimization: *some software needs to run on pretty weak hardware*. This could be an OS running on a smartwatch, or photo processing software running on the CPUs of an ancient mars rover. And, if you are Nikon, you may have an edge over your competitors if your digital camera gets away with a cheaper CPU: the price will be lower and the battery will last longer.

A third reason affects us all: waiting is annoying. Whether you are waiting for Windows to complete an unexpected update at the start of your once-in-a-lifetime presentation for an international audience (I’ve seen worse examples than what happened to Gabe Newell), or a massive file copy, or to get on a plane: we are impatient beings, we spend too much time sleeping and eating already so let’s not add months of waiting.

There is one more reason for software optimization, but it is subtle, and it will make more sense once we get to the end of this article. So instead, let’s try to define what optimization *is*. It is a lot of things: *thinking like a CPU*, for example. That means: being aware of instruction pipelines, latencies, dependencies, bandwidth, cycles… The image below is a 80486dx2/66. It may not mean much to you, but for me, it was the core of the first PC I put together, using components from 3 shops, and it was unbelievably fast. And this photo shows it’s naked machinery. That’s a thing of beauty. I own this thing. I want to know how that works, and I want to make it work for me. To do that, I should not teach it how to speak English; I should learn how to speak CPU.

So, thinking like a CPU. But optimization is also: *work smarter, not harder*. That means: smarter algorithms (e.g., reducing some O(n) code to O(log n)), but also: picking the right algorithm for a particular data set. Sorting a massive set of evenly distributed numbers requires a different sorting algorithm than a relatively small set of strings, for example. And finally: work smarter, not harder also means: be as accurate as necessary, but not more. If your square root function is accurate to the fifth decimal digit, but you clamp it to a whole number afterwards, chances are you can replace it by something faster.

And finally, optimization is: *don’t assume, measure*. For this we use a *profiler*, i.e. software that monitors performance characteristics of a program. Profiling should always steer the optimization effort: without it, we are blind.

So, profiling is:

- Think like a CPU
- Work smarter, not harder
- Don’t assume, measure

The profiler should help us decide where to aim our efforts. This has to do with the *Pareto Principle*, aka the *80/20 rule*: 80% of the effects come from 20% of the causes. Translated to and adjusted for software: 1% of your code takes 99% of the execution time. Understanding this is important for several reasons: first of all, profiling helps you to find this 1%, the *“low hanging fruit”*, where modifications to just a few lines of code can have massive impact on overall performance. Secondly, limiting your efforts to 1% of the code limits the impact on maintainability. Just imagine an optimized code base that is littered with inline assembly: the thought alone will discourage any project manager from even allowing you to speed up the code.

The profiler also helps us to execute optimization as a structured and deliberate process. The output of this process is (typically) a program that is *10-25x faster*. We will review this structured process in the next section.

Without further ado, this is the consistent approach:

- Determine optimization requirements.
- Profile: determine hotspots.
- Analyze hotspots: determine scalability.
- Apply high level optimizations to hotspots.
- Profile again.
- Paralellize / vectorize / use GPGPU
- Profile again.
- Apply low level optimizations.
- Repeat step 7 and 8 until time runs out.
- Report.

The list actually starts with a step zero: *complete the code*. It does not make sense to optimize code that is not ‘done’: optimizations may make further changes harder, and the changes may make optimizations obsolete. “Premature optimization is the root of all evil”, as it was put in a controversial way by Donald Knuth.

After step zero we should establish our goals. This includes (a range of) target hardware, but also the desired performance level. Optimizing for 5x is not the same as optimizing for 100x; a vastly larger data set may put unforeseen strain on initially innocent code.

Our optimization requirements should also consider the time we have available to optimize. Assuming we optimize finished code, the optimization process probably needs to be squeezed in between QA and product delivery; chances are that your time budget is expressed in days, not weeks. This may affect our decisions later on: replacing a bad sorting algorithm by a decent one in one hour may be preferable to replacing it by the best one in four hours, because the decent one gives us three hours to spend on something more important.

After the initial phase of establishing requirements we encounter step two: *profiling*. Note that we didn’t make a single change to the code yet.

Before we get to the actual profiling, behold this machine:

This is the *Difference Engine*, designed by Charles Babbage. It is considered to be the first design for a computer. Sadly Charles never managed to get it to work, although the design is sound: it just required improved lubricants. At Charles’ side was a lady: Ada Lovelace. She wrote the first software, for another design by Charles: the Analytical Engine. About this software, Lady Lovelace makes the following remarks:

In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selection amongst them (…).

One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation.

And this, my friends, is the very first mention of the concept of software optimization. So next time you wonder why so few ladies chose the noble profession of software engineering, do realize that it all started with Lady Ada Lovelace.

It is time to make things practical. For this, I have prepared a small project, the output of which can be observed in the following image:

You can download the Visual Studio 2019 project files by clicking here.

When you run the application you will see that it performs just fine. Maybe you need to switch from debug to release, but after that the application should reach a pretty solid frame rate.

The application comes with some built-in profiling. In the top-left corner of the screen you can see the amount of time it takes to:

- Transform the 256 dots using a 4×4 rotation matrix
- Sort the 256 dots
- Render the 256 dots
- Clear the screen.

Based on the initial findings, we may want to focus our attention to the rendering process, which is currently taking most time. Rendering happens in Game::Render (in game.cpp):

void Game::Render() { t.reset(); m_Surface->Clear( 0 ); elapsed4 = t.elapsed(); for ( int i = 0; i < DOTS; i++ ) { // extract dot index from sorted z-coordinate uint dotIdx =(uint)&m_Rotated[i].z & 3; // draw scaled sprite int sx = (int)(m_Rotated[i].x * 350.0f) + SCRWIDTH / 2; int sy = (int)(m_Rotated[i].y * 350.0f) + SCRHEIGHT / 2; int size = (int)((m_Rotated[i].z + 2) * 10.0f); m_Dot->SetFrame( dotIdx ); m_Dot->DrawScaled( sx - size / 2, sy - size / 2, size, size, m_Surface ); } }

It is not immediately trivial to see what the problem with this function could be. Perhaps the divisions? Maybe the method invocations, or the type casts? But, we should not guess nor assume: instead, we profile.

To profile, we use a profiler. There is a profiler integrated in Microsoft Visual Studio, but you can also use an external one. I will use VerySleepy CS, but the process is similar with other profilers, so use whatever works best for you.

In the above image, I am starting the demo from the profiler. After specifying the executable to profile, and the directory to run it in, I let the demo run for about 10 seconds, before it is terminated. The profiler shows the following result:

There are several problems with the output:

- Actual program code is not recognizable and source code is missing.
- The workload for the application is so small that the most expensive function is the one that synchronizes the main loop to the vsync of the monitor.

Regarding the first issue: we can add debug information to our executable in Visual Studio via project properties ==> C/C++ ==> General ==> Debug Information Format (set to ‘Program Database’), and: project properties ==> Linker ==> Debugging ==> Generate Debug Info (set to ‘Yes’).

To resolve the second issue, we can simply increase the number of ‘DOTS’ in game.cpp. Let’s set it to 4096:

Increasing the amount of dots reveals something interesting. The time needed to *draw* the particles scales as expected, but the *sorting* time… exploded. A quick inspection of the sorting code reveals why:

void Game::Sort() { for (int i = 0; i < DOTS; i++) { for (int j = 0; j < (DOTS - 1); j++) { if (m_Rotated[j].z > m_Rotated[j + 1].z) { vec3 h = m_Rotated[j]; m_Rotated[j] = m_Rotated[j + 1]; m_Rotated[j + 1] = h; } } } }

This code is an implementation of a simple sorting algorithm named BubbleSort. This particular implementation scales quadratically in the number of elements to sort, which explains the terrible runtime for larger sets.

Obviously, we need to fix this, before we look at the rendering code (which was the bottleneck for the initial workload). To do this, we have a large number of options. Which one is best for this particular situation? May I suggest an unconventional answer: *the one that yields a decent improvement with minimal effort*. This leaves time for other optimizations; if the sorting becomes a problem again, we will simply revisit it.

A simple QuickSort implementation does the job:

```
void Swap( vec3& a, vec3& b ) { vec3 t = a; a = b; b = t; }
int Pivot( vec3 a[], int first, int last )
{
int p = first;
vec3 e = a[first];
for (int i = first + 1; i <= last; i++) if (a[i].z <= e.z) Swap( a[i], a[++p] );
Swap( a[p], a[first] );
return p;
}
void QuickSort( vec3 a[], int first, int last )
{
int pivotElement;
if (first >= last) return;
pivotElement = Pivot( a, first, last );
QuickSort( a, first, pivotElement - 1 );
QuickSort( a, pivotElement + 1, last );
}
void Game::Sort()
{
````QuickSort( m_Rotated, 0, DOTS - 1 );`

}

After this change, rendering is (by far) the bottleneck again. Running the demo with 65536 dots is sufficient to keep the CPU busy:

This time we can not only see that ‘rendering’ is the most expensive part of the program; the problem is in the DrawScaled function, and in that function, it’s two specific lines in the inner loop that take all the time. We can stop looking at the other 99% of the program, until we have resolved this issue. How to speed up this code is a topic for another blog post.

We have seen that optimization starts with profiling. Our initial observation (target ‘render’) had to be adjusted once we changed the workload: ‘sort’ required a high-level optimization to resolve the terrible scalability. To save time, we opted for a pragmatic solution (in this case, QuickSort) which may not be the best sorting algorithm for this particular data set, but it was very easy to drop in. The profiler now guides us to just two lines in one function, which take about 90% of the runtime.

In the next post we will have a look at what can be done about the performance of DrawScaled using some low-level optimizations.

*Questions? Mail me at bikker.j@gmail.com*, or follow me on Twitter: @j_bikker.

**Integrals in Physically Based Rendering**

To calculate light transport in a virtual scene we use Kajiya’s *rendering equation*:

This is the *three point formulation* of the rendering equation. It describes that light arriving at point s from point x is the sum of light emitted by x towards s, plus the light reflected by x towards s.

The amount of energy that is transported from point x towards s depends on several factors:

- geometry term G(s\leftrightarrow x): if s and x are mutually visible this is \frac{1}{||s-x|| ^2 }, otherwise it is 0;
- emittance L_e(s\leftarrow x): only if x is on a light emitting surface, this is greater than 0;
- irradiance L(x\leftarrow y): this is the light that x receives from a third point y, which it may reflect towards s;
- reflectance f_r(s\leftarrow x\leftarrow y): this is the
*bidirectional reflectance distribution function (BRDF)*, which defines the relation between incoming and outgoing energy for a pair of directions.

The rendering equation contains an integral: light may arrive at x from any direction over the hemisphere (\Omega) over x. The integral is *recursive*: light arriving from y is either emitted by y, or reflected by y, so L(x\leftarrow y) is calculated using the same formula. Recursion typically terminates when we find an emissive surface: if L_e(s\leftarrow x)>0 we do not evaluate the integral.

Several alternative formulations of the rendering equation exist. One example is the hemispherical formulation:

\tag{2}L_o\left (x,\omega_o\right )=L_e\left (x,\omega_o\right )+\int_{\Omega}f_r\left (x,\omega_i,\omega_o\right )L_i\left (x,\omega_i\right )\left (\omega_i\cdot n\right )d\omega_iThis states that the light leaving x in direction \omega_o is the light emitted by x in direction \omega_o, plus the light reflected by x in direction \omega_o. Reflected light is again an integral: *radiance* arriving from all directions \omega_i over the hemisphere over x, converted to *irradiance* by the \omega_i\cdot n factor (where n is the surface normal at x), and scaled by the BRDF f_r(x,\omega_i,\omega_o).

Another interesting formulation is the transport formulation:

\tag{3}L=L_e+TLWhich leaves out a lot of details, and simply states that the light leaving a point is the light emitted by the point, plus the light *transported* by the point. The transported light can either be emitted by a third point, or it can be transported by that third point, so:

Practical: a ray tracer (as well as a path tracer) creates images by sending rays from the camera lens into the world. The color of a pixel is determined by the light transported from the first surface point hit by a ray. The above equations allow us to calculate this light accurately.

**Path
Tracing**

The path tracing algorithm uses Monte Carlo integration to evaluate the rendering equation. The basic algorithm operates as follows:

- A primary ray is constructed, which starts at the lens, and extends through a pixel.
- The end of this ray is the first scene surface that the ray intersects.
- This surface point may emit energy (L_e>0), or reflect energy (L_e=0). The reflected energy is evaluated by taking a
*single*sample. - Taking a single sample is implemented by generating a random direction \omega_i over the hemisphere and sending a new ray in that direction to probe incoming energy. This effectively extends the path by one segment. With the newly generated ray, the algorithm continues at step 2, until it terminates on a light emitting surface or leaves the scene.

For a single path, this yields a sample with high variance. This is resolved by sending many paths through each pixels.

One path is now effectively one sample of the rendering equation. Note the similarity with particle transport: a single path bears strong resemblance to the behavior of a photon bouncing around the scene. An obvious distinction is the direction of the path as a whole: photons originate at light sources; path tracing operates in reverse. According to the *Helmholtz reciprocity principle*, both directions are equivalent.

Also note the dimensionality of a path sample: for a single bounce, we need two random numbers to generate a direction on the hemisphere, i.e. D=2. At two bounces, D=4 and so on. Additional dimensions are needed to sample the area of the pixel, the area of the lens, and a time slice. The curse of dimensionality strikes hard, rendering Riemann sums useless.

*Figure 1* shows a path traced scene. The surface point in the center of the red crosshairs is, like all other pixels of the image, at the end of a camera ray. This surface point x reflects light towards the camera sensor s, as described by the rendering equation: this is a combination of light arriving from the light source, light arriving from the sky, and light arriving via other surfaces in the scene.

The light reflected by x towards the camera arrives over the hemisphere over x. The figure shows this hemisphere, and next to it an enlarged version of (two sides of) this hemisphere.

**Monte-Carlo Integration of the Hemisphere**

We can evaluate the integral over the hemisphere using Monte Carlo integration. A naive implementation of this is straight-forward: we generate two random numbers, which we use to determine a direction \omega_i towards the hemisphere, and sample incoming light from that direction.

Combining the Monte Carlo integrator from the first article with the integral over the hemisphere of the rendering equation yields:

\int_{\Omega}f_r\left (x,\omega_i,\omega_o\right )L_i\left (x,\omega_i\right )\left (\omega_i\cdot n\right )d\omega_i\approx \frac{A}{n}\sum_{k=1}^{n}f_r(x,\omega_i,\omega_o)L_i(x,\omega_k)(\omega_k\cdot n) (5)where \omega_o is the direction towards the camera, \omega_k is a random direction on the hemisphere, and A is the surface area of a hemisphere of radius 1, which is simply 2\pi.

Using Monte Carlo with importance sampling (article 1, *Equation 7*) we get:

In other words: not all random incoming directions \omega_k need to have the same chance of being selected; we can use a pdf to focus our efforts.

**Importance Sampling the Hemisphere: Cosine**

Looking at the integral, even if we know nothing about f_r and L_i, we can at least conclude one thing: the term that is being summed is proportional to \omega_k\cdot n. This is called the *cosine term*, and accounts for the fact that a beam of incoming light (radiance) is distributed over a larger area when it arrives at a greater angle from the normal, yielding less energy per unit area (irradiance). In the absence of other information, this cosine term makes a good pdf.

The raw cosine requires normalization: in 3D, it integrates to \pi. We now have our pdf: p(x)=\frac{\omega_k\cdot n}{\pi}. Remains the question how we pick a random \omega_k proportional to this pdf. The Global Illumination Compendium provides an answer: given two uniform random numbers r_1 and r_2, we obtain a cosine weighted random vector \omega_k with

\tag{7}x=\cos \left (2\pi r_1 \right )\sqrt{1-r_2}\\y=\sin \left (2\pi r_1\right )\sqrt{1-r_2}\\z=\sqrt{r_2}Note that the fact that we can sample a direction proportional to a pdf directly is a quite rare case. Also note that, at least for the \omega_k\cdot n term, this pdf is *perfect*: it is not an approximation of the function we are trying to integrate.

**Importance Sampling the Hemisphere: Direct Illumination**

*Figure 3* shows a latitude-longitude version of the hemisphere we used before.

Incoming radiance at x is dominated by *direct light* (from the light source, and to a lesser extent, from the sky). We would thus like to focus on directions that will hit the light source.

This time, a pdf is not easily created. However, we can use a trick. Recall the pdf in *Figure 10* of part 1, where half of the domain received 70% of the samples. We can do something similar for direct illumination. Earlier we defined the full light transport as:

Applied to point x, the term TL_{e_2} represents all direct illumination arriving at x (L_{e_1} is the light emitted by x). The subsequent terms TTL_{e_3}+TTTL_{e_4}+... represent indirect illumination. We can thus evaluate direct and indirect illumination *separately*, using two distinct integrals, and sum the result.

This is illustrated in *Figure 4*. In the top row, the left image shows a 2D representation of the hemisphere. The red shape indicates the magnitude of the illumination: point x is lit from all directions, but most of the energy is coming from the light source (blue). The center image shows the direct illumination, and the right image the indirect illumination. The sum of these is the original integral.

Obtaining the sum of just the indirect illumination is straight-forward: we pick a random direction on the hemisphere, and when it happens to hit a light source, we ignore its result. This seems like a waste, but lights typically do not occupy a significant portion of the hemisphere. Since we pick each direction with the same probability as we did before, we can use the pdf we used before.

Obtaining the sum of just the direct illumination is also straight-forward: we pick a random direction towards the light, *by aiming for a random point on the light*. Note that this time, we definitely do not pick each direction with the same probability: we thus need a new pdf. This is a pdf that is zero in most places, and constant over the area of the light source projected on the hemisphere. The constant must be chosen so that the pdf integrates to 1. This is the case if the constant is 1/SA, where SA is the area of the light source projected on the hemisphere, the *solid angle*:

**Modified Path Tracing Algorithm**

With the discussed importance sampling techniques we can now formulate the improved path tracing algorithm:

- A primary ray is constructed, which starts at the lens, and extends through a pixel.
- The end of this ray is the first scene surface that the ray intersects.
- This surface point may emit energy (L_e>0), or reflect energy (L_e=0). The reflected energy is evaluated by summing two samples: one for the direct illumination, and one for the indirect illumination.
- To sample the direct illumination, we pick a random point on the light source, and probe this direction using a ray. The result is divided by the pdf, which is 1/SA.
- To sample the indirect illumination, we generate a random direction proportional to the pdf: (\omega_i\cdot n)/\pi, and continue with step 2 with a ray in this direction. The result is divided by the pdf.

**Synchronize**

At this point, the following concepts should be clear:

- The importance sampled version of the Monte Carlo integrator (part 1,
*Equation 7*) yields correct results for any valid pdf, including p(x)=1. Many valid pdfs are possible; some will yield lower variance than the constant pdf, others may yield higher variance. - An important term in the indirect illumination is \omega_i\cdot n. A pdf proportional to this term typically lowers variance, even though it doesn’t take into account the other terms, or G.
- The union / sum of the integrals of direct and indirect illumination over the hemisphere is the integral of all illumination over the hemisphere.
- Sampling direct light using a separate ray towards the light source is a form of importance sampling: we skip many directions on the hemisphere. We thus need a pdf. Normalizing it requires the
*solid angle*.

If not, please reread the relevant sections, and if that fails, ask questions.

**Multiple
Lights**

One last thing that was omitted in the
description of sampling direct light is how to handle *multiple lights*. The solution to this problem is simple, but why
this works requires some explanation.

If we have n lights in total, we pick a random light, and evaluate it as we did before: the pdf is 1/SA or zero; 1/SA for the set of directions towards this particular light, zero elsewhere. The result we get from sampling the random light is then multiplied by n.

To understand why this works, let’s consider the graph of *Equation 3* again:

Previously, we integrated this numerically by taking a random sample in the domain [-2.5,2.5]. An alternative is to randomly pick one half of the function for each sample. The domain then becomes either [-2.5,0] or [0,2.5], and thus the contribution of the sample is scaled by 2.5, instead of the 5 that we used for the full domain. Multiplying by 2 thus gives us the correct estimate.

This still works if the two subdomains are not equally sized. Suppose we have subdomains [-2.5,1] and [1,2.5]. Now half of our samples are scaled by 3.5, and the other half by 1.5. Multiplying by 2 yields \left (\frac{1}{2} \times 3.5 \times 2\right )+\left (\frac{1}{2} \times 1.5 \times 2\right )=5, which is still correct. Applying this to lights: each light is a subdomain of the full domain, i.e. the set of all lights. Each light may be picked with an equal probability. A selected light now represents all lights, and therefore its contribution is scaled by n.

**Importance Sampling Lights**

When a scene is illuminated by multiple lights, picking a random light is not always the best option. A small, distant light will have less effect on a surface point than a nearby bright light source.

We can again use importance sampling to improve our estimate in this situation, by assigning a picking probability to each light source. Since this is again a pdf, the sum of these probabilities must sum to 1. To do so, we first estimate the *potential contribution* for each light source, I\times SA, where I is the emission of the light source per unit area, and SA, as before, the solid angle. An important factor that is missing here is the visibility of the light source. Visibility is estimated using a ray, which is precisely the operation we wish to prevent.

Suppose we get the following potential contributions for four light sources:

light 1: 0.01

light 2: 5.02

light 3: 2.77

light 4: 0.59

We can normalize these values by dividing each by the sum of the four values (here: 8.39), which yields a valid pdf: it integrates to 1, and it is never zero unless the contribution of a light is 0.

Next, we create a *cumulative distribution function (cdf)*, which stores the partial integrals of the pdf:

Or, in the case of a discrete pdf:

\tag{11}F(x)=\sum_{x_i<x}f(x_i)In other words, for a random value x, the cdf tells us what the probability is that sampling according to the pdf produces this value, or a smaller one. For the four light sources, the cdf is simply:

cdf[0] = (0.01) / 8.39 ≈ 0.1% cdf[1] = (0.01 + 5.02) / 8.39 ≈ 60% cdf[2] = (0.01 + 5.02 + 2.77) / 8.39 ≈ 93% cdf[3] = (0.01 + 5.02 + 2.77 + 0.59) / 8.39 = 100%

We can now pick a light with a probability proportional to the pdf, using the cdf:

float r = Rand();

int lightIndex = 0;

while (cdf[lightIndex] < r) lightIndex++;

In other words: we look for the last cdf entry that is smaller than the random number r; the index of this entry is the index of the selected light. Since we are now picking lights using a pdf, we need to divide the result by the value of the pdf. Note that we no longer multiply by the number of light sources: the light count is already accounted for by the pdf. In fact, when we picked each of the four light sources with an equal probability, the value of the constant pdf was 0.25: multiplying by light count is thus the same as dividing by this pdf.

**Closing
Remarks**

This concludes this document. If you have any questions or suggestions, please contact me:

Jacco Bikker : bikker.j@gmail.com Twitter: @j_bikker

Github: https://github.com/jbikker/lighthouse2

Utrecht, December 12^{th}, 2019.

]]>

Rendering frequently involves the evaluation of multidimensional definite integrals: e.g., the visibility of an area light, radiance arriving over the area of a pixel, radiance arriving over a period of time, and the irradiance arriving over the hemisphere of a surface point. Evaluation of these integrals is typically done using Monte-Carlo integration, where the integral is replaced by the expected value of a stochastic experiment.

This document details the basic process of Monte-Carlo integration, as well as several techniques to reduce the variance of the approach. This will be done from a practical point of view: it is assumed that the reader is not intimately familiar with probability theory, but still wants to benefit from it for the development of efficient yet correct rendering algorithms.

**Definite Integrals**

A definite integral is an integral of the form \int_{a}^{b}f(x)dx, where [a,b] is an interval (or domain), x is a scalar and f(x) is a function that can be evaluated for each point in the interval. As worded by Wikipedia, a definite integral is defined as the signed area in the x-plane that is bounded by the graph of f, the x-axis and the vertical lines x=a and x=b (*Figure 1a*).

The concept extends intuitively to higher dimensions: for a definite double integral, *signed area* becomes *signed volume* (*Figure 1b*) and in general, for definite multiple integrals, this becomes the *signed hyper-volume*.

In some cases, the area can be determined *analytically*, e.g. for f(x)=2: for the domain [a,b], the area is simply 2(b-a). In other cases an analytical solution is impossible, for example when we want to know the volume of the part of the iceberg that is above the water (*Figure 1c*). In such cases, f(x) can often only be determined by *sampling*.

**Numerical Integration**

We can estimate the area of complex integrals using *numerical integration*. One example of this is the *Riemann sum*. We calculate this sum by dividing the region in regular shapes (e.g. rectangles), that together form a region that is similar to the actual region. The Riemann sum is defined as:

Here, n is the number of subintervals, and \Delta x_i=\frac{b-a}{n} is the width of one subinterval. For each interval i, we sample f at a fixed location x_i in the subinterval (in *Figure 2*: at the start of the subinterval).

Note that as we increase n the Riemann sum will converge to the actual value of the integral:

\tag{2}\int_{a}^{b}f(x)dx=\lim_{||\Delta x||\to 0}\sum_{i=1}^{n}f(x_i)\Delta x_iRiemann sums also work in higher dimensions (*Figure 3*). However, we run into a problem: for a function with two parameters, the number of subintervals must be much larger if we want to have a resolution that is comparable to what we used in the 2D case. This effect is known as the *curse of dimensionality*, and is amplified in higher dimensions.

We will now evaluate the accuracy of the Riemann sum for the following (deliberately convoluted) function:

\tag{3}f(x)=\left |\sin \left (\frac{1}{2}x+\frac{\pi}{2} \right )\tan \frac{x}{27}+\sin \left (\frac{3}{5}x^2 \right )+\frac{4}{x+\pi+1}-1\right |A plot of the function over the domain [-2.5,2.5] is shown below. For reference, we calculate the definite integral \int_{-2.5}^{2.5}f(x) via Wolfram Alpha, which reports that the area is 3.12970. The plot on the right shows the accuracy of numerical integration using the Riemann sum for increasing n.

To put some numbers on the accuracy: for n=50, the error is ~2\times10^{-3}. At n=100, the error is ~3\times10^{-4}. Another order of magnitude is obtained for n=200.

For additional information on Riemann sums, consider these resources:

- Kahn Academy: https://www.khanacademy.org/math/ap-calculus-ab/ab-accumulation-riemann-sums/ab-riemann-sums/v/simple-riemann-approximation-using-rectangles
- Wikipedia: https://en.wikipedia.org/wiki/Riemann_sum

**Monte
Carlo (1)**

For rendering, few integrals (none?) are *univariate*. This means that we quickly meet the curse of dimensionality. On top of that, sampling a function at regular intervals is prone to *undersampling* and *aliasing*: we may miss important values in the function, or end up with unintended interference between the sampled function and the sample pattern (*Figure 5*).

We solve these problems using a technique known as *Monte Carlo integration*. Similar to the Riemann sum this involves sampling the function at a number of points, but unlike the *deterministic* pattern in the Riemann sum, we turn to a fundamentally *non-deterministic* ingredient: random numbers.

Monte Carlo integration is based on the observation that an integral can be replaced by the *expected value* of a stochastic experiment:

In other words: we sample the function n times at random locations within the domain (denoted by capital X), average the samples, and multiply by the width of the domain (for a univariate function). As with the Riemann sum, as n approaches infinity, the average of the samples converges to the expected value, and thus the true value of the integral.

**A
Bit of Probability Theory**

It is important to grasp all the individual concepts here. Let’s start with the *expected value*: this is the value we expect for a single sample. Note that this is not necessarily a *possible* value, which may be counter-intuitive. For example, when we roll a die, the expected value is 3.5: the average of all possible outcomes, (1+2+3+4+5+6)/6=21/6=3.5.

The second concept is *random numbers*. It may sound obvious, but what we need for Monte Carlo integration are uniformly distributed random numbers, i.e. every value must have an equal probability of being generated. More on this later.

A third concept is *deviation*, and, related to that, *variance*. Even when we take a small number of samples, the expected average value as well as the expected value of each individual sample is the same. However, evaluating *Equation 4* rarely will actually produce this value. Deviation is the difference between the expected value and the outcome of the experiment: X-E(X).

In practice, this deviation has an interesting distribution:

This is a plot of the *normal distribution* or *bell curve*: it shows that not all deviations are equally probable. In fact, ~68.2% of our samples are within the range -1\sigma..1\sigma, where \sigma (sigma) is the *standard deviation*. Two useful ways to describe ‘standard deviation’ are:

- Standard deviation is a
*measure of the dispersion*of the data. - 95% of the data points are within 2\sigma from the mean.

To determine the standard deviation we have two methods:

- Standard deviation \sigma=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left (X_i-E\left [X\right ]\right )^2}: this works if we have a discrete probability distribution and the expected value E[X] is known. This is true for dice, where X={1,2,3,4,5,6} and E[X]=3.5. Plugging in the numbers, we get \sigma=1.71.
- Alternatively, we can calculate the sample standard deviation using \sigma=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}\left (X_i-X\right )^2}. More information on Wikipedia.

Sanity check: does this make sense? If \sigma=1.71, we claim that 68.2% of the samples are within 1.71 from 3.5. We know that {2,3,4,5} satisfy this criterion, 1 and 6 do not. Four out of six is 66.7%. Only if our dice would have been able to produce

anyvalue in the range [1..6] we would have hit the 68.2% mark exactly.

Instead of standard deviation we frequently use the related term *variance*, which is defined simply as Var\left [X\right ]=\sigma^2. Being a square, variance is always positive, which helps in our calculations.

**Monte Carlo (2)**

In an earlier section we evaluated *Equation 3* using a Riemann sum. We will now repeat this experiment using Monte Carlo integration. Recall that Monte Carlo integration is defined as

A straight translation to C code:

double sum = 0; for( int i = 0; i < n; i++ ) sum += f( Rand( 5 ) - 2.5 ); sum = (sum * 5.0) / (double)n;

The result for n=2 to n=200 is shown in the graph below. This suggests that Monte Carlo integration performs much worse than the Riemann sum. A closer inspection of the error shows that for n=200 the average error of the Riemann sum is 0.0002, while the error for Monte Carlo is 0.13.

In higher dimensions, this difference is reduced, but not eliminated. The following equation is an expanded version of the one we used before, taking two parameters:

f(x,y)=\left |\sin\left( \frac{1}{2}x+\frac{\pi}{2}\right )\tan \frac{x}{27}+\sin \left ( \frac{1}{6}x^2\right )+\frac{4}{x+\pi+1}-1\right |\left |\sin\left ( 1.1y\right )\cos\left (2.3x\right )\right | (6)

Over the domain x∈[-2.5,2.5],y∈[-2.5,2.5], the volume bounded by this function and the xy-plane is 6.8685. At n=400 (20×20 samples), the error of the Riemann sum is 0.043. At the same sample count, the average error using Monte Carlo integration is 0.33. This is better than the previous result, but the difference is still significant. To understand this problem we investigate a well-known variance reduction technique for Monte Carlo integration: stratification.

Stratification improves the *uniformity* of random numbers. In *Figure 8a*, eight random numbers are used to sample the function. Since each number is picked at random, they often are not evenly distributed over the domain. *Figure 8b* shows the effect of stratification: the domain is subdivided in eight strata, and in each stratum a random position is chosen, improving uniformity.

The effect on variance is quite pronounced. *Figure 9a* shows a plot of the results with and without stratification. *Figure 9b* shows the error in the estimate. For n=10, the average error for 8 strata is 0.05; for 20 strata 0.07 and for 200 strata it drops to 0.002. Based on these results it is tempting to use a large number of strata. Stratification has drawbacks however, which amplify with an increasing stratum count. First of all, the number of samples must always be a multiple of the number of strata and secondly, like the Riemann sum, stratification suffers from the curse of dimensionality.

**Importance Sampling**

In the previous sections, we have sampled equations uniformly. An extension to the Monte Carlo *integrator* allows us to change that:

Here, p(X) is a *probability density function (pdf)*: it specifies the relative probability that a random variable takes on a particular value.

For a uniform random variable in the range 0..1, the pdf is simply 1 (*Figure 10*a), which means that each value has an equal probability of being chosen. If we integrate this function over the domain [0,0.5] we get 0.5: the probability that X<\frac{1}{2}. For X>\frac{1}{2}, we obviously get the same probability.

*Figure 10b* shows a different pdf. This time, the probability of generating a number smaller than \frac{1}{2} is 70%. This is achieved with the following code snippet:

float SamplePdf() { if (Rand() < 0.7f) return Rand( 0.5f ); else return Rand( 0.5f ) + 0.5f; }

This pdf is defined as:

\tag{8}p(x)=\left\{\begin{matrix}1.4,if x<\frac{1}{2}\\0.6,otherwise\end{matrix}\right. The numbers 1.4 and 0.6 reflect the requirement that the probability that x<\frac{1}{2} is 70%. Integrating the pdf over [0..\frac{1}{2} yields 1.4\times\frac{1}{2}, and 0.6\times\frac{1}{2} equals 0.3. This illustrates an important requirement for pdfs in general: the pdf *must *integrate to 1. Another requirement is that p(x) cannot be zero if f(x) is not zero: this would mean that parts of f have a zero probability of being sampled, which obviously affects the estimate.

A few notes to help you grasp the concept of the pdf:

- A single value of the pdf does
*not*represent a probability: the pdf can therefore locally be greater than 1 (e.g., in the pdf we just discussed). - The integral over (part of) the domain of the pdf however
*is*a probability, and therefore the pdf integrates to 1.

A single value can be interpreted as the *relative likelihood* that a particular value occurs.

Note that the

normal distributionis a probability density function: it provides us with a probability that some random variable falls within a certain range. In the case of the normal distribution, this random variable is deviation from the mean. Like a well-mannered pdf, the normal distribution integrates to 1.

*Equation 7* thus allows for non-uniform sampling. It compensates for this by dividing each sample by the relative chance it is picked. Why this matters is illustrated in *Figure 11*a. The plotted function features a significant interval where its value is 0. Sampling this region is useless: nothing is added to the sum, we just divide by a larger number. Recall the iceberg in *Figure 1c*: there is no point in sampling the height in a large area around the iceberg.

A pdf that exploits this knowledge about the function is shown in *Figure 11*b. Notice that this pdf actually is zero for a range of values. This does not make it an invalid pdf: the function is zero at the same locations. We can extend this idea beyond zero values. Samples are best spent where the function has significant values. In fact, the *ideal pdf* is *proportional to the function we are sampling*. A very good pdf for our function is shown in *Figure 12a*. An even better pdf is shown in *Figure 12b*. In both cases, we must not forget to *normalize* it, so it integrates to 1.

The pdfs in *Figure 1*2 pose two challenges:

- how do we create such pdfs;
- how do we sample such pdfs?

The answer to both questions is: *we don’t.* In many cases the function we
wish to integrate is unknown, and the only way to determine where it is
significant is by sampling it – which is precisely what we need the pdf for; a
classic chicken and egg situation.

In other cases however, we have a coarse idea of where we may expect the function to yield higher values, or zero values. In those cases, a crude pdf is often better than no pdf.

We may also be able to build the pdf on-the-fly. A couple of samples estimate the shape of the function, after which we aim subsequent samples at locations where we expect high values, which we use to improve the pdf, and so on.

In the next article we apply these concepts to rendering. A significant challenge is to construct pdfs. We will explore several cases where pdfs can help sampling.

Contact: bikker.j@gmail.com Twitter: @j_bikker

Github: https://github.com/jbikker/lighthouse2

In this article we will explore an important concept used in the recently published Lighthouse 2 platform. Wavefront path tracing, as it is called by NVIDIA’s Laine, Karras and Aila, or streaming path tracing, as it was originally named by Van Antwerpen in his master’s thesis, plays a crucial role in the development of efficient GPU path tracers, and potentially, also in CPU path tracers. It is somewhat counter-intuitive however, and its use requires rethinking the flow of ray tracing algorithms.

**‘Occupancy’**

The path tracing algorithm is a surprisingly simple algorithm, which can be described in a few lines of pseudo-code:

vec3Trace( vec3 O, vec3 D ) IntersectionData i = Scene::Intersect( O, D ) if (i == NoHit) return vec3( 0 ) // ray left the scene if (i == Light) return i.material.color // lights do not reflect vec3 R, pdf = RandomDirectionOnHemisphere( i.normal ), 1 / 2PI returnTrace( i.position, R ) * i.BRDF * dot( i.normal, R ) / pdf

Input is a *primary ray* from the camera through a screen pixel. For this ray we determine the nearest intersection with a scene primitive. If there is none, the ray disappeared in the void. Otherwise, if the ray encountered a light, we found a light transport path between the light and the camera. If we find anything else, we bounce and recurse, hoping that the bounced ray does find a light. Note that this process resembles the (reverse) path of a photon, bouncing around the scene surfaces.

GPUs are designed to do the same thing on many threads. At first sight, ray tracing is a natural fit for this. So, we use OpenCL or CUDA to spawn a thread per pixel, and each thread executes the algorithm, which indeed works as intended, and quite fast too: just have a look at some ShaderToys to see how fast ray tracing can be on the GPU. Fast or not, the question is: are these ray tracers *as fast as they can be*?

There is a problem with the algorithm. A primary ray may find a light right away, or after a single random bounce, or after fifty bounces. A CPU programmer may see a potential stack overflow; a GPU programmer should see *low hardware utilization*. The problem is caused by the (conditional) tail recursion: a path may get terminated at a light source, or it may continue if it hit something else. Translated to many threads: a portion of the threads will get terminated, and a portion continues. After a few bounces, we have a few threads that have work left to do, while most threads are waiting for the final threads to finish.

The hardware utilization problem is amplified by the SIMT execution model of GPUs. Threads are organized in groups, e.g. 32 threads go together in a *warp *on a Pascal GPU (10xx class NVidia hardware). The threads in a warp share a single program counter: they execute in lock-step, so every program instruction is executed by the 32 threads simultaneously. SIMT stands for: *single instruction multiple thread*, which describes this concept well. For a SIMT processor, conditional code is a challenge. This is nicely illustrated in the Volta white paper:

When some condition is true for some threads in the warp, the branches of the *if*-statement are serialized. The alternative to ‘all threads do the same thing’ is ‘some threads are disabled’. In an if-then-else block, the portion of threads doing useful work will be 50% on average, unless all threads agree on the condition.

Conditional code is sadly not uncommon in a path tracer. Shadow rays are cast only if a light source is not behind the shading point, different paths may hit different materials, Russian roulette may or may not kill a path, and so on. It turns out that this becomes a major source of inefficiency, and it is not easy to prevent without extreme measures.

An earlier version of this article used the term

‘occupancy’instead of ‘hardware utilization’. Occupancy is a somewhat tricky concept. For NVIDIA devices, it is defined as: “the number of warps running concurrently on a multiprocessor divided by the maximum number of warps that can run concurrently“. The maximum number of concurrent warps includes warps that are ready to be swapped in when another warp encounters a stall. Occupancy does not take into account control flow divergence, which reducesinstruction level parallelism(ILP). It is thus possible to have 100% occupancy for code that has a single active thread in each warp, and it is possible to have 100% hardware utilization with <100% occupancy.

**Streaming Path Tracing**

The streaming path tracing algorithm is designed to combat the root of the occupancy problem. Streaming path tracing splits the path tracing algorithm in four phases:

**Generate****Extend****Shade****Connect**

Each phase is implemented as a separate program. So instead of running the full path tracer as a single GPU program (‘kernel’), we now have *four *kernels. And on top of that, they execute in a loop, as we will see shortly.

**Phase 1 (‘Generate’)** is responsible for generating the primary rays. It is a simple kernel that produces ray origins and directions for as many rays as there are pixels. The output of this phase is a large buffer of rays, and a counter, which tells the next phase how many rays should be processed. For primary rays this is of course simply *screen width* times *screen height*.

**Phase 2 (‘Extend’)** is the second kernel. It is executed only after phase 1 has completed for all pixels. The kernel reads the buffer generated in phase 1, and intersects each ray with the scene. The output of this phase is an intersection result for each ray, stored in a buffer.

**Phase 3 (‘Shade’)** executes after phase 2 is completely done. It takes the intersection result from phase 2 and evaluates the shading model for each path. This may or may not generate new rays, depending on whether a path was terminated or not. A paths that spawns a new ray (the path is ‘extended’) writes a new ray (‘path segment’) to a buffer. Paths that directly sample light sources (‘explicit light sampling’ or ‘next event estimation’) write a shadow ray to a second buffer.

**Phase 4 (‘Connect’)** traces the shadow rays generated in phase 3. It is similar to phase 2, but there is an important distinction: shadow rays merely need to find *any *intersection, while extension rays need to find the nearest intersection. This justifies a separate kernel.

Once phase 4 has completed we are left with a buffer that contains path extension rays. With these rays we proceed to phase 2. We do this until no path extension rays remain, or until we reach a maximum number of iterations.

**Inefficiencies**

For a performance-aware programmer, the streaming path tracing algorithms should raise a lot of red flags:

- Instead of a single kernel invocation, we now have
*three invocations per iteration*, plus a generate kernel. Kernel invocations involve a certain overhead, so this is bad. - Each kernel reads a massive buffer and writes a massive buffer.
- The CPU needs to know how many threads to spawn for each kernel, so the GPU needs to inform the CPU how many rays were generated in phase 3. Information travelling back from GPU to CPU is a bad idea, and we need it at least once per iteration.
- How does phase 3 write rays to a buffer without having gaps all over the place? It’s not going to use an atomic counter for that, is it?
- The number of active paths is still going to go down, so how does the scheme help in the first place?

To start with the last concern: if we pass a million tasks to the GPU, it will not run a million threads concurrently. The actual number of threads that runs concurrently depends on the hardware capabilities, but in general, tens of thousands of threads execute in parallel. Only when the workload drops below that number we are going to see hardware under-utilization due to a low task count.

The massive buffer I/O is another concern. This is indeed an issue, but not as much so as we might expect: data access is highly predictable, and especially for writing to the buffers, latency is not a problem. In fact, this type of data processing is precisely what the GPU was made for in the first place.

Another thing that a GPU does well is atomic counters, which you may not expect when you are coming from a CPU world. As a rule of thumb, an atomic write is as expensive as an un-cached write to global memory. In many cases, the latency will be hidden by the massively parallel execution of the GPU.

**Consequences**

Before we discuss the details, we will have a look at the consequences of the wavefront path tracing algorithm. First of all, the buffers. We need a buffer for the output of phase 1, i.e. the primary rays. For each ray, we need:

- A ray origin: three floats, so 12 bytes
- A ray direction: three floats, so 12 bytes

In practice it is better to increase the size of the buffer. Storing 16 bytes for the ray origin and direction allows the GPU to read those with a single 128-bit read. The alternative is a 64-bit read followed by a 32-bit read to obtain a float3, which is about twice as slow. So for a 1920×1080 screen we have: 1920x1080x32=~64MB. We also need a buffer for the intersection results produced by the Extend kernel. This is again 128 bit per entry, so 32MB. Next, the Shade kernel may produce up to 1920×1080 path extensions (upper limit), and we can’t write them to the buffer we are reading from. So, another 64MB. And finally, if our path tracer casts shadow rays, that is another 64MB buffer. Summing everything, we get to 224MB of data, just for the wavefront algorithm. Or, about 1GB at a 4K resultion.

Now, here’s another thing we may need to get used to: memory is abundant. 1GB may sound like a lot, and there are ways to lower that figure, but realistically, by the time we need to actually path trace at 4K, using 1GB on an 8GB GPU is the least of our problems.

A bigger problem than memory requirements are the consequences for the rendering algorithm. So far I assumed that we want to generate a single extension ray and perhaps a shadow ray per thread in the Shade kernel. But what if we want to do some AO, using 16 rays per pixel? The 16 AO rays need to be stored in a buffer, but worse, they will only arrive in the next iteration. A similar problem arises with Whitted-style ray tracing: casting a shadow ray to multiple lights, or splitting a path when it encounters glass is pretty much impossible.

On the plus side, wavefront path tracing does solve the issues we identified in the ‘Occupancy’ section:

- In phase 1, all threads unconditionally produce primary rays and write these to a buffer.
- In phase 2, all threads unconditionally intersect rays with the scene and write intersection results to a buffer.
- In phase 3, we start the intersection result evaluation with 100% occupancy.
- In phase 4, we process a continuous list of shadow rays without any gaps.

By the time we return to phase 2 with the survivingfor paths with a length of 2 segments we again have a compacted ray buffer, guaranteeing full occupancy when the kernel starts.

There is an additional benefit, which we should not overlook. The code in the four individual phases is isolated. Each kernel can use all available GPU resources (cache, shared memory, registers) without taking into account other kernels. This may allow the GPU to run the scene intersection code with more threads, because this code does not require as many registers as the shading code. More threads means: better latency hiding.

Full occupancy, better latency hiding, streaming writes: these are benefits that directly relate to the origin and nature of the GPU platform. For a GPU, wavefront path tracing is very natural.

**Is It Worth It?**

The question is of course: does the improved occupancy justify the buffer I/O and the cost of the additional kernel invocations?

The answer is: yes, but it is not easy to prove.

If we go back to the ShaderToy path tracers for a moment, we will see that most of them feature a simple hard-coded scene. Replacing this by a full-blown scene is not trivial: for millions of primitives ray-scene intersection becomes a hard problem, that is often left to NVidia (Optix), AMD( Radeon-Rays) or Intel (Embree). None of these options can simply replace a hard-coded scene in a toy CUDA ray tracer. In CUDA, the closest match (Optix) insists on taking over your program flow. On the CPU, Embree will actually allow you to trace individual rays from your own code, but at a significant performance cost: it much prefers to trace a large batch of rays instead of individual rays.

Whether wavefront path tracing is faster than the alternative (the ‘megakernel’ as Laine et al. call it) depends on the time spent in the kernels (large scenes and expensive shaders reduce the relative overhead of the wavefront algorithm), the maximum path length, the occupancy in the megakernel and the difference in register pressure in the four phases. In an early version of the original Brigade path tracer, we found that even a simple scene, with a mix of specular and Lambertian surfaces, running on a GTX480 benefited from the wavefront approach.

**Streaming Path Tracing in Lighthouse 2**

The Lighthouse 2 platform provides two wavefront path tracers. The first one uses Optix Prime to implement phase 2 and 4 (the ray/scene intersection phases); the second one uses Optix directly to implement the same functionality.

Optix Prime is a simplified version of Optix that will only intersect a collection of rays with a scene consisting of triangles. Unlike the full Optix library, it does not support user intersection code; it will only intersect triangles. For a wavefront path tracer this is exactly what is needed however.

The Optix Prime wavefront path tracer is implemented in `rendercore.cpp`

in the `rendercore_optixprime_b`

project. Optix Prime Initialization starts in the `Init`

function, using `rtpContextCreate`

. An scene is created using `rtpModelCreate`

. The various ray buffers are created in the `SetTarget`

function, using `rtpBufferDescCreate`

. Note that we provide regular device pointers for these buffers: this means that they can be used in Optix as well as in regular CUDA kernels.

Rendering starts in the `Render`

method. The CUDA kernel `generateEyeRays`

is used to fill the primary ray buffer. Once the buffer is populated, Optix Prime is invoked, using `rtpQueryExecute`

. This yields the intersection results in `extensionHitBuffer`

. Note that all buffers remain on the GPU: there is no traffic between the CPU and the GPU, apart from the kernel calls. The Shade phase is implemented in a regular CUDA kernel, `shade`

. It’s implementation can be found in `pathtracer.cu`

.

A few details of the `optixprime_b`

implementation are noteworthy. First, shadow rays are traced outside the wavefront loop. This is valid: a shadow ray contributes to a pixel if it is not occluded, but apart from that, its result is not needed anywhere. A shadow ray is thus a *fire and forget* ray, which we can trace at any time and in any order. In this case, this is exploited by batching all shadow rays, so that the batch that is finally traced is as large as possible. This has one unfortunate consequence: for *N* iterations of the wavefront algorithm and *X* primary rays, the upper bound on the shadow ray count is *XN*.

Another detail is the handling of the various counters. The Extend and Shade phases need to know how many paths are active. The counters for this are updated (atomically) on the GPU, and subsequently used on the GPU, without ever moving back to the CPU. Sadly there is one case where this is not possible: Optix Prime wants to know the number of rays to trace. For this we need to bring back the counters once per iteration.

**Conclusion**

This article explained what wavefront path tracing is, and why it is needed to run the path tracing algorithm efficiently on a GPU. A practical implementation is provided in the Lighthouse 2 platform, which is open source and available on Github.

** Acknowledgements**

The article was updated based on suggestions by Jebb (see comments).

]]>