This is going to be a short post. It is about a function that is heavily used in the Lighthouse 2 renderer: FetchTexelTrilinear, which can be found in sampling_shared.h. It roughly looks like this:

float4 FetchTexelTrilinear( float lambda, float2 uv, int offset, int width, int height ) { int level0 = min( MIPLEVELCOUNT - 1, (int)lambda ); int level1 = min( MIPLEVELCOUNT - 1, level0 + 1 ); float f = lambda - floor( lambda ); // select first MIP level int o0 = offset, w0 = width, h0 = height; for (int i = 0; i < level0; i++) o0 += w0 * h0, w0 >>= 1, h0 >>= 1; // select second MIP level int o1 = offset, w1 = width, h1 = height; for (int i = 0; i < level1; i++) o1 += w1 * h1, w1 >>= 1, h1 >>= 1; // read actual data float4 p0 = FetchTexel( uv, o0, w0, h0 ); float4 p1 = FetchTexel( uv, o1, w1, h1 ); // final interpolation return (1 - f) * p0 + f * p1; }

For context, a bit of sampling theory:

When using textures in a renderer we may encounter two issues related to sampling a ‘discrete signal’, which the texture is: it is a raster image that approximates an analog image, which in turn can be imagined as an image with infinite resolution. The two issues are:

- Oversampling: when using the texture, we read a particular pixel of the texture many times, because the texture is too small or we are viewing it up close.
- Undersampling: this time, when reading the texture we skip pixels, because the texture is too large. The skipping tends to happen in patterns, which are called Moiré patterns. These can be very distracting, especially when animated.

Oversampling can be countered by increasing texture resolution. This is however not always an option. To get a better image using the data we have, we can use interpolation. For a texture, this interpolation is bilinear.

Bilinear interpolation blends four pixels. Compared to nearest pixel sampling, which uses only a single pixel, this is a significant increase in memory transactions.

For undersampling we have another technique: *MIP mapping*. For this, we store the original texture, but also a scaled down versions. When we detect undersampling, we simply switch to a smaller version of the texture.

The visible transition between the MIP maps can finally be countered with *trilinear *interpolation. Where bilinear interpolation interpolated in two dimensions by blending four pixels, trilinear interpolates in three dimensions. The extra dimension is a blend between two MIP maps. The number of memory operations is now eight: four pixels in each of the two MIP maps that we interpolate between.

Let’s return to the code snippet. The input parameters are:

- lambda: floating point version of the desired MIP level. The fractional part of lambda is the interpolation parameter used to blend between the bilinear samples from the two nearest MIP maps.
- uv is the texture coordinate, where (1,1) is the bottom-right corner of the texture. Larger values result in tiling. The coordinate must still be scaled by the sizes of the sampled MIP maps.
- offset is the position of the texture data in a larger buffer that stores all textures.
- width and height specify the dimensions of the first MIP map (level 0).

The texture data for the MIP maps is stored sequentially, so for a 16 x 16 texture, the first MIP map (which is 8 x 8) starts at offset 256, and the second one (4 x 4) at 256 + 64, and so on. In the code, the offsets of the two used MIP maps are calculated using two small for loops:

int o0 = offset, w0 = width, h0 = height; for (int i = 0; i < level0; i++) o0 += w0 * h0, w0 >>= 1, h0 >>= 1; int o1 = offset, w1 = width, h1 = height; for (int i = 0; i < level1; i++) o1 += w1 * h1, w1 >>= 1, h1 >>= 1;

This is where things get interesting.

Yesterday I received an e-mail from Marvin Reza. He pointed out that the for loops are in fact finite geometric sums:

offset=\frac{1}{4}+\frac{1}{16}+\frac{1}{64}...This sum can be calculated as

\sum_{k=0}^{n}r^k=\frac{1-r^n}{1-r}where r is (in our case) 0.25, and n is the desired MIP level. In CUDA (or OpenCL, or C) we would write:

o = width * height * (1.0f - powf( 0.25f, level0 ) / 0.75f);

I had two issues with this suggestion. The first is that this function is performing so many memory operations that surely a minor optimization to a for loop is not going to speed it up in any measurable way. The second is that a powf function is considered to be so expensive that surely it is not any faster than a simple for loop.

However… it *does *in fact make a measurable difference. And we’re not even done yet.

Marvin points out that the powf( 0.25, level ) can be replaced by exp2f( -2 * level ), which is slightly faster still. But we can take it one step further.

In IEEE floating point representation, 2^n is special. Have a look at this table:

2^-1 0.500000 1056964608 111111000000000000000000000000 2^-2 0.250000 1048576000 111110100000000000000000000000 2^-3 0.125000 1040187392 111110000000000000000000000000 2^-4 0.062500 1031798784 111101100000000000000000000000 2^-5 0.031250 1023410176 111101000000000000000000000000 2^-6 0.015625 1015021568 111100100000000000000000000000

The third column contains the ‘unsigned integer’ representation of the 32-bits used to store the floating point number in the second column. The fourth column shows the binary representation of this number. One thing is immediately clear: *all the action happens in the top bits.* In an IEEE 32-bit floating point number, this is the *exponent* of the number. The remaining bits store the *mantissa.*

The exponents for the six powers of two are obtained when we shift the integer number 23 bits to the right. They are: 126, 125, 124, 123, 122 and 121.

In the above table we went from float to integer representation. We can however also go in the opposite direction. If we want to store 2^{-n}, we can store (127-n)<<23 in an unsigned integer, and reinterpret this as a floating point number. And this is *much *faster than a powf or an exp2f.

In CUDA:

value = __uint_as_float( (127 - n) << 23 );

The whole FetchTexelTrilinear now becomes:

float4 FetchTexelTrilinear( float lambda, float2 uv, int offset, int width, int height ) { int level0 = 0, level1 = 0; float f = 0; if (lambda >= 0) level0 = min( MIPLEVELCOUNT - 1, (int)lambda ), level1 = min( MIPLEVELCOUNT - 1, level0 + 1 ), f = lambda - floor( lambda ); // as proposed by Marvin Reza, slightly faster float scale = (float)(width * height) * 1.3333333333f; int o0 = offset + (int)(scale * (1 - __uint_as_float( (127 - 2 * level0) << 23 ))); int o1 = offset + (int)(scale * (1 - __uint_as_float( (127 - 2 * level1) << 23 ))); // read actual data float4 p0 = FetchTexel( texCoord, o0, width >> level0, height >> level0 ); float4 p1 = FetchTexel( texCoord, o1, width >> level1, height >> level1 ); // final interpolation return (1 - f) * p0 + f * p1; }

Note that this exercise exposed a bug where lambda could be negative. This is countered in the optimized version of the code.

Many thanks to Marvin for the initial suggestion and for triggering this experiment!

Questions? Mail to bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

Thanks for the nice article!

I think you meant “undersampling” in the paragraph of “For oversampling we have another technique…”?

You are right! Fixed, thanks.

Great article, thanks.

There is a second instance in that same paragraph that should change: “When we detect oversampling…”

Thanks! I fixed it.

Nice!

Any performance measurements and comparison?

Do you have a performance comparison table?

No, sorry. The exact impact of the changes is complex: it depends on the number of texture reads (texture layers, normal maps, specularity map etc.) and other factors, which is why I merely claimed a speedup. In practice the impact is not going to exceed 1%, unless shading is very simple and texture layers are many.