 # Optimizing Trilinear Interpolation

This is going to be a short post. It is about a function that is heavily used in the Lighthouse 2 renderer: FetchTexelTrilinear, which can be found in sampling_shared.h. It roughly looks like this:

float4 FetchTexelTrilinear( float lambda, float2 uv, int offset, int width, int height )
{
int level0 = min( MIPLEVELCOUNT - 1, (int)lambda );
int level1 = min( MIPLEVELCOUNT - 1, level0 + 1 );
float f = lambda - floor( lambda );
// select first MIP level
int o0 = offset, w0 = width, h0 = height;
for (int i = 0; i < level0; i++) o0 += w0 * h0, w0 >>= 1, h0 >>= 1;
// select second MIP level
int o1 = offset, w1 = width, h1 = height;
for (int i = 0; i < level1; i++) o1 += w1 * h1, w1 >>= 1, h1 >>= 1;
float4 p0 = FetchTexel( uv, o0, w0, h0 );
float4 p1 = FetchTexel( uv, o1, w1, h1 );
// final interpolation
return (1 - f) * p0 + f * p1;
}

For context, a bit of sampling theory:

When using textures in a renderer we may encounter two issues related to sampling a ‘discrete signal’, which the texture is: it is a raster image that approximates an analog image, which in turn can be imagined as an image with infinite resolution. The two issues are:

1. Oversampling: when using the texture, we read a particular pixel of the texture many times, because the texture is too small or we are viewing it up close.
2. Undersampling: this time, when reading the texture we skip pixels, because the texture is too large. The skipping tends to happen in patterns, which are called Moiré patterns. These can be very distracting, especially when animated.

Oversampling can be countered by increasing texture resolution. This is however not always an option. To get a better image using the data we have, we can use interpolation. For a texture, this interpolation is bilinear.

Bilinear interpolation blends four pixels. Compared to nearest pixel sampling, which uses only a single pixel, this is a significant increase in memory transactions.

For undersampling we have another technique: MIP mapping. For this, we store the original texture, but also a scaled down versions. When we detect undersampling, we simply switch to a smaller version of the texture.

The visible transition between the MIP maps can finally be countered with trilinear interpolation. Where bilinear interpolation interpolated in two dimensions by blending four pixels, trilinear interpolates in three dimensions. The extra dimension is a blend between two MIP maps. The number of memory operations is now eight: four pixels in each of the two MIP maps that we interpolate between.

• lambda: floating point version of the desired MIP level. The fractional part of lambda is the interpolation parameter used to blend between the bilinear samples from the two nearest MIP maps.
• uv is the texture coordinate, where (1,1) is the bottom-right corner of the texture. Larger values result in tiling. The coordinate must still be scaled by the sizes of the sampled MIP maps.
• offset is the position of the texture data in a larger buffer that stores all textures.
• width and height specify the dimensions of the first MIP map (level 0).

The texture data for the MIP maps is stored sequentially, so for a 16 x 16 texture, the first MIP map (which is 8 x 8) starts at offset 256, and the second one (4 x 4) at 256 + 64, and so on. In the code, the offsets of the two used MIP maps are calculated using two small for loops:

int o0 = offset, w0 = width, h0 = height;
for (int i = 0; i < level0; i++) o0 += w0 * h0, w0 >>= 1, h0 >>= 1;

int o1 = offset, w1 = width, h1 = height;
for (int i = 0; i < level1; i++) o1 += w1 * h1, w1 >>= 1, h1 >>= 1; 

This is where things get interesting.

Yesterday I received an e-mail from Marvin Reza. He pointed out that the for loops are in fact finite geometric sums:

$$offset=\frac{1}{4}+\frac{1}{16}+\frac{1}{64}...$$

This sum can be calculated as

$$\sum_{k=0}^{n}r^k=\frac{1-r^n}{1-r}$$

where $r$ is (in our case) 0.25, and $n$ is the desired MIP level. In CUDA (or OpenCL, or C) we would write:

o = width * height * (1.0f - powf( 0.25f, level0 ) / 0.75f);

I had two issues with this suggestion. The first is that this function is performing so many memory operations that surely a minor optimization to a for loop is not going to speed it up in any measurable way. The second is that a powf function is considered to be so expensive that surely it is not any faster than a simple for loop.

However… it does in fact make a measurable difference. And we’re not even done yet.

Marvin points out that the powf( 0.25, level ) can be replaced by exp2f( -2 * level ), which is slightly faster still. But we can take it one step further.

In IEEE floating point representation, $2^n$ is special. Have a look at this table:

2^-1    0.500000    1056964608     111111000000000000000000000000
2^-2    0.250000    1048576000     111110100000000000000000000000
2^-3    0.125000    1040187392     111110000000000000000000000000
2^-4    0.062500    1031798784     111101100000000000000000000000
2^-5    0.031250    1023410176     111101000000000000000000000000
2^-6    0.015625    1015021568     111100100000000000000000000000

The third column contains the ‘unsigned integer’ representation of the 32-bits used to store the floating point number in the second column. The fourth column shows the binary representation of this number. One thing is immediately clear: all the action happens in the top bits. In an IEEE 32-bit floating point number, this is the exponent of the number. The remaining bits store the mantissa.

The exponents for the six powers of two are obtained when we shift the integer number 23 bits to the right. They are: $126, 125, 124, 123, 122$ and $121$.

In the above table we went from float to integer representation. We can however also go in the opposite direction. If we want to store $2^{-n}$, we can store (127-n)<<23 in an unsigned integer, and reinterpret this as a floating point number. And this is much faster than a powf or an exp2f.

In CUDA:

value = __uint_as_float( (127 - n) << 23 );

The whole FetchTexelTrilinear now becomes:

float4 FetchTexelTrilinear( float lambda, float2 uv, int offset, int width, int height )
{
int level0 = 0, level1 = 0;
float f = 0;
if (lambda >= 0)
level0 = min( MIPLEVELCOUNT - 1, (int)lambda ),
level1 = min( MIPLEVELCOUNT - 1, level0 + 1 ),
f = lambda - floor( lambda );
// as proposed by Marvin Reza, slightly faster
float scale = (float)(width * height) * 1.3333333333f;
int o0 = offset + (int)(scale * (1 - __uint_as_float( (127 - 2 * level0) << 23 )));
int o1 = offset + (int)(scale * (1 - __uint_as_float( (127 - 2 * level1) << 23 )));
float4 p0 = FetchTexel( texCoord, o0, width >> level0, height >> level0 );
float4 p1 = FetchTexel( texCoord, o1, width >> level1, height >> level1 );
// final interpolation
return (1 - f) * p0 + f * p1;
}

Note that this exercise exposed a bug where lambda could be negative. This is countered in the optimized version of the code.

Many thanks to Marvin for the initial suggestion and for triggering this experiment!

## 8 comments for “Optimizing Trilinear Interpolation”

1. Jaewon Jung
April 10, 2020 at 6:57 pm

Thanks for the nice article!

I think you meant “undersampling” in the paragraph of “For oversampling we have another technique…”?

• jbikker
April 10, 2020 at 7:01 pm

You are right! Fixed, thanks.

• Dithermaster
June 2, 2020 at 4:25 pm

Great article, thanks.

There is a second instance in that same paragraph that should change: “When we detect oversampling…”

• jbikker
June 2, 2020 at 5:39 pm

Thanks! I fixed it.

2. Rustam
April 11, 2020 at 6:00 am

Nice!
Any performance measurements and comparison?

3. April 14, 2020 at 8:52 am

Do you have a performance comparison table?

• jbikker
April 14, 2020 at 8:57 am

No, sorry. The exact impact of the changes is complex: it depends on the number of texture reads (texture layers, normal maps, specularity map etc.) and other factors, which is why I merely claimed a speedup. In practice the impact is not going to exceed 1%, unless shading is very simple and texture layers are many.

4. 