OPT#4:SIMD (part 2 of 2)

In the previous post I introduced the concept of single instruction, multiple data using 4 bytes in a 32-bit unsigned integer, and using Intel’s SSE instructions. Using SSE instructions, a sequence of so-called vector instructions operates on four instances of a single scalar flow. Take the following simple example:

vec3 P1( 1, 1, 1 );
vec3 P2( 2, 2, 2 );
vec3 D = normalize( P2 - P1 );

This is an interesting example, because it tempts us to store actual vectors in quadfloats: e.g., __m128 p1_4 = _mm_set_ps( 1, 1, 1, 0 ). Doing so is obviously a waste: we only have three-component vectors. Things get worse when I tell you that Intel introduced 8-wide SIMD with the AVX instruction set in 2011:

__m256 A = _mm256_set1_ps( 2.0f );
__m256 B = _mm256_set_ps( 1, 2, 3, 4, 5, 6, 7, 8 );
__m256 C = _mm256_mul_ps( A, B );

As you can see, it works just like SSE, but this time we get eight floats per octfloat, the registers are now 256 bits rather than 128 bits, and the intrinsic names are even longer.

Applying AVX to the first code snippet really forces us to think differently about ‘vectorizing’ code. Let’s start with the scalar flow:

float P1x = 1, P1y = 1, P1z = 1;
float P2x = 2, P2y = 2, P2z = 2;
float Dx = P2x - P1x, Dy = P2y - P1y, Dz = P2z - P1z;
float len = sqrtf( Dx * Dx + Dy * Dy + Dz * Dz );
Dx /= len, Dy /= len, Dz /= len;

The functionality shown here is identical to the original functionality. This time however, every operation is operating on a scalar. The scalar flow exposes the work that is done under the hood, which in turn encourages a simple optimization, where we calculate the reciprocal of $len$, so we can replace the three divisions by one reciprocal and three multiplications.

The scalar flow is also the ideal starting point for writing SSE code. A proper vector flow consists of multiple scalar flows.

So, instead of normalizing one vector at a time, we should be normalizing four. If we can do that properly, the concept naturally extends to AVX, which we can use to normalize eight vectors at a time.

Triangles

Normalizing vectors is boring, so let’s bring in a more practical problem: ray/triangle intersection. This is an operation that is frequently used in rendering, but also in collision detection, line-of-sight queries and audio propagation.

The triangle is defined by its vertices, $v_1$, $v_2$ and $v_3$; the ray by its origin $O$ and (normalized) direction $D$. The intersection is defined by a single scalar $t$, which is the distance of the intersection along the ray. The intersection point itself is then defined as $I=O+t\cdot D$.

Intersecting a triangle efficiently is often done using the approach proposed by Möller and Trumbore:

bool Intersect( vec3 O, D, v1, v2, v3; out float u, v, t )
vec3 edge1 = v2 - v1, edge2 = v3 - v1;
vec3 h = cross( D, edge2 );
float det = dot( edge1, h );
if (det > -0.001 && det < 0.001) return false; // no hit
float inv_det = 1.0 / det;
vec3 s = O - v1;
u = dot( s, h ) * inv_det;
if (u < 0.0 || u > 1.0) return false; // no hit
vec3 q = cross( s, edge1 );
v = dot( D, q ) * inv_det;
if (v < 0.0 || u + v > 1.0) return false; // no hit
t = dot( edge2, q ) * inv_det;
return (t > 0);

Given a ray (in O and D) and a triangle the code yields a yes/no answer for the hit, as well as a distance $t$ and a position $u,v$ in barycentric coordinates for the hit.

To benefit from SIMD for this code we need to identify a scalar flow. And, we need to run this scalar flow in parallel: four times for SSE, eight times for AVX, or an arbitrary number of times if we want to anticipate arbitrary SIMD-widths (e.g. 16 for AVX512). We are clearly looking for data parallelism, i.e. executing the same code for different inputs. This happens to be the execution model of GPUs as well, but that is another story.

In the case of ray/triangle intersections, we have two options:

1. Intersect many rays with one triangle;
2. Intersect one ray with many triangles.

Both are valid: it’s unlikely that we have just one triangle to intersect, and it is also unlikely that we have just one ray. But, since we need a decision, let’s assume we have four rays, and one triangle. I will assume SSE hardware, but the idea extends naturally to AVX.

float e1x = v2x - v1x;
float e1y = v2y - v1y;
float e1z = v2z - v1z;
float e2x = v3x - v1x;
float e2y = v3y - v1y;
float e2z = v3z - v1z;
float hx = Dy * e2z - Dz * e2y;
float hy = Dz * e2x - Dx * e2z;
float hz = Dx * e2y - Dy * e2x;
float det = e1x * hx + e1y * hy + e1z * hz;
if (det > -0.001 && det < 0.001) return false;
float inv_det = 1 / det;
float sx = Ox - v1x;
float sy = Oy - v1y;
float sz = Oz - v1z;
float u = (sx * hx + sy * hy + sz * hz) * inv_det;
if (u < 0 || u > 1) return false;
float qx = sy * e1z - sz * e1y;
float qy = sz * e1x - sx * e1z;
float qz = sx * e1y - sy * e1x;
float v = (Dx * qx + Dy * qy + Dz * qz) * inv_det;
if (v < 0 || u + v > 1) return false;
float t = (e2x * Qx + e2y * Qy + e2z * Qz) * inv_det;
return if (t > 0); 

Here, operations on vec3 objects have been replaced by scalar operations. Dot products are now written out as the sum of three products. The cross products are written out as well.

We are now ready for the actual vectorization. The first lines calculate two edges of the triangle. Since each stream is intersecting a ray against the same triangle, we feed the same edges in each stream:

__m128 e1x4 = _mm_set1_ps( v2x - v1x );
__m128 e1y4 = _mm_set1_ps( v2y - v1y );
__m128 e1z4 = _mm_set1_ps( v2z - v1z );
__m128 e2x4 = _mm_set1_ps( v3x - v1x );
__m128 e2y4 = _mm_set1_ps( v3y - v1y );
__m128 e2z4 = _mm_set1_ps( v3z - v1z ); 

The next three lines pose a problem. The scalar code refers to Dx, Dy and Dz for the direction of a single ray. We can replace this by Dx4, Dy4 and Dz4 for four rays, but that requires an expensive ‘gather’ operation, e.g.:

__m128 Dx4 = _mm_set_ps( ray1.Dx, ray2.Dx, ... );

The _mm_set_ps instruction can become painfully slow, especially when the four rays are not in the same cache line: in that case, one instruction can cause four cache misses.

OAS versus SOA

We can prevent this by reorganizing our ray data. Imagine the array of rays looks like this:

struct Ray { vec3 O, D; };
Ray ray[256]; // 256 * 12 * 2 bytes = 6Kb

We call this an array of structures, or AOS. An alternative layout, which holds the same data and has the same size, is the SOA layout:

float Ox[256], Oy[256], Oz[256]; // 3x 1024 bytes
float Dx[256], Dy[256], Dz[256]; // 3x 1024 bytes, total 6Kb

This layout happens to be identical to this one:

__m128 Ox4[64], Oy4[64], Oz4[64]; // 3x 1024 bytes
__m128 Dx4[64], Dy4[64], Dz4[64]; // 3x 1024 bytes, total 6Kb

And that is data we can read efficiently in the intersection code. I have added the calculation of $det$, which is trivial now.

__m128 hx4 = _mm_sub_ps( _mm_mul_ps( Dy4, e2z4 ), _mm_mul_ps( Dz4, e2y4 ) );
__m128 hy4 = _mm_sub_ps( _mm_mul_ps( Dz4, e2x4 ), _mm_mul_ps( Dx4, e2z4 ) );
__m128 hz4 = _mm_sub_ps( _mm_mul_ps( Dx4, e2y4 ), _mm_mul_ps( Dy4, e2x4 ) );
_mm_add_ps( _mm_mul_ps( e1x4, hx4 ), _mm_mul_ps( e1y4 * hy4 ) ),
_mm_mul_ps( e1z4, hz4 )
);

Conditions

The next line to translate reads:

if (det > -0.001 && det < 0.001) return false;

This is a significant challenge. Scalar $det$ became vector $det4$. So, we need to figure out how to compare two vectors. But more fundamentally, what if one stream wants to ‘return’, while the others wish to continue? The answer is: the stream doesn’t return. Instead, we incapacitate it.

To see how that can be done, we return to unsigned integers once more. Consider the following code snippet:

unsigned int a = CalculateSomething();
if (a < 0) return;
a += Calculation2();

Let’s assume that the profiler indicates that the code is performing poorly due to branch mispredictions. In that case, we could rewrite the code to get rid of the branch:

unsigned int a = CalculateSomething();
bool valid = (a >= 1);
a += Calculation2() * valid;

The interesting thing about a boolean is that it can be used as a number: 0 for ‘false’, and 1 for ‘true’. And with that, the addition on the third line works as always if $valid$ is indeed true. But if $valid$ is false, the addition doesn’t change $a$ anymore, even though the code is executed. This masking of operations is what we will use in SSE / AVX code as well.

The SSE instruction set does in fact provide comparison instructions:

__m128 mask1 = _mm_cmpeq_ps( a4, b4 ); // equal
__m128 mask2 = _mm_cmpgt_ps( a4, b4 ); // greater than
__m128 mask3 = _mm_cmple_ps( a4, b4 ); // less or equal

The result of these operations is a 128-bit mask. The cmp instructions perform four comparisons, and store each result as a 32-bit mask: 11111111 11111111 11111111 11111111 for true, and 00000000 00000000 00000000 00000000 for false. A 128-bit mask can be used to ‘zero-out’ results, or to let results through unmodified. For that, we use the AND operator. Using unsigned ints and floats again:

union { float f; unsigned int i; };
f = 3.141593f;
i &= 0xFFFF;
i &= 0x0000;

As in the earlier examples, $f$ and $i$ occupy the same memory and use the same 32 bits, so changing $i$ changes $f$ and vice versa. After filling $f$ with 3.141593, $i$ contains gibberish. If we apply the bitwise AND operator to $i$ using 32 bits all set to 1, the result is whatever was in $i$, and therefore $f$ remains unmodified. If, however, we apply the bitwise AND operator using 32 bits all set to 0, the result is 0. And, interestingly, 32 bits set to 0 yield floating point number 0.0. The final line thus sets $f$ to 0.

We can now apply this to the condition in the intersection code.

if (det > -0.001 && det < 0.001) return false;

__m128 mask1 = _mm_cmple_ps( det, _mm_set1_ps( -0.001f ) );
__m128 mask2 = _mm_cmpge_ps( det, _mm_set1_ps(  0.001f ) );
__m128 combined = _mm_or_ps( mask1, mask2 );

Note that the conditions are inverted: we want to return if $det$ exceeds -0.001 and $det$ is smaller than 0.001; we thus want to keep the stream alive if $det$ is smaller or equal to -0.001, or if $det$ is greater or equal to 0.001. The combined mask, created on the third line, will have 32 bits set to 1 for each stream that is still active, or to 0 for inactive streams.

Now that we know which streams are still useful, we can just continue. There are two additional if statements, that will update the mask. On the last two lines everything comes together: the distance that we calculate is tested against 0 (we can’t have intersections behind the ray origin), and to this test we add the combined mask. Only valid distances, computed by active streams, may return ‘yes’.

The final vectorized code then becomes:

__m128 EPS4 = _mm_set_ps1( EPSILON );
__m128 MINUSEPS4 = _mm_set_ps1( -EPSILON );
__m128 ONE4 = _mm_set_ps1( 1.0f );
__m128 e1x4 = _mm_set_ps1( v2.x - v1.x );
__m128 e1y4 = _mm_set_ps1( v2.y - v1.y );
__m128 e1z4 = _mm_set_ps1( v2.z - v1.z );
__m128 e2x4 = _mm_set_ps1( v3.x - v1.x );
__m128 e2y4 = _mm_set_ps1( v3.y - v1.y );
__m128 e2z4 = _mm_set_ps1( v3.z - v1.z );
__m128 hx4 = _mm_sub_ps( _mm_mul_ps( dy4, e2z4 ), _mm_mul_ps( dz4, e2y4 ) );
__m128 hy4 = _mm_sub_ps( _mm_mul_ps( dz4, e2x4 ), _mm_mul_ps( dx4, e2z4 ) );
__m128 hz4 = _mm_sub_ps( _mm_mul_ps( dx4, e2y4 ), _mm_mul_ps( dy4, e2x4 ) );
__m128 det4 = _mm_add_ps( _mm_add_ps( _mm_mul_ps( e1x4, hx4 ), _mm_mul_ps( e1y4, Py4 ) ), _mm_mul_ps( e1z4, hz4 ) );
__m128 mask1 = _mm_or_ps( _mm_cmple_ps( det4, MINUSEPS4 ), _mm_cmpge_ps( det4, EPS4 ) );
__m128 inv_det4 = _mm_rcp_ps( det4 );
__m128 sx4 = _mm_sub_ps( ox4, _mm_set_ps1( v1.x ) );
__m128 sy4 = _mm_sub_ps( oy4, _mm_set_ps1( v1.y ) );
__m128 sz4 = _mm_sub_ps( oz4, _mm_set_ps1( v1.z ) );
__m128 u4 = _mm_mul_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( sx4, hx4 ), _mm_mul_ps( sy4, hy4 ) ), _mm_mul_ps( sz4, hz4 ) ), inv_det4 );
__m128 mask2 = _mm_and_ps( _mm_cmpge_ps( u4, _mm_setzero_ps() ), _mm_cmple_ps( u4, ONE4 ) );
__m128 qx4 = _mm_sub_ps( _mm_mul_ps( sy4, e1z4 ), _mm_mul_ps( sz4, e1y4 ) );
__m128 qy4 = _mm_sub_ps( _mm_mul_ps( sz4, e1x4 ), _mm_mul_ps( sx4, e1z4 ) );
__m128 qz4 = _mm_sub_ps( _mm_mul_ps( sx4, e1y4 ), _mm_mul_ps( sy4, e1x4 ) );
__m128 v4 = _mm_mul_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx4, qx4 ), _mm_mul_ps( dy4, qy4 ) ), _mm_mul_ps( dz4, qz4 ) ), inv_det4 );
__m128 mask3 = _mm_and_ps( _mm_cmpge_ps( v4, _mm_setzero_ps() ), _mm_cmple_ps( _mm_add_ps( u4, v4 ), ONE4 ) );
__m128 newt4 = _mm_mul_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps( e2x4, qx4 ), _mm_mul_ps( e2y4, qy4 ) ), _mm_mul_ps( e2z4, qz4 ) ), inv_det4 );
__m128 mask4 = _mm_cmpgt_ps( newt4, _mm_setzero_ps() );
__m128 mask5 = _mm_cmplt_ps( newt4, t4 );
t4 = _mm_blendv_ps( t4, newt4, combined ); 

Note that even something simple as returning ‘yes’ or ‘no’ suddenly is non-trivial in vectorized code. Instead, the above code updates the intersection distance if (and only if) a valid distance was found which is smaller than the previously found distance. And with that, everything is unconditional.

Conclusion

In this post I discussed several concepts:

1. The importance of parallel scalar flows as a starting point for vectorization;
2. Data layout: Array of Structures versus Structure of Arrays;
3. The concept of masking, which allows us to continue a stream unconditionally.

All this was applied to a practical bit of code.

With that, we conclude the topic of SIMD. If you have any questions, feel free to drop me an email at bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

3 comments for “OPT#4:SIMD (part 2 of 2)”

1. Jaewon Jung
May 18, 2020 at 9:26 pm

Thank you for a nice article!

I am confused about the third to last line in the final code;
“__m128 mask5 = _mm_cmplt_ps( newt4, newt4 );”

Shouldn’t it be like “__m128 mask5 = _mm_cmplt_ps( newt4, t4 );” given your explanation right before the conclusion?

• jbikker
May 19, 2020 at 6:23 am

You are correct, thanks for spotting it!

This site uses Akismet to reduce spam. Learn how your comment data is processed.