OPT#3:SIMD (part 1 of 2)

This is the third article in the series of posts on Optimization, which accompany online lectures for the IGAD program of the Breda University of Applied Sciences. You can find the first post (on profiling) here. The second post, on low level optimization, introduced the Rules of Engagement, which can help when approaching the bottlenecks found using profiling. The last Rule, #7, is: do things simultaneously, but this was not discussed in detail.

Doing things simultaneously means: using multiple cores, or multiple processors, or a CPU and a GPU. However, before we start spawning threads, we can do things in parallel in a single thread, using a concept that is called instruction level parallelism (ILP).

ILP happens naturally in a superscalar CPU pipeline, where multiple instructions are fetched, decoded and executed in each cycle, at least under the right conditions. A second form of ILP is provided by certain complex CPU instructions that perform multiple operations. An example is fused multiply and add (FMA), which performs a calculation like a=b\sdot c+d using a single assembler instruction. A particularly interesting class of complex instructions are the vector operations, which apply a single operation on multiple data. When applied correctly, single instruction multiple data (SIMD) can yield great speedups, of four or eight times, and sometimes more. This doesn’t hinder multi-core processing either: whatever gains we obtain in a single thread using SIMD code scales up even further when we finally do spawn those extra threads.

SIMD

To understand the concept of SIMD it helps to start with 32-bit unsigned integers. An unsigned integer consists of 4 bytes:

union
{
    unsigned int a4;
    unsigned char a[4];
};

In C/C++, a union is a mechanism that allows us to overlap variables in memory. In this case, unsigned int a4 and the array a[4] occupy the same bytes. This means that if we modify a4, we modify the array, and vice versa. This allows us to do some interesting things:

a[0] = 1;
a[1] = 3;
a[2] = 25;
a[3] = 100;
a4 *= 2;

The first four lines modify the individual bytes. The last line operates on all four bytes at the same time. The effect is interesting: multiplying a4 by 2 yields { 2, 6, 50, 200 } in the array. We effectively operated on four numbers using a single instruction.

This doesn’t seem particularly useful, but in fact it is. Let’s start with a little detour, and the following code snippet:

uint ScaleColor( uint c, float x ) // x = 0..1
{
     uint red = (c >> 16) & 255;
     uint green = (c >> 8) & 255;
     uint blue = c & 255;
     red = red * x, green = green * x, blue = blue * x;
     return (red << 16) + (green << 8) + blue; 
} 

This code takes a color, stored as ‘alpha-red-green-blue’ in the four bytes of a single unsigned integer, and scales the color by ‘x’, one color component at a time.

This code works, but it is very inefficient, because of the many implicit type conversions. To multiply red (an integer value) by x (a float value), red is promoted to a float value, then the multiplication is applied, and finally the result is converted back to an integer value. We can do better, using some basic fixed point math, where scale factor x is represented by an integer value in the range 0..255:

uint ScaleColor( uint c, uint x ) // x = 0..255
{
     uint red = (c >> 16) & 255, green = (c >> 8) & 255, blue = c & 255;
     red = (red * x) >> 8;
     green = (green * x) >> 8;
     blue = (blue * x) >> 8;
     return (red << 16) + (green << 8) + blue; 
} 

The ‘>> 8’ is a bitshift, which effectively divides a number by 256. This compensates for the scale factor, which is 256 times larger than it was before. The code now is devoid of float operations, and thus type conversions; as a consequence it runs (much) faster.

Interestingly, we can do even better. Behold the following code:

uint ScaleColor( const uint c, const uint x ) // x = 0..255
{
     uint redblue = c & 0x00FF00FF;
     uint green   = c & 0x0000FF00;
     redblue = ((redblue * x) >> 8) & 0x00FF00FF;
     green = ((green * x) >> 8) & 0x0000FF00;
     return redblue + green; 
} 

This time, the red and blue color components sit together in a single 32-bit value. Green is separated, in its own 32-bit value. This leaves an 8-bit gap between red and blue, where green used to be.

The gap allows us to multiply red and blue by the scale, using a single multiplication. If we multiply a number in the range 0..255 by another number in the range 0..255, the maximum value we can get is 65025, which fits in 16 bits, according to the Windows 10 calculator:

The code exploits this. The red and green bytes are multiplied by the scale, then divided by 256 using a bitshift which moves each bit 8 positions to the right. Finally, we need to get rid of some bits, for which the & operation is used.

Compared to the original integer code, which used 7 bitshifts, 3 ands, 3 multiplications and 2 additions, the new code only uses 2 bitshifts, 4 ands, 2 multiplications and an addition. And we’re not even done. The final version is slightly more efficient:

uint ScaleColor( const uint c, const uint x ) // x = 0..255
{
     uint redblue = c & 0x00FF00FF;
     uint green   = c & 0x0000FF00;
     redblue = (redblue * x) & 0xFF00FF00;
     green = (green * x) & 0x00FF0000;
     return (redblue + green) >> 8; 
} 

This time we have 1 shift, 4 ands, 2 multiplications and 1 addition, for a total of 8 operations.

The point of this exercise is this: once we realize that an unsigned int is a collection of four bytes, integer operations operate on these vectors of four bytes. The four bytes are operated on in parallel.

This sadly has limited use. Consider the following situation:

union { unsigned int a4; char a[4]; };
a[0] = 200;
a[1] = a[2] = a[3] = 1;
a4 *= 2;

After the multiplication, array a contains { 144, 3, 2, 2 } instead of { 400, 2, 2, 2 }. Since 400 doesn’t fit in a byte, the value overflows into a[1] which thus becomes 3, and a[0] gets the remaining 400-256=144.

If we would want to work with larger numbers, or negative numbers, or if we wish to divide four values, this approach doesn’t work anymore. In an ideal world, we would like a bit more freedom:

  1. An ideal approach supports unsigned and signed integers as well as floats;
  2. It would be nice if we could also use more complex arithmetic such as division, multiplication, and square roots;
  3. Preferably, there should be no influencing between streams, such as overflow;
  4. It would be nice if things were a bit easier to use.

With vector instructions, we get all that. Well, except for one thing, of course.

x86 SIMD: SSE and AVX

In 1996, Intel introduced the Pentium II processor, and with it, the MMX instruction set, which was the first use of SIMD in commodity hardware. Shortly after that, in 1998, Motorola introduced AltiVec, and in 1999, Intel released the Pentium III, with the SSE instruction set.

The SSE instruction set consists of 70 assembler instructions that operate on 128-bit vector registers. Rather than storing a single value, each of these store 4 floats, or 4 integers, or 8 shorts, or 16 bytes.

In Visual Studio, the vector functionality is directly accessible in C/C++, in the form of the (unpronounceable) __m128 data type, which we will call, from now on, quadfloat. If the vector contains integers, the data type is __m128i, or quadint. Since a quadfloat is literally just a small set of floats, we can use a union to access the individual values, like we did before:

union
{
    __m128 a4;
    float a[4];
};

We operate on quadfloats and quadints using vector operations. For this, so-called intrinsics are provided: instructions that translate to a single assembler instruction. A few examples:

__m128 a4 = _mm_set_ps( 1, 0, 3.141592f, 9.5f );
__m128 b4 = _mm_setzero_ps();
__m128 c4 = _mm_add_ps( a4, b4 ); // not: __m128 = a4 + b4;
__m128 d4 = _mm_sub_ps( b4, a4 );

The first line stores four floats in a vector register. The second line sets the four values in the vector to zero. Adding two vectors is done using _mm_add_ps, and subtracting using _mm_sub_ps. Note the naming convention: you are free to do this differently of course, but it helps to identify variables that actually contain 4 values rather than 1. Also note that you can’t simply add vectors using e.g. a4+b4, unless you wrap the quadfloats and quadints in a class that provides some overloaded operators. It’s all very low-level, very assembler-like, and that is because SSE intrinsics get you very close to the metal.

As promised, SSE provides a pretty versatile set of operations:

__m128 c4 = _mm_div_ps( a4, b4 ); // component-wise division
__m128 d4 = _mm_sqrt_ps( a4 );    // four square roots
__m128 d4 = _mm_rcp_ps( a4 );     // four reciprocals
__m128 d4 = _mm_rsqrt_ps( a4 );   // four reciprocal square roots (!)

Divisions, square roots, reciprocals. And the best part: you get the result of four divisions in roughly the same time it takes to do one regular division. Four square roots: actually you get these often faster than a single square root. The weirdest instruction is _mm_rsqrt_ps, which gets you four reciprocal square roots, in just a few cycles, much faster than just a square root! Why is that? Well, 1/\sqrt{x} is very frequently used, namely: for normalizing vectors. For that reason, x86-compatible CPUs provide a fast approximate reciprocal square root instruction, which uses a hardware lookup table, baked on your CPU, which coughs up the answers, at an incredible speed. If ‘approximate’ is not good enough for you, you can easily refine the answer afterwards.

Beyond square roots and reciprocals, there is more good stuff:

a4 = _mm_max_ps( a4, b4 );
c4 = _mm_min_ps( a4, b4 );

Those dreaded min/max lines that typically yield conditional code are handled by actual instructions in SSE. And that means that if your code was slowed down by branch mispredictions, SSE can actually lead to speedups beyond 4x.

Particles

Let’s do some practical SIMD work. You can download the project by clicking here; it should compile out-of-the-box in Visual Studio 2019. When you run it you see a slowly rotating backdrop, consisting of eight black holes:

Press space to start a stream of particles.

The application provides some basic timings for code blocks. Drawing the backdrop took 100.7 milliseconds in the above screenshot. In that time, for each pixel the gravity of the black holes is accumulated, and, if this value exceeds 1, a black pixel is plotted (an accurate simulation of the event horizon), otherwise a shade of blue. The code that produces the backdrop can be found in game.cpp on line 30:

void Game::BuildBackdrop()
{
     Pixel* dst = screen->GetBuffer();
     float fy = 0;
     for (unsigned int y = 0; y < SCRHEIGHT; y++)
     {
         float fx = 0;
         for (unsigned int x = 0; x < SCRWIDTH; x++)
         {
             float g = 0;
             for (unsigned int i = 0; i < HOLES; i++)
             {
                 float dx = m_Hole[i]->x - fx;
                 float dy = m_Hole[i]->y - fy;
                 float squareddist = (dx * dx + dy * dy);
                 g += (250.0f * m_Hole[i]->g) / squareddist;
             }
             if (g > 1) g = 0;
             *dst++ = (int)(g * 255.0f);
             fx++;
         }
         fy++;
     }
}

Note that this code has not been intentionally crippled for this example: it is pretty optimal as-is.

To apply SIMD to the code, we will be working with four numbers at a time. Looking at the loops, we have two options here: we can either work on four pixels at a time, or we can process the black holes in groups of four. Either option is fine, but we need to make a choice, so I will work with four black holes at a time. That means that the inner loop changes:

for (unsigned int i = 0; i < HOLES; i += 4) 
{
   // ...
}

Inside the loop, every scalar operation (operating on one input to produce one result) will be converted to a vector operation. Let’s start with the first line:

float dx = m_Hole[i]->x - fx; 

Before we translate this line to SSE code, let’s create a backup. This has two purposes: we can revert to the (correct) code later if needed, and, since the SSE code is going to be rather hard to read, it helps others to understand what we are doing.

// float dx = m_Hole[i]->x - fx;
__m128 dx4 = ... ;

The original line took the x-coordinate of a single black hole, and subtracted the x-coordinate of a pixel. The new line still operates on a single pixel, but four black holes. So, m_Hole[i]->x becomes:

_mm_set_ps( m_Hole[i], m_Hole[i + 1], m_Hole[i + 2], m_Hole[i + 3] )

Whatever we wish to subtract from that vector should be a vector as well. That means that the single float value fx must be promoted to a vector with four identical values:

_mm_set_ps( fx, fx, fx, fx )

Or, briefer:

_mm_set1_ps( fx )

The whole calculation thus becomes:

// float dx = m_Hole[i]->x - fx;
__m128 dx4 = _mm_sub_ps( 
    _mm_set_ps( m_Hole[i]->x, m_Hole[i + 1]->x, m_Hole[i + 2]->x, m_Hole[i + 3]->x ),
    _mm_set1_ps( fx ) 
);

The second line is converted in the same manner, but this time for y-coordinates. The third line is now not hard to convert:

// float squareddist = (dx * dx + dy * dy);
__m128 sd4 = _mm_add_ps(
    _mm_mul_ps( dx4, dx4 ),
    _mm_mul_ps( dy4, dy4 )
);

The final line however presents some challenges. The calculation is simple enough: expand 250.0f to a vector of four identical values using _mm_set1_ps, multiply by four gravity values which we put in a vector using _mm_set_ps, and divide the result by sd4. But what about g? In the original code, the gravity of eight black holes was added to g one by one, but now we have four values. We can for now solve the problem by using a union.

// g += (250.0f * m_Hole[i]->g) / squareddist; 
union { __m128 temp4; float temp[4]; };
temp4 = _mm_div_ps(
    _mm_mul_ps(
        _mm_set1_ps( 250.0f ),
        _mm_set_ps( m_Hole[i]->g, m_Hole[i + 1]->g, m_Hole[i + 2]->g, m_Hole[i + 3]->g )
    ),
    sd4
);
g += temp[0] + temp[1] + temp[2] + temp[3];

Note that the actual addition to g is done using scalar code; since the vectors are really only just four floats, this is a valid approach.

With all the lines of the inner loop converted to SSE, the code runs faster, but sadly, not a lot. The reason is, in a way, once again type conversions: in order to work on vector data, we gather floats together over and over again. We should apply some loop hoisting, and get all those conversions out of the inner loop. The black holes do not change position while all the pixels are being plotted; we can thus simply prepare two quadfloats with the x-coordinates of eight black holes, and likewise two quadfloats for the y-coordinates and gravity values.

__m128 hx4[2] = {
     _mm_set_ps( m_Hole[0 * 4 + 0]->x, m_Hole[0 * 4 + 1]->x,
                 m_Hole[0 * 4 + 2]->x, m_Hole[0 * 4 + 3]->x ),
     _mm_set_ps( m_Hole[1 * 4 + 0]->x, m_Hole[1 * 4 + 1]->x, 
                 m_Hole[1 * 4 + 2]->x, m_Hole[1 * 4 + 3]->x )
};
 __m128 hy4[2] = {
     _mm_set_ps( m_Hole[0 * 4 + 0]->y, m_Hole[0 * 4 + 1]->y, 
                 m_Hole[0 * 4 + 2]->y, m_Hole[0 * 4 + 3]->y ),
     _mm_set_ps( m_Hole[1 * 4 + 0]->y, m_Hole[1 * 4 + 1]->y, 
                 m_Hole[1 * 4 + 2]->y, m_Hole[1 * 4 + 3]->y )
};
 __m128 hg4[2] = {
     _mm_set_ps( m_Hole[0 * 4 + 0]->g, m_Hole[0 * 4 + 1]->g, 
                 m_Hole[0 * 4 + 2]->g, m_Hole[0 * 4 + 3]->g ),
     _mm_set_ps( m_Hole[1 * 4 + 0]->g, m_Hole[1 * 4 + 1]->g, 
                 m_Hole[1 * 4 + 2]->g, m_Hole[1 * 4 + 3]->g )
};

Calculating dx4 now reduces to:

 // float dx = m_Hole[i]->x - fx;
__m128 dx4 = _mm_sub_ps( hx4[i / 4], _mm_set1_ps( fx ) ); 

This time, the code is almost twice as fast as the original. Still not the promised 4x, but that has to do with how we accumulate (and process) the gravity.

Closing Remarks

To apply SSE to your own code, you may find the following steps useful.

1. Locate a significant bottleneck in your code.

Writing SSE code is hard. Make sure the effort is worth it. Use a profiler to pinpoint particularly expensive code.

2. Keep a copy of the original code.

Whether you do this in comments, or in an #ifdef block, keeping the original code helps. You will thank yourself later, for example when you try to port to a platform that doesn’t support SSE, such as ARM (e.g. for Android).

3. Prepare the scalar code.

Add a loop that executes your code four times, so that you know beforehand that you have four data streams that can be processed in parallel.

4. Reorganize your data.

Make sure you don’t convert your data over and over again; have it in the correct (vector) format before you enter the code you wish to convert. Lots of gathers will reduce your gains to zero, or worse.

5. Make unions with floats.

Quadfloats can be unioned with floats. This lets you convert one line at a time, checking correct behavior at every step.

6. Convert one line at a time.

Verify functionality as you go.

7. Check MSDN for exotic SSE instructions.

There are some pretty weird instructions, that may just solve your problem in the smallest amount of cycles. Know your options.

This is part 1 of 2 of the introduction to SIMD. Part 2 is now also available and discusses how to layout your data in a fundamentally SIMD-friendly way, and how to handle conditional code.

Questions? Mail me: bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

1 comment for “OPT#3:SIMD (part 1 of 2)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.