OPT#2:LowLevel

Welcome back to this series of posts on Optimization, which accompany online lectures for the IGAD program of the Breda University of Applied Sciences.

In the first post on the Art of Software Optimization I presented the ten steps of a structured approach to optimization:

  1. Determine optimization requirements.
  2. Profile: determine hotspots.
  3. Analyze hotspots: determine scalability.
  4. Apply high level optimizations to hotspots.
  5. Profile again.
  6. Paralellize / vectorize / use GPGPU
  7. Profile again.
  8. Apply low level optimizations.
  9. Repeat step 7 and 8 until time runs out.
  10. Report.

A vital part of this process is profiling. I used profiling to find out that the sorting code of the demo application scaled poorly, which was countered with a high level optimization: BubbleSort was replaced by QuickSort, which has a much better algorithmic complexity. After that, more profiling revealed that two lines in a function that draws scaled sprites are now responsible for the bulk of the runtime of the application.

The starting point for this post is the demo project used in part 1, but this time with 65536 sprites, and QuickSort instead of BubbleSort. Download the Visual Studio 2019 project by clicking here.

After profiling the code with VerySleepy CS for about 27 seconds, the following results are shown:

The two lines that take 11.61 and 10.10 seconds of the total runtime can be found in surface.cpp, line 429 and 430:

Pixel color = src[u + v * m_Pitch];
if (color & 0xffffff) 
    a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color;

Before we attempt to optimize this code, let’s take a moment to see what happens in Sprite::DrawScaled. The method draws a sprite to the screen, taking into account transparency. The sprite is drawn to a rectangular area on the screen that may differ from the size of the sprite; this lets us draw the sprite image with an arbitrary scale.

In the above image, a 16×16 Mario sprite is drawn twice. The small version is 8×8 pixels, which means that four pixels of the sprite need to be squeezed into 1 pixel on the screen. We could do that by averaging 2×2 pixels. Sprite::DrawScaled uses a simpler approach: it will just skip every other row and column.

We thus write to every pixel in a square, but we don’t necessarily read from every pixel of the sprite image, except when we enlarge the sprite: in that case we may in fact read every pixel multiple times. This is how DrawScaled operates: the two loops cover the target rectangle, and for each pixel in this rectangle, the position of a pixel in the sprite is calculated.

With that in mind, let’s try to make the code faster. Question is however, why is this code slow? There are several options:

  1. Line 429 reads from memory. Memory is slow.
  2. Line 430 writes to memory.
  3. There is conditional code on line 430.
  4. There is a function call on line 430.
  5. Line 430 uses a lengthy calculation.

Many of these points require knowledge about the hardware to understand why they would be problematic. But the worst thing is: it is not even certain that lines 429 and 430 are the problem… Let’s stop for a moment, and gather some information before we continue.

x86 Assembler in 5 minutes

The C++ code that we are trying to optimize will be executed by an x86 processor, which expects x86 machine code, which is produced by the C++ compiler. It’s easy to take a peek under the hood.

To obtain the disassembly shown here, compile the program in ‘Debug’ mode and place a breakpoint on line 428. Once the breakpoint is triggered, right-click line 428, and select ‘Go To Disassembly’.

To understand what happens here we need a crash course in x86 assembler.

The first x86 CPU was the Intel 8086, which was introduced in 1978. This was a 16-bit processor, with eight general purpose 16-bit registers:

AX, the accumulator;
BX, the base register;
CX, the counter;
DX, the data register;
BP, the base pointer;
SI, or source index;
DI, or destination index;
and SP, the stack pointer.

The first four registers can also be addressed per byte: AH and AL, BH and BL, and so on.

In 1985 the 32-bit Intel 80386 processor was introduced. This processor is fully compatible with the 16-bit version, but sports eight 32-bit registers: EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP, which overlap the original 16-bit registers. An optional floating point co-processor, the 80387, adds eight 80-bit registers for floating point calculations, named st0..st7.

After the 32-bit processors, AMD introduced the x86-compatible ‘x86-64‘ processors, with 64-bit registers: RAX, RBX and so on, as well as eight new 64-bit general purpose registers, named R8..R15. In terms of registers, a final important addition came with SIMD support, for which 8 128-bit registers are available: xmm0..7, and more recently: 16 256-bit registers, ymm0..15. AVX512 CPUs finally have 32 512-bit registers, named zmm0..31.

An example of some x86 assembly code is shown below.

loop:
    mov eax, [0x1008FFA0]   // read from address into register
    shr eax, 5              // shift eax 5 bits to the right
    add eax, edx            // add registers, store in eax
    dec ecx                 // decrement ecx
    jnz loop                // jump if not zero
    fld [esi]               // load from address [esi] onto FPU
    fld st0                 // duplicate top float
    faddp                   // add top two values, push result

Every line has a simple instruction, which reads something from memory into a register, writes to memory, operates on a pair of operands, or continues execution at a new location, perhaps based on a condition.

The above code is not too hard to read. With this knowledge, let’s take another look at the assembly that was produced for the C++ code:

The first instruction converts an integer (x, stored 38h (hexadecimal, 56 in base 10) bytes away from the start of the stack) to a floating point number in register xmm0 (ss, or single scalar). The second line multiplies this number by variable [whatever]. The result must be stored in integer variable u, so the third line converts the multiplication result back to integer. Finally, the result of the conversion, in register eax, is written to a local variable on the stack.

Subsequent lines of C++ code yield more complex assembly, but it is not too hard to make sense of it. Things change drastically however when we switch from debug to release.

When looking at the assembly produced in release mode, the first thing that becomes clear is that there is now much more assembly code. The code for individual lines also appears to be strangely interleaved, and a lot of code is repeated.

It turns out that the compiler has optimized the code in several ways:

  1. The content of the loop is being repeated. This is known as loop unrolling; it reduces the overhead of the loop itself.
  2. Instructions are reordered. This increases separation of instructions that depend on results of earlier instructions, and helps the CPU to execute multiple instructions per clock cycle.

Running the release code shows that it is indeed much faster than the debug code. It is clear that the compiler knows things about CPUs that we don’t. And although it is nice that the compiler automates a lot of the optimization, it would be nicer if we knew how to help the compiler produce the fastest code possible.

“We need to go deeper”

The execution of instructions by a CPU is typically divided in four phases:

  1. Fetch
  2. Decode
  3. Execute
  4. Writeback

In the phase 1, an instruction is read from memory. The instruction is encoded in a number of bits, and must be decoded in phase 2. In phase 3, the instruction is executed. The result of the execution is written back to memory or a register in phase 4 and is now available as input for subsequent instructions.

The above diagram shows this flow, with the ‘Execute’ stage highlighted. Now imagine that each stage takes one cycle: in that case, each instruction takes four cycles. During each phase, only a part of the CPU is actually active: the Fetch hardware goes unused when an instruction is being decoded, for example.

Now consider the following improved CPU model:

This is a pipeline: during each cycle, one instruction is fetched, one decoded, one executed and one result is written back. All parts of the CPU are now making themselves useful during each cycle, and the throughput of the CPU is (in theory) four times higher.

This improvement comes at a price. Imagine that the second instruction needs the result from the first instruction. The first instruction still needs to complete three phases, so this time the second instruction gets delayed: it must be halted until the input it requires is available. This is called a bubble in the pipeline. Compilers prevent bubbles by reordering code, as we have seen. If instruction three does not depend on instruction one, two and three can perhaps be swapped. This is also good to know for us programmers: we should write code with fewer dependencies, or perhaps somehow fill up bubbles with useful work (which we then essentially get done for free).

Modern CPUs use far more than four stages in the pipeline. The reason for this is clockspeed. The rate at which the CPU can run basically depends on the most complex stage. If we make individual stages simpler, we can increase the frequency, and thus improve throughput. Like this:

Of course, with longer pipelines, the impact of dependencies increase. Another problem also increases: the cost of branches. Imagine that we have a ‘goto’ statement somewhere in our code. To fill the pipeline, we can just follow the ‘goto’. But what if the jump is conditional, and the condition is still being evaluated?

This special dependency is handled by branch prediction hardware in the CPU. Instead of waiting for the condition, the CPU will ‘guess’ if the conditional jump will be taken or not. If it guesses wrong, the pipeline is full of irrelevant instructions, and must be reset – at a significant cost. When conditional jumps are unpredictable, this happens frequently. This can become a major source of inefficiency in a program.

The programmer can counter this by reducing conditional code. And, if the use of conditional code cannot be prevented, it should at least be predictable.

In the last diagram the Execute stage was represented by three sub-stages. In real scenarios this presents another source of inefficiency. Some instructions need completely different execution logic than others: dividing a floating point number has no overlap with loading an integer from memory, for instance.

So, when a floating point instruction is being executed, a significant portion of the CPU is idling. But what if we execute a different instruction in the same cycle? This must obviously be a different type of instruction, which increases the pressure on the compiler to produce a suitable stream of instructions. And, if we execute several instructions per cycle, we must also fetch and decode several instructions per cycle:

This yields the mighty superscalar pipeline, which is commonly used in modern processors.

Let’s take a step back. A modern CPU executes multiple instructions per cycle, if these instructions are not all of the same type, and if these instructions do not depend on recently started instructions. The pipeline is filled, even with instructions after a jump, if the jump destination is correctly guessed. In all other cases, the machinery comes to a halt, and severe latencies occur. The compiler attempts to minimize dependencies, and optimizes the instruction mix, but it could use some help.

Rules of Engagement

It is time to introduce some generic tools for low level software optimization. These are meant to kick-start your optimization effort. Many of these involve the concepts I discussed in the previous paragraphs. I refer to these tools as: “The Rules of Engagement”.

Rule number one:

  1. Avoid Costly Operations.

This may seem obvious. But what is a ‘costly operation’? As you dive deeper into optimization, you will quickly gain an intuition for this. Here is a quick overview:

<< , >> : bit shifting
+ – & | ^ : simple arithmetic, logical operators
* : multiplication
/ : division
sqrt
sin, cos, tan
pow, log, exp

On most processors, bit operations are very cheap (as in: they take one cycle or less on average). Sometimes they can be completely free. There are many ways to exploit this. For example: multiplying by a power of 2 can be replaced by a bitsift: v = v << 3 has the same effect as v *= 8 (assuming v is an integer). Likewise, v >>= 4 is the same as v /= 16. Your compiler also knows this, so it is actually safe to write v /= 16. But what about v *= 9? A speedy replacement is: v += v << 3, which gets rid of the multiplication.

Looking at the table, ‘pow’ is pretty bad. So, please replace y=pow(x,5) by y=x*x*x*x*x! It looks ugly, but the result is much faster.

Point here is: once you know which operations may be costly, you can start looking for alternatives, carefully balancing required precision, code versatility and readability, and performance.

  1. Precalculate.

If there is no fast alternative to a certain operation, you may be able to use a lookup table. Take sine and cosine, for example: on many processors, these are expensive to evaluate. But a 360 entry table lets you fetch the correct values at the cost of a (cached?) memory operation. And some interpolation gets you values in between, albeit at a slight loss of accuracy.

A special case of precalculation is loop hoisting. Here is the inner loop of DrawScaled again:

for (int x = 0; x < a_Width; x++)
{
   int u = (int)((float)x * whatever);
   Pixel color = src[u + v * m_Pitch];
   if (color & 0xffffff) 
      a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color;
}

Here, the address of a pixel in array src is calculated as u + v * m_Pitch. But, inside the inner loop, v is not changing, and neither is m_Pitch. So, we can precalculate the result of v * m_Pitch outside the inner loop, and use the precalculated value inside the loop. And, as you can see, there are several other opportunities for loop hoisting in this short code segment.

Precalculation can go much further though. We see just how far in a few minutes.

  1. Pick the Right Data Type.

In C and C++, you can store numbers in many ways: in integers, unsigned integers, floats, doubles, (unsigned) shorts and (unsigned) chars. So which data type should you use in practice? The answer is: it depends.

Imagine you have a positive number, and you know that it will never exceed 255. You now have the option to store it in an unsigned char, and the CPU can store it in an 8-bit register (AL or AH, for example). There is a problem with that approach: even though all x86-compatible CPUs still support access to AH and AL for the 64-bit RAX register, the functionality is outdated and therefore somewhat unoptimized. A compiler will thus emulate the byte behavior, using a full register. This requires some extra operations compared to regular 64-bit numbers: summing 1 and 255 in a byte yields 0, due to overflow, and this overflow must now be mimicked. What it boils down to is that operating on bytes is more work than operating on 64-bit numbers.

There is another consideration when picking a data type, and that is type conversion. We have seen before that a simple line like int u = x * whatever involves multiple type conversions if ‘whatever’ happens to be a float value. The type conversion itself is not free, and may add significant overhead to the calculation. Ensuring that all variables are either int or float would thus improve performance.

And finally, it is beneficial to have both integer code and floating point code in the same block. This allows the compiler to interleave instructions for the superscalar pipeline. A long loop that operates exclusively on integers will have part of the CPU idling (the floating point execution units), which is a waste of compute power.

  1. Avoid Conditional Branches.

The description of the (superscalar) pipeline should have made clear why conditional code is bad. But you may not always be aware of conditionals. Examples of conditional code:

if (x==y) { Function1(); } else { Function2(); }

This is obvious conditional code. Note that if x in practice always equals y, the conditional code is cheap to evaluate: the CPU will quickly learn and predict the correct program flow.

 a=a>b?a:b; 

This is the so-called ternary operator. It is often hidden: a = max(a,b).

More examples of conditional code:

switch (value)
{ 
case 1: Funtion1(); break;
case 2: Function2(); break;
} 
virtual float GetX() { return x; }
for( int x = 0; x < 10; x++ ) { Function1( x ); }
do { Function1(); } while (x > 0);

Fast code does not use conditionals, or uses mostly predictable conditions. Reaching this state is not always easy. Sometimes lookup tables help. Sometimes you can split a loop in multiple parts to get rid of a condition inside a loop. And, like the compiler did, loop unrolling helps: a for loop includes the evaluation of a conditional, and therefore less iterations means less conditionals.

  1. Early Out.

This one is easily overlooked. Consider the following code snippet:

char a[] = “abcdfghijklmnopqrstuvwxyz”, c = ‘p’;
int position = -1; 
for ( int t = 0; t < strlen( a ); t++ )
{
    if (a[t] == c)
    {
        position = t;
    }
}

This code finds a character in a string. In this case we look for ‘p’, which we find in loop iteration 16. But after that, the loop continues searching. The fix is simple: add a break statement as soon as the correct answer is found. And while we are at it: the function strlen(a) is called whenever the loop condition is evaluated. This is prevented when we precalculate the length of string a: an example of loop hoisting.

‘Early Out’ can be exploited in the DrawScaled function. We are scaling very specific sprites: images of red, green, blue and yellow balls. That means that once we have found the right side of the ball image, we can stop: there will be no opaque pixels further to the right. Of course, such a change makes function DrawScaled significantly less versatile. But, if the goal is to make the application faster for the current input, this may be considered acceptable.

  1. Use the Power of Two.

A division by a power of two can be replaced by a bitshift, as we have already seen. But there are more situations where powers of two shine. Consider the following array:

float a[100][100];

Accessing elements in this array requires a multiplication: cell 10,10 for instance is located at position 10 of row 10, so 10 times the width of the table plus 10. Now, if we would slightly resize the table:

float a[100][128];

This time, the width is a power of two. Cell 10,10 is now located at a + 10 * 128 + 10, which is the same as a + (10 << 7) + 10. The result is that every access of a cell in this table is now faster.

Note that we are wasting some memory here. On the average PC, this is not an issue. You probably have 8GB or more, and chances are you never wrote a program that used more than 1GB. Feel free to use memory liberally!

  1. Do Things Simultaneously.

The final Rule of Engagement is: Do Things Simultaneously. That means: use all cores. Maximize the use of all cores, in fact. And once all cores are running at peak performance, turn to the GPU, and make that chip work at peak performance as well. But before we get to multithreading and GPGPU, we need to consider doing things simultaneously on a single core. This is what we will do when we start using SIMD hardware, which will be discussed in detail in a later blog post.

Summarizing, the Rules of Engagement:

  1. Avoid Costly Operations
  2. Precalculate
  3. Pick the Right Data Type
  4. Avoid Conditional Branches
  5. Early Out
  6. Use the Power of Two
  7. Do Things Simultaneously

With these, you should be able to make a decent start.

Blind Stupidity

Let’s bring back the inner loop of DrawScaled once more:

for (int x = 0; x < a_Width; x++) 
{
    int u = (int)((float)x * whatever);
    Pixel color = src[u + v * m_Pitch];
    if (color & 0xffffff)
        a_Target->GetBuffer()[a_X + x + ((a_Y + y) * a_Target->GetPitch())] = color;
}

With the Rules of Engagement in mind, there’s a lot we can do now:

  • We could store ‘x’ as an integer, emulating floating point logic using fixed point math, to evade the type conversions.
  • We can loop hoist v * m_Pitch, and ((a_Y + y) * a_Target->GetPitch().
  • In fact, we can calculate the destination address at the start of a line of pixels and increment it by 1 at the end of the loop iteration.
  • We can ‘early out’ on the right side of a ball.
  • We can preprocess the ball images to remove the alpha channel, so that ‘color & 0xffffff’ is reduced to just ‘color’.

But there is that nagging feeling… What are we doing?

There is a famous quote by William Wulf that goes like this:

More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason – including blind stupidity.

The problem is this: we are drawing scaled sprites. The scale is uniform, so the width equals the height, in this demo. Width and height are passed as integers, and the number of unique sizes is quite limited. Certainly no ball is larger than 64 pixels, or smaller than 5 or so. So, why don’t we simply precalculate scaled sprites? This gets rid of the scaling logic altogether, which surely beats optimizing it…

Going to Extremes

With the prescaled ball sprites we can now render a massive amount of spheres in real-time.

But what if we want to go further? There is always a faster way… How about:

  • We can get rid of the sorting. There are in fact two ways to do this: this first is using a kD-tree, the second using a z-buffer. I’ll leave this as an exercise for the reader.
  • Considering the density of the dot sphere, we can simply skip the back side. And perhaps the inside as well. So we can just render a thin shell.
  • Pixels on the screen are 32-bit, and so is the sprite. Using a palettized display greatly reduces data transport.
  • Perhaps we can draw two pixels at once, with 64-bit writes? Or maybe even more?

Of course, at some point you need a 4k screen to benefit from further performance gains. But just for the fun of it, how far can we go?

Going Insane

I remember an old trick used on the Amiga, where a regular loop with a variable number of iterations was unrolled.

Original:

for( int x = 0; x < N; x++ ) pixel[x] = 0;

Unrolled:

pixel[0] = 0;
pixel[1] = 0;
pixel[2] = 0;
pixel[3] = 0;
.
.
pixel[1023] = 0;

To set a set of N pixels to 0, one would set pointer ‘pixels’ to the correct value, and then jump to the line 1023 – N to draw precisely N pixels. Without a loop condition, without a jump after each loop iteration. This is obviously blazingly fast.

We can apply this concept to ball plotting as well. Imagine we want to draw a yellow ball using a width and height of 1 pixel, at location (x, y). For that we don’t need a loop:

Pixel* a = GetBuffer() + x + y * m_Width;
a[0] = 0x00ffff00; // yellow

Note that the color is not read from a ball picture anymore; it is simply hardcoded. A larger ball can be drawn similarly, e.g. a 2×2 ball:

Pixel* a = GetBuffer() + x + y * m_Width;
switch (size)
{
case 1:
   a[0] = 0x00ffff00; // yellow 
   break;
case 2:
   a[0] = 0x00ffff00;
   a[1] = 0x00ffff00;
   a[1280] = 0x00ffff00;
   a[1281] = 0x00ffff00;
   break;
}

As a final optimization, each pair of 32-bit writes can be combined in a single 64-bit write. Note that in the above code the screen width (here: 1280) has been hardcoded as well. From here we can proceed, with all desired ball sizes, and with the correct colors obtained from the scaled ball sprites. I wrote a small program that emits the desired C code to do this. For 64 ball sizes in four colors, the result is a massive 6MB source file. But rendering using that code is extremely fast.

Closing Remarks

In the next post I will dive into the memory hierarchy, and the impact of caches on application performance. Until then: for questions, please email me at bikker.j@gmail.com, or follow me on Twitter: @j_bikker.

Post navigation

1 comment for “OPT#2:LowLevel

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.