This blog post serves as supplementary and/or backup material for the workshop on profiling, originally scheduled for delivery at the IGAD program of the Breda University of Applied Science, but you know… Corona. The original goal of the workshop was to introduce a series of blog posts on software optimization, something I have been teaching for quite some years now, first at the NHTV University of Applied Sciences, and more recently at the Utrecht University.

In this first post I will introduce the Art of Software Optimization. This introduction includes a high level overview of the optimization process, which allows us to perform optimization in a structured, deliberate way. I will also discuss the most important ingredient of successful software optimization: profiling.

AlphaGo Parallel, ELO rating 3140, running on 1202 CPUs and 176 GPUs.

Let’s start at the beginning. With today’s multi-teraflop GPUs and 64-core CPUs, who needs fast software? In software development, production efficiency and security is more important than performance, and hardware is cheaper than labor, so Java is more popular than C and Python beats C++. Despite this, some software still requires the super computer of the future, e.g. AlphaGo Parallel, which used a massive system to beat a human. Luckily, compute power continues to improve consistently over the years, which in fact provides us with a first reason to properly optimize our software. Doubling the performance of your program is about the same thing as running it on hardware that will be available in 18 months or so. Optimization thus allows us to look into the future.

XBMC running on a Raspberry Pi.

There is a second reason for software optimization: some software needs to run on pretty weak hardware. This could be an OS running on a smartwatch, or photo processing software running on the CPUs of an ancient mars rover. And, if you are Nikon, you may have an edge over your competitors if your digital camera gets away with a cheaper CPU: the price will be lower and the battery will last longer.

A third reason affects us all: waiting is annoying. Whether you are waiting for Windows to complete an unexpected update at the start of your once-in-a-lifetime presentation for an international audience (I’ve seen worse examples than what happened to Gabe Newell), or a massive file copy, or to get on a plane: we are impatient beings, we spend too much time sleeping and eating already so let’s not add months of waiting.

There is one more reason for software optimization, but it is subtle, and it will make more sense once we get to the end of this article. So instead, let’s try to define what optimization is. It is a lot of things: thinking like a CPU, for example. That means: being aware of instruction pipelines, latencies, dependencies, bandwidth, cycles… The image below is a 80486dx2/66. It may not mean much to you, but for me, it was the core of the first PC I put together, using components from 3 shops, and it was unbelievably fast. And this photo shows it’s naked machinery. That’s a thing of beauty. I own this thing. I want to know how that works, and I want to make it work for me. To do that, I should not teach it how to speak English; I should learn how to speak CPU.

So, thinking like a CPU. But optimization is also: work smarter, not harder. That means: smarter algorithms (e.g., reducing some O(n) code to O(log n)), but also: picking the right algorithm for a particular data set. Sorting a massive set of evenly distributed numbers requires a different sorting algorithm than a relatively small set of strings, for example. And finally: work smarter, not harder also means: be as accurate as necessary, but not more. If your square root function is accurate to the fifth decimal digit, but you clamp it to a whole number afterwards, chances are you can replace it by something faster.

And finally, optimization is: don’t assume, measure. For this we use a profiler, i.e. software that monitors performance characteristics of a program. Profiling should always steer the optimization effort: without it, we are blind.

So, profiling is:

  • Think like a CPU
  • Work smarter, not harder
  • Don’t assume, measure

The profiler should help us decide where to aim our efforts. This has to do with the Pareto Principle, aka the 80/20 rule: 80% of the effects come from 20% of the causes. Translated to and adjusted for software: 1% of your code takes 99% of the execution time. Understanding this is important for several reasons: first of all, profiling helps you to find this 1%, the “low hanging fruit”, where modifications to just a few lines of code can have massive impact on overall performance. Secondly, limiting your efforts to 1% of the code limits the impact on maintainability. Just imagine an optimized code base that is littered with inline assembly: the thought alone will discourage any project manager from even allowing you to speed up the code.

The profiler also helps us to execute optimization as a structured and deliberate process. The output of this process is (typically) a program that is 10-25x faster. We will review this structured process in the next section.

A Consistent Approach to Optimization

Without further ado, this is the consistent approach:

  1. Determine optimization requirements.
  2. Profile: determine hotspots.
  3. Analyze hotspots: determine scalability.
  4. Apply high level optimizations to hotspots.
  5. Profile again.
  6. Paralellize / vectorize / use GPGPU
  7. Profile again.
  8. Apply low level optimizations.
  9. Repeat step 7 and 8 until time runs out.
  10. Report.

The list actually starts with a step zero: complete the code. It does not make sense to optimize code that is not ‘done’: optimizations may make further changes harder, and the changes may make optimizations obsolete. “Premature optimization is the root of all evil”, as it was put in a controversial way by Donald Knuth.

After step zero we should establish our goals. This includes (a range of) target hardware, but also the desired performance level. Optimizing for 5x is not the same as optimizing for 100x; a vastly larger data set may put unforeseen strain on initially innocent code.

Our optimization requirements should also consider the time we have available to optimize. Assuming we optimize finished code, the optimization process probably needs to be squeezed in between QA and product delivery; chances are that your time budget is expressed in days, not weeks. This may affect our decisions later on: replacing a bad sorting algorithm by a decent one in one hour may be preferable to replacing it by the best one in four hours, because the decent one gives us three hours to spend on something more important.

After the initial phase of establishing requirements we encounter step two: profiling. Note that we didn’t make a single change to the code yet.

Before we get to the actual profiling, behold this machine:

A machine.

This is the Difference Engine, designed by Charles Babbage. It is considered to be the first design for a computer. Sadly Charles never managed to get it to work, although the design is sound: it just required improved lubricants. At Charles’ side was a lady: Ada Lovelace. She wrote the first software, for another design by Charles: the Analytical Engine. About this software, Lady Lovelace makes the following remarks:

In almost every computation a great variety of arrangements for the succession of the processes is possible, and various considerations must influence the selection amongst them (…).
One essential object is to choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the calculation.

And this, my friends, is the very first mention of the concept of software optimization. So next time you wonder why so few ladies chose the noble profession of software engineering, do realize that it all started with Lady Ada Lovelace.


It is time to make things practical. For this, I have prepared a small project, the output of which can be observed in the following image:

Classic Amiga 500 dot cloud.

You can download the Visual Studio 2019 project files by clicking here.

When you run the application you will see that it performs just fine. Maybe you need to switch from debug to release, but after that the application should reach a pretty solid frame rate.

The application comes with some built-in profiling. In the top-left corner of the screen you can see the amount of time it takes to:

  1. Transform the 256 dots using a 4×4 rotation matrix
  2. Sort the 256 dots
  3. Render the 256 dots
  4. Clear the screen.

Based on the initial findings, we may want to focus our attention to the rendering process, which is currently taking most time. Rendering happens in Game::Render (in game.cpp):

void Game::Render()
    t.reset(); m_Surface->Clear( 0 ); elapsed4 = t.elapsed();
    for ( int i = 0; i < DOTS; i++ )
        // extract dot index from sorted z-coordinate
        uint dotIdx = (uint)&m_Rotated[i].z & 3;
        // draw scaled sprite
        int sx = (int)(m_Rotated[i].x * 350.0f) + SCRWIDTH / 2;
        int sy = (int)(m_Rotated[i].y * 350.0f) + SCRHEIGHT / 2;
        int size = (int)((m_Rotated[i].z + 2) * 10.0f);
        m_Dot->SetFrame( dotIdx );
        m_Dot->DrawScaled( sx - size / 2, sy - size / 2, size, size, m_Surface );

It is not immediately trivial to see what the problem with this function could be. Perhaps the divisions? Maybe the method invocations, or the type casts? But, we should not guess nor assume: instead, we profile.

To profile, we use a profiler. There is a profiler integrated in Microsoft Visual Studio, but you can also use an external one. I will use VerySleepy CS, but the process is similar with other profilers, so use whatever works best for you.


In the above image, I am starting the demo from the profiler. After specifying the executable to profile, and the directory to run it in, I let the demo run for about 10 seconds, before it is terminated. The profiler shows the following result:

VerySleepy report on a profiling session.

There are several problems with the output:

  1. Actual program code is not recognizable and source code is missing.
  2. The workload for the application is so small that the most expensive function is the one that synchronizes the main loop to the vsync of the monitor.

Regarding the first issue: we can add debug information to our executable in Visual Studio via project properties ==> C/C++ ==> General ==> Debug Information Format (set to ‘Program Database’), and: project properties ==> Linker ==> Debugging ==> Generate Debug Info (set to ‘Yes’).

To resolve the second issue, we can simply increase the number of ‘DOTS’ in game.cpp. Let’s set it to 4096:

Dot cloud, 4096 dots.

Increasing the amount of dots reveals something interesting. The time needed to draw the particles scales as expected, but the sorting time… exploded. A quick inspection of the sorting code reveals why:

void Game::Sort()
    for (int i = 0; i < DOTS; i++)
        for (int j = 0; j < (DOTS - 1); j++)
            if (m_Rotated[j].z > m_Rotated[j + 1].z)
                vec3 h = m_Rotated[j];
                m_Rotated[j] = m_Rotated[j + 1];
                m_Rotated[j + 1] = h;

This code is an implementation of a simple sorting algorithm named BubbleSort. This particular implementation scales quadratically in the number of elements to sort, which explains the terrible runtime for larger sets.

Obviously, we need to fix this, before we look at the rendering code (which was the bottleneck for the initial workload). To do this, we have a large number of options. Which one is best for this particular situation? May I suggest an unconventional answer: the one that yields a decent improvement with minimal effort. This leaves time for other optimizations; if the sorting becomes a problem again, we will simply revisit it.

A simple QuickSort implementation does the job:

void Swap( vec3& a, vec3& b ) { vec3 t = a; a = b; b = t; }
int Pivot( vec3 a[], int first, int last )
    int p = first;
    vec3 e = a[first];
    for (int i = first + 1; i <= last; i++) if (a[i].z <= e.z) Swap( a[i], a[++p] );
    Swap( a[p], a[first] );
    return p;
void QuickSort( vec3 a[], int first, int last )
    int pivotElement;
    if (first >= last) return;
    pivotElement = Pivot( a, first, last );
    QuickSort( a, first, pivotElement - 1 );
    QuickSort( a, pivotElement + 1, last );
void Game::Sort()
    QuickSort( m_Rotated, 0, DOTS - 1 );

After this change, rendering is (by far) the bottleneck again. Running the demo with 65536 dots is sufficient to keep the CPU busy:

This time we can not only see that ‘rendering’ is the most expensive part of the program; the problem is in the DrawScaled function, and in that function, it’s two specific lines in the inner loop that take all the time. We can stop looking at the other 99% of the program, until we have resolved this issue. How to speed up this code is a topic for another blog post.


We have seen that optimization starts with profiling. Our initial observation (target ‘render’) had to be adjusted once we changed the workload: ‘sort’ required a high-level optimization to resolve the terrible scalability. To save time, we opted for a pragmatic solution (in this case, QuickSort) which may not be the best sorting algorithm for this particular data set, but it was very easy to drop in. The profiler now guides us to just two lines in one function, which take about 90% of the runtime.


In the next post we will have a look at what can be done about the performance of DrawScaled using some low-level optimizations.

Questions? Mail me at bikker.j@gmail.com, or follow me on Twitter: @j_bikker.