The GCC and LLVM optimizers contain troves of arcane and esoteric tricks to speed up code on different systems. Rather surprisingly, GCC will even correct bad hand-optimizations!
One such example is using shifts instead of multiplication. In the olden days, this was a reliable way to speed up your code, especially for game and demo programming. It's mentioned in Hacker's Delight, which I highly recommend for bit fiddlers.
To draw a single pixel at given x and y positions, you must calculate the screen offset:
components_size is the size of each pixel. For 8-bit RGBA values, it
will be four bytes. In the nineties, graphics modes used color palettes, so
the component would be a one-byte index. Games and demos often used mode 13h, which is 320 by 200
pixels. That gives
However, multiplication was quite expensive in those days, and with many individual pixels being drawn (as opposed to image blocks, or scanning the screen buffer a pixel at a time), an optimization would be to replace it with instructions taking less cycles. Noticing that 320 = 26 + 28, the expression can be transformed to
On a CPU like the Intel 386, that would take up less cycles in total.
This was a tried-and-true technique, part of every graphics programmer's bag of tricks. I typed it out by reflex for years. However, on modern CPUs, such code is no longer optimal. Let's see the assembly code for the function
Using LLVM, we can compile the code using native 64-bit instructions with no optimizations:
The below code is from an Intel i7 on OS X. First it loads
edi, respectively, from the stack frame:
It then performs
x += (y << 6)
x += (y << 8)
But turning on optimizations,
it will change slightly to
While it still performs adds and shifts, it uses less cycles. We've excluded the function prologue and epilogue. In pseudo-code, it does this:
LLVM will do compile to the exact same code with this version
and 32-bit targets are structurally equivalent. But what about GCC? It goes even further!
Compiling the original shift-and-add code with
return x + y*320. But is it faster? Yes, it is!
When it comes to performance testing, there's nothing like actually running real code. So let's put the functions to the test.
We'll use NASM to assemble the two versions above, and GCC 5 to compile a test driver. We'll turn off optimizations where needed.
The assembly code for the imul and shift + add functions are placed in
offset.asm. I'm on OSX right now, so we'll need to use the correct alignment,
otherwise the linker won't be able to find the functions. The file looks like
Compile this with
The test driver program, called
test-offset.c, uses getrusage
to find the CPU time spent by each function. To compare the performance of each
function, we'll run one billion iterations of them each, then keep the best
time so far. That's the approach Facebook uses in their profiling code in Folly (apparently they went back to this after testing some statistical modelling).
For code like this, we don't want to measure nonsense things like average
running time and so on: The best run gives the most correct measurement of how
well the function performs.
Compile this with
We're finally ready to party (according to some definition of "party"):
imul function is the fastest. But only by a small amount.
However, the point here was to show how GCC will fix bad hand-optimizations.
And clearly, it did that correctly!
What's interesting is that GCC 5 obviously didn't optimize the
function as well as either of
offset_shift_add in this
program. You can take a look at the disassembly by doing
test-offset and see for yourself.
Obviously, the first mistake for the
correct function is that it creates a stack frame. If you recompile with
-fomit-frame-pointer, GCC quickly rectifies that, and we get the shift + add
version (but not the imul one!):
Rerunning the tests, I got these results:
shift+add should actually get the exact same timings,
but they don't! That's most likely because of preemption
noise by the OS. I didn't bother trying to run this in single user mode,
nice -10 and so on. If you try this out yourself, particularly if you're
on another OS, let me know your results in the comments!
Another question is why LLVM doesn't perform the same optimization. GCC is supposedly somewhat faster than LLVM, but I haven't checked if this is still true. Also, what we haven't looked at is any side-effects on the code. I'm thinking about the cascading CPU pipeline and stuff like that. I don't expect to see anything interesting here, though, because we're only accesing values on the stack, and so on. Let me know in the comments if you have two cents to spare!
I also tried adding
-mtune=native and the results I
got then was
Each run is a little bit different, though.