Nibble Stew: Experiment: binary size reduction by using common function tails

In embedded development the most important feature of any program is its size. The raw performance does not usually matter that much, but size does. A program that is even one byte larger than available flash size is useless.

GCC, Clang and other free compilers do an admirable job in creating small executables when asked to with the -Os compiler switch. However there are still optimizations that could be added. Suppose we have two functions that looks like this:

int funca() {
int i = 0;
i+=func2();
return i+func1();
}

int funcb() {
int i = 1;
i+=func3();
return i+func1();
}

They would get compiled into the following asm on x86-64:

funca():
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 0
call func2()
add DWORD PTR [rbp-4], eax
call func1()
mov edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
leave
ret
funcb():
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 1
call func3()
add DWORD PTR [rbp-4], eax
call func1()
mov edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
leave
ret

If you look carefully, the last 7 instructions on both of these functions are identical. In fact the code above can be rewritten to this:

funca():
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 0
call func2()
common_tail:
add DWORD PTR [rbp-4], eax
call func1()
mov edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
leave
ret
funcb():
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 1
call func3()
jmp common_tail

Depending on your point of view this can be seen as either a cool hack or an insult to everything that is good and proper in the world. funcb does an unconditional jump inside the body of an unrelated function. The reason this works is that we know that the functions end in a ret operand which pops a return address from the stack and jumps into that (that is, the "parent function" that called the current function). Thus both code segments are identical and can be collapsed into one. This is an optimisation that can only be done at the assembly level, because C prohibits gotos between functions.

How much does this save?

To test this I wrote a simple Python script that parses assembly output, finds the ends of functions and replaces common tails with jumps as described above. It uses a simple heuristic and only does the reduction if there are three or more common instructions. Then I ran it on the assembly output of SQLite's "amalgamation" source file. That resulted in reductions such as this one:

Ltail_packer_57:

setne %al

Ltail_packer_1270:

andb $1, %al

movzbl %al, %eax

popq %rbp

retq

This function tail is being used in two different ways, sometimes with the setne command and sometimes without. In total the asm file contained 1801 functions. Out of those 1522 could be dedupped. The most common removals looked like this:

addq $48, %rsp

popq %rbp

retq

That is, the common function suffix. Interestingly, when the dedupped asm is compiled, the output is about 10k bigger than without dedupping. The original code was 987 kB. I did not measure where the difference comes from. It could be either because the extra labels need extra metadata or because the jmp instruction takes more space than the instructions it replaces because the jump might need a 32 bit offset. A smarter implementation would look to minimize the jump distance so it would fit in 16 bits and thus in a smaller opcode. (I'm not sure if x86-64 has those opcodes so the previous comment might be wrong in practice but the reasoning behind it is still valid.)

Is this actually worth doing?

On the x86 probably not because the machines have a lot of ram to spare and people running them usually only care about raw performance. The x86 instruction set is also quite compact because it has a variable size encoding. The situation is different in ARM and other embedded platforms. They have fewer instructions and a constant encoding size (usually 32 bits). This means longer instruction sequences which gives more potential for size reductions. Some embedded compilers do this optimization so The Real World would seem to indicate that it is worth it.

I wanted to run the test on ARM assembly as well, but parsing it for function tails is much more difficult than for x86 asm so I just gave up. Thus knowing the real benefits would require getting comments from an actual compiler engineer. I don't even pretend to be one on the Internet, so I just filed a feature request on this to the Clang bug tracker.

Nibble Stew

Tuesday, July 18, 2017

Experiment: binary size reduction by using common function tails

How much does this save?

Is this actually worth doing?

No comments:

Post a Comment