The other day I read a two-parter blog post by Owen Shepherd, {n} times faster than C. In it, he takes a simple algorithm, and optimizes it as much as he can, dropping down to raw assembly along the way.
I love this sort of thing, though I’m not very good at it. I decided I’d try my own hand at it anyway, but for the A64 instruction set. I’ve tried to make this post understandable if you haven’t read Owen’s, though I highly recommend you do! (if only for how much fun I found it was to read).
For context, here’s the environment I’m using to obtain benchmark timings and compilation outputs:
- Clang 16.0.6
-O3
and-g3
- Instruments 14.3.1
- M1 Pro
- macOS 13.4.1 without any weird kernel boot arguments or anything
All code (plus a nifty benchmarking harness!) is available in this GitHub repo. I’ve matched the inputs generated by Owen’s original benchmarking setup as closely as possible.
Problem statement: Initialize a counter variable to zero.
Iterate over a null-terminated array of bytes;
whenever you see an 's'
, increment the counter;
whenever you see a 'p'
, decrement the counter.
The basics
Let’s begin with the most trivial solution to the problem, written in C, as given by Owen’s blog post (plus some minor stylistic changes):
int32_t
basic(const uint8_t *input)
{
int32_t res = 0;
while (true) {
uint8_t c = *input;
input++;
switch (c) {
case '\0':
return res;
case 's':
res++;
break;
case 'p':
res--;
break;
default:
break;
}
}
}
Compared to Owen’s 310 MB/s, I get 305 MB/s. In spite of varying microarchitectures, instruction sets, compilers, cache hierarchies and RAM speeds, we’ve ended up at almost the same speed!
Next, let’s take a look at the corresponding A64 assembly for this C:
basic:
mov w8, #0
loop:
ldrb w9, [x0], #1
cbz w9, exit
cmp w9, 'p'
b.eq p
cmp w9, 's'
b.ne loop
mov w9, #1
add w8, w8, w9
b loop
p:
mov w9, #-1
add w8, w8, w9
b loop
exit:
mov x0, x8
ret
I was honestly quite surprised to see that
Clang emits assembly which is instruction-for-instruction equivalent to x64.
The only part here that leaves me confused is that
Clang uses a mov
and an add
-with-register
instead of a single add
-with-immediate …
Rather than retrace the steps in Owen’s blog post, I think it’d be more fun to skip ahead to the faster versions of the code before taking a different approach entirely.
The table-driven approach
At the start of the second post we are introduced to this function:
static int32_t to_add[256] = {
['s'] = 1,
['p'] = -1,
};
int32_t
table(const uint8_t *input)
{
int32_t res = 0;
while (true) {
uint8_t c = *input;
input++;
if (c == '\0') {
return res;
}
res += to_add[c];
}
}
Here we see a common approach to optimizing tight loops like these: eliminate branches by replacing them with table lookups. And this does indeed work: we just boosted our throughput 9×, bringing it to 2.712 GB/s! This completely pales in comparison to the 4 GB/s Owen achieved, though :(
table:
ldrb w9, [x0]
cbz w9, exit
mov x8, x0
mov w0, #0
add x8, x8, #1
adr x10, to_add
nop
loop:
ldr w9, [x10, w9, uxtw #2]
add w0, w9, w0
ldrb w9, [x8], #1
cbnz w9, loop
ret
exit:
mov w0, #0
ret
Mercifully, the assembly is, as far as these things go, serene.
Skipping past all that setup, the core loop is only three instructions short:
load from to_add
(x10
) using the c
value from the previous iteration (w9
) as an index,
add it to res
(w0
), and finally load the next value of c
.
We jump back to the top of the loop only if c
wasn’t zero (the null terminator).
Hmm, anyway, what could be causing this to be so slow? Come to think of it, when Owen tried compiling the same code with GCC, performance plummeted to 2.13 GB/s. In his case, the issue boiled down to (IIUC) the inner loop straddling a cache line boundary. Could we be having the same issue?
A quick trip to the Clang command line argument reference
reveals that -falign-loops=64
is the incantation we need.
And yes, the assembly is absolutely filled with nop
s.
Unfortunately, performance stays roughly the same at 2.695 GB/s.
Running Instruments with appropriate performance counters enabled on table
,
I found that the function spends 90.8% of its time1
simply loading bytes from the input (ldrb
).
To make things faster we just need to do that less.
It’s not like we’re loading each byte of the input multiple times,
so the way forward is to load multiple bytes in one go.
To do that, though, we’ll need to bend the rules.
Cheating incident #1 – explicit length
Imagine the address space immediately following the input we’ve been given is unmapped. In this scenario, reading even a single byte past the null terminator at the end of the input will make us segfault. This is fine if we read a byte at a time, stopping as soon as we see the null terminator:
..........spspspspspspspspspsp0..........
^^^^^^^^^^^^^^^^^^^^|
all good :)
Now imagine that we switch to loading multiple bytes at once – say, in chunks of eight. The final chunk we read is likely to not only contain the null terminator, but also several of those segfault-triggering bytes following it!
these 3b make us segfault ↓
vvv
..........spspspspspspspspspsp0..........
\______/\______/\______/
8 8 8
KABOOM!
To overcome this,
we can switch to what is arguably a far saner scheme
(but one which is technically against the rules of the problem statement):
pass in the length of the input directly
instead of relying on a null terminator.
With this extra information in hand,
we can find how many bytes don’t neatly fit into our eight-byte chunks
using the %
operator.
We’ll process these bytes – I’m calling them “slack” –
separately, byte-by-byte at the end.
..........spspspspspspspspspsp0..........
\______/\______/^^^^|
8 8
all good :)
Here’s a pseudocode implementation:
int32_t
new_implementation(const uint8_t *input, size_t length)
{
size_t chunk_count = length / 8;
size_t slack = length % 8;
const uint8_t *chunks_end = input + chunk_count * 8;
int32_t res = 0;
while (input != chunks_end) {
// process input in eight-byte chunks,
// somehow ...
input += 8;
}
// account for remaining slack
res += table(input, slack);
return res;
}
A little sidebar
To implement the algorithm I described above,
we’ll need a version of table
which takes a length.
I ended up with this neat little function:
static int32_t to_add[256] = {
['s'] = 1,
['p'] = -1,
};
int32_t
table_length(const uint8_t *input, size_t length)
{
int32_t res = 0;
const uint8_t *end = input + length;
while (input != end) {
uint8_t c = *input;
input++;
res += to_add[c];
}
return res;
}
I won’t show the assembly here because it’s a huge mess,
since Clang managed to unroll this loop four times
(meaning it copy-pasted the loop body so that
four bytes get processed before branching back to the top of the loop).
This results in a huge speedup of 15× over basic
for a total throughput of 4.610 GB/s.
Processing the input in chunks
How can we load multiple bytes of our input in one go?
One approach is to cast the input pointer from a uint8_t *
to a uint64_t *
,
thereby letting us load eight bytes at once.
We can then loop through the component bytes of the loaded integer,
processing them as we go.
Something like this?
int32_t
new_implementation(const uint8_t *input, size_t length)
{
// ...
while (input != chunks_end) {
uint64_t chunk = *(const uint64_t *)input;
input++;
uint8_t bytes[8];
memcpy(bytes, &chunk, 8);
for (size_t i = 0; i < 8; i++) {
res += to_add[bytes[i]];
}
}
// ...
}
I don’t like how we
declare both chunk
and bytes
and then memcpy
from one to the other
just to loop over bytes,
but that’s exactly the sort of thing compilers are good at optimizing away.
One problem here is that we’re casting a uint8_t *
to a uint64_t *
without thinking about alignment:
a pointer to a 64-bit integer must be a multiple of eight,
but we haven’t made sure of that.
Although the M1’s Firestorm microarchitecture permits
(and as far as I know doesn’t penalize)
unaligned accesses,
they are undefined behavior as per the C standard,
so we’d better fix this.
Remember that pseudocode from earlier about handling “slack”? Let’s adapt it to also handle alignment, and glue it together with that code above to process each chunk:
static int32_t to_add[256] = {
['s'] = 1,
['p'] = -1,
};
int32_t
table_8(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 8 != 0) {
padding = 8 - (uintptr_t)input % 8;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t chunk_count = length / 8;
size_t slack = length % 8;
const uint64_t *chunks = (const uint64_t *)input;
const uint64_t *chunks_end = chunks + chunk_count;
while (chunks != chunks_end) {
uint64_t chunk = *chunks;
chunks++;
uint8_t bytes[8];
memcpy(bytes, &chunk, 8);
for (size_t i = 0; i < 8; i++) {
res += to_add[bytes[i]];
}
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
And lo and behold!
Our new implementation gives us further gains:
20× faster than basic
at 6.358 GB/s.
As I had hoped, Clang completely eliminated the memcpy
as well as the loop,
resulting in this very fast but very lengthy mess:
table_8:
stp x22, x21, [sp, #-0x30]!
stp x20, x19, [sp, #0x10]
stp x29, x30, [sp, #0x20]
add x29, sp, #0x20
mov x20, x1
mov x21, x0
ands x8, x0, #0x7
mov w9, #8
sub x8, x9, x8
tst x0, #0x7
csel x22, xzr, x8, eq
mov x1, x22
bl table_length
mov x19, x0
add x0, x21, x22
sub x8, x20, x22
cmp x8, #8
b.lo slack
lsr x10, x8, #3
add x9, x0, x10, lsl #3
lsl x10, x10, #3
adr x11, to_add
nop
loop:
ldr x12, [x0], #8
and x13, x12, #0xff
ldr w13, [x11, x13, lsl #2]
add w13, w13, w19
ubfx x14, x12, #8, #8
ldr w14, [x11, x14, lsl #2]
ubfx x15, x12, #16, #8
ldr w15, [x11, x15, lsl #2]
add w14, w15, w14
add w13, w14, w13
ubfx x14, x12, #24, #8
ldr w14, [x11, x14, lsl #2]
ubfx x15, x12, #32, #8
ldr w15, [x11, x15, lsl #2]
ubfx x16, x12, #40, #8
ldr w16, [x11, x16, lsl #2]
add w14, w15, w14
add w14, w16, w14
add w13, w14, w13
ubfx x14, x12, #48, #8
ldr w14, [x11, x14, lsl #2]
lsr x12, x12, #54
and x12, x12, #0x3fc
ldr w12, [x11, x12]
add w12, w12, w14
add w19, w12, w13
subs x10, x10, #8
b.ne loop
mov x0, x9
slack:
and x1, x8, #0x7
bl table_length
add w0, w0, w19
ldp x29, x30, [sp, #0x20]
ldp x20, x19, [sp, #0x10]
ldp x22, x21, [sp], #0x30
ret
I had been considering
using bitwise operations to separate the bytes out of chunk
instead of copying it into an array of bytes,
but it seems Clang has already done that for us.
ubfx
, or Unsigned Bitfield Extract,
is an A64 instruction that copies a region of bits from one register into another.
Do a quick scan through the assembly
and you’ll see repeated ubfx
uses with increasing offsets,
which is exactly the assembly we want to see here.
Also, take note of all the loads in the inner loop:
they all load with some offset from to_add
(x11
)
except for one, which loads from input
(x0
).
At the bottom of the loop you’ll see that
we are indeed stepping forward by eight each iteration.
Thus, we have achieved our goal of reducing how often we load from the input:
we now only load once every eight bytes, instead of once every byte.
That makes me wonder, though, if we can do better? Could we load once every sixteen bytes instead? Sure thing!
Loading less often with 128-bit integers
It turns out that the AArch64 execution state
(or is it the A64 instruction set? or the Armv8-A architecture? I hate this)
makes available to us thirty-two 128-bit registers.
We can access these from C using the __uint128_t
type.
We only need to tweak table_8
slightly to make use of this:
static int32_t to_add[256] = {
['s'] = 1,
['p'] = -1,
};
int32_t
table_16(const uint8_t *input, size_t length)
{
size_t padding = 0;
// all the 8s are replaced with 16
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t chunk_count = length / 16;
size_t slack = length % 16;
// and all the uint64_ts are replaced with __uint128_t
const __uint128_t *chunks = (const __uint128_t *)input;
const __uint128_t *chunks_end = chunks + chunk_count;
while (chunks != chunks_end) {
__uint128_t chunk = *chunks;
chunks++;
uint8_t bytes[16];
memcpy(bytes, &chunk, 16);
for (size_t i = 0; i < 16; i++) {
res += to_add[bytes[i]];
}
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
I won’t show the assembly because
it’s basically the same as for table_8
, just longer.
One thing I found interesting is that 128-bit registers are never actually used,
with Clang instead preferring to use two 64-bit registers instead.
It’s amazing that it was able to see through my (admittedly opaque) code
and figure out the intention behind it so effectively.
The performance improvement here was more minor:
23× faster than basic
with 7.090 GB/s.
A new approach: Neon
It seems we’re starting to hit a limit on the table-driven strategy, at least with how we’re approaching it currently.
If you’ve ever looked much into software optimization, you’ve probably heard of SIMD, or Single Instruction, Multiple Data. The name really says it all: one instruction operates on multiple pieces of data.
Arm as supported by the M1
(I’ve given up on using terminology)
gives us an instruction set (????) called Neon
which uses those 128-bit registers from earlier.
Those registers can be treated as 16, 32, 64 or 128-bit floats,
or as 128-bit vectors.
Vectors can be divvied up into lanes of various widths:
a 128-bit vector can be viewed as sixteen 8-bit lanes,
eight 16-bit lanes, four 32-bit lanes, and so on.
These are each represented by a C type provided by the arm_neon.h
header:
int8x16_t
, int16x8_t
, int32x4_t
, etc.
There exist unsigned versions of each.
Since we’re processing individual bytes, we’ll mainly be using uint8x16_t
.
To give you some idea of what kinds of operations you can do with vectors,
there are ones like “multiply these two vectors lanewise”,
as well as ones like “add up all the lanes in this vector”.
We can access these operations from C using special functions called “intrinsics”,
also from the arm_neon.h
header,
which are (mostly) named after the instructions they produce.
For instance, adding two vectors of four uint32_t
s each
can be accomplished with
the assembly instruction add v0.4s, v1.4s, v2.4s
or just as well with the C code v0 = vaddq_u32(v1, v2);
.
Let’s try to think about how we can implement our algorithm using Neon. Say we’ve just loaded up a vector of sixteen bytes:
p p h t z p a y z n s m s x m d
We want to locate all the 's'
s and 'p'
s,
adding 1 for the 's'
s and subtracting 1 for the 'p'
s.
We want some way to perform the following transformation:
p p h t z p a y z n s m s x m d
↓
-1 -1 0 0 0 -1 0 0 0 0 1 0 1 0 0 0
Neon gives us an instruction, cmeq
, that compares two vectors:
if two lanes are equal, the corresponding output lane is set to -1
;
if two lanes aren’t equal, the corresponding output lane is set to 0
.
Perfect!
Given a register full of 'p'
s, we’re already halfway:
p p h t z p a y z n s m s x m d
p p p p p p p p p p p p p p p p
↓ cmeq ↓
-1 -1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
We can do the same for the 's'
s:
p p h t z p a y z n s m s x m d
s s s s s s s s s s s s s s s s
↓ cmeq ↓
0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 0 0
How can we make those positive, then?
In binary, -1
as a byte is 11111111
.
We want to turn that into 00000001
, leaving all 00000000
s untouched.
An easy way to do this is to bitwise-AND with 00000001
:
a | b | a & b |
---|---|---|
00000000 | 00000001 | 00000000 |
11111111 | 00000001 | 00000001 |
Let’s see that in full:
p p h t z p a y z n s m s x m d
s s s s s s s s s s s s s s s s
↓ cmeq ↓
0 0 0 0 0 0 0 0 0 0 -1 0 -1 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
↓ and ↓
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
Awesome!
We now have a resultant vector for 's'
s and 'p'
s,
but how do we combine them together?
In fact, it just so happens that
bitwise-ORing anything with 0 gives back the original value,
and our vectors have no overlapping 1
s and -1
s.
So, we can merge them using the orr
instruction:
-1 -1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
↓ orr ↓
-1 -1 0 0 0 -1 0 0 0 0 1 0 1 0 0 0
And there we have our original target vector!
Writing this out in code is trivial now that we’ve worked out the steps that are involved. To make it easier to focus on the logic of the code I’ve put the inner loop here by itself first, so you can get a good look at it:
uint8x16_t s = vmovq_n_u8('s');
uint8x16_t p = vmovq_n_u8('p');
uint8x16_t ones = vmovq_n_u8(1);
while (chunks != chunks_end) {
uint8x16_t chunk = *chunks;
chunks++;
uint8x16_t matched_s = vandq_u8(vceqq_u8(chunk, s), ones);
uint8x16_t matched_p = vceqq_u8(chunk, p);
uint8x16_t matched = vorrq_u8(matched_s, matched_p);
res += vaddlvq_s8(matched);
}
Let’s look over each of the intrinsics I’ve used here:
vmovq_n_u8
: duplicates a scalar value across all lanesvceqq_u8
: performs thatcmeq
equality operation from earlier (I don’t know why the intrinsic is named differently to the instruction)vandq_u8
: bitwise-ANDvorrq_u8
: bitwise-ORvaddlvq_s8
: adds values across lanes into a scalar which is large enough to hold the result (in this case adding up all 8-bit lanes gives a 16-bit sum)
You might have noticed how each of these intrinsics
ends with u8
except for vaddlvq_s8
.
u8
indicates an unsigned 8-bit integer;
s8
is the same, but signed.
The reason for using the signed variant in this case
is that extending the sum from eight bits to sixteen bits
(sign extension)
is a sign-dependent operation:
0000000011111111
is not -1
for a 16-bit integer.
Here’s the function in full
(which is mostly the same as for table_16
):
int32_t
neon(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t chunk_count = length / 16;
size_t slack = length % 16;
const uint8x16_t *chunks = (const uint8x16_t *)input;
const uint8x16_t *chunks_end = chunks + chunk_count;
uint8x16_t s = vmovq_n_u8('s');
uint8x16_t p = vmovq_n_u8('p');
uint8x16_t ones = vmovq_n_u8(1);
while (chunks != chunks_end) {
uint8x16_t chunk = *chunks;
chunks++;
uint8x16_t matched_s = vandq_u8(vceqq_u8(chunk, s), ones);
uint8x16_t matched_p = vceqq_u8(chunk, p);
uint8x16_t matched = vorrq_u8(matched_s, matched_p);
res += vaddlvq_s8(matched);
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
Finally, the processor is kind enough to reward us handsomely for our efforts: 22.519 GB/s for a total speedup of 73×! Funnily enough there’s little point in poring over the assembly this time around, since we’ve basically hand-written which instructions we want Clang to produce (though a quick glance is a good idea!).
We can go further, though.
Profiling in Instruments shows that the function spends 81.6% of its time
in the pair of instructions vaddlvq_s8
generates –
again, I’m not sure if I can trust this,
but it does make sense considering what the Firestorm SIMD instruction tables
have to say about saddlv
and smov
.
Reducing less often
If you think about it, we really don’t need to be
reducing the vector on each loop iteration.
Instead, why don’t we use a vector accumulator variable
that gets reduced at the end?
Indeed, we could use res
for this purpose
if we were to change its type from int32_t
to uint8x16_t
.
Hmmm, that won’t work though:
what if we get unlucky and one of res
’s lanes overflows past 127?
Let’s wrap the body of the inner loop in another loop which runs 127 times
(so overflow is impossible),
after which the accumulator is reduced and added to res
,
before being reset to zero.
Since we’re now stepping in units of 16 × 127 = 2032 instead of 16, we’ll also have to update the “slack” calculations.
int32_t
neon_less_reduce(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
// NEW:
size_t iter_count = length / (16 * 127);
size_t slack = length % (16 * 127);
size_t chunk_count = iter_count * 127;
const uint8x16_t *chunks = (const uint8x16_t *)input;
const uint8x16_t *chunks_end = chunks + chunk_count;
uint8x16_t s = vmovq_n_u8('s');
uint8x16_t p = vmovq_n_u8('p');
uint8x16_t ones = vmovq_n_u8(1);
while (chunks != chunks_end) {
uint8x16_t acc = vmovq_n_u8(0);
// NEW:
for (uint32_t i = 0; i < 127; i++) {
uint8x16_t chunk = *chunks;
chunks++;
uint8x16_t matched_s =
vandq_u8(vceqq_u8(chunk, s), ones);
uint8x16_t matched_p = vceqq_u8(chunk, p);
uint8x16_t matched = vorrq_u8(matched_s, matched_p);
acc = vaddq_u8(acc, matched);
}
res += vaddlvq_s8(acc);
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
This simple function has turned quite a large volume of code, hasn’t it?
The performance here has improved in dramatic fashion once again,
reaching 33.952 GB/s, a whopping 111× faster than our original basic
function.
But can we do better? Of course we can.
Cheating incident #2 – only 's'
and 'p'
Tragically, while I was writing this post I ran across {n} times faster than C, where n = 128 by Thomas Ip. Not only did someone else make a follow-up post to Owen’s original post in which they use SIMD to get huge performance improvements, but that someone else also used Neon intrinsics on an M1 Pro! I was quite miffed that someone had beaten me to the punch, though I decided to finish writing anyway :)
Anyhow, in his post, Thomas makes an astute observation.
The ASCII representation for 'p'
is 1110000
,
while 's'
is 1110011
.
We can distinguish between the two solely through the least-significant bit –
'p' & 1 == 0
, while 's' & 1 == 1
.
Of course, this only works if every single character in the input
is either 's'
or 'p'
.
Owen’s original code never assumes this,
and neither has ours so far.
Let’s keep going and try out the idea anyway!
First, here’s the simplest version of Thomas’s code that takes advantage of this optimization:
fn opt3_count_s_branchless(input: &str) -> i64 {
let n_s = input.bytes().map(|b| (b & 1) as i64).sum::<i64>();
(2 * n_s) - input.len() as i64
}
We count the number of occurrences of 's'
by adding up the least-significant bit of every byte in the input;
let’s call this s_count
.
Since we’re now assuming the input is composed purely of 's'
s and 'p'
s,
we know that p_count
must be length - s_count
.
Our goal is to increment for all 's'
s and decrement for all 'p'
s,
so the value we return is (1 * s_count) + (-1 * p_count)
.
Let’s expand that:
(1 * s_count) + (-1 * p_count)
= s_count - p_count
= s_count - (length - s_count)
= s_count - length + s_count
= 2 * s_count - length
And whaddaya know, that’s exactly what Thomas’s function returns!
Side note: I want to quickly point out how interesting it is that the problem has been reduced from one of searching for characters to counting how many least-significant bits are set.
Let’s apply this in our own code, now. Once again, the main loop is the only part that’s changed:
int32_t
neon_lsb(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t iter_count = length / (16 * 127);
size_t slack = length % (16 * 127);
size_t chunk_count = iter_count * 127;
const uint8x16_t *chunks = (const uint8x16_t *)input;
const uint8x16_t *chunks_end = chunks + chunk_count;
// NEW:
// no more s and p constants
uint8x16_t ones = vmovq_n_u8(1);
while (chunks != chunks_end) {
uint8x16_t acc = vmovq_n_u8(0);
for (uint32_t i = 0; i < 127; i++) {
uint8x16_t chunk = *chunks;
chunks++;
// NEW:
// brutally simple loop body:
// just mask out the lsb
// and add to the accumulator
acc = vaddq_u8(acc, vandq_u8(chunk, ones));
}
// NEW:
// that formula we worked out
// makes an appearance
res += 2 * vaddlvq_s8(acc) - 16 * 127;
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
A nice speed increase:
now we’re 124× faster than basic
at 37.961 GB/s.
An unfortunate discovery
It was at this point in the article-writing process that I was going to call it on the algorithmic work, and get down to micro-optimization (or at least the few things I know which could pass as micro-optimization). For some reason, though, I decided to go read the Hacker News comments under Owen’s post, where I stumbled across a comment from nwallin:
nwallin — 7 July 2023
So I downloaded your code. On my desktop, with
loop-9
gcc I got ~4.5 GB/s, and withloop-7
I got ~4.4 GB/s. With the following code:#include <stddef.h> int run_switches(const char *s, size_t n) { int res = 0; for (; n--; ++s) res += (*s == 's') - (*s == 'p'); return res; }
I got ~31 GB/s in GCC and ~33 GB/s in Clang. This is without any padding, or SIMD intrinsics, or any such nonsense. This is just untying the compiler’s hands and giving it permission to do its job properly.
[…]
With a little bit of blocking:
#include <stddef.h> int run_switches(const char *s, size_t n) { int res = 0; char tmp = 0; for (size_t i = n & 63; i--; ++s) tmp += (*s == 's') - (*s == 'p'); res += tmp; for (n >>= 6; n--;) { tmp = 0; for (size_t i = 64; i--; ++s) tmp += (*s == 's') - (*s == 'p'); res += tmp; } return res; }
That’s ~55 GB/s.
Anyway, the point is, you’re pretty far from the point where you ought to give up on C and dive into assembly.
Well shit.
I tried the last implementation from the comment, and it reached 48.522 GB/s in the same benchmarking setup I’ve been using throughout this post.
:(
Sigh, I guess it’s time to dive in and try to figure out why this is so fast. First, though, I’m gonna refactor and restyle nwallin’s code a bunch2:
int32_t
eqsub(const uint8_t *input, size_t length)
{
int32_t res = 0;
size_t chunk_size = 64;
size_t iters = length / chunk_size;
size_t slack = length % chunk_size;
for (size_t i = 0; i < iters; i++) {
int8_t acc = 0;
for (size_t j = 0; j < chunk_size; j++) {
acc += (*input == 's') - (*input == 'p');
input++;
}
res += acc;
}
for (size_t i = 0; i < slack; i++) {
res += (*input == 's') - (*input == 'p');
input++;
}
return res;
}
The overall structure used here will look extremely familiar to you now;
we split the input into chunks,
process each chunk and add the result of the chunk to res
,
then process the “slack” left over at the end.
What’s really interesting about this solution is the heart of the loop:
(b == 's') - (b == 'p')
.
What’s going on here?
Well, comparing the current byte with either 's'
or 'p'
yields 0
or 1
.
First off, both comparisons cannot be 1
because a byte cannot be both 's'
and 'p'
at once.
Both comparisons could certainly be 0
, though.
We then subtract the result of comparing with 'p'
from the result of comparing with 's'
.
What does that mean?
Maybe a little table will help:
b == 'p' | b != 'p' | |
---|---|---|
b == 's' | impossible | 1 - 0 == 1 |
b != 's' | 0 - 1 == -1 | 0 - 0 == 0 |
How clever is that?! The formula always returns the value we want, for every possible byte value – no cheating required!
As you’d expect, the assembly for eqsub
is a huge tangled mess,
so I won’t show it here.
The main loop’s assembly is optimal:
eight neat sequences of cmeq
, cmeq
, sub
for the two equality checks and the subtraction.
Hold on: cmeq
produces -1
for “true”, not 1
,
so won’t the results all be flipped?
Clang is very clever, here: it swaps the arguments to sub
,
negating the result and undoing the inversion.
Why not apply this in our own code? I’ve marked the few changes needed with comments:
int32_t
neon_eqsub(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t iter_count = length / (16 * 127);
size_t slack = length % (16 * 127);
size_t chunk_count = iter_count * 127;
const uint8x16_t *chunks = (const uint8x16_t *)input;
const uint8x16_t *chunks_end = chunks + chunk_count;
uint8x16_t s = vmovq_n_u8('s');
uint8x16_t p = vmovq_n_u8('p');
// NEW:
// “ones” variable is no longer needed
while (chunks != chunks_end) {
uint8x16_t acc = vmovq_n_u8(0);
for (uint32_t i = 0; i < 127; i++) {
uint8x16_t chunk = *chunks;
chunks++;
// NEW:
// tiny inner loop
// with just three instructions
uint8x16_t matched_s = vceqq_u8(chunk, s);
uint8x16_t matched_p = vceqq_u8(chunk, p);
uint8x16_t matched = vsubq_u8(matched_p, matched_s);
acc = vaddq_u8(acc, matched);
}
res += vaddlvq_s8(acc);
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
Another wonderful performance win,
this time reaching … 21.171 GB/s?
Huh?
Yyyup, we’ve actually regressed performance,
below even neon_less_reduce
(which ran at 33.952 GB/s).
How could this be?
To investigate, I ran each of neon_less_reduce
, neon_eqsub
and eqsub
for 5000 iterations on my standard 1 MB test input
while tracking performance counters in Instruments.
billions of: | cycles | instrs | L1D misses | misprds |
---|---|---|---|---|
neon_eqsub | 96.4 | 316.4 | 6.4 | 0.3067 |
neon_less_reduce | 60.5 | 356.0 | 5.9 | 0.3087 |
eqsub | 42.0 | 223.8 | 6.7 | 0.0006 |
What stands out to me is the number of branch mispredictions.
Why would the supposedly-branchless code we’ve been writing
have any branch mispredictions at all?
Looking a bit further into which instructions are causing the issue,
it seems that all the branch mispredictions for neon_eqsub
and neon_less_reduce
are concentrated on a cmp
instruction at the bottom of the inner loop.
Welp, now that it’s been shoved in my face like this,
it really does make a lot of sense.
for (uint32_t i = 0; i < 127; i++)
has too many iterations for Clang to fully unroll it,
so the CPU is now put in an uncomfortable position:
it goes around that loop again and again,
priming the branch predictor to say,
“hey, I think this branch will be taken and we’ll go around the loop again!”.
Then, after a measly 127 iterations,
the prediction will be untrue,
causing a pipeline flush and a 13 cycle penalty.
We can fix this quite easily by using a more modest iteration count so that Clang can unroll the loop for us:
int32_t
neon_eqsub_unroll(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t iter_count = length / (16 * 8);
size_t slack = length % (16 * 8);
size_t chunk_count = iter_count * 8;
const uint8x16_t *chunks = (const uint8x16_t *)input;
const uint8x16_t *chunks_end = chunks + chunk_count;
uint8x16_t s = vmovq_n_u8('s');
uint8x16_t p = vmovq_n_u8('p');
while (chunks != chunks_end) {
uint8x16_t acc = vmovq_n_u8(0);
// NEW:
// loop only for eight iterations
for (uint32_t j = 0; j < 8; j++) {
uint8x16_t chunk = *chunks;
chunks++;
uint8x16_t matched_s = vceqq_u8(chunk, s);
uint8x16_t matched_p = vceqq_u8(chunk, p);
uint8x16_t matched = vsubq_u8(matched_p, matched_s);
acc = vaddq_u8(acc, matched);
}
res += vaddlvq_s8(acc);
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
Finally, we’ve managed to beat eqsub
with 49.806 GB/s,
163× faster than basic
.
Cheating incident #2 – redux
I can’t help but at least try to apply the least-significant bit strategy here, just for fun:
int32_t
neon_lsb_unroll(const uint8_t *input, size_t length)
{
size_t padding = 0;
if ((uintptr_t)input % 16 != 0) {
padding = 16 - (uintptr_t)input % 16;
}
int32_t res = table_length(input, padding);
input += padding;
length -= padding;
size_t iter_count = length / (16 * 8);
size_t slack = length % (16 * 8);
size_t chunk_count = iter_count * 8;
const uint8x16_t *chunks = (const uint8x16_t *)input;
const uint8x16_t *chunks_end = chunks + chunk_count;
// NEW:
// ones variable makes a comeback!
uint8x16_t ones = vmovq_n_u8(1);
while (chunks != chunks_end) {
uint8x16_t acc = vmovq_n_u8(0);
for (uint32_t j = 0; j < 8; j++) {
uint8x16_t chunk = *chunks;
chunks++;
// NEW:
// simple loop body
acc = vaddq_u8(acc, vandq_u8(chunk, ones));
}
// NEW:
// use formula to deduce correct 's' vs 'p' value
res += 2 * vaddlvq_s8(acc) - 16 * 8;
}
input = (const uint8_t *)chunks;
res += table_length(input, slack);
return res;
}
And when I thought I had finally found a winner
with 87.017 GB/s, 285× faster than basic
,
I had the idea to try adapting eqsub
to use this too.
int32_t
lsb(const uint8_t *input, size_t length)
{
int32_t res = 0;
size_t chunk_size = 64;
size_t iters = length / chunk_size;
size_t slack = length % chunk_size;
for (size_t i = 0; i < iters; i++) {
int8_t acc = 0;
for (size_t j = 0; j < chunk_size; j++) {
acc += *input & 1;
input++;
}
res += acc;
}
for (size_t i = 0; i < slack; i++) {
res += *input & 1;
input++;
}
return 2 * res - (int32_t)length;
}
87.138 GB/s. Goddammit. You win, Clang :)
What Did We Learn™
This post followed the exact steps I took
while trying to optimize the original function.
Don’t do what I did!
Before bringing out the SIMD intrinsics,
try to reformulate your code so the compiler can vectorize it for you.
We’ve seen how effective that can be with eqsub
.
In general, try to use the compiler to your advantage,
but make sure to check the resulting assembly.
In this case eqsub
emitted a perfect sequence of cmeq
s and sub
s,
but you won’t always be so lucky.
Occasionally it just won’t emit what you want,
even if you use intrinsics,
and in those cases writing in raw assembly is an option.
Whether this is true or not, I’m not sure, since I’ve had some weird experiences with the M1’s performance counters, both inside and outside of Instruments. ↩︎
Aside from the
stdint.h
types, the variable names and the old school formatting, the biggest change I’ve made here is to “de-optimize” the loop conditions. Like, yes,n & 63
is faster thann % 64
, but only if you have compiler optimizations turned off – but why would you do that in this context, of all places? Maybe it’s an old habit of nwallin, I don’t know.And don’t worry, even though it looks quite different from the original, I made sure that the performance remained identical; feel free to compare the original with my version to convince yourself that they’re equivalent. ↩︎
Luna Razzaghipour
15 July 2023