n times faster than C, Arm edition

The other day I read a two-parter blog post by Owen Shepherd, {n} times faster than C. In it, he takes a simple algorithm, and optimizes it as much as he can, dropping down to raw assembly along the way.

I love this sort of thing, though I’m not very good at it. I decided I’d try my own hand at it anyway, but for the A64 instruction set. I’ve tried to make this post understandable if you haven’t read Owen’s, though I highly recommend you do! (if only for how much fun I found it was to read).

For context, here’s the environment I’m using to obtain benchmark timings and compilation outputs:

Clang 16.0.6
-O3 and -g3
Instruments 14.3.1
M1 Pro
macOS 13.4.1 without any weird kernel boot arguments or anything

All code (plus a nifty benchmarking harness!) is available in this GitHub repo. I’ve matched the inputs generated by Owen’s original benchmarking setup as closely as possible.

Problem statement: Initialize a counter variable to zero. Iterate over a null-terminated array of bytes; whenever you see an 's', increment the counter; whenever you see a 'p', decrement the counter.

The basics

Let’s begin with the most trivial solution to the problem, written in C, as given by Owen’s blog post (plus some minor stylistic changes):

int32_t
basic(const uint8_t *input)
{
	int32_t res = 0;

	while (true) {
		uint8_t c = *input;
		input++;
		switch (c) {
		case '\0':
			return res;
		case 's':
			res++;
			break;
		case 'p':
			res--;
			break;
		default:
			break;
		}
	}
}

Compared to Owen’s 310 MB/s, I get 305 MB/s. In spite of varying microarchitectures, instruction sets, compilers, cache hierarchies and RAM speeds, we’ve ended up at almost the same speed!

Next, let’s take a look at the corresponding A64 assembly for this C:

basic:
	mov     w8, #0
loop:
	ldrb    w9, [x0], #1
	cbz     w9, exit
	cmp     w9, 'p'
	b.eq    p
	cmp     w9, 's'
	b.ne    loop
	mov     w9, #1
	add     w8, w8, w9
	b       loop
p:
	mov     w9, #-1
	add     w8, w8, w9
	b       loop
exit:
	mov     x0, x8
	ret

I was honestly quite surprised to see that Clang emits assembly which is instruction-for-instruction equivalent to x64. The only part here that leaves me confused is that Clang uses a mov and an add-with-register instead of a single add-with-immediate …

Rather than retrace the steps in Owen’s blog post, I think it’d be more fun to skip ahead to the faster versions of the code before taking a different approach entirely.

The table-driven approach

At the start of the second post we are introduced to this function:

static int32_t to_add[256] = {
	['s'] = 1,
	['p'] = -1,
};

int32_t
table(const uint8_t *input)
{
	int32_t res = 0;

	while (true) {
		uint8_t c = *input;
		input++;
		if (c == '\0') {
			return res;
		}
		res += to_add[c];
	}
}

Here we see a common approach to optimizing tight loops like these: eliminate branches by replacing them with table lookups. And this does indeed work: we just boosted our throughput 9×, bringing it to 2.712 GB/s! This completely pales in comparison to the 4 GB/s Owen achieved, though :(

table:
	ldrb    w9, [x0]
	cbz     w9, exit
	mov     x8, x0
	mov     w0, #0
	add     x8, x8, #1
	adr     x10, to_add
	nop
loop:
	ldr     w9, [x10, w9, uxtw #2]
	add     w0, w9, w0
	ldrb    w9, [x8], #1
	cbnz    w9, loop
	ret
exit:
	mov     w0, #0
	ret

Mercifully, the assembly is, as far as these things go, serene. Skipping past all that setup, the core loop is only three instructions short: load from to_add (x10) using the c value from the previous iteration (w9) as an index, add it to res (w0), and finally load the next value of c. We jump back to the top of the loop only if c wasn’t zero (the null terminator).

Hmm, anyway, what could be causing this to be so slow? Come to think of it, when Owen tried compiling the same code with GCC, performance plummeted to 2.13 GB/s. In his case, the issue boiled down to (IIUC) the inner loop straddling a cache line boundary. Could we be having the same issue?

A quick trip to the Clang command line argument reference reveals that -falign-loops=64 is the incantation we need. And yes, the assembly is absolutely filled with nops. Unfortunately, performance stays roughly the same at 2.695 GB/s.

Running Instruments with appropriate performance counters enabled on table, I found that the function spends 90.8% of its time¹ simply loading bytes from the input (ldrb). To make things faster we just need to do that less. It’s not like we’re loading each byte of the input multiple times, so the way forward is to load multiple bytes in one go. To do that, though, we’ll need to bend the rules.

Cheating incident #1 – explicit length

Imagine the address space immediately following the input we’ve been given is unmapped. In this scenario, reading even a single byte past the null terminator at the end of the input will make us segfault. This is fine if we read a byte at a time, stopping as soon as we see the null terminator:

..........spspspspspspspspspsp0..........
          ^^^^^^^^^^^^^^^^^^^^|
                                  all good :)

Now imagine that we switch to loading multiple bytes at once – say, in chunks of eight. The final chunk we read is likely to not only contain the null terminator, but also several of those segfault-triggering bytes following it!

     these 3b make us segfault ↓
                               vvv
..........spspspspspspspspspsp0..........
          \______/\______/\______/
          8       8       8
                                  KABOOM!

To overcome this, we can switch to what is arguably a far saner scheme (but one which is technically against the rules of the problem statement): pass in the length of the input directly instead of relying on a null terminator. With this extra information in hand, we can find how many bytes don’t neatly fit into our eight-byte chunks using the % operator. We’ll process these bytes – I’m calling them “slack” – separately, byte-by-byte at the end.

..........spspspspspspspspspsp0..........
          \______/\______/^^^^|
          8       8
                                  all good :)

Here’s a pseudocode implementation:

int32_t
new_implementation(const uint8_t *input, size_t length)
{
	size_t chunk_count = length / 8;
	size_t slack = length % 8;
	const uint8_t *chunks_end = input + chunk_count * 8;

	int32_t res = 0;

	while (input != chunks_end) {
		// process input in eight-byte chunks,
		// somehow ...
		input += 8;
	}

	// account for remaining slack
	res += table(input, slack);

	return res;
}

To implement the algorithm I described above, we’ll need a version of table which takes a length. I ended up with this neat little function:

static int32_t to_add[256] = {
	['s'] = 1,
	['p'] = -1,
};

int32_t
table_length(const uint8_t *input, size_t length)
{
	int32_t res = 0;
	const uint8_t *end = input + length;

	while (input != end) {
		uint8_t c = *input;
		input++;
		res += to_add[c];
	}

	return res;
}

I won’t show the assembly here because it’s a huge mess, since Clang managed to unroll this loop four times (meaning it copy-pasted the loop body so that four bytes get processed before branching back to the top of the loop). This results in a huge speedup of 15× over basic for a total throughput of 4.610 GB/s.

Processing the input in chunks

How can we load multiple bytes of our input in one go? One approach is to cast the input pointer from a uint8_t * to a uint64_t *, thereby letting us load eight bytes at once. We can then loop through the component bytes of the loaded integer, processing them as we go. Something like this?

int32_t
new_implementation(const uint8_t *input, size_t length)
{
	// ...

	while (input != chunks_end) {
		uint64_t chunk = *(const uint64_t *)input;
		input++;

		uint8_t bytes[8];
		memcpy(bytes, &chunk, 8);

		for (size_t i = 0; i < 8; i++) {
			res += to_add[bytes[i]];
		}
	}

	// ...
}

I don’t like how we declare both chunk and bytes and then memcpy from one to the other just to loop over bytes, but that’s exactly the sort of thing compilers are good at optimizing away.

One problem here is that we’re casting a uint8_t * to a uint64_t * without thinking about alignment: a pointer to a 64-bit integer must be a multiple of eight, but we haven’t made sure of that. Although the M1’s Firestorm microarchitecture permits (and as far as I know doesn’t penalize) unaligned accesses, they are undefined behavior as per the C standard, so we’d better fix this.

Remember that pseudocode from earlier about handling “slack”? Let’s adapt it to also handle alignment, and glue it together with that code above to process each chunk:

static int32_t to_add[256] = {
	['s'] = 1,
	['p'] = -1,
};

int32_t
table_8(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 8 != 0) {
		padding = 8 - (uintptr_t)input % 8;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t chunk_count = length / 8;
	size_t slack = length % 8;
	const uint64_t *chunks = (const uint64_t *)input;
	const uint64_t *chunks_end = chunks + chunk_count;

	while (chunks != chunks_end) {
		uint64_t chunk = *chunks;
		chunks++;

		uint8_t bytes[8];
		memcpy(bytes, &chunk, 8);

		for (size_t i = 0; i < 8; i++) {
			res += to_add[bytes[i]];
		}
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

And lo and behold! Our new implementation gives us further gains: 20× faster than basic at 6.358 GB/s.

As I had hoped, Clang completely eliminated the memcpy as well as the loop, resulting in this very fast but very lengthy mess:

table_8:
	stp     x22, x21, [sp, #-0x30]!
	stp     x20, x19, [sp, #0x10]
	stp     x29, x30, [sp, #0x20]
	add     x29, sp, #0x20
	mov     x20, x1
	mov     x21, x0
	ands    x8, x0, #0x7
	mov     w9, #8
	sub     x8, x9, x8
	tst     x0, #0x7
	csel    x22, xzr, x8, eq
	mov     x1, x22
	bl      table_length
	mov     x19, x0
	add     x0, x21, x22
	sub     x8, x20, x22
	cmp     x8, #8
	b.lo    slack
	lsr     x10, x8, #3
	add     x9, x0, x10, lsl #3
	lsl     x10, x10, #3
	adr     x11, to_add
	nop
loop:
	ldr     x12, [x0], #8
	and     x13, x12, #0xff
	ldr     w13, [x11, x13, lsl #2]
	add     w13, w13, w19
	ubfx    x14, x12, #8, #8
	ldr     w14, [x11, x14, lsl #2]
	ubfx    x15, x12, #16, #8
	ldr     w15, [x11, x15, lsl #2]
	add     w14, w15, w14
	add     w13, w14, w13
	ubfx    x14, x12, #24, #8
	ldr     w14, [x11, x14, lsl #2]
	ubfx    x15, x12, #32, #8
	ldr     w15, [x11, x15, lsl #2]
	ubfx    x16, x12, #40, #8
	ldr     w16, [x11, x16, lsl #2]
	add     w14, w15, w14
	add     w14, w16, w14
	add     w13, w14, w13
	ubfx    x14, x12, #48, #8
	ldr     w14, [x11, x14, lsl #2]
	lsr     x12, x12, #54
	and     x12, x12, #0x3fc
	ldr     w12, [x11, x12]
	add     w12, w12, w14
	add     w19, w12, w13
	subs    x10, x10, #8
	b.ne    loop
	mov     x0, x9
slack:
	and     x1, x8, #0x7
	bl      table_length
	add     w0, w0, w19
	ldp     x29, x30, [sp, #0x20]
	ldp     x20, x19, [sp, #0x10]
	ldp     x22, x21, [sp], #0x30
	ret

I had been considering using bitwise operations to separate the bytes out of chunk instead of copying it into an array of bytes, but it seems Clang has already done that for us. ubfx, or Unsigned Bitfield Extract, is an A64 instruction that copies a region of bits from one register into another. Do a quick scan through the assembly and you’ll see repeated ubfx uses with increasing offsets, which is exactly the assembly we want to see here.

Also, take note of all the loads in the inner loop: they all load with some offset from to_add (x11) except for one, which loads from input (x0). At the bottom of the loop you’ll see that we are indeed stepping forward by eight each iteration. Thus, we have achieved our goal of reducing how often we load from the input: we now only load once every eight bytes, instead of once every byte.

That makes me wonder, though, if we can do better? Could we load once every sixteen bytes instead? Sure thing!

Loading less often with 128-bit integers

It turns out that the AArch64 execution state (or is it the A64 instruction set? or the Armv8-A architecture? I hate this) makes available to us thirty-two 128-bit registers. We can access these from C using the __uint128_t type.

We only need to tweak table_8 slightly to make use of this:

static int32_t to_add[256] = {
	['s'] = 1,
	['p'] = -1,
};

int32_t
table_16(const uint8_t *input, size_t length)
{
	size_t padding = 0;

	// all the 8s are replaced with 16
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t chunk_count = length / 16;
	size_t slack = length % 16;

	// and all the uint64_ts are replaced with __uint128_t
	const __uint128_t *chunks = (const __uint128_t *)input;
	const __uint128_t *chunks_end = chunks + chunk_count;

	while (chunks != chunks_end) {
		__uint128_t chunk = *chunks;
		chunks++;

		uint8_t bytes[16];
		memcpy(bytes, &chunk, 16);

		for (size_t i = 0; i < 16; i++) {
			res += to_add[bytes[i]];
		}
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

I won’t show the assembly because it’s basically the same as for table_8, just longer. One thing I found interesting is that 128-bit registers are never actually used, with Clang instead preferring to use two 64-bit registers instead. It’s amazing that it was able to see through my (admittedly opaque) code and figure out the intention behind it so effectively.

The performance improvement here was more minor: 23× faster than basic with 7.090 GB/s.

A new approach: Neon

It seems we’re starting to hit a limit on the table-driven strategy, at least with how we’re approaching it currently.

If you’ve ever looked much into software optimization, you’ve probably heard of SIMD, or Single Instruction, Multiple Data. The name really says it all: one instruction operates on multiple pieces of data.

Arm as supported by the M1 (I’ve given up on using terminology) gives us an instruction set (????) called Neon which uses those 128-bit registers from earlier. Those registers can be treated as 16, 32, 64 or 128-bit floats, or as 128-bit vectors. Vectors can be divvied up into lanes of various widths: a 128-bit vector can be viewed as sixteen 8-bit lanes, eight 16-bit lanes, four 32-bit lanes, and so on. These are each represented by a C type provided by the arm_neon.h header: int8x16_t, int16x8_t, int32x4_t, etc. There exist unsigned versions of each. Since we’re processing individual bytes, we’ll mainly be using uint8x16_t.

To give you some idea of what kinds of operations you can do with vectors, there are ones like “multiply these two vectors lanewise”, as well as ones like “add up all the lanes in this vector”. We can access these operations from C using special functions called “intrinsics”, also from the arm_neon.h header, which are (mostly) named after the instructions they produce. For instance, adding two vectors of four uint32_ts each can be accomplished with the assembly instruction add v0.4s, v1.4s, v2.4s or just as well with the C code v0 = vaddq_u32(v1, v2);.

Let’s try to think about how we can implement our algorithm using Neon. Say we’ve just loaded up a vector of sixteen bytes:

 p  p  h  t  z  p  a  y  z  n  s  m  s  x  m  d

We want to locate all the 's's and 'p's, adding 1 for the 's's and subtracting 1 for the 'p's. We want some way to perform the following transformation:

 p  p  h  t  z  p  a  y  z  n  s  m  s  x  m  d

                        ↓

-1 -1  0  0  0 -1  0  0  0  0  1  0  1  0  0  0

Neon gives us an instruction, cmeq, that compares two vectors: if two lanes are equal, the corresponding output lane is set to -1; if two lanes aren’t equal, the corresponding output lane is set to 0. Perfect! Given a register full of 'p's, we’re already halfway:

 p  p  h  t  z  p  a  y  z  n  s  m  s  x  m  d
 p  p  p  p  p  p  p  p  p  p  p  p  p  p  p  p

                    ↓ cmeq ↓

-1 -1  0  0  0 -1  0  0  0  0  0  0  0  0  0  0

We can do the same for the 's's:

 p  p  h  t  z  p  a  y  z  n  s  m  s  x  m  d
 s  s  s  s  s  s  s  s  s  s  s  s  s  s  s  s

                    ↓ cmeq ↓

 0  0  0  0  0  0  0  0  0  0 -1  0 -1  0  0  0

How can we make those positive, then? In binary, -1 as a byte is 11111111. We want to turn that into 00000001, leaving all 00000000s untouched. An easy way to do this is to bitwise-AND with 00000001:

`a`	`b`	`a & b`
`00000000`	`00000001`	`00000000`
`11111111`	`00000001`	`00000001`

Let’s see that in full:

 p  p  h  t  z  p  a  y  z  n  s  m  s  x  m  d
 s  s  s  s  s  s  s  s  s  s  s  s  s  s  s  s

                    ↓ cmeq ↓

 0  0  0  0  0  0  0  0  0  0 -1  0 -1  0  0  0
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

                    ↓ and ↓

 0  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0

Awesome! We now have a resultant vector for 's's and 'p's, but how do we combine them together? In fact, it just so happens that bitwise-ORing anything with 0 gives back the original value, and our vectors have no overlapping 1s and -1s. So, we can merge them using the orr instruction:

-1 -1  0  0  0 -1  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  1  0  1  0  0  0

                    ↓ orr ↓

-1 -1  0  0  0 -1  0  0  0  0  1  0  1  0  0  0

And there we have our original target vector!

Writing this out in code is trivial now that we’ve worked out the steps that are involved. To make it easier to focus on the logic of the code I’ve put the inner loop here by itself first, so you can get a good look at it:

uint8x16_t s = vmovq_n_u8('s');
uint8x16_t p = vmovq_n_u8('p');
uint8x16_t ones = vmovq_n_u8(1);

while (chunks != chunks_end) {
	uint8x16_t chunk = *chunks;
	chunks++;

	uint8x16_t matched_s = vandq_u8(vceqq_u8(chunk, s), ones);
	uint8x16_t matched_p = vceqq_u8(chunk, p);
	uint8x16_t matched = vorrq_u8(matched_s, matched_p);

	res += vaddlvq_s8(matched);
}

Let’s look over each of the intrinsics I’ve used here:

vmovq_n_u8: duplicates a scalar value across all lanes
vceqq_u8: performs that cmeq equality operation from earlier (I don’t know why the intrinsic is named differently to the instruction)
vandq_u8: bitwise-AND
vorrq_u8: bitwise-OR
vaddlvq_s8: adds values across lanes into a scalar which is large enough to hold the result (in this case adding up all 8-bit lanes gives a 16-bit sum)

You might have noticed how each of these intrinsics ends with u8 except for vaddlvq_s8. u8 indicates an unsigned 8-bit integer; s8 is the same, but signed. The reason for using the signed variant in this case is that extending the sum from eight bits to sixteen bits (sign extension) is a sign-dependent operation: 0000000011111111 is not -1 for a 16-bit integer.

Here’s the function in full (which is mostly the same as for table_16):

int32_t
neon(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t chunk_count = length / 16;
	size_t slack = length % 16;
	const uint8x16_t *chunks = (const uint8x16_t *)input;
	const uint8x16_t *chunks_end = chunks + chunk_count;

	uint8x16_t s = vmovq_n_u8('s');
	uint8x16_t p = vmovq_n_u8('p');
	uint8x16_t ones = vmovq_n_u8(1);

	while (chunks != chunks_end) {
		uint8x16_t chunk = *chunks;
		chunks++;

		uint8x16_t matched_s = vandq_u8(vceqq_u8(chunk, s), ones);
		uint8x16_t matched_p = vceqq_u8(chunk, p);
		uint8x16_t matched = vorrq_u8(matched_s, matched_p);

		res += vaddlvq_s8(matched);
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

Finally, the processor is kind enough to reward us handsomely for our efforts: 22.519 GB/s for a total speedup of 73×! Funnily enough there’s little point in poring over the assembly this time around, since we’ve basically hand-written which instructions we want Clang to produce (though a quick glance is a good idea!).

We can go further, though. Profiling in Instruments shows that the function spends 81.6% of its time in the pair of instructions vaddlvq_s8 generates – again, I’m not sure if I can trust this, but it does make sense considering what the Firestorm SIMD instruction tables have to say about saddlv and smov.

Reducing less often

If you think about it, we really don’t need to be reducing the vector on each loop iteration. Instead, why don’t we use a vector accumulator variable that gets reduced at the end? Indeed, we could use res for this purpose if we were to change its type from int32_t to uint8x16_t. Hmmm, that won’t work though: what if we get unlucky and one of res’s lanes overflows past 127? Let’s wrap the body of the inner loop in another loop which runs 127 times (so overflow is impossible), after which the accumulator is reduced and added to res, before being reset to zero.

Since we’re now stepping in units of 16 × 127 = 2032 instead of 16, we’ll also have to update the “slack” calculations.

int32_t
neon_less_reduce(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	// NEW:
	size_t iter_count = length / (16 * 127);
	size_t slack = length % (16 * 127);
	size_t chunk_count = iter_count * 127;
	const uint8x16_t *chunks = (const uint8x16_t *)input;
	const uint8x16_t *chunks_end = chunks + chunk_count;

	uint8x16_t s = vmovq_n_u8('s');
	uint8x16_t p = vmovq_n_u8('p');
	uint8x16_t ones = vmovq_n_u8(1);

	while (chunks != chunks_end) {
		uint8x16_t acc = vmovq_n_u8(0);

		// NEW:
		for (uint32_t i = 0; i < 127; i++) {
			uint8x16_t chunk = *chunks;
			chunks++;

			uint8x16_t matched_s =
			        vandq_u8(vceqq_u8(chunk, s), ones);
			uint8x16_t matched_p = vceqq_u8(chunk, p);
			uint8x16_t matched = vorrq_u8(matched_s, matched_p);

			acc = vaddq_u8(acc, matched);
		}

		res += vaddlvq_s8(acc);
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

This simple function has turned quite a large volume of code, hasn’t it?

The performance here has improved in dramatic fashion once again, reaching 33.952 GB/s, a whopping 111× faster than our original basic function. But can we do better? Of course we can.

Cheating incident #2 – only `'s'` and `'p'`

Tragically, while I was writing this post I ran across {n} times faster than C, where n = 128 by Thomas Ip. Not only did someone else make a follow-up post to Owen’s original post in which they use SIMD to get huge performance improvements, but that someone else also used Neon intrinsics on an M1 Pro! I was quite miffed that someone had beaten me to the punch, though I decided to finish writing anyway :)

Anyhow, in his post, Thomas makes an astute observation. The ASCII representation for 'p' is 1110000, while 's' is 1110011. We can distinguish between the two solely through the least-significant bit – 'p' & 1 == 0, while 's' & 1 == 1.

Of course, this only works if every single character in the input is either 's' or 'p'. Owen’s original code never assumes this, and neither has ours so far. Let’s keep going and try out the idea anyway!

First, here’s the simplest version of Thomas’s code that takes advantage of this optimization:

fn opt3_count_s_branchless(input: &str) -> i64 {
	let n_s = input.bytes().map(|b| (b & 1) as i64).sum::<i64>();
	(2 * n_s) - input.len() as i64
}

We count the number of occurrences of 's' by adding up the least-significant bit of every byte in the input; let’s call this s_count. Since we’re now assuming the input is composed purely of 's's and 'p's, we know that p_count must be length - s_count. Our goal is to increment for all 's's and decrement for all 'p's, so the value we return is (1 * s_count) + (-1 * p_count). Let’s expand that:

  (1 * s_count) + (-1 * p_count)
= s_count - p_count
= s_count - (length - s_count)
= s_count - length + s_count
= 2 * s_count - length

And whaddaya know, that’s exactly what Thomas’s function returns!

Side note: I want to quickly point out how interesting it is that the problem has been reduced from one of searching for characters to counting how many least-significant bits are set.

Let’s apply this in our own code, now. Once again, the main loop is the only part that’s changed:

int32_t
neon_lsb(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t iter_count = length / (16 * 127);
	size_t slack = length % (16 * 127);
	size_t chunk_count = iter_count * 127;
	const uint8x16_t *chunks = (const uint8x16_t *)input;
	const uint8x16_t *chunks_end = chunks + chunk_count;

	// NEW:
	// no more s and p constants
	uint8x16_t ones = vmovq_n_u8(1);

	while (chunks != chunks_end) {
		uint8x16_t acc = vmovq_n_u8(0);

		for (uint32_t i = 0; i < 127; i++) {
			uint8x16_t chunk = *chunks;
			chunks++;

			// NEW:
			// brutally simple loop body:
			// just mask out the lsb
			// and add to the accumulator
			acc = vaddq_u8(acc, vandq_u8(chunk, ones));
		}

		// NEW:
		// that formula we worked out
		// makes an appearance
		res += 2 * vaddlvq_s8(acc) - 16 * 127;
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

A nice speed increase: now we’re 124× faster than basic at 37.961 GB/s.

An unfortunate discovery

It was at this point in the article-writing process that I was going to call it on the algorithmic work, and get down to micro-optimization (or at least the few things I know which could pass as micro-optimization). For some reason, though, I decided to go read the Hacker News comments under Owen’s post, where I stumbled across a comment from nwallin:

nwallin — 7 July 2023
So I downloaded your code. On my desktop, with loop-9 gcc I got ~4.5 GB/s, and with loop-7 I got ~4.4 GB/s. With the following code:
#include <stddef.h>

int run_switches(const char *s, size_t n) {
	int res = 0;
	for (; n--; ++s)
		res += (*s == 's') - (*s == 'p');
	return res;
}
I got ~31 GB/s in GCC and ~33 GB/s in Clang. This is without any padding, or SIMD intrinsics, or any such nonsense. This is just untying the compiler’s hands and giving it permission to do its job properly.
[…]
With a little bit of blocking:
#include <stddef.h>

int run_switches(const char *s, size_t n) {
	int res = 0;
	char tmp = 0;
	for (size_t i = n & 63; i--; ++s)
		tmp += (*s == 's') - (*s == 'p');
	res += tmp;

	for (n >>= 6; n--;) {
		tmp = 0;
		for (size_t i = 64; i--; ++s)
			tmp += (*s == 's') - (*s == 'p');
		res += tmp;
	}

	return res;
}
That’s ~55 GB/s.
Anyway, the point is, you’re pretty far from the point where you ought to give up on C and dive into assembly.

Well shit.

I tried the last implementation from the comment, and it reached 48.522 GB/s in the same benchmarking setup I’ve been using throughout this post.

Sigh, I guess it’s time to dive in and try to figure out why this is so fast. First, though, I’m gonna refactor and restyle nwallin’s code a bunch²:

int32_t
eqsub(const uint8_t *input, size_t length)
{
	int32_t res = 0;
	size_t chunk_size = 64;
	size_t iters = length / chunk_size;
	size_t slack = length % chunk_size;

	for (size_t i = 0; i < iters; i++) {
		int8_t acc = 0;
		for (size_t j = 0; j < chunk_size; j++) {
			acc += (*input == 's') - (*input == 'p');
			input++;
		}
		res += acc;
	}

	for (size_t i = 0; i < slack; i++) {
		res += (*input == 's') - (*input == 'p');
		input++;
	}

	return res;
}

The overall structure used here will look extremely familiar to you now; we split the input into chunks, process each chunk and add the result of the chunk to res, then process the “slack” left over at the end.

What’s really interesting about this solution is the heart of the loop: (b == 's') - (b == 'p'). What’s going on here?

Well, comparing the current byte with either 's' or 'p' yields 0 or 1. First off, both comparisons cannot be 1 because a byte cannot be both 's' and 'p' at once. Both comparisons could certainly be 0, though. We then subtract the result of comparing with 'p' from the result of comparing with 's'. What does that mean? Maybe a little table will help:

	`b == 'p'`	`b != 'p'`
`b == 's'`	impossible	`1 - 0 == 1`
`b != 's'`	`0 - 1 == -1`	`0 - 0 == 0`

How clever is that?! The formula always returns the value we want, for every possible byte value – no cheating required!

As you’d expect, the assembly for eqsub is a huge tangled mess, so I won’t show it here. The main loop’s assembly is optimal: eight neat sequences of cmeq, cmeq, sub for the two equality checks and the subtraction.

Hold on: cmeq produces -1 for “true”, not 1, so won’t the results all be flipped? Clang is very clever, here: it swaps the arguments to sub, negating the result and undoing the inversion.

Why not apply this in our own code? I’ve marked the few changes needed with comments:

int32_t
neon_eqsub(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t iter_count = length / (16 * 127);
	size_t slack = length % (16 * 127);
	size_t chunk_count = iter_count * 127;
	const uint8x16_t *chunks = (const uint8x16_t *)input;
	const uint8x16_t *chunks_end = chunks + chunk_count;

	uint8x16_t s = vmovq_n_u8('s');
	uint8x16_t p = vmovq_n_u8('p');
	// NEW:
	// “ones” variable is no longer needed

	while (chunks != chunks_end) {
		uint8x16_t acc = vmovq_n_u8(0);

		for (uint32_t i = 0; i < 127; i++) {
			uint8x16_t chunk = *chunks;
			chunks++;

			// NEW:
			// tiny inner loop
			// with just three instructions
			uint8x16_t matched_s = vceqq_u8(chunk, s);
			uint8x16_t matched_p = vceqq_u8(chunk, p);
			uint8x16_t matched = vsubq_u8(matched_p, matched_s);

			acc = vaddq_u8(acc, matched);
		}

		res += vaddlvq_s8(acc);
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

Another wonderful performance win, this time reaching … 21.171 GB/s? Huh? Yyyup, we’ve actually regressed performance, below even neon_less_reduce (which ran at 33.952 GB/s).

How could this be? To investigate, I ran each of neon_less_reduce, neon_eqsub and eqsub for 5000 iterations on my standard 1 MB test input while tracking performance counters in Instruments.

billions of:	cycles	instrs	L1D misses	misprds
`neon_eqsub`	96.4	316.4	6.4	0.3067
`neon_less_reduce`	60.5	356.0	5.9	0.3087
`eqsub`	42.0	223.8	6.7	0.0006

What stands out to me is the number of branch mispredictions. Why would the supposedly-branchless code we’ve been writing have any branch mispredictions at all? Looking a bit further into which instructions are causing the issue, it seems that all the branch mispredictions for neon_eqsub and neon_less_reduce are concentrated on a cmp instruction at the bottom of the inner loop.

Welp, now that it’s been shoved in my face like this, it really does make a lot of sense. for (uint32_t i = 0; i < 127; i++) has too many iterations for Clang to fully unroll it, so the CPU is now put in an uncomfortable position: it goes around that loop again and again, priming the branch predictor to say, “hey, I think this branch will be taken and we’ll go around the loop again!”. Then, after a measly 127 iterations, the prediction will be untrue, causing a pipeline flush and a 13 cycle penalty.

We can fix this quite easily by using a more modest iteration count so that Clang can unroll the loop for us:

int32_t
neon_eqsub_unroll(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t iter_count = length / (16 * 8);
	size_t slack = length % (16 * 8);
	size_t chunk_count = iter_count * 8;
	const uint8x16_t *chunks = (const uint8x16_t *)input;
	const uint8x16_t *chunks_end = chunks + chunk_count;

	uint8x16_t s = vmovq_n_u8('s');
	uint8x16_t p = vmovq_n_u8('p');

	while (chunks != chunks_end) {
		uint8x16_t acc = vmovq_n_u8(0);

		// NEW:
		// loop only for eight iterations
		for (uint32_t j = 0; j < 8; j++) {
			uint8x16_t chunk = *chunks;
			chunks++;

			uint8x16_t matched_s = vceqq_u8(chunk, s);
			uint8x16_t matched_p = vceqq_u8(chunk, p);
			uint8x16_t matched = vsubq_u8(matched_p, matched_s);

			acc = vaddq_u8(acc, matched);
		}

		res += vaddlvq_s8(acc);
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

Finally, we’ve managed to beat eqsub with 49.806 GB/s, 163× faster than basic.

Cheating incident #2 – redux

I can’t help but at least try to apply the least-significant bit strategy here, just for fun:

int32_t
neon_lsb_unroll(const uint8_t *input, size_t length)
{
	size_t padding = 0;
	if ((uintptr_t)input % 16 != 0) {
		padding = 16 - (uintptr_t)input % 16;
	}

	int32_t res = table_length(input, padding);
	input += padding;
	length -= padding;

	size_t iter_count = length / (16 * 8);
	size_t slack = length % (16 * 8);
	size_t chunk_count = iter_count * 8;
	const uint8x16_t *chunks = (const uint8x16_t *)input;
	const uint8x16_t *chunks_end = chunks + chunk_count;

	// NEW:
	// ones variable makes a comeback!
	uint8x16_t ones = vmovq_n_u8(1);

	while (chunks != chunks_end) {
		uint8x16_t acc = vmovq_n_u8(0);

		for (uint32_t j = 0; j < 8; j++) {
			uint8x16_t chunk = *chunks;
			chunks++;

			// NEW:
			// simple loop body
			acc = vaddq_u8(acc, vandq_u8(chunk, ones));
		}

		// NEW:
		// use formula to deduce correct 's' vs 'p' value
		res += 2 * vaddlvq_s8(acc) - 16 * 8;
	}

	input = (const uint8_t *)chunks;

	res += table_length(input, slack);

	return res;
}

And when I thought I had finally found a winner with 87.017 GB/s, 285× faster than basic, I had the idea to try adapting eqsub to use this too.

int32_t
lsb(const uint8_t *input, size_t length)
{
	int32_t res = 0;
	size_t chunk_size = 64;
	size_t iters = length / chunk_size;
	size_t slack = length % chunk_size;

	for (size_t i = 0; i < iters; i++) {
		int8_t acc = 0;
		for (size_t j = 0; j < chunk_size; j++) {
			acc += *input & 1;
			input++;
		}
		res += acc;
	}

	for (size_t i = 0; i < slack; i++) {
		res += *input & 1;
		input++;
	}

	return 2 * res - (int32_t)length;
}

87.138 GB/s. Goddammit. You win, Clang :)

What Did We Learn™

This post followed the exact steps I took while trying to optimize the original function. Don’t do what I did! Before bringing out the SIMD intrinsics, try to reformulate your code so the compiler can vectorize it for you. We’ve seen how effective that can be with eqsub.

In general, try to use the compiler to your advantage, but make sure to check the resulting assembly. In this case eqsub emitted a perfect sequence of cmeqs and subs, but you won’t always be so lucky. Occasionally it just won’t emit what you want, even if you use intrinsics, and in those cases writing in raw assembly is an option.

Whether this is true or not, I’m not sure, since I’ve had some weird experiences with the M1’s performance counters, both inside and outside of Instruments. ↩︎
Aside from the stdint.h types, the variable names and the old school formatting, the biggest change I’ve made here is to “de-optimize” the loop conditions. Like, yes, n & 63 is faster than n % 64, but only if you have compiler optimizations turned off – but why would you do that in this context, of all places? Maybe it’s an old habit of nwallin, I don’t know.
And don’t worry, even though it looks quite different from the original, I made sure that the performance remained identical; feel free to compare the original with my version to convince yourself that they’re equivalent. ↩︎

Luna Razzaghipour
15 July 2023