Execution Units are Often Pipelined

In the context of out-of-order microarchitectures, I was under the impression that execution units remain occupied until the µop they’re processing is complete. This is often not the case.

As an example, take the Firestorm microarchitecture in the A14 and M1. It has two integer execution units capable of executing multiplies, which take three cycles to complete one multiplication.

Of course, a sequence of dependent instructions like

benchmark:
	mul	x1, x0, x0 // a
	mul	x2, x1, x1 // b
	mul	x3, x2, x2 // c
	mul	x4, x3, x3 // d
	ret

will take 4 × 3 = 12 cycles, since it can only take advantage of a single execution unit:

cycle	EU 1	EU 2	completed
0	`[a]`	`[ ]`
1	`[a]`	`[ ]`
2	`[a]`	`[ ]`
3	`[b]`	`[ ]`	`a`
4	`[b]`	`[ ]`	`a`
5	`[b]`	`[ ]`	`a`
6	`[c]`	`[ ]`	`a`, `b`
7	`[c]`	`[ ]`	`a`, `b`
8	`[c]`	`[ ]`	`a`, `b`
9	`[d]`	`[ ]`	`a`, `b`, `c`
10	`[d]`	`[ ]`	`a`, `b`, `c`
11	`[d]`	`[ ]`	`a`, `b`, `c`
12	`[ ]`	`[ ]`	`a`, `b`, `c`, `d`

With my original understanding of how execution units work, a sequence of independent instructions like

benchmark:
	mul	x1, x0, x0 // a
	mul	x2, x0, x0 // b
	mul	x3, x0, x0 // c
	mul	x4, x0, x0 // d
	ret

would take 2 × 3 = 6 cycles:

cycle	EU 1	EU 2	completed
0	`[a]`	`[b]`
1	`[a]`	`[b]`
2	`[a]`	`[b]`
3	`[c]`	`[d]`	`a`, `b`
4	`[c]`	`[d]`	`a`, `b`
5	`[c]`	`[d]`	`a`, `b`
6	`[ ]`	`[ ]`	`a`, `b`, `c`, `d`

As it turns out, many execution unit and µop combinations are heavily pipelined. This means that a µop can be issued to an execution unit while it’s still busy processing a different µop. So, on Firestorm that code sequence actually executes more like

cycle	EU 1	EU 2	completed
0	`[a][ ][ ]`	`[b][ ][ ]`
1	`[c][a][ ]`	`[d][b][ ]`
2	`[ ][c][a]`	`[ ][d][b]`
3	`[ ][ ][c]`	`[ ][ ][d]`	`a`, `b`
4	`[ ][ ][ ]`	`[ ][ ][ ]`	`a`, `b`, `c`, `d`

taking 4 cycles instead of 6.

In the limit, where the two execution units are constantly kept fed with multiplication µops, my original understanding would have predicted 1.5 cycles/instruction on average, when they in reality can sustain 0.5 cycles/instruction – each execution unit can be fed a new multiplication µop every cycle, and we have two of them.

Knowing this, I finally get why instruction latency and bandwidth tables specify reciprocal throughput; because it’s equivalent to cycles/instruction!

I put together a GitHub repo where you can see this for yourself. Make sure to adjust the maximum CPU frequency in Entry Point.c as appropriate.

Luna Razzaghipour
26 December 2024