Suppose we want to round a number, `n`

, up to the next multiple of `p`

,
where `p`

is a power of two.
We might write some naive code like this:

```
int64_t
round_up(int64_t n, int64_t p)
{
int64_t rem = n % p;
if (rem == 0) {
return n;
}
return n + p - rem;
}
```

We calculate the remainder of `n`

divided by `p`

.
If `n`

is already a multiple of `p`

, then we can just return `n`

.
Otherwise, we calculate how far we are from the next multiple
– that’s `p - rem`

–
then add that onto `n`

, and return.
A little fiddly, but it’s quite simple all up.
The resulting assembly is no fun, though:

```
round_up:
sdiv x8, x0, x1
msub x8, x8, x1, x0
add x9, x1, x0
sub x9, x9, x8
cmp x8, #0
csel x0, x0, x9, eq
ret
```

`sdiv`

, which performs division,
can take up to nine cycles on the M1’s Firestorm cores,
while `msub`

, which includes a multiplication, can take up to three cycles.
Theoretically, in the best case this function will take ten cycles.
Surely we can do better since we know that `p`

is a power of two?

## Rounding down efficiently

Though it isn’t immediately clear how this is relevant, there’s a fast way to round a number down to the next multiple of a power of two.

Let’s say we want to round 100 down to the nearest multiple of 16,
which is 2^{4}.
To do this, let’s look at some multiples of 16 in binary
to see if we can spot a pattern:

decimal | binary |
---|---|

16 | `00010000` |

32 | `00100000` |

48 | `00110000` |

64 | `01000000` |

Notice anything? The last four bits are always zero! To round a number down to the next multiple of 16, we can just zero out the bottom four bits:

```
int64_t
round_down_16(int64_t n)
{
return n & 0xfffffffffffffff0;
}
```

If `n`

is a power of 16 then
we’re just zeroing out bits that are already zero – no effect.
If `n`

isn’t a power of 16 then we’re turning off bits that were set,
meaning the returned value will be less (we’re rounding *down,* after all).

## Masks

In decimal, powers of ten are always a 1 followed by several 0s.
When you subtract one from a power of ten, you end up with a series of 9s
with one less digit than you started with.
For example `100_000 - 1 == 99_999`

.
Equivalently, powers of two in binary are always a `1`

followed by several `0`

s.
Subtracting one from these gives a series of `1`

s.
To find the equivalent of that large `fff...`

constant from the previous section
for any value of `p`

, we can rely on the fact that `p`

is a power of two:

```
int64_t
round_down(int64_t n, int64_t p)
{
return n & ~(p - 1);
}
```

Going back to our example of 16 from before, `16 - 1 == 0b1111`

.
We want zeroes in those bottom four bits, not ones, so we take the complement:
`~(16 - 1) == 0b11111...111110000`

.

## The solution

We want to round up though, not down.
To achieve this we can use another little trick:
rather than figuring out how to round up,
let’s see if we can modify `n`

so we can round down every time instead.

Say we want to round some numbers up to the next multiple of 4. We’d expect the following results from our function:

```
...
round_up(55, 4) == 56
round_up(56, 4) == 56
round_up(57, 4) == 60
round_up(58, 4) == 60
round_up(59, 4) == 60
round_up(60, 4) == 60
round_up(61, 4) == 64
round_up(62, 4) == 64
round_up(63, 4) == 64
round_up(64, 4) == 64
round_up(65, 4) == 68
round_up(66, 4) == 68
...
```

To be able to round down rather than up,
we want to map each four-line chunk *down* to the next four-line chunk.
For instance, if we could transform
57 into 60, 58 into 61, 59 into 62, and 60 into 63,
then we’d be able to reuse `round_down`

from before.
As you might’ve noticed, the increase needed here is `p - 1`

,
or `4 - 1 == 3`

in this case.
Thus,

```
int64_t
round_up(int64_t n, int64_t p)
{
n += p - 1;
return round_down(n, p);
}
```

is a correct implementation of `round_up`

.
We don’t actually need a stand-alone `round_down`

implementation,
so let’s inline that:

```
int64_t
round_up(int64_t n, int64_t p)
{
n += p - 1;
return n & ~(p - 1);
}
```

And flatten it out:

```
int64_t
round_up(int64_t n, int64_t p)
{
int64_t mask = p - 1;
return (n + mask) & ~mask;
}
```

The compiler, which in this case is Apple Clang 15.0.0, does something which surprised me when I first saw it:

```
round_up:
add x8, x0, x1
sub x8, x8, #1
neg x9, x1
and x0, x8, x9
ret
```

Rather than calculating `~mask`

, which is really just `~(p - 1)`

,
it’s calculated `-p`

.
I probably should’ve already known this,
but it turns out that the negation of a positive number `x`

is equivalent to `~(x - 1)`

!
Is that where “two’s *complement*” comes from?

## One final improvement

Unfortunately, while very cool, that last change the compiler made actually led to worse code.

See, A64 has an instruction called Bitwise Bit Clear (or `bic`

for short)
specifically for zeroing out an integer `x`

at the bit positions where bits are set in another integer `y`

.
In other words, it performs `x & ~y`

in one go.
By merging the calculation of `mask`

with the calcuation of its complement,
the compiler has missed an opportunity to use `bic`

.
I compiled several versions of the function hoping to get Clang to emit `bic`

,
but to no avail :(

I looked through `arm_acle.h`

which supposedly has a bunch of intrinsics
perfect for this exact sort of situation,
but sadly I couldn’t find an intrinsic for `bic`

.
It seems we have to drop down to inline assembly.
I’ll start with just a wrapper function for `bic`

:

```
int64_t
bit_clear(int64_t x, int64_t y)
{
asm(
"bic %0, %1, %2"
: "=r"(x)
: "r"(x), "r"(y)
);
return x;
}
```

The variable given as `"=r"`

is where the assembly will place its result,
while the variables given as `"r"`

are ones that the assembly reads from.
`%0`

refers to the first variable in that list, `%1`

refers to the second, etc.
Thus, `%0`

is `x`

, `%1`

is `x`

too, and `%2`

is `y`

.
In other words, we’re running `bic`

on `x`

and `y`

and putting the result in `x`

.

We can update `round_up`

to make use of this:

```
int64_t
round_up(int64_t n, int64_t p)
{
int64_t mask = p - 1;
return bit_clear(n + mask, mask);
}
```

Finally, this incarnation of `round_up`

compiles down to just three instructions:

```
round_up:
sub x8, x1, #1
add x9, x8, x0
bic x0, x9, x8
ret
```

Each of these instructions is dependent on the previous one,
so we aren’t exploiting instruction-level parallelism or anything.
Every instruction here completes in a single cycle on the M1,
so in total our new `round_up`

implementation takes just three cycles.

I haven’t bothered to make any measurements
(I just *cannot* bring myself to write Yet Another benchmarking harness)
but I’m pretty sure it’ll be much faster than what we started with.

Luna Razzaghipour

21 September 2023