An ALU (Arithmetic Logic Unit) is a circuit for performing various integer operations. It is an essential component of any CPU.

It does not perform operations on floating point numbers, which is handled by a FPU instead.

Overview

Common operations

Bitwise AND, OR, XOR

These operations simply involve wiring each bit to the corresponding gate.

Shift/rotate left/right

Shifting or rotating a number by a fixed amount of bits simply involves wiring each input bit to its corresponding output bit.

Shifting by a variable amount is trickier, especially for large numbers. Using a single multiplexer will require an excessive amount of wires and gates to support every possible rotation.

A sequence multiple small multiplexers can be used, each shifting the number by x * 2^y. While this is slower it saves a lot on gates.

# Right shifter for 8 bits numbers.
s4 = y[2] ? (x >> 4) : x
s2 = y[1] ? (s4 >> 2) : s4
s1 = y[0] ? (s2 >> 1) : s2

Addition

An adder is composed of multiple 1-bit adders. Each of these 1-bit adders has three inputs and two outputs.

The addition of two 1-bit numbers has a 2-bit result. The high bit is called the carry. For multi-bit adder this carry is carried to the next 1-bit adder.

The LSb (out) can be calculated by XORing all three input bits together. The MSb (carry_out) can be calculated by testing whether two or more bits are 1.

out = in_a ^ in_b ^ carry_in
carry_out = (in_a & in_b) | (in_b & carry_in) | (carry_in & in_a)

Note that:

(x & z) | (y & z) = (x | y) & z
x | y = (x ^ y) | (x & y)
x & !x = 1

From which follows:

   (in_a & in_b) | (in_b & carry_in) | (carry_in & in_a)
 = (in_a & in_b) | ((in_a | in_b) & carry_in)
 = (in_a & in_b) | (((in_a ^ in_b) | (in_a & in_b)) & carry_in)
 = (in_a & in_b) | ((in_a ^ in_b) & carry_in) | ((in_a & in_b) & carry_in)
 = (in_a & in_b) | ((in_a ^ in_b) & carry_in)

Hence we can save a few gates by using the result of in_a ^ in_b

t = in_a ^ in_b
out = t ^ carry_in
carry_out = (in_a & in_b) | (t & carry_in)

Faster addition

While chaining 1-bit adders is a cheap & easy way to implement a multi-bit adder, it is also slow as the signal must propagate from the lowest to the highest bit. To speed this up a carry-lookahead is used.

This lookahead computes the carry for multiple bits up, which allows using multiple small multi-bit adders in parallel.

Suppose that g = x & y and p = x ^ y, then:

c = (x & y) | ((x ^ y) & z) = g | (p & z)

Since we chain each 1-bit adder:

c1 = g0 | (p0 & c0)
c2 = g1 | (p1 & c1)
c3 = g2 | (p2 & c2)
...

If we fully work out c on the right side of each expression we get:

c1 = g0 | (p0 & c0)
c2 = g1 | (p1 & c1)
   = g1 | (p1 & (g0 | (p0 & c0)))
   = g1 | (p1 & g0) | (p1 & p0 & c0)
c3 = g2 | (p2 & c1)
   = g2 | (p2 & (g1 | (p1 & g0) | (p1 & p0 & c0)))
   = g2 | (p2 & g1) | (p2 & p1 & g0) | (p2 & p1 & p0 & c0)
...

We can group the p and g signals:

P = ... p2 & p1 & p0
G = ... (... p2 & g1) | (... p2 & p1 & g0)

And calculate the final carry signal:

C = G | (P & c0)

Addition in HDLs & simulators

Note that most HDLs & simulators already provide a component for performing addition. This component should be preferred as it can be synthesized more efficiently than a manual implementation (e.g. by using built-in adders or other components in FPGAs, such as Xilinx's CARRY4).

Subtraction

For two-complements arithmetic subtraction can be done with a - b = a + (-b). -b is equivalent to ~b + 1, i.e. apply bitwise NOT to b and add 1. Adding 1 can be done by setting the carry high.

Equality

The cheapest way to test for equality is by using bitwise XOR: if two numbers are equal, their XOR is 0. If not, it is nonzero. Then do a bitwise OR of all bits. If 1, the numbers are not equal.

Less Than

For unsigned arithmetic a Less Than test can be done with a subtraction and checking the carry bit.

TODO signed arithmetic requires checking carry and high bit IIRC?

Multiplication

Division

Pitfalls

Division operation in Verilog, VHDL et al.

It may be tempting to use the division operator directly (e.g. alu_o = alu_a / alu_b). Do not do this as it will generate a huge and slow combinatorial circuit.

Instead do each part of the division step by step, one step per clock cycle.

ALU

Contents