Arithmetic Circuits

This part is almost the same NUS CG3207 Lec 04 - Arithmetic for Computers. So, I will combine the information together here.

Arithmetic circuits are the central building blocks of computers. Computers and digital logic perform many arithmetic functions: addition, subtraction, comparisons, shifts, multiplication, and division. This section describes hardware implementations for all of these operations.

Addition

Addition is one of the most common operations in digital systems. If you still remember, we've introduced here that all binary arithmetics can be done using adders! We first consider how to add two 1-bit binary numbers. We then extend to N-bit binary numbers. Adders also illustrate trade-offs between speed and complexity.

Half Adder

We begin by building a 1-bit half adder. As shown in Figrue 5.1, the half adder has two inputs, A and B, and two outputs, SS and CoutC_{\text{out}}. SS is the sum of A and B. If A and B is 1, SS is 2, which cannot be represented with a single binary digit. Instead, it is indicated with a carry out CoutC_{\text{out}} in the next column.

In a multi-bit adder, CoutC_{\text{out}} is added or carried in to the next most significant bit (or into the next column). However, the half adder lacks a CinC_{\text{in}} input to accept CoutC_{\text{out}} of the previous column. The full adder, described in the next section, solves this problem.

Full Adder

A full adder accepts the carry in CinC_{\text{in}} as shown in Figure 5.3. The figure also shows the output equations for SS and CoutC_{\text{out}}.

Carry Propagate Adder

An NN-bit adders sums two NN-bit inputs, A and B, and a carry in CinC_{\text{in}} to produce an NN-bit result SS and a carry out CoutC_{\text{out}}. It is commonly called a carry propagate adder (CPA) because the carry out of one bit propagates into the next bit. The symbol for CPA is shown in Figure 5.4; it is drawn just like a full adder except that A, B, and S are buses rather than single bits. Three common CPA implementations are called

  1. Prefix adders (FYI)

Ripple-Carry Adder

The simplest way to build an N-bit carry propagate adder is to chain together N full adders. The CoutC_{\text{out}} of one stage acts as the CinC_{\text{in}} of the next stage, as shown in Figure 5.5 for 32-bit addition. This is called a ripple-carry adder.

The ripple-carry adder has the disadvantage of being slow when N is large. As S31S_{31} depends on C30C_{30}, which depends on C29C_{29}, and so forth all the way back to CinC_{\text{in}}. We say that the carry ripples through the carry chain. The delay of the adder, tripplet_{\text{ripple}} grows directly with the number of bits, as given in Equation 5.1, where tFAt_{\text{FA}} is the delay of a full adder,

tripple=NtFAt_{\text{ripple}}=Nt_{\text{FA}}

In Ripple-Carry Adder, the main disadvantage is that the carry propagates very slow. Can we accelerate the process of knowing the carry of an adder? Here comes the carry-lookahead adder.

Carry-Lookahead Adder

A carry-lookahead adder (CLA) is another type of carry propogate adder that solves the problem caused by ripple-carry adder by dividing the adder into blocks and providing circuitry to quickly determine the carry out of a block as soon as the carry in is known. Thus it is said to look ahead across the blocks rather than waiting to ripple through all the full adders inside a block.

CLAs use bit-level generate (gg) and propagate (pp) signals to determine carry behavior for individual bit positions, which are then combined into block-level signals (GG and PP) to determine the carry out for larger multi-bit groups.

  • The ii-th column of an adder is said to generate a carry if it produces a carry out independent of the carry in. Hence, we have gi=AiBig_i=A_iB_i.

  • The ii-th column of an adder is said to propagate a carry it is produces a carry whenever there is a carry in. Thus, we have pi=Ai+Bip_i=A_i+B_i.

Using the gig_i and pip_i, we can rewrite the carry out for each block of full adder (CiC_i)

C1=p0C0+g0C2=p1C1+g1=p1p0C0+p1g0+g1Ci=piCi1+gi\begin{align*} C_1&=p_0C_0+g_0 \\ C_2&=p_1C_1+g_1=p_1p_0C_0+p_1g_0+g_1\\ \cdots\\ C_i&=p_iC_{i-1}+g_i \end{align*}

Let's say we want to build a 32-bit carry-lookahead adder. Doing the above manipulation for 32 stages is cumbersome. However, we can divide them into four blocks first (a.k.a, implement a 4-bit carry-lookahead adder first as the fundamental block).

Using this fundamental block (4-bit carry-lookahead adder), we can build bigger adders. Figure 5.6 (a) shows a 32-bit carry-lookahead adder composed of eight 4-bit blocks. Each block contains a 4-bit ripple-carry adder and some lookahead logic to compute the carry out of the block given the carry in, as shown in Figure 5.6 (b).

Image Explanation

Putting it all together

HDLs provide the + operation to specify a CPA. HDL Example 5.1 describes a CPA with carries in and out

Subtraction

Recall from previous part that adders can add positive and negative numbers using two's complement number representation. Substraction is almost as easy as:

  1. flip the sign of the second number

  2. then add it

Thus, subtraction is performed with a single CPA by adding A+BˉA+\bar B with Cin=1C_{\text{in}}=1. Figure 5.9 shows the symbol for a subtractor and the underlying hardware for performing Y=ABY=A-B.

HDL Example 5.2 describes a subtractor.

Signed Addition/Subtraction

In the signed addition/subtraction, we have four flags, N(egative), Z(ero), C(array), V(Overflow), to indicate the status of our result. The following part will explain when each certain flag is set,

Signed Addition

By observing the addition examples above, we may find out that

  1. For unsigned addition, carry = 1 indicates that the result is wrong.

  2. For signed addition,

    1. Overflow occurs when (+ve) + (+ve) = (-ve) or (-ve) + (-ve) = (+ve), a.k.a, two negative number addition gives out a positive number or two positive number addition gives a negative number

      1. To determine the overflow flag, we always treat the two operands as signed! So, this process is just to see that if the sign bits (MSB) of the operands are the same, and is different from that of the result (the (N-1)-th bit), there is an overflow.

      2. Overflow can never occur when you add two numbers of different signs

      3. Carry and overflow are unrelated (However, carry is sometime called "unsigned overflow")

    2. Carry flag is set when the value in the blue rectangle is one.

    3. Negative flag is determined by only looking at the MSB of the result (the (N-1)-th bit actually).

    4. Zero flag is determined by only looking at the N bits result.

Notes

Signed Subtraction

By observing the subtraction examples above, we can find out that

  1. For unsigned subtraction, borrow indicates that the result is wrong.

    1. In ARM, carry is just NOT borrow. However, in Intel, carry is borrow, but in subtraction, it will flip the carry bit to get the carry flag when it is subtraction.

  2. For signed subtrcation,

    1. Overflow occurs at when (+ve) - (-ve) = (-ve) or (-ve) - (+ve) = (+ve).

      1. And similar to the signed addition, we always treat the two operands as signed. So, if the sign bits (MSB) of the operands are different, and the result has a sign same as that of the second operand, there is an overflow.

      2. Overflow can never occur in subtraction when operands are of the same sign.

    2. In hardware, the Carry flag simply reflects the ALU's carry-out signal. Conceptually, during subtraction, this signal acts as the inverse of the borrow used in pen-n-paper arithmetic.

Notes


In summary, we know the following points from the human interpretation side,

  1. When C = 1, it indicates the unsigned addition result is wrong.

  2. When C = 0, it indicates that the unsigned subtraction result is wrong.

  3. When V = 1, it indicates that both signed addition and signed subtraction result are wrong.


For RISC-V, NZCV is irrelevant only when ALU does the addition. But, it is still recommended to go through how each flag of the NZCV is generated during addition.

Instead, in RISC-V, NZCV is relevant when ALU does the subtraction. Again, it is important to go through how NZCV is generated during subtraction. In RISC-V, the subtraction can be done in

  • sub

  • branch/slt variants

In pure sub instruction, the NZCV may not be that useful. But in branch/slt variants, this NZCV flag, which is not stored as bits for use by future instructions, becomes much more useful. The different comparisons, like eq, etc, are implemented using these NZCV flags.

Comparison
Signed/Unsigned
Condition
Implemented as
Uses

eq

Both

A == B

Z

Z flag

ne

Both

A ≠ B

¬Z

Z flag

lt

Signed

A < B

N ⊕ V

Signed overflow logic

ge

Signed

A ≥ B

¬(N ⊕ V)

Signed overflow logic

ltu

Unsigned

A < B

¬C

Borrow logic

geu

Unsigned

A ≥ B

C

Borrow logic

Table Explanation

Multi-word arithmetic in RISC-V

Suppose we want to do 64-bit addition and subtraction in RISC-V. We use [t0:t1] to store the 64-bit number A and [t2:t3] to store the 64-bit number B. And [t4:t5] to store the final result.

For the addition, we have,

For the subtraction, we have,

So, in summary, we have

Operation
Condition for carry/borrow
Equivalent test
RISC-V instruction

Addition

carry

(sum < A) or (sum < B)

sltu t6, t4, t0

Subtraction

borrow

(A < B)

sltu t6, t0, t2

Comparators

A comparator determines whether two binary numbers are equal or if one is greater or less than the other. A comparator receives two N-bit binary numbers A and B. There are two common types of comparators.

  • An equality comparator produces a single output indicating whether A is equal to B (A==B).

  • A magnitude comparator produces one or more outputs indicating the relative values of A and B.

1

Equality comparator

The eqaulity comparator is the simpler piece of hardware. Figure 5.11 shows the symbol and implementation of a 4-bit equality comparator. It first checks to determine whether the corresponding btis in each column of A and B are equal using XNOR gates. The numbers are equal if all of the columns are equal.

2

Magnitude comparator

Magnitude comparison is usually done by computing A-B and looking at the sign (most significant bit) of the resulf as shown in Figure 5.12. If the result is negative (e.g., the sign bit is 1), then A is less than B. Otherwise A is greater than or equal to B.

HDL Example 5.3 shows how to ues various comparison operations.

ALU

An Arithmetic/Logical Unit (ALU) combines a variety of mathematical and logical operations into a single unit. For example, a typical ALU might perform addition, subtraction, magnitude comparison, AND, and OR operations. The ALU forms the heart of most computer systems.

Figure 5.14 shows the symbol for an N-bit ALU with N-bit inputs and outputs.

The ALU receives a control signal F that specifies which function to perform. Table 5.1 lists typical functions that the ALU can perform.

Figure 5.15 shows an implementation of the ALU.

Image Explanation

Some ALUs produce extra outputs, called flags, that indicate information about the ALU output. For example, an overflow flag indicates that the result of the adder overflowed. A zero flag indicates that teh ALU output is 0.

HDL Implementation

The HDL for an NN-bit ALU is shown as follows

Shifters and Rotators

Shifters and rotators move bits and multiply or divide by powers of 2. There are several kinds of commonly used shifters:

1

Logical shifter

It shifts the number to the left (LSL) or right (LSR) and fills empty spots with 0's. For example,

2

Arithmetic shifter

It is the same as logical shifter, but on right shifts fills the most significant bits with a copy of the old most significant bit (msb). This is useful for multiplying and dividing signed numbers. Arithmetic shift left (ASL) is the same as logical shift left (LSL) because we will fill the empty spots at right with 0, this doesn't matter much. For example,

3

Rotator

It rotates number in circle such that empty spots are filled with bits shifted off the other end.

A N-bit shifter can be built from N N:1 multiplexers. The input is shifted by 0 to N1N-1 bits, depending on the value of the log2N\log_2N-bit select lines (shamt). Figure 5.16 shows the symbol and hardware of 4-bit shifters. The operators <<, >>, and >>> typically indicate shift left, logical shift right, and arithmetic shift right, respectively.

Image Explanation

If we want to use this idea to build a 32-bit shifter, we need 32 32:1 multiplexer, which has a very high hardware usage. To solve this problem, let's introduce the hardware-efficient shifter.

Hardware-efficient Shifter

This hardware-efficient shifter only needs 5 x 32 2:1 multiplexer. And below is the implementation for the 32-bit srl shifter.

This shifter uses a logarithmic, staged design that decomposes any shift amount into a combination of small, power-of-two shifts. Each stage doubles the shift capacity, making the circuit both elegant and efficient.

To add support for sll and sra, the easiest way is to implement a 3-to-1 multiplexer, controlled by the shift type info from the instruction.

However, more efficient ways to combine the shifts exist though. For example, when shamt54=1\text{shamt5}_4=1, the input can be from another small "multiplexer" that gives

  • {16{0}. ShIn31:16} for srl

  • {16{ShIn31}. ShIn31:16} for sra

Hint: To implement this, we only need 1 1-bit 2:1 multiplexer to select whether the MSB should be 0 or ShIn31 and then duplicate it 16 times, we can get the result we want.

Multiplication

In the previous part, we have seen that multiplication is nothing but shift then add. By shifting, we form the partial products.

In general, a NxN multiplier multiplies two N-bit numbers and produces a 2N-bit result. Multiplication of 1-bit binary numbers is equivalent to the AND operation, so AND gates are used to form the partial products.

Array Multiplier

Figure 5.18 shows the symbol, function, and implementation of an unsigned 4x4 multiplier.

The unsigned multiplier receives the multiplicand and multiplier, AA and BB, and produces the product PP. Figure 5.18(b) shows how partial products are formed. Each partial product is a single multipler bit (B3,B2,B1, or B0B_3,B_2,B_1,\text{ or }B_0) AND the mutiplicand bits (A3,A2,A1,A0A_3,A_2,A_1,A_0). With NN-bit operands, there are NN partial products and N1N-1 stages of 1-bit adders. For example, for a 4x4 multiplier, the partial product of the first row is B0B_0 AND (A3,A2,A1,A0A_3,A_2,A_1,A_0). This partial product is added to the shifted second partial product, B1B_1 AND (A3,A2,A1,A0A_3,A_2,A_1,A_0). Subsequent rows of AND gates and adders form and add the remaining partial products.

This array multiplier is combinational because the whole multiplication (all ANDs and adds) happens in one clock cycle. Maybe this video will be good on explaining the critical path and hardware analysis of the array multiplier.

The pros and cons of using array multiplier

In the array multiplier, we have the following two advantages

  1. the multiplication is done in one cycle, thus it uses less cycle.

  2. the bits are constructed in parallel by using the AND gates and then "ripple" through the adder. Thus it provides better parallelizability.

However, an array multiplier has the following disadvantages

  1. The clock cycle depends on the critical path, which may thus slower the clock speed.

  2. The hardware usage is high and increases in O(n2).

Sequential Multiplier

Suppose that we may not want to do the mutiplication in one single cycle, we can use the sequential multiplier. Its workflow is given as follows,

To understand it better, let's step through an example. Let S=000000, A=011, B=101 (initial values)

  1. B0=1B_0=1 (original B0=1B_0=1)

    1. add: S = S + A = 000000 + 000011 = 000011

    2. A = A << 1; Thus A = 000110

    3. B = B >> 1; Thus B = 010

  2. B0=0B_0=0 (original B1=0B_1=0)

    1. no addition required here

    2. A = A << 1; Thus A = 001100

    3. B = B >> 1; Thus B = 001

  3. B0=1B_0=1 (original B2=1B_2=1)

    1. add: S = S + A = 000011 + 001100 = 001111

Improved Sequential Multiplier

The idea is to always align AA to the "high-side" (3 most significant bits of S in 3-bit multiplication) and then perform the addition. Then shift S to the right and do not shift A. And the least significant bits of S populated with B, which keeps getting shifted to the right with S (avoids the need for an extra register for B). Its workflow is shown as follows,

Again, stepping through an example will make it easier to understand. Let S = 0000101 (000B), A = 011 (initial values)

  1. S0=1S_0=1 (original B0=1B_0=1)

    1. add S = S + A000 = 000101 + 011000 = 011101

    2. S = S >> 1; Thus S = 001110

  2. S0=0S_0=0 (original B1=0B_1=0)

    1. no addition required

    2. S = S >> 1; Thus S = 000111

  3. S0=1S_0=1 (original B2=1B_2=1)

    1. add S = S + A000 = 000111 + 011000 = 011111

    2. S = S >> 1; Thus S = 001111, which is the final result

The trade-off between array multiplier and sequential multipler
  1. Clock speed: As we have seen from above, the array multiplier uses only one cycle but likely the cycle will take longer time, thus slowering the clock speed while the sequential multiplier uses more cycles but likely each cycle takes a shorter time and thus the clock speed is faster.

  2. Hardware cost: The array multiplier usually uses more hardware than the sequential multiplier.

As a hardware designer, we need to always think about this kind of trade-off to achieve what we want. The following table summarizes the cost and latency analysis of a n-bit multiplier built using the carry propagate adder.

Multiplier Type
Cost (Hardware Area)
Latency (Speed)

Sequential Multiplier

O(n)O(n) (Linear)

O(n2)O(n^2) (Quadratic)

Array Multiplier

O(n2)O(n^2) (Quadratic)

O(n)O(n) (Linear)

HDL Implementation

The HDL for a multiplier is in HDL Example 5.4. As with adders, the synthesis tools may pick the most appropriate design given the timing constraints.

Division

Binary division can be performed using the following algorithm for NN-bit unsigned numbers in the range [0, 2N1][0,~2^N-1].

  1. The partial remainder RR is initialized to 0 (R=0R'=0), and the most significant bit of the dividend AA becomes the least significant bit of RR (R={R<<1, Ai}R=\{R'<<1,~ A_i\}).

  2. The divisor BB is subtracted from this partial remainder to determine whether it fits (D=RBD = R-B).

    1. If the difference DD is negative (e.g., the sign bit of DD is 1), then the quotient bit QiQ_i is 0 and the difference is discarded.

    2. Else, the difference DD is positive. Then the quotient bit QiQ_i is 1 and the remainder RR becomes DD.

    3. In any event, the partial remainder is then doubled (left-shited by one column), the next most significant bit of AA becomes the least significant bit of RR, and the process repeats.

  3. The process (step 2) repeats and the result satisfies AB=Q+RB\frac{A}{B}=Q+\frac{R}{B}.

Below is a more straight-forward example,

This approach is called long division approach, which is basically similar to what we have seen above, but in a more straight-forward way

  1. If divisor \leq dividend bits

    1. 1 bit in quotient, subtract

  2. Otherwise

    1. 0 bit in quotient, bring down next dividend bit

Array Divider

Figure 5.22 shows a schematic of a 4-bit array divider. The divider computes A/BA/B and produces a quotient QQ and a remainder RR. The legend shows the symbol and schematic for each block in the array divider.

  1. Each row performs one iteration of the division algorithm. Specifically, each row calculates the difference D=RBD = R-B. (Recall that R+Bˉ+1=RBR+\bar B+1=R-B).

  2. The multiplexer select signal, NN ( for Negative), receives 1 when a row's difference DD is negative. So NN is driven by the most significant bit of DD.

  3. Each quotient bit (QiQ_i) is 0 when DD is negative and 1 otherwise. The multiplexer passes RR to the next row if the difference is negative and DD otherwise. (In the legend)

  4. The following row shifts the new partial remainder left by one bit (a good example to show that "shifting is just rewiring"), appends the next most significant bit of AA, and then repeats the process.

The delay of an NN-bit array divider increases proportionally to N2N^2 because the carry must ripple through all NN stages in a row before the sign is determined and the multiplexer can select RR or DD. This repeats for all NN rows.

Sequential Divider

The sequential divider here just follows the pseudocode mentioned above, which is about unsigned division. To convert it to signed, only minor changes are needed. (For more information on this, can go to Lab 03)

Similar to the Sequential Multiplier, we can also do the divider sequentially,

To understand it better, again, let's step through an example. For example, 00000111 (dividend) is divided by 0010 (divisor).

In the Hennessy & Patterson's textbook, they mention an improved version of the division algorithm above, it is shown as follows,

Signed Multiplication/Division

1

Signed vs. Unsigned Multiplication

When you multiply two n-bit numbers,

  • the lowest n bits (the least significant half) of the product are identical no matter the numbers are interpreted as signed or unsigned.

  • Only the upper n bits differ (because sign extension affects those).

That’s why:

  • MUL (in RISC-V) gives just the lower n bits.

  • MULH / MULHU (RISC-V) give the upper n bits, for signed/unsigned respectively.

2

Macro-op fusion

“Macro-op fusion” means the processor can fuse mul and mulh internally and perform a single 64-bit multiply, returning both halves if needed.

We know that RISC-V has separate instructions for multiplication:

  • mul → lower 32 bits of product

  • mulh → upper 32 bits (signed × signed)

  • mulhu → upper 32 bits (unsigned × unsigned)

  • mulhsu → upper 32 bits (signed × unsigned)

If we implement them literally, it looks like three separate multipliers are needed. But in hardware, we don’t have to do three full multiplications. Instead, we can compute one 64-bit result internally and just output different parts or interpret sign bits differently.

So, even though RISC-V has multiple mul variants, we don’t necessarily perform multiple physical multiplications.

3

Signed/Unsigned always matters in Division

Unlike multiplication, signed and unsigned division don’t share any bits of result. We can’t reuse the same division result for both. For example,

Dividend
Divisor
Signed?
Result
Notes

6

3

signed or unsigned

2

Same either way

6

-3

signed

-2

(because sign)

6

-3

unsigned

invalid (divisor interpreted as 4294967293 if 32-bit!)

-6

3

signed

-2

okay

-6

3

unsigned

big number ≈ 0x55555554 (because -6 -> 0xFFFFFFFA interpreted as 4294967290)

So division results sometimes completely change depending on whether you treat numbers as signed or unsigned — even if you use the same bit pattern.

4

Simple trick for signed multiplication/division

Signed division (and signed multiplication) can be done using the same unsigned operation unit if you:

  1. Record the sign of both operands.

  2. Take their absolute values (convert negatives to positives).

  3. Perform unsigned division (or multiplication).

  4. If exactly one operand was negative, negate the result.

But there are some wierd scenarios. For example, if we want to do the following multiplication,

231×1=231-2^{31}\times-1=2^{31}

The result cannot fit into a 32-bit register. This is an overflow (same trick to detect overflow as we have discussed in signed addition/subtraction)!

Number Systems

Computers operate on both integers and fractions. So far, we have only considered representing signed or unsigned integers. This section introduces fixed- and floating-point number systems that can represent rational numbers.

  • Fixed-point numbers are analogous to decimals; some of the bits represent the integer part, and the rest represent the fraction

  • Floating-point numbers are analogous to scientific notation, with a mantissa and an exponent.

Fixed-Point Number Systems

Fixed-point notation has an implied binary point between the integer and fraction bits, analogus to the decimal point between the integer and fraction digits of an ordinary decimal number. For example, Figure 5.23(a) shows a fixed-point number with four integer bits and four fraction bits. Figure 5.23(b) shows the implied binary point in blue, and Figure 5.23(c) shows the equivalent decimal value.

Notes

Signed Fixed-Point

Signed fixed-point numbers can use either two's complement or sign/magnitude notation. Figure 5.24 shows the fixed-point representation of 2.375-2.375 using both notations with four integer and four fraction bits.

  • In sign/magnitude form, the most significant bit is used to indicate the sign.

  • The two's complement representation is formed by inverting the bits of the absolute value and adding 1 to the least significant (righmost) bit. In this case, the least significant bit position is in the 242^{-4} column.

    • In this example, we have: 00100110 -> 11011001 -> 11011010

A trick to quickly find the binary representation of a fraction using fixed-point notation
  1. First, write the fraction is to AB\frac{A}{B}, where AA is any integer, and BB is an integer which is 2n2^n.

  2. Use binary to represent AA and find out nn.

  3. Move the binary point from behind the rightmost bit nn bits forward.

For example, we want to calculate 0.6250.625,

  1. 0.625=58=5230.625=\frac{5}{8}=\frac{5}{2^3}

  2. 5=1015=101

  3. 0.625=0.1010.625=0.101

  4. Fit into eight bit representation, it will become 0000.1010

Fixed-Point Calculation

Fixed-point addition and subtraction is simple because the position of the binary point won't change. But multiplication and division are a bit tedious. So, here we will only talk about the fixed-point number multiplication and division.

Actually, from the small trick we mentioned above, we've seen that fixed-point notation has an implicit scale factor, which is 2number of fraction bits2^{\text{number of fraction bits}}.

For example, If we have 2 fractional bits, the scale factor is 22=42^2=4. That means every stored integer actually represents stored/4\text{stored/4}. So, 00000110₂ = 6 actually means 6/4=1.56/4=1.5.

Knowing this, we can clearly see that when we multiply two numbers in fixed-point, both are already scaled. For example, let's compute

Variable
Actual value
Stored (binary)
Decimal equivalent (stored)

a

1.5

00000110

6

b

2.5

00001010

10

When we do a*b, we will have 00111100₂. Since we scaled each input by 22=42^2=4, the product is scaled by 44=16=244*4=16=2^4. So to get back to the same scalor factor, which is 2 fractional bits, we must divide the product by 4 (shift right by 2 bits). Thus, we have 00111100₂ >> 2 = 00001111₂. This gives our stored value to be 15 and our actual value to be 15/4=3.7515/4=3.75.

But, in the above scenario, we assumed a short multiply, e.g., n-bit × n-bit = n-bit.

When we want to do a long multiplication,

  • an overflow may happen, which basically is that our result bits cannot fit the multiplication result.

  • an underflow may also happen, which means the multiplication result is representable, but not accurately, because our fractional bits are too few.

Floating-Point Number Systems

Floating-point numbers are analogous to scientific notation. They circumvent the limitation of having a constant number of integer and fraction bits, allowing the representation of very large and very small numbers. Like scientific notation, floating-point numbers have a sign, mantissa (M), base (B), and exponent (E), as shown below:

±M×BE\pm M\times B^E

For example, the number 4.1×1034.1×10^3 is the decimal scientific notation for 4100. It has a mantissa of 4.1, a base of 10, and an exponent of 3. The decimal point floats to the position right after the most significant digit. Floating-point numbers are base 2 with a binary mantissa. 32 bits are used to represent 1 sign bit, 8 exponent bits, and 23 mantissa bits.

The exponent needs to represent both positive and negative exponents. To do so, floating-point uses a biased exponent, which is the original exponent plus a constant bias. In 32-bit floating-point representation, the bias is 127. For example, for the exponent 7, the biased exponent is 7 + 127 = 134 = 10000110₂. For the exponent -4, the biased exponent is -4 + 127 = 123 = 01111011₂.

To put it all together, let's examine two examples:

1

Represent unsigned number in IEEE 754

Suppose we want to represent 22810 = 111001002 = 1.110012 x 27 using floating point system, it will be as follows,

2

Represent signed number in IEEE 754

Suppose we want to write -58.2510 in floating point (IEEE 754)

  1. Convert decimal to binary: 58.2510 = 111010.012

  2. Write in binary scientific notation: 1.1101001 x 25

  3. Fill in fields: Sign bit = 1, Biased Expoenent = 5 + 127 = 100001002, 23 fraction bits = 110 1001 0000 0000 0000 0000.

Special Cases

The IEEE floating-point standard has special cases to represent numbers such as zero, infinity, and illegal results.

More formats

So far, we have examined 32-bit floating-point numbers. This format is also called single-precision, single, or float. The IEEE 754 standard also defines 64-bit double-precision numbers (also called doubles) and 128-bit quadruple-precision numbers (also called quads) that provide greater precision and greater range. Table 5.5 shows the number of bits used for the fields in each format.

Rounding

Arithmetic results that fall outside of the available precision must round to a neighboring number. The rounding modes are

  1. round down,

  2. round up,

  3. round toward zero, and

  4. round to nearest.

The default rounding mode is round to nearest. In the round-to-nearest mode, if two numbers are equally near, the one with a 0 in the least significant position of the fraction is chosen.

For example, we want to round 1.100101 (1.578125) to only 3 fraction bits

  1. round down: 1.100

  2. round up: 1.101

  3. round toward zero: 1.100

  4. round to nearest: 1.101 (1.625 is close to 1.578125 than 1.5 is)

Notes

Floating-Point Calculation

1

Addition

The steps for adding floating-point numbers with the same sign are as follows:

  1. Extract exponent and fraction bits.

  2. Prepend leading 1 to form the mantissa.

  3. Compare exponents.

  4. Shift smaller mantissa if necessary.

  5. Add mantissas.

  6. Normalize mantissa and adjust exponent if necessary.

  7. Check whether have overflow or underflow then round result.

  8. Assemble exponent and fraction back into floating-point number.

For example,

Notes

2

Multiplication

Floating point multiplication follows the steps shown as follows:

For example, using the above steps to multiply the numbers 0.510 and -0.437510.

  1. In binary, the task is multiplying 1.0002 x 2-1 by -1.1102 x 2-2.

  2. Adding the exponents without bias:

    1. -1 + -2 = -3. The exponenet will be -3 + 127 = 124

  3. Multiplying the significands:

    1. 1.000 x -1.110 = 1.110 x 2-3

  4. Checking Overflow and underflow:

    1. As -3 is within [-127, 126], no overflow and underflow

  5. Rounding the product:

    1. No need in this case

  6. Adding the sign to the final product:

    1. Since the signs of the original operands differ, make the sign of the product negative. Hence the product is, -1.110 x 2-3.

Notes

Last updated