Arithmetic Circuits
This part is almost the same NUS CG3207 Lec 04 - Arithmetic for Computers. So, I will combine the information together here.
Arithmetic circuits are the central building blocks of computers. Computers and digital logic perform many arithmetic functions: addition, subtraction, comparisons, shifts, multiplication, and division. This section describes hardware implementations for all of these operations.
Addition
Addition is one of the most common operations in digital systems. If you still remember, we've introduced here that all binary arithmetics can be done using adders! We first consider how to add two 1-bit binary numbers. We then extend to N-bit binary numbers. Adders also illustrate trade-offs between speed and complexity.
Half Adder
We begin by building a 1-bit half adder. As shown in Figrue 5.1, the half adder has two inputs, A and B, and two outputs, and . is the sum of A and B. If A and B is 1, is 2, which cannot be represented with a single binary digit. Instead, it is indicated with a carry out in the next column.

In a multi-bit adder, is added or carried in to the next most significant bit (or into the next column). However, the half adder lacks a input to accept of the previous column. The full adder, described in the next section, solves this problem.
Full Adder
A full adder accepts the carry in as shown in Figure 5.3. The figure also shows the output equations for and .

Carry Propagate Adder
An -bit adders sums two -bit inputs, A and B, and a carry in to produce an -bit result and a carry out . It is commonly called a carry propagate adder (CPA) because the carry out of one bit propagates into the next bit. The symbol for CPA is shown in Figure 5.4; it is drawn just like a full adder except that A, B, and S are buses rather than single bits. Three common CPA implementations are called
Prefix adders (FYI)

Ripple-Carry Adder
The simplest way to build an N-bit carry propagate adder is to chain together N full adders. The of one stage acts as the of the next stage, as shown in Figure 5.5 for 32-bit addition. This is called a ripple-carry adder.

The ripple-carry adder has the disadvantage of being slow when N is large. As depends on , which depends on , and so forth all the way back to . We say that the carry ripples through the carry chain. The delay of the adder, grows directly with the number of bits, as given in Equation 5.1, where is the delay of a full adder,
In Ripple-Carry Adder, the main disadvantage is that the carry propagates very slow. Can we accelerate the process of knowing the carry of an adder? Here comes the carry-lookahead adder.
Carry-Lookahead Adder
A carry-lookahead adder (CLA) is another type of carry propogate adder that solves the problem caused by ripple-carry adder by dividing the adder into blocks and providing circuitry to quickly determine the carry out of a block as soon as the carry in is known. Thus it is said to look ahead across the blocks rather than waiting to ripple through all the full adders inside a block.
CLAs use bit-level generate () and propagate () signals to determine carry behavior for individual bit positions, which are then combined into block-level signals ( and ) to determine the carry out for larger multi-bit groups.
The -th column of an adder is said to generate a carry if it produces a carry out independent of the carry in. Hence, we have .
The -th column of an adder is said to propagate a carry it is produces a carry whenever there is a carry in. Thus, we have .
One great advantage here is that both and depend only on the two inputs number and , instead of the need of knowing the "carry-in" () of that block.
Using the and , we can rewrite the carry out for each block of full adder ()
Let's say we want to build a 32-bit carry-lookahead adder. Doing the above manipulation for 32 stages is cumbersome. However, we can divide them into four blocks first (a.k.a, implement a 4-bit carry-lookahead adder first as the fundamental block).

Using this fundamental block (4-bit carry-lookahead adder), we can build bigger adders. Figure 5.6 (a) shows a 32-bit carry-lookahead adder composed of eight 4-bit blocks. Each block contains a 4-bit ripple-carry adder and some lookahead logic to compute the carry out of the block given the carry in, as shown in Figure 5.6 (b).
C1, C2, C3 and C4 are the four carries that we've looked ahead!

Image Explanation
The AND and OR gates needed to compute the single-bit generate and propagate signals, and , from and are left out for brevity.
The block propagate () and generate () signals for 4-bit blocks ( refers to the block number here),
, is the same as shown in the figure above
, is the same as shown in the figure above.
The ripple effect is avoided within a block, but the ripple carry effect will be present between blocks. Nevertheless, the delays are significantly lower than the orginal ripple carry adder.
Putting it all together
HDLs provide the + operation to specify a CPA. HDL Example 5.1 describes a CPA with carries in and out
Subtraction
Recall from previous part that adders can add positive and negative numbers using two's complement number representation. Substraction is almost as easy as:
flip the sign of the second number
then add it
Flipping the sign of a two's complement number is done by inverting the bits then adding 1.
Thus, subtraction is performed with a single CPA by adding with . Figure 5.9 shows the symbol for a subtractor and the underlying hardware for performing .

An N-bit subtractor needs N NOT gates. This is tricky! Don't just see from the figure and then deduce that only one NOT gate is needed!
HDL Example 5.2 describes a subtractor.
Signed Addition/Subtraction
In the signed addition/subtraction, we have four flags, N(egative), Z(ero), C(array), V(Overflow), to indicate the status of our result. The following part will explain when each certain flag is set,
Before that, let's make some clarifications or conventions first.
We denote that the LSB is at bit 0 or 0-th bit, so on and so forth for all the remaining bits.
When we add two N-bit numbers, we will get a N+1 bit number. However, we treat the MSB (the N-th bit) as the carry bit, and the remaining N bits are the result. So, again when we say the MSB of the result, we are referring to the (N-1)-th bit actually.
Signed Addition

By observing the addition examples above, we may find out that
For unsigned addition,
carry = 1indicates that the result is wrong.For signed addition,
Overflow occurs when (+ve) + (+ve) = (-ve) or (-ve) + (-ve) = (+ve), a.k.a, two negative number addition gives out a positive number or two positive number addition gives a negative number
To determine the overflow flag, we always treat the two operands as signed! So, this process is just to see that if the sign bits (MSB) of the operands are the same, and is different from that of the result (the (N-1)-th bit), there is an overflow.
Overflow can never occur when you add two numbers of different signs
Carry and overflow are unrelated (However, carry is sometime called "unsigned overflow")
Carry flag is set when the value in the blue rectangle is one.
Negative flag is determined by only looking at the MSB of the result (the (N-1)-th bit actually).
Zero flag is determined by only looking at the N bits result.
Notes
In addition, there are some small tips
N flag and Z flag cannot be 1 simultaneously.
Z flag and V flag can be set simultaneously. e.g., 1000+1000=0000 with carry out 1, and the interpretation of the result to be 0.
The hardware as well as the instruction used for performing both signed and unsigned addition are the same
The hardware doesn't know and doesn't care if we are dealing with signed or unsigned.
Only the interpretation of the result and/or the way the condition codes are used are different.
So, first use signed/unsigned to write out the two operands, then just do the binary addition on these two operands and derive the flags.
Signed Subtraction

By observing the subtraction examples above, we can find out that
For unsigned subtraction, borrow indicates that the result is wrong.
In ARM, carry is just NOT borrow. However, in Intel, carry is borrow, but in subtraction, it will flip the carry bit to get the carry flag when it is subtraction.
For signed subtrcation,
Overflow occurs at when (+ve) - (-ve) = (-ve) or (-ve) - (+ve) = (+ve).
And similar to the signed addition, we always treat the two operands as signed. So, if the sign bits (MSB) of the operands are different, and the result has a sign same as that of the second operand, there is an overflow.
Overflow can never occur in subtraction when operands are of the same sign.
In hardware, the Carry flag simply reflects the ALU's carry-out signal. Conceptually, during subtraction, this signal acts as the inverse of the borrow used in pen-n-paper arithmetic.
Notes
The hardware as well as the instruction used for performing both signed and unsigned subtraction are the same
The hardware doesn't know and doesn't care if we are dealing with signed or unsigned.
Only the interpretation of the result and/or the way the condition codes are used are different.
So, first use signed/unsigned to write out the two operands, then just do the binary subtraction on these two operands and derive the flags.
In summary, we know the following points from the human interpretation side,
When C = 1, it indicates the unsigned addition result is wrong.
When C = 0, it indicates that the unsigned subtraction result is wrong.
When V = 1, it indicates that both signed addition and signed subtraction result are wrong.
For RISC-V, NZCV is irrelevant only when ALU does the addition. But, it is still recommended to go through how each flag of the NZCV is generated during addition.
Instead, in RISC-V, NZCV is relevant when ALU does the subtraction. Again, it is important to go through how NZCV is generated during subtraction. In RISC-V, the subtraction can be done in
subbranch/slt variants
In pure sub instruction, the NZCV may not be that useful. But in branch/slt variants, this NZCV flag, which is not stored as bits for use by future instructions, becomes much more useful. The different comparisons, like eq, etc, are implemented using these NZCV flags.
eq
Both
A == B
Z
Z flag
ne
Both
A ≠ B
¬Z
Z flag
lt
Signed
A < B
N ⊕ V
Signed overflow logic
ge
Signed
A ≥ B
¬(N ⊕ V)
Signed overflow logic
ltu
Unsigned
A < B
¬C
Borrow logic
geu
Unsigned
A ≥ B
C
Borrow logic
Table Explanation
For the
ltandge, whch are for signed numbers, we must first make sure there is no overflow, then check the N flag.For the
ltuandgeu, use the ARM borrow logic to understand. IfA<B, then A-B will need a borrow, thus the carry flag is 0, and vice versa.sltuis useful to check whether the subtraction of two numbers will need a borrow or not.
Comparators
A comparator determines whether two binary numbers are equal or if one is greater or less than the other. A comparator receives two N-bit binary numbers A and B. There are two common types of comparators.
An equality comparator produces a single output indicating whether A is equal to B (
A==B).A magnitude comparator produces one or more outputs indicating the relative values of A and B.
Equality comparator
The eqaulity comparator is the simpler piece of hardware. Figure 5.11 shows the symbol and implementation of a 4-bit equality comparator. It first checks to determine whether the corresponding btis in each column of A and B are equal using XNOR gates. The numbers are equal if all of the columns are equal.

Magnitude comparator
Magnitude comparison is usually done by computing A-B and looking at the sign (most significant bit) of the resulf as shown in Figure 5.12. If the result is negative (e.g., the sign bit is 1), then A is less than B. Otherwise A is greater than or equal to B.

Here, the [N-1] means the output is the bit of the result N.
HDL Example 5.3 shows how to ues various comparison operations.
ALU
An Arithmetic/Logical Unit (ALU) combines a variety of mathematical and logical operations into a single unit. For example, a typical ALU might perform addition, subtraction, magnitude comparison, AND, and OR operations. The ALU forms the heart of most computer systems.
Figure 5.14 shows the symbol for an N-bit ALU with N-bit inputs and outputs.

The ALU receives a control signal F that specifies which function to perform. Table 5.1 lists typical functions that the ALU can perform.

Figure 5.15 shows an implementation of the ALU.

Image Explanation
is also the carry in to the adder.
In two's complement arithmetic, .
When , the ALU performs the set if less than (SLT) operation. When , . Otherwise, . In other words, Y is set to 1 if A is less than B.
SLT is performed by computing
S=A-B. If S is negative (e.g., the sign bit is set), A is less than B. The zero extend unit produces an N-bit output by concatenating is 1-bit input with 0's in the most significant bits. The sign bit ( bit) of S is the input to the zero extend unit.
The AND and OR gate shown in the figure above both need 32 2-input AND/OR gate to implement! NOT One!
The multiplexer to select whether one of the input sources is the -bit or can be implemented using 2-input XOR gates, where one input of the XOR gate is 1-bit from the -bit , the other is from the 1-bit signal .
Some ALUs produce extra outputs, called flags, that indicate information about the ALU output. For example, an overflow flag indicates that the result of the adder overflowed. A zero flag indicates that teh ALU output is 0.
HDL Implementation
The HDL for an -bit ALU is shown as follows
TODO: Wait for adding LOL.
Shifters and Rotators
Shifters and rotators move bits and multiply or divide by powers of 2. There are several kinds of commonly used shifters:
Arithmetic shifter
It is the same as logical shifter, but on right shifts fills the most significant bits with a copy of the old most significant bit (msb). This is useful for multiplying and dividing signed numbers. Arithmetic shift left (ASL) is the same as logical shift left (LSL) because we will fill the empty spots at right with 0, this doesn't matter much. For example,
A N-bit shifter can be built from N N:1 multiplexers. The input is shifted by 0 to bits, depending on the value of the -bit select lines (shamt). Figure 5.16 shows the symbol and hardware of 4-bit shifters. The operators <<, >>, and >>> typically indicate shift left, logical shift right, and arithmetic shift right, respectively.

Image Explanation
For the Figure 5.16(b), which is about
srl(shift right logical), we can divide it into four casesshamt = 00(no shift), then output .shamt = 01(srl 1 bit), then outputshamt = 10(srl 2 bits), then outputshamt = 11(srl 3 bits), then output
Such a shifter shown in the Figure 5.16 which can combinationally shift an input by an arbitrary amount is called a barrel shifter.
Shift by a fixed amount (e.g., 1 bit, 2 bits, etc) onlys needs rewiring! It doesn't need any multiplexer.
If we want to use this idea to build a 32-bit shifter, we need 32 32:1 multiplexer, which has a very high hardware usage. To solve this problem, let's introduce the hardware-efficient shifter.
Hardware-efficient Shifter
This hardware-efficient shifter only needs 5 x 32 2:1 multiplexer. And below is the implementation for the 32-bit srl shifter.

This shifter uses a logarithmic, staged design that decomposes any shift amount into a combination of small, power-of-two shifts. Each stage doubles the shift capacity, making the circuit both elegant and efficient.
To add support for sll and sra, the easiest way is to implement a 3-to-1 multiplexer, controlled by the shift type info from the instruction.
However, more efficient ways to combine the shifts exist though. For example, when , the input can be from another small "multiplexer" that gives
{16{0}. ShIn31:16} for
srl{16{ShIn31}. ShIn31:16} for
sra
Multiplication
In the previous part, we have seen that multiplication is nothing but shift then add. By shifting, we form the partial products.
In general, a NxN multiplier multiplies two N-bit numbers and produces a 2N-bit result. Multiplication of 1-bit binary numbers is equivalent to the AND operation, so AND gates are used to form the partial products.
Array Multiplier
Figure 5.18 shows the symbol, function, and implementation of an unsigned 4x4 multiplier.

The unsigned multiplier receives the multiplicand and multiplier, and , and produces the product . Figure 5.18(b) shows how partial products are formed. Each partial product is a single multipler bit () AND the mutiplicand bits (). With -bit operands, there are partial products and stages of 1-bit adders. For example, for a 4x4 multiplier, the partial product of the first row is AND (). This partial product is added to the shifted second partial product, AND (). Subsequent rows of AND gates and adders form and add the remaining partial products.
This array multiplier is combinational because the whole multiplication (all ANDs and adds) happens in one clock cycle. Maybe this video will be good on explaining the critical path and hardware analysis of the array multiplier.
Sequential Multiplier
Suppose that we may not want to do the mutiplication in one single cycle, we can use the sequential multiplier. Its workflow is given as follows,

To understand it better, let's step through an example. Let S=000000, A=011, B=101 (initial values)
(original )
add:
S = S + A = 000000 + 000011 = 000011A = A << 1;ThusA = 000110B = B >> 1;ThusB = 010
(original )
no addition required here
A = A << 1;ThusA = 001100B = B >> 1;ThusB = 001
(original )
add:
S = S + A = 000011 + 001100 = 001111
The hardware complexity increases linearly as we increase the bits to be multiplied.
Improved Sequential Multiplier
The idea is to always align to the "high-side" (3 most significant bits of S in 3-bit multiplication) and then perform the addition. Then shift S to the right and do not shift A. And the least significant bits of S populated with B, which keeps getting shifted to the right with S (avoids the need for an extra register for B). Its workflow is shown as follows,

Again, stepping through an example will make it easier to understand. Let S = 0000101 (000B), A = 011 (initial values)
(original )
add
S = S + A000 = 000101 + 011000 = 011101S = S >> 1;ThusS = 001110
(original )
no addition required
S = S >> 1;ThusS = 000111
(original )
add
S = S + A000 = 000111 + 011000 = 011111S = S >> 1;ThusS = 001111, which is the final result
But there is a problem with the improved sequential multiplier, what if the 32-bit ALU generates a carry out? Where should the carry out go?
Ans: The carry-out effectively becomes the new MSB (Bit 63) of the 64-bit Product after the shift occurs.
HDL Implementation
The HDL for a multiplier is in HDL Example 5.4. As with adders, the synthesis tools may pick the most appropriate design given the timing constraints.
Division
Before we continue this section, let's make some conventions first. Given , we say that
is the dividend
is the divisor
is the quotient
is the remainder
Binary division can be performed using the following algorithm for -bit unsigned numbers in the range .
The partial remainder is initialized to 0 (), and the most significant bit of the dividend becomes the least significant bit of ().
The divisor is subtracted from this partial remainder to determine whether it fits ().
If the difference is negative (e.g., the sign bit of is 1), then the quotient bit is 0 and the difference is discarded.
Else, the difference is positive. Then the quotient bit is 1 and the remainder becomes .
In any event, the partial remainder is then doubled (left-shited by one column), the next most significant bit of becomes the least significant bit of , and the process repeats.
The process (step 2) repeats and the result satisfies .
-bit operands yield -bit quotient and remainder. Here, we want to show that the divisor and quotient and remainder shares the same bit-width, but actually the dividend can have bits.
Below is a more straight-forward example,

This approach is called long division approach, which is basically similar to what we have seen above, but in a more straight-forward way
If divisor dividend bits
1 bit in quotient, subtract
Otherwise
0 bit in quotient, bring down next dividend bit
This illustration is more straight-forward, but less complete. Thus it is good for doing the hand-by-hand calculation. But for the systematic approach to implement in the hardware, please go reviewing the pseudocode above.
Array Divider
Figure 5.22 shows a schematic of a 4-bit array divider. The divider computes and produces a quotient and a remainder . The legend shows the symbol and schematic for each block in the array divider.

Each row performs one iteration of the division algorithm. Specifically, each row calculates the difference . (Recall that ).
The multiplexer select signal, ( for Negative), receives 1 when a row's difference is negative. So is driven by the most significant bit of .
Each quotient bit () is 0 when is negative and 1 otherwise. The multiplexer passes to the next row if the difference is negative and otherwise. (In the legend)
The following row shifts the new partial remainder left by one bit (a good example to show that "shifting is just rewiring"), appends the next most significant bit of , and then repeats the process.
The delay of an -bit array divider increases proportionally to because the carry must ripple through all stages in a row before the sign is determined and the multiplexer can select or . This repeats for all rows.
Sequential Divider
The sequential divider here just follows the pseudocode mentioned above, which is about unsigned division. To convert it to signed, only minor changes are needed. (For more information on this, can go to Lab 03)
Similar to the Sequential Multiplier, we can also do the divider sequentially,

To understand it better, again, let's step through an example. For example, 00000111 (dividend) is divided by 0010 (divisor).

In the Hennessy & Patterson's textbook, they mention an improved version of the division algorithm above, it is shown as follows,

Signed Multiplication/Division
Signed vs. Unsigned Multiplication
When you multiply two n-bit numbers,
the lowest n bits (the least significant half) of the product are identical no matter the numbers are interpreted as signed or unsigned.
Only the upper n bits differ (because sign extension affects those).
That’s why:
MUL(in RISC-V) gives just the lower n bits.MULH/MULHU(RISC-V) give the upper n bits, for signed/unsigned respectively.
Macro-op fusion
“Macro-op fusion” means the processor can fuse mul and mulh internally and perform a single 64-bit multiply, returning both halves if needed.
We know that RISC-V has separate instructions for multiplication:
mul→ lower 32 bits of productmulh→ upper 32 bits (signed × signed)mulhu→ upper 32 bits (unsigned × unsigned)mulhsu→ upper 32 bits (signed × unsigned)
If we implement them literally, it looks like three separate multipliers are needed. But in hardware, we don’t have to do three full multiplications. Instead, we can compute one 64-bit result internally and just output different parts or interpret sign bits differently.
So, even though RISC-V has multiple mul variants, we don’t necessarily perform multiple physical multiplications.
Signed/Unsigned always matters in Division
Unlike multiplication, signed and unsigned division don’t share any bits of result. We can’t reuse the same division result for both. For example,
6
3
signed or unsigned
2
Same either way
6
-3
signed
-2
(because sign)
6
-3
unsigned
invalid (divisor interpreted as 4294967293 if 32-bit!)
-6
3
signed
-2
okay
-6
3
unsigned
big number ≈ 0x55555554 (because -6 -> 0xFFFFFFFA interpreted as 4294967290)
So division results sometimes completely change depending on whether you treat numbers as signed or unsigned — even if you use the same bit pattern.
Simple trick for signed multiplication/division
Signed division (and signed multiplication) can be done using the same unsigned operation unit if you:
Record the sign of both operands.
Take their absolute values (convert negatives to positives).
Perform unsigned division (or multiplication).
If exactly one operand was negative, negate the result.
But there are some wierd scenarios. For example, if we want to do the following multiplication,
The result cannot fit into a 32-bit register. This is an overflow (same trick to detect overflow as we have discussed in signed addition/subtraction)!
Number Systems
Computers operate on both integers and fractions. So far, we have only considered representing signed or unsigned integers. This section introduces fixed- and floating-point number systems that can represent rational numbers.
Fixed-point numbers are analogous to decimals; some of the bits represent the integer part, and the rest represent the fraction
Floating-point numbers are analogous to scientific notation, with a mantissa and an exponent.
If we have bits, we can only represent 2N patterns. Using different number systems won't change the number of patterns, but can change the distribution of the numbers it represents.
Fixed-Point Number Systems
Fixed-point notation has an implied binary point between the integer and fraction bits, analogus to the decimal point between the integer and fraction digits of an ordinary decimal number. For example, Figure 5.23(a) shows a fixed-point number with four integer bits and four fraction bits. Figure 5.23(b) shows the implied binary point in blue, and Figure 5.23(c) shows the equivalent decimal value.

Notes
This notation is not so standard.
The position of the binary point affects the range (difference between the largest and smallest representable numbers) and precision (smallest possible difference between any two numbers)
Signed Fixed-Point
Signed fixed-point numbers can use either two's complement or sign/magnitude notation. Figure 5.24 shows the fixed-point representation of using both notations with four integer and four fraction bits.

The implicit binary point is shown in blue for clarity.
In sign/magnitude form, the most significant bit is used to indicate the sign.
The two's complement representation is formed by inverting the bits of the absolute value and adding 1 to the least significant (righmost) bit. In this case, the least significant bit position is in the column.
In this example, we have: 00100110 -> 11011001 -> 11011010
In general, we use Ua.b to denote an unsigned fixed-point number with a integer bits and b fraction bits. Qa.b denotes a signed (two's complement) fixed point number with a integer bits (including the sign bit) and b fractional bits.
Fixed-Point Calculation
Fixed-point addition and subtraction is simple because the position of the binary point won't change. But multiplication and division are a bit tedious. So, here we will only talk about the fixed-point number multiplication and division.
Actually, from the small trick we mentioned above, we've seen that fixed-point notation has an implicit scale factor, which is .
For example, If we have 2 fractional bits, the scale factor is . That means every stored integer actually represents . So, 00000110₂ = 6 actually means .
Knowing this, we can clearly see that when we multiply two numbers in fixed-point, both are already scaled. For example, let's compute
a
1.5
00000110
6
b
2.5
00001010
10
When we do a*b, we will have 00111100₂. Since we scaled each input by , the product is scaled by . So to get back to the same scalor factor, which is 2 fractional bits, we must divide the product by 4 (shift right by 2 bits). Thus, we have 00111100₂ >> 2 = 00001111₂. This gives our stored value to be 15 and our actual value to be .
But, in the above scenario, we assumed a short multiply, e.g., n-bit × n-bit = n-bit.
When we want to do a long multiplication,
an overflow may happen, which basically is that our result bits cannot fit the multiplication result.
an underflow may also happen, which means the multiplication result is representable, but not accurately, because our fractional bits are too few.
Floating-Point Number Systems
Floating-point numbers are analogous to scientific notation. They circumvent the limitation of having a constant number of integer and fraction bits, allowing the representation of very large and very small numbers. Like scientific notation, floating-point numbers have a sign, mantissa (M), base (B), and exponent (E), as shown below:
For example, the number is the decimal scientific notation for 4100. It has a mantissa of 4.1, a base of 10, and an exponent of 3. The decimal point floats to the position right after the most significant digit. Floating-point numbers are base 2 with a binary mantissa. 32 bits are used to represent 1 sign bit, 8 exponent bits, and 23 mantissa bits.
The exponent needs to represent both positive and negative exponents. To do so, floating-point uses a biased exponent, which is the original exponent plus a constant bias. In 32-bit floating-point representation, the bias is 127. For example, for the exponent 7, the biased exponent is 7 + 127 = 134 = 10000110₂. For the exponent -4, the biased exponent is -4 + 127 = 123 = 01111011₂.
This notation conforms to the IEEE 754 floating-point standard. In this notation, the first bit of the fraction/mantissa is always 1.
To put it all together, let's examine two examples:
Represent signed number in IEEE 754
Suppose we want to write -58.2510 in floating point (IEEE 754)
Convert decimal to binary: 58.2510 = 111010.012
Write in binary scientific notation: 1.1101001 x 25
Fill in fields: Sign bit = 1, Biased Expoenent = 5 + 127 = 100001002, 23 fraction bits = 110 1001 0000 0000 0000 0000.

Special Cases
The IEEE floating-point standard has special cases to represent numbers such as zero, infinity, and illegal results.

More formats
So far, we have examined 32-bit floating-point numbers. This format is also called single-precision, single, or float. The IEEE 754 standard also defines 64-bit double-precision numbers (also called doubles) and 128-bit quadruple-precision numbers (also called quads) that provide greater precision and greater range. Table 5.5 shows the number of bits used for the fields in each format.

Rounding
Arithmetic results that fall outside of the available precision must round to a neighboring number. The rounding modes are
round down,
round up,
round toward zero, and
round to nearest.
The default rounding mode is round to nearest. In the round-to-nearest mode, if two numbers are equally near, the one with a 0 in the least significant position of the fraction is chosen.
For example, we want to round 1.100101 (1.578125) to only 3 fraction bits
round down: 1.100
round up: 1.101
round toward zero: 1.100
round to nearest: 1.101 (1.625 is close to 1.578125 than 1.5 is)
Notes
A number overflows when its magnitude is too large to be represented.
A number underflows when it is too tiny to be represented.
In round-to-nearest mode, overflows are rounded up to and underflows are rounded down to 0.
Floating-Point Calculation
Addition
The steps for adding floating-point numbers with the same sign are as follows:
Extract exponent and fraction bits.
Prepend leading 1 to form the mantissa.
Compare exponents.
Shift smaller mantissa if necessary.
Add mantissas.
Normalize mantissa and adjust exponent if necessary.
Check whether have overflow or underflow then round result.
Assemble exponent and fraction back into floating-point number.
For example,

Notes
For subtraction, just change step 5 to -> Subtract the mantissas.
If two mantissa A - B, and A is smaller, just do B - A and add a negative sign to the final result and then normalize the mantissa.
If Step 3 the shift amount is negative, shift the upper mantissa to the right.
Step 7 to check overflow and underflow is important! The idea is to check whether the final result's exponent is between [-126, 127] for single-precision floating point number. If < -126, underflow and if > 127, overflow.
Multiplication
Floating point multiplication follows the steps shown as follows:

For example, using the above steps to multiply the numbers 0.510 and -0.437510.
In binary, the task is multiplying 1.0002 x 2-1 by -1.1102 x 2-2.
Adding the exponents without bias:
-1 + -2 = -3. The exponenet will be -3 + 127 = 124
Multiplying the significands:
1.000 x -1.110 = 1.110 x 2-3
Checking Overflow and underflow:
As -3 is within [-127, 126], no overflow and underflow
Rounding the product:
No need in this case
Adding the sign to the final product:
Since the signs of the original operands differ, make the sign of the product negative. Hence the product is, -1.110 x 2-3.
Last updated
