Lab 03 - Multiply and Divide
Introduction
In this lab, we are still using the single cycle RISC-V processor which we have designed in Lab 02. The only difference is that, we will incorporate the multiply and divide into our processor. So, to set up for this lab, you can just copy all the design files from your lab 02 and then copy the MCycle.v into your design files and test_MCycle.v into your testbench from here!
Task 1: Implement Division
This task only needs HDL simulation.
For this task, what we do is to strictly follow the sequential divider introduced in the lecture 04.
Implementation Details
As you may notice, the sequential divider we have introduced in Lec 04 by right supports only unsigned division. However, we can only do some minor changes to make it support signed by considering the following cases. (in other words, find patterns from the following cases)
From the table above, we can clearly see that to do the signed division, we can just convert everything to unsigned first, then determine the sign of the quotient and the remainder.
Quotient is positive if and only if the sign of the dividend and divisor are the same. Otherwise, quotient is negative.
The sign of the remainder follows the sign of the dividend.
And to test whether our signed/unsigned multiplication/division works normally or not, we craft the following test cases:
0x00000000
0x00000000
0x0000000000000000
Zero × Zero
0x00000001
0xFFFFFFFF
0x00000000FFFFFFFF
Identity × Max
0xFFFFFFFF
0xFFFFFFFF
0xFFFFFFFE00000001
Max × Max
0x0000FFFF
0x0000FFFF
0x00000000FFFE0001
16-bit check
0x7FFFFFFF
0x00000002
0x00000000FFFFFFFE
Near sign bit
0x80000000
0x00000002
0x0000000100000000
2³¹ × 2
0xFFFFFFFF
0x00000002
0x00000001FFFFFFFE
Max × 2
0xAAAAAAAA
0x00000002
0x0000000155555554
Alternating bits
0x12345678
0x9ABCDEF0
0x0B00EA4E242D2080
Random case
0x0000ABCD
0x00001234
0x000000000C35F3F4
Mid-range test
0x00000000
0x00000000
0x0000000000000000
Zero
0x00000001
0xFFFFFFFF
0xFFFFFFFFFFFFFFFF
+1 × −1 = −1
0xFFFFFFFF
0xFFFFFFFF
0x0000000000000001
−1 × −1 = +1
0x7FFFFFFF
0x00000002
0x00000000FFFFFFFE
(2³¹−1) × 2
0x80000000
0x00000002
0xFFFFFFFF00000000
−2³¹ × 2
0x80000000
0xFFFFFFFF
0x0000000080000000
−2³¹ × −1 (overflow)
0x00000010
0xFFFFFFF0
0xFFFFFFFFFFFFFF00
16 × −16 = −256
0x7FFFFFFF
0xFFFFFFFF
0xFFFFFFFF80000001
(2³¹−1) × −1
0xFFFFFFFE
0xFFFFFFFE
0x0000000000000004
(−2) × (−2)
0x00001234
0xFFFFABCD
0x0000FFFFFA034FA4
Mixed signs
0x00000000
0x00000001
0x00000000
0x00000000
Zero ÷ Nonzero
0xFFFFFFFF
0x00000001
0xFFFFFFFF
0x00000000
÷ 1
0xFFFFFFFF
0x00000002
0x7FFFFFFF
0x00000001
Max ÷ 2
0x80000000
0x00000002
0x40000000
0x00000000
2³¹ ÷ 2
0x00000007
0x00000003
0x00000002
0x00000001
Small remainder
0xFFFFFFFF
0x7FFFFFFF
0x00000002
0x00000001
Large divisor
0xFFFFFFFF
0xFFFFFFFF
0x00000001
0x00000000
Equal operands
0x00000005
0x00000007
0x00000000
0x00000005
Dividend < Divisor
0x0000000A
0x00000003
0x00000003
0x00000001
Nonzero remainder
0xFFFFFFFE
0xFFFFFFFD
0x00000001
0x00000001
Max-1 ÷ Max-2
0x00000000
0x00000001
0x00000000
0x00000000
Zero ÷ Positive
0x00000001
0xFFFFFFFF
0xFFFFFFFF
0x00000000
+1 ÷ −1 = −1
0xFFFFFFFF
0xFFFFFFFF
0x00000001
0x00000000
−1 ÷ −1 = 1
0x7FFFFFFF
0x00000002
0x3FFFFFFF
0x00000001
Positive half division
0x80000000
0xFFFFFFFF
0x80000000
0x00000000
Overflow case (−2³¹ ÷ −1)
0x80000000
0x00000002
0xC0000000
0x00000000
Negative ÷ Positive
0x00000010
0xFFFFFFF0
0xFFFFFFFF
0x00000000
16 ÷ −16 = −1
0xFFFFFFF6
0x00000005
0xFFFFFFFE
0x00000000
−10 ÷ 5 = −2 r 0
0x12345678
0x0000F00F
0x00000135
0x000026B5
Random signed
0x80000000
0x00000003
0xD5555555
0xFFFFFFFD
−2³¹ ÷ 3 = floor-toward-zero
We didn't include the case when divisor is 0 because it is specified clearly that the divisor won't be 0 in the lab manual.
Task 2: Incorporate MCycle into our CPU
The whole idea here is that our processor is still single cycle! However, our multiplier and divider (MCycle) are not single cycle. So to maintain this single cycle property, we have a signal called Busy coming out from our MCycle to the WE (Write Enable) of the PC register indicating that if the MCycle is working, our PC won't be updated!
Implementation Details
As for now, you may wonder how we incorporate the MCycle into our processor, and below is the updated microarchitecture diagram!
Based on this updated microarchitecture, we can know for sure what to modify in our processor:
Decoder: to generate
Start,MCycleOp[1:0]andMulDivsignal.RV: there are two things here
Assign
WE_PC = ~busyAdd a multiplexer in the RV datapath to enable select from
ALUResultorMCycleResult(which is justResult1in our microarchitecture)
Wrapper: to remain readability, you can change
ALUResulttoComputeResult.
To derive the MulDiv, the Opcode must be same of the Opcode of DP Reg instructions and the func7 == 0000001. Strictly follow this, otherwise you will spend 2 hours debugging your lb/lbu/lh/lhu/sb/sh in Lab 04 like me 😭.
Task 3: Enhancement
In Lab 03, we are required to improve the performance of the signed multiplier. In the example code, it is intended to be written in a very inefficient way, just to give us more space to improve its efficiency.
Why it sucks
Below is part of the example code,
In Line 2 or Line 10, you can see that the teaching team intentionally to make signed multiplication taken 2 * width cycles, which is to intentionally make the signed multiplication inefficient.
Improvement Details
Use the simple trick from lec 04
As for the partial sum method, you will at most shift width - 1 times and perform width times addition. The idea of this trick is also what we use to implement our division.
Booth Algorithm
There are a lot of resources online introducing what is Booth Algorithm, the ones I am watching are this video and this video. (The sequence matters)
The improvements and trade-off analysis of Booth Algorithm are as follows
Disadvantages
We cannot reduce the number of shifts. It will always be
width-1.If the bit pattern of our operands is
0101..., then Booth Algorithm will actually perform slower than the method above. As, for half of the "addition", we will do the subtraction instead.
Here, we improve the booth's algorithm as a sequential multiplier. Can we make it combinational? In other words, can we implement the booth's algorithm using the idea from the array multiplier? (This is to use the increased hardware cost to trade off for the improvement of the speed of multiplication. See more from the trade-off between array multiplier and sequential multiplier.)
This is asked by the GA, I think it is a quite good food for thought.
By right, we can achieve even less cycles if we implement the booth's algorithm combinationally. But the trade-off here will be maybe more hardware usage and slower clock speed.
The booth's algorithm implemented in our Lab 03 doesn't support the mulhu, you will need to make some trivial changes to make it support the mulhu, or you can refer to our Lab 04 code directly.
Last updated