Lab 02 - Single Cylce RV Processor

Introduction

In this lab, we will build a single-cycle RISC-V processor and the lab manual is here. And our processor will support the followig instructions for now,

  • add, addi, sub, and, andi, or, ori

  • lw, sw

  • beq, bne, jal (without linking, that is, without saving the return address).

  • lui,auipc

  • sll, srl, sra

Design Files

Top Module

Our Top design file, Top_Nexys.vhd, is written in VHDL instead of Verilog. (FYI, some of our teaching team is really a fan of VHDL 😂). Before reading about this section, I strongly recommended you to read the our Wrapper file first.

Purpose of Top

In this module, we explicitly connect the ports (like LED, Seven-seg, etc) on our Wrapper (or, motherboard) to the real ports (LEC, Seven-seg, etc) on our Nexys 4 FPGA.

CLK_DIV_BITS

You don't need to know how everything works in Top_Nexys.vhd except for one variable — CLK_DIV_BITS, which is essentially a variable to slow down the processor's speed.

For example, given that the base clock frequency on our FPGA is 100MHz.

  1. If CLK_DIV_BITS = 5, then the processr's clock frequency will be 100÷253 MHz100\div2^5\approx3~\text{MHz}.

  2. Or if CLK_DIV_BITS = 2, then the processor's clock frequency will be 100÷2225 MHz100\div2^2\approx25\text{~MHz}.

In summary, the relationship between CLK_DIV_BITS and clock frequency is

clock frequency=100 MHz÷2CLK_DIV_BITS\text{clock frequency}=100~\text{MHz}\div2^{\text{CLK\_DIV\_BITS}}
If you are able to choose to run your processor at 20, 50 and 100MHz, what will you choose?

As discussed in Lec 02, the maximum clock frequency is determined by the critical path — the combinational path between two registers with the longest delay.

So, to answer this question, we need to clearly demonstrate to the TA that

  1. Since we are designing a single-cycled processor here, this means every instruction must complete in one cycle.

  2. Among all RISC-V instructions, the load instruction has the longest execution time.

  3. Therefore, the critical path can be identified within the load instruction.

  4. If this critical path meets the timing requirements for 20 MHz, 50 MHz, or 100 MHz, the processor can run at that frequency by adjusting CLK_DIV_BITS.

Thus, when choosing between 20 MHz, 50 MHz, and 100 MHz, the decision depends on whether the critical path can support the higher frequencies without violating timing constraints.

Why does the Wrapper exist?

Wrapper.v is long, but it’s basically a simulation harness that sits between our RISC-V core (RV) and the outside world (testbench or FPGA board). It provides memories and memory-mapped peripherals so that the processor can interact with LEDs, DIP switches, buttons, UART, OLED, etc. We can think of it as the “motherboard” that our RV CPU plugs into.

In short, Wrapper.v is just the Verilog implementation of our Lec 03 microarchitecture but adding the MMIO part. (Go to the spoiler at the back)

Purpose of Wrapper

  • Provides IROM (instruction memory) and DMEM (data memory) to the processor.

  • Provides MMIO (memory-mapped I/O) registers for peripherals like LEDs, UART, DIP switches, OLED, accelerometer, etc.

  • Translates “clean, parallel” signals into abstract peripherals that are easy to monitor in simulation.

  • Used only for simulation (not synthesis directly). On the FPGA, the higher-level TOP_Nexys.vhd connects the wrapper to actual hardware pins.

So, it makes our RISC-V behave like it’s running on a small SoC with peripherals.

I/O Ports

Signal
Direction
Width
Description

DIP

In

16

DIP switch inputs (not debounced).

PB

In

3

Push buttons (BTNL, BTNC, BTNR; not debounced).

LED_OUT

Out (reg)

8

LEDs [7:0] showing processor results.

LED_PC

Out

7

Shows PC[8:2] on LEDs [15:9].

SEVENSEGHEX

Out (reg)

32

Data for 8-digit 7-seg display.

UART_TX

Out (reg)

8

Byte sent to PC/testbench via UART.

UART_TX_ready

In

1

UART ready flag (ok to write new TX byte).

UART_TX_valid

Out (reg)

1

Indicates Wrapper wrote a new UART TX byte.

UART_RX

In

8

Byte received from PC/testbench via UART.

UART_RX_valid

In

1

Indicates new RX data available.

UART_RX_ack

Out (reg)

1

Acknowledge that RX byte was read.

OLED_Write

Out (reg)

1

Pixel update signal for OLED.

OLED_Col

Out (reg)

7

OLED column index.

OLED_Row

Out (reg)

6

OLED row index.

OLED_Data

Out (reg)

24

OLED pixel data <R,G,B> (8/8/8 aligned).

ACCEL_Data

In

32

Packed <Temp, X, Y, Z> from accelerometer.

ACCEL_DReady

In

1

Accelerometer data ready flag.

RESET

In

1

Active-high reset.

CLK

In

1

Clock input (divided clock shown on LED[8]).

Memories

In Wrapper.v (our motherboard), we not only have our CPU RV.v, but also three memory spaces:

  • IROM: instruction memory (program comes from .mem file dumped by RARS).

  • DMEM: data memory (for global variables, stack, etc.).

  • MMIO: memory-mapped I/O (LEDs, switches, UART, OLED, etc.).

Address Decoding

In our Wrapper.v, we need to decide whether we want to access (can be "read" or "write") DMEM or MMIO peripherals. To do so, we introduce lots of dec_* signals to indicate whether we want to access DMEM or MMIO peripherals.

Connection to RV

Different from the Lec 03 microarchitecture, here we added the MMIO part inside.

1

Instruction fetch path

  1. CPU outputs PC. This is done in RV.v as PC is an output from CPU.

  2. Wrapper uses PC to fetch the Instr from instruction memory.

  3. Wrapper sends Instr back into CPU. This done in RV.v as Instr is an input to CPU.

2

Data access path

The data access path has two parts, one is to load data from either DMEM or MMIO, the other is to store data to either DMEM or MMIO.


Load: Loading from both DMEM and MMIO peripherals look identical to CPU (in our implementation, we read both together first), only Wrapper decides the source.

  • CPU computes address in ALU -> ALUResult.

  • CPU asserts MemRead = 1. (MemRead is just MemtoReg in our microarchitecture)

  • Wrapper checks ALUResult and enable corresponding dec_* signal to indicate which memory we want to access (read from here)

    • If address ∈ DMEM range (dec_DMEM == 1)-> access (read from) DMEM, and store the read data into ReadData_DMEM.

    • If address ∈ MMIO range (dec_MMIO_* == 1) -> access (read from) the MMIO peripheral and store the read data into ReadData_MMIO.

    • After that, we delay the decoded signals

  • Wrapper puts that value into ReadData_in. (This is where the I/O Multiplexing comes into play)

  • CPU reads ReadData_in and writes into register file.

Store: It will either update memory or trigger side effects on peripherals.

  • CPU computes address in ALU -> ALUResult.

  • CPU outputs the value we want to store in WriteData_out.

  • CPU asserts MemWrite_out = 1. (This is the {4{MemWrite}}, where MemWrite is in our Lec 03 microarchitecture)

  • Wrapper checks ALUResult:

    • If address ∈ DMEM range -> write WriteData_out into DMEM.

    • If address ∈ MMIO range -> forward WriteData_out into peripheral (e.g. set LEDs, update 7-seg).

  • CPU does not expect anything on ReadData_in.

Why do you simulate Wrapper instead of Top_Nexys?

In this lab, simulating the Wrapper is sufficient because it contains representations of all the FPGA ports. The Top_Nexys module only connects these ports to the actual FPGA hardware. Since simulation only involves providing inputs to these ports and observing the outputs, the physical connections to the FPGA are unnecessary. Therefore, simulating the Wrapper alone captures the full functionality needed for verification.

RISC-V processor

Our RISC-V processor, or simply denoted as RV processor, is designed in RV.v, which also has the following hierarchy

Why there is no mv instruction in RISC-V?

In ARM assembly (if you have taken CG2028), mv can do two things:

  1. Register -> Register: It just copies the value from source register rs into the destionation register rd.

  2. Immediate -> Register: Move an immediate value to the destination register rd.

For these two scenarios, RISC-V can use its base instructions or combination of base instructions (pseudo-instructinos) to achieve

  1. Register -> Register: addi rd, rs, 0.

  2. Immediate -> Register: li rd, imm. (li is a pseudo-instruction and it will be implemented in lui + addi)

So, RISC-V doesn’t need a special instruction for all the variants that ARM has because the same effects can be achieved with a combination of simple instructions.

Tips

Demo

As you may be confused about what you are going to demo, here are some tips from the teaching team

1

General Caveats

  1. Make your output dependant on some user inputs. A.k.a, don't hardcode the output using a counter, etc.

  2. Design your assembly program such that if one instruction doesn't work, the final output will be completely different. In other words, to get this output correctly given this input, every instruction in your assembly program should work correctly.

    1. Don't just look at the final output to state that every instruction of your prgram works. For example, if you shift right by 5 bits and shift left by 5 bits, even if your shift instructions don't work, your result will still be right.

2

Testbench

Your progam should run based on your user input, so the testbench just serves as simulating that user input and check whether the output of your program is correct or not.

3

More about OLED

In the lab, OLED is provided, but it is just another output. The fancy demo you may see from the lab by Dr. Rajesh is just another way to craft the well-designed assembly code.

4

Prepare about the RISC-V Microarchitecture

Especially know what is the datapath for every type of instruction we have learned. Better to understand and then memorise it! Below is a file summarised all the datapaths for each instruction:

5

General Steps

  1. Assembly code correctness: Step through the assembly program step by step to make sure the algorithm works correctly on the "high-level".

  2. HDL (Vivado) simulation: Step through the each instruction to make sure each instruction is executed correctly on vivado simulation (behavioral + post-synthesis)

  3. Hardware demo: This step cannot step through and thus we just provide the input and see the output.

Last updated