Lab 01 - Get prepared
Introduction
It is imperative to put in effort and try your best for this assignment. It may take an amount of effort that is quite disproportionately large, compared to the impact on your grade. This is normal. This assignment is designed to prepare you for the later ones, so that you can spend time debugging your design, instead of debugging your knowledge.
— From CG3207 Teaching Team
Task 1: Assembly Simulation
Task Instruction
The goal of this task is to understand the RISC-V Assembly Language. Thus, I will put some effort to explain this program public on the GitHub.
Overall Structure
This sample RISC-V assembly program contains 3 parts,
.eqv (Constants)
The code here is not data at all — it never goes into instruction memory (IROM) or data memory (DMEM). Instead, it's purely an assembler directive: a symbolic substitution (like #define in C).
.eqv NAME VALUEWhen the assembler sees NAME, it replaces it with VALUE.
Instruction Memory (IROM / Code Memory)
This is the program memory — where instructions live. It stores the instructions that the CPU fetches and executes. It has the following coding convention,
.text ## IROM segment
main:
li s0, MMIO_BASE
...
halt:
j haltCode Explanation
All code goes inside
.text.Labels like
main:,loop:,wait:are symbolic addresses.Final
halt: j haltensures execution has a “dead end” — since without an OS, programs don’t really return.With IROM depth 9 → 2^9 = 512 bytes → you can fit 128 instructions (each 4 bytes).
Pseudoinstructions (like
li) may expand to multiple real instructions, so you must stay within this limit.
Data Memory (DMEM)
Data memory stores constants, variables, strings, arrays, stack, heap, etc. Its coding convention is as follows,
.data ## DMEM segment
DMEM:
delay_val: .word 4 # a constant at DMEM+0x00
string1: .asciz "\r\nWelcome to CG3207..\r\n"
var1: .word 1 # static variable, initial value = 1
.align 9
STACK_INIT:Line-by-line Explanation
Line 15-31
These lines define some constants.
Line 40-45 (main)
This section initializes base addresses (MMIO_BASE, LED address, DIP switch address).
Line 41:
liis implemented aslui + addibecauseMMIO_BASEdoesn't fit in 12-bit immediate.Now
s0 = 0xFFFF0000. This will be the starting point for accessing all peripherals.
Line 43: Computes the LED peripheral’s memory-mapped address,
s1 = s0 + 0x60->s1 = 0xFFFF0060.Later, writing to
(s1)will control the LEDs.
Line 44: Since
DIP_OFF(0x64) fits in 12 bits, this is just oneaddiinstruction, notlui+addi. It loads the immediate into adpreg (data-processing register)s2, which is meant for the LED Offset from the MMIO base.Line 45: Add the LED Offset to the MMIO base.
The comment here means this is just a way to demonstrate the instruction
add(register + register) and the usage ofdpreg (data-processing register).
li in RISC-V means "load immediate" and its implementation depends on the size of the immediate value.
Line 46-54 (loop + wait)
This code snippet mainly does the following
Each time through
loop, DIP switches are read → LEDs updated.Then the program spends time in
wait, decrementings3until it reaches zero.That’s a software delay loop.
Once the delay is over, execution returns to
loop, reloadsdelay_val, and the process repeats.
So, the overall effect is: LEDs continuously reflect DIP switches, but with a controlled refresh rate (slowed down by the delay loop). And the detailed explanation is as follows:
Line 47: Loads the constant
delay_val(here = 4) from data memory intos3.Line 48: Reads the DIP switch values from the MMIO register at
s2 = 0xFFFF0064.Line 49: Writes the same value into the LED MMIO register (
s1 = 0xFFFF0060). So the LEDs mirror whatever is on the DIP switches.Line 51: Decrement the delay counter.
Line 52: If counter hits zero, go back to
loop:to reloaddelay_valand refresh LEDs.Line 53: If counter ≠ 0, jump back to
wait:(continue counting down).Here
jal zero, waitis used as a plain jump. Sincejalnormally stores the return address into a register, writing intozerodiscards it. It is equivalent toj wait.
Line 63-79 (dmem)
This is the data memory,
Line 69: Defines a constant
delay_valstored at the beginning of data memory.Line 70-71: Stores a null-terminated string in memory. Each character = 1 byte. The assembler appends a null (
0x00) at the end.Line 72: A statically allocated variable, initialized with
1. As the string above is 24 byte, it is stored fromDMEM+0x4toDMEM+0x18. Thus, thevar1happens to fit inDMEM+0x1c.Line 73: If the string is 1 byte longer, then
var1will be stored atDMEM+0x20for word-alignment.
Line 75:
.align 9means “advance the current memory location to the next multiple of 2⁹ = 512 bytes.”Line 76: Followed by Line 75, so the stack starts at address
DMEM+0x200.Line 77: Mainly describes the stack in RISC-V
Stack grows downwards (toward lower addresses).
sp(stack pointer) should be initialized to this address.Each push → decrement
sp, each pop → incrementsp.
In RISC-V, the word is stored in low-endianness. So, below is how the String in Line 72 is stored,

The word is stored using little endianess (we've encounterd little-endiance in CG2111A) in RISC-V memory, but within each byte, the byte is stored normally.
What's actually going on in Line 47? (auipc)
In Line 47, the instruction lw s3, delay_val is actually implemented by two RISC-V instructions.
auipc x19, 0x0000fc10
lw x19, 0xffffffec(x19)This is as shown as follows, (x19 is the s3 register)

The reason for this two-step sequence is that lw is an I-type instruction, and I-type immediates are limited to a signed 12-bit offset relative to a base register. This means lw can only directly access data within ±2048 bytes of the base address. When the data we want to load is located far away, we need an additional instruction to construct a base address that is “close enough.”
This is where auipc (Add Upper Immediate to PC) comes in. auipc takes the current PC value, adds a 20-bit immediate shifted left by 12 bits, and stores the result into the destination register. In other words, it lets us build a base address relative to the PC, suitable for accessing distant memory.
The symbol
delay_valis at address0x10010000. The instructionlw s3, delay_valitself is at0x00400014.These two addresses differ by much more than 12 bits, so a plain
lwcannot reachdelay_valdirectly.
To bridge the gap, the assembler splits the target address into a high part and a low part.
The upper 20 bits difference is:
0x10010 - 0x00400 = 0xFC10. This becomes the immediate forauipc.
After executing
auipc x19, 0x0000fc10, registerx19holds:x19 = PC + (0xFC10 << 12), which is a value “close” to the address ofdelay_val.Now only a small offset is left to cover.
The lower 12 bits difference is:
0x000 - 0x014 = 0xFFFFFFEC. This fits within the signed 12-bit immediate range oflw.
Finally, the instruction
lw x19, 0xffffffec(x19)usesx19as the base plus the small offset to reach the exact address ofdelay_valand load its value intos3.
The key idea is that auipc provides a way to construct PC-relative addresses for far-away data or code. By combining auipc (for the high 20 bits of the address) with an I-type instruction like lw (for the low 12 bits), RISC-V can access any 32-bit address in memory, despite the immediate size limitations of a single instruction.
The use of la to load address which is far away from the current PC address works in exactly similar ways. Instead storing the content, la store the address of that content.
Demonstration
In this task, we mainly just need to demonstrate as the following images shows,

Run the code step by step till Line 48
Change the input at the DIP switches (
0xffff0064), then run Line 48 and 49, the output at LEDs (0xffff0060) should be mirrored.This loop is infinite, so showing this mirror once suffices.
Wait for the problems proposed by the TA.
Questions Preparation
What is the 32-bit representation of certain instruciont, like the
opcode,funct3, etcbring the risc-v card along with you
What is the memory capcity of IROM?
As we IROM can store 128 words, its memory capcity is 7 bits. (Although I think it is a bit not good here as memory capcity should be 128 words bruh, and the address of IROM is 7 bits.)
Optional Task
Helloworld without subroutines
The RISC-V assembly code about HelloWorld is public on the GitHub. The overall behaivor is
It waits for the user to press the
Akey followed by Enter (\ror\n) on the console.It echoes every input character to the console, LEDs, and seven-segment display while waiting.
Once the correct input is received, it prints “Welcome to CG3207..” to the console using UART character by character.
The LED and seven-segment display here are just used as “hardware echo” that mirrors what you typed.
We met UART again, feel free to go back to NUS CG2111A Notes on reviewing how UART works! Here this UART serial communication is setup between our RISC-V processor and our PC's console (on RARS).
And I will do the explanation section by section,
Line 17-50 (Setup)
This is the setup work. Nothing special.
Line 52-78 (Read A and Enter)
This section is also pretty straight-forward. But in Line 75-78, the trick to implement if A or B needs our attention,
Line 79-96 (Print "Helloworld")
a0stores the address of of the word (4 bytes) to be printed. And within each word, one byte is printed a time. After a word has been printed,a0is incremented by 4 to print the next word. (As we've seen in the previous task, thestring1is 24 bytes — 6 words long)
Task 2: Basic HDL Simulation
In the
initialstatement, no matter in RTL code or testbench, the L.H.S signal must bereg.
RTL Design
Clock Enable
The Clock_Enable block has three states,
1Hz mode: Pull up the
enableto HIGH for 1 clock cycle (10ns for Nexys 4) every 1 second.4Hz mode: Pull up the
enableto HIGH for 1 clock cycle every 0.25 second.Pause mode: Pull down the
enableto LOW until pause mode is exited.
In our simulation, the enable behaves like below

The time enable is low is controlled by the corresponding threshold for each of the three modes above. We can think of enable as a slower clock signal.
The image uses the 1Hz mode and change the threshold to 8 for simluation. Normally should be 100_000_000 for real world.
threshold should be implemented sequentially or combinationally?
Don't be confused by the comments in the Verilog code, threshold should be updated in a combinational block. (It is faster) Although you can still implement it sequentially. But the later is not recommended if you want to get a small improvement on your CPU.
Get Mem
This block is straight as we only need to implement two things
the combinational logic part for the
datafed to the seven-segment display and leds; and theupper_lowersignal fed to the ledsthe sequential logic part for the
counterwhich is basically theaddr.
Why it is a 9-bit counter/address here?
Let's anaylze from the bottom to the up:
As we only have 16 Leds on Nexys, each instruction is 32-bit long. Thus, we need to show the upper and lower half-word on the leds, this is for each instruction. Thus, we need 1 bit for this, and this must be the LSB of our counter/address. (
addr[0])As for our IROM and DMEM, each of them is 128-word long, thus we need another 7 bits to track the address. (
addr[7:1])As we display IROM first, then DMEM. We need another 1 bit to do this "switching" (
addr[8])
Thus, in total, the counter/address should be 9 bits long in this module. This also means that to display all of the content from IROM and DMEM, we need 29=512 "rounds". (This is useful in our simulation)
Top
This module has nothing special, we just instantiate the two modules we have designed
the
Clock_Enablemodulethe
Get_Memmodule
And the one module that is provided,
the
Seven_segmodule
In the Top module, we also need to implement a multiplxer to choose the 16 bit data to be shown on the led.
Behavorial Simulation
Here, we basically need to simulate all the combinations of the inputs, which are
Both
btnCandbtnUare not pressed.btnUis pressed butbtnCis not pressed.btnCis pressed butbtnUis not pressed.
No auto-check version
From the explanation above, we know that we need 512 rounds to display all the data from IROM an DMEM. Under each of the three modes above, the clock cycles for each round is different, and it is determined by the corresponding threshold value in each mode, thus
Under 1Hz mode, we need 512×threshold_1Hz clock cycles in total, this will be the terminating
ivalue in our for loop.Similarly, under 4Hz mode, we need 512×threshold_4Hz clock cycles.
And the total timing is also not difficult to calcualte. Let's say we delay 10ns at each round. So, under 1 Hz mode, the total time taken will be 10×512×threshold_1Hz.
Notes
The
initialstatement will not be ignored in the behavorial simulation, but will be ignored in the real run, or synthesis.Remember to change the
thresholdvalues back to normal after simulation. (For observing the results quickly, we have changed them to some really small number during the simulation)
Auto-check version
Here, the auto-check means we should read the IROM and DMEM data again in our testbench. And then check at each clock cycle, the led output of our UUT is the same as the expected led. (Only led here because led is the only port visible to us). See more from this issue.
Notice that what we want to achieve here is that we press the btnU for several "seconds", and then release, and then press btnC for several seconds, then release. Then check whether the uut.led and the expected_led match or not.
To implement this, we are recommended to write a verilog task to do the checking for us. (Reduce duplicate code)
Code Explanation
The argument
cyclesdoesn't need to take the threhold into accound because it uses the posedge ofuut.enable.
Then in our main simulation, we can just call these two methods under each phase.
Code Explanation
The
#10at Line 16 and Line 23 is important as it makes sure the during the next phase the simulation actually sees the button change. The value may change depending on the the clock period. Usually, we should pause for 1 clock period.
Demostration
Load the bitstream file into your FPGA.
Press the
btnU,btnCto see the output of your FPGA.
Questions Asked
What is the difference between
Clock_Enablebetween the normal 100MHz clock?Clock_Enableis a slower clock implemented by thethresholdthinking.
How to verify the content shown on the seven-seg display is correct?
The only way we can do is to compare it visually with the memory files (IROM and DMEM)
The fruits of our labour
The video is out-dated and is for reference only as the lower-half and upper-half displaying sequence is flipped. As our design specification, should display upper-half first, then lower-half.
Some Interesting Questions
Why the actual behavior of seven-seg display and leds is different from the behavorial simulation?
An interesting phenomenon I find out is that after the instruction memory, the last line will blink several times and then it still prints the instructions memory.
This is solved by Dr. Rajesh and TA Neil in this disscussion. The main reasons is that the synthesis tool will likely optimise the storage to a combinational circuit (4-input, 32-output) instead of a ROM as the utilisation is very low. That means repeating pattern modulo 16 composed of 13 valid and 3 garbage words.
Why after I press btnU, my FPGA will have a delay, then change to fast speed?
This happens because of the way the Clock_Enable logic uses the counter vs. threshold comparison.
If your Line 7 is counter == threshold - 1, then you will encounter the problem stated. But why?
In 1 Hz mode, the
thresholdis large, so thecountercan hold a relatively big value.When you switch to 4 Hz mode, the new
thresholdis much smallerIf the current counter value is still larger than the
threshold_4Hz(new threshold) but smaller than thethreshold_1Hz, the conditioncounter==threshold-1will be false for many cycles. Thus, the counter continues incrementing until it eventually wraps around to zero, only then generating the first enable pulse at the new faster rate.This “wrap-around” is the delay you see when switching speeds.
Last updated