Lec 03 - RISC-V ISA and Microarchitecture

As ISA is just architecture (see Harris & Harris!), so this lecture will talk about the architecture and microarchitecture part of RISC-V!

RISC-V ISA

As we have already seen in Harris & Harris's introduction to architecture, we first recap some important points regarding architecture and microarchitecture.

1

Architecture

This is the programmer's view of computer. And it has the following features:

  • Defined by instructions & operand locations

  • Assembly language: human-readable format of instructions

  • Machine language: computer-readable format (1’s and 0’s)

  • Assembly language -> Machine language conversion is done by the assembler

    • one to one correspondence (except for pseudo-instructions)

2

Microarchitecture

This is about how to implement an architecture in hardware (we will see later in second half of this lec series)

RISC-V Features

In the lec, we have introduced some interesting features about RISC-V, and they are as follows,

  • As a RISC architecture, the RISC-V ISA is a load-store architecture, which means only load/store variants can access memory

    • No mixing of memory access with data processing or branching

  • Interesting design choices to simplify hardware implementation

    • Especially the encoding of immediates (We will see later).

Below are some very interesting points starting from the instruction length vs. word length.

1

Instruction Length vs. Word Length

The word length of a processor is the width of its registers/ALU or the size of most elements in its datapath. For example, the following are three types of Risc-V processors,

  • RV32 → word length = 32 bits (registers and ALU are 32-bit wide).

  • RV64 → word length = 64 bits.

  • RV128 (rare, theoretical) → word length = 128 bits.

The instruction length is the number of bits to encode an instruction and it is determined by the ISA encoding design and extensions included. For example,

  • Base ISA (RV32I, RV64I, RV128I) → instructions are always 32 bits long.

  • If you add the Compressed (C) extension, some instructions are 16 bits long.

  • There are also 48-bit and 64-bit instruction encodings in the RISC-V spec (for special extensions).

2

What if word length and instruction length doesn't match?

  • Instruction length ≠ word length

    • A 32-bit processor does not mean all instructions must be 32 bits long. This is what we have seen from above.

    • Example: ARM Thumb and RISC-V compressed (C) extension use 16-bit instructions on a 32-bit processor.

    • The processor’s frontend (fetch/decode) handles variable instruction sizes, while the backend (execution units, registers, ALU) still works on 32-bit data.

  • Microarchitecture impact

    • Supporting multiple instruction lengths adds complexity to the frontend (fetch, decode).

    • The backend (datapath, ALU, register file) usually remains tied to the word length, so the impact on overall hardware design is noticeable but not drastic.

  • Practical advantage: saving instruction memory

    • Shorter instructions (16-bit vs. 32-bit) reduce code size.

    • This allows more instructions to fit into the same memory/cache, improving memory efficiency and potentially performance.

  • Word length and memory addressing

    • In a 32-bit processor, addresses are 32 bits wide.

    • This limits the maximum directly addressable memory to 232=4 GB.2^{32} = 4\ \text{GB}.

    • This is a backend (data memory) property, tied to word length, not instruction length.

    • Thus, even if instructions are compressed to 16 bits, the system is still limited to 4 GB RAM because addressing depends on word length.

3

More about modern processors

  • ISA Stability

    • The Instruction Set Architecture (ISA) (e.g., x86, ARM, RISC-V) usually remains stable across generations.

    • This ensures that code compiled decades ago can still run on modern processors of the same ISA family.

    • Example: Modern Intel CPUs can still run DOS-era x86 code.

  • Extensions to the ISA

    • Instead of replacing the ISA, new generations often extend it with additional instructions (e.g., Intel’s SSE, AVX, AVX-512; ARM’s NEON; RISC-V vector extensions).

    • These new instructions allow compilers and developers to take advantage of improved performance features, but are optional from the program’s perspective.

  • Backward and Forward Compatibility

    • Backward compatibility: Old code (compiled for older processors) runs fine on newer processors, since the original instructions are still supported.

    • Forward compatibility: New code (compiled with instructions from the newest extensions) will not run on older processors if those instructions are missing.

    • Workaround: Compilers often provide a compatibility mode (e.g., “target x86-64 baseline”) so the same program can run on older hardware.

  • Real-world example: Windows 11

    • Windows 11 requires support for certain instruction set extensions (e.g., SSE4.2, CMPXCHG16b, LAHF/SAHF, and in practice often AVX2).

    • Older CPUs without these instructions cannot run Windows 11, even though they are technically “x86-64 processors.”

  • Microarchitecture vs. ISA

    • New processors primarily innovate by updating the microarchitecture (pipeline depth, branch prediction, cache design, out-of-order execution, etc.) while keeping the ISA stable.

    • This allows performance to improve without breaking software compatibility.

Registers

The register set has been introduced in Harris & Harris. But the following image adds the information about whether the register should be saved by the caller or callee.

And below are some useful notes

1

ABI name and ISA name

In RISC-V assembly code, we can use ABI name, like zero, ra, etc. But in compiled code, ISA names, like x0, x1, etc, is used.

2

PC in RISC-V

In RISC-V, PC is not a register readable/writable explicitly by any instruction, e.g., it is not a visible register. And,

  • In RISC-V, PC just stores the address of the current instruction. (Not like ARM, PC actually stores the address of the current instruction address +4 or +8)

  • Writing PC is done only by branch/jump instructions.

3

No instruction updates more than one visible register

This is a very important golden rule (The step title here). And, in RISC-V, the register updated is explicitly specified in the rd field. This ensures that the register file only needs one write port.

4

No instruction reads more than two registers.

Another golden rule here. This ensures that the register file needs only 2 read ports. Below is the example of RISC-V's register file,

So, in summary, below is the illustration of RISC-V register file,

5

No flag registers

In RISC-V, flags are never stored for “future use.” Instead, comparisons and branches are self-contained. For example, beq x1, x2, target; will directly compare x1 and x2, and branches to target.

  • If you do need the comparison result as data, use an instruction like slt to store it in an general-purpose register slt x3, x1, x2; means x3 = 1 if x1<x2, else x3=0.

  • Because branch instructions already use the ALU for comparison, RISC-V usually has a separate unit to compute branch target addresses (PC + offset).

Instruction Formats

RISC-V instruction has 6 formats, and they are well summarized in the following table,

This table is from James Zhu at UCB
1

Opcode field occupies the least significant part of the instruction

This gives us the benefit that,

  • CPU can recognize instruction type by just reading the first byte (since RISC-V uses little-endian and little-endian puts opcode at the lowest address).

  • "3L mnemonic" for little-endian: In little-endian formatting, least significant byte goes to lowest memory address.

2

rd, rs1, rs2 always in the same place across formats

This makes decoder hardware is simpler (fewer multiplexers, will see later in the microarchitecture).

3

All immediates are MSB-extended

So all immediates become sign-extended words. RISC-V immediates are encoded in 2’s complement form. So when you sign-extend, you’re effectively just preserving the correct integer value. For example, in RV32I, 12-bit immediate,

  • Immediate field = 1111_1000 (0xF8).

  • As 12-bit 2’s complement → −8.

  • Sign-extend to 32 bits → 0xFFFFFFF8.

  • ALU sees this and just adds normally, giving a result 8 less.

However, this may become a bit wired with instruction sltu. But, let's look at the following example,

Suppose the above is our "set if less than immediate unsigned" instruction, and RISC-V processor will treate -8 as 0xFFFFFFF8.

  • x1=-2, x1 will be interpreted as 0xFFFFFFFE. As ALU just compares unsigned, 0xFFFFFFFE is greater than 0xFFFFFFF8, thus false, meaning under unsigned comparison x1 is not smaller than -8!

  • x1=5, x1 will be interpreted as 0x00000101. Similarly, ALU just compares unsigned, 0x00000101 is smaller than 0xFFFFFFF8, thus true, meaning under unsigned comparison x1 is smaller than -8 (unsigned)

This will give us the range for the immediate in slti and sltiu: 0x800 - 0x7FF (-2048 to +2047). This range applies to all I-type and S-type instructions. The B-type instructionis a bit special, its range is (-4096 to +4094 <-> 0x1000 - 0xFFE). But as in B-type instructions, it is impossible to give the immediate manually, I think this won't appear in quizzes or finals 😂.

So, instructions like sltiu x3, x1, 0x00000FFF is illegal as 0xFFF = 4095 exceeds to range for the immediate in I-type and S-type instructions!

4

The location of immediate is well-designed

From the table above, we may think why the immediate's location is that wired! Actually, later from the microarchitecture view, we will see that this is a genius design as it minimizes the number of multiplexers!

DP Instruction

"DP" stands for data-processing. The following tables summarises all the base DP instructions.

From this table, we notice that

  • subi is unnecessary as the assembler can encode A-B as A+(-B). This find as B is immediate, if B is a register, then B cannot be known at assembly time, and that's why sub is still needed.

  • And the following table about the three types of shift operations is copied from Harris & Harris, just for CG3207 midterm quiz purpose.

DP Pseudo-Instruction

All the pseudo-instruction in RISC-V is introduced here! Among them, knowing the working principle of auipc from Lab 1 is also necessary!

auipc explanation

In Lab1, the instruction lw s3, delay_val at Line 47 is actually implemented by two RISC-V instructions.

This is as shown as follows, (x19 is the s3 register)

The reason for this two-step sequence is that lw is an I-type instruction, and I-type immediates are limited to a signed 12-bit offset relative to a base register. This means lw can only directly access data within ±2048 bytes of the base address. When the data we want to load is located far away, we need an additional instruction to construct a base address that is “close enough.”

This is where auipc (Add Upper Immediate to PC) comes in. auipc takes the current PC value, adds a 20-bit immediate shifted left by 12 bits, and stores the result into the destination register. In other words, it lets us build a base address relative to the PC, suitable for accessing distant memory.

  1. The symbol delay_val is at address 0x10010000. The instruction lw s3, delay_val itself is at 0x00400014.

    1. These two addresses differ by much more than 12 bits, so a plain lw cannot reach delay_val directly.

  2. To bridge the gap, the assembler splits the target address into a high part and a low part.

    1. The upper 20 bits difference is: 0x10010 - 0x00400 = 0xFC10. This becomes the immediate for auipc.

  3. After executing auipc x19, 0x0000fc10, register x19 holds: x19 = PC + (0xFC10 << 12), which is a value “close” to the address of delay_val.

  4. Now only a small offset is left to cover.

    1. The lower 12 bits difference is: 0x000 - 0x014 = 0xFFFFFFEC. This fits within the signed 12-bit immediate range of lw.

  5. Finally, the instruction lw x19, 0xffffffec(x19) uses x19 as the base plus the small offset to reach the exact address of delay_val and load its value into s3.


The key idea is that auipc provides a way to construct PC-relative addresses for far-away data or code. By combining auipc (for the high 20 bits of the address) with an I-type instruction like lw (for the low 12 bits), RISC-V can access any 32-bit address in memory, despite the immediate size limitations of a single instruction.

Multiply and Divide

The multiply and divide are not part of the base instruction set, but are available as an optional standard extension.

Memory Instruction

The following is the base memory instruction:

So, from this table, notice that

  • For the memory instruction syntax, if imm=0, we can omit the imm. e.g., op rd, (rs1).

  • For lb and lh, which loads a byte or a half word, the rest of bits are formed by sign-extension/MSB-extension (just copy the MSB) of the byte/half-word

  • For lbu and lhu, zero extension (copy zero only) is done.

Control Instructions

The following is the base control instructions:

1

Jump vs. Branch

  1. jal can jump farther than conditional branches because

    1. jal instructions use 20-bit signed immediate

    2. Branch instructions use 12-bit signed immediate

  2. jal allows for saving return address, while conditional branches cannot.

2

The immediate in branch instructions

In the branch instructions, the immediate is 12 bits, but it stores the [12:1] instead of [11:0]. This is because in RISC-V, every instruction must start at an address that is an even number (a multiple of 2). You can never jump to an odd address (like 0x1001) because instructions never exist there.

So, instead of storing bits [11:0] (which would include the useless last zero), the instruction stores bits [12:1]. The hardware assumes the last bit (bit 0) is always 0.

3

Two types of jumps

Look at the "Description (C)" of jal and jalr from the table above to understand this part better!

  1. jal: jump and link, is a J-type instruction. And it stores return address in rd.

    1. Used when you know the target address at assembly time.

    2. Used in function call (Jump to a function’s code so it can execute)

    3. imm is 20-bit.

  2. jalr: jump and link register, is an I-type instruction. And it stores return address in rd (usually ra)

    1. Use when the target address is dynamic, stored in a register.

    2. Used in function return (Go back to where the function was called from). This is because the function can be called from many different places. The return address isn't known until the program is actually running. Therefore, we jump to the address stored in the register ra.

    3. imm is 12-bit, but it can jump anywhere in a 32-bit absolute address range. A lui instruction can first load rs1 with the upper 20 bits of a target address, then jalr can add in the lower bits.

      1. Similarly, auipc then jalr can jump anywhere in a 32-bit pc-relative address range.

For example, in the following code, we can see how jalr can jump to anywhere with the help of lui.

Put it all together

Nothing is better than an example! So, let's loook at an example to put everything together.

RISC-V Microarchitecture

From a computer hardware engineer's view, a computer can be divided into 2 parts

Datapath

Datapath is the path through which data "flows". It includes the following elements

1

Storage elements

Like memories and registers. The storage elements can be further divided into the following parts

  1. Architectural state elements: manipulated by the programmer, like instruction memory (IROM), data memory (DMEM), register file (x0-x31), PC, and other control registers.

  2. Microarchitectural state elements: not accessible to the programmers, like pipeline registers, cache tags, and branch predictor state.

2

Steering logic

This is to channel data properly. For example, in the pipeline, should the ALU take operand from register file, or from immediate, or from forwarding path?

The steering logic is implemented using

  • multiplexers (select one of many inputs) and

  • internal buses (shared connections).

3

Functional units

These units operate on data, like deciding where the data should go. For example,

  • ALU: performs add, subtract, AND, OR, shifts, etc.

  • Adders: sometimes separate for PC update, address calculation, etc.

    • Often stateless because they don’t store, just compute outputs when inputs arrive.

4

Interface resources

These connect the CPU to the outside world. For example,

  • External buses: connect CPU to memory, I/O devices, GPU, etc.

  • Ports: I/O interfaces like memory-mapped registers, communication ports.

What is bus?

From the above points, we may wonder what on earth is a bus? From Wikipedia, the definition is

In computer architecture, a bus is a communication system that transfers data between components inside a computer or between computers. At its core, a bus is a shared physical pathway, typically composed of wires, that allows multiple devices to communicate.

And, bus in computer architecture can be categorized based on the following two categories

The location of the bus

  1. Internal buses: inside the CPU datapath.

    1. Example: register file → ALU operand buses.

  2. External buses: connects CPU to outside world (main memory, I/O devices, GPU, etc.).

    1. Example: system bus, PCIe, AMBA, memory bus.

The purpose of the bus

This is basically what the information a bus can carry

  1. Data bus: carries the actual data values being read or written.

  2. Address bus: carries the location of data (memory address or register index).

  3. Control bus: carries signals that say what operation is happening.


Usually, we use the location of the bus first, then say the purpose of the bus.

Control Unit

Control Unit controls the flow/processing/storage of data in the datapath via

  • Mux selects: choose which input goes into a multiplexer (e.g., should ALU input B come from register or immediate?).

  • Register write enables: should this register latch a new value at the clock edge or stay the same?

  • Functional unit activation: is the ALU active? Is the memory doing a read or a write?

  • Operation selection: what exact operation should ALU perform (add, sub, AND, OR…)?

Implement a single-cycle microarchitecture

In this lecture, we will implement a single-cycle microarchitecture first. So basically, a single-cycle microarchitecture will fetch, decode, execute all in one clock cycle. And in this lecture, we have covered four single-cycle microarchitecture, each is built upon the previous one,

1

Single-Cycle Processor with Control

As in a single-cycled processor, the CU is just a decoder, the following table summarises the decoder behavior,

As you can save from the schematic above, the ALU will also output a 3-bit ALUFlags to the PC Logic, inside the Control Unit, and the PC logic is summarised as follows,

Inside the PC logic, we should be clear that it has two inputs (PCS, ALUFlags[2:0]) and one output (PCSrc)

  • PCS: determines the instruction category for PC updates.

  • ALUFlags[2:0]: Bits that describe the outcome of the ALU operation. The three bits are {eq, lt, ltu}.

  • PCSrc: 1-bit control signal to choose the next PC Value.

    • 0: use the default PC+4 (fallthrough to next instruction).

    • 1: use the branch/jump target (computed by ALU + immediate).

For now, our PC only supports the beq, it will support more in the following iteration.

2

PC Logic with all conditions

The schematic is the same as the above. But, the PC logic is complete, we will look at func3 field in the instruction to decide which branch instruction is being chosen.

3

Support for lui and auipc

Here, we just add a 2-bit output named ALUSrcA from our decoder, these 2 bits will control whether the ALU SrcA will have the current PC as input. And regarding this, we have CU decoder table updated as follows,

4

Here, we have two modifications:

  1. change the ALUSrcB from 1-bit to 2-bit, thus adding another possibility for the immediate 4 to be the input of ALU SrcB.

  2. change the PCSrc from 1-bit to 2-bit, thus adding another possibility for the PC value to be read from Register File RD1.

Thus, we have two changes for our tables, the first one is the CU decoder table

And our PC Logic table will also change,

This is the whole content to build a simple single-cycle RISC-V processor. And we may notice the following,

  • No datapath resource can be used more than once per instruction, so some must be duplicate

    • Separate memories: because fetch and load/store both need memory at the same time.

    • Two adders: because PC+4 and normal ALU addition both need addition in the same cycle.

So, how can we make it faster? The answer is wait for Lec 05 😉!

Last updated