Lec 05 - The Pipelined Processor

Introduction

Recap of the single cycle processor

In Lec 03, we have introduced how to implement a single-cycled RISC-V processor. In our design, we have noticed that different instructions will go through different parts of the datapath, thus causing the time taken for each instruction to complete to be different.

A single cycle processor that we designed at Lec 03

For example, let's assume

  • Instruction and Data Memory needs 200ps

  • ALU needs 120ps

  • Adders needs 75ps

  • Register File Access (reads: 100ps, writes: 60ps)

Then we will have,

Instr.
I Mem
Reg Rd
ALU Op
D Mem
Reg Wr
PC Incr
Total

DP

200

100

120

60

480

lw

200

100

120

200

60

680

sw

200

100

120

200

620

beq

200

100

120

75

495

jal

200

75

275

Why all these parts take time to execute?

It is because for all the Instruction Memory, Register File, ALU, Data Memory and Adders, they are implemented using a lot of logic gates, thus it confirm will have some propagation delay. So, it takes a certain amount of time for the signal to travel from the input to the output of that certain part.

And our single cycle processor's clock cycle time must be at least larger than the slowest instruction's execution time (usually, it is the lw that takes the longest time to complete). This will create some unwanted waste in resources, like adders.

Make it Faster

Recall that the performance equation we have introduced in Lec 01 is,

CPU Time=Instruction Count×CPI×Clock Cycle Time\text{CPU Time} = \text{Instruction Count} \times \text{CPI} \times \text{Clock Cycle Time}

Here, since we cannot change IC, as it is determined by the compiler. For CPI, let's assume it is fixed now. Given that, how can we improve the performance? The answer will be easy, that is to decrease the Clock Cycle Time by using the technique called pipelining.

Five Stage Pipeline

In this course, we will use a five-stage pipeline on our processor.

Pipelining is to "cut" our microarchitecture into different stages (in our case, it is 5), and each stage will take one clock cycle to complete. Thus, the clock cycle time can be decreased to a large extent.

For example, below is the image showing the fives stages of the lw instruction.

  • Fetch: Instruction Fetch and update PC

  • Dec: Registers Fetch and Instruction Decode

  • Exec: Execute DP-type; calculate memory address

  • Mem: Read/write the data from/to the Data Memory

  • WB: Write the result data into the register file

To see clearly how we do this "cut", we can reuse the microarchitecture in Lec 03.

After pipelining our processor, the clock cycle diagram to execute the instructions will look like as follows

We can see that after the completion of the clock cycle 5, one instruction is completed every cycle, thus CPI = 1. We also call the first 4 cycles overhead as nothing completes during this time.

However, this pipeline isn't perfect, you may notice that not every instruction needs the full 5 stages. For example, sw doesn't need WB, branch and DP Reg/Imm doesn't need Mem, etc. Besides, there are still some hazards that we may encouter. (We will see them later in this lec)

Is the speed-up of the pipelined processor always same as the number of stages?

Not exactly. For example, a 5-stage pipelined processor doesn't necessarily have a 5x speed-up comparing to the single-cycle processor. As you can see from the following image,

The reason for this is because that in the pipelined processor, our clock cycle time accommodates the slowest stage, while in the single-cycle processor, our clock cycle time accommodates the slowest instruction. But the slowest instruction isn't necessarily composed of 5x the slowest stage!

This lecture utilizes the classic five-stage pipeline as a foundational model. However, the fundamental instruction phases — Fetch, Decode, Execute, Memory Access, and Writeback — remain universal across deeper pipeline architectures.

Implement a Pipelined Processor

As we have mentioned above, we need to "cut" our microarchitecture into 5 stages. To do this "cut", we are basically adding registers between each stage.

We have 5 stages, thus we need 4 "cuts"/registers. By adding registers, we can clearly see that now our critical path is shortened significantly! And this practice is the soul of pipelining!

Pipeline Register

The smart you may notice that besides adding 4 registers, we also made some changes to some signals.

1

Delayed Signals

Several signals, like RegWrite, MemtoReg etc are delayed. Their names are also changed at each stage by adding the posfix F, D, E, M, W depending on which stage the signals are at.

We can think of this delay as, after we "cut" the microarchitecture, we still need to "store" the corresponding Control Unit signals and the Datapath signals to make sure that the operation done in each stage works correctly.

2

The cut of rdW

In the Fetch stage, the rd is no longer connected directly to the instruction decoder. Instead, we delay the rd signal by 3 stages and connect it to rdW, which means after the WB stage, we connect the rd back to the register file.

This is because we are using pipelining! Suppose we have the following instructions

If our rd is connected directly to the instruction decoder, the correct rd (x1) of the add will be overwritten by the rd (x4) of the sub instruction!

3

The multiplexer at PC

This multiplexer is added for the branch instructions and correspondingly, the PC signal is delayed by 2 stages so the multiplexer will choose between PCF and PCE.

This is because in the fetch stage, the PC will continue increasing by 4 without waiting for the actually completion of one instruction. However, if our branch is taken, we should use the "old" PC to add to ExtImm.

Pipeline Hazards

Pipelining is not perfect, sometimes it can get into troubles. These troubles are called pipeline hazards. And normally, we have the following three hazards

Structural Hazards

Structural harzards happen when we attempt to use the same resource (usually it is the memory) by two different instructions at the same time.

For example, in the following diagram, at the fourth clock cycle, both the lw and sw are accessing the same memory because in real life, IROM and DMEM are both in the RAM.

Data Hazards

Data harzards happen when we attempt to use data before it is ready. For example, an instruction's source operand(s) are produced by a prior instruction still in the pipeline. This hazard is also known as RAW (read after write) hazard or true data dependency and it occurs very frequently in practice.

For example, in the following diagram, the destination register rd of the add instruction is needed by the following sub, or and and as a source register rs.

Control Hazards

Control hazards happen when we attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated. This usually happens in branch/jump instructions.

For example, in the branch instruction, the BTA (Branch Target Address) can only be known after the Execution stage. In this case, two more instructions will have been already fetched. And if the branch is taken, these two instructions shouldn't be completed.

Self-Diagnostic Quiz

Use inline code (put the callee function into the caller function) can always reduce the control hazard in a pipelined processor?


Ans: No, it should be sometimes. This is because of the following three reasons:

  1. I-Cache Misses (The Main Hazard): Inlining replicates code, significantly increasing the binary size ("code bloat"). If the expanded code exceeds the size of the Instruction Cache, it causes cache misses. A fetch from main memory (hundreds of cycles) is far more costly than the few cycles lost to a control hazard.

  2. Register Pressure: Inlining forces the compiler to manage more variables simultaneously. If the processor runs out of registers, it causes "spilling" (saving data to stack memory), creating memory access delays that negate the speedup from removing the function call.

  3. Internal Branches Remain: Inlining only removes the CALL and RETURN jumps. Any conditional branches (loops, if-else) inside the inlined function still exist and still cause control hazards.

Handle Pipeline Hazards

In this part, we will introduce how to handle the 3 types of pipeline harzards we have encountered above. As for the sake of this course, we will focus more on the handling of the data hazards.

Handle Structural Hazards

We can simply fix this hazard by using separate instruction and data memory.

Handle Data Hazards

To handle data harzards, we will introduce five methods,

Use different clock edges

This means that we will write register file and pipeline registers at different edges of the clock.

For example, in the following diagram, we use negedge for register file and posedge for the pipeline state registers.

By following the above convention we made, here's the updated microarchitecture of our processor

Using this technique to analyze the data hazard example above, we can see from the following graph that the third instruction and will successfully read the updated value stored in register s8.

Given that, we will summarize the pros and cons of this technique as follows,

  • Pros:

    • This technique can help reduce the "troublesome" instructions from 3 to 2 in our example above.

  • Cons:

    • It will use 2 edges from the same clock and this is a practice discouraged in CG3207.

    • The half cycle may not be enough to write the pipeline state register. In other words, reading the value from the register file and write it to a pipeline state register may take longer than half of a clock cycle.

Inserting NOPS

In RISC-V, we can use addi x0, x0, 0 as NOP.

The basic idea is that when there exists true data dependency between two instructions, we can insert NOPs between the these instructions to "delay" the data "transfer". So, if we have already implemented the first technique of using different clock edges, we should make sure the number of NOPs or independent useful instructions is 2 between the two instructions that have true data dependency.

For example, in the following diagram, add and sub have true data dependency. We can either insert two NOPs between add and sub or replace one of the NOPs with the independent useful instruction orr.

Now, we can summarize the pros and cons of this technique

  • Pros:

    • It is easier to implement as the NOPs are inserted by the compiler during the compile time.

    • Inserting NOPs can solve almost all the hazards.

  • Cons:

    • Insert NOPs will waste code memory.

    • As NOPs will do nothing, we won't treat them as real instructions when calculating the CPI. Thus, with more clock cycles and the same number of useful instructions, the CPI will increase and thus increase our CPU Time.

    • To know the number of NOPs or independant useful instructions to be inserted, the compiler needs to know the microarchitecture, like the number of stages in our pipeline processor and have we used different clock edges to write to register file and pipeline state registers, etc. This will make the code not very portable.

Data Forwarding

This is the most challenging technique introduced in hanlding data hazards.

The basic idea in data forwarding is that the data is available on some pipeline register before it is written back to the register file (RF). So, we can take that data from where it is present, and pass it to where it is needed.

1

Execution Stage

In this part, we will see the forwarding circuitry for W -> E and M -> E.

In the execution stage, we can check if the register read (rs1E and rs2E) by the instruction which is currently in Execute stage matches the register written (rdM or rdW) by the instruction which is currently in Memory or Writeback Stage. If so, we forward the ready result (usually is the ALUResult at Memory or Writeback Stage) to the Execution stage so now the input to ALU will come from M or W stage rather than the E stage register.

For example, we can see how the above forward logic works in the following graph,

Suppose we have applied the Use different clock edges technique, now the add instruction only has true data dependency with the sub and or instruction.

  • In the sub instruction's E stage, we notice it needs rs1E to be the register s8. In the add instruction, its rdM in the M stage is register s8 and it mathces with rs1E. Thus, we can forward the ready ALUResult from add instruction's M stage to the E stage of the sub instruction.

  • Similarly, in the or instruction's E stage, it needs rs2E to be the register s8. In the add instruction, its rdW in the W stage is register s8 and it mathces with rs2E. Thus, we can forward the ready ALUResult (In W stage, it is ResultW) to the E stage of the or instruction.

To implement this fowarding logic, we will add Hazard Unit shown as follows in our microarchitecture to specifically handle the hazard.

rs1/2E, rdM/W and RegWriteM/W are easy to pass as the input to the Hazard Unit, but the output ForwardA/BE which are used to control the multiplexer for the ALU SrcA and SrcB should follow exactly the following three cases

  • Condition 1 rs1/2E == rdM/W is about the match.

  • Condition 2 RegWriteM/W ensures that the instruction (in our example, add) really writes to the register. If not (like sw, branch), there won't be any data hazard, thus no need to do data forwarding.

  • Condition 3 rdM/W ≠ 0 ensures that we are not forwarding from the x0 register as there is no need to do so.

In total, we need 2 x 2 + 2 = 6, 5-bit comparators to do this data forwarding. rs1/2E each needs two comparators (2 x 2 = 4) and rdM/W needs another 2 but the can share result to another Forward1/2E signal.

2

Mem-to-Mem Copy

In this section, we will see the forwarding circuitry for W -> M.

Another situation that we should do the data forwarding is when there is a lw followed by a sw and the sw tries to store the value (rs2) that lw writes/loads to (rd).

In this case, we need to check if the register used in Memory stage by the sw instruction (rs2) matches register written by the lw in Writeback stage (rd). If so, we will forward the result.

And our Hazard Unit will be updated to the following,

By now, we have added ForwardM as another output of our Hazard Unit, rd2M and rdW are two inputs. And the logic to derive when to set ForwardM will be as follows,

  • Condition 1: rs2M == rdW is the basic match check

  • Condition 2: MemWriteM ensures that the instruction at the Memory stage is the sw. (See the Lec 03, same for above, as only sw has MemWrite == 1)

  • Condition 3: MemtoRegW ensures that the instruction at the Writeback stage is the lw

  • Condition 4: rdW ≠ 0 ensures that the lw at the Writeback stage have a non-x0 destination.

Notes

3

Load and Use Hazard: Stalling

Suppose we have a lw instruction whose rd is one of the rs of the following instruction, for example, and instruction. As and is followed closely after the lw, we have a trouble shown as follows!

The problem is that in the 4th clock cycle, the and instruction needs s7 but as lw is still in Mem stage, s7 is still not available! To solve this problem, we will need to stall the Decode stage and Fetch stage for the and and or instruction in the above example. A result of this is that during the 6-th clock cycle, nothing will complete.

To implement it out, its circuitry will look like as follows,

We stall the PC update register so that whatever in the Fetch stage is still there and won't change, similar for the pipeline register D and the Decode stage. However, as the clock cycle moves forward by 1, the lw instruction will move to the Memory stage, while whatever in the Decode stage (and instruction in our example) will move forward to the Execution stage. This is not what we want, thus, we flush the pipeline register E at the same time we stall the pipeline register F and D.

The basic condition for detecting it will be

  • Condition 1: MemtoRegE is to ensure the instruction at the execution stage is lw.

  • Condition 2: (rs1D == rdE) | (rs2D == rdE) ensures that the instruction followed actually uses the lw's rd. This is the earliest time we can detect such load-use hazard.

After 1 cycle, the forwarding circuitry (W -> E) will take care of delivering the correct data. However, how do we combine it with our Mem-to-Mem copy circuitry, and deal with the case when some instructions don't use rs1 or rs2? Below is an example!

4

W -> D Forwarding

In the Use different clock edges, we use the negedge of the clock to read from the register file. What if we want to use the same posedge of the clock in our processor?

A problem will happen if we have the following instructions:

Suppose we denote the clock cycle when the addi's fetch happens as clock cycle 1. We may notice that after clock cycle 4, we should forward the result that will be stored in x5 in clock cycle 5 to the Decode stage of the add x6, x5, x2 so that it can use the upated value.

To handle this case, we can further modify our hazard unit to add two output signals Forward1/2D.

  • Conidition 1: rs1/2D == rdW ensures the Writeback stage rd matches with one of the decode stage rs.

  • Condition 2: RegWriteW ensures that the Writeback stage will write back to the register file.

  • Condition 3: rdW ≠ 0 ensures that the Writeback stage rd is not x0.

Our updated circuitry will be as follows,

Handle Control Hazards

Unlike the data hazards, which can be handled effectively using the techniques like data forwarding. There is no perfect way to handle the control hazards. In this section, we will introduce the flushing technique to handle the control hazards.

The key idea of flushing is to flush the two fetched instructions if the branch is taken. In our microarchitecture, if PCSrcE == 1, it means the branch is taken. Thus, the condition can be written as follows,

And its circuitry will be as follows,

Flushing vs. Stalling

Stalling means hold still (instructions waits, waiting) while flushing means wipe out (instruction removed). Think of the following example

Stalling the decode pipeline register E will cause PCSE to keep its previous value (not updating). Flushing the decode pipeline register E will cause PCSE to be 0. And as this zero will be passed all the way till the end of the pipeline, and that's why we can think of flushing one pipeline statge as inserting one NOP.

If any stage is stalled, (i) the previous stages should also be stalled, and (ii) the following stage should be flushed (think load and use)

Complete Hazard Handling Circuitry

The following is the complete hazard handling circuitry we have introduced in this lecture. However, it is not based on the full microarchitecture in lec 03.

Last updated