Lec 05 - The Pipelined Processor
Introduction
Recap of the single cycle processor
In Lec 03, we have introduced how to implement a single-cycled RISC-V processor. In our design, we have noticed that different instructions will go through different parts of the datapath, thus causing the time taken for each instruction to complete to be different.

For example, let's assume
Instruction and Data Memory needs 200ps
ALU needs 120ps
Adders needs 75ps
Register File Access (reads: 100ps, writes: 60ps)
Then we will have,
DP
200
100
120
60
480
lw
200
100
120
200
60
680
sw
200
100
120
200
620
beq
200
100
120
75
495
jal
200
75
275
PC_Incr is done in parallel, when things happen in parallel, we take the worst case timing. For branch instructions and the jal instructions here, as it needs to fecth the instruction (200ps) and then decide whether to increment PC or not. Thus we will count PC_Incr time into the calculation.
And our single cycle processor's clock cycle time must be at least larger than the slowest instruction's execution time (usually, it is the lw that takes the longest time to complete). This will create some unwanted waste in resources, like adders.
Make it Faster
Recall that the performance equation we have introduced in Lec 01 is,
Here, since we cannot change IC, as it is determined by the compiler. For CPI, let's assume it is fixed now. Given that, how can we improve the performance? The answer will be easy, that is to decrease the Clock Cycle Time by using the technique called pipelining.
Five Stage Pipeline
In this course, we will use a five-stage pipeline on our processor.
Pipelining is to "cut" our microarchitecture into different stages (in our case, it is 5), and each stage will take one clock cycle to complete. Thus, the clock cycle time can be decreased to a large extent.
For example, below is the image showing the fives stages of the lw instruction.

Fetch: Instruction Fetch and update PC
Dec: Registers Fetch and Instruction Decode
Exec: Execute DP-type; calculate memory address
Mem: Read/write the data from/to the Data Memory
WB: Write the result data into the register file
To see clearly how we do this "cut", we can reuse the microarchitecture in Lec 03.

This is just an illustration, the real section may be a bit different. For example, in the first version of the design, PC Logic is in the Execute Stage.
After pipelining our processor, the clock cycle diagram to execute the instructions will look like as follows

We can see that after the completion of the clock cycle 5, one instruction is completed every cycle, thus CPI = 1. We also call the first 4 cycles overhead as nothing completes during this time.
However, this pipeline isn't perfect, you may notice that not every instruction needs the full 5 stages. For example, sw doesn't need WB, branch and DP Reg/Imm doesn't need Mem, etc. Besides, there are still some hazards that we may encouter. (We will see them later in this lec)
This lecture utilizes the classic five-stage pipeline as a foundational model. However, the fundamental instruction phases — Fetch, Decode, Execute, Memory Access, and Writeback — remain universal across deeper pipeline architectures.
Implement a Pipelined Processor
As we have mentioned above, we need to "cut" our microarchitecture into 5 stages. To do this "cut", we are basically adding registers between each stage.

We have 5 stages, thus we need 4 "cuts"/registers. By adding registers, we can clearly see that now our critical path is shortened significantly! And this practice is the soul of pipelining!
Pipeline Register
Naming convention: We name the register by the target that it feeds the signal to. For example, the register D is named D because here the signals coming from the source (fetch stage) will go into the target (decode stage).
Real processor design: In Verilog, we use several register to compose a big pipeline register, like the decode pipeline register shown as follows,
So, at each posedge of the clock, all the left hand register will copy the value from the right hand side. This will help us understand the stalling and flushing later.
The smart you may notice that besides adding 4 registers, we also made some changes to some signals.

Delayed Signals
Several signals, like RegWrite, MemtoReg etc are delayed. Their names are also changed at each stage by adding the posfix F, D, E, M, W depending on which stage the signals are at.
We can think of this delay as, after we "cut" the microarchitecture, we still need to "store" the corresponding Control Unit signals and the Datapath signals to make sure that the operation done in each stage works correctly.
The cut of rdW
rdWIn the Fetch stage, the rd is no longer connected directly to the instruction decoder. Instead, we delay the rd signal by 3 stages and connect it to rdW, which means after the WB stage, we connect the rd back to the register file.
This is because we are using pipelining! Suppose we have the following instructions
If our rd is connected directly to the instruction decoder, the correct rd (x1) of the add will be overwritten by the rd (x4) of the sub instruction!
The multiplexer at PC
This multiplexer is added for the branch instructions and correspondingly, the PC signal is delayed by 2 stages so the multiplexer will choose between PCF and PCE.
This is because in the fetch stage, the PC will continue increasing by 4 without waiting for the actually completion of one instruction. However, if our branch is taken, we should use the "old" PC to add to ExtImm.
Pipeline Hazards
Pipelining is not perfect, sometimes it can get into troubles. These troubles are called pipeline hazards. And normally, we have the following three hazards
Structural Hazards
Structural harzards happen when we attempt to use the same resource (usually it is the memory) by two different instructions at the same time.
For example, in the following diagram, at the fourth clock cycle, both the lw and sw are accessing the same memory because in real life, IROM and DMEM are both in the RAM.

Data Hazards
Data harzards happen when we attempt to use data before it is ready. For example, an instruction's source operand(s) are produced by a prior instruction still in the pipeline. This hazard is also known as RAW (read after write) hazard or true data dependency and it occurs very frequently in practice.
For example, in the following diagram, the destination register rd of the add instruction is needed by the following sub, or and and as a source register rs.

Control Hazards
Control hazards happen when we attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated. This usually happens in branch/jump instructions.
For example, in the branch instruction, the BTA (Branch Target Address) can only be known after the Execution stage. In this case, two more instructions will have been already fetched. And if the branch is taken, these two instructions shouldn't be completed.

Handle Pipeline Hazards
In this part, we will introduce how to handle the 3 types of pipeline harzards we have encountered above. As for the sake of this course, we will focus more on the handling of the data hazards.
Handle Structural Hazards
We can simply fix this hazard by using separate instruction and data memory.
Handle Data Hazards
To handle data harzards, we will introduce five methods,
Use different clock edges
This means that we will write register file and pipeline registers at different edges of the clock.
For example, in the following diagram, we use negedge for register file and posedge for the pipeline state registers.

By following the above convention we made, here's the updated microarchitecture of our processor

Using this technique to analyze the data hazard example above, we can see from the following graph that the third instruction and will successfully read the updated value stored in register s8.

Given that, we will summarize the pros and cons of this technique as follows,
Pros:
This technique can help reduce the "troublesome" instructions from 3 to 2 in our example above.
Cons:
It will use 2 edges from the same clock and this is a practice discouraged in CG3207.
The half cycle may not be enough to write the pipeline state register. In other words, reading the value from the register file and write it to a pipeline state register may take longer than half of a clock cycle.
Inserting NOPS
In RISC-V, we can use
addi x0, x0, 0as NOP.
The basic idea is that when there exists true data dependency between two instructions, we can insert NOPs between the these instructions to "delay" the data "transfer". So, if we have already implemented the first technique of using different clock edges, we should make sure the number of NOPs or independent useful instructions is 2 between the two instructions that have true data dependency.
This magic number 2 here is dependent on the microarchitecture of the processor. In other words, the number of stages you have in your processor.
For example, in the following diagram, add and sub have true data dependency. We can either insert two NOPs between add and sub or replace one of the NOPs with the independent useful instruction orr.

Now, we can summarize the pros and cons of this technique
Pros:
It is easier to implement as the NOPs are inserted by the compiler during the compile time.
Inserting NOPs can solve almost all the hazards.
Cons:
Insert NOPs will waste code memory.
As NOPs will do nothing, we won't treat them as real instructions when calculating the CPI. Thus, with more clock cycles and the same number of useful instructions, the CPI will increase and thus increase our CPU Time.
To know the number of NOPs or independant useful instructions to be inserted, the compiler needs to know the microarchitecture, like the number of stages in our pipeline processor and have we used different clock edges to write to register file and pipeline state registers, etc. This will make the code not very portable.
Data Forwarding
This is the most challenging technique introduced in hanlding data hazards.
The basic idea in data forwarding is that the data is available on some pipeline register before it is written back to the register file (RF). So, we can take that data from where it is present, and pass it to where it is needed.
Execution Stage
In this part, we will see the forwarding circuitry for W -> E and M -> E.
In the execution stage, we can check if the register read (rs1E and rs2E) by the instruction which is currently in Execute stage matches the register written (rdM or rdW) by the instruction which is currently in Memory or Writeback Stage. If so, we forward the ready result (usually is the ALUResult at Memory or Writeback Stage) to the Execution stage so now the input to ALU will come from M or W stage rather than the E stage register.
For example, we can see how the above forward logic works in the following graph,

Suppose we have applied the Use different clock edges technique, now the add instruction only has true data dependency with the sub and or instruction.
In the
subinstruction's E stage, we notice it needsrs1Eto be the registers8. In theaddinstruction, itsrdMin the M stage is registers8and it mathces withrs1E. Thus, we can forward the ready ALUResult fromaddinstruction's M stage to the E stage of thesubinstruction.Similarly, in the
orinstruction's E stage, it needsrs2Eto be the registers8. In theaddinstruction, itsrdWin the W stage is registers8and it mathces withrs2E. Thus, we can forward the ready ALUResult (In W stage, it is ResultW) to the E stage of theorinstruction.
To implement this fowarding logic, we will add Hazard Unit shown as follows in our microarchitecture to specifically handle the hazard.

rs1/2E, rdM/W and RegWriteM/W are easy to pass as the input to the Hazard Unit, but the output ForwardA/BE which are used to control the multiplexer for the ALU SrcA and SrcB should follow exactly the following three cases
Condition 1
rs1/2E == rdM/Wis about the match.Condition 2
RegWriteM/Wensures that the instruction (in our example,add) really writes to the register. If not (likesw,branch), there won't be any data hazard, thus no need to do data forwarding.Condition 3
rdM/W ≠ 0ensures that we are not forwarding from thex0register as there is no need to do so.
In total, we need 2 x 2 + 2 = 6, 5-bit comparators to do this data forwarding. rs1/2E each needs two comparators (2 x 2 = 4) and rdM/W needs another 2 but the can share result to another Forward1/2E signal.
For DP-Reg instructions, the value is available at the Memory stage or at the Writeback stage before it is written back to the register file.
Mem-to-Mem Copy
In this section, we will see the forwarding circuitry for W -> M.
Another situation that we should do the data forwarding is when there is a lw followed by a sw and the sw tries to store the value (rs2) that lw writes/loads to (rd).
In this case, we need to check if the register used in Memory stage by the sw instruction (rs2) matches register written by the lw in Writeback stage (rd). If so, we will forward the result.

And our Hazard Unit will be updated to the following,

By now, we have added ForwardM as another output of our Hazard Unit, rd2M and rdW are two inputs. And the logic to derive when to set ForwardM will be as follows,
Condition 1:
rs2M == rdWis the basic match checkCondition 2:
MemWriteMensures that the instruction at the Memory stage is thesw. (See the Lec 03, same for above, as onlyswhasMemWrite == 1)Condition 3:
MemtoRegWensures that the instruction at the Writeback stage is thelwCondition 4:
rdW ≠ 0ensures that thelwat the Writeback stage have a non-x0destination.
Notes
For
lwinstruction, the value is ready only at the Writeback stage (after the Memory stage).All the forwarding must be done vertically.
For M -> E to be done, the value must be ready at Memory stage and the next instruction must use that value as one of its ALU sources.
For W -> E to be done, the next instruction by the next instruction (2 more instructions) should use that value as one of its ALU sources.
For W -> M to be done, it only happens for
lwfollowed byswwhich meets the requirements at the beginning of this section.
If this Mem-to-Mem copy is used together with the unoptimized version of load and use hazard, stall will be triggered and one NOP will be inserted. In other words, if you use the basic version of load and use hazard below, the Mem-to-Mem copy won't have any effect.
Load and Use Hazard: Stalling
Suppose we have a lw instruction whose rd is one of the rs of the following instruction, for example, and instruction. As and is followed closely after the lw, we have a trouble shown as follows!

The problem is that in the 4th clock cycle, the and instruction needs s7 but as lw is still in Mem stage, s7 is still not available! To solve this problem, we will need to stall the Decode stage and Fetch stage for the and and or instruction in the above example. A result of this is that during the 6-th clock cycle, nothing will complete.

To implement it out, its circuitry will look like as follows,

We stall the PC update register so that whatever in the Fetch stage is still there and won't change, similar for the pipeline register D and the Decode stage. However, as the clock cycle moves forward by 1, the lw instruction will move to the Memory stage, while whatever in the Decode stage (and instruction in our example) will move forward to the Execution stage. This is not what we want, thus, we flush the pipeline register E at the same time we stall the pipeline register F and D.
The basic condition for detecting it will be
Condition 1:
MemtoRegEis to ensure the instruction at the execution stage islw.Condition 2:
(rs1D == rdE) | (rs2D == rdE)ensures that the instruction followed actually uses thelw'srd. This is the earliest time we can detect such load-use hazard.
After 1 cycle, the forwarding circuitry (W -> E) will take care of delivering the correct data. However, how do we combine it with our Mem-to-Mem copy circuitry, and deal with the case when some instructions don't use rs1 or rs2? Below is an example!
These extra exceptions are just to improve performance. Not implementing them won't cause any functional issue, but will cause the performance loss.
W -> D Forwarding
In the Use different clock edges, we use the negedge of the clock to read from the register file. What if we want to use the same posedge of the clock in our processor?
A problem will happen if we have the following instructions:
Suppose we denote the clock cycle when the addi's fetch happens as clock cycle 1. We may notice that after clock cycle 4, we should forward the result that will be stored in x5 in clock cycle 5 to the Decode stage of the add x6, x5, x2 so that it can use the upated value.

To handle this case, we can further modify our hazard unit to add two output signals Forward1/2D.
Conidition 1:
rs1/2D == rdWensures the Writeback stagerdmatches with one of the decode stagers.Condition 2:
RegWriteWensures that the Writeback stage will write back to the register file.Condition 3:
rdW ≠ 0ensures that the Writeback stagerdis notx0.
Our updated circuitry will be as follows,

Handle Control Hazards
Unlike the data hazards, which can be handled effectively using the techniques like data forwarding. There is no perfect way to handle the control hazards. In this section, we will introduce the flushing technique to handle the control hazards.
The key idea of flushing is to flush the two fetched instructions if the branch is taken. In our microarchitecture, if PCSrcE == 1, it means the branch is taken. Thus, the condition can be written as follows,
And its circuitry will be as follows,

Complete Hazard Handling Circuitry
The following is the complete hazard handling circuitry we have introduced in this lecture. However, it is not based on the full microarchitecture in lec 03.

For the finals, be prepared to see some challenging questions that use a pipeline design that is not 5-stage, it can use 7-stage or 17 stages. Be aware of how to anaylze the execution time under these circumstances.
Last updated
