Lec 06 - Advanced Processor
In this lecture, we will exploit different methods to extract even more performance from the pipelined processor that we built in Lec 05!
Branch Prediction
An ideal pipelined processor would have a CPI of 1. The branch misprediction penalty is a major reason of increased CPI. As pipelines get deeper, branches are resolved later in the pipeline. Thus, the branch misprediction penalty gets larger because all the instructions issued after the mispredicted branch must be flushed. To address this problem, most pipelined processors use a branch predictor/prediction to guess whether the branch should be taken.
In Lec 05, the way we handle the control hazard is to simply predict that branches are never taken.
Branch misprediction penalty is the flushing of all the instructions issued after the mispredicted branch.
Static Branch Prediction
The simplest form of branch prediction checks the direction of the branch and predicts that backward branches are taken and forward branches are not.
Forward Branches
These are the branches that occur at the beginning of a loop to check a condition and branch past the loop when the condition is no longer met (e.g., in for and while loops). Loops tend to execute many times, so these forward branches are usually not taken.
Loop: beq s1, s2, Out
1nd loop instr
.
.
.
last loop instr
j Loop
Out: fall out instrBackward Branches
These are the branches that occur when a program reaches the end of a loop and branches back to repeat the loop (e.g., do-while loop). Again, because loops tend to execute many times, these backward branches are usually taken.
Dynamic Branch Prediction
However, branches, especially forward branches, are difficult to predict without knowing more about the specific program. Therefore, most processors use dynamic branch predictors/predictions, which use the history of program execution to guess whether a branch should be taken.
Dynamic branch predictors maintain a table of the last several hundred (or thousand) branch instructions that the processor has executed. The table, called a branch target buffer, includes the destination of the branch and a history of whether the branch was taken.
To see it clearly, consider the following loop. The loop repeats 10 times, and the branch out of the loop (bge s0, t0, done) is taken only on the last iteration.
One-bit Dynamic Branch Predictor
A one-bit dynamic branch predictor remembers whether the branch was taken the last time and predicts that it will do the same thing the next time.
While the loop is repeating, it remembers that the bge was not taken last time and predicts that it should not be taken next time. This is a correct prediction until the last branch of the loop, when the branch does get taken. Unfortunately, if the loop is run again, the branch predictor remembers that the last branch was taken. Therefore, it incorrectly predicts that the branch should be taken when the loop is first run again.
In summary, a one-bit dynamic branch predictor mispredicts the first and last branches of a loop. And its accuracy is 80% if the loop repeats 10 times. In a loop with N iterations, the accuracy is
The accuracy of 80% doesn't apply to all cases! It is the accuracy only when the loop repeats 10 times!
Two-bit Dynamic Branch Predictor
A two-bit dynamic branch predictor can decrease the number of misprediction by having four states: Strongly Taken, Weakly Taken, Weakly Not Taken, and Strongly Not Taken.

Using the same example above, when the loop is repeating, it enters the Strongly Not Taken state and predicts that the branch should not be taken next time. This is correct until the last branch of the loop, which is taken and moves the predictor to the Weakly Not Taken state. When the loop is first run again, the branch predictor correctly predicts that the branch should not be taken and reenters the Strongly Not Taken state.
In summary, two-bit dynamic branch predictor mispredicts only the last branch of a loop. Thus, when the loop has N iterations, it has the accuracy of
Branch Delay Slot
In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding branch. The instruction in the delay slot will execute even if the preceding branch is taken. Usually, independent instructions are inserted into the delay slot. The insertion of the independent instruction which is safe to execute irrespective of branch outcome into the branch delay slot is done by the compiler.
Speculative Execution
Speculative execution is a performance optimization technique where a computer system or CPU performs tasks ahead of time based on a guess, in order to avoid delays.
One good example is the branch prediction we have seen above. Another example will be the load speculation.
Speculative execution must have (hardware/software) mechanisms for
Checking to see if the guess was correct
Recovering from the effects of the instructions that were executed speculatively if the guess was incorrect.
Deep Pipelining
As we have seen from Lec 01, to increase performance, we would like to speed up the clock and/or reduce the CPI. For the CPI, we know that stalling and flushing will both increase the CPI.
As we have seen the technique of pipelining in Lec 05, it can reduce the clock cycle time and thus increase the clock speed. In real world, aside from the advances in manufacturing, the easiest way to speed up the clock is to chop the pipelines into more stages. Each stage contains less logic, so it can run faster. Nowadays, 8-20 stages are commonly used.
However, the maximum number of pipeline stages is limited by the pipeline hazards, sequencing overhead, and cost.
Micro-Operations
This technique is largerly used in CISC (Complex Instruction Set Computer).
It means that at run time, more complex instructions will be decomposed into a series of simple instructions called micro-operations (micro-ops or -ops). These micro-operations can be executed on simple datapaths.
The decoding process is done by the hardware and the micro-operations need not even be valid instructions in ISA. This will also increase code density, resulting in less IROM usage.
Macro-op Fusion
This is exactly the opposite of the micro-operations. We have seen this earlier in Lec 04!
Multiple Issue Processors
Superscalar Processors
A superscalar processor issues several instructions at a time, each of which operates on one piece of data. Thus it contains multiple copies of the datapath hardware to execute multiple instructions simultaneously. The figure below shows a block diagram of a two-way superscalar processor that fetches and executes two instructions per cycle.

Ideal Case
The ideal case for a two-way superscalar processor is that it can execute exactly two instructions on each cycle. This is shown in the following figure,

For this program, the proecssor has a CPI of 0.5. Designers commonly refer to the reciprocal of the CPI as the instructions per cycle, or IPC. This processor has an IPC of 2 on this program.
Real Case
As we all know, executing many instructions simultaneously is difficult because of dependencies. The following figure shows a pipeline diagram running a a program with data dependencies. The dependencies in the code are indicated in blue.

Cannot issue simultaneously: The
addinstruction is dependent ons8, which is produced by thelwinstruction, so it cannot be issued at the same time aslw.Data Forwarding: Additionally, the
addinstruction stalls for yet another cycle so thatlwcan forwards8toaddin cycle 5.Data Forwarding: The other dependencies (between
subandandbased ons8, and betweenorandswbased ons11) are handled by forwarding results produced in one cycle to be consumed in the next.
This program requires fives cycles to issue six instructions, for an IPC of .
Out-of-Order Processor
To cope with the problem of dependencies, an Out-of-Order (OoO) processor looks ahead across many instructions to issue independent instructions as rapidly as possible. The instructions can issue in a different order than that written by the programmer as long as dependencies are honored so that the program produces the intended result. This will increase the Instruction Level Parallelism (ILP) and thus increasing the IPC also.
Consider running the same program above on a two-way superscalar out-of-order processor. The processor can issue up to two instructions per cycle from anywhere in the program, as long as dependencies are observed. The following figure shows the data dependencies and the operation of the processor.

The constraints on issuing instructions are:
Cycle 1
The
lwinstruction issues.The
add,sub, andandinstructions are dependent onlwby way ofs8, so they cannot issue yet. However, theorinstruction is independent, so it also issues.
Cycle 2
Remember that a two-cycle latency exists between issuing
lwand a dependent instruction, soaddcannot issue yet because of thes8dependence.subwritess8, so it cannot issue beforeadd, lestaddreceive the wrong value ofs8.andis dependent onsub.Only the
swinstruction issues.
Cycle 3
On cycle 3,
s8is available (or, rather, will be whenaddneeds it), so theaddissues.subissues simultaneously, because it will not writes8until afteraddconsumes (e.g., reads) it.
Cycle 4
The
andinstruction issues.s8is forwarded fromsubtoand.
This out-of-order processor issues the six instructions in four cycles, for an IPC of , which is more than the normal superscalar processor we have introduced above. In the real-world out-of-order processor, we will see three data dependencies (we have seen two in the example above):
Read After Write (RAW)
In the example above, the dependence of add on lw by way of s8 is sa read after write (RAW) hazard. add must not read s8 until after lw has written it. Similarly, the dependence of sw on or by way of s11 and of and on sub by way of s8 are RAW dependencies.
This is the type of dependency we are accustomed to handling in the pipelined processor (Lec 05). To solve it, we can use
the data forwarding logic and
sometimes the stalling technique.
Write After Read (WAR)
The dependence between sub and add by way of s8 is called a write after read (WAR) hazard or an antidependence. sub must not write s8 before add reads s8, so that add receives the correct value according to the original order of the program.
A WAR hazard is not essential to the operation of the program. It is merely an artifact of the programmer’s choice to use the same register for two unrelated instructions. If the sub instruction had written s3 instead of s8, then the dependency would disappear and sub could be issued before add.
To solve it, we can use the register renaming technique which will be introduced later!
Write After Write (WAW)
This hazard is not shown in the example above. It is called a write afte write (WAW) hazard or an output dependence. A WAW hazard occurs if an instruction attempts to write a register after a subsequent instruction has already written it. The hazard would result in the wrong value being written to the register. For example, in the following code, lw and add both write s7. The final value in s7 should come from add according to the order of the program. If an out-of-order processor attempted to execute add first and then lw, a WAW hazard would occur.
WAW hazards are not essential either; again, they are artifacts caused by the programmer using the same destination register for two unrelated instructions.
To solve it, we can just discard the result of the unwanted instruction. For example, if the add instruction were issued first, then the program could eliminate the WAW hazard by discarding the result of the lw instead of writing it to s7. This is called squashing the lw.
If we discard the value of
s7in thelwinstruction, why do we still need to execute thelw? This is because we want to make sure there won't be a load access fault!
Besides discarding, we can also use register renaming to solve this hazard.
Implementing the Out-of-Order Execution
While the conceptual goal is to increase ILP, the hardware implementation requires a specific structure to ensure correctness. The diagram below illustrates a typical Out-of-Order processor implementation paradigm. It is divided into three distinct phases to balance speed with stability.

The Front End: In Order Issue
The Instruction Fetch and Decode Unit retrieves instructions and issues them to the "Reservation Stations."
Behavior: This process happens In-Order.
Reason: We must issue in program order to correctly identify and track data dependencies (like the RAW hazards mentioned above) before the instructions are scattered to different units. At this stage, if operands are not ready, the instruction is not stalled; it is simply moved to a waiting area.
The Execution Core: Out-of-Order Execute
Once issued, instructions sit in Reservation Stations (RS). These buffers hold the instruction and wait for pending operands.
Behavior: The Functional Units (FUs) (Integer, Floating point, Load/Store) initiate execution Out-of-Order. They start exactly when their data is ready, regardless of the original program sequence.
The "Red Arrows" (Common Data Bus): As soon as a Functional Unit finishes, it broadcasts the result:
To Waiting RS: The result is forwarded immediately to any other Reservation Station waiting for this data (solving RAW hazards without stalling the fetch unit).
To the Commit Unit: The result is saved for the final update.
The Back End: In-Order Commit
The Commit Unit (often coupled with a Reorder Buffer or ROB) collects results from the execution units.
Behavior: It writes results to the architectural registers (the actual programmer-visible state) In-Order (program fetch order).
Why?: This is crucial for two reasons:
Precise Exceptions: If an instruction crashes (e.g., divide by zero), we ensure that only registers from previous instructions are updated. We must not accidentally save the result of a "future" instruction that executed early.
Branch Misprediction: If the processor guessed a branch wrong (speculation), the instructions executed after the branch must be discarded. Since the Commit Unit hasn't written them to the permanent registers yet, we can simply "flush" the Reorder Buffer to correct the machine state.
Register Renaming
Actually, the out-of-order processors use a technique called register renaming to eliminate WAR and WAW hazards. Register renaming adds some nonarchitectural renaming registers to the processor. For example, a processor might add 20 renaming registers, called r0 to r19. The programmer cannot use these registers directly, because they are not part of the architecture. However, the processor is free to use them to eliminate hazards.
For example, in the example above, a WAR hazard occurred between the sub s8, t2, t3 and add s9, s8, t1 instructions based on reusing s8. The out-of-order processor could rename s8 to r0 for the sub instruction. Then, sub could be executed sooner, because r0 has no dependency on the add instruction. The processor keeps a table of which registers were renamed so that it can consistently rename registers in subsequent dependent instructions. In this example, s8 must also be renamed to r0 in the and instruction, because it refers to the result of sub. The following figure shows the same program from above on an out-of-order processor with register renaming.

The constraints on issuing instructions are:
Cycle 1
The
lwinstruction issues.The
addinstruction is dependent onlwby way ofs8, so it cannot issue yet. However, thesubinstruction is independent now that its destination has been renamed tor0, sosubalso issues.
Cycle 2
Remember that a two-cycle latency must exist between issuing
lwand a dependent instruction, soaddcannot issue yet because of thes8dependence.The
andinstruction is dependent onsub, so it can issue.r0is forwarded fromsubtoand.The
orinstruction is independent, so it also issues.
Cycle 3
On cycle 3,
s8is available, so theaddissues.s11is also available, soswissues.
Now the out-of-order processor with register renaming issues the six instructions in three cycles, for an IPC of 2, which achieves the ideal case!
VLIW Processor
In the above sections, we try to increase the parallelism from the hardware perspective. However, there is a technique that shifts the burden of identifying parallelism from hardware to the compiler. This technique is called VLIW.
VLIW can be called as compile-time multiple issue also.
In the VLIW processor, the compiler packs groups of independent instructions into the bundle, and the bundle can be thought of as one very long instruction. Hence the name.
Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three methods described above require. Thus, VLIW CPUs offer more computing with less hardware complexity (but greater compiler complexity) than do most superscalar CPUs.
VLIW Processor Application
VLIW processors excel in specialized domains like
Digital Signal Processing (DSP),
Graphics Processing, and
Machine Learning
Based on the above three applications, we can know that VLIW / Compile Time Multiple Issue is rarely used in PC processors or microcontrollers.
because these applications rely heavily on repetitive, predictable loops and linear data access patterns. Unlike general-purpose software, these tasks involve very little uncertainty regarding control flow or memory addresses, which minimizes runtime hazards like complex branches or cache misses. This high level of determinism allows the compiler to aggressively optimize and schedule instructions statically, enabling VLIW architectures to achieve high performance without the power-hungry, complex hardware logic needed to handle the unpredictability of mainstream CPUs.
Remember the key distinction is Determinism. VLIW thrives on static, compile-time predictability (low entropy), whereas Superscalar processors are necessary for high-entropy tasks (like Operating Systems) that require dynamic, runtime adaptability.
Loop Unrolling
Loop Unrolling is a compiler optimization that replicates the loop body multiple times to reduce control overhead (fewer branches and counter updates) and expose larger blocks of independent instructions. For VLIW processors, this is critical because it creates a large pool of instructions that can be scheduled in parallel. By unrolling, the compiler can find operations that do not depend on each other (such as the four independent lw pairs in the example) and pack them into a single Wide Instruction Word, effectively hiding latency and maximizing the utilization of parallel functional units.
You can see the unrolling if you use different compiler flags, like -O1, -O2, and -O3.
Multithreading
Here, all the stuff will still use one processor!
Multithreading starts from two problems:
Because the ILP of real programs tends to be fairly low, adding more execution units to a superscalar or out-of-order processor gives diminishing returns.
Memory is much slower than the processor. Most
lwandswinstructions access a much smaller and faster memory called cache. However, when the instructions or data are not available in the cache, the processor may stall for 100 or more cycles while retrieving the information from the main memory.
Multithreading is a technique that helps keep a processor with many execution units busy even if
the ILP of a program is low or
the program is stalled waiting for memory
To explain multithreading, we need to define two new terms
Process
A program running on a computer is called a process. Computers can run multiple processes simultaneously. For example, you can play music on a PC while surfing the web and running a virus checker.
Also, if you click the icon to run a program on your PC. After clicking, the programming is running. And after the program is loaded into the memory, it is called process.
Thread
Each process consists of one or more threads that also run "simultaneously". For example, a word processor may have one thread handling the user typing, a second thread spell-checking the document while the user works, and a third thread printing the document. In this way, the user does not have to wait, for example, for a document to finish printing before being able to type again.
The degree to which a process can be split into multiple threads that can run "simultaneously" defines its level of thread-level parallelism (TLP).
Multithreading (whether software or hardware) lets one processor appear to do multiple things at once.
Software Multithread
In a conventional processor, the threads only give the illusion of running simultaneously. The threads actually take turns being executed on the processor under control of the operating system (OS). When one thread’s turn ends, the OS saves its architectural state, loads the architectural state of the next thread, and starts executing that next thread. This procedure is called context switching. As long as the processor switches through all threads fast enough, the user perceives all of the threads as running at the same time.
Usually, there are ways for the context switching to happen in this scenario:
This describes the software-based concurrency on a single-core CPU.
Hardware Multithread
A hardware multithreaded processor contains more than one copy of its architectural state so that more than one thread can be active at a time.
For example, if we extended a processor to have four program counters and 128 registers, four threads could be available at one time. If one thread stalls while waiting for data from main memory, then the processor could context switch to another thread without any delay, because the program counter and registers are already available. Moreover, if one thread lacks sufficient parallelism to keep all execution units busy in a superscalar design, then another thread could issue instructions to the idle units.
Switching between threads can either be fine-grained or coarse-grained.
Fine-grained multithreading
Fine-grained multithreading switches between threads on each instruction and must be supported by hardware multithreading.
The advantage for fine-grained multithreading is that there will be less control and data hazards. As each thread is independent and now the total number of stages is splited into two threads running, thus having less control and data hazards overall.
This is temporal.
Simultaneous multithreading
This is also know as SMT or hyperthreading (called by Intel). If one thread can't keep all execution units busy, another thread can use them, so instructions from different threads execute at the same time, without duplication of functional units.
This is both temporal and spatial.
Multithreading does not improve the performance of an individual thread, because it does not increase the ILP. However, it does improve the overall throughput of the processor, because multiple threads can use processor resources that would have been idle when executing a single thread.
Flynn's Taxonomy
There are four types of computer architectures in the Flynn's Taxonomy, we will introduce them one by one and give you some concrete examples for each of them.
Single Instruction stream Single Data stream (SISD)
It's essentially a sequential computer which exploits no parallelism in either the instruction or data stream.

One example is a single-thread or single-core processor.
Single Instruction stream Multiple Data streams (SIMD)
It's a computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized.
Exploits data parallelism

The examples are hardware accelerators and GPUs.
Multiple Instruction streams Single Data stream (MISD)
As its name suggests, it's multiple instructions operating on one data stream.

This is an uncommon architecture and the examples are autopilot system of the aeroplane, and the systolic arrays which are the heart of Google TPUs.
Multiple Instruction streams Multiple Data streams (MIMD)
Multiple autonomous processors simultaneously executing different instructions on different data.
Exploits thread-level/task parallelism

The examples of MIMD architectures include parallel / distributed systems, using either one shared memory space or a distributed memory space.
Multiprocessors
Modern processors have enormous numbers of transistors available. Using them to increase the pipeline depth or to add more execution units to a superscalar processor gives little performance benefit and wastes power. Around the year 2005, computer architects made a major shift to building multiple copies of the processor on the same chip; these copies are called cores.
A multiprocessor system consists of multiple processors and a method for communication between the processors. This system can be divided into two categories based on whether it has a shared memory at any level:
Loosely Coupled Mutiprocessor Systems
In this system, each node runs different OS instances and communicate by passing messages rather than through a shared memory. In this kind of system, we have two concrete examples:
Clusters
In a clustered multiprocessor system, each processor has its own local memory system instead of sharing memory.
It has the nodes set to perform the same task, controlled and scheduled by software. These nodes are typically hemogenous in hardware and software, housed in the same building / geography, interconnected using a dedicated network, and have shared resources. This system is often viewed as a single computer from outside, but it actually has a lot of nodes (small computers) inside.
One example is the supercomputer.
Grid Computers
This system is very loosely coupled, which means diverse computers/nodes are connected via internet and they are geographically dispersed.
Within this system, each node performs a different task to reach a common goal. The nodes are typically heterogenous in hardware and autonomous in software.
One example is the cryptocurrency mining tools.
Tightly Coupled Mutiprocessor Systems
Tightly coupled multiprocessor systems are usually mounted on the same mother board or within the same silicon die, communicates via shared memory and usually controlled by a single OS.

In this system, as you can see from the image above
Each processor executes different programs and works on different data
Each processor usually has one or more levels of private cache
The processors share many resouces (e.g., higher level caches, memory, I/O device, interrupt system, etc.)
All processors are connected using a system bus.
However, for all the processors to communicate smoothly via the share memory, some mechanisms are needed to handle the conflict
Arbitration mechanisms: If two processors attempt to use the same resource simultaneously, this mechanism is needed to deal with this situation.
Mutual exclusion mechanisms: This is used to protect resources (or a range of memory) which should not be used in a concurrent manner.
Cache coherence ensurance mechanisms: This is to ensure the cache coherence across all levels.
Bus snooping and directory-based mechanisms are two examples.
In the tightly coupled multiprocessor system, we have two small parts
Multiprocessors can be used to do the following two things:
run more threads simultaneously, or
run a particular thread faster
Symmetric Multiprocessors
Symmetric multiprocessors (SMP) include two or more identical processors sharing a single main memory. The multiple processors may be separate chips or multiple cores on the same chip. For example, Intel Core i3, i5 (prioir to Alder Lake) etc.
Symmetric multiprocessors are good for situations like large data centers that have lots of thread-level parallelism available.
Heterogeneous Multiprocessors
Heterogeneous multiprocessors (HMP) incorporate different types of cores and/or specialized hardware in a single system. And it can take the following two forms:
a heterogeneous system can incorporate cores with the same architecture but different microarchitectures, each with different power, performance, and area trade-offs.
another heterogeneous strategy is accelerators, in which a system contains special-purpose hardware optimized for performance or energy efficiency on specific types of tasks.
Heterogeneous systems are good for systems that have more varying or special-purpose workloads, such as mobile devices.
Let's look at some real world examples for the heterogenous multiprocessor system!
ARM big.LITTE
ARM big.LITTLE is a heterogeneous computing architecture coupling relatively battery-saving and slower processor cores (LITTLE) with relatively more powerful and power-hungry ones (big). It has the following 3 variants:
Global task scheduling
This is the most powerful and popluar paradigm currently, as it can enable all physical cores at the same time. In this paradigm, the "burden" will go to the scheduler.
Within the scheduler, threads with high priority or computational intensity are allocated to the big cores, while threads with lower priority or less computational intensity, such as background tasks, can be performed by the LITTLE cores.

Apple A14 uses this paradigm and it has 11.8 billion transistors in total!
SIMD/Vector Processing
Our motivation here is the amortization.
In standard processors, every instruction incurs significant overhead because fetching and decoding commands consumes both power and time. If a program needs to perform the same operation on multiple data points — such as adding eight pairs of numbers — standard CPU wastes effort by fetching and decoding the "Add" command eight separate times. The solution is SIMD, which allows the processor to fetch and decode the command just once and apply it to all data points simultaneously. This strategy effectively "amortizes," or spreads out, the expensive management costs across many data elements to improve efficiency.
Packed SIMD
This first option places the control explicitly in the hands of the software. The programmer must intentionally use special vector instructions (e.g., VEC8_mul) to define how data is packed and processed. While this course treats the following two terms similarly, there is a nuance in terminology:
Packed SIMD: Refers to fixed-width registers (e.g., "Process exactly 4 items at once"). This is common in standard CPUs like Intel AVX or ARM NEON.
Vector Processing: Often implies variable-length processing (e.g., "Process a list of n items"). This is common in supercomputers or RISC-V Vector extensions.

GPU
The second option, used by GPUs, shifts the complexity from the programmer to the hardware. This is often called SIMT (Single Instruction, Multiple Threads). Instead of writing complex vector code, the programmer writes standard scalar instructions (like a simple mul) intended for a single thread. The GPU hardware then performs the vectorization automatically:
Implicit Grouping: The hardware dynamically bundles these individual threads into groups (called "Warps" by NVIDIA or "Wavefronts" by AMD).
Lockstep Execution: These bundles are then executed on the hardware's SIMD units in lockstep.
This allows the programmer to think in terms of simple, single threads, while the hardware ensures the massive throughput of vector processing.

Systolic Arrays
Our motivation here is to eliminate memory bottleneck.
In standard processors (like the SIMD examples we discussed), data is frequently read from memory, processed, and written back. This constant access to memory (registers or cache) creates a bottleneck. The Systolic Array solves this by mimicking the rhythm of a beating heart ("systole"). Instead of each processor acting independently, they form a tightly coupled network. Data flows from memory into the array once and is then rhythmically passed from neighbor to neighbor.
Its main mechanism is called Rhythmic Data Flow. A systolic array consists of a grid of Processing Elements (PEs). Data flows through the array in a wave-like fashion. When a PE finishes a calculation, it passes the data directly to its neighbor rather than writing it back to memory.

The Trade-off: This design sacrifices flexibility (it is hard to do general-purpose logic like "If/Else") and limits available registers.
The Reward: In exchange, it achieves immense efficiency for specific tasks like Matrix Multiplication. Since operand data and partial results are stored within the passing wave, the system drastically reduces the need to access external buses or caches, saving power and increasing operation density.
TPU
One application of the systolic array is the Google TPU. The Google Tensor Processing Unit (TPU) v1 is a real-world implementation of a systolic array, designed specifically for heavy compute workloads like Machine Learning inference. It is not a standalone CPU; it sits on a PCIe bus and acts as a coprocessor, receiving instructions from a host CPU. Because the TPU runs massive, complex tasks (like multiplying two huge matrices) with a single command, it utilizes a CISC instruction set.
The heart of the TPU is the Matrix Multiply Unit, a massive systolic array containing over 65,000 processing units.
Data Flow: It reads in weights and data (activations) into local buffers (Weight FIFO and Unified Buffer). These values flow through the Matrix Unit, performing 8-bit multiply-accumulate operations at a rate of up to 92 Tera-operations per second.
Pipeline: The results flow out to an Activation Unit (which applies hardwired functions like ReLU) and can be fed back into the Unified Buffer for the next layer of calculation. This design creates a pipeline optimized entirely for the math required by deep neural networks.

Last updated

