Lec 01 - History, Technology, Performance
Intro
Below your program
This famous image is called abstraction in Harris & Harris. Please go back to review the work of art before continue reading!

Basically, here we categorize these abstractions into three categories
Application software: Written in high-level language
System software: Operating System, a.k.a. service code
Handling input/output
Managing memory and storage
Scheduling tasks & sharing resources
Hardware: Processor, memory, I/O controllers
Levels of Program Code

There are three levels when we run a high-level language program (like C) on a processor
High-level language:
Level of abstraction closer to problem domain
Provides for productivity and portability
Assembly language: Textual representation of instructions
Hardware representation:
Binary digits (bits)
Encoded instructions and data
ISA
Again, go back to the Harris & Harris notes about architecture and microarchitecture here as it is pretty well-written!
ISA = assembly language programmer's / firmware engineer's view of the processor. Microarchitecture = hardware engineer's view of the processor.
ABI
What is an ABI?
An ABI (Application Binary Interface) builds on the ISA, which already defines what instructions the CPU understands, but goes further by specifying how programs interact with the OS, libraries, and hardware at the binary level.
What does ABI specify?
Usually, the ABI will specify under an ISA, what is the purpose of each register. (For example, the RISC-V Register File is defined by its ABI)
Why is ABI important?
With ABI, we can get binary portability. For example, a program compiled on one Linux x86-64 system can run on another Linux x86-64 system, because both respect the same ABI rules.
So, think of ISA as the language (words + vocabulary). Then ABI is like the etiquette and customs:
Not just what words mean, but how you greet people, how you exchange gifts, how you behave. If two people speak the same language but follow different customs, they’ll still miscommunicate. Similarly, two programs can be compiled for the same ISA, but if they don’t agree on the ABI, they’ll misinterpret function calls or data.
Todos
Compare and contrast API and ABI
In short: API = source-level contract; ABI = binary-level contract.
Level
Source code level
Binary / machine code level
Definition
This is the set of public types/variables/functions that you expose from your application/library.
This is how the compiler builds an application.
Example
C standard library function printf() is an API
It defines things (but is not limited to):
How parameters are passed to functions (registers/stack).
Who cleans parameters from the stack (caller/callee).
Where the return value is placed for return.
Read up on Calling Convention on RISC-V
In this course, we use RISC-V RV32I. So, the calling convention in RISC-V RV32I makes sure that every compiler, OS, and library agrees on where arguments/return values go. This way, code compiled separately (say, your program + a math library) can work together at the binary level.
Which are the popular ISAs?
This is simple. Nowadays, we mainly have the following ISAs
x86
ARM
RISC-V
MIPS
History
Digital Hardware Market Segments
ASIC (application specific integrated circuit)
ASSP (application specific standard product)
FPGA (field programmable gate array)
Application Processor Market
Application processors = Processors used in smartphones / tablets (ARMv8A / 9A instruction set)
In AY25/26 Sem 1, the questions appeared in the final indicate that nowadays, the fastest processor (AMD EPYC) can execute around 1000 billions of instructions per second. And the cutting-edge GPUs have around 10000 cuda cores (in the order of 10000).
Moore's Law
In 1965, Intel’s Gordon Moore predicted that the number of transistors that can be integrated on single chip would double about every two years.
Technology
Power Trends
In CMOS IC technology,
For example, suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
The power reduction proportion is
But, we have reached the power wall nowadays,
we can't reduce voltage further
we can't remove more heat
How else can we improve performance?
Issues and Modern Trends
To solve the power wall problem above, the modern trends are now as follows,
Limited instruction level parallelsim (ILP), power issues
Cloud computing
Multi-core/processor systems, clusters
Heterogeneous systems, hardware accelerators, hardware/software codesign, reconfigurable computing
Application-specific instruction-set processors: NPU, TPU, Bitcoin mining, etc.
Reaching the limits of silicon:
Use compound semiconductors such as GaN, InP, etc.
The communication bottleneck — within and between chips: the data transfer speed between processor and memory
SoCs, multi-chip modules
3D ICs/stacking (e.g., in HBM)
Optical interconnects: Use light instead of electricity to communicate
In-memory computing
Leakage current & short channel effects
Multi-gate (3D) FETs — FinFET and gate-all-around (GAA) FETs.
Todos
Read up on chip binning
Chip binning = sorting tested chips into categories (“bins”) based on their measured performance characteristics. An example is, same silicon die may become a “Core i9” if it passes high-frequency tests, or a “Core i5” if only stable at lower clocks.
This can maximizes yield, so instead of throwing away chips that can’t meet the top spec, sell them in lower bins.
Performance
Throughput vs. Response Time
Response time (execution time) – the time between the start and the completion of a task.
Throughput – the rate at which a system completes its tasks. It is defined as the total amount of work accomplished divided by the time taken to complete that work. ()
Will need different performance metrics as well as a different set of applications to benchmark personal mobile devices, embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput.
Relative Performance
We define performance to be
"X is time faster than Y" is equivalent to
Example: time taken to run a program
10s on A, 15s on B
So A is 1.5 times fater than B
CPU Clocking
Clock period / Clock cycle time / Cycle time (): the duration of a clock cycle. It is the reciprocal of clock frequency (rate). e.g., 250ps = 0.25ns = s.
Clock frequency (rate): cycles per second. e.g., 4.0GHz = 4000MHz = Hz.
Clock cycles: the total number of clock cycles needed to finish a task.
CPU Time: the actual time taken by the CPU to execute a program.
Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
Aim for 6s CPU time
Can do faster clock, but causes 1.2 x clock cycles
How fast must Computer B clock be?
Instruction Count (IC) and CPI
Here, "instruction" means the machine instructions (e.g. RISC-V instructions in our course)
Instruction Count (IC) : the total number of instructions that a CPU must execute to complete a given program or task.
It is determined by program, ISA and compiler.
Cycles per Instruction (CPI): As its name suggested.
It is determined by CPU hardware, but can be affected by many other stuff.
Notes
If different instructions have different CPI, then the average CPI is affected by instruction mix.
CPI is not a fixed property bonded with the processor, it is a statistics which is defined by .
Example
IC and CPI are affected by almost everything
Algorithm
The algorithm determines the sequence of operations.
Even if two implementations solve the same problem, one may require fewer high-level steps (e.g., binary search vs. linear search).
This directly changes the instruction count (IC) — and possibly the mix of instructions (e.g., more multiplications vs. more additions).
Since different types of instructions can take different cycles (e.g., loads may stall on memory, multiplications may take multiple cycles), the average CPI changes.
Programming language
A high-level algorithm written in C vs. Python (or even C vs. Java) will not translate to the same low-level instructions.
Some languages encourage more abstraction (e.g., object-oriented overhead, runtime checks), which can generate more instructions when compiled or interpreted.
So the same algorithm in different languages can have very different instruction sequences and counts.
Compiler
Even with the same programming language and algorithm, different compilers — or different compiler optimization levels (
-O0,-O2,-Ofast) — can produce very different machine code.Example: A smart compiler might use one
mulinstruction, while a naive compiler generates a loop of additions.This changes both the instruction count and the instruction mix, and therefore affects CPI.
ISA
If you change the ISA (e.g., RISC-V vs. x86 vs. ARM), the same high-level operation may take very different numbers and types of instructions.
Example: A complex x86 instruction (like
rep movsb) might do in one instruction what RISC-V requires a loop of multiple instructions to do.Even within the same ISA, the instruction set extensions (e.g., RISC-V
Mextension for multiplication/division) change CPI by replacing multi-cycle software routines with single hardware instructions.
IC is data dependant
Instruction Count (IC) is data-dependent. The compiler can tell us the static instruction count (how many instructions exist in the program), but the dynamic instruction count (how many are actually executed) depends on runtime inputs and data. For example, a while loop may run 10 times or 1,000,000 times depending on the input, so the actual IC cannot be determined at compile time.
CPI in more detail
If different instruction classes take different numbers of cycles
Weighted average CPI
An CPI Example:
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA, compiler -> Same IC
Which is faster, and by how much?
So, from the result, we can say that
A is 1.2x / 20% faster than B, or
B is 20% slower than A.
Performance Summary
Performance depends on
Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction Set Architecture (ISA): affects IC, CPI, Cycle Time ()
SPEC Benchmark
Programs can be used to measure performance. And Standard Performance Evaluation Corp (SPEC) develops benchmarks for CPU, I/O, Web, ...
SPEC CPU Benchmark
Using the CINT2006 standard, the benchmark of a processor can be calculated as
where the Execution time ratioi (or SPECratio) is
An example that you will find is as follows,

SPEC Power Benchmark
Power consumption of server at different levels:
Performance: ssj_oops/sec
Power: Watt (Joules/sec)
So, the formula for the power benchmark is
And the following is an example,

Amdahl's law
Amdahl’s Law describes the limits of performance improvement when you only improve part of a system. It says,
If only part of the execution time can be improved, the overall speedup is limited by the fraction of time that part takes.
Its formula form is,
or it can be written as,
, where is the overall system speed up, is the fraction of work performed by the faster component, and is the speedup of the new component (same component as ).
For example, imagine you optimize disk I/O in a program:
Program runtime: 50 seconds
Disk I/O: 10 seconds (20%)
Computation: 40 seconds (80%)
If you make disk I/O 10× faster, then:
So, the speed up is,
Or using the second formula, we can get
Even though disk I/O became 10× faster, the overall program only got 22% faster, because disk I/O was only a small part of the total time.
From the Amdahl's law, we can deduce a very useful idea — "Make the common case fast". As the common case takes larger portion of time, improving it can improve the speed of the whole system significantly.
Eight Great Ideas
Design for Moore's Law.
Use abstraction to simplify the design.
Make the common case fast.
Performance via parallelism.
Performance via pipelining.
Performance via prediction: prediction depends on the history
Hierarchy of memories: Cache, main memory, disk.
Dependability via redundancy: One transistor broken should not affect the whole system.
Last updated