

### Computer Systems and Networks

ECPE 170 – University of the Pacific

# Processor Architecture

# Lab Schedule

### Activities

- **7** Today
  - Processor Architecture
  - **7** Lab 10 & 11

#### Thursday

- Network Programming
- Lab 11 & Lab 12

#### Next Tuesday

**7** ????

### **Assignments Due**

- Tonight
  - **7** Lab 10 due by 11:59pm
- **Sun Dec 1**<sup>st</sup>
  - **7** Lab 11 due by 11:59pm

- How does the hardware MIPS processor execute a single instruction?
- With a 5-step instruction cycle



#### Step 1 – Instruction Fetch (IF)

- Retrieve next instruction from memory (check the instruction cache first!)
- Program Counter (PC) register stores address of next instruction to be retrieved/executed



#### **স** Step 2 − Instruction Decode (ID)

- Decode instruction what should we do?
- Retrieve input values from registers



#### **স** Step 3 − Execute (EX)

- ALU performs arithmetic or logical operation
- Operation might be calculating a memory address



#### Step 4 – Memory Access (MEM)

Read/write memory if necessary (Check the data cache first!)



8

#### **স** Step 5 − Write Back (WB)

Write final result of instruction to register if necessary



### Example 1 – add \$so,\$s1,\$s2

- 1. IF: Load instruction from memory; increment PC
- ID: Determine operation is "add"; Load \$s1 and \$s2 from registers
- 3. **EX**: ALU performs addition operation
- 4. **MEM**: No operation (no-op)
- 5. WB: Output of ALU written to \$s0

### Example 2 – lw \$\$0,10(\$t1)

- **1. IF**: Load instruction from memory, increment PC
- 2. ID: Determine operation is "load word"; retrieve value of \$t1 from register
- 3. EX: ALU calculates memory address of desired data (\$t1 plus 10 sign-extended to full 32 bits)
- 4. **MEM**: Retrieve data from memory at address calculated by ALU (check the data cache first!)
- 5. WB: Output of memory written to \$s0

### Example 3 – sw \$so, 20(\$t1)

- 1. IF: Load instruction from memory, increment PC
- 2. ID: Determine operation is "store word"; retrieve values of \$s0 and \$t1 from registers
- 3. EX: ALU calculates memory address of storage location (\$t1 plus 20 sign-extended to full 32 bits)
- 4. MEM: Store value from \$s0 to memory at address calculated by ALU (write goes to the data cache!)
- 5. WB: No operation (no-op)

### Example 4 – beq \$t1,\$t2,label

- **1. IF**: Load instruction from memory, increment PC
- 2. ID: Determine operation is "branch on equal"; retrieve values of \$t1 and \$t2 from registers
- 3. **EX**: ALU calculates memory address of location to jump to *if* the comparison is true (PC + label sign-extended to full 32 bits); ALU also compares \$t1 and \$t2 for equality
- MEM: If comparison is <u>equal</u>, PC = address calculated by ALU. Otherwise, PC is unchanged
- 5. WB: No operation (no-op)



### Instruction Cycle

The performance of our 5-step instruction cycle is slow if we only do one instruction at a time

New Goal: Run the instruction cycle <u>quickly</u> and <u>efficiently</u>

# Instruction Cycle

16

- ↗ A laundry analogy...
  - ➤ Laundry cycle instead of instruction cycle
- Doing laundry in your residence hall
  - Washing machine 35 minutes
  - Dryer 60 minutes
  - **Folding / Hanging 8 minutes**
- How do you do one load of laundry the fastest?



# Instruction Cycle for Laundry

#### How do you do two loads of laundry the fastest?

- Back to back?
  - 206 minutes total
  - Leaves machines idle at different times

#### ↗ Concurrently?



# Pipelining

#### This is pipelining

- Performing work in parallel instead of sequentially
  - Goal: Keep all hardware busy
- Provides for instruction level parallelism (ILP)
  - Executing more than one instruction at a time

#### Without Pipelining:

With Pipelining:

| Instr.<br># | Stage |    |    |              |                  |                   |    | Instr.<br># | Pipeline Stage |    |    |    |     |     |     |
|-------------|-------|----|----|--------------|------------------|-------------------|----|-------------|----------------|----|----|----|-----|-----|-----|
| 1           | IF    | ID | EX | MEM          | WB               | Finish<br>instruc |    |             | 1              | IF | ID | EX | MEM | WB  |     |
| 2           |       |    |    | <br>starting | before<br>second | IF                | ID | EX          | 2              |    | IF | ID | EX  | MEM | WB  |
| 3           |       |    |    |              |                  |                   |    |             | 3              |    |    | IF | ID  | EX  | MEM |
| Cycle       | 1     | 2  | 3  | 4            | 5                | 6                 | 7  | 8           | Cycle          | 1  | 2  | 3  | 4   | 5   | 6   |

#### **Computer Systems and Networks**

# **Deeper Pipelining**

#### **We can do better than this**

- (Original) Laundry Room Specifications:
  - Washing machine 35 minutes
  - Dryer 60 minutes
  - **Folding / Hanging 8 minutes**

#### What is the bottleneck in our simple pipeline?

- Drying takes much longer than the other stages
- **↗** This slows down the entire laundry process

# Pipelining / Laundry Revisited



Total: 163 minutes

- How can we fix it? Get two dryers
  - Operate them in parallel, or ...
  - Operate them in series for half the time
    - Each has a specialized task
    - First dryer set to <u>hot</u> (initial drying)
    - Second dryer set to <u>cool</u> (final drying / prevent shrinking)

# Pipelining / Laundry Revisited



Total: 138 minutes

- How can we fix it? Get two dryers
  - Operate them in parallel, or ...
  - Operate them in series for half the time
    - Each has a specialized task
    - ↗ First dryer set to <u>hot</u> (initial drying)
    - Second dryer set to <u>cool</u> (final drying / prevent shrinking)

# Pipelining / Laundry Revisited

- Better performance
  - 7 206 minutes → 163 minutes → 138 minutes
  - But now we're limited by the washer speed
- How do we fix this?
  - Buy more machines, each doing smaller parts of the task

#### Could I benefit from 10 machines? 100? 1000?

- Not shown in timeline: Time required to advance laundry from one stage to the next
- The time spent moving laundry between machines could exceed the time spent <u>in</u> the machines <sup>(3)</sup>
- **↗** System becomes increasingly <u>complex</u> to design ⊗

# Pipeline Challenge 1

- **Ideal** pipeline speedup is equal to pipeline depth
  - **5** stages? Program could run at best 5 times faster
- Pipeline challenge only achieve <u>ideal</u> speedup if the pipeline is perfectly balanced
  - The hardware in every stage takes the exact same amount of time to operate
- Most pipelines are not balanced
  - Example: loading data from memory is slower than decoding instruction
- Do we set processor frequency to *fastest* or *slowest* stage?
  - Slowest stage otherwise it won't have time to finish

# Pipeline Challenge 2

- Problem: We might not always be able to keep the pipeline full of instructions
- Hazards cause pipeline conflicts and stalls
  - Data hazards (dependencies)
  - Structural hazards (resource conflicts)
  - Control hazards (conditional branching)

### Data Hazard

### Program correctness depends on executing instructions in original order

| a | Read After Write   dd \$s1,\$t1,\$t2   dd \$s2,\$t3,\$t4   dd \$t4,\$s1,\$s2 | Write After Read<br>add \$t1, <b>\$s1</b> ,\$t2<br>add <b>\$s1</b> ,\$t3,\$t4   | <u>Write After Write</u><br>add <b>\$s1</b> ,\$t1,\$t2<br>add <b>\$s1</b> ,\$t3,\$t4 |
|---|------------------------------------------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| р | hird add cannot<br>roceed until first two<br>re complete!                    | Second add cannot write<br>result until after first add<br>has read its inputs! | Second add cannot write<br>result until after first add<br>has written its result!   |

25

# Structural Hazard, Control Hazard

#### Structural hazard

- Part of the processor hardware is required by two different instructions at the same time
- Example: A shared memory, shared ALU, shared data bus, etc...

#### Control hazard

The processor needs to know which instruction will be executed next, and it can't until the branch is determined

### Instruction-Level Pipelining

- Hazards can cause pipeline to stall or flush
  - **オ Stall** − pipeline is delayed for a cycle
  - **Flush** all instructions in pipeline are deleted
- Clever hardware or clever assembly programmers (or *optimizing* compilers) can reduce the effects of these hazards
  - **7** But not fully eliminate them...

# Intel Pipelining

- Almost all Intel chips (286, 386, 486, etc...) have some degree of pipelining
- Pipelining was first seriously applied to the Intel486 chip in 1989
  - Could complete an ALU instruction (coming from a register, going to a register) every clock cycle
- Pipelining got better with the **Pentium** chip in 1993
  - Double-wide: Two instructions are sent down the pipeline every cycle! (Requires two ALUs, etc...)

# Intel Pipelining

- Pipeline depth changed over time:
  - Original Pentium: 5 stages
  - Pentium 2: 12 stages
  - Pentium 3: 14 stages
  - Pentium 4: 20-24 stages
  - Pentium 4 extreme edition: 31 stages
  - Why were the pipelines getting longer?
- **7** Today
  - Core i7 has a 17-stage pipeline

29

# **MIPS** Pipelining

- Like Intel, the pipeline size of the MIPS processors has grown
  - R2000 and R3000 have 5-stage pipelines

  - **R10000** has three pipelines:
    - **5**-stage pipeline for integer instructions
    - 7-stage pipeline for floating-point instructions



Example program: (imagine it was in assembly)

- Assume we have a processor with "lots" of ALUs
  - **What instructions** <u>can</u> be executed in parallel?
  - **What instructions** <u>cannot</u> be executed in parallel?

Example program 2: (imagine it was in assembly)

- Assume we have a processor with "lots" of ALUs
  - **What instructions** <u>can</u> be executed in parallel?
  - **What instructions** <u>cannot</u> be executed in parallel?
    - **If we tried really hard, could we run them in parallel?**

#### This is instruction-level parallelism

- Finding instructions in the same program that can be executed in parallel
- Different from multi-core parallelism, which executes instructions from *different* programs in parallel
- ↗ You can do this in a single "core" of a CPU
  - Adding more ALUs to the chip is easy
  - **↗** Finding the parallelism to exploit is harder...
  - **7** Getting the data to the ALUs is harder...

#### ✓ Instruction-level parallelism is good ☺

- Let's find as much of it as possible and use it to decrease execution time!
- Two competing methods:
  - **Superscalar**: the *hardware* finds the parallelism
  - **VLIW**: the *compiler* finds the parallelism
- Both designs have multiple execution units (e.g. ALUs) in a single processor core

# MIMD – Superscalar

- Superscalar designs the hardware finds the instruction-level parallelism while the program is running
- Challenges
  - CPU instruction fetch unit must simultaneously retrieve several instructions from memory
  - CPU instruction decoding unit determines which of these instructions can be executed in parallel and combines them accordingly
    - Complicated!

### MIMD-VLIW

- Very long instruction word (VLIW) designs the compiler finds the instruction-level parallelism before the program executes
  - The compiler packs <u>multiple</u> instructions into one long instructions that the hardware executes in parallel
- Arguments:
  - For: Simplifies hardware, plus the compiler can better identify instruction dependencies (it has more time to work)
  - Against: Compilers cannot have a view of the run time code, and must plan for all possible branches and code paths
- Examples: Intel Itanium, ATI R600-R900 GPUs

Back to the example program:

- More techniques for ILP
- Speculative execution (or branch prediction)
  - Guess that e>f, and execute line 4 immediately...

#### Out-of-order execution

Execute line 7 before 4-6, since it doesn't depend on them