Lectures for 3rd Edition Note: these lectures are often supplemented with other materials and also problems from the text worked out on the blackboard. Youll want to customize these lectures for your class. The student audience for these lectures have had exposure to logic design and attend a hands-on assembly language programming lab that does not follow a typical lecture format. 1 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter 1 2 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Introduction This course is all about how computers work But what do we mean by a computer? Different types: desktop, servers, embedded devices Different uses: automobiles, graphics, finance, genomics Different manufacturers: Intel, Apple, IBM, Microsoft, Sun Different underlying technologies and different costs! Analogy: Consider a course on automotive vehicles Many similarities from vehicle to vehicle (e.g., wheels) Huge differences from vehicle to vehicle (e.g., gas vs. electric) Best way to learn:
Focus on a specific instance and learn how it works While learning general principles and historical perspectives 3 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Why learn this stuff? You want to call yourself a computer scientist You want to build software people use (need performance) You need to make a purchasing decision or offer expert advice Both Hardware and Software affect performance: Algorithm determines number of source-level statements Language/Compiler/Architecture determine machine instructions (Chapter 2 and 3) Processor/Memory determine how fast instructions are executed (Chapter 5, 6, and 7) Assessing and Understanding Performance in Chapter 4 4 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers What is a computer?
Components: input (mouse, keyboard) output (display, printer) memory (disk drives, DRAM, SRAM, CD) network Our primary focus: the processor (datapath and control) implemented using millions of transistors Impossible to understand by looking at each transistor We need... 5 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Abstraction Delving into the depths reveals more information An abstraction omits unneeded detail, helps us cope with complexity What are some of the details that appear in these familiar abstractions? 6 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers How do computers work? Need to understand abstractions such as: Applications software Systems software
Assembly Language Machine Language Architectural Issues: i.e., Caches, Virtual Memory, Pipelining Sequential logic, finite state machines Combinational logic, arithmetic circuits Boolean logic, 1s and 0s Transistors used to build logic gates (CMOS) Semiconductors/Silicon used to build transistors Properties of atoms, electrons, and quantum dynamics So much to learn! 7 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Instruction Set Architecture A very important abstraction interface between hardware and low-level software standardizes instructions, machine language bit patterns, etc. advantage: different implementations of the same architecture disadvantage: sometimes prevents using new innovations True or False: Binary compatibility is extraordinarily important? Modern instruction set architectures: IA-32, PowerPC, MIPS, SPARC, ARM, and others 8 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Historical Perspective
ENIAC built in World War II was the first general purpose computer Used for computing artillery firing tables 80 feet long by 8.5 feet high and several feet wide Each of the twenty 10 digit registers was 2 feet long Used 18,000 vacuum tubes Performed 1900 additions per second Since then: Moores Law: transistor capacity doubles every 18-24 months 9 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter 2 10 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Instructions: Language of the Machine Well be working with the MIPS instruction set architecture similar to other architectures developed since the 1980's Almost 100 million MIPS processors manufactured in 2002 used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, 1400 1300 1200 1100 1000 900 800 Other SPARC
Hitachi Morgan Kaufmann PublishersSH PowerPC Motorola Morgan Kaufmann Publishers68K MIPS IA-32 ARM 700 600 500 400 300 200 100 0 1998 1999 2000 2001 2002 11 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers MIPS arithmetic All instructions have 3 operands Operand order is fixed (destination first) Example: C code: a = b + c
MIPS code: add a, b, c (well talk about registers in a bit) The natural number of operands for an operation like addition is threerequiring every instruction to have exactly three operands, no more and no less, conforms to the philosophy of keeping the hardware simple 12 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers MIPS arithmetic Design Principle: simplicity favors regularity. Of course this complicates some things... C code: a = b + c + d; MIPS code: add a, b, c add a, a, d Operands must be registers, only 32 registers provided Each register contains 32 bits Design Principle: smaller is faster. Why?
13 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Registers vs. Memory Arithmetic instructions operands must be registers, only 32 registers provided Compiler associates variables with registers What about programs with lots of variables Control Input Memory Datapath Processor Output I/O 14 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Memory Organization Viewed as a large, single-dimension array, with an address. A memory address is an index into the array "Byte addressing" means that the index points to a byte of memory. 0
1 2 3 4 5 6 ... 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 15 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Memory Organization Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes. 0 32 bits of data 4 32 bits of data Registers hold 32 bits of data 8 32 bits of data 12 32 bits of data ... 232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4 Words are aligned i.e., what are the least 2 significant bits of a word address?
16 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Instructions Load and store instructions Example: C code: A[12] = h + A[8]; MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 48($s3) Can refer to registers by name (e.g., $s2, $t2) instead of number Store word has destination last Remember arithmetic operands are registers, not memory! Cant write: add 48($s3), $s2, 32($s3) 17 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Our First Example Can we figure out the code?
swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli $2, $5, 4 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31 18 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers So far weve learned: MIPS loading words but addressing bytes arithmetic on registers only Instruction Meaning add $s1, $s2, $s3 sub $s1, $s2, $s3 lw $s1, 100($s2) sw $s1, 100($s2) $s1 = $s2 + $s3 $s1 = $s2 $s3 $s1 = Memory[$s2+100]
Memory[$s2+100] = $s1 19 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Machine Language Instructions, like registers and words of data, are also 32 bits long Example: add $t1, $s1, $s2 registers have numbers, $t1=9, $s1=17, $s2=18 Instruction Format: 000000 10001 10010 op rs rt 01000 00000 100000 rd shamt funct Can you guess what the field names stand for? 20 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Machine Language
Consider the load-word and store-word instructions, What would the regularity principle have us do? New principle: Good design demands a compromise Introduce a new type of instruction format I-type for data transfer instructions other format was R-type for register Example: lw $t0, 32($s2) 35 18 9 op rs rt 32 16 bit number Where's the compromise? 21 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Stored Program Concept Instructions are bits Programs are stored in memory to be read or written just like data
Processor Memory memory for data, programs, compilers, editors, etc. Fetch & Execute Cycle Instructions are fetched and put into a special register Bits in the register "control" the subsequent actions Fetch the next instruction and continue 22 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control Decision making instructions alter the control flow, i.e., change the "next" instruction to be executed MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label Example: if (i==j) h = i + j; bne $s0, $s1, Label add $s3, $s0, $s1
Label: .... 23 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control MIPS unconditional branch instructions: j label Example: if (i!=j) h=i+j; else h=i-j; beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1: sub $s3, $s4, $s5 Lab2: ... Can you build a simple for loop? 24 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers So far: Instruction Meaning
add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label $s1 = $s2 + $s3 $s1 = $s2 $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label Formats: R op rs rt rd I op rs rt 16 bit address J
op shamt funct 26 bit address 25 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control Flow We have: beq, bne, what about Branch-if-less-than? New instruction: if $s1 < $s2 then $t0 = 1 slt $t0, $s1, $s2 else $t0 = 0 Can use this instruction to build "blt $s1, $s2, Label" can now build general control structures Note that the assembler needs a register to do this, there are policy of use conventions for registers 26 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Policy of Use Conventions Name Register number 0 $zero 2-3 $v0-$v1 4-7 $a0-$a3 8-15 $t0-$t7 16-23 $s0-$s7 24-25 $t8-$t9 28 $gp 29 $sp 30 $fp 31 $ra Usage the constant value 0 values for results and expression evaluation arguments temporaries saved more temporaries global pointer stack pointer frame pointer return address Register 1 ($at) reserved for assembler, 26-27 for operating system 27 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Constants
Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? put 'typical constants' in memory and load them. create hard-wired registers (like $zero) for constants like one. MIPS Instructions: addi $29, $29, 4 slti $8, $18, 10 andi $29, $29, 6 ori $29, $29, 4 Design Principle: Make the common case fast. Which format? 28 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers How about larger constants? We'd like to be able to load a 32 bit constant into a register Must use two instructions, new "load upper immediate" instruction lui $t0, 1010101010101010 1010101010101010
filled with zeros 0000000000000000 Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010 ori 1010101010101010 0000000000000000 0000000000000000 1010101010101010 1010101010101010 1010101010101010 29 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Assembly Language vs. Machine Language Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first Machine language is the underlying reality e.g., destination is no longer first Assembly can provide 'pseudoinstructions' e.g., move $t0, $t1 exists only in Assembly
would be implemented using add $t0,$t1,$zero When considering performance you should count real instructions 30 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Other Issues Discussed in your assembly language programming lab: support for procedures linkers, loaders, memory layout stacks, frames, recursion manipulating strings and pointers interrupts and exceptions system calls and conventions Some of these we'll talk more about later Well talk about compiler optimizations when we hit chapter 4. 31 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Overview of MIPS simple instructions all 32 bits wide very structured, no unnecessary baggage
only three instruction formats R op rs rt rd I op rs rt 16 bit address J op shamt funct 26 bit address rely on compiler to achieve performance what are the compiler's goals? help compiler where we can 32 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Addresses in Branches and Jumps
Instructions: bne $t4,$t5,Label $t5 beq $t4,$t5,Label $t5 j Label op Formats: I J op rs Next instruction is at Label if $t4 Next instruction is at Label if $t4 = Next instruction is at Label rt 16 bit address 26 bit address Addresses are not 32 bits How do we handle this with load and store instructions? 33 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Addresses in Branches
Instructions: bne $t4,$t5,Label beq $t4,$t5,Label Formats: I Next instruction is at Label if $t4$t5 Next instruction is at Label if $t4=$t5 op rs rt 16 bit address Could specify a register (like lw and sw) and add it to address use Instruction Address Register (PC = program counter) most branches are local (principle of locality) Jump instructions just use high order bits of PC address boundaries of 256 MB 34 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers To summarize: MIPS operands Name Example $s0-$s7, $t0-$t9, $zero, $a0-$a3, $v0-$v1, $gp,
32 registers $fp, $sp, $ra, $at Memory[0], 30 2 arithmetic. MIPS register $zero always equals 0. Register $at is reserved for the assembler to handle large constants. Accessed only by data transfer instructions. MIPS uses byte addresses, so memory Memory[4], ..., words Comments Fast locations for data. In MIPS, data must be in registers to perform sequential words differ by 4. Memory holds data structures, such as arrays, Memory[4294967292] and spilled registers, such as those saved on procedure calls. MIPS assembly language Category Instruction add Arithmetic subtract Example add $s1, $s2, $s3
Meaning $s1 = $s2 + $s3 Three operands; data in registers sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers addi $s1, $s2, 100 lw $s1, 100($s2) sw $s1, 100($s2) store word lb $s1, 100($s2) Data transfer load byte sb $s1, 100($s2) store byte lui $s1, 100 load upper immediate add immediate load word Comments $s1 = $s2 + 100 Used $s1 = Memory[$s2 + 100] Word Memory[$s2 + 100] = $s1 Word $s1 = Memory[$s2 + 100] Byte Byte Memory[$s2 + 100] = $s1 16 $s1 = 100 * 2
to add constants from memory to register from register to memory from memory to register from register to memory Loads constant in upper 16 bits beq $s1, $s2, 25 if($s1 == $s2) go to PC + 4 + 100 Equal test; PC-relative branch bne branch on not equal $s1, $s2, 25 if($s1 != $s2) go to PC + 4 + 100 Not equal test; PC-relative $s1, $s2, $s3 if($s2 < $s3) $s1 = 1; else$s1 = 0 Compare less than; for beq, bne if($s2 < 100) $s1 = 1; else$s1 = 0 Compare less than constant branch on equal Conditional set on less than slt branch set less than immediate slti jump j jr jal Uncondijump register tional jump jump and link $s1, $s2, 100 2500 $ra 2500 go to 10000 Jump to target address go to$ra For switch, procedure return $ra = PC + 4; go to 10000 For procedure call 35 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers 1. Morgan Kaufmann PublishersImmediate Morgan Kaufmann Publishersaddressing op rs
rt Immediate 2. Morgan Kaufmann PublishersRegister Morgan Kaufmann Publishersaddressing op rs rt rd . Morgan Kaufmann Publishers . Morgan Kaufmann Publishers. funct Registers Register 3. Morgan Kaufmann PublishersBase Morgan Kaufmann Publishersaddressing op rs rt Memory Address + Register Byte Halfword Word
4. Morgan Kaufmann PublishersPC-relative Morgan Kaufmann Publishersaddressing op rs rt Memory Address PC + Word 5. Morgan Kaufmann PublishersPseudodirect Morgan Kaufmann Publishersaddressing op Address PC Memory Word 36 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Alternative Architectures Design alternative: provide more powerful operations goal is to reduce number of instructions executed danger is a slower cycle time and/or a higher CPI
The path toward operation complexity is thus fraught with peril. To avoid these problems, designers have moved toward simpler instructions Lets look (briefly) at IA-32 37 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IA - 32 1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The 80286 increases address space to 24 bits, +instructions 1985: The 80386 extends to 32 bits, new addressing modes 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance) 1997: 57 new MMX instructions are added, Pentium II 1999: The Pentium III added another 70 instructions (SSE) 2001: Another 144 instructions (SSE2) 2003: AMD extends the architecture to increase address space to 64 bits, widens all registers to 64 bits and other changes (AMD64) 2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds more media extensions
This history illustrates the impact of the golden handcuffs of compatibility adding new features as someone might add clothing to a packed bag an architecture that is difficult to explain and impossible to love 38 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IA-32 Overview Complexity: Instructions from 1 to 17 bytes long one operand must act as both a source and destination one operand can come from memory complex addressing modes e.g., base or scaled index with 8 or 32 bit displacement Saving grace: the most frequently used instructions are not too difficult to build compilers avoid the portions of the architecture that are slow what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective 39 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IA-32 Registers and Data Addressing Registers in the 32-bit subset that originated with 80386 Name Use 31
0 EAX GPR Morgan Kaufmann Publishers0 ECX GPR Morgan Kaufmann Publishers1 EDX GPR Morgan Kaufmann Publishers2 EBX GPR Morgan Kaufmann Publishers3 ESP GPR Morgan Kaufmann Publishers4 EBP GPR Morgan Kaufmann Publishers5 ESI GPR Morgan Kaufmann Publishers6 EDI GPR Morgan Kaufmann Publishers7 EIP EFLAGS CS Code Morgan Kaufmann Publisherssegment Morgan Kaufmann Publisherspointer
SS Stack Morgan Kaufmann Publisherssegment Morgan Kaufmann Publisherspointer Morgan Kaufmann Publishers(top Morgan Kaufmann Publishersof Morgan Kaufmann Publishersstack) DS Data Morgan Kaufmann Publisherssegment Morgan Kaufmann Publisherspointer Morgan Kaufmann Publishers0 ES Data Morgan Kaufmann Publisherssegment Morgan Kaufmann Publisherspointer Morgan Kaufmann Publishers1 FS Data Morgan Kaufmann Publisherssegment Morgan Kaufmann Publisherspointer Morgan Kaufmann Publishers2 GS Data Morgan Kaufmann Publisherssegment Morgan Kaufmann Publisherspointer Morgan Kaufmann Publishers3 Instruction Morgan Kaufmann Publisherspointer Morgan Kaufmann Publishers(PC) Condition Morgan Kaufmann Publisherscodes 40 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IA-32 Register Restrictions Registers are not general purpose note the restrictions below 41 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IA-32 Typical Instructions Four major types of integer instructions:
Data movement including move, push, pop Arithmetic and logical (destination register or memory) Control flow (use of condition codes / flags ) String instructions, including string move and string compare 42 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IA-32 instruction Formats Typical formats: (notice the different lengths) a. Morgan Kaufmann PublishersJE Morgan Kaufmann PublishersEIP Morgan Kaufmann Publishers+ Morgan Kaufmann Publishersdisplacement 4 4 8 Condi- Displacement tion JE b. Morgan Kaufmann PublishersCALL 8 32 CALL Offset c. Morgan Kaufmann PublishersMOV Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersEBX, Morgan Kaufmann Publishers[EDI Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers45] 6 1 1 8 MOV d w r/m
Postbyte 8 Displacement d. Morgan Kaufmann PublishersPUSH Morgan Kaufmann PublishersESI 5 3 PUSH Reg e. Morgan Kaufmann PublishersADD Morgan Kaufmann PublishersEAX, Morgan Kaufmann Publishers#6765 4 3 1 32 ADD Reg w f. Morgan Kaufmann PublishersTEST Morgan Kaufmann PublishersEDX, Morgan Kaufmann Publishers#42 7 1 TEST w Immediate 8 32 Postbyte Immediate 43
2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Summary Instruction complexity is only one variable lower instruction count vs. higher CPI / lower clock rate Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast Instruction set architecture a very important abstraction indeed! 44 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Three 45 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Numbers Bits are just bits (no inherent meaning) conventions define relationship between bits and numbers Binary numbers (base 2)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001... decimal: 0...2n-1 Of course it gets more complicated: numbers are finite (overflow) fractions and real numbers negative numbers e.g., no MIPS subi instruction; addi can add a negative number How do we represent negative numbers? i.e., which bit patterns will represent which numbers? 46 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Possible Representations Sign Magnitude: 000 = +0 001 = +1 010 = +2 011 = +3 100 = -0 101 = -1 110 = -2 111 = -3 One's Complement Two's Complement 000 = +0 001 = +1 010 = +2 011 = +3 100 = -3 101 = -2
110 = -1 111 = -0 000 = +0 001 = +1 010 = +2 011 = +3 100 = -4 101 = -3 110 = -2 111 = -1 Issues: balance, number of zeros, ease of operations Which one is best? Why? 47 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers MIPS 32 bit signed numbers: 0000 0000 0000 ... 0111 0111 1000 1000 1000 ... 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000two = 0ten 0000 0000 0000 0000 0000 0000 0001two = + 1ten 0000 0000 0000 0000 0000 0000 0010two = + 2ten
1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1110two 1111two 0000two 0001two
0010two = = = = = + + 2,147,483,646ten 2,147,483,647ten 2,147,483,648ten 2,147,483,647ten 2,147,483,646ten maxint minint 1111 1111 1111 1111 1111 1111 1101two = 3ten 1111 1111 1111 1111 1111 1111 1110two = 2ten 1111 1111 1111 1111 1111 1111 1111two = 1ten 48 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Two's Complement Operations Negating a two's complement number: invert all bits and add 1 remember: negate and invert are quite different! Converting n bit numbers into numbers with more than n bits:
MIPS 16 bit immediate gets converted to 32 bits for arithmetic copy the most significant bit (the sign bit) into the other bits 0010 -> 0000 0010 1010 -> 1111 1010 "sign extension" (lbu vs. lb) 49 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Addition & Subtraction Just like in grade school (carry/borrow 1s) 0111 0111 0110 + 0110 - 0110 - 0101 Two's complement operations easy subtraction using addition of negative numbers 0111 + 1010 Overflow (result too large for finite computer word): e.g., adding two n-bit numbers does not yield an n-bit number 0111 + 0001
note that overflow term is somewhat misleading, 1000 it does not mean a carry overflowed 50 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Detecting Overflow No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign: overflow when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive and get a negative or, subtract a positive from a negative and get a positive Consider the operations A + B, and A B Can overflow occur if B is 0 ? Can overflow occur if A is 0 ? 51 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Effects of Overflow An exception (interrupt) occurs Control jumps to predefined address for exception Interrupted address is saved for possible resumption Details based on software system / language example: flight control vs. homework assignment
Don't always want to detect overflow new MIPS instructions: addu, addiu, subu note: addiu still sign-extends! note: sltu, sltiu for unsigned comparisons 52 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's look at 3 versions based on a gradeschool algorithm 0010 __x_1011 (multiplicand) (multiplier) Negative numbers: convert and multiply there are better techniques, we wont look at them 53 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multiplication: Implementation Start Multiplier0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 1. Morgan Kaufmann Publishers Morgan Kaufmann PublishersTest Multiplier0
Multiplier0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 1a. Morgan Kaufmann Publishers Morgan Kaufmann PublishersAdd Morgan Kaufmann Publishersmultiplicand Morgan Kaufmann Publishersto Morgan Kaufmann Publishersproduct Morgan Kaufmann Publishersand place Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersresult Morgan Kaufmann Publishersin Morgan Kaufmann PublishersProduct Morgan Kaufmann Publishersregister Multiplicand Shift Morgan Kaufmann Publishersleft 64 Morgan Kaufmann Publishersbits Multiplier Shift Morgan Kaufmann Publishersright 64-bit Morgan Kaufmann PublishersALU 2. Morgan Kaufmann Publishers Morgan Kaufmann PublishersShift Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersMultiplicand Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersleft Morgan Kaufmann Publishers1 Morgan Kaufmann Publishersbit 32 Morgan Kaufmann Publishersbits Product Write 3. Morgan Kaufmann Publishers Morgan Kaufmann PublishersShift Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersMultiplier Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersright Morgan Kaufmann Publishers1 Morgan Kaufmann Publishersbit Control Morgan Kaufmann Publisherstest 64 Morgan Kaufmann Publishersbits 32nd Morgan Kaufmann Publishersrepetition? Datapath No: Morgan Kaufmann Publishers Morgan Kaufmann Publishers< Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Yes: Morgan Kaufmann Publishers Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Control Done 54 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Final Version Start Multiplier starts in right half of product Product0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 1. Morgan Kaufmann Publishers Morgan Kaufmann PublishersTest Product0 Product0 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 Multiplicand 32 Morgan Kaufmann Publishersbits 32-bit Morgan Kaufmann PublishersALU Product Shift Morgan Kaufmann Publishersright Write Control test 3. Morgan Kaufmann Publishers Morgan Kaufmann PublishersShift Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersProduct Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersright Morgan Kaufmann Publishers1 Morgan Kaufmann Publishersbit 64 Morgan Kaufmann Publishersbits 32nd Morgan Kaufmann Publishersrepetition? What goes here? No: Morgan Kaufmann Publishers Morgan Kaufmann Publishers< Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Yes: Morgan Kaufmann Publishers Morgan Kaufmann Publishers32 Morgan Kaufmann Publishersrepetitions Done 55 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Floating Point (a brief look) We need a way to represent numbers with fractions, e.g., 3.1416 very small numbers, e.g., .000000001 very large numbers, e.g., 3.15576 109 Representation: sign, exponent, significand: (1)sign significand 2exponent more bits for significand gives more accuracy more bits for exponent increases range IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand double precision: 11 bit exponent, 52 bit significand 56 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers IEEE 754 floating-point standard Leading 1 bit of significand is implicit Exponent is biased to make sorting easier
all 0s is smallest exponent all 1s is largest bias of 127 for single precision and 1023 for double precision summary: (1)sign significand) 2exponent bias Example: decimal: -.75 = - ( + ) binary: -.11 = -1.1 x 2-1 floating point: exponent = 126 = 01111110 IEEE single precision: 10111111010000000000000000000000 57 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Floating point addition Sign Exponent Fraction Sign Exponent 1. Morgan Kaufmann Publishers Morgan Kaufmann PublishersCompare Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersexponents Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherstwo Morgan Kaufmann Publishersnumbers. Shift Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssmaller Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersto Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersright Morgan Kaufmann Publishersuntil Morgan Kaufmann Publishersits exponent Morgan Kaufmann Publisherswould Morgan Kaufmann Publishersmatch Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherslarger Morgan Kaufmann Publishersexponent Small Morgan Kaufmann PublishersALU Exponent difference 0 Start
Fraction 2. Morgan Kaufmann PublishersAdd Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssignificands 1 0 1 0 3. Morgan Kaufmann PublishersNormalize Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssum, Morgan Kaufmann Publisherseither Morgan Kaufmann Publishersshifting Morgan Kaufmann Publishersright Morgan Kaufmann Publishersand incrementing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersexponent Morgan Kaufmann Publishersor Morgan Kaufmann Publishersshifting Morgan Kaufmann Publishersleft and Morgan Kaufmann Publishersdecrementing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersexponent Shift Morgan Kaufmann Publishersright Control 1 Overflow Morgan Kaufmann Publishersor underflow? Big Morgan Kaufmann PublishersALU Yes No 0 0 1 Increment Morgan Kaufmann Publishersor decrement
Exception 1 4. Morgan Kaufmann PublishersRound Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssignificand Morgan Kaufmann Publishersto Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersappropriate number Morgan Kaufmann Publishersof Morgan Kaufmann Publishersbits Shift Morgan Kaufmann Publishersleft Morgan Kaufmann Publishersor Morgan Kaufmann Publishersright No Rounding Morgan Kaufmann Publishershardware Still Morgan Kaufmann Publishersnormalized? Yes Sign Exponent Fraction Done 58 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Floating Point Complexities Operations are somewhat more complicated (see text) In addition to overflow we can have underflow Accuracy can be a big problem
IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields infinity zero divide by zero yields not a number other complexities Implementing the standard can be tricky Not using the standard can be even worse see text for description of 80x86 and Pentium bug! 59 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Three Summary Computer arithmetic is constrained by limited precision Bit patterns have no inherent meaning but standards do exist twos complement IEEE 754 floating point Computer instructions determine meaning of the bit patterns Performance and accuracy are important so there are many complexities in real machines Algorithm choice is important and may lead to hardware
optimizations for both space and time (e.g., multiplication) You may want to look back (Section 3.10 is great reading!) 60 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter 4 61 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) How does the machine's instruction set affect performance? 62 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Which of these airplanes has the best performance? Airplane Passengers
Boeing 737-100 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 101 470 132 146 Range (mi) Speed (mph) 630 4150 4000 8720 598 610 1350 544 How much faster is the Concorde compared to the 747? How much bigger is the 747 than the Douglas DC-8? 63 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Computer Performance: TIME, TIME, TIME Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput
How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? If we upgrade a machine with a new processor what do we increase? If we add a new machine to the lab what do we increase? 64 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Execution Time Elapsed Time counts everything (disk and memory accesses, I/O , etc.) a useful number, but often not good for comparison purposes CPU time doesn't count I/O or time spent running other programs can be broken up into system time, and user time Our focus: user CPU time time spent executing the lines of code that are "in" our program 65 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Book's Definition of Performance For some program running on machine X,
PerformanceX = 1 / Execution timeX "X is n times faster than Y" PerformanceX / PerformanceY = n Problem: machine A runs a program in 20 seconds machine B runs the same program in 25 seconds 66 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Clock Cycles Instead of reporting execution time in seconds, we often use cycles seconds cycles seconds = program program cycle Clock ticks indicate when to start activities (one abstraction): time cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 4 Ghz. clock has a
1 4 109 1012 =250 picoseconds (ps) cycle time 67 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers How to Improve Performance seconds cycles seconds = program program cycle So, to improve performance (everything else being equal) you can either (increase or decrease?) ________ the # of required cycles for a program, or ________ the clock cycle time or, said another way, ________ the clock rate. 68 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers How many cycles are required for a program? ... 6th 5th 4th 3rd Morgan Kaufmann Publishersinstruction 2nd Morgan Kaufmann Publishersinstruction
Could assume that number of cycles equals number of instructions 1st Morgan Kaufmann Publishersinstruction time This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code 69 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Different numbers of cycles for different instructions time Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) 70 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Example Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?" Don't Panic, can easily work this out from basic principles 71 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Now that we understand cycles A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) a floating point intensive application might have a higher CPI MIPS (millions of instructions per second) this would be higher for a program using simple instructions
72 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance Performance is determined by execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second? Common pitfall: thinking one of the variables is indicative of performance when it really isnt. 73 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers CPI Example Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?
74 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers # of Instructions Example A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence? 75 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers MIPS example Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time? 76
2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Benchmarks Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real program and inputs valuable indicator of performance (and compiler technology) can still be abused 77 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Benchmark Games An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its microprocessor chips on an industry benchmark by 10 percent. However, industry analysts said the coding errorwas a sad commentary on a common industry practice of cheating on standardized performance testsThe error was pointed out to Intel two days ago by a competitor, Motorola came in a test known as SPECint92Intel acknowledged that it had optimized its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the
same thingAt the heart of Intels problem is the practice of tuning compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code Saturday, January 6, 1996 New York Times 78 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers SPEC 89 Compiler enhancements and performance 800 700 600 o ti a r Morgan Kaufmann Publishers 500 e c n a rm 400 o fr e p Morgan Kaufmann Publishers C E P 300 S 200 100
0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced Morgan Kaufmann Publisherscompiler 79 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers SPEC CPU2000 80 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance?
1.6 Pentium Morgan Kaufmann PublishersM Morgan Kaufmann [email protected] Morgan Kaufmann Publishers1.6/0.6 Morgan Kaufmann PublishersGHz Pentium Morgan Kaufmann Publishers4-M Morgan Kaufmann [email protected] Morgan Kaufmann Publishers2.4/1.2 Morgan Kaufmann PublishersGHz Pentium Morgan Kaufmann PublishersIII-M Morgan Kaufmann [email protected] Morgan Kaufmann Publishers1.2/0.8 Morgan Kaufmann PublishersGHz 1400 1.4 1200 1.2 Pentium Morgan Kaufmann Publishers4 Morgan Kaufmann PublishersCFP2000 1000 Pentium Morgan Kaufmann Publishers4 Morgan Kaufmann PublishersCINT2000 800 1.0 0.8 600 0.6 Pentium Morgan Kaufmann PublishersIII Morgan Kaufmann PublishersCINT2000 400 0.4 Pentium Morgan Kaufmann PublishersIII Morgan Kaufmann PublishersCFP2000 200 0.2 0 500 1000 1500
2000 Clock Morgan Kaufmann Publishersrate Morgan Kaufmann Publishersin Morgan Kaufmann PublishersMHz 2500 3000 3500 0.0 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 SPECINT2000 SPECFP2000 Always Morgan Kaufmann Publisherson/maximum Morgan Kaufmann Publishersclock Laptop Morgan Kaufmann Publishersmode/adaptive clock Minimum Morgan Kaufmann Publisherspower/minimum clock Benchmark Morgan Kaufmann Publishersand Morgan Kaufmann Publisherspower Morgan Kaufmann Publishersmode 81 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Experiment Phone a major computer retailer and tell them you are having trouble deciding between two different computers, specifically you are confused about the processors strengths and weaknesses (e.g., Pentium 4 at 2Ghz vs. Celeron M at 1.4 Ghz ) What kind of response are you likely to get?
What kind of response could you give a friend with the same question? 82 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Amdahl's Law Execution Time After Improvement = Execution Time Unaffected +( Execution Time Affected / Amount of Improvement ) Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? Principle: Make the common case fast 83 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Example Suppose we enhance a machine making all floating-point instructions run five times faster. If the execution time of some benchmark before the floating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions? We are looking for a benchmark to show off the new floating-point unit described above, and want the overall benchmark to show a speedup of 3.
One benchmark we are considering runs for 100 seconds with the old floating-point hardware. How much of the execution time would floatingpoint instructions have to account for in this program in order to yield our desired speedup on this benchmark? 84 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Remember Performance is specific to a particular program/s Total execution time is a consistent summary of performance For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfall: expecting improvement in one aspect of a machines performance to affect the total performance 85 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Lets Build a Processor Almost ready to move into chapter 5 and start building a processor First, lets review Boolean Logic and build the ALU well need (Material from Appendix B) operation
a 32 ALU result 32 b 32 86 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Review: Boolean Algebra & Gates Problem: Consider a logic function with three inputs: A, B, and C. Output D is true if at least one input is true Output E is true if exactly two inputs are true Output F is true only if all three inputs are true Show the truth table for these three functions. Show the Boolean equations for these three functions. Show an implementation consisting of inverters, AND, and OR gates. 87 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers An ALU (arithmetic logic unit)
Let's build an ALU to support the andi and ori instructions we'll just build a 1 bit ALU, and use 32 of them operation a op a b res result b Possible Implementation (sum-of-products): 88 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Review: The Multiplexor Selects one of the inputs to be the output, based on a control input S A 0 B 1
C note: we call this a 2-input mux even though it has 3 inputs! Lets build our ALU using a MUX: 89 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Different Implementations Not easy to decide the best way to build something Don't want too many inputs to a single gate Dont want to have to go through too many gates for our purposes, ease of comprehension is important Let's look at a 1-bit ALU for addition: CarryIn a Sum b cout = a b + a cin + b cin sum = a xor b xor cin CarryOut How could we build a 1-bit ALU for add, and, and or? How could we build a 32-bit ALU?
90 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Building a 32 bit ALU CarryIn a0 b0 Operation CarryIn ALU0 Result0 CarryOut Operation CarryIn a1 a 0 b1 CarryIn ALU1 Result1 CarryOut 1 Result a2
2 b b2 CarryIn ALU2 Result2 CarryOut CarryOut a31 b31 CarryIn ALU31 Result31 91 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers What about subtraction (a b) ? Two's complement approach: just negate b and add. How do we negate? A very clever solution: Binvert
Operation CarryIn a 0 1 b 0 Result 2 1 CarryOut 92 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Adding a NOR function Can also choose to invert a. How do we get a NOR b ? Ainvert Operation Binvert a CarryIn 0 0
1 1 b 0 + Result 2 1 CarryOut 93 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Tailoring the ALU to the MIPS Need to support the set-on-less-than instruction (slt) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b Need to support test for equality (beq $t5, $t6, $t7) use subtraction: (a-b) = 0 implies a = b 94 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Supporting slt Can we figure out the idea? Binvert a Binvert CarryIn a 0 Operation Ainvert Operation Ainvert CarryIn 0 0 0 1 1 1 1 Result b 0
+ Result b 0 2 + 2 1 1 Less Less 3 3 Set Overflow detection Overflow Use this ALU for most significant bit CarryOut all other bits Supporting slt Operation
Binvert Ainvert CarryIn a0 b0 CarryIn ALU0 Less CarryOut Result0 a1 b1 0 CarryIn ALU1 Less CarryOut Result1 a2 b2 0 CarryIn ALU2 Less CarryOut Result2 .. . a31
b31 0 .. . CarryIn CarryIn ALU31 Less .. . Result31 Set Overflow 96 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Test for equality Notice control lines: Operation Bnegate Ainvert 0000 0001 0010 0110 0111 1100 = = =
= = = and or add subtract slt NOR Note: zero is a 1 when the result is zero! a0 b0 CarryIn ALU0 Less CarryOut a1 b1 0 CarryIn ALU1 Less CarryOut a2 b2 0 CarryIn ALU2 Less CarryOut .. .
a31 b31 0 Result0 Result1 .. . Result2 .. . CarryIn CarryIn ALU31 Less Zero .. . .. . Result31 Set Overflow 97 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Conclusion We can build an ALU to support the MIPS instruction set key idea: use multiplexor to select the output we want
we can efficiently perform subtraction using twos complement we can replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware all of the gates are always working the speed of a gate is affected by the number of inputs to the gate the speed of a circuit is affected by the number of gates in series (on the critical path or the deepest level of logic) Our primary focus: comprehension, however, Clever changes to organization can improve performance (similar to using better algorithms in software) We saw this in multiplication, lets look at addition now 98 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Problem: ripple carry adder is slow Is a 32-bit ALU as fast as a 1-bit ALU? Is there more than one way to do addition? two extremes: ripple carry and sum-of-products Can you see the ripple? How could you get rid of it? c1 = b0c0 + a0c0 + a0b0 c2 = b1c1 + a1c1 + a1b1 c2 = c3 = b2c2 + a2c2 + a2b2 c3 = c4 = b3c3 + a3c3 + a3b3 c4 = Not feasible! Why? 99 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Carry-lookahead adder An approach in-between our two extremes Motivation: If we didn't know the value of carry-in, what could we do? When would we always generate a carry? gi = ai bi When would we propagate the carry? pi = ai + bi Did we get rid of the ripple? c1 = g0 + p0c0 c2 = g1 + p1c1 c2 = c3 = g2 + p2c2 c3 = c4 = g3 + p3c3 c4 = Feasible! Why? 100 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Use principle to build bigger adders CarryIn a0 b0 a1 b1 a2 b2 a3 b3 a4
b4 a5 b5 a6 b6 a7 b7 a8 b8 a9 b9 a10 b10 a11 b11 a12 b12 a13 b13 a14 b14 a15 b15 CarryIn Result03 ALU0 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersP0 Morgan Kaufmann Publishers Morgan Kaufmann PublishersG0 pi gi C1 ci Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers1
CarryIn Carry-lookahead Morgan Kaufmann Publishersunit Result47 ALU1 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersP1 Morgan Kaufmann Publishers Morgan Kaufmann PublishersG1 pi Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers1 gi Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers1 C2 ci Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers2 CarryIn Cant build a 16 bit adder this way... (too big) Could use ripple carry of 4-bit CLA adders Better: use the CLA principle again! Result811 ALU2 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersP2 Morgan Kaufmann Publishers Morgan Kaufmann PublishersG2 pi Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers2 gi Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers2 C3 ci Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers3 CarryIn
Result1215 ALU3 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersP3 Morgan Kaufmann Publishers Morgan Kaufmann PublishersG3 pi Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers3 gi Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers3 C4 CarryOut ci Morgan Kaufmann Publishers+ Morgan Kaufmann Publishers4 101 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ALU Summary We can build an ALU to support MIPS addition Our focus is on comprehension, not performance Real processors use more sophisticated techniques for arithmetic Where performance is not critical, hardware description languages allow designers to completely automate the creation of hardware! 102 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Five 103 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers The Processor: Datapath & Control
We're ready to look at an implementation of the MIPS Simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic Implementation: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? 104 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers More Implementation Details Abstract / Simplified View: 4 Add Add
Data PC Address Instruction Instruction memory Register Morgan Kaufmann Publishers# Registers Register Morgan Kaufmann Publishers# ALU Address Data memory Register Morgan Kaufmann Publishers# Data Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential) 105 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers State Elements Unclocked vs. Clocked Clocks used in synchronous logic when should an element that contains state be updated? Falling Morgan Kaufmann Publishersedge Clock Morgan Kaufmann Publishersperiod
Rising Morgan Kaufmann Publishersedge cycle time 106 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers An unclocked state element The set-reset latch output depends on present inputs and also on past inputs R S Q Q 107 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Latches and Flip-flops Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) "logically true",
could mean electrically low A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written 108 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers D-latch Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: the value of the internal state (Q) and it's complement C Q D C _ Q Q D 109 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers D flip-flop
Output changes only on the clock edge D D C D latch Q D C D latch Q Q Q Q C D C Q 110 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Our Implementation
An edge triggered methodology Typical execution: read contents of some state elements, send values through some combinational logic write results to one or more state elements State element 1 Combinational Morgan Kaufmann Publisherslogic State element 2 Clock Morgan Kaufmann Publisherscycle 111 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Register File Built using D flip-flops Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers1 Register Morgan Kaufmann Publishers0 Register Morgan Kaufmann Publishers1 . Morgan Kaufmann Publishers. Morgan Kaufmann Publishers. Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers1 Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers2 Write register Write
data Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers2 Read data Morgan Kaufmann Publishers1 Register Morgan Kaufmann Publishersfile Read data Morgan Kaufmann Publishers2 M u Read Morgan Kaufmann Publishersdata Morgan Kaufmann Publishers1 x Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Read Morgan Kaufmann Publishersregister number Morgan Kaufmann Publishers2 Write M u Read Morgan Kaufmann Publishersdata Morgan Kaufmann Publishers2 x Do you understand? What is the Mux above? 112 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Abstraction
Make sure you understand the abstractions! Sometimes it is easy to think you do, when you dont Select A31 Select B31 A B M u x C31 32 32 M u x 32 C A30 B30 M u x C30
.. . .. . A0 B0 M u x C0 113 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Register File Note: we still use the real clock to determine when to write Write C 0 1 Register Morgan Kaufmann Publishersnumber n-to-2n decoder Register Morgan Kaufmann Publishers0 .. . D C
Register Morgan Kaufmann Publishers1 n Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 n D .. . C Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers2 D C Register Morgan Kaufmann Publishersn Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Register Morgan Kaufmann Publishersdata D 114 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Simple Implementation Include the functional units we need for each instruction Instruction address MemWrite Instruction Add Sum PC Address Read data
16 Instruction memory a. Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersmemory b. Morgan Kaufmann PublishersProgram Morgan Kaufmann Publisherscounter c. Morgan Kaufmann PublishersAdder Write data Data memory Sign extend 32 MemRead a. Morgan Kaufmann PublishersData Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersunit Register numbers 5 Read register Morgan Kaufmann Publishers1 5 Read register Morgan Kaufmann Publishers2 5 Data
Write register 4 Read data Morgan Kaufmann Publishers1 Data Registers ALU Morgan Kaufmann Publishersoperation Zero ALU ALU result Read data Morgan Kaufmann Publishers2 Write Data b. Morgan Kaufmann PublishersSign-extension Morgan Kaufmann Publishersunit Why do we need this stuff? RegWrite a. Morgan Kaufmann PublishersRegisters b. Morgan Kaufmann PublishersALU 115 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Building the Datapath
Use multiplexors to stitch them together PCSrc M u x Add Add 4 ALU result Shift left Morgan Kaufmann Publishers2 PC Read address Instruction Instruction memory Read register Morgan Kaufmann Publishers1 ALUSrc Read data Morgan Kaufmann Publishers1 ALU Morgan Kaufmann Publishersoperation MemWrite Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2
register MemtoReg Zero M u x Write data ALU ALU result Address Write data RegWrite 16 4 Sign extend 32 Read data M u x Data memory
MemRead 116 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: add $8, $17, $18 Instruction Format: 000000 10001 10010 op rs rt 01000 00000 100000 rd shamt funct
ALU's operation based on instruction type and function code 117 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control e.g., what should the ALU do with this instruction Example: lw $1, 100($2) 35 2 1 op rs rt 16 bit offset ALU control input 0000 0001 0010 0110 0111 1100 100
AND OR add subtract set-on-less-than NOR Why is the code for subtract 0110 and not 0011? 118 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Control Must describe hardware to compute 4-bit ALU control input given instruction type 00 = lw, sw ALUOp 01 = beq, computed from instruction type 10 = arithmetic function code for arithmetic Describe it using a truth table (can turn into gates): 119 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers 0 M u x Add ALU
Add result 4 Instruction Morgan Kaufmann Publishers[3126] PC Read address Instruction [310] Instruction memory Instruction Morgan Kaufmann Publishers[2521] Read register Morgan Kaufmann Publishers1 Instruction Morgan Kaufmann Publishers[2016] Read register Morgan Kaufmann Publishers2 0 M u Instruction Morgan Kaufmann Publishers[1511] x 1 Instruction Morgan Kaufmann Publishers[150] Shift left Morgan Kaufmann Publishers2 RegDst Branch MemRead MemtoReg
ALUOp MemWrite ALUSrc RegWrite Control Write register Write data 16 1 Read data Morgan Kaufmann Publishers1 Zero Read data Morgan Kaufmann Publishers2 Registers Sign extend 0 M u x 1 ALU ALU result Address Read data
1 M u x 0 Data Write memory data 32 ALU control Instruction Morgan Kaufmann Publishers[50] Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0
0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 Control Simple combinational logic (truth tables) Inputs Op5 Op4 Op3 Op2 ALUOp Op1 ALU Morgan Kaufmann Publisherscontrol Morgan Kaufmann Publishersb lock Op0
ALUOp0 ALUOp1 Outputs F3 F Morgan Kaufmann Publishers(5 0) F2 F1 F0 Operation2 Operation1 R-format Operation Iw sw beq RegDst ALUSrc MemtoReg Operation0 RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO 121 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Our Simple Control Structure All of the logic is combinational We wait for everything to settle down, and the right thing to be done ALU might not produce right answer right away we use write signals along with clock to determine when to write Cycle time determined by length of the longest path State element 1 Combinational Morgan Kaufmann Publisherslogic State element 2 Clock Morgan Kaufmann Publisherscycle We are ignoring some details like setup and hold times 122 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (200ps), ALU and adders (100ps), register file access (50ps) PCSrc
M u x Add Add 4 ALU result Shift left Morgan Kaufmann Publishers2 PC Read address Instruction Instruction memory Read register Morgan Kaufmann Publishers1 ALUSrc Read data Morgan Kaufmann Publishers1 ALU Morgan Kaufmann Publishersoperation MemWrite Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register
MemtoReg Zero M u x Write data ALU ALU result Address Write data RegWrite 16 4 Sign extend 32 Read data M u x Data memory MemRead
123 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Where we are headed Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area One Solution: use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath: PC Address Instruction register Instruction or Morgan Kaufmann Publishersdata Memory Data Memory data register Data A
Register Morgan Kaufmann Publishers# Registers Register Morgan Kaufmann Publishers# ALU ALUOut B Register Morgan Kaufmann Publishers# 124 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multicycle Approach We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data Our control signals will not be determined directly by instruction e.g., what should the ALU do for a subtract instruction? Well use a finite state machine for control 125 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multicycle Approach Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit
At the end of a cycle store values for use in later cycles (easiest thing to do) introduce additional internal registers PC 0 M u x 1 Address Memory MemData Write data Instruction [2016] Instruction [150] Instruction register Instruction [150] Memory data register 0 M u x 1 Read register Morgan Kaufmann Publishers1 Instruction [2521]
0 M Instruction u x [1511] 1 0 M u x 1 16 Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Write Read register data Morgan Kaufmann Publishers2 A B 4 Write data Sign extend 32 Zero ALU ALU result
ALUOut 0 1M u 2 x 3 Shift left Morgan Kaufmann Publishers2 126 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Instructions from ISA perspective Consider each instruction from perspective of ISA. Example: The add instruction changes a register. Register specified by bits 15:11 of instruction. Instruction specified by the PC. New value is the sum (op) of two registers. Registers specified by bits 25:21 and 20:16 of the instruction Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]] In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming) 127 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Breaking down an instruction ISA definition of arithmetic:
Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] Reg[Memory[PC][20:16]] Could break down to: IR <= Memory[PC] A <= Reg[IR[25:21]] B <= Reg[IR[20:16]] ALUOut <= A op B Reg[IR[20:16]] <= ALUOut We forgot an important part of the definition of arithmetic! PC <= PC + 4 op 128 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Idea behind multicycle approach We define each instruction from the ISA perspective (do this!) Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps) Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.) Finally try and pack as much work into each step
(avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution) Result: Our books multicycle Implementation! 129 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! 130 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 1: Instruction Fetch
Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC. Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? 131 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) 132 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 3 (instruction dependent) ALU is performing one of three functions, based on instruction type
Memory Reference: ALUOut <= A + sign-extend(IR[15:0]); R-type: ALUOut <= A op B; Branch: if (A==B) PC <= ALUOut; 133 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Step 4 (R-type or memory-access) Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B; R-type instructions finish Reg[IR[15:11]] <= ALUOut; The write actually takes place at the end of the cycle on the edge 134 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Write-back step Reg[IR[20:16]] <= MDR; Which instruction needs this? 135 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Summary: 136 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Simple Questions How many cycles will it take to execute this code? Label: lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, Label add $t5, $t2, $t3 sw $t5, 8($t3) ... #assume not What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? 137 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers PCSource
PCWriteCond PCWrite ALUOp Outputs IorD MemRead ALUSrcB Control ALUSrcA MemWrite MemtoReg Op [50] RegWrite IRWrite 0 RegDst 26 Instruction Morgan Kaufmann Publishers[25-0] PC 0 M u x 1
Instruction [3126] Address Memory MemData Write data Instruction [2016] Instruction [150] Instruction register Instruction [150] Memory data register 0 M u x 1 Read register Morgan Kaufmann Publishers1 Instruction [2521] Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Write Read
register data Morgan Kaufmann Publishers2 0 M Instruction u x [1511] 1 A 16 Sign extend 32 Instruction Morgan Kaufmann Publishers[50] 28 PC Morgan Kaufmann Publishers[3128] Zero ALU ALU result B 4 Write data 0 M u x 1
Shift left Morgan Kaufmann Publishers2 Shift left Morgan Kaufmann Publishers2 Jump address [310] 0 1M u 2 x 3 ALU control ALUOut M 1 u x 2 Review: finite state machines Finite state machines: a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Next state Current Morgan Kaufmann Publishersstate Next-state function
Clock Inputs Output function Outputs Well use a Moore machine (output based only on current state) 139 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Review: finite state machines Example: B. 37 A friend would like you to build an electronic eye for use as a fake security device. The device consists of three lights lined up in a row, controlled by the outputs Left, Middle, and Right, which, if asserted, indicate that a light should be on. Only one light is on at a time, and the light moves from left to right and then from right to left, thus scaring away thieves who believe that the device is monitoring their activity. Draw the graphical representation for the finite state machine used to specify the electronic eye. Note that the rate of the eyes movement will be controlled by the clock speed (which should not be too great) and that there are essentially no inputs. 140 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Implementing the Control Value of control signals is dependent upon: what instruction is being executed which step is being performed
Use the information weve accumulated to specify a finite state machine specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification 141 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Graphical Specification of FSM Instruction Morgan Kaufmann Publishersfetch MemRead ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 IorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 IRWrite ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 PCWrite PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 0 Start Note: Instruction Morgan Kaufmann Publishersdecode/ register Morgan Kaufmann Publishersfetch 1 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers11 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 dont care if not mentioned
asserted if name only otherwise exact value Memory Morgan Kaufmann Publishersaddress computation How many state bits will we need? 2 6 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 8 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 Memory access 3 Memory access 5 MemRead IorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 Branch completion Execution Jump completion 9
ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 PCWriteCond PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 PCWrite PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 R-type Morgan Kaufmann Publisherscompletion 7 MemWrite IorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 RegWrite MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 Memory Morgan Kaufmann Publishersread completon Morgan Kaufmann Publishersstep 4 RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 RegWrite MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 142 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Finite State Machine for Control Implementation: PCWrite PCWriteCond IorD MemRead MemWrite
IRWrite Control Morgan Kaufmann Publisherslogic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 Inputs 5 p O 4 p O 3 p O 2 p O 1 p O
Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield 0 p O 3 S 2 S 1 S 0 S State Morgan Kaufmann Publishersregister 143 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers PLA Implementation If I picked a horizontal or vertical line could you explain it? Op5 Op4 Op3 Op2 Op1 Op0 S3 S2 S1 S0
PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 144 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ROM Implementation ROM = "Read Only Memory" values of memory locations are fixed ahead of time A ROM can be used to implement a truth table if the address is m-bits, we can address 2m entries in the ROM. our outputs are the bits of data that the address points to. m n 0
0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 1 0 0
0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 m is the "height", and n is the "width" 145 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ROM Implementation How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 210 = 1024 different addresses) How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs
ROM is 210 x 20 = 20K bits Rather wasteful, since for lots of the entries, the outputs are the same i.e., opcode is often ignored (and a rather unusual size) 146 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers ROM vs PLA Break up the table into two parts 4 state bits tell you the 16 outputs, 24 x 16 bits of ROM 10 bits tell you the 4 next state bits, 210 x 4 bits of ROM Total: 4.3K bits of ROM PLA is much smaller can share product terms only need entries that produce an active output can take into account don't cares Size is (#inputs #product-terms) + (#outputs #product-terms) For this example = (10x17)+(20x17) = 510 PLA cells PLA cells usually about the size of a ROM cell (slightly bigger) 147
2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Another Implementation Style Complex instructions: the "next state" is often current state + 1 Control Morgan Kaufmann Publishersunit PLA Morgan Kaufmann Publisherso r Morgan Kaufmann PublishersROM Outputs Input PCWrite PCWriteCond IorD MemRead MemWrite IRWrite BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl 1 State Adder Address Morgan Kaufmann Publishersselect Morgan Kaufmann Publisherslogic ] 0 [5 p
O Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield 148 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Details Op 000000 000010 000100 100011 101011 Dispatch ROM 1 Opcode name R-format jmp beq lw sw Value 0110 1001 1000 0010 0010 Op 100011 101011 Dispatch ROM 2 Opcode name lw sw
Value 0011 0101 PLA Morgan Kaufmann Publishersor Morgan Kaufmann PublishersROM 1 State Adder 3 Mux 2 1 AddrCtl 0 0 Dispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers2 Dispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers1 Address Morgan Kaufmann Publishersselect Morgan Kaufmann Publisherslogic Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield State number 0 1 2 3 4 5 6 7 8 9 Address-control action Use Morgan Kaufmann Publishersincremented Morgan Kaufmann Publishersstate Use Morgan Kaufmann Publishersdispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers1
Use Morgan Kaufmann Publishersdispatch Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers2 Use Morgan Kaufmann Publishersincremented Morgan Kaufmann Publishersstate Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Use Morgan Kaufmann Publishersincremented Morgan Kaufmann Publishersstate Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Replace Morgan Kaufmann Publishersstate Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersby Morgan Kaufmann Publishers0 Value of AddrCtl 3 1 2 3 0 0 3 0 0 0 149 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microprogramming Control Morgan Kaufmann Publishersunit Microcode Morgan Kaufmann Publishersmemory Outputs Input PCWrite PCWriteCond IorD MemRead MemWrite IRWrite
BWrite MemtoReg PCSource ALUOp ALUSrcB ALUSrcA RegWrite RegDst AddrCtl Datapath 1 Microprogram Morgan Kaufmann Publisherscounter Adder Address Morgan Kaufmann Publishersselect Morgan Kaufmann Publisherslogic Instruction Morgan Kaufmann Publishersregister opcode Morgan Kaufmann Publishersfield What are the microinstructions ? 150 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microprogramming A specification methodology appropriate if hundreds of opcodes, modes, cycles, etc. signals specified symbolically using microinstructions Label Fetch Mem1 LW2 ALU
control Add Add Add SRC1 PC PC A Register control SRC2 4 Extshft Read Extend PCWrite Memory control Read Morgan Kaufmann PublishersPC ALU Read Morgan Kaufmann PublishersALU Write Morgan Kaufmann PublishersMDR SW2 Rformat1 Func Morgan Kaufmann Publisherscode A Write Morgan Kaufmann PublishersALU B Write Morgan Kaufmann PublishersALU BEQ1 JUMP1 Subt
A B ALUOut-cond Jump Morgan Kaufmann Publishersaddress Sequencing Seq Dispatch Morgan Kaufmann Publishers1 Dispatch Morgan Kaufmann Publishers2 Seq Fetch Fetch Seq Fetch Fetch Fetch Will two implementations of the same architecture have the same microcode? What would a microassembler do? 151 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microinstruction format Field name ALU Morgan Kaufmann Publisherscontrol SRC1 SRC2 Value Add Subt Func Morgan Kaufmann Publisherscode PC
A B 4 Extend Extshft Read ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 ALUSrcA Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10 ALUSrcB Morgan Kaufmann Publishers= Morgan Kaufmann Publishers11 Write Morgan Kaufmann PublishersALU RegWrite, RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1, MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 RegWrite, RegDst Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0, MemtoReg Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 MemRead, lorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers0 MemRead, lorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 MemWrite, lorD Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1 PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 PCWrite PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01, PCWriteCond PCSource Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10, PCWrite AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers11 AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 AddrCtl Morgan Kaufmann Publishers= Morgan Kaufmann Publishers10
Register control Write Morgan Kaufmann PublishersMDR Read Morgan Kaufmann PublishersPC Memory Read Morgan Kaufmann PublishersALU Write Morgan Kaufmann PublishersALU ALU PC Morgan Kaufmann Publisherswrite Morgan Kaufmann Publisherscontrol ALUOut-cond jump Morgan Kaufmann Publishersaddress Sequencing Signals active ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers00 ALUOp Morgan Kaufmann Publishers= Morgan Kaufmann Publishers01 Seq Fetch Dispatch Morgan Kaufmann Publishers1 Dispatch Morgan Kaufmann Publishers2 Comment Cause Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersto Morgan Kaufmann Publishersadd. Cause Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssubtract; Morgan Kaufmann Publishersthis Morgan Kaufmann Publishersimplements Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherscompare Morgan Kaufmann Publishersfor branches. Use Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersinstruction's Morgan Kaufmann Publishersfunction Morgan Kaufmann Publisherscode Morgan Kaufmann Publishersto Morgan Kaufmann Publishersdetermine Morgan Kaufmann PublishersALU Morgan Kaufmann Publisherscontrol. Use Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersPC Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersfirst Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinput. Register Morgan Kaufmann PublishersA Morgan Kaufmann Publishersis Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersfirst Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinput. Register Morgan Kaufmann PublishersB Morgan Kaufmann Publishersis Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssecond Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinput. Use Morgan Kaufmann Publishers4 Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssecond Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinput. Use Morgan Kaufmann Publishersoutput Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssign Morgan Kaufmann Publishersextension Morgan Kaufmann Publishersunit Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssecond Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinput. Use Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersoutput Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersshift-by-two Morgan Kaufmann Publishersunit Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherssecond Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinput. Read Morgan Kaufmann Publisherstwo Morgan Kaufmann Publishersregisters Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersrs Morgan Kaufmann Publishersand Morgan Kaufmann Publishersrt Morgan Kaufmann Publishersfields Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersIR Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersregister numbers Morgan Kaufmann Publishersand Morgan Kaufmann Publishersputting Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdata Morgan Kaufmann Publishersinto Morgan Kaufmann Publishersregisters Morgan Kaufmann PublishersA Morgan Kaufmann Publishersand Morgan Kaufmann PublishersB.
Write Morgan Kaufmann Publishersa Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersrd Morgan Kaufmann Publishersfield Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersIR Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersand the Morgan Kaufmann Publisherscontents Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALUOut Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdata. Write Morgan Kaufmann Publishersa Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersrt Morgan Kaufmann Publishersfield Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersIR Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersregister Morgan Kaufmann Publishersnumber Morgan Kaufmann Publishersand the Morgan Kaufmann Publisherscontents Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersMDR Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdata. Read Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersPC Morgan Kaufmann Publishersas Morgan Kaufmann Publishersaddress; Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersresult Morgan Kaufmann Publishersinto Morgan Kaufmann PublishersIR Morgan Kaufmann Publishers(and Morgan Kaufmann Publishers the Morgan Kaufmann PublishersMDR). Read Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALUOut Morgan Kaufmann Publishersas Morgan Kaufmann Publishersaddress; Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersresult Morgan Kaufmann Publishersinto Morgan Kaufmann PublishersMDR. Write Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALUOut Morgan Kaufmann Publishersas Morgan Kaufmann Publishersaddress, Morgan Kaufmann Publisherscontents Morgan Kaufmann Publishersof Morgan Kaufmann PublishersB Morgan Kaufmann Publishersas Morgan Kaufmann Publishersthe data. Write Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersoutput Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersinto Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersPC. If Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersZero Morgan Kaufmann Publishersoutput Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersALU Morgan Kaufmann Publishersis Morgan Kaufmann Publishersactive, Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersPC Morgan Kaufmann Publisherswith Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherscontents of Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersregister Morgan Kaufmann PublishersALUOut. Write Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersPC Morgan Kaufmann Publisherswith Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersjump Morgan Kaufmann Publishersaddress Morgan Kaufmann Publishersfrom Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersinstruction. Choose Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersnext Morgan Kaufmann Publishersmicroinstruction Morgan Kaufmann Publisherssequentially. Go Morgan Kaufmann Publishersto Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersfirst Morgan Kaufmann Publishersmicroinstruction Morgan Kaufmann Publishersto Morgan Kaufmann Publishersbegin Morgan Kaufmann Publishersa Morgan Kaufmann Publishersnew Morgan Kaufmann Publishersinstruction. Dispatch Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers1. Dispatch Morgan Kaufmann Publishersusing Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersROM Morgan Kaufmann Publishers2. 152 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Maximally vs. Minimally Encoded No encoding: 1 bit for each datapath operation faster, requires more memory (logic) used for Vax 780 an astonishing 400K of memory! Lots of encoding: send the microinstructions through logic to get control signals uses less memory, slower Historical context of CISC: Too much logic to put on a single chip with everything else
Use a ROM (or even RAM) to hold the microcode Its easy to add new instructions 153 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Microcode: Trade-offs Distinction between specification and implementation is sometimes blurred Specification Advantages: Easy to design and write Design architecture and microcode in parallel Implementation (off-chip ROM) Advantages Easy to change since values are in memory Can emulate other architectures Can make use of internal registers Implementation Disadvantages, SLOWER now that: Control is implemented on same chip as processor ROM is no longer faster than RAM No need to go back and make changes 154 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Historical Perspective
In the 60s and 70s microprogramming was very important for implementing machines This led to more sophisticated ISAs and the VAX In the 80s RISC processors based on pipelining became popular Pipelining the microinstructions is also possible! Implementations of IA-32 architecture processors since 486 use: hardwired control for simpler instructions (few cycles, FSM control implemented using PLA or random logic) microcoded control for more complex instructions (large numbers of cycles, central control store) The IA-64 architecture uses a RISC-style ISA and can be implemented without a large central control store 155 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pentium 4 Pipelining is important (last IA-32 without it was 80386 in 1985) Control Control I/O interface Chapter 7 Instruction Morgan Kaufmann Publisherscache Data cache Enhanced
floating Morgan Kaufmann Publisherspoint and Morgan Kaufmann Publishersmultimedia Integer datapath Control Advanced Morgan Kaufmann Publisherspipelining hyperthreading Morgan Kaufmann Publisherssupport Secondary cache and memory interface Chapter 6 Control Pipelining is used for the simple instructions favored by compilers Simply put, a high performance implementation needs to ensure that the simple instructions execute quickly, and that the burden of the complexities of the instruction set penalize the complex, less frequently used, instructions 156 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pentium 4 Somewhere in all that control we must handle complex instructions Control Control I/O
interface Instruction Morgan Kaufmann Publisherscache Data cache Enhanced floating Morgan Kaufmann Publisherspoint and Morgan Kaufmann Publishersmultimedia Integer datapath Control Advanced Morgan Kaufmann Publisherspipelining hyperthreading Morgan Kaufmann Publisherssupport Secondary cache and memory interface Control Processor executes simple microinstructions, 70 bits wide (hardwired) 120 control lines for integer datapath (400 for floating point) If an instruction requires more than 4 microinstructions to implement, control from microcode ROM (8000 microinstructions) Its complicated! 157 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Chapter 5 Summary If we understand the instructions We can build a simple processor! If instructions take different amounts of time, multi-cycle is better Datapath implemented using: Combinational logic for arithmetic State holding elements to remember bits Control implemented using: Combinational logic for single-cycle implementation Finite state machine for multi-cycle implementation 158 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Six 159 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipelining Improve performance by increasing instruction throughput Program execution Time order
(in Morgan Kaufmann Publishersinstructions) 200 lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers100($0) Instruction fetch Reg lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers Morgan Kaufmann Publishers200($0) 400 600 Data access ALU 800 1000 1200 1400 ALU Data access 1600 1800 Reg Instruction Reg fetch 800 Morgan Kaufmann Publishersps
lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers$3, Morgan Kaufmann Publishers Morgan Kaufmann Publishers300($0) Reg Instruction fetch 800 Morgan Kaufmann Publishersps Note: timing assumptions changed for this example 800 Morgan Kaufmann Publishersps Program execution Time order (in Morgan Kaufmann Publishersinstructions) lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers100($0) 200 Instruction fetch 400 Reg lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers Morgan Kaufmann Publishers200($0) 200 Morgan Kaufmann Publishersps Instruction fetch lw Morgan Kaufmann Publishers Morgan Kaufmann Publishers$3, Morgan Kaufmann Publishers Morgan Kaufmann Publishers300($0) 200 Morgan Kaufmann Publishersps 600 ALU Reg Instruction
fetch 800 Data access ALU Reg 1000 1200 1400 Reg Data access ALU Reg Data access Reg 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps 200 Morgan Kaufmann Publishersps Ideal speedup is number of stages in the pipeline. Do we achieve this? 160 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipelining What makes it easy all instructions are the same length just a few instruction formats
memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. 161 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Basic Idea IF: Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersfetch ID: Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersdecode/ register Morgan Kaufmann Publishersfile Morgan Kaufmann Publishersread EX: Morgan Kaufmann PublishersExecute/ address Morgan Kaufmann Publisherscalculation MEM: Morgan Kaufmann PublishersMemory Morgan Kaufmann Publishersaccess WB: Morgan Kaufmann PublishersWrite Morgan Kaufmann Publishersback Add 4 Shift left Morgan Kaufmann Publishers2
P C Address Instruction Instruction memory Read Read register Morgan Kaufmann Publishers1 data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Write Read register data Morgan Kaufmann Publishers2 Write data 16 ADD Add result Zero ALU ALU result Address Read data Data Memory Write data
Sign 32 extend What do we need to add to actually split the datapath into stages? 162 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipelined Datapath IF/ID ID/EX EX/MEM MEM/WB Add 4 Shift left Morgan Kaufmann Publishers2 PC Address Instruction memory Add Add result Read register Morgan Kaufmann Publishers1 Read data Morgan Kaufmann Publishers1
Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register Zero ALU ALU result Read data Address Data memory Write data Write data 16 Sign extend 32 Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 163 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Corrected Datapath IF/ID
ID/EX EX/MEM MEM/WB Add 4 Shift left Morgan Kaufmann Publishers2 PC Address Instruction memory Add Add result Read register Morgan Kaufmann Publishers1 Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register Zero ALU ALU result Read
data Address Data memory Write data Write data 16 Sign extend 32 164 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Graphically Representing Pipelines Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) Program execution order (in Morgan Kaufmann Publishersinstructions) lw Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers100($0) lw Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers200($0) lw Morgan Kaufmann Publishers$3, Morgan Kaufmann Publishers300($0) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 IM
Reg IM CC Morgan Kaufmann Publishers3 ALU Reg IM CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 DM Reg ALU DM Reg ALU DM Reg CC Morgan Kaufmann Publishers6 CC7 Reg Can help with answering questions like:
how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths 165 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipeline Control PCSrc IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 Shift left Morgan Kaufmann Publishers2 Branch RegWrite PC Address Instruction memory Read register Morgan Kaufmann Publishers1
Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register MemWrite ALUSrc Zero Add ALU result MemtoReg Read data Address Data memory Write data Write data Instruction (15D0) Instruction (20D16) 16 Sign extend
32 6 ALU control MemRead ALUOp Instruction (15D11) RegDst 166 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? 167 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pipeline Control
Pass control signals along just like the data Instruction R-format lw sw beq Execution/Address Calculation Memory access stage stage control lines control lines Reg ALU ALU ALU Mem Mem Dst Op1 Op0 Src Branch Read Write 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0
1 0 0 1 X 0 1 0 1 0 0 Write-back stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X WB Instruction IF/ID Control M WB EX
M WB ID/EX EX/MEM MEM/WB 168 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Datapath with Control PCSrc ID/EX Control IF/ID WB EX/MEM M WB EX M MEM/WB WB Add 4
Shift left Morgan Kaufmann Publishers2 PC Address Instruction memory Add Add result Branch ALUSrc Read register Morgan Kaufmann Publishers1 Read data Morgan Kaufmann Publishers1 Read register Morgan Kaufmann Publishers2 Registers Read Write data Morgan Kaufmann Publishers2 register Zero ALU ALU result Read data Address Data memory
Write data Write data Instruction [150] Instruction [2016] 16 Sign extend 32 6 ALU control MemRead ALUOp Instruction [1511] RegDst 169 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Dependencies Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles)
CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 CC Morgan Kaufmann Publishers3 10 10 10 IM Reg Value Morgan Kaufmann Publishersof register Morgan Kaufmann Publishers$2: CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers10/20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Program execution order (in Morgan Kaufmann Publishersinstructions) sub $2, Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers$3 and Morgan Kaufmann Publishers$12, $2, Morgan Kaufmann Publishers$5
or Morgan Kaufmann Publishers$13, Morgan Kaufmann Publishers$6, $2 add Morgan Kaufmann Publishers$14, $2, $2 sw Morgan Kaufmann Publishers$15, Morgan Kaufmann Publishers100($2) IM DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg
170 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Software Solution Have compiler guarantee no hazards Where do we insert the nops ? sub and or add sw $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2) Problem: this really slows us down! 171 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Forwarding Use temporary results, dont wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2
Value Morgan Kaufmann Publishersof Morgan Kaufmann Publishersregister Morgan Kaufmann Publishers$2: 10 10 Value Morgan Kaufmann Publishersof Morgan Kaufmann PublishersEX/MEM: Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Value Morgan Kaufmann Publishersof Morgan Kaufmann PublishersMEM/WB: Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX CC Morgan Kaufmann Publishers3 10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers10/20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers20 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann PublishersX Program execution order
(in Morgan Kaufmann Publishersinstructions) sub $2, Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers$3 and Morgan Kaufmann Publishers$12, $2, Morgan Kaufmann Publishers$5 or Morgan Kaufmann Publishers$13, Morgan Kaufmann Publishers$6, $2 add Morgan Kaufmann Publishers$14,$2 , $2 sw Morgan Kaufmann Publishers$15, Morgan Kaufmann Publishers100($2) what if this $2 was $13? IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg IM
Reg DM Reg Reg DM Reg 172 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Forwarding The main idea (some details not shown) ID/EX EX/MEM MEM/WB M u x ForwardA Registers ALU M u x
Data memory M u x ForwardB Rs Rt Rt Rd EX/MEM.RegisterRd M u x Forwarding unit MEM/WB.RegisterRd 173 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 CC Morgan Kaufmann Publishers3
CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 DM Reg CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 Program execution order (in Morgan Kaufmann Publishersinstructions) lw $2, Morgan Kaufmann Publishers20($1) and $4, $2, Morgan Kaufmann Publishers$5 or Morgan Kaufmann Publishers$8, $2, Morgan Kaufmann Publishers$6 add Morgan Kaufmann Publishers$9, $4, $2 IM Reg IM Reg IM
DM Reg IM Reg DM Reg Reg DM Reg Thus, we need a hazard detection unit to stall the load instruction slt Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers$6, Morgan Kaufmann Publishers$7 IM Reg DM Reg 174 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Stalling We can stall the pipeline by keeping an instruction in the same stage Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles) CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2
CC Morgan Kaufmann Publishers3 CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 Reg DM Reg CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 CC Morgan Kaufmann Publishers10 Program execution order (in Morgan Kaufmann Publishersinstructions) lw $2, Morgan Kaufmann Publishers20($1) IM bubble and Morgan Kaufmann Publishersbecomes Morgan Kaufmann Publishersnop add Morgan Kaufmann Publishers$4, Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers$5 or Morgan Kaufmann Publishers$8, Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers$6 add Morgan Kaufmann Publishers$9, Morgan Kaufmann Publishers$4, Morgan Kaufmann Publishers$2 IM
Reg IM DM Reg IM Reg DM DM Reg IM Reg Reg Reg DM Reg 175 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Hazard Detection Unit Stall by letting an instruction that wont write anything go forward Hazard
detection unit ID/EX.MemRead ID/EX WB M u x Control 0 IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU PC Instruction memory M
u x Data memory M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd M u x ID/EX.RegisterRt Rs Rt Forwarding unit 176 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Branch Hazards When we decide to branch, other instructions are in the pipeline! Time Morgan Kaufmann Publishers(in Morgan Kaufmann Publishersclock Morgan Kaufmann Publisherscycles)
CC Morgan Kaufmann Publishers1 CC Morgan Kaufmann Publishers2 CC Morgan Kaufmann Publishers3 CC Morgan Kaufmann Publishers4 CC Morgan Kaufmann Publishers5 DM Reg CC Morgan Kaufmann Publishers6 CC Morgan Kaufmann Publishers7 CC Morgan Kaufmann Publishers8 CC Morgan Kaufmann Publishers9 Program execution order (in Morgan Kaufmann Publishersinstructions) 40 Morgan Kaufmann Publishersbeq Morgan Kaufmann Publishers$1, Morgan Kaufmann Publishers$3, Morgan Kaufmann Publishers28 44 Morgan Kaufmann Publishersand Morgan Kaufmann Publishers$12, Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers$5 48 Morgan Kaufmann Publishersor Morgan Kaufmann Publishers$13, Morgan Kaufmann Publishers$6, Morgan Kaufmann Publishers$2 52 Morgan Kaufmann Publishersadd Morgan Kaufmann Publishers$14, Morgan Kaufmann Publishers$2, Morgan Kaufmann Publishers$2 72 Morgan Kaufmann Publisherslw Morgan Kaufmann Publishers$4, Morgan Kaufmann Publishers50($7) IM
Reg IM Reg IM DM Reg IM Reg DM Reg IM Reg DM Reg Reg DM Reg We are predicting branch not taken need to add hardware for flushing instructions if we are wrong 177 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Flushing Instructions IF.Flush Hazard detection unit ID/EX WB Control 0 IF/ID M u x + EX/MEM M WB EX/MEM EX M WB + 4 M u x PC
Shift left Morgan Kaufmann Publishers2 Registers Instruction memory = M u x ALU M u x Data memory M u x Sign extend M u x Fowarding unit Note: weve also moved branch decision to ID stage 178 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
Branches If the branch is taken, we have a penalty of one cycle For our simple design, this is reasonable With deeper pipelines, penalty increases and static branch prediction drastically hurts performance Solution: dynamic branch prediction Taken Not Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publisherstaken Taken Not Morgan Kaufmann Publisherstaken Taken Not Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publishersnot Morgan Kaufmann Publisherstaken Predict Morgan Kaufmann Publishersnot Morgan Kaufmann Publisherstaken Taken Not Morgan Kaufmann Publisherstaken A 2-bit prediction scheme 179 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Branch Prediction Sophisticated Techniques: A branch target buffer to help us look up the destination Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches)
Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A branch delay slot which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time! 180 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, 0($t1) 4($t1) 0($t1) 4($t1) Dynamic Pipeline Scheduling
Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesnt wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism 181 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Advanced Pipelining Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) Superscalar processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different pipes) VLIW: very long instruction word, static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more!
182 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter 6 Summary Pipelining does not improve latency, but does improve throughput Deeply pipelined Multicycle (Section Morgan Kaufmann Publishers5.5) Pipelined Multiple Morgan Kaufmann Publishersissue with Morgan Kaufmann Publishersdeep Morgan Kaufmann Publisherspipeline (Section Morgan Kaufmann Publishers6.10) Multiple Morgan Kaufmann Publishersissue with Morgan Kaufmann Publishersdeep Morgan Kaufmann Publisherspipeline (Section Morgan Kaufmann Publishers6.10) Multiple-issue pipelined (Section Morgan Kaufmann Publishers6.9) Multiple-issue pipelined (Section Morgan Kaufmann Publishers6.9) Single-cycle (Section Morgan Kaufmann Publishers5.4) Deeply pipelined Multicycle (Section Morgan Kaufmann Publishers5.5)
Single-cycle (Section Morgan Kaufmann Publishers5.4) Slower Pipelined Faster Instructions Morgan Kaufmann Publishersper Morgan Kaufmann Publishersclock Morgan Kaufmann Publishers(IPC Morgan Kaufmann Publishers= Morgan Kaufmann Publishers1/CPI) 1 Several Use Morgan Kaufmann Publisherslatency Morgan Kaufmann Publishersin Morgan Kaufmann Publishersinstructions 183 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapter Seven 184 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10) Word Morgan Kaufmann Publishersline
A A B B Pass Morgan Kaufmann Publisherstransistor Capacitor Bit Morgan Kaufmann Publishersline 185 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Exploiting Memory Hierarchy Users want large and fast memories! SRAM access times are .5 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. 2004 Try and give it to them anyway build a memory hierarchy CPU Level Morgan Kaufmann Publishers1 Levels Morgan Kaufmann Publishersin Morgan Kaufmann Publishersthe memory Morgan Kaufmann Publishershierarchy Increasing Morgan Kaufmann Publishersdistance from Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersCPU Morgan Kaufmann Publishersin access Morgan Kaufmann Publisherstime
Level Morgan Kaufmann Publishers2 Level Morgan Kaufmann Publishersn Size Morgan Kaufmann Publishersof Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersat Morgan Kaufmann Publisherseach Morgan Kaufmann Publisherslevel 186 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level 187 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Cache Two issues:
How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 188 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache Cache 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 00001 00101 01001 01101 10001 10101 11001 11101
Memory 189 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Direct Mapped Cache For MIPS: Address Morgan Kaufmann Publishers(showing Morgan Kaufmann Publishersbit Morgan Kaufmann Publisherspositions) 31 Morgan Kaufmann Publishers30 Hit Tag 13 Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers11 20 2 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 Byte offset 10 Data Index Index 0 1 2 Valid Tag Data 1021 1022 1023
20 32 = What kind of locality are we taking advantage of? 190 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Direct Mapped Cache Taking advantage of spatial locality: Address Morgan Kaufmann Publishers(showing Morgan Kaufmann Publishersbit Morgan Kaufmann Publisherspositions) 31 Hit 14 Morgan Kaufmann Publishers13 18 Tag 6 Morgan Kaufmann Publishers5 8 2 Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 4 Byte offset Data Block Morgan Kaufmann Publishersoffset
Index 18 Morgan Kaufmann Publishersbits V 512 Morgan Kaufmann Publishersbits Tag Data 256 entries 16 32 32 32 = Mux 32 191 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Hits vs. Misses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart
Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word 192 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Hardware Issues Make reading multiple words easier by using banks of memory CPU CPU CPU Multiplexor Cache Cache Cache Bus Bus Memory b. Morgan Kaufmann PublishersWide Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersorganization Bus
Memory Memory Memory Memory bank Morgan Kaufmann Publishers0 bank Morgan Kaufmann Publishers1 bank Morgan Kaufmann Publishers2 bank Morgan Kaufmann Publishers3 c. Morgan Kaufmann PublishersInterleaved Morgan Kaufmann Publishersmemory Morgan Kaufmann Publishersorganization Memory a. One-word-wide memory Morgan Kaufmann Publishersorganization It can get a lot more complicated... 193 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance Increasing the block size tends to decrease miss rate: 40% 35% 30% 25%
e t a Morgan Kaufmann Publishersr 20% ss i M 15% 10% 5% 0% 4 16 64 Block Morgan Kaufmann Publisherssize Morgan Kaufmann Publishers(bytes) 256 1 Morgan Kaufmann PublishersKB 8 Morgan Kaufmann PublishersKB 16 Morgan Kaufmann PublishersKB 64 Morgan Kaufmann PublishersKB 256 Morgan Kaufmann PublishersKB Use split caches because there is more spatial locality in code: Program gcc spice Block size in words 1 4 1 4
Instruction miss rate 6.1% 2.0% 1.2% 0.3% Data miss rate 2.1% 1.7% 1.3% 0.6% Effective combined miss rate 5.4% 1.9% 1.2% 0.4% 194 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance Simplified model: execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?
195 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Decreasing miss ratio with associativity One-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative (direct Morgan Kaufmann Publishersmapped) Block Tag Data 0 1 2 3 4 5 6 Two-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative Set Tag Data Tag Data 0 1 2 3 7 Four-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative Set Tag Data Tag Data Tag Data Tag Data 0 1 Eight-way Morgan Kaufmann Publishersset Morgan Kaufmann Publishersassociative Morgan Kaufmann Publishers(fully Morgan Kaufmann Publishersassociative)
Compared to direct mapped, give a series of references that: Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache assuming we use the least recently used replacement strategy 196 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers An implementation Address 31 30 12 11 10 9 8 8 22 Index 0 1 2 V Tag Data V 3210 Tag Data V
Tag Data V Tag Data 253 254 255 22 32 4-to-1 Morgan Kaufmann Publishersmultiplexor Hit Data 197 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Performance 15% 1 Morgan Kaufmann PublishersKB 12% 2 Morgan Kaufmann PublishersKB 9% 4 Morgan Kaufmann PublishersKB 6%
8 Morgan Kaufmann PublishersKB 16 Morgan Kaufmann PublishersKB 32 Morgan Kaufmann PublishersKB 3% 64 Morgan Kaufmann PublishersKB 128 Morgan Kaufmann PublishersKB 0 One-way Two-way Four-way Eight-way Associativity 198 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Decreasing miss penalty with multilevel caches Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache Example: CPI of 1.0 on a 5 Ghz machine with a 5% miss rate, 100ns DRAM access Adding 2nd level cache with 5ns access time decreases miss rate to .5%
Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache 199 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Cache Complexities Not always easy to understand implications of caches: 1200 2000 Radix Morgan Kaufmann Publisherssort 1000 Radix Morgan Kaufmann Publisherssort 1600 800 1200 600 800 400 200 Quicksort 400 0 Quicksort
0 4 8 16 32 64 128 256 512 1024 2048 4096 Size Morgan Kaufmann Publishers(K Morgan Kaufmann Publishersitems Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssort) Theoretical behavior of Radix sort vs. Quicksort 4 8 16 32 64 128 256 512 1024 2048 4096 Size Morgan Kaufmann Publishers(K Morgan Kaufmann Publishersitems Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssort)
Observed behavior of Radix sort vs. Quicksort 200 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Cache Complexities Here is why: 5 Radix Morgan Kaufmann Publisherssort 4 3 2 1 Quicksort 0 4 8 16 32 64 128 256 512 1024 2048 4096 Size Morgan Kaufmann Publishers(K Morgan Kaufmann Publishersitems Morgan Kaufmann Publishersto Morgan Kaufmann Publisherssort)
Memory system performance is often critical factor multilevel caches, pipelined processors, make it harder to predict outcomes Compiler optimizations to increase locality sometimes hurt ILP Difficult to predict best algorithm: need experimental data 201 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Virtual Memory Main memory can act as a cache for the secondary storage (disk) Virtual Morgan Kaufmann Publishersaddresses Physical Morgan Kaufmann Publishersaddresses Address Morgan Kaufmann Publisherstranslation Disk Morgan Kaufmann Publishersaddresses Advantages: illusion of having more physical memory program relocation protection 202 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price)
can handle the faults in software instead of hardware using write-through is too expensive so we use writeback Virtual Morgan Kaufmann Publishersaddress 31 Morgan Kaufmann Publishers30 Morgan Kaufmann Publishers29 Morgan Kaufmann Publishers28 Morgan Kaufmann Publishers27 15 Morgan Kaufmann Publishers14 Morgan Kaufmann Publishers13 Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers11 Morgan Kaufmann Publishers10 Morgan Kaufmann Publishers9 Morgan Kaufmann Publishers8 3 Morgan Kaufmann Publishers2 Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 Page Morgan Kaufmann Publishersoffset Virtual Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Translation 29 Morgan Kaufmann Publishers28 Morgan Kaufmann Publishers27 15 Morgan Kaufmann Publishers14 Morgan Kaufmann Publishers13 Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers11 Morgan Kaufmann Publishers10 Morgan Kaufmann Publishers9 Morgan Kaufmann Publishers8 Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber 3 Morgan Kaufmann Publishers2 Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers0 Page Morgan Kaufmann Publishersoffset Physical Morgan Kaufmann Publishersaddress 203 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Page Tables Virtual Morgan Kaufmann Publisherspage number Page Morgan Kaufmann Publisherstable Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersor Valid disk Morgan Kaufmann Publishersaddress 1 1 1
1 0 1 1 0 1 1 0 1 Physical Morgan Kaufmann Publishersmemory Disk Morgan Kaufmann Publishersstorage 204 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Page Tables Page Morgan Kaufmann Publisherstable Morgan Kaufmann Publishersregister Virtual Morgan Kaufmann Publishersaddress 3 1 Morgan Kaufmann Publishers 3 0 Morgan Kaufmann Publishers 2 9 Morgan Kaufmann Publishers 2 8 Morgan Kaufmann Publishers 2 7 1 5 Morgan Kaufmann Publishers 1 4 Morgan Kaufmann Publishers 1 3 Morgan Kaufmann Publishers 1 2 Morgan Kaufmann Publishers 1 1 Morgan Kaufmann Publishers 1 0 Morgan Kaufmann Publishers 9 Morgan Kaufmann Publishers 8 Virtual Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Page Morgan Kaufmann Publishersoffset 12 20 Valid 3 Morgan Kaufmann Publishers 2 Morgan Kaufmann Publishers 1 Morgan Kaufmann Publishers 0 Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Page Morgan Kaufmann Publisherstable 18 If Morgan Kaufmann Publishers0 Morgan Kaufmann Publishersthen Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersis Morgan Kaufmann Publishersnot
present Morgan Kaufmann Publishersin Morgan Kaufmann Publishersmemory 2 9 Morgan Kaufmann Publishers 2 8 Morgan Kaufmann Publishers 2 7 1 5 Morgan Kaufmann Publishers 1 4 Morgan Kaufmann Publishers 1 3 Morgan Kaufmann Publishers 1 2 Morgan Kaufmann Publishers 1 1 Morgan Kaufmann Publishers 1 0 Morgan Kaufmann Publishers 9 Morgan Kaufmann Publishers 8 Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber 3 Morgan Kaufmann Publishers 2 Morgan Kaufmann Publishers 1 Morgan Kaufmann Publishers 0 Page Morgan Kaufmann Publishersoffset Physical Morgan Kaufmann Publishersaddress 205 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Making Address Translation Fast A cache for address translations: translation lookaside buffer TLB Virtual Morgan Kaufmann Publisherspage number Valid Dirty Ref 1 1 1 1 0 1 0 1 1 0 0 0 Tag Physical Morgan Kaufmann Publisherspage
address 1 1 1 1 0 1 Physical Morgan Kaufmann Publishersmemory Page Morgan Kaufmann Publisherstable Physical Morgan Kaufmann Publisherspage Valid Dirty Ref or Morgan Kaufmann Publishersdisk Morgan Kaufmann Publishersaddress 1 1 1 1 0 1 1 0 1 1 0 1 Typical values: 1 0 0 0 0 0 0 0 1 1 0 1
1 0 0 1 0 1 1 0 1 1 0 1 Disk Morgan Kaufmann Publishersstorage 16-512 entries, miss-rate: .01% - 1% miss-penalty: 10 100 cycles 206 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers TLBs and caches Virtual Morgan Kaufmann Publishersaddress TLB Morgan Kaufmann Publishersaccess TLB Morgan Kaufmann Publishersmiss exception No Yes TLB Morgan Kaufmann Publishershit? Physical Morgan Kaufmann Publishersaddress
No Try Morgan Kaufmann Publishersto Morgan Kaufmann Publishersread Morgan Kaufmann Publishersdata from Morgan Kaufmann Publisherscache Cache Morgan Kaufmann Publishersmiss Morgan Kaufmann Publishersstall while Morgan Kaufmann Publishersread Morgan Kaufmann Publishersblock No Cache Morgan Kaufmann Publishershit? Yes Write? No Yes Write Morgan Kaufmann Publishersaccess bit Morgan Kaufmann Publisherson? Write Morgan Kaufmann Publishersprotection exception Yes Try Morgan Kaufmann Publishersto Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersdata to Morgan Kaufmann Publisherscache Deliver Morgan Kaufmann Publishersdata to Morgan Kaufmann Publishersthe Morgan Kaufmann PublishersCPU Cache Morgan Kaufmann Publishersmiss Morgan Kaufmann Publishersstall while Morgan Kaufmann Publishersread Morgan Kaufmann Publishersblock No Cache Morgan Kaufmann Publishershit?
Yes Write Morgan Kaufmann Publishersdata Morgan Kaufmann Publishersinto Morgan Kaufmann Publisherscache, update Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdirty Morgan Kaufmann Publishersbit, Morgan Kaufmann Publishersand put Morgan Kaufmann Publishersthe Morgan Kaufmann Publishersdata Morgan Kaufmann Publishersand Morgan Kaufmann Publishersthe address Morgan Kaufmann Publishersinto Morgan Kaufmann Publishersthe Morgan Kaufmann Publisherswrite Morgan Kaufmann Publishersbuffer 207 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers TLBs and Caches Virtual Morgan Kaufmann Publishersaddress 31 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers30 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers29 14 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers13 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers12 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers11 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers10 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers9 Virtual Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber 3 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers2 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers1 Morgan Kaufmann Publishers Morgan Kaufmann Publishers Morgan Kaufmann Publishers0 Page Morgan Kaufmann Publishersoffset 12 20 Valid Dirty Tag Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber = = = = = = TLB TLB Morgan Kaufmann Publishershit
20 Page Morgan Kaufmann Publishersoffset Physical Morgan Kaufmann Publisherspage Morgan Kaufmann Publishersnumber Physical Morgan Kaufmann Publishersaddress Block Physical Morgan Kaufmann Publishersaddress Morgan Kaufmann Publisherstag Cache Morgan Kaufmann Publishersindex offset 18 8 4 Byte offset 2 8 12 Valid Data Tag Cache = Cache Morgan Kaufmann Publishershit 32 Data 208 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Modern Systems
209 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Modern Systems Things are getting complicated! 210 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Some Issues Processor speeds continue to increase very fast much faster than either DRAM or disk access times 100,000 10,000 1,000 Performance CPU 100 10 Memory 1 Year Design challenge: dealing with this growing disparity Prefetching? 3rd level caches and more? Memory design?
211 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Chapters 8 & 9 (partial coverage) 212 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: access latency throughput connection between devices and the system the memory hierarchy the operating system A variety of different users (e.g., banks, supercomputers, engineers) Interrupts Processor Cache Memory- Morgan Kaufmann PublishersI/O Morgan Kaufmann Publishersbus Main memory I/O controller
Disk Disk I/O controller I/O controller Graphics output Network 213 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Important but neglected The difficulties in assessing and designing I/O systems have often relegated I/O to second class status courses in every aspect of computing, from programming to computer architecture often ignore I/O or give it scanty coverage textbooks leave the subject to near the end, making it easier for students and instructors to skip it! GUILTY! we wont be looking at I/O in much detail be sure and read Chapter 8 in its entirety. you should probably take a networking class! 214 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers
I/O Devices Very diverse devices behavior (i.e., input vs. output) partner (who is at the other end?) data rate 215 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Example: Disk Drives Platters Tracks Platter Sectors Track To access data: seek: position head over the proper track (3 to 14 ms. avg.) rotational latency: wait for desired sector (.5 / RPM) transfer: grab the data (one or more sectors) 30 to 80 MB/sec 216 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Example: Buses
Shared communication link (one or more wires) Difficult design: may be bottleneck length of the bus number of devices tradeoffs (buffers for higher bandwidth increases latency) support for many different devices cost Types of buses: processor-memory (short high speed, custom design) backplane (high speed, often standardized, e.g., PCI) I/O (lengthy, different devices, e.g., USB, Firewire) Synchronous vs. Asynchronous use a clock and a synchronous protocol, fast and small but every device must operate at same rate and clock skew requires the bus to be short dont use a clock and instead use handshaking 217 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers I/O Bus Standards Today we have two dominant bus standards: 218 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Other important issues Bus Arbitration: daisy chain arbitration (not very fair) centralized arbitration (requires an arbiter), e.g., PCI collision detection, e.g., Ethernet
Operating system: polling interrupts direct memory access (DMA) Performance Analysis techniques: queuing theory simulation analysis, i.e., find the weakest link (see I/O System Design) Many new developments 219 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Pentium 4 I/O Options Pentium Morgan Kaufmann Publishers4 processor DDR Morgan Kaufmann Publishers400 (3.2 Morgan Kaufmann PublishersGB/sec) Main memory DIMMs DDR Morgan Kaufmann Publishers400 (3.2 Morgan Kaufmann PublishersGB/sec) System Morgan Kaufmann Publishersbus Morgan Kaufmann Publishers(800 Morgan Kaufmann PublishersMHz, Morgan Kaufmann Publishers604 Morgan Kaufmann PublishersGB/sec)
AGP Morgan Kaufmann Publishers8X Memory (2.1 Morgan Kaufmann PublishersGB/sec) Graphics controller output hub CSA (north Morgan Kaufmann Publishersbridge) (0.266 Morgan Kaufmann PublishersGB/sec) 1 Morgan Kaufmann PublishersGbit Morgan Kaufmann PublishersEthernet 82875P Serial Morgan Kaufmann PublishersATA (150 Morgan Kaufmann PublishersMB/sec) (266 Morgan Kaufmann PublishersMB/sec) Parallel Morgan Kaufmann PublishersATA (100 Morgan Kaufmann PublishersMB/sec) Serial Morgan Kaufmann PublishersATA (150 Morgan Kaufmann PublishersMB/sec) Parallel Morgan Kaufmann PublishersATA (100 Morgan Kaufmann PublishersMB/sec) Disk Disk Stereo (surroundsound) AC/97 (1 Morgan Kaufmann PublishersMB/sec) USB Morgan Kaufmann Publishers2.0 (60 Morgan Kaufmann PublishersMB/sec) . Morgan Kaufmann Publishers. Morgan Kaufmann Publishers. I/O controller
hub (south Morgan Kaufmann Publishersbridge) 82801EB (20 Morgan Kaufmann PublishersMB/sec) CD/DVD Tape 10/100 Morgan Kaufmann PublishersMbit Morgan Kaufmann PublishersEthernet PCI Morgan Kaufmann Publishersbus (132 Morgan Kaufmann PublishersMB/sec) 220 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never fail. Fallacy: magnetic disk storage is on its last legs, will be replaced. Fallacy: A 100 MB/sec bus can transfer 100 MB/sec. Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve performance without analysis. 221
2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad news: its really hard to write good concurrent programs many commercial failures Processor Processor Processor Cache Cache Cache Processor Processor Processor Cache Cache Cache Memory Memory Memory
Single Morgan Kaufmann Publishersbus Memory I/O Network 222 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Questions How do parallel processors share data? single address space (SMP vs. NUMA) message passing How do parallel processors coordinate? synchronization (locks, semaphores) built into send / receive primitives operating system protocols How are they implemented? connected by a single bus connected by a network 223 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Supercomputers Plot of top 500 supercomputer sites over a decade: Single Morgan Kaufmann PublishersInstruction Morgan Kaufmann Publishersmultiple Morgan Kaufmann Publishersdata Morgan Kaufmann Publishers(SIMD) 500
Cluster (network Morgan Kaufmann Publishersof workstations) 400 Cluster (network Morgan Kaufmann Publishersof SMPs) 300 Massively parallel processors (MPPs) 200 100 0 93 93 94 94 95 95 96 96 97 97 98 98 99 99 00 Sharedmemory multiprocessors (SMPs) Uniprocessors 224 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Using multiple processors an old idea Some SIMD designs:
Costs for the the Illiac IV escalated from $8 million in 1966 to $32 million in 1972 despite completion of only of the machine. It took three more years before it was operational! For better or worse, computer architects are not easily discouraged Lots of interesting designs and ideas, lots of failures, few successes 225 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Topologies P0 P1 P2 a. Morgan Kaufmann Publishers2-D Morgan Kaufmann Publishersgrid Morgan Kaufmann Publishersor Morgan Kaufmann Publishersmesh Morgan Kaufmann Publishersof Morgan Kaufmann Publishers16 Morgan Kaufmann Publishersnodes P3 P0 P4 P1 P5 P2 P6 P3 P7 P4 P5 P6
P7 b. Morgan Kaufmann PublishersOmega Morgan Kaufmann Publishersnetwork a. Morgan Kaufmann PublishersCrossbar b. Morgan Kaufmann Publishersn-cube Morgan Kaufmann Publisherstree Morgan Kaufmann Publishersof Morgan Kaufmann Publishers8 Morgan Kaufmann Publishersnodes Morgan Kaufmann Publishers(8 Morgan Kaufmann Publishers= Morgan Kaufmann Publishers23 Morgan Kaufmann Publishersso Morgan Kaufmann Publishersn Morgan Kaufmann Publishers= Morgan Kaufmann Publishers3) 226 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Clusters Constructed from whole computers Independent, scalable networks Strengths: Many applications amenable to loosely coupled machines Exploit local area networks Cost effective / Easy to expand Weaknesses: Administration costs not necessarily lower Connected using I/O bus Highly available due to separation of memories In theory, we should be able to do better 227 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Google
Serve an average of 1000 queries per second Google uses 6,000 processors and 12,000 disks Two sites in silicon valley, two in Virginia Each site connected to internet using OC48 (2488 Mbit/sec) Reliability: On an average day, 20 machines need rebooted (software error) 2% of the machines replaced each year In some sense, simple ideas well executed. Better (and cheaper) than other approaches involving increased complexity 228 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers Concluding Remarks Evolution vs. Revolution More often the expense of innovation comes from being too disruptive to computer users Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers. 229 2004 Morgan Kaufmann PublishersMorgan Morgan Kaufmann PublishersKaufmann Morgan Kaufmann PublishersPublishers