Spatial Computation

Spatial Computation

Spatial Computation Mihai Budiu CMU CS Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze SCS Ph.D. Thesis defense, December 8, 2003 Spatial Computation Thesis committee: Seth Goldstein A model of general-purpose computation Peter Lee SCS based on Application-Specific Hardware. Todd Mowry

Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 2 Thesis Statement Application-Specific Hardware (ASH): can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 3 Outline Introduction Compiling for ASH

Media processing on ASH ASH vs. superscalar processors Conclusions 4 CPU Problems Complexity Power Global Signals Limited ILP 5 Design Complexity Design Time: CAD productivity favors FPL 100,000,000 10,000,000

Logic transistors/chip .1 0 1,000,000 1,000,000 10,000 100,000 10,000 1000 100 10 x x x x x 1000 x

100 21%/Year 2009 2005 2007 2003 2001 1999 1997 1993 1995 1991 1989

1987 1985 10 1983 2 .5 x 1981 Chip size (K transistors) 58%/Year Productivity 100,000 .3 5 10,000,000

Transistors/staff*month Source: S. Malik, orig Sematech from Michael Flynns FCRC 2003 talk 6 Communication vs. Computation gate wire 5ps 20ps Power consumption on wires is also dominant 7 Our Approach: ASH Application-Specific Hardware 8

Resource Binding Time 1. 1. 2. Programs 2. CPU Programs ASH 9 Hardware Interface software software ISA virtual ISA

hardware CPU hardware gates ASH 10 Application-Specific Hardware C program Dataflow IR Compiler Reconfigurable/custom hw 11 Contributions Embedded systems Asynchronous

circuits Computer architecture Reconfigurable computing Compilation High-level synthesis Nanotechnology Dataflow machines s y s m te s theory 12

Outline Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 13 Computation = Dataflow Programs Circuits a x = a & 7; ... y = x >> 2; 7 & x 2

>> Operations ) functional units Variables ) wires No interpretation 14 Basic Operation + latch data ack valid 15 Asynchronous Computation + data 1

latch 5 + + + ack valid 2 + 3 + 6 4

+ 7 + 8 16 Distributed Control Logic FSM ack rdy + short, local wires asynchronous control 17 Forward Branches b if (x > 0) y = -x; else

y = b*x; x * 0 - > ! y Conditionals ) Speculation critical path 18 Control Flow ) Data Flow data Merge (label) data data

predicate Gateway Split (branch) p ! 19 0 Loops i int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; * 0

+1 < 100 sum + ! ret 20 Predication and Side-Effects addr token sequencing of side-effects no speculation pred Load data

to memory token 21 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 22 Outline Introduction CASH: Compiling for ASH An optimization on the SIDE Media processing on ASH ASH vs. superscalar processors

Conclusions skip to 23 Availability Dataflow Analysis y = a*b; ... if (x) { ... y ... = a*b; } 24 Dataflow Analysis Is Conservative if (x) { ... y = a*b; } ... y ... = a*b;

? 25 Static Instantiation, Dynamic Evaluation flag = false; if (x) { ... y = a*b; flag = true; } ... ... = flag ? y : a*b; 26 0 epic_d 10 5 300.twolf

300.twolf 254.gap 197.parser 181.mcf 176.gcc 35 175.vpr 164.gzip 188.ammp 183.equake 147.vortex 134.perl

132.ijpeg 130.li 129.compress 124.m88ksim 099.go mesa rasta pgp_d pgp_e g721_d g721_e pegwit_d pegwit_e

jpeg_d jpeg_e mpeg2_d 197.parser 254.gap epic_d mpeg2_e 176.gcc epic_e 181.mcf 175.vpr gsm_d gsm_e

% reduction 40 164.gzip 188.ammp 183.equake adpcm_d 15 adpcm_e %st promo %st PRE 134.perl 0 147.vortex 132.ijpeg

Stores 130.li 129.compress 30 124.m88ksim 099.go mesa rasta pgp_d pgp_e g721_d g721_e pegwit_d

pegwit_e jpeg_d jpeg_e mpeg2_d mpeg2_e 20 epic_e 25 gsm_d gsm_e adpcm_d adpcm_e

SIDE Register Promotion Impact 45 Loads % ld promo % ld PRE 30 25 20 15 53 10 5 27 Outline Introduction

CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 28 Performance Evaluation ASH LSQ L1 8K L2 1/4M Mem limited BW CPU: 4-way OOO Assumption: all operations have the same latency.

29 5.8 rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa 5.8 jpeg_e jpeg_d 3 gsm_e

gsm_d g721_e g721_d epic_e epic_d adpcm_e adpcm_d Times faster Media Kernels, vs 4-way OOO 12 2.5 2 1.5

1 0.5 0 30 rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa jpeg_e jpeg_d

gsm_e gsm_d g721_e g721_d epic_e epic_d adpcm_e adpcm_d Media Kernels, IPC 25 Base IPC ASH IPC 20 15

10 5 4 0 31 10 Speed-up rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa

jpeg_e jpeg_d gsm_e gsm_d g721_e g721_d epic_e 8 epic_d 9 adpcm_e adpcm_d

Times bigger Speed-up IPC Correlation 12 IPC Ratio 7 6 5 4 3 2 1 0 32

Low-Level Evaluation C CASH core Results shown so far. All results in thesis. Verilog back-end Synopsys, Cadence P/R 180nm std. cell library, 2V ~1999 technology Results in the next two slides. ASIC 33 pe gw

_e it_ e it_ d pe g2 2_ d pe gw m _e _d eg

_d m pe g jp gs m gs m g7 21 _e _d _d g7 21

ad pc m Square mm 12 Area 10 8 6 4 2 0 Reference: P4 in 180nm has 217mm2 34 Power

350 Times smaller than OOO 300 250 200 150 100 50 0 power ratio adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d

mpeg2_d mpeg2_e pegwit_d pegwit_e 70 41 41 129 147 94 121 136

303 303 vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W 35 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 36 Outline Introduction CASH: Compiling for ASH Media processing on ASH dataflow pipelining ASH vs. superscalar processors Conclusions

skip to 37 i Pipelining * + <= pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; 100

1 sum + cycle=1 38 i Pipelining * 100 1 + <= sum

+ cycle=2 39 i Pipelining * 100 1 + <= sum + cycle=3 40

i Pipelining * 100 1 + <= sum + cycle=4 41 i Pipelining

i=1 100 1 + <= i=0 sum + pipeline balancing cycle=5 42 Outline

Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 43 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good). 44 Percent slower / faster -10 0 147.vortex 134.perl 132.ijpeg

130.li 129.compress 124.m88ksim -20 099.go SpecInt95, ASH vs 4-way OOO 30 20 10 -30 -40 -50 45

Branch Prediction i ASH crit path for (i=0; i < N; i++) { ... CPU crit path 1 + < exception if (exception) break; } Predicted not taken Effectively a noop for CPU! Predicted taken. !

& result available before inputs 46 Percent slower/faster -20 60 147.vortex 134.perl 132.ijpeg 130.li 129.compress 124.m88ksim -40 099.go

SpecInt95, perfect prediction baseline prediction 40 20 no data 0 -60 47 ASH Problems Both branch and join not free Static dataflow (no re-issue of same instr) Memory is far Fully static No branch prediction No dynamic unrolling No register renaming Calls/returns not lenient ...

48 Thesis Statement Application-Specific Hardware: can be synthesized by adapting software compilation for predicated architectures, provides high-performance for programs with high ILP, with very low power consumption, is a more scalable and efficient computation substrate than monolithic processors. 49 Outline Introduction + CASH: Compiling for ASH + Media processing on ASH + ASH vs. superscalar processors = Conclusions 50 Strengths low power simple verification?

economies of scale highly optimized specialized to app. unlimited ILP simple hardware no fixed window branch prediction control speculation full-dataflow global signals/decision 51

Conclusions Compiling around the ISA is a fruitful research approach. Distributed computation structures require more synchronization overhead. Spatial Computation efficiently implements high-ILP computation with very low power. 52 Backup Slides Control logic Pipeline balancing Lenient execution Dynamic Critical Path Memory PRE Critical path analysis CPU + ASH 53 Control Logic rdyin C C

ackin ackout rdyout datain Reg dataout back back to talk 54 Last-Arrival Events + data

ack valid Event enabling the generation of a result May be an ack Critical path=collection of last-arrival edges 55 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node back back to analysis 56 Critical Paths b if (x > 0) y = -x;

else y = b*x; x * 0 - > ! y 57 Lenient Operations b if (x > 0) y = -x; else y = b*x; x *

0 - > ! y Solve the problem of unbalanced paths back back to talk 58 i Pipelining * i=1 100

1 + <= i=0 sum + cycle=6 59 i Pipelining * + <= is loop

predicate 100 1 Long latency pipe sum sums loop + cycle=7 60 i Pipelining is loop *

critical path 100 1 + <= Predicate ack edge is on the critical path. sum sums loop + 61 Pipelinine balancing * i

100 1 + <= is loop decoupling FIFO sum sums loop + cycle=7 62 Pipelinine balancing *

is loop i 100 1 + <= critical path decoupling FIFO sum sums loop + back back to presentation

63 Register Promotion (p1) *p= (p1) *p= (p2) =*p (p2 : p1) =*p Load is executed only if store is not 64 Register Promotion (2) (p1) *p= (p2) =*p (p1) *p= (false)

=*p When p2 ) p1 the load becomes dead... ...i.e., when store dominates load in CFG back 65 PRE (p1) ...=*p (p2) ...=*p (p1 p2) ...=*p This corresponds in the CFG to lifting the load to a basic block dominating the original loads 66 Store-store (1) (p1) *p=

(p1 : p2) *p= (p2) *p=... (p2) *p=... When p1 ) p2 the first store becomes dead... ...i.e., when second store post-dominates first in CFG 67 Store-store (2) (p1) *p= (p1 : p2) *p= (p2) *p=... (p2) *p=... Token edge eliminated, but... ...transitive closure of tokens preserved back 68

A Code Fragment for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized 69 Dynamic Critical Path definition sizeof(X[j]) load predicate loop predicate for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 70

MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT L2: addiu $v1,$v1,20 L3: lw $v0,0($v1) L4: addiu $a0,$a0,1 L5: bne $v0,$a3,LOOP EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) ; X[j].r == i ; &X[j+1].r ; X[j+1].r ; j++ ; X[j+1].r == 0xF L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence break; 71 If Branch Prediction Correct

LOOP: L1: beq $v0,$a1,EXIT L2: addiu $v1,$v1,20 L3: lw $v0,0($v1) L4: addiu $a0,$a0,1 L5: bne $v0,$a3,LOOP EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; ; X[j].r == i ; &X[j+1].r ; X[j+1].r ; j++ ; X[j+1].r == 0xF L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained 72 Critical Path with Prediction

Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 73 Prediction + Load Speculation ~4 cycles! Load not pipelined ack edge (self-anti-dependence) for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 74 OOO Pipe Snapshot LOOP: L1: beq $v0,$a1,EXIT L2: addiu $v1,$v1,20 L3: lw $v0,0($v1)

L4: addiu $a0,$a0,1 L5: bne $v0,$a3,LOOP EXIT: register renaming ; X[j].r == i ; &X[j+1].r ; X[j+1].r ; j++ ; X[j+1].r == 0xF IF DA EX WB CT L5 L1

L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 L3 75 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break;

if (X[j+1].r == 0xF) break; when 1 if (X[j+1].r == i) break; } Y[i] = X[j].q; } back back to talk iteration 76 Ideal Architecture Low ILP computation + OS + VM CPU ASH

High-ILP computation Memory back 77

Recently Viewed Presentations

  • Preterite Vs. Imperfect

    Preterite Vs. Imperfect

    PRETERITE VS. IMPERFECT When to use these past tenses When to Use the Preterite When an action was completed in the past at a specific time or a specific number of times When an action began or ended When an...
  • Advanced Placement Government &amp; Politics

    Advanced Placement Government & Politics

    AP Gov Exam: The Basics. Essential Question: How is the exam structured? Two sections: Multiple Choice = 60 questions in 45 minutes. (50% of Exam) Free Response Questions = 4 F.R.Q.s in 100 minutes (think 25 minutes each) (50% of...
  • Ca Isotope Geochemistry

    Ca Isotope Geochemistry

    Singleton et al., ES&T, 2005 At Hanford: Nitrate in high level waste from tanks is distinguishable from soil nitrate, low level waste nitrate, and natural vadose zone nitrate (tank waste) (low level waste) T-36 Crib 106 101 103 Suspected/Known Leaking...
  • Week 1: Introduction - University of Queensland

    Week 1: Introduction - University of Queensland

    Students need to learn to read effectively in both print and online. Ability to match suitable texts to the reading level of a struggling student {without other students being aware} Technology can serve as a motivator for some students /...
  • Lesson 21 - sshs.ecboe.org

    Lesson 21 - sshs.ecboe.org

    PenguinsThe penguins are the best known and most numerous of all Antarctic birds.They are stocky, flightless birds with wings reduced to flippers with which they propel themselves through the water. Penguins nest in large, dense colonies, some with 180,000 or...
  • Using 21st Century Tools to Teach 21st Century Skills

    Using 21st Century Tools to Teach 21st Century Skills

    Using 21st Century Tools to Teach 21st Century Skills 21st Century Skills are…Life savers Allowing us to rediscover or recharge our PASSION about teaching! In the ocean of assessments, paperwork, and day to day routines…GREAT teachers must possess the vision...
  • Selling an Idea or a Product - Rivier University

    Selling an Idea or a Product - Rivier University

    Times New Roman Monotype Sorts Arial Book Antiqua Twinkles Bitmap Image Introduction to the Gigabit-Ethernet What is Gigabit Ethernet Gigabit Ethernet System Layers Gigabit Media Independent Interface (GMII) GMII (continued) Gigabit Ethernet Migration Topologies PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation...
  • IRP, Inc. Update

    IRP, Inc. Update

    Tim Adams (IRP, Inc.) Ken Carey (IRP, Inc. q. Goals and Objectives. Support IRP and IFTA enforcement. Potential to eliminate paper (IRP cab card and IFTA license & decals) Improved compliance. Positive revenue impact. Reduced cost for both jurisdictions and...