Heterogeneous System coherence For Integrated CPU-GPU Systems

Heterogeneous System coherence For Integrated CPU-GPU Systems

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS Original work by Jason et. Al Presented By, Anilkumar Ranganagoudra For ECE751 Coursework EXECUTIVE SUMMARY Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% |3 HETEROGENEOUS SYSTEM COHERENCE PHYSICAL INTEGRATION |4 HETEROGENEOUS SYSTEM COHERENCE PHYSICAL INTEGRATION |5 HETEROGENEOUS SYSTEM COHERENCE PHYSICAL INTEGRATION

|6 HETEROGENEOUS SYSTEM COHERENCE PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU CPU Cores |7 HETEROGENEOUS SYSTEM COHERENCE Credit: IBM LOGICAL INTEGRATION General-purpose GPU computing OpenCL CUDA Heterogeneous Uniform Memory Access (hUMA) Shared virtual address space Cache coherence Allows new heterogeneous apps |8 HETEROGENEOUS SYSTEM COHERENCE OUTLINE Motivation Background System overview Cache architecture reminder

Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions |9 HETEROGENEOUS SYSTEM COHERENCE SYSTEM OVERVIEW SYSTEM LEVEL Highbandwidth interconnect Accelerated Processing Unit (APU) DRAM Channels 10 | HETEROGENEOUS SYSTEM COHERENCE SYSTEM OVERVIEW APU

APU GPU compute accesses must stay coherent Direct-access bus GPU Cluster CPU Cluster Directory (used for graphics) To DRAM 11 | HETEROGENEOUS SYSTEM COHERENCE Arrow thickness bandwidth Invalidation traffic SYSTEM OVERVIEW GPU

CU GPU Cluster I-Fetch / Decode CU CU CU CU L1 L1 L1 L1 CU L1 Very high bandwidth: CU CU CU CU CULocal CU Scratchpad CU CU CU CU L2 has high

miss rate L1 L1 L1 L1 L1 L1Memory L1 L1 L1 L1 Register File Ex Ex Ex Ex Ex Ex L1 L1

L1 L1Ex L1 CU CU CU CU CU L1 GPU L2 Cache Ex To L1 Ex L1 ExL1 ExL1 Ex L1 L1

L1 L1 L1 L1 L1 L1 CU Ex CU ExCU ExCU Ex CU CU CU CU CU CU CU

CU 12 | HETEROGENEOUS SYSTEM COHERENCE SYSTEM OVERVIEW CPU Cluster CPU bandwidth: Core CPU Core Low Low L2 miss rate L1 L1 To Dir L2 13 | HETEROGENEOUS SYSTEM COHERENCE L1 L1

CPU Core CPU Core CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE Demand Requests Cache Tag Arrays Demand requests Searches cache tags from L1Allocates cache anfor a tag match MSHR Tag hit on probe: send MSHRs entry On a directory data to other core Miss On a miss, send probe,Requests check Data Onrequest a hit, return

Hit to directory MSHRsResponses and tags Probe data to the L1 Requests Core Data Responses 14 | HETEROGENEOUS SYSTEM COHERENCE Coherent Network Interface DIRECTORY ARCHITECTURE REMINDER DIRECTORY Demand Block requests Blocktags Probe Searches Directory Tag Array cache Requests/ Responses

from L2Allocates cache anfor a tag match MSHR On a miss, the entry data Allocate and send Probe comes from DRAM MSHRs Request RAM Coherent probes to L2 caches Hit Block Requests Miss To DRAM 15 | HETEROGENEOUS SYSTEM COHERENCE OUTLINE Motivation Background Heterogeneous System Bottlenecks Simulation overview Directory bandwidth

MSHRs Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 17 | HETEROGENEOUS SYSTEM COHERENCE SIMULATION DETAILS gem5 simulator Workloads Simple CPU GPU simulator based on AMD GCN All memory requests through gem5 CPU Clock CPU Cores CPU Shared L2 GPU Clock Compute Units GPU Shared L2 L3 (Memory-side) DRAM Peak Bandwidth Baseline Directory 18 | HETEROGENEOUS SYSTEM COHERENCE

Modified to use hUMA Rodinia & AMD APP SDK 2 GHz 2 2 MB (16-way banked) 1 GHz 32 4 MB (64-way banked) 16 MB (16-way banked) DDR3, 16 channels 700 GB/s 256k entries (8-way banked) GPGPU BENCHMARKS Rodinia benchmarks bp trains the connection weights on a neural network bfs breadth-first search hs performs a transient 2D thermal simulation (5-point stencil) lud matrix decomposition nw performs a global optimization for DNA sequence alignment km does k-means clustering sd speckle-reducing anisotropic diffusion AMD SDK bn bitonic sort dct discrete cosine transform hg histogram mm matrix multiplication

19 | HETEROGENEOUS SYSTEM COHERENCE SYSTEM BOTTLENECKS APU Difficult to scale directory bandwidth Difficult to multi-port Complicated pipelineGPU Cluster CPU Cluster Designed to support CPU bandwidth High resource usage Must allocate MSHR for entire duration Directory of request High bandwidth MSHR array difficult to scale To DRAM 20

| HETEROGENEOUS SYSTEM COHERENCE D ir e c t o r y a c c e s s e s p e r G P U c y c le DIRECTORY TRAFFIC Difficult to support >1 request per cycle 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 bp bfs 21 | HETEROGENEOUS SYSTEM COHERENCE hs lud

nw km sd bn dct hg mm Maximum MSHRs RESOURCE USAGE 100000 10000 1000 Very difficult to scale MSHR array Causes significant back-pressure on L2s 100

bp bfs 22 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km sd bn dct hg mm PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES 5 Back-pressure from limited

MSHRs and bandwidth 4.5 4 Slow down 3.5 3 2.5 2 1.5 1 0.5 0 bp bfs 23 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km

sd bn dct hg mm BOTTLENECKS SUMMARY Directory bandwidth Must support up to 4 requests per cycle Difficult to construct pipeline Resource usage MSHRs are a constraining resource Need more than 10,000 Without resource constraints, up to 4x better performance 24 | HETEROGENEOUS SYSTEM COHERENCE OUTLINE

Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Overall system design Region buffer design Region directory design Example Hardware complexity Results Conclusions 25 | HETEROGENEOUS SYSTEM COHERENCE BASELINE DIRECTORY COHERENCE APU GPU Cluster CPU Cluster Initialization Kernel Launch Directory To DRAM

26 | HETEROGENEOUS SYSTEM COHERENCE Read result HETEROGENEOUS SYSTEM COHERENCE (HSC) APU GPU Cluster CPU Cluster Initialization Kernel Launch Directory To DRAM 27 | HETEROGENEOUS SYSTEM COHERENCE REGION COHERENCE -Applied to snooping systems - [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] - Extended to directories - [Fang, PACT 2013] [Zebchuk, MICRO 2013] (a) Region Directory Entry

Region Tag State CPU GPU 18 bits 2 bits 1 valid bit per cluster (b) Region Buffer Entry Region Tag 18 bits 28 | HETEROGENEOUS SYSTEM COHERENCE State B0 B1 B2 ... B15 2 bits 1 valid bit per block in the region HETEROGENEOUS SYSTEM COHERENCE (HSC) APU GPU Region Cluster Buffer

CPU Region Cluster Buffer Direct-access bus Region Directory Directory To DRAM 29 | HETEROGENEOUS SYSTEM COHERENCE Region buffers coordinate with region directory HETEROGENEOUS SYSTEM COHERENCE (HSC) GPU Cluster CU CU CU CU

CU CU CU CU CU CU CU CU CU CU CU CU L1 L1 L1

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 Region Buffer GPU L2 Cache

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

L1 CU CU CU CU CU CU CU CU CU CU CU CU CU CU

CU CU 30 | HETEROGENEOUS SYSTEM COHERENCE HETEROGENEOUS SYSTEM COHERENCE (HSC) CPU Cluster CPU Core CPU Core L1 L1 L2 31 | HETEROGENEOUS SYSTEM COHERENCE L2 Region Buffer

L1 L1 CPU Core CPU Core HSC: L2 CACHE & REGION BUFFER MSHRs Demand Demand Requests Requests Region tagsCache and Tag Arrays Cache Tag Arrays Region Buffer permissions Only region-level permission traffic Interface for direct-access bus MSHRs

Miss Hit Miss Core Data Responses Core Data Responses 32 | HETEROGENEOUS SYSTEM COHERENCE Hit Hit Miss Miss Requests Probe Hit Data Requests Responses Probe Requests

Direct Access Bus Interface Coherent Coherent Network Network Interface Interface HSC: REGION DIRECTORY Region tags, sharers, and Block Directory Array permissions Region DirectoryTag Tag Array Block Probe Requests/ BlockResponses Probe Requests/Responses Region Permission Requests

MSHRs Coherent Block Requests Probe Probe Request RAM Request RAM MSHRs Hit Hit Miss Miss To DRAM 33 | HETEROGENEOUS SYSTEM COHERENCE HSC: EXAMPLE MEMORY REQUEST APU GPU L2 Cache

GPU Region Cluster GPU Region Buffer Buffer CPU Region Cluster Buffer RBE Region Directory RDE Region Directory To DRAM 35 | HETEROGENEOUS SYSTEM COHERENCE HSC: HARDWARE COMPLEXITY Region protocols reduce directory size Region directory: 8x fewer entries (a) Region Directory Entry

Region Tag State CPU GPU 18 bits 2 bits 1 valid bit per cluster Region buffers At each L2 cache 1-KB region (16 64-B blocks) 16-K region entries Overprovisioned for low-locality workloads 36 | HETEROGENEOUS SYSTEM COHERENCE (b) Region Buffer Entry Region Tag 18 bits State B0 B1 B2 ... B15 2 bits 1 valid bit per block in the region

HSC KEYPOINTS Key insight GPU-CPU applications exhibit high spatial locality Use direct-access bus present in systems Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information At each L2 cache Bypass coherence network and directory Replace directory with region directory Significantly reduces total size needed 37 | HETEROGENEOUS SYSTEM COHERENCE OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Speed-up Latency of loads

Bandwidth MSHR usage Conclusions 38 | HETEROGENEOUS SYSTEM COHERENCE THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1-KB region size 39 | HETEROGENEOUS SYSTEM COHERENCE HSC PERFORMANCE Largest Largest slow-downs slowdowns Broadcast from constrained 4.5 4 resources Normalized speed-up 5 Baseline

HSC 3.5 3 2.5 2 1.5 1 0.5 0 bp bfs 40 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km sd bn

dct hg mm N o r m a liz e d d ir e c to r y b a n d w id t h DIRECTORY TRAFFIC REDUCTION 1.2 1 0.8 0.6 Average bandwidth Theoretical significantly reduced broadcast baseline HSC reduction from 16 block regions 0.4 0.2 0 bp

bfs 41 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km sd bn dct hg mm N o r m a liz e d d ir e c to r y M S H R s r e q u ir e d HSC RESOURCE USAGE 0.25 0.2

Maximum MSHRs significantly reduced 0.15 0.1 0.05 0 bp bfs 42 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km sd bn

dct hg mm RESULTS SUMMARY HSC significantly improves performance Reduces the average load latency Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 43 | HETEROGENEOUS SYSTEM COHERENCE CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 45 | HETEROGENEOUS SYSTEM COHERENCE

COMMENTS Design Space Exploration on the parameters chosen Work on non-streaming memory access benchmarks Energy Efficiency could be made on the benchmarks 46 | HETEROGENEOUS SYSTEM COHERENCE Questions? 47 | HETEROGENEOUS SYSTEM COHERENCE DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION

2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 48 | HETEROGENEOUS SYSTEM COHERENCE Backup Slides N o rm a liz e d lo a d la te n cy LOAD LATENCY 4.5 4 Average load time broadcast significantly reduced baseline HSC 3.5 3 2.5 2 1.5 1

0.5 0 bp bfs 50 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km sd bn dct hg mm Execution tim e (%)

EXECUTION TIME BREAKDOWN 120 GPU CPU 100 80 60 40 20 0 bp bfs 51 | HETEROGENEOUS SYSTEM COHERENCE hs lud nw km sd

bn dct hg mm

Recently Viewed Presentations

  • Employee - GCFLearnFree.org

    Employee - GCFLearnFree.org

    Food shopping tips. Responsible weight-loss planning. Tips for long-term weight management. Web Tools for Tracking Your Health Goals. Learn about free online tools that can help you keep track of your goals. Stress Management Tips. Relaxation techniques.
  • Fair Labor Standards Act (FLSA) - IN.gov

    Fair Labor Standards Act (FLSA) - IN.gov

    The long definition is found in the Fair Labor Standards Act in the section here. Instead of looking at the definition, it's probably more useful to look at who is not an employee. Independent contractors are not employees, but don't...
  • Adjusting Active Basis Model by Regularized Logistic Regression

    Adjusting Active Basis Model by Regularized Logistic Regression

    Also thanks to Dr. Chih-Jen Lin for his liblinear software package and his detailed suggestions about how to adjust the software for our experiment. Intel Core i5 CPU, RAM 4GB, 64bit windows # pos
  • ACSI Teacher Certificate Q&A Session/Information Fair

    ACSI Teacher Certificate Q&A Session/Information Fair

    with Association of Christian Schools International (ACSI) since August 4, 2014. Graduates will receive the Professional Teacher Certificate of ACSI, recognized by ACSI member schools (nearly 24,000 schools as of 2014) and Christian schools all over the world.
  • Lecture 6 - جامعة الملك سعود

    Lecture 6 - جامعة الملك سعود

    IIS stands for Internet Information Services . ASP is a powerful tool for making dynamic and interactive Web pages. ASP file. An ASP file is just the same as an HTML file. An ASP file can contain text, HTML, XML,...
  • Honor Bound - University of Northern Iowa

    Honor Bound - University of Northern Iowa

    If reminded of honor ideal, less likely to cheat in honor culture if endorsed honor ideals, but didn't have effect or had opposite effect with other cultures. Gift and face effects occurred for dignity and face cultures but not honor...
  • ME 221 Statics - Michigan State University

    ME 221 Statics - Michigan State University

    ME 221 Statics Lecture #38 Final Exam Review Final Exam Wednesday, April 30 10:00am - 12:00 noon 1345EB See University policy for multiple exams 30% of course grade Final Exam Topics Chapter 2: Vectors and Force Systems Scalars & vectors...
  • African Economic Outlook - Oecd

    African Economic Outlook - Oecd

    AFRICAN ECONOMIC OUTLOOK 2002/03 Geneva - 8 September 2003 The AEO project A joint product of the African Development Bank & the OECD Development Centre Key features Country level analysis, organised in a single unified framework social and political developments...