IBM Presentations: Smart Planet Template

IBM Presentations: Smart Planet Template

Steve Nasypany, Advanced Technical Sales Support October 2013 POWER7 Performance & Dynamic System Optimizer 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Agenda Affinity & Partition Placement Utilization, SMT & Virtual Processors Scaled Throughput Dynamic System Optimizer Backup Performance Redbooks APARs to know 2 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Affinity & Placement Review 3 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Affinity: Review Performance is closer to optimal when they stay close to physical resources. Affinity is a measurement of proximity to a resource Examples of resources can include L2/L3 cache, memory, core, chip and book/node

Cache Affinity: threads in different domains need to communicate with each other, or cache needs to move with thread(s) migrating across domains Memory Affinity: threads need to access data held in a different memory bank not associated with the same chip or node Modern highly multi-threaded workloads are architected to have light-weight threads and distributed application memory Can span domains with limited impact Unix scheduler/dispatch/memory manager mechanisms spread workloads AIX Enhanced Affinity was created to optimize performance OS and Hypervisor maintain metrics on a threads affinity Dynamically attempts to maintain best affinity to those resources Supported on POWER7 & POWER7+ systems with AIX 6.1 TL05 or above 4 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Affinity & Partition Placement: Review AIX Enhanced Affinity measurements Local Usually POWER7 Chip Near Local Node Far Other Node/Drawer/CEC Entitlement, Memory Sizing & hardware define where hypervisor runs a partition Hypervisor tries to optimize to Chip, Dual Chip Module (DCM)/node in that order (cores & DIMMs) Chip type (3/4/6/8c), DIMM sizes, DIMM population all have to be factored in Affinity

Local chip POWER7 770/780/795 intranode Far internode Sizing to chip, node/DCM & DIMM sizes where practical will give best performance and minimize distant dispatches POWER7+ 750+/760 DCM 5 2013 IBM Corporation Near Dynamic System Optimizer & Performance Updates Affinity: lssrad tool shows us logical placement View of 24-way, two socket POWER7+ 760 with Dual Chip Modules (DCM) 6 cores in each chip, 12 in each DCM 5 Virtual Processors (5 VP x 4-way SMT = 20 logical cpus) # lssrad -av REF1 SRAD MEM CPU 0

12363.94 0-7 2 4589.00 12-15 1 5104.50 8-11 3 3486.00 16-19 0 1 REFs SRAD When may I have a problem? - SRAD has CPUs but no memory or vice-versa - When CPU or Memory are very unbalanced

But how do I really know? - Tools tell you! - Users complain - Disparity in performance between equivalent systems In the real world, SRADs will never be perfectly balanced and many workloads do not care Nodes or Dual Chip Module (DCM) Scheduler Resource Allocation Domain (Chip) 0 & 2 are two chips in the first DCM 1 & 3 belong to other DCM If a threads home node was SRAD 0 Cant tell from this output if there is an affinity SRAD 2 would be near issue SRAD 1 & 3 would be far 6 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Affinity: topas -M Topas Monitor for host: claret4 Interval: 2 ====================================================================== REF1 SRAD TOTALMEM INUSE FREE FILECACHE HOMETHRDS CPUS ---------------------------------------------------------------------0 2 4.48G 515M

3.98G 52.9M 134.0 12-15 0 12.1G 1.20G 10.9G 141M 236.0 0-7 1 1 4.98G 537M 4.46G 59.0M 129.0 8-11 3 3.40G 402M 3.01G 39.7M 116.0 16-19 ====================================================================== CPU SRAD TOTALDISP LOCALDISP% NEARDISP% FARDISP% ------------------------------------------------------------ We want to minimize 0 0 303.0

43.6 15.5 40.9 2 0 1.00 100.0 0.0 0.0 multi-node far 3 0 1.00 100.0 0.0 0.0 dispatches as 4 0 1.00 100.0 0.0 0.0 possible 5 0 1.00 100.0 0.0 0.0 6 0 1.00 100.0 0.0

0.0 7 0 1.00 100.0 0.0 0.0 Do not worry 8 1 1.00 0.0 0.0 1000s 100.0 Whats a bad FARDISP% rate? No rule-of-thumb, but of far about far 9 1 1.00 0.0 0.0 100.0 dispatches per second will likely indicate lower performance How do we fix? Entitlement & Memory sizing Best Practices + Dynamic Platform Optimizer 7 2013 IBM Corporation dispatches on

single-node systems Dynamic System Optimizer & Performance Updates How does partition placement work? The hypervisor knows the chip types and memory configuration, and will attempt to pack partitions onto the smallest number of chips / nodes / drawers It considers the partition profiles and calculates optimal placements Placement is a function of Desired Entitlement, Desired & Maximum Memory settings Virtual Processor counts are not considered Maximum memory defines the size of the Hardware Page Table maintained for each partition. For POWER7, it is 1/64 th of Maximum and 1/128th on POWER7+ Ideally, Desired + (Maximum/HPT ratio) < node memory size if possible The POWER7 795 has additional rules, based on System Partition Processor Limit (SPPL) settings 8 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Partition Placement Partition configuration plays an important part in the decision-making for the OS and the hypervisor Hypervisor and Operating System affinity mechanisms for chip, intra-node and inter-node will not work optimally if you dont help them help you Low entitlement, high VP ratios, and high physical/entitlement consumption will lead to lower affinity Excess real and maximum memory settings. My best practices: You dont need to run with >32GB free all the time (that includes AIX file cache know your Computational Memory rate, see my Getting Started session) Desired & Maximum should be within 32GB unless there is a plan to your capacity planning See Tracy Smiths Architecting and Deploying Enterprise Systems session for more detailed guidance on Best Practices

9 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Utilization, Simultaneous Multi-threading & Virtual Processors 10 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Review: POWER6 vs POWER7 SMT Utilization POWER5/6 utilization does not account for SMT, POWER7 is calibrated in hardware POWER6 SMT2 Htc0 busy Htc1 idle Htc0 busy Htc1 busy 100% busy

100% busy POWER7 SMT2 Htc0 busy Htc1 idle Htc0 busy Htc1 busy POWER7 SMT4 ~70% busy 100% busy Htc0 busy Htc1 idle

Htc2 idle Htc3 idle ~65% busy busy = user% + system% Simulating a single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each of these cases, physical consumption can be reported as 1.0. Real world production workloads will involve dozens to thousands of threads, so many users may not notice any difference in the macro scale See Simultaneous Multi-Threading on POWER7 Processors by Mark Funk http://www.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf Processor Utilization in AIX by Saravanan Devendran https://www.ibm.com/developerworks/mydeveloperworks/wikis/home?lang=en#/wiki/Power%20Systems/page/Understanding%20CPU%20utilization%20on%20AIX 11 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Review: POWER6 vs POWER7 Dispatch POWER6 SMT2 Htc0 busy Htc1 busy

~80% busy POWER7 SMT4 Htc0 busy Htc1 idle Htc2 idle Htc3 idle Activate Virtual Processor ~50% busy There is a difference between how workloads are distributed across cores in POWER7 and earlier architectures In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as Raw Throughput mode. Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores

12 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Review: POWER6 vs POWER7 Dispatch proc0 proc1 proc2 proc3 POWER6 Primary Secondary proc0 POWER7 proc1 proc2 proc3 Primary Secondary Tertiaries Once a Virtual Processor is dispatched, the Physical Consumption metric will typically increase to the next whole number

Put another way, the more Virtual Processors you assign, the higher your Physical Consumption is likely to be 13 2013 IBM Corporation Dynamic System Optimizer & Performance Updates POWER7 Consumption: A Problem? POWER7 may activate more cores at lower utilization levels than earlier architectures when excess VPs are present Customers may complain that the physical consumption metric (reported as physc or pc) is equal to or possibly even higher after migrations to POWER7 from earlier architecture Every POWER7 customer with this complaint to also have significantly higher idle% percentages over earlier architectures Consolidation of workloads and may result in many more VPs assigned to the POWER7 partition Customers may also note that CPU capacity planning is more difficult in POWER7. If they will not reduce VPs, they may need subtract %idle from the physical consumption metrics for more accurate planning. In POWER5 & POWER6, 80% utilization was closer to 1.0 physical core In POWER7 with excess VPs, in theory, all of the VPs could be dispatched and the system could be 40-50% idle 14 2013 IBM Corporation Dynamic System Optimizer & Performance Updates POWER7 Consumption: Solutions Apply APARs in backup section, these can be causal for many of the high consumption complaints Beware allocating many more Virtual Processors than sized Reduce Virtual Processor counts to activate secondary and tertiary SMT threads Utilization percentages will go up, physical consumption will remain equal or drop

Use nmon, topas, sar or mpstat to look at logical CPUs. If only primary SMT threads are in use with a multi-threaded workload, then excess VPs are present. A new alternative is Scaled Throughput This increases per-core utilization by a Virtual Processor Details in backup section 15 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Scaled Throughput 16 2013 IBM Corporation Dynamic System Optimizer & Performance Updates What is Scaled Throughput? Scaled Throughput is an alternative to the default Raw AIX scheduling mechanism It is an alternative for some customers at the cost of partition performance It is not an alternative to addressing AIX and pHyp defects, partition placement issues, realistic entitlement settings and excessive Virtual Processor assignments It will dispatch more SMT threads to a VP/core before unfolding additional VPs It can be considered to be more like the POWER6 folding mechanism, but this is a generalization, not a technical statement Supported on POWER7/POWER7+, AIX 6.1 TL08 & AIX 7.1 TL02 Raw vs Scaled Performance Raw provides the highest per-thread throughput and best response times at the expense of activating more physical cores Scaled provides the highest core throughput at the expense of per-thread response times and throughput. It also provides the highest system-wide throughput per VP because tertiary thread capacity is not left on the table.

17 2013 IBM Corporation Dynamic System Optimizer & Performance Updates POWER7 Raw vs Scaled Throughput proc0 proc1 proc2 proc3 Primary Raw Secondary default Tertiaries lcpu 0-3 63% 77% 88% 100% lcpu 4-7 100% 63% 77%

88% lcpu 8-11 100% 63% 77% 88% lcpu 12-15 100% 63% 77% 88% proc0 proc1 proc2 proc3 Scaled Mode 2 proc0 Scaled Mode 4 18 2013 IBM Corporation

proc1 proc2 proc3 Once a Virtual Processor is dispatched, physical consumption will typically increase to the next whole number Dynamic System Optimizer & Performance Updates Scaled Throughput: Tuning Tunings are not restricted, but you can be sure that anyone experimenting with this without understanding the mechanism may suffer significant performance impacts Dynamic schedo tunable Actual thresholds used by these modes are not documented and may change at any time schedo p o vpm_throughput_mode= 0 Legacy Raw mode (default) 1 Scaled or Enhanced Raw mode with a higher threshold than legacy 2 Scaled mode, use primary and secondary SMT threads 4 Scaled mode, use all four SMT threads Tunable schedo vpm_throughput_core_threshold sets a core count at which to switch from Raw to Scaled Mode Allows fine-tuning for workloads depending on utilization level VPs will ramp up quicker to a desired number of cores, and then be more conservative under chosen Scaled mode

19 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Scaled Throughput: Workloads Workloads Workloads with many light-weight threads with short dispatch cycles and low IO (the same types of workloads that benefit well from SMT) Customers who are easily meeting network and I/O SLAs may find the tradeoff between higher latencies and lower core consumption attractive Customers who will not reduce over-allocated VPs and prefer to see behavior similar to POWER6 Performance It depends, we cant guarantee what a particular workload will do Mode 1 may see little or no impact but higher per-core utilization with lower physical consumed Workloads that do not benefit from SMT and use Mode 2 or Mode 4 will see double-digit per-thread performance degradation (higher latency, slower completion times) 20 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Raw Throughput: Default and Mode 1 Scaled Throughput: Mode 1 Raw Throughput 12 12

11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3

2 2 1 1 0 0 Time Active_Threads Active_VP Time Phys_Busy Phys_Consumed AIX will typically allocate 2 extra Virtual Processors as the workload scales up and is more instantaneous in nature VPs are activated and deactivated one second at a time 21 2013 IBM Corporation Active_Threads Active_VP

Phys_Busy Phys_Consumed Mode 1 is more of a modification to the Raw (Mode 0) throughput mode, using a higher utilization threshold and moving average to prevent less VP oscillation It is less aggressive about VP activations. Many workloads may see little or no performance impact Dynamic System Optimizer & Performance Updates Scaled Throughput: Modes 2 & 4 Scaled Throughput: Mode 4 Scaled Throughput: Mode 2 12 12 11 11 10 10 9 9 8

8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 Time

Time Active_Threads Active_VP Phys_Busy Phys_Consumed Mode 2 utilizes both the primary and secondary SMT threads Somewhat like POWER6 SMT2, eight threads are collapsed onto four cores Physical Busy or utilization percentage reaches ~80% of Physical Consumption 22 2013 IBM Corporation Active_Threads Active_VP Phys_Busy Phys_Consumed Mode 4 utilizes both the primary, secondary and tertiary SMT threads Eight threads are collapsed onto two cores Physical Busy or utilization percentage reaches 90100% of Physical Consumption Dynamic System Optimizer & Performance Updates Tuning (other) Never adjust the legacy vpm_fold_threshold without L3 Support guidance

Remember that Virtual Processors activate and deactivate on 1 second boundaries. The legacy schedo tunable vpm_xvcpus allows enablement of more VPs than required by the workload. This is rarely needed, and is over-ridden when Scaled Mode is active. If you use RSET or bindprocessor function and bind a workload To a secondary thread, that VP will always stay in at least SMT2 mode If you bind to a tertiary thread, that VP cannot leave SMT4 mode These functions should only be used to bind to primary threads unless you know what you are doing or are an application developer familiar with the RSET API Use bindprocessor s to list primary, secondary and tertiary threads A recurring question is How do I know how many Virtual Processors are active? There is no tool or metric that shows active Virtual Processor count There are ways to guess this, and looking a physical consumption (if folding is activated), physc count should roughly equal active VPs nmon Analyser makes a somewhat accurate representation, but over long intervals (with a default of 5 minutes), it does not provide much resolution For an idea at a given instant with a consistent workload, you can use: echo vpm | kdb 23 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Virtual Processors > echo vpm | kdb VP VP VSD Thread State CPU CPPR VP_STATE FLAGS 0 0 ACTIVE 1 1

255 ACTIVE 0 2 255 ACTIVE 0 3 255 ACTIVE 0 4 255 ACTIVE 0 5 255 ACTIVE 0 6 255 ACTIVE 0 7 255 ACTIVE 0 8 11 DISABLED 1 9 11 DISABLED 1 10 11 DISABLED 1 11 11 DISABLED 1 12 11 DISABLED 1

13 11 DISABLED 1 14 11 DISABLED 1 15 11 DISABLED 1 SLEEP_STATE AWAKE AWAKE AWAKE AWAKE AWAKE AWAKE AWAKE AWAKE SLEEPING SLEEPING SLEEPING SLEEPING SLEEPING SLEEPING SLEEPING SLEEPING PROD_TIME: SECS 0000000000000000 000000005058C6DE 000000005058C6DE 000000005058C6DE 00000000506900DD 00000000506900DD

00000000506900DD 00000000506900DD 0000000050691728 0000000050691728 0000000050691728 0000000050691728 0000000050691728 0000000050691728 0000000050691728 0000000050691728 NSECS 00000000 25AA4BBD 25AA636E 25AA4BFE 0D0CC64B 0D0D6EE0 0D0E4F1E 0D0F7BE6 358C3218 358C325A 358C319F 358E2AFE 358C327A 358C3954 358C3B13 358C3ABD CEDE_LAT 00 00 00 00 00

00 00 00 02 02 02 02 02 02 02 02 With SMT4, each core will have 4 Logical CPUs, which equals 1 Virtual Processor This method is only useful for steady-state workloads 24 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Dynamic System Optimizer 25 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Active vs Dynamic System Optimizer Dynamic System Optimizer (DSO) is a rebranding and enhancement to the legacy Active System Optimizer (ASO) ASO is a free AIX feature which autonomously tunes the allocation of system resources to improve performance DSO includes additional charged-for features via an enablement fileset It is probably easier to adopt the DSO moniker with the understanding that there are two components, and the ASO daemon is the name of the

process doing the actual work Legacy ASO provided function for optimizing cache and memory affinity Monitors workloads for high cpu and memory utilization Associates targeted workloads to a specific core or set of cores Determines if memory pages being accessed can be relocated for higher affinity to cache & core Designed for POWER7 and originally shipped with AIX 7.1 TL01 26 2013 IBM Corporation Dynamic System Optimizer & Performance Updates What does Dynamic System Optimizer do? If ASO provides best affinity for core/cache and memory, what does DSO add? Dynamic migration to Large Pages (16MB MPSS) Conversion of memory pages to larger sizes Think Oracle SGA Data Stream Pre-fetch Optimizations. Dynamically modifies algorithms used for controlling how data is moved into processor cache from main memory All function has been back-ported to AIX 6.1 TL08 and enhanced for AIX 7.1 TL02 and POWER7+ 27 2013 IBM Corporation Dynamic System Optimizer & Performance Updates All this affinity? Confused? Enhanced Affinity, Dynamic System Optimizer, Dynamic Platform Optimizer what does what? How is this different from AIX Enhanced Affinity? Enhanced Affinity optimizes threads to a scheduler domain (think chip)

DSO optimizes threads within a chip to a core or set of cores DSO actively optimizes memory pages for best locality and size How is this different from the Dynamic Platform Optimizer (DPO)? DPO optimizes a partitions placement within a frame or drawer Think moves partitions rather than threads Enhanced Affinity Dynamic System Optimizer Dynamic Platform Optimizer 28 2013 IBM Corporation Think CHIP Think CORE/DIMM Think FRAME Dynamic System Optimizer & Performance Updates DSO Architecture ASO/DSO Analysis Monitoring Kernel Statistics Optimizations Workload Resource Allocations AIX 6.1 TL08 / AIX 7.1 Performance Monitoring Unit DSO continually monitors public and private AIX kernel statistics and POWER7 processor hardware counters Determines which workloads will benefit from optimization

Moves workloads to specific cores 29 2013 IBM Corporation Dynamic System Optimizer & Performance Updates What are Hardware Counters? POWER processors have always included hardware instrumentation in the form of the Performance Monitor Unit (PMU) This hardware facility collects events related to the operations in the processor See Jeff Stuechelis POWER7 Micro-architecture, A PMU Event Guided Tour Performance Monitor Counter data analysis using Counter Analyzer, Qi Liang, 2009 http://www.ibm.com/developerworks/aix/library/au-counteranalyzer/index.html 30 2013 IBM Corporation Dynamic System Optimizer & Performance Updates What is DSO to AIX? ASO/DSO is a Unix System Resource Controller (SRC) service Transparent optimization does not require active administrator intervention Acts like any other kernel service Low overhead, high gain Configurable via smitty src or CLI Active tuning hibernates if no gains achieved and wakes up when instrumentation indicates possible performance improvements Focuses on long-term run-time analysis of processor and memory allocations based on affinity metrics Utilizes some aspects of AIX 6.1 Enhanced Affinity, but focus is a set of cores within a chipset 31 2013 IBM Corporation

Dynamic System Optimizer & Performance Updates ASO/DSO General ASO is designed to improve the performance of workloads that are longlived, multi-threaded and have stable, non-trivial core/cache utilization The greater the communication between threads, the higher the potential for ASO to improve performance. Greatest benefit when running in dedicated processor LPAR environments, on large multi-chip or multi-node configurations ASO can be enabled at the system or process level monitoring is done before and after placements Operates on a process-wide scope, does not tune individual threads within a process No optimization of single-threaded processes which will remain managed by existing AIX scheduler mechanisms Improvements are limited to what can be achieved via manual tuning No optimization of workloads that are already members of Resource Set (RSET) attachments or controlled by bindprocessor() If most of the heavy workloads fall under manual tuning, ASO will hibernate 32 2013 IBM Corporation Dynamic System Optimizer & Performance Updates ASO/DSO Requirements POWER7/POWER7+ dedicated or virtualized partitions ASO AIX 7.1 TL01 - Cache and memory affinity DSO AIX 6/1 TL08 and AIX 7.1 TL02 Legacy ASO function 16 MB MPSS Pre-fetch support Not supported in Active Memory Sharing environments Capped shared processor environments must have a minimum entitlement of 2 cores Consumption for unfolded Virtual Processors must be sufficiently high to allow

optimizations Dedicated partitions cannot have Virtual Processor Folding enabled this occurs when Energy Management features are active No reboot is required after applying DSO fileset, ASO will recognize it automatically Filesets bos.aso 7.1.2.0 Active System Optimizer dso.aso 1.1.0.0 Dynamic System Optimizer ASO ext. 33 2013 IBM Corporation Dynamic System Optimizer & Performance Updates ASO Cache Affinity Cache Affinity The initial task of ASO is to optimize the placement of workloads so threads of a process are grouped into the smallest affinity domain that provides the necessary CPU and memory resources Locality by grouping cores located in chips Consolidating workloads by cache activity ASO can be of benefit on single chip systems Threads that have heavy interaction with the same core, making similar requests to L2 and L3 cache Optimize for lock contention software threads contending for a lock can be minimized to the subset of hardware (SMT) threads executing on the same core for best sharing Workload Multi-threaded workloads with 5+ minute periods of stability Minimum 0.1 core utilization 34 2013 IBM Corporation

Dynamic System Optimizer & Performance Updates Affinity to ASO: Cache Affinity Chip (SRAD 0) Chip (SRAD 1) L3 Cache Threads of workload utilize similar cache lines or access remote cache L3 Cache Optimizer relocates workloads to same Chip/SRAD for best cache affinity 35 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Affinity to ASO: Cache Affinity Chip (SRAD 0) Chip (SRAD 1) L3 Cache Threads of workload utilize similar cache lines or access remote cache L3 Cache Optimizer compresses workload onto single Chip/SRAD for best cache affinity 36 2013 IBM Corporation Dynamic System Optimizer & Performance Updates ASO Memory Affinity

Memory Affinity The second task for ASO is to optimize memory allocation such that frequently accessed pages of memory are localized as possible to where the workload is running Given that a workload needs a local affinity domain, memory affinity can only be applied once a workload has been optimized for cache affinity Memory page migrations are continually monitored Workload Minimum 0.1 core utilization Multi-threaded workloads with 5+ minute periods of stability Single-threaded workloads are not considered since their process private data is affinitized by the kernel Workloads currently must fit within a single Scheduler Resource Affinity Domain (SRAD). An SRAD typically caps to a single chip/socket in POWER7, but DLPAR operations can impact that. 37 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Affinity to ASO: Memory Affinity Chip (SRAD 0) Chip (SRAD 1) DIMM DIMM DIMM DIMM DIMM DIMM DIMM

DIMM Workload accesses memory frames associated to another socket DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM Optimizer migrates pages to provide better locality for workload 38 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Large Memory Pages: 16MB MPSS Multiple Page Segment Size (MPSS) AIX has supported 4K and 64K page sizes within the same 256 MB segment POWER6 with AIX 6.1 and above allow autonomic conversion between 4K and 64K pages based on workload needs 16 MB page sizes have been supported via manual tuning and effectively had to be managed as pinned pages (allocated and managed up front) AIX 6.1 TL8 and AIX 7.1 TL2 introduces the capability to mix 16 MB pages with other sizes within a memory segment. This allows autonomic conversions to 16 MB pages Processors use Translation Lookaside Buffers (TLB) and Effective to Real Address Translation (ERAT) when addressing real memory Processor architectures can only have so many TLBs. That number and the page size defines how much of main memory can be directly mapped. Larger page sizes allow more of memory to be directly mapped, and for fewer address lookups to have to be performed. This can minimize TLB/ERAT misses.

Processor instrumentation allows this activity to be monitored, and for heavily used memory regions to be targeted for promotion to 16 MB pages 39 2013 IBM Corporation Dynamic System Optimizer & Performance Updates DSO: 16 MB Activity Workloads The ideal workload is one which uses large System V memory regions. Examples would include databases using large shared memory regions (Oracle SGA), or Java JVMs instances with large heap(s) Workloads could be either multi-threaded or a group of single threaded processes Minimum 2 cores stable CPU utilization over 10 minutes Minimum 16 GB of system memory Historically, Oracle specialists have been wary to use 16 MB pages because they had to be pre-allocated and it is not always clear what the DBs internal memory patterns are. MPSS support makes this more flexible for DSO to monitor and adjust. Behavior DSO will monitor a workload for at least 10 minutes before beginning any migration Migrations of small (4K) to medium (64K) memory frames to 16 MB is not a rapid process. Lab tests with migrating double-digit SGAs are measured in hours. SGAs on the order of 64 GB or larger could take half a day. You should not try to assess performance improvements until migration is complete, there is no quick way to do apples-to-apples comparisons Customers using the ESP would not have seen 16 MB activity using the svmon tool because the updates for that support were completed after the beta 40 2013 IBM Corporation Dynamic System Optimizer & Performance Updates

16 MB MPSS Activity: svmon # svmon -P 4129052 Pid Command 4129052 16mpss_basic PageSize s 4 KB m 64 KB L 16 MB Vsid 1370db7 Inuse 84755 Pin 12722 Inuse 52227 1009 4 Esid Type Description a0000000 work N/A # svmon -P 4129052 -Ompss=on Pid Command 4129052 16mpss_basic Vsid 1370db7

Inuse 84755 Pin 12722 Esid Type Description a0000000 work N/A 41 2013 IBM Corporation Pgsp 0 Pin 2 795 0 Virtual 64-bit Mthrd 84734 Y Y Pgsp Virtual 0 52227 0 1009 0 4 PSize smL

Inuse 65536 16MB N Pin Pgsp Virtual 0 0 65536 Pgsp Virtual 0 84734 PSize s m L Inuse 33008 1009 4 Pin Pgsp Virtual 0 0 33008 795 0 1009 0 0 4

Dynamic System Optimizer & Performance Updates POWER7 Pre-fetch: Review POWER7 architecture provides a dedicated register to control memory pre-fetching Register is the Data Stream Control Register (DSCR) Allows control over enablement, depth and stride of pre-fetching POWER pre-fetch instructions can be used to mask latencies of requests to the memory controller and fill cache. The POWER7 chip can recognize memory access patterns and initiate pre-fetch instructions automatically Control over how aggressive the hardware will pre-fetch, i.e. how many cache lines will be pre-fetched for a given reference, is controlled by the DSCR The dscrctl command can be used to query and set the system wide DSCR value # dscrctl -q A system administrator can change the system wide value using the dscrctl command # dscrctl [-n | -b] s Disengage the data prefetch feature : dscrctl -n -s 1 Returning to default: dscrctl n s 0 This is a dynamic system-wide setting and easy to change/check May yield 5-10% performance improvement with some applications 42 2013 IBM Corporation Dynamic System Optimizer & Performance Updates DSO Pre-fetch DSO will collect information from the AIX kernel, POWER Hypervisor performance utilities and Processor Counters to dynamically determine the optimal setting of this register for a specific period in time. Workloads Large memory footprints and high CPU Utilization with high context switch rates are typically identified as candidates

Can be either multi-threaded or a group of single-threaded processes. This optimization is disabled if the DCSR register is set manually at the system level (dscrctl command). Optimization requires a minimum system memory of 64GB, process shared memory use of 16GB and consumption of ~8 physical cores Behavior When AIX DSO is installed, DSCR optimization in ASO is enabled Memory access patterns are monitored by ASO Daemon Optimal values for the DSCR register are deduced Register value can be set at system or per-process level Decisions are dynamic and automatic, so pre-fetching levels are changed according to current workload requirements 43 2013 IBM Corporation Dynamic System Optimizer & Performance Updates ASO/DSO Usage System Resource Controller must be activated first (can also use smitty src, aso subsystem) Start/Stop: [startsrc | stopsrc] s aso Status: lssrc s aso ASO via command line with the asoo command. Use p to persist across reboots Activate: asoo o aso_active=1 Deactivate: asoo o aso_active=0 Process Environment Variables Session variables effective until logout. Use /etc/environment file for permanent changes ASO_ENABLED= ALWAYS ASO prioritizes this process for optimization NEVER

ASO never optimizes this process ASO_OPTIONS= Feature ASO DSO Option Values Effect ALL ON | OFF Enables/disables all of ASO CACHE_AFFINITY ON | OFF Enables/disables cache affinity MEMORY_AFFINITY ON | OFF Enables/disables memory affinitization. Note memory affinitization cannot be performed if cache affinity is disabled. LARGE_PAGE

ON | OFF Enables/disables 16 MB MPSS MEMORY_PREFETCH ON | OFF Enables/disables Prefetch optimization 44 2013 IBM Corporation Dynamic System Optimizer & Performance Updates ASO Debug ? If you open a PMR on ASO, collection scripts do not include the ASO log files. You should collect any output from the /var/log/aso/ directory and include. Debug options are available from the system resource controller level Start SRC startsrc s aso Activate ASO asoo o aso_active=1 Enable debug asoo o debug_level=3 (3=highest, dynamic) Execute workload Disable debug asoo o debug_level=0 Forward aso_debug.out file to IBM Support 45 2013 IBM Corporation Dynamic System Optimizer & Performance Updates

Logging Log files maintained by ASO /var/log/aso/aso.log will tell you if ASO is running /var/log/aso/aso_process.log shows optimizations performed on processes Activities on processes and PIDs are logged Documentation for interpreting log files is not currently provided by IBM But they are ascii-readable output files like most SRC daemons Some behavior and tolerances used by ASO can be divined by watching the output 46 2013 IBM Corporation Dynamic System Optimizer & Performance Updates aso_process.log: Cache Affinity Dynamic Reconfig Event, hibernates Recognizes DSO function available Recognizes new workload, begins monitoring Considers optimization, decides to attach to core Example output of process placement for cache affinity 47 2013 IBM Corporation Dynamic System Optimizer & Performance Updates

aso_process.log: Cache Affinity Recommendation Placement Gain Measurement & Result Example output of process placement for cache affinity 48 2013 IBM Corporation Dynamic System Optimizer & Performance Updates aso_process.log: Memory, Large Pages Attaching to cores Large Page/TLB profiling Not enough gain, removing cache, memory, large page and pre-fetch attempts Example output of analysis for memory affinity and Large Page (16MB) promotion Decision to abandon optimization policies because of workload behavior 49 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Redbooks, APARs to Know 50 2013 IBM Corporation

Dynamic System Optimizer & Performance Updates Great Performance Redbooks http://www.redbooks.ibm.com/abstracts/sg248080.html http://www.redbooks.ibm.com/abstracts/sg248079.html 51 2013 IBM Corporation Dynamic System Optimizer & Performance Updates Updates: POWER7 & AIX The most problematic performance issues with AIX were resolve in early 2012. Surprisingly, many customers are still running with these defects Memory Affinity Domain Balancing Scheduler/Dispatch defects Wait process defect TCP Retransmit Shared Ethernet defects Do not run with a firmware level below 720_101. A hypervisor dispatch defect exists below that level. The next slide provides the APARs to resolve the major issues We strongly recommend updating to these levels if you encounter performance issues. AIX Support will likely push you to these levels before wanting to do detailed research on performance PMRs. All customer Proof-of-Concept or tests should use these as minimum recommended levels to start with 52 2013 IBM Corporation Dynamic System Optimizer & Performance Updates POWER7 Performance APARs List Issue

Release APAR SP/PTF WAITPROC IDLE LOOPING CONSUMES CPU 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10484 IV10172 IV06197 IV01111 SP2 (IV09868) SP2 (IV09929) U846391 bos.mp64 6.1.6.17 or SP7 U842590 bos.mp64 6.1.5.9 or SP8 SRAD load balancing issues on shared LPARs 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10802 IV10173

IV06196 IV06194 SP2 (IV09868) SP2 (IV09929) U846391 bos.mp64 6.1.6.17 or SP7 U842590 bos.mp64 6.1.5.9 or SP8 Miscellaneous dispatcher/scheduling performance fixes 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10803 IV10292 IV10259 IV11068 SP2 (IV09868) SP2 (IV09929) U846391 bos.mp64 6.1.6.17 or SP7 U842590 bos.mp64 6.1.5.9 or SP8 7.1 TL1 6.1 TL7 6.1 TL6 6.1 TL5 IV10791 IV10606 IV03903 n/a

SP2 (IV09868) SP2 (IV09929) U846391 bos.mp64 6.1.6.17 or SP7 TCP Retransmit Processing is slow (HIPER) 7.1 TL1 6.1 TL7 6.1 TL6 IV13121 IV14297 IV18483 SEA lock contention and driver issues 2.2.1.4 address space lock contention issue 53 2013 IBM Corporation SP4 SP4 U849886 bos.net.tcp.client 6.1.6.19 or SP8 FP25 SP02 Dynamic System Optimizer & Performance Updates AIX Paging Issue New global_numperm tunable has been enabled with AIX 6.1 TL7 SP4 / 7.1 TL1 SP4. Customers may experience early paging due to failed pincheck on 64K pages

What Fails to steal from 4K pages when 64K pages near maximum pin percentage (maxpin) and 4K pages are available Scenario not properly checked for all memory pools when global numperm is enabled vmstat v shows that the number of 64K pages pinned is close to maxpin% svmon shows that 64K pinned pages are approaching the maxpin value Action Apply APAR Alternatively if the APAR cannot be applied immediately, disable numperm_global : # vmo -p -o numperm_global=0 Tunable is dynamic, but workloads paged out will have to be paged in and performance may suffer until that completes or a reboot is performed APARs FixDist has interim fixes available if SP has not shipped: IV26272 IV26735 IV26581 IV27014 IV26731 AIX AIX AIX AIX AIX 6.1 6.1 7.1 7.1 7.1 TL7 TL8 TL0 TL1

TL2 54 2013 IBM Corporation

Recently Viewed Presentations

  • Revision Copyright Guy Harley 2008 1 Discharge of

    Revision Copyright Guy Harley 2008 1 Discharge of

    A two part test Losses that flow "according to the usual course of things" from the breach Koufos v Czarnikow Ltd Hadley v Baxendale Losses within the actual contemplation of the parties at the time the contract was made Victoria...
  • UNIVERSITY OF SOUTH ALABAMA GY 402: Sedimentary Petrology

    UNIVERSITY OF SOUTH ALABAMA GY 402: Sedimentary Petrology

    For personal use only. Today's Agenda Walther's Law Sequence stratigraphy Markov Chain Analysis Walther's Law Walther's Law Named after Johannes Walther (1860-1937), a German geologist, who in 1894, noted a fundamental relationship between the vertical and lateral distribution of facies.
  • The Effectiveness of Mass Media Campaigns: Youth Substance Abuse

    The Effectiveness of Mass Media Campaigns: Youth Substance Abuse

    The Effectiveness of Mass Media Campaigns: Youth Substance Abuse Bill Bukoski, Ph.D., NIDA Robert Orwin, Ph.D., Westat June 7, 2006 OVERVIEW 1. Youth's Exposure to Mass Media.
  • Developing Oral and Online Presentations

    Developing Oral and Online Presentations

    Selecting the Best Media and Channels. Controlled Methods. Specific Software. Conference Room. Company's Online Media. Your Choice of Methods. In-Person Presentations. Webcasts or Screencasts. Twitter and Web-Based. For some presentations, you'll be expected to use whatever media and channels your...
  • Herding Effective-dated Cats Challenges for a Student-term Data

    Herding Effective-dated Cats Challenges for a Student-term Data

    The University at Buffalo enrolls approximately 30,000 students in Undergraduate, Graduate, Medical, Dental, Pharmacy, and Law Careers. The conferral results here, for June 1, 2014, represent less than 10% of the expected total graduates. Examining the records of the few...
  • Common Directional Terms

    Common Directional Terms

    DORSAL. Towards the back area. LATERAL. Toward the outside. MEDIAL. Inside of an area. PALMAR. The bottom of the front feet. PLANTAR. The bottom of the rear feet. PROXIMAL. Closer to the center of the body. RECUMBENCY. Lying in position....
  • Teaching and Experiencing Christian Prayer Christian Prayer Drawing

    Teaching and Experiencing Christian Prayer Christian Prayer Drawing

    Christian prayer tradition rich and diverse. Challenges in teaching and expression of Christian prayer. Review your current practice. ... USING CHRISTIAN PRAYER FORMS. Daily Prayer of the Church. Antiphonal Prayers. Blessings. Intercessions. Prayer Planning Table pp. 29-30. ACTIVITY THREE ...
  • What Is An Individual Learning Plan?

    What Is An Individual Learning Plan?

    Rather than comparing ILP schools to non-ILP schools, this pie chart shows the distribution of counselor caseloads among those survey respondents who use ILPs (n=915). The categories were chosen because 250 is the maximum number of students per counselor recommended...