The Future of Information Technology and The Indiana ...

The Future of Information Technology and The Indiana ...

NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Software: MIDAS HPC-ABDS General HPC-ABDS February 2017 1 1 Software Nexus Application Layer On Big Data Software Components for Programming and Data Processing On HPC for runtime On IaaS and DevOps Hardware and Systems HPC-ABDS this part in this deck MIDAS Java Grande 2

Big Data ABDS HPCCloud 3.0 17. Orchestration Beam, Crunch, Tez, Cloud Dataflow 16. Libraries MLlib/Mahout, TensorFlow, CNTK, R, Python HPC, Cluster Kepler, Pegasus, Taverna 15A. High Level Programming Pig, Hive, Drill Domain-specific Languages 15B. Platform as a Service App Engine, BlueMix, Elastic Beanstalk Languages Java, Erlang, Scala, Clojure, SQL, SPARQL, Python 14B. Streaming Storm, Kafka, Kinesis 13,14A. Parallel Runtime Hadoop, MapReduce 2. Coordination Zookeeper 12. Caching

Memcached Yarn, Mesos 8. File Systems HDFS, Object Stores 1, 11A Formats Thrift, Protobuf 5. IaaS OpenStack, Docker Infrastructure CLOUDS XSEDE Software Stack Fortran, C/C++, Python HPC-ABDS MPI/OpenMP/OpenCL IntegratedSo CUDA, Exascale Runtime ftware 11. Data Management Hbase, Accumulo, Neo4J, MySQL 10. Data Transfer Sqoop 9. Scheduling ScaLAPACK, PETSc, Matlab

iRODS GridFTP Slurm Lustre FITS, HDF Linux, Bare-metal, SR-IOV Clouds and/or HPC SUPERCOMPUTERS 3 HPC-ABDS Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies CrossCutting Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination

: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca 21 layers Over 350 Software Packages January 29 2016

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame

Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53 4 Functionality of 21 HPC-ABDS Layers 1) 2) 3) 4) 5) Message Protocols: Distributed Coordination:

Security & Privacy: Monitoring: IaaS Management from HPC to hypervisors: 6) DevOps: 7) Interoperability: 8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management B) NoSQL C) SQL 12) In-memory databases & caches / Object-relational mapping / Extraction Tools 13) Inter process communication Collectives, point-to-point, publishsubscribe, MPI: 14) A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15) A) High level Programming: B) Frameworks 16) Application and Analytics: 17) Workflow-Orchestration: Lesson of large number (350). This is a rich software environment that HPC cannot compete with. Need to

5 use and not regenerate Note level 13 Inter process communication added Using Apache (Commercial Big Data) Data Systems for Science/Simulation Pro: Use rich functionality and usability of ABDS (Apache Big Data Stack) Pro: Sustainability model of community open source Con (Pro for many commercial users): Optimized for fault-tolerance and usability and not performance Feature: Naturally run on clouds and not HPC platforms Feature: Cloud is logically centralized, physically distributed but science data typically distributed. Question: how do science data analysis requirements differ from those commercially e.g. recommender systems heavily used commercially Approach: HPC-ABDS using HPC runtime and tools to enhance commercial data systems (ABDS on top of HPC) Upper level software: ABDS Lower level runtime: HPC HPCCloud Hardware: HPC or classic cloud dependent on application requirements 6 HPC-ABDS SPIDAL Project Activities

Green is MIDAS Black is SPIDAL Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated with Heron/Flink and Cloudmesh on HPC cluster Level 16: Applications: Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining Level 16: Algorithms: Generic and custom for applications SPIDAL Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop, Spark, Flink. Improve Inter- and Intra-node performance; science data structures Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark, Flink, Giraph) using HPC runtime technologies, Harp Level 12: In-memory Database: Redis + Spark used in Pilot-Data Memory Level 11: Data management: Hbase and MongoDB integrated via use of Beam and other Apache tools; enhance Hbase Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark, Hadoop; integrate Storm and Heron with Slurm

Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability 7 Exemplar Software for a Big Data Initiative Functionality of ABDS and Performance of HPC Workflow: Apache Crunch, Python or Kepler Data Analytics: Mahout, R, ImageJ, Scalapack High level Programming: Hive, Pig Batch Parallel Programming model: Hadoop, Spark, Giraph, Harp, MPI; Streaming Programming model: Storm, Kafka or RabbitMQ In-memory: Memcached

Data Management: Hbase, MongoDB, MySQL Distributed Coordination: Zookeeper Cluster Management: Yarn, Slurm File Systems: HDFS, Object store (Swift),Lustre DevOps: Cloudmesh, Chef, Puppet, Docker, Cobbler IaaS: Amazon, Azure, OpenStack, Docker, SR-IOV Monitoring: Inca, Ganglia, Nagios 8 MIDAS Middleware 350 HPC-ABDS Software Projects System Abstraction/Standards Data Format and Storage HPC ABDS MIDAS-SPIDAL Hourglass HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point Communication Support for iteration (in memory processing) Application Abstractions/Standards Graphs, Networks, Images, Geospatial .. Scalable Parallel Interoperable Data Analytics Library

(SPIDAL) High performance Mahout, R, Matlab .. High Performance Applications 9 HPC-ABDS Stack Summarized I The HPC-ABDS software is broken up into 21 layers so that one can discuss software systems in reasonable size groups. The layers where there is especial opportunity to integrate HPC are colored green in figure. We note that data systems that we construct from this software can run interoperably on virtualized or non-virtualized environments aimed at key scientific data analysis problems. Most of ABDS emphasizes scalability but not performance and one of our goals is to produce high performance environments. Here there is clear need for better node performance and support of accelerators like Xeon-Phi and GPUs. Figure ABDS v. HPC Architecture contrasts modern ABDS and HPC stacks illustrating most of the 21 layers and labelling on left with layer number used in HPC-ABDS Figure. The omitted layers in architecture figure are Interoperability, DevOps, Monitoring and Security (layers 7, 6, 4, 3) which are all important and clearly applicable to both HPC and ABDS. We also add an extra layer language not discussed in HPC-ABDS Figure.

10 HPC-ABDS Stack Summarized II Lower layers where HPC can make a major impact include scheduling layer 9 where Apache technologies like Yarn and Mesos need to be integrated with the sophisticated HPC cluster and HTC approaches. Storage layer 8 is another important area where HPC distributed and parallel storage environments need to be reconciled with the data parallel storage seen in HDFS in many ABDS systems. However the most important issues are probably at the higher layers with data management(11), communication(13), (high layer or basic) programming(15, 14), analytics (16) and orchestration (17). These are areas where there is rapid commodity/commercial innovation and we discuss them in order below. Much science data analysis is centered on files (8) but we expect movement to the common commodity approaches of Object stores (8), SQL and NoSQL (11) where latter has a proliferation of systems with different characteristics especially in the data abstraction that varies over row/column, key-value, graph and documents. Note recent developments at the programming layer (15A) like Apache Hive and Drill, which offer high-layer access models like SQL implemented on scalable NoSQL data systems. Generalize Drill to other views such as FITS on anything (astronomy) or Excel on Anything or Matplotlib on anything 11 HPC-ABDS Stack Summarized III

The communication layer (13) includes Publish-subscribe technology used in many approaches to streaming data as well the HPC communication technologies (MPI) which are much faster than most default Apache approaches but can be added to some systems like Hadoop whose modern version is modular and allows plug-ins for HPC stalwarts like MPI and sophisticated load balancers. Need to extend to Giraph and include load-balancing The programming layer (14) includes both the classic batch processing typified by Hadoop (14A) and streaming by Storm (14B). Investigating Map-Streaming programming models seems important. ABDS Streaming is nice but doesnt support real-time or parallel processing. The programming offerings (14) differ in approaches to data model (key-value, array, pixel, bags, graph), fault tolerance and communication. The trade-offs here have major performance issues. Too many ~identical programming systems! Recent survey of graph databases from Wikidata with 49 features; chose BlazeGraph You also see systems like Apache Pig (15A) offering data parallel interfaces. At the high layer we see both programming models (15A) and Platform (15B) as a Service toolkits where the Google App Engine is well known but there are many entries include the recent BlueMix from IBM. 12 HPC-ABDS Stack Summarized IV The orchestration or workflow layer 17 has seen an explosion of activity in the ABDS space although with systems like Pegasus,

Taverna, Kepler, Swift and IPython, HPC has substantial experience. There are commodity orchestration dataflow systems like Tez and projects like Apache Crunch with a data parallel emphasis based on ideas from Google FlumeJava. A modern version of the latter presumably underlies Googles recently announced Cloud Dataflow that unifies support of multiple batch and streaming components; a capability that we expect to become common. The implementation of the analytics layer 16 depends on details of orchestration and especially programming layers but probably most important are quality parallel algorithms. As many machine learning algorithms involve linear algebra, HPC expertise is directly applicable as is fundamental mathematics needed to develop O(NlogN) algorithms for analytics that are naively O(N2). Streaming (online) algorithms are an important new area of research 13 DevOps, Platforms and Orchestration DevOps Level 6 includes several automation capabilities including systems like OpenStack Heat, Juju and Kubernetes to build virtual clusters and a standard TOSCA that has several good studies from Stuttgart TOSCA specifies system to be instantiated and managed TOSCA is closely related to workflow (orchestration) standard BPEL BPEL specifies system to be executed

In Level 17, should evaluate new orchestration systems from ABDS such as NiFi or Crunch and toolkits like Cascading Ensure streaming and batch supported Level 15B has application hosting environments such as GAE, Heroku, Dokku (for Docker), Jelastic These platforms bring together a focused set of tools to address a finite but broad application area Should look at these 3 levels to build HPC and Streaming systems 14 HPC-ABDS In Detail See for lectures covering the Spring 2015 HPC-ABDS at level of around 1 slide for each ~300 (then) members This is video and PowerPoint 15 HPC-ABDS Stack Layers 1 and 2 Layer 1) Message Protocols: Avro, Thrift, Protobuf This layer is unlikely to directly visible in many applications as used in underlying system. Thrift and Protobuf have similar functionality and are

used to build messaging protocols with data syntax dependent interfaces between components (services) of system. Avro always carries schema with messages so that they can be processed generically without syntax specific interfaces. Layer 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups Zookeeper is likely to be used in many applications as it is way that one achieves consistency in distributed systems especially in overall control logic and metadata. It is for example used in Apache Storm to coordinate distributed streaming data input with multiple servers ingesting data from multiple sensors. Zookeeper is based on the original Google Chubby and there are several projects extending Zookeeper such as the Giraffe system. JGroups is less commonly used and is very different; it builds secure multicast messaging with a variety of transport mechanisms. 16 HPC-ABDS Stack Layers 3 and 4 Layer 3) Security & Privacy: InCommon, OpenStack Keystone, LDAP, Sentry, Sqrrl Security & Privacy is of course a huge area present implicitly or explicitly in all applications. It covers authentication and authorization of users and the security of running systems. In the Internet there are many authentication systems with sites often allowing you to use Facebook, Microsoft , Google etc. credentials. InCommon, operated by Internet2, federates research and higher education institutions, in the United States with identity management and related services. LDAP is a simple database (key-value) forming a set of distributed directories recording properties of users and resources according to X.500 standard. It allows secure management of systems. OpenStack Keystone is a role-based authorization

and authentication environment to be used in OpenStack private clouds. Sqrrl comes from a startup company spun off the US National Security Agency. It focusses on analyzing big data using Accumulo (layer 11B) to identify security issues such as cybersecurity incidents or suspicious data linkages. Layer 4) Monitoring: Ambari, Ganglia, Nagios, Inca Here Apache Ambari is aimed at installing and monitoring Hadoop systems and there are related tools at layers 6 and 15B. Very popular are the similar Nagios and Ganglia, which are system monitors with ability to gather metrics and produce alerts for a wide range of applications. Inca is a higher layer system allowing user reporting of performance of any sub system. Essentially all deployed systems use monitoring but 17 most users do not add custom reporting. HPC-ABDS Stack Layer 5 Layer 5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public (general access) Clouds, Google Cloud DNS, Amazon Route 53 Technologies at layer 5 underlie all applications although they may not be apparent to users. Layer 5 includes 4 major hypervisors Xen, KVM, Hyper-V and VirtualBox and the alternative virtualization approach through Linux containers OpenVZ, LXC and

Linux-Vserver. OpenStack, OpenNebula, Eucalyptus, Nimbus and Apache CloudStack are leading virtual machine managers with OpenStack most popular in US and OpenNebula in Europe (for researchers). These systems manage many aspects of virtual machines including computing, security, clustering, storage and networking; in particular OpenStack has an impressive number of subprojects (16 in March 2015). As a special case there is bare-metal i.e. the null hypervisor, which is discussed at layer 6. The DevOps (layer 6) technology Docker is playing an increasing role as a Linux container. CoreOS is a recent lightweight version of Linux customized for containers and Docker in particular. The public clouds Amazon, Azure and Google have their own solution and it is possible to move machine images between these different environments. 18 HPC-ABDS Stack Layer 6 Layer 6) DevOps: Docker, Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive This layer describes technologies and approaches that automate the deployment, installation and life-cycle of software systems and underlies software-defined systems. At Indiana University,

we integrate tools together in Cloudmesh Libcloud, Cobbler (becoming OpenStack Ironic), Chef, Docker, Slurm, Ansible, Puppet and Celery. We saw the container support system Docker earlier in layer 5. Puppet, Chef, Ansible and SaltStack are leading configuration managers allowing software and their features to be specified. Juju and OpenStack Heat extend this to systems or virtual clusters of multiple different software components. Cobbler, Xcat, Razor, Ubuntu MaaS, and OpenStack Ironic address bare-metal provisioning and enable IaaS with hypervisors, containers or baremetal. Foreman is a popular general deployment environment while Boto provides from a Python interface, DevOps for all (around 40 March 2015) the different AWS public cloud Platform features. Rocks is a well-regarded cluster deployment system aimed at HPC with software configurations specified in rolls, Cisco Intelligent Automation for Cloud is a commercial offering from Cisco that directly includes networking issues. Tupperware is used by Facebook to deploy containers for their production systems while AWS OpsWorks is new Amazon capability for automating use of AWS. Google Kubernetes focusses on cluster management for Docker while Buildstep and Gitreceive are part of the Dokku application hosting framework (layer 15B) for Docker. 19 HPC-ABDS Stack Layer 7 Layer 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis This layer has both standards and interoperability libraries for services, compute, virtualization and storage. Libvirt provides common interfaces at the hypervisor level while Libcloud and JClouds provide this at the higher cloud provider level. TOSCA is an interesting DevOps standard specifying the type of system managed by OpenStack Heat. OCCI and CDMI provide cloud

computing and storage interfaces respectively while Whirr provides cloud-neutral interfaces to services. Saga and Genesis come from HPC community and provide standards and their implementations for distributed computing and storage. 20 HPC-ABDS Stack Layer 8 Layer 8) File systems: HDFS, Swift, Amazon S3, Azure Blob, Google Cloud Storage, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS One will use files in most applications but the details may not be visible to the user. Maybe you interact with data at layer of a data management system like iRODS or an Object store (OpenStack Swift or Amazon S3). Most science applications are organized around files; commercial systems at a higher layer. For example originally Facebook directly used a basic distributed file systems (NFS) to store images but in Haystack replaced this with a customized object store which was refined in f4 to accommodate variations in image usage with cold, warm and hot data. HDFS realizes the goal of bring computing to data and has a distributed data store on the same nodes that perform computing; it underlies the Apache Hadoop ecosystem. Openstack Cinder implements the Block store used in Amazon Elastic Block Storage EBS, Azure Files and Google Persistent Storage. This is analogous to disks accessed directly on a traditional non-cloud environment whereas Swift, Amazon S3, Azure Blob and Google Cloud Storage implement backend object stores. Lustre is a major HPC shared cluster file system with Gluster as an alternative. FUSE is a user level file system used by Gluster. Ceph is a distributed file system that projects object, block, and file storage paradigms to the user. GPFS or Global Parallel File System is a IBM parallel file system optimized to support MPI-IO and other high intensity parallel I/O scenarios. GFFS or Global Federated File System comes from Genesis II (layer 7) and

provides a uniform view of a set of distributed file systems. 21 HPC-ABDS Stack Layer 9 Layer 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs You will certainly need cluster management in your application although often this is provided by the system and not explicit to the user. Yarn from Hadoop is very popular while Mesos from UC Berkeley is similar to Yarn and is also well used. Apache Helix comes from LinkedIn for this area. Llama from Cloudera runs above Yarn and achieves lower latency by switching use of long lived processes. Google and Facebook certainly face job management at a staggering scale and Omega and Corona respectively are proprietary systems along the lines of Yarn and Mesos. Celery is built on RabbitMQ and supports the master-worker model in a similar fashion to Azure worker role with Azure queues. Slurm is a basic HPC system as are Moab, Torque, SGE, OpenPBS while Condor also well known for scheduling of Grid applications. Many systems are in fact collections of clusters as in data centers or grids. These require management and scheduling across many clusters; the latter is termed meta-scheduling. These are addressed by the Globus Toolkit and Pilot jobs which originated in the Grid computing community 22

HPC-ABDS Stack Layer 10 Layer 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop BitTorrent was famous 10 years ago (2004) for accounting one third of all Internet traffic; this is dropped a factor of 10 in 2014 but still an important Peer to Peer (file sharing) protocol. Simple HTTP protocols are typically used for small data transfers while the largest one might even use the Fedex/UPS solution of transporting disks between sites. SSH and FTP are old well established Internet protocols underlying simple data transfer. Apache Flume is aimed at transporting log data and Apache Sqoop at interfacing distributed data to Hadoop. Globus Online or GridFTP is dominant and successful system for the HPC community. Data transport is often not highlighted as it runs under the covers but is often quoted as a major bottleneck. 23 HPC-ABDS Stack Layer 11A Layer 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet The data management layer 11 is a critical area for nearly all applications as it captures areas of file, object, NoSQL and SQL data management. The many entries in area testify to variety of problems (graphs, tables, documents, objects) and importance of efficient solution. Just a little while ago, this area was dominated by SQL databases and file managers. We divide this layer into 3 subparts; management and data structures for file

in 11A; the cloud NoSQL systems in 11B and the traditional SQL systems in layer 11C, which also includes the recent crop of NewSQL systems that overlap with layer 15A. It is remarkable that the Apache stack does not address file management (as object stores are used instead of file systems) and the HPC system iRODS is major system to manage files and their metadata. This layer also includes important file (data) formats. NetCDF and CDF are old but still well used formats supporting array data, as is HDF or Hierarchical Data Format. The earth science domain uses OPeNDAP and the astronomers FITS. RCFile (Row Column File) and ORC are new formats introduced with Hive (layer 15A) while Parquet based on Google Dremel is used for column storage in many major systems in layers 14A and 15A. 24 HPC-ABDS Stack Layer 11B

Layer 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Azure Table, Amazon Dynamo, Google DataStore, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, HBase, Google Bigtable, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Yarcdata, AllegroGraph, Facebook Tao, Titan:db, Jena, Sesame NoSQL systems can be divided into six important styles: Tools, Key-value stores, Document-based stores, Columnbased store, Graph-based stores and Triple stores. Tools include Apache Lucene providing information-retrieval; Apache Solr uses Lucene to build a fast search engine while Apache Solandra adds Cassandra as a backend to Solr. Key-value stores have a hash table of keys and values and include Voldemort from LinkeIn; Riak uses Solr for search and is based on Amazon Dynamo; Berkeley DB comes from Oracle; Azure Table, Amazon Dynamo, and Google DataStore are the dominant public cloud NoSQL key-value stores. Document-based stores manage documents made up of tagged elements and include MongoDB which is best known system in this class. Espresso comes from LinkedIn and uses Helix (layer 9); Apache CouchDB has a variant Couchbase that adds caching (memcached) features; IBM Cloudant is a key part of IBMs cloud offering. Column-based stores have data elements that just contain data from one column as pioneered by Google Bigtable which inspires Apache Hbase; Google Megastore and Spanner build on Bigtable to provide capabilities that interpolate between NoSQL and SQL and can get scalability of NoSQL and ease of use of SQL; Apache Cassandra comes from Facebook; Apache Accumulo is also popular and RYA builds a triple store on top of it; Sqrrl is built on top of Accumulo to provide graph capabilities and security applications. Graph-based Stores: Neo4j is most popular graph database; Yarcdata Urika is supported by Cray shared memory machines and allows SPARQL queries as does AllegroGraph, which is written in Lisp and is integrated with Solr and MongoDB; Facebook TAO (The Associations and Objects) supports their specific problems with massive scaling; Titan:db is an interesting graph database and integrates with Cassandra, HBase, BerkeleyDB, TinkerPop graph stack (layer 16), Hadoop (layer 14A), ElasticSearch (layer 16), Solr and Lucene (layer 11B) Triple stores: The last category is a special case of the graph database specialized to the triples (typically resource, attribute and attribute value) that one gets in the RDF approach. Apache Jena and Sesame support storage and queries including those in SPARQL.

25 HPC-ABDS Stack Layer 11C Layer 11C) SQL/NewSQL: Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, Galera Cluster, SciDB, Rasdaman, Apache Derby, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB Layer 11C only lists a few of the traditional relational databases but includes the NewSQL area, which is also seen in systems at layer 15A. NewSQL combines a rigorous SQL interface with the scalability of MapReduce style infrastructure. Oracle, IBM DB2 and Microsoft SQL Server are of course major commercial databases but the amazing early discovery of cloud computing was that their architecture, optimized for transaction processing, was not the best for many cloud applications. Traditional databases are still used with Apache Derby, SQLite, MySQL, and PostgreSQL being important low-end open source systems. Galera Cluster is one of several examples of a replicated parallel database built on MySQL. SciDB and Rasdaman stress another idea; good support for the array data structures we introduced in layer 11A. Google Cloud SQL, Azure SQL, Amazon RDS, are the public cloud traditional SQL engines with Azure SQL building on Microsoft SQL server. N1QL illustrates an important trend and is designed to add SQL queries to the NoSQL system Couchbase. Google F1 illustrates the NewSQL concept building a quality SQL system on the Spanner system described in layer 11B. IBM dashDB similarily offers warehouse capabilities built on top of the NoSQL Cloudant which is a derivative of CouchDB (again layer 11B). BlinkDB is a research database exploring sampling to speed up queries on large datasets. 26

HPC-ABDS Stack Layer 12 Layer 12) In-memory databases&caches: Gora, Memcached, Redis, Hazelcast, Ehcache, Infinispan / Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC / Extraction Tools: UIMA, Tika This layer represents another important area addressing several important capabilities. Firstly Memcached (best known and used by GAE), Redis (an in-memory key value store), Hazelcast, Ehcache, Infinispan enable caching to put as much processing as possible in memory. This is an important optimization with Gartner highlighting in several recent hype charts with In-Memory database management systems and Analytics. UIMA and Tika are conversion tools with former well known from its use by Jeopardy winning IBM Watson system. Gora supports generation of general object data structures from NoSQL. Hibernate, OpenJPA, EclipseLink and DataNucleus are tools for persisting Java in-memory data to relational databases. 27 HPC-ABDS Stack Layer 13-1 Layer 13) Inter process communication Collectives, point-to-point, publishsubscribe, MPI: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Amazon SNS and Lambda, Google Pub Sub, Azure Queues and Event Hubs, This layer describes the different communication models used by the systems in layers 14 and 15) below. One has communication between the processes in parallel computing and communication between data sources and filters. There are important trade-offs between performance, fault tolerance, and flexibility. There are also differences that depend on application structure and hardware. MPI from

the HPC domain, has very high performance and has been optimized for different network structures while its use is well understood across a broad range of parallel algorithms. Data is streamed between nodes (ports) with latencies that can be as low as a microsecond. This contrasts with disk access with latencies of 10 milliseconds and event brokers of around a millisecond corresponding to the significant software supporting messaging systems. Hadoop uses disks to store data in between map and reduce stages, which removes synchronization overheads and gives excellent fault tolerance at cost of highest latency. Harp brings MPI performance to Hadoop with a plugin. 28 HPC-ABDS Stack Layer 13-2 There are several excellent publish-subscribe messaging systems that support publishers posting messages to named topics and subscribers requesting notification of arrival of messages at topics. The systems differ in message protocols, API, richness of topic naming and fault tolerance including message delivery guarantees. Apache has Kafka from LinkedIn with strong fault tolerance, ActiveMQ and QPid. RabbitMQ and NaradaBrokering have similar good performance to ActiveMQ. Kestrel from Twitter is simple and fast while Netty is built around Java NIO and can support protocols like UDP which are useful for messaging with media streams as well as HTTP. ZeroMQ provides fast inter-process communication with software multicast. Message queuing is well supported in commercial clouds but with different software environments; Amazon Simple Notification Service, Google Pub-Sub, Azure queues or service-bus queues. Amazon offers a new event based computing model Lambda while Azure has Event Hubs built on the service bus to support Azure streaming analytics in layer 14B. There are important messaging standards supported by many of these systems. JMS or

Java Message Service is a software API that does not specify message nature. AMQP (Advanced Message Queuing Protocol) is best known message protocol while STOMP (Simple Text Oriented Messaging Protocol) is a particularly simple and HTTP in style without supporting topics. MQTT (Message Queue Telemetry Transport) comes from the Internet of Things domain and could grow in importance for machine to machine communication. 29 HPC-ABDS Stack Layer 14A Layer 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, Stratosphere (Apache Flink), Reef, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi Most applications use capabilities at layers 14 which we divide into the classic or batch programming in 14A and the streaming area 14B that has grown in attention recently. This layer implements the programming models shown in Fig. 3. Layer 14B supports the Map-Streaming model which is category 5 of Fig. 3. Layer 14A focusses on the first 4 categories of this figure while category 6 is included because shared memory is important in some graph algorithms. Hadoop in some sense created this layer although its programming had been known for a long time but not articulated as brilliantly as was done by Google for MapReduce. Hadoop covers categories 1 and 2 of Fig. 3 and with the Harp plug-in categories 3 and 4. Other entries here have substantial overlap with Spark and Twister (no longer developed) being pioneers for Category 3 and Pregel with an open source version Giraph supporting category 4. Pegasus also supports graph computations in category 4. Hama is an early Apache project with capabilities similar to MPI with Apache Flink and Reef newcomers supporting all of categories 1-4. Flink supports multiple data APIs including graphs with its Spargel subsystem. Reef is optimized to support machine learning. Ligra and GraphChi are shared memory graph

processing frameworks (category 6 of Fig. 3) with GraphChi supporting disk-based storage of graph. 30 HPC-ABDS Stack Layer 14B Layer 14B) Streaming: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics Figure 3, category 5, sketches the programming model at this layer while figure 4 gives it more detail for Storm, which being open source and popular, is the best archetype for this layer. There is some mechanism to gather and buffer data, which for Apache Storm is a publish-subscribe environment such as Kafka, RabbitMQ or ActiveMQ. Then there is a processing phase delivered in Storm as bolts implemented as dataflow but which can invoke parallel processing such as Hadoop. The bolts then deliver their results to a similar buffered environment for further processing or storage. Apache has three similar environments Storm, Samza and S4 which were donated by Twitter, LinkedIn and Yahoo respectively. S4 features a built-in key-value store. Granules from Colorado State University has a similar model to Storm. The other entries are commercial systems with LinkedIn Databus and Facebook Puma/ Ptail/ Scribe/ ODS supporting internal operations of these companies for examining in near real-time the logging and response of their web sites. Kinesis, MillWheel and Azure Stream Analytics are services offered to customers of the Amazon, Google and Microsoft clouds respectively. Interestingly none of these uses the Apache functions (Storm, Samza, S4) although these run well on commercial clouds. 31 HPC-ABDS Stack Layer 15A Layer 15A) High layer Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Pig, Sawzall, Google Cloud DataFlow, Summingbird Components at this layer are not required but are very interesting and we can expect great progress to come both in improving them and using them. There are several SQL on MapReduce software systems with Apache Hive (originally from Facebook) as best known. Hive illustrates key idea that MapReduce supports parallel database actions and so SQL systems built on Hadoop can outperform traditional databases due to scalable parallelism [OSUcite]. Presto is another Facebook system supporting interactive queries rather than Hives batch model. Apache HCatalog is a table and storage management layer for Hive. Other SQL on MapReduce systems include Shark (using Spark not Hadoop), MRQL (using Hama, Spark or Hadoop for complex analytics as queries), Tajo (warehouse) Impala (Cloudera), HANA (real-time SQL from SAP), HadoopDB (true SQL on Hadoop), and PolyBase (Microsoft SQL server on Hadoop). A different approach is Kite which is a toolkit to build applications on Hadoop. Note layer 11C) lists production SQL data bases including some NewSQL systems with architectures similar to software described here. Google Dremel supported SQL like queries on top of a general data store including unstructured and NoSQL systems. This capability is now offered by Apache Drill, Google BigQuery and Amazon Redshift. Apache Phoenix exposes HBase functionality through a SQL interface that is highly optimized. Pig and Sawzall offer data parallel programming models on top of Hadoop (Pig) or Google

MapReduce (Sawzall). Pig shows particular promise as fully open source with DataFu providing analytics libraries on top of Pig. Summingbird builds on Hadoop and supports both batch (Cascading) and streaming (Storm) applications. Google Cloud Dataflow provides similar integrated batch and streaming interfaces. 32 HPC-ABDS Stack Layer 15B Layer 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere This layer is exemplified by Google App Engine GAE and Azure where frameworks are called Platform as a Service PaaS but now there are many cloud integration/development environments. The GAE style framework offers excellent cloud hosting for a few key languages (often PHP, JavaScript Node.js, Python, Java and related languages) and technologies (covering SQL, NoSQL, Web serving, memcached, queues) with hosting involving elasticity, monitoring and management. Related capabilities are seen in AWS Elastic Beanstalk, AppScale, appfog, OpenShift (Red Hat), Heroku (part of Salesforce) while Aerobatic is specialized to single web page applications. Pivotal offers a base open source framework Cloud Foundry that is used by IBM in their BlueMix PaaS but is also offered separately as Pivotal Cloud Foundry and Web Services. The major vendors AWS, Microsoft, Google and IBM mix support of general open source software like Hadoop with proprietary technologies like BigTable (Google), Azure Machine Learning, Dynamo (AWS) and Cloudant (IBM). Jelastic is a Java PaaS that can be hosted on the major IaaS offerings including Docker. Similarly Stackato from ActiveState can be run on Docker and Cloud Foundry. CloudBees recently switched from general PaaS to focus on Jenkins continuous integration services. Engine Yard and Cloud Control focus on

managing (scaling, monitoring) the entire application. The latter recently purchased the dotCloud PaaS from Docker, which company was originally called dotCloud but changed name and focus due to huge success of their support product Docker. Dokku is a bash script providing Docker Heroku compatibility based on tools Gitreceive and Buildstep. This layer also includes toolsets that are used to build web environments OSGi, HUBzero and OODT. Agave comes from the iPlant Collaborative and offers Science as a Service building on iPlant Atmosphere cloud services with access to 600 plant biology packages. Software at this layer has overlaps with layer 16, application libraries, and layer 17, workflow and science gateways. 33 HPC-ABDS Stack Layer 16-1 Layer 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, Scalapack, PetSc, Azure Machine Learning, Google Prediction API, Google Translation API, mlpy, scikit-learn, PyBrain, CompLearn, Caffe, Torch, Theano, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch This is the business logic of application and where you find machine learning algorithms like clustering, recommender engines and deep learning. Mahout, MLlib, MLbase are in Apache for Hadoop and Spark processing while the less well-known DataFu provides machine learning libraries on top of Apache Pig. R with a custom scripting language, is a key library from statistics community with many domain specific libraries such as Bioconductor, which has 936 entries in version 3.0. Image processing (ImageJ) in Java and High Performance Computing HPC (Scalapack and PetSc) in C++/Fortran also have rich libraries. pbdR uses Scalapack to add high performance parallelism to R which is well known for not achieving high performance often necessary for Big Data. Note R without modification will address

the important pleasingly parallel sector with scaling number of independent R computations. Azure Machine Learning, Google Prediction API, and Google Translation API represent machine learning offered as a service in the cloud. 34 HPC-ABDS Stack Layer 16-2 The array syntaxes supported in Python and Matlab make them like R attractive for analytics libraries. mlpy, scikit-learn, PyBrain, and CompLearn are Python machinelearning libraries while Caffe (C++, Python, Matlab), Torch (custom scripting), DL4J (Java) and Theano (Python) support the deep learning area with has growing importance and comes with GPUs as key compute engine. H2O is a framework using R and Java that supports a few important machine learning algorithms including deep learning, K-means, and Random Forest and runs on Hadoop and HDFS. IBM Watson applies advanced natural language processing, information retrieval, knowledge representation, automated reasoning, and machine learning technologies to answer questions. It is being customized in hundreds of areas following its success in the game Jeopardy There are several important graph libraries including Oracle PGX, GraphLab (CMU), GraphBuilder (Intel), GraphX (Spark based), and TinkerPop (open source group). IBM System G has both graph and Network Science libraries and there are also libraries associated with graph frameworks (Giraph and Pegasus) at Layer 14A. CINET and Network Workbench focus on Network science including both graph construction and analytics. Google Fusion Tables focus on analytics for Tables including map displays. ElasticSearch combines search (second only to Solr in popularity) with an analytics engine Kibana. You will nearly always need software at this layer 35 HPC-ABDS Stack Layer 17-1 Layer 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA) This layer implements orchestration and integration of the different parts of a job. This integration is typically specified by a directed data-flow graph and a simple but important is a pipeline of the different stages of a job. Essentially all problems involve the linkage of multiple parts and the terms orchestration or workflow are used to describe the linkage of these different parts. The parts (components) involved are large and very different from the much smaller parts involved in parallel computing involving MPI or Hadoop. On general principles, communication costs decrease in comparison to computing costs as problem sizes increase. So orchestration systems are not subject to the intense performance issues we saw in layer 14. Often orchestration involves linking of distributed components. We can contrast PaaS stacks which describe the different services or functions in a single part with orchestration that describes the different parts making up a single application (job). The trend to Software as a Service clouds the distinction as it implies that a single part may be made up of multiple services. The messaging linking services in PaaS is contrasted with dataflow linking parts in orchestration. Note that often the orchestration parts often communicate via disk although faster streaming links are also common. 36 HPC-ABDS Stack Layer 17-2

Orchestration in its current form originated in the Grid and service oriented communities with the early importance of OASIS standard BPEL (Business Process Execution Language) illustrated by ActiveBPEL which was last updated in 2008. BPEL did not emphasize dataflow and was not popular in Grid community. Pegasus, Kepler, and Taverna are perhaps the best known Grid workflow systems with recently Galaxy and BioKepler popular in bioinformatics. The workflow system interface is either visual (link programs as bubbles with data flow) or as an XML or program script. The latter is exemplified by the Swift customized scripting system and the growing use of Python. Apache orchestration of this style is seen in ODE and Airavata with latter coming from Grid research at Indiana University. Recently there has been a new generation of orchestration approaches coming from Cloud computing and covering a variety of approaches. Dryad, Naiad and the recent NiFi support a rich dataflow model which underlies orchestration. These systems tend not to address the strict synchronization needed for parallel computing and that limits their breadth of use. Apache Oozie and Tez link well with Hadoop and are alternatives to high layer programming models like Apache Pig. e-Science Central from Newcastle, the Azure Data Factory and Google Cloud Dataflow focus on the end to end user solution and combine orchestration with well-chosen PaaS features. Google FlumeJava and its open source relative Apache Crunch are sophisticated efficient Java orchestration engines. Cascading, PyCascading and Scalding offer Java, Python and Scala toolkits to support orchestration. One can hope to see comparisons and integrations between these many different systems. 37

Recently Viewed Presentations

  • The major drawbacks to hydroponics

    The major drawbacks to hydroponics

    The major drawbacks to hydroponics Wick system- the biggest draw back of this system is that plants that are large or use large amounts of water may use up the nutrient solution faster than the wick(s) can supply it.
  • 1.1 Silicon Crystal Structure - University of California ...

    1.1 Silicon Crystal Structure - University of California ...

    Diffusion Current Total Current Non-Uniformly-Doped Semiconductor Einstein Relationship between D and m Example: Diffusion Constant Potential Difference due to n(x), p(x) Quasi-Neutrality Approximation If the dopant concentration profile varies gradually with position, then the majority-carrier concentration distribution does not differ...
  • Modernizing Energy Management: Solutions for Data-Driven Energy Efficiency

    Modernizing Energy Management: Solutions for Data-Driven Energy Efficiency

    Ron will provide a brief history of the evolution of energy management, from dot matrix printouts to present-day real-time monitoring-based commissioning systems, as well as examples of actionable insights, case studies at varying scales, and tools and services available to...
  • We Beat The Street - Woodbridge Township School District

    We Beat The Street - Woodbridge Township School District

    Chapter 8. George and Sampson meet for the first time. They are taking a test to get in the University of High School. Sampson has concerns about attending the school, in fear he will not be top of the class...
  • Passive and Active Immunization Nono Mkhize, PhD National

    Passive and Active Immunization Nono Mkhize, PhD National

    The second kind of protection is adaptive (or active) immunity, which develops throughout our lives. Adaptive immunity involves the lymphocytes such as B cells and T cells and develops as people are exposed to diseases or immunized against diseases through...
  • Resident Assessment Instrument - Tennessee

    Resident Assessment Instrument - Tennessee

    If unscheduled assessment due in the assessment window for a scheduled assessment, must combine by setting ARD of the scheduled assessment for the same day the unscheduled assessment is required. A scheduled assessment cannot occur after an unscheduled assessment in...
  • Diapositive 1 - SIGCOMM

    Diapositive 1 - SIGCOMM

    Move to Content-oriented Network. Traffic is already content-oriented. CDN, overlays, P2P. Users/applications care "what to receive" They don't care "from whom" Host based communication model is getting ''outdated'' Naturally shift from . Multimedia traffic 90% ??
  • Literary and Rhetorical Terms - Thomas County School District

    Literary and Rhetorical Terms - Thomas County School District

    poetic and rhetorical device in which normally unassociated ideas, words, or phrases are placed next to one another, creating an effect of surprise and wit. metaphor a figure of speech that makes a comparison between two unlike things without the...