Getting Started with The Globus Toolkit®

Getting Started with The Globus Toolkit®

The GriPhyN Virtual Data System Ian Foster for the VDS team Science as Workflow: E.g., Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution 100000 Number of Clusters 10000 1000 100 10 1 1 10 Number of Galaxies 100 Jim Annis, Steve Kent, Vijay Sehkri,

Fermilab, Michael Milligan, Yong Zhao, University of Chicago Requirements Express complex multi-step workflows Operate on heterogeneous distributed data Despite diverse failure conditions Enable reuse of data & workflows Parallel computers &/or distributed Grids Execute workflows reliably Different formats & access protocols Harness many computing resources

Perhaps 100,000s of individual tasks Discovery & composition Support many users, workflows, resources Policy specification & enforcement Virtual Data System Express complex multi-step workflows Operate on heterogeneous distributed data Despite diverse failure conditions Enable reuse of data & workflows VDL, XDTM Parallel computers &/or distributed Grids Execute workflows reliably & efficiently

Different formats & access protocols Harness many computing resources Perhaps 100,000s of individual tasks Discovery & composition Pegasus, DAGman, Globus Support many users, workflows, resources Policy specification & enforcement VDC TBD Virtual Data System Workflow spec VDL Program Virtual Data catalog

Virtual Data Workflow Generator Abstract workflow Create Execution Plan Statically Partitioned DAG Dynamically Planned DAG Local planner Grid Workflow Execution DAGman DAG DAGman & Condor-G Job Planner Job Cleanup 600-1000+ CPUs Genome Analysis & DB Update (GADU) The Rest of the Talk

Express complex multi-step workflows Despite diverse failure conditions Enable reuse of data & workflows Parallel computers &/or distributed Grids Execute workflows reliably & efficiently Different formats & access protocols Harness many computing resources VDL, XDTM Operate on heterogeneous distributed data

Perhaps 100,000s of individual tasks Discovery & composition Ewa Pegasus, DAGman, Globus Support many users, workflows, resources Policy specification & enforcement VDC TBD Messy Scientific Data Diverse storage formats & access protocols Logically identical dataset can be stored in text file (e.g. CSV), binary file, spreadsheet Data available from filesystem, database, HTTP, WebDAV, etc... Metadata encoded in directory & file names

E.g.: fMRI volume is composed of an image file & header file with same prefix Format dependency hinders program and workflow reuse But... Data is Often Logically Structured Scientific data often maintain hierarchical structure A common practice is to select a set of data items and apply a transformation to each individual item A nested approach of such iterations could scale up to millions of objects Introducing a Typing System Describe logical data structures as types & physical representations as mappings

Define procedures in terms of typed datasets & apply procedures to different physical data Compose workflows from typed procedures Benefits Type checking Dataset selection and iteration Discovery by types Dynamic binding Type conversion XDTM (Moreau, Zhao, Wilde, Foster)

XML Dataset Typing and Mapping Separates logical structure from physical representations Logical structure described by XML Schema Primitive scalar types: int, float, string, date Complex types (structs and arrays) Mapping descriptor How logical elements map to physical External parameters (e. g. location) XPath for dataset selection Mapping

Define a common mapping interface Data providers implement the interface Initialize, read, create, write, close Responsible for data access details XView maintains cached logical datasets VDS Mapper VDS XViewMgr Mapper Data Source XView Data Source Use Case: Functional MRI Logical Structure Physical Representation

DBIC Archive Study #1 Group #1 Subject #1 Anatomy high-res volume Functional Runs run #1 volume #001 ... volume #275 ... run #5 volume #001 ... snrun #... Group #5 ... Study #... DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024 volume_anat.img volume_anat.hdr bold1_001.img bold1_001.hdr ... bold1_275.img bold1_275.hdr ... bold5_001.img ...

snrbold*_* air* ... Group_5 ... Study ... Type Definitions in VDL type Image {}; type Header {}; type Volume { Image img; Header hdr; } type Anat Volume; type Warp {}; type NormAnat { Anat aVol; Warp aWarp; Volume nHires; } type Run { Volume v [ ]; } type Subject { Anat anat; Run run [ ]; Run snrun [ ]; } type Group { Subject s[ ]; } type Study { Group g[ ]; } Part of fMRI AIRSN (Spatial Normalization) Workflow

Type Definitions in XML Schema Procedure Definition in VDL (Run snr) functional( Run r, NormAnat a, Air shrink ) { Run yroRun = reorientRun( r , "y" ); Run roRun = reorientRun( yroRun , "x" ); Volume std = roRun[0]; Run rndr = random_select( roRun, .1 ); //10% sample AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] ); Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k"); Volume meanRand = softmean(reslicedRndr, "y", null ); Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] ); Volume mnQA = reslice( meanRand, mnQAAir, "o", "k ); Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); Run nr = reslice_warp_run( boldNormWarp, roRun ); Volume meanAll = strictmean ( nr, "y", null ) Volume boldMask = binarize( meanAll, "y" ); snr = gsmoothRun( nr, boldMask, 6, 6, 6 );

} Dataset Iteration Functional analysis expressed in typed datasets Iterate over each volume in a run reorientRun reorientRun random_select alignlinearRun resliceRun softmean alignlinear combinewarp reslice_warpRun strictmean binarize gsmoothRun Expanded Execution Plan Datasets dynamically instantiated from data sources by mappers reorient

reorient reorient/25 reorient/51 reorient/27 reorient/52 reorient/29 reorient/53 reorient/09 reorient/01 reorient/10 alignlinear alignlinear/11 reslice reslice/12 softmean reorient/06 reorient/54 reorient/33

reorient/55 reorient/35 reorient/56 reorient/37 reorient/57 alignlinear/03 alignlinear/07 reslice/04 reslice/08 alignlinear/17 combine_warp combinewarp/21 reslice_warp/26 reslice_warp/28 reslice_warp/30 reslice_warp/24 reslice_warp/22 reslice_warp/23 reslice_warp/32 reslice_warp/34 reslice_warp/36 reslice_warp/38 strictmean strictmean/39 binarize gsmooth reorient/02 reorient/31 softmean/13

alignlinear reslice_warp reorient/05 binarize/40 gsmooth/44 gsmooth/45 gsmooth/46 gsmooth/43 gsmooth/41 gsmooth/42 gsmooth/47 gsmooth/48 gsmooth/49 gsmooth/50 Functional MRI Execution Code Size Comparison Lines of code with different workflow encodings Workflow Scrip

t Generato r VDL GENATLAS1 49 72 6 GENATLAS2 97 135 10 FILM1 63 134 17 FEAT 84 191

13 215 ~400 37 AIRSN The Rest of the Talk Express complex multi-step workflows Operate on heterogeneous distributed data Despite diverse failure conditions Enable reuse of data & workflows VDL, XDTM Parallel computers &/or distributed Grids

Execute workflows reliably & efficiently Different formats & access protocols Harness many computing resources Perhaps 100,000s of individual tasks Discovery & composition Pegasus, DAGman, Globus Support many users, workflows, resources Policy specification & enforcement VDC TBD Virtual Data Schema 1 FormalArg ActualArg argname

type direction 1 binds 1 * 1 argname value 1 passes 1 1 Procedure nmspace name version calls * uses describes wfid fromDV toDV

1 Invocation 1 Call 1 * Workflow 1 nmspace name references * describes Annotation object pred type/val user date passes Dataset 1

nmspace name version includes executes 1 * dvID host start duration exitcode stats fMRI Virtual Data Queries Which transformations can process a subject image? Q: xsearchvdc -q tr_meta dataType subject_image input A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img A: 3472-3_anonymized.img

3472-3_anonymized.sliced.hdr atlas.hdr atlas.img atlas_z.jpg 3472-3_anonymized.sliced.img Provenance for ATLAS DC2 (High Energy Physics) How much compute time was delivered? | years| mon | year | +------+------+------+ | .45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556 ... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86

stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel? LIGO Inspiral Search Application Describe Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS Remote Directory Creation for Ensemble Member 1 FOAM run for Ensemble Member 1 Atmos Atmos Postprocessing Postprocessing for Ensemble Member 2 Remote Directory Creation for Ensemble Member 2

Remote Directory Creation for Ensemble Member N FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Ocean Postprocessing for Ensemble Member 2 Coupl Coupl Postprocessing Postprocessing for for Ensemble Member Ensemble Member22 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution) FOAM and VDS Climate Supercomputer and Grad student TeraGrid and VDS

160 ensemble members in 75 days 250 ensemble members in 4 days Visualization courtesy Pat Behling and Yun Liu, UW Madison Summary: Science as Workflow Executed Executing Query Executable Not yet executable What I Did What I Am Doing What I Want to Do Execution environment Schedule Edit

Acknowledgements The Virtual Data System group is: ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao GriPhyN is supported by the NSF Many research efforts involved in this work are supported by the US Department of Energy, Office of Science

Recently Viewed Presentations

  • Research Methods & Design Outline Types of research

    Research Methods & Design Outline Types of research

    Research Methods & Design Outline Types of research design How to choose a research design Issues in research design Types of Research Design Correlational Field (survey) Experimental Qualitative Meta-analysis Types of Research Design Correlational Study Explores or tests relations between...
  • Presented by Chuck Corbishley Exploring religion and physical

    Presented by Chuck Corbishley Exploring religion and physical

    Direct translation of Biblical texts on dietary laws. Promote vegetarian and Loma Linda foods (Land, 2005). Run children's activity programmes called 'Adventurers'. Also have football teams which play on the weekend. Don't provide any other sporting activities. Need for older...
  • Boy in the Tower Lesson Seventeen Prior Class

    Boy in the Tower Lesson Seventeen Prior Class

    In Chapter 53, we now discover that the Bluchers are in the building. The characters react to this news in different ways. Think about the reactions of each character and whether they match the descriptions below.
  • ARTICULAO E APOIO com as Organizaes da Sociedade

    ARTICULAO E APOIO com as Organizaes da Sociedade

    radio-teatro do prazer sem dor, desejo sem culpa - um programa de mulher. projeto prevenÇÃo É vida. vida digna com hiv/aids. reduÇÃo de danos É de lei. malhaÇÃo vida nova. sexualidade, cultura e prevenÇÃo: programa prevenÇÃo dst/hiv/aids atravÉs rede cidadÃo....
  • Energy procurement in the presence of intermittent Adam

    Energy procurement in the presence of intermittent Adam

    This talk is really about the role of uncertainty in newsvendor problems. Estimate demand, ? Purchase, ? Demand ? is realized ?>?⇒ lost revenue. ?>?⇒ wasted inventory uncertainty "You have to decide today how many newspapers you want to sell...
  • "TENER" vs. "TENER QUE + INFINITIVE"

    "TENER" vs. "TENER QUE + INFINITIVE"

    Title "TENER" vs. "TENER QUE + INFINITIVE" Author: User Last modified by: Windows User Created Date: 1/25/2009 11:04:54 PM Document presentation format
  • Cyber and Tech Safety - 1.cdn.edl.io

    Cyber and Tech Safety - 1.cdn.edl.io

    FACEBOOK. An online social media and social networking service. Since 2006, anyone age 13 and older has been allowed to become a registered user of Facebook, though variations exist in the minimum age requirement, depending on applicable local laws
  • PRESENTED BY PATRICK D. SAYON COMMUNITY SCIENCES COORDINATOR

    PRESENTED BY PATRICK D. SAYON COMMUNITY SCIENCES COORDINATOR

    Bottom gill net. Cast nets Channel Net Couta Net / Couta Chain Hook and Line. 3 lines. Long Line (Murrell) Mackerel Net Mina Net Purse Seine (bonga chain, herring chain) Set net (legochain) Shark fishing net Trap (for lobsters, crabs)...