Introduction to Research Data Management Sarah Jones Digital Curation Centre, Glasgow [email protected] Twitter: @sjDCC Carpentry workshop - Open Research Data, 5 December 2016, Oslo What will we cover? 1. What is research data management? 2. Consideration and pointers when:

Creating data Managing data Sharing data 3. Useful tools and resources What is RDM? Image CC-BY-SA by Janneke Staaks What is Research Data Management? Create Preserve

Documen t Share Use Store the active management and appraisal of data over the lifecycle of scholarly and

scientific interest Data management is part of good research practice What is involved in RDM? Data Management Planning Data creation Annotating / documenting data Analysis, use, versioning Create

Storage and backup Publishing papers and data Preserve Document Share Use

Preparing for deposit Archiving and sharing Licensing Citing Store Why manage and share data? Direct benefits for you Research integrity

Potential to share data To make your research easier! To avoid accusations of fraud or bad science So others can reuse and build on your data Stop yourself drowning

in irrelevant stuff Evidence findings and enable validation of research methods To gain credit several studies have shown higher citation rates when data are shared Make sure you can

understand and reuse your data again later Advance your career data is growing in significance Meet codes of practice on research conduct Many research funders worldwide now require Data Management and Sharing Plans

For greater visibility, impact and new research collaborations Promote innovation and allow research in your field to advance faster What if this was your laptop? Why YOU need a Data Management Plan

08/01/why-you-need-a-data-manage ment-plan Good data management is about making informed decisions Creating data Image CC-SA-ND by Bill Dickinson Data creation tips

Choose appropriate formats Adopt a file naming convention Create metadata and documentation as you go Ensure consent forms, licences and agreements dont restrict opportunities to share data Choose appropriate file formats Different formats are good for different things - open, lossless formats are more sustainable e.g. rtf, xml, tif, wav

- proprietary and/or compressed formats are less preservable but are often in widespread use e.g. doc, jpg, mp3 One format for analysis then convert to a standard format BioformatsConverter batch converts a variety of proprietary microscopy image formats to the Open Microscopy Environment format - OME-TIFF

Data centres may suggest preferred formats for deposit How will you name your files? Keep file and folder names short, but meaningful

Agree a method for versioning Include dates in a set format e.g. YYYYMMDD Avoid using non-alphanumeric characters in file names

Use hyphens or underscores not spaces e.g. day-sheet, day_sheet Order the elements in the most appropriate way to retrieve the record Example from ARM Climate Research Facility

What is metadata? Data about data Documentation and metadata Metadata Standardised Structured Machine and human readable

Metadata helps to cite & disambiguate data Documentation aids reuse Documentation Metadata Metadata standards These can be general such as Dublin Core Or discipline specific

Data Documentation Initiative (DDI) social science Ecological Metadata Language (EML) - ecology Flexible Image Transport System (FITS) astronomy Provided in catalogues to aid discoverability Structured so search engines can uncover it Exposed in machine-readable form e.g. XML Dublin Core metadata example

Creator: Donald Cooper Role=Photographer Subject: Shakespeare, William, 1564-1616, Antony and Cleopatra [LC] Description: Vanessa Redgrave as Cleopatra Date: 1973-08-09 Type: Image Format: JPEG Identifier:4150 [catalogue no] Source: negative no 235 Relation: Antony and Cleopatra: Thompson/738 IsPartOf

Coverage: Bankside Globe Role=Spatial Rights: Donald Cooper Use metadata standard Metadata Standards Directory Biosharing Broad, disciplinary listing of standards

and tools. Maintained by RDA group A portal of data standards, databases, and policies Focused on life, environmental and biomedical sciences http:// metadata-directo ry Documentation Can others understand the data? Think about what is needed in order to find, evaluate, understand, and reuse the data. Have you documented what you did and how?

Did you develop code to run analyses? If so, this should be kept and shared too. Is it clear what each bit of your dataset means? Make sure the columns/rows are labelled, variable ranges defined, abbreviations explained in data dictionaries ReadMe files We recommend that a ReadMe be a plain text file containing the following: for each filename, a short description of what data it includes,

optionally describing the relationship to the tables, figures, or sections within the accompanying publication for tabular data: definitions of column headings and row labels; data codes (including missing data); and measurement units any data processing steps, especially if not described in the publication, that may affect interpretation of results

a description of what associated datasets are stored elsewhere, if applicable whom to contact with questions Managing data Image tools CC-BY by zzpza

Legal and ethical issues Be aware of legislation that applies to you: Offentlighetsloven (FoI) EIR (Environmental Information Regulations) Data Protection Health Research Act Understand what this means in terms of how data are stored, transferred and shared Use appropriate services TSD provides a platform for researchers working at

UiO and in other public research institutions to collect, store and analyze sensitive research data. TSD complies with the directive of privacy and electronic communication in Norway. storage/sensitive-data/index.html Ask for consent for data sharing If not, data centres wont be able to accept the data regardless of any conditions on the original grant. Where will you store the data? Your own device (laptop, flash drive, server etc.) And if you lose it? Or it breaks? Departmental drives or university servers Cloud storage Do they care as much about your data as you do? The decision will be based on how sensitive your data are, how robust you need the storage to be, and who needs access to the data and when

i ma CC ge b rro Mo n y

har yS r lick nF o w CC image by momboleum on Flickr One copy = risk of data loss

Who will do the backup? Use managed services where possible (e.g. University filestores rather than local or external hard drives), so backup is done automatically 3 2 1 backup! at least 3 copies of a file on at least 2 different media with at least 1 offsite Ask central IT team for advice

Backup and preservation not the same thing! Backups Used to take periodic snapshots of data in case the current version is destroyed or lost Backups are copies of files stored for short or near-long-term Often performed on a somewhat frequent schedule Archiving Used to preserve data for historical reference or potentially during disasters Archives are usually the final version, stored for long-term, and generally

not copied over Often performed at the end of a project or during major milestones How to keep you data secure? Develop a practical solution that fits your circumstances Store your data on managed servers Restrict access Keep anti-virus software up-to-date Encrypt mobile devices carrying sensitive information

Data sharing Image CC-BY-NC-ND by talkingplant The data deluge is upon us Sensors ability to produce data outstrips ITs ability to process it Why not keep it all? Globally, data volumes are doubling every two years

John Gantz and David Reinsel 2011 Extracting Value from Chaos Storage mgmt costs rise long-term Hardware costs decline, but power and staff costs keep rising David Rosenthal The storage is cheap fallacy Decreasing hardware costs offset by exponential growth in data volume Backup and mirroring multiplies cost of

preserved data Discovery becomes harder as the chaff outweighs the wheat Curation of unused data is a waste of resources Select what to keep (and share) 1. What must be kept to manage compliance risk? 2. What data could be re-used? 3. What data has value and should be kept? 4. Given costs what will or wont be

kept? 5. How will it be kept and shared, on what terms? Should all data be open? NO Many reasons most to do with human subjects But data existence should always be open Allows discovery & negotiation on use Avoids pointless replication

How to make data open? 1. Choose your dataset(s) What can you may open? You may need to revisit this step if you encounter problems later. 2. Apply an open license Determine what IP exists. Apply a suitable licence e.g. CC-BY 3. Make the data available Provide the data in a suitable format. Use repositories.

4. Make it discoverable Post on the web, register in catalogues License research data openly This DCC guide outlines the pros and cons of each approach and gives practical advice on how to implement your licence CREATIVE COMMONS LIMITATIONS Horizon 2020 Open Access guidelines point to:

NC Non-Commercial What counts as commercial? ND No Derivatives Severely restricts use or These clauses are not open licenses EUDAT licensing tool

Answer questions to determine which licence(s) are appropriate to use Deposit in a data repository The EC guidelines point to Re3data as one of the registries that can be searched to find a home for data /content/re3data-demo How to select a repository? Look for provision from your community, university, publisher, funder etc Check they match your particular data needs: e.g. formats accepted; mixture of Open and Restricted Access.

See if they provide guidance on how to cite the deposited data. Do they assign a persistent & globally unique identifier for sustainable citations and to links back to particular researchers and grants?

Look for certification as a Trustworthy Digital Repository with an explicit ambition to keep the data available in long term. Norwegian repository landscape Zenodo

Zenodo is a multi-disciplinary repository that can be used for the long-tail of research data An OpenAIRE-CERN joint effort Multidisciplinary repository accepting Multiple data types Publications Software

Assigns a Digital Object Identifier (DOI) Links funding, publications, data & software What are persistent identifiers? They are an alphanumeric code identifying a resource,

organisation or individual They must be Unique Persistent Ideally they should be actionable too How do persistent identifiers work Citing research data: why?

How to cite data Key citation elements Author Publication date Title Location (= identifier) Funder (if applicable) Resources Image Energy Resources | Energie Quelle CC-BY-NC by K. H. Reichert

Managing and sharing data: a best practice guide Guidance and training resources ESIP Data Management Training clearing house for environmental sciences DataONE best practices DCC resources FOSTER open science portal Acquire research data skills

Tools for managing data managing-active-research-data Also look for national & local support! s upport/research/research-data contact-noads

Finally Well-managed data makes your research easier, now and in future Well-managed data is easier to share, more likely to be re-used

Sharing data is good for you Its good for all of us It isnt as hard as you think were here to show you how!

How do you share data effectively? Use appropriate repositories, this catalogue is a good place to start Document and describe it enough for others to understand, use and cite -datasets Licence it so others can reuse

Thanks for listening For DCC resources see: Follow DCC us on twitter: @digitalcuration and #ukdcc

