How to take care of your research data?
In order to make research more transparent, verifiable, replicable or reproducible, one should be able to provide the data supporting the findings. Many funding bodies nowadays require that a research proposal must also include a data management plan, and in many cases submitting the data to a trusted repository is required. However, not all data can be shared, in which case the best practice is to share at least the metadata – information about the data. Publishing research data is a relatively new addition to the research process and the practices are still developing. Services and tools that conform to the FAIR principles, such as the Fairdata services provided by the Finnish Ministry of Education and Culture, help bring transparency and reliability into the changing environment.
Data collection requires a lot of effort, and it is an invaluable part of the research process. Making data available can prevent duplication of efforts and save time and resources. Furthermore, the value of data as a form of research output equal to publications is becoming better understood in academia. The shift can be observed for example in the DORA declaration, which calls for the development of more accurate and responsible metrics to measure impact of scientific research outputs, including recognising and rewarding the contribution of data creators and curators. In Fairdata services, the datasets are assigned a persistent identifier, which facilitates findability and enables unambiguous data citation. Data reuse and citation increase the impact and visibility of the research, the researcher and their home institution; therefore, good research data management can bring added value to the whole research community.
Research data management (RDM) includes many aspects, e.g. the planning of data collection or generation, organising data, documentation and description, storage, version control, and decisions about archiving and preservation after the conclusion of the research project, such as what data should be discarded, what should be kept and what can be shared. It is beneficial to plan how you are going to manage your data at the very start of your research project, and the plan should cover the whole research data lifecycle:
Source: UK Data Services
1. The planning phase
Develop a data management plan (DMP). There are tools to guide you through the planning process, such as DMPTuuli for the Finnish research community, which includes templates and guidance based on the requirements of various funding bodies and research institutions (e.g. Academy of Finland). Similar tools in the international context are for example DMPOnline or DMPTool.
The following questions can help you plan your research data management:
- What type of data will you collect and what methods are you going to use? Could you maybe make use of existing published datasets?
- Are you applying for funding? → Check the funding body requirements (e.g. in the Sherpa Juliet service)
- Does your institution have a data policy? → Check your institution’s guidelines and requirements
- Will you be collecting sensitive data? → Read the guidelines on Anonymisation and Personal Data (Data Management Guidelines by the Finnish Social Science Data Archive). This CSC webinar about sensitive data explains what is sensitive data and how it should be handled.
- Does the nature of your project require risk assessment or ethical review? → Check the guidelines by the Finnish National Board on Research Integrity, or contact your institutional or discipline specific ethical advisory board. If applicable, plan also how you are going to inform the research participants and obtain their consent (for more information see for example the Data Management Guidelines).
- Plan and agree upon the ownership, intellectual property rights and access rights. It is important to plan ahead with your collaborators and have a clear agreement on who is the owner of the data, who controls the access and privileges, who is going to be allowed to access, manipulate and modify the data. Make also the necessary decisions about authorship and intellectual property rights. More practical tips can be found in the Data Management Guidelines or in these recommendations by the Finnish National Board on Research Integrity.
- Identify the steps that will have to be taken to manage the data in each phase of its life cycle and who will be responsible for those steps.
To identify and plan the necessary steps, you can start by reading about the procedures and problems associated with each part of the data life cycle in parts 2.-4. of this checklist: how are you going to handle, store, use or share the data? The most important questions have been compiled by DCC into this Data management checklist flyer. The How fair are your data? checklist by Sarah Jones and Marjan Grootveld will help you evaluate if your data management plan conforms to the FAIR principles. It is normal to be unsure and not every single detail can be decided upon before the project has even started, therefore it is acceptable and advisable to review and update your data management plan throughout the course of your research project.
To learn more about the structure and the benefits of developing a data management plan, see this video titled “The what, why and how of research data management” created by Research Data Netherlands.
2. During the research
Data collection and management practices are discipline specific and depend largely on the type and specifications of the data. Some basic advice to follow is:
- Create detailed documentation throughout the course of your research – having to go back and fill in the gaps in documentation afterward is laborious and complicated, if not impossible. Sufficient documentation of the data, information resources, and the methods and codes used in the analyses make the reuse and reproducibility of research possible. Document also the changes performed while collecting, organising and analysing the data. The provenance (record trail of the origin and changes made to the data) should be transparent. Good documentation will benefit you too: you will have a detailed record of your procedure and description of the data available if you want to reuse your data or need to answer any questions about it in the future.
- Use the research support services available at your institution. At most universities, the library offers data support services, or there might even be a research data management specialist at your department. Some issues can be resolved with the help of your institution’s Legal Services, or you could contact the Research Integrity Adviser.
- If you want to learn more about research data management, there are many online courses and materials available. Some examples are:
- Data Management Guidelines by the Finnish Social Science Data Archive
- MANTRA (Research Data Management Training): online course developed at the University of Edinburgh
- Responsible Research: Guide to Research Integrity, Research Ethics and Scientific Communication in Finland
- Introduction to data management (YouTube), short introduction to data management and FAIR data by CSC
- The “Love your data!” webinar series by CSC
More tips on good practice are available in this guide by CSC or this guide by UK Data Service. The following are some of the specific issues you should take into consideration when preparing your data management plan:
What steps need to be taken to ensure that the data can be interpreted and used in the future? Consider how you can make the data intelligible, understandable and reusable for others. Some aspects to take into account are for example:
- The type (e.g. interviews, survey data, images etc.) and volume of research data determine the appropriate file formats, software to be used, and storage media; use open, non-proprietary file formats that will be readable with various devices and software.
- Systematic file and directory structure and consistent naming conventions, versioning (useful tips for example on CSC website)
- Naming variables, documenting the methodology and analysis, creating a readme.txt file.
Metadata means ”data about data”. Metadata provide the context and information required to interpret and understand the data. The plain data without any description will not be comprehensible, and therefore not reusable, nor verifiable. For comprehensive guide on metadata and description, read this chapter of the Data Management Guidelines by Finnish Social Science Data Archive. The following is just a basic overview of metadata types:
- Structural metadata: e.g. the file and directory structure, record trail of steps taken during collection and analysis, information about data formats.
- Administrative metadata: e.g. licenses, access rights.
- Descriptive metadata: the context and information about the data, e.g. the name of dataset, research discipline, persistent identifier, the time and place of collection and publishing of the dataset, authorship and ownership, content description (keywords, variables, etc.).
Metadata standards and formats give the metadata a clear structure and machine readable form. The choice of appropriate metadata standard depends on common practice in your discipline, the type of research data, and you should also take into consideration the requirements of your institution, your chosen repository or archive and/or the journal you are going to publish in. For more information about what metadata you should document and how, consult for example this metadata guide by DCC.
The data management plan should include information about how you are going to store and share data during the active research phase. How to ensure the integrity, safety and security of the data? More information about the ethical and legal aspects is available at the Responsible Research website. Some important points are for example the following:
- Quality assurance processes
- Back-up and recovery
- File encryption
- Ownership, access rights and IPR
- Privacy notice or privacy policy
- Access rights outside the research collaboration and access management
- Informing research participants, obtaining consent.
3. Data deposition
While drawing up your data management plan, consider also what happens to the data after the research is concluded and findings possibly published. What data should be retained and where are you going to deposit it? What can be deleted or what should be destroyed (perhaps some parts of sensitive personal data)? What should be prepared for long term digital preservation? These are some basic points to take into account:
- The cost and amount of work required to prepare the data for deposition
- Funder requirements
- If the results have been published, familiarise yourself with the terms, recommendations or requirements of the publisher. There might be an embargo on data sharing, other publishers might require that the underlying data are also made available (check for example the data policy of PLOS journals).
- Whenever possible, link the various outputs produced in course of the research, for example if there are reasons to submit parts of the data output into another repository. It is also useful to link the publication record with the landing page of the underlying data. Persistent identifiers are invaluable in creating reliable links.
Even if your research leads to negative findings, consider making the data available anyway. Publishing these findings could help avoid duplication of effort, contribute to the discussion, or the data could be used in meta-analysis.
You can use a service that is suitable for storage during the research and subsequent publishing of the stable, immutable version of the data. You can also use one option for the active data storage where you can process and modify the data, and another option to deposit the immutable data afterward. In any case there are steps that need to be taken before the data can be made available – they have to be logically and systematically organised, documented, and there needs to be sufficient metadata. Remove or destroy files that are not to be made available. Make sure that personal data are adequately anonymised.
What type of service is suitable for your data depends on factors such as the requirements of the funding body or your institution, and common practice in your academic field. You can choose from institutional repositories, international general data repositories (e.g. Zenodo), subject or domain specific data archives (e.g. Dryad, Genbank), data type specific services (such as Github for software), or national repositories (Fairdata-IDA). Data journals are a new format of peer-reviewed academic publications that specialise in publishing research data in the form of data article (see for example Brain and Behavior, or Geoscience Data Journal).
Research data storage service IDA offers collaborative storage space for a defined group of users. IDA is suitable for active storage during the research, for sharing among the group members as well as storing stable data in an immutable state. The user can select data to be frozen – i.e. stored in an immutable state, which enables publishing the data. Prior to publishing the data, the user adds metadata to their frozen data with the Qvain metadata tool. The data described with Qvain will get a persistent identifier and a landing page once they are published. The published dataset is discoverable in the Etsin research data finder. Access to the data can be set as open or restricted, and it is also possible to only publish the metadata. IDA and Etsin can be found in the Registry of Research Data Repositories, a database of trusted research data services, and both are also well known and recognised as trusted for example in the Academy of Finland grant application process. If you are considering using Fairdata services for data management, take into account the criteria mentioned above. Note that data protection issues also affect the selection of storage service: IDA is not suitable for the storage of sensitive personal data.
Many trusted digital repositories provide long-term storage of the data exactly as it has been deposited, and guarantee data integrity on the bit level. However, such archiving method will not ensure that the data will be usable and readable in the long term, due to software and file format obsolescence. The aim of digital preservation is to guarantee that especially valuable datasets are still usable for the future generations. Digital preservation is costly: it requires active, ongoing curation and taking measures to extend the data lifetime so that it stays usable for decades or even centuries. Such measures can be for example the conversion of obsolete file formats, or various means to ensure data integrity and quality, keeping data readable and usable and protecting it from decay and damage in the long term.
Only particularly valuable datasets are selected for digital preservation. Although each case should be assessed individually, the following characteristics suggest digital preservation could be considered:
- The data hold remarkable potential for reuse
- Cases in which the funding body requires data preservation
- Unique data whose generation or collection required a lot of resources; irreplaceable research that would be extremely difficult, costly, or impossible to reproduce.
- Data resources of national interest
- Data resources that are considered fundamental to the institution’s research core area
Digital preservation services define the policies and technical specifications necessary for data submission. The conditions might include:
- Sufficient metadata that make the data comprehensible
- Open formats that are readable by more than one application and facilitate reuse of data
- Clarification of all potential ethical and legal issues such as IPR or personal (sensitive) data issues
4. Data sharing and reuse
The data life cycle doesn’t end after the findings have been published. When you are drawing up your research management plan and considering if you should make your data available for reuse, the rule of thumb is to make the data “as open as possible, as closed as necessary”. You will need to weigh in factors such as IPR and ethical issues or licences applied to the data or the publication based on them.
When you deposit the data, you can apply a relevant licence defining reuse conditions (frequently used options are the Creative Commons or Open Data Commons licences). You can also set access rights and entitlements. The processes of applying for and granting access will be determined by the terms of use of each repository. For example in the Fairdata services you can set the data as available on an open access basis, or define the conditions for granting access. Fairdata Etsin enables users to register and apply for access to restricted datasets.
If you are reusing data collected by somebody else, make sure you give the creator credit and cite the data correctly. Use persistent identifiers whenever available – persistent identifier ensures that the dataset will be uniquely identified, tools used for impact evaluation will be able to recognize your citation and the creator of the data will receive merit for their work.
A persistent identifier or PID is a unique and unambiguous machine readable name for an object, in this case a specific research dataset. It is also a permanent link, that will always take you to the landing page of the dataset, where the description and for example the license of the dataset can be found. Usually a PID is a DOI or a URN, identifiers provided by two different systems and they can be recognized by the first letters as either.
Read more from CSC blog post “What a researcher should know about persistent identifiers“. If you are interested in data impact evaluation, read this blog post by ImpactStory. For specific citation guidelines see the data citation roadmap created by the Finnish Committee for Research Data.
Fairdata-Etsin research data finder enables search and browsing of information about data generated by the Finnish research community.
For a wider search, you can start by finding a relevant data archive or repository. Re3data, the Registry of Research Data Repositories, is a global registry where you can browse and search for data repositories by subject, country or content type (e.g. software applications, audiovisual data etc.)
You can also use the search engine of a citation index – a database which harvests metadata feeds from various repositories, for example Data Citation Index by Clarivate Analytics.
Data discoverability through search engines depends on sufficient metadata – the indexing, discovery and retrieval are based on available metadata, as full-text search of data is not feasible. This is yet another reason why it is beneficial to create good documentation and descriptive metadata.
More information for organisations
Do you work for an organisation such as a research institution, academic publisher or a funding body, and want to learn more about how your organisation can promote good data management? You might be interested in the report linked below, compiled by the Knowledge Exchange collaboration. Recommendations for various types of organisations involved in the research community can be found on pages A4-A7, and page A11 summarises the factors that encourage or hinder data sharing.
Incentives and motivations for sharing research data: a researcher’s perspective