Written by: Erja Kortelainen, Sonja Sipponen, Anssi Kainulainen (CSC) and Anca Hienola (FMI)
Fairdata Services now include dataset metadata from METIS – The Finnish Meteorological Institute’s (FMI) EUDAT B2SHARE instance. The catalouge is available on Etsin service. FMI has been publishing their research datasets in METIS since 2020. According to FMI’s research data policy, all research datasets are openly accessible to the general public. However, when needed, a maximum 2 years of embago is allowed. FMI’s METIS datasets have been in Fairdata since December 8th 2022 and the integration is now fully active which means that all updates and new datasets are synced from FMI to Fairdata daily. By bringing the datasets into Fairdata, it also means that the METIS datasets are automatically visible in Research.fi portal through an existing integration. This improves the visibility and findability of the datasets considerably in addition to existing harvests to EUDAT B2FIND. Implementing integrations like this to multiple services improves reliable findability of information with little work after initial project thus creating impact with less effort in the long run.
- METIS – FMI’s Research Data repository in EUDAT B2SHARE
- METIS – FMI’s EUDAT B2SHARE research data repository in Etsin
- METIS – FMI’s EUDAT B2SHARE research data repository in Research.fi
This post is an overview of the steps taken to bring the research dataset metadata from EUDAT B2SHARE to Fairdata Services and thus also in Research.fi portal. We will bring out the challenges we faced with different aspects of the project and also what we learned and how could we use the knowledge and know-how in future projects.
First we will lay out the basics of our project and collaboration. After that we will go into more technical details of the implementation. Lastly we’ll speculate on the future uses of this type of integration.
Collaboration
Project was organized by CSC and carried out by a close collaboration between CSC’s Fairdata Team and EUDAT Team. Also Research.fi team was involved to ensure the smooth transition to the Research.fi portal as well.
The Finnish Meteorological Institute (FMI) produces observation and research data on the atmosphere, the near space and the seas, as well as weather, sea, air quality and climate services for the needs of public safety, business life and citizens. The Finnish Meteorological Institute is an administrative branch of the Ministry of Transport and Communications.
The Fairdata Services are integrated services for storing, sharing and publishing research data. The Fairdata Services, organized by the Ministry of Education (of Finland) and produced by CSC, are offered free of charge for users in Finnish higher education institutions and research institutes.
EUDAT Collaborative Data Infrastructure (EUDAT CDI) is a pan-European network/consortium consisting of more than 25 research organisations, data and computing centers, with its origin in CSC led EU projects. EUDAT partners develop and provide a portfolio of interoperable services for different stages of the data lifecycle. CSC provides customised services for organisations based on the EUDAT components, as well as the public EUDAT CDI B2SHARE service.
Reseach.fi portal is a service offered by the Ministry of Education and Culture that collects and shares information on research conducted in Finland. Research.fi contains information on the Finnish research system, publications by Finnish organizations, projects funded by public and private research funders, information on researchers operating in Finland and their research activities, and statistical information on the development of research resources and impact. The service improves the location of information and experts on research and increases the visibility and societal impact of Finnish research.
Fairdata Services, as well as Research.fi, have a search function which makes finding the datasets easy. All FMI’s METIS datasets can be listed at once but datasets can also be filtered and searched by more detailed search terms. Having the datasets both in B2SHARE and in Fairdata and Research.fi gives the datasets more visibility and makes them more findable.
METIS research datasets
The Finnish Meteorological Institute conducts internationally recognized science, the results of which are applied in society and used for supporting decision-making. In addition to measurement and experimental data, the work involves the use of scientific calculation models that make use of supercomputers. FMI produces observations and research data on the atmosphere, the near space and the seas. It also provides services on weather, sea, air quality, climate and near space for the needs of public safety, business life and citizens.
Open and reproducible science is one of the FMI’s strategic objectives. As such, at FMI, through METIS, publicly funded research data is made available to the widest possible audience (under CC BY license), as the best way to maximize the data impact but also to do justice to all the hard labor put into collecting, cleaning, and analyzing the data. However, even if the publication of data is not possible for reasons listed in the institute’s Open Research Data policy, FMI seeks to publish the metadata, acknowledging their existence, topic, contact information and ways – whenever possible – to obtain the data. METIS allows researchers to self-archive their research data, which can increase the visibility, usage, and impact of research conducted at FMI. Knowledge management, research and openness assessment, open access to scholarly research, fragmentation avoidance are some of the other functions of the repository. METIS is evolving into a potent tool for hosting and disseminating accumulated knowledge, highlighting the FMI’s accomplishments in research, and gaining peer recognition.
About the project
The aim of the integration project was to create a solution to automatically copy dataset metadata from FMI’s B2SHARE instance to Fairdata so that it would work in the background without any manual steps. An integration project is never just a simple “we get this 100% right the first time”. It’s more like “Make the first configurations and mappings and see how it works” and then we iterate from there. Integration rarely is absolutely perfect: compromises and agreements are needed in order to get two different systems to understand each other and their underlying data structures.
Although both Fairdata and B2SHARE already had API interfaces in place, we still needed to get the interfaces to talk to and understand each other. When specifying the technical solution, one of the important questions was to decide which service would be the active party in the interaction: Would B2SHARE push the data in Fairdata or would Fairdata pull the metadata from B2SHARE. This was also a question about the responsibilities: Which party would implement the active solution to push or pull the metadata and also make the commitment to maintain it. After the negotiations it was decided that B2SHARE would push its data to Fairdata by using the Fairdata Metax API.
Key steps in the project were:
- Mapping the datamodel structures and value sets used in the EUDAT to those used in Fairdata
- Setting up the accounts and connections between the systems, and agree on the API endpoints
- Build the actual integration solution
- Testing and reviewing
- Deployment & Go-Live
This project was no exception to how integration projects usually go. We spent several hours on specifying the field mappings, fine tuning the mappings for different value sets and just overall: getting everything right.
The outcome was first tested in Fairdata’s demo environment. We brought all dataset metadata into the environment and reviewed it closely by making thorough comparisons on how the content looked like in B2SHARE and how it now looked in Fairdata. Adjustments were made, mostly to the mapping rules, and then the result was reviewed again. When the project group agreed that it was finally perfect, the final review was done by FMI’s representatives: hard work had paid off and we did have a ready-for-production solution ready!
Mapping
In the task list “mapping” seems to be just one task amongst the others. But actually, it was the most important and also the most time-consuming part of our project. Describing how the source datamodel (all fields used) translates to the destination datamodel is the most fundamental part of any integration project. We understood that this would be the base for the whole project: mess up the mapping, mess up the whole project, which would set a bad precedent to any future hopes to repeat the harvesting for other customers and communities.
From datamodel perspective, EUDAT B2SHARE’s FMI instance has three different sets of fields:
- EUDAT Core – used to standardize and validate exchange of metadata between EUDAT services, namely B2SHARE and B2FIND. It is based on DataCite Metadata Schema 4.4 and OpenAire Guidelines for Data Archives.
- EUDAT Extended – shared by all B2SHARE instances and communities, almost the same as EUDAT Core, but supports richer and more flexible descriptions through nested and repeatable fields and less strict limitations on (persistent) identifier types.
- B2SHARE community extensions – tailored case by case to provide better support for community specific workflows and metadata needs. FMI’s METIS community extension includes specifics about their data sources, parameters, processing steps and such.
EUDAT B2SHARE datamodel is mainly based on DataCite when Fairdata is based on DCAT. We already had an initial mapping done from Fairdata (DCAT) to Datacite but in order to succeed, we needed to understand the datamodels in their whole.
Already-built crosswalks are really helpful as a general guidance when planning an integration and mapping between services but a general crosswalk cannot be used as-is: you need to review both, the source and the destination, datamodels to really understand the way they are implemented and possibly tailored and decide how to implement the crosswalk to fit into your needs. You need to specify, for example, how to handle different data types, many-to-many vs. single-value fields, different value sets (for example controlled vocabularies, ontologies and other pre-defined lists of known values / codes), and so-on.
There are well-designed crosswalks already done for example by RDA Research Metadata Working Group: A Collection of Crosswalks from Fifteen Research Data Schemas to Schema.org. Also, for example, FAIRCORE4EOSC is doing a case study on integrations in European national level services which Research.fi Portal is also participating in. The expected result is to enrich schemas of currently existing European national level research information systems and improve their interoperability, as well as serving as template for further integrations.
Although challenges were met, at the end we did have a mapping table which included a satisfying list for every field and a set of values from EUDAT B2SHARE FMI instance matched with a field and acceptable value in Fairdata’s datamodel. With this mapping table it was rather easy to start implanting the needed rules and exceptions into an integration solution.
Integration solution
Serializers translate services internal data representations to and from external data formats. Data formats are then provided through set protocols. Serializers and protocols are the key in implementing integrations.
Fairdata datasets can be accessed through human readable landing pages and machine readable HTTP REST API. Although most of the API’s provided by Metax API are accessible only to other Fairdata services, it does offer a comprehensive set of API endpoints also for end users and for integration service users.
B2SHARE datasets or records can also be accessed through human readable landing pages and two machine readable protocols for direct queries or harvesting: HTTP REST API and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Available OAI-PMH harvestable formats are EUDAT Core, Dublin Core, DataCite and MARC XML, while REST API offers EUDAT Extended format (JSON).
Fairdata provides two serializers to represent the metadata. Such serializers are an internal JSON and DataCite XML.
The active component of B2SHARE-Metax integration is implemented as a B2SHARE internal module. The module includes a new METAX-specific serializer, which transforms metadata stored in the B2SHARE internal datamodel to a format understood by METAX REST API based on a set of rules (the machine readable definition of mapping presented in previous chapter). The module pushes metadata of new or updated datasets, once per day, thus avoiding redundant traffic.
The implementation was done as a B2SHARE internal module, which can be enabled or disabled as needed, for sake of maintenance and reusability in potential future cases with other B2SHARE instances. If an external service would harvest or request metadata from B2SHARE, either HTTP REST API or OAI-PMH protocol would be suitable.
What’s next?
Creating and showcasing links between research entities is important in order to create a big picture of Finnish research and it’s impact on a national and international level. Funders and decision makers need reliable information and transparency in order to make steering decisions. Finding links between research outputs, funding, infrastructures and organisations is vital also for the development of research information management. Many Finnish research organisations aim to define what is research and what kind of entities relate to it.
Since services like Research.fi include information on publications, research funding, researchers and research organisations in addition to research datasets, it can be used in the future to showcase links between the entities and provide the links through an API for further use. Also B2INST-service is being developed for cataloguing instruments operated within science.
We have been investigating the possibility to bring Finnish research dataset metadata into Fairdata from other Finnish research data repositories, including other B2SHARE instances. As the Fairdata services are meant for increasing the visibility and societal impact of Finnish research and research datasets, the question has been, and still is, how to specify and identify the Finnish datasets from the others. What does it actually mean that the dataset is Finnish? A describer of the dataset is Finnish or the creator affiliates to Finnish a higher education institutions and research institutes? The research group is Finnish and how to define what makes a group Finnish? The research is financed by a Finnish funder? Or something completely else? We are still evaluating these questions and the possibilities in collaboration with the Finnish Fairdata network.
In short, to prepare for national instances to become available we now have a functional EUDAT Core & Extended mapping to Fairdata – we would “only” need to add an extension mapping. In addition, publishing an existing dataset catalog from Fairdata to for example B2FIND would be rather an easy task, at least in technical level, with this already existing mapping. Also, new research entities can be published and catalogued, and linked to datasets through future integrations. Further integrations from both B2FIND and Research.fi are possible in the future to international infrastructures such as OpenAIRE that provides services for open science.
If you are interested in learning more about Fairdata integrations, please visit the Fairdata Network’s page “Metax integration for organisations“. The page is only in Finnish.