Progress in 180 Days

Web Interfaces. A core goal of the commons is to make data findable and accessible to a qualified user, regardless of where those data are stored and what metadata structure they have. These early searches demonstrate that cross-dataset queries can work in controlled circumstances, which is a first step in building an interface that can access arbitrarily complex ‘real’ datasets. For example, 53 users from 8 different teams have access to COPDGene clinical data via copdgene.hms.harvard.edu/ We have constructed web interfaces that can use arbitrarily complex search term sets to find datasets that fit specific use cases.

Research Object Identification and Access. Data and other research objects (software, workflows, provenance, etc.) must be uniquely identifiable at a global scale, and supported by common core metadata and access services, to enable their use within and across stacks. Identifiers must be persistent, resolvable on the Web, verifiable, and compatible with the worldwide biomedical communications ecosystem. We have developed common models of Globally Unique Identifier Services and of Core Metadata to support these goals; and have implemented an initial set of compatible services, to identify and verifiably access data across stacks, and across the biomedical ecosystem. Data services previously used within the stacks have been harmonized. We have demonstrated interoperability across the data providers, stacks and ecosystem, and have plans to extend identifier and service interoperability across the full universe of research objects and computations available on the Web.

Common Data Model. One challenge presented by the data sets in NIH Data Commons (i.e., TOPMed, GTEx, and the AGR is that these data, particularly their metadata, are stored natively in different formats, often using different terms for the same concept. We have established the Crosscut Metadata Model, a system that renders these diverse metadata into a common exchange format. The stacks are ingesting the Crosscut Metadata Model to ensure users will have access to the Commons datasets in a uniform manner.

Data Harmonization. A notorious challenge associated with the vast volumes of omics data is the absence of a unified set of standard naming conventions for information such as clinical variables, phenotypes, and properties of things like genes or the biological processes linked to genes. Performing comprehensive retrieval of data across TOPMed, GTEx and AGR is exceptionally challenging for the end-user, without engaging additional effort to harmonize metadata associated with these datasets. We have provisions for capturing data harmonization in the Crosscut Metadata Model, and have incorporated an initial harmonization effort for subject demographic information (age, gender, sex), subject phenotypic information (disease status), and limited sample information (tissue or anatomical region of collection).

Enabling FAIR Assessments. While the NIH is embracing the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines, and these guidelines are used to direct the transformative potential of the Data Commons, it was unclear until a year ago what does it mean to be FAIR, and how FAIRness can be measured. One of the early products of the Data Commons is a software platform to perform FAIR assessments. The platform, called FAIRshake, was developed to enable the manual and automated assessment of the FAIR compliance of any biomedical digital research object. The FAIRshake platform has entries for Projects, Metrics, Rubrics and Assessments. Metrics are specific questions about FAIR compliance, for example, whether a resource provides a license; Rubrics are collections of Metrics; Projects are collections of digital objects that can be evaluated for FAIRness, for example, all the bioinformatics software tools that exist on the AGR websites. FAIRshake also provides a browser extension and a bookmarklet which enable users to see the FAIRness level of a digital object when visiting a website that hosts the digital resource. Using FAIRshake, members of the Data Commons already manually assessed all the digital objects on the AGR and GTEx portals, and automatically assessed the publicly available machine-readable metadata from all the TOPMed studies. These assessments were communicated to these Data Stewards who are now working on improving their FAIR scores. In addition, FAIRshake is interoperable with other world-wide FAIR-enabling services such as FAIRsharing and the FAIRmetrics.org, and initiative such as GO-FAIR. The Universal FAIR Metrics are encoded within FAIRshake as one of the FAIRshake’s rubrics. Importantly, FAIRshake is gradually becoming integrated with the Full Stacks. It is already integrated within Repositive and CommonsShare which represent two Full Stacks products that adopted FAIRshake for automated assessments of the digital objects they are currently hosting, and planning to host in the near future.

Application Programmer Interfaces. The Data Commons is taking advantage of powerful software-based systems to support interoperability between stacks, using a technology referred to as Application Programmer Interfaces (APIs). API technologies serve as software-based intermediaries to exchange data, and have been employed in many industrial and research domains for years. At present an astonishing total of 65 APIs have been identified for use within the Commons. Consortium members have agreed on the SmartAPI registry to ensure rapid discovery and reuse of these APIs. By having a shared list of APIs, developers in the Data Commons can check to see how whether a given API is already being used, and consolidate around well-supported APIs.

Ethics, Security, and Privacy. The Commons will host protected data resources from GTEx and TOPMed. In order for users to access these resources each stack service must authenticate and authorize users in order to assure appropriate use within the FAIR framework. The consortium is committed to providing users with the ability to log in with a single set of credentials, enabling access data within all of the stacks, on multiple (cloud) platforms. We have established a working group to address the technical issues relevant to highly secure single sign-on capability, as well as the policy issues involved with enabling access across data from multiple clinical studies. As part of this work, we have collaborated with consortium researchers and developers to ensure appropriate authentication and authorization standards are used, worked across full-stack teams to identify the security requirements for implementation of production-ready systems in phase 2, and developed a proposal for a governance policy, which will provide the commons long-term guidance on security, privacy and ethics. We have also drafted a proposal to stand up a Data Governance Council, which is in the review process. Finally, we are working with NIH to review and define the necessary security controls needed for “production” level systems.

Streamlined Regulatory Approach. The Commons has taken a streamlined approach of how it handles obtaining controlled-access data. As part of this Pilot Phase, many of the involved institutions have created individual but uniform data access requests for dbGaP data. The NIH provided a single uniform approach to request these data, as explained in the DCPPC Instructions for Online Data Access Request. Additionally, all sites involved with the Commons were given access to the same data sets listed tier 1a and tier 1b data, for the purposes of developing the infrastructure of the Commons. Tier 1a data do not require additional approvals; however, for access to 22 additional TOPMed studies with large sample size and complex phenotypic IRB is required. We have instituted a SMART IRB Reliance mechanism for institutions interested in accessing tier 1b data where they can use a centralized mechanism to obtain IRB approval for the NIH Data Commons Pilot Phase.

Cross-stack interoperation. A key goal for the Commons pilot phase is to show the full stacks can interoperate with each other, and that scientists can perform the same analysis pipeline on multiple full stacks. We have recently shown that a user with a single set of credentials logged into multiple stacks, found the same data sets using standardized GUIDs, and executed the workflows on data using a common API. This combination of functionality is completely unprecedented in the biomedical domain. Evaluation of the workflows output confirmed there was no analysis variation across the stacks; runtimes and overall costs were also evaluated during the demonstration.

Project management. We have implemented project management practices that have proven to be effective in tracking project progress and to report monthly on milestones and deliverables leading to better organization and accountability. The project management system enables objective measurement of DCPPC progress for each month. It has simplified reporting on project completion for all PIs, and added uniform reporting of milestones to the NIH. By consolidating all project management communications under a single Consortium-wide system we have also greatly increased transparency and simplified oversight of many DCPPC activities.