The NIH Data Commons - Executive Summary¶
Authors: The NIH Data Commons Pilot Phase Consortium¶
Vision and Introduction¶
The opportunities of data-driven and data intensive biomedical research are immense: the ability to explore individual research questions using vast online databases, the promise of increasingly broad and deep integrative analyses, and the ability to repurpose existing data to new questions will transform biomedical research and translational medicine. However, these opportunities come with significant challenges: the data sets are too big to download; many computational tools don’t work with the data or metadata formats output by others; and local compute capacity is often too limited to meet dynamic research needs. Biomedical data is consistently failing to reach its full potential in basic research, clinical, and translational medicine.
One solution to these challenges is to host data and workflows in the cloud, where compute resources scale as needed and computational tools run close to the data in a standard environment. But this is only a part of the full solution: we need policies and protocols for accessing human-subjects data; indexing and search of available data sets; a collection of computational pipelines that can be applied to data sets; data standards to connect data sets between analyses; policies for data citation and reuse; and computational “stacks” of software that provide Web sites and developer interfaces for analysis. Moreover, researchers and consortia will want and need to bring their own data and computational pipelines into the cloud, both to use them and to make them available to others.
Within the first 180 days of this project, we will implement a Data Commons platform that will support biomedical data analysis in the cloud by building software infrastructure on top of standards, software pipelines, and existing data sets. This Data Commons effort will start by adopting standards and access protocols while building a software platform in support of the analysis of three high-value NIH data sets: TOPmed, GTEx, and the Alliance of Genome Resources (AGR). The resulting Data Commons platform will be a scalable cloud solution that hosts cloud-local data, supports many computational pipelines running in the cloud, and provides interactive data analysis workspaces. Over the four years of the Data Commons Pilot Phase and beyond, this Data Commons platform will expand to enable any biomedical research group to effectively work with data, improve reproducibility and reduce redundancy in infrastructure and standards development.
This implementation will be done through a Data Commons Pilot Phase Consortium (DCPPC), consisting of NIH personnel, a number of academic and industry implementation teams, Data Stewards (representing the TOPmed, GTEx, and AGR consortia), and multiple commercial cloud providers.
Major Objectives of the Data Commons Pilot Phase Consortium¶
The Data Commons will require many policy and technology choices around interoperability, data access, functionality, and implementation. The goal of the DCPPC is to resolve many of these choices. We focus on a system to find, access, use biomedical data, developing prototypes and case studies in the initial phase and engaging with the community. We have identified the following 10 Major Objectives:
1. Identifiers for data: Interoperable global unique identifier system for digital objects.¶
The Data Commons will be a repository for computational pipelines implementing analysis workflows, and data from many different sources. Researchers and software working within the Data Commons must be able to refer to digital objects via unambiguous identifiers, consume metadata about each object, and resolve identifiers into data and analysis pipelines. New identifiers will need to be minted as new objects enter the Data Commons, and these identifiers will need to be resolvable across all Data Commons-compatible platforms. Therefore, a major objective of the DCPPC is to specify and implement an interoperable system for global unique identifiers for digital objects.
2. Data access: Authentication and authorization policies and protocols for controlled access to digital objects and derivatives.¶
The Data Commons will provide access to large collections of human-subjects data, and must support controlled access to and use of this data. Data Commons-compliant platforms must be able to provide access to this data by user and use, track access, allow audit trails of amalgamated data sets, and provide a robust policy vocabulary for these operations that is shared across platforms. The DCPPC will work with the NIH and NIH grantees to integrate with existing policies and define computational protocols for access to human-subjects data via the Data Commons.
3. Findability: Search and indexing of digital objects and data sets.¶
Large collections of digital objects are useless without the ability to query and discover objects of interest, e.g. data sets with particular attributes or content, and workflows that perform specific tasks. The Data Commons will need to define a common metadata format in support of indexing, search, and categorization of digital objects hosted by or accessible to the Commons. Moreover, the Data Commons must support mechanisms for indexing and searching the content of data sets in flexible and robust ways.
4. Software stacks: Multiple robust and sustainable software stacks implementing Commons standards.¶
The standards and conventions described above will need to be implemented in “stacks” of software components and provided as a service. The DCPPC will implement multiple of these software stacks, each working across multiple clouds to host data, serve APIs, and execute workflows. These platforms will cooperate on the base layer of functionality and interoperability needed to achieve the core Commons goals (e.g. digital identifiers, authentication, search/indexing, and APIs) while supporting a broader range of “value add” functionality. Within the 180 day period, these platforms will primarily serve demonstration projects that prototype core functionality; over the four year period, they will mature into full independent and interoperable implementations of Data Commons stacks.
5. Data use, standards: Standard application interfaces.¶
Workflow execution, user authentication, data access, data retrieval, search, and other functionality may be done by computational applications built within and on top of the software stacks, and stacks will also need to communicate with each other. This requires a set of standardized interfaces by which applications within the Data Commons application can pass data, computational pipelines, and access authorization to each other. These interfaces must all support the identifiers and access permissions and protocols defined as part of the Data Commons. In the first 180 days, we will identify the core interfaces needed for the Data Commons, and over the next four years extend them to support multiple applications that use the stacks to provide functionality by building on top of the core platform.
6. Use cases: Use case library.¶
The Data Commons must meet the needs of many different types of end users, from biomedical clinicians to research scientists to data scientists and software developers. To define and explore classes of users and their computational research needs, the DCPPC will build a library of use cases and personas: “personas” are potential users and their research goals, while the “use cases” describe the actions and technical mechanisms by which they will achieve these goals. This index and searchable library will be an expanding collection of use cases and personas that encompass the desired functionality of the DCPPC from an end-user perspective, and serve as touchpoints with the external user community as well as developer guidance within the DCPPC. We will also invite the larger user community to submit new use cases to be incorporated into the library.
7. Community: Community engagement and support across multiple levels of expertise.¶
We will ensure the Data Commons will be fully utilized by the biomedical research community by introducing it early and often. Within the first 180 days, we will run brainstorming, discussion, and demonstration workshops to connect with the TOPmed, GTEx, and AGR user communities, as well as reach out to method development communities that may want to bring their data analysis pipelines to the Data Commons. We also plan to connect with other Commons efforts (at the NIH and elsewhere) to discuss community standardization and platform interoperability between Commonses. Once we have an initial set of functionality implemented in the Data Commons, we will recruit biomedical users to test out searching and sharing, data scientists to conduct large scale analyses, and computational tool and pipeline developers to add new functionality. Over the four years, we will provide training (workshops, webinars) and technical support (help desk, issue tracking) as well as a supportive online community (e.g. a user forum), developer “hack days”, and social media engagement. We will also reach out to other communities with large high-value data sets that could add synergy to the existing data sets through combined analysis, and make arrangements to host them on the Data Commons platform.
8. Community: governance, membership, and coordination.¶
The Consortium needs formal governance procedures in order to chart the course of the Data Commons and the DCPPC over the first 180 days as well as the remaining four years of the Pilot Phase. The DCPPC will need to support a range of decision making processes, including choosing Milestones and deciding whether to include specific functionality. The DCPPC will also need to maintain a flexible and open internal culture in order to coordinate and build products. Supporting this requires investment in technical infrastructure such as mailing lists, messaging platforms, and shared development locations, while also building social norms of direct contact between Consortium members. And, finally, the DCPPC will need to adopt existing technology, do engineering testing between DCPPC products, and onboard new members and partners. The governance and coordination structures must be dynamic in order to support this flexibility.
9. Evaluation methods and metrics.¶
We plan a culture of frequent release of products, with small iterations, routine evaluation and redesign; this is now standard practice in many IT communities to ensure that functionality meets user needs. We will also establish regular evaluation checkpoints at which the DCPPC discusses demonstration projects, negotiates stack interoperability, and connects with external evaluators. Within the first 180 days, most of our evaluation metrics will be focused on demonstration projects, but we will expand our evaluation efforts to be more holistic as functionality expands.
10. FAIR guidelines and metrics.¶
Our guiding principles for data access, use, and reuse will adhere to the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines. While the FAIR guidelines are well-accepted, to bring them to practice will require defining and adopting community-based metrics and rubrics so these can be applied to data, and other types of digital objects, hosted within or available through the Data Commons. At the same time, once FAIR metrics and rubrics are defined, these will be used to measure the level of “FAIRness” of repositories, datasets, and other digital objects. Such evaluations will inform and engage both Data Commons users and digital objects producers.