Executive Summary

A formidable number of large biomedical datasets are hosted in repositories across the Internet. These data are a boon for researchers performing global analyses to better understand the relationship between human biology and health and disease, and for experimentalists interpreting their own results in the context of these resources. These vast datasets also pose huge challenges. It can be difficult to even find much less download the data from the large number of repositories. Moreover, as datasets grow in scale, downloading becomes impractical in terms of cost (storing multiple copies of large datasets is wasteful), accessibility (few researchers have the necessary computational infrastructure), and security (many laboratories lack state-of-the-art security and access control). To address these concerns, the NIH has formed a consortium of teams to foster the exchange and creation of knowledge via interoperable multi-cloud technologies. Data, tools, and knowledge will exist in an integrated environment using collaborative, best-of-breed technologies that foster a unifying culture supporting research and educational excellence. We envision a future in which NIH research is performed by dynamic investigative and learning communities with cost-conscious access to NIH-funded information, unfettered by hardware restrictions.

To this end, the Data Commons Pilot Phase Consortium (DCPPC) has established a thriving network of research groups to promote consortia building and cross-team engagement. We employ agile development to support demos of experimental functionality and engage in brainstorming. Our outreach effort is engaging an increasingly large circle of stakeholders and possible contributors, and we work closely with Data Stewards who have deep domain expertise in the data sets being hosted by the Commons. Several DCPPC research groups are addressing important Key Capabilities (KCs), through continual integration and testing of these systems with the four production-level stack providers (“full stacks”). Open development provides transparency and rapidly transfers our successes across the entire group. Our project management systems track and document the outcomes of all work, and we employ a Request for Comment (RFC) system in which both technical details and broad, cross-cutting issues are evaluated and ratified.

In its first incarnation, the commons will be composed of four stacks, incorporating products from the KCs, and data from three large resources will be available through all of these systems. Users will be able to log into any of these systems using a single set of credentials, and securely access appropriate data within all stacks, on multiple (cloud) providers. Users will also be supplied with a basic set of applications that they can expect will run the same way on all stacks. This will be possible as our consortium advances the policies and protocols for accessing human subjects data; supports global identification, indexing and searching of available data sets; provides a collection of computational pipelines that can be applied to data sets; utilizes standards to globally identify and access data sets for analysis, along with the software (including tools and workflows) used in the analysis; creates policies for data citation, reuse and reproducibility; and enables researchers to bring their own data and computational pipelines into the cloud.