Utility of the Data Commons

The user experience. So what are the implications of the above components of the Commons? The fact is that the opportunities for data-driven and data intensive biomedical research are immense. The ability to explore individual research questions using vast online databases, the promise of increasingly broad and deep integrative analyses, and the ability to repurpose existing data to new questions will transform biomedical research and translational medicine. We are not going to lack for datasets hosted on cloud-based systems - where we will suffer is a fractured infrastructure that blocks users from working across these huge resources. At present the current stack infrastructures used by the NIH, while highly successful as stand-alone structures, do not readily enable users to exchange data between these systems, or to share analysis pipelines on more than one stack. The Commons pilot project has been established to break the obstacles that prevent users from operating on multiple infrastructures.

Consider that it is trivial for users on Android or iPhone smartphones to perform the same Internet searches, access their Uber app, or use the same browser on either system. The Commons is meant to establish an ecosystem similar to the way smartphones work now. A smartphone user can use the same system to call people regardless if they have an android or iPhone - commons users will be able be able to access any data set regardless of the stack they use. A smartphone user can run an Uber app on any phone - a Commons user will be able to run the same pipeline on any stack.

One of the most important elements of the commons is that is specifically designed to support multiple stacks. By adopting federated authentication, authorization and accounting systems, universal identifiers for data and software, and adopting other important standards, we will reduce the current set of barriers that are faced by users of the stacks such as GDC, AnVIL, Firecloud and other individual stacks.

Our consortium will also advance the policies and protocols for accessing human subjects data; indexing and search of available data sets; a collection of computational pipelines that can be applied to data sets; data standards to connect data sets between analyses; policies for data citation, verification, and reuse; and computational “stacks” of software that provide Web sites and developer interfaces for analysis. This will enable researchers to bring their own data and computational pipelines into the cloud, use them there, and make these resources available to others.

The NIH Experience. The Commons will have tangible and positive impact for NIH staff who manage grants and contract awards. We expect that the Commons will host datasets generated by their awardees. In cases where larger data hosting infrastructures need to be established (e.g., the next TOPMed) the stack providers will be able to draw upon standards, software systems, cloud services, and user access systems developed by the DCPPC. This will reduce costs associated with development of systems such as user authentication, data models used to host primary data and metadata, as well as other infrastructural costs.

The experience of the community of resource providers. In future years we will also serve as a focal point for communication across existing projects such the GDC (NCI), AnVIL (NHGRI), Kids First, STAGE (NHLBI), and All of Us (Common Fund). We will hold yearly conferences involving plenary presentations, hackathons breakout sessions, and poster sessions. The events will be designed to maximize exposure of the latest technical and policy developments. Conferences such as this will be essential to creating an effective open interoperation between these projects at the grassroots level.