The Data Commons Architecture
At present several NIH initiatives utilize sets of integrated services, software and storage (collectively referred to as "stacks") that operate on cloud systems. The Data Commons Pilot is composed of four stacks, incorporating architectural guidance and services provided by the KCs, and data from three large resources will be hosted on all of these systems. Users will be able to log into any of these systems using a single set of credentials, and find and access data within all of the stacks, on multiple (cloud) platforms. Users will also be supplied with a basic set of applications that they can expect will run the same way on all of the stacks, with the results of computations fully shareable across the stacks.
Why Multiple Stacks?¶
Stacks have been deployed for several projects, including the Genomic Data Commons, Globus, AnVIL, and TCGA on FireCloud. The Commons consortium has written a longer justification for multiple stacks, but briefly, we are concerned that, in their current form, the stacks tend to operate as walled gardens. Users are required to use separate login systems to access each site. Sharing data is difficult, and few analysis services, (e.g. a variant-calling pipeline on genomic sequences) will work across many of these resources. By funding multiple stacks as a part of the Commons Pilot, the NIH is directly funding teams to pilot methods for interoperability and sharing, while maintaining the existing unique and creative approaches of each team. Collectively this approach will enable a larger number of teams to accelerate innovation, while at the same time creating a more sustainable model of multiple groups contributing to common services and more stable infrastructure. All of this will be important to the research community, who have a diversity of research needs and technical experience, and ultimately will be able to select from a broad set of options from the Commons.