Skip to content

Why Multiple Stacks Are Necessary

Brainstorming: Why are multiple stacks an important and necessary part of the NIH Data Commons Pilot Phase Consortium?

(Team Helium, Team Copper, Team Phosphorus)

Contact points: Stan Ahalt & Titus Brown

Aug 2018

Executive Summary

At present several NIH initiatives utilize services, software and storage (collectively referred to as a "stack") that operate on cloud systems. Stacks have been deployed for several projects, including the Genomic Data Commons, AnVIL, and TCGA on FireCloud. In their current form these stacks tend to operate as walled gardens. Users are required to use separate login systems to access each site. Sharing data is difficult, and very few analysis services(e.g., a variant-calling pipeline on genomic sequence) will work across any of these resources. By funding multiple stacks as a part of the Commons Pilot the NIH is directly funding teams to pilot methods for interoperability and sharing, while maintaining the existing unique and creative approaches of each team.

One important purpose of the Commons Pilot is to collectively agree on a set of best practices across the four stacks to eliminate barriers for accessing, sharing and analyzing biomedical data. The DCPPC is working to address a number of these barriers. We are currently implementing a system: to enable login into each stack using a single set of credentials, to support user access of all data on each stack regardless of hosted location, and to supply a basic set of applications that will run on all systems with the same results. The stack groups involved in the Commons Pilot are involved in many other NIH funded projects that provide biomedical data to the research community. For this reason they serve as ideal candidates to develop the standards, data handling and user authentication procedures required for robust interoperation in the Commons.

The resources provided by NIH through the Commons are an effective mechanism to motivate development of standards compliant systems by supplying stacks with resources, and encouraging them to work together as a central metric for successful participation in the project. In addition to investing in interoperability solutions, the stacks are individually engaging in the improvement of their own functionality, aimed at increasing efficiency, addressing specific domain needs, or leveraging new technologies. This will result in a very healthy, non-destructive form of competition that will stimulate innovation, increase services to users, and lower costs.

In short, users will be able to share data across multiple stack systems, eliminating productivity bottlenecks, yet will have opportunities to leverage different and nascent functionality that best fit their particular scientific need. Another advantage for the NIH is that we will achieve a greater cost effectiveness as new data sets are added to the Commons, because less resources will be required to create a supporting infrastructure to host data. Furthermore the formation of a standards alliance to create a global framework across stacks that are funded by NIH (as well as other agencies) simply is good stewardship of government resources.
In summary, the multiple stack strategy of the Data Commons means that:

  1. The stacks involved in the Data Commons are themselves already stack providers for many other projects - recruiting these particular stacks for this larger undertaking is a very sensible form of stewardship, ensuring that each group will define and participate in the standards that will be developed in our project.

  2. Multiple options for investment by partners: potential partners for the NIH Data Commons can choose between multiple existing stacks and their user experience, support models, etc.

  3. Practical demonstration of interoperability / n+1 capability

    1. Without two or more stacks, interoperability is a theory, not a fact.

    2. Practical interoperability is critical for the ability to add new data sets, methods, and infrastructure implementations in a permissionless way.

  4. Development of a community of practice and adoption or evolution of standards.

    1. A community of practice is a group of people who share a concern or a passion for something they do, and learn how to do it better as they interact regularly.

    2. Adoption and/or evolution of standards depends upon the development of reasonable practice first, as complete details and edge cases cannot necessarily be reasoned out in advance.

  5. Coopetition (cooperation + competition)

    1. We will end up with a much better overall suite of functionality if we have multiple different groups implementing their vision.

    2. We will also not be “captured” by a single group’s billing model, set of functionality, etc.

  6. Practical demonstration of “Commons compliance”

    1. Similar to interoperability and standards, but also important: “if you want to join the Commons, we know you can do it because others have.”
  7. At least one implementation should be completely open source, patent-free, and deployable by others.

    1. This is important for “permissionless innovation” and enabling the full power of an operational Data Commons.
  8. Beyond functionality, different full stacks will explore different operational models for cloud billing, user onboarding, engagement, training, and user interfaces. This is valuable for developing maturity in this space.

  9. There are 2 aspects to standards and interoperability – retrospective and prospective. In the short term the focus is quite retrospective in nature because we have to look at data that often wasn’t set up for any type of standards and algorithms, workflows and technology components that also weren’t set up for standards and interoperability. This is one reason multiple stacks are important – addressing different aspects of these issues, applying multiple types of technology to solve these issues etc. Prospectively, we could imagine that standards will be demanded and easier to implement.

    1. This may be encompassed in the other points.
  10. Diverse commons implementations provide fundamental virtues to research and to the management of research programs:

    1. Scientific and Technological Invention and Dissemination:

      In the short term, each stack will develop methods other stacks don't offer as a function of exposure to different requirements and diverse team composition. These differentiators may be anything from a better peak calling algorithm to a ​better approach to storing intermediate files in a workflow. But our overarching research aim is to publish and openly disseminate our best ideas. As a result, the best scientific and technological ideas with broad applicability will, in general, eventually become widely adopted conventions or officially ratified standards. All stacks will likely adopt winning technologies and scientific methods over time. So while the differentiation is not durable, multiple stacks play a critical role in generating innovative ideas and in serving as an evolution-like system for the selection of the strongest traits without top down control.

    2. Parallel Innovation:

      If innovation is "a search for designs across a universe of components" [1], we believe that search will move more quickly if it's executed in parallel. One stack working in isolation is best suited to building a siloed, monolithic commons. It will be attuned to the requirements to which it is exposed and constrained by the technical skill set of its team. Multiple stacks work on the problem in parallel building open, interoperable components. They are able to explore multiple scientific and technological possibilities simultaneously. Serendipitous invention is driven by allowing diverse communities to explore alternative spaces while also interacting. Developing multiple stacks in parallel will increase the rate at which the universe of useful approaches can be explored and effectively evaluated.

      [1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5722871/

    3. Risk Mitigation

      Having a single stack strategically disadvantages funding agencies and other investors. If the single funded stack is ineffective in some area whether for geographical, technological, social, financial, or other reasons, the research program is encumbered. It behooves researchers and program administrators to have choice, not only at the granularity of stacks but at the granularity of ideas and components. Through judicious use of standards, the program will benefit from the best ideas generated in parallel by collaborating and competing teams bringing all of their attendant diversity to bear.

    4. Diversity in maturity of platforms.

      Seven Bridges and Calcium both have mature “tried and true" platforms that they are adapting to fit the needs of the Data Commons. Relative to their platforms, our perception is that Team Helium, for example, is focused on assembling tools that need adaptation, but offer flexibility. This approach gives us an opportunity to show NIH what it looks like to integrate solutions to address needs through both the development of a new platform and the adaptation of existing platforms.

  11. We draw a parallel between the importance of biodiversity in ecosystems and diversity of solutions and ideologies in the Commons ecosystem.

    1. In business, differentiation is about highlighting the ways a product is different from another in order to create a sense of “value” and compete. In the Commons we want to enable an ecosystem of diverse and complementary ideas and solutions that interact with each other to create more value.

    2. In other words, in the stacks, differentiation is durable by virtue of their ever-evolving diversity. Cross-pollination of ideas and solutions does not always translate to full cross-adoption; a solution or idea albeit great may only work on one stack, or may be implemented in different ways.

    3. We don’t expect stacks to always adopt and duplicate only winning technologies. What matters is that we inculcate a culture and behavior design that encourages teams and stacks to build and leverage on those great ideas to produce greater value to the user community.

    4. The greater the “(bio) diversity" in an ecosystem, the more likely is the system to adapt to small and progressive changes —without fear of collapsing.

    5. Therefore, if we want the Commons to serve as a technological foundation (platform) on which innovation will thrive in a sustainable manner we must:

      1. "Diversify" the ideological engines that drive its direction through the support of a multiplicity of solutions, stacks and teams,

      2. "Facilitate the co-existence of these solutions" through interoperability and open standards; and,

      3. "Promote their interactions" to amplify the network effect through behavior design and community culture [Five Drivers of a Platform Scale]