Skip to content

DCPPC Project Execution Plan

THIS DOCUMENT IS NOW READ-ONLY.

DCPPC Project Execution Plan

Overarching Goals of the DCPPC

Executive summary

DCPPC Project Management

DCPPC Objectives: Planned Project Deliverables

Issues to resolve

Overarching Goals of the DCPPC

Preface

The members of the Data Commons Pilot Phase Consortium (DCPPC) were brought together to leverage their expertise to collaboratively build a Data Commons. We aim to achieve the stated goals within RM-17-026 that describe the overarching principles for this ambitious new data ecosystem, and thus to make it easier and more efficient for the broad biomedical research community to use NIH datasets to advance discovery. This is the first version of a consortium-level Roadmap for the DCPPC.

The intent of this Roadmap is to outline a plan to build, prototype, test, and by the end of the first 180 days, determine a path for moving forward and creating an NIH Data Commons. Activities within the 180-day pilot period will include the development of demos and early-stage products that will clarify and guide the policy and technology choices for the NIH Data Commons. This Project Execution Plan summarizes consortium-level goals, objectives, and touchpoints that will shape milestones and guide the work of the DCPPC.

Scientific Rationale

The NIH continues to fund the generation and maintenance of a large volume of biomedical data sets and knowledge resources whose reuse and impact is either suboptimal or difficult to assess. While the features that promote reuse have been well described at a high level (e.g. FAIR); to date there have not been systematic and comprehensive ways to quantify and improve the reusability of key data assets. These data sets, which include sequence and genetics data; gene, protein and metabolite expression data; registry and patient-reported outcomes data; imaging data; clinical trials results; and many other types of data, are currently stored in repositories and databases that are difficult or impossible for researchers to access and are large enough that downloading them to external compute platforms is time- and cost-prohibitive. Moreover, these datasets, even if findable and accessible, are often underutilized because of barriers to interoperability and reuse. Researchers must be able to more easily use such data—in compliance with participant consent or applicable waivers or exemptions—to formulate and test research hypotheses, explore large-scale patterns, and accelerate biomedical discovery. Researchers should also be able to better discover and re-use software and data produced by other researchers. The goal of the NIH Data Commons Consortium is to facilitate discoveries by providing a cloud-based platform upon which investigators can store, share, access, and compute with digital objects (e.g., data, software, workflows) as part of their scientific research.

DCPPC Mission Statement

Our mission is to develop tools and systems to support a trans-NIH data ecosystem that facilitates the broad use and re-use of biomedical data; develop and disseminate analysis methods and software; and enhance training relevant to data analysis and data reuse. Over the next four years, we will provide a Commons environment that supports interconnection among diverse data sets and visualization and analytic tools. We will develop best practices for operating in this environment and provide guidance towards new interoperation standards. The NIH Data Commons will include a policy and procedural framework that appropriately addresses Ethical, Legal and Social Implications (ELSI) of human subjects research, privacy issues, and data security matters. The NIH Data Commons will serve a diverse set of end users, including basic research scientists, computational biologists, clinician scientists, and programmatic users. We will create an open infrastructure and engage a broad community to enable additional developers, end users, and data contributors to participate in the NIH Data Commons.

Chief among the principles of a Data Commons is the need for accessible data and tools distributed across a network of providers. Building a Data Commons will require community consensus on standards and conventions. In support of this, we will establish Consortium governance principles and coordination guidelines to foster consensus across the DCPPC in the design and operationalization of the NIH Data Commons. We also will promote internal testing as well as engagement with the broader NIH research community in order to evaluate the NIH Data Commons.

Why this project should begin rather than wait for more refinement.

The DCPPC objectives that will be undertaken in the next number of months are ambitious. We strongly advocate moving forward for several reasons. One reason is simple: many of the personnel involved with generating reports for the DCPPC are currently working without support. Content generation should not continue at its current level without full support of each member of all the DCPPC teams. It is understood that NIH had high expectations that the DCPCC generate a unified action plan in order to proceed, and we should also recognize that expectations will inevitably hit an inflection point where consortia members will not be able to continue without compensation for their effort. In addition to this, one thing the DCPCC has demonstrated is an ability to produce on short timelines and achieve what is requested of the group. We can safely expect that any corrections that are needed in order to succeed will occur rapidly and without significant push-back. It is inherent in this project that we will engage in a continuous process of refinement and integration as the development process continues over the course of the 180-day pilot period. We are confident that collectively as a group we will be able to establish consensus opinions in order to complete the activities that are stated in this plan. We are confident that we will implement a set of high-level, consortium-wide project goals that will drive discussion, consensus, and development of a common set of tools and systems. Again, it is imperative to begin. Once we initiate this ambitious project, we will engage in agile project management to generate products and demos while at the same time evolving our activities based on mitigation of high risk issues and technical bottlenecks.

Terms

We define the following terms for the purpose of this document.

  • User story: a description of a software feature from an end user perspective.

  • End-user narrative: a path through several user stories that proposes a particular order of events and their relationships, targeted at a particular end-user.

  • Demos: activities and documentation resulting from the DCPCC to build, test and demonstrate completion of goals of the Commons.

  • Products: resources resulting from the DCPPC that are considered to be products include standards and conventions, APIs, data resources, websites, repositories, documentation, and training/outreach materials

  • Stack: term of art referring to a suite of services that run in the cloud and enable ubiquitous, convenient, on-demand access to a shared pool of configurable computing resources.

  • KC: ‘Key Capabilities’ originally referred to eight targeted development areas to implement the Data Commons. Each team defined milestones in their proposals designed to achieve these capabilities. The term ‘KC’ now applies to working groups that are collaboratively implementing their milestones.

  • Team: Groups of people led by a Principal Investigator (PI), or PIs, who will complete milestones. Each group has been assigned a name, represented by the elements on the periodic chart.

  • Milestones: Required to implement activities such as Demos and Products. Milestones are tied to teams.

  • MVP: Minimum Viable Product. This term is deprecated and does not appear elsewhere in this document.

  • Sprints: Term of art used in software generation, referring to short, iterative cycles of development, with continuous review of code through daily builds and end-of-sprint demos.

Executive summary

Vision and Introduction

The opportunities of data-driven and data intensive biomedical research are immense: the ability to explore individual research questions using vast online databases, the promise of increasingly broad and deep integrative analyses, and the ability to repurpose existing data to new questions will transform biomedical research and translational medicine. However, these opportunities come with significant challenges: the data sets are too big to download; many computational tools don’t work with the data or metadata formats output by others; and local compute capacity is often too limited to meet dynamic research needs. Biomedical data is consistently failing to reach its full potential in basic research, clinical, and translational medicine.

One solution to these challenges is to host data and workflows in the cloud, where compute resources scale as needed and computational tools run close to the data in a standard environment. But this is only a part of the full solution: we need policies and protocols for accessing human-subjects data; indexing and search of available data sets; a collection of computational pipelines that can be applied to data sets; data standards to connect data sets between analyses; policies for data citation and reuse; and computational “stacks” of software that provide Web sites and developer interfaces for analysis. Moreover, researchers and consortia will want and need to bring their own data and computational pipelines into the cloud, both to use them and to make them available to others.

Within the first 180 days of this project, we will implement a Data Commons platform that will support biomedical data analysis in the cloud by building software infrastructure on top of standards, software pipelines, and existing data sets. This Data Commons effort will start by adopting standards and access protocols while building a software platform in support of the analysis of three high-value NIH data sets: TOPmed, GTEx, and the Alliance of Genome Resources (AGR). The resulting Data Commons platform will be a scalable cloud solution that hosts cloud-local data, supports many computational pipelines running in the cloud, and provides interactive data analysis workspaces. Over the four years of the Data Commons Pilot Phase and beyond, this Data Commons platform will expand to enable any biomedical research group to effectively work with data, improve reproducibility and reduce redundancy in infrastructure and standards development.

This implementation will be done through a Data Commons Pilot Phase Consortium (DCPPC), consisting of NIH personnel, a number of academic and industry implementation teams, Data Stewards (representing the TOPmed, GTEx, and AGR consortia), and multiple commercial cloud providers.

Major Objectives of the Data Commons Pilot Phase Consortium

The Data Commons will require many policy and technology choices around interoperability, data access, functionality, and implementation. The goal of the DCPPC is to resolve many of these choices. We focus on a system to find, access, use biomedical data, developing prototypes and case studies in the initial phase and engaging with the community. We have identified the following 10 Major Objectives:

1. Identifiers for data: Interoperable global unique identifier system for digital objects.

The Data Commons will be a repository for computational pipelines implementing analysis workflows, and data from many different sources. Researchers and software working within the Data Commons must be able to refer to digital objects via unambiguous identifiers, consume metadata about each object, and resolve identifiers into data and analysis pipelines. New identifiers will need to be minted as new objects enter the Data Commons, and these identifiers will need to be resolvable across all Data Commons-compatible platforms. Therefore, a major objective of the DCPPC is to specify and implement an interoperable system for global unique identifiers for digital objects.

2. Data access: Authentication and authorization policies and protocols for controlled access to digital objects and derivatives.

The Data Commons will provide access to large collections of human-subjects data, and must support controlled access to and use of this data. Data Commons-compliant platforms must be able to provide access to this data by user and use, track access, allow audit trails of amalgamated data sets, and provide a robust policy vocabulary for these operations that is shared across platforms. The DCPPC will work with the NIH and NIH grantees to integrate with existing policies and define computational protocols for access to human-subjects data via the Data Commons.

3. Findability: Search and indexing of digital objects and data sets.

Large collections of digital objects are useless without the ability to query and discover objects of interest, e.g. data sets with particular attributes or content, and workflows that perform specific tasks. The Data Commons will need to define a common metadata format in support of indexing, search, and categorization of digital objects hosted by or accessible to the Commons. Moreover, the Data Commons must support mechanisms for indexing and searching the content of data sets in flexible and robust ways.

4. Software stacks: Multiple robust and sustainable software stacks implementing Commons standards.

The standards and conventions described above will need to be implemented in “stacks” of software components and provided as a service. The DCPPC will implement multiple of these software stacks, each working across multiple clouds to host data, serve APIs, and execute workflows. These platforms will cooperate on the base layer of functionality and interoperability needed to achieve the core Commons goals (e.g. digital identifiers, authentication, search/indexing, and APIs) while supporting a broader range of “value add” functionality. Within the 180 day period, these platforms will primarily serve demonstration projects that prototype core functionality; over the four year period, they will mature into full independent and interoperable implementations of Data Commons stacks.

5. Data use, standards: Standard application interfaces.

Workflow execution, user authentication, data access, data retrieval, search, and other functionality may be done by computational applications built within and on top of the software stacks, and stacks will also need to communicate with each other. This requires a set of standardized interfaces by which applications within the Data Commons application can pass data, computational pipelines, and access authorization to each other. These interfaces must all support the identifiers and access permissions and protocols defined as part of the Data Commons. In the first 180 days, we will identify the core interfaces needed for the Data Commons, and over the next four years extend them to support multiple applications that use the stacks to provide functionality by building on top of the core platform.

6. Use cases: Use case library.

The Data Commons must meet the needs of many different types of end users, from biomedical clinicians to research scientists to data scientists and software developers. To define and explore classes of users and their computational research needs, the DCPPC will build a library of use cases and personas: “personas” are potential users and their research goals, while the “use cases” describe the actions and technical mechanisms by which they will achieve these goals. This index and searchable library will be an expanding collection of use cases and personas that encompass the desired functionality of the DCPPC from an end-user perspective, and serve as touchpoints with the external user community as well as developer guidance within the DCPPC. We will also invite the larger user community to submit new use cases to be incorporated into the library.

7. Community: Community engagement and support across multiple levels of expertise.

We will ensure the Data Commons will be fully utilized by the biomedical research community by introducing it early and often. Within the first 180 days, we will run brainstorming, discussion, and demonstration workshops to connect with the TOPmed, GTEx, and AGR user communities, as well as reach out to method development communities that may want to bring their data analysis pipelines to the Data Commons. We also plan to connect with other Commons efforts (at the NIH and elsewhere) to discuss community standardization and platform interoperability between Commonses. Once we have an initial set of functionality implemented in the Data Commons, we will recruit biomedical users to test out searching and sharing, data scientists to conduct large scale analyses, and computational tool and pipeline developers to add new functionality. Over the four years, we will provide training (workshops, webinars) and technical support (help desk, issue tracking) as well as a supportive online community (e.g. a user forum), developer “hack days”, and social media engagement. We will also reach out to other communities with large high-value data sets that could add synergy to the existing data sets through combined analysis, and make arrangements to host them on the Data Commons platform.

8. Community: governance, membership, and coordination.

The Consortium needs formal governance procedures in order to chart the course of the Data Commons and the DCPPC over the first 180 days as well as the remaining four years of the Pilot Phase. The DCPPC will need to support a range of decision making processes, including choosing Milestones and deciding whether to include specific functionality. The DCPPC will also need to maintain a flexible and open internal culture in order to coordinate and build products. Supporting this requires investment in technical infrastructure such as mailing lists, messaging platforms, and shared development locations, while also building social norms of direct contact between Consortium members. And, finally, the DCPPC will need to adopt existing technology, do engineering testing between DCPPC products, and onboard new members and partners. The governance and coordination structures must be dynamic in order to support this flexibility.

9. Evaluation methods and metrics.

We plan a culture of frequent release of products, with small iterations, routine evaluation and redesign; this is now standard practice in many IT communities to ensure that functionality meets user needs. We will also establish regular evaluation checkpoints at which the DCPPC discusses demonstration projects, negotiates stack interoperability, and connects with external evaluators. Within the first 180 days, most of our evaluation metrics will be focused on demonstration projects, but we will expand our evaluation efforts to be more holistic as functionality expands.

10. FAIR guidelines and metrics.

Our guiding principles for data access, use, and reuse will adhere to the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines. While the FAIR guidelines are well-accepted, to bring them to practise will require defining and adopting community-based metrics and rubrics so these can be applied to data, and other types of digital objects, hosted within or available through the Data Commons. At the same time, once FAIR metrics and rubrics are defined, these will be used to measure the level of “FAIRness” of repositories, datasets, and other digital objects. Such evaluations will inform and engage both Data Commons users and digital objects producers.

180 Day Consortium Milestones and Timeline

Month 1:

PRODUCT An API registry for definition and discovery of APIs.

DEMO 1M.1 Searching: Web Interface to search subsets of curated phenotypic/clinical data

WORKSHOP Begin planning community engagement workshops (dates, times, and locations).

Month 2:

PRODUCT A draft DCPPC website for community engagement

DEMO 2M.1 Uploads and minting: Upload data, annotate, and mint GUIDs on at least one stack

DEMO 2M.2 Searching: Demonstrate interoperability of search with access controls across cloud environments

DEMO 2M.3 Searching: Search metadata across catalogs and data storage endpoints

Month 3:

PRODUCT A draft FAIR metadata specification to be supported by all digital objects

PRODUCT Access-restricted data sandbox for testing access control restrictions and APIs

DEMO 3M.1 Searching: Find TOPMed/GTEx/MOD data across multiple stacks using common GUID

DEMO 3M.2 FAIRness assessment: FAIR assessment API generates FAIRness reports for digital objects

DEMO 3M.3 Data analysis: Workflow is described, implemented, exchanged, and executed reproducibly across multiple Full Stacks

DEMO 3M.4 Analysis and Training: Web Interface to access tools and sample code to develop analyses

Month 4

PRODUCT A FAIR reporting and assessment tool for Commons-compliant digital object repositories

DEMO 4M.1 FAIRness assessment: Demonstrate FAIR assessment with test datasets

DEMO 4M.2 APIs: Portable workflow scheduled on two or more stacks using the same API

DEMO 4M.3 APIs: Interservice interoperability across stacks

DEMO 4M.4 Testing: TOPMed Open Sandboxes provide a technical environment to access TOPMed data

Month 5

PRODUCT API cross-platform compatibility test plan and test library

PRODUCT A draft plan for the 4 year Data Commons Pilot Phase Consortium

PRODUCT Delivery of first version of the Use Case Library

PRODUCT A draft policy for access to restricted data and management of derived results

PRODUCT A draft external user engagement plan for workshops, training, and outreach

DEMO 5M.1 Registration: Single shared sign-on for stacks and data access

DEMO 5M.2 Registration: Display audit trail for access to restricted data across stacks

DEMO 5M.3 Data analysis: User adds data and workflow to a stack to harmonize with NIH Data Commons-hosted data

Month 6

DEMO 6M.1 Data analysis: Run workflow with cost-aware data staging and provisioning

DEMO 6M.2 Data analysis: Run scalable, cost-controlled data analysis for data enrichment

DEMO 6M.3 Data analysis: Perform multi-cloud compute on consortia data and novel data according to user permissions.

DEMO 6M.4 Data analysis: Share and retrieve analysis results across stacks

DEMO 6M.5 MetaAPI: Deployed across selected APIs in the consortium

The Data Commons architecture

Caption: The Commons Technical Architecture. Users will interact with the software stacks via Web and programmatic interfaces. The stacks will supply services based on the standards and protocols implemented by the DCPPC, and use data/run computational processes on the commercial clouds.

DCPPC Project Management

Organizational Vision for the DCPPC

The DCPPC will operate with a dynamic, flexible project plan that will change course as needs arise and lessons are learned. We will focus on formal interactions with members of the DCPPC around shared public goals. We will focus on a limited set of Design Guidelines, representative user epics that will serve as anchors for the project. We will implement a set of high-level, consortium-wide project goals that will drive discussion, consensus, and development of a common vocabulary. We will build fit-for-purpose demos and deliverables based on DCPPC-generated requirements. We also recognize that the NIH Data Commons will initiate two general classes of tasks: i) infrastructure-building tasks to generate common products that are valued by DCPPC members and are likely to be valued by external partners; and ii) hard, but common, tasks to address challenges such as data access issues and harmonization of metadata and data sets derived from multiple sources and models. These tasks will take the form of light-weight demos specifically designed to explore these challenges.

Measures of success will include an ability to support user requirements, scientific objectives, and design guidelines. We accept the generalizable premise that: “no software ever survives first contact with the user”. As such, we are willing to discard products and approaches that no longer make sense. We will embrace the principles and practices of “open source”, “agile” software development and open communication, with coordination across KCs and working groups via shared requirements, code, libraries, tests, tutorials, and demonstrated interoperability and iterative sprint cycles to deliver and review blocks of functional code or components. Conceptual and high-level explanations and strategies will be developed among the KCs and the Full Stacks. This environment will incubate multiple solutions as demos and products are created that address various challenges and determine areas of improvement. We envision that the Full Stacks will provide feedback to the KCs to assist with quality assurance in operations. Other particulars, such as workflow engines, are unique to each of the stacks and provide opportunities to implement new technology.

Consortium Oversight, in Brief

In the three months since the initial Consortium was finalized by the funding of the OT per RM-17-026, we have worked together to achieve the following:

  • Held an in-person kickoff meeting to introduce the vision and goals for the Consortium;

  • Developed draft prospectuses outlining potential solutions for Objectives 1-5 (GUIDs, Authentication/Access, Search and Indexing, APIs, and Full Stack implementations);

  • Created a structure for Use Cases and a number of specific use cases to address in the 180 day period (Objective 6);

  • Outlined a variety of mechanisms for Community Engagement (Objective 7);

  • Ratified a formal governance structure (Objective 8);

  • Built out a mailing list and messaging platform in support of internal Consortium culture building (Objective 8);

  • Written a draft Roadmap that outlines the policies and approaches we plan to take, including laying out a mission statement, organizational vision, a detailed set of deliverables for the 180 day period, and project management strategies, while delineating a commitment to openness and agility.

We addressed and solved many significant challenges during this period, including setting up a governance structure, identifying robust coordination and communication challenges, and coordinating efforts around generating a Roadmap and Project Execution Plan. Additional activities are described below.

Project Management Vision and Philosophy

Collectively, consortium members have agreed on a set of operational principles for the DCPPC that have direct impact on our design strategy for the NIH Data Commons. The following aspects of our vision for the NIH Data Commons will translate into activities specified by this Roadmap, as well as the DCPPC Project Plan.

We will:

  • Support multiple implementations of infrastructure to stimulate healthy competition and to test whether Commons functionality can be rapidly incorporated into other stack platforms.

  • Promote robust interoperability across stacks and across infrastructures. Interoperation will be achieved through the use of well-documented APIs, Globally Unique Identifiers (GUIDs), and the use of standards, metadata harmonization, controlled vocabularies, data models, ontologies, and workflows.

  • Ensure that user access control, data access permissions, and security are compatible across stacks.

  • Continuously revisit our plan to redefine which functionality stacks should support, and to refine the research questions, computations, data types, and data models that all stacks should support. Pragmatically, this means that we will operate in roughly six-month design/build cycles with frequent cross-stack touchpoints, rather than specifying a full four-year plan.

  • Generate demos and prototype implementations within the initial 180-day period, in part by employing a pragmatic approach that includes the use of test and/or “slimmed” data sets to test our architectural design and basic functionality.

  • Enable users to bring algorithms and software tools to the data.

  • Place a premium on the user experience and employ best practices in user-based design. In particular, we expect user-facing deliverables to be expressed in terms of the functionality and interface provided to the users, and we will evaluate these deliverables through the lens of the user. We will ramp up engagement with external users throughout the four-year period and incorporate their suggestions into our efforts.

  • Provide tutorials, generate documentation, and run outreach and training workshops to demonstrate the functionality of the NIH Data Commons and to engage with the broader community.

  • Support higher level services and functionality within each stack, and enable external services to the NIH Data Commons, based on the underlying foundational services implemented within the NIH Data Commons: that is, the NIH Data Commons will be a platform.

  • Spin off products that will be used as common resources across the DCPPC and that will enable other data providers to more easily stand up their own instance of a Commons.

  • Build a four-year vision that will lay out community conventions and interoperability standards by which other Commons can participate in, expand on, and reuse the DCPPC platform and its components.

We will adopt an agile approach http://agilemanifesto.org/ to software development , where requirements and evaluation are driven by real-world end-user narratives. We will navigate between defining specific end-user narratives, implementing them in demos, and extending the stack infrastructure to robustly support them.

The goal of the 180-day period will be to implement a series of specific end user narratives in demo form while extending and refining the underlying stack infrastructure needed to robustly and sustainably support the functionality and create the infrastructure to conduct KC8 scientific uses cases. Products of the 180-day period will include demos and additional end-user narratives. The tangible products of the 180 days will provide a more accurate view of what is possible for the NIH Data Commons. This, in turn, will refine the scope for what the NIH Data Commons will become, and serve as a guide to the activities and functionality that will be implemented over the next four years.

To support the 180 day pilot period, we will create the following processes and resources:

Timeline

  • Create Gantt charts reflecting the activities of our working groups and identify milestones for the DCPPC.

  • Identify several points of evaluation throughout the course of the project.

  • Establish a DCPPC team to formulate the Consortium’s four-year plan.

Implementation and Community Strategies

  • Create an open infrastructure through adoption of the “open source” model defined in this document that will rapidly enable new developers to participate in the NIH Data Commons.

  • Develop well-rounded views on socio-technical solutions that will be useful to a broad set of end users, through creation of new end-user narratives and refinement of the Design Guidelines document. We envision that some of these end-user narratives may be implemented in other Commons efforts, internal or external to the NIH.

  • Embrace an agile approach to producing lightweight demos of solutions that will enable us to attack hard challenges through trial and collaboration, while building the technical and social components required to solve them.

  • Build consensus interoperability requirements across the multiple DCPPC teams who have established data management systems, each of which is designed to meet the needs of specific communities, but not necessarily harmonized.

  • Build consensus, interoperability requirements around potentially inconsistent data access policies.

  • Create demos that will lead to prototype implementations, and infrastructure in the context of end-user narratives to demonstrate the utility of the NIH Data Commons with concrete examples.

  • Engage the community early and through multi-modal approaches to promote usage of the NIH Data Commons by the broader biomedical community.

  • Work in a competitive, yet integrative, manner by identifying points of collaboration and building an infrastructure that is capable of supporting multiple implementation strategies.

  • Recruit additional data integrators and downstream users of pilot-phase NIH Data Commons data to help define data integration and analytic architecture for the post–pilot phase of the project.

  • Provide open resources (e.g., documentation, design guidelines, end user narratives, standards development documents), along with contact points, so that other Commons efforts (e.g., NCATS, NCI, NLHBI, NHGRI) can track the work of the DCPPC and engage appropriately.

Requirements Analysis

  • Layer a set of high-level project goals across the DCPPC to drive agreement, discussion, consensus, and development of a common vocabulary.

  • Create a versioned set of User Epics and Design Guidelines so that DCPPC members can update development based on stable materials.

  • Develop conceptual and high-level opportunities for collaboration to support communication between the KCs (e.g., KC1-FAIR, KC2-GUIDs, KC3-APIs) and the Full Stacks.

  • Generate walkthroughs and tutorials (KC9-CoordinatingCenter) as a way to help drive a common vision and functionality for the Commons.

Software Development Management Strategies

  • Complete a landscape analysis of existing software and software points of contact.

  • Produce demos and prototypes that illustrate specific functionality and promote deeper analysis requirements.

  • Produce stacks that are robust and future-proofed.

  • Ensure that all spin-off products serve as resources for groups to share among themselves.

Initial Design Guidelines

The DCPPC has outlined a series of guidelines that reflect multiple scientific outcomes to be achieved through the work of the NIH Data Commons. The DCPPC’s overarching scientific and technical objectives will be integrated with user narratives into a set of Design Guidelines by which the KCs can think in a collective, collaborative, consortium-driven manner. The Design Guidelines (see 3rd draft) provide a framework for comprehensive general guidance, end user–focused narratives, and technically-oriented stories that can serve as technical building blocks in support of specific end-user narratives. KC8 scientific uses cases will also be used to drive development of the platform. As the pilot phase continues, KC8 will develop user narratives to test, evaluate, and implement Consortium-developed ideas, principles, and technology. The current Design Guidelines include high level goals and end-user narratives for:

  • Research scientists who want to find existing analyses and explore the results;

  • Biological data scientists who generate analyses and summaries;

  • Methods developers who evaluate new analysis workflows; and

  • Data providers and data generators who contribute new data to the Data Commons.

These Design Guidelines and associated end-user narratives and user stories will be versioned periodically and updated as scientific outcomes, end-user narratives, and user stories are generated and refined.

Evaluation and Testing Criteria

The four-year vision for the NIH Data Commons begins with a 180-day pilot phase, the goal of which is to test multiple methods and ideas across the DCPPC.

At the end of the first 180 days, the DCPPC will be evaluated on three main criteria:

  • Consortium structure and flexibility — is the DCPPC capable of collaboratively developing a platform in an open and agile manner that rapidly accommodates revised requirements?

  • Testable functionality — has the DCPPC developed functionality that implements the specific scientific objectives defined in the end-user narratives and Design Guidelines?

  • Four-year plan — has the DCPPC developed a realistic high-level plan to develop, implement, and deliver a production-level Data Commons that addresses interoperability requirements and consists of a modular design?

Our evaluation criteria for the four-year vision will include:

  • Use-case analysis to evaluate the effectiveness and ease of use of the NIH Data Commons.

  • FAIRness assessments of data sources and other digital objects.

  • Query or analysis suite that can be used to benchmark and compare results in the different Full Stack implementations.

  • Plan to develop components that serve a diversity of end users, such as basic research scientists, computational biologists, clinician scientists, and programmatic users.

  • Evaluation of demos to assess if and when portions of the Project Plan require adjustments, and a strategy for regrouping and realignment that includes an ability to draw on other resources, persons, and software both internal and external to the DCPPC.

Connecting scientific use cases and goals to the technical infrastructure.

We see a significant ongoing challenge in making sure that scientific use cases and goals are harmonized with the technical infrastructure under construction by the DCPPC. The primary mechanism for connecting use cases to technology will be the Design Guidelines Working Group (DGWG), a cross-KC and cross-team body that will will serve as an advocate for the scientists who will use the Commons. The DGWG will identify realistic goals that can be achieved by the Commons and translate these into implementation activities.

As per the ratified governance document, the DGWG will be composed of personnel across the Consortium, including at least one representative of both KC8 (Use Cases) and the Data Stewards. The primary output of the DGWG will be design guidelines that connect End-user Narratives to User Stories that aim to comprehensively cover the Use Case Library (Major Objective 6). The DGWG will also interact with the external user community to obtain feedback and comments on use cases.

Engagement with Data Stewards. The DGWG will initiate a near-term meeting with the Data Stewards, the purpose of which is to review user narratives and then to work together to provision a specific set of files or APIs to support those narratives. We will assemble a task force composed of DCPPC members and Data Steward representatives charged with reviewing the available data sources and determining if they are sufficient to support our user narratives. In particular, we expect to define a “minimal set of data types” to establish the data content of the Commons for the first 180 days. Engagement with the various Data Steward teams and other KCs will continue on a regular basis through online and in-person meetings.

Interaction with KC8/Use Cases. The DGWG will work closely with KC8 personnel (who will compose some of its membership) to define cross-Consortium design guidelines that meet the Use Case needs defined by KC8 as well as other KCs. The DGWG will be responsible for defining, expanding, and refining scientific goals of the Data Commons, while KC8 will take primary responsibility for connecting high level use case requirements to specific technical needs and respond to detailed questions about intent and design.

KC8 Scientific uses cases will also be conducted as driving examples to build the platform. For example, the KC8 Carbon team use case studies a specific condition hypertrophic cardiomyopathy (HCM). This infrastructure will allow for searching for all consented samples across the TOPMed cohorts that have sequencing and/or genotyping data available in conjunction with echocardiography measurements. The user will be able to retrieve and store structured cardiac morphology measures from these samples along with metadata, available demographic information, data on comorbidities, and available genetic data from these samples. These genetic and phenotypic data will then be available in a collaborative analysis environment with the ability to query across all available TOPMed cohorts for population-specific allele frequency, and have the ability to query GTEx and MODs for user-specified sarcomeric genes related to HCM and user-specified ‘control genes’ unrelated to HCM. Using the collaborative analysis environment, we can compute genetic and phenotypic reference ranges for HCM-related variables and train SVM classifier (that distinguishes pathogenic/benign variants for HCM across diverse ancestral groups) using integrated TOPMed, GTEx, and MODs data and distribute results securely to broader set of collaborators and consortium members.

Building “open” infrastructure

The NIH Data Commons will be built on an “open source” model, under the principle that our products should be open to inspection, use, and reuse as soon as possible. More specifically, this means:

  • Commons-accessible data will be provided under FAIR principles - Findable, Accessible, Interoperable, and Reusable. In particular, all data accessible within the NIH Data Commons must be made available under a clearly stated and unambiguous license, with specific rights as to use, reuse, and redistribution, subject to Data Use Agreements.

  • While the underlying infrastructure may be built using proprietary platforms, platforms must interoperate via open and community-based standards; where these standards do not yet exist or must be elaborated, we will work to create or extend them.

  • At least one set of feature-complete client source code for all NIH Data Commons core infrastructure must be made available under an OSI-approved license (e.g., https://opensource.org/licenses).

  • All central documentation and training materials will be developed and released under a Creative Commons CC-BY license or alternately via a CC0 Copyright waiver, which permit the widest possible redistribution and reuse.

  • We will provide documentation and walkthroughs to facilitate the deployment of new infrastructure stacks built by reusing our existing source code, and run hackathons on component reuse.

  • All Consortium-wide publications will be made available on preprint servers and published as open access.

Community engagement and coordination workshops

The DCPPC will work to achieve wide adoption of the Data Commons by the broader user community by engaging with internal and external users directly, building Web sites for discovery of the Data Commons, facilitating internal DCPPC interactions, and connecting with external Commons-related efforts. KC9-Training’s intent is to play a facilitating role in these interactions. In the first Pilot Phase (180 days), KC9-training will primarily be focused on internal coordination (e.g. with Data Stewards and between teams) and external engagement with technical users and Commons-related efforts. In the long term, KC9-Training will facilitate a broader range of interactions and support including webinars for external communities, help desk and issue tracking for external users, social media outreach, and developer hack-days.

See details of proposed workshops.

Oversight

Governance of the DCPPC Working Groups

The DCPPC has ratified a Governance Document that established objectives, responsibilities, governance, and a working group framework for the DCPPC. The document also established a DCPPC Steering Committee. The Steering Committee will: review all elements of this governance document on an annual basis; establish guidelines for the duration and process for the Chair to be elected; and review on an annual basis all elements of governance such formation of working groups, term limits for chairs of each working group, and all other relevant DCPPC activities.

The Steering Committee will provide monthly feedback on the working groups, full stacks, and KC project plans during the first 180 days and every six months thereafter.

Roles and Interactions

Key Capability Working Groups

Members of the DCPPC have been collected into organizational groups that are associated with Key Capabilities as specified by RM-17-026. Descriptions for the Key Capabilities are listed in this table. The Principal Investigators, their NIH Science Officers and points of contact are listed in this table.

The personnel as well as their institution are listed in the following documents:

The activities of each KC, as reflected by their current list of specific aims, is listed here:

Role of the Full Stacks

  • Provide constructive feedback to the KCs aimed at guiding improvements to their deliverables in the context of more effective stack operations.

  • Address relevant Design Guidelines that apply within an individual stack.

  • Maximize interoperation with the other stacks.

  • Adopt open standards to allow communication not only across the Full Stacks but also across the larger biomedical data environment.

  • Focus on meeting all project goals and deliverables.

  • Manage final delivery of data and services to end users.

Role of the KCs

  • Serve a dual role of:

    • Providing products and resources to assist in the implementation of individual stacks (e.g., API specifications designed to work for a single platform).

    • Providing resources that that can be applied across the stacks (e.g., a common data model for all metadata).

  • Focus on a broad set of Design Guidelines that apply across stacks.

  • Deliver products to be handed off to other KCs and Stacks.

Role of the Data Stewards

  • Provide curatorial expertise on multiple data types of relevance to the model organism research community.

  • Utilize standardized metadata, controlled vocabularies, and ontologies.

  • Provide large-scale data management, quality control, data integration, and generation of secondary analysis products (e.g., read counts, validated SNPs) derived from raw data.

  • Contribute use cases to assist in driving the strategic objectives of the DCPPC.

  • Provide data and metadata, as well as APIs and other services, for representation within the NIH Data Commons.

Role of the Project managers

The Project managers will be responsible for tracking execution of the project in accordance with the plan, both at the operational and technical level, to mitigate identified project risks, and to communicate any plan variations, risks, change requests, or other important events to NIH stakeholders and the DCPPC Steering Committee. Specifically, the Project managers will be responsible for the following:

  • Monitor plan execution for variations

  • Perform change control activities

  • Perform technical controls and evaluation of deliverables

  • Mange and capture project documentation relevant to the DCPCC and broader user community

  • Identify project risks/bottlenecks and suggest potential workarounds and mitigation strategies

  • Communication: Provide status updates to NIH stakeholders and DCPCC Steering Committee on a regular basis

DCPPC Touchpoints for 180-day Pilot Phase

The DCPPC will share a master set of touchpoints for the 180-day pilot phase. These are checkpoints on the Consortium’s calendar, or time points when the DCPPC will regroup, ensure that deliverables are available or resources are transferred from one group to another. Touchpoints for the 180-day pilot phase will ensure that the DCPPC will:

  • Provide a demonstration of the utility of the NIH Data Commons, with examples of technology that begin to address one or more Scientific Objectives from the Design Guidelines.

  • Identify difficult, Consortium-level challenges (e.g., data access, metadata harmonization, interoperability, consent, governance) and propose processes or technology for addressing these challenges.

  • Identify and resolve first-order Consortium-level challenges (e.g., common APIs, models, authorization strategies, etc.).

  • Develop a cohesive plan for the infrastructure of the NIH Data Commons and begin building components.

  • Build, test, learn from, and ultimately implement several possible solutions for a Data Commons.

  • Provide a mechanism for external groups to engage with the DCPPC to identify points of contact, discuss common concerns, and participate in standards and API development.

The full stack milestone document identifies a set of sequential touch points in which we will collaborate on to facilitate the overall delivery of the above demonstrations. A timeline has been generated by the Stack groups that lists important milestones. The touchpoints are listed below:

Touchpoints
Month 1
Operating on the assumption that the Full Stacks will have access to representative data (TOPMed and GTEx data) on commercial clouds, we focus on utilization of ore core datasets (such as Phase 1 18K WGS from TOPMed on Google and AWS) [1].
Common IDs and minimal metadata agreement for the data onboarded. Focus on an interim GUIDs and small amount of metadata to start so Full Stacks can move forward.
Month 2
One or more TOPMed workflows in WDL/CWL and registered on a public site (KC3, for example Dockstore)
Onboard representative data (such as TOPMed or GTEx) into storage infrastructure for each full stack
Security and compliance agreement for users/developers operating in the pilot
Month 3
Pilot (sandbox) user interfaces available
Each Full Stack has a pointer to representative data on commercial clouds, by this point no duplication of data in clouds (when needed)
Show a workflow can be exchanged with Full Stacks and each Full Stack can execute with test data
Incorporation of KC standards (exact activities to be developed as KC working groups refine deliverables)
Approved test user can log on to FS prototype, access data, successfully execute workflow
Month 4
Richer metadata and identifiers coming from KCs
Incorporation of KC standards (exact activities to be developed as KC working groups refine deliverables)
Pilot users identified and onboarded
Cross stack compute possible
Month 5
Incorporation of KC standards
User interviews
Broader user engagement (in collaboration with KC9)
User data injected into the Full Stacks and processed using a high value workflow. As a scale-up proof of concept between the Full Stacks we could onboard data in multiple Full Stacks for a distributed recompute.
Month 6
Refined user flows across Full Stacks
Stretch: show a demo where data is produced across Full Stacks, and used in other Full Stacks
Overview of the common APIs supported by the Full Stacks
Incorporation of user interview results
Plans for additional user training and engagement

A listing of full stack milestones may be found at this link.

DCPPC Objectives: Planned Project Deliverables

DCPCC Deliverables and Functions

Products from the DCPPC include end-user narratives, stack implementations, standards and conventions, APIs, documentation, and training/outreach resources. We also will create demos to guide discussion and implementation activities.

User Capabilities at the End of the 180-day Pilot Phase

At the end of 180 days, an end user will be able to:

  • Use a first Full Stack to search for and assemble a collection of data sets, including those already accessible through the NIH Data Commons and others newly contributed by the user.

  • Execute a first workflow on a data set or collection to produce a secondary, derived data set that is assigned a new GUID and is persistent within the cloud.

  • Use a second Full Stack to perform a new analysis on the output of the first workflow, with a different workflow, generate GUIDs, and share the results with other researchers.

  • Publish derived data, analyses, and workflows.

  • Log sufficient authentication, authorization, and data access, transform, and analysis steps to demonstrate compliance with consent and Data Use Agreements.

180 Day Products and Demos

DCPPC 180-Day Products

One essential purpose for funding the DCPPC is to create spin off “products” that will be used as common resources across the DCPPC as well as open access resources for the broader research community (now or in the future). Products will be developed in consultation with the stacks, and are also expected to enable other data providers to more easily stand up their own instance of a stack in the Commons. Examples of resources from the DCPPC that are considered to be products include standards and conventions, APIs, data resources, websites, repositories, documentation, and training/outreach resources. Our list of products to be generated in the 180 day period are outlined below, and the groups who will perform these activities are available in the Activity Matrix:

STANDARDS AND DOCUMENTATION

  • KC1: A FAIRness document providing universal FAIR metrics for Commons objects, and a FAIR metadata specification to be supported by Commons objects.

  • KC2: A GUID document that specifies GUID support requirements for full stacks, including minting, resolution, and data/metadata retrieval.

  • KC2: A metadata document that specifies common core metadata fields that must be supported by GUID and search interfaces.

  • KC2: Common core metadata fields for GUIDs

  • KC2: Detailed uses cases to guarantee that GUID services support the Commons data providers and scientific objectives

  • KC2: Guidelines for each type of identifier to support GUID services

  • KC2: Namespace registry services model and instructions

  • KC2: Specifications for each GUID service or API

  • KC2: User-friendly documentation and tutorials on how data providers and full stacks should use the GUID services

  • KC2: Workflow Diagram for GUID services model

  • KC3: An API document that specifies available Data Commons APIs and associated documentation, and links to sandboxes.

  • KC3: NIH Data Commons API Registration Guidelines

  • KC3: NIH Data Commons recommended API Workflow Creation, Exchange, and Execution Standards doc

  • KC3: NIH Data Commons recommended API Data Access Standards doc

  • KC6: A security and access control document that provides detailed specifications, standards and/or conventions for all systems, including audit trail functionality.

  • KC7: A high level summary report on data flow, data integration, data harmonization, controlled vocabularies, and data models across the Commons.

  • KC9: An expanded collection of Design Guidelines that includes detailed End-user Narratives and User Stories, together with a process for review, modification, and addition.

SOCIALIZATION AND COORDINATION

  • KC9: An internal coordination Web site with complete onboarding information, a document repository, and comprehensive links to all DCPCC sites.

  • KC9: An external DCPPC website supporting community engagement via a multitude of formal (RFCs) and informal (mailing list, social media, issue tracker, annotations) mechanisms.

  • KC9: A publicly visible collection of walkthroughs and tutorials for existing functionality, together with guidelines for creating new walkthroughs and tutorials.

APIs

  • KC2: Prototype common interface for GUID minting and registration

  • KC2: Prototype common interface for resolvers

  • KC2: Prototype software and API for a central namespace registry

  • KC3: PIC-SURE API to access clinical and genomic data

  • KC3: NIH Data Commons recommended (e.g. GA4GH DOS) data upload/download API

  • KC3: NIH Data Commons recommended (e.g. GA4GH TRS/WES) workflow sharing and execution APIs

  • KC3: MetaAPI across selected APIs in the consortium

FAIRNESS SYSTEMS

  • KC1: FAIR rubrics repository (FAIRsharing and FAIR metrics already have some of this functionality)

  • KC1: FAIR scores repository (FAIRshake) and FAIR scores push/pull APIs

  • KC1: FAIR licensing repository (reusabledata.org has some of this functionality)

  • KC1: FAIR reporting and assessment tool.

SOFTWARE (and, see below)

  • KC2: Prototype for reference client library and tools

  • KC2: Prototype software for landing page extensions or service endpoints

DATASETS

  • KC7: A harmonization and mapping between terms used in the Crosscut Metadata Model and terms defined by other standards, controlled vocabularies, and ontologies. To do this we will

    • Develop Crosscut Metadata Model

    • Specify Exchange Format for metadata ingestion, export, and exchange

    • Create metadata instance (e.g. Metadata matrix)

    • Verify availability of Metadata Matrix to Stacks

  • KC3: Repository of all available NIH Data Commons APIs

WEB INTERFACES

KC7: Utilize the Metadata matrix web to perform:

  • Faceted search

  • Search enabled by metadata mapping

  • Ability to pass data to workspaces without local download

  • Ability to build custom data sets/virtual cohorts (e.g., run query against TOPMed metadata, identify records based on that query, package results in a BDBag to pass to analysis workflow).

  • Ability to search across data sources

  • Search leveraging Metadata Matrix

KC3: Web interfaces to Access Open Sandboxes

  • Ability to search curated phenotypic data

  • Ability to access JupyterHub/R notebooks to conduct analyses

  • Ability to access Postman with examples to access data

WORKFLOW STANDARDS

  • KC3: Development of workflow conventions

Activity Matrix

The work plans and activities to required to achieve the products described above are reflected in a series of reports that have been generated by the each of the KC and Stack working groups. For more information please review the documents found in the following table:

KC1 FAIR KC2 GUIDs KC3 APIS KC6 ACCESS KC7 Indices KC8 Use cases
Specific Aims KC1.SA KC2.SA KC3.SA KC6.SA KC7.SA KC8.SA
Team KC1.Team KC.Team KC3.Team KC6.Team KC7.Team KC8.Team
Users will do in 180 days KC1.180d KC2.180d KC3.180d KC6.180d KC7.180d KC8.180d
Design guidelines KC1.DG KC2.DG KC3.DG KC6.DG KC7.DG KC8.DG
Products: Documentation KC1.PD KC2.PD KC3.PD KC6.PD KC7.PD KC8.PD
Products: Technical KC1.PT KC2.PT KC3.PT KC6.PT KC7.PT NA
Timeline KC1.TL KC2.TL See gantt KC6.TL KC7.TL KC8.TL
GANTT KC1.G KC2,G KC3.G KC6.G KC7.G See timeline
Full report KC1.FR KC2.FR KC3.FR KC6.FR KC7.FR KC8.FR

SO, WHERE IS THE LONG LIST OF DCPPC SOFTWARE PRODUCTS?

We would certainly like to provide this. Each DCPPC full stack team brings decades of experience in building widely used software for managing, accessing, analyzing, and sharing large distributed data, and in enabling collaborative science. A sweeping portfolio of software components will be produced by this project and those software belong to a broad set of categories that include:

  1. Cloud-based software stacks that are integrated systems residing on cloud based platforms that handle elements of data processing, storage, servers, and user access control;

  2. Application programming interfaces (APIs) that serve as a mechanism to transfer machine-readable data to and from cloud-based data, and are implemented in many programming languages and specify routines, data structures, object classes, variables or remote calls;

  3. Workflow systems conforming to cloud scalable standards such as WDL and CWL that link a series of executable programs used for data processing and analysis;

  4. Web servers with front-end interfaces enabling users to query, explore, analyze and store data on the cloud;

  5. Stand alone software that interact with elements of the software stacks via APIs.

The beneficiaries of our software development are multi-fold. We will enable other data providers using our cloud-based stack systems to deploy their own stack; end-users will make use of our workflow systems and interfaces; and we expect that programmatic users will employ the APIs to orchestrate workflows and data analysis on our systems. The DCPPC will itself will also be consumers of our own software in order to create an ecosystem of interoperable services that assist in delegation of user-specified authorizations, data transfer, exportable analysis tools and cross-platform data object search capability.

While it would be valuable to list each software product that will result from the DCPPC effort, doing so is challenging. One reason is because we are drawing from the enormity of existing systems that we can not necessarily claim as our own. As an example, the Foster team employs Globus cloud services for security and data management. This is widely used within and outside biomedicine, with tens of thousands of users and more than 14,000 storage systems. For the reason that it takes a village to make high quality software these days, we have resisting listing all the software products that will result from this project. Rather, we suggest it is more appropriate to review our list of DCPCC demos to showcase the successful integration and development of software produced by this project.

Each demo will be the result of a considerable amount of execution code deployed on cloud systems, will be designed with a view to interoperability, and will be made available via Jupyter Notebook scripts, recorded media, hack-a-thons and/or live presentations. The demos are described in the next section.

DCPPC 180-Day Demos

The demos are functionality that will be implemented in executable form, usually as a Web site available through a Web interface or a REST API. All demos will be based on a use of the DCPPC products in combination with the DCPPC stack providers. Demos will always have an associated walkthrough highlighting the current functionality (which could be developed before or after the demo functionality). An idealized example of a demo could be a stack-hosted Jupyter Notebook that can use remote APIs to search metadata, examine metadata by GUID, retrieve data by GUID, execute workflows, and mint GUIDs for newly generated data and analysis results. Breaking this down into smaller components to achieve this demonstration, a user will be able to perform:

Roadmap Listing

DEMO#

Team Mo#1 Category Description

Data sets required

TOPMed (TM), GTEx (GX), MODs (MO)

1M.1 C 1 Searching Web Interface to search subsets of curated phenotypic/clinical data TM
2M.1 Ar, He, CA 2 Uploads and minting Upload data, annotate, and mint GUIDs on at least one stack
2M.2 He, Ar 2 Searching Demonstrate interoperability of search with access controls across cloud environments
2M.3 Ar, Xe, CA 2 Searching Search metadata across catalogs and data storage endpoints
3M.1 Ca, Xe, He, Ar 3 Searching Find TOPMed/GTEx/MOD data across multiple stacks using common GUID
3M.2 He, Ni 3 FAIRness assessment FAIR assessment API generates FAIRness reports for digital objects
3M.3 Ca, Xe, He, Ar 3 Data analysis Workflow is described, implemented, exchanged, and executed reproducibly across multiple Full Stacks
3M.4 C 3 Data analysis and Training Web Interface to access tools and sample code to develop analyses
4M.1 He, Ni 4 FAIRness assessment Demonstrate FAIR assessment with test datasets
4M.2 Ca, Xe, He, Ar 4 APIs Portable workflow scheduled on two or more stacks using the same API
4M.3 Ar, Ca 4 APIs Interservice interoperability across stacks
4M.4 C 4 Testing/APIs TOPMed Open Sandboxes will provide the technical environment to access TOPMed data TM
5M.1 Ar, He 5 Registration Single shared sign-on for stacks and data access
5M.2 He 5 Registration Display audit trail for access to restricted data across stacks
5M.3 Ca, Xe, He, Ar 5 Data analysis User adds data and workflow to a stack to harmonize with NIH Data Commons-hosted data TM, GX, MO
6M.1 Ar 6 Data analysis Run workflow with cost-aware data staging and provisioning
6M.2 Ar 6 Data analysis Run scalable, cost-controlled data analysis for data enrichment
6M.3 Xe Ca 6 Data analysis Perform cross-stack, multi-cloud compute on consortia data and novel data according to user permissions, for example by running TOPMed pipelines on GTEx samples,
6M.4 Ca, Xe, He, Ar 6 Data analysis Share and retrieve analysis results across stacks TM, GX, MO
6M.5 C 6 APIs MetaAPI deployed across selected APIs in the consortium
1- Month number. This refers to the first month when a demo will be performed.

Full stacks will likely have monthly demos organized by the full stack KC.

These Demos were derived from the following Stack team documents:

A full report has also been generated by the Stack teams and describes a set of interoperability demos in detail. Team Nitrogen (not considered part of the Stack) will work together with team Helium on demos 3M.2 and 4M.1. In addition, team Nitrogen proposes two additional independent demos (Ni.6 and Ni.7) which are related to KC1 activities and could assist the efforts of the Stack teams. Team Carbon will also provide demos that can be utilized by the stack teams.

Review and Evaluation of the DCPPC Demos

In general demos are intended to showcase a combination of coded and tested software components for the purpose of finding, hosting and or operating on data in one of the Data Commons stacks. The demos will also demonstrate our ability to interoperate between the stacks in a facile fashion. Demos will allow us to review what has been accomplished, to maintain continuous assessments of our progress, and to create resources that will be provided to the broader user community. It's important to also remember not all demos should be considered a final delivery of DCPPC products; they are also intended to result in evaluation of incremental capability, and to inform any changes we would like to make for what we will build in the future. Some may be trials of proposed functionality that will be discarded or refactored as the project continues. For this reason demo reviews will be informal, and we will use them to promote discussion regarding what has been built and its ability to contribute to the Commons.

All DCPPC members will be invited to review the demos, and they will be reviewed by Steering Committee co-chairs, NIH staff, DCPPC project managers, and the External Panel of Consultants, and MITRE as evaluators. Demos will be announced to the DCPPC via emails lists and Steering Committee meetings. Live demos will be attended by DCPPC members on an ad hoc basis. Live demos will be attended by DCPPC members on an ad hoc basis. Demos released as scripted walkthroughs or recordings will be reviewed individually, or via an informal conference call presentation with DCPCC members. Recorded or scripted demos will be made available to the wider user community on the DCPPC coordination website where appropriate.

Issues to resolve

ISSUE 1.

Statement in text: “5M.1 Registration Single shared sign-on for stacks and data access”

We have yet to identify the stacks who will participate in this. It in part revolves around obtaining product delivery from KC6 that could assist with sign on. We have yet to identify if this is within scope at this point.

  1. In order to engage in Data Commons development, we will produce synthetic data modeled on TOPMed and GTEx resources, if access to actual datasets is delayed.