Trajectory of the Commons for the Next 3 Years
Operationalizing the Commons.¶
We will establish an Authority To Operate (ATO) - a process of assessing the security, privacy, and compliance requirements implemented in our system, and in the infrastructure, as well as procedural and process items in place with the infrastructure provider. Typically, the infrastructure and application providers must complete a set of security/privacy related document, such as a Privacy Impact Assessment and a System Security plan and undergo a third party assessment. The plan will be submitted to the signing official at NIH (usually the CIO) and subsequent rounds of interactions will be required to provide evidence of compliance to the signing official. After signing the ATO we will then be authorized to make the entire Data Commons system available to end users. Completing the ATO process is expected to take approximately 7 months from the date production systems are launched. We would also like to work with the NIH to evaluate ways of enabling the research community to test the system under a limited provisional ATO. This will be crucial for testing hypotheses about user needs and gaining valuable feedback about how to improve the system for the diverse research community.
Privacy, security and ethics.¶
We will establish an Ethics, Security, and Privacy Governance and Operations Council (ESPGOC), composed of a Chair and Co-chair from the Data Commons research ethics, privacy, and security working group. Other council member will include human subjects' experts from the NIH ICs, and external security experts. The ESPGOC will recommend policies for the Commons, and provide guidance regarding compliance with study participant privacy, ethical mandates, and data security requirements. This will assist institutions, individuals, and researchers to access, use, share, collaborate, and contribute data, information assets, and tools in furtherance and support of collaborative scientific research while respecting privacy, security, and ethical requirements. The ESPGOC will ensure that approaches are consistent with participant’s original consent, are consistent with applicable Federal laws, regulations, and NIH policies, and reflect a desire to foster innovation and collaboration within the scientific community through the use of data and technology. ESPGOC will also consider the integrity of data, respect for study participants’ privacy, and promote the ethical use of data from a study participant.
Data and Resource Sustainability.¶
Sustaining published bioinformatics tools (e.g., software, containers, services, workflows) and databases is currently difficult and limited. Many published tools and databases disappear after they are published because there are few incentives to maintain published bioinformatics tools, such as paths of credit that go from uses to tools and data. To address these issues, we will explore if the Commons could serve as a means for archiving bioinformatics tools and important databases. To do so, we will create standards for data, software, and workflow citation to allow all parties involved in analysis to receive credit for their work. We will also explore governance models that encourage long-term partnerships for the purpose of sustaining tools, workflows, and databases; and related services.
Interoperability/Collaboration at NIH, commerce and internationally¶
An exceptional number of opportunities are available to add to the services available through the Commons and promoting adoption of the standards developed by our consortium. The NIH has announced an agreement with Google Cloud as its first industry partner that enables us to leverage several important resources, namely storage of Data Commons data at zero or low cost, discounted computes for the consortium, developing ways for groups outside of the consortium to access and use the data, maximizing utility of data on Google: dataset search, bigtable, BigQuery, search and indexing, encryptions methods, training, API services, docker deployment, and Kubernetes. We will also seek to partner with several government agencies who are also providing important computational resources. NIH-related resources with which we intend to work include the National Cancer Institute, AnVIL (NHGRI), Kids First, STAGE (NHLBI), National Heart, Lung, and Blood Institute (Data Storage, Toolspace, Access, and analytics for biG-data Empowerment), and All of Us (Common Fund). Other US agencies that we aim to engage are: NSF (NSF Big Data Hubs, and also new Open Science Network (OSN)), USDA (ARS - Ag Data Commons), Department of Veterans Affairs (Million Veterans Program/Apollo), and Department of Energy. We will work with technology providers such as Globus and iRODS. The goals for all of these potential partnerships will be to increase utilization of interoperation standards, sharing of common workspaces and capability with an expanding user base, increasing the literacy of all users to operating in cloud-based platforms, and increasing the total pool of computational and storage assets for our users. Ultimately the Commons consortium plans to establish an ambitious target for overall computational resources that will be made available to the user community during the course of the project. By engaging in a greater degree of interoperation and collaboration with the agencies described above, we will investigate hardening our current computational capability, scaling from 1,000s of analyses to an order of ~1,000,000. We also expect to increase from an initial volume of 100s of users to tens of thousands during the course of this project.
We have established effective mechanisms awarding resources to explore new research areas that include targeting new capabilities, communicating new research projects across the consortium, testing patterns of behavior (e.g. quarterly releases, consistent demo schedule, and defining paths for adoption of standards).
Data harmonization. Some of the datasets on the commons are derived from clinical studies. Data from clinical studies are notoriously non-uniform, so patient variables - such as blood pressure, BMI, age, and disease, are nearly impossible to retrieve in a uniform manner. We will engage in harmonization efforts to impose a greater degree of uniformity on the metadata, and this will enable users to perform queries like: "retrieve all variants from patients between age 40-50" - which is currently not possible.
Machine learning. We will apply Machine Learning predictions to various types of Biomedical Datasets. A searchable library of ready-to-go datasets will be made available for download, and for fetching into notebooks that people can run in the cloud.
In addition to our current research areas such as FAIR metrics, cross cut data models, GUID services and APIs, our other expected research directions will likely include:
Current state of Play for Guidelines/Standards. Why are they necessary? Describe the process by which these are being developed, vetted and gaining community traction and support;
Application of machine learning predictions to a broad set of biomedical datasets to increase the available/queryable assets to our users;
Assessing the impact of publications, researchers, projects, websites, tools and other types of digital objects;
Hosting data on alternative cloud-based systems.
Project management. One success story in the first year of the DCPPC is that we implemented project management practices that have proven to be effective. The ability to track project progress and to report monthly on milestones and deliverables using GitHub was an eye opener for many. This effort to better organize NIH projects will make investigators and other NIH awardees more accountable. The approaches developed for the DCPPC should be cultivated and expanded.
Training. We will establish a training program that will engage users to inform platform development, propagate essential skills, and build a community of invested ambassadors capable of contributing to the growth and maintenance of the Data Commons. We expect one of the main returns from this training program to be a steady stream of feedback, bug reports and user enhancement requests that will help grow the Data Commons to meet user needs. The program hinges on recruiting a cadre of early users interested in serving as trained “ambassadors.” Ambassadors will run in-person user training sessions, organize feedback to platform developers, develop and maintain training materials, and help drive engagement with local user communities. The horizon for the first stage of this program is approximately two years, and we aim to bring between 1,000 and 5,000 users, including 100-200 Ambassadors, onto the Data Commons platforms within that time period.
Data Steward interactions. The Data Stewards will take a more central role in Data Commons and initiatives during Phase II, engaging in the following activities: 1) The Data Stewards will participate in furthering data standards and harmonization by developing processes for identifying required data fields in their respective datasets, mapping these to standardized vocabularies and formats, and integrating these across studies and organisms to fulfill the requirements of targeted Use Cases. Identifying common elements across these major datasets, such as expression data for GTEx and AGR/MODs, phenotypes for TOPMed and AGR/Mods and genes for all three, will provide opportunities for the Data Stewards to work together to determine common data elements, methods for mapping and aligning data across the resources and developing standards and methods for integrating additional datasets. The Data Stewards have extensive experience in soliciting and processing data submissions from individual researchers as well as the development of extensive data import pipelines from large scale data resources, and will use this expertise to develop processes and templates to work with additional data resources to onboard into the Commons infrastructure; 2) The Data Stewards are adept at meeting the needs of diverse user communities, which will become more important for the Commons in Phase II as it moves beyond the first tier of targeted researchers considered adept at programming. The Data Stewards software development teams create user friendly tools and develop sophisticated workflows for all types of researchers. Members of the Data Stewards software development teams will bring this expertise to the working groups in the Commons; 3) All three Data Stewards are experienced in creating video tutorials and workshops to train researchers and students on the datasets and tools available at their sites and they will bring this expertise to the development of training materials and workshops for Commons users; and, 4) The Data Stewards will continue to contribute Use Cases that exemplify the needs of researchers and help the working groups plan and prioritize datasets and functions.