A report from the September 21st, 2018 Workshop.¶
Contact: C. Titus Brown firstname.lastname@example.org
On September 21, 2018, we ran a full-day workshop on “Commonsing” as part of the September DCPPC Face-to-Face at Harvard Medical School. The objective of this workshop was to introduce the NIH Data Commons to a group of experts in infrastructure development, data policy, and community building, and to discuss frameworks for governance. The workshop included senior scientists from large federally funded data collaborations, leaders in open source data science projects, representatives of funders, and ethnographers and social scientists. Most of the Data Commons External Panel of Consultants attended, as did a number of DCPPC awardees, along with several members of DataSTAGE.
The workshop was run in an “unconference” style, and participants worked throughout the day to discuss questions such as, “What is a Data Commons, and what makes it a Commons?”, “Who should participate (and why)?”, and “What are the potential reasons the Data Commons might fail?” We concluded by asking participants to identify their “top wish” for the NIH Data Commons to act on in the next 6 months.
The primary outcomes of the workshop follow.
Most of the likely causes of failure for the NIH Data Commons are not technical.
Common data analysis infrastructure has been built in other fields, and the room was rich with expertise in these efforts. When asked to discuss reasons why this effort might fail, there were many reasons identified – but very few of them were technical. This represents a general theme of the day: the primary challenges for creation, adoption, and sustaining of a Data Commons are community- and policy-based. “The tech will make it succeed, the people will make it fail.”
A few specific reasons mentioned during the exercise:
“Too many cooks” - a failure to create or execute on a short term plan.
“Too exclusive” - we don’t achieve enough buy-in from the larger scientific community.
“Too inclusive” - so many people are involved there is no sense of citizenship, and no buy-in to the governance processes.
“Fostering an artificial community vs engaging a real community.”
“Not an ecosystem” – no technical or social support for emergent adaptive solutions.
“Focusing too much on full stack function and not enough on interoperability between them.”
“Produces and sustains harmful inequalities” – if the Data Commons supports and extends current inequities in data access and societal focus on participation.
“Too complex to use” or “Does nothing useful.”
“Data black box – easy to put data in, impossible to find/discover”.
“Not sustainable” so no one invests their careers in it.
Several experienced infrastructure builders stated that solving the technical challenges was always possible, but without investing in the social structure required for sustainability and longevity, we would not progress towards new technical challenges.
A related lesson learned in the open source community is that if you get the community right, you can solve any problem. If you get the community wrong, nothing will stay solved. We need to agree on the common problems, agree to work on solving those problems, and focus attention on these problems - and then we will produce solutions organically and maintain investment of effort in this long term. If the community can create a “narrow middle” of solutions to technical interoperability, then the Data Commons can grow from there.
Trust in both the quality of the data, and trust that the data will be used ethically, is critical.
Another central theme was that there must be general trust in the quality of the data in the NIH Data Commons, and that those depositing data must have trust that the data will be used ethically. This highlights the critical role that the Data Stewards and data curation more generally must play in the Data Commons, as well as the importance of proper data governance.
There is immense practical value in implementing shared standards around a combination of “rough consensus” and “working code”.
Experts in building past data infrastructure focused on the importance of reaching consensus on shared technical practice (standards, protocols, and implementations) within this effort by first identifying a rough consensus and then delivering working code. For example, an RFC process for adopting internal project practice should be based around multiple interoperable implementations.
It is not clear what the “Common Pool Resource” is in the NIH Data Commons.
The underlying economic basis of the NIH Data Commons remain unclear with respect to economic and sociotechnical studies theory. For example, the infrastructure and implementations of the Data Commons may be either a public good (accessible to all) or a club good (restricted to a community). The data inside the NIH Data Commons cannot be open to all, due to the inclusion of access-controlled personal health information. Neither the infrastructure nor the data meet the definition of a common pool resource, however, which makes it difficult to apply Elinor Ostrom’s well-established design principles for a Commons to them. One theory advanced in the workshop was that “engagement and focused attention of the community” acts as the common pool resource for the community’s shared mission of building the NIH Data Commons.
An interesting discussion around “free riding” and levels of community engagement also emerged. In economic theory, free riders are those who gain benefits without contributing to sustainability; free riding is often treated as a binary, but this need not be the case here – there could be a variety of levels of engagement with the NIH Data Commons, ranging from people who solely make use of the data and infrastructure for their own purposes, to people who contribute data to the Commons, to people who are involved in implementing and maintaining the Data Commons.
A related concept of “citizenship” in the Data Commons was raised, as a model for participation in the Data Commons for data stewards, patients and research subjects, infrastructure developers, and others. The term “citizen” captures the idea of reciprocal obligations that are important for building and sustaining a commons, and elucidating the rights, privileges, and obligations of Data Commons citizens could be a productive way to define the potential community for the Data Commons.
On the data side, a particularly interesting analogy was made with respect to two “public goods”, Waze and weather data.
Waze is a traffic app that allows people to crowdsource traffic patterns. For Waze, engagement ranges from those who download the app and make use of the community-gathered data without contributing any data back, to those who engage fully in by contributing back into the community data set. Much of the value of the Waze app derives from this crowdsourced data. In the case of Waze, there are also significant negative externalities emerging such as unexpected traffic patterns that create traffic congestion in neighborhoods.
In an interesting analogy to the kind of ecosystem we aspire to build, weather data is produced by the National Weather Service as a public good, and consumed by a variety of apps and weather services. The data provided by the NWS gains value from the construction of social context around it, through apps and Web sites, but the majority of consumers of the data do not contribute directly to the sustainability of the data source (except as taxpayers). However, the consumers do contribute directly to the apps through purchasing, and to the Web sites by viewing ads. Meanwhile, the data produced by the NWS would have little value if it were not produced according to a standard protocol that allowed anyone to consume it, which allows for the ecosystem of apps and Web sites to emerge.
There was a general sense that further exploration of both the theory and practice of “digital commonsing” would be important to understanding, implementing, and sustaining the NIH Data Commons. Close engagement with some social scientists (ethnographers and anthropologists) in the Data Commons Consortium was a strong recommendation of the overall group.
The important data is not the data we already have, but the data we will be generating.
The current NIH Commons efforts are focused around certain data sets -- implicit in this is nascent value to be gleaned from these data sets. However, each year we are generating more biomedical data than has previously been generated in the history of mankind. So while we may need to demonstrate the value of the Data Commons on data we already have, the true value will emerge as we take advantage of newly arriving data sets in the coming years.
The sixth month ask.¶
We concluded by asking participants to identify their “top wish” for the NIH Data Commons in the next 6 months; here are their answers.
“Within the next six months, the NIH Data Commons should:”
Publish a set of guidelines and principles for interoperation.
Define a standard set of champion slides and enlist champions to share widely.
Define a number of empirical measures to define success and showcase progress.
Write a concrete articulation of some of the work that is going to be required (infrastructure, labor, and who is going to do the work).
Clearly identify what we are building and who and where (create a roadmap, and foster social agreement).
Lay out a visible path to see inside the black box.
Identify rate limiting steps of (e.g.) GTEx users so that we can tackle them concretely, and enlist those communities as champions.
Bring on users and have actual use cases!
Define “who is the community” of the Commons.
Lay out a roadmap of how “commonses” converge (e.g. how would you onboard a non-NIH Data Steward?)
Create a clear use case for cross data-set functionality (for instance, GTEx and TCGA)
Create participation signposts: “Here are the GitHub repos, here is how to participate.”
Define and implement LOW HANGING FRUIT on interoperability.
Bring in more data sets.
Provide open source code, libraries, and examples sufficient for people to understand the technical directions in which we are heading
Develop a 1 page “cheat sheet”.
Demonstrate in six months that “a grad student” can do something with Data Commons that only the Broad can do now.
Pilar Ossario (U. Wisconsin; NIH Data Commons External Panel of Consultants)
Lara Mangravite (Sage Bionetworks)
Jeremy Freeman (Chan-Zuckerberg Initiative)
Matthew Trunnell (Fred Hutch Cancer Research Center; NIH Data Commons External Panel of Consultants)
C. Titus Brown (UC Davis; NIH Data Commons)
Josh Greenberg (Sloan Foundation)
Sasha Zaranek (Curoverse/Veritas Genetics; NIH Data Commons)
Carol Willing (Project Jupyter)
David Siedzik (Broad Institute; NIH Data Commons)
Sarah Wylie (Northeastern University)
Elisha Wood-Charlson (DOE KBase)
Jen Yttri (MITRE; NIH Data Commons)
Carly Strasser (Fred Hutch Cancer Research Center)
Carl Kesselman (ISI / USC; NIH Data Commons)
Stan Ahalt (RENCI; NIH Data Commons)
Avi Ma’ayan (Icahn School of Medicine at Mount Sinai; NIH Data Commons)
Luiz Irber (UC Davis; NIH Data Commons)
Melissa Cragin (U. Illinois & NSF Midwest Big Data Hub)
Jason Williams(CSHL; NIH Data Commons External Panel of Consultants)
Kira Bradford (RENCI; NIH Data Commons; DataSTAGE)
Nadia Eghbal (Protocol Labs)
Erin Robinson (ESIP)
Brandi Davis-Dusenbery (Seven Bridges Genomics; NIH Data Commons)
Rebecca Calisi (UC Davis; NIH Data Commons)
Patricia Cruse (Datacite, NIH Data Commons)
Allison Heath (CHOP; Kids First Data Resource Center)