KC7 Initial Products

KC7 Phase I Products

Authors: Mike D'Arcy, Ian Foster, Carl Kesselman, Robert Schuler (Team Argon, PI: Foster); Robert Carter, Jonathan Crabtree, Victor Felix, Michelle Giglio, Alejandra Gonzalez-Beltran, Olukemi Ifeonu, Anup Mahurkar, Suvarna Nadendla, Philippe Rocca-Serra, Susanna-Assunta Sansone, Owen White (Team Phosphorus, PIs: Sansone, White)

Contact point: Ian Foster


Tags: KC7, metadata, index, search, DATS, JSON-LD, normalization, harmonization, Crosscut Metadata Model (C2M2), Crosscut Metadata Instance (CMI)

The KC7 Crosscut Metadata Model (C2M2) and Crosscut Metadata Instance (CMI) provide normalized, harmonized metadata for the Data Commons' three core datasets--TOPmed, GTEx, and AGR MODs--to enable indexing and searching by the four Data Commons "Full Stacks."

The native metadata for the TOPMed, GTEx, and AGR datasets has diverse encodings. For example, when describing an event, some studies record the subject's age while others record the the subject’s birth date and the event date, requiring users to calculate the age. To search across multiple datasets, applications must first normalize and harmonize these differences: an intellectually and labor-intensive activity.

C2M2 defines mapping from the native metadata to the DatA Tag Suite (DATS) model. KC7 extended DATS as needed to represent the Data Commons' datasets and contributed the results back to the open source DATS project. The CMI is the TOPmed, GTEx, and AGR metadata, prepared for ingestion into the Full Stacks. The CMI is encoded using C2M2, packaged as a Big Data Bag (BDBag), and referenced using a KC2-compatible persistent identifier.

The CMI has been ingested by three of the four DCPPC Full Stack teams, plus team Oxygen. These teams demonstrated searches over the datasets using CMI. In practice, the definition of C2M2, construction of CMI, and ingestion of the CMI into multiple Full Stacks proceeded iteratively with a tight feedback and re-release loop.

The KC7 CMI currently presents the Rat and Mouse AGR MODs, three of TOPmed's ~20 studies, and 652 GTEx subjects (14,070 samples, 14,070 sequence files).