Skip to content

KC2 Core Metadata for GUIDs

DCPPC-DRAFT-#: 7

DCPPC-DRAFT-Title: Core Metadata for GUIDs

DCPPC-DRAFT-Type: Design Principle

Name of the person who is to be Point of Contact for the DCPPC-DRAFT: Mercè Crosas

Email of the person who is to be Point of Contact for the DCPPC-DRAFT: merce.crosas@gmail.com

Submitting Team: KC2

Requested DCPPC-DRAFT posting start date: 7/23/2018

Date Emailed for consideration: 7/13/2018

DCPPC-DRAFT-Status: Active and open for comment

URL Link to this Document: https://docs.google.com/document/d/1FD3aXr_uHnPy-YrFhQhuXET73tBVxu7F_Q5uS9TPUZs/edit?usp=sharing

License: This work is licensed under a CC-BY-4.0 license.

Core Metadata for GUIDs

Martin Fenner, Tim Clark, Dan Katz, Merce Crosas, Patricia Cruse, John Kunze, Sarala Wimalaratne for Team Sodium

NA-1: DCPPC Team Sodium Deliverable: 2M.2.PRODUCT: Core Metadata: Complete documentation and add to Doc Library site

Version 31 May 2018

Document purpose

The Data Commons will need to uniquely identify any and all FAIR digital objects to enable long term resolution of cited persistent data and potentially provide the capability to link disparate datasets. For this activity we reuse, adopt, and extend community based standards already in place, and there is demonstrated community engagement with and endorsement of the proposed methods.[1]

This document from KC2 describes the core metadata required and/or recommended for GUID registration in the Data Commons Pilot to enable citation, accessibility, etc. These core metadata are a subset of the metadata used in the Data Commons Pilot, and other KCs and full stacks will define additional metadata, in particular KC7 in its work on metadata required for indexing and discovery.

The core metadata described in this document help address a number of important use cases in DCPPC, including support for resolution and access behaviors of GUIDs, enabling presentation of landing pages or landing services, and appropriate access to and citation of digital object (e.g., data, workflow, software, etc.) content. The core metadata specified in this document follow the FAIR Data Principles[2], specifically F1, F2, F4, I1-I3, and R1-R1.3.

Core Metadata Principles

  1. This document has been extensively reviewed and discussed within KC2.1

  2. KC2 supports GUIDs that are machine actionable, globally unique, and widely used by a community. In most cases these GUIDs are also persistent.

  3. Core metadata fields for GUIDs are tightly coupled to the particular type of GUID used, with a small set of metadata common to all GUID types supported by KC2.

  4. This document focuses on core metadata for datasets, and the data providers for those datasets. Other content types (software, workflows, etc.) will be covered in as much they are relevant for current DCPPC use cases.

  5. GUID core metadata refers mostly to citation metadata for the digital object: the metadata that would be available in the landing page or landing service for that digital object. The tracking and accessibility of this metadata through a landing page is important for citation metrics.

  6. Extended metadata for describing a dataset are outside the scope of this document and will be addressed elsewhere, in particular in KC7. Extended metadata includes information about the (scientific) contents of the data object. The KC2 core metadata described here is a subset of KC7 metadata.

  7. All core metadata should be describable using the schema.org[3] metadata schema. If another metadata schema is used (such as DATS for KC7), a mapping to schema.org should exist.

The decision to use schema.org as the metadata schema used for core metadata was based on the following:

  1. Schema.org is supported by GUIDs used in the Data Commons Pilot, and mappings exist to other metadata standards used in the Data Commons Pilot, in particular DATS.[4]

  2. Schema.org is widely adopted outside the Data Commons Pilot, with a large number of services and tools available.

  3. Schema.org supports all metadata fields required for core metadata, so that no schema updates are needed during the pilot. We have identified at least one metadata field (checksum) that would benefit from extra work, and we have reached out to the schema.org community to work on this after the Data Commons Pilot.

DCPPC Integration

GUIDs and core metadata are essential for many DCPPC use cases, including the following use cases taken mainly from the DCPPC PEP.

Data Stewards

  1. GUIDs for GTEx, TOPMed and MODs, where needed, and using the appropriate GUID type.

KC1

  1. Develop, discuss, evaluate, and adopt FAIR community metrics.

  2. Align metadata descriptors developed for KC7 for datasets with FAIR metrics and guidelines for this type (datasets) of digital object.

  3. One of the FAIR evaluation criteria is the presence and implementation of appropriate GUID to describe the digital object. The rubric should be coordinated with the standards developed by KC2.

KC2

  1. Namespace registry: extend the existing prefix registry built by identifiers.org.

    1. Minimum set of metadata needed to describe the data repository (data catalog) - identifier, name, description, provider, and resource url.

    2. Minimum set of metadata needed for the resolution (Dataset level) - identifier, access url, content url, name, description.

  2. GUID minting and registration capability.

  3. Landing page/GUID metadata service.

  4. Developing reference client libraries and tools that can be used by end users, other KCs, and full stacks.

KC3

  1. To achieve maximum collaboration and interoperability, API workflow standards will be developed.

KC7

  1. Crosscut metadata model describes metadata about various things, which should be identified via identifiers whose metadata is described in this document.

KC8

  1. Creation of an initial set of three driving scientific use cases.

  2. Define points of interface with other key capabilities that are required to execute on these driving use cases.

Full Stacks

  1. Within 180 days, demonstrate interoperability between full-stacks by using shared APIs and building service brokers around areas of common functionality such as search, authentication, and workflows. Demonstrate this in the form of demos.

Data Catalogs

Data catalog is the term used by schema.org and DCAT[5] to describe a data repository hosting datasets. GTEx, dbGAP, and the various MODs are examples of data catalogs.

All GUIDs for published datasets use the includedInDataCatalog property to describe where the dataset is hosted. The KC2 Namespace Service maps namespaces to data catalogs, and provides mapping information for compact identifiers.

Name URL Description Required
@id

Primary identifier expressed as URL/URI. See discussion of identifier.

JSON-LD uses @id, RDFa 1.1 uses resource, microdata uses itemid.

Y
@type http://schema.org/DataCatalog Should be DataCatalog. Y
identifier http://schema.org/identifier Compact identifier expressed as URL or string. Y
name http://schema.org/name The name of the data catalog. Y
url http://schema.org/url Location of the data catalog, expressed as HTTP URL. y

Table 1. Core metadata for data catalogs

Example

{

"@id": "https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000047",

"@type": "DataCatalog",

"name": "Rat Genome Database",

"url": "http://rgd.mcw.edu/"

}

GUIDs and Core Metadata

KC2 supports the following GUID types:

  1. Compact identifiers (via Identifiers.org or N2T.net) for accession numbers

  2. ARKs (via EZID), for example

    1. Minids for temporary or semi-persistent GUIDs, with additional metadata
  3. Data GUIDs (via dataguids.org)

    1. Helium CommonsShare identifiers as DataGUIDs
  4. DOIs via DataCite for permanent GUIDs

These GUID types serve different use cases, and accordingly require and support different sets of core metadata. The core metadata supported by the various identifiers are listed below:

Name Compact Identifier ARK/Minid Data GUID DataCite DOI
@id Required Required Required Required
@type Required Required Required Required
identifier Required Required Required Required
Url[6] Required Required Required
includedInDataCatalog Required Required
name Required Required Optional Required
author Required Optional Required
datePublished Optional Required
dateCreated[7] Required Optional Optional
additionalType Optional
description Optional Optional
keywords Optional Optional
license Optional Optional
version Optional Optional
citation Optional
isBasedOn Optional
isPredecessor Optional
isSuccessor Optional
hasPart Optional
isPartOf Optional
funder Optional
contentSize Required Optional
fileFormat Optional
contentUrl Optional Optional Optional

Table 2. Required and optional core metadata by identifier.

Compact identifiers

Compact identifiers[8] only require the registration of a namespace in the Team Sodium namespace registry.[9] No additional metadata registration is needed, or possible, at this time. Compact identifiers are used to make local accession numbers globally unique without registration of each identifier. The compact identifier service is implemented by Team Sodium member EMBL-EBI.

Name URL Description Required
@id

Primary identifier expressed as URL/URI. See discussion of identifier.

JSON-LD uses @id, RDFa 1.1 uses resource, microdata uses itemid.

Y
@type http://schema.org/Dataset
http://schema.org/SoftwareSourceCode
http://schema.org/CreativeWork
Should in most cases be Dataset. Use CreativeWork if data provider (data catalog) provides multiple different types. Y
identifier http://schema.org/identifier Compact identifier expressed as URL or string. Y
url http://schema.org/url Location of the resource, expressed as HTTP URL. This would normally be the landing page. R
includedInDataCatalog http://schema.org/includedInDataCatalog Data provider (data catalog) that hosts this dataset. R
name http://schema.org/name The name or title of the resource. Y

Table 3. Core metadata for compact identifiers

Example

{

"@context": "http://schema.org",

"@id": "https://identifiers.org/rgd:2825",

"@type": "Dataset",

"identifier": "rgd:2825",

"url": "https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=2825",

"includedInDataCatalog": {

"@id": "https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000047",

"@type": "DataCatalog",

"name": "Rat Genome Database",

"url": "http://rgd.mcw.edu/"

}

}

ARKs and Minids

The Archival Resource Key (ARK) is a naming scheme for persistent access to digital objects, implemented by Team Sodium member California Digital Library (CDL). There are no specific metadata requirements for Arks. Minids[10] are an example implementation using Arks as the GUID, developed by Team Argon. The main use case for Minids are intermediate, often transient, data products that are created during a research project.

Name URL Description Required
@id

Primary identifier expressed as URL/URI. See discussion of identifier.

JSON-LD uses @id, RDFa 1.1 uses resource, microdata uses itemid.

Y
@type http://schema.org/Dataset
http://schema.org/SoftwareSourceCode
http://schema.org/CreativeWork
Should in most cases be Dataset. Use CreativeWork if data provider (data catalog) provides multiple different types. N
identifier http://schema.org/identifier

MINID expressed as URL/URI.

One of the identifiers must be a checksum using propertyValue with the specific checksum algorithm and checksum.

Y
url http://schema.org/url Location of the resource, expressed as HTTP URL. This would normally be the landing page. Y
contentUrl http://schema.org/contentUrl Actual bytes of the media object, for example the image file or video file. N
includedInDataCatalog http://schema.org/includedInDataCatalog Data provider (data catalog) which hosts this dataset. N
dateCreated http://schema.org/dateCreated The date on which the CreativeWork was created. Y
expires http://schema.org/expires Date the content expires and is no longer useful or available. N
name http://schema.org/name The name or title of the resource. Y
author http://schema.org/author The author(s) of the dataset. Schema.org uses creator as synonym. Y

Table 4. Core metadata for Minids.

Example

{

"@context": "http://schema.org",

"@id": "https://n2t.net/ark:/88120/r8059v",

"@type": "CreativeWork",

"identifier": [

"https://n2t.net/ark:/88120/r8059v",

{
"@type": "PropertyValue",
"name": "sha-256",
"value": "cacc1abf711425d3c554277a5989df269cefaa906d27f1aaa72205d30224ed5f"
}

],

"url": "http://minid.bd2k.org/minid/landingpage/ark:/88120/r8059v",

"contentUrl": "http://bd2k.ini.usc.edu/assets/all-hands-meeting/minid_v0.1_Nov_2015.pdf",

"name": "minid: A BD2K Minimal Viable Identifier Pilot v0.1",

"author": {

"@id": "http://orcid.org/0000-0003-2129-5269",

"@type": "Person",

"name": "Ian Foster"

},

"dateCreated": "2015-11-10T04:44:44.387671Z"

}

DOIs

Digital object identifiers (DOIs) are persistent identifiers mainly used for scholarly content, including research data. Team Sodium member DataCite is a DOI Registration Agency with a focus on DOIs for datasets, with more than four million DOIs for research data registered so far. DOIs have standard required and optional metadata. The main use case for DOIs is persistent datasets that need to be referenced and cited by other resources, e.g., to fully describe workflows, or links to publications, funding and people.

Name URL Description Required
@id

Primary identifier expressed as URL/URI. See discussion of identifier.

JSON-LD uses @id, RDFa 1.1 uses resource, microdata uses itemid.

Y
@type http://schema.org/Dataset
http://schema.org/SoftwareSourceCode
http://schema.org/CreativeWork
Should in most cases be Dataset. Y
identifier http://schema.org/identifier

DOI expressed as URL.

One of the identifiers can be a checksum using propertyValue with the specific checksum algorithm and checksum.

Y
url http://schema.org/url Location of the resource, expressed as HTTP URL. This would normally be the landing page. Y
includedInDataCatalog http://schema.org/includedInDataCatalog Data provider (data catalog) which hosts this dataset. Y
name http://schema.org/name The name or title of the resource. Y
author http://schema.org/author The author(s) of the dataset. Schema.org uses creator as synonym. Y
datePublished http://schema.org/datePublished Date of first publication. Y
dateCreated http://schema.org/dateCreated The date on which the CreativeWork was created. N
additionalType http://schema.org/additionalType An additional type for the item, typically used for adding more specific types from external vocabularies. N
description http://schema.org/description Used for discovery. N
keywords http://schema.org/keywords Used for discovery. N
license http://schema.org/license A license document that applies to this content, typically indicated by URL. N
version http://schema.org/version The version of the dataset. N
citation http://schema.org/citation A citation or reference to another creative work, e.g. an article describing the dataset. N
isBasedOn http://schema.org/isBasedOn A resource that was used in the creation of this resource. N
PredecessorOf http://schema.org/predecessorOf A pointer to the next version. N
successorOf http://schema.org/successorOf A pointer to the previous version. N
hasPart http://schema.org/hasPart A dataset (or other creative work) that is part of this resource. N
isPartOf http://schema.org/isPartOf A dataset (or other creative work) that this resource is part of. N
funder http://schema.org/funder One or more organizations that provided support via a financial contribution. N
contentSize http://schema.org/contentSize File size N
fileFormat http://schema.org/fileFormat MIME format, or described via URL N
contentUrl http://schema.org/contentUrl Actual bytes of the media object, for example the image file or video file. N

Table 5. Core metadata for DOIs

Example

{  
    "@context": "http://schema.org",  
    "@type": "Dataset",  
    "@id": "https://doi.org/10.25491/5e92-ht74",  
    "identifier": "https://doi.org/10.25491/5e92-ht74",  
    "additionalType": "Data dictionary",  
    "name": "A data dictionary that describes each variable in the
    GTEx\_v7\_Annotations\_SubjectPhenotypesDS.txt",  
    "author": {  
        "@type": "Organization",  
        "name": "The GTEx Consortium"  
    },  
    "keywords": "gtex, annotation, phenotype, gene regulation,
    transcriptomics",  
    "datePublished": "2017",

    "includedInDataCatalog": {  
        "@type": "Organization",  
        "name": "GTEx"  
    },

    "version": "v7",

    "url": "https://www.gtexportal.org/home/datasets",

    "contentSize": "5.4 Mb",

    "fileFormat":
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",  
    "funder": {  
        "@type": "Organization",  
        "@id": "https://doi.org/10.13039/100000050",  
        "name": "National Heart, Lung, and Blood Institute"  
    }  
}

Comments about specific metadata properties

A. url and contentURL

We distinguish between the URL for the content itself, and the URL for the landing page of the resource. There can be more than one url and contentURL for each resource, e.g. datasets hosted in multiple cloud environments and full stacks.

B. identifier

Checksums for files are expressed as identifier, including the algorithm used (MD5, sha-1, sha-256). Each dataset needs at least one KC2 GUID identifier, but can have multiple identifiers.

C. expires

This property should be used to indicate that the resource is not persistent. If possible, the GUID and metadata should be preserved beyond the lifetime of the resource.

Appendix

Schema.org metadata can be mapped to/from other metadata standards, including DataCite metadata and DATS[11].

Schema.org

Dublin Core

DataCite

DATS

@id

identifier

identifier

identifier

identifier

name

title

title

title

author

creator

creator

creator

publisher

publisher

publisher

publisher

datePublished

date

publicationYear

date

Table 6. Mapping of core metadata across common metadata schemata

While the focus of this document is on core metadata for datasets, GUIDs for other resource types are needed in the Data Commons Pilot. The most relevant resource types are listed below.

Name URL Description
DataCatalog http://schema.org/DataCatalog Describes the data repository
Dataset http://schema.org/Dataset Describes the dataset
Software http://schema.org/SoftwareApplication Describes software, e.g. software used to generate a dataset
Collection http://bib.schema.org/Collection A created collection of datasets or other creative works.
Report http://schema.org/Report A formal account of the proceedings or transactions of a group.

Table 7. Resource types relevant for GUID core metadata

References

  1. Schema.org: http://schema.org/

  2. Bioschemas: http://bioschemas.org/index.html

  3. Fenner, M., Crosas, M., Grethe, J., Kennedy, D., Hermjakob, H., Rocca-Serra, P., … Clark, T. (2016). A Data Citation Roadmap for Scholarly Data Repositories. https://doi.org/10.1101/097196

  4. Wimalaratne, S. M., Juty, N., Kunze, J., Janée, G., McMurry, J. A., Beard, N., … Clark, T. (2018). Uniform resolution of compact identifiers for biomedical data. Scientific Data, 5, 180029. https://doi.org/10.1038/sdata.2018.29

  5. Sansone, S.-A., Gonzalez-Beltran, A., Rocca-Serra, P., Alter, G., Grethe, J. S., Xu, H., … Ohno-Machado, L. (2017). DATS, the data tag suite to enable discoverability of datasets. Scientific Data, 4, 170059. https://doi.org/10.1038/sdata.2017.59

  6. Data Catalog Vocabulary (DCAT): https://www.w3.org/TR/vocab-dcat/

  7. HCLS Community Profile: https://www.w3.org/TR/hcls-dataset

  8. EDMI guidelines

  1. There is not yet consensus from all the teams. At the time of delivery of this as an RFC, Teams Sodium, Xenon, and Calcium have endorsed it.

  2. https://www.force11.org/group/fairgroup/fairprinciples

  3. http://schema.org/

  4. https://biocaddie.org/publications/presentations/datamed-dats-model-annotated-schema-org

  5. https://www.w3.org/TR/vocab-dcat/

  6. See Item A in Section “Comments about specific metadata properties” below.

  7. Time when data object is “published”, where publishing here means the time a GUID is assigned to the data object.

  8. Wimalaratne, S. M., Juty, N., Kunze, J., Janée, G., McMurry, J. A., Beard, N., … Clark, T. (2018). Uniform resolution of compact identifiers for biomedical data. Scientific Data, 5, 180029. https://doi.org/10.1038/sdata.2018.29

  9. http://identifiers.org/registry/

  10. http://bd2k.ini.usc.edu/assets/all-hands-meeting/minid_v0.1_Nov_2015.pdf

  11. Sansone, S.-A., Gonzalez-Beltran, A., Rocca-Serra, P., Alter, G., Grethe, J. S., Xu, H., … Ohno-Machado, L. (2017). DATS, the data tag suite to enable discoverability of datasets. Scientific Data, 4,

    1. https://doi.org/10.1038/sdata.2017.59