KC2 Overview for FullStacks
DCPPC-DRAFT-Title: KC2 Phase 1 Overview for Full Stacks
Name of the person who is to be Point of Contact for the DCPPC-DRAFT: Mercè Crosas
Email of the person who is to be Point of Contact for the DCPPC-DRAFT: email@example.com
Submitting Individual or Group: KC2
Requested DCPPC-DRAFT posting start date: 07/16/2018
Date Emailed for consideration : 07/12n/2018
DCPPC-DRAFT-Status: Active and open for comment
URL Link to this Document: https://docs.google.com/document/d/1lx3uakz4foYN8vw8E5U6NM2F7RcBjyIGA4ZE2zvE9_g/edit?usp=sharing
License: This work is licensed under a CC-BY-4.0 license. (or provide your license preference)
KC2 Phase 1 Overview for Full Stacks
An informational overview of the KC recommendations, standards, and technologies aimed at the Full Stacks for Phase 1
1. Document Purpose¶
This document provides information for the Full Stacks that wish to demonstrate Key Capability 2, which is concerned largely with providing a set of Globally Unique Identifiers (GUIDs) that are findable, accessible, interoperable, and reusable in the Data Commons infrastructure.
Actionable identifier - An identifier that is easily resolvable (“clickable”) using widely available software, such as web browsers, document management systems, email clients, etc. In today’s Internet, actionable identifiers are URLs (HTTP URIs).
Archival Resource Key (ARK) - Persistent identifiers designed to support long-term access to information objects, defined using the conventions described in (Kunze & Rogers 2008 https://n2t.net/ark:/13030/c7cv4br18). The ARK framework is decentralized and its infrastructure is becoming community-owned via ARKs-in-the-Open. For DCPPC, ARK registration and resolution is supported by the California Digital Library (CDL), a unit of the University of California Office of the President. There are about 175 million ARKs globally, with 19 million registered at CDL.
Compact Identifier - A GUID expressed in the pattern \<namespace_prefix>:\<local_accession>, which can be resolved to a URI through a meta-resolver such as https://identifiers.org or https://n2t.net that maps the namespace_prefix to a local resolver, to which it presents the local_accession. Similar to the CURIE (or Compact URI) defined by the W3C.
Data Catalog - A collection of datasets, as defined in the W3C Data Catalog Vocabulary specification (https://www.w3.org/TR/vocab-dcat/#vocabulary-overview).
DataCite DOI - Datacite DOIs are DOIs registered with Datacite (http://datacite.org), a DOI Registration Agency (https://www.doi.org/registration_agencies.html). Datacite DOIs identify long-term persistent, citable data, and have metadata elements specific to data. Approximately 5 million datasets are registered with Datacite DOIs.
Digital Object Identifier (DOI) - A persistent identifier for digital objects, as specified in ISO 26324:2012 (https://www.iso.org/standard/43506.html). DOIs are an implementation of the Handle system (http://handle.net).
Data Object Service - An API that can be used to access and manage data referred to by Data GUIDs, or an unique identifier of your choosing. (http://data-object-service.readthedocs.io/en/latest/)
Globally Unique Identifier (GUID) - An identifier following certain conventions to make it unique within a global context. In DCPPC, GUIDs are meant to be globally unique and actionable on the Web - or actionable when prefixed by a resolver URI.
GUID Broker - A GUID registration and landing service, which also serves as a local metadata registry, and maps internal system identifiers to persistent GUIDs.
Identifier - A name (alphanumeric string) linked to an object, set of objects, or concept, and meant to specify that object, set, or concept uniquely within some context.
Landing Page - An HTML page and sets of associated machine- and human-readable metadata provided by a landing service.
Landing Service - A web service integral to the resolution of persistent GUIDs, which mediates between the GUID itself, a set of GUID-associated metadata, and the GUID’s object resolution endpoint(s) or URLs. A landing service provides access to a landing page.
Minid - A minid is a (semi-) persistent GUID formed according to the specifications in (Ian Foster 2015 https://bd2k.ini.usc.edu/assets/all-hands-meeting/minid_v0.1_Nov_2015.pdf) and registered as an ARK.
Object Resolution Endpoint - The address on the final server involved in (the chain of redirects that fulfills) a resolution request.
Persistent GUID - A GUID which is guaranteed to persist over a defined timespan, i.e. one which is a persistent identifier (q.v.).
Persistent Identifier - A persistent identifier (PI or PID) is a long-lasting reference to a document, file, web page, or other object. In DCPPC, persistent identifiers are equivalent to Persistent GUIDs and are actionable on the Web when represented as URLs by prefixing the persistent GUID with a resolver service URL such as http://doi.org or http://n2t.net.
Resolver service - A GUID resolver service makes persistent GUIDs actionable on the Web by redirecting “get” requests to the GUID’s landing service, which then provides a limited set of metadata, including the object resolution endpoint(s), to the client for action.
Semi-Persistent GUID - A GUID with an expiration datetime in its metadata, meant to identify temporary objects such as intermediate computational results.
Universally Unique Identifier (UUID) - A type of GUID defined using the conventions in RFC 4122 (https://www.ietf.org/rfc/rfc4122.txt). In some contexts, GUIDs and UUIDs are defined as synonyms. However in the DCPPC context, they are not synonyms, but have a class:subclass relationship. UUIDs are uniquely resolvable on the web within a UUID-specific URN namespace.
3. Types of GUIDs in the DCPPC¶
Compact ID ¶
Short Definition: Compact Identifier
Supported by: EMBL-EBI and California Digital Library. Supported in DCPPC by Team Sodium.
Services: Identifiers.org resolution service and prefix registration service
Primary use case: Creating GUIDs for life science data with local accession numbers.
How are they used or planning to be used by the DCPPC: Identifiers.org and N2T.net allow services to register namespaces that render local accession numbers globally unique on the Web using the CURIE format. N2T.net harvests namespace records from Identifiers.org.
Short Definition: Datacite Digital Object Identifiers
Syntax/Example: Includes a proxy (https://doi.org) prefix (10.25491), suffix (cq8s-f809), https://doi.org/10.25491/cq8s-f809
Supported by: DataCite as a Registration Agency of the DOI Foundation. Within the DCPPC, supported by Team Sodium and Team Xenon.
Resolving Systems: doi.org
Services: DOI and accompanying metadata registration; DOI resolution; DOI indexing
Primary use case: Data that is not only persistent, but also stable (\<10 versions anticipated), and likely to be cited. Data for which the required metadata is both available and useful to the consumer. Data for which funding exists to cover the per-ID costs, ideally such funding exists not only for a subset of the citable, stable, data to be identified from a source, but all such data.
How are they used or planning to be used by the DCPPC: DOIs should be used for data to be persistent, citable, shareable, findable, accessible, reusable, and interoperable. For example, for primary data from data stewards or derived datasets that are published and referenced from an article or other scholarly work.
Short Definition: Archival Resource Keys (via EZID and N2T.net)
Syntax/Example: https://n2t.net/ark:/13030/qt5gz1r3mc?, where https://n2t.net proxies (makes actionable) the globally unique string following it, consisting of the identifier type (ark:), name assigning authority (13030), base name (qt5gz1r3mc), optional suffixes (absent here) to express variants and hierarchy. In this example, the optional inflection (‘?’) was added to request metadata instead of the object itself.
Supported by: California Digital Library (CDL) for EZID registration and N2T.net for resolution. Supported in DCPPC by Team Sodium, Team Helium, and Team Xenon.
Resolving System: N2T.net (resolving to local services, e.g., Data Object Service (DOS))
Services: ARK and minid registration; native ARK resolution and minid (via CURIE) resolution
Primary use case: Data that is either stable or volatile, and that needs more metadata flexibility.
How are they used or planning to be used by the DCPPC: ARKs can be used like DOIs, but in DCPPC they are primarily used to register minids (q.v.).
Short Definition: Minimal Viable Identifier
Link to more information: link
Link to client software: link
Supported by: Supported in DCPPC by Team Argon, Team Sodium, and Team Helium.
Resolving System: N2T.net
Services: Identifier minting and resolution
Primary use case: Association of persistent (and verifiable) names with files and collections. Provides an unambiguous anchor from which arbitrary metadata can be associated and enables validation of integrity by ensuring that all minids have an associated checksum.
How are they used or planning to be used by the DCPPC: Digital artifacts that have been registered as minids will be made available as ARKs.
Definition: Domain neutral persistent identifiers
Link to more information: link
Syntax/Example: https://n2t.net/ga4ghdos:dg.4503/a5d79375-1ba8-418f-9dda-eb981375e516, but for high performance use the resolution target directly, https://dataguids.org/ga4gh/dos/v1/dataobjects/dg.4503/01b048d0-e128-4cb0-94e9-b2d2cab7563d
Supported by: Supported in DCPPC by Team Calcium and Team Helium.
Resolving System: Data Object Service, indexd
Services: Identifier resolution, prefix registration, identifier minting
Primary use case: High-performance internal routing of identifiers within a full stack, combined with the ability to resolve IDs externally.
How are they used or planning to be used by the DCPPC: Data GUIDs MAY be used for any data object that requires persistence or unique identification.
4. Core Metadata High-Level Recommendations¶
Interoperability of identifier strategies depends on the ability of different full stacks to interoperably exchange description of the GUIDs and the data that the GUIDS are bound to (Core Metadata). For this reason:
A FS SHOULD provide a core set of metadata associated with a data object that is represented using the schema.org metadata schema.
The Core metadata SHOULD be human and machine readable from the dataset landing page or service. The FS MAY provide machine readable metadata in different serializations.
There is a DCPPC-DRAFT specific to Core Metadata. A FS implementation SHOULD provide a minimal set of core metadata that is compliant to this document.
5. Landing Services API, Resolution and Registration Guidelines¶
Landing services provide Core Metadata metadata on GUIDs including object resolution endpoints. They enable interoperability between stacks and long-term persistence of GUIDs by mediating between GUIDs and their direct object resolution behavior. They are an abstraction and extension to the Cloud environment of the current “landing page” model supported by DOIs, ARKs and Minids. KC2 has worked on an RFC for common GUID services, including landing services: https://docs.google.com/document/d/1Ug-druVDEnZSDZIBeouvg3gbrBpSr9zaoxfjnNQLZaQ/edit?usp=sharing
6. KC2 External Advisory Board¶
An international, diverse group external to DCPPC
Experts on Identifiers for research objects and data and software citation
Select, define, and convene board during Phase 1. Start meetings during Phase 2 (~ 2 times yearly).
Geoff Bilder, CrossRef
Herbert van den Sompel, LANL
Nicolas Le Novere (started identifiers.org)
Jo McEntyre (EMBL-EBI)
Sunje Dallheimer-Tiessen (CERN)
Leslie McIntosh (RDA/US Executive Director)
Carole Goble (University of Manchester)
Neil Chue Hong (UK Software Sustainability Institute)
Roger Schonfeld (Director, Ithaka S+R Libraries and Scholarly Communication Program)
Larry Lannom (CNRI)