Development iteration

Document Status

Status

Comment

DRAFT

(CSJ) Initial draft

DRAFT

(Dahl)

To enable a repository to communicate using the DataONE Member Node APIs, most groups take an iterative approach, implementing the lowest level APIs first, and working up through the Tiers. Implementing the DataONE APIs also involves managing identifiers, System Metadata, content versioning, and data packaging. We touch on these issues here.

Note

The DataONE Common and DataONE Client libraries provide serialization and deserialization of the DataONE types and language-specific wrappers for the DataONE APIs.

Implementing the Tier 1 Member Node API

MN developers will need to first implement the MNCore API, which provides utility methods for use by CNs, other MNs and ITK components. The ping() method provides a lightweight ‘are you alive’ operation, while the getCapabilities() provides the Node document so software clients can discover which services are provided by the MN. The CNs also use the getLogRecords() method to aggregate usage statistics of the Node and Object levels so that scientists and MN operators have access to usage for objects that they are authorized to view.

The bulk of the work within the system is provided by MNRead API methods. The CNs use the listObjects() method to harvest (or synchronize) Science Metadata from the Member Nodes on a scheduled basis. They also call get(), getSystemMetadata(), and getChecksum() to verify and register objects in the system. They will occasionally call synchronizationFailed() when there is an error, and MN operators should use these events to log these exceptions for getLogRecords(), and should inform the MN operator of the problems in a way they see fit (email, syslog, monitoring software, etc). Lastly, the getReplica() call is distinguished from the get() call in that it is only used between Member Nodes that are participating in the DataONE replication system, and that these events are logged differently to not inflate download statistics for objects. Implementing the MNRead API requires identifier management, System Metadata management, versioning, and packaging, described below.

Handling identifiers

To support as many repository systems as possible, DataONE puts few constraints on identifier strings that are used to identify objects. See the details for the Identifier Type. This flexibility requires that identifiers with special characters that would affect REST API calls, XML document structure, etc. will need to be escaped appropriately in serialized form. See the documentation on Identifiers for the details. MN operators will need to ensure that identifiers are immutable, in that there is a one to one relationship between an identifier and an object (byte for byte). This allows the DataONE system to use the checksum and byte count declared in System Metadata documents to verify and track objects correctly. Calling get() on an identifier must always return the same byte stream. See the documentation on mutability of content for more details.

System Metadata

Each object (Science Data, Science Metadata, or Resource Maps) that is exposed to the DataONE system via MNRead.listObjects() must have an accompanying System Metadata document. These documents provide basic information for access, synchronization, replication, and versioning of content. MN operators will need to either store or generate System Metadata.

One notable component of System Metadata is the formatId field. To support as many repository systems and object types as possible, DataONE assigns format information to objects according to an extensible list of object formats. The Object Format List holds a list of types akin to MIME types and file formats. The definitive list is found on the production CNs. When a new MN, holding a new type of object, joins DataONE, the new Object Formats should be added to the Object Format List before they are used in the System Metadata formatID field. DataONE has been involved in the Unified Digital Format Registry project to establish a comprehensive registry of formats, and development is ongoing. When MN objects are synchronized, objects that are tagged in the Object Format List as holding metadata (i.e. Science Metadata), will be parsed and indexed on the CN for search and discovery services. The automatic parsing and indexing is not supported for all types of objects that contain Science Metadata. For objects that are not currently supported, MN developers should coordinate with the DataONE developers to provide mappings between metadata fields and fields in DataONE’s search index. See the Content Discovery documentation for details.

Content versioning

Objects in DataONE are immutable. When a scientist or MN operator wants to update an object (Science Data, Science Metadata, or Resource Map), the process involves the following sequence:

  1. Minting a new identifier

  2. Creating a new System Metadata document for the new version of the object, and setting the ‘obsoletes’ field to the previous object identifier

  3. Updating the previous object’s System Metadata, setting the ‘obsoletedBy’ field to the new object’s identifier, and setting the ‘archived’ field to ‘true’.

  4. Updating the Member Node data store to include the new object (and the old one). For Tier 1 and 2 MNs, this will happen outside of the DataONE API. For Tier 3 and Tier 4 MNs, it can be done through the MNStorage.update() call.

Since one of the main goals of DataONE is to provide a preservation network with citable data to enable reproducible science, this sequence is critical to the system’s success. There are times when a Member Node won’t be able to store earlier versions of objects indefinitely, in which case MNs should set a replication policy in their object’s System Metadata to ‘true’ so that replicas can be made and the system will act as a persistent store of the versioned objects. However, Member Nodes have suggested that DataONE support mutable objects, and possible ways to support this without impeding the preservation goals of the federation are currently under investigation. DataONE is open for input on this. If you are facing issues with versioning, please contact support@dataone.org.

Data packaging

Scientists often work with separate objects (files) containing data and metadata, and want to access them as a collection. However, communities use different packaging technologies that are often specific to their data types. To support collections across a federated network, DataONE chose to represent data packages using the OAI-ORE specification. Also known as Resource Maps, these documents use RDF to describe relationships among objects (resources). DataONE has chosen a limited vocabulary to represent associations between objects. Curently, the associations are:

  • describes / describedBy

  • aggregates / isAggregatedBy

  • documents / isDocumentedBy

For instance, a Resource Map describes an aggregation, and the aggregation aggregates a Science Metadata document and a Science Data object. In turn, the Science Metadata document documents a Science Data object. These relationships are captured in the Resource Map as ‘triple statements’, and provide a graph of associations. Resource Maps may also be more complex, where one Science Metadata document documents many Science Data objects. Some repositories may find the need to create hierarchical collections, where one Resource Map aggregates another.

Member Node operators will need to store or generate Resource Maps that will get harvested during scheduled synchronization. The CNs will parse the maps and index these data/metadata relationships, to be used in content discovery. Note that Resource Maps will also need associated System Metadata documents. during this phase of development, MN operators can make use of the DataONE Common and DataONE Client libraries for building Resource Maps. Details about data packaging can be found in the architecture documentation.