You're reading an old version of this documentation. For up-to-date information, please have a look at v0.12.

ASpecD dataset format (adf)

Datasets need to be persisted from time to time. While there are plenty of different file formats, storing both, metadata and binary data can get quite complicated pretty fast.

General ideas

The ASpecD dataset format is vaguely reminiscent of the Open Document Format, i.e. a zipped directory containing structured data (in this case in form of a YAML file) and binary data in a corresponding subdirectory.

As PyYAML is not capable of dealing with NumPy arrays out of the box, those are dealt with separately. Small arrays are stored inline as lists, larger arrays in separate files. For details, see the aspecd.utils.Yaml class.

The data format tries to be as self-contained as possible, using standard file formats and a brief description of its layout contained within the archive. Collecting the contents in a single ZIP archive allows the user to deal with a single file for a dataset, while more advanced users can easily dig into the details and write importers for other platforms and programming languages, making the format rather platform-independent and future-safe. Due to using binary representation for larger numerical arrays, the format should be more memory-efficient than other formats.

Files and their meaning

What follows is a short description of the different files contained in the ZIP archive.

dataset.yaml - text/YAML

hierarchical metadata store
binaryData/<filename>.npy - NumPy binary

numerical data of the dataset stored in NumPy format

Only arrays exceeding a certain threshold are stored in binary format, mainly to save space and preserve numerical accuracy.
VERSION - text

version number of the dataset format

The version number follows the semantic versioning scheme.
README - text

General information on the dataset format

README in the ZIP archive

As mentioned above, the ASpecD dataset format is essentially a ZIP archive that consists of a number of files. One of these is a text file called README with some basic information of the contents of the archive – an attempt to be as self-consistent as possible. Below the contents of this file are shown.

Readme
======

This directory contains an ASpecD dataset stored in the
ASpecD dataset format (adf).

What follows is a bit of information on the meaning of
each of the files in the directory.
Sources of further information on the file format
are provided at the end of the file.

Copyright (c) 2021, Till Biskup
2021-01-04

Files and their meaning
-----------------------

* dataset.yaml - text/YAML
  hierarchical metadata store

* binaryData/<filename>.npy - NumPy binary
  numerical data of the dataset stored in NumPy format

  Only arrays exceeding a certain threshold are stored
  in binary format, mainly to save space and preserve
  numerical accuracy.

* VERSION - text
  version number of the dataset format

  The version number follows the semantic versioning scheme.

* README - text
  This file

Further information
-------------------

More information can be found on the web in the
ASpecD package documentation:

https://docs.aspecd.de/adf.html