ASpecD dataset format (adf)¶
Datasets need to be persisted from time to time. While there are plenty of different file formats, storing both, metadata and binary data can get quite complicated pretty fast.
The ASpecD dataset format is vaguely reminiscent of the Open Document Format, i.e. a zipped directory containing structured data (in this case in form of a YAML file) and binary data in a corresponding subdirectory.
As PyYAML is not capable of dealing with NumPy arrays out of the box, those are dealt with separately. Small arrays are stored inline as lists, larger arrays in separate files. For details, see the
The data format tries to be as self-contained as possible, using standard file formats and a brief description of its layout contained within the archive. Collecting the contents in a single ZIP archive allows the user to deal with a single file for a dataset, while more advanced users can easily dig into the details and write importers for other platforms and programming languages, making the format rather platform-independent and future-safe. Due to using binary representation for larger numerical arrays, the format should be more memory-efficient than other formats.
Files and their meaning¶
What follows is a short description of the different files contained in the ZIP archive.
hierarchical metadata store
binaryData/<filename>.npy- NumPy binary
numerical data of the dataset stored in NumPy format
Only arrays exceeding a certain threshold are stored in binary format, mainly to save space and preserve numerical accuracy.
version number of the dataset format
The version number follows the semantic versioning scheme.
General information on the dataset format
README in the ZIP archive¶
As mentioned above, the ASpecD dataset format is essentially a ZIP archive that consists of a number of files. One of these is a text file called
README with some basic information of the contents of the archive – an attempt to be as self-consistent as possible. Below the contents of this file are shown.
Readme ====== This directory contains an ASpecD dataset stored in the ASpecD dataset format (adf). What follows is a bit of information on the meaning of each of the files in the directory. Sources of further information on the file format are provided at the end of the file. Copyright (c) 2021, Till Biskup 2021-01-04 Files and their meaning ----------------------- * dataset.yaml - text/YAML hierarchical metadata store * binaryData/<filename>.npy - NumPy binary numerical data of the dataset stored in NumPy format Only arrays exceeding a certain threshold are stored in binary format, mainly to save space and preserve numerical accuracy. * VERSION - text version number of the dataset format The version number follows the semantic versioning scheme. * README - text This file Further information ------------------- More information can be found on the web in the ASpecD package documentation: https://docs.aspecd.de/adf.html