1. What exactly is JData?

1.1. JData in a nutshell

JData is our solution to help data creators and users, especially scientific researchers who routinely deal with complex data, to easily store, read, understand, share, combine, process, exchange and integrate their data and enable new discoveries. They can save all the extra works to deal with numerous diverse file formats, while focusing on addressing the more important questions. Their hard-work generated data files can be also be shared with a large research community by making them FAIR - findable, accessible, interoperable and reusable, helping other researchers to create bigger studies and enable grander discoveries.

We have developed the JData framework to specifically address the above mentioned challenges. This framework includes

  1. A set of data specifications that map complex data structures to an standardized "data annotation" that can ambiguously represent a large variety of data types encountered in the scientific world
  2. A standardized data serialization approach based on existing, widely supported data serialization formats (specifically, we picked JSON and binary JSON - yes, dual-interface!), so that the JData-annotated constructs can be efficiently parsed/stored/converted/transmitted without needing extra works to write brand new parsers, and
  3. A series of libraries and encoders/decoders to map complex data types from different programming environment to the common JData annotation format, export and import between programming languages using a JSON/binary JSON file

In a nutshell, the JData Specification represents complex data structures in the JSON (JavaScript Object Notation) compatible data constructs, making it possible to make complex (and lightweight) data structures readable, shareable, extensible and interoperable between software and analysis pipelines. The JData specification extends JSON by adding additional data containers to support binary strongly-typed data, complex-valued arrays, data compression, data grouping, linking/referencing etc completely in the "semantic layer" without altering the syntax of the format. This makes JData-encoded data files 100% compatible with all existing JSON parsers and readily usable in software where JSON is supported.

To make JData suitable in many space-limited performance-sensitive applications, we also defined a binary interface for the JData-defined data annotations, utilizing another widely supported binary JSON-like format - Universal Binary JSON (UBJSON) to produce files of smaller sizes and significantly faster parsing speed. JSON/text-JData files can be lossly converted to a UBJSON/binary-JData file, and vice versa. The reason we selected UBJSON from other binary alternatives, i.e. BSON, MessagePack, CBOR, is largely due to its "quasi-human-readability" that is unique to this binary format. In UBJSON, all "semantic elements", i.e. the data type markers and data name records are all stored in human-readable forms. By using a text-editor or a string-printing tool, such as the strings command on Mac or Linux (also available on Windows via Cygwin), one can read the content of the binary data without much difficulties.

Simplicity and human-readability are among our requirements in the design of this framework because we understand that for a data standard targeting for general audiences, it has to be simple! The format has to be simple and intuitive, the serialization, reading/writing of the file has to be easily and widely acceptable without much programming overhead, and future extension must be inherent - once a data file is created, it must be easily modified without worrying about breaking the parsers, so it can stay for a long time. Human-readability is harder to achieve, because it typically conflicts with speed and data size. However, our solution attempts to strike a balance between readability and efficiency, and ensures that all "semantic components", such as data item names, strings, and data types, are easily readable and understood, while the binary data payload can be stored with compact size with compression or filters.

1.2. Introduction by examples

1.2.1. A simple array

Let's start with a simple example. Assume we have have a 2x4 integer array 'mydata', which one can define similarly in several programming languages as

 Python:     mydata=[[1,2,3,4], [5,6,7,8]]
 JavaScript: mydata=[[1,2,3,4], [5,6,7,8]]
 MATLAB:     mydata=[[1,2,3,4]; [5,6,7,8]]
 Perl:      @mydata=[[1,2,3,4], [5,6,7,8]]

In JData, we can represent this data array using the widely-supported, plain-text-based JSON format as

 {
    "mydata": [[1,2,3,4], [5,6,7,8]]
 }

From the above example, you may notice several major characteristics of JData

  • P1 - First, this representation is quite intuitive and easy to understand; one can use a text-editor and open the file and understand the data without much trouble.
  • P2 - Secondly, JData form is syntactically compatible with JSON, and JSON encoders are freely available in almost all programming languages, and they are easy to program or use.

1.2.2. A simple array with a binary type

Now, let's make this better - imagine that we want to store this array using an unsigned byte, i.e, uint8 format. The above JSON based formats started to meet its limitations - because the JSON specification does not support strongly-typed binary data.

In JData, we introduce a new data representation - the annotated array format to let JSON support strongly-typed binary data

 {
    "mydata": {
        "_ArrayType_": "uint8",
        "_ArraySize_": [2,4],
        "_ArrayData_": [1,2,3,4,5,6,7,8]
     }
 }

The above annotated format does look slightly more complicated than the first form (referred to as the direct format), but it remains easily understandable - more importantly, the annotated-array remains fully syntactically compatible with the JSON format - in other words, this JSON data representation is now capable of encoding data types.

1.2.3. A simple array with compression

We can go one step further, JData extends JSON's capability further to represent even binary data. For example, the binary data in the above array can be stored in an encoded string, and such binary can be even compressed to save space, for example:

 {
    "mydata": {
        "_ArrayType_": "uint8",
        "_ArraySize_": [2,4],
        "_ArrayZipType_": "zlib",
        "_ArrayZipSize_": [1,8],
        "_ArrayZipData_": "eJxjZGJmYWVj5wAAAIAAJQ=="
     }
 }

The above JSON construct stores the 8-element uint8 array above by first compressing the byte streams using zlib-deflate algorithm first (as indicated by the _ArrayZipType_) and then converted to ASCII text using base64-encoding. At this point, we use a JSON-compliant annotation format to store strongly typed binary data, which is previously not supported by the native JSON specification.

1.2.4. A structure

Hierarchical data are natively supported by JSON and UBJSON, therefore, it is effortless for JData files to encode and store nested complex data structures. For example, the below commands in respective programming languages

 Python:     mydata={'a':5,'f':1.1,'c':{'d':[[1,2,3,4,5],[6,7,8,9,10]],'s':'a string'}}
 JavaScript: mydata=('a':5,'f':1.1,'c':('d':[[1,2,3,4,5],[6,7,8,9,10]],'s':'a string'))
 MATLAB:     mydata=struct('a',5,'f',1.1,'c',struct('d',[[1,2,3,4,5]; [6,7,8,9,10]],'s','a string'))
 Perl:      %mydata=('a'=>5,'f'=>1.1,'c'=>('d'=>[[1,2,3,4,5], [6,7,8,9,10]],'s'=>'a string'))

produces the below native data structure

 mydata= {
    a =  5
    f =  1.1000
    c =
        d =
           1   2   3   4   5
           6   7   8   9  10
        s = a string
 }

In JSON/text-JData format, the above data can be intuitively represented as

 {
    "mydata": {
        "a": 5,
        "f": 1.1,
        "c": {
           "d": [
            [1,2,3,4, 5], 
            [6,7,8,9,10]
           ],
           "s": "a string"
        }
     }
 }
The above representation is rather easy to read and be understood without needing special tools, libraries or prior knowledge of the data format itself.

1.2.5. Using binary interface

One may argue that, although the above data representation is general, extensible and human readable, it requires more space to store and not very efficient when processing in large quantities.

This concern is valid, and is the main motivation for the "binary JData" interface using UBJSON.

Let's first understand why the above argument is valid. If we use the above JData form and use a tab \t for indentation and a newline \n for line wrapping, then the above form requires a total of 124 bytes to be stored on a disk. Because white-space between data records, including newlines and tabs are optional in JSON, stripping such white space can save more space. Therefore, we can use the "compact" format of JSON/JData,

 {"mydata":{"a":5,"f":1.1,"c":{"d":[[1,2,3,4,5],[6,7,8,9,10]],"s":"a string"}}}

this results in a total of 79 bytes.

In JData specification, we further reduce this storage overhead by converting the above JSON data into its equivalent binary UBJSON representation. To facilitate the reading of this data structure, we used the "block-notation" by inserting newlines and whites-paces between data records for better formatting and reading. Please note that in the actual UBJSON file, all white spaces and the "[" "]" markers for each "data block" below must be removed.

 [{]
    [U][6][mydata] 
    [{]
        [U][1][a] [U][5]
        [U][1][f] [d][1.1]
        [U][1][c]
        [{]
           [U][1][d] [[]
              [[] [U][1][U][2][U][3][U][4][U][5]  []]
              [[] [U][6][U][7][U][8][U][9][U][10] []]
           []]
           [U][1][s] [S][U][8][a string]
        [}]
    [}]
 [}]

Just to explain the above notations better, you can see that the UBJSON follow a similar syntax as JSON, except that both the name and value fields accepts binary data type markers ([U] for uint8, [d] for float32, [S] for string, [[]...[]] stores an array object, and [{]...[}] stores a map/hash).

the above UBJSON representation requires a total of 73 bytes to store, which is slightly smaller than the text-based JSON format above. One can see that the 2-D array "mydata.c.d" contains numerical elements of the same type, thus, it is a "packed" array, which can take advantage of the "optimized array header" defined in the UBJSON specification to extract all shared data type markers, in this case, [U] to gain more space saving. Using the UBJSON optimized array header, we can store the above data as

 [{]
    [U][6][mydata] 
    [{]
        [U][1][a] [U][5]
        [U][1][f] [d][1.1]
        [U][1][c]
        [{]
           [U][1][d] [[]
              [[] [$][U][#][U][5] [1][2][3][4][5]
              [[] [$][U][#][U][5] [6][7][8][9][10]
           []]
           [U][1][s] [S][U][8][a string]
        [}]
    [}]
 [}]

The above form needs a total of 71 bytes to store, about 10% smaller than the compact-formatted JSON/JData format; in the meantime, all data types are stored in strongly-typed binary form without needing to parse or format when loading or saving.

1.2.6. Quasi-readability of binary JData files

As we mention above, UBJSON is a rather unique binary data format that is "quasi-human-readable". To show this, we can run the strings utility in the Mac/Linux (or Cygwin on windows) command line for the above binary JData file, one can see the below output

 strings -n 2 mydata.jbat
{U
mydata{U
aU
fD?
c{U
d[[$U#U
[$U#U
]U
sSU
a string}}}

if we slightly reformat the above output using another code-formatting text utility Aastyle, the improved text markers extracted from the above binary JData file can be shown as

 strings -n 2 mydata.jbat | astyle
{   U
    mydata{
        U
        aU
        fD?
        c{
            U
            d[[$U#U
            [$U#U
            ]U
            sSU
            a string}}
}
The above printed data skeleton contains all "semantic" elements of the data file, i.e. all the data subfield names, string values, and data type markers. In the meantime, the formatted text outline correctly printed the hierarchical structures of the binary data.

This ability to allow a user to conveniently inspect and understand the overall data structures in a binary file is a unique ability to UBJSON/binary JData, and is not supported by almost all other binary data formats, including HDF5, BSON, MessagePack, etc. This quasi-human-readability readily make many binary-only data types easily searchable and findable once converted to the binary JData formats.

1.2.7. Space savings

In the above toy dataset, the space saving using data compression or after converting to the binary form does not appear to be significant. In this section, we show you how to significantly reduce data file sizes and parsing overhead using JData formats using real-world datasets.

Let's consider a 3-D medical imaging data, a head-CT scan provided by the widely used rendering software MRIcroGL (raw data file). In this 3-D CT scan of a head volume contains 208 x 256 x 225 voxels, with each voxel represented by a 1-byte gray-scale value (0-255).

To store such dataset using the widely supported NIfTI-1 data format, the raw .nii data file requires a total of 11,982 kB, including a 348-byte NIfTI-1 binary header, 4-byte extension markers, followed by 208 x 256 x 225=11,980,800 bytes to store the image data. Often times, a .nii file is stored as gzip-compressed file as .nii.gz. In this case, the .nii.gz file has a size of 2,809 kB.

By using a JData-equivalent data container 100% compatible to the NIfTI header, i.e. the JNIfTI format, we can convert the above .nii file to a JSON/text-based JData file or a UBJSON/binary-JData file. Optionally, as we showed above, one can use data compression via the addition of _ArrayZip*_ tags to further save space while losslessly store the binary data. Some of these stored data samples can be found at this folder.

In the below table, we summarize the data file sizes, encoding/saving time and loading/decoding time across different file formats. The best-two in sizes (smaller the better), saving and loading times (shorter the better) are marked by bold-text in this table. All benchmarks are tested in MATLAB on a Ubuntu 16.04 Linux system, with a Samsung EVO 970 NVME hard drive.

File Formats Suffix File size Saving time (s) Loading time (s)
NIfTI-1 raw .nii 11,982 kB 0.058 s† 0.156 s†
NIfTI-1+gzip .nii.gz 2,809 kB 1.012 s† 0.265 s†
Text-JData (zlib) .jnii 3,492 kB 0.583 s* 0.136 s*/0.075 s‡
Text-JData (lzma) .jnii 2,608 kB 2.041 s* 0.199 s*/0.171 s‡
Binary-JData (no zip).bnii 11,982 kB 0.146 s* 0.138 s*
Binary-JData (zlib).bnii2,583 kB 0.493 s* 0.090 s*
Binary-JData (lzma).bnii1,929 kB 2.128 s* 0.167 s*
Powered by Habitat