To systematically address scalability and searchability in scientific data sharing, we have built a versatile data exchange platform and address challenges at multiple levels. The diagram below shows the overall design architecture. Let us walk you through our platform architecture from the bottom-level to the top-level step-by-step.
At the lowest level (data serialization), the foundation of the NeuroJSON platform is solidly seated in the JSON format - an internationally standardized, portable, ubiquitous and human-readable format that can be used in any programming language and understood by any user, now and future. It is JSON's universal presence and future-proof human-readability that had won our bet to carry valuable scientific data resources, including both raw data and metadata, for generations to come. The vast ecosystem of tools, including numerous free JSON parsers, standards such as JSON Path, JSON reference, JSON schema, JSON-LD, many NoSQL databases and much more, are readily at our disposal once our data are JSON-compliant.
We also recognize that there is a need for efficiency. For that purpose, we supplement JSON with an interchangeable binary JSON interface, Binary JData or BJData format for applications where high IO speed and smaller disk space is desired. This is a new format derived from a widely-used binary JSON format called Universal Binary JSON (UBJSON). Compared to other more sophisticated binary JSON variants such as MessagePack, CBOR, BSON etc, UBJSON's KISS (keep it simple and stupid) design philosophy has the closest mirror to the core spirit of JSON. We made some key modifications to UBJSON to specially optimize it towards carrying scientific/neuroimaging data, including native support for N-D strongly-typed arrays. In the past years, our C++ BJData parsers has been included in JSON for Modern C++, a highly popular C++ JSON library that has received over 4 millions of downloads. Our BJData format support were also included in Python, MATLAB/Octave, JavaScript and C. JSON and binary JSON can be losslessly converted from one to the other.
After building the low-level serialization format interface, we subsequently moved to standardizations of scientific data structures. This leads to our JData specification - a lightweight semantic layer that is aimed to describe all common scientific data structures using pure JSON annotations. We intentionally implemented JData completely with JSON annotation/keywords, making all NeuroJSON data files 100% JSON compatible without introducing customized syntax. Examples of the JData annotated data structures include N-D arrays, typed/complex/sparse N-D arrays, tables, trees, graphs, linked lists, binary streams, etc. Binary N-D array data can be losslessly enclosed by a JSON construct with optional data compression (supporting multiple codecs, including the high-performance blosc2 meta-compressor). Ultra-lightweight JData annotation encoders/decoders have been made widely available for MATLAB/Octave, Python, JavaScript etc.
With our standardized/portable data stream and data structures at the foundation, we can readily approach neuroimaging data sharing (or any general scientific data sharing at large) and "modernize" them to become scalable, searchable and interoperable.
At the smallest scale, a neuroimaging dataset consists of many data files that enclose different pieces of information regarding the diverse imaging modalities and procedures involved. We built a series of lossless JSON wrappers to map these data files to a JSON "container" or "wrapper". These file-level JSON wrappers include the JNIfTI spec (JSON-wrapper of the commonly used NIfTI-1/2 data files), JSNIRF spec (JSON-wrapper of the SNIRF format used in fNIRS), JMesh spec (for storing mesh, geometry and shape data in a JSON form). Emerging neuroimaging data sharing standards, such as BIDS and Zarr, already started using JSON sidecar files as the main carriers for metadata; these JSON files are readily usable in NeuroJSON's ecosystem without needing a wrapper or converter.
At the middle-level, neuroimaging data are typically shared in the unit of "datasets". It has been a challenging task of sharing datasets that are made of diverse measurements and modalities that involve many data files. Inconsistent naming conventions, file/folder organization schemes, and non-standard metadata names and storage locations have set major barriers for effective imaging data sharing over the past. Fortunately, community-driven efforts, especially BIDS standard, have been greatly reducing this barrier and make it possible to parse/exchange data under a predictable, semantically meaningful, yet still simple folder/file structure. To enclose an entire dataset using our JSON framework, we have developed BIDS-to-JSON converters (such as neuroj or jbids) that can iterate through every data file (.nii.gz
, .tsv
, .json
, .bval
, etc) under a dataset, separating the searchable metadata with the non-searchable binary content, and map each part to their JSON equivalence, and then combine all searchable content to a single JSON "digest" file.
NeuroJSON project is aimed at an even more ambitious goal - we don't want to just stop at the file-package level for data sharing, but think bigger at the levels of collections of datasets (we call it a "database") or many collections of datasets (i.e. databases of databases) - how can we effectively represent, search, locate, interact, and process data on such a scale? Now, NoSQL database technologies enter the scene. NoSQL database is the umbrella term describing a new generation of database architectures that can handle complex, heterogeneous and hierarchical data as opposed to the table-like data structure supported in traditional relational databases. NoSQL databases have seen rapid growth over the last few decades and build the backbones for modern-day information technology and application data exchange. There are many flavors of NoSQL databases, including document-store databases (CouchDB, MongoDB), key-value-pair databases (Radis, Valkey), graph-databases (Neo4j), among others. Because JSON is hierarchical and heterogeneous in nature, unsurprisingly, it has been serving as the primary data exchange format for nearly all of these NoSQL database engines. As long as we can convert datasets and databases of neuroimaging data to JSON, we can readily store such data in the highly optimized NoSQL database engines and achieve scalable, rapid and versatile data search, manipulation and dissemination.
This leads to the top level of our architecture and our vision how neuroimaging datasets (or any scientific dataset) should be accessed and processed in the future. User should not just limit themselves to processing locally downloaded single imaging data files or single dataset, rather, they should use our highly scalable NoSQL database interfaces (commonly known as the RESTful-API, which provides a URL-based universal access to online/cloud content) to rapidly search, locate desired data files that match their needs across large collections of datasets (or mega-database), both interactively and programmably in automated pipelines. The exchanged data are all encoded in human/machine readable JSON packets that is lightweight, versatile, and language/platform neutral. The access and compute can be done inside local hardware or high-performance large-scale cloud based systems. The representation and storage of data in NeuroJSON framework are always human-understandable and searchable, making them easily reusable in the future.