Showing revision 1.6

Why NeuroJSON

: 1. What problems are we addressing?
: 2. What solutions does NeuroJSON provide?
: 3. Mapping of neuroimaging datasets to a CouchDB database

TLDR: NeuroJSON.io serves free neuroimaging datasets that are searchable, universally accessible using URLs, both human and machine readable for long term reusability.

NeuroJSON.io serves human-readable, searchable neuroimaging datasets using universally accessible JSON format and URL-based RESTful APIs. NeuroJSON.io is built upon highly scalable document-store NoSQL database technologies, specifically, open-source Apache CouchDB engine, that can handle millions of datasets without major performance penalties. It provides fine-grained data search capabilities to allow users to find, preview and re-combine complex data records from public datasets before download.

1. What problems are we addressing?

Traditional neuroimaging data sharing has been largely focused on file-based data sharing that faces challenges in 1) handling diverse and complex file formats involved, 2) lack of standardization in naming conventions, and 3) lack of human-readable and machine-actionable metadata. The emerging BIDS (brain imaging data structure) standard has greatly simplified and homogenized file-package based dataset organization, making sure that the data files and folders are organized in simple, consistent and meaningful semantic order, with restricted file types, accompanied with human/machine-readable metadata, both in the modality data-file level and the dataset level.

However, file-package based data sharing still faces a number of key challenges

without searchability and findability, it is difficult to scale towards exponentially growing public neuroimaging datasets, both in size and complexity
many of the data files, even with restrictions under the BIDS specifications, are in binary form and are not directly human readable or searchable; their utilities are dependent on the continual maintenance of file parsers; future discontinuity or upgrade of certain file formats may jeopardize the long-term viability of the dataset
increasing use of complex data analysis pipelines via automated and distributed cloud-computing services demands more flexible and fine-grained data access; disseminating large zipped packages could add significant overhead to data processing and maintenance.

2. What solutions does NeuroJSON provide?

The NIH funded NeuroJSON project addresses the needs for scalability and long-term viability of scientific datasets by adopting the JSON format and NoSQL database technologies that have been extensively developed and widely adopted by the IT industries over the last few decades.

JSON is a human-readable hierarchical data format that is extensively used among web and native applications for ubiquitous data exchange. It is an internationally standardized format and has a large tool ecosystem, filled with [numerous free parsers and utilities developed for nearly every programming language and environment in existence. JSON is supported by default within many prominent programming languages such as Python, Perl, MATLAB/Octave, and Javascript, with lightweight parsers available for most other programming environments.

To enable rapid search and manipulation of massive amount of complex data, a new kind of database engine, NoSQL database, has been developed and broadly used in routine handling of large data produced by online and cloud-based applications. Different from traditional table-based relational databases, NoSQL databases can effectively handle and manipulate hierarchical data records and JSON is often used as the native data exchange format for many NoSQL database engines.

The NeuroJSON project first define a set of lightweight specifications to "wrap" common neuroimaging data files into a JSON constructs. These specifications ranges from JData specification -- responsible for mapping common scientific data structures such as tables and N-D arrays to JSON structures, to JNIfTI specification -- responsible for mapping a NIfTI data file to a JSON construct, to JSNIRF specification -- responsible for mapping an HDF5 based SNIRF data files to JSON, among others.

Based on these specifications, we have developed a set of converters that can convert common neuroimaging data files, including the folder structure defined by BIDS specification, and extract all searchable metadata and content to JSON based files. With these JSON encoded files, we can "upload" the searchable portion of the dataset, potentially those from many many datasets, to a NoSQL database to facilitate fast and complex data search.

At NeuroJSON.io, we run an instance of Apache CouchDB server to host NeuroJSON curated datasets. We chose CouchDB because it is fully open-source (compared to MongoDB), and support automatic synchronization between multiple database instances.

3. Mapping of neuroimaging datasets to a CouchDB database

To carry neuroimaging datasets in a CouchDB database, we use the following mapping schemes to convert the logical structure of datasets/dataset collections to the hierachies provided by a CouchDB (databases, documents, attachments etc)

Data logical structure	CouchDB object	Examples
a dataset collection	a CouchDB database	openneuro, dandi, openfnirs,...
a dataset	a CouchDB document	ds000001, ds000002, ...
files and folders related to a subject	JSON keys inside a document	sub-01, sub-1/anat/scan.tsv,...
human-readable binary content (small)	an attachment to a document	.png, .jpg, .pdf, ...
non-searchable binary content (large)	`_DataLink_` JSON key	`"_DataLink_":"http://url/to/ds/filehash.jbd"`