NeuroJSON.io serves human-readable, searchable neuroimaging datasets using universally accessible JSON format and URL-based RESTful APIs. NeuroJSON.io is built upon highly scalable document-store NoSQL database technologies, specifically, open-source Apache CouchDB engine, that can handle millions of datasets without major performance penalties. It provides fine-grained data search capabilities to allow users to find, preview and re-combine complex data records from public datasets before download.
Traditional neuroimaging data sharing has been largely focused on file-based data sharing that faces challenges in 1) handling diverse and complex file formats involved, 2) lack of standardization in naming conventions, and 3) lack of human-readable and machine-actionable metadata. The emerging BIDS (brain imaging data structure) standard has greatly simplified and homogenized file-package based dataset organization, making sure that the data files and folders are organized in simple, consistent and meaningful semantic order, with restricted file types, accompanied with human/machine-readable metadata, both in the modality data-file level and the dataset level.
However, file-package based data sharing still faces a number of key challenges
The NIH funded NeuroJSON project addresses the needs for scalability and long-term viability of scientific datasets by adopting the JSON format and NoSQL database technologies that have been extensively developed and widely adopted by the IT industries over the last few decades.
To enable rapid search and manipulation of massive amount of complex data, a new kind of database engine, NoSQL database, has been developed and broadly used in routine handling of large data produced by online and cloud-based applications. Different from traditional table-based relational databases, NoSQL databases can effectively handle and manipulate hierarchical data records and JSON is often used as the native data exchange format for many NoSQL database engines.
The NeuroJSON project first define a set of lightweight specifications to "wrap" common neuroimaging data files into a JSON constructs. These specifications ranges from JData specification -- responsible for mapping common scientific data structures such as tables and N-D arrays to JSON structures, to JNIfTI specification -- responsible for mapping a NIfTI data file to a JSON construct, to JSNIRF specification -- responsible for mapping an HDF5 based SNIRF data files to JSON, and JMesh specification -- responsible for portable exchange of discrete shape/mesh data, among others.
Based on these specifications, we have developed a set of converters that can convert common neuroimaging data files, including the folder structure defined by BIDS specification, and extract all searchable metadata and content to JSON based files. With these JSON encoded files, we can "upload" the searchable portion of the dataset, potentially those from many many datasets, to a highly scalable NoSQL database to facilitate fast and complex data search.
At NeuroJSON.io, we run an instance of Apache CouchDB server to host NeuroJSON curated datasets. We chose CouchDB because it is fully open-source (compared to MongoDB), and supports automatic synchronization between multiple database instances.
To carry neuroimaging datasets in a CouchDB database, we use the following mapping schemes to convert the logical structure of datasets/dataset collections to the hierarchies provided by a CouchDB (databases, documents, attachments etc)
|Data logical structure
|a dataset collection
|a CouchDB database
|openneuro, dandi, openfnirs,...
|a CouchDB document
|ds000001, ds000002, ...
|files and folders related to a subject
|JSON keys inside a document
|human-readable binary content (small)
|an attachment to a document
|.png, .jpg, .pdf, ...
|non-searchable binary content (large)
_DataLink_ JSON key