neuroj
)neuroj
and njprep
data conversion overviewneuroj
njprep
njprep
Utilities to convert datasets to JSON and access RESTful data on NeuroJSON.io
The NeuroJSON project (https://neurojson.org), funded by the US National Institute of Health (NIH), is aimed at promoting and curating scalable, searchable, and reusable neuroimaging datasets among the communities. The NeuroJSON project adopts JSON and binary JSON formats as the primary underlying data formats to reinforce searchbility and scalability. JSON is an internationally standardized format that is universally supported across wide range of programming environments. JSON has a vast toolchain ecosystem that can be readily applied for processing neuroimaging data once converted. Specifically, modern document-store and NoSQL databases, such as Radis, CouchDB, MongoDB, or JSON type support in MySQL or Sqlite, provides rapid and extensive search capability of large datasets that can easily handle millions of datasets at scale.
This toolbox provides a set of lightweight, easy-to-use shell-based utilities to convert neuroimaging datasets from native modality-specific data formats to JSON, and subsequently allow users to upload their JSON-encoded data to NeuroJSON.io, our primary document-store database built upon an open-source CouchDB server, for sharing and data publication. The utilities also provide convenient functions to query all free/public datasets provided on NeuroJSON.io, search specific datasets and data records for reusing those in secondary data analysis or testing. The CouchDB server exposes all provided dataset information using intuitive RESTful APIs for neuroimaging end-users and tool developers to query, combine and download JSON-encoded data that are relevant to their project.
The migration from zip-file based neuroimaging data sharing to modern NoSQL database based data dissemination not only greatly enhances the scalability of neuroimaging datasets, but also make datasets findable, searchable, and easy to integrate with diverse data analysis tools. This also prepares the community towards building complex data analysis pipelines that requires interoperable data-exchange between complex software tools running on the cloud or web-based applications.
We map neuroimaging datasets and dataset collections to CouchDB/NoSQL database object hierarchies. The below table shows the conceptual mapping of the data logical structures to the CouchDB object hierarchies.
Data logical structure | CouchDB object | Examples |
---|---|---|
a dataset collection | a CouchDB database | openneuro, dandi, openfnirs,… |
a dataset | a CouchDB document | ds000001, ds000002, … |
files and folders related to a subject | JSON keys inside a document | sub-01, sub-1/anat/scan.tsv,… |
human-readable binary content (small) | an attachment to a document | .png, .jpg, .pdf, … |
non-searchable binary content (large) | _DataLink_ JSON key | "_DataLink_":"https://url/to/ds/filehash.jbd" |
A CouchDB (similarly other NoSQL database engines) can hold and process enormous numbers of databases (i.e. collections) and documents (i.e. datasets) in each database, however, for high-performance search capabilities, our CouchDB server follows the best practice recommendations and set the maximum document size to 8 MB. That means the searchable JSON-encoded content of a dataset should be limited to ~10-15MB if stored in raw JSON text files (after parsing, the data will be reduced). In practices, the searchable content in most existing datasets can fit in this limit (for example, over 90% of OpenNeuro datasets have less than 10 MB raw JSON size after conversion). After capping the sizes of large .tsv files, nearly all OpenNeuro datasets can be stored in a CouchDB document.
It is highly recommended to only encode human-readable and searchable data in the JSON encoded datasets and offload the non-searchable binary data in externally linked files. This way, the JSON document can be small and easy to query, download and manipulate. CouchDB can perform complex searches of a database containing millions of small documents (kB) in a fraction of a second (MongoDB can be even faster; Radis DB offers the fastest speed if the entire database can be fit in the memory of the server).
Primary tools:
neuroj
: NeuroJSON client - the primary utility that calls other tools to convert and query NeuroJSON.io
njprep
: a bash script to convert databases, single dataset or single data file
Helper functions (called by neuroj
and njprep
)
bids2json
: a utility to merge converted dataset JSON files to a single datasetname.json
file for upload
link2json
: a utility to create a JSON file for symbolic links
listdatalink
: a utility to list/extract all URL/externally linked data files (_DataLink_
) for batch download
mergejson
: a bash/jq script to merge all converted files under a subject-folder to a single subject.jbids
file
tsv2json
: a Perl script to convert tsv/csv to JSON
neuroj
and njprep
data conversion overviewIn the below diagram, we show the data conversion input/output folder structures.
[input folder] [output folder] /orig/data/collection/root => /coverted/json/root | |----------------------------- **database1.json** -> push to NeuroJSON.io CouchDB |-- dataset1/ |-- dataset1/ ^ | |-- dataset_description.json => (copy) | |-- dataset_description.json | merge by `bids2json` | |-- README => (convert) | |-- README.jbids | | | | |---------------------- **subj-01.jbids** | |-- sub-01/ | |-- sub-01/ \ ^ | | |-- sub-01_scans.tsv => (convert) | | |-- sub-01_scans.tsv.json | | | | |-- anat/ | | |-- anat/ |--| merge by `mergejson` | | |-- sub-01_T1w.nii.gz => (convert) | | |-- sub-01_T1w.nii.gz.json | | | |-- sub-01_events.tsv | | |-- sub-01_events.tsv.json | | |-- sub-01-file <symlink> -> git/annex/...| |-- sub-01-file.json: ["_DataLink_":"symlink:git/annex/..."] | | | |---------------------- **subj-02.jbids** | |-- sub-02/ | |-- sub-02/ | | |-- ... => | | |-- ... | |----------------------------- **database2.json** -> push to NeuroJSON.io CouchDB |-- dataset2/ |-- dataset2/ | |-- ... => | |-- ... |... |... |.att/ # attachment data files -> upload to your preferred or NeuroJSON server |-- dataset1/ | |-- md5_pathhash_file1-zlib.jdb | |-- md5_pathhash_file2-zlib.jdb | |-- ... |-- dataset2/ | |-- md5_pathhash_file1-zlib.jdb | |-- md5_pathhash_file2-zlib.jdb | |-- ... ...
neuroj
neuroj
is the NeuroJSON client script that provides most of the functionalities. It calls njprep
to perform batched and parallel dataset/datafile conversion to JSON, as well as listing, searching, downloading, databases and datasets from NeuroJSON.io, our open data dissemination portal. NeuroJSON.io shares open datasets publically and permits anonymous access.
For neuroimaging dataset creators, uploaders and collection administrators, you can also use neuroj
to perform administrative tasks such as uploading new JSON-encoded datasets to an existing database, updating dataset JSON document with new revision, deleting old versions and other maintenance commands supported by the CouchDB REST API.
Command format: neuroj -flag1 <param1> -flag2 <param2> ...
Suported flags include
-i/--input folderpath path to the top folder of a data collection (such as OpenNeuro) -o/--output folderpath path to the output folder storing the converted JSON files -db/--database dbname database name (such as openneuro, openfnirs, dandi etc) -ds/--dataset dataset dataset name (a single dataset in a collection, such as ds000001) -v/--rev revision dataset revision key hash -r/--convert convert database (-db) or dataset (-db .. -ds ..) (in parallel) to JSON -t/--threads num set the thread number for parallel conversion (4 by default)
-l/--list list all database if -db is given; or the dataset if both -db/-ds are given -q/--info query database info if -db is given; or dataset info if both -db/-ds are given -f/--find '{" selector ":,...}' use the CouchDB _find API to search dataset
-g/--get/--pull retrieve and display JSON encoded dataset, or complete database (slow) -p/--put/--push dataset.json upload JSON data to a database (-db) and dataset (if -ds is missing, use file name), (admin only) -c/--create create a specified database (-db), (admin only) -d/--delete delete specified database (-ds) from a database (-db), (admin only) -u/--url https://... CouchDB REST API root url, use https://neurojson.io:7777 (default) or use NEUROJSON_IO env variable
-n read from \$HOME/.netrc (Linux/MacOS) or \%HOME\%/_netrc for username/password for admin tasks --netrc-file /path/netrcfile same as -n, specify netrc file path (see https://everything.curl.dev/usingcurl/netrc) -U/--user username set username for admin tasks (unless use -n or -c or NEUROJSON_IO URL has user info) -P/--pass password set password for admin tasks (unless use -n or -c or NEUROJSON_IO URL has password info)
neuroj accepts 3 ways to set username/password if you are running admin tasks (create/upload/update/delete datasets). Using curl with -n/–netrc-file is the recommended approach as it does not leave passwords in the commands or system logs.
If one can not install curl, neuroj attempts to use Perl module LWP::UserAgent to communicate with the server. In this case, user may set an environment variable NEUROJSON_IO in the form of https://user:pass@example.com:port. If user/pass contains special characters, they must be URL-encoded. This way, the neuroj command will not show any password in the log. If you are on a secure computer, using -U/-P will also allow LWP::UserAgent to authenticate.
neuroj
neuroj -i /path/to/database/rootfolder -o /path/to/output/json/folder -db openneuro
neuroj -i /path/to/database/rootfolder -o /path/to/output/json/folder -db openneuro --convert -t 12
neuroj -i /path/to/database/rootfolder -o /path/to/output/json/folder -db openneuro -ds ds000001 --convert
neuroj -i /path/to/database/databasename -o /path/to/output/json/folder
neuroj -i /path/to/database/databasename -o /path/to/output/json/folder -ds ds000001
neuroj --list
neuroj --list | jq '.'
neuroj --list -db openneuro
neuroj --list -db openneuro | jq '.rows[] | .id'
neuroj --info -db openneuro
neuroj -db openneuro --find '{"selector":{},"fields":["_id"],"limit":10,"skip":2}'
neuroj -db openneuro --find '{"selector":{},"fields":["_id","bids_dataset_info.dataset_description\\\\.json"],"limit":2}' | jq '.'</source>
njprep
njprep
is a neuroimaging-data file to JSON converter following the general principles of the NeuroJSON project - that is to separate a dataset into human-readable/searchable part and a non-searchable/binary data part.
The human-readable part is stored in the JSON format and can be readily uploaded to modern document-store databases to allow data analyses to scale to large datasets, making the data searchable, findable and universally accessible and parsable. The human-readability of the data format also ensures future reusability.
The non-searchable data are stored in binary JSON, or their original formats and can be stored externally while still being associated with the searchable JSON data using links, URLs or stored as “attachments” to the JSON document. They can be “re-united” with the searchable data on-demand to restore the full dataset for data analysis.
For conversion of human-readable data files, njprep
currently supports .json
, .tsv
, .csv
, and various text files (.txt/.md/.rst
); for a limit set of neuroimaging data files, such as .nii.gz
, .snirf
, njprep
parse the file header into JSON while storing the rest into binary files. njprep
also converts symbolic links to a special JSON element to maintain the linkage. Other human-readable documentation files, such as .png
, .jpg
, .pdf
are stored as attachments
njprep
njprep /database/root/ /output/json/root/ database_name
njprep /database/root/ /output/json/root/ database_name dataset_name
njprep /database/root/ /output/json/root/ database_name dataset_name /path/to/a/file</source>
For Linux and Mac OS: - jq - curl - GNU Octave - jbids https://github.com/NeuroJSON/jbids - including 4 submodules under tools) - libparallel-forkmanager-perl (for Parallel::ForkManager) - libwww-perl (for LWP::UserAgent) - libjson-xs-perl (for JSON::XS)
For Windows: please first install cygwin64 (https://cygwin.com/) or MSYS2 (https://msys2.org/) and also install the above packages in the corresponding cygwin64/msys2 installers.
When converting datasets with neuroj
or njprep
, conversion for some of the data files, such as .snirf
or .nii/.nii.gz
requires octave and the jbids toolbox (including its submodules). Other functionalities does not require octave.