Showing revision 1.0
Auto-BIDSify - User Guide

Automated Neuroimaging Data Standardization

Convert any neuroimaging dataset to BIDS format using LLM-powered intelligence

What is Auto-BIDSify?

LLM-First Architecture

Uses LLM for semantic understanding of your dataset structure

9-Stage Pipeline

Comprehensive workflow from data ingestion to BIDS validation

Universal Compatibility

Works with any naming convention - no hardcoded rules

Supported Formats

MRI
  • • DICOM (.dcm)
  • • NIfTI (.nii, .nii.gz)
  • • JNIfTI (.jnii, .bnii)
fNIRS
  • • SNIRF (.snirf)
  • • Homer3 (.nirs)
  • • MATLAB (.mat)

Interactive Demo

Before & After Comparison

❌ Before (Non-standard)
my_dataset/
├── Beijing_sub82352/
│   ├── anat_mprage/
│   │   ├── scan.dcm
│   │   └── scan_002.dcm
│   └── func_rest/
│       ├── fmri_001.dcm
│       └── fmri_002.dcm
├── Cambridge_sub06272/
│   ├── anat_mprage/
│   │   └── scan.dcm
│   └── func_rest/
│       └── fmri_001.dcm
├── Beijing_sub19283/
│   └── ...
└── README.txt

Issues:

  • • Non-standard subject naming
  • • DICOM files not converted
  • • Missing BIDS metadata

Usage Guides

🚀 End-to-End Pipeline

Run the complete conversion in one command. The pipeline automatically executes all 9 stages from data ingestion to BIDS validation.

1
Prepare Your Dataset

Organize your neuroimaging data in a single directory:

my_dataset/
├── subject1/
│   ├── anat/
│   │   ├── T1w_001.dcm
│   │   └── ...
│   └── func/
│       ├── fmri_001.dcm
│       └── ...
├── subject2/
├── README.txt  # Optional
└── protocol.pdf # Optional
Note: Auto-BIDSify works with any naming convention. No need to rename files manually!
2
Run Auto-BIDSify

Execute the full pipeline with a single command:

python cli.py full \ --input my_dataset/ \ --output outputs/run1/ \ --model qwen \ --modality auto \ --nsubjects 10 \ --describe "Multi-site fMRI study with T1w and resting-state scans"

⚠️ Required: The --describe parameter is mandatory. Provide a clear description of your dataset to help the AI understand its structure.

💡 Highly Recommended: Add --nsubjects N if you know the number of subjects for much faster and more accurate processing.

3
Monitor Progress

Watch the 9-stage pipeline execute:

[INFO] === Starting Full Pipeline === [INFO] Using model: qwen [1/9] Ingesting data... ✓ Data ingestion complete [2/9] Building evidence bundle... ✓ Evidence bundle saved [3/9] Classifying files... ✓ Classification complete [4/9] Generating dataset_description.json... ✓ Dataset description created [5/9] Generating README.md... ✓ README created [6/9] Generating participants.tsv... ✓ Participants file created [7/9] Planning conversion... ✓ Conversion plan generated [8/9] Converting files... ✓ Converted files [9/9] Validating BIDS structure... ✓ BIDS validation passed === Pipeline Complete ===
Access Your BIDS Dataset

Your standardized dataset is ready:

outputs/run1/bids_compatible/
├── dataset_description.json
├── README.md
├── participants.tsv
├── sub-001/
│   ├── anat/sub-001_T1w.nii.gz
│   └── func/sub-001_task-rest_bold.nii.gz
└── ...

🚀 Quick Start

1Installation

Install Auto-BIDSify from PyPI or GitHub:

📦 Via pip (Recommended):

pip install autobidsify

🔗 From GitHub:

pip install git+https://github.com/yiyiliu-rose/autobidsify.git

💻 For development:

git clone https://github.com/yiyiliu-rose/autobidsify.git
cd autobidsify
pip install -e .

Requirements: Python 3.8+, pip

2Setup API Key (Optional)

If you want to use OpenAI models (like GPT-4o), set your OpenAI API key as an environment variable:

💡 Note: This step is only required if you plan to use OpenAI's LLM models. You can skip this if using other supported models.

Linux/Mac:

export OPENAI_API_KEY='your-api-key-here'

Windows (CMD):

set OPENAI_API_KEY=your-api-key-here

Windows (PowerShell):

$env:OPENAI_API_KEY="your-api-key-here"

Or add to ~/.bashrc for persistence:

echo 'export OPENAI_API_KEY="your-api-key-here"' >> ~/.bashrc

🔑 Get your API key: OpenAI Platform

3Usage

Once installed, you can use Auto-BIDSify in two ways:

🚀 End-to-End Pipeline

Run the complete conversion in one command - perfect for most users

🔧 Step-by-Step Mode

Execute stages individually for debugging and customization

💡 Tip: Start with the End-to-End Pipeline for quick results. Switch to Step-by-Step Mode if you need more control.

CLI Reference

Command Syntax

python cli.py full

Run complete end-to-end pipeline

python cli.py full --input INPUT --output OUTPUT [OPTIONS]
OptionStatusDescriptionExample
--input✓ RequiredInput dataset directorymy_data/
--output✓ RequiredOutput directoryoutputs/run1/
--describe✓ RequiredDataset description (mandatory)"fMRI study"
--nsubjects⭐ Highly RecommendedNumber of subjects (improves accuracy)10
--modelLLM model (default: qwen)gpt-4o
--modalityData type: auto/mri/nirs/mixedmri
--id-strategySubject ID: auto/numeric/semanticauto
Individual Stage Commands
  • • python cli.py ingest --input INPUT --output OUTPUT
  • • python cli.py evidence --output OUTPUT
  • • python cli.py classify --output OUTPUT
  • • python cli.py trio --output OUTPUT --model MODEL [--trio TYPE]
  • • python cli.py plan --output OUTPUT --model MODEL
  • • python cli.py execute --output OUTPUT
  • • python cli.py validate --output OUTPUT

Real-World Examples

📊 Example 1: Replication Data for: fNIRS RSFC in Tinnitus

Real-world fNIRS dataset from Harvard Dataverse investigating resting-state connectivity in tinnitus patients

Authors: San Juan J, Hu X-S, Issa M, Bisconti S, Kovelman I, Kileny P

Published: PLoS ONE 2017, 12(6): e0179150

DOI: 10.7910/DVN/ZNZZBV

Modality: fNIRS (Homer3 .nirs)

Subjects: 13 participants

License: Public Domain (PD)

Input → Output Transformation
❌ Input (Non-BIDS)
harvard_dataverse/
├── BZZ003.nirs
├── BZZ004.nirs
├── BZZ005.nirs
├── BZZ007.nirs
├── BZZ008.nirs
├── ...
└── BZZ028.nirs

13 files, flat structure

Issues:

  • • No BIDS directory structure
  • • Custom naming (BZZ prefix)
  • • Missing BIDS metadata files
  • • Flat structure
✅ Output (BIDS-Compliant)
tinnitus_fnirs_rsfc/
├── dataset_description.json
├── README.md
├── participants.tsv
├── sub-03/
│   └── nirs/sub-03_task-passive-listening_nirs.snirf
├── sub-04/
│   └── nirs/sub-04_task-passive-listening_nirs.snirf
├── sub-05/
│   └── nirs/sub-05_task-passive-listening_nirs.snirf
├── ...
└── sub-28/
    └── nirs/sub-28_task-passive-listening_nirs.snirf

13 subjects, BIDS v1.10.0, SNIRF format

Improvements:

  • ✓ BIDS-compliant structure
  • ✓ Homer3 .nirs → SNIRF conversion
  • ✓ Standard naming with task labels
  • ✓ Original IDs preserved
  • ✓ All metadata files generated
Generated Metadata Files
dataset_description.json
{
  "Name": "Replication Data for: fNIRS RSFC in Tinnitus",
  "BIDSVersion": "1.10.0",
  "DatasetType": "raw",
  "License": "PD",
  "Authors": ["San Juan J", "Hu X-S", "Issa M", "Bisconti S", "Kovelman I", "Kileny P"]
}
participants.tsv
participant_idoriginal_id
sub-03BZZ003
sub-04BZZ004
sub-05BZZ005
... (10 more)
README.md
# README for BIDS Dataset: fNIRS RSFC in Tinnitus

## Overview
This dataset contains replication data for investigating tinnitus effects 
on resting state functional connectivity using fNIRS.
...

## Dataset Description
- **Title**: Replication Data for: fNIRS RSFC in Tinnitus
- **Authors**: San Juan J, Hu X-S, Issa M, et al.
- **DOI**: 10.7910/DVN/ZNZZBV
- **Journal**: PLoS ONE 12(6): e0179150 (2017)
- **Subjects**: 13 participants
...

## Data Acquisition
fNIRS measuring brain activity during resting state in tinnitus patients.
...

## References
San Juan J, et al. (2017) PLoS ONE 12(6): e0179150
...

🧠 Example 2: CamCAN Multi-Site Study

Large-scale multi-site MRI dataset from the Cambridge Centre for Ageing and Neuroscience

Study: Cambridge Centre for Ageing and Neuroscience

Sites: Beijing, Cambridge, ...

Source: CamCAN Open Data

Modality: MRI (DICOM → NIfTI)

Subjects: 3,763 participants (multi-site cohort)

Scans: T1-weighted anatomy + resting-state fMRI

License: CC-BY-4.0

Input → Output Transformation
❌ Input (Non-BIDS)
my_dataset/
├── Beijing_sub82352/
│   ├── anat_mprage/scan.dcm
│   └── func_rest/fmri.dcm
├── Cambridge_sub06272/
├── Beijing_sub19283/
└── ...

3,763 subjects, DICOM format

Issues:

  • • Non-standard naming (site_subID)
  • • Potential ID conflicts across sites
  • • DICOM not converted
  • • No BIDS metadata
✅ Output (BIDS-Compliant)
bids_compatible/
├── dataset_description.json
├── README.md
├── participants.tsv
├── sub-Beijing82352/
│   ├── anat/sub-Beijing82352_T1w.nii.gz
│   └── func/sub-Beijing82352_task-rest_bold.nii.gz
├── sub-Cambridge06272/
├── sub-Beijing19283/
└── ...

3,763 subjects, NIfTI format

Improvements:

  • ✓ BIDS-compliant structure
  • ✓ Semantic IDs prevent conflicts
  • ✓ DICOM→NIfTI conversion
  • ✓ Site info preserved
Generated Metadata Files
dataset_description.json
{
  "Name": "CamCAN Multi-Site Study",
  "BIDSVersion": "1.10.0",
  "License": "CC-BY-4.0",
  "Authors": ["Research Team"]
}
participants.tsv
participant_idsiteoriginal_id
sub-Beijing82352BeijingBeijing_sub82352
sub-Cambridge06272CambridgeCambridge_sub06272
sub-Beijing19283BeijingBeijing_sub19283
... (3,760 more)

💡 ID Strategy: Original naming was site_subXXXXX. Since different sites could have overlapping IDs, Auto-BIDSify renamed to sub-siteXXXXX for global uniqueness.

README.md
# Cambridge Centre for Ageing and Neuroscience (CamCAN) Dataset

## Overview
The CamCAN project investigates how individuals maintain cognitive abilities 
with age, integrating epidemiological, cognitive, and neuroimaging data across 
five phases.
...

## Dataset Description
Multi-phase neuroimaging study with ~700 participants:
- **Phase 1**: Demographics, health, cognitive data (~2700 adults, 2010-2012)
- **Phase 2**: Detailed cognitive, MRI, MEG data (CC700: 2011-2013)
- **Phase 3**: Repeat MRI/MEG scans (CC280: 2012-2014)
...

## Data Acquisition
**MRI Modalities**: T1, T2, DWI, resting-state fMRI, task fMRI
**Imaging Parameters**:
- T1 MPRAGE: TR 2250ms, TE 2.99ms, 1mm isotropic
- fMRI: TR 1970ms, TE 30ms, 3×3×4.44mm
- DWI: 30 directions, b=0,1000,2000
...

## File Organization
Organized as BIDS repositories by modality:
- Phase 2 Arm 1 (CC700) Raw MRI/MEG
- Phase 2 Arm 2 (Frail) Raw MRI/MEG
- Phase 3 (CC280) Raw MRI/MEG
...

## Usage Notes
Non-commercial research only. Proper acknowledgment required.
...

## References
Shafto et al. (2014) BMC Neurology 14(204)
doi: 10.1186/s12883-014-0204-1
...

Feature Comparison

FeaturefNIRSMRI
InputHomer3 .nirsDICOM
Output.snirf (BIDS)NIfTI
Subjects13 (non-consecutive)3,763 (multi-site)
ID StrategyNumeric (sub-03, sub-04...)Semantic (sub-Beijing82352...)
ConversionRe-organizationDICOM→NIfTI

Developed by Yiyi (Rose) Liu

© 2026 • Built with ❤️ for the neuroimaging community

Powered by Habitat