How To Use
This page is a practical list to help you decide which data to practice with first. First, we use public data to create a state in which we can reproduce the same results (L0), and then check to see if we can predict and withstand changes in conditions (L1-L2).
When selecting data at the beginning, it is important to choose data that is easy for others to follow in terms of procedures and results, rather than data that is extremely difficult. Rather than aiming for everything from the beginning, a shortcut is to create the smallest loop using public data that is easy to reproduce.
This page is the practical entry point for deciding where to start and how to complete the minimum L0 loop. The verification platform handles what counts as progress, and the casework section within the verification platform handles examples from other fields. If you want a one-page guide to how the practical pages differ from the rest of the site, please see Wiki: Guide to reading practical pages.
The minimal loop procedure from the old hands_on.md has been integrated into this page. Therefore, you can read straight through to the L0 skeleton, QC, baseline, and completion conditions without having to go to another page after data selection.
This page is the practical data portal, not the full research-question map. If you want to move from a dataset bucket here to a specific mind-upload research question, a fixed EEG-ready claim, and a grant-ready theme, start with the current public six-RQ brief, then use the RQ60 EEG feasibility page, the RQ-by-RQ deep dossiers, the grant and dataset playbook, and the current funding shortlist.
| What I want to do | First data |
|---|---|
| I want to practice the basics of preprocessing and classification | EEG Motor Movement/Imagery is easy to enter. The problem settings are relatively easy to understand, making it suitable for L0-L1 practice. |
| I want to experience long-term data and event detection | CHB-MIT is a good fit. It lets you practice handling noise, long recordings, and event detection together. |
| I want to handle state transitions | Sleep-EDF is a good fit. It is useful for learning how states change over time. |
| I want to see the difficulty of large-scale data | TUH EEG is also a candidate. However, it is heavy for a first dataset, so it is safer to become familiar with the first three first. |
The starter dataset is not intended to solve all WBE problems at once. The first things you want to get here are reproducible input organization, QC habits, and baseline comparison. Strong points such as identity and causal identity cannot be resolved with just the data at this stage.
This public page stays at the entry level. When one unresolved question is actively being turned into an EEG-ready work package, stronger routing details such as fixed Dxx + DOI anchors, first-pass KPIs, and stopping rules are kept in the wiki rather than promoted here as a public conclusion. Use the current public six-RQ brief, the RQ60 EEG feasibility page, and the RQ-by-RQ deep dossiers when you need the current one-question-at-a-time packages.
When you read a dataset introduction, it is tempting to jump straight to "what score was achieved?" The first questions should instead be what the train/test split unit is, whether leakage was checked, and whether the result was compared against a simple baseline. If this is still unclear, please read Wiki: Data Splits and Data Leakage first.
MOABB treats within-session, cross-session, and cross-subject as separate evaluation families. In other words, even if the 70% is the same, the 70% obtained by "same day, same person, same setup" and the 70% obtained by hold-out on "different day" or "different person" are different achievements. If you want to sort out short-term state fluctuations and long-term drift first, please also have a look at Wiki: state/trait/drift.
Even with the same "public EEG data", the meaning of comparison is different for cue-locked annotation channel, expert interval annotation, whole-night hypnogram, and physician report-derived label. Therefore, on this page, in addition to the dataset name, be sure to include where the label came from, at what time granularity, and what is considered as an independent unit of split.
Even if you just decide on the data name, if the form of the submission is ambiguous, it is easy to get stuck. If you want to see BIDS, Validator, QC logs, division rules, baselines, execution steps, and failure examples in one page, please see Wiki: Minimum L0 artifact pack.
After the introduction to EEG, if you would like to see the flow of selecting data on this page, going around in the L0 practice section, and confirming it as L0 in Verification, please see Wiki: Straight path from EEG to L0.
Even if the waveform file is published, if the event definition, stimulus log, time synchronization, and bad channel / bad segment recording are weak, it will be difficult to compare again later. Furthermore, in the 2026-03 re-audit, we added to the site rule that event semantics are not fixed just by having `events.tsv`, and hardware delay cannot be audited just by having LSL. If you want to understand this point from the beginning, please see Wiki: Basics of event synchronization and observation logs first.
Future dataset cards must include at least (1) onset / duration / sample, (2) clock domain plus stream-alignment rule, (3) timing-validation class such as stored-data anchor / digital trigger / physical onset / uncontrolled-response test, (4) event semantics such as trial_type, HED, and scoring rules, (5) provenance / scorer / report-usage flag, (6) independent split units, and (7) a clear stopping claim. Cards without these fields are insufficient as reusable L0 guides.
BIDS is a raw-data standard, BIDS Derivatives keeps processed-output lineage explicit, OpenNeuro and PhysioNet are storage areas, Validator is a formal check, MNE-BIDS is a loader, MNE-BIDS-Pipeline or a BIDS App is a workflow recipe, and Benchmark is a comparison rule. If you want to sort out this role difference from the beginning, please use Wiki: Standards/Repositories/Validators/Benchmarks.
EEG-based starter datasets are enough to begin, but later you may want to add spatial or structural information. If you want to map out what can be added to EEG first, please see Wiki: Multimodal integration basics.
This page had one practical weakness: it explained how to build a reproducible neuron-first EEG baseline, but it still lacked a public rule for testing whether adding maintenance-state / support-state variables changes prediction, stability, or explanation. The current primary literature does not support one compressed support-variable bucket. Cahill et al. (2024), Williamson et al. (2025), Dewa et al. (2025), and Bukalo et al. (2026) sharpen distinct astrocyte-state routes across minute-scale network encoding, recall, multiday stabilization, and fear-state support. By contrast, Suzuki et al. (2011), Silva et al. (2022), Pavlowsky et al. (2025), and Greda et al. (2025) sharpen distinct glial substrate-routing routes across lactate support, ketone-body support under starvation, learning-linked fatty-acid routing, and apoE / sortilin-dependent lipid delivery under limited glucose. Mai-Morente et al. (2025) sharpens a pericyte / capillary-support route; Kim et al. (2025) sharpens a meningeal-lymphatics / microglia route for synaptic physiology; Hirschler et al. (2025) and Dagum et al. (2026) sharpen bounded human clearance-side observables; and Chung et al. (2025) raises tracer-specific BBB transport quantification while explicitly leaving human ground truth and test-retest for future work. Therefore, this page now fixes a component-addition / ablation ladder that separates glial substrate-routing from astrocyte-state instead of letting readers pile both into one multimodal boost.
A practical component-addition / ablation ladder for maintenance-state routes
| Family added on top of the neuron-first baseline | Minimum paired data requirement | What you may say if the gain survives | What still must stop here |
|---|---|---|---|
| Glial substrate-routing route | Same-subject neural and behavioral target, plus a named glia-to-neuron fuel-support observable or perturbation with supplier cell, neuronal sink, fuel object / carrier, and nutrient or learning regime fixed. | A declared glial substrate-routing family improved prediction, recall, or a bounded memory-support readout in that named nutrient or learning regime. | That astrocyte ensemble state was identified, or that one glial fuel route generalizes across lactate, ketone-body, fatty-acid, and apoE / sortilin-dependent lipid delivery. |
| Astrocyte-state route | Same-subject neural and behavioral target, plus a named astrocyte-state observable or perturbation aligned to the same recall, stabilization, or fear-state window. | A named astrocyte-state family improved prediction, recall, stabilization, or fear-state decoding in that declared window. | That the responsible whole-brain astrocyte controller was identified, or that astrocyte-state evidence fixed glial fuel routing, clearance control, or one general support state across all tasks and timescales. |
| Neurovascular / BBB / pericyte support route | Same-subject neural and behavioral target, plus a named capillary, BBB-exchange, or BBB-transport observable with shared arousal / vascular-driver logging. | A declared vascular-support family reduced one error term or improved one prediction slice under the named physiological regime. | That a generic BBB state was measured, or that the added row directly read out the neuronal variable of interest. |
| Clearance / immune / lymphatic route | Same-subject neural or biomarker target, plus a named CSF-mobility, tracer-transport, or sleep-linked efflux route with sleep / time-of-day handling fixed. | A declared transport-side or immune-side route explained incremental variance or changed a bounded physiological readout. | That local microglial control, route-free whole-brain clearance truth, or one universal maintenance controller was identified. |
| Bundle comparison rule | Compare the neuron-only baseline, each single added family, and the full bundle under the same subjects, split rule, missingness policy, and common-driver audit. | The bundle improved the declared task under a named availability and regime constraint, beyond the strongest single added row. | That the full bundle proves the minimum required biological configuration or closes U3 by itself. |
- Freeze the neuron-first baseline, target object, split unit, and metric bundle before adding any maintenance-state row.
- Add one family at a time under the same subjects, same sessions, same missingness rule, and same evaluation family.
- Name the direct observable, time window, spatial unit, and route class for every added family, because local causal perturbation and bounded human proxy do not carry the same claim.
- Report the strongest single added row, the full bundle, and their disagreement / missing-modality behavior under the same split.
- Stop the claim at incremental predictive, stability, or physiological gain unless the result also survives common-driver controls, out-of-regime checks, and a named abstention boundary.
EEG Motor Movement/Imagery, CHB-MIT, Sleep-EDF, and TUH help you fix the neuron-first baseline, split unit, QC discipline, and leakage checks. By themselves they do not close glial substrate-routing, astrocyte-state, pericyte / BBB support, clearance transport, or other maintenance-state families. Any public maintenance-state claim on this site therefore needs paired support-state data, aligned proxy logs, or a named perturbation route, and it must be compared against the strongest single added family rather than only against the all-in bundle.
The page still had one practical weakness after the family split: it could leave the impression that once several support-state rows are collected in the same subject, the bundle itself is already close to one aligned biological variable. The current primary literature does not support that shortcut. Vafaii et al. (2024) showed that spontaneous multimodal measures contain both common and divergent cortical structure. Chen et al. (2025) showed that simultaneous EEG-PET-MRI can display tightly coupled temporal evolution while still preserving spatially distinct hemodynamic and metabolic patterns. Bolt et al. (2025) showed that a major low-frequency global fMRI pattern is substantially coupled to autonomic physiology, while Epp et al. (2025) showed that about 40% of gray-matter voxels with significant task BOLD changes exhibited opposing oxygen-metabolism changes. Bundle-level gain is real but still conditional: Rohaut et al. (2024) showed that adding markers can reduce prognostic uncertainty, Amiri et al. (2023) showed that direct same-sample comparison shrank to 48 patients with all EEG and fMRI features, and Manasova et al. (2026) showed higher inter-modality disagreement in minimally conscious or improving patients even as performance improved with more modalities. Row-local stability is separate again: Bøgh et al. (2024) fixed a named repeatability window for a 3 T deuterium route, and Wirsich et al. (2021) showed reproducible EEG-fMRI connectome relations only under explicitly harmonized simultaneous acquisition. Therefore, this page now requires a practical support-state augmentation card before a same-subject bundle is read as more than row addition.
A practical support-state augmentation card for dataset bundles
| Field to log before reading a same-subject bundle strongly | Why this field is necessary | What overread it blocks |
|---|---|---|
| Route class and bridge type | State whether a row is a same-subject human proxy, a same-subject perturbation, a sequential bridge, or a mixed-species causal support row. | Do not read rodent causal support and bounded human proxy rows as interchangeable evidence just because they concern the same family label. |
| Effective time window and physiological regime | Log whether rows target the same trial epoch, sleep stage, arousal window, pharmacological state, or multiday stabilization regime. | Do not read co-acquisition or same-session wording as if one support-state sample had been aligned automatically. |
| Direct observable and quantity type | Name whether the added row is density, transport, exchange, mobility, flux, metabolism, or an indirect classifier / score. | Do not collapse tracer transport, glucose uptake, BOLD fluctuation, and bounded biomarker efflux into one solved maintenance variable. |
| Shared-driver / quantity-bridge audit | Disclose vascular, respiratory, autonomic, motion, drug, and time-of-day covariates, and say whether the bundle established a shared trajectory, a common driver, or a true quantity bridge. | Do not read correlated rows as one biological quantity when the coupling may be driven by arousal or another shared nuisance source. |
| Availability slice and missing-modality policy | Report the exact complete-case subset, any imputation or substitution rule, and whether the comparison is same-sample or maximum-available-data. | Do not hide that the bundle result may depend on a much smaller or specially filtered subgroup than the headline cohort. |
| Strongest single row and disagreement topology | Compare the best single added family against the full bundle and state where modalities agree, diverge, or change sign. | Do not promote the full bundle if it only repackages the strongest single row or if disagreement is concentrated in the hardest regime. |
| Row-local repeatability and transfer window | Name the hardware, sequence, preprocessing, centre, and acquisition window under which each added row is repeatable or portable. | Do not treat one named proxy route as field-ready or cross-centre stable just because it worked once in one harmonized setup. |
| Abstention and stopping claim | State what remains latent after the gain, such as controller identity, cell specificity, or out-of-regime failure. | Do not turn bundle improvement into a minimum-biological-configuration claim or a U3 closure claim by default. |
On this page, a support-state addition now stays at family-split augmentation evidence unless the dataset card logs the fields above and then compares neuron-first baseline, strongest single added row, and full bundle under the same split, availability slice, and abstention rule. If the bundle mixes living-human proxy classes, use the Verification: Human Proxy Composition Card alongside the Verification: Fusion Card instead of treating co-acquisition as sufficient.
1) Shared infrastructure to establish first
OpenNeuro (BIDS-based sharing)
A platform for sharing BIDS-compliant neuroimaging and electrophysiology datasets, including EEG, MEG, and fMRI.
Open OpenNeuroPhysioNet (biosignals and benchmark culture)
A public platform for biosignal datasets and related resources, including many standard EEG corpora.
Open PhysioNetHuman Connectome Project (large-scale human imaging)
A representative public resource for large-scale human brain imaging data and analysis tools.
Open HCPOpenNeuro and PhysioNet are entry points, but they do not guarantee reproducibility by themselves. First fix the snapshot / version, then align it with BIDS / EEG-BIDS, fix the reading and conversion path with tools such as MNE-BIDS, and finally define the comparison setting with a benchmark harness such as MOABB for within-session / cross-session / cross-subject. If you mix up repository, loader, and benchmark settings, the same dataset name will still yield incomparable results.
2) EEG starter pack (start with this from L0 to L1)
The following are representative examples of introductory EEG datasets that emphasize ease of use and extensive reference. We focused on practicing the preprocessing pipeline and reaching L0 to L1, and narrowed it down to a range where we can immediately start comparing reproduced baselines.
| Dataset | What you can do (example) | Link |
|---|---|---|
| EEG Motor Movement/Imagery | Motor/motor recall classification, preprocessing practice, baseline comparison | PhysioNet |
| CHB-MIT Scalp EEG | Epileptic seizure detection, event detection, long-term EEG handling | PhysioNet |
| Sleep-EDF | Estimating sleep stages, modeling state transitions, handling longitudinal fluctuations | PhysioNet |
| TUH EEG Corpus (large scale) | Scaling EEG classification, difficulty in distribution for actual operation, data leak countermeasures | TUH EEG |
| Dataset | Good first release | Why this is a good first release |
|---|---|---|
| EEG Motor Movement/Imagery | Baseline accuracy and preprocessing log for two-class classification | The task setup is simple, so it is easy to build a minimal loop from preprocessing to evaluation. |
| CHB-MIT | Reproduction baseline and exclusion reason log for seizure event detection | It is a good way to learn the practical difficulties of long recordings and event detection, including failure cases. |
| Sleep-EDF | Basic baseline for sleep stage classification and confusion matrix of state transitions | It shows not only accuracy but also how state transitions fail, which makes errors easier to interpret. |
| TUH EEG Corpus | Reproduction experiment with a small subset and clarifying data division rules | It is more important to lock down leak prevention and split rules first than to process the full corpus from the start. |
2.5) The same score means different things across generalization settings
This is one of the current weak points of the site. `within-session`, `cross-session`, `cross-subject`, and `adaptation` may all report "classification accuracy," but they answer different questions about generalization. The official MOABB documentation also implements them as separate evaluation classes, and in the 5-day MI dataset of Ma et al. (2022), the average subject-specific accuracy dropped from within-session 68.8% to cross-session 53.7%, then recovered to cross-session adaptation 78.9% when a small amount of target-session data was used. Therefore, this site will no longer list scores alone; it will also state what was held out, what was allowed to vary, and what remains unresolved.
| Evaluation family | What is held out | What this supports | What not to overread |
|---|---|---|---|
| within-session | Folds within the same subject and the same session. | It can show whether classes separate under the same-day, same-setup condition and whether preprocessing plus baseline modeling work at all. | Do not treat this as evidence of cross-day robustness or deployable decoding. |
| cross-session | A different session or day from the same subject. | It can show how long subject-specific features persist across days and how sensitive they are to state changes and re-setup effects. | Do not read this as subject-independent generalization or zero-recalibration operation. |
| cross-subject | One or more entire subjects. | It can show whether population-level shared structure exists and how far a cold-start decoder might go at initial installation. | Do not equate this score with a decoder optimized for a specific individual. |
| cross-session adaptation | Another session is held out, then a small amount of target-session data is used for recalibration. | It can show how much performance is recoverable through recalibration and how much room there is for operational adaptation. | Do not describe this as a stable decoder that worked from the beginning without adaptation. |
Musall et al. (2019) showed that neural activity during tasks can be strongly dominated by uninstructed movements, and Egger et al. (2024) showed over a 10-hour EEG day that movement-related decoding changes enough to motivate adaptive decoders. But fast labels are still not the whole state story. de Quervain et al. (1998) and Oei et al. (2007) showed glucocorticoid-linked retrieval impairment and reduced human hippocampal / prefrontal retrieval activity, Barone et al. (2023) plus Birnie et al. (2023) showed circadian and corticosteroid-rhythm control of hippocampal plasticity, and Sherman et al. (2015) showed that memory-linked hippocampal activity varies with circadian-rhythm consistency. Finally, Wilson et al. (2025) showed that long-term BCI operation still requires recurrent recalibration. In other words, even for the same subject, short-term resolution, cross-day tolerance, and long-term operation are different barriers, and state annotation has to split fast labels from slow internal-milieu disclosure before temporal validity is read strongly.
The remaining practical weakness on this page was subtler than simple split naming. It already separated cross-session, adaptation, and long-term use, but it still left state annotation too close to one free-text note about movement, arousal, or session ID. The current primary literature does not support that shortcut. Egger et al. (2024) showed over a 10-hour EEG day that movement-related decoding changes enough to motivate adaptive decoders, while de Quervain et al. (1998), Oei et al. (2007), Barone et al. (2023), Birnie et al. (2023), and Sherman et al. (2015) show that the same visible task can still run under different glucocorticoid, circadian, and broader slow internal-milieu regimes. At the same time, Wilson et al. (2025) and Wairagkar et al. (2025) show that recurrent recalibration burden and fast same-day throughput are different again. Therefore, on this site, a temporal claim now has to disclose fast labels such as movement / arousal / task mode separately from any relevant slow internal-milieu disclosure such as time-of-day / circadian phase, recent sleep-wake schedule, glucocorticoid or steroid exposure, and feeding / fasting or glucose-insulin regime, before fixed decoder interval, recalibration burden, or transfer ceiling are interpreted.
| If the result is reported as... | You still have to disclose | Stopped claim if missing |
|---|---|---|
| same-day online / streaming use | Fast labels such as movement / arousal / task mode, output-path / abstention or fallback policy, and whether the result stayed inside one same-day operating regime. | Do not promote to cross-day stability, fixed-decoder durability, or a generic temporal-validity benchmark. |
| cross-session | Fast state labels, any relevant slow internal-milieu disclosure such as time-of-day / circadian phase, recent sleep-wake schedule, glucocorticoid or steroid exposure, feeding / fasting or glucose-insulin regime, fixed decoder interval, and whether the setup was reattached, re-referenced, or otherwise changed. | Read only as cross-day tolerance under named conditions, not as durable decoding. |
| cross-session adaptation | The same temporal fields as above, plus how much target-session data was used, when recalibration happened, and what the pre-adaptation score was. | Do not promote to fixed-decoder stability or low-burden deployment. |
| longitudinal / chronic use | Fast labels plus slow internal-milieu disclosure, fixed decoder interval, recalibration burden, failure / fallback mode, and participant / site / task transfer ceiling. | Do not promote to generic long-term robustness or deployability. |
Even when a within-session score is high, it can still be explained by eye-movement confounds shown by Mostert et al. (2018), the EMG route shown by McFarland et al. (2005), post-onset auditory feedback shown by Chen et al. (2024), identity confounding shown by Chaibub Neto et al. (2019), time-robust resting-state fingerprints shown by Wang et al. (2020) and Di et al. (2021), or subject-driven EEG variation summarized by Gibson et al. (2022). For that reason, this site now overlays the Verification: Specificity & Shortcut Card on dataset cards and baseline results, fixing plausible nuisance routes, auxiliary channels such as EOG / EMG / behavior / audio / metadata, nuisance-only baselines, fingerprint audit, nuisance-regime hold-outs, and the claim that must stop here.
This site previously stopped more clearly at subject / session fingerprint than at setup effects. That remained too weak for dataset cards. The official EEG-BIDS specification already separates electrodes, channels, coordinate system, and reference scheme. Hu et al. (2018) showed that the measured scalp potential itself changes with reference montage and electrode setup, Melnik et al. (2017) showed that EEG recordings vary not only by subject and session but also by recording system, Xu et al. (2020) showed that cross-dataset EEG decoding is degraded by environmental variability such as amplifier, cap, sampling rate, and filtering, Ceballos-Villegas et al. (2022) explicitly modeled multinational batch effects across studies and devices, and Dong et al. (2024) showed that cross-location comparison required an explicit REST-based offline transform rather than a generic claim that the datasets had already been harmonized. Therefore, this site now treats site / device / reference system / electrode layout / coordinate route / protocol distribution as a recording-frame contract rather than as harmless metadata. Inference from these sources: common-channel intersection, interpolation to a target montage, and REST-based transformation preserve different benchmark objects, so a dataset card now has to name the harmonization branch rather than only say that setup differences were handled.
The next practical weakness on this page was that split / leak / harmonization were visible, while benchmark governance could still be treated as administrative detail. The current primary and official sources do not support that shortcut. The official EEG Challenge (2025) homepage states that the original challenge preprint became outdated during execution and that the website plus starter kit should be treated as current. The official rules require disclosure of additional pretraining datasets, pretrained models / fine-tuning method, code submission at inference stage, and a single-GPU 20 GB inference budget. The official leaderboard then disclosed that Challenge 2 samples had not been randomized, allowing contiguous-trial same-subject structure to affect what the ranking meant and forcing separate awards. That warning is aligned with benchmark-side primary sources: Xiong et al. (2025) argued that inconsistent evaluation protocols make cross-model EEG-FM comparisons unreliable, and Liu et al. (2026) showed across 12 open-source foundation models and 13 datasets that ranking depends materially on transfer regime and benchmarking choices. Therefore, when this site reads a leaderboard, challenge result, or foundation-model benchmark, the card must also name benchmark version, split / randomization rule, hidden grouping structure, extra-data / pretrained-checkpoint policy, adaptation regime, inference-stage restrictions, and later organizer postmortems. If those fields are missing, we treat the result only as a qualified benchmark snapshot, not as a stable measure of portable EEG generalization.
The next practical weakness was narrower. Even after benchmark governance became visible, a reader could still talk as if the benchmark name fixed the predicted object, independent prediction unit, grouped hold-out unit, adaptation regime, and operations budget. The current primary and official sources do not support that shortcut. The official EEG Challenge (2025) homepage separates Challenge 1 response-time regression from Challenge 2 subject-level externalizing prediction, the official rules and submission page add an inference-only code-submission workflow under a single-GPU 20 GB budget, Ma et al. (2022) use one five-session motor-imagery dataset to separate within-session, cross-session, and cross-session adaptation, Liu et al. (2026) separate leave-one-subject-out cross-subject evaluation from within-subject few-shot calibration, and Lahiri et al. (2026) show that six benchmark inconsistencies can reverse rankings on identical datasets by up to 24 percentage points. Therefore, on this site, a dataset or leaderboard name is still too coarse until the object / unit / budget matrix is disclosed explicitly.
The next weak point on this page was different from benchmark governance. A dataset or benchmark card could already expose site / device / reference / layout diversity and still leave a reader with the impression that a setup-agnostic foundation model had already solved physiology-preserving transfer. The current primary literature does not support that shortcut. Han et al. (2025) target channel-permutation equivariance, Chen et al. (2025) target coordinate-based adaptation across heterogeneous devices and more than 150 layouts, and El Ouahidi et al. (2025) push setup-agnostic pretraining to more than 60,000 hours from 92 datasets and 25,000 subjects. Those papers advance recording-frame compatibility. They still do not prove that different montages, coordinate routes, and reference families already preserve one shared physiology-side representation. Ma et al. (2026) then show that strong EEG foundation models can still generalize poorly when subject-level supervision is limited unless extra adaptation structure is added. Therefore, this page now treats setup diversity, coordinate route, reference family, omitted-channel policy, and label-limited adaptation burden as separate dataset / benchmark-card fields rather than one merged claim of portable generalization.
| Case | What the named benchmark or dataset actually predicts | What still has to be frozen separately | Safe ceiling on this site |
|---|---|---|---|
| EEG Challenge 1 official homepage + rules |
Trial-level response-time regression from the CCD task. | The trial is the scoring unit, but grouped subject structure and the inference-only single-GPU 20 GB budget still have to be disclosed separately. | A named transfer benchmark under a fixed operations budget, not a general EEG decoder verdict. |
| EEG Challenge 2 official homepage + leaderboard |
Subject-level externalizing-factor prediction from EEG across multiple paradigms. | The subject is the natural independent unit, and the organizer postmortem shows that hidden contiguous-trial grouping can still change what the benchmark measured. | A subject-invariance benchmark attempt whose meaning remains contingent on grouping policy, not proof that subject invariance is solved. |
| Ma et al. (2022) five-session motor-imagery dataset |
The same raw dataset supports within-session, cross-session, and cross-session adaptation evaluation families. | The dataset name alone does not tell you whether target-session data were used, when recalibration happened, or what the pre-adaptation score was. | A useful session-shift practice board, not automatic fixed-decoder durability. |
| Liu et al. (2026) foundation-model benchmark matrix |
Cross-model comparison across 13 EEG datasets and nine paradigms under multiple transfer settings. | The paper explicitly separates leave-one-subject-out transfer from within-subject few-shot calibration, so hold-out unit and adaptation regime still have to be named separately. | A transfer-regime comparison board, not one portable score of EEG generalization. |
From this section onward, dataset cards and baseline results must report at least (1) evaluation family, (2) benchmark object plus independent prediction unit, (3) the independent hold-out unit, (4) raw-recording / window ancestry, (5) subject / session / site / device / reference-system / electrode-layout disjointness together with metadata-only baselines, (6) the channel-map / coordinate-route / reference-family / omitted-channel / sample-rate / filter harmonization log, including whether comparison used common-channel intersection, interpolation to a target montage, REST / another explicit transform, or no cross-setup harmonization, (7) whether target-session, target-subject, or target-site data were used, (8) recalibration amount and timing or extra label budget, (9) for leaderboard or challenge claims, benchmark provenance including version, split / randomization rule, hidden grouping, extra-data / checkpoint policy, inference-stage restrictions or operations budget, and postmortem disclosures, and (10) a stopping claim. If the claim spans more than one session or day, it must additionally disclose the site's Temporal Validity fields: state annotation split into fast labels and slow internal-milieu disclosure, fixed decoder interval, recalibration burden, and transfer ceiling. Scores without this context will be treated as limited L1 decode results, fingerprint-unresolved / acquisition-distribution-unresolved classifiers, or benchmark-object-unresolved / benchmark-governance-unresolved leaderboards rather than evidence of long-term stability or deployability.
The next practical weakness on this page was that split / leak / harmonization and benchmark governance were visible, while metric semantics could still hide behind one headline number. The primary literature does not support that shortcut. Saito & Rehmsmeier (2015) showed why precision-recall views can be more informative than ROC summaries under strong class imbalance. In seizure tasks, Roy et al. (2021) and Scheuer et al. (2021) show that practical evaluation still turns on sensitivity, false alarms per hour or day, event-overlap logic, and latency rather than plain accuracy, while Segal et al. (2023) show that false-alarm control is itself a design target in seizure prediction rather than an afterthought. In sleep staging, Sun et al. (2017) used Cohen's kappa and showed that imbalance in stage proportions changes performance estimates, while Vallat & Walker (2021) show that pooled performance can still hide especially weak N1-stage agreement. Therefore, on this site, a dataset or benchmark card must now also disclose a task-matched metric bundle, not only a split and a score.
| Task family | Minimum metric bundle on this site | Overread to block |
|---|---|---|
| Cue-locked classification / decoding | Balanced accuracy or macro-F1, confusion matrix, subject-wise aggregation, and calibration / abstention if probabilities are output. | Do not let one accuracy number hide minority-class collapse or confidence miscalibration. |
| Seizure detection / forecasting | Event sensitivity or recall, false alarms per hour or per day, event-overlap rule, detection / warning latency when relevant, and calibration if thresholds or alarms are used. | Do not let accuracy, AUROC, or one threshold-free summary stand in for clinically usable alarm behavior. |
| Sleep staging | Cohen's kappa or macro-F1, per-stage recall / F1, and a confusion matrix that keeps minority stages visible. | Do not let pooled accuracy hide weak N1 or transition-stage performance. |
3) Audit to avoid overestimating starter data
The above four cases are very useful as a practice base for L0-L1, but they are not the ground truth for directly verifying the strong claims of EEG source imaging and WBE. What is needed here is not a dichotomy of ``usable/unusable,'' but rather a fixation on which claims can be supported.
The stopping claims and minimum operational rules in the table below are operational boundaries drawn from what is directly observed and annotated in the official dataset descriptions and primary literature. In other words, they are not claims explicitly made by the dataset providers; they are site rules derived from annotation provenance and time fidelity.
| Dataset | Things that are easy to verify now | Still difficult to verify | Minimum precautions |
|---|---|---|---|
| EEG Motor Movement/Imagery | Since it is a cue-locked task of 64ch, 160Hz, and 109 people, it is suitable for practicing preprocessing, subject-based split, and simple baseline comparison. | Without individualized MRI, electrode coordinates, and invasive ground truth, claims of improved ESI accuracy and deep reconstruction cannot be audited. | Since the task involves presenting left/right/up/down cues on the screen, we will check for inclusion of line of sight, myoelectricity, and cue-locked artifacts, and fix the split for each subject. |
| CHB-MIT | Suitable for learning long-term EEG, seizure event detection, and logging of missing and exclusion reasons. | Since it strongly depends on the clinical conditions of children, intractable epilepsy, and drug withdrawal, it cannot be used as a general-purpose benchmark for general recognition or source imaging. | Split in case units and handle while retaining the gap and montage summary between records. The disparity between seizures and seizures will also be clarified first. |
| Sleep-EDF | It is suitable for learning how to handle state transitions, sleep stage classification, and longitudinal fluctuations using whole-night PSG. | The primary EEG is Fpz-Cz / Pz-Oz 2-lead, 100 Hz, so it is not a benchmark for spatial resolution or source imaging. | The labels are manual scoring based on the Rechtschaffen & Kales standard, so when comparing with new sleep stage studies, we will clearly indicate the label correspondence. |
| TUH EEG Corpus | Suitable for learning the difficulties of real-world distribution such as large scale, clinical noise, repeated sessions, and physician reports. | It is not suitable for direct validation of source imaging improvements as it is not a controlled biophysical benchmark due to large variations in channel number and clinical conditions. | Fix patient/session unit split, fixed channel subset, montage normalization, and text leakage prevention when using reports first. |
| Dataset | Label/Event origin | Time fidelity | Claim to stop here | Minimum operational rules |
|---|---|---|---|---|
| EEG Motor Movement/Imagery | .event and annotation channel T0/T1/T2 indicate cue-locked onset of real/imagined motion. |
Cue-onset level for 160 Hz recording. | Do not promote open-ended thought decoding or subject-independent semantic readout. | Split by subject + run and audit visual cue and myoelectric/ocular contributions separately. |
| CHB-MIT | The summary and .seizure annotations for each case mark seizure intervals during long-term recording. In addition, chb21 is the same subject as chb01. |
Expert interval annotation, gap between files also remains. | Do not treat this as gap-free continuous monitoring or count cases as if they were independent subjects. | Split by subject and case chronology rather than by file, and keep gap plus montage summaries in the runbook. |
| Sleep-EDF | Comes with an R&K hypnogram by a well-trained technician and a 1 Hz event marker. | The whole-night stage annotation is coarse, and even though the EEG is 100 Hz, the marker is 1 Hz. | Stop claiming that sub-second event onset and AASM equivalent labels are self-evident. | If you split by subject-night and map from R&K to AASM, specify the mapping rule. |
| TUH EEG / TUSZ | TUH has a patient/session hierarchy and a clinician report .txt, while TUSZ goes through a selection including report keyword search and automatic triage. |
Clinical label at session/file level and expert seizure annotation at some subset. | Do not write report-assisted labels as if they were pure EEG-only benchmark accuracy. | Require patient / session unit splits and a report-usage flag, and do not feed report text into signal-only evaluation. |
The official EEG Motor Movement/Imagery dataset description itself fixes the ceiling: 109 volunteers, 64 channels, 14 cue-driven runs, 160 Hz, and T0/T1/T2 onset codes copied into both the annotation channel and .event files. That is enough to audit cue-locked decoding, preprocessing, and subject-split hygiene, but it is also why this site requires a separate audit of visual-cue, overt-movement, and myoelectric / ocular contributions before any stronger readout wording is allowed.
The official CHB-MIT description fixes a different ceiling: 22 pediatric subjects organized into 23 cases, case chb21 being the same subject as chb01, gaps between consecutively numbered EDF files, and seizure boundaries carried by .seizure files together with case summaries. Therefore file-level randomization overstates independence unless subject identity and case chronology remain explicit.
The official Sleep-EDF description likewise constrains the interpretation: the PSG uses only Fpz-Cz / Pz-Oz EEG together with EOG and chin EMG, while the event marker and some auxiliary channels are sampled at 1 Hz, and the hypnograms are manual Rechtschaffen & Kales scores. Rosenberg & Van Hout (2013) then showed that even modern sleep-stage scoring reaches only about 82.6% overall inter-scorer agreement, with weaker agreement for N1 and N3. That is why this site stops a Sleep-EDF result at staged-state practice unless label mapping, scoring regime, and time granularity are disclosed explicitly.
For TUH / TUSZ, Obeid & Picone (2016) explain that the clinical corpus pairs EDF recordings with clinician reports, while Shah et al. (2018) describe seizure-rich triage using report keyword search and automatic detectors. Later corpus-maintenance notes documented that an early Neureka 2020 release had non-exclusive subjects across train / dev / blind evaluation and high-frequency seizure annotation problems. Therefore report-usage flags, patient/session ancestry, and benchmark postmortems are treated on this site as part of the result rather than footnotes.
When introducing starter data, always include (1) label provenance, (2) time granularity, (3) clock domain plus stream-alignment rule, (4) timing-validation class, (5) event semantics, (6) independent split unit, (7) acquisition-distribution summary plus harmonization policy, and (8) stopping claim. A dataset card that does not include this will be considered insufficient as a practical guide for L0.
BIDS/EEG-BIDS is important, but it alone cannot prove the validity of source imaging or the comparability of cross-dataset decoding. The BIDS specification itself also requires EEGReference, SamplingFrequency, SoftwareFilters, and if *_electrodes.tsv is issued, *_coordsystem.json is also required. However, this is a condition that makes it possible for a third party to trace the incident, not a condition that allows the true source to be known or that automatically harmonizes reference mismatch, electrode-layout mismatch, and device / filter differences across cohorts.
Please provide at least the following four points.
- Individual anatomy: Individual MRI/CT or EEG-BIDS recordings with digitized electrode positions and
*_electrodes.tsv/*_coordsystem.json - Forward model audit: Head model and skull-conductivity sensitivity analysis
- External standards: Ground truth such as phantoms, simultaneous invasive recording, intracranial stimulation, or TMS-EEG
- Uncertainty: Report not only point estimates but also localization errors and interval estimates
4) If you want to dig deeper into source imaging, divide the data into three stages
The weak point of this page was that it only stopped by saying, ``Starter data is not a direct benchmark for source imaging,'' but then it was weak in deciding what to choose next. Here, we divide the data into three levels depending on the strength of the argument.
| stage | Representative data | Supported argument | Things I can't say yet |
|---|---|---|---|
| A: Practice table | EEG Motor Movement/Imagery, CHB-MIT, Sleep-EDF, TUH EEG | L0-L1 reproducibility analysis, QC, split design, baseline comparison | ESI localization error improvement, deep source claim, strong WBE-oriented reconstruction claim |
| B: Reconstruction with anatomical constraints | Records including *_electrodes.tsv / *_coordsystem.json for individualized MRI, digitized electrodes, and EEG-BIDS |
Audit of forward model, comparison of reconstruction near the cortical surface, sensitivity analysis of electrode placement and conductivity assumptions | Deep source accuracy guarantee without direct ground truth, generalized unique recovery claim |
| C: Direct validation | Localize-MI (Mikulan et al., 2020), scalp EEG with intracranial stimulation, simultaneous HD-EEG/SEEG, presurgical cohort with postoperative outcome | Named validation-class audit: localization error against known stimulation sites, concordance with simultaneous invasive recording, or clinical concordance against postoperative outcome | Universal performance guarantee beyond task/cohort/montage |
On this site, you should still write which C-stage validation class you used:
- stimulation ground truth: asks localization error against a known stimulation site and time (Mikulan et al., 2020; Unnwongse et al., 2023).
- simultaneous invasive recording: asks concordance with concurrent SEEG/ECoG under the same event regime (Hao et al., 2025).
- postsurgical outcome / clinical concordance: asks whether the source estimate points toward clinically relevant tissue, not whether the source was uniquely observed (Birot et al., 2014).
Localize-MI by Mikulan et al. (2020) is a rare data resource that exposes intracerebral stimulation with 256ch scalp EEG and stereo-EEG, allowing source imaging to be directly audited against “known stimulation locations.” Hao et al. (2025) reported average localization errors of 14.07 mm for ictal ESI and 17.38 mm for interictal ESI in 29 simultaneous HD-EEG/SEEG cases, indicating that source power and source depth greatly affect accuracy. Jahromi et al. (2026) then added a 3D-printed pediatric deep-source phantom, showing that even phantom validation is not one universal board once deep epileptic source class and geometry are changed. Therefore, if you want to improve your source imaging, you need not only a C-level benchmark, but also a named C-stage validation class rather than an A-level starter dataset alone.
One remaining simplification on this page was to make the entrance sound too close to HD-EEG is the serious route and low-density EEG is the weak route. The current primary literature is narrower than that. Horrillo-Maysonnial et al. (2023) showed that a targeted 33-36-electrode montage reached 54/58 sublobar concordance (93%) against an 83-electrode HD montage, but still showed larger peak-vertex distance for tangential generators. Rong et al. (2025) then showed that a DeepSIF-based approach stayed comparatively stable from 75 to 16 electrodes, with average spatial dispersions of 7.9/9.0 mm versus 21.9/28.1 mm for sLORETA and 20.0/28.9 mm for LCMV. But these papers do not erase the direct-validation ceiling. Unnwongse et al. (2023) showed that coverage geometry and conductivity assumptions still move localization error in direct validation, and Hao et al. (2025) showed that ictal and interictal ESI differed (14.07 ± 4.62 mm versus 17.38 ± 4.16 mm) and that source depth and spike power still matter. Therefore, this site no longer treats high density versus low density as the right first split. The safer split is named montage / coverage policy + inverse family + source regime + validation class.
In a systematic review by Mouthaan et al. (2019), the summary sensitivity of electric source imaging in presurgical epilepsy was 82% and specificity was 53%. In other words, although postoperative outcomes and SOZ concordance are useful external criteria, source imaging itself cannot be fixed as the true value. Even at the C stage, what you can say now is ``How far has the error been reduced with this benchmark?'', not ``I was able to uniquely read the source in my brain.''
The first question when selecting data is not ``what is interesting?'' but what level of argument do you want to support this time? Level A is sufficient for practicing L0-L1. If you want to proceed with claiming improvements in source imaging, please put your claim on hold unless you audit the head model in stage B and take direct validation in stage C. If you do proceed to solver comparison, the next section fixes what has to stay the same across methods before any leaderboard is accepted.
4.5) Inverse-problem benchmark board: compare error questions, not solver names
The weakness of this page after the 2026-03-18 validation-class update was that it could still let a reader jump from ``we used C-stage data'' to ``solver X won.'' That is too weak. Michel & Brunet (2019) describe ESI as a pipeline rather than a single algorithm, but the current literature is stricter still. Luria et al. (2024) expose a probabilistic focal-support family, Tong et al. (2025) expose a sparse debiased-inference family, and Feng et al. (2025) target extended-source reconstruction. Those papers do not return the same target object or the same uncertainty object. Pascarella et al. (2023) then showed on an in-vivo focal-source benchmark that ten methods differ not only in best localization error but also in sensitivity to regularization and montage density, while Unnwongse et al. (2023), Hao et al. (2025), and Jahromi et al. (2026) show that direct-validation boards themselves differ by stimulation class, simultaneous invasive reference, and deep-source phantom geometry. Vorwerk et al. (2024) and Vorwerk et al. (2026) further show that forward-model uncertainty is a separate audit rather than a property inherited from a nicer inverse map. Therefore, this site now treats inverse-problem comparison as a board with five fixed axes: validation class, source regime / target object, inverse family / uncertainty object, same-geometry controls, and sensitivity sweep.
This page now has to stop another shortcut explicitly. A posterior-support map, a debiased sparse interval, and an extended-source overlap estimate are not three visualizations of one identical hidden object. Luria et al. (2024) return posterior support for focal alternatives, Tong et al. (2025) return debiased estimation / inference for sparse spatial-temporal sources, and Feng et al. (2025) return uncertainty-aware extended-source reconstructions. Therefore, a benchmark on this site must name not only the board, but also which inverse family was used, what target object it aimed to recover, and what uncertainty object it actually returned.
Another shortcut still had to be blocked here. A benchmark can be “directly validated” and still score only the geometric center of a source. Feng et al. (2025) explicitly target extended-source reconstruction rather than focal-center localization. Hao et al. (2025) likewise note that their ECD-based distance comparison neglects the spatial extent of the seizure-onset and irritative zones, and argue that future localization work should incorporate source extent in addition to geometric center. Therefore, a solver that wins a focal-site distance board is not automatically the best method for distributed, extended, or propagation-rich sources, and a public benchmark on this site must state whether it scores centre, extent, overlap, or propagation pattern.
| Benchmark question | Keep fixed across methods | Primary metric to publish | What not to overread |
|---|---|---|---|
| Focal-source localization against known stimulation site | Same raw recording, event window, electrode coordinates, head model, conductivity sweep, source space, and bad-channel mask. | Distance to known stimulation site/time, plus spread across conductivity and regularization settings. | Do not crown a universal solver for extended or distributed sources from a focal-source board alone. |
| Concordance with simultaneous SEEG/ECoG under the same event | Same event definition, same reference montage, same source-depth stratification, same preprocessing, and same concordance rule. | Distance or overlap to invasive reference together with source depth and source power strata. | Do not read concordance as direct ground truth for all generators, especially low-amplitude or deep activity. |
| Clinical concordance / postsurgical outcome | Same SOZ/resection definition, same outcome window, same blinding rule, and same patient inclusion criteria. | Sensitivity/specificity or concordance against clinical outcome, clearly separated from localization error. | Do not relabel surgical concordance as precise source-localization ground truth. |
| Extended-source reconstruction or multimodal-prior reconstruction | Same definition of source extent, same prior source, same anatomical constraints, and the same focal-versus-extended evaluation split. | Extent overlap or reconstruction error for distributed sources, plus the gain from the added prior. | Do not compare an extended-source method only on a focal-source leaderboard and call it inferior in general. |
| Inverse-family comparison under one named board | Same validation class, same raw data, same geometry, same source regime, and an explicit statement of whether each method returns posterior support, sparse debiased intervals, focal centres, or source extent / overlap. | Family-typed result table: target object, uncertainty object, primary board metric, and the spread induced by conductivity / hyperparameter sweeps. | Do not collapse a probabilistic focal family, a sparse debiased family, and an extended-source family into one shared winner or one generic “better ESI” claim. |
| If MNE / beamformer / Champagne disagree | What to publish now | Safe reading on this site |
|---|---|---|
| Ranking flips when skull conductivity, head model, or electrode geometry is perturbed. | Show the family-specific ranking under the full sensitivity sweep instead of only the best run. | Method-conditioned improvement in a bounded geometry regime, not a solver winner in general. |
| A method wins only at one hand-tuned regularization point. | Publish the localization-error curve or interval across the tested hyperparameter range. | Best-case performance only; robustness remains unresolved. |
| Dense montages reduce dispersion but not localization error. | Report localization error and spatial dispersion separately. | Better concentration of the estimate, not automatic improvement in true-source accuracy. |
| Deep and superficial sources behave differently. | Stratify results by source depth rather than pooling into one mean. | Conditional detectability only; do not generalize to deep sources as a whole. |
| A focal-source board and an extended-source board favor different families. | Keep separate leaderboards for focal, sparse, and extended-source tasks. | Source-regime-specific strength, not a contradiction that can be collapsed into one number. |
| Probabilistic focal support, sparse debiased inference, and extent-aware reconstruction return different uncertainty objects. | Publish the family label, target object, and uncertainty object beside the main score instead of hiding them in Methods. | Board-specific, family-specific evidence only; do not read disagreement as generic noise around one common truth object. |
A public inverse-problem comparison on this site must now disclose at least (1) validation class, (2) source regime and target object (focal-centre / sparse / extended / propagation-aware), (3) inverse family plus the uncertainty object it returns, (4) same-geometry controls including montage / coverage policy, (5) sensitivity sweep over conductivity and key hyperparameters, (6) inter-method disagreement summary, and (7) the claim that must stop here. Without these fields, a result will be treated as a method illustration or lab-specific pipeline note, not as a reusable benchmark.
5) Checklist that does not end with just “there is data”
Checklist
- Version fixed:Does OpenNeuro snapshot, PhysioNet version, DOI, acquisition date remain?
- Reproduction:Can you write the acquisition procedure, license, preprocessing conditions, random numbers, and environment
- Metadata:Do you have sampling, reference, electrode placement, event definition, clock domain, and a named timing-validation class?
- Annotation provenance:Did you clearly indicate whether the label came from an annotation channel, manual scoring, or a report-derived rule, and whether a known scorer-agreement or report-derived ceiling still limits interpretation?
- QC:Are noise, defects, and artifacts quantified?
- Comparison:Is there a baseline and can be compared using the same metrics as the evaluation family
- Metric bundle:If the task is imbalanced or event-based, are event sensitivity, false alarms, per-stage agreement, or calibration disclosed rather than one headline number?
- Benchmark provenance:If the result comes from a challenge or leaderboard, are benchmark version, split / randomization, hidden grouping, subject exclusivity, extra-data policy, pretrained-checkpoint policy, inference-stage restrictions, and later postmortems fixed?
- Inverse-problem governance:If source imaging is compared, are validation class, source regime, inverse family / uncertainty object, geometry/control sweep, and inter-method disagreement disclosed before declaring a winner?
- Rebuttal evidence:Are there data leak tests, segment/window ancestry checks, counterfactual tests, and records of failures
In practical work, a benchmark page, rules page, submission constraint, and final leaderboard can each fix a different part of what your score means. Brookshire et al. (2024) show that segment-based cross-validation in translational EEG can leak subject information between training and test sets and inflate headline performance. The later TUH/TUSZ maintenance record also documented a public-release case where subject exclusivity and annotation quality had to be repaired after downstream use had already begun. That is why this page now routes EEG foundation-model benchmarking not only through split / leakage hygiene but also through benchmark provenance. If the benchmark uses evolving challenge operations, continue directly to Wiki: EEG foundation models and pretraining and Verification: Pretraining Card before treating the ranking as portable generalization evidence.
6) Run the L0 minimum loop here
The goal here is not to compete for high accuracy, but to create the smallest loop that a third party can follow in the same way. The minimum pack on this site is no longer just version + BIDS + QC + split + baseline. It now also requires event fidelity, label provenance, acquisition-distribution summary, derivative lineage, and a stopping claim so later accuracy can still be read correctly.
L0 Loop
- Version:OpenNeuro snapshot / PhysioNet version / DOI / leave acquisition date
- Input:Create a format that can be placed in BIDS / EEG-BIDS (data + metadata + reference / channels / electrodes / events)
- Event fidelity:Record onset / duration / sample, clock domain, stream-alignment rule, timing-validation class, delay / jitter evidence, and event semantics
- Label provenance:State whether the target comes from annotation channels, expert scoring, clinician reports, or other rules
- Quality: Record missing, noise, artifact, and exclusion reasons in numerical form
- Processing:Fix preprocessing conditions, random numbers, software version, and derivative lineage from raw to outputs
- Evaluation: Fix one of within-session / cross-session / cross-subject first, together with the independent hold-out unit and raw-recording / window ancestry
- Output:Even if it is simple, publish at least one baseline indicator that can be compared later, and state what claim must stop here
- Audit:Failure cases, leak tests, harmonization logs, and pending conditions are also recorded along with the results
The weakness of this page was that the public checklist had become stricter than the wiki page readers actually use when assembling an L0 submission. That gap is now closed. If you want the submission shape itself, not only the route on this page, go directly to Wiki: Minimum artifact pack for L0. The synced pack now fixes dataset identity, event fidelity, label provenance, evaluation family + hold-out ancestry, acquisition-distribution summary, derivative lineage, and stopping claim as first-class deliverables.
| Easy to get clogged | To cut first |
|---|---|
| I think I can reproduce it with the same dataset name | Fix the OpenNeuro snapshot tag and PhysioNet version first, and leave the acquisition date and DOI in the runbook. |
| Stops in the form of BIDS | Before inputting the actual data, first create a directory skeleton, dataset_description.json, participants.tsv, and events.tsv. |
| I wonder how much QC to leave | It is safer to fix just the four items: missing, noise, artifact, and reason for exclusion, and increase them later. |
| Cannot determine baseline | Prefer a simple and easy-to-reproduce model, such as motor recall 2 classes or spectral summary, rather than a complex model. |
| Getting lost in train/test | First decide whether the comparison is within-session, cross-session, or cross-subject, then lock the split unit for each subject or session. |
The dataset name alone is not enough. OpenNeuro manages snapshots with semantic-version Git tags, and PhysioNet also displays and cites dataset versions for each project. Therefore, the first runbook should record snapshot / version / DOI / retrieval date, not just the dataset name.
Even if the contents are not aligned at first, just fixing the placement will reduce rework. If you create a file name and metadata template with the premise of passing it through a validator, subsequent QC and comparisons will become much easier.
Eliminate any problems found with the machine at an early stage. Passing the BIDS Validator is not a sufficient condition for research, but it is close to the minimum requirement for sharing.
MNE-BIDS is a tool that helps with BIDSPath handling, data loading, and metadata extraction, while MOABB fixes the paradigm and evaluation family. There is a difference between being able to read data and being able to make fair comparisons. In particular, MNE-BIDS treats write-back of modified or preloaded data as an exception, so it is safer to treat preprocessed data as derivatives with explicit lineage.
With just the raw waveform, it is difficult for a third party to reconstruct what went wrong and what was left out. The main body of L0 is to record bad channels, bad segments, event synchronization, stimulus logs, and reaction logs with numerical values and threshold values.
Rather than using SOTA, first place a comparison axis that is easy to reproduce. Having an initial baseline allows you to compare what has improved even after updating the preprocessing or updating the model.
| Check items | Lowest line of L0 | Where to go back to when you're running low |
|---|---|---|
| Data version | snapshot / version / DOI / acquisition date is fixed | Wiki: Standards/Repositories/Validators/Benchmarks |
| Data structure | Can be stored in BIDS format | The shortest route to shareable data |
| Quality control | QC logs and exclusion criteria remain | Wiki: Event synchronization and observation logs |
| Comparability | One baseline and evaluation family / train/test rules are fixed | Wiki: Data splits and data leaks |
| Prepare to share | Execution steps, environment, and failure examples can be passed on to a third party | Verification infrastructure |
7) The shortest route to "shareable data" with Mind-Upload
Mind-Upload's goal is not just to collect data, but to leave it in a form that can be verified by a third party. The shortest route to that end is to approach BIDS/EEG-BIDS.
Verification Commons
Click here for the blueprint for "Standards + Storage + Evaluation".
View verification platform →8) References and official pages
- BIDS 1.11.1: Task events
- BIDS 1.11.1: Electroencephalography
- Pernet et al. (2019), EEG-BIDS
- Robbins et al. (2021), HED for FAIR event annotation
- Hermes et al. (2025), HED library schema for EEG data annotation
- Kothe et al. (2025), Lab Streaming Layer
- Jeung et al. (2024), Motion-BIDS
- OpenNeuro Docs: Git access and snapshots
- OpenNeuro Docs: Dataset landing page and snapshot metadata
- PhysioNet: About and citation policy
- PhysioNet: Resources and citation guidance
- Appelhoff et al. (2019), MNE-BIDS
- MNE-BIDS Docs: write_raw_bids
- Jayaram & Barachant (2018), MOABB
- MOABB Docs
- MOABB Docs: WithinSessionEvaluation
- MOABB Docs: CrossSessionEvaluation
- MOABB Docs: CrossSubjectEvaluation
- Ma et al. (2022), A large EEG dataset for studying cross-session variability in motor imagery BCI
- Jiang et al. (2024), Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI
- Lee et al. (2025), Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-Tuning
- Han et al. (2025), DIVER-0: A Fully Channel Equivariant EEG Foundation Model
- Chen et al. (2025), HEAR: An EEG Foundation Model with Heterogeneous Electrode Adaptive Representation
- El Ouahidi et al. (2025), REVE: A Foundation Model for EEG -- Adapting to Any Setup with Large-Scale Pretraining on 25,000 Subjects
- Ma et al. (2026), Structured Prototype-Guided Adaptation for EEG Foundation Models
- Liu et al. (2026), EEG Foundation Models: Progresses, Benchmarking, and Open Problems
- Lahiri et al. (2026), PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis
- EEG Challenge (2025), official homepage
- EEG Challenge (2025), rules
- EEG Challenge (2025), submission page
- EEG Challenge (2025), leaderboard
- Musall et al. (2019), Single-trial neural dynamics are dominated by richly varied movements
- Egger et al. (2024), Chrono-EEG dynamics influencing hand gesture decoding: a 10-hour study
- Karpowicz et al. (2025), Stabilizing brain-computer interfaces through alignment of latent dynamics
- Wilson et al. (2025), Long-term unsupervised recalibration of cursor-based intracortical BCIs
- Wairagkar et al. (2025), An instantaneous voice-synthesis neuroprosthesis
- Saito & Rehmsmeier (2015), The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets
- Roy et al. (2021), Evaluation of artificial intelligence systems for assisting neurologists with fast and accurate annotations of scalp electroencephalography data
- Scheuer et al. (2021), Seizure Detection: Interreader Agreement and Detection Algorithm Assessments Using a Large Dataset
- Segal et al. (2023), Utilizing risk-controlling prediction calibration to reduce false alarm rates in epileptic seizure prediction
- Sun et al. (2017), Large-Scale Automated Sleep Staging
- Vallat & Walker (2021), An open-source, high-performance tool for automated sleep staging
- PhysioNet: EEG Motor Movement/Imagery Dataset
- PhysioNet: CHB-MIT Scalp EEG Database
- PhysioNet: Sleep-EDF Database Expanded
- Obeid & Picone (2016), TUH EEG Corpus
- Shah et al. (2018), TUH Seizure Detection Corpus
- Hamid et al. (2021), Recent advances in the TUH EEG Corpus: improving the interrater agreement for artifacts and epileptiform events
- Rosenberg & Van Hout (2013), The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring
- Brookshire et al. (2024), Data leakage in deep learning studies of translational EEG
- Moser et al. (2009), Sleep classification difference between AASM and Rechtschaffen & Kales
- Mikulan et al. (2020), Localize-MI
- Baillet et al. (2001), Evaluation of inverse methods and head models using a human skull phantom
- Phillips et al. (2005), An empirical Bayesian solution to the source reconstruction problem in EEG
- Michel & Brunet (2019), EEG source imaging: a practical review of the analysis steps
- Aydin et al. (2019), Influence of head tissue conductivity uncertainties on EEG dipole reconstruction
- Cai et al. (2021), Robust estimation of noise for electromagnetic brain imaging with the Champagne algorithm
- Pascarella et al. (2023), An in-vivo validation of ESI methods with focal sources
- Hao et al. (2025), HD-EEG source imaging with simultaneous SEEG
- Birot et al. (2014), Head model and electrical source imaging
- Mouthaan et al. (2019), E-PILEPSY systematic review
- Unnwongse et al. (2023), Validating EEG source imaging using intracranial electrical stimulation
- Seeber et al. (2019), Subcortical electrophysiological activity is detectable with high-density EEG source imaging
- Vorwerk et al. (2024), Global sensitivity of EEG source analysis to tissue conductivity uncertainties
- Luria et al. (2024), The SESAMEEG package: a probabilistic tool for source localization and uncertainty quantification in M/EEG
- Tong et al. (2025), Debiased estimation and inference for spatial-temporal EEG/MEG source imaging
- Feng et al. (2025), Block-Champagne for extended E/MEG source imaging
- Vorwerk et al. (2026), Potential of EEG and EEG/MEG skull conductivity estimation to improve source analysis in presurgical evaluation of epilepsy
- Jahromi et al. (2026), 3D printed pediatric head phantom for assessing deep epileptic sources localization
- Cahill et al. (2024), Network-level encoding of local neurotransmitters in cortical astrocytes
- Suzuki et al. (2011), Astrocyte-neuron lactate transport is required for long-term memory formation
- Silva et al. (2022), Glial ketogenesis regulates memory maintenance during starvation
- Pavlowsky et al. (2025), Neuronal fatty acid oxidation fuels memory after intensive learning in Drosophila
- Greda et al. (2025), Interaction of sortilin with apolipoprotein E3 enables neurons to use long-chain fatty acids as alternative metabolic fuel
- Williamson et al. (2025), Learning-associated astrocyte ensembles regulate memory recall
- Dewa et al. (2025), The astrocytic ensemble acts as a multiday trace to stabilize memory
- Bukalo et al. (2026), Astrocytes enable amygdala neural representations supporting memory
- Mai-Morente et al. (2025), Pericyte pannexin1 controls cerebral capillary diameter and supports memory function
- Kim et al. (2025), Meningeal lymphatics-microglia axis regulates synaptic physiology
- Hirschler et al. (2025), Region-specific drivers of CSF mobility measured with MRI in humans
- Chung et al. (2025), Quantitative PET imaging and modeling of molecular blood-brain barrier permeability
- Dagum et al. (2026), The glymphatic system clears amyloid beta and tau from brain to plasma in humans
- Vafaii et al. (2024), Multimodal measures of spontaneous brain activity reveal both common and divergent patterns of cortical functional organization
- Chen et al. (2025), Simultaneous EEG-PET-MRI identifies temporally coupled and spatially structured brain dynamics across wakefulness and NREM sleep
- Bolt et al. (2025), Autonomic physiological coupling of the global fMRI signal
- Epp et al. (2025), BOLD signal changes can oppose oxygen metabolism across the human cortex
- Rohaut et al. (2024), Multimodal assessment improves neuroprognosis performance in clinically unresponsive critical-care patients with brain injury
- Amiri et al. (2023), Multimodal prediction of residual consciousness in the intensive care unit: the CONNECT-ME study
- Manasova et al. (2026), Multimodal multicentre investigation of diagnostic and prognostic markers in disorders of consciousness
- Bøgh et al. (2024), Repeatability of deuterium metabolic imaging in healthy volunteers at 3 T
- Wirsich et al. (2021), The relationship between EEG and fMRI connectomes is reproducible across simultaneous EEG-fMRI studies from 1.5 T to 7T