Resource

Data & Hands-on: Where to start and how to get to L0

Connect ``what to use'' and ``how to reproduce'' in the shortest route without separating them.

Mind Uploading Research Project

Public Page Updated: 2026-04-04 Curated List + L0 Practice (updated with the EEG recording-frame contract sync)

How to use this page

Read this first to avoid getting lost

This page is a practical guide that answers both ``Which public data should I start validation with first?'' and ``How do I proceed to L0 reproducible analysis?'' in one place. It does not stop at a list of dataset names; it connects BIDS, QC, splitting, and baselines in a single path.

  • We look at the shared infrastructure first, then the starter datasets.
  • Starter data is a practice board for L0-L1, not the ground truth of EEG source imaging.
  • The page now also fixes a component-addition / ablation ladder for maintenance-state routes, so glial substrate-routing, astrocyte-state, neurovascular / BBB, and clearance augmentations are compared against a named neuron-first baseline instead of being piled into one multimodal boost.
  • A same-subject support-state bundle is no longer treated as self-interpreting; route class, effective window, quantity bridge, common-driver audit, missingness policy, disagreement topology, repeatability, transfer window, and abstention are now fixed as practical fields.
  • Even inside direct-validation data, stimulation ground truth, simultaneous invasive recording, and postsurgical outcome are different evidence classes.
  • A fair inverse-problem benchmark also has to separate focal-centre versus source-extent targets, inverse family, uncertainty object, montage / coverage policy, and geometry / conductivity sensitivity rather than naming only a winning method.
  • High density is not the only meaningful route: targeted-density and DeepSIF-like low-density ESI can work in bounded regimes, but the gain is still solver-, geometry-, and source-regime-conditioned.
  • Each starter dataset has different annotation provenance, time fidelity, and independent split units.
  • Within-session / cross-session / cross-subject / adaptation are different evaluation families and should not be placed side by side under the same score.
  • Benchmark name is still too coarse: predicted object, independent prediction unit, hold-out unit, adaptation regime, and operations budget can all change what the same score means.
  • Metric semantics are task-dependent: in imbalanced or rare-event tasks, accuracy or AUROC alone do not fix event sensitivity, false alarms, minority-class failure, or calibration.
  • Cross-session and adaptation labels are not yet temporal-validity claims; state annotation is now split into fast labels and slow internal-milieu disclosure, and fixed decoder interval, recalibration burden, and transfer ceiling still have to be disclosed separately.
  • Clock domain, stream alignment, digital trigger capture, physical output onset, and uncontrolled-response timing are different timing-validation classes rather than one sync field.
  • Even when the score is numerically the same, you still have to separate the target neural variable from eye movement, EMG, behavior, feedback routes, subject / session fingerprint, and acquisition-distribution shortcuts such as site / device / reference / electrode layout.
  • Even when foundation / self-supervised EEG models are used, pretraining-corpus, coordinate-route / reference-family, omitted-channel, and label-budget audits are still required.
  • Even when datasets are described as `harmonized`, common-channel intersection, interpolation to a target montage, and REST-based transformation remain different recording-frame branches rather than one benchmark object.
  • Official challenge rules, submission constraints, and later postmortems can change what a benchmark score means, so benchmark provenance is part of the dataset / benchmark card rather than administrative detail.
  • BIDS raw layout, BIDS derivatives / lineage, workflow recipe, benchmark harness, benchmark-governance snapshot, and runtime pin are different reproducibility layers; naming only the dataset and tool remains too coarse.
  • Reference system, channel map, electrode layout, and device protocol are not cosmetic metadata; they can move scores and belong in the dataset card.
  • The ultimate goal is to make it possible for a third party to rerun the result under the same conditions.
  • The L0 artifact pack now follows this page's stricter site rule: event fidelity, label provenance, acquisition-distribution summary, derivative lineage, and a stopping claim are required alongside version/BIDS/QC/split/baseline.
Best for
People who are wondering which public data to start with, people who are looking for an L0 practice board
Reading time
12-20 minutes
Accuracy note
The datasets listed here are entry candidates. They are listed from the perspective of ease of use and reproducibility, and cannot cover all the issues of WBE.

Relatively clear at this stage

What we know now

  • Public EEG data is useful for L0 recall analysis and L1 baseline practice.
  • When selecting data for the first time, you will move forward if you prioritize ease of retesting over difficulty.
  • Starter EEG datasets are still neuron-first baselines; maintenance-state claims need paired support-state data or aligned proxy logs, one-family-at-a-time augmentation, and strongest-single-row versus bundle comparison.
  • Glial substrate-routing and astrocyte-state are different augmentation families: a named astrocyte observable does not fix supplier cell / neuronal sink / fuel object, and a glial fuel-support route does not by itself identify astrocyte ensembles.
  • A same-subject support-state bundle can still mix common drivers, smaller complete-case slices, and opposite-sign rows, so family-split augmentation is not read strongly on this site without its own augmentation card.
  • Cue-locked events, expert interval annotations, sleep hypnograms, and physician report-derived labels have different meanings even though they are the same 'public EEG data'.
  • Even if the accuracy is the same, the strength of the argument that can be read will change depending on which generalization condition the score was obtained under.
  • The same benchmark name can still hide different predicted objects, independent units, grouped hold-out units, adaptation regimes, and inference budgets.
  • For class-imbalanced or rare-event tasks, a task-matched metric bundle is required: event sensitivity plus false alarms for seizure tasks, and macro / per-stage agreement for sleep staging, rather than one headline number.
  • Cross-session accuracy and cross-session adaptation still do not say whether a fixed decoder survived, how much recalibration was needed, or what transfer ceiling remains.
  • The same task and decoder window can still run under different slow internal-milieu regimes, so `state annotation` is not exhausted by movement, arousal, or session ID alone.
  • A dataset card that says only `events.tsv` or `LSL` still leaves open whether timing was checked at the stored-data, trigger, physical-output, or uncontrolled-response level.
  • A same-day score may reflect movement / EOG / EMG / feedback routes, subject / session fingerprint, or acquisition-distribution shortcuts rather than the target signal.
  • Foundation-model improvements are not comparable unless the pretraining corpus, channel-mismatch handling, acquisition-distribution summary, and adaptation regime are disclosed.
  • A benchmark name alone is not enough; version, split / randomization rule, hidden grouping, extra-data policy, pretrained-checkpoint policy, and inference-stage restrictions can all change what a score means.
  • A dataset or pipeline name alone is still not enough; raw layout, derivative lineage, workflow recipe, and runtime pin can each move what the result means.
  • Reference system, device, electrode layout, and filter chain can change what looks like the same EEG benchmark.
  • With only starter data and no individual MRI or invasive ground truth, we cannot make strong claims about improved ESI accuracy.
  • At source-imaging stage C, named validation class still matters because stimulation ground truth, simultaneous SEEG, and clinical outcome do not answer the same error question.
  • Low-density or targeted-density ESI can be scientifically meaningful in bounded regimes, but montage design, generator depth/orientation, and validation class still have to be declared before the result is compared with HD-EEG or treated as portable.
  • For inverse-problem claims, same raw data is still not enough; same head model, same preprocessing, same source regime, same inverse family / target object, and a sensitivity report are also required before comparing solver families.
  • At L0, a reusable artifact pack now also requires event fidelity, label provenance, hold-out ancestry, acquisition-distribution summary, derivative lineage, and a stopping claim rather than only version/BIDS/QC/split/baseline.

Still unresolved beyond this point

What we still do not know

  • Starter datasets alone cannot solve all the issues of WBE.
  • We have not yet determined which data will be most effective for future causal/closed-loop verification.
  • We have not decided yet which public data will be the default route for the annotation fidelity benchmark.
  • We have not determined yet which public benchmark board should become the default for comparing focal-centre and source-extent inverse methods under the same montage / geometry controls and uncertainty sweep.
  • We still do not have a default public EEG benchmark that logs state annotation, fixed decoder interval, recalibration burden, and transfer ceiling under one shared temporal-validity schema.
  • We still do not have a shared public template that freezes fast labels and slow internal-milieu disclosure together inside one temporal-validity addendum.
  • We still do not have a default public dataset bundle that jointly fixes a neuron-first baseline, family-split support-state augmentation, and strongest-single-row versus bundle comparison under one shared missingness and common-driver audit.

Learn the basics

Check the basics in the wiki

How To Use

This page is a practical list to help you decide which data to practice with first. First, we use public data to create a state in which we can reproduce the same results (L0), and then check to see if we can predict and withstand changes in conditions (L1-L2).

Criteria for selection

When selecting data at the beginning, it is important to choose data that is easy for others to follow in terms of procedures and results, rather than data that is extremely difficult. Rather than aiming for everything from the beginning, a shortcut is to create the smallest loop using public data that is easy to reproduce.

When you want to understand where this page fits

This page is the practical entry point for deciding where to start and how to complete the minimum L0 loop. The verification platform handles what counts as progress, and the casework section within the verification platform handles examples from other fields. If you want a one-page guide to how the practical pages differ from the rest of the site, please see Wiki: Guide to reading practical pages.

When you want to complete with just this page

The minimal loop procedure from the old hands_on.md has been integrated into this page. Therefore, you can read straight through to the L0 skeleton, QC, baseline, and completion conditions without having to go to another page after data selection.

If you want the RQ-by-RQ route from datasets back to mind-upload questions

This page is the practical data portal, not the full research-question map. If you want to move from a dataset bucket here to a specific mind-upload research question, a fixed EEG-ready claim, and a grant-ready theme, start with the current public six-RQ brief, then use the RQ60 EEG feasibility page, the RQ-by-RQ deep dossiers, the grant and dataset playbook, and the current funding shortlist.

What I want to do First data
I want to practice the basics of preprocessing and classification EEG Motor Movement/Imagery is easy to enter. The problem settings are relatively easy to understand, making it suitable for L0-L1 practice.
I want to experience long-term data and event detection CHB-MIT is a good fit. It lets you practice handling noise, long recordings, and event detection together.
I want to handle state transitions Sleep-EDF is a good fit. It is useful for learning how states change over time.
I want to see the difficulty of large-scale data TUH EEG is also a candidate. However, it is heavy for a first dataset, so it is safer to become familiar with the first three first.
Avoid expecting too much when selecting data at the beginning

The starter dataset is not intended to solve all WBE problems at once. The first things you want to get here are reproducible input organization, QC habits, and baseline comparison. Strong points such as identity and causal identity cannot be resolved with just the data at this stage.

Keep the public dataset route conservative

This public page stays at the entry level. When one unresolved question is actively being turned into an EEG-ready work package, stronger routing details such as fixed Dxx + DOI anchors, first-pass KPIs, and stopping rules are kept in the wiki rather than promoted here as a public conclusion. Use the current public six-RQ brief, the RQ60 EEG feasibility page, and the RQ-by-RQ deep dossiers when you need the current one-question-at-a-time packages.

Seeing before precision

When you read a dataset introduction, it is tempting to jump straight to "what score was achieved?" The first questions should instead be what the train/test split unit is, whether leakage was checked, and whether the result was compared against a simple baseline. If this is still unclear, please read Wiki: Data Splits and Data Leakage first.

Even if the score is the same, the meaning will change if the generalization conditions are different

MOABB treats within-session, cross-session, and cross-subject as separate evaluation families. In other words, even if the 70% is the same, the 70% obtained by "same day, same person, same setup" and the 70% obtained by hold-out on "different day" or "different person" are different achievements. If you want to sort out short-term state fluctuations and long-term drift first, please also have a look at Wiki: state/trait/drift.

See label provenance before data name

Even with the same "public EEG data", the meaning of comparison is different for cue-locked annotation channel, expert interval annotation, whole-night hypnogram, and physician report-derived label. Therefore, on this page, in addition to the dataset name, be sure to include where the label came from, at what time granularity, and what is considered as an independent unit of split.

When you are wondering what to get for L0

Even if you just decide on the data name, if the form of the submission is ambiguous, it is easy to get stuck. If you want to see BIDS, Validator, QC logs, division rules, baselines, execution steps, and failure examples in one page, please see Wiki: Minimum L0 artifact pack.

When you want to see the entire order from EEG to L0 in one straight line

After the introduction to EEG, if you would like to see the flow of selecting data on this page, going around in the L0 practice section, and confirming it as L0 in Verification, please see Wiki: Straight path from EEG to L0.

Raw EEG is not enough

Even if the waveform file is published, if the event definition, stimulus log, time synchronization, and bad channel / bad segment recording are weak, it will be difficult to compare again later. Furthermore, in the 2026-03 re-audit, we added to the site rule that event semantics are not fixed just by having `events.tsv`, and hardware delay cannot be audited just by having LSL. If you want to understand this point from the beginning, please see Wiki: Basics of event synchronization and observation logs first.

Event Fidelity Card now required

Future dataset cards must include at least (1) onset / duration / sample, (2) clock domain plus stream-alignment rule, (3) timing-validation class such as stored-data anchor / digital trigger / physical onset / uncontrolled-response test, (4) event semantics such as trial_type, HED, and scoring rules, (5) provenance / scorer / report-usage flag, (6) independent split units, and (7) a clear stopping claim. Cards without these fields are insufficient as reusable L0 guides.

BIDS, Derivatives, Pipeline, OpenNeuro, and Benchmark are not the same

BIDS is a raw-data standard, BIDS Derivatives keeps processed-output lineage explicit, OpenNeuro and PhysioNet are storage areas, Validator is a formal check, MNE-BIDS is a loader, MNE-BIDS-Pipeline or a BIDS App is a workflow recipe, and Benchmark is a comparison rule. If you want to sort out this role difference from the beginning, please use Wiki: Standards/Repositories/Validators/Benchmarks.

Plan for future expansion

EEG-based starter datasets are enough to begin, but later you may want to add spatial or structural information. If you want to map out what can be added to EEG first, please see Wiki: Multimodal integration basics.

If you want to test whether adding glial, astrocyte, or clearance variables really changes the result

This page had one practical weakness: it explained how to build a reproducible neuron-first EEG baseline, but it still lacked a public rule for testing whether adding maintenance-state / support-state variables changes prediction, stability, or explanation. The current primary literature does not support one compressed support-variable bucket. Cahill et al. (2024), Williamson et al. (2025), Dewa et al. (2025), and Bukalo et al. (2026) sharpen distinct astrocyte-state routes across minute-scale network encoding, recall, multiday stabilization, and fear-state support. By contrast, Suzuki et al. (2011), Silva et al. (2022), Pavlowsky et al. (2025), and Greda et al. (2025) sharpen distinct glial substrate-routing routes across lactate support, ketone-body support under starvation, learning-linked fatty-acid routing, and apoE / sortilin-dependent lipid delivery under limited glucose. Mai-Morente et al. (2025) sharpens a pericyte / capillary-support route; Kim et al. (2025) sharpens a meningeal-lymphatics / microglia route for synaptic physiology; Hirschler et al. (2025) and Dagum et al. (2026) sharpen bounded human clearance-side observables; and Chung et al. (2025) raises tracer-specific BBB transport quantification while explicitly leaving human ground truth and test-retest for future work. Therefore, this page now fixes a component-addition / ablation ladder that separates glial substrate-routing from astrocyte-state instead of letting readers pile both into one multimodal boost.

A practical component-addition / ablation ladder for maintenance-state routes

Family added on top of the neuron-first baseline Minimum paired data requirement What you may say if the gain survives What still must stop here
Glial substrate-routing route Same-subject neural and behavioral target, plus a named glia-to-neuron fuel-support observable or perturbation with supplier cell, neuronal sink, fuel object / carrier, and nutrient or learning regime fixed. A declared glial substrate-routing family improved prediction, recall, or a bounded memory-support readout in that named nutrient or learning regime. That astrocyte ensemble state was identified, or that one glial fuel route generalizes across lactate, ketone-body, fatty-acid, and apoE / sortilin-dependent lipid delivery.
Astrocyte-state route Same-subject neural and behavioral target, plus a named astrocyte-state observable or perturbation aligned to the same recall, stabilization, or fear-state window. A named astrocyte-state family improved prediction, recall, stabilization, or fear-state decoding in that declared window. That the responsible whole-brain astrocyte controller was identified, or that astrocyte-state evidence fixed glial fuel routing, clearance control, or one general support state across all tasks and timescales.
Neurovascular / BBB / pericyte support route Same-subject neural and behavioral target, plus a named capillary, BBB-exchange, or BBB-transport observable with shared arousal / vascular-driver logging. A declared vascular-support family reduced one error term or improved one prediction slice under the named physiological regime. That a generic BBB state was measured, or that the added row directly read out the neuronal variable of interest.
Clearance / immune / lymphatic route Same-subject neural or biomarker target, plus a named CSF-mobility, tracer-transport, or sleep-linked efflux route with sleep / time-of-day handling fixed. A declared transport-side or immune-side route explained incremental variance or changed a bounded physiological readout. That local microglial control, route-free whole-brain clearance truth, or one universal maintenance controller was identified.
Bundle comparison rule Compare the neuron-only baseline, each single added family, and the full bundle under the same subjects, split rule, missingness policy, and common-driver audit. The bundle improved the declared task under a named availability and regime constraint, beyond the strongest single added row. That the full bundle proves the minimum required biological configuration or closes U3 by itself.
  1. Freeze the neuron-first baseline, target object, split unit, and metric bundle before adding any maintenance-state row.
  2. Add one family at a time under the same subjects, same sessions, same missingness rule, and same evaluation family.
  3. Name the direct observable, time window, spatial unit, and route class for every added family, because local causal perturbation and bounded human proxy do not carry the same claim.
  4. Report the strongest single added row, the full bundle, and their disagreement / missing-modality behavior under the same split.
  5. Stop the claim at incremental predictive, stability, or physiological gain unless the result also survives common-driver controls, out-of-regime checks, and a named abstention boundary.
Starter EEG datasets are still only the baseline arm of this ladder

EEG Motor Movement/Imagery, CHB-MIT, Sleep-EDF, and TUH help you fix the neuron-first baseline, split unit, QC discipline, and leakage checks. By themselves they do not close glial substrate-routing, astrocyte-state, pericyte / BBB support, clearance transport, or other maintenance-state families. Any public maintenance-state claim on this site therefore needs paired support-state data, aligned proxy logs, or a named perturbation route, and it must be compared against the strongest single added family rather than only against the all-in bundle.

A same-subject support-state bundle still needs its own augmentation card

The page still had one practical weakness after the family split: it could leave the impression that once several support-state rows are collected in the same subject, the bundle itself is already close to one aligned biological variable. The current primary literature does not support that shortcut. Vafaii et al. (2024) showed that spontaneous multimodal measures contain both common and divergent cortical structure. Chen et al. (2025) showed that simultaneous EEG-PET-MRI can display tightly coupled temporal evolution while still preserving spatially distinct hemodynamic and metabolic patterns. Bolt et al. (2025) showed that a major low-frequency global fMRI pattern is substantially coupled to autonomic physiology, while Epp et al. (2025) showed that about 40% of gray-matter voxels with significant task BOLD changes exhibited opposing oxygen-metabolism changes. Bundle-level gain is real but still conditional: Rohaut et al. (2024) showed that adding markers can reduce prognostic uncertainty, Amiri et al. (2023) showed that direct same-sample comparison shrank to 48 patients with all EEG and fMRI features, and Manasova et al. (2026) showed higher inter-modality disagreement in minimally conscious or improving patients even as performance improved with more modalities. Row-local stability is separate again: Bøgh et al. (2024) fixed a named repeatability window for a 3 T deuterium route, and Wirsich et al. (2021) showed reproducible EEG-fMRI connectome relations only under explicitly harmonized simultaneous acquisition. Therefore, this page now requires a practical support-state augmentation card before a same-subject bundle is read as more than row addition.

A practical support-state augmentation card for dataset bundles

Field to log before reading a same-subject bundle strongly Why this field is necessary What overread it blocks
Route class and bridge type State whether a row is a same-subject human proxy, a same-subject perturbation, a sequential bridge, or a mixed-species causal support row. Do not read rodent causal support and bounded human proxy rows as interchangeable evidence just because they concern the same family label.
Effective time window and physiological regime Log whether rows target the same trial epoch, sleep stage, arousal window, pharmacological state, or multiday stabilization regime. Do not read co-acquisition or same-session wording as if one support-state sample had been aligned automatically.
Direct observable and quantity type Name whether the added row is density, transport, exchange, mobility, flux, metabolism, or an indirect classifier / score. Do not collapse tracer transport, glucose uptake, BOLD fluctuation, and bounded biomarker efflux into one solved maintenance variable.
Shared-driver / quantity-bridge audit Disclose vascular, respiratory, autonomic, motion, drug, and time-of-day covariates, and say whether the bundle established a shared trajectory, a common driver, or a true quantity bridge. Do not read correlated rows as one biological quantity when the coupling may be driven by arousal or another shared nuisance source.
Availability slice and missing-modality policy Report the exact complete-case subset, any imputation or substitution rule, and whether the comparison is same-sample or maximum-available-data. Do not hide that the bundle result may depend on a much smaller or specially filtered subgroup than the headline cohort.
Strongest single row and disagreement topology Compare the best single added family against the full bundle and state where modalities agree, diverge, or change sign. Do not promote the full bundle if it only repackages the strongest single row or if disagreement is concentrated in the hardest regime.
Row-local repeatability and transfer window Name the hardware, sequence, preprocessing, centre, and acquisition window under which each added row is repeatable or portable. Do not treat one named proxy route as field-ready or cross-centre stable just because it worked once in one harmonized setup.
Abstention and stopping claim State what remains latent after the gain, such as controller identity, cell specificity, or out-of-regime failure. Do not turn bundle improvement into a minimum-biological-configuration claim or a U3 closure claim by default.
What this card changes in practice

On this page, a support-state addition now stays at family-split augmentation evidence unless the dataset card logs the fields above and then compares neuron-first baseline, strongest single added row, and full bundle under the same split, availability slice, and abstention rule. If the bundle mixes living-human proxy classes, use the Verification: Human Proxy Composition Card alongside the Verification: Fusion Card instead of treating co-acquisition as sufficient.

1) Shared infrastructure to establish first

A

OpenNeuro (BIDS-based sharing)

A platform for sharing BIDS-compliant neuroimaging and electrophysiology datasets, including EEG, MEG, and fMRI.

Open OpenNeuro
B

PhysioNet (biosignals and benchmark culture)

A public platform for biosignal datasets and related resources, including many standard EEG corpora.

Open PhysioNet
C

Human Connectome Project (large-scale human imaging)

A representative public resource for large-scale human brain imaging data and analysis tools.

Open HCP
Reproducibility depends on the full execution chain

OpenNeuro and PhysioNet are entry points, but they do not guarantee reproducibility by themselves. First fix the snapshot / version, then align it with BIDS / EEG-BIDS, fix the reading and conversion path with tools such as MNE-BIDS, and finally define the comparison setting with a benchmark harness such as MOABB for within-session / cross-session / cross-subject. If you mix up repository, loader, and benchmark settings, the same dataset name will still yield incomparable results.

2) EEG starter pack (start with this from L0 to L1)

The following are representative examples of introductory EEG datasets that emphasize ease of use and extensive reference. We focused on practicing the preprocessing pipeline and reaching L0 to L1, and narrowed it down to a range where we can immediately start comparing reproduced baselines.

Dataset What you can do (example) Link
EEG Motor Movement/Imagery Motor/motor recall classification, preprocessing practice, baseline comparison PhysioNet
CHB-MIT Scalp EEG Epileptic seizure detection, event detection, long-term EEG handling PhysioNet
Sleep-EDF Estimating sleep stages, modeling state transitions, handling longitudinal fluctuations PhysioNet
TUH EEG Corpus (large scale) Scaling EEG classification, difficulty in distribution for actual operation, data leak countermeasures TUH EEG
Dataset Good first release Why this is a good first release
EEG Motor Movement/Imagery Baseline accuracy and preprocessing log for two-class classification The task setup is simple, so it is easy to build a minimal loop from preprocessing to evaluation.
CHB-MIT Reproduction baseline and exclusion reason log for seizure event detection It is a good way to learn the practical difficulties of long recordings and event detection, including failure cases.
Sleep-EDF Basic baseline for sleep stage classification and confusion matrix of state transitions It shows not only accuracy but also how state transitions fail, which makes errors easier to interpret.
TUH EEG Corpus Reproduction experiment with a small subset and clarifying data division rules It is more important to lock down leak prevention and split rules first than to process the full corpus from the start.

2.5) The same score means different things across generalization settings

This is one of the current weak points of the site. `within-session`, `cross-session`, `cross-subject`, and `adaptation` may all report "classification accuracy," but they answer different questions about generalization. The official MOABB documentation also implements them as separate evaluation classes, and in the 5-day MI dataset of Ma et al. (2022), the average subject-specific accuracy dropped from within-session 68.8% to cross-session 53.7%, then recovered to cross-session adaptation 78.9% when a small amount of target-session data was used. Therefore, this site will no longer list scores alone; it will also state what was held out, what was allowed to vary, and what remains unresolved.

Evaluation family What is held out What this supports What not to overread
within-session Folds within the same subject and the same session. It can show whether classes separate under the same-day, same-setup condition and whether preprocessing plus baseline modeling work at all. Do not treat this as evidence of cross-day robustness or deployable decoding.
cross-session A different session or day from the same subject. It can show how long subject-specific features persist across days and how sensitive they are to state changes and re-setup effects. Do not read this as subject-independent generalization or zero-recalibration operation.
cross-subject One or more entire subjects. It can show whether population-level shared structure exists and how far a cold-start decoder might go at initial installation. Do not equate this score with a decoder optimized for a specific individual.
cross-session adaptation Another session is held out, then a small amount of target-session data is used for recalibration. It can show how much performance is recoverable through recalibration and how much room there is for operational adaptation. Do not describe this as a stable decoder that worked from the beginning without adaptation.
Why this distinction matters scientifically

Musall et al. (2019) showed that neural activity during tasks can be strongly dominated by uninstructed movements, and Egger et al. (2024) showed over a 10-hour EEG day that movement-related decoding changes enough to motivate adaptive decoders. But fast labels are still not the whole state story. de Quervain et al. (1998) and Oei et al. (2007) showed glucocorticoid-linked retrieval impairment and reduced human hippocampal / prefrontal retrieval activity, Barone et al. (2023) plus Birnie et al. (2023) showed circadian and corticosteroid-rhythm control of hippocampal plasticity, and Sherman et al. (2015) showed that memory-linked hippocampal activity varies with circadian-rhythm consistency. Finally, Wilson et al. (2025) showed that long-term BCI operation still requires recurrent recalibration. In other words, even for the same subject, short-term resolution, cross-day tolerance, and long-term operation are different barriers, and state annotation has to split fast labels from slow internal-milieu disclosure before temporal validity is read strongly.

2026-04-03 addendum: state annotation is not one free-text field

The remaining practical weakness on this page was subtler than simple split naming. It already separated cross-session, adaptation, and long-term use, but it still left state annotation too close to one free-text note about movement, arousal, or session ID. The current primary literature does not support that shortcut. Egger et al. (2024) showed over a 10-hour EEG day that movement-related decoding changes enough to motivate adaptive decoders, while de Quervain et al. (1998), Oei et al. (2007), Barone et al. (2023), Birnie et al. (2023), and Sherman et al. (2015) show that the same visible task can still run under different glucocorticoid, circadian, and broader slow internal-milieu regimes. At the same time, Wilson et al. (2025) and Wairagkar et al. (2025) show that recurrent recalibration burden and fast same-day throughput are different again. Therefore, on this site, a temporal claim now has to disclose fast labels such as movement / arousal / task mode separately from any relevant slow internal-milieu disclosure such as time-of-day / circadian phase, recent sleep-wake schedule, glucocorticoid or steroid exposure, and feeding / fasting or glucose-insulin regime, before fixed decoder interval, recalibration burden, or transfer ceiling are interpreted.

If the result is reported as... You still have to disclose Stopped claim if missing
same-day online / streaming use Fast labels such as movement / arousal / task mode, output-path / abstention or fallback policy, and whether the result stayed inside one same-day operating regime. Do not promote to cross-day stability, fixed-decoder durability, or a generic temporal-validity benchmark.
cross-session Fast state labels, any relevant slow internal-milieu disclosure such as time-of-day / circadian phase, recent sleep-wake schedule, glucocorticoid or steroid exposure, feeding / fasting or glucose-insulin regime, fixed decoder interval, and whether the setup was reattached, re-referenced, or otherwise changed. Read only as cross-day tolerance under named conditions, not as durable decoding.
cross-session adaptation The same temporal fields as above, plus how much target-session data was used, when recalibration happened, and what the pre-adaptation score was. Do not promote to fixed-decoder stability or low-burden deployment.
longitudinal / chronic use Fast labels plus slow internal-milieu disclosure, fixed decoder interval, recalibration burden, failure / fallback mode, and participant / site / task transfer ceiling. Do not promote to generic long-term robustness or deployability.
2026-03-18 addendum: fix the route behind the same score

Even when a within-session score is high, it can still be explained by eye-movement confounds shown by Mostert et al. (2018), the EMG route shown by McFarland et al. (2005), post-onset auditory feedback shown by Chen et al. (2024), identity confounding shown by Chaibub Neto et al. (2019), time-robust resting-state fingerprints shown by Wang et al. (2020) and Di et al. (2021), or subject-driven EEG variation summarized by Gibson et al. (2022). For that reason, this site now overlays the Verification: Specificity & Shortcut Card on dataset cards and baseline results, fixing plausible nuisance routes, auxiliary channels such as EOG / EMG / behavior / audio / metadata, nuisance-only baselines, fingerprint audit, nuisance-regime hold-outs, and the claim that must stop here.

2026-04-04 correction: acquisition distribution needs a recording-frame contract

This site previously stopped more clearly at subject / session fingerprint than at setup effects. That remained too weak for dataset cards. The official EEG-BIDS specification already separates electrodes, channels, coordinate system, and reference scheme. Hu et al. (2018) showed that the measured scalp potential itself changes with reference montage and electrode setup, Melnik et al. (2017) showed that EEG recordings vary not only by subject and session but also by recording system, Xu et al. (2020) showed that cross-dataset EEG decoding is degraded by environmental variability such as amplifier, cap, sampling rate, and filtering, Ceballos-Villegas et al. (2022) explicitly modeled multinational batch effects across studies and devices, and Dong et al. (2024) showed that cross-location comparison required an explicit REST-based offline transform rather than a generic claim that the datasets had already been harmonized. Therefore, this site now treats site / device / reference system / electrode layout / coordinate route / protocol distribution as a recording-frame contract rather than as harmless metadata. Inference from these sources: common-channel intersection, interpolation to a target montage, and REST-based transformation preserve different benchmark objects, so a dataset card now has to name the harmonization branch rather than only say that setup differences were handled.

2026-03-25 addendum: benchmark governance is part of the benchmark

The next practical weakness on this page was that split / leak / harmonization were visible, while benchmark governance could still be treated as administrative detail. The current primary and official sources do not support that shortcut. The official EEG Challenge (2025) homepage states that the original challenge preprint became outdated during execution and that the website plus starter kit should be treated as current. The official rules require disclosure of additional pretraining datasets, pretrained models / fine-tuning method, code submission at inference stage, and a single-GPU 20 GB inference budget. The official leaderboard then disclosed that Challenge 2 samples had not been randomized, allowing contiguous-trial same-subject structure to affect what the ranking meant and forcing separate awards. That warning is aligned with benchmark-side primary sources: Xiong et al. (2025) argued that inconsistent evaluation protocols make cross-model EEG-FM comparisons unreliable, and Liu et al. (2026) showed across 12 open-source foundation models and 13 datasets that ranking depends materially on transfer regime and benchmarking choices. Therefore, when this site reads a leaderboard, challenge result, or foundation-model benchmark, the card must also name benchmark version, split / randomization rule, hidden grouping structure, extra-data / pretrained-checkpoint policy, adaptation regime, inference-stage restrictions, and later organizer postmortems. If those fields are missing, we treat the result only as a qualified benchmark snapshot, not as a stable measure of portable EEG generalization.

2026-03-31 addendum: benchmark name is not yet the benchmark object

The next practical weakness was narrower. Even after benchmark governance became visible, a reader could still talk as if the benchmark name fixed the predicted object, independent prediction unit, grouped hold-out unit, adaptation regime, and operations budget. The current primary and official sources do not support that shortcut. The official EEG Challenge (2025) homepage separates Challenge 1 response-time regression from Challenge 2 subject-level externalizing prediction, the official rules and submission page add an inference-only code-submission workflow under a single-GPU 20 GB budget, Ma et al. (2022) use one five-session motor-imagery dataset to separate within-session, cross-session, and cross-session adaptation, Liu et al. (2026) separate leave-one-subject-out cross-subject evaluation from within-subject few-shot calibration, and Lahiri et al. (2026) show that six benchmark inconsistencies can reverse rankings on identical datasets by up to 24 percentage points. Therefore, on this site, a dataset or leaderboard name is still too coarse until the object / unit / budget matrix is disclosed explicitly.

2026-04-02 addendum: setup diversity is not yet physiology-equivalent transfer

The next weak point on this page was different from benchmark governance. A dataset or benchmark card could already expose site / device / reference / layout diversity and still leave a reader with the impression that a setup-agnostic foundation model had already solved physiology-preserving transfer. The current primary literature does not support that shortcut. Han et al. (2025) target channel-permutation equivariance, Chen et al. (2025) target coordinate-based adaptation across heterogeneous devices and more than 150 layouts, and El Ouahidi et al. (2025) push setup-agnostic pretraining to more than 60,000 hours from 92 datasets and 25,000 subjects. Those papers advance recording-frame compatibility. They still do not prove that different montages, coordinate routes, and reference families already preserve one shared physiology-side representation. Ma et al. (2026) then show that strong EEG foundation models can still generalize poorly when subject-level supervision is limited unless extra adaptation structure is added. Therefore, this page now treats setup diversity, coordinate route, reference family, omitted-channel policy, and label-limited adaptation burden as separate dataset / benchmark-card fields rather than one merged claim of portable generalization.

Case What the named benchmark or dataset actually predicts What still has to be frozen separately Safe ceiling on this site
EEG Challenge 1
official homepage + rules
Trial-level response-time regression from the CCD task. The trial is the scoring unit, but grouped subject structure and the inference-only single-GPU 20 GB budget still have to be disclosed separately. A named transfer benchmark under a fixed operations budget, not a general EEG decoder verdict.
EEG Challenge 2
official homepage + leaderboard
Subject-level externalizing-factor prediction from EEG across multiple paradigms. The subject is the natural independent unit, and the organizer postmortem shows that hidden contiguous-trial grouping can still change what the benchmark measured. A subject-invariance benchmark attempt whose meaning remains contingent on grouping policy, not proof that subject invariance is solved.
Ma et al. (2022)
five-session motor-imagery dataset
The same raw dataset supports within-session, cross-session, and cross-session adaptation evaluation families. The dataset name alone does not tell you whether target-session data were used, when recalibration happened, or what the pre-adaptation score was. A useful session-shift practice board, not automatic fixed-decoder durability.
Liu et al. (2026)
foundation-model benchmark matrix
Cross-model comparison across 13 EEG datasets and nine paradigms under multiple transfer settings. The paper explicitly separates leave-one-subject-out transfer from within-subject few-shot calibration, so hold-out unit and adaptation regime still have to be named separately. A transfer-regime comparison board, not one portable score of EEG generalization.
Site rule from this section

From this section onward, dataset cards and baseline results must report at least (1) evaluation family, (2) benchmark object plus independent prediction unit, (3) the independent hold-out unit, (4) raw-recording / window ancestry, (5) subject / session / site / device / reference-system / electrode-layout disjointness together with metadata-only baselines, (6) the channel-map / coordinate-route / reference-family / omitted-channel / sample-rate / filter harmonization log, including whether comparison used common-channel intersection, interpolation to a target montage, REST / another explicit transform, or no cross-setup harmonization, (7) whether target-session, target-subject, or target-site data were used, (8) recalibration amount and timing or extra label budget, (9) for leaderboard or challenge claims, benchmark provenance including version, split / randomization rule, hidden grouping, extra-data / checkpoint policy, inference-stage restrictions or operations budget, and postmortem disclosures, and (10) a stopping claim. If the claim spans more than one session or day, it must additionally disclose the site's Temporal Validity fields: state annotation split into fast labels and slow internal-milieu disclosure, fixed decoder interval, recalibration burden, and transfer ceiling. Scores without this context will be treated as limited L1 decode results, fingerprint-unresolved / acquisition-distribution-unresolved classifiers, or benchmark-object-unresolved / benchmark-governance-unresolved leaderboards rather than evidence of long-term stability or deployability.

2026-03-25 addendum: metric semantics are part of the benchmark

The next practical weakness on this page was that split / leak / harmonization and benchmark governance were visible, while metric semantics could still hide behind one headline number. The primary literature does not support that shortcut. Saito & Rehmsmeier (2015) showed why precision-recall views can be more informative than ROC summaries under strong class imbalance. In seizure tasks, Roy et al. (2021) and Scheuer et al. (2021) show that practical evaluation still turns on sensitivity, false alarms per hour or day, event-overlap logic, and latency rather than plain accuracy, while Segal et al. (2023) show that false-alarm control is itself a design target in seizure prediction rather than an afterthought. In sleep staging, Sun et al. (2017) used Cohen's kappa and showed that imbalance in stage proportions changes performance estimates, while Vallat & Walker (2021) show that pooled performance can still hide especially weak N1-stage agreement. Therefore, on this site, a dataset or benchmark card must now also disclose a task-matched metric bundle, not only a split and a score.

Task family Minimum metric bundle on this site Overread to block
Cue-locked classification / decoding Balanced accuracy or macro-F1, confusion matrix, subject-wise aggregation, and calibration / abstention if probabilities are output. Do not let one accuracy number hide minority-class collapse or confidence miscalibration.
Seizure detection / forecasting Event sensitivity or recall, false alarms per hour or per day, event-overlap rule, detection / warning latency when relevant, and calibration if thresholds or alarms are used. Do not let accuracy, AUROC, or one threshold-free summary stand in for clinically usable alarm behavior.
Sleep staging Cohen's kappa or macro-F1, per-stage recall / F1, and a confusion matrix that keeps minority stages visible. Do not let pooled accuracy hide weak N1 or transition-stage performance.

3) Audit to avoid overestimating starter data

The above four cases are very useful as a practice base for L0-L1, but they are not the ground truth for directly verifying the strong claims of EEG source imaging and WBE. What is needed here is not a dichotomy of ``usable/unusable,'' but rather a fixation on which claims can be supported.

The last two columns in this section reflect this site's operating logic

The stopping claims and minimum operational rules in the table below are operational boundaries drawn from what is directly observed and annotated in the official dataset descriptions and primary literature. In other words, they are not claims explicitly made by the dataset providers; they are site rules derived from annotation provenance and time fidelity.

Dataset Things that are easy to verify now Still difficult to verify Minimum precautions
EEG Motor Movement/Imagery Since it is a cue-locked task of 64ch, 160Hz, and 109 people, it is suitable for practicing preprocessing, subject-based split, and simple baseline comparison. Without individualized MRI, electrode coordinates, and invasive ground truth, claims of improved ESI accuracy and deep reconstruction cannot be audited. Since the task involves presenting left/right/up/down cues on the screen, we will check for inclusion of line of sight, myoelectricity, and cue-locked artifacts, and fix the split for each subject.
CHB-MIT Suitable for learning long-term EEG, seizure event detection, and logging of missing and exclusion reasons. Since it strongly depends on the clinical conditions of children, intractable epilepsy, and drug withdrawal, it cannot be used as a general-purpose benchmark for general recognition or source imaging. Split in case units and handle while retaining the gap and montage summary between records. The disparity between seizures and seizures will also be clarified first.
Sleep-EDF It is suitable for learning how to handle state transitions, sleep stage classification, and longitudinal fluctuations using whole-night PSG. The primary EEG is Fpz-Cz / Pz-Oz 2-lead, 100 Hz, so it is not a benchmark for spatial resolution or source imaging. The labels are manual scoring based on the Rechtschaffen & Kales standard, so when comparing with new sleep stage studies, we will clearly indicate the label correspondence.
TUH EEG Corpus Suitable for learning the difficulties of real-world distribution such as large scale, clinical noise, repeated sessions, and physician reports. It is not suitable for direct validation of source imaging improvements as it is not a controlled biophysical benchmark due to large variations in channel number and clinical conditions. Fix patient/session unit split, fixed channel subset, montage normalization, and text leakage prevention when using reports first.
Dataset Label/Event origin Time fidelity Claim to stop here Minimum operational rules
EEG Motor Movement/Imagery .event and annotation channel T0/T1/T2 indicate cue-locked onset of real/imagined motion. Cue-onset level for 160 Hz recording. Do not promote open-ended thought decoding or subject-independent semantic readout. Split by subject + run and audit visual cue and myoelectric/ocular contributions separately.
CHB-MIT The summary and .seizure annotations for each case mark seizure intervals during long-term recording. In addition, chb21 is the same subject as chb01. Expert interval annotation, gap between files also remains. Do not treat this as gap-free continuous monitoring or count cases as if they were independent subjects. Split by subject and case chronology rather than by file, and keep gap plus montage summaries in the runbook.
Sleep-EDF Comes with an R&K hypnogram by a well-trained technician and a 1 Hz event marker. The whole-night stage annotation is coarse, and even though the EEG is 100 Hz, the marker is 1 Hz. Stop claiming that sub-second event onset and AASM equivalent labels are self-evident. If you split by subject-night and map from R&K to AASM, specify the mapping rule.
TUH EEG / TUSZ TUH has a patient/session hierarchy and a clinician report .txt, while TUSZ goes through a selection including report keyword search and automatic triage. Clinical label at session/file level and expert seizure annotation at some subset. Do not write report-assisted labels as if they were pure EEG-only benchmark accuracy. Require patient / session unit splits and a report-usage flag, and do not feed report text into signal-only evaluation.
Why these stop-lines are evidence-backed rather than site style

The official EEG Motor Movement/Imagery dataset description itself fixes the ceiling: 109 volunteers, 64 channels, 14 cue-driven runs, 160 Hz, and T0/T1/T2 onset codes copied into both the annotation channel and .event files. That is enough to audit cue-locked decoding, preprocessing, and subject-split hygiene, but it is also why this site requires a separate audit of visual-cue, overt-movement, and myoelectric / ocular contributions before any stronger readout wording is allowed.

The official CHB-MIT description fixes a different ceiling: 22 pediatric subjects organized into 23 cases, case chb21 being the same subject as chb01, gaps between consecutively numbered EDF files, and seizure boundaries carried by .seizure files together with case summaries. Therefore file-level randomization overstates independence unless subject identity and case chronology remain explicit.

The official Sleep-EDF description likewise constrains the interpretation: the PSG uses only Fpz-Cz / Pz-Oz EEG together with EOG and chin EMG, while the event marker and some auxiliary channels are sampled at 1 Hz, and the hypnograms are manual Rechtschaffen & Kales scores. Rosenberg & Van Hout (2013) then showed that even modern sleep-stage scoring reaches only about 82.6% overall inter-scorer agreement, with weaker agreement for N1 and N3. That is why this site stops a Sleep-EDF result at staged-state practice unless label mapping, scoring regime, and time granularity are disclosed explicitly.

For TUH / TUSZ, Obeid & Picone (2016) explain that the clinical corpus pairs EDF recordings with clinician reports, while Shah et al. (2018) describe seizure-rich triage using report keyword search and automatic detectors. Later corpus-maintenance notes documented that an early Neureka 2020 release had non-exclusive subjects across train / dev / blind evaluation and high-frequency seizure annotation problems. Therefore report-usage flags, patient/session ancestry, and benchmark postmortems are treated on this site as part of the result rather than footnotes.

The most important site rule to add now

When introducing starter data, always include (1) label provenance, (2) time granularity, (3) clock domain plus stream-alignment rule, (4) timing-validation class, (5) event semantics, (6) independent split unit, (7) acquisition-distribution summary plus harmonization policy, and (8) stopping claim. A dataset card that does not include this will be considered insufficient as a practical guide for L0.

BIDS is a requirement, but not a ground truth

BIDS/EEG-BIDS is important, but it alone cannot prove the validity of source imaging or the comparability of cross-dataset decoding. The BIDS specification itself also requires EEGReference, SamplingFrequency, SoftwareFilters, and if *_electrodes.tsv is issued, *_coordsystem.json is also required. However, this is a condition that makes it possible for a third party to trace the incident, not a condition that allows the true source to be known or that automatically harmonizes reference mismatch, electrode-layout mismatch, and device / filter differences across cohorts.

If you want to claim ESI improvement, you need a different chain of evidence

Please provide at least the following four points.

  • Individual anatomy: Individual MRI/CT or EEG-BIDS recordings with digitized electrode positions and *_electrodes.tsv / *_coordsystem.json
  • Forward model audit: Head model and skull-conductivity sensitivity analysis
  • External standards: Ground truth such as phantoms, simultaneous invasive recording, intracranial stimulation, or TMS-EEG
  • Uncertainty: Report not only point estimates but also localization errors and interval estimates

4) If you want to dig deeper into source imaging, divide the data into three stages

The weak point of this page was that it only stopped by saying, ``Starter data is not a direct benchmark for source imaging,'' but then it was weak in deciding what to choose next. Here, we divide the data into three levels depending on the strength of the argument.

stage Representative data Supported argument Things I can't say yet
A: Practice table EEG Motor Movement/Imagery, CHB-MIT, Sleep-EDF, TUH EEG L0-L1 reproducibility analysis, QC, split design, baseline comparison ESI localization error improvement, deep source claim, strong WBE-oriented reconstruction claim
B: Reconstruction with anatomical constraints Records including *_electrodes.tsv / *_coordsystem.json for individualized MRI, digitized electrodes, and EEG-BIDS Audit of forward model, comparison of reconstruction near the cortical surface, sensitivity analysis of electrode placement and conductivity assumptions Deep source accuracy guarantee without direct ground truth, generalized unique recovery claim
C: Direct validation Localize-MI (Mikulan et al., 2020), scalp EEG with intracranial stimulation, simultaneous HD-EEG/SEEG, presurgical cohort with postoperative outcome Named validation-class audit: localization error against known stimulation sites, concordance with simultaneous invasive recording, or clinical concordance against postoperative outcome Universal performance guarantee beyond task/cohort/montage
Stage C is not one box

On this site, you should still write which C-stage validation class you used:

  • stimulation ground truth: asks localization error against a known stimulation site and time (Mikulan et al., 2020; Unnwongse et al., 2023).
  • simultaneous invasive recording: asks concordance with concurrent SEEG/ECoG under the same event regime (Hao et al., 2025).
  • postsurgical outcome / clinical concordance: asks whether the source estimate points toward clinically relevant tissue, not whether the source was uniquely observed (Birot et al., 2014).
The most important thing now is the C stage public benchmark

Localize-MI by Mikulan et al. (2020) is a rare data resource that exposes intracerebral stimulation with 256ch scalp EEG and stereo-EEG, allowing source imaging to be directly audited against “known stimulation locations.” Hao et al. (2025) reported average localization errors of 14.07 mm for ictal ESI and 17.38 mm for interictal ESI in 29 simultaneous HD-EEG/SEEG cases, indicating that source power and source depth greatly affect accuracy. Jahromi et al. (2026) then added a 3D-printed pediatric deep-source phantom, showing that even phantom validation is not one universal board once deep epileptic source class and geometry are changed. Therefore, if you want to improve your source imaging, you need not only a C-level benchmark, but also a named C-stage validation class rather than an A-level starter dataset alone.

High density is no longer the only credible route, but it is still not a solver-free shortcut

One remaining simplification on this page was to make the entrance sound too close to HD-EEG is the serious route and low-density EEG is the weak route. The current primary literature is narrower than that. Horrillo-Maysonnial et al. (2023) showed that a targeted 33-36-electrode montage reached 54/58 sublobar concordance (93%) against an 83-electrode HD montage, but still showed larger peak-vertex distance for tangential generators. Rong et al. (2025) then showed that a DeepSIF-based approach stayed comparatively stable from 75 to 16 electrodes, with average spatial dispersions of 7.9/9.0 mm versus 21.9/28.1 mm for sLORETA and 20.0/28.9 mm for LCMV. But these papers do not erase the direct-validation ceiling. Unnwongse et al. (2023) showed that coverage geometry and conductivity assumptions still move localization error in direct validation, and Hao et al. (2025) showed that ictal and interictal ESI differed (14.07 ± 4.62 mm versus 17.38 ± 4.16 mm) and that source depth and spike power still matter. Therefore, this site no longer treats high density versus low density as the right first split. The safer split is named montage / coverage policy + inverse family + source regime + validation class.

Postoperative outcomes can be used, but should not be equated with ground truth

In a systematic review by Mouthaan et al. (2019), the summary sensitivity of electric source imaging in presurgical epilepsy was 82% and specificity was 53%. In other words, although postoperative outcomes and SOZ concordance are useful external criteria, source imaging itself cannot be fixed as the true value. Even at the C stage, what you can say now is ``How far has the error been reduced with this benchmark?'', not ``I was able to uniquely read the source in my brain.''

Practical reading

The first question when selecting data is not ``what is interesting?'' but what level of argument do you want to support this time? Level A is sufficient for practicing L0-L1. If you want to proceed with claiming improvements in source imaging, please put your claim on hold unless you audit the head model in stage B and take direct validation in stage C. If you do proceed to solver comparison, the next section fixes what has to stay the same across methods before any leaderboard is accepted.

4.5) Inverse-problem benchmark board: compare error questions, not solver names

The weakness of this page after the 2026-03-18 validation-class update was that it could still let a reader jump from ``we used C-stage data'' to ``solver X won.'' That is too weak. Michel & Brunet (2019) describe ESI as a pipeline rather than a single algorithm, but the current literature is stricter still. Luria et al. (2024) expose a probabilistic focal-support family, Tong et al. (2025) expose a sparse debiased-inference family, and Feng et al. (2025) target extended-source reconstruction. Those papers do not return the same target object or the same uncertainty object. Pascarella et al. (2023) then showed on an in-vivo focal-source benchmark that ten methods differ not only in best localization error but also in sensitivity to regularization and montage density, while Unnwongse et al. (2023), Hao et al. (2025), and Jahromi et al. (2026) show that direct-validation boards themselves differ by stimulation class, simultaneous invasive reference, and deep-source phantom geometry. Vorwerk et al. (2024) and Vorwerk et al. (2026) further show that forward-model uncertainty is a separate audit rather than a property inherited from a nicer inverse map. Therefore, this site now treats inverse-problem comparison as a board with five fixed axes: validation class, source regime / target object, inverse family / uncertainty object, same-geometry controls, and sensitivity sweep.

Inverse family is not just a solver style label

This page now has to stop another shortcut explicitly. A posterior-support map, a debiased sparse interval, and an extended-source overlap estimate are not three visualizations of one identical hidden object. Luria et al. (2024) return posterior support for focal alternatives, Tong et al. (2025) return debiased estimation / inference for sparse spatial-temporal sources, and Feng et al. (2025) return uncertainty-aware extended-source reconstructions. Therefore, a benchmark on this site must name not only the board, but also which inverse family was used, what target object it aimed to recover, and what uncertainty object it actually returned.

Centre error and source extent are different benchmark objects

Another shortcut still had to be blocked here. A benchmark can be “directly validated” and still score only the geometric center of a source. Feng et al. (2025) explicitly target extended-source reconstruction rather than focal-center localization. Hao et al. (2025) likewise note that their ECD-based distance comparison neglects the spatial extent of the seizure-onset and irritative zones, and argue that future localization work should incorporate source extent in addition to geometric center. Therefore, a solver that wins a focal-site distance board is not automatically the best method for distributed, extended, or propagation-rich sources, and a public benchmark on this site must state whether it scores centre, extent, overlap, or propagation pattern.

Benchmark question Keep fixed across methods Primary metric to publish What not to overread
Focal-source localization against known stimulation site Same raw recording, event window, electrode coordinates, head model, conductivity sweep, source space, and bad-channel mask. Distance to known stimulation site/time, plus spread across conductivity and regularization settings. Do not crown a universal solver for extended or distributed sources from a focal-source board alone.
Concordance with simultaneous SEEG/ECoG under the same event Same event definition, same reference montage, same source-depth stratification, same preprocessing, and same concordance rule. Distance or overlap to invasive reference together with source depth and source power strata. Do not read concordance as direct ground truth for all generators, especially low-amplitude or deep activity.
Clinical concordance / postsurgical outcome Same SOZ/resection definition, same outcome window, same blinding rule, and same patient inclusion criteria. Sensitivity/specificity or concordance against clinical outcome, clearly separated from localization error. Do not relabel surgical concordance as precise source-localization ground truth.
Extended-source reconstruction or multimodal-prior reconstruction Same definition of source extent, same prior source, same anatomical constraints, and the same focal-versus-extended evaluation split. Extent overlap or reconstruction error for distributed sources, plus the gain from the added prior. Do not compare an extended-source method only on a focal-source leaderboard and call it inferior in general.
Inverse-family comparison under one named board Same validation class, same raw data, same geometry, same source regime, and an explicit statement of whether each method returns posterior support, sparse debiased intervals, focal centres, or source extent / overlap. Family-typed result table: target object, uncertainty object, primary board metric, and the spread induced by conductivity / hyperparameter sweeps. Do not collapse a probabilistic focal family, a sparse debiased family, and an extended-source family into one shared winner or one generic “better ESI” claim.
If MNE / beamformer / Champagne disagree What to publish now Safe reading on this site
Ranking flips when skull conductivity, head model, or electrode geometry is perturbed. Show the family-specific ranking under the full sensitivity sweep instead of only the best run. Method-conditioned improvement in a bounded geometry regime, not a solver winner in general.
A method wins only at one hand-tuned regularization point. Publish the localization-error curve or interval across the tested hyperparameter range. Best-case performance only; robustness remains unresolved.
Dense montages reduce dispersion but not localization error. Report localization error and spatial dispersion separately. Better concentration of the estimate, not automatic improvement in true-source accuracy.
Deep and superficial sources behave differently. Stratify results by source depth rather than pooling into one mean. Conditional detectability only; do not generalize to deep sources as a whole.
A focal-source board and an extended-source board favor different families. Keep separate leaderboards for focal, sparse, and extended-source tasks. Source-regime-specific strength, not a contradiction that can be collapsed into one number.
Probabilistic focal support, sparse debiased inference, and extent-aware reconstruction return different uncertainty objects. Publish the family label, target object, and uncertainty object beside the main score instead of hiding them in Methods. Board-specific, family-specific evidence only; do not read disagreement as generic noise around one common truth object.
Site rule from this section

A public inverse-problem comparison on this site must now disclose at least (1) validation class, (2) source regime and target object (focal-centre / sparse / extended / propagation-aware), (3) inverse family plus the uncertainty object it returns, (4) same-geometry controls including montage / coverage policy, (5) sensitivity sweep over conductivity and key hyperparameters, (6) inter-method disagreement summary, and (7) the claim that must stop here. Without these fields, a result will be treated as a method illustration or lab-specific pipeline note, not as a reusable benchmark.

5) Checklist that does not end with just “there is data”

Checklist

  • Version fixed:Does OpenNeuro snapshot, PhysioNet version, DOI, acquisition date remain?
  • Reproduction:Can you write the acquisition procedure, license, preprocessing conditions, random numbers, and environment
  • Metadata:Do you have sampling, reference, electrode placement, event definition, clock domain, and a named timing-validation class?
  • Annotation provenance:Did you clearly indicate whether the label came from an annotation channel, manual scoring, or a report-derived rule, and whether a known scorer-agreement or report-derived ceiling still limits interpretation?
  • QC:Are noise, defects, and artifacts quantified?
  • Comparison:Is there a baseline and can be compared using the same metrics as the evaluation family
  • Metric bundle:If the task is imbalanced or event-based, are event sensitivity, false alarms, per-stage agreement, or calibration disclosed rather than one headline number?
  • Benchmark provenance:If the result comes from a challenge or leaderboard, are benchmark version, split / randomization, hidden grouping, subject exclusivity, extra-data policy, pretrained-checkpoint policy, inference-stage restrictions, and later postmortems fixed?
  • Inverse-problem governance:If source imaging is compared, are validation class, source regime, inverse family / uncertainty object, geometry/control sweep, and inter-method disagreement disclosed before declaring a winner?
  • Rebuttal evidence:Are there data leak tests, segment/window ancestry checks, counterfactual tests, and records of failures
Official benchmark postmortems are part of reproducibility, not footnotes

In practical work, a benchmark page, rules page, submission constraint, and final leaderboard can each fix a different part of what your score means. Brookshire et al. (2024) show that segment-based cross-validation in translational EEG can leak subject information between training and test sets and inflate headline performance. The later TUH/TUSZ maintenance record also documented a public-release case where subject exclusivity and annotation quality had to be repaired after downstream use had already begun. That is why this page now routes EEG foundation-model benchmarking not only through split / leakage hygiene but also through benchmark provenance. If the benchmark uses evolving challenge operations, continue directly to Wiki: EEG foundation models and pretraining and Verification: Pretraining Card before treating the ranking as portable generalization evidence.

6) Run the L0 minimum loop here

The goal here is not to compete for high accuracy, but to create the smallest loop that a third party can follow in the same way. The minimum pack on this site is no longer just version + BIDS + QC + split + baseline. It now also requires event fidelity, label provenance, acquisition-distribution summary, derivative lineage, and a stopping claim so later accuracy can still be read correctly.

L0 Loop

  • Version:OpenNeuro snapshot / PhysioNet version / DOI / leave acquisition date
  • Input:Create a format that can be placed in BIDS / EEG-BIDS (data + metadata + reference / channels / electrodes / events)
  • Event fidelity:Record onset / duration / sample, clock domain, stream-alignment rule, timing-validation class, delay / jitter evidence, and event semantics
  • Label provenance:State whether the target comes from annotation channels, expert scoring, clinician reports, or other rules
  • Quality: Record missing, noise, artifact, and exclusion reasons in numerical form
  • Processing:Fix preprocessing conditions, random numbers, software version, and derivative lineage from raw to outputs
  • Evaluation: Fix one of within-session / cross-session / cross-subject first, together with the independent hold-out unit and raw-recording / window ancestry
  • Output:Even if it is simple, publish at least one baseline indicator that can be compared later, and state what claim must stop here
  • Audit:Failure cases, leak tests, harmonization logs, and pending conditions are also recorded along with the results
2026-03-20 addendum: the public L0 pack is now synchronized

The weakness of this page was that the public checklist had become stricter than the wiki page readers actually use when assembling an L0 submission. That gap is now closed. If you want the submission shape itself, not only the route on this page, go directly to Wiki: Minimum artifact pack for L0. The synced pack now fixes dataset identity, event fidelity, label provenance, evaluation family + hold-out ancestry, acquisition-distribution summary, derivative lineage, and stopping claim as first-class deliverables.

Easy to get clogged To cut first
I think I can reproduce it with the same dataset name Fix the OpenNeuro snapshot tag and PhysioNet version first, and leave the acquisition date and DOI in the runbook.
Stops in the form of BIDS Before inputting the actual data, first create a directory skeleton, dataset_description.json, participants.tsv, and events.tsv.
I wonder how much QC to leave It is safer to fix just the four items: missing, noise, artifact, and reason for exclusion, and increase them later.
Cannot determine baseline Prefer a simple and easy-to-reproduce model, such as motor recall 2 classes or spectral summary, rather than a complex model.
Getting lost in train/test First decide whether the comparison is within-session, cross-session, or cross-subject, then lock the split unit for each subject or session.
Step 0: Freeze the version

The dataset name alone is not enough. OpenNeuro manages snapshots with semantic-version Git tags, and PhysioNet also displays and cites dataset versions for each project. Therefore, the first runbook should record snapshot / version / DOI / retrieval date, not just the dataset name.

Step 1: Create the BIDS skeleton first

Even if the contents are not aligned at first, just fixing the placement will reduce rework. If you create a file name and metadata template with the premise of passing it through a validator, subsequent QC and comparisons will become much easier.

Step 2: Eliminate standard violations first with Validator

Eliminate any problems found with the machine at an early stage. Passing the BIDS Validator is not a sufficient condition for research, but it is close to the minimum requirement for sharing.

Step 2.5: Separate and fix loader and benchmark

MNE-BIDS is a tool that helps with BIDSPath handling, data loading, and metadata extraction, while MOABB fixes the paradigm and evaluation family. There is a difference between being able to read data and being able to make fair comparisons. In particular, MNE-BIDS treats write-back of modified or preloaded data as an exception, so it is safer to treat preprocessed data as derivatives with explicit lineage.

Step 3: Leave QC logs as numerical values ​​instead of waveforms

With just the raw waveform, it is difficult for a third party to reconstruct what went wrong and what was left out. The main body of L0 is to record bad channels, bad segments, event synchronization, stimulus logs, and reaction logs with numerical values ​​and threshold values.

Step 4: Fix only one baseline

Rather than using SOTA, first place a comparison axis that is easy to reproduce. Having an initial baseline allows you to compare what has improved even after updating the preprocessing or updating the model.

Check items Lowest line of L0 Where to go back to when you're running low
Data version snapshot / version / DOI / acquisition date is fixed Wiki: Standards/Repositories/Validators/Benchmarks
Data structure Can be stored in BIDS format The shortest route to shareable data
Quality control QC logs and exclusion criteria remain Wiki: Event synchronization and observation logs
Comparability One baseline and evaluation family / train/test rules are fixed Wiki: Data splits and data leaks
Prepare to share Execution steps, environment, and failure examples can be passed on to a third party Verification infrastructure

7) The shortest route to "shareable data" with Mind-Upload

Mind-Upload's goal is not just to collect data, but to leave it in a form that can be verified by a third party. The shortest route to that end is to approach BIDS/EEG-BIDS.

Verification Commons

Click here for the blueprint for "Standards + Storage + Evaluation".

View verification platform →

8) References and official pages