The shortest distinction
Standards fix raw-data layout, repositories fix where a versioned dataset is published, validators check schema compliance, derivative specifications fix how processed outputs stay linked to their sources, workflow / model recipes fix how outputs are produced, and benchmark harnesses plus benchmark provenance / governance fix what the score means. Even though they all look like ``research infrastructure,'' their roles are different.
The old version of this page still let benchmark sound like a stable comparison label once the harness name was known. That is too weak. The official EEG Challenge (2025) homepage states that the original challenge preprint became out of date during execution and that the website should be treated as current. The official rules then fix downsampling after 0.5-50 Hz filtering, additional-data disclosure, pretrained-model disclosure, and a single-GPU 20 GB inference-stage constraint. The official submission page further fixes that it is an inference-only code submission, and the final leaderboard later disclosed a Challenge 2 randomization error and separated the final awards. Recent benchmark papers make the same point in more general form: Xiong et al. (2025/2026) argue that inconsistent evaluation protocols make EEG-foundation-model comparisons unreliable, and Liu et al. (2026) show across 12 open-source foundation models and 13 datasets that the reading of transfer quality depends materially on protocol choice. Therefore, on this site, benchmark provenance / governance is treated as a first-class part of reproducibility rather than as after-the-fact administration.
The remaining weakness on this page was subtler. It still let BIDS + repository + benchmark name sound almost sufficient for reproducibility. Current official and primary sources do not support that reading. Markiewicz et al. (2021) show that OpenNeuro plus BIDS helps freeze a shareable, versioned raw input. But the BIDS specification separately requires derived datasets to carry GeneratedBy and SourceDatasets, and derivative files to keep explicit Sources. Gorgolewski et al. (2017) show that BIDS Apps solve deployment and interface portability, not automatic benchmark meaning; MNE-BIDS-Pipeline explicitly exposes a text-file configuration, cached intermediate steps, and summary reports; BIDS Stats Models defines a separate machine-readable model recipe; and Maumet et al. (2016) show that result provenance itself can be packaged as a separate standardized object. Therefore, on this site, derivative specification, workflow / model recipe, and execution / result provenance are now treated as distinct layers rather than as details hidden inside ``BIDS'' or ``benchmark.''
Why consider separately
If you confuse these layers, you'll get the wrong impression, such as, ``There's a benchmark because you uploaded it to OpenNeuro,'' ``Because it's BIDS, the processed outputs are already traceable,'' ``Because the pipeline name was given, the recipe is already frozen,'' or ``Because MOABB was named, the benchmark meaning is already fixed.'' In reality, the tasks of aligning raw data, naming derivative lineage, freezing workflow and model recipes, defining comparison rules, and freezing the exact benchmark governance are different things.
First, separate terms
| Term | What it does | Example |
|---|---|---|
| standard (raw layout) | The way to place the file, name it, and write the metadata will be the same. | BIDS, EEG-BIDS. |
| Storage/shared infrastructure (repository) | Publish your data so others can retrieve it. | OpenNeuro, PhysioNet, PDB, etc. |
| Validator | Mechanically inspects for standard violations and missing metadata. | BIDS Validator. |
| derivative specification / lineage | Keep processed outputs separate from raw and link them back to their direct sources and generating pipeline. | BIDS Derivatives, GeneratedBy, SourceDatasets, Sources. |
| loader / converter | Read or write datasets in a standardized way and bridge them into the analysis library. | MNE-BIDS. |
| workflow / model recipe | Fix the ordered steps, config values, optional branches, grouping logic, and analysis graph that generate the outputs. | MNE-BIDS-Pipeline config, BIDS Apps CLI, BIDS Stats Models JSON. |
| execution / result provenance | Record which software, version, container, code, and activity actually produced the reported outputs and reports. | NIDM-Results, pipeline reports, DataLad / BABS audit trail. |
| benchmark harness | Fix issues, divisions, indicators, and prohibitions to make them comparable. | MOABB, MLPerf, ImageNet type operation. |
| benchmark provenance / governance | Fix which exact rule snapshot, split construction, hidden grouping, extra-data policy, pretrained-checkpoint policy, execution constraints, and postmortems defined the score. | Official challenge homepage / rules / submission / leaderboard, benchmark postmortems. |
In practice, the short labels are not enough, so we look at 11 layers
| layer | Representative examples | What to fix here | No guarantees yet |
|---|---|---|---|
| 1. Standards | BIDS, EEG-BIDS | File name, required metadata, coordinate system, events/channels/electrodes format. | train/test Splits and indicators are not determined. |
| 2. Public version | OpenNeuro snapshot, PhysioNet version | A third party can return to the same input to know which version was obtained. | The version does not necessarily fix the benchmark split or preprocessing conditions. |
| 3. Event semantics/extension schema | HED, Motion-BIDS | trial_type meaning, event tag, additional sensor metadata, and coordinate frame. |
Clock alignment and benchmark split are not determined automatically. |
| 4. Synchronization middleware | LSL | Time alignment of multiple streams, clock offset estimation, and stream metadata. | We do not guarantee the true value of device-side delay or stimulus presentation delay. |
| 5. Derivative specification / lineage | BIDS Derivatives, GeneratedBy, SourceDatasets, Sources |
Keep processed outputs separate from raw and make source ancestry plus generating pipeline explicit. | A clean or epoched file can still be overread as self-explanatory if lineage is missing. |
| 6. Conversion/Reading | MNE-BIDS | BIDSPath, metadata extraction, reading path to MNE, format conversion when necessary. | Comparison indicators and evaluation families are not fixed. |
| 7. Workflow / model recipe | MNE-BIDS-Pipeline config, BIDS Apps CLI, BIDS Stats Models JSON | Fix step order, skipped or optional stages, model graph, and config values that determine derived outputs. | The same raw input can still produce different derivatives when the recipe changes. |
| 8. Execution / result provenance | NIDM-Results, pipeline reports, DataLad / BABS run records | Record which software, version, container, commands, and activities actually produced the outputs being reported. | A figure or score table can still be detached from the software state that created it. |
| 9. Benchmark harness | MOABB | paradigm, evaluation family, statistical comparison, cross-sectional evaluation of the same pipeline. | Current rule snapshot, hidden grouping, extra-data policy, and execution constraints are not fixed unless governance documents are also frozen. |
| 10. Benchmark provenance / governance | Official homepage, rules page, submission page, leaderboard / postmortem | Current benchmark version, split / randomization, hidden grouping, extra-data and pretrained-model policy, inference-stage restrictions, and later corrections. | This still does not prove target-signal specificity, source-imaging truth, or operational safety outside the stated benchmark. |
| 11. Learner / runtime environment | Linear classifier, Riemannian pipeline, deep model, container image, lockfile | Which estimator was run with which preprocessing, random numbers, runtime image, and hyperparameters. | If 1-10 above are not fixed, it will not be a fair comparison. |
OpenNeuro treats the snapshot as a git tag of the semantic version, and PhysioNet also explicitly cites the version for each project. Therefore, on this site, we include not only the dataset name but also the snapshot / version / DOI or persistent URL in the artifact. Additionally, BIDS is a raw-data container, BIDS Derivatives is the processed-data layer, HED/Motion-BIDS is semantics and additional metadata, LSL is synchronization, MNE-BIDS is an input/output path, MNE-BIDS-Pipeline or a BIDS App is a workflow recipe, BIDS Stats Models is a model recipe, NIDM-Results is result provenance packaging, and MOABB is a comparison rule. Please don't mix these up and read that ``Since I used BIDS, I was able to get past the benchmark'' or ``Since I installed LSL, I was able to solve the hardware delay.''
The remaining weakness was to let BIDS + HED + LSL sound like a complete multimodal validity package. That is too strong. Kothe et al. (2025) made clear that LSL solves synchronized stream transport rather than device-side delay truth. Wei et al. (2020) showed that EEG-fMRI fusion remains model-conditioned, and Vafaii et al. (2024) plus Chen et al. (2025) showed that simultaneous multimodal recordings can retain modality-specific structure even when acquired together. Therefore, on this site, standards and synchronization infrastructure are necessary inputs to a multimodal study, but a separate Fusion Card is still required before the claim ceiling is raised.
MOABB correctly fixes evaluation families such as within-session, cross-session, and cross-subject, but current EEG challenge operations show that this is only one part of the benchmark object. The official EEG Challenge rules fixed the filter / downsample route, additional-data policy, pretrained-model disclosure, and inference-stage memory budget, the official submission page fixed that the competition was inference-only, and the final leaderboard disclosed a non-randomized Challenge 2 split that changed the prize structure. Therefore, on this site, a benchmark claim is incomplete unless the harness and the current governance / provenance documents are frozen together.
Benchmark provenance is part of reproducibility
The practical weakness on this page was to stop at benchmark harness. Recent official and primary sources do not support that shortcut. The official EEG Challenge homepage explicitly says the original preprint became outdated during execution, the official rules fix operational constraints, the official submission page narrows the executable object to inference-only code, and the final leaderboard discloses a split-construction failure that changed how the ranking had to be interpreted. In parallel, Xiong et al. (2025/2026) and Liu et al. (2026) both argue that fair EEG-foundation-model comparison requires standardized protocols and that rankings still depend materially on evaluation choices. Therefore, this site now separates benchmark provenance / governance from the harness name itself.
| Benchmark field | What it fixes | Unsafe shortcut if omitted |
|---|---|---|
| Current rule snapshot | Which homepage / rules / starter-kit state was actually in force when the run was made. | Reading an outdated proposal paper as the final benchmark definition. |
| Split / randomization / hidden grouping | Whether trial order, subject contiguity, session grouping, or other hidden structure could be exploited. | Reading a leaderboard as if it reflected portable subject-invariant generalization by default. |
| Extra-data / pretrained-model policy | Whether external corpora, checkpoints, or fine-tuning routes were allowed and how they had to be disclosed. | Comparing runs as if they were trained under the same information budget. |
| Inference-stage execution constraints | Whether the object being compared was a full training pipeline, an inference-only submission, or a memory / hardware-bounded executable. | Treating challenge rank as a pure representation-learning comparison independent of systems constraints. |
| Postmortem / correction status | Whether organizers later disclosed split flaws, score-definition changes, or prize-structure revisions. | Reading an early leaderboard snapshot as final scientific truth. |
Looking at the example of EEG
| stage | What to do there |
|---|---|
| 1. Align to standards | Align EEG files, events.tsv, channels.tsv, and metadata to BIDS format. |
| 2. Add event semantics | Specify trial_type, condition description, HED tags, manual scoring rule, and report usage flag. |
| 3. Audit synchronization | How to measure clock domain, LSL / TTL / photodiode, delay / jitter / drift will be left. |
| 4. Check with Validator | Mechanically identifies violations of standards and missing items. |
| 5. Publish to storage | Put it on a shared platform like OpenNeuro or PhysioNet so it can be retrieved by third parties. |
| 6. Freeze derivative lineage | Keep preprocessed outputs, epochs, features, and reports as derivatives with explicit source ancestry. |
| 7. Freeze workflow / model recipe | Record the pipeline config, optional branches, model graph, and software settings that generated the outputs. |
| 8. Compare with benchmarks | Compare models with the same train/test split, the same metrics, and the same baseline. |
| 9. Freeze benchmark provenance | Record the active rules page, split/randomization policy, extra-data / pretrained-model policy, inference-stage restrictions, and postmortem status together with the score. |
| 10. Freeze runtime / result provenance | Record software versions, container or lockfile, commands, reports, and result bundles so the published figure or score can be traced back to the run that made it. |
Just by aligning to the standard, there is still no "rule for comparison." But even if there is a benchmark, the comparison can still break if derivative lineage, workflow recipe, or runtime provenance are left implicit. All of those layers matter.
What is missing?
| What is missing | Problems that are likely to occur |
|---|---|
| Standards | The file name and metadata are different for each person, and it stops at the entrance to the supplementary exam. |
| Storage place | Even if you know about its existence, you will not be able to obtain it or reuse it, and the circle of comparison will not expand. |
| Validator | Notice of violation of standards is delayed, and accidents occur immediately before sharing or during reanalysis. |
| Derivative specification / lineage | Processed outputs can be mistaken for raw or for each other, and later readers cannot tell which source files or branches generated them. |
| Workflow / model recipe | The same pipeline name can hide different optional steps, configs, and model graphs, so the rerun does not actually reproduce the same analysis. |
| Execution / result provenance | A figure, table, or derivative can no longer be traced back to the exact software, version, container, and commands that created it. |
| Benchmark | Each person evaluates using different divisions and indicators, and the meaning of "winning" fluctuates. |
| Benchmark provenance / governance | The same benchmark name hides different rule snapshots, hidden grouping, inference limits, or later corrections, so the score is overread. |
Why raw files alone are not enough
Just having a waveform file is not enough for standards or benchmarks. At the very least, without event definitions, stimulus logs, synchronization information, QC logs, and reasons for exclusion, it will be difficult to repeat the same challenge.
Being "publicly available" and being "comparable" are two different things. Publication is the first step, comparability is the next step in design.
Common confusion
| Things I tend to say | More accurate paraphrase |
|---|---|
| “Since we chose BIDS, there is a benchmark” | BIDS is an input format standard, not a comparison rule itself. |
| “It was standardized because it was placed in OpenNeuro” | Even if it is posted in the storage area, the standards and metadata are not necessarily sufficient. |
| "We used the same input because the dataset name is the same" | Unless you fix OpenNeuro snapshot or PhysioNet version, it cannot be said that it is the same input. |
| "The benchmark name alone fixes what the score means" | You still need the active rules snapshot, split / randomization / hidden grouping policy, extra-data / pretrained-model policy, execution constraints, and postmortem status. |
| "Validator passed, so it's enough for research" | Validator is a formal check and does not guarantee the validity of the research or the strength of the benchmark. |
| "It became a benchmark because I could read it with MNE-BIDS" | MNE-BIDS is a reading/conversion aid; fixing evaluation families and comparison statistics is a separate task. |
| "Because the data are in BIDS, the processed outputs are already self-explanatory" | Raw BIDS and BIDS derivatives are separate layers, and processed outputs still need explicit lineage and source ancestry. |
| "Naming MNE-BIDS-Pipeline or a BIDS App already freezes the workflow" | The pipeline name alone is still too coarse; config values, skipped stages, model recipe, and software version have to be frozen as well. |
| "A containerized run already captures what the score means" | Container and runtime pin help software portability, but benchmark harness and benchmark governance still remain separate objects. |
| “Event semantics are fixed because there is `events.tsv`” | events.tsv is a container for time and columns, and condition meanings and scorer rules must be fixed separately in events.json, HED, and auxiliary logs. |
| "Using LSL even solved the hardware delay" | LSL helps with stream synchronization, but device-side delay for display/audio/amplifier requires separate measurement. |
| “MOABB scores can be directly compared across tasks” | Within-session, cross-session, and cross-subject are different evaluation families and cannot be treated equally. |
| "It's safe to convert preprocessed files back to raw BIDS" | BIDS and MNE-BIDS basically assume unprocessed or minimally processed data, and it is safer to treat modified data as derivatives by specifying the lineage. |
| "We won the benchmark, so it's good enough for actual operation" | Benchmark is a yardstick for comparison and does not automatically guarantee actual operation or the establishment of L4/L5. |
| "The challenge proposal paper is the final benchmark specification" | Execution-phase websites, rules, starter kits, and final postmortems can supersede the original proposal and must be frozen with the result. |
Minimum 7 IDs that you want to fix
| ID | What I want at least | What happens when it is missing |
|---|---|---|
| Input ID | OpenNeuro snapshot tag, PhysioNet version, DOI, acquisition date. | If you mix different versions with the same dataset name, you will not be able to try again. |
| Schema ID | The version of BIDS/EEG-BIDS, the version of Validator, and the reason for the warning left. | It is not possible to distinguish between standard differences and implementation differences. |
| Derivative ID | Derived dataset name, GeneratedBy, SourceDatasets, and direct Sources lineage. |
Preprocessed outputs can be confused with raw or with another derivative branch. |
| Workflow ID | MNE-BIDS-Pipeline / BIDS App / config file / model-graph version and settings. | Even with the same input version, a different recipe can still generate a different result. |
| Evaluation ID | Within-session / cross-session / cross-subject, indicators, split seed, and prohibitions. | The meaning of score will be different and fair comparison will be broken. |
| Benchmark Governance ID | Rules URL or archived snapshot, split / randomization policy, hidden grouping note, extra-data / pretrained-model policy, inference-stage restrictions, and postmortem status. | The benchmark title will stay too coarse, and the same leaderboard name may hide different scientific meanings. |
| Runtime / Result Provenance ID | Software version, container or lockfile, command log, and result bundle or report identifier. | The published figure or score cannot be traced back to the exact run that created it. |
9 questions when reading strong arguments
- What is the input standard? Check to see if the format is consistent using BIDS, etc.
- What version was used:See if the snapshot, version, DOI, and acquisition date are fixed.
- What are the event semantics and clock domain: Look at
trial_type, HED, scorer rule, LSL/TTL/photodiode, delay/jitter audits. - Did they separate raw and derivative?See if processed outputs remain explicit derivatives with followable lineage.
- What was used to read/write:Look at the loader/transformer and see if its version is specified.
- What workflow or model recipe generated the outputs?Look for config files, optional branches, model graph, and software settings.
- What runtime or result-provenance record exists?Check container / lockfile, command logs, reports, or result bundles.
- What benchmark harness was used?See if evaluation family, metrics, and comparison statistics are fixed.
- What benchmark provenance was in force?Check the active rules snapshot, split/randomization, extra-data / checkpoint policy, inference-stage restrictions, and postmortem status.
References and official pages
- Gorgolewski et al. (2016), BIDS
- BIDS Specification: Task events
- BIDS Specification: Electroencephalography
- BIDS Specification: dataset_description, GeneratedBy, and SourceDatasets
- BIDS Derivatives: common data types and lineage metadata
- BIDS Stats Models Specification
- Pernet et al. (2019), EEG-BIDS
- Robbins et al. (2021), HED for FAIR event annotation
- Hermes et al. (2025), HED library schema for EEG data annotation
- Kothe et al. (2025), The lab streaming layer for synchronized multimodal recording
- Jeung et al. (2024), Motion-BIDS
- Markiewicz et al. (2021), OpenNeuro
- OpenNeuro Docs: Git access and snapshots
- OpenNeuro Docs: Dataset landing page and snapshot metadata
- PhysioNet: About and citation policy
- PhysioNet: Resources and citation guidance
- Appelhoff et al. (2019), MNE-BIDS
- MNE-BIDS Docs: write_raw_bids
- MNE-BIDS-Pipeline Docs
- Gorgolewski et al. (2017), BIDS Apps
- Zhao et al. (2024), BABS and large-scale BIDS-App audit trails
- Maumet et al. (2016), NIDM-Results
- Jayaram & Barachant (2018), MOABB
- MOABB Docs
- MOABB Docs: paradigm and evaluation examples
- EEG Challenge (2025): homepage
- EEG Challenge (2025): rules
- EEG Challenge (2025): submission
- EEG Challenge (2025): final leaderboard and organizer correction
- Xiong et al. (2025/2026), EEG-FM-Bench
- Liu et al. (2026), EEG Foundation Models: Progresses, Benchmarking, and Open Problems
- Wei et al. (2020), Bayesian fusion and multimodal DCM for EEG and fMRI
- Vafaii et al. (2024), multimodal spontaneous brain-activity organization
- Chen et al. (2025), simultaneous EEG-PET-MRI across wakefulness and NREM sleep
Where to go back next
Please use Data & Bench to return to the practical entry point, Verification Platform to return to overall design, and Casework to return to examples from other fields.