Wiki: Standards, Repositories, Validators, and Benchmarks

The shortest distinction

Standards fix raw-data layout, repositories fix where a versioned dataset is published, validators check schema compliance, derivative specifications fix how processed outputs stay linked to their sources, workflow / model recipes fix how outputs are produced, and benchmark harnesses plus benchmark provenance / governance fix what the score means. Even though they all look like ``research infrastructure,'' their roles are different.

2026-03-26 correction: benchmark provenance is part of the benchmark

The old version of this page still let benchmark sound like a stable comparison label once the harness name was known. That is too weak. The official EEG Challenge (2025) homepage states that the original challenge preprint became out of date during execution and that the website should be treated as current. The official rules then fix downsampling after 0.5-50 Hz filtering, additional-data disclosure, pretrained-model disclosure, and a single-GPU 20 GB inference-stage constraint. The official submission page further fixes that it is an inference-only code submission, and the final leaderboard later disclosed a Challenge 2 randomization error and separated the final awards. Recent benchmark papers make the same point in more general form: Xiong et al. (2025/2026) argue that inconsistent evaluation protocols make EEG-foundation-model comparisons unreliable, and Liu et al. (2026) show across 12 open-source foundation models and 13 datasets that the reading of transfer quality depends materially on protocol choice. Therefore, on this site, benchmark provenance / governance is treated as a first-class part of reproducibility rather than as after-the-fact administration.

2026-04-02 correction: BIDS plus a benchmark name still do not fix derivatives, workflow recipe, or result provenance

The remaining weakness on this page was subtler. It still let BIDS + repository + benchmark name sound almost sufficient for reproducibility. Current official and primary sources do not support that reading. Markiewicz et al. (2021) show that OpenNeuro plus BIDS helps freeze a shareable, versioned raw input. But the BIDS specification separately requires derived datasets to carry GeneratedBy and SourceDatasets, and derivative files to keep explicit Sources. Gorgolewski et al. (2017) show that BIDS Apps solve deployment and interface portability, not automatic benchmark meaning; MNE-BIDS-Pipeline explicitly exposes a text-file configuration, cached intermediate steps, and summary reports; BIDS Stats Models defines a separate machine-readable model recipe; and Maumet et al. (2016) show that result provenance itself can be packaged as a separate standardized object. Therefore, on this site, derivative specification, workflow / model recipe, and execution / result provenance are now treated as distinct layers rather than as details hidden inside ``BIDS'' or ``benchmark.''

Why consider separately

If you confuse these layers, you'll get the wrong impression, such as, ``There's a benchmark because you uploaded it to OpenNeuro,'' ``Because it's BIDS, the processed outputs are already traceable,'' ``Because the pipeline name was given, the recipe is already frozen,'' or ``Because MOABB was named, the benchmark meaning is already fixed.'' In reality, the tasks of aligning raw data, naming derivative lineage, freezing workflow and model recipes, defining comparison rules, and freezing the exact benchmark governance are different things.

First, separate terms

Term	What it does	Example
standard (raw layout)	The way to place the file, name it, and write the metadata will be the same.	BIDS, EEG-BIDS.
Storage/shared infrastructure (repository)	Publish your data so others can retrieve it.	OpenNeuro, PhysioNet, PDB, etc.
Validator	Mechanically inspects for standard violations and missing metadata.	BIDS Validator.
derivative specification / lineage	Keep processed outputs separate from raw and link them back to their direct sources and generating pipeline.	BIDS Derivatives, `GeneratedBy`, `SourceDatasets`, `Sources`.
loader / converter	Read or write datasets in a standardized way and bridge them into the analysis library.	MNE-BIDS.
workflow / model recipe	Fix the ordered steps, config values, optional branches, grouping logic, and analysis graph that generate the outputs.	MNE-BIDS-Pipeline config, BIDS Apps CLI, BIDS Stats Models JSON.
execution / result provenance	Record which software, version, container, code, and activity actually produced the reported outputs and reports.	NIDM-Results, pipeline reports, DataLad / BABS audit trail.
benchmark harness	Fix issues, divisions, indicators, and prohibitions to make them comparable.	MOABB, MLPerf, ImageNet type operation.
benchmark provenance / governance	Fix which exact rule snapshot, split construction, hidden grouping, extra-data policy, pretrained-checkpoint policy, execution constraints, and postmortems defined the score.	Official challenge homepage / rules / submission / leaderboard, benchmark postmortems.

In practice, the short labels are not enough, so we look at 11 layers

layer	Representative examples	What to fix here	No guarantees yet
1. Standards	BIDS, EEG-BIDS	File name, required metadata, coordinate system, events/channels/electrodes format.	train/test Splits and indicators are not determined.
2. Public version	OpenNeuro snapshot, PhysioNet version	A third party can return to the same input to know which version was obtained.	The version does not necessarily fix the benchmark split or preprocessing conditions.
3. Event semantics/extension schema	HED, Motion-BIDS	`trial_type` meaning, event tag, additional sensor metadata, and coordinate frame.	Clock alignment and benchmark split are not determined automatically.
4. Synchronization middleware	LSL	Time alignment of multiple streams, clock offset estimation, and stream metadata.	We do not guarantee the true value of device-side delay or stimulus presentation delay.
5. Derivative specification / lineage	BIDS Derivatives, `GeneratedBy`, `SourceDatasets`, `Sources`	Keep processed outputs separate from raw and make source ancestry plus generating pipeline explicit.	A clean or epoched file can still be overread as self-explanatory if lineage is missing.
6. Conversion/Reading	MNE-BIDS	BIDSPath, metadata extraction, reading path to MNE, format conversion when necessary.	Comparison indicators and evaluation families are not fixed.
7. Workflow / model recipe	MNE-BIDS-Pipeline config, BIDS Apps CLI, BIDS Stats Models JSON	Fix step order, skipped or optional stages, model graph, and config values that determine derived outputs.	The same raw input can still produce different derivatives when the recipe changes.
8. Execution / result provenance	NIDM-Results, pipeline reports, DataLad / BABS run records	Record which software, version, container, commands, and activities actually produced the outputs being reported.	A figure or score table can still be detached from the software state that created it.
9. Benchmark harness	MOABB	paradigm, evaluation family, statistical comparison, cross-sectional evaluation of the same pipeline.	Current rule snapshot, hidden grouping, extra-data policy, and execution constraints are not fixed unless governance documents are also frozen.
10. Benchmark provenance / governance	Official homepage, rules page, submission page, leaderboard / postmortem	Current benchmark version, split / randomization, hidden grouping, extra-data and pretrained-model policy, inference-stage restrictions, and later corrections.	This still does not prove target-signal specificity, source-imaging truth, or operational safety outside the stated benchmark.
11. Learner / runtime environment	Linear classifier, Riemannian pipeline, deep model, container image, lockfile	Which estimator was run with which preprocessing, random numbers, runtime image, and hyperparameters.	If 1-10 above are not fixed, it will not be a fair comparison.

2026-03 site rule

OpenNeuro treats the snapshot as a git tag of the semantic version, and PhysioNet also explicitly cites the version for each project. Therefore, on this site, we include not only the dataset name but also the snapshot / version / DOI or persistent URL in the artifact. Additionally, BIDS is a raw-data container, BIDS Derivatives is the processed-data layer, HED/Motion-BIDS is semantics and additional metadata, LSL is synchronization, MNE-BIDS is an input/output path, MNE-BIDS-Pipeline or a BIDS App is a workflow recipe, BIDS Stats Models is a model recipe, NIDM-Results is result provenance packaging, and MOABB is a comparison rule. Please don't mix these up and read that ``Since I used BIDS, I was able to get past the benchmark'' or ``Since I installed LSL, I was able to solve the hardware delay.''

2026-03-20 addendum: synchronization infrastructure is not a Fusion Card

The remaining weakness was to let BIDS + HED + LSL sound like a complete multimodal validity package. That is too strong. Kothe et al. (2025) made clear that LSL solves synchronized stream transport rather than device-side delay truth. Wei et al. (2020) showed that EEG-fMRI fusion remains model-conditioned, and Vafaii et al. (2024) plus Chen et al. (2025) showed that simultaneous multimodal recordings can retain modality-specific structure even when acquired together. Therefore, on this site, standards and synchronization infrastructure are necessary inputs to a multimodal study, but a separate Fusion Card is still required before the claim ceiling is raised.

2026-03-26 addendum: a benchmark name is not yet a fixed benchmark object

MOABB correctly fixes evaluation families such as within-session, cross-session, and cross-subject, but current EEG challenge operations show that this is only one part of the benchmark object. The official EEG Challenge rules fixed the filter / downsample route, additional-data policy, pretrained-model disclosure, and inference-stage memory budget, the official submission page fixed that the competition was inference-only, and the final leaderboard disclosed a non-randomized Challenge 2 split that changed the prize structure. Therefore, on this site, a benchmark claim is incomplete unless the harness and the current governance / provenance documents are frozen together.

Benchmark provenance is part of reproducibility

The practical weakness on this page was to stop at benchmark harness. Recent official and primary sources do not support that shortcut. The official EEG Challenge homepage explicitly says the original preprint became outdated during execution, the official rules fix operational constraints, the official submission page narrows the executable object to inference-only code, and the final leaderboard discloses a split-construction failure that changed how the ranking had to be interpreted. In parallel, Xiong et al. (2025/2026) and Liu et al. (2026) both argue that fair EEG-foundation-model comparison requires standardized protocols and that rankings still depend materially on evaluation choices. Therefore, this site now separates benchmark provenance / governance from the harness name itself.

Benchmark field	What it fixes	Unsafe shortcut if omitted
Current rule snapshot	Which homepage / rules / starter-kit state was actually in force when the run was made.	Reading an outdated proposal paper as the final benchmark definition.
Split / randomization / hidden grouping	Whether trial order, subject contiguity, session grouping, or other hidden structure could be exploited.	Reading a leaderboard as if it reflected portable subject-invariant generalization by default.
Extra-data / pretrained-model policy	Whether external corpora, checkpoints, or fine-tuning routes were allowed and how they had to be disclosed.	Comparing runs as if they were trained under the same information budget.
Inference-stage execution constraints	Whether the object being compared was a full training pipeline, an inference-only submission, or a memory / hardware-bounded executable.	Treating challenge rank as a pure representation-learning comparison independent of systems constraints.
Postmortem / correction status	Whether organizers later disclosed split flaws, score-definition changes, or prize-structure revisions.	Reading an early leaderboard snapshot as final scientific truth.

Looking at the example of EEG

stage	What to do there
1. Align to standards	Align EEG files, events.tsv, channels.tsv, and metadata to BIDS format.
2. Add event semantics	Specify `trial_type`, condition description, HED tags, manual scoring rule, and report usage flag.
3. Audit synchronization	How to measure clock domain, LSL / TTL / photodiode, delay / jitter / drift will be left.
4. Check with Validator	Mechanically identifies violations of standards and missing items.
5. Publish to storage	Put it on a shared platform like OpenNeuro or PhysioNet so it can be retrieved by third parties.
6. Freeze derivative lineage	Keep preprocessed outputs, epochs, features, and reports as derivatives with explicit source ancestry.
7. Freeze workflow / model recipe	Record the pipeline config, optional branches, model graph, and software settings that generated the outputs.
8. Compare with benchmarks	Compare models with the same train/test split, the same metrics, and the same baseline.
9. Freeze benchmark provenance	Record the active rules page, split/randomization policy, extra-data / pretrained-model policy, inference-stage restrictions, and postmortem status together with the score.
10. Freeze runtime / result provenance	Record software versions, container or lockfile, commands, reports, and result bundles so the published figure or score can be traced back to the run that made it.

This is important

Just by aligning to the standard, there is still no "rule for comparison." But even if there is a benchmark, the comparison can still break if derivative lineage, workflow recipe, or runtime provenance are left implicit. All of those layers matter.

What is missing?

What is missing	Problems that are likely to occur
Standards	The file name and metadata are different for each person, and it stops at the entrance to the supplementary exam.
Storage place	Even if you know about its existence, you will not be able to obtain it or reuse it, and the circle of comparison will not expand.
Validator	Notice of violation of standards is delayed, and accidents occur immediately before sharing or during reanalysis.
Derivative specification / lineage	Processed outputs can be mistaken for raw or for each other, and later readers cannot tell which source files or branches generated them.
Workflow / model recipe	The same pipeline name can hide different optional steps, configs, and model graphs, so the rerun does not actually reproduce the same analysis.
Execution / result provenance	A figure, table, or derivative can no longer be traced back to the exact software, version, container, and commands that created it.
Benchmark	Each person evaluates using different divisions and indicators, and the meaning of "winning" fluctuates.
Benchmark provenance / governance	The same benchmark name hides different rule snapshots, hidden grouping, inference limits, or later corrections, so the score is overread.

Why raw files alone are not enough

Just having a waveform file is not enough for standards or benchmarks. At the very least, without event definitions, stimulus logs, synchronization information, QC logs, and reasons for exclusion, it will be difficult to repeat the same challenge.

Safe reading

Being "publicly available" and being "comparable" are two different things. Publication is the first step, comparability is the next step in design.

Common confusion

Things I tend to say	More accurate paraphrase
“Since we chose BIDS, there is a benchmark”	BIDS is an input format standard, not a comparison rule itself.
“It was standardized because it was placed in OpenNeuro”	Even if it is posted in the storage area, the standards and metadata are not necessarily sufficient.
"We used the same input because the dataset name is the same"	Unless you fix OpenNeuro snapshot or PhysioNet version, it cannot be said that it is the same input.
"The benchmark name alone fixes what the score means"	You still need the active rules snapshot, split / randomization / hidden grouping policy, extra-data / pretrained-model policy, execution constraints, and postmortem status.
"Validator passed, so it's enough for research"	Validator is a formal check and does not guarantee the validity of the research or the strength of the benchmark.
"It became a benchmark because I could read it with MNE-BIDS"	MNE-BIDS is a reading/conversion aid; fixing evaluation families and comparison statistics is a separate task.
"Because the data are in BIDS, the processed outputs are already self-explanatory"	Raw BIDS and BIDS derivatives are separate layers, and processed outputs still need explicit lineage and source ancestry.
"Naming MNE-BIDS-Pipeline or a BIDS App already freezes the workflow"	The pipeline name alone is still too coarse; config values, skipped stages, model recipe, and software version have to be frozen as well.
"A containerized run already captures what the score means"	Container and runtime pin help software portability, but benchmark harness and benchmark governance still remain separate objects.
“Event semantics are fixed because there is `events.tsv`”	`events.tsv` is a container for time and columns, and condition meanings and scorer rules must be fixed separately in `events.json`, HED, and auxiliary logs.
"Using LSL even solved the hardware delay"	LSL helps with stream synchronization, but device-side delay for display/audio/amplifier requires separate measurement.
“MOABB scores can be directly compared across tasks”	Within-session, cross-session, and cross-subject are different evaluation families and cannot be treated equally.
"It's safe to convert preprocessed files back to raw BIDS"	BIDS and MNE-BIDS basically assume unprocessed or minimally processed data, and it is safer to treat modified data as derivatives by specifying the lineage.
"We won the benchmark, so it's good enough for actual operation"	Benchmark is a yardstick for comparison and does not automatically guarantee actual operation or the establishment of L4/L5.
"The challenge proposal paper is the final benchmark specification"	Execution-phase websites, rules, starter kits, and final postmortems can supersede the original proposal and must be frozen with the result.

Minimum 7 IDs that you want to fix

ID	What I want at least	What happens when it is missing
Input ID	OpenNeuro snapshot tag, PhysioNet version, DOI, acquisition date.	If you mix different versions with the same dataset name, you will not be able to try again.
Schema ID	The version of BIDS/EEG-BIDS, the version of Validator, and the reason for the warning left.	It is not possible to distinguish between standard differences and implementation differences.
Derivative ID	Derived dataset name, `GeneratedBy`, `SourceDatasets`, and direct `Sources` lineage.	Preprocessed outputs can be confused with raw or with another derivative branch.
Workflow ID	MNE-BIDS-Pipeline / BIDS App / config file / model-graph version and settings.	Even with the same input version, a different recipe can still generate a different result.
Evaluation ID	Within-session / cross-session / cross-subject, indicators, split seed, and prohibitions.	The meaning of score will be different and fair comparison will be broken.
Benchmark Governance ID	Rules URL or archived snapshot, split / randomization policy, hidden grouping note, extra-data / pretrained-model policy, inference-stage restrictions, and postmortem status.	The benchmark title will stay too coarse, and the same leaderboard name may hide different scientific meanings.
Runtime / Result Provenance ID	Software version, container or lockfile, command log, and result bundle or report identifier.	The published figure or score cannot be traced back to the exact run that created it.

9 questions when reading strong arguments

What is the input standard? Check to see if the format is consistent using BIDS, etc.
What version was used:See if the snapshot, version, DOI, and acquisition date are fixed.
What are the event semantics and clock domain: Look at trial_type, HED, scorer rule, LSL/TTL/photodiode, delay/jitter audits.
Did they separate raw and derivative?See if processed outputs remain explicit derivatives with followable lineage.
What was used to read/write:Look at the loader/transformer and see if its version is specified.
What workflow or model recipe generated the outputs?Look for config files, optional branches, model graph, and software settings.
What runtime or result-provenance record exists?Check container / lockfile, command logs, reports, or result bundles.
What benchmark harness was used?See if evaluation family, metrics, and comparison statistics are fixed.
What benchmark provenance was in force?Check the active rules snapshot, split/randomization, extra-data / checkpoint policy, inference-stage restrictions, and postmortem status.

References and official pages

Where to go back next

Please use Data & Bench to return to the practical entry point, Verification Platform to return to overall design, and Casework to return to examples from other fields.

Wiki: Standards, Repositories, Validators, and Benchmarks

Read this first to avoid getting lost

What we know now

What we still do not know

Check the basics in the wiki

Plain-language terms on this page

See the structure before reading the whole page