Think in terms of one pack
The L0 artifact is not a single file or a single score. Only when dataset identity, what was actually observed, how train/test was separated, how derivatives were produced, and how to replay the run are fixed together can a third party track the result honestly.
This site's practical pages now require more than version + BIDS + QC + split + baseline. The reason is simple: EEG-BIDS, MOABB, official dataset pages, and MNE-BIDS docs together make clear that event fidelity, label provenance, evaluation family, acquisition-distribution summary, and derivative lineage materially change what a later score means. This page is now synchronized with that stricter rule.
The remaining weakness on this page was subtler than the 2026-03-20 tightening. The current practical rule had already become stricter about event fidelity, label provenance, setup distribution, and lineage, but it still left three score-defining fields too implicit. First, the official EEG Challenge (2025) homepage now states that the proposal preprint became out of date during execution and that the current website plus Starter Kit should be treated as authoritative. Second, the same official benchmark family still mixes different prediction objects, from per-trial response-time regression to subject-level psychopathology regression, and even its current rules and leaderboard record later execution changes and organizer corrections. Third, Egger et al. (2024) and Wilson et al. (2025) show that cross-session, adaptation, and long-term use still hide different temporal burdens, while Saito & Rehmsmeier (2015) and Vallat & Walker (2021) show that one headline metric can still hide task-specific failure. Therefore, the L0 pack on this site now adds benchmark object + metric bundle, benchmark provenance / governance, and a conditional Temporal-Validity addendum.
One more ambiguity remained inside items 7 and 8. Even when benchmark governance is logged, the benchmark name alone still does not fix the predicted object, independent prediction unit, grouped hold-out unit, adaptation regime, or operations budget. The official EEG Challenge (2025) homepage separates trial-level response-time regression from subject-level externalizing prediction, the official rules and submission page impose an inference-only code-submission workflow under a single-GPU 20 GB budget, Ma et al. (2022) use one motor-imagery dataset to separate within-session, cross-session, and cross-session adaptation, Liu et al. (2026) separate leave-one-subject-out transfer from within-subject few-shot calibration, and Lahiri et al. (2026) show that six benchmark inconsistencies can reverse rankings on identical datasets by up to 24 percentage points. Therefore items 7 and 8 on this page are now read together as an object / unit / budget disclosure, not as a benchmark name plus an administrative appendix.
One more L0 shortcut remained. BIDS Derivatives make it possible to say which outputs came from which sources, but that still does not freeze which recipe produced them. The current BIDS specification requires derived datasets to record GeneratedBy and supports SourceDatasets; Gorgolewski et al. (2017) showed that BIDS Apps improve software portability; and MNE-BIDS-Pipeline explicitly exposes a text configuration file, cached steps, and summary reports. Therefore, on this site, the L0 pack now treats lineage, workflow recipe, and runtime pin as separate fields that must travel together.
One more L0 ambiguity still remained. The pack already required a Temporal-Validity addendum, but it still left too much room to write state annotation as one vague sentence about movement, arousal, or session ID. The current primary literature does not support that shortcut. Egger et al. (2024) showed that movement-related EEG decoding conditions change across a long day-night window, while de Quervain et al. (1998), Oei et al. (2007), Barone et al. (2023), Birnie et al. (2023), and Sherman et al. (2015) show that slower circadian / glucocorticoid / endocrine-metabolic regime changes can also move memory-relevant operating state. Therefore the L0 pack now treats fast labels and slow internal-milieu disclosure as separate parts of item 9 rather than one free-text temporal note.
Minimum 14 items now required in the L0 pack
| Deliverables | Minimum desired contents | What is the problem if it is missing |
|---|---|---|
| 1. Dataset identity | Snapshot / version / DOI / retrieval date / license / persistent URL. | Even with the same dataset name, different versions or releases get mixed and reproduction breaks. |
| 2. BIDS / EEG-BIDS skeleton | dataset_description.json, README, participant/session/run structure, *_eeg.json, *_channels.tsv, *_electrodes.tsv, and *_coordsystem.json when positions exist. |
Third parties cannot reconstruct the same raw input or its measurement condition. |
| 3. Event Fidelity Card | Onset / duration / sample, clock domain, delay / jitter evidence, event semantics, and any HED or scoring rule used to interpret them. | The result may look aligned to behavior while event meaning and timing remain ambiguous. |
| 4. Label provenance | Whether the target comes from annotation channels, manual scoring, clinician reports, keyword rules, or another derived source, plus a report-usage flag when relevant. | A signal-only benchmark and a report-assisted benchmark get silently mixed. |
| 5. Standards confirmation | Validator output together with any remaining warnings and why they are acceptable. | Non-shareable structural violations remain hidden behind a seemingly clean dataset name. |
| 6. Split family + hold-out ancestry | Within-session / cross-session / cross-subject / adaptation family, the independent hold-out unit, and whether windows from the same raw recording can cross the boundary. | The score becomes uninterpretable because train/test independence is unclear. |
| 7. Benchmark object + metric bundle | Task family, predicted object, independent prediction unit, grouped hold-out unit when different, output family, and the task-matched metric bundle that makes the score interpretable. | A headline number hides whether the benchmark was cue-locked classification, event detection, sleep staging, trial-wise regression, or subject-level regression, and whether the metric actually matches the task. |
| 8. Benchmark provenance + governance | Benchmark or leaderboard name, version, current rules snapshot, split / randomization policy, hidden grouping, extra-data or pretrained-checkpoint policy, inference-stage restrictions / operations budget, and postmortem / correction status. | The same challenge or benchmark name can silently point to different score objects after execution changes or organizer corrections. |
| 9. Temporal-Validity addendum Required when the claim spans >1 session/day or uses adaptation |
State annotation split into fast labels and slow internal-milieu disclosure, fixed decoder interval, recalibration amount and timing, and transfer ceiling. | Cross-session or adaptation labels get overread as fixed-decoder durability or low-burden operation. |
| 10. Acquisition-distribution summary | Site / device / reference / channel map / electrode layout / protocol distribution, plus the harmonization policy and any metadata-only baseline. | Signal differences and setup differences get collapsed into one accuracy number. |
| 11. QC / exclusion log | Missingness, bad channels, bad segments, artifacts, exclusions, and thresholds in numerical form. | No one can tell which recordings were removed or why. |
| 12. Baseline + shortcut checks | At least one simple baseline, plus any nuisance-only or metadata-only comparison needed to keep shortcut routes visible. | Apparent improvement may come from identity, setup, or label shortcuts rather than the intended signal. |
| 13. Derivative lineage + workflow provenance + replay steps | GeneratedBy / SourceDatasets or equivalent lineage, commands, config or model recipe, container or lockfile, environment, random seeds, preprocessing boundaries, and explicit raw-to-derivative lineage. |
Preprocessed data can be mistaken for raw, and other people cannot rerun the same flow. |
| 14. Failure examples + stopping claim | Known failure modes, exclusions, pending conditions, and the strongest claim the result is still allowed to stop at. | Only successes remain and later readers overread L0 as if it already implied stronger evidence. |
Why the old 8-point pack is now too weak
| Weak point | Why it fails now | What the pack must add |
|---|---|---|
| BIDS shape without annotation depth | BIDS and EEG-BIDS make the dataset traceable, but they do not by themselves tell you whether an outcome came from cue markers, manual stage scoring, clinician reports, or a derived rule. | Add Event Fidelity Card plus label provenance to the pack itself. |
| Split rule without evaluation family | Within-session, cross-session, cross-subject, and adaptation all answer different questions, and the same accuracy number does not transfer across them. | Add evaluation family, independent hold-out unit, and window ancestry. |
| Benchmark family without prediction object or metric bundle | Per-trial response-time regression, subject-level factor prediction, seizure alarm behaviour, and sleep-stage scoring do not test the same scientific object even when all are reported as EEG decoding. | Add benchmark object, independent prediction unit, and a task-matched metric bundle. |
| Challenge or leaderboard name without governance snapshot | The official EEG Challenge site now says the proposal preprint is outdated during execution, the current website plus Starter Kit are authoritative, and the later leaderboard note revised what the rankings meant after a sample-randomization error. | Add the current benchmark provenance / governance snapshot instead of treating it as an administrative footnote. |
| Cross-session or adaptation label without temporal-validity fields | Daily drift, recalibration burden, and fixed-decoder durability remain different questions, so cross-session and adaptation labels still underdescribe what survived across time. | Add a conditional Temporal-Validity addendum with state annotation split into fast labels and slow internal-milieu disclosure, fixed decoder interval, recalibration amount, and transfer ceiling. |
| Replay steps without lineage, workflow recipe, or setup summary | Preloaded / modified data can silently become derivatives, setup differences such as site, device, reference, and electrode layout can still dominate the result, and the same pipeline name can still hide a different config. | Add acquisition-distribution summary, harmonization log, derivative lineage, and a workflow / runtime pin. |
Five bundles to keep together
Bundles
- Identity: freeze snapshot, version, DOI, retrieval date, and license.
- Observability: fix BIDS / EEG-BIDS shape, event fidelity, and label provenance.
- Evaluation: fix evaluation family, hold-out ancestry, benchmark object, metric bundle, current benchmark rules, temporal scope when needed, setup distribution, harmonization, and baselines.
- Lineage: keep raw-to-derivative boundaries explicit instead of silently rewriting modified data as raw.
- Replay: keep commands, config, environment, failures, and the stopping claim together.
Common omissions
| Common conditions | What is still missing |
|---|---|
| dataset name exists | Snapshot, version, DOI, retrieval date, and license may not remain. |
| Waveform file is available | Events, synchronization, label provenance, event semantics, and bad segments may still be missing. |
| Accuracy is there | Evaluation family, independent hold-out unit, benchmark object, metric bundle, harmonization log, or stopping claim may still be absent. |
| There is a leaderboard or challenge name | The current rules snapshot, randomization policy, extra-data policy, inference-stage restriction, or later organizer correction may still be absent. |
| There is a cross-session or adaptation result | State annotation, fixed decoder interval, recalibration burden, and transfer ceiling may still be absent. |
| There is a code | Environment, random numbers, derivative lineage, execution order, and known failure conditions may not be written. |
| I thought I did QC | Numeric logs, exclusion reasons, and the stopping claim may not remain. |
A stricter L0 completion check
| Question | If yes, move forward | If no, what to do next |
|---|---|---|
| Can other people recover the same input identity? | Snapshot / version / DOI / retrieval date / license and BIDS skeleton are complete. | Freeze the dataset identity and the BIDS skeleton first. |
| Can they tell what was actually annotated and by whom? | Event fidelity and label provenance are written, including any report-usage flag. | Fix events, annotation rules, and label provenance before trusting the score. |
| Can they explain what one prediction and one score mean? | Evaluation family, hold-out ancestry, benchmark object, independent prediction unit, and task-matched metric bundle are fixed. | Fix the prediction object and metric semantics before trusting the headline score. |
| Can they explain which benchmark rules were in force? | Benchmark provenance, current rules snapshot, hidden grouping / randomization policy, extra-data rules, and any later corrections are fixed. | Freeze the governance snapshot before comparing yourself to a challenge or leaderboard. |
| If the claim spans more than one session or uses adaptation, is the temporal scope explicit? | State annotation, fixed decoder interval, recalibration amount, and transfer ceiling are written. | Add the Temporal-Validity fields before treating the result as durability evidence. |
| Can they tell whether setup distribution still dominates? | Acquisition-distribution summary, harmonization policy, and shortcut-aware baselines are fixed. | Fix setup distribution, nuisance baselines, and harmonization before trusting generalization claims. |
| Can someone else replay the same derivatives? | Command, environment, preprocessing boundaries, and raw-to-derivative lineage remain. | Create a short runbook and make derivative lineage explicit. |
| Can the claim stop at the right ceiling? | Failure examples and the stopping claim are written next to the result. | State explicitly what the current pack does not justify. |
This page still does not decide which model is strongest or which metric bundle is universally best. The first objective of L0 is still to create a comparable starting point. The change on this page is only that the starting point is now defined more strictly, including the rule that benchmark meaning and temporal scope are part of the artifact rather than optional commentary.
References
- BIDS Website: Annotating a BIDS dataset
- BIDS Specification 1.11.1: Events
- BIDS Specification 1.11.1: Electroencephalography
- Pernet et al. (2019), EEG-BIDS
- MNE-BIDS Docs: write_raw_bids
- Jayaram & Barachant (2018), MOABB
- MOABB Docs: WithinSessionEvaluation
- MOABB Docs: CrossSessionEvaluation
- MOABB Docs: CrossSubjectEvaluation
- Ma et al. (2022), A large EEG dataset for studying cross-session variability in motor imagery BCI
- Chaibub Neto et al. (2019), identity confounding in machine learning-based disease diagnosis
- Melnik et al. (2017), Systems, subjects, sessions
- Xu et al. (2020), Cross-dataset deep learning for EEG
- EEG Challenge (2025): Homepage
- EEG Challenge (2025): Rules
- EEG Challenge (2025): Leaderboard
- EEG Challenge (2025): FAQ
- Egger et al. (2024), Chrono-EEG dynamics influencing hand gesture decoding: a 10-hour study
- Wilson et al. (2025), Long-term unsupervised recalibration of cursor-based intracortical brain-computer interfaces using a hidden Markov model
- Saito & Rehmsmeier (2015), The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets
- Vallat & Walker (2021), An open-source, high-performance tool for automated sleep staging
- PhysioNet: EEG Motor Movement/Imagery Dataset
- PhysioNet: CHB-MIT Scalp EEG Database
- PhysioNet: Sleep-EDF Database Expanded
- Obeid & Picone (2016), TUH EEG Corpus
Where to return next
Return to Hands-On if you want to follow the actual steps, Data & Bench if you want to reselect the input data, or return to Verification Infrastructure if you want to see how this product stacks up as a public good.