Wiki

Wiki: Minimum artifact pack for L0

Do not call it reproducible until version, observability, benchmark meaning, lineage, and replay are fixed together

Mind Uploading Research Project

Public Page Updated: 2026-04-03 Operational guide / temporal-validity sync

How to use this page

Read this first to avoid getting lost

This page is an auxiliary page that fixes what must be bundled together before an L0 result can be called reproducible analysis on this site. It is not a procedure manual; it is a submission-shape check that asks whether a third party can reconstruct not only the score, but also what was actually observed, which prediction object and metric bundle were used, which benchmark rules were in force, what was held out, and what remained outside scope.

  • The L0 pack is no longer just version + BIDS + QC + split + baseline; it now also includes benchmark object + metric bundle, benchmark provenance / governance, workflow recipe, and a temporal-validity addendum when a claim spans more than one session, day, or adaptation stage.
  • The pack is still organized around five bundles, but the evaluation bundle now explicitly separates split family, prediction object, metric semantics, current benchmark rules, temporal scope, setup distribution, and baselines.
  • Benchmark name is still too coarse unless predicted object, independent prediction unit, grouped hold-out unit, and operations budget are frozen alongside governance.
  • Challenge or leaderboard results are not reproducible artifacts on this site unless the current rules snapshot and later organizer corrections are frozen alongside the score.
  • Derivative lineage is still not the same as workflow provenance: `GeneratedBy` / `SourceDatasets`, config files, and runtime pin have to be frozen together.
  • Cross-session or adaptation naming is not yet temporal validity; state annotation now splits fast labels from slow internal-milieu disclosure, and fixed decoder interval, recalibration burden, and transfer ceiling still have to be disclosed.
  • This page is now synchronized with the stricter practical rule already used on Datasets and Verification.
Best for
People who have started creating L0, and people who want to check to what extent it can be called reproducible analysis.
Reading time
12-18 minutes
Accuracy note
This page defines the current minimum for L0. It does not by itself justify causal or identity claims, but without these fields even L0 comparability remains too weak.

Relatively clear at this stage

What we know now

  • For L0, it is more important than high accuracy that a third party can rerun under the same conditions and still understand what the score is allowed to mean.
  • BIDS / EEG-BIDS makes data traceable, but it does not by itself fix event fidelity, label provenance, or leak-free evaluation.
  • The same score can change meaning not only across within-session, cross-session, cross-subject, and adaptation settings, but also across prediction objects and metric bundles.
  • The same benchmark name can still hide different predicted objects, grouped hold-out units, and inference budgets unless those fields are frozen explicitly.
  • Challenge, leaderboard, or benchmark names alone are still too coarse because rules snapshots, randomization policies, extra-data rules, and later postmortems can materially change what the score means.
  • Cross-session and unsupervised recalibration results still do not by themselves tell you the fixed decoder interval, recalibration burden, or operational transfer ceiling.
  • The same task can still run under different circadian / endocrine-metabolic regimes, so temporal validity does not reduce to movement labels or session IDs alone.
  • Preloaded or modified recordings should be written as derivatives with explicit lineage rather than silently overwriting raw.
  • Derivative lineage and workflow provenance are separate: the run still needs config, software / container version, and replayable commands.
  • Examples of failures, setup shortcuts, and stopping claims belong in the artifact pack, not only in side notes.

Still unresolved beyond this point

What we still do not know

  • Which QC metrics, nuisance-only baselines, and harmonization transforms should become defaults still depends on the task and dataset.
  • How the L0 pack should expand into standard L1/L2 cards will depend on future benchmark design.
  • The best reusable format for benchmark-governance snapshots across rapidly changing challenge sites is still evolving.
  • The best reusable format for acquisition-distribution summaries across multi-site datasets is still evolving.

Learn the basics

Check the basics in the wiki

What the wiki is for

The wiki is a learning aid. For the project's official current synthesis, success criteria, and operating rules, always return to the public pages.

Think in terms of one pack

The L0 artifact is not a single file or a single score. Only when dataset identity, what was actually observed, how train/test was separated, how derivatives were produced, and how to replay the run are fixed together can a third party track the result honestly.

2026-03-20 addendum: the old 8-point pack was too weak

This site's practical pages now require more than version + BIDS + QC + split + baseline. The reason is simple: EEG-BIDS, MOABB, official dataset pages, and MNE-BIDS docs together make clear that event fidelity, label provenance, evaluation family, acquisition-distribution summary, and derivative lineage materially change what a later score means. This page is now synchronized with that stricter rule.

2026-03-28 addendum: the 11-point pack was still under-specified

The remaining weakness on this page was subtler than the 2026-03-20 tightening. The current practical rule had already become stricter about event fidelity, label provenance, setup distribution, and lineage, but it still left three score-defining fields too implicit. First, the official EEG Challenge (2025) homepage now states that the proposal preprint became out of date during execution and that the current website plus Starter Kit should be treated as authoritative. Second, the same official benchmark family still mixes different prediction objects, from per-trial response-time regression to subject-level psychopathology regression, and even its current rules and leaderboard record later execution changes and organizer corrections. Third, Egger et al. (2024) and Wilson et al. (2025) show that cross-session, adaptation, and long-term use still hide different temporal burdens, while Saito & Rehmsmeier (2015) and Vallat & Walker (2021) show that one headline metric can still hide task-specific failure. Therefore, the L0 pack on this site now adds benchmark object + metric bundle, benchmark provenance / governance, and a conditional Temporal-Validity addendum.

2026-03-31 addendum: benchmark name is not yet the benchmark object

One more ambiguity remained inside items 7 and 8. Even when benchmark governance is logged, the benchmark name alone still does not fix the predicted object, independent prediction unit, grouped hold-out unit, adaptation regime, or operations budget. The official EEG Challenge (2025) homepage separates trial-level response-time regression from subject-level externalizing prediction, the official rules and submission page impose an inference-only code-submission workflow under a single-GPU 20 GB budget, Ma et al. (2022) use one motor-imagery dataset to separate within-session, cross-session, and cross-session adaptation, Liu et al. (2026) separate leave-one-subject-out transfer from within-subject few-shot calibration, and Lahiri et al. (2026) show that six benchmark inconsistencies can reverse rankings on identical datasets by up to 24 percentage points. Therefore items 7 and 8 on this page are now read together as an object / unit / budget disclosure, not as a benchmark name plus an administrative appendix.

2026-04-02 addendum: derivative lineage is not yet workflow provenance

One more L0 shortcut remained. BIDS Derivatives make it possible to say which outputs came from which sources, but that still does not freeze which recipe produced them. The current BIDS specification requires derived datasets to record GeneratedBy and supports SourceDatasets; Gorgolewski et al. (2017) showed that BIDS Apps improve software portability; and MNE-BIDS-Pipeline explicitly exposes a text configuration file, cached steps, and summary reports. Therefore, on this site, the L0 pack now treats lineage, workflow recipe, and runtime pin as separate fields that must travel together.

2026-04-03 addendum: temporal-validity addendum must split fast labels from slow internal milieu

One more L0 ambiguity still remained. The pack already required a Temporal-Validity addendum, but it still left too much room to write state annotation as one vague sentence about movement, arousal, or session ID. The current primary literature does not support that shortcut. Egger et al. (2024) showed that movement-related EEG decoding conditions change across a long day-night window, while de Quervain et al. (1998), Oei et al. (2007), Barone et al. (2023), Birnie et al. (2023), and Sherman et al. (2015) show that slower circadian / glucocorticoid / endocrine-metabolic regime changes can also move memory-relevant operating state. Therefore the L0 pack now treats fast labels and slow internal-milieu disclosure as separate parts of item 9 rather than one free-text temporal note.

Minimum 14 items now required in the L0 pack

Deliverables Minimum desired contents What is the problem if it is missing
1. Dataset identity Snapshot / version / DOI / retrieval date / license / persistent URL. Even with the same dataset name, different versions or releases get mixed and reproduction breaks.
2. BIDS / EEG-BIDS skeleton dataset_description.json, README, participant/session/run structure, *_eeg.json, *_channels.tsv, *_electrodes.tsv, and *_coordsystem.json when positions exist. Third parties cannot reconstruct the same raw input or its measurement condition.
3. Event Fidelity Card Onset / duration / sample, clock domain, delay / jitter evidence, event semantics, and any HED or scoring rule used to interpret them. The result may look aligned to behavior while event meaning and timing remain ambiguous.
4. Label provenance Whether the target comes from annotation channels, manual scoring, clinician reports, keyword rules, or another derived source, plus a report-usage flag when relevant. A signal-only benchmark and a report-assisted benchmark get silently mixed.
5. Standards confirmation Validator output together with any remaining warnings and why they are acceptable. Non-shareable structural violations remain hidden behind a seemingly clean dataset name.
6. Split family + hold-out ancestry Within-session / cross-session / cross-subject / adaptation family, the independent hold-out unit, and whether windows from the same raw recording can cross the boundary. The score becomes uninterpretable because train/test independence is unclear.
7. Benchmark object + metric bundle Task family, predicted object, independent prediction unit, grouped hold-out unit when different, output family, and the task-matched metric bundle that makes the score interpretable. A headline number hides whether the benchmark was cue-locked classification, event detection, sleep staging, trial-wise regression, or subject-level regression, and whether the metric actually matches the task.
8. Benchmark provenance + governance Benchmark or leaderboard name, version, current rules snapshot, split / randomization policy, hidden grouping, extra-data or pretrained-checkpoint policy, inference-stage restrictions / operations budget, and postmortem / correction status. The same challenge or benchmark name can silently point to different score objects after execution changes or organizer corrections.
9. Temporal-Validity addendum
Required when the claim spans >1 session/day or uses adaptation
State annotation split into fast labels and slow internal-milieu disclosure, fixed decoder interval, recalibration amount and timing, and transfer ceiling. Cross-session or adaptation labels get overread as fixed-decoder durability or low-burden operation.
10. Acquisition-distribution summary Site / device / reference / channel map / electrode layout / protocol distribution, plus the harmonization policy and any metadata-only baseline. Signal differences and setup differences get collapsed into one accuracy number.
11. QC / exclusion log Missingness, bad channels, bad segments, artifacts, exclusions, and thresholds in numerical form. No one can tell which recordings were removed or why.
12. Baseline + shortcut checks At least one simple baseline, plus any nuisance-only or metadata-only comparison needed to keep shortcut routes visible. Apparent improvement may come from identity, setup, or label shortcuts rather than the intended signal.
13. Derivative lineage + workflow provenance + replay steps GeneratedBy / SourceDatasets or equivalent lineage, commands, config or model recipe, container or lockfile, environment, random seeds, preprocessing boundaries, and explicit raw-to-derivative lineage. Preprocessed data can be mistaken for raw, and other people cannot rerun the same flow.
14. Failure examples + stopping claim Known failure modes, exclusions, pending conditions, and the strongest claim the result is still allowed to stop at. Only successes remain and later readers overread L0 as if it already implied stronger evidence.

Why the old 8-point pack is now too weak

Weak point Why it fails now What the pack must add
BIDS shape without annotation depth BIDS and EEG-BIDS make the dataset traceable, but they do not by themselves tell you whether an outcome came from cue markers, manual stage scoring, clinician reports, or a derived rule. Add Event Fidelity Card plus label provenance to the pack itself.
Split rule without evaluation family Within-session, cross-session, cross-subject, and adaptation all answer different questions, and the same accuracy number does not transfer across them. Add evaluation family, independent hold-out unit, and window ancestry.
Benchmark family without prediction object or metric bundle Per-trial response-time regression, subject-level factor prediction, seizure alarm behaviour, and sleep-stage scoring do not test the same scientific object even when all are reported as EEG decoding. Add benchmark object, independent prediction unit, and a task-matched metric bundle.
Challenge or leaderboard name without governance snapshot The official EEG Challenge site now says the proposal preprint is outdated during execution, the current website plus Starter Kit are authoritative, and the later leaderboard note revised what the rankings meant after a sample-randomization error. Add the current benchmark provenance / governance snapshot instead of treating it as an administrative footnote.
Cross-session or adaptation label without temporal-validity fields Daily drift, recalibration burden, and fixed-decoder durability remain different questions, so cross-session and adaptation labels still underdescribe what survived across time. Add a conditional Temporal-Validity addendum with state annotation split into fast labels and slow internal-milieu disclosure, fixed decoder interval, recalibration amount, and transfer ceiling.
Replay steps without lineage, workflow recipe, or setup summary Preloaded / modified data can silently become derivatives, setup differences such as site, device, reference, and electrode layout can still dominate the result, and the same pipeline name can still hide a different config. Add acquisition-distribution summary, harmonization log, derivative lineage, and a workflow / runtime pin.

Five bundles to keep together

Bundles

  • Identity: freeze snapshot, version, DOI, retrieval date, and license.
  • Observability: fix BIDS / EEG-BIDS shape, event fidelity, and label provenance.
  • Evaluation: fix evaluation family, hold-out ancestry, benchmark object, metric bundle, current benchmark rules, temporal scope when needed, setup distribution, harmonization, and baselines.
  • Lineage: keep raw-to-derivative boundaries explicit instead of silently rewriting modified data as raw.
  • Replay: keep commands, config, environment, failures, and the stopping claim together.

Common omissions

Common conditions What is still missing
dataset name exists Snapshot, version, DOI, retrieval date, and license may not remain.
Waveform file is available Events, synchronization, label provenance, event semantics, and bad segments may still be missing.
Accuracy is there Evaluation family, independent hold-out unit, benchmark object, metric bundle, harmonization log, or stopping claim may still be absent.
There is a leaderboard or challenge name The current rules snapshot, randomization policy, extra-data policy, inference-stage restriction, or later organizer correction may still be absent.
There is a cross-session or adaptation result State annotation, fixed decoder interval, recalibration burden, and transfer ceiling may still be absent.
There is a code Environment, random numbers, derivative lineage, execution order, and known failure conditions may not be written.
I thought I did QC Numeric logs, exclusion reasons, and the stopping claim may not remain.

A stricter L0 completion check

Question If yes, move forward If no, what to do next
Can other people recover the same input identity? Snapshot / version / DOI / retrieval date / license and BIDS skeleton are complete. Freeze the dataset identity and the BIDS skeleton first.
Can they tell what was actually annotated and by whom? Event fidelity and label provenance are written, including any report-usage flag. Fix events, annotation rules, and label provenance before trusting the score.
Can they explain what one prediction and one score mean? Evaluation family, hold-out ancestry, benchmark object, independent prediction unit, and task-matched metric bundle are fixed. Fix the prediction object and metric semantics before trusting the headline score.
Can they explain which benchmark rules were in force? Benchmark provenance, current rules snapshot, hidden grouping / randomization policy, extra-data rules, and any later corrections are fixed. Freeze the governance snapshot before comparing yourself to a challenge or leaderboard.
If the claim spans more than one session or uses adaptation, is the temporal scope explicit? State annotation, fixed decoder interval, recalibration amount, and transfer ceiling are written. Add the Temporal-Validity fields before treating the result as durability evidence.
Can they tell whether setup distribution still dominates? Acquisition-distribution summary, harmonization policy, and shortcut-aware baselines are fixed. Fix setup distribution, nuisance baselines, and harmonization before trusting generalization claims.
Can someone else replay the same derivatives? Command, environment, preprocessing boundaries, and raw-to-derivative lineage remain. Create a short runbook and make derivative lineage explicit.
Can the claim stop at the right ceiling? Failure examples and the stopping claim are written next to the result. State explicitly what the current pack does not justify.
What this page still does not do

This page still does not decide which model is strongest or which metric bundle is universally best. The first objective of L0 is still to create a comparable starting point. The change on this page is only that the starting point is now defined more strictly, including the rule that benchmark meaning and temporal scope are part of the artifact rather than optional commentary.

References

Where to return next

Return to Hands-On if you want to follow the actual steps, Data & Bench if you want to reselect the input data, or return to Verification Infrastructure if you want to see how this product stacks up as a public good.