Wiki: Verification example walkthrough

This worked example in one sentence

We use one small public EEG classification task to show what a result has to freeze before the score means anything: what entered the file, what the benchmark object actually is, what train/test independence really means, which shortcut routes remain open, what time scope is being claimed, and where the claim must stop.

Why this tutorial needed a 2026-03-28 repair

The older walkthrough was useful as a first orientation, but it still taught a weaker recipe than the current site allows. It could leave readers with the impression that BIDS + split + score + model card was already most of the work. The current literature and the rest of this site no longer support that shortcut. BIDS events, HED semantics, and LSL synchronization answer different questions; record-wise or weakly grouped splits can still learn identity; same-day success is not cross-day durability; and a score with no calibration or abstention rule is still not operationally interpretable.

Safe reading of this example

If this example is filled well, the strongest safe outcome is a bounded reproducible EEG decode under a named observation contract. It still does not become source ground truth, a target-specific biomarker by default, a causal intervention result, or a WBE-relevant hidden-state readout.

The core scaffold is still four artifacts

Core artifact	What it does in this example
Data standard	Fixes the dataset snapshot, BIDS shape, event files, channel metadata, and QC context.
Benchmark	Fixes the task, target, split family, hold-out unit, metric bundle, and stopped claim.
Registry	Fixes preprocessing, split freeze, baselines, nuisance checks, and success/failure conditions before the result is known.
Model card / audit log	Records scores, failures, shortcut checks, calibration behavior, abstention behavior, and what still remains unresolved.

But the current tutorial also stacks companion cards

Companion card or note	Why this example now needs it	What goes wrong if omitted
Observation contract	Separates event anchor, event semantics, clock domain, timing-validation class, and label provenance.	`events.tsv`, HED, trigger lines, and synchronized streams are too easily read as one solved timing object.
Observability Budget	Fixes what EEG directly observed: scalp potentials under a named setup, not sources or hidden state by default.	A small scalp-level classifier is too easily promoted to internal-state evidence.
Specificity & Shortcut note	Fixes which routes could still explain the score: subject fingerprint, setup distribution, residual movement, or other nuisance paths.	A clean split can still be mistaken for target-specific neural evidence.
Temporal-validity note	Fixes whether the example is within-session, same-day, cross-session, or cross-day, and whether the decoder was fixed or updated.	Same-session success is silently promoted to durability.
Calibration & Abstention note	Fixes how probabilities or prediction sets were calibrated, and when the model should abstain instead of forcing output.	Thresholds, confidence, and coverage become uninterpretable.

Step 1: Fix the input and the event contract

The first thing this example now freezes is not only which EEG file you used, but also what the time and label columns are allowed to mean. The current standards and timing literature require a narrower reading here. In BIDS, onset is measured from the first stored data point, not from physical screen or speaker onset. HED makes event semantics machine-readable. LSL can synchronize streams across a LAN and compensate offset and jitter, but it does not automatically measure device-side delay truth. Therefore, this site now asks the tutorial to log those pieces separately.

Input-side item	What to freeze in this example	Why it matters
dataset identity	Snapshot or release, DOI or persistent URL, retrieval date, and license.	The same dataset name can still refer to different content over time.
BIDS skeleton	`dataset_description.json`, participant/session/run layout, `*_eeg.json`, channel metadata, and electrode metadata when positions exist.	Without this, later readers cannot reconstruct the same measurement condition.
event anchor	`events.tsv`, `events.json`, onset/duration/sample meaning, and any discarded-sample rule.	The epoch boundary can look precise while still referring only to stored-file time.
event semantics	`trial_type`, HED tags when available, condition naming, and any manual scoring rule.	Two datasets can share a label name while meaning different things.
clock domain + timing-validation class	Whether the example has only a stored-data anchor, stream alignment, digital marker capture, or actual physical timing validation.	The site no longer lets BIDS, HED, LSL, TTL, and photodiode traces collapse into one timing claim.
label provenance	Whether the target label comes from cue markers, manual scoring, a report-derived rule, or another derived path.	A signal-only benchmark and a report-assisted benchmark are not the same evidence object.
QC / exclusions	Bad channels, bad segments, missing runs, and thresholds used to exclude data.	The score becomes impossible to audit if exclusions stay implicit.

Safe tutorial rule

For this worked example, writing only “the data are in BIDS” is no longer enough. The minimum safe wording is: which event anchor exists, which semantics exist, which timing-validation rung was actually tested, and where the label came from.

Step 2: Fix the benchmark object and the independence unit

The next weak point in older beginner workflows was to treat a split rule as if it already solved shortcut risk. The current literature does not support that shortcut. Record-wise splits can learn identity rather than the target variable, resting-state EEG can support time-robust person identification, and cross-dataset EEG performance can move with setup differences such as amplifier, cap, sampling rate, or filtering. This example therefore freezes not only the split, but also the independent hold-out unit and the shortcut families that remain plausible.

Benchmark field	What to write in this example	What not to overread
target	One bounded task such as two-state or few-class EEG classification.	Do not let a small task silently stand in for a general biomarker or latent-state decoder.
evaluation family	Within-session, cross-session, cross-subject, cross-dataset, or adaptation regime.	The same accuracy means different things in different families.
independent hold-out unit	Subject, session, or raw recording, not only windows or epochs.	A result can stay identity-confounded even when train/test windows are disjoint.
raw-recording ancestry	Whether windows cut from one raw recording ever cross train/test.	Window-level separation is not enough if the raw ancestor is shared.
setup disjointness	Participant, session, site, device, reference system, channel map, and protocol differences.	A classifier can still read acquisition-distribution structure rather than the intended signal.
shortcut-aware baselines	Metadata-only, subject-ID, or other nuisance-aware baselines when relevant.	Without them, the score can still be explained by who, when, or how the EEG was recorded.
temporal scope	Whether the example is same-session only or claims any reuse across time.	Do not promote a same-session result to cross-day durability after the fact.

Step 3: Write the registry before training

The registry is where this example stops becoming a flexible demo and becomes an auditable result. The main point is not fancy formatting. The main point is that preprocessing, splits, baselines, and stopping conditions are fixed before the score appears. If the example will later report probabilities, prediction sets, or an abstain threshold, this is also where the fit / calibration / test separation must be frozen.

Registry field	What to freeze here
preprocessing recipe	Filtering, referencing, artifact handling, rejected channels/segments, and derivative boundaries.
split freeze	The exact grouping rule for subjects, sessions, and raw recordings, plus any benchmark version.
baseline plan	Simple baseline, shortcut-aware baseline, and what counts as improvement over them.
failure conditions	What will count as collapse: low sensitivity, shortcut-only win, unstable calibration, or cross-session failure.
calibration split	Whether the model reports only hard labels or also probabilities / prediction sets, and which held-out slice is reserved for threshold or temperature tuning.
stopped claim	Write in advance the strongest safe claim if everything works exactly as planned.

What the stopped claim should usually look like here

For a minimal public EEG example, the planned stopping point is usually something like: reproducible classification under a named observation contract and declared split regime. It is usually not stable biomarker evidence, target-specific neural proof, or cross-day deployability.

Step 4: Attach the route cards before reading the score

This is the main scientific tightening in the new tutorial. The score is no longer read alone. Before the score is interpreted, the example now stacks four companion checks that answer four different questions.

Companion check	Question it answers	Example answer in a small EEG tutorial
Observability Budget	What did the sensor directly observe?	Scalp potentials under a declared montage and preprocessing regime; not hidden state, sources, or causal controller by default.
Specificity & Shortcut note	Which route could still explain the score besides the intended target?	Subject/session fingerprint, setup distribution, residual behavior, or other nuisance routes may still contribute unless audited separately.
Temporal-validity note	How far across time may the result be extrapolated?	If the decoder was evaluated only within session, the example stops at within-session evidence even if the score is strong.
Calibration & Abstention note	What do the output probabilities or sets mean, and when should output stop?	Fit/calibration/test are separated, and low-confidence outputs can be rejected instead of forced.

Why the temporal note is now mandatory once time enters the story

Egger et al. (2024) showed that hand-gesture EEG decoding moves across a 10-hour day and that a non-updated classifier can degrade steadily over time. On this site, that means a tutorial cannot jump from “clean split” to “stable result” without stating the fixed decoder interval, state annotation, and any recalibration burden.

Step 5: Publish a model card plus calibration and failure logs

At the end, the model card is still the visible artifact, but it is now narrower and more explicit than the older tutorial implied. The purpose is not only to show where the model wins, but also to expose where the route breaks and what the score is still allowed to mean.

Output-side item	What to include	Why it matters
headline metrics	Main metric bundle, baseline deltas, and slice-wise results.	One number alone hides which regime actually carried the result.
shortcut results	Nuisance-aware baselines, metadata-only or identity baselines when relevant, and unresolved shortcut gaps.	The score can otherwise be overread as target-specific evidence.
temporal note	Whether the result is same-session only, same-day only, or tested further, and whether the decoder stayed fixed.	Prevents silent promotion to durability.
calibration report	Fit/calibration/test split, ECE/Brier/NLL or prediction-set coverage when applicable.	Confidence without calibration is not yet operationally meaningful.
abstention / threshold policy	Reject option, prediction-set rule, or explicit statement that the example does not yet support one.	Stops threshold tweaking from hiding inside the test result.
failure ledger	Subjects, sessions, states, or setup slices where the example collapses.	Without this, only favorable conditions survive into the narrative.
stopped claim	One or two lines stating what the result supports and what it still does not support.	Prevents the example from being reused as stronger evidence than it earned.

What this example now supports, and what it still does not support

What this example can support	What it still does not support
A reproducible small EEG benchmark with a named event contract and a declared split regime.	Physical timing truth unless the highest timing-validation rung was actually measured.
A bounded score comparison against declared baselines under a declared independence unit.	Target-specific neural evidence if shortcut routes remain unresolved.
A same-session or explicitly bounded temporal result.	Cross-day durability, fixed-decoder stability, or deployability unless those were directly audited.
A calibrated or abstaining output only if calibration and abstention were frozen and reported explicitly.	Causal, source-identification, or WBE-level hidden-state claims.

One small pack this page now expects

Minimum pack

Dataset identity: snapshot, DOI or URL, retrieval date, and license.
Observation contract: event anchor, event semantics, timing-validation class, label provenance, and QC.
Benchmark object: task, target, metric bundle, split family, and independent hold-out unit.
Shortcut note: plausible shortcut families, shortcut-aware baselines, and unresolved shortcut gap.
Temporal note: same-session or beyond, fixed decoder or updated decoder, and stopped time claim.
Registry: preprocessing, split freeze, baselines, failure conditions, and calibration split when needed.
Model card: results, failures, calibration behavior, abstention behavior, and stopped claim.

Where to go next

If you want the full blueprint again, return to Verification. If you want the stricter L0 checklist, go next to Wiki: Minimum artifact pack for L0. If the main uncertainty is event timing or label meaning, use Wiki: Event synchronization and observation logs. If the problem is shortcut resistance or hold-out ancestry, use Wiki: Data splits and leakage. If the claim starts to cross sessions or days, continue to Wiki: State, trait, and drift.

References behind this correction

Brain Imaging Data Structure. Events. BIDS specification.
Hermes D, Pal Attia T, Beniczky S, et al. Hierarchical Event Descriptor library schema for EEG data annotation. Scientific Data. 2025. doi:10.1038/s41597-025-05791-2
Kothe C, et al. The Lab Streaming Layer for synchronized multimodal recording. Imaging Neuroscience. 2025. doi:10.1162/IMAG.a.136
Lepauvre A, Hirschhorn R, Bendtz K, Mudrik L, Melloni L. A standardized framework to test event-based experiments. Behavior Research Methods. 2024. doi:10.3758/s13428-024-02508-y
Chaibub Neto E, et al. Detecting the impact of subject characteristics on machine learning-based diagnostic applications. npj Digital Medicine. 2019;2:99. doi:10.1038/s41746-019-0178-x
Xu L, et al. Cross-Dataset Variability Problem in EEG Decoding With Deep Learning. Frontiers in Human Neuroscience. 2020;14:103. doi:10.3389/fnhum.2020.00103
Di Y, et al. The Time-Robustness Analysis of Individual Identification Based on Resting-State EEG. Frontiers in Human Neuroscience. 2021;15:672946. doi:10.3389/fnhum.2021.672946
Egger J, et al. Chrono-EEG dynamics influencing hand gesture decoding: a 10-hour study. Scientific Reports. 2024;14:20247. doi:10.1038/s41598-024-70609-x
Lei J, G'Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. Distribution-Free Predictive Inference for Regression. Journal of the American Statistical Association. 2018;113(523):1094-1111. doi:10.1080/01621459.2017.1307116

Wiki: Verification example walkthrough

Read this first to avoid getting lost

What we know now

What we still do not know

Check the basics in the wiki

Plain-language terms on this page

See the structure before reading the whole page