Wiki

Wiki: Verification example walkthrough

Read the blueprint through one small EEG example without overreading timing, identity, or stability

Mind Uploading Research Project

Public Page Updated: 2026-03-28 Worked example / technical refresh

How to use this page

Read this first to avoid getting lost

This page turns the Verification Commons into one small EEG example. The goal is still not to chase a large score. The goal is to show what has to be frozen before a small public EEG result can be read honestly: input contract, benchmark object, split regime, shortcut audit, temporal scope, calibration logic, and stopped claim.

  • This worked example is no longer just data standard + benchmark + registry + model card; it now also carries an event contract, a shortcut audit, a temporal-validity note, and a calibration / abstention note.
  • BIDS events, HED semantics, and LSL synchronization answer different questions and should not be compressed into one checkbox.
  • A subject/session split is not enough if raw-recording ancestry, identity confounding, or acquisition-distribution shortcuts remain unresolved.
  • Same-session success is not cross-day durability; fixed decoder interval and recalibration burden stay explicit.
  • The safe ceiling of this example is a bounded reproducible EEG decode under a named observation contract, not a stable biomarker or hidden-state readout.
Best for
People who find Verification too abstract, and people who want one concrete EEG example that already respects the site's newer stop lines
Reading time
12-18 minutes
Accuracy note
This is a bounded L0/L1 tutorial. It does not support causal, source-identification, or WBE-level claims by itself. It shows how to build one small EEG result without silently overreading timing, subject identity, or temporal stability.

Relatively clear at this stage

What we know now

  • A small public EEG example can already teach most of the core verification logic if input contract, benchmark object, and stopped claim are frozen explicitly.
  • Event timing, event meaning, split hygiene, shortcut resistance, and temporal scope are separate technical questions.
  • High score alone is not enough; the site now asks where predictive information came from and how far the result can be extrapolated across time.
  • If probabilities, thresholds, or prediction sets are reported, fit / calibration / test separation and abstention policy belong in the worked example.

Still unresolved beyond this point

What we still do not know

  • This page still does not define one site-wide default calibration threshold or one universal temporal benchmark for every EEG task.
  • Which backbone object should become the default target for future longitudinal EEG examples remains unsettled.
  • This example does not decide which additional cards would be sufficient for stronger L2/L3 claims.

Learn the basics

Check the basics in the wiki

What the wiki is for

The wiki is a learning aid. For the project's official current synthesis, success criteria, and operating rules, always return to the public pages.

This worked example in one sentence

We use one small public EEG classification task to show what a result has to freeze before the score means anything: what entered the file, what the benchmark object actually is, what train/test independence really means, which shortcut routes remain open, what time scope is being claimed, and where the claim must stop.

Why this tutorial needed a 2026-03-28 repair

The older walkthrough was useful as a first orientation, but it still taught a weaker recipe than the current site allows. It could leave readers with the impression that BIDS + split + score + model card was already most of the work. The current literature and the rest of this site no longer support that shortcut. BIDS events, HED semantics, and LSL synchronization answer different questions; record-wise or weakly grouped splits can still learn identity; same-day success is not cross-day durability; and a score with no calibration or abstention rule is still not operationally interpretable.

Safe reading of this example

If this example is filled well, the strongest safe outcome is a bounded reproducible EEG decode under a named observation contract. It still does not become source ground truth, a target-specific biomarker by default, a causal intervention result, or a WBE-relevant hidden-state readout.

The core scaffold is still four artifacts

Core artifact What it does in this example
Data standard Fixes the dataset snapshot, BIDS shape, event files, channel metadata, and QC context.
Benchmark Fixes the task, target, split family, hold-out unit, metric bundle, and stopped claim.
Registry Fixes preprocessing, split freeze, baselines, nuisance checks, and success/failure conditions before the result is known.
Model card / audit log Records scores, failures, shortcut checks, calibration behavior, abstention behavior, and what still remains unresolved.

But the current tutorial also stacks companion cards

Companion card or note Why this example now needs it What goes wrong if omitted
Observation contract Separates event anchor, event semantics, clock domain, timing-validation class, and label provenance. events.tsv, HED, trigger lines, and synchronized streams are too easily read as one solved timing object.
Observability Budget Fixes what EEG directly observed: scalp potentials under a named setup, not sources or hidden state by default. A small scalp-level classifier is too easily promoted to internal-state evidence.
Specificity & Shortcut note Fixes which routes could still explain the score: subject fingerprint, setup distribution, residual movement, or other nuisance paths. A clean split can still be mistaken for target-specific neural evidence.
Temporal-validity note Fixes whether the example is within-session, same-day, cross-session, or cross-day, and whether the decoder was fixed or updated. Same-session success is silently promoted to durability.
Calibration & Abstention note Fixes how probabilities or prediction sets were calibrated, and when the model should abstain instead of forcing output. Thresholds, confidence, and coverage become uninterpretable.

Step 1: Fix the input and the event contract

The first thing this example now freezes is not only which EEG file you used, but also what the time and label columns are allowed to mean. The current standards and timing literature require a narrower reading here. In BIDS, onset is measured from the first stored data point, not from physical screen or speaker onset. HED makes event semantics machine-readable. LSL can synchronize streams across a LAN and compensate offset and jitter, but it does not automatically measure device-side delay truth. Therefore, this site now asks the tutorial to log those pieces separately.

Input-side item What to freeze in this example Why it matters
dataset identity Snapshot or release, DOI or persistent URL, retrieval date, and license. The same dataset name can still refer to different content over time.
BIDS skeleton dataset_description.json, participant/session/run layout, *_eeg.json, channel metadata, and electrode metadata when positions exist. Without this, later readers cannot reconstruct the same measurement condition.
event anchor events.tsv, events.json, onset/duration/sample meaning, and any discarded-sample rule. The epoch boundary can look precise while still referring only to stored-file time.
event semantics trial_type, HED tags when available, condition naming, and any manual scoring rule. Two datasets can share a label name while meaning different things.
clock domain + timing-validation class Whether the example has only a stored-data anchor, stream alignment, digital marker capture, or actual physical timing validation. The site no longer lets BIDS, HED, LSL, TTL, and photodiode traces collapse into one timing claim.
label provenance Whether the target label comes from cue markers, manual scoring, a report-derived rule, or another derived path. A signal-only benchmark and a report-assisted benchmark are not the same evidence object.
QC / exclusions Bad channels, bad segments, missing runs, and thresholds used to exclude data. The score becomes impossible to audit if exclusions stay implicit.
Safe tutorial rule

For this worked example, writing only “the data are in BIDS” is no longer enough. The minimum safe wording is: which event anchor exists, which semantics exist, which timing-validation rung was actually tested, and where the label came from.

Step 2: Fix the benchmark object and the independence unit

The next weak point in older beginner workflows was to treat a split rule as if it already solved shortcut risk. The current literature does not support that shortcut. Record-wise splits can learn identity rather than the target variable, resting-state EEG can support time-robust person identification, and cross-dataset EEG performance can move with setup differences such as amplifier, cap, sampling rate, or filtering. This example therefore freezes not only the split, but also the independent hold-out unit and the shortcut families that remain plausible.

Benchmark field What to write in this example What not to overread
target One bounded task such as two-state or few-class EEG classification. Do not let a small task silently stand in for a general biomarker or latent-state decoder.
evaluation family Within-session, cross-session, cross-subject, cross-dataset, or adaptation regime. The same accuracy means different things in different families.
independent hold-out unit Subject, session, or raw recording, not only windows or epochs. A result can stay identity-confounded even when train/test windows are disjoint.
raw-recording ancestry Whether windows cut from one raw recording ever cross train/test. Window-level separation is not enough if the raw ancestor is shared.
setup disjointness Participant, session, site, device, reference system, channel map, and protocol differences. A classifier can still read acquisition-distribution structure rather than the intended signal.
shortcut-aware baselines Metadata-only, subject-ID, or other nuisance-aware baselines when relevant. Without them, the score can still be explained by who, when, or how the EEG was recorded.
temporal scope Whether the example is same-session only or claims any reuse across time. Do not promote a same-session result to cross-day durability after the fact.

Step 3: Write the registry before training

The registry is where this example stops becoming a flexible demo and becomes an auditable result. The main point is not fancy formatting. The main point is that preprocessing, splits, baselines, and stopping conditions are fixed before the score appears. If the example will later report probabilities, prediction sets, or an abstain threshold, this is also where the fit / calibration / test separation must be frozen.

Registry field What to freeze here
preprocessing recipe Filtering, referencing, artifact handling, rejected channels/segments, and derivative boundaries.
split freeze The exact grouping rule for subjects, sessions, and raw recordings, plus any benchmark version.
baseline plan Simple baseline, shortcut-aware baseline, and what counts as improvement over them.
failure conditions What will count as collapse: low sensitivity, shortcut-only win, unstable calibration, or cross-session failure.
calibration split Whether the model reports only hard labels or also probabilities / prediction sets, and which held-out slice is reserved for threshold or temperature tuning.
stopped claim Write in advance the strongest safe claim if everything works exactly as planned.
What the stopped claim should usually look like here

For a minimal public EEG example, the planned stopping point is usually something like: reproducible classification under a named observation contract and declared split regime. It is usually not stable biomarker evidence, target-specific neural proof, or cross-day deployability.

Step 4: Attach the route cards before reading the score

This is the main scientific tightening in the new tutorial. The score is no longer read alone. Before the score is interpreted, the example now stacks four companion checks that answer four different questions.

Companion check Question it answers Example answer in a small EEG tutorial
Observability Budget What did the sensor directly observe? Scalp potentials under a declared montage and preprocessing regime; not hidden state, sources, or causal controller by default.
Specificity & Shortcut note Which route could still explain the score besides the intended target? Subject/session fingerprint, setup distribution, residual behavior, or other nuisance routes may still contribute unless audited separately.
Temporal-validity note How far across time may the result be extrapolated? If the decoder was evaluated only within session, the example stops at within-session evidence even if the score is strong.
Calibration & Abstention note What do the output probabilities or sets mean, and when should output stop? Fit/calibration/test are separated, and low-confidence outputs can be rejected instead of forced.
Why the temporal note is now mandatory once time enters the story

Egger et al. (2024) showed that hand-gesture EEG decoding moves across a 10-hour day and that a non-updated classifier can degrade steadily over time. On this site, that means a tutorial cannot jump from “clean split” to “stable result” without stating the fixed decoder interval, state annotation, and any recalibration burden.

Step 5: Publish a model card plus calibration and failure logs

At the end, the model card is still the visible artifact, but it is now narrower and more explicit than the older tutorial implied. The purpose is not only to show where the model wins, but also to expose where the route breaks and what the score is still allowed to mean.

Output-side item What to include Why it matters
headline metrics Main metric bundle, baseline deltas, and slice-wise results. One number alone hides which regime actually carried the result.
shortcut results Nuisance-aware baselines, metadata-only or identity baselines when relevant, and unresolved shortcut gaps. The score can otherwise be overread as target-specific evidence.
temporal note Whether the result is same-session only, same-day only, or tested further, and whether the decoder stayed fixed. Prevents silent promotion to durability.
calibration report Fit/calibration/test split, ECE/Brier/NLL or prediction-set coverage when applicable. Confidence without calibration is not yet operationally meaningful.
abstention / threshold policy Reject option, prediction-set rule, or explicit statement that the example does not yet support one. Stops threshold tweaking from hiding inside the test result.
failure ledger Subjects, sessions, states, or setup slices where the example collapses. Without this, only favorable conditions survive into the narrative.
stopped claim One or two lines stating what the result supports and what it still does not support. Prevents the example from being reused as stronger evidence than it earned.

What this example now supports, and what it still does not support

What this example can support What it still does not support
A reproducible small EEG benchmark with a named event contract and a declared split regime. Physical timing truth unless the highest timing-validation rung was actually measured.
A bounded score comparison against declared baselines under a declared independence unit. Target-specific neural evidence if shortcut routes remain unresolved.
A same-session or explicitly bounded temporal result. Cross-day durability, fixed-decoder stability, or deployability unless those were directly audited.
A calibrated or abstaining output only if calibration and abstention were frozen and reported explicitly. Causal, source-identification, or WBE-level hidden-state claims.

One small pack this page now expects

Minimum pack

  • Dataset identity: snapshot, DOI or URL, retrieval date, and license.
  • Observation contract: event anchor, event semantics, timing-validation class, label provenance, and QC.
  • Benchmark object: task, target, metric bundle, split family, and independent hold-out unit.
  • Shortcut note: plausible shortcut families, shortcut-aware baselines, and unresolved shortcut gap.
  • Temporal note: same-session or beyond, fixed decoder or updated decoder, and stopped time claim.
  • Registry: preprocessing, split freeze, baselines, failure conditions, and calibration split when needed.
  • Model card: results, failures, calibration behavior, abstention behavior, and stopped claim.

Where to go next

If you want the full blueprint again, return to Verification. If you want the stricter L0 checklist, go next to Wiki: Minimum artifact pack for L0. If the main uncertainty is event timing or label meaning, use Wiki: Event synchronization and observation logs. If the problem is shortcut resistance or hold-out ancestry, use Wiki: Data splits and leakage. If the claim starts to cross sessions or days, continue to Wiki: State, trait, and drift.

References behind this correction

  1. Brain Imaging Data Structure. Events. BIDS specification.
  2. Hermes D, Pal Attia T, Beniczky S, et al. Hierarchical Event Descriptor library schema for EEG data annotation. Scientific Data. 2025. doi:10.1038/s41597-025-05791-2
  3. Kothe C, et al. The Lab Streaming Layer for synchronized multimodal recording. Imaging Neuroscience. 2025. doi:10.1162/IMAG.a.136
  4. Lepauvre A, Hirschhorn R, Bendtz K, Mudrik L, Melloni L. A standardized framework to test event-based experiments. Behavior Research Methods. 2024. doi:10.3758/s13428-024-02508-y
  5. Chaibub Neto E, et al. Detecting the impact of subject characteristics on machine learning-based diagnostic applications. npj Digital Medicine. 2019;2:99. doi:10.1038/s41746-019-0178-x
  6. Xu L, et al. Cross-Dataset Variability Problem in EEG Decoding With Deep Learning. Frontiers in Human Neuroscience. 2020;14:103. doi:10.3389/fnhum.2020.00103
  7. Di Y, et al. The Time-Robustness Analysis of Individual Identification Based on Resting-State EEG. Frontiers in Human Neuroscience. 2021;15:672946. doi:10.3389/fnhum.2021.672946
  8. Egger J, et al. Chrono-EEG dynamics influencing hand gesture decoding: a 10-hour study. Scientific Reports. 2024;14:20247. doi:10.1038/s41598-024-70609-x
  9. Lei J, G'Sell M, Rinaldo A, Tibshirani RJ, Wasserman L. Distribution-Free Predictive Inference for Regression. Journal of the American Statistical Association. 2018;113(523):1094-1111. doi:10.1080/01621459.2017.1307116