Wiki: Data partitioning and data leaks

The shortest explanation

Data division is the process of ``determining how far you can look before comparing the answers.'' A data leak is when that boundary is inadvertently crossed and information that cannot be used in production is mixed into learning and adjustment.

2026-03 re-audit: split hygiene is necessary, not sufficient

The older version of this page was good at explaining subject / session / time split, but it still left two practical shortcuts too implicit. First, Chaibub Neto et al. (2019), Melnik et al. (2017), Xu et al. (2020), and Di et al. (2021) together show that identity and acquisition-distribution shortcuts can remain even when the coarse split sounds respectable. Second, the official EEG Challenge (2025) homepage, rules, submission page, and leaderboard show that benchmark governance itself can move what a score means. Therefore this page now treats split hygiene, acquisition-distribution audit, and benchmark provenance as one operational bundle rather than three unrelated side notes.

Why partitioning matters so much

On school tests, if you practice the questions while looking at the answers, you will get a higher score. However, that score cannot be said to indicate the ability to solve truly new problems. The same goes for machine learning, if the information seen during learning bleeds into the test side, only the numbers will look good.

First, be aware of the unit of division

Unit of division	What kind of scene is it?	Notes
Subject unit	When you want to see if you can generalize to new people.	It's easier than it looks when you have pieces of the same person in both train and test.
Per session	When you want to see if the same person will be stable on different days.	If you divide only by recording on the same day, you will overlook differences in daily differences and electrode conditions.
Time unit	When envisioning future predictions or continuous operation.	If you enter windows with similar times on both sides, you may see almost the same fragment.

4 starter data items, independent units are not the same

The last two columns are the operational reasoning for this site

Why leaks and Safe classification in the table below are operational rules drawn by this site based on the official explanation of each dataset and the hierarchical structure shown in primary documents.

Dataset	What should be prioritized as an independent unit	Common misdivision	Why does it leak	How to divide on the safe side
EEG Motor Movement/Imagery	subject, run if necessary	Random split of epoch / trial	Signal characteristics and cue structures of the same subject and session span train / test.	First of all, separate the subject unit and the run unit even if the evaluation is within the same subject.
CHB-MIT	subject and case chronology	Random division by file	`chb21` is the same subject as `chb01`, and the gap between files also has a context.	Check for subject correspondence instead of case, and split while preserving sequential order and gap.
Sleep-EDF	subject-night	Random split of epoch	Sequential hypnograms and subject-specific sleep structures from the same night span train / test.	Keep each night and declare first whether it is generalization across subjects or within-subject.
TUH EEG / TUSZ	patient / session	Random division of segment / file, signal-only evaluation with report included	This is because multiple sessions and de-identified reports of the same patient have information close to the label.	Require per patient/session split and report usage flag.

7 common leak patterns

Common accidents	What's happening
Fragments of the same subject enter on both sides	The individual's unique habits are memorized, and the generalization performance to new people appears to be higher.
Mix adjacent time windows	Separating almost the same waveform slices into train/test, it underestimates the difficulty of predicting the future.
Normalize and select features on all data	The statistics on the test side are used during training, and the information flows backwards.
Repeat model selection with test	test essentially takes on the role of validation, and the final score is optimistic.
Miss duplicate or derived samples	Data originally cut from the same record is included on both sides, resulting in a comparison that is not an independent sample.
Hide site / device / reference / layout shortcuts	The model learns acquisition-distribution structure such as amplifier, montage, reference system, or electrode layout instead of the target neural variable.
Treat challenge operations as fixed when they changed	The benchmark name stays the same while randomization, hidden grouping, extra-data policy, inference restrictions, or organizer postmortems change what the ranking actually measures.

Split hygiene still leaves four shortcut families

Shortcut family	What can masquerade as progress	What to publish instead
Subject / session fingerprint	A score can look like generalization while it mostly exploits stable subject-specific or session-specific structure.	Disclose the independent hold-out unit, raw-window ancestry, and whether subject/session identifiers were fully disjoint.
Acquisition-distribution shortcut	A model can ride site, amplifier, cap, sampling rate, filter chain, reference system, or electrode-layout differences instead of the claimed neural variable.	Publish site / device / reference / layout disjointness, the harmonization log, and a metadata-only or setup-only baseline whenever possible.
Report / metadata shortcut	A signal-only claim can inherit report-derived labels, triage context, or structured metadata that already sits close to the answer.	State whether report text or derived metadata were used, and separate signal-only from multimodal / metadata-assisted scoreboards.
Benchmark-governance shortcut	A leaderboard can look stable even though hidden grouping, randomization, extra-data policy, checkpoint policy, or inference-stage rules changed what the benchmark measured.	Publish benchmark version, split / randomization rule, hidden grouping, extra-data and pretrained-checkpoint policy, inference-stage restrictions, and later postmortems together with the score.

Benchmark governance is part of leakage control, not administrative detail

The official EEG Challenge (2025) homepage explicitly says the original preprint became outdated after execution-phase changes and that the website plus starter kit should be treated as current. The official rules require disclosure of additional pretraining datasets, pretrained models and fine-tuning method, code submission during the inference stage, and a single-GPU 20 GB inference budget. The official submission page further fixes the event as an inference-only code competition. The final leaderboard then disclosed that Challenge 2 samples had not been randomized, allowing contiguous-trial same-subject structure to affect the ranking and forcing separate awards. On this site, that means benchmark governance now belongs on the same checklist as split hygiene rather than in a footnote after the score.

Metric semantics are also part of leak-resistant reporting

Even after split hygiene and benchmark provenance are disclosed, the reported number can still mislead if the task is rare-event or class-imbalanced. Saito & Rehmsmeier (2015) showed that precision-recall views are often more informative than ROC summaries under strong imbalance. In seizure tasks, Roy et al. (2021) and Scheuer et al. (2021) show that event sensitivity, overlap logic, and false alarms per hour or day matter together, while Segal et al. (2023) shows that false-alarm control is itself a design target in seizure prediction. In sleep staging, Sun et al. (2017) and Vallat & Walker (2021) show that pooled performance can still hide minority-stage failure. Therefore, this site now asks for a task-matched metric bundle in addition to split hygiene.

Leak warning specific to the dataset added this time

dataset	Notes to be fixed this time
EEG Motor Movement/Imagery	Since it is a cue-locked motor task, even if the split is made stricter, the visual cue/eyeball/myoelectric contributions will still be audited separately.
CHB-MIT	Do not confuse subject numbers with case numbers. Don't shuffle files and pin gap and chronology to runbook.
Sleep-EDF	Do not silently mix R&K hypnogram as equivalent to AASM. No cross-dataset comparison without writing label mapping.
TUH EEG / TUSZ	Do not mix report text, triage and session metadata derived from report keyword into input of signal-only benchmark.

At least I would like to report this

Report Items

Evaluation family:Whether the result is within-session, cross-session, cross-subject, or temporal / longitudinal.
Split rule:How many items were placed in train / validation / calibration / test, and what was the independent hold-out unit?
Window ancestry:Which subject / case / night / session / file / record generated each split, and were near-adjacent windows kept apart?
Metric bundle:Was the task read through balanced / macro metrics, event sensitivity plus false alarms, or per-stage agreement rather than a single headline number?
Report usage flag:Was the claim signal-only, or were report text / metadata / multimodal fields also used?
Acquisition-distribution audit:Were site, device, reference system, channel map, electrode layout, and protocol distribution separated, harmonized, or left mixed?
Preprocessing boundaries:Were normalization, feature selection, and threshold tuning fit using only the allowed split?
Benchmark provenance:If this is a challenge or leaderboard result, what were the benchmark version, randomization rule, hidden grouping, extra-data / pretrained-checkpoint policy, inference-stage restrictions, and later postmortems?
Baseline:What is the improvement compared with a simpler model or a metadata-only / setup-only baseline?
Failure example + stopping claim:Under what conditions did it fail, what was excluded, and what stronger claim is explicitly not being made?

References

Safety measures if you get lost in the first book

When in doubt, it is safe to follow these three points: Separate train/test for each subject Do not touch test until the endFor normalization and feature selection, fit only with train. Even if it seems too harsh, reliable accuracy is more valuable than fancy numbers.

Where to go back next

Go back to Data & Bench if you want to review the actual starter data, Hands-on if you want to go back to creating minimal loops, or go back to Verification Foundation if you want to see why this is part of the verification foundation.

Wiki: Data partitioning and data leaks

Read this first to avoid getting lost

What we know now

What we still do not know

Check the basics in the wiki

Plain-language terms on this page

See the structure before reading the whole page