The shortest explanation
Data division is the process of ``determining how far you can look before comparing the answers.'' A data leak is when that boundary is inadvertently crossed and information that cannot be used in production is mixed into learning and adjustment.
The older version of this page was good at explaining subject / session / time split, but it still left two practical shortcuts too implicit. First, Chaibub Neto et al. (2019), Melnik et al. (2017), Xu et al. (2020), and Di et al. (2021) together show that identity and acquisition-distribution shortcuts can remain even when the coarse split sounds respectable. Second, the official EEG Challenge (2025) homepage, rules, submission page, and leaderboard show that benchmark governance itself can move what a score means. Therefore this page now treats split hygiene, acquisition-distribution audit, and benchmark provenance as one operational bundle rather than three unrelated side notes.
Why partitioning matters so much
On school tests, if you practice the questions while looking at the answers, you will get a higher score. However, that score cannot be said to indicate the ability to solve truly new problems. The same goes for machine learning, if the information seen during learning bleeds into the test side, only the numbers will look good.
First, be aware of the unit of division
| Unit of division | What kind of scene is it? | Notes |
|---|---|---|
| Subject unit | When you want to see if you can generalize to new people. | It's easier than it looks when you have pieces of the same person in both train and test. |
| Per session | When you want to see if the same person will be stable on different days. | If you divide only by recording on the same day, you will overlook differences in daily differences and electrode conditions. |
| Time unit | When envisioning future predictions or continuous operation. | If you enter windows with similar times on both sides, you may see almost the same fragment. |
4 starter data items, independent units are not the same
Why leaks and Safe classification in the table below are operational rules drawn by this site based on the official explanation of each dataset and the hierarchical structure shown in primary documents.
| Dataset | What should be prioritized as an independent unit | Common misdivision | Why does it leak | How to divide on the safe side |
|---|---|---|---|---|
| EEG Motor Movement/Imagery | subject, run if necessary | Random split of epoch / trial | Signal characteristics and cue structures of the same subject and session span train / test. | First of all, separate the subject unit and the run unit even if the evaluation is within the same subject. |
| CHB-MIT | subject and case chronology | Random division by file | chb21 is the same subject as chb01, and the gap between files also has a context. |
Check for subject correspondence instead of case, and split while preserving sequential order and gap. |
| Sleep-EDF | subject-night | Random split of epoch | Sequential hypnograms and subject-specific sleep structures from the same night span train / test. | Keep each night and declare first whether it is generalization across subjects or within-subject. |
| TUH EEG / TUSZ | patient / session | Random division of segment / file, signal-only evaluation with report included | This is because multiple sessions and de-identified reports of the same patient have information close to the label. | Require per patient/session split and report usage flag. |
7 common leak patterns
| Common accidents | What's happening |
|---|---|
| Fragments of the same subject enter on both sides | The individual's unique habits are memorized, and the generalization performance to new people appears to be higher. |
| Mix adjacent time windows | Separating almost the same waveform slices into train/test, it underestimates the difficulty of predicting the future. |
| Normalize and select features on all data | The statistics on the test side are used during training, and the information flows backwards. |
| Repeat model selection with test | test essentially takes on the role of validation, and the final score is optimistic. |
| Miss duplicate or derived samples | Data originally cut from the same record is included on both sides, resulting in a comparison that is not an independent sample. |
| Hide site / device / reference / layout shortcuts | The model learns acquisition-distribution structure such as amplifier, montage, reference system, or electrode layout instead of the target neural variable. |
| Treat challenge operations as fixed when they changed | The benchmark name stays the same while randomization, hidden grouping, extra-data policy, inference restrictions, or organizer postmortems change what the ranking actually measures. |
Split hygiene still leaves four shortcut families
| Shortcut family | What can masquerade as progress | What to publish instead |
|---|---|---|
| Subject / session fingerprint | A score can look like generalization while it mostly exploits stable subject-specific or session-specific structure. | Disclose the independent hold-out unit, raw-window ancestry, and whether subject/session identifiers were fully disjoint. |
| Acquisition-distribution shortcut | A model can ride site, amplifier, cap, sampling rate, filter chain, reference system, or electrode-layout differences instead of the claimed neural variable. | Publish site / device / reference / layout disjointness, the harmonization log, and a metadata-only or setup-only baseline whenever possible. |
| Report / metadata shortcut | A signal-only claim can inherit report-derived labels, triage context, or structured metadata that already sits close to the answer. | State whether report text or derived metadata were used, and separate signal-only from multimodal / metadata-assisted scoreboards. |
| Benchmark-governance shortcut | A leaderboard can look stable even though hidden grouping, randomization, extra-data policy, checkpoint policy, or inference-stage rules changed what the benchmark measured. | Publish benchmark version, split / randomization rule, hidden grouping, extra-data and pretrained-checkpoint policy, inference-stage restrictions, and later postmortems together with the score. |
The official EEG Challenge (2025) homepage explicitly says the original preprint became outdated after execution-phase changes and that the website plus starter kit should be treated as current. The official rules require disclosure of additional pretraining datasets, pretrained models and fine-tuning method, code submission during the inference stage, and a single-GPU 20 GB inference budget. The official submission page further fixes the event as an inference-only code competition. The final leaderboard then disclosed that Challenge 2 samples had not been randomized, allowing contiguous-trial same-subject structure to affect the ranking and forcing separate awards. On this site, that means benchmark governance now belongs on the same checklist as split hygiene rather than in a footnote after the score.
Even after split hygiene and benchmark provenance are disclosed, the reported number can still mislead if the task is rare-event or class-imbalanced. Saito & Rehmsmeier (2015) showed that precision-recall views are often more informative than ROC summaries under strong imbalance. In seizure tasks, Roy et al. (2021) and Scheuer et al. (2021) show that event sensitivity, overlap logic, and false alarms per hour or day matter together, while Segal et al. (2023) shows that false-alarm control is itself a design target in seizure prediction. In sleep staging, Sun et al. (2017) and Vallat & Walker (2021) show that pooled performance can still hide minority-stage failure. Therefore, this site now asks for a task-matched metric bundle in addition to split hygiene.
Leak warning specific to the dataset added this time
| dataset | Notes to be fixed this time |
|---|---|
| EEG Motor Movement/Imagery | Since it is a cue-locked motor task, even if the split is made stricter, the visual cue/eyeball/myoelectric contributions will still be audited separately. |
| CHB-MIT | Do not confuse subject numbers with case numbers. Don't shuffle files and pin gap and chronology to runbook. |
| Sleep-EDF | Do not silently mix R&K hypnogram as equivalent to AASM. No cross-dataset comparison without writing label mapping. |
| TUH EEG / TUSZ | Do not mix report text, triage and session metadata derived from report keyword into input of signal-only benchmark. |
At least I would like to report this
Report Items
- Evaluation family:Whether the result is within-session, cross-session, cross-subject, or temporal / longitudinal.
- Split rule:How many items were placed in train / validation / calibration / test, and what was the independent hold-out unit?
- Window ancestry:Which subject / case / night / session / file / record generated each split, and were near-adjacent windows kept apart?
- Metric bundle:Was the task read through balanced / macro metrics, event sensitivity plus false alarms, or per-stage agreement rather than a single headline number?
- Report usage flag:Was the claim signal-only, or were report text / metadata / multimodal fields also used?
- Acquisition-distribution audit:Were site, device, reference system, channel map, electrode layout, and protocol distribution separated, harmonized, or left mixed?
- Preprocessing boundaries:Were normalization, feature selection, and threshold tuning fit using only the allowed split?
- Benchmark provenance:If this is a challenge or leaderboard result, what were the benchmark version, randomization rule, hidden grouping, extra-data / pretrained-checkpoint policy, inference-stage restrictions, and later postmortems?
- Baseline:What is the improvement compared with a simpler model or a metadata-only / setup-only baseline?
- Failure example + stopping claim:Under what conditions did it fail, what was excluded, and what stronger claim is explicitly not being made?
References
- PhysioNet: EEG Motor Movement/Imagery Dataset
- PhysioNet: CHB-MIT Scalp EEG Database
- PhysioNet: Sleep-EDF Database Expanded
- Obeid & Picone (2016), The Temple University Hospital EEG Data Corpus
- Shah et al. (2018), The Temple University Hospital Seizure Detection Corpus
- Moser et al. (2009), Sleep classification according to AASM and Rechtschaffen & Kales
- Chaibub Neto et al. (2019), Detecting the impact of subject characteristics on machine learning-based diagnostic applications
- Melnik et al. (2017), Systems, subjects, sessions: to what extent do these factors influence EEG data?
- Xu et al. (2020), Cross-dataset variability problem in EEG decoding with deep learning
- Di et al. (2021), The Time-Robustness Analysis of Individual Identification Based on Resting-State EEG
- Saito & Rehmsmeier (2015), The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets
- Roy et al. (2021), Evaluation of artificial intelligence systems for assisting neurologists with fast and accurate annotations of scalp electroencephalography data
- Scheuer et al. (2021), Seizure Detection: Interreader Agreement and Detection Algorithm Assessments Using a Large Dataset
- Segal et al. (2023), Utilizing risk-controlling prediction calibration to reduce false alarm rates in epileptic seizure prediction
- Sun et al. (2017), Large-Scale Automated Sleep Staging
- Vallat & Walker (2021), An open-source, high-performance tool for automated sleep staging
- EEG Challenge (2025) official homepage
- EEG Challenge (2025) official rules
- EEG Challenge (2025) official submission page
- EEG Challenge (2025) official leaderboard / organizer postmortem
Safety measures if you get lost in the first book
When in doubt, it is safe to follow these three points: Separate train/test for each subject Do not touch test until the endFor normalization and feature selection, fit only with train. Even if it seems too harsh, reliable accuracy is more valuable than fancy numbers.
Where to go back next
Go back to Data & Bench if you want to review the actual starter data, Hands-on if you want to go back to creating minimal loops, or go back to Verification Foundation if you want to see why this is part of the verification foundation.