Wiki

Wiki: Data partitioning and data leaks

Even if the accuracy is high, the evaluation will be corrupted if the classification is lax

Mind Uploading Research Project

Public Page Updated: 2026-03-25 Practical guide

How to use this page

Read this first to avoid getting lost

This page is a wiki that explains from the beginning how to divide datasets and why data leaks are dangerous. The 2026-03 re-audit tightened one more point: a clean split is necessary, but it is still not enough if acquisition-distribution shortcuts or benchmark-governance failures remain hidden.

  • Difficulty varies greatly depending on whether the unit is divided into subjects, sessions, or time.
  • Leaks occur not only due to 'cheating' but also due to well-intentioned pre-processing and splitting.
  • Even across four starter datasets, the independent unit differs: subject, case, night, or session.
  • The first thing you should look at is the division rules and leak countermeasures rather than the accuracy itself.
  • A clean subject/session split still does not neutralize acquisition-distribution shortcuts such as site, device, reference system, electrode layout, or protocol mix.
  • Benchmark governance is part of leakage control: version, randomization, hidden grouping, extra-data policy, pretrained checkpoints, inference-stage restrictions, and organizer postmortems can all change what the score means.
Best for
People who create initial evaluations using public data, people who are easily confused by leaks and splits
Reading time
12-18 minutes
Accuracy note
These are operational rules, not one-size-fits-all formulas. The best split still depends on the task and data structure, and official challenge operations can materially change what a benchmark score means.

Relatively clear at this stage

What we know now

  • Accuracy can easily be overestimated if the train/test separation is loose.
  • Apparent performance tends to improve when fragments from the same subject, same session, and near time are included on both sides.
  • In clinical EEG, report text and report-derived labels can also be leakage sources.
  • Preprocessing, normalization, and feature selection can also be a source of leaks if they are performed after looking at all the data.
  • Even after coarse split hygiene, metadata shortcuts such as site, amplifier, reference, and electrode layout can still dominate the score if disjointness and harmonization are not disclosed.
  • Leaderboard and challenge results are not stable objects unless benchmark provenance and later postmortems are carried together with the score.

Still unresolved beyond this point

What we still do not know

  • Which division is closest to future actual operation depends on the task setting and usage situation.
  • A deep understanding and auditing of data structures is required to be able to claim that leaks have been completely eliminated.
  • How to standardize report-derived labels from signal-only benchmarks is still in the process of operational design.
  • Which benchmark-governance bundle should become the default reusable card for EEG leaderboards is still being refined.

Learn the basics

Check the basics in the wiki

What the wiki is for

The wiki is a learning aid. For the project's official current synthesis, success criteria, and operating rules, always return to the public pages.

The shortest explanation

Data division is the process of ``determining how far you can look before comparing the answers.'' A data leak is when that boundary is inadvertently crossed and information that cannot be used in production is mixed into learning and adjustment.

2026-03 re-audit: split hygiene is necessary, not sufficient

The older version of this page was good at explaining subject / session / time split, but it still left two practical shortcuts too implicit. First, Chaibub Neto et al. (2019), Melnik et al. (2017), Xu et al. (2020), and Di et al. (2021) together show that identity and acquisition-distribution shortcuts can remain even when the coarse split sounds respectable. Second, the official EEG Challenge (2025) homepage, rules, submission page, and leaderboard show that benchmark governance itself can move what a score means. Therefore this page now treats split hygiene, acquisition-distribution audit, and benchmark provenance as one operational bundle rather than three unrelated side notes.

Why partitioning matters so much

On school tests, if you practice the questions while looking at the answers, you will get a higher score. However, that score cannot be said to indicate the ability to solve truly new problems. The same goes for machine learning, if the information seen during learning bleeds into the test side, only the numbers will look good.

First, be aware of the unit of division

Unit of division What kind of scene is it? Notes
Subject unit When you want to see if you can generalize to new people. It's easier than it looks when you have pieces of the same person in both train and test.
Per session When you want to see if the same person will be stable on different days. If you divide only by recording on the same day, you will overlook differences in daily differences and electrode conditions.
Time unit When envisioning future predictions or continuous operation. If you enter windows with similar times on both sides, you may see almost the same fragment.

4 starter data items, independent units are not the same

The last two columns are the operational reasoning for this site

Why leaks and Safe classification in the table below are operational rules drawn by this site based on the official explanation of each dataset and the hierarchical structure shown in primary documents.

Dataset What should be prioritized as an independent unit Common misdivision Why does it leak How to divide on the safe side
EEG Motor Movement/Imagery subject, run if necessary Random split of epoch / trial Signal characteristics and cue structures of the same subject and session span train / test. First of all, separate the subject unit and the run unit even if the evaluation is within the same subject.
CHB-MIT subject and case chronology Random division by file chb21 is the same subject as chb01, and the gap between files also has a context. Check for subject correspondence instead of case, and split while preserving sequential order and gap.
Sleep-EDF subject-night Random split of epoch Sequential hypnograms and subject-specific sleep structures from the same night span train / test. Keep each night and declare first whether it is generalization across subjects or within-subject.
TUH EEG / TUSZ patient / session Random division of segment / file, signal-only evaluation with report included This is because multiple sessions and de-identified reports of the same patient have information close to the label. Require per patient/session split and report usage flag.

7 common leak patterns

Common accidents What's happening
Fragments of the same subject enter on both sides The individual's unique habits are memorized, and the generalization performance to new people appears to be higher.
Mix adjacent time windows Separating almost the same waveform slices into train/test, it underestimates the difficulty of predicting the future.
Normalize and select features on all data The statistics on the test side are used during training, and the information flows backwards.
Repeat model selection with test test essentially takes on the role of validation, and the final score is optimistic.
Miss duplicate or derived samples Data originally cut from the same record is included on both sides, resulting in a comparison that is not an independent sample.
Hide site / device / reference / layout shortcuts The model learns acquisition-distribution structure such as amplifier, montage, reference system, or electrode layout instead of the target neural variable.
Treat challenge operations as fixed when they changed The benchmark name stays the same while randomization, hidden grouping, extra-data policy, inference restrictions, or organizer postmortems change what the ranking actually measures.

Split hygiene still leaves four shortcut families

Shortcut family What can masquerade as progress What to publish instead
Subject / session fingerprint A score can look like generalization while it mostly exploits stable subject-specific or session-specific structure. Disclose the independent hold-out unit, raw-window ancestry, and whether subject/session identifiers were fully disjoint.
Acquisition-distribution shortcut A model can ride site, amplifier, cap, sampling rate, filter chain, reference system, or electrode-layout differences instead of the claimed neural variable. Publish site / device / reference / layout disjointness, the harmonization log, and a metadata-only or setup-only baseline whenever possible.
Report / metadata shortcut A signal-only claim can inherit report-derived labels, triage context, or structured metadata that already sits close to the answer. State whether report text or derived metadata were used, and separate signal-only from multimodal / metadata-assisted scoreboards.
Benchmark-governance shortcut A leaderboard can look stable even though hidden grouping, randomization, extra-data policy, checkpoint policy, or inference-stage rules changed what the benchmark measured. Publish benchmark version, split / randomization rule, hidden grouping, extra-data and pretrained-checkpoint policy, inference-stage restrictions, and later postmortems together with the score.
Benchmark governance is part of leakage control, not administrative detail

The official EEG Challenge (2025) homepage explicitly says the original preprint became outdated after execution-phase changes and that the website plus starter kit should be treated as current. The official rules require disclosure of additional pretraining datasets, pretrained models and fine-tuning method, code submission during the inference stage, and a single-GPU 20 GB inference budget. The official submission page further fixes the event as an inference-only code competition. The final leaderboard then disclosed that Challenge 2 samples had not been randomized, allowing contiguous-trial same-subject structure to affect the ranking and forcing separate awards. On this site, that means benchmark governance now belongs on the same checklist as split hygiene rather than in a footnote after the score.

Metric semantics are also part of leak-resistant reporting

Even after split hygiene and benchmark provenance are disclosed, the reported number can still mislead if the task is rare-event or class-imbalanced. Saito & Rehmsmeier (2015) showed that precision-recall views are often more informative than ROC summaries under strong imbalance. In seizure tasks, Roy et al. (2021) and Scheuer et al. (2021) show that event sensitivity, overlap logic, and false alarms per hour or day matter together, while Segal et al. (2023) shows that false-alarm control is itself a design target in seizure prediction. In sleep staging, Sun et al. (2017) and Vallat & Walker (2021) show that pooled performance can still hide minority-stage failure. Therefore, this site now asks for a task-matched metric bundle in addition to split hygiene.

Leak warning specific to the dataset added this time

dataset Notes to be fixed this time
EEG Motor Movement/Imagery Since it is a cue-locked motor task, even if the split is made stricter, the visual cue/eyeball/myoelectric contributions will still be audited separately.
CHB-MIT Do not confuse subject numbers with case numbers. Don't shuffle files and pin gap and chronology to runbook.
Sleep-EDF Do not silently mix R&K hypnogram as equivalent to AASM. No cross-dataset comparison without writing label mapping.
TUH EEG / TUSZ Do not mix report text, triage and session metadata derived from report keyword into input of signal-only benchmark.

At least I would like to report this

Report Items

  • Evaluation family:Whether the result is within-session, cross-session, cross-subject, or temporal / longitudinal.
  • Split rule:How many items were placed in train / validation / calibration / test, and what was the independent hold-out unit?
  • Window ancestry:Which subject / case / night / session / file / record generated each split, and were near-adjacent windows kept apart?
  • Metric bundle:Was the task read through balanced / macro metrics, event sensitivity plus false alarms, or per-stage agreement rather than a single headline number?
  • Report usage flag:Was the claim signal-only, or were report text / metadata / multimodal fields also used?
  • Acquisition-distribution audit:Were site, device, reference system, channel map, electrode layout, and protocol distribution separated, harmonized, or left mixed?
  • Preprocessing boundaries:Were normalization, feature selection, and threshold tuning fit using only the allowed split?
  • Benchmark provenance:If this is a challenge or leaderboard result, what were the benchmark version, randomization rule, hidden grouping, extra-data / pretrained-checkpoint policy, inference-stage restrictions, and later postmortems?
  • Baseline:What is the improvement compared with a simpler model or a metadata-only / setup-only baseline?
  • Failure example + stopping claim:Under what conditions did it fail, what was excluded, and what stronger claim is explicitly not being made?

References

Safety measures if you get lost in the first book

When in doubt, it is safe to follow these three points: Separate train/test for each subject Do not touch test until the endFor normalization and feature selection, fit only with train. Even if it seems too harsh, reliable accuracy is more valuable than fancy numbers.

Where to go back next

Go back to Data & Bench if you want to review the actual starter data, Hands-on if you want to go back to creating minimal loops, or go back to Verification Foundation if you want to see why this is part of the verification foundation.