Wiki

Wiki: EEG foundation models and pretraining

Read advances in representation learning separately from claims that still need to be stopped

Mind Uploading Research Project

Public Page Updated: 2026-04-04 Technical / natural science only

How to use this page

Read this first to avoid getting lost

This page is a learning wiki that organizes how to read EEG foundation / self-supervised models. Recent large-scale pretraining is clearly an advance, but to avoid jumping from that advance to claims such as 'generalization is solved' or 'we are one step closer to WBE,' we separate pretraining corpus, channel mismatch, adaptation regime, benchmark object / supervision unit, and evaluation family.

  • Foundation models can improve EEG decoding, but they do not solve observability, identifiability, and deployability all at once.
  • Recent primary papers themselves treat electrode mismatch, sampling-rate differences, missing channels, low SNR, and inter-subject variability as major open problems.
  • Accepted papers, official challenge rules, and arXiv preprints / under-review manuscripts are not treated as the same evidence tier.
  • Challenge and benchmark papers from 2025-2026 show that standardized cross-task / cross-subject evaluation is itself still unfinished.
  • Foundation-model benchmarks are not one object: window / trial classification, event detection, sequence labeling, subject-level regression, and retrieval-style tasks still need separate benchmark-object disclosure.
  • Benchmark object, independent prediction unit, grouped hold-out unit, and inference-stage budget are separate fields; one leaderboard name does not fix all four.
  • The official EEG Challenge leaderboard later disclosed a split-construction error in Challenge 2, so benchmark provenance here includes sample randomization, hidden grouping, and inference-stage constraints rather than only benchmark name.
  • A setup-agnostic foundation model or a very large pretraining corpus is not yet shortcut-resistant transfer; subject / site / reference / protocol shortcuts still need an explicit specificity audit.
  • A unified spatial embedding or channel-permutation-equivariant backbone is still not a shared physiological coordinate system; coordinate route, reference family, and omitted-channel policy remain separate evidence fields.
  • Larger models do not automatically win; rankings move with parameter efficiency, training time, and benchmark design.
  • To preserve comparability, a standard model card is not enough; a Pretraining Card is also required.
  • A pretraining corpus is also a dataset, so overlap audit here now splits raw-recording, subject/session, setup, task/object, and benchmark-operations ancestry instead of one yes/no box.
Best for
Readers who want to assess EEG foundation models such as LaBraM, BIOT, EEGPT, and BENDR without overclaiming
Reading time
10-15 min
Accuracy note
This page covers only how to read the technical and natural-science evidence. It does not address overall WBE completion criteria or philosophical questions.

Relatively clear at this stage

What we know now

  • Self-supervised / foundation models show promising gains under limited-label conditions and across mixed-task downstream settings.
  • EEG has severe format heterogeneity, and differences in channel count, reference, sample rate, and window length easily break comparison.
  • The meaning of a downstream score changes across frozen, linear-probe, and fine-tuning regimes.
  • Papers from 2025-2026 are beginning to show that model rankings can change even with benchmark split construction and preprocessing alone.
  • Recent 2025-2026 model and benchmark papers show that 'works with any setup', 'wins under linear probing', and 'transfers under fine-tuning' are different claims that can reverse across evaluation regimes.
  • Benchmark name alone is still too coarse; the supervision unit can shift from windows or trials to epochs, events, or subjects, which changes what transfer means.
  • Benchmark object, independent prediction unit, grouped hold-out unit, and inference-stage budget can all change the meaning of the same leaderboard entry.
  • Official challenge operations can themselves expose hidden subject-order shortcuts or score-definition changes, so benchmark postmortems are treated here as primary evidence about comparability rather than as afterthoughts.
  • A successful foundation model cannot be read directly as source identifiability or WBE state-completeness.
  • Recent heterogeneous-device papers show that layout compatibility itself is still an active model-design target, which means geometry-route equivalence should not be silently assumed.

Still unresolved beyond this point

What we still do not know

  • It is still unsettled which pretraining objective is the most stable across broad downstream families.
  • There is still no default path that simultaneously satisfies cross-day, cross-device, cross-task, and longitudinal deployability.
  • There is also no fixed common standard for auditing benchmark version, split rules, and checkpoint selection together.
  • There is still no fixed common standard for reporting benchmark object / supervision unit alongside benchmark provenance.
  • Cross-project reporting still varies on how raw-recording ancestry, grouped hold-out unit, and benchmark-object ancestry are disclosed together.
  • It is not yet a settled law when targeted diversity beats indiscriminate scale.
  • It also remains unresolved how to show that a pretrained EEG representation is resisting identity / setup shortcuts rather than merely tolerating them on one benchmark.
  • It also remains unresolved how far reference-family shifts, coordinate-route mismatch, and label-limited clinical adaptation can be handled without heavy downstream rescue.

Learn the basics

Check the basics in the wiki

What the wiki is for

The wiki is a learning aid. For the project's official current synthesis, success criteria, and operating rules, always return to the public pages.

Bottom line in one sentence

EEG foundation models are an important advance for representation learning and low-label downstream tasks. However, that advance is readable only after separating what data the model was pretrained on, how formats were harmonized, and how far adaptation went downstream. A large model name alone does not determine either the strength of generalization or which claims still need to be stopped.

Scope of this page

This page does not cover philosophy or legal institutions. It covers only how to read EEG foundation / self-supervised models from technical and natural-science evidence.

What the 2026-03 literature audit identified as missing

The previous site had already strengthened QC, splits, multimodality, and drift, but it was missing how to read foundation models themselves. Without that layer, recent large-scale pretraining can still be misread too quickly as "dataset shift is solved," "a general decoder exists," or "we are closer to WBE." This page therefore separates what the primary literature actually advances from what it still leaves unresolved.

Source types fixed in advance as of 2026-03-25

The sources on this page mix peer-reviewed journal / accepted conference papers, accepted posters / workshops, official challenge websites / rules, arXiv preprints, and under-review manuscripts. These are not evidence of the same strength. For example, the official EEG Foundation Challenge site states in its 2025-11-17 update that the proposal preprint does not reflect changes made during the execution phase and that the current website and starter kit should be used instead. The final leaderboard then disclosed that Challenge 2 had not randomized samples, which allowed teams to exploit the fact that contiguous trials likely came from the same subjects. Accordingly, this page does not place model-capability comparisons, benchmark-governance warnings, and moving-target competition rules into the same single frontier ranking.

"Adapting to any setup" is not yet shortcut-resistant transfer

This was the next weak point on this page. El Ouahidi et al. (2025) is important because it explicitly targets arbitrary length and electrode arrangement with pretraining on more than 60,000 hours from 92 datasets. But that is still not the same as proving that the learned representation stopped reading subject identity, reference / device / protocol structure, or other recording-distribution cues. Lahiri et al. (2026) then showed that narrow-source versus diverse-source pretraining can trade places depending on whether the downstream regime is linear-probe or fine-tuning, while Liu et al. (2026) showed across 12 open-source foundation models and 13 datasets that linear probing is often insufficient, specialist models trained from scratch remain competitive, and larger models do not automatically generalize better. Those benchmark-side warnings line up with the shortcut literature already used elsewhere on this site: Chaibub Neto et al. (2019), Xu et al. (2020), and Di et al. (2021) show why identity confounding, acquisition variability, and time-robust fingerprints must be audited separately from headline transfer. Therefore, on this site, setup-agnostic pretraining is not read as shortcut-resistant neural representation unless the downstream claim also passes the Specificity & Shortcut Card.

A unified spatial embedding is not yet a common physiological coordinate system

This was the next remaining compression on this page. Han et al. (2025) targeted channel-permutation equivariance so arbitrary electrode configurations could be handled more robustly, Chen et al. (2025) introduced a coordinate-based spatial embedding for more than 150 electrode layouts, and El Ouahidi et al. (2025) pushed further toward any-setup pretraining. Those are real advances in recording-frame compatibility. But they still do not prove that different montages, coordinate routes, and reference families have become one shared physiology-preserving coordinate system. Ma et al. (2026) then showed that even strong EEG foundation models can generalize poorly when subject-level supervision is limited unless extra adaptation structure is added, while Lahiri et al. (2026) showed that split construction, checkpoint selection, segment length, and normalization can still dominate comparison. Therefore, on this site, layout support, reference-family robustness, coordinate-route disclosure, and label-limited adaptation burden remain separate fields rather than being collapsed into one word such as generalization.

Benchmark object, independent unit, and hold-out unit are separate axes

This was the next remaining weakness on this page. The official EEG Challenge homepage states that Challenge 1 predicts response time from CCD trials, whereas Challenge 2 predicts externalizing scores from EEG across multiple paradigms. The official rules then add that Challenge 1 is scored per trial, submissions are inference-only, and models must run on a single GPU with 20 GB memory. Lee et al. (2025) fine-tuned large brainwave foundation models across memory tasks and sleep stage classification, Liu et al. (2026) explicitly compared leave-one-subject-out cross-subject evaluation with within-subject few-shot calibration, and Lahiri et al. (2026) showed that six benchmark inconsistencies can reverse rankings on identical datasets by up to 24 percentage points. Therefore, on this site, benchmark name, predicted object, independent prediction unit, grouped hold-out unit, and operations budget are separate disclosure fields rather than one merged "benchmark provenance" box.

2026-03-30 correction: benchmark object still needs an explicit matrix

The older wording on this page already required benchmark-object disclosure, but it still left one practical shortcut open: a reader could talk as if benchmark object, independent prediction unit, hold-out unit, and challenge operations budget were all fixed by the benchmark name alone. The current primary and official sources do not support that shortcut. On this site, those fields now have to be read separately.

Case What is predicted What unit must still be named separately Safe ceiling on this site
EEG Challenge 1
official homepage + rules
Trial-level response-time regression from the CCD task. The trial is the scoring unit, but grouped subject structure and the inference-only single-GPU 20 GB operations budget still have to be disclosed separately. A named cross-task-transfer benchmark under a fixed operations budget, not a general decoder verdict.
EEG Challenge 2
official homepage + leaderboard
Subject-level externalizing-factor prediction from EEG across multiple paradigms. The subject is the natural independent unit, and the leaderboard postmortem shows that hidden contiguous-trial grouping can still change what the benchmark measured. A subject-invariant benchmark attempt whose interpretation remains contingent on grouping policy, not proof that subject invariance is solved.
Lee et al. (2025)
ICML proceedings
Fine-tuning results across memory tasks and sleep stage classification. The task family, label granularity, adaptation regime, and metric family still have to be named because sleep-stage labels and memory-task outputs are not one prediction object. A fine-tuning / PEFT audit across named tasks, not a universal frontier ranking for EEG foundation models.
Liu et al. (2026)
benchmarking preprint
Cross-model comparison across 13 EEG datasets and nine paradigms. The paper explicitly separates leave-one-subject-out evaluation from within-subject few-shot calibration, so the hold-out unit cannot be collapsed into one transfer score. A benchmark matrix for transfer-regime tradeoffs, not a settled answer to which model generalizes best.
Lahiri et al. (2026)
PRISM
Clinical differential diagnosis from interictal EEG, including epilepsy versus mimickers. The clinically interesting object is subject-level diagnosis, but the paper also shows that split construction, checkpoint selection, segment length, and normalization can dominate comparison. Evidence that protocol differences can dominate rankings, not an accepted law of clinical transfer.

2026-04-02 correction: setup compatibility is not physiological equivalence

The older wording on this page already warned against shortcut-resistant overclaims, but one practical shortcut was still left open. A reader could still move from heterogeneous-device support to a shared physiology-preserving representation without naming which part of the cross-setup gap had actually been closed. The current 2025-2026 model literature does not support that shortcut. On this site, recording-frame compatibility and physiology-side equivalence are now kept separate explicitly.

Case What the paper directly advances What still must be disclosed separately on this site
DIVER-0 (2025)
workshop / arXiv
Channel-permutation-equivariant modeling and robust adaptation to arbitrary electrode configurations unseen during pretraining. Coordinate route, reference family, omitted-channel policy, downstream adaptation regime, and whether the target variable stayed identifiable rather than merely layout-tolerant.
HEAR (2025)
arXiv
A coordinate-based embedding that supports heterogeneous EEG devices, varying electrode counts, and more than 150 layouts. Whether the geometry route is subject-specific or template-based, whether reference mismatch was neutralized or only absorbed, and what claim ceiling remains for cross-montage physiology.
REVE (2025)
accepted poster / arXiv
Large-scale setup-agnostic pretraining across 92 datasets and more than 60,000 hours. Overlap audit, covered reference-system distribution, coordinate-route disclosure, and whether a downstream gain survived shortcut slices rather than only mixed-corpus transfer.
SCOPE (2026)
arXiv
A structured adaptation route for label-limited cross-subject settings where EEG foundation models otherwise generalize poorly. The label budget, pseudo-label / prototype burden, and whether the result is a property of the pretrained representation or of extra downstream rescue.
PRISM (2026)
arXiv
Clinical-transfer evidence plus a warning that split construction, checkpoint selection, segment length, and normalization can dominate rankings. Benchmark provenance, independent hold-out unit, preprocessing path, and the exact comparison regime before any statement about portable clinical generalization.

Read primary sources by evidence tier

The biggest weakness that needed correction here was that accepted model papers, official challenge documentation, benchmark-warning preprints, and under-review manuscripts were too easy to read as equally strong "latest research." Technically, that matters because accepted model papers support advances in representation learning / transfer under specific settings, official rules support the exposure conditions of the benchmark, and benchmark-audit preprints support warnings about instability in comparison. A table that hides source type therefore becomes a source of misreading by itself.

Example Source type / as of 2026-03-25 What can be said relatively strongly What barrier the paper itself leaves unresolved
Kostas et al. (2021)
BENDR
Peer-reviewed journal paper It showed that self-supervised pretraining can provide breadth across novel subjects, hardware, and tasks. Downstream applicability remained unsettled; pretraining alone did not guarantee universal transfer.
Wang et al. (2023)
BIOT
Accepted conference paper It provided a concrete strategy for bringing heterogeneous biosignals with different sampling rates, channels, recording durations, and missing values into cross-dataset learning. Conversely, any result that does not report format harmonization is not meaningfully comparable.
Jiang et al. (2024)
LaBraM
Accepted conference paper It performed cross-dataset pretraining on about 20 datasets and roughly 2,500 hours of EEG, and showed strong performance across multiple downstream tasks. It explicitly leaves electrode mismatch, unequal length, varied task design, and low SNR as central EEG-side challenges.
Wang et al. (2024)
EEGPT
Accepted conference presentation It reported strong downstream performance with a pretrained transformer and linear probing under low SNR, inter-subject variability, and channel mismatch. A high score there does not automatically imply cross-day deployability or source identifiability.
Lee et al. (2025)
ICML fine-tuning audit
Accepted conference poster It showed that current large brainwave foundation models only slightly outperform conventional deep baselines, while PEFT methods such as LoRA can greatly reduce the number of trainable parameters. The gain is small, around 0.5% even at the abstract level, so the result does not support the claim that "larger models win by default."
EEG Foundation Challenge (2025)
NeurIPS competition
Official competition website / rules It attempts to standardize measurement of cross-task transfer and subject-invariant representation over more than 3,000 HBN-EEG participants. What it provides directly is current benchmark governance, not a final verdict on model capability. The official site also states that the proposal preprint is outdated, so operational conditions should be read from the current rules and starter kit.
EEG Foundation Challenge final leaderboard (2025)
Governance postmortem
Official leaderboard / postmortem It shows that benchmark operations themselves can expose hidden subject-order shortcuts: the organizers reported that Challenge 2 samples had not been randomized, so contiguous trials could reveal same-subject structure and the final prize logic had to be changed. This is strong evidence about benchmark fragility, not a stable capability ranking of the submitted models. It tells us the measurement changed, not which architecture is universally best.
Xiong et al. (2025)
EEG-FM-Bench
arXiv benchmark preprint It states explicitly that the rapid proliferation of foundation models has outpaced standardized evaluation and that fragmented comparison is slowing scientific progress. Unharmonized comparisons do create scientific inefficiency, but this is safest to read as a benchmark warning rather than as a final frontier ranking.
El Ouahidi et al. (2025)
REVE
Accepted poster / arXiv manuscript It introduced a 4D positional encoding that can handle arbitrary length and electrode arrangement, pointing toward better transfer across diverse setups. What can be read relatively strongly here is a direction for handling heterogeneity, not a stable universal ranking across accepted benchmarks.
Han et al. (2025)
DIVER-1
Under-review / arXiv manuscript It presented a largest-scale corpus and a systematic scaling-law analysis, arguing that electrophysiology raises a data-constrained scaling question. The warning that smaller models trained longer can outperform larger models trained briefly under fixed data / compute is important, but an under-review source alone is not enough to fix the field's default scaling-law interpretation.
Wang et al. (2025)
NeuroTTT
arXiv method preprint It showed that domain-tuned self-supervision and test-time training can help with pretraining-downstream misalignment and cross-subject shift. Conversely, the results do not support the assumption that a foundation model alone is sufficient without downstream adaptation. Results that include TTT are also not read here as evidence of deployment simplicity.
Lahiri et al. (2026)
PRISM
arXiv clinical-transfer preprint It reported that pretraining with targeted diversity can become advantageous under fine-tuning and can improve performance on a clinical mimicker task. The warning that benchmark inconsistency alone can strongly reverse rankings on the same dataset is important, but it still should not be fixed as a shared conclusion of accepted clinical benchmarks.
Liu et al. (2026)
EEG FM benchmarking
arXiv benchmark / review preprint It compared 12 open-source foundation models and specialist baselines across 13 EEG datasets, and argued that linear probing is often insufficient, scratch specialists remain competitive, and larger models do not automatically generalize better. Because it is still a preprint and a benchmark study, it does not by itself prove shortcut resistance, deployment readiness, or a settled ranking across future accepted evaluations.

The 10 gates before reading a foundation model

Gate Why it is needed Minimum evidence we want
G0: source type / maturity Accepted papers, accepted posters, official rules, arXiv preprints, and under-review manuscripts support claims of different strength. The source type, whether it is accepted / preprint / under review, and for moving-target rules pages, the last verified date.
G1: corpus identity / overlap A pretraining corpus is also a dataset. If closely related data leak into the downstream side, the split no longer means what it appears to mean, and that leakage can happen through multiple ancestry axes rather than one route. Corpus name, version / snapshot, total hours, and a multi-axis overlap audit covering raw-recording / window ancestry, subject / session ancestry, site / device / reference / layout ancestry, task / benchmark-object ancestry, and extra-data / checkpoint ancestry.
G2: population / setup diversity The number of datasets or total hours is not enough. If population, device, or electrode layout are biased, pretraining may simply learn recording-distribution artifacts. The covered population, device types, clinical vs. lab setting, electrode schema, and the distribution of reference systems.
G3: harmonization / geometry route EEG differs greatly in channel count, electrode geometry, reference family, sample rate, and window length, and even a layout-tolerant model does not automatically erase those differences. Channel map, electrode-coordinate route or template, reference family, resampling, token length, and the policy for missing, omitted, or interpolated channels / segments.
G4: adaptation regime Frozen feature extraction, full fine-tuning, and test-time training do not mean the same thing when one asks what actually transferred. Whether the regime is frozen, linear-probe, PEFT, full fine-tune, or TTT, plus target-data usage, label budget, and recalibration amount.
G5: benchmark object / supervision unit / independent prediction unit Per-window classification, event detection, sequence labeling, subject-level regression, and retrieval / ranking do not test the same scientific object. Official foundation-model benchmarks already mix these families. The supervision unit, label provenance, output family, metric bundle, what counts as one independent prediction, and whether that unit inherits raw-recording or subject grouping.
G6: benchmark provenance / operations budget Benchmark papers from 2025-2026 show that rankings can move with split construction, checkpoint selection, segment length, hidden sample ordering, and challenge-stage compute restrictions. The official EEG Challenge postmortem made that point operationally explicit. Benchmark name, version, split rule, sample-randomization / hidden-grouping policy, checkpoint selection, segment length, normalization, how the external hold-out was built, and any inference-stage compute / training restrictions.
G7: shortcut-resistance / specificity bridge A good transfer score can still come from subject identity, site / device / reference structure, or protocol distribution rather than the intended neural variable. Foundation-model headlines do not remove that risk. A task-matched nuisance audit, including participant / site / device / reference disjointness, metadata-only or identity baselines where applicable, shortcut slices, and the linked Specificity & Shortcut Card.
G8: scale / efficiency In EEG, "bigger is stronger" does not always hold. It is easy to misread results unless parameter count, data, compute, and trainable fraction are read together. Total parameter count, trainable parameter count, pretraining epochs / steps, corpus size, training time, and adapter size.
G9: claim ceiling Success for a foundation model is still an advance in macro decoding / representation learning. An explicit statement of what remains latent, and an explicit stop against source identifiability, direct validation, and WBE state-completeness claims.
Official challenge postmortems count as benchmark evidence

The EEG Challenge submission page defines an inference-only code submission setting, while the final leaderboard discloses that Challenge 2 accidentally preserved same-subject trial contiguity. Those facts are not side notes. They directly change what a reported ranking means, because one result was obtained under a fixed inference budget and another could exploit an unintended grouping cue. On this site, benchmark provenance therefore includes operational constraints and postmortem disclosures, not only the benchmark title.

2026-04-04 correction: overlap audit must be split by ancestry axis

The older wording on this page still made overlap audit sound too much like one checkbox. Current primary and official sources do not support that compression. Brookshire et al. (2024) show a raw-window ancestry failure mode, Chaibub Neto et al. (2019) show a subject-characteristic failure mode, Melnik et al. (2017) and Xu et al. (2020) show a setup-distribution failure mode, and the official EEG Challenge data, submission, and leaderboard pages show a benchmark-object / benchmark-operations failure mode. On this site, those are now read as separate ancestry axes rather than one generic overlap warning.

The Pretraining Card required on this site

For foundation / self-supervised results, this site requires a Pretraining Card in addition to the standard model card. This is not an external publication standard; it is an operating rule of this site for keeping heterogeneous-corpus pretraining comparable.

Item Minimum required content Dangerous misreading if omitted
Corpus Pretraining corpus name, version, total hours, exclusion criteria, and a multi-axis overlap audit covering raw-recording / window, subject / session, site / device / reference / layout, task / benchmark-object, and extra-data / checkpoint ancestry. You may miss the possibility that what looked like generalization was actually reuse of the same recording family, person, setup, task object, or benchmark lineage.
Population / Setup Population, device, electrode layout, reference system, and whether the setting is clinical or lab-based. You may misread the number of datasets as recording diversity itself.
Harmonization / Geometry Route Channel schema, electrode-coordinate route or template, reference family, sample rate, tokenization, normalization, and missing / omitted / interpolated-channel policy. You may misread recording-frame translation as physiology-preserving model capability.
Objective The pretraining objective, such as masked, autoregressive, or contrastive. You cannot compare which inductive bias actually mattered.
Source Type / Maturity Whether the source is an accepted journal / conference paper, accepted poster / workshop, official rules page, arXiv preprint, or under-review manuscript, and for a rules page, the last verified date. You may misread under-review warnings or operational documentation as frontier evidence of the same strength as accepted model papers.
Adaptation Frozen / linear-probe / PEFT / full fine-tune / TTT, target-data usage, label budget, and whether recalibration is used. You may conflate "a general representation transferred well" with "the model was strongly adapted to the target."
Benchmark Provenance / Operations Benchmark name, version, split rule, checkpoint selection, segment length, normalization, and any inference-stage compute or no-training restriction. You may misread ranking changes caused by benchmark design as differences in the model itself.
Benchmark Object / Supervision Unit Whether the downstream object is window / trial classification, event detection, sequence labeling, subject-level regression / diagnosis, retrieval / ranking, or another family, together with label provenance, output family, metric bundle, the independent prediction unit, and whether grouped ancestry from the same recording or subject remains. You may collapse heterogeneous wins into one story about portable EEG generalization even though the model solved different objects with different error surfaces.
Shortcut-resistance / Specificity Bridge For any downstream decode / biomarker / clinical claim, report participant / site / device / reference disjointness, metadata-only or subject-ID baselines where relevant, nuisance-route checks, shortcut slices, and the linked Specificity & Shortcut Card. You may misread a representation that mainly preserves identity or recording-distribution cues as if it had become invariant to those shortcuts.
Scale / Efficiency Total parameter count, trainable parameter count, pretraining steps / epochs, training time, adapter size, and inference cost. You may read "the foundation model won because it is large" when the real driver was compute allocation or PEFT.
Evaluation Evaluation family, hold-out unit, device hold-out, cross-day evaluation, abstention policy, and failure conditions. You may mistake a high same-day score for deployability.
Stopped claim A one-line statement of what still cannot be claimed. You may over-extrapolate foundation-model success to source truth or WBE.

Operating rules on this site

Rule

  • We do not hide source type: accepted papers, official rules, and preprints / under-review manuscripts are not listed as evidence of the same strength.
  • Foundation-model results are not exempt from split auditing: independence must be checked including the pretraining corpus.
  • We do not hide population / setup diversity: we report not just the number of datasets, but which recording distributions were actually included.
  • We do not hide format harmonization: channel / reference / sampling harmonization must always be reported.
  • We do not read heterogeneous-device support as physiology equivalence: coordinate route, reference family, and omitted-channel policy stay visible even when a model accepts arbitrary layouts.
  • We do not hide the amount of adaptation: linear probing, full fine-tuning, and TTT are not all listed as the same kind of "transfer success."
  • We do not hide benchmark object: window classification, event detection, sequence labeling, subject-level regression, and retrieval-like tasks are not compressed into one frontier score.
  • We do not hide independent units or grouped hold-outs: trial, epoch, recording, and subject are different prediction objects and need separate disclosure.
  • We do not hide benchmark provenance: because rankings move with split / checkpoint / preprocessing differences, benchmark specification is part of the result.
  • We do not hide challenge operations budgets: inference-only settings, no-training rules, and memory limits are part of what the leaderboard score means.
  • We do not treat "any setup" as shortcut-resistant by title alone: foundation-model transfer claims also need a shortcut-resistance bridge to the Specificity & Shortcut Card.
  • Current competition rules are checked on the official site: proposal papers or companion preprints are background material; current rules / submission instructions / starter kits take priority for operations.
  • We do not hide benchmark postmortems: if organizers later disclose split flaws, sample-order shortcuts, or scoring changes, that disclosure changes how we read the leaderboard.
  • Benchmark-warning preprints are not treated as frontier verdicts: ranking reversals and scaling-law claims remain exploratory until reinforced by accepted papers or independent reruns.
  • We do not hide scale / efficiency: we do not write that a foundation model won without reporting parameter count, trainable fraction, and training time.
  • Even at high scores, the claim ceiling is kept in place: source identifiability, direct validation, closed-loop deployability, and WBE state-completeness are separate gates.
  • Results without a Pretraining Card are treated only as qualified decoding evidence: they are not automatically promoted to L2 or above.

References

  1. Kostas, D., Aroca-Ouellette, S., & Rudzicz, F. (2021). BENDR: Using Transformers and a Contrastive Self-Supervised Learning Task to Learn From Massive Amounts of EEG Data. Frontiers in Human Neuroscience, 15, 653659. doi:10.3389/fnhum.2021.653659
  2. Wang, H., Lu, C., Xie, B., et al. (2023). BIOT: Biosignal Transformer for Cross-data Learning in the Wild. NeurIPS 2023. paper
  3. Jiang, W.-B., Zhao, L., & Lu, B.-L. (2024). Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI. ICLR 2024. proceedings
  4. Wang, G., Liu, W., He, Y., Xu, C., Ma, L., & Li, H. (2024). EEGPT: Pretrained Transformer for Universal and Reliable Representation of EEG Signals. NeurIPS 2024. poster / abstract
  5. Lee, N., Barmpas, K., Panagakis, Y., Adamos, D., Laskaris, N., & Zafeiriou, S. (2025). Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-Tuning. Proceedings of the 42nd International Conference on Machine Learning, PMLR 267, 32878-32888. PMLR
  6. EEG Foundation Challenge (2025). From Cross-Task to Cross-Subject EEG Decoding. NeurIPS 2025 competition. official website
  7. EEG Foundation Challenge (2025). Data. official data page
  8. EEG Foundation Challenge (2025). Rules. official rules
  9. EEG Foundation Challenge (2025). Submission. submission page
  10. EEG Foundation Challenge (2025). Leaderboard. official leaderboard / postmortem
  11. Xiong, W., Li, J., Li, J., & Zhu, K. (2025). EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models. arXiv. arXiv:2508.17742
  12. El Ouahidi, Y., Lys, J., Thölke, P., Farrugia, N., Pasdeloup, B., Gripon, V., Jerbi, K., & Lioi, G. (2025). REVE: A Foundation Model for EEG -- Adapting to Any Setup with Large-Scale Pretraining on 25,000 Subjects. accepted poster / arXiv manuscript. arXiv:2510.21585
  13. Han, D. D., Lee, A. L., Lee, T., Gwon, Y., Lee, S., Lee, S., Park, D. K., Yoo, S., Cha, J., & Chung, C. K. (2025). DIVER-0: A Fully Channel Equivariant EEG Foundation Model. ICML 2025 Workshop on GenBio / arXiv manuscript. arXiv:2507.14141
  14. Chen, Z., Qin, C., You, W., Liu, R., Chu, C., Yang, R., Tan, K. C., & Wu, J. (2025). HEAR: An EEG Foundation Model with Heterogeneous Electrode Adaptive Representation. arXiv preprint. arXiv:2510.12515
  15. Han, D. D., Gwon, Y., Lee, A. L., et al. (2025). DIVER-1: Deep Integration of Vast Electrophysiological Recordings at Scale. under-review / arXiv manuscript. arXiv:2512.19097
  16. Wang, S., Deng, Y., Bao, Z., Zhan, X., & Duan, Y. (2025). NeuroTTT: Bridging Pretraining-Downstream Task Misalignment in EEG Foundation Models via Test-Time Training. arXiv preprint. arXiv:2509.26301
  17. Ma, J., Wu, F., Xing, Y., Lin, Q., Liu, T., Liu, C., Jia, Z., & Feng, M. (2026). Structured Prototype-Guided Adaptation for EEG Foundation Models. arXiv preprint. arXiv:2602.17251
  18. Lahiri, J. B., Runwal, P., Kulkarni, A., Jain, M., Mishra, A. R., Panwar, S., & Singh, S. (2026). PRISM: Exploring Heterogeneous Pretrained EEG Foundation Model Transfer to Clinical Differential Diagnosis. arXiv preprint. arXiv:2603.02268
  19. Liu, D., Chen, Y., Chen, Z., Cui, Z., Wen, Y., An, J., Luo, J., & Wu, D. (2026). EEG Foundation Models: Progresses, Benchmarking, and Open Problems. arXiv preprint. arXiv:2601.17883
  20. Brookshire, G., Kasper, J., Blauch, N. M., Wu, Y. C., Glatt, R., Merrill, D. A., Gerrol, S., Yoder, K. J., Quirk, C., & Lucero, C. (2024). Data leakage in deep learning studies of translational EEG. Frontiers in Neuroscience, 18, 1373515. doi:10.3389/fnins.2024.1373515
  21. Chaibub Neto, E., Pratap, A., Perumal, T. M., et al. (2019). Detecting the impact of subject characteristics on machine learning-based diagnostic applications. npj Digital Medicine, 2, 99. doi:10.1038/s41746-019-0178-x
  22. Xu, M., Yao, S., Wei, Z., et al. (2020). Cross-dataset variability problem in EEG decoding with deep learning. Frontiers in Human Neuroscience, 14, 103. doi:10.3389/fnhum.2020.00103
  23. Di, Y., An, X., Zhong, W., Liu, S., & Ming, D. (2021). The time-robustness analysis of individual identification based on resting-state EEG. Frontiers in Human Neuroscience, 15, 672946. doi:10.3389/fnhum.2021.672946