Wiki: Uncertainty, Calibration, and Abstention

Conclusion

This site does not treat point estimates only, uncalibrated confidence only, untyped source-imaging widths, or output without abstention conditions as strong evidence. The four points to be audited first are where does the uncertainty come from, which split and evaluation family was used to calibrate the probability/interval/prediction set, where to stop when reliability is low, and how to record the recalibration load in the case of an online system. In the re-audit in March 2026, we did not end this with an auxiliary explanation, but connected it to Verification's Calibration & Abstention Card.

Scope of this page

I am not going to deal with philosophy or legal systems here. We sort out uncertainties, calibrations, and abstentions from only the technical and natural science aspects of EEG source imaging, EEG classification, language decoding, and closed-loop BCI.

2026-03 Weaknesses revealed in re-audit

The previous version was useful as a support page for teaching confidence ≠ calibration, but it had not yet become a reusable submission specification like the Observability Budget or Temporal Validity Card. Looking at the primary literature, it is dangerous to directly extrapolate within-session calibration to cross-day / cross-subject / temporal shift, and if you touch the threshold without separating fit / calibration / test, the evidence gate itself will collapse. In addition, the earlier page still let language-facing outputs sound too generic: a retrieval score from a fixed bank, a top-k score at known word onsets, and a prompt-conditioned LLM output were still too easy to misread as if they all expressed the same kind of uncertainty. Therefore, on this page, split, slice, candidate scaffold, coverage-risk, and fallback policy are fixed together as Calibration & Abstention Card, and language-facing outputs are explicitly stacked with the Neural Contribution Card.

2026-03-31 correction: source-imaging uncertainty is not one generic width

This page still lagged behind the site's newer source-imaging rule. It was already correct to say that point estimates are too weak, but it still left too much room to read posterior band, solver spread, conductivity sensitivity, and external-validator distance as if they were interchangeable uncertainty objects. The current primary literature does not support that shortcut. Mahjoory et al. (2017) showed large pipeline-conditioned variability across inverse methods and toolboxes, Vorwerk et al. (2024) showed strong conductivity-driven localization and depth error, Luria et al. (2024) described posterior support over focal-source configurations, Tong et al. (2025) derived variance and hypothesis tests for a sparse debiased estimator, and Feng et al. (2025) targeted extended-source location-plus-extent reconstruction with empirical-Bayesian uncertainty quantification. Mikulan et al. (2020) and Hao et al. (2025) then show that external validators themselves answer different error questions. This page now types the uncertainty object before it allows confidence language.

Four audit gates to be fixed first

Audit gate	What I want at least	Claim that it stops when there is not enough
Gate 1: Isolating the source	Breakdown of observation noise, preprocessing difference, head geometry, conductivity, subject shift, session drift, and decoder drift.	It cannot be said that ``the cause of the error is known'' or ``the improvement measures were effective.''
Gate 2: Proofreading	Separation of fit/calibration/test, coverage of intervals and sets, ECE/Brier/NLL, named uncertainty object (such as posterior support, debiased interval, extent uncertainty, or cross-pipeline spread), slice-wise calibration audit, and declared output scaffold when the task is language-facing.	confidence, posterior, or interval width cannot be read as usable probabilities, calibrated coverage, or one generic uncertainty scale.
Gate 3: Abstain	Reject/abstain conditions when low reliability, exchange of coverage reduction and risk reduction, prediction-set size, false alarm ceiling, branching of remeasurement and reanalysis, and disclosure of candidate bank / prompt dependence where relevant.	We cannot claim to suppress incorrect answers or operate on the safe side under low reliability conditions.
Gate 4: online load	Distinction between recalibration frequency, recalibration trigger, dropout, recovery time, hold-last-output / silence / freeze / hard stop.	Operation stability in a closed loop cannot be expressed only in terms of average accuracy.

Do not mix confidence, intervals, proofreading, and abstention

Concept	What do we know	What we still don't know
Point Presumption	You can see the current representative value.	We don't know how unstable it is or whether it will collapse if conditions are changed.
Interval / posterior band	You can see how much width there is around the estimated value.	It is necessary to separately check whether the width actually has a reasonable coverage and where it comes from.
Prediction set	You can see how many candidates can be narrowed down under these conditions.	It is necessary to separately check under what assumptions the set guarantees coverage and how much the set size has increased.
confidence	You can see the ordering of scores and confidence inside the model.	I don't know if that number matches the actual probability of hitting the mark.
Calibration	When you get 0.8, you can see whether it really hits about 80% or whether the interval is covered as expected.	Even if proofreading is good, lack of expressiveness and lack of OOD generalization remain separate issues.
Abstain	Output can be stopped under low reliability conditions and exchange conditions between coverage and risk can be specified.	It is necessary to determine whether the threshold setting is appropriate and whether there is a remeasurement/recalibration flow after abstention.

Replacement on this site

Softmax, posterior probability, decoder class score, and prediction set are not considered as calibrated probabilities or safe sets as they are. Calibration error, coverage-risk, interval/set coverage, and separation of fit/calibration/test must be presented together before they can be considered reliable for actual operation.

Additional replacement for language-facing outputs

If the output is a word, sentence, or speech candidate, the score must also disclose whether it was conditioned on a fixed retrieval bank, a known-onset protocol, or a prompt / language-model scaffold. On this site, that score is not read as open-world uncertainty until the scaffold itself has been fixed and audited.

Calibration is managed separately for fit / calibration / test

Stage	Put it here	What breaks when you mix
fit	Learn the model parameter, feature extractor, and decoder body.	If you mix this stage with calibration, you will lose track of the contribution of model improvement and threshold adjustment.
calibration	Adjust temperature scaling, threshold tuning, conformal score, and prediction-set size to the frozen model.	If you move the threshold while looking at the test, you cannot claim calibrated probabilities or coverage.
test	Fix final ECE/Brier/NLL, empirical coverage, false alarm rate, coverage-risk.	When retuning with test, held-out evidence and local tuning are indistinguishable.
deployment / temporal audit	Fix the handling of cross-day, cross-subject, temporal shift, recalibration trigger, and human intervention.	The same-day calibration is mistakenly read as deployable threshold.

Lei et al. (2018) specified the calibration split necessary for split conformal, and Chernozhukov et al. (2021) extended the distributional conformal route. Therefore, on this site, we do not refer to proofreading as ``setting a threshold value after the fact,'' but treat it as a submission that requires an independent split.

The same calibration has different meanings if the evaluation family is different

evaluation family	Minimum desired slice	Stop misreading here
within-session	Calibration per trial / block / state / artifact burden.	The same-day confidence should not be translated into another day or person.
cross-session	Calibration per recording day, electrode replacement, state annotation.	Do not write stable decoder while leaving the day shift within the same subject as hidden.
cross-subject / cross-site / cross-device	Calibration by cohort, site, device, reference scheme, and population subgroup.	Confidence provided by mixed validation is not misinterpreted as patient-independent reliability.
temporal / longitudinal / OOD	Calibration per time-since-fit, time-since-calibration, novel task, drug/vigilance state, and covariate shift.	We do not increase the short-term success of a fixed model to long-term deployability or OOD safety.

Shafiezadeh et al. (2023) showed that the split design itself greatly influenced the results in patient-independent seizure prediction, and Ovadia et al. (2019) showed that predictive uncertainty can widely collapse under dataset shift. Furthermore, Han et al. (2024) showed that in temporal distribution shift, assessment and selection themselves need to be aligned with the time axis. Therefore, on this site, we do not read global 1-digit ECE as final proof of reliability.

Probability, interval, prediction set, and abstention are separate outputs

Output type	Minimum guarantee you want	Things that should be brought together
scalar probability / confidence	ECE, Brier, NLL, reliability diagram, and slice-wise calibration.	Fit/calibration/test separation, evaluation family, and the claim being stopped.
interval / posterior band	Alignment with empirical coverage, interval width, sensitivity analysis, and external validation.	Whether the coverage is marginal or local, increase or decrease the width, and for which variable.
prediction set / conformal output	Set coverage, average set size, the validity assumption, and the exchangeability / time-order rule.	Calibration split, set-size cost, marginal vs conditional validity, arguments that stop at OOD.
abstention / selective prediction	Coverage-risk curve, false-alarm ceiling, fallback path, and human-review trigger.	Threshold, coverage drop, silence/freeze/stop distinction, recovery rule.

Candidate-bank and prompt-conditioned confidence are not open-world uncertainty

Route type	What the reported score is conditional on	What it must not be overread as
Fixed-bank retrieval	Bank contents, segment definition, retrieval metric, and top-k cutoff.	Uncertainty over unrestricted language or free-form generation.
Known-onset word decoding	Declared word boundaries, retrieval-set size, averaging rule, and vocabulary family.	Confidence for free-running segmentation plus decoding in natural conversation.
Prompt-conditioned generation	Prompt tokens, LLM prior, vocabulary, and beam / decoding strategy.	Brain-only generative likelihood over the full space of possible continuations.
Autoregressive semantic reconstruction	Subject-specific fit, previous decoded context, and the evaluation candidate regime.	Subject-independent or prior-free semantic access.

Tang et al. (2023) explicitly separated encoding-model and language-model contributions by ranking the actual word against distractors, while also showing that the decoder relies on both autoregressive context and fMRI. Défossez et al. (2023) reported segment-level top-k retrieval for 3 s speech segments rather than open-vocabulary decoding. d'Ascoli et al. (2025) explicitly framed prior M/EEG retrieval work as requiring access to the ground-truth speech segments at test time, and their own word-level results still report top-k accuracy under a fixed retrieval set with known word onsets. Ye et al. (2025) concatenated brain and text embeddings into the LLM prompt itself. Therefore, on this site, language-facing confidence is read only after the scaffold is disclosed through the Neural Contribution Card, and any cross-day or deployment claim must still pass the Temporal Validity Card.

Uncertainty comes in four layers

Layer	Representative examples	Mainly effective pages/issues
Observation noise	Electrode contact, synchronization, myoelectricity/blinks, defects, stimulation artifacts.	Introduction to EEG, event sync, closed-loop implementation.
Model/geometry uncertainties	Head model, cranial conductivity, source depth, solver dependence.	Source imaging, multimodal integration, and observation-to-estimation.
Distribution shift	Subject differences, different days, pharmacological conditions, anesthesia, task changes, OOD conditions.	Decode, forecasting, and counterfactual / perturbation.
Operation Drift	decoder drift, electrode reseating, learning, fatigue, recalibration loads.	Closed-loop BCI, state-trait-drift, and longitudinal evaluation.

The important thing is not to talk about uncertainty in one box. Source imaging width strongly depends on geometry and conductivity, EEG classification overconfidence strongly depends on calibration error and subject shift, and closed-loop failure strongly depends on drift and recalibration load. Even though the word "uncertainty" is the same, accounting methods differ depending on the issue.

Change the indicators to be published for each issue

Task	Minimum desired indicators	Why point estimation alone is dangerous
EEG source imaging	Inverse family / target object, uncertainty object, forward-model uncertainty route, cross-family comparison rule, named validation board / operating regime, and abstention boundary.	The same scalp data can support different focal, sparse, or extent-aware uncertainty objects, so one width is not a generic confidence scale.
Offline EEG classification	ECE, Brier score, NLL, fit/calibration/test separation, out-of-subject evaluation, coverage-risk curve.	Even if the accuracy is high, the confidence created by mixed validation is dangerous during operation.
Speech / brain-to-text decode	Neural Contribution Card fields, cue regime, output family, candidate bank or retrieval-set size, prompt / LM scaffold, onset regime or caption scaffold, no-brain / no-LM / shuffle controls, calibration / abstention slices, and Temporal Validity Card when the claim leaves same-session.	Fluent output or top-k success can be carried by the scaffold more than by the neural contribution if the task constraints are left implicit.
Rare event prediction	False alarm rate, sensitivity, calibration curve, risk-controlling threshold, and coverage for each alarm horizon.	In low-frequency tasks such as seizure prediction, even a small amount of overconfidence can greatly impair practicality.
online / closed-loop BCI	abstention rate, dropout, recalibration burden, recovery time, time-since-calibration, number of silence / freeze / hard stops.	Average accuracy alone hides breakdowns in continuous operation and time when intervention is unavailable.

What the primary literature actually shows

1. For source imaging, width without a typed uncertainty object is still too much reading

The current source-imaging literature no longer supports the shortcut that any reported width or posterior can be treated as one common confidence object. Mahjoory et al. (2017) showed substantial cross-pipeline variability across forward models, inverse methods, and software implementations. Vorwerk et al. (2024) showed that tissue-conductivity uncertainty can drive localization and depth errors, especially for quasi-tangential sources on sulcal walls. Luria et al. (2024) describe a Bayesian focal-source route that returns posterior support over alternative source configurations. Tong et al. (2025) derive variance and hypothesis testing for a sparse debiased estimator, while Feng et al. (2025) target the locations and spatial extents of extended sources with empirical-Bayesian uncertainty quantification. These are not one interchangeable width. In parallel, Mikulan et al. (2020) and Hao et al. (2025) show that external validators themselves answer different error questions. Therefore, on this site, source-imaging uncertainty is not accepted unless the inverse family, target object, uncertainty object, forward-model uncertainty route, and named validation board are all declared.

Uncertainty object	Representative literature	What it actually describes	What it must not be treated as
Cross-pipeline spread	Mahjoory et al. (2017)	Variability across forward models, inverse methods, templates, and software implementations on the same data.	A calibrated posterior or a confidence interval for one fixed inverse family.
Forward-model / conductivity sensitivity	Vorwerk et al. (2024); Rimpiläinen et al. (2019)	How geometry and tissue-conductivity uncertainty move localization, depth, or magnitude.	Solver-internal uncertainty about source support when upstream physics is treated as fixed.
Posterior support over focal-source configurations	Luria et al. (2024)	Probability mass over candidate focal-source number and configuration under a Bayesian focal-source model.	A debiased interval for sparse amplitudes or an extent-aware uncertainty map for spatially extended sources.
Debiased interval / test uncertainty for sparse activity	Tong et al. (2025)	Variance-aware inference for sparse source amplitude, orientation, and depth after debiasing regularized estimates.	A posterior over alternative source configurations or an uncertainty map for arbitrary spatial extent.
Extent-aware empirical-Bayesian uncertainty	Feng et al. (2025)	Uncertainty tied to the locations and spatial extents of extended sources.	Peak-only confidence for focal solutions or one generic ``better uncertainty'' label.

External validation is also not one generic coverage claim

Mikulan et al. (2020) provide a precisely known stimulation-site board for focal localization under simultaneous intracerebral stimulation and HD-EEG, whereas Hao et al. (2025) use simultaneous HD-EEG and SEEG in drug-resistant epilepsy and show regime-dependent accuracy tied to source depth and spike power. On this site, those do not collapse into one generic ``external validator'' field. A named board is required because different boards answer different error questions.

2. In EEG classification, calibration without fixing split and shift is reading too much

Shafiezadeh et al. (2023) showed that random cross-validation and leave-one-patient-out give different estimates in patient-independent seizure prediction, and Ovadia et al. (2019) showed that predictive uncertainty methods can be widely degraded under dataset shifts. Furthermore, Han et al. (2024) showed that model assessment and selection should be designed according to the temporal order with temporal distribution shift. Duan et al.'s UNCER, Hu et al., and Shafiezadeh et al. (2024) showed that calibration itself is important, but the inference that can be drawn from this is that even if the ECE is the same, different split and shift families are different evidence. Therefore, this site does not equate within-session calibration with cross-day / cross-subject reliability.

3. In language decoding, confidence must be read through the scaffold

Language-decoding papers make the same lesson concrete in a different form: the uncertainty object changes when the output scaffold changes. Tang et al. (2023) used subject-specific fMRI together with autoregressive context, and their ablations show that removing fMRI drops performance while isolated scoring still compares the actual word to distractors. Défossez et al. (2023) reported top-10 segment retrieval for fixed 3 s speech windows. d'Ascoli et al. (2025) scaled known-word-onset decoding to 723 participants, but still reported top-10 accuracy with a retrieval set of 250 words and found strong protocol effects such as MEG > EEG and reading > listening. Ye et al. (2025) directly concatenated brain and text prompt embeddings for LLM generation and outperformed prompt-only or permuted-brain controls. Therefore, on this site, apparently fluent text output is not accepted as a generic uncertainty object: it must declare the candidate bank, prompt scaffold, onset rule, and neural-contribution controls first.

3A. 2026-03-29 addendum: structural pass/fail bundle for language-facing confidence

Route family	Minimum disclosure before the score is read	Minimum pass condition on this site	Claim still blocked
Fixed-bank retrieval	Bank contents, segment definition, bank size, cue policy, and whether cue-presentation data entered train / validation / test.	It can count only as retrieval-conditioned evidence if calibration and accuracy are reported on held-out trials under the same bank and cue policy, with no-brain, time-shuffle, or cue-separated comparison.	Open-world semantic access or free-choice communication.
Known-onset word decoding	Word-onset source, retrieval-set size, vocabulary policy, averaging rule, and whether the decoder uses only causal information around the onset.	It can count only as onset-conditioned word-decoding evidence if top-k or calibration is reported within the declared retrieval set and onset regime together with Neural Contribution Card controls.	Free-running continuous speech or unrestricted language readout.
Prompt-conditioned generation	Prompt tokens, prompt length, prompt-only and permuted-brain controls, LLM family, decoding strategy, and output-length rule.	It can count only as prompt-conditioned generative evidence if the paper shows what changes when the brain input is removed or permuted under the same prompt scaffold.	Brain-only generative likelihood or prior-free semantic reconstruction.
Viewed/recalled content captioning	Stimulus family, semantic feature extractor, candidate-initialization policy, cue or recall prompt route, and whether caption optimization started from noninformative or content-bearing seeds.	It can count only as captioning of bounded viewed/recalled content if identification or discriminability and caption quality are reported under the declared caption scaffold and compared against scaffold-preserving controls such as word-order shuffle or feature-mismatch baselines.	General inner-speech decoding or unrestricted thought reading.
Streaming speech neuroprosthesis	Latency quantiles, silence or abstention rule, same-day training versus fixed-decoder interval, recalibration burden, and durability slice.	It can count only as communication-subsystem evidence if online performance is reported together with abstention or silence behavior and the time window over which the decoder was kept fixed.	Long-term general communication ability, emulate, or WBE-relevant state recovery.

The point of this bundle is not to assign one universal numeric threshold. It is to stop readers from comparing unlike uncertainty objects as if they were one probability scale. Rybár et al. (2024) showed that cue-presentation data can grossly overestimate semantic BCI performance. Horikawa (2025) added a viewed/recalled content-captioning route built from decoded semantic features and iterative text optimization. Tang et al. (2023) remained subject-cooperative autoregressive semantic reconstruction. d'Ascoli et al. (2025) remained known-onset top-10 retrieval over a fixed vocabulary. Ye et al. (2025) remained prompt-conditioned LLM generation. Littlejohn et al. (2025) and Wairagkar et al. (2025) remained invasive communication-subsystem routes. The confidence object changes before the score changes.

4. Conformal / risk-controlling routes are effective, but the assumptions and set size need to be stated separately

Lei et al. (2018) gives finite-sample marginal coverage by split conformal, and Chernozhukov et al. (2021) presented distributional conformal prediction using a conditional distribution model. Tibshirani et al. (2019) then made explicit that once exchangeability breaks under covariate shift, the conformal procedure itself must be modified rather than rhetorically extended. Furthermore, Segal et al. (2023) showed a direction to suppress false alarm rate by risk-controlling prediction calibration in seizure prediction, and Eliades & Papadopoulos (2019) applied conformal prediction to BCI / exoskeleton control. Therefore, set-valued output and risk-controlled threshold are effective, but it is necessary to separately state which split was used for calibration, how coverage and set size were exchanged, and which of marginal / conditional / temporal validity is claimed.

5. Abstaining is not "because it seems safe", but disclosure of coverage and risk

Ganeshkumar et al. showed that the false prediction rate can be reduced by including a reject option in the EEG motor imagery BCI. What is important here is to show how much coverage has been reduced in exchange for reducing errors. Therefore, it is not enough to show only abstention rate or accuracy alone, and it is necessary to disclose coverage-risk exchange conditions together.

6. With online BCI, recalibration and silence are also performance features

Wairagkar et al.'s instantaneous voice-synthesis neuroprosthesis showed a low-latency loop, but at the same time it was important to design it to return silence in non-speech sections. Wilson et al. demonstrated long-term unsupervised recalibration of intracortical BCIs and showed that not only accuracy but also how much recalibration is required is the bottleneck for continued operation. Therefore, in the closed-loop system, in addition to latency and accuracy, abstention / silence / recalibration burden / recovery time are kept as separate indicators.

Operation rules adopted by this site

Rule

Do not read confidence as just probability:If there is no calibration error or interval / set coverage, treat it as an internal score.
Separate fit/calibration/test:Temperature scaling, threshold tuning, and conformal score are managed as independent splits, and they are not readjusted by looking at the test.
Calibration is issued for each evaluation family:Do not read within-session ECE or coverage as cross-day / cross-subject / temporal shift reliability.
Language-facing scores must disclose their scaffold:Fixed candidate banks, known onsets, prompt tokens, beam search, and language-model priors are reported before score or fluency is read as neural uncertainty.
Source imaging requires typed uncertainty objects, not one generic width:Separate cross-pipeline spread, forward-model sensitivity, focal posterior support, sparse debiased intervals, extent-aware uncertainty, and the named validation board before reading confidence strongly.
EEG classification gives coverage-risk:Do not pass with accuracy alone, include ECE/Brier/NLL, slice-wise calibration, and coverage after abstention.
Set-valued / conformal results also reveal assumptions:Do not hide marginal / conditional / temporal validity, set size, or exchangeability / time-order rule.
Manage false alarms separately for seizure prediction and rare events:In addition to sensitivity, include false alarm cost and threshold control as key metrics.
Online BCI publishes the recalibration load as performance:Publishes the breakdown of the number of recalibrations, required time, recovery time, silence / freeze / hard stop.
Make it possible to choose to abstain when reliability is low:Rather than forcefully returning a single answer, branch to the options that require remeasurement, reanalysis, or stoppage that requires intervention.
Attach a Calibration & Abstention Card to results that highlight probabilities, intervals, prediction sets, and abstentions: Fix split, slice, coverage-risk, and fallback policy in the common submission on the Verification side.
Stack the Neural Contribution Card when the output is language-like:For text / speech / brain-to-text outputs, calibrate the score only together with candidate-set, prompt, and no-brain / no-LM / shuffle disclosure on the Verification side.

References

Mahjoory, K., Nikulin, V. V., Botrel, L., Linkenkaer-Hansen, K., Fato, M. M., & Haufe, S. (2017). Consistency of EEG source localization and connectivity estimates. NeuroImage, 152, 590-601. doi:10.1016/j.neuroimage.2017.02.076
Vorwerk, J., Wolters, C. H., & Baumgarten, D. (2024). Global sensitivity of EEG source analysis to tissue conductivity uncertainties. Frontiers in Human Neuroscience, 18, 1335212. doi:10.3389/fnhum.2024.1335212
Rimpiläinen, I., Solis-Lemus, J. A., & Särkkä, S. (2019). Improved EEG source localization with Bayesian uncertainty modelling of unknown skull conductivity. NeuroImage, 184, 52-60. doi:10.1016/j.neuroimage.2018.11.058
Luria, G., Viani, A., Pascarella, A., Bornfleth, H., Sommariva, S., & Sorrentino, A. (2024). The SESAMEEG package: a probabilistic tool for source localization and uncertainty quantification in M/EEG. Frontiers in Human Neuroscience, 18, 1359753. doi:10.3389/fnhum.2024.1359753
Tong, P. F., Yang, H., Ding, X., Ding, Y., Geng, X., An, S., Wang, G., & Chen, S. X. (2025). Debiased Estimation and Inference for Spatial-Temporal EEG/MEG Source Imaging. IEEE Transactions on Medical Imaging, 44(3), 1480-1493. doi:10.1109/TMI.2024.3506596
Feng, Z., Guan, C., & Sun, Y. (2025). Block-Champagne: A Novel Bayesian Framework for Imaging Extended E/MEG Source. IEEE Transactions on Medical Imaging. doi:10.1109/TMI.2025.3642620
Mikulan, E., Russo, S., Parmigiani, S., Sarasso, S., Zauli, F. M., Rubino, A., Avanzini, P., Cattani, A., Sorrentino, A., Gibbs, S., Cardinale, F., Sartori, I., Nobili, L., Massimini, M., & Pigorini, A. (2020). Simultaneous human intracerebral stimulation and HD-EEG, ground-truth for source localization methods. Scientific Data, 7, 127. doi:10.1038/s41597-020-0467-x
Hao, S., Zhao, H., Feng, Z., Liu, W., Zhang, C., Ping, H., Zhou, Q., Sun, B., Zhan, S., & Cao, C. (2025). HD-EEG source imaging with simultaneous SEEG recording in drug-resistant epilepsy. Epilepsia, 66(11), 4451-4464. doi:10.1111/epi.18552
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B., & Snoek, J. (2019). Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. NeurIPS 2019
Han, E., Huang, C., & Wang, K. (2024). Model Assessment and Selection under Temporal Distribution Shift. Proceedings of Machine Learning Research, 235. PMLR 235
Tang, J., LeBel, A., Jain, S., & Huth, A. G. (2023). Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26, 858-866. doi:10.1038/s41593-023-01304-9
Défossez, A., Caucheteux, C., Rapin, J., Kabeli, O., & King, J.-R. (2023). Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5, 1097-1107. doi:10.1038/s42256-023-00714-5
d'Ascoli, S., Bel, C., Rapin, J., Banville, H., Benchetrit, Y., Pallier, C., & King, J.-R. (2025). Towards decoding individual words from non-invasive brain recordings. Nature Communications, 16, 10521. doi:10.1038/s41467-025-65499-0
Ye, X., Liu, X., Wang, Y., Jiang, F., Geng, J., Ye, Q., & Wang, J. (2025). Generative language reconstruction from brain recordings. Communications Biology, 8, 285. doi:10.1038/s42003-025-07731-7
Rybár, M., Poli, R., & Daly, I. (2024). Using data from cue presentations results in grossly overestimating semantic BCI performance. Scientific Reports, 14, 28003. doi:10.1038/s41598-024-79309-y
Horikawa, T. (2025). Mind captioning: Evolving descriptive text of mental content from human brain activity. Science Advances, 11(45), eadw1464. doi:10.1126/sciadv.adw1464
Littlejohn, K. T., Cho, C. J., Liu, J. R., et al. (2025). A streaming brain-to-voice neuroprosthesis to restore naturalistic communication. Nature Neuroscience, 28, 902-912. doi:10.1038/s41593-025-01905-6
Duan, T., Wang, Z., Liu, S., Yin, Y., & Srihari, S. N. (2023). UNCER: A framework for uncertainty estimation and reduction in neural decoding of EEG signals. Neurocomputing, 538, 126210. doi:10.1016/j.neucom.2023.03.071
Hu, J., Ur Rahman, M. M., Al-Naffouri, T., & Laleg-Kirati, T.-M. (2024). Uncertainty Estimation and Model Calibration in EEG Signal Classification for Epileptic Seizures Detection. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 1-5). doi:10.1109/EMBC53108.2024.10782858
Shafiezadeh, S., Mento, G., & Testolin, A. (2023). Methodological Issues in Evaluating Machine Learning Models for Patient-Independent Epileptic Seizure Prediction. Mathematics, 11(7), 1650. doi:10.3390/math11071650
Shafiezadeh, S., Duma, G. M., Mento, G., Danieli, A., Antoniazzi, L., Del Popolo Cristaldi, F., Bonanni, P., & Testolin, A. (2024). Calibrating Deep Learning Classifiers for Patient-Independent Electroencephalogram Seizure Forecasting. Sensors, 24(9), 2863. doi:10.3390/s24092863
Lei, J., G'Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-Free Predictive Inference for Regression. Journal of the American Statistical Association, 113(523), 1094-1111. doi:10.1080/01621459.2017.1307116
Chernozhukov, V., Wüthrich, K., & Zhu, Y. (2021). Distributional conformal prediction. Proceedings of the National Academy of Sciences, 118(48), e2107794118. doi:10.1073/pnas.2107794118
Tibshirani, R. J., Barber, R. F., Candès, E. J., & Ramdas, A. (2019). Conformal Prediction Under Covariate Shift. Advances in Neural Information Processing Systems, 32. NeurIPS 2019
Segal, G., Keidar, N., Lotan, R. M., Romano, Y., Herskovitz, M., & Yaniv, Y. (2023). Utilizing risk-controlling prediction calibration to reduce false alarm rates in epileptic seizure prediction. Frontiers in Neuroscience, 17, 1184990. doi:10.3389/fnins.2023.1184990
Ganeshkumar, P., Maheswari, U., & Vasant, P. (2017). Reject Option to Reduce False Prediction Rates for EEG-Motor Imagery Based BCI. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). doi:10.1109/ICACCI.2017.8125908
Eliades, G., & Papadopoulos, H. (2019). Applying conformal prediction to control an exoskeleton. Proceedings of Machine Learning Research, 105, 44-51. PMLR 105
Wilson, G. H., Stein, E. A., Kamdar, F., Avansino, D. T., Pun, T. K., Gross, R., Hosman, T., Singer-Clark, T., Kapitonava, A., Hochberg, L. R., Simeral, J. D., Shenoy, K. V., Druckmann, S., Henderson, J. M., & Willett, F. R. (2025). Long-term unsupervised recalibration of cursor-based intracortical brain-computer interfaces using a hidden Markov model. Nature Biomedical Engineering. doi:10.1038/s41551-025-01536-z
Wairagkar, M., Card, N. S., Singer-Clark, T., Hou, X., Iacobacci, C., Miller, L. M., Hochberg, L. R., Brandman, D. M., & Stavisky, S. D. (2025). An instantaneous voice-synthesis neuroprosthesis. Nature, 644(8075), 145-152. doi:10.1038/s41586-025-09127-3

Where to go back next

To return to the source imaging side, please use From observation to estimation. To return to the closed-loop side, please use Closed-loop, delay, jitter, and safety stops. To return to the entire public rule, please use Verification platform.