From single-pass uncertainty decomposition to a staged UQ stack with ensemble evidential models and conformal deployment guarantees.
Pairwell's active learning loop routes uncertain candidates to lab validation, but if the uncertainty signal comes from a scalar-only head, it mixes model uncertainty (reducible with more data) and observation noise (irreducible). Heteroscedastic variants of MC dropout or ensembles can separate these components, but they require repeated forward passes or multiple models at inference [6, 11, 12]. I propose a staged uncertainty stack: (1) a Conformal Prediction wrapper for guaranteed coverage on deployment decisions [16], (2) an Evidential Deep Learning (EDL) output layer for single-pass epistemic-aleatoric decomposition [1], and (3) an Ensemble of Evidential models (EOE) for robust dual-source epistemic uncertainty [15]. EOE has been shown to dominate both single EDL and standard ensembles on bioactivity prediction benchmarks at 10× rather than 100× inference cost. Each stage delivers independent value and can be validated incrementally on Bindwell's binding task [2, 3, 15].
Deep ensembles [6] train M independent copies of a model (often M = 5) with different random initializations. At inference, all M models produce predictions, and the dispersion across predictions is used as a model-uncertainty signal. Compute scales linearly with M (training and inference). Model storage scales with M, and peak GPU memory can scale with M if models are run in parallel.
Monte Carlo dropout [11] retains dropout at inference and runs T stochastic forward passes (often T = 10-50, with 10-20 common in practice). The variance across passes approximates a Bayesian predictive uncertainty. Training cost is unchanged, but inference cost is roughly T times a single pass.
Figure 1. The two standard approaches to neural network uncertainty. Both produce an identical ± scalar that conflates epistemic and aleatoric sources (shown here with scalar heads; heteroscedastic variants can decompose but retain M× or T× inference cost), and both incur significant inference overhead at Pairwell's screening scale (millions of pairs).
In their standard (scalar-output) formulations, both methods produce a single variance that conflates epistemic and aleatoric sources. Heteroscedastic variants [12] can decompose them (Epistemic = Vart[ŷt], Aleatoric = Et[σ²t]), but the decomposition still requires M or T forward passes.
Epistemic uncertainty arises from insufficient training data - the model is extrapolating. It is reducible: one well-chosen data point can substantially improve the model. Aleatoric uncertainty arises from inherent data noise (assay variability, measurement error). It is irreducible: more data will not eliminate it.
For active learning, the distinction is critical. An acquisition function should prioritize candidates where additional lab data is expected to reduce model uncertainty, and deprioritize candidates dominated by observation noise. Sampling-based methods can separate these components when paired with a heteroscedastic output (predicting an input-dependent noise term) and repeated stochastic passes, but that increases inference cost. The NIG evidential head provides an explicit split in closed form and keeps inference to one forward pass [12].
EDL [1] places a Normal-Inverse-Gamma (NIG) prior over a Gaussian likelihood. The network outputs four parameters (γ, ν, α, β), and these must satisfy constraints (ν > 0, β > 0, α > 1). In practice this is enforced with activations such as ν = softplus(νraw), β = softplus(βraw), and α = 1 + softplus(αraw). With these constraints, epistemic and aleatoric quantities are obtained in closed form:
More precisely, Equation (2) corresponds to Var[μ] under the NIG posterior (uncertainty in the predicted mean), and Equation (3) corresponds to E[σ2] (expected observation noise) [1].
Trained by minimizing NIG negative log-likelihood with an evidence regularizer (controlled by λ) [1]. The epistemic and aleatoric quantities above are computed from one forward pass, instead of requiring multiple stochastic passes or multiple ensemble members at inference.
Figure 2. EDL architecture. The prediction (γ) is the same binding affinity score. The other 3 parameters encode evidence, yielding two separable uncertainty quantities in closed form. Compare with Figure 1: no repeated inference, no ensemble overhead.
A systematic review of all publicly available Bindwell artifacts yields the following:
Rose, Monti, Anand, Shen - bioRxiv, 2024
github.com/Bindwell/PLAPT
neg_log10_affinity_M (and affinity_uM)plapt.score_candidates() returns scores onlygithub.com/Bindwell/APPT
Predicted_pKdbindwell.com - November 2025
With a scalar-output ensemble or MC dropout, two candidates with identical ±40 nM uncertainty can have opposite causes. Heteroscedastic variants [12] can resolve this at M× or T× inference cost; EDL resolves it in one pass:
Figure 3. Two candidates with identical ±40 nM total uncertainty from opposite sources. Left: broad distribution (aleatoric noise). Right: disagreeing narrow distributions (epistemic ignorance). Scalar-output sampling methods report the same ±40 for both. Heteroscedastic variants [12] can distinguish them at M× or T× cost; EDL distinguishes them in one pass.
If Pairwell uses a scalar-output method, both candidates are treated identically—Candidate A wastes lab budget (irreducible noise) while Candidate B is the high-value target (one assay reduces model uncertainty globally). Heteroscedastic sampling can resolve this but multiplies inference cost. At millions of candidates, either the inefficiency or the overhead compounds.
I propose modifying Pairwell's prediction head to output four NIG parameters instead of one:
Figure 4. Proposed change: final layer widened from 1 to 4 outputs, loss changed from MSE to NIG negative log-likelihood. All upstream components remain frozen [1].
With decomposed uncertainty, the acquisition function routes on epistemic uncertainty alone:
Figure 5. Modified active learning loop. The middle branch is the key difference: high-aleatoric/low-epistemic candidates are deprioritized, saving lab budget for novel chemistry where each assay yields maximum model improvement.
A single EDL model can assign high evidence to the wrong answer with no external check. Recent theoretical work [13] argues that EDL's epistemic score may behave more like an energy-based OOD detector than a faithful Bayesian uncertainty estimate. Ensemble of Evidential models (EOE) [15] addresses this by training M independent EDL models and combining their outputs, capturing two independent sources of epistemic uncertainty:
Within-model (evidence-based): Each EDL model reports its own epistemic uncertainty via the NIG parameters. Low evidence ν relative to scale β signals that the model collected insufficient evidence for its prediction. The ensemble averages these NIG parameters (γ̄, ν̄, ᾱ, β̄) to produce a pooled evidence-based epistemic estimate.
Across-model (disagreement-based): Variance in predicted means Var[γ1, …, γM] across ensemble members signals regions where models with different initializations reach different conclusions—the classical ensemble disagreement signal.
Figure 6. EOE captures two independent epistemic signals. Within each model: evidence-based uncertainty from the NIG posterior. Across models: classical ensemble disagreement. A single EDL model has only the left signal and can be confidently wrong with no external check.
Khalil et al. [15] benchmarked EOE against single EDL, deep ensembles, and MC dropout on the Papyrus++ bioactivity dataset. EOE10 dominated on 4 of 5 metrics (RMSE, NLL, CRPS, interval score) at 10× inference cost versus 100× for comparable ensemble and MC dropout baselines, with improvements statistically significant at p<0.01 (Friedman + Conover-Holm). If Pairwell already runs a 5-model ensemble, switching to EOE5 costs the same at inference but adds uncertainty decomposition and better calibration.
EDL and EOE decompose uncertainty but cannot guarantee that predicted intervals contain the true value at a stated rate. Conformal Prediction (CP) [16] provides this guarantee: given a calibration set of n held-out assay results, compute residuals, take the (1−ε)-quantile as threshold q̂, and wrap every new prediction in [f̂(x) − q̂, f̂(x) + q̂]. The resulting intervals satisfy P(Y ∈ C(X)) ≥ 1−ε under exchangeability—a distribution-free, finite-sample guarantee [16].
CP is post-hoc (∼20 lines of code), requires no retraining, and works on any model output. When combined with Conformalized Quantile Regression (CQR), intervals adapt to the model's own uncertainty—narrower where confident, wider where uncertain. This is directly applicable to Bindwell's selectivity screening:
The full stack (CP + EOE) need not be deployed as a single migration. Each stage delivers independent value:
| Stage | Change | Timeline | Deliverable |
|---|---|---|---|
| 1. CP Wrapper | Hold out calibration set; wrap existing model | ∼1 week | Guaranteed coverage on deployment decisions |
| 2. EDL Head | Swap output 1→4; NIG loss | ∼2–4 weeks | Epistemic routing for active learning |
| 3. EOE5 | Train 5 EDL models; average NIG params | ∼4–6 weeks | Dual epistemic signal; best UQ |
| 4. CP + EOE | CQR on top of EOE5 outputs | ∼1 week | Complete system with guarantees |
Stage 1 can be applied to the existing model today with zero architectural changes. Each subsequent stage de-risks the next. Stop at any stage if gains plateau.
Soleimany, Amini et al. (ACS Central Science, 2021) applied EDL to molecular property prediction on QM9 with message-passing models. This is a related setting (molecules and noisy labels), but it is not the same as protein-ligand binding affinity with transformer architectures. The numbers below are included as motivation only and must be revalidated on Pairwell's data and metrics [2]:
| Metric | Deep Ensemble | MC Dropout | EDL (Proposed) | Improvement |
|---|---|---|---|---|
| Training data to reach RMSE = 7.0 (QM9 active learning) | N/A | 55% of dataset | 21% of dataset | ~62% less data vs dropout |
| Virtual screening hit rate | 78% | N/A | >95% | 78% → 95%+ |
| Rank correlation (ρ) for candidate prioritization | 0.040 | N/A | 0.163 | 4× better ranking |
| Inference cost per prediction | 5× (5 models) | 10-20× (stochastic passes) | 1× | 5-20× reduction |
Zhao et al. (Nature Communications, 2025) benchmarked EDL against 11 baselines for drug-target interaction prediction:
| Benchmark | EDL vs. Best Baseline | Detail |
|---|---|---|
| Davis dataset | +0.8% accuracy, +2% F1 | Over best of 11 baselines (classical ML and neural DTI models) |
| KIBA dataset | +0.6% accuracy | Uncertainty is used to improve high-confidence decision-making |
| DrugBank | 82% precision | Interaction prediction, single forward pass |
A benchmark study of UQ methods for molecular property prediction [4] reports that performance varies by data set and evaluation metric, and no single UQ method is uniformly best. This supports treating deep ensembles as a strong baseline while still valuing cheaper alternatives like EDL when compute or deployment constraints matter.
Khalil et al. (J. Chem. Inf. Model., 2025) benchmarked six UQ architectures on the Papyrus++ bioactivity dataset (ChEMBL-derived, standardized, filtered for ≥50 readings per target). The study compared performance under both stratified splits (interpolation) and scaffold cluster splits (extrapolation). On the stratified xC50 endpoint:
| Model | Type | RMSE ↓ | NLL ↓ | CRPS ↓ | Interval ↓ | Cost |
|---|---|---|---|---|---|---|
| MC Dropout100 | Bayesian | 0.763 | 1.140 | 0.418 | 2.111 | 100× |
| Deep Ensemble100 | Bayesian | 0.728 | 1.441 | 0.406 | 2.234 | 100× |
| Evidential1 | Evidential | 0.682 | 1.005 | 0.378 | 2.247 | 1× |
| EMC10 | Hybrid | 0.670 | 0.978 | 0.367 | 2.098 | 10× |
| EOE10 | Hybrid | 0.646 | 0.943 | 0.352 | 2.028 | 10× |
EOE10 achieved the best score on 4 of 5 metrics at 10× inference cost versus 100× for the Bayesian baselines. Improvements over Deep Ensemble100 were statistically significant at p<0.01 via Friedman + Conover-Holm post-hoc tests [15]. The result is directly relevant to Bindwell: the same bioactivity prediction task, similar data scale, and the key finding that combining evidential and ensemble approaches outperforms either alone.
Evidence collapse. EDL can assign high evidence to OOD inputs if the regularization coefficient λ is miscalibrated. Mitigated by evidence regularization [1], but λ requires per-dataset tuning—too low allows evidence collapse on OOD inputs; too high suppresses the epistemic signal. Requires monitoring on a held-out OOD set. EOE mitigation: ensemble disagreement provides a second, independent check that catches overconfident single-model predictions.
Limited adoption. EDL remains research-stage [2, 3]. It has not reached ensemble-level adoption in production pipelines. This is an early-mover opportunity with integration risk. Staged rollout mitigation: the CP wrapper (Stage 1) delivers value immediately with zero model changes, de-risking subsequent stages.
Calibration on proprietary data. Published results use public benchmarks (QM7/QM9 with D-MPNN architectures; Davis/KIBA with DTI-specific models). Performance on Bindwell's pesticide binding data with transformer architectures may differ and requires empirical validation. CP mitigation: conformal prediction provides distribution-free coverage guarantees regardless of model calibration quality.
Interpretability of epistemic scores. Recent theoretical work argues that many evidential objectives can yield epistemic-style scores that do not vanish even with very large data, and that these methods can behave more like energy-based OOD detectors than faithful estimators of Bayesian model uncertainty [13, 14]. Mitigation: treat the epistemic score as a routing heuristic, validate under controlled distribution shifts, and compare against heteroscedastic dropout and ensemble baselines before using it to allocate lab budget. EOE mitigation: even if single-model epistemic scores are imperfect, ensemble disagreement provides a complementary signal grounded in classical Bayesian principles [15].
No coverage guarantee from EDL alone. Neither single EDL nor EOE can claim that a stated fraction of predictions contain the true value. For deployment decisions (e.g., "this molecule is bee-safe"), stakeholders need guaranteed intervals, not best-effort estimates. CP mitigation: conformal prediction (Section 3.4) provides exactly this guarantee—P(Y ∈ C(X)) ≥ 90%—enabling regulatory-grade selectivity claims [16].
Pairwell's active learning loop benefits from separating model uncertainty and observation noise when deciding what to assay next. Sampling-based approaches can provide this split with heteroscedastic heads, but they add inference cost because they require multiple passes or multiple models [12]. An evidential NIG head provides an explicit split from a single pass, which can reduce per-candidate compute while keeping the same acquisition logic.
However, single EDL has known limitations—it can be confidently wrong, and its epistemic score may not behave as a faithful Bayesian uncertainty estimate [13, 14]. Ensembling multiple EDL models (EOE) addresses both weaknesses by adding a second, independent epistemic signal via model disagreement, and published benchmarks show EOE dominates both single EDL and standard ensembles on bioactivity prediction tasks at a fraction of the compute [15].
For deployment decisions—particularly selectivity screening where Bindwell must guarantee that a candidate does not bind non-target receptors—Conformal Prediction provides mathematically guaranteed coverage intervals that neither EDL nor ensembles can offer alone [16]. The combined stack (CP + EOE) separates two concerns: EOE drives active learning by routing lab budget on epistemic uncertainty, while CP provides guaranteed intervals for go/no-go deployment decisions.
The staged implementation (Section 3.5) allows Bindwell to capture value incrementally: coverage guarantees from CP in week 1, epistemic routing from EDL by week 4, dual-signal UQ from EOE by week 8, and the full stack by week 9. Each stage is independently valuable and de-risks the next. The practical value of each stage should be validated on Pairwell with side-by-side baselines and evaluation under distribution shift [2, 13, 14, 15].