Evidential Deep Learning for Pairwell

From single-pass uncertainty decomposition to a staged UQ stack with ensemble evidential models and conformal deployment guarantees.

Abstract

Pairwell's active learning loop routes uncertain candidates to lab validation, but if the uncertainty signal comes from a scalar-only head, it mixes model uncertainty (reducible with more data) and observation noise (irreducible). Heteroscedastic variants of MC dropout or ensembles can separate these components, but they require repeated forward passes or multiple models at inference [6, 11, 12]. I propose a staged uncertainty stack: (1) a Conformal Prediction wrapper for guaranteed coverage on deployment decisions [16], (2) an Evidential Deep Learning (EDL) output layer for single-pass epistemic-aleatoric decomposition [1], and (3) an Ensemble of Evidential models (EOE) for robust dual-source epistemic uncertainty [15]. EOE has been shown to dominate both single EDL and standard ensembles on bioactivity prediction benchmarks at 10× rather than 100× inference cost. Each stage delivers independent value and can be validated incrementally on Bindwell's binding task [2, 3, 15].

1 Background

1.1 Sampling-Based Uncertainty Methods

Deep ensembles [6] train M independent copies of a model (often M = 5) with different random initializations. At inference, all M models produce predictions, and the dispersion across predictions is used as a model-uncertainty signal. Compute scales linearly with M (training and inference). Model storage scales with M, and peak GPU memory can scale with M if models are run in parallel.

Monte Carlo dropout [11] retains dropout at inference and runs T stochastic forward passes (often T = 10-50, with 10-20 common in practice). The variance across passes approximates a Bayesian predictive uncertainty. Training cost is unchanged, but inference cost is roughly T times a single pass.

Deep Ensembles Train 5 copies. Run all 5. Measure disagreement. Input pair mol + protein Model 1 Model 2 Model 3 Model 4 Model 5 7.8 8.4 8.1 7.9 8.2 8.08 nM ± 0.23 Cost Train Inference per pair GPU memory1× (seq) / 5× (par) Storage MC Dropout Train once. Keep dropout on at inference. Run T passes. Measure spread. Input pair mol + protein 1 Model random neurons off each pass ×T T slightly different answers 8.08 nM ± 0.23 Cost Train Inference per pair10–20× GPU memory Storage Scalar heads: one ± number; can't tell why At Pairwell's scale: 5-20× overhead compounds

Figure 1. The two standard approaches to neural network uncertainty. Both produce an identical ± scalar that conflates epistemic and aleatoric sources (shown here with scalar heads; heteroscedastic variants can decompose but retain M× or T× inference cost), and both incur significant inference overhead at Pairwell's screening scale (millions of pairs).

In their standard (scalar-output) formulations, both methods produce a single variance that conflates epistemic and aleatoric sources. Heteroscedastic variants [12] can decompose them (Epistemic = Vartt], Aleatoric = Et[σ²t]), but the decomposition still requires M or T forward passes.

1.2 Epistemic vs. Aleatoric Uncertainty

Epistemic uncertainty arises from insufficient training data - the model is extrapolating. It is reducible: one well-chosen data point can substantially improve the model. Aleatoric uncertainty arises from inherent data noise (assay variability, measurement error). It is irreducible: more data will not eliminate it.

For active learning, the distinction is critical. An acquisition function should prioritize candidates where additional lab data is expected to reduce model uncertainty, and deprioritize candidates dominated by observation noise. Sampling-based methods can separate these components when paired with a heteroscedastic output (predicting an input-dependent noise term) and repeated stochastic passes, but that increases inference cost. The NIG evidential head provides an explicit split in closed form and keeps inference to one forward pass [12].

1.3 Evidential Deep Learning

EDL [1] places a Normal-Inverse-Gamma (NIG) prior over a Gaussian likelihood. The network outputs four parameters (γ, ν, α, β), and these must satisfy constraints (ν > 0, β > 0, α > 1). In practice this is enforced with activations such as ν = softplus(νraw), β = softplus(βraw), and α = 1 + softplus(αraw). With these constraints, epistemic and aleatoric quantities are obtained in closed form:

γ = predicted value    ν = evidence strength (1)
Epistemic uncertainty = β / (ν · (α − 1)) (2)
Aleatoric uncertainty = β / (α − 1) (3)

More precisely, Equation (2) corresponds to Var[μ] under the NIG posterior (uncertainty in the predicted mean), and Equation (3) corresponds to E[σ2] (expected observation noise) [1].

Trained by minimizing NIG negative log-likelihood with an evidence regularizer (controlled by λ) [1]. The epistemic and aleatoric quantities above are computed from one forward pass, instead of requiring multiple stochastic passes or multiple ensemble members at inference.

Input pair mol + protein Model EDL Head single forward pass γ = 8.08 nM ν = 47.3 α = 12.1 β = 0.8 Epistemic Uncertainty "Model hasn't seen this" - reducible β / (ν · (α − 1)) Aleatoric Uncertainty "Data is inherently noisy" - irreducible β / (α − 1) Cost: 1× train, 1× inference Output 1→4 + NIG loss + activation constraints

Figure 2. EDL architecture. The prediction (γ) is the same binding affinity score. The other 3 parameters encode evidence, yielding two separable uncertainty quantities in closed form. Compare with Figure 1: no repeated inference, no ensemble overhead.

2 Analysis of Bindwell's Current Approach

2.1 Evidence Audit

A systematic review of all publicly available Bindwell artifacts yields the following:

PLAPT Paper [5]

Rose, Monti, Anand, Shen - bioRxiv, 2024

  • Final layer architecture: 512 → 64 → 1
  • Loss function: standard MSE regression
  • No mention of uncertainty quantification
  • Terms "evidential," "NIG," "epistemic," "aleatoric" absent
  • Terms "ensemble," "MC dropout" also absent

PLAPT Repository [7]

github.com/Bindwell/PLAPT

  • CLI/API returns neg_log10_affinity_M (and affinity_uM)
  • No ± uncertainty in output
  • No confidence intervals
  • plapt.score_candidates() returns scores only

APPT Repository [8]

github.com/Bindwell/APPT

  • Protein-protein binding model (ESM2_3B backbone)
  • Output: single Predicted_pKd
  • Feed-forward prediction head, no uncertainty
  • No evidential or Bayesian methods

Blog Post [9]

bindwell.com - November 2025

  • "Calibrated uncertainty estimates" - generic term
  • "State-of-the-art generalized UQ" - method unnamed
  • Describes single ± value for active learning routing
  • No epistemic/aleatoric decomposition mentioned
  • "Evidential," "NIG," "Normal-Inverse-Gamma" absent
Key inference: PLAPT and APPT output single scalar predictions with no uncertainty. The blog describes a single ± value - consistent with ensemble or MC dropout. If EDL were in use, the epistemic/aleatoric decomposition would be highlighted as a differentiator. Its absence strongly suggests EDL is not used.

2.2 The Blind Spot

With a scalar-output ensemble or MC dropout, two candidates with identical ±40 nM uncertainty can have opposite causes. Heteroscedastic variants [12] can resolve this at M× or T× inference cost; EDL resolves it in one pass:

Candidate A: Noisy Assay Target High aleatoric, low epistemic. Model has seen this chemistry. More data will not reduce the intrinsic noise floor. Candidate B: Novel Chemical Space High epistemic, low aleatoric. Model is extrapolating. One assay here gives Pairwell maximum new information.

Figure 3. Two candidates with identical ±40 nM total uncertainty from opposite sources. Left: broad distribution (aleatoric noise). Right: disagreeing narrow distributions (epistemic ignorance). Scalar-output sampling methods report the same ±40 for both. Heteroscedastic variants [12] can distinguish them at M× or T× cost; EDL distinguishes them in one pass.

If Pairwell uses a scalar-output method, both candidates are treated identically—Candidate A wastes lab budget (irreducible noise) while Candidate B is the high-value target (one assay reduces model uncertainty globally). Heteroscedastic sampling can resolve this but multiplies inference cost. At millions of candidates, either the inefficiency or the overhead compounds.

3 Proposed Approach

3.1 Architectural Modification

I propose modifying Pairwell's prediction head to output four NIG parameters instead of one:

CURRENT [Atomwell embeddings] → 512 → 64 → 1 LMSE PROPOSED [Atomwell embeddings] → 512 → 64 → 4 LNIG Unchanged components Atomwell embeddings (frozen) Transformer trunk (frozen) Training data (same) Inference latency (≈same)

Figure 4. Proposed change: final layer widened from 1 to 4 outputs, loss changed from MSE to NIG negative log-likelihood. All upstream components remain frozen [1].

3.2 Modified Active Learning Loop

With decomposed uncertainty, the acquisition function routes on epistemic uncertainty alone:

Pairwell + EDL head 1 pass → (γ, ν, α, β) Low epistemic uncertainty In-distribution → prediction is reliable Accept Do not prioritize for information-gain assays High aleatoric, low epistemic Noisy target → lab data won't reduce uncertainty Deprioritize Save lab budget High epistemic uncertainty Novel chemistry → maximum information gain Lab assay Maximum value per $ Lab result retrains model → epistemic uncertainty decreases in that region

Figure 5. Modified active learning loop. The middle branch is the key difference: high-aleatoric/low-epistemic candidates are deprioritized, saving lab budget for novel chemistry where each assay yields maximum model improvement.

3.3 Beyond Single EDL: Ensemble of Evidential Models

A single EDL model can assign high evidence to the wrong answer with no external check. Recent theoretical work [13] argues that EDL's epistemic score may behave more like an energy-based OOD detector than a faithful Bayesian uncertainty estimate. Ensemble of Evidential models (EOE) [15] addresses this by training M independent EDL models and combining their outputs, capturing two independent sources of epistemic uncertainty:

Within-model (evidence-based): Each EDL model reports its own epistemic uncertainty via the NIG parameters. Low evidence ν relative to scale β signals that the model collected insufficient evidence for its prediction. The ensemble averages these NIG parameters (γ̄, ν̄, ᾱ, β̄) to produce a pooled evidence-based epistemic estimate.

Across-model (disagreement-based): Variance in predicted means Var[γ1, …, γM] across ensemble members signals regions where models with different initializations reach different conclusions—the classical ensemble disagreement signal.

Within Each Model Evidential: "How much evidence did I collect?" EDL Model 1 γ₁, ν₁, α₁, β₁ ··· EDL Model M γM, νM, αM, βM Low evidence → high epistemic uncertainty β̄ / (ν̄ · (ᾱ − 1)) averaged NIG parameters + Across Models Disagreement: "Do my models agree?" High spread = high epistemic uncertainty Var[γ₁, γ₂, …, γM] prediction disagreement

Figure 6. EOE captures two independent epistemic signals. Within each model: evidence-based uncertainty from the NIG posterior. Across models: classical ensemble disagreement. A single EDL model has only the left signal and can be confidently wrong with no external check.

Khalil et al. [15] benchmarked EOE against single EDL, deep ensembles, and MC dropout on the Papyrus++ bioactivity dataset. EOE10 dominated on 4 of 5 metrics (RMSE, NLL, CRPS, interval score) at 10× inference cost versus 100× for comparable ensemble and MC dropout baselines, with improvements statistically significant at p<0.01 (Friedman + Conover-Holm). If Pairwell already runs a 5-model ensemble, switching to EOE5 costs the same at inference but adds uncertainty decomposition and better calibration.

3.4 Conformal Prediction for Deployment Guarantees

EDL and EOE decompose uncertainty but cannot guarantee that predicted intervals contain the true value at a stated rate. Conformal Prediction (CP) [16] provides this guarantee: given a calibration set of n held-out assay results, compute residuals, take the (1−ε)-quantile as threshold q̂, and wrap every new prediction in [f̂(x) − q̂, f̂(x) + q̂]. The resulting intervals satisfy P(Y ∈ C(X)) ≥ 1−ε under exchangeability—a distribution-free, finite-sample guarantee [16].

P( Ynew ∈ C(Xnew) ) ≥ 1 − ε (4)

CP is post-hoc (∼20 lines of code), requires no retraining, and works on any model output. When combined with Conformalized Quantile Regression (CQR), intervals adapt to the model's own uncertainty—narrower where confident, wider where uncertain. This is directly applicable to Bindwell's selectivity screening:

Selectivity guarantee: For bee-safe insecticide screening, if the upper bound of the 90% CP interval on bee nAChR falls below pKd 5.0, Bindwell can claim "this molecule does not bind the bee receptor with 90% guaranteed confidence." Similarly, if the lower bound on pest nAChR exceeds pKd 6.5, the pest-activity call is guaranteed. These are regulatory-grade claims no competitor currently makes, because they use either no uncertainty or uncalibrated ensemble uncertainty without CP's mathematical backing.

3.5 Staged Implementation

The full stack (CP + EOE) need not be deployed as a single migration. Each stage delivers independent value:

Stage Change Timeline Deliverable
1. CP Wrapper Hold out calibration set; wrap existing model ∼1 week Guaranteed coverage on deployment decisions
2. EDL Head Swap output 1→4; NIG loss ∼2–4 weeks Epistemic routing for active learning
3. EOE5 Train 5 EDL models; average NIG params ∼4–6 weeks Dual epistemic signal; best UQ
4. CP + EOE CQR on top of EOE5 outputs ∼1 week Complete system with guarantees

Stage 1 can be applied to the existing model today with zero architectural changes. Each subsequent stage de-risks the next. Stop at any stage if gains plateau.

4 Supporting Evidence

4.1 Molecular Property Prediction [2]

Soleimany, Amini et al. (ACS Central Science, 2021) applied EDL to molecular property prediction on QM9 with message-passing models. This is a related setting (molecules and noisy labels), but it is not the same as protein-ligand binding affinity with transformer architectures. The numbers below are included as motivation only and must be revalidated on Pairwell's data and metrics [2]:

Metric Deep Ensemble MC Dropout EDL (Proposed) Improvement
Training data to reach RMSE = 7.0 (QM9 active learning) N/A 55% of dataset 21% of dataset ~62% less data vs dropout
Virtual screening hit rate 78% N/A >95% 78% → 95%+
Rank correlation (ρ) for candidate prioritization 0.040 N/A 0.163 4× better ranking
Inference cost per prediction 5× (5 models) 10-20× (stochastic passes) 5-20× reduction

4.2 Drug-Target Interaction [3]

Zhao et al. (Nature Communications, 2025) benchmarked EDL against 11 baselines for drug-target interaction prediction:

Benchmark EDL vs. Best Baseline Detail
Davis dataset +0.8% accuracy, +2% F1 Over best of 11 baselines (classical ML and neural DTI models)
KIBA dataset +0.6% accuracy Uncertainty is used to improve high-confidence decision-making
DrugBank 82% precision Interaction prediction, single forward pass

A benchmark study of UQ methods for molecular property prediction [4] reports that performance varies by data set and evaluation metric, and no single UQ method is uniformly best. This supports treating deep ensembles as a strong baseline while still valuing cheaper alternatives like EDL when compute or deployment constraints matter.

4.3 Bioactivity Modeling with EOE [15]

Khalil et al. (J. Chem. Inf. Model., 2025) benchmarked six UQ architectures on the Papyrus++ bioactivity dataset (ChEMBL-derived, standardized, filtered for ≥50 readings per target). The study compared performance under both stratified splits (interpolation) and scaffold cluster splits (extrapolation). On the stratified xC50 endpoint:

Model Type RMSE ↓ NLL ↓ CRPS ↓ Interval ↓ Cost
MC Dropout100 Bayesian 0.763 1.140 0.418 2.111 100×
Deep Ensemble100 Bayesian 0.728 1.441 0.406 2.234 100×
Evidential1 Evidential 0.682 1.005 0.378 2.247
EMC10 Hybrid 0.670 0.978 0.367 2.098 10×
EOE10 Hybrid 0.646 0.943 0.352 2.028 10×

EOE10 achieved the best score on 4 of 5 metrics at 10× inference cost versus 100× for the Bayesian baselines. Improvements over Deep Ensemble100 were statistically significant at p<0.01 via Friedman + Conover-Holm post-hoc tests [15]. The result is directly relevant to Bindwell: the same bioactivity prediction task, similar data scale, and the key finding that combining evidential and ensemble approaches outperforms either alone.

5 Risks and Limitations

Evidence collapse. EDL can assign high evidence to OOD inputs if the regularization coefficient λ is miscalibrated. Mitigated by evidence regularization [1], but λ requires per-dataset tuning—too low allows evidence collapse on OOD inputs; too high suppresses the epistemic signal. Requires monitoring on a held-out OOD set. EOE mitigation: ensemble disagreement provides a second, independent check that catches overconfident single-model predictions.

Limited adoption. EDL remains research-stage [2, 3]. It has not reached ensemble-level adoption in production pipelines. This is an early-mover opportunity with integration risk. Staged rollout mitigation: the CP wrapper (Stage 1) delivers value immediately with zero model changes, de-risking subsequent stages.

Calibration on proprietary data. Published results use public benchmarks (QM7/QM9 with D-MPNN architectures; Davis/KIBA with DTI-specific models). Performance on Bindwell's pesticide binding data with transformer architectures may differ and requires empirical validation. CP mitigation: conformal prediction provides distribution-free coverage guarantees regardless of model calibration quality.

Interpretability of epistemic scores. Recent theoretical work argues that many evidential objectives can yield epistemic-style scores that do not vanish even with very large data, and that these methods can behave more like energy-based OOD detectors than faithful estimators of Bayesian model uncertainty [13, 14]. Mitigation: treat the epistemic score as a routing heuristic, validate under controlled distribution shifts, and compare against heteroscedastic dropout and ensemble baselines before using it to allocate lab budget. EOE mitigation: even if single-model epistemic scores are imperfect, ensemble disagreement provides a complementary signal grounded in classical Bayesian principles [15].

No coverage guarantee from EDL alone. Neither single EDL nor EOE can claim that a stated fraction of predictions contain the true value. For deployment decisions (e.g., "this molecule is bee-safe"), stakeholders need guaranteed intervals, not best-effort estimates. CP mitigation: conformal prediction (Section 3.4) provides exactly this guarantee—P(Y ∈ C(X)) ≥ 90%—enabling regulatory-grade selectivity claims [16].

6 Conclusion

Pairwell's active learning loop benefits from separating model uncertainty and observation noise when deciding what to assay next. Sampling-based approaches can provide this split with heteroscedastic heads, but they add inference cost because they require multiple passes or multiple models [12]. An evidential NIG head provides an explicit split from a single pass, which can reduce per-candidate compute while keeping the same acquisition logic.

However, single EDL has known limitations—it can be confidently wrong, and its epistemic score may not behave as a faithful Bayesian uncertainty estimate [13, 14]. Ensembling multiple EDL models (EOE) addresses both weaknesses by adding a second, independent epistemic signal via model disagreement, and published benchmarks show EOE dominates both single EDL and standard ensembles on bioactivity prediction tasks at a fraction of the compute [15].

For deployment decisions—particularly selectivity screening where Bindwell must guarantee that a candidate does not bind non-target receptors—Conformal Prediction provides mathematically guaranteed coverage intervals that neither EDL nor ensembles can offer alone [16]. The combined stack (CP + EOE) separates two concerns: EOE drives active learning by routing lab budget on epistemic uncertainty, while CP provides guaranteed intervals for go/no-go deployment decisions.

The staged implementation (Section 3.5) allows Bindwell to capture value incrementally: coverage guarantees from CP in week 1, epistemic routing from EDL by week 4, dual-signal UQ from EOE by week 8, and the full stack by week 9. Each stage is independently valuable and de-risks the next. The practical value of each stage should be validated on Pairwell with side-by-side baselines and evaluation under distribution shift [2, 13, 14, 15].

References

  1. Amini, A., Schwarting, W., Soleimany, A., & Rus, D. (2020). Deep Evidential Regression. Advances in Neural Information Processing Systems (NeurIPS), 33. arXiv:1910.02600
  2. Soleimany, A.P., Amini, A., Goldman, S., Rus, D., Bhatia, S.N., & Coley, C.W. (2021). Evidential Deep Learning for Guided Molecular Property Prediction and Discovery. ACS Central Science, 7(8), 1356-1367. PMC8393200
  3. Zhao, Y., et al. (2025). Evidential Deep Learning for Drug-Target Interaction Prediction. Nature Communications, 16. DOI: 10.1038/s41467-025-62235-6
  4. Hirschfeld, L., Swanson, K., Yang, K., Barzilay, R., & Coley, C.W. (2020). Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. arXiv. arXiv:2005.10036
  5. Rose, T., Monti, F., Anand, N., & Shen, A. (2024). PLAPT: Protein-Ligand Binding Affinity Prediction Using Pretrained Transformers. bioRxiv. DOI: 10.1101/2024.02.08.575577v3
  6. Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Advances in Neural Information Processing Systems (NeurIPS), 30. arXiv:1612.01474
  7. Bindwell. PLAPT: Protein-Ligand Binding Affinity Prediction Tool. github.com/Bindwell/PLAPT
  8. Bindwell. APPT: All-Purpose Protein-Protein Binding Prediction Tool. github.com/Bindwell/APPT
  9. Anand, N. & Rose, T. (2025). Defeating Pests with AI. Bindwell Blog. bindwell.com/posts/defeating-pests-with-ai
  10. Wohlwend, J., Corso, G., Passaro, S., et al. (2024). Boltz-1: Democratizing Biomolecular Interaction Modeling. MIT Jameel Clinic. github.com/jwohlwend/boltz
  11. Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML). arXiv:1506.02142
  12. Kendall, A. & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS. arXiv:1703.04977
  13. Shen, M., Ryu, J. J., Ghosh, S., Bu, Y., Sattigeri, P., Das, S., & Wornell, G. W. (2024). Are Uncertainty Quantification Capabilities of Evidential Deep Learning a Mirage? NeurIPS. arXiv:2402.06160
  14. Bengs, V., Hüllermeier, E., & Waegeman, W. (2023). On Second-Order Scoring Rules for Epistemic Uncertainty Quantification. ICML (PMLR). PMLR v202
  15. Khalil, K., Schweighofer, A., Dyubankova, N., van Westen, G. J. P., & van Vlijmen, H. W. T. (2025). Combining Bayesian and Evidential Uncertainty Quantification for Bioactivity Modeling. J. Chem. Inf. Model., 65(24), 13057–13069. DOI: 10.1021/acs.jcim.5c01597
  16. Angelopoulos, A. N. & Bates, S. (2022). A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. arXiv. arXiv:2107.07511