Evidential Deep Learning for Pairwell

From single-pass uncertainty decomposition to a staged UQ stack with ensemble evidential models and conformal deployment guarantees.

Abstract

Pairwell's active learning loop routes uncertain candidates to lab validation, but if the uncertainty signal comes from a scalar-only head, it mixes model uncertainty (reducible with more data) and observation noise (irreducible). Heteroscedastic variants of MC dropout or ensembles can separate these components, but they require repeated forward passes or multiple models at inference [6, 11, 12]. I propose a staged uncertainty stack: (1) a Conformal Prediction wrapper for guaranteed coverage on deployment decisions [16], (2) an Evidential Deep Learning (EDL) output layer for single-pass epistemic-aleatoric decomposition [1], and (3) an Ensemble of Evidential models (EOE) for robust dual-source epistemic uncertainty [15]. EOE has been shown to dominate both single EDL and standard ensembles on bioactivity prediction benchmarks at 10× rather than 100× inference cost. Each stage delivers independent value and can be validated incrementally on Bindwell's binding task [2, 3, 15].

1 Background

1.1 Sampling-Based Uncertainty Methods

Deep ensembles [6] train M independent copies of a model (often M = 5) with different random initializations. At inference, all M models produce predictions, and the dispersion across predictions is used as a model-uncertainty signal. Compute scales linearly with M (training and inference). Model storage scales with M, and peak GPU memory can scale with M if models are run in parallel.

Monte Carlo dropout [11] retains dropout at inference and runs T stochastic forward passes (often T = 10-50, with 10-20 common in practice). The variance across passes approximates a Bayesian predictive uncertainty. Training cost is unchanged, but inference cost is roughly T times a single pass.

Figure 1. The two standard approaches to neural network uncertainty. Both produce an identical ± scalar that conflates epistemic and aleatoric sources (shown here with scalar heads; heteroscedastic variants can decompose but retain M× or T× inference cost), and both incur significant inference overhead at Pairwell's screening scale (millions of pairs).

In their standard (scalar-output) formulations, both methods produce a single variance that conflates epistemic and aleatoric sources. Heteroscedastic variants [12] can decompose them (Epistemic = Var_t[ŷ_t], Aleatoric = E_t[σ²_t]), but the decomposition still requires M or T forward passes.

1.2 Epistemic vs. Aleatoric Uncertainty

Epistemic uncertainty arises from insufficient training data - the model is extrapolating. It is reducible: one well-chosen data point can substantially improve the model. Aleatoric uncertainty arises from inherent data noise (assay variability, measurement error). It is irreducible: more data will not eliminate it.

For active learning, the distinction is critical. An acquisition function should prioritize candidates where additional lab data is expected to reduce model uncertainty, and deprioritize candidates dominated by observation noise. Sampling-based methods can separate these components when paired with a heteroscedastic output (predicting an input-dependent noise term) and repeated stochastic passes, but that increases inference cost. The NIG evidential head provides an explicit split in closed form and keeps inference to one forward pass [12].

1.3 Evidential Deep Learning

EDL [1] places a Normal-Inverse-Gamma (NIG) prior over a Gaussian likelihood. The network outputs four parameters (γ, ν, α, β), and these must satisfy constraints (ν > 0, β > 0, α > 1). In practice this is enforced with activations such as ν = softplus(ν_raw), β = softplus(β_raw), and α = 1 + softplus(α_raw). With these constraints, epistemic and aleatoric quantities are obtained in closed form:

γ = predicted value ν = evidence strength (1)

Epistemic uncertainty = β / (ν · (α − 1)) (2)

Aleatoric uncertainty = β / (α − 1) (3)

More precisely, Equation (2) corresponds to Var[μ] under the NIG posterior (uncertainty in the predicted mean), and Equation (3) corresponds to E[σ²] (expected observation noise) [1].

Trained by minimizing NIG negative log-likelihood with an evidence regularizer (controlled by λ) [1]. The epistemic and aleatoric quantities above are computed from one forward pass, instead of requiring multiple stochastic passes or multiple ensemble members at inference.

Figure 2. EDL architecture. The prediction (γ) is the same binding affinity score. The other 3 parameters encode evidence, yielding two separable uncertainty quantities in closed form. Compare with Figure 1: no repeated inference, no ensemble overhead.

2 Analysis of Bindwell's Current Approach

2.1 Evidence Audit

A systematic review of all publicly available Bindwell artifacts yields the following:

PLAPT Paper [5]

Rose, Monti, Anand, Shen - bioRxiv, 2024

Final layer architecture: 512 → 64 → 1
Loss function: standard MSE regression
No mention of uncertainty quantification
Terms "evidential," "NIG," "epistemic," "aleatoric" absent
Terms "ensemble," "MC dropout" also absent

PLAPT Repository [7]

github.com/Bindwell/PLAPT

CLI/API returns neg_log10_affinity_M (and affinity_uM)
No ± uncertainty in output
No confidence intervals
plapt.score_candidates() returns scores only

APPT Repository [8]

github.com/Bindwell/APPT

Protein-protein binding model (ESM2_3B backbone)
Output: single Predicted_pKd
Feed-forward prediction head, no uncertainty
No evidential or Bayesian methods

Blog Post [9]

bindwell.com - November 2025

"Calibrated uncertainty estimates" - generic term
"State-of-the-art generalized UQ" - method unnamed
Describes single ± value for active learning routing
No epistemic/aleatoric decomposition mentioned
"Evidential," "NIG," "Normal-Inverse-Gamma" absent

Key inference: PLAPT and APPT output single scalar predictions with no uncertainty. The blog describes a single ± value - consistent with ensemble or MC dropout. If EDL were in use, the epistemic/aleatoric decomposition would be highlighted as a differentiator. Its absence strongly suggests EDL is not used.

2.2 The Blind Spot

With a scalar-output ensemble or MC dropout, two candidates with identical ±40 nM uncertainty can have opposite causes. Heteroscedastic variants [12] can resolve this at M× or T× inference cost; EDL resolves it in one pass:

Figure 3. Two candidates with identical ±40 nM total uncertainty from opposite sources. Left: broad distribution (aleatoric noise). Right: disagreeing narrow distributions (epistemic ignorance). Scalar-output sampling methods report the same ±40 for both. Heteroscedastic variants [12] can distinguish them at M× or T× cost; EDL distinguishes them in one pass.

If Pairwell uses a scalar-output method, both candidates are treated identically—Candidate A wastes lab budget (irreducible noise) while Candidate B is the high-value target (one assay reduces model uncertainty globally). Heteroscedastic sampling can resolve this but multiplies inference cost. At millions of candidates, either the inefficiency or the overhead compounds.

3 Proposed Approach

3.1 Architectural Modification

I propose modifying Pairwell's prediction head to output four NIG parameters instead of one:

Figure 4. Proposed change: final layer widened from 1 to 4 outputs, loss changed from MSE to NIG negative log-likelihood. All upstream components remain frozen [1].

3.2 Modified Active Learning Loop

With decomposed uncertainty, the acquisition function routes on epistemic uncertainty alone:

Figure 5. Modified active learning loop. The middle branch is the key difference: high-aleatoric/low-epistemic candidates are deprioritized, saving lab budget for novel chemistry where each assay yields maximum model improvement.

3.3 Beyond Single EDL: Ensemble of Evidential Models

A single EDL model can assign high evidence to the wrong answer with no external check. Recent theoretical work [13] argues that EDL's epistemic score may behave more like an energy-based OOD detector than a faithful Bayesian uncertainty estimate. Ensemble of Evidential models (EOE) [15] addresses this by training M independent EDL models and combining their outputs, capturing two independent sources of epistemic uncertainty:

Within-model (evidence-based): Each EDL model reports its own epistemic uncertainty via the NIG parameters. Low evidence ν relative to scale β signals that the model collected insufficient evidence for its prediction. The ensemble averages these NIG parameters (γ̄, ν̄, ᾱ, β̄) to produce a pooled evidence-based epistemic estimate.

Across-model (disagreement-based): Variance in predicted means Var[γ₁, …, γ_M] across ensemble members signals regions where models with different initializations reach different conclusions—the classical ensemble disagreement signal.

_M, ν_M, α_M, β_M Low evidence → high epistemic uncertainty β̄ / (ν̄ · (ᾱ − 1)) averaged NIG parameters + Across Models Disagreement: "Do my models agree?" High spread = high epistemic uncertainty Var[γ₁, γ₂, …, γ_M] prediction disagreement

Figure 6. EOE captures two independent epistemic signals. Within each model: evidence-based uncertainty from the NIG posterior. Across models: classical ensemble disagreement. A single EDL model has only the left signal and can be confidently wrong with no external check.

Khalil et al. [15] benchmarked EOE against single EDL, deep ensembles, and MC dropout on the Papyrus++ bioactivity dataset. EOE₁₀ dominated on 4 of 5 metrics (RMSE, NLL, CRPS, interval score) at 10× inference cost versus 100× for comparable ensemble and MC dropout baselines, with improvements statistically significant at p<0.01 (Friedman + Conover-Holm). If Pairwell already runs a 5-model ensemble, switching to EOE₅ costs the same at inference but adds uncertainty decomposition and better calibration.

3.4 Conformal Prediction for Deployment Guarantees

EDL and EOE decompose uncertainty but cannot guarantee that predicted intervals contain the true value at a stated rate. Conformal Prediction (CP) [16] provides this guarantee: given a calibration set of n held-out assay results, compute residuals, take the (1−ε)-quantile as threshold q̂, and wrap every new prediction in [f̂(x) − q̂, f̂(x) + q̂]. The resulting intervals satisfy P(Y ∈ C(X)) ≥ 1−ε under exchangeability—a distribution-free, finite-sample guarantee [16].

P( Y_new ∈ C(X_new) ) ≥ 1 − ε (4)

CP is post-hoc (∼20 lines of code), requires no retraining, and works on any model output. When combined with Conformalized Quantile Regression (CQR), intervals adapt to the model's own uncertainty—narrower where confident, wider where uncertain. This is directly applicable to Bindwell's selectivity screening:

Selectivity guarantee: For bee-safe insecticide screening, if the upper bound of the 90% CP interval on bee nAChR falls below pKd 5.0, Bindwell can claim "this molecule does not bind the bee receptor with 90% guaranteed confidence." Similarly, if the lower bound on pest nAChR exceeds pKd 6.5, the pest-activity call is guaranteed. These are regulatory-grade claims no competitor currently makes, because they use either no uncertainty or uncalibrated ensemble uncertainty without CP's mathematical backing.

3.5 Staged Implementation

The full stack (CP + EOE) need not be deployed as a single migration. Each stage delivers independent value:

Stage	Change	Timeline	Deliverable
1. CP Wrapper	Hold out calibration set; wrap existing model	∼1 week	Guaranteed coverage on deployment decisions
2. EDL Head	Swap output 1→4; NIG loss	∼2–4 weeks	Epistemic routing for active learning
3. EOE₅	Train 5 EDL models; average NIG params	∼4–6 weeks	Dual epistemic signal; best UQ
4. CP + EOE	CQR on top of EOE₅ outputs	∼1 week	Complete system with guarantees

Stage 1 can be applied to the existing model today with zero architectural changes. Each subsequent stage de-risks the next. Stop at any stage if gains plateau.

4 Supporting Evidence

4.1 Molecular Property Prediction [2]

Soleimany, Amini et al. (ACS Central Science, 2021) applied EDL to molecular property prediction on QM9 with message-passing models. This is a related setting (molecules and noisy labels), but it is not the same as protein-ligand binding affinity with transformer architectures. The numbers below are included as motivation only and must be revalidated on Pairwell's data and metrics [2]:

Metric	Deep Ensemble	MC Dropout	EDL (Proposed)	Improvement
Training data to reach RMSE = 7.0 (QM9 active learning)	N/A	55% of dataset	21% of dataset	~62% less data vs dropout
Virtual screening hit rate	78%	N/A	>95%	78% → 95%+
Rank correlation (ρ) for candidate prioritization	0.040	N/A	0.163	4× better ranking
Inference cost per prediction	5× (5 models)	10-20× (stochastic passes)	1×	5-20× reduction

4.2 Drug-Target Interaction [3]

Zhao et al. (Nature Communications, 2025) benchmarked EDL against 11 baselines for drug-target interaction prediction:

Benchmark	EDL vs. Best Baseline	Detail
Davis dataset	+0.8% accuracy, +2% F1	Over best of 11 baselines (classical ML and neural DTI models)
KIBA dataset	+0.6% accuracy	Uncertainty is used to improve high-confidence decision-making
DrugBank	82% precision	Interaction prediction, single forward pass

A benchmark study of UQ methods for molecular property prediction [4] reports that performance varies by data set and evaluation metric, and no single UQ method is uniformly best. This supports treating deep ensembles as a strong baseline while still valuing cheaper alternatives like EDL when compute or deployment constraints matter.

4.3 Bioactivity Modeling with EOE [15]

Khalil et al. (J. Chem. Inf. Model., 2025) benchmarked six UQ architectures on the Papyrus++ bioactivity dataset (ChEMBL-derived, standardized, filtered for ≥50 readings per target). The study compared performance under both stratified splits (interpolation) and scaffold cluster splits (extrapolation). On the stratified xC₅₀ endpoint:

Model	Type	RMSE ↓	NLL ↓	CRPS ↓	Interval ↓	Cost
MC Dropout₁₀₀	Bayesian	0.763	1.140	0.418	2.111	100×
Deep Ensemble₁₀₀	Bayesian	0.728	1.441	0.406	2.234	100×
Evidential₁	Evidential	0.682	1.005	0.378	2.247	1×
EMC₁₀	Hybrid	0.670	0.978	0.367	2.098	10×
EOE₁₀	Hybrid	0.646	0.943	0.352	2.028	10×

EOE₁₀ achieved the best score on 4 of 5 metrics at 10× inference cost versus 100× for the Bayesian baselines. Improvements over Deep Ensemble₁₀₀ were statistically significant at p<0.01 via Friedman + Conover-Holm post-hoc tests [15]. The result is directly relevant to Bindwell: the same bioactivity prediction task, similar data scale, and the key finding that combining evidential and ensemble approaches outperforms either alone.

5 Risks and Limitations

Evidence collapse. EDL can assign high evidence to OOD inputs if the regularization coefficient λ is miscalibrated. Mitigated by evidence regularization [1], but λ requires per-dataset tuning—too low allows evidence collapse on OOD inputs; too high suppresses the epistemic signal. Requires monitoring on a held-out OOD set. EOE mitigation: ensemble disagreement provides a second, independent check that catches overconfident single-model predictions.

Limited adoption. EDL remains research-stage [2, 3]. It has not reached ensemble-level adoption in production pipelines. This is an early-mover opportunity with integration risk. Staged rollout mitigation: the CP wrapper (Stage 1) delivers value immediately with zero model changes, de-risking subsequent stages.

Calibration on proprietary data. Published results use public benchmarks (QM7/QM9 with D-MPNN architectures; Davis/KIBA with DTI-specific models). Performance on Bindwell's pesticide binding data with transformer architectures may differ and requires empirical validation. CP mitigation: conformal prediction provides distribution-free coverage guarantees regardless of model calibration quality.

Interpretability of epistemic scores. Recent theoretical work argues that many evidential objectives can yield epistemic-style scores that do not vanish even with very large data, and that these methods can behave more like energy-based OOD detectors than faithful estimators of Bayesian model uncertainty [13, 14]. Mitigation: treat the epistemic score as a routing heuristic, validate under controlled distribution shifts, and compare against heteroscedastic dropout and ensemble baselines before using it to allocate lab budget. EOE mitigation: even if single-model epistemic scores are imperfect, ensemble disagreement provides a complementary signal grounded in classical Bayesian principles [15].

No coverage guarantee from EDL alone. Neither single EDL nor EOE can claim that a stated fraction of predictions contain the true value. For deployment decisions (e.g., "this molecule is bee-safe"), stakeholders need guaranteed intervals, not best-effort estimates. CP mitigation: conformal prediction (Section 3.4) provides exactly this guarantee—P(Y ∈ C(X)) ≥ 90%—enabling regulatory-grade selectivity claims [16].

6 Conclusion

Pairwell's active learning loop benefits from separating model uncertainty and observation noise when deciding what to assay next. Sampling-based approaches can provide this split with heteroscedastic heads, but they add inference cost because they require multiple passes or multiple models [12]. An evidential NIG head provides an explicit split from a single pass, which can reduce per-candidate compute while keeping the same acquisition logic.

However, single EDL has known limitations—it can be confidently wrong, and its epistemic score may not behave as a faithful Bayesian uncertainty estimate [13, 14]. Ensembling multiple EDL models (EOE) addresses both weaknesses by adding a second, independent epistemic signal via model disagreement, and published benchmarks show EOE dominates both single EDL and standard ensembles on bioactivity prediction tasks at a fraction of the compute [15].

For deployment decisions—particularly selectivity screening where Bindwell must guarantee that a candidate does not bind non-target receptors—Conformal Prediction provides mathematically guaranteed coverage intervals that neither EDL nor ensembles can offer alone [16]. The combined stack (CP + EOE) separates two concerns: EOE drives active learning by routing lab budget on epistemic uncertainty, while CP provides guaranteed intervals for go/no-go deployment decisions.

The staged implementation (Section 3.5) allows Bindwell to capture value incrementally: coverage guarantees from CP in week 1, epistemic routing from EDL by week 4, dual-signal UQ from EOE by week 8, and the full stack by week 9. Each stage is independently valuable and de-risks the next. The practical value of each stage should be validated on Pairwell with side-by-side baselines and evaluation under distribution shift [2, 13, 14, 15].

References

Amini, A., Schwarting, W., Soleimany, A., & Rus, D. (2020). Deep Evidential Regression. Advances in Neural Information Processing Systems (NeurIPS), 33. arXiv:1910.02600
Soleimany, A.P., Amini, A., Goldman, S., Rus, D., Bhatia, S.N., & Coley, C.W. (2021). Evidential Deep Learning for Guided Molecular Property Prediction and Discovery. ACS Central Science, 7(8), 1356-1367. PMC8393200
Zhao, Y., et al. (2025). Evidential Deep Learning for Drug-Target Interaction Prediction. Nature Communications, 16. DOI: 10.1038/s41467-025-62235-6
Hirschfeld, L., Swanson, K., Yang, K., Barzilay, R., & Coley, C.W. (2020). Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. arXiv. arXiv:2005.10036
Rose, T., Monti, F., Anand, N., & Shen, A. (2024). PLAPT: Protein-Ligand Binding Affinity Prediction Using Pretrained Transformers. bioRxiv. DOI: 10.1101/2024.02.08.575577v3
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. Advances in Neural Information Processing Systems (NeurIPS), 30. arXiv:1612.01474
Bindwell. PLAPT: Protein-Ligand Binding Affinity Prediction Tool. github.com/Bindwell/PLAPT
Bindwell. APPT: All-Purpose Protein-Protein Binding Prediction Tool. github.com/Bindwell/APPT
Anand, N. & Rose, T. (2025). Defeating Pests with AI. Bindwell Blog. bindwell.com/posts/defeating-pests-with-ai
Wohlwend, J., Corso, G., Passaro, S., et al. (2024). Boltz-1: Democratizing Biomolecular Interaction Modeling. MIT Jameel Clinic. github.com/jwohlwend/boltz
Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML). arXiv:1506.02142
Kendall, A. & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS. arXiv:1703.04977
Shen, M., Ryu, J. J., Ghosh, S., Bu, Y., Sattigeri, P., Das, S., & Wornell, G. W. (2024). Are Uncertainty Quantification Capabilities of Evidential Deep Learning a Mirage? NeurIPS. arXiv:2402.06160
Bengs, V., Hüllermeier, E., & Waegeman, W. (2023). On Second-Order Scoring Rules for Epistemic Uncertainty Quantification. ICML (PMLR). PMLR v202
Khalil, K., Schweighofer, A., Dyubankova, N., van Westen, G. J. P., & van Vlijmen, H. W. T. (2025). Combining Bayesian and Evidential Uncertainty Quantification for Bioactivity Modeling. J. Chem. Inf. Model., 65(24), 13057–13069. DOI: 10.1021/acs.jcim.5c01597
Angelopoulos, A. N. & Bates, S. (2022). A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. arXiv. arXiv:2107.07511