Probabilistic Inference is Not Synergistic

Bayesian inference tools measure belief in uncertain events. The basic idea is to pit prior information (a probability estimate) against new information (posterior) and condition both cyclically. A governing concept called meta-probability (the probability that these estimates are good) helps us calibrate our confidence in using these priors.

Nassim Nicholas Taleb shows people overestimate their confidence in their probabilistic estimates. In Antifragile, part of the Incerto series^[1], Taleb terms entities that improve after bad events as antifragile. Processes like risk-taking, innovation, Lindy effect, information security, and project handling are antifragile. Human bodies are antifragile. All benefit from damage. What about probabilistic inference methods? Can we combine the predictive power of estimators to give an estimator at least as good as the best? The answer is negative.

Probabilistic inference is a fragile system. Fragility means probabilistic estimators cannot always be combined to construct an estimator that matches the performance of their best candidate. This construction dismisses the trivial case of one-to-one mapping of the best candidate to the constructed estimator.

In a Bayesian scenario, we use Bayesian analysis for simplicity and universality. A Bayesian estimator, B, can't match the best candidate from a set of estimators, E. B's prior is centered around E, a subset of the universal set of all measures and priors. We construct B by giving exponential weight penalties for performance deviations, known as regret-based construction. The goal is to see if B can have zero regret.

Without loss of generality, we can assign a loss function (like KL-divergence) to the performance penalty (how close B is to the true distribution estimator of E) in a measure-theoretic space. The countably infinite set E requires B to assign exponentially decreasing weights to each element. However, B's elements already have a measure space assigning exponentially decreasing weight to the event space. Hence the likelihood of events of interest (based on E) has little room for any penalty. To combine E elements, we must bolster the likelihood instead of penalizing it. This is a classic case of overfitting B to E. This would work for E, but break it for U – E.

This result applies to all estimation principles like AIC, MDL, and BIC. Combining probabilistic models complicates the analysis without improving predictive power.