Unexploitability by adversaries = robustness to randomness

One common argument in favor of Bayesian probability theory is that any deviations from it make you vulnerable to Dutch books, sets of bets in which you lose money with certainty. But in the real world not all environments are adversarial, so why should we use Bayesian probability theory when no one is trying to exploit us?

You could argue that you should use the same principles to reason in all environments you find, but why would that be? It’s pretty natural in fact to behave more cautiously when you’re facing an adversary than when you’re facing mere unoptimized randomness. Perhaps Bayesianism is like wearing heavy armor: good defense if you need it, but hugely inconvenient if not.

However, it turns out that there’s a convenient theory that says that Bayes is the best you can do in a merely random environment with no adversaries. PAC-Bayes theory addresses the following setup: you have a prior over hypotheses, you observe data sampled randomly from some unknown distribution, and you want to pick a posterior that will predict well on future data from that same distribution. The theory shows that in the particular case where prediction error is measured by negative log-likelihood, the tightest bound on your expected prediction error is achieved when your posterior is a reweighting of the prior by the likelihood, $p_\text{post}(\text{hyp}) \sim p_\text{prior}(\text{hyp})\cdot l(\text{data}|\text{hyp})$. This is exactly Bayesian inference.

This is good, it means we can use Bayes for both adversarial and random environments, but the two proofs are entirely different. Why would the same structure that protects you from Dutch books also happen to be the best way to learn some distribution when data is sampled randomly?

The fact that exactly the same structure appears in two unrelated places for different reasons should make you suspicious, like the fact that mass is both the amount of inertia and the amount of gravity generated by an object. Could there be a deeper principle at play?

The duality between optimization and randomness

Being robust to uncertain and random environments requires that you take into account that reality might surprise you. Suppose you want to predict how long it will take you to get to the airport, since you don’t want to miss your flight. You can’t just assume there won’t be traffic on the way to the airport. On the other hand, if you know that historically even with heavy traffic you will only be delayed by 30 minutes it makes no sense to predict a delay of three hours.

Mathematically, the assumption that reality will be maximally surprising within its possibilities corresponds to the assumption that real world outcomes will follow a maximum entropy distribution, subject to constraints. And as is well known, Bayesian inference can be understood as choosing the posterior that maximizes entropy subject to constraints.

Meanwhile, when you face an adversary, you have to consider the option that it might try to artificially increase traffic, for example by going to the airport with their own car and blocking a lane. You have to assume the adversary will do everything in its power to thwart you, and predict that. The adversary will maximize your losses, and you will plan so that you still have the minimal possible loss.

Mathematically, planning for the worst someone can do within their power corresponds to picking the prediction that minimaxes your loss, subject to constraints.

And it turns out, that if your level of success or failure is measured by the quality of your predictions, and in particular your loss is the negative log-likelihood of your predictions… picking the distribution that maximizes entropy subject to evidence is the same as picking the distribution that minimaxes the negative log-likelihood with respect to all other possibilities. Both problems are solved by the same distribution. For the sake of brevity I’m not going to prove this, but you can check out this paper for more information.

Create your own luck

This is nice math and all, but what’s the intuition behind it?

What an agent does, essentially, is make the outcomes it prefers more likely to happen. So in a sense an agent, by acting, is tilting probability in its favor, it’s “increasing its luck”. Similarly, dealing with a competent adversary can just feel like you have very bad luck. All of your independent fail-safe mechanisms and contingency plans fail at the same time.

In the particular case in which you only care about predictions and not actions, the log-likelihood of the true outcome under your prediction is how surprised you were by the outcome, or equivalently how unlucky you think you were, if predicting correctly is your goal. In a random environment, if you keep feeling very unlucky, that indicates that your model might be incorrect. And if you’re betting with someone and you keep losing all your bets due to “bad luck”… maybe you should stop betting.

So making predictions with the expectation that reality will be maximally surprising, and making predictions with the expectation that someone is rigging the game against you, turn out to have exactly the same solution: predict with the distribution that is maximally entropic and consistent with your observations. This is the distribution that is most robust to bad luck, whether artificial or natural.

Note that this just means that you should use Bayesian probability theory to reason about both adversarial and random environments.¹ It does not mean that you should plan exactly the same in both situations! In reality the adversary can not only artificially increase traffic, it can steal your car and blow up the road! When you face an adversary, you should expect to have particularly bad luck and have bad luck in ways that are not typical. The more capable the adversary, the worse your luck will be!

At least if you have enough compute to do it. ↩︎