Is there a purely epistemic basis for probability?
Introduction: invariance
Entropy is not invariant to how you parametrize your probability distributions. The uniform distribution over $[0,1]$ naively has an entropy of $-\int_{[0,1]} p(x)\log (1/p(x))dx = -\int_{[0,1]} \log (1/1) dx = 0$. But suppose now we measure things in units 10x smaller. The uniform distribution over $[0,10]$ has an entropy of $-\int_{[0,10]} 1/10 \log (1/10) dx = -\log (1/10)$. So just by relabeling each number $x \mapsto 10x$ we have changed the entropy of the distribution.
The KL divergence, meanwhile, is invariant to such transformations. This is because KL uses the logarithm of a ratio, which doesn’t change when you change units, while entropy uses the logarithm of an absolute quantity. KL divergence is basically a form of entropy relative to a reference distribution, and this relative distance is always well-defined.
Bayes says: choose the posterior distribution that minimizes the sum of expected negative log-likelihood and KL deviation from the prior. This is the content of the Bayesian free energy equality. When entropy is well-defined, and the prior is the max-entropy distribution, then Bayes is the same as the maximum entropy principle. But when we can only define entropy in relative terms via KL, Bayes still works.
However, what about the other term in the Bayesian free energy, the posterior expected log-likelihood of the data? This uses the logarithm of an absolute quantity. So it’s not invariant in the same sense, only relative to the uniform distribution over variables and datapoints. Does this give us any problem?
Well, suppose you have an infinite number of datapoints. Then the posterior expected negative log-likelihood is not well-defined. You would need to calculate the KL divergence to a reference distribution, as the closest substitute. But choosing such a reference distribution essentially corresponds to choosing which datapoints weight more or less. So essentially it’s like choosing a utility function over datapoints, you care more about predicting some than others.
Of course, when the number of datapoints is finite, we can always use the canonical uniform distribution and stop worrying. But maybe we should be careful with this non-invariance thing, perhaps it indicates that something is wrong with our framework. Can we find other situations in which we are forced to choose a particular utility function over outcomes?
Anthropics
In anthropic problems, you have to reason about how may copies of you there are in the universe and which one of them you are likely to be. The canonical problem is the Sleeping Beauty problem, but for our purposes it’s better to use a different one.
Suppose N copies of you are made while you sleep, and 1 copy is transported to a blue room, while N-1 are transported to identical red rooms. Before you are put to sleep, which probability should you assign to waking up in the blue room?
Here there are two canonical responses that are considered coherent. One is 1/N, and one is 1/2. Roughly, the question is whether you consider the number of copies or the number of “equivalent situations” as the important variable.
And it turns out these two positions are coherent with having an utility function that maximizes the sum of utilities of all copies, or having an utility function in which each copy maximizes its own utility, respectively. So it seems like this is another case in which how we value outcomes influences which probabilities we should assign.
There is however a lingering feeling that probability should be epistemic, not decision-theoretic. If you then subject each copy to the same experiment, and do the same iteratively, both theories agree on what distribution of blue/red rooms each copy will remember seeing and expect to see. So with many iterations we can assign some objective frequentist probabilities, but one of the alternatives says those are not utility-relevant, so why use them?
Knightian uncertainty
Suppose you are playing some lotteries, and you want to choose which lotteries you participate in. All of them have the same possible outcomes, but they differ in their probabilities. There’s one catch though: you don’t know the exact probability distribution each lottery has. Instead, you only see a coarse-grained version.
In particular, if each lottery has a joint distribution over two variables X and Y, you only see the marginal probabilities P(X). How should you decide which lotteries to pick? Your utility function is defined over pairs of outcomes, U(X,Y). But any particular set of choices, by the VNM theorem, implies that you have a utility function over X alone, V(X). In order for this marginal utility function to be consistent with your native utility function, it must be the case that $V(X) = \mathbb{E}_{P(Y|X)}[U(X,Y)]$ for some conditional distribution $P(Y|X)$.
Not only that, by Donsker-Varadhan, under a set of assumptions, there is an optimal $P(Y|X)$ that depends on your utility function $U$. So two agents with different utility functions in the same epistemic state will assign different probabilities to different outcomes.
What does this mean?
I don’t have a good answer, but I think what’s going on is the following.
Everyone understands and agrees on frequentist probability distributions. However, in many cases we have to make decisions in which some uncertain event only happens once. Subjective expected utility theory tells us that we should make those decisions by maximizing expected utility relative to some probability distribution. And it turns out that the correct probability distribution, in most cases, is the same for all possible utility functions. In those cases it might appear that the assignment of probabilities is purely epistemic, that it is about anticipating future experiences independently of bets or values.
However, at the fundamental mathematical level this is not true, and probabilities by themselves are not invariant, only the pairing of probabilities and utilities is invariant. And it turns out there are a few examples where this distinction matters. When there are infinite variables or infinite datapoints, when there are multiple copies of you, and when you can’t represent all your uncertainty probabilistically.
The first of those is a technically correct but impossible situation and not very relevant. Anthropics is of debatable utility today given the lack of copies we can interact with. It’s also weird and nobody understands it very well, so who cares if there is no purely epistemic assignment of probabilities independent of the utility function? But every embedded agent necessarily faces Knightian uncertainty, since it can’t reason using a probability distribution over the whole universe including itself. And all of us are embedded agents.
So I think we might have to abandon the notion of subjective probability as a purely epistemic anticipation of future experiences, and recognize that in some cases utilities and probabilities don’t factor independently. In the case of anthropics, what this would mean is that “what bets should you make” and “what memories do you expect to have” are meaningful questions with numerical answers, but “what subjective experience should you anticipate” is not. The only coherent answer to that last question is “I anticipate all possibilities with probability 100% each”.